Sunteți pe pagina 1din 4

Data reusable search scan methods for low power motion estimation

Sung Dae Kim, Jin Wook Baek*, Jin Wook Burm**, and Myung Hoon Sunwoo
School of Electrical and Computer Engineering, Ajou University San 5, Woncheon-Dong, Yeongtong-Gu Suwon, 443-749, Korea * Samsung Electronics, **Sogang University E-mail: sunwoo@ajou.ac.kr
AbstractThis paper proposes the data reusable search scan methods for low power Motion Estimation (ME). The proposed Optimized Sub-region Partitioning (OSP) method can reduce the number of the required registers with the same data reusability. In addition, the proposed Center Biased Search Scan method (CBSS) for various fast search algorithms can improve the data reusability. The performance comparisons show that the proposed search scan methods can reduce redundant data loading average about 26.9% and 16.1% compared with raster scan and snake scan methods, respectively. Due to reduction of memory accesses, the proposed search scan methods are quite suitable for low power and high performance ME implementation.

(TSS) [5], Diamond Search (DS) [6], hexagon based search, cross diamond search, Predictive MV Field Adaptive Search Technique (PMVFAST) [7], and Unsymmetrical-cross MultiHexagon grid Search (UMHexagonS) [8]. The latest video compression standards H.264 [1] and HEVC [2] further increases the complexity of ME with variable block sizes and multiple reference frames. Existing ME implementations are mainly tailored to FS [9]-[10], since FS has hardware friendly features such as regular data flow and low control overhead. Due to its huge computation complexity, FS is not suitable for low power and real-time applications, such as smart phones, smart pads, etc. Flexible ME architectures [11]-[13] which support multiple ME algorithms have been developed to meet these requirements. Chen et al. [11] introduced ME architectures that support both FS and 4SS. The configurable ME architecture for FS and fast search algorithms such as TSS and DS are introduced in [12], [13]. The architectures implement fast search algorithms by filling lookup tables or ROMs with the predefined data structures that contain all the specific information about the search patterns and search paths. This paper proposes the novel search scan orders for FS and fast search algorithms. To reduce the number of required registers required for rigid sub-region partition method [10] for FS, the Optimized Sub-region Partitioning (OSP) method is proposed. In addition, the data reuse scan order for FS cannot be applied to fast search algorithms because of different search patterns. Hence, we propose the new data reuse scan order, the Center Biased Search Scan (CBSS) method, for various fast search algorithms. These efficient data reuse methods can reduce the redundant data loading of reference data. The rest of this paper is organized as follows. Section 2 introduces the existing data reusable search scan orders. Section 3 proposes the efficient search scan orders and their architecture. Simulation results for the proposed algorithms

I.

INTRODUCTION

The transmission and storage of video are important factors for many application devices, especially hand held devices such as smart phones, smart pads, etc. Video compression is essential for transmission and storage because most video sequences consist of huge data.. Over the years, many video coding standards such as MPEG-2, MPEG-4, H.261, H.263, H.264 [1], HEVC [2], etc. have been developed to achieve efficient compression. These video codecs are based on a hybrid of inter prediction and transform coding. Block based Motion Estimation (ME) is the most computation intensive part of the video codecs. The well-known Full Search (FS) is the simplest, but the most computation expensive search algorithm, which exhaustively search all the candidate points in the search range. Numerous fast search algorithms have been developed to reduce huge computation complexity of FS [3]-[8]. They can be roughly classified into lossy and lossless algorithms. Lossless algorithms such as successive elimination based algorithms [3], [4] have the same search quality as FS, but they gained more speed FS by eliminating unnecessary candidate points as early as possible. The representative lossy algorithm is the decimation of candidate points. There are various fast search algorithms including Three Step Search

This work was supported by Mid-career Researcher Program through the NRF grant funded by the MEST (20110016671), by the framework of international cooperation program managed by National Research Foundation of Korea (2011-0030930) and by IC Design Education Center (IDEC).
978-1-4673-0219-7/12/$31.00 2012 IEEE 1556

are presented in Section 4. Finally, a brief conclusion is given in Section 5. II. EXISTING DATA REUSABLE SEARCH SCAN ORDERS

Search scan order plays an important role in improving data reusability. Existing ME architectures usually use raster scan as shown in Fig. 1. In the raster scan, the search proceeds from left to right. Raster scan is effective in reusing horizontal reference data. Considering a block size of N x N. For example, when the upper-left search point is searched, all the reference pixels in the N x N block are loaded from memory. Then, 1 x N new pixels are loaded and (N - 1) x N pixels can be reused for the next search points. However, raster scan does not support the data reuse between adjacent rows. After searching for one row, N x N pixels should be loaded from memory.

implement fast search algorithms because of two reasons: the lack coherence in reference data inside RRA and the complex data transaction between the PE array and RRA. RRA cannot have all the reference data related to the previous row or column in the search region. The PE array requires the different size and location of reuse data according to the search pattern. Hence, we propose novel search scan orders and data reuse methods for fast search algorithms.

III.

PROPOSED SEARCH SCAN ORDERS

This section proposes two data reuse methods including OSP for FS and CBSS for fast search algorithms. A. Optimized Sub-region Partitiong (OSP) method As mentioned above, the required size of RRA is (M-1) x (N-1) for the sub-region with the maximum width or height of M. However, if the search direction can be arbitrarily adjusted, the maximum width or height of sub-region is no longer limited to the size of RRA. Using these features, we propose the new data reuse search scan method including OSP and search direction control method. OSP uses the new parameter D which indicates the maximum number of registers to generate a new set of subregions. If the width and height of search region are determined, the search region is partitioned to generate a new set of sub-regions only when the width or height of search region is larger than D. The parameter D has a similar role as the parameter M by restricting the size of sub-region. However, there is a difference between two parameters. If one of the width or height of sub-region is limited less than or equal to D, the other width or height does not have to be limited to the same parameter D where parameter M does limit the width and height.. Thus, the search region of 32 x 32 can be divided into two 32 x 16 sub-regions with the parameter D of 16. The example search region which is divided into two sub-regions L1 and L2 by using OSP are shown in Fig. 2(a).

Figure 1. Search scan orders for ME.

Snake scan, as shown in Fig. 1(b), is another scanning order to improve the data reusability. Snake scan processes the first row from left to right, then the second row from right to left, and then the third row from left to right, and so on. During horizontal scanning along a row, (N - 1) x N pixels are reused like raster scan. After one row is finished, N x (N - 1) pixels can be reused different from the raster scan. Smart snake scan which can minimize the redundant data load with variable data reuse ratios was introduced in [10]. In smart snake scan, the search range is divided into non-overlapping rectangular sub-regions. An example with six sub-regions is shown in Fig. 1(c). Basically, in each sub-region, snake scan is performed but the direction of search can be changing. In L1 sub-region, the search starts from top to bottom. Then, in L2 sub-region, the search proceeds from bottom to top. After finishing search for L2 sub-region, the column by column snake scan which is different from the original snake scan based on the row by row operation is adopted in L3 subregion. The size of each sub-region is restricted to be less than or equal to a parameter M. The smart snake scan with the Reconfigurable Register Array (RRA) having an array of (M1) x (N-1) registers are very suitable to implement FS because of the simple and high efficiency data reuse. However, the sub-region partitioning method is too rigid to improve the data reuse efficiency. Moreover, the same search scan order and RRA update methods cannot be used to

Figure 2. Proposed search order and direction methods for FS.

The next step determines the search order and direction. The search direction is derived from the width and height of the search region. If the width is larger than the height, the search proceeds column by column from top to bottom or

1557

from bottom to top. Otherwise, the search direction becomes row by row from left to right or from right to left. In this example, the width is larger than the height, and thus, a column by column search is performed. The search start position and direction are changed according to the size of the search region as shown in Fig. 2(b) and Fig. 2(c). B. Center Biased Search Scan (CBSS) method Most ME algorithms have symmetry property of search patterns, such as diamond, hexagon, cross, etc. Hence the reusable data is concentrated on the center candidate point. The proposed CBSS method stores the reference data corresponding to the center position into RRA. Then, the search is performed at the right side of search range. After finishing the right side search, the search proceeds at the left side. During left or right side search, the shift registers which support left and right shifts in PE array are used for data reuse. RRA is used when the center point is searched or the search direction is changed. Fig. 3 shows the proposed CBSS method. The detailed explanation of the proposed CBSS method is as follows.

to the PE array only if RRA has the reference data for center position. Fig. 4 shows the proposed search scan order according to various search patterns. The search order applied to the raster scan and the snake scan are shown in Fig. 4(a) and Fig. 4(b), respectively. Fig. 4(c) shows the proposed search scan order. The black circles in Fig. 4(c) indicate the location where RRA is updated.

Figure 4. Scan orders for each search pattern according to search scan method.

Figure 3. Proposed CBSS methods.

IV.

EXPERIMENT RESULTS

The search scan proceeds from top to bottom row in the search region. First, we check the center position on every candidate rows whether to skip the search or not. If the search is performed at the center position, the N x N reference pixels corresponding to the center position are loaded from memory into the PE array. At the same time, the reference data are propagated from the PE array to RRA. Otherwise, if both left and right sides have the candidate search points, the same data update is performed. After processing the center position, the candidate points at the right side of current row are scanned from the point near the center position. The reference data are propagated within the PE array as they are needed for the following search points. During this step, RRA remains unchanged. Next, the search scan is processed at the remaining candidate points at the left side of the current row. Before starting the scan, the stored reference data are propagated back from RRA. Thus, the N x N reference pixels related to the center position are located in the PE array again. The left side search is processed in a similar manner as the right side search. After all candidate search points in the current row are searched, the data is moved back

To investigate the data reusability of different scanning orders, the redundant loading ratio R [10] is defined as

LS (1) S where S is the number of reference pixels inside the search range, and L is the number of reference pixels actually loading during search. R is equal to zero when each pixel inside the search range is loaded only once and no redundant loading occurs. R=
Table I shows the redundant loading ratios for FS with various scanning orders including raster scan, snake scan, smart snake scan, and the proposed scan. The redundant loading ratios are computed for two representative cases: small video resolution (CIF) with a small search window (32 x 32), and large video resolution (1080P) with a large search window (128 x 96). If both of width and height of search region are less than or equal to D, the region is processed as a single block and the redundant loading ratio R can be equal to 0. The proposed OSP scan method for FS can show better data reuse ratio than

1558

the snake scan when the size of RRA is the same and it can provide nearly close or even better performance than the snake scan method with the half size of RRA.
TABLE I. DATA REUSE RATIO FOR VARIOUS SCAN METHODS 2P=32, 2Q=32, N=16, 2P=128, 2Q=96, N=16, CIF 1080p S L R S L R 2099 24064 989 15873 219648 1283 Raster 2099 16624 652 15873 196848 1140 Snake Smart snake 2099 3844 83 15873 46128 191 (M=16) Smart snake 2099 2099 0 15873 26508 67 (M=32) Proposed 2099 2575 0.22 15873 25497 4.35 (D=16)

can be reduced by 26.9% and 16.1% compared with the existing raster scan and snake scan methods, respectively. Therefore, the proposed search scan methods can be quite suitable for low power ME applications including smart phones, smart pads, etc. REFERENCES
[1] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG, Document JVT-G050, May 2003. Thomas Wiegand, Woo-Jin Han, Benjamin Bross, Jens-Rainer Ohm, and Gary J. Sullivan, WD1: Working Draft 1 of High-Efficiency Video Coding, ITU-T SG16/WP3 Doc. JCTVC-C403, Guangzhou, China, Oct. 2010. Lien Fei Chen, Shien-Yu Huang, Chi-Yao Liao, and Yeong-Kang Lai, Hardware efficient coarse-to-fine fast algorithm for H.264/AVC variable block size motion estimation, in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), May 2009, pp. 16571660. H. M. Wang, J.-K. Lin, and J.-F. Yang, A successive termination and elimination method for fast H.264/AVC SATD-based inter mode decision, IET Signal Processing, vol. 3, issue 3, pp. 165-176, May 2009. Luo Tao, Yao Su-ying, Shi Zai-feng, and Gao Peng, An improved three-step search algorithm with zero detection and vector filter for motion estimation, in Proc. IEEE International Conference on Computer Science and Software Engineering, Dec. 2008, pp. 976-970. Obianuju Ndili and Tokunbo Ogunfunmi, Hardware-oriented modified diamond search for motion estimation in H.264/AVC, in Proc. IEEE International Conference on Image Processing (ICIP), Sept. 2010, pp. 749-752. A. M. Tourapis, O. C. Au, and M. L. Liou, Highly efficient predictive zonal algorithms for fast block matching motion estimation, IEEE Trans. on Circuits and Systems for Video Technology, vol. 12, no. 10, pp.934-947, Oct. 2002. Z. Chen, P. Zhou, and Y. He, Fast integer pel and fractional pel motion estimation for JVT, JVT-F017, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCE, 6th Meeting, Japan, Dec. 2002. Yeong-Kang Lai and Liang-Gee Chen, A data-interlacing architecture with two-dimensional data-reuse for full-search block-matching algorithm, IEEE Trans. on Circuits and Systems for Video Technology, vol. 8, no. 2, pp.124-127, April 1998. Xing Wen, Oscar C. Au, Jiang Xu, Lu Fang, Run Cha, and Jiali Li, Novel RD-optimized VBSME with matching highly data re-usable hardware architecture, IEEE Trans. on Circuits and Systems for Video Technology, vol. 21, no 2, pp. 206-219, Feb. 2011. Tung-Chien Chen, Yu-Han Chen, Sung-Fang Tsai, Shao-Yi Chien, and Liang-Gee Chen, Fast algorithm and architecture design of low power integer motion estimation for H.264/AVC, IEEE Trans. on Circuits and Systems for Video Technology, vol. 17, no. 5, pp.568-577, May 2007. W. M. Chao, C. W. Hsu, Y. C. Chang, and L. G. Chen, A novel hybrid motion estimator supporting diamond search and fast full search, in Proc. IEEE International Symposium on Circuits and Systems, May 2002, pp.492-495. T. Li, S. Li, and C. Shen, A novel configurable motion estimation architecture for high-efficiency MPEG-4/H.264 encoding, in Proc. IEEE Asia and South Pacific Design Automation Conf., vol. 2, Jan. 2005, pp. 1264-1267.

Scan Method

[2]

[3]

Table II shows another redundant data loading ratio R for various scan orders according to scan patterns. Because snake scan searches the successive candidate points, only the shift registers are used for data reuse. In addition, raster scan with RRA shows the same performance with snake scan when RRA is updated by the leftmost search points in every row.
TABLE II. DATA REUSE RATIO FOR VARIOUS SCAN METHODS Scan Diamond Hexagon pattern Cross pattern Method pattern
S L 568 508 448 R 0.46 0.31 0.15 S 356 356 356 L 506 446 384 R 0.42 0.25 0.08 S 384 384 384 L 444 444 384 R 0.16 0.16

[4]

[5]

[6]

Raster scan Snake scan Proposed CBSS

388 388 388

[7]

[8]
0

[9]

From the experiment result, the proposed CBSS method for fast search algorithms can reduce the average redundant data loading about 26.9% and 16.1% compared with the existing rater scan and snake scan, respectively. V. CONCLUSIONS

[10]

[11]

This paper proposes the data reusable search scan methods for low power ME. The systolic PE array and RRA are usually used for the data reuse of FS. The proposed OSP method can reduce the size of RRA with the same data reusability of the smart snake method [10]. In fast search algorithms, the data reuse efficiency using RRA is sharply decreased because of irregular dataflow. The proposed CBSS method can improve the data reusability by changing the search order. As shown in performance comparisons, the average redundant data loading

[12]

[13]

1559

S-ar putea să vă placă și