Sunteți pe pagina 1din 8

International Journal of ElectronicsJOURNAL and Communication Engineering & Technology (IJECET), INTERNATIONAL OF ELECTRONICS AND ISSN 0976 6464(Print),

, ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

ISSN 0976 6464(Print) ISSN 0976 6472(Online) Volume 5, Issue 1, January (2014), pp. 74-81 IAEME: www.iaeme.com/ijecet.asp Journal Impact Factor (2013): 5.8896 (Calculated by GISI) www.jifactor.com

IJECET
IAEME

REVIEW OF METHODS OF SCENE TEXT DETECTION AND ITS CHALLENGES


Ms. Saumya sucharita Sahoo M.E. (E&TC), Genba Sopanrao Moze College of Engineering, Pune. Prof. Smita Tikar

ABSTRACT Since from last decade, there are many methods presented over the research area called scene text detecting using the image processing terminologies. The automated systems presented to address the challenges of detecting and localizing the text information from the natural scene images. The application areas of such automated systems are keyword based image search, tourist guide, image indexing using texts, image text translation systems etc. The text data which available in images or videos is important information required for automatic annotation, searching, structuring etc. Automated system to extract the text information from images is basically consisting for phases like detection, preprocessing, localization, tracking, enhancement, extraction, and recognition. Challenging part of automated scene text detection systems is that, sometimes scene images containing varying in text due to the style, size, alignment, orientation, complex background etc. Many research methods presented over the automatic detection of text from the natural images previously and still many researches are going in this same area. The main goal of this paper is to present the survey of different methods presented for text detection on images. The detailed description of works done for automatically detection of text from the scene images is presented. Keywords: Text detection, Scene text detection. I. INTRODUCTION The current work over the research field called retrieval of contents from the videos and images recognized different range of application areas where the need of automated text extraction is required from the natural scene images. Recently the new application is developed from the mobile banking which is provided by particular bank to their customers with aim of facilitating their banking users to execute their transactions by sending just image of their cheque of passbook to the main
74

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

server of bank. Another application of it is tourist guide, which helps the tourist to understand the different language written display boards as image text translation systems to help the visually impaired people and also tourists. This all main applications are based on concept of automatic text extraction from the scene images. This automated system needs to efficiently detect, localize as well as extract the text related information available in natural scene images. Below figure 1 showing the overall processing of such automated systems. From the figure the first phase is image acquisition which is possibility through the input videos or cameras. The quality is based on use of cameras. Once you image is acquire, the next step is the preprocessing. In the preprocessing, the contrast of image is enhanced or noise is removed from it so that accuracy of detection improved. The next phase is to detect as well as localize the text present in the preprocessed image. The next phase of this system is recognition phase where the texts are extracted and recognized by using different methods. The detected text regions are given to OCR which recognizes the characters and gives the textual output. The preprocessing is mainly required due to the fact that input image is may be of different size, angle, orientation, alignment etc and hence it is required to smoothen the image. For each phase of this system, there are different methods suggest by various researchers with their own advantages and disadvantages.

Figure 1: Text Extraction Process from Scene Images Since from the 1990s, the search over detection of text and its localization is carried out as well as many text detection algorithms has been presented so far. In this paper we are aiming to present the literature survey over different techniques presented over the automatic text extraction from the scene images, research challenges, performance metrics used etc. In below section II we are presenting the literature survey of different methods presented by various authors for extraction of text from scene images. In section III we are discussing the different challenges of these systems. In section IV we are presenting the performance metrics used for evaluation of scene text extraction methods. II. REVIEW OF TEXT DETECTION METHODS Several approaches for text detection in images and videos have been proposed in the past. Techniques for automatic detection and translation of text in images and videos have been proposed.
75

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

The distribution of edges, for example, is used in many text detection methods [7, 8, and 9]. In these methods the edges are grouped together based on features such as size, color and aspect ratio [10, 11]. Many researchers working on text detection and thres holding algorithm with various approaches have achieved good performance based on some constraints. An early histogram based global thres holding Otsus method is widely used in many applications [1]. Text detection and binarization method is proposed for Korean sign board images using k means clustering [2]. But finding a best value for k to achieve a good binary image is difficult in images with complex background and uneven lighting. The linear Nib lack method was proposed to extract connected components and texts were localized using a classifier algorithm [3]. Four different methods were suggested to extract text, depending on character size [4]. In the work of Wu et al. a method was proposed to clean up and extract text using a histogram based binarization algorithm [5]. The local threshold was picked at first valley on the smoothed intensity histogram and used to achieve good binarization result. A thres holding method was developed using intensity and saturation feature to separate text from background in color document images [6]. System using the gray-level values at high gradient regions as known data to interpolate the threshold surface of image document was proposed [7]. Layer based approach using morphological operation was proposed to detect text from complex natural scene images [8]. However, these method put few constrain and showed lots of missing and false positive detection on many natural scene images. This may confirm that the detection of text from natural scene is still a challenging issue. In our previous work we proposed a region based method using the color contrast of the text and their surrounding pixels. Due to limited number of color variation between text and its immediate background, finding a right threshold and detecting text pattern are key issues. Based on the methods being used to localize text regions, these approaches can be categorized into two main classes: connected component based methods and texture based methods. CAI ET a1 [2] presented a text detection approach which is based on character features like edge strength, Edge density and horizontal distribution. first, they have a color edge detection algorithm in the YUV color space and a range of non-text edges out using filter then keep holding, a local technology thres-to simplify the text and background contrast. Finally, in order to localize the text areas projection profile analysis. an approach which RGB color space on using color images directly Operate the proposed lien Hart and Effelsberg [1]. The character features like mono chromacity and contrast within the local environment are used to qualify a pixel as a part of a connected component or not, segmenting each frame into suitable objects in this way. Then, regions are merged using the criteria of having similar color. At the end, specific ranges of width, height, width to height ratio and compactness of characters are used to discard all non-character regions. Kim [6] has proposed an approach in which LCQ (Local Color Quantization) is performed for each color separately. Each color is assumed as a text color without knowing whether it is real text color or not. Color quantization takes place before processing to reduce blight, an input image is converted to a 256-color image when they show features the text field text lines to find candidates, connected components that are extracted for each color merged. LCQ for each color is executed since this drawback of the method is processing time. Agnihotri and Dimitrova [11] have presented an algorithm which uses only the red part of the RGB color space, with the aim to obtain high contrast edges for the frequent text colors. By means of a convolution process with specific masks they first enhance the image and then detect edges. Non text areas are removed using a preset fixed threshold. Finally, a connected component analysis (eight-pixel neighborhood) is performed on the edge image in order to group neighboring edge pixels to single connected components structures. Then, the detected text candidates undergo another treatment in order to be ready for an OCR.
76

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

Garcia and Apostolicism [4] perform an eight -connected component analysis on a binary image, which is obtained as the union of local edged maps that are produced by applying the band Deriche filter on each color. Jain and Yu [5] first perform a color reduction by bit dropping and color clustering quantization, and afterwards, a multi-value image decomposition algorithm is applied to decompose the input image into multiple foreground and background images. Then, connected component analysis combined with images performed on each of them to localize text candidates. This method can extract only horizontal texts of large sizes. The second class of approaches [7, 9] regards texts as regions with distinct textural properties, such as character components that contrast the background and at the same time exhibit a periodic horizontal intensity variation, Characters are horizontal alignment. Gabor filtering method of spatial variance of texture analysis and automatically detect text fields is used with such approach different character font size do not function well, and besides, they are computationally intense. for example, Lee and doorman [7] usually 16 x 16 pixels, and a small window of the image scan each of them A text or non-text window is a three-layer neural network to classify as using different text sizes to use. a successful detection, they use a three-tier pyramid approach. Text regions are extracted at each level and then extrapolated at the original scale. The bounding box of the text area is generated by a connected component analysis of the text windows. Wu et al. [9] have proposed an automatic text extraction system. Then, features are computed to form a feature vector for each pixel from the filtered images in order to classify them into text or non text pixels. In a second step bottom up methods are applied to extract connected components. A simple histogram-based algorithm is proposed to automatically find the threshold value for each text region, making the text cleaning process more efficient. III. CHALLENGES OF SCENE TEXT DETECTION First of all, in order to understand challenges of this field, new imaging conditions and newly considered scenes need to be detailed: Raw sensor image and sensor noise: in low-priced HIDs, pixels of a raw sensor are interpolated to produce real colors, which can induce degradations. Demos icing techniques, viewed more as complex interpolation techniques, are sometimes required. Moreover, sensor noise of an HID is usually higher than that of a scanner. Angle: scene text and HIDs are not necessarily parallel creating perspective to correct. Blur: during acquisition, some motion blur can appear or be created by a moving object. All other kinds of blur, such as wrong focus, may also degrade even more image quality. Lighting: in real images, real (uneven) lighting, shadowing, reflections onto objects, interreflections between objects may make colors vary drastically and decrease analysis performance. Resolution and Aliasing: from webcam to professional cameras, resolution range is large and images with low resolution must also be taken into account. Resolution may be below 50 dpi which causes commercial OCR to fail. It may lead to aliasing creating fringed artifacts in the image. The newly considered scenes represent targets such as: Outdoor/non-paper objects: different materials cause different surface reflections leading to various degradations and creating inter-reflections between objects. Scene text: backgrounds are not necessarily clean and white, and more complex ones make text extraction from background difficult. Moreover scene text such as that seen in advertisements may include artistic fonts. Non-planar objects: - text embedded in the bottles or cans suffer from deformation. Unknown layout: priori information not available on structure of text to detect it efficiently.
77

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

Objects in distance: space between text & HIDs can vary, & character sizes may vary in a wide range, leading to a wide range of character sizes in a same scene.

Fig. 1. Samples of natural scene images The main challenge is to design a system as versatile as possible to handle all variability in daily life, meaning variable targets with unknown layout, scene text, several character fonts and sizes and variability in imaging conditions with uneven lighting, shadowing and aliasing. Our proposed solutions for each text understanding step must be context independent, meaning independent of scenes, colors, lighting and all various conditions. We discus on this methods which work reliably across the broadest possible range of NS images, such as displayed in Figure 1. IV. PERFORMANCE METRICS USED The ICDAR 2011 Robust Reading Competition (Challenge 2: Reading Text in Scene Images) dataset [?] is a widely used dataset for benchmarking scene text detection algorithms. The dataset contains 229 training images and 255 testing images. The proposed system is trained on the training set and evaluated on the testing set. It is worth noting that the evaluation scheme of ICDAR 2011 competition is not the same as of ICDAR 2003 and ICDAR 2005. The new scheme, the object count/area scheme proposed by Wolf et al. [?], is more complicated but offers several enhancements over the old scheme. Basically, these two scheme use the notation of precision, recall and f measure that is defined as

The above matching functions only consider one-tone matches between ground truth and detected rectangles, leaving room for ambiguity between detection quantity and quality [?]. In the new evaluation scheme, the matching functions are redesigned considering detection quality and different matching situations (one-to-one matchings, one-to-many matchings and many to- one matchings) between ground truth rectangles and detected rectangles, such that the detection quantity
78

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

and quality can both be observed using the new evaluation scheme. The evaluation software DetEval 4 used by ICDAR 2011 competition is available online and free to use. The performance of our system, together with Neumann and Matas method [?], a very recent MSER based method by Shi et al. [?] and some of the top scoring methods (Kims method, Yis method, TH-Text Loc system and Neumanns method) from ICDAR 2011 Competition are presented in Table 1. As can be seen from Table 1, our method produced much better recall, precision and f measure over other methods on this dataset. Four methods in Table 1 are all MSER based methods and Kims method is the winning method of ICDAR 2011 Robust Reading Competition. Apart from the detection quality, the proposed system offers speed advantage over some of the listed methods. The average processing speed of the proposed system on a Linux laptop with Intel (R) Core (TM)2 Duo 2.00GHZ CPU is 0.43s per image. The processing speed of Shi et al.s method [?] on a PC with Intel (R) Core (TM)2 Duo 2.33GHZ CPU is 1.5s per image. The average processing speed of Neumann and Matas method [?] is 1.8s per image on a standard PC. Figure 9 shows some text detection examples by our system on ICDAR 2011 dataset.

Fig. 9: Text detection examples on the ICDAR 2011 dataset. Detected text by our system is labeled using red rectangles. Notice the robustness against low contrast, complex background and font variations TABLE 1: Performance (%) comparison of text detection algorithms is on ICDAR 2011 Robust Reading Competition dataset

To fully appreciate the benefits of text candidates elimination and the MSERs pruning algorithm, we further profiled the proposed system on this dataset using the following schemes (see Table 2) 1) Scheme-I, no text candidates elimination performed. As can be seen from Table 2 the absence of text candidates elimination results is in a major decrease in precision value. The degradation can be fixed by the fact that large number of non-text is passed to the text candidates classification stage without being eliminated.
79

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME

2) Scheme-II, using default parameter setting [?] for the MSER extraction algorithm. The MSER extraction algorithm is controlled by several parameters [?]: _ controls how the variation is calculated; maximal variation v+ excludes too unstable MSERs; minimal diversity d+ removes duplicate MSERs by measuring the size difference between a MSER and its parent. As can be seen from Table 2, compared with our parameter setting (_ = 1; v+ = 0:5; d+ = 0:1), the default parameter setting (_ = 5; v+ = 0:25; d+ = 0:2) results in a major decrease in recall value. (1) the MSER algorithm is not able to detect some low contrast characters (due to v+), and (2) the MSER algorithm tends to miss some regions that are more likely to be characters (due to _ and d+). Note that the speed loss (from 0.36 seconds to 0.43 seconds) is mostly due to the MSER detection algorithm itself. TABLE 2: Performance (%) of the proposed method due to different components

V. CONCLUSION AND FUTURE WORK In this paper we have take the review of concepts of automatic extraction text from scene images using different methods. We have discussed the different challenges of research in automatic extraction text images as well as performance metrics. There are many papers presented over the automatic extraction of texts from the images are discussed in literature in paper. Algorithms is proposed by considering the various properties of text which helps to distinguish the text regions from the natural scene image or videos. The future work for this research area is further to present improved method with aim of improving the overall accuracy of detecting as compared to previous methods. VI. REFERENCES 1. R. Lien hart and W. Effelsberg. Automatic Text Segmentation and Text Recognition for Video Indexing Multimedia System, Vol. 8, pp. 69-81, 2000. 2. N. Ezaki, M. Bulacu, L. Schomaker, Text Detection from Natural Scene Images: Towards a System for Visually Impaired Persons, Int. Conf. on Pattern Recognition (ICPR 2004), vol. II, pp. 683-686. 3. J. Park, G. Lee, E. Kim, J. Lim, S. Kim, H. Yang, M. Lee, S. Hwang, Automatic detection and recognition of Korean text in outdoor signboard images, Pattern Recognition Letters, 2010. 4. T. N. Dinh, J. Park, G. Lee, Korean Text Detection and Binarization in Color Signboards, Proc. of The Seventh Int. Conf. on Advanced Language Processing and Web Information Technology (ALPIT 2008), pp. 235-240. 5. P. Shivakumara, W. Huang, C. L. Tan, Efficient Video Text Detection using Edge Features, Int. Conf. on Pattern Recognition (ICPR 2008), pp. 1-4. 6. P. Shivakumara, T. Q. Phan, C. L. Tan, Video text detection based on filters and edge features, Int. Conf. on Multimedia & Expo (ICME 2009), pp. 514-517. 7. Q. Yuan and C. Tan, Text extraction from gray scale document images using edge information, Sixth International Conference on Document Analysis and Recognition, 2001, pp. 302306. 8. X. Chen, J. Yang, J. Zhang, and A. Waibel, Automatic detection and recognition of signs from natural scenes, IEEE Transactions on Image Processing, vol. 13, pp. 8799, 2004.

80

International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 6464(Print), ISSN 0976 6472(Online), Volume 5, Issue 1, January (2014), IAEME 9. N. Ezaki, M. Bulacu, and L. Schomaker, Text detection from natural scene images: towards a system for visually impaired persons, Proceeding of the 17th International Conference on Pattern Recognition, vol. 2, 2004, pp. 683686. 10. Hertzmann, A.; Jacobs, C.E.; Oliver, N.; Curless, B. & Sales in, D.H. (2001). Image analogies, Proceedings of ACM SIGGRAPH, Int. Conf. On Computer Graphics and Interactive Techniques. 11. Hild, M. (2004). Color similarity measures for efficient color classification, Jour. of Imaging Science and Technology, Vol. 15, No. 6, pp. 529-547. 12. ICDAR Competition (2003). http://algoval.essex.ac.uk/icdar Jung, K.; Kim, K.I. & Jain, A.K. (2004). Text information extraction in images and video: a survey, Pattern Recognition, Vol. 37, No. 5, pp. 977-997. 13. Karatzas, D. & Antonacopoulos, A. (2004). Text extraction from web images based on a splitand-merge segmentation method using color perception, Proceedings of Int. Conf. Pattern Recognition, Vol. 2, pp. 634-637. 14. Kim, I.J. (2005). Keynote presentation of camera-based document analysis and recognition, http://www.m.cs.osakafu-u.ac.jp/cbdar. 15. Kim, J.; Park, S. & Kim, S. (2005). Text locating from natural scene images using image intensities, Proceedings of Int. Conf Document Analysis and Recognition, pp. 655-659. 16. Kovesi, P.D. (2006). MATLAB and Octave functions for computer vision and image processing, School of Computer Science & Software Engineering, The University of Western Australia, http://www.csse.uwa.edu.au/~pk/research/matlabfns/. 17. Li, H. & Doermann D. (1999). Text enhancement in digital video using multiple frame integration, Proceedings of ACM Int. Conf. on Multimedia, pp. 19-22. 18. Liang, J.; Doermann, D. & Li, H. (2003). Camera-based analysis of text and documents: a survey, Int. Journal on Document Analysis and Recognition, Vol. 7, No. 2-3, pp. 84-104 Lien hart, R. & Wernicke, A. (2002). Localizing and segmenting text in images, videos and web Pages, IEEE Trans. Circuits and Systems for Video Technology, Vol. 12, No. 4, pp.256-268. 19. Lopresti, D. & Zhou, J. (2000). Locating and recognizing text in WWW images, Information Retrieval, Vol. 2, pp. 177-206. 20. Lukac, R.; Smolka, B.; Martin, K.; Plataniotis, K.N. & Venetsanopoulos, A.N. (2005). Vector filtering for color imaging, IEEE Signal Processing, Special Issue on Color Image Processing, Vol. 22, No. 1, pp. 74-86. 21. Luo, X.-P.; Li, J. & Zhen, L.-X. (2004). Design and implementation of a card reader based on build-in camera, Proceedings of Int. Conf. Pattern Recognition, pp. 417-420. 22. Mancas-Thillou, C. (2006). Natural scene text understanding, PhD thesis, Faculty Polytechnique de Mons, Belgium Mancas-Thillou, C. & Gosselin, B. (2006). Spatial and color spaces combination for natural scene text extraction, Proceedings of Int. Conf. Image Processing Mancas-Thillou, C.; Mancas, M. & Gosselin, B. (2005). Camera-based degraded character segmentation into individual components, Proceedings of Int. Conf Document Analysis and Recognition, pp. 755-759. 23. Mata, M.; Armingol, J.M.; Escalera, A. & Salichs, M.A. (2001). A visual landmark recognition system for topologic navigation of mobile robots, Proceedings of Int. Conf. on Robotics and Automation, pp. 1124-1129. 24. Messelodi, S. & Modena, C.M. (1992). Automatic identification and skew estimation of text lines in real scene images, Pattern Recognition, Vol. 32, No. 5, pp. 791-810. 25. Vilas Naik and Sagar Savalagi, Textual Query Based Sports Video Retrieval By Embedded Text Recognition, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 556 - 565, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 26. Pushpa D. and Dr. H.S. Sheshadri, Schematic Model for Analyzing Mobility and Detection of Multiple Object on Traffic Scene, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 32 - 49, ISSN Print: 0976 6367, ISSN Online: 0976 6375.
81

S-ar putea să vă placă și