Sunteți pe pagina 1din 5

Text Detection in Scene Images using Stroke Width

and Nearest-Neighbor Constraints


Apurva Srivastav1 and Jayant Kumar2
Indian Institute of Technology, Roorkee, India1
Indian Institute of Science, Bangalore, India2

Abstract-Text in scene images can provide very useful as well strong contrast with the background, fixed font size or fixed
as vital information and hence, its detection and recognition is an position. The challenge associated with caption text is that it
important task. We propose an adaptive edge-based connected- is of low resolution limited by the video source. Scene text
component method for text-detection in natural scene images. The is harder to localize in general, since it varies arbitrarily in the
approach is based on three reasonable assumptions – (i)
image unless it is taken under specific cases, wherein domain
Characters of a particular word are locally aligned in a certain
direction (ii) Each character is of uniform color ( iii) Stroke width knowledge clues may help to find text. But then, making a
is almost constant for most of the characters in a particular domain independent, robust Text Detection Algorithm is the
word. We apply color quantization and use the luminance to challenge we hope to meet.
obtain the intensity values. An improved edge-detection
technique that performs adaptive thresholding is used to capture all II. PREVIOUS WORK
possible text components with some non-text components initially.
Then, we remove obvious non-text component based on a few
heuristics. Further, we classify those components as text for The published algorithms can be broadly classified as
which we successfully obtain two consecutive nearest-neighbors gradient feature based, color segmentation based, and
that are aligned in a direction and satisfy certain constraints texture analysis based. The family of gradient-feature-
based on size and inter component distance. Finally, we based text localization methods assumes that text exhibits a
estimate stroke width and foreground color for each strong edge against background, and therefore those pixels
component and those having a fairly uniform value of the with high gradient values are regarded as good candidates
same are classified as text. Results on ICDAR 2003 Robust for text regions. Cai et al. [1] propose a locally adaptive
Reading Competition data show that the method is competitive for
threshold computation method for the Sobel edge detectors in all
text-detection. The main advantage of our method is that it is
robust to font-size, degraded intensities and complex backgrounds. Y, U, and V channels of a color image to find text that is
Also, the use of stroke width and color in this manner for text similar to background in luminance but not in color. Kim et
detection is novel to the best of the author’s knowledge. al. [2], in addition to the gradient value, consider three other
gradient-based local features as well: gradient variance, edge
I. INTRODUCTION density, and edge density variance. In their underlying
application of license plate reading, they assume that license
Text localization and extraction is one of the first and most plate regions tend to have large gradient variance, high edge
prevalent steps in the analysis of structured documents density and low density variance. The features are fed to a
or natural scenes. The processing of image text is a s p e c i f i c simple neural network to determine the possible text pixels.
ap p licatio n that s e e k s to r e c o g n i z e text appearing as Lienhart and Wernicle [3] compute the gradient map of an
part of or embedded in visual content. One branch of the image and apply a neural network classifier on a sliding small
previous work in the literature is concerned with text detection window to identify possible text pixels. They adopt a
in video key frames or still images, which has received a multi-resolution approach to find text of various sizes.
great deal of attention as it provides a supplemental way of Instead of finding edges, Hua et al. [4] use a Susan corner
indexing images or video. The community typically detector to find the stroke ends common in characters.
distinguishes between graphic text, which is superimposed on Isolated corners in background can be removed due to their
the image (such as subtitles, sports scores, or movie credits), low-density property. A common procedure in all the above
and scene text (such as on signs, buildings, vehicles, name tags, methods is the selection of strong gradient pixels and their
or T-shirts). Hence depending on the application, text grouping into text regions, followed by possible post processing
localization in images can be targeted at caption text (or as verification.
graphic text) or scene text or both. Caption text usually has Another category of text localization algorithms is based
on color segmentation. Miene et al. [5] use a fast color which works very efficiently for document images (small text
segmentation method where pixels are scanned first in the x size), but the same is not so robust on real-scene images where
direction, then in the y direction. Each pixel is compared to font size is larger as well as variable. Hence we have tried to
its neighboring pixels, and a decision is made if a new color use some very common properties of text characters like
region is present. Li et al. [6] propose a multichannel uniform stroke width, alignment in a certain direction,
decomposition approach to text extraction from scene uniform foreground intensity, coupled with some practical
images. The original RGB color image is quantized to 27 constraints on heuristics to extract the text components.
color indices by taking only the two highest bits from In the next section, we explain details of the proposed approach.
each color component and further mapping four states In section IV we present experimental results and the final
into three states. A second quantization result is conclusion is provided in sectionV.
obtained from a filtered image that can eliminate illumination
variance to some extent. The multichannel results were fused III. TEXT DETECTION SYSTEM
together in the connected component analysis to generate
candidate text regions. Further knowledge on color spaces, Our text detection system basically exploits the properties
multi-channel and quantization can be obtained from the work associated with characters and text strings. We have made some
of C. Mancas- Thillou [15]. Wang and Kangas [7] also use reasonable assumptions about the alignment of characters in
multigroup decomposition where they combine quantization words and the stroke width of a character. Based on these
results from the hue component layer, the low saturation assumptions, we filter out non-text components one by
layer, the luminance layer, and the edge map. Their one. Fig. 1 shows the main steps involved in our text
quantization method is an unsupervised clustering detection system.
algorithm. The connected –component approach applied by
Nobuo et al. [17] uses mathematical- morphological operations
on edge image. Preprocessing: Color Quantization
The third category of approach for text localization is viewing of
the text as a unique texture that exhibits a certain regularity
which is distinguishable from the background. Wu et al. Adaptive Edge Detection and basic filtering
[8] use Gaussian derivative filters to extract local texture
features and apply K-means clustering to group pixels that
have similar filter outputs. Zhong et al. [9] assume that text Classification based on Nearest-Neighbor constraints
areas contain certain high horizontal and vertical frequencies
and detect these frequencies directly from the DCT
coefficients of the compressed image. Li et al. [10] Classification based on Stroke width and
perform wavelet decomposition on multiscale layers Foreground color constraints
generated from the input grayscale image. Feature values are
the first to third moments of decomposition output. An ANN
classifier is trained to label a small window as text or non-text Figure 1. Block diagram for proposed text detection system
based on the features residing in the window. The
multiscale labeling results are integrated into the final result in
the original resolution. Clark and Mirmehdi [11] design five A. Preprocessing and Basic Filtering
statistical texture measurements for text areas and build an ANN We use color quantization so that colors that appear same to
classifier to classify each pixel as text or nontext based on the the naked eye but vary over quite a range in their actual RGB
five statistics. values can be grouped together and reduced to the same
Among the methods introduced above, their reported value, thus increasing the accuracy of our foreground
results vary primarily because of differences in the color estimation. The bin size is fixed to 64. The Canny edge
engineering details of the implementation rather than the detection algorithm [12] is known to many as the optimal
general methodology. While texture-based methods are more edge detector. The canny edge detector that uses the
noise tolerant, have a better self-learning ability, and can output Gaussian filtering and differentiation mask works best for edge
a confidence value, they are often computationally expensive detection in most of the images, but due to a wide range of
compared with gradient or color-based schemes. Due to variations in the intensity, contrast and illumination in the
variation in font-size of scene-text, notion of texture for same image, the global thresholding for the image may miss
them becomes useless. We observed this fact while applying out some of the weak edges and prevent text characters from
the Gabor –filtering approach as proposed by Sabari et al. [16], being detected as connected components. Hence in our
approach, we resort to adaptive thresholding that modifies the or more EBs as illustrated in Fig 2(a). For example, the
threshold values according to the peak of the gradient letter 'B' gives rise to three components; two due to the
magnitude in a particular region. This supports better inner boundaries, EBint and the third due to the outer boundary,
chances of our capturing the edges that may be missed out EBout. Filtering of EBs is done in a similar manner as described
otherwise. The image is accessed in 64x64 blocks and the in [13]. We need to eliminate the undesired EBs as they could
Canny edge response is thresholded individually as follows: interfere with the nearest neighbor calculation in subsequent
steps.
1) Calculate MaxGrad = Maximum gradient in the image
2) For each block in the image: B. Nearest-Neighbor Constraints
Calculate LocalMax(i) = Maximum gradient in the For every component, two consecutive nearest neighbors are
block found out using the Euclidean distance. The components
if LocalMax(i) >= t1*MaxGrad and their nearest neighbors are compared on the basis of their
Perform Canny Edge Detector on the Block with: dimensions, alignment and distance. If the constraints given
Upper Threshold = t2*LocalMax(i) by (2), (3) and (4) are satisfied, we declare the
Lower Threshold = t3 *LocalMax(i) component and its two consecutive nearest neighbors as
else text; otherwise we discard the component and proceed by
Discard the Block, continue with next block performing the same check on other components.

The edge detection is performed on all the three channels, max(h1,h2)<2*min(h1,h2) OR max(w1, w2)<2*min(w1,w2) (2)
R, G and B separately. This assures capturing of the
gradient in the intensity of any of the three color
components. The three edge images hence obtained- ER, EG max(h2,h3)<2*min(h2,h3) OR max(w2, w3)<2*min(w2,w3) (3)

and EB are combined (logical OR operation), as illustrated in


(1) to give a comprehensive edge map as follows: (max(d12,d23)<2*min(d12,d23)) AND (a12> a23 – t4 ) AND
(a12 < a23 + t4) (4)
E = ER | EG | EB (1)
Where
hi = height of the ith component
wi = width of the ith component
dij = distance between the ith and jth components
aij = angle made by the line joining the centroids of two
components with the horizontal axis

(a) (b)

Figure 2. (a) An edge-box having two inner edge-boxes. (b) A typical


scene character depicting stroke width at different pixel positions with the
direction associated with it.

After obtaining the edge image, an 8-connected


component labeling is done and the associated bounding box (a) (b)
information is obtained. The aspect ratio of edge-box (EB) is
constrained to lie between 0.1 and 10 to eliminate highly Figure 3. (a) Two prominent stroke widths for character ‘E’. (b) Pixel
intensity at the mid of two edge pixels is shown as yellow cross on the scan line
elongated regions. The size of the EB should be greater than (green) for foreground intensity estimation
25 pixels but smaller than 1/5th of the image dimension to be
considered for further processing. Since the edge detection
captures both the inner and outer boundaries of the C. Stroke Width and Foreground Color Constraints
characters, it is possible that an EB may completely enclose one Any character is written using either single or multiple
sub-strokes. Each sub-stroke has a width associated with it. (d)
We observe that stroke width (SW) in a character is almost
same for all sub-strokes as shown in Fig. 2(b). The sub strokes
can by and large be classified into four types on the basis of
the direction of the edge at that point. (Refer Fig. 2(b)) (i)
horizontal edge, (ii) edge at 45 deg, (iii) vertical edge, (iv)
edge at 135 deg. For each EB, we calculate the SW in each
of these four directions and the direction of edge in which the
width (SWd) satisfies (5) is considered for estimating the (e)
overall stroke width of the EB. Finally we observe that in any
character there are a maximum of 2 predominant stroke widths
(e.g. Letter E, shown in Fig. 3(a)), hence if more than 60 % of
the pixels thus chosen are nearly equal to one or two
values of width, we classify that particular component as text.

(f)

Figure 4. Some results from our text detection system. Output image in (a),
(b), (c) has few false positives but no false negatives. (d) and (e) has no false
alarms. (f) has both kind of false alarms, observe that character ‘A’ is missing.

(a)
Standard Deviation (SWd) < t5*(max (SWd) – min (SWd)) (5)

Foreground color is referred to the color of character


stroke i.e. the color of the region enclosed by the edge-
boundary. We scan every row of the EB and obtain the
intensities of pixels which lie in the middle of two edge
pixels. (as shown in Fig. 3(b)) We select only those
intensities that lie inside the edge-boundary and find its
median value (FG) and if ( 6) is satisfied, we classify that
(b) component as text

(FG_range / FG_total )> t6 (6)

where FG_range is the number of pixels whose intensities lie in


the range (FG – t7, FG + t7) and FG_total is the total number of
intensities for a given edge box.

(c)
IV. EXPERIMENTAL RESULTS

The test images used in our work are taken from ICDAR
2003 Robust reading and text locating competition image
database [14]. The empirical value of various parameters used
is shown in Table I. We have selected 100 images of varying
backgrounds, 50 from training and 50 from test database.
Table II presents a summary of our results. Fig. 4 shows
some of the outputs.
We use Recall and Precision measures for evaluation of results work well for many others. The current method still gives some
where: false positives that can be eliminated by increasing
constraints on color and distance between characters in text
Recall = Correct/(Correct + False Negatives) (7) string. Another drawback of the algorithm is that it does not
detect a text string of 2 or less characters. Our future work
Precision = Correct/(Correct + False Positives) (8) will concentrate on these and some more issues. For
instance, making the parameters values self adaptive to the
images so that for an image with poor resolution, the
TABLE I threshold values are low in order to capture weak edges and
EMPIRICAL VALUE OF PARAMETERS
for images with better resolution, the same are higher to
Parameter t1 t2 t3 t4 t5 t6 t7 avoid detecting false edges. Also as an extension to the stroke
width approach, future work will try to use stroke smoothness
Value 0.1 0.25 0.7 0.6 0.5 0.7 15 and inherent symmetry of characters to distinguish them from
non-text components.

TABLE I I
REFERENCES
EXPERIMENTAL RESULTS
[1] Cai M, Song J-Q, Lyu MR (2002) A new approach for video text
Image Characters False False Recall Precision detection. In: Proc. ICIP, pp 117–120
DataBase Positives Negatives [2] Kim S, Kim D, Ryu Y, Kim G (2002) A robust licenseplate extraction
method under complex image conditions. In: Proc. ICPR, pp 216–219.
Train(50) 534 119 16 97.09% 81.77% [3] Lienhart R, Wernicle A (2002) Localizing and segmenting text in
images and videos. IEEE Trans Circuits Syst Video Technol 12(4), pp
Test(50) 509 155 23 95.67% 76.65% 256–268.
Average 522 274 20 96.38% 79.21% [4] Hua X-S, Chen X-R, Liu W-Y, Zhang H-J (2001) Automatic location
of text in video frames. In: Proc. ACM workshop on multimedia:
multimedia information retrieval, pp 24–28
[5] Miene A, Hermes Th, Ioannidis G (2001) Extracting textual inserts
The same algorithm when applied using original Canny from digital videos. In: Proc. ICDAR, pp 1079–1083.
[6] Li C, Ding X-Q, Wu Y-S (2001) Automatic text location in natural
edge Detector gives precision and recall values as 80.01% scene images. ICDAR, pp 1069-1073.
and 95.06 % respectively. We observe that though Recall [7] Wang H, Kangas J (2001) Character-like region verification for
has improved, Precision has slightly decreased. This could be extracting text in scene images. In: Pr ICDAR, pp 957–962.
explained by the false detection of a few edges in [8] Wu V, Manmatha R, Riseman EM (1999) TextFinder: an automatic
system to detect and recognize text in images. IEEE Trans Pattern Anal
images with high resolution, due to sensitive thresholds.
Mach Intell 21(11), pp 1124–1129.
[9] Zhong Y, Karu K, Jain AK (1995) Locating text in complex color
images. ICDAR, pp 146–149.
V. CONCLUSION AND FUTURE WORK [10] Li H, Doermann D, Kia O (2000) Automatic text detection and tracking
in digital video. IEEE Trans Image Processing 9(1), pp. 147–167
[11] Clark P, Mirmehdi M (2000) Finding text regions using localized
In this paper, we have proposed a novel approach for text measures. BMVC, pp 675-684.
analysis and detection based on character stroke width and [12] J Canny (1986), A Computational Approach to Edge Detection, IEEE
nearest neighbor constraints. Unlike traditional approaches Trans. PAMI, vol. 8, pp. 679-714
based mainly on gradient feature and texture, we use inherent [13] T. Kasar, J. Kumar and A.G. Ramakrishnan (2007) "Font and
Background Color Independent Text Binarization," Intl. workshop on
properties of the scene text. These properties though Camera Based Document Analysis and Recognition (CBDAR), pp. 3 –
simple and obvious enough, carry great strength as they define 9. http://www.imlab.jp/cbdar2007/program.shtml#proc
the basic nature of text characters in general. The strength [14] ICDAR 2003 Robust reading and text locating competition image
of this method lies in its robustness to font size, style, text database http://algoval.essex.ac.uk/icdar/Datasets.html
[15] C. Mancas-Thillou. Natural Scene Text Understanding. PhD thesis,
orientation and varying background shade. Due to adaptive Facult´e Polytechnique de Mons, Belgium, ‘2006
edge detection, we are able to capture even text regions with [16] Sabari Raju, Peeta Basa Pati and A. G. Ramakrishnan, "Gabor Filter
weak edges. The experimental results show good recall and Based Block Energy Analysis for Text Extraction from Digital
precision of the method (average of 96.38% and 79.21% Document Images," Intl. Workshop on Document Analysis for
Libraries, Palo Alto, California, USA, Jan 23-24, 2004
respectively). The values of the parameters used in this [17] Text Detection from Natural Scene Images: Towards a System for
algorithm are a trade off between relaxed and strict thresholds, Visually Impaired Persons. Proceedings of the 17th International
although it has given good results for several images, it does not Conference on Pattern Recognition (ICPR’2004)

S-ar putea să vă placă și