Sunteți pe pagina 1din 14

CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

000 Accurate Scene Text Recognition based on 000


001 001
002
Recurrent Neural Network 002
003 003
004 004
Anonymous ACCV submission
005 005
006 Paper ID *** 006
007 007
008 008
009 009
010
Abstract. Scene text recognition is a useful but very challenging task 010
due to uncontrolled condition of text in natural scenes. This paper presents
011 011
a novel approach to recognize text in scene images. In the proposed tech-
012 012
nique, a word image is first converted into a sequential column vectors
013 013
based on Histogram of Oriented Gradient (HOG). The Recurrent Neural
014 Network (RNN) is then adapted to classify the sequential feature vec- 014
015 tors into the corresponding word. Compared with most of the existing 015
016 methods that follow a bottom-up approach to form words by grouping 016
017 the recognized characters, our proposed method is able to recognize the 017
018 whole word images without character-level segmentation and recogni- 018
019
tion. Experiments on a number of publicly available datasets show that 019
the proposed method outperforms the state-of-the-art techniques signifi-
020 020
cantly. In addition, the recognition results on publicly available datasets
021 021
provide a good benchmark for the future research in this area.
022 022
023 023
024 024
1 Introduction
025 025
026 026
Reading text in scenes is a very challenging task in Computer Vision, which has
027 027
been drawing increasing research interest in recent years. This is partially due
028 028
to the rapid development of wearable and mobile devices such as smart phones,
029 029
digital cameras, and the latest google glass, where scene text recognition is a key
030 030
module to a wide range of practical and useful applications.
031 031
Traditional Optical Character Recognition (OCR) systems usually assume
032 032
that the document text has well defined text fonts, size, layout, etc. and scanned
033 033
under well-controlled lighting. They often fail to recognize camera-captured texts
034 034
in scenes, which could have little constraints in terms of text fonts, environmental
035 035
lighting, image background, etc., as illustrated in Fig 1.
036 036
Intensive research efforts have been observed in this area in recent years
037 037
and a number of good scene text recognition systems have been proposed. One
038 038
approach is to combine text segmentation with existing OCR engines, where
039 text pixels are first segmented from the image background and then fed to OCR 039
040 engines for recognition. Several systems have been reported that exploit Markov 040
041 Random Field [5], Nonlinear color enhancement [6] and Inverse Rendering [7] to 041
042 extract the character regions. However, the text segmentation process by itself 042
043 is a very challenging task that is prone to different types of segmentation errors. 043
044 Furthermore, the OCR engines may fail to recognize the segmented texts due 044
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
2 ACCV-14 submission ID ***

045 045
046 046
047 047
048 (a) (b) (c) (d) 048
049 049
050 050
051 051
052 052
053 053
054 (e) (f) (g) (h) 054
055 055
056 056
057 057
058 058
059 (i) (j) (k) (l) 059
060 060
061 061
062 062
063 (m) (n) (o) (p) 063
064 064
065 Fig. 1. Word image examples taken from the recent Public Datasets [1–4]. All the 065
066 words in the images are correctly recognized by our proposed method. 066
067 067
068 068
069 to special text fonts and perspective distortion, as the OCR engines are usually 069
070 trained on characters with fronto-parallel view and normal fonts. 070
071 071
A number of scene text recognition techniques [4, 8–11] have been reported in
072 072
recent years that tend to train their own scene character classifiers. Most of these
073 073
methods follow a bottom-up approach to group the recognized characters into a
074 word based on the context information. The grouping process to form a word is 074
075 usually defined as finding the best word alignment that fits the set of detected 075
076 characters. Lexicon and n-gram language model are also incorporated as a top- 076
077 down clue to recover some common errors, such as spelling and ambiguities [10, 077
078 6, 7]. On the other hand, these new techniques also require robust and accurate 078
079 character-level detection and recognition, otherwise the word alignment will lead 079
080 to incorrect result due to the error accumulation from lower levels to higher levels. 080
081 081
In this paper, we propose a novel scene text recognition technique that treats
082 082
a word image as an unsegmented sequence and does not require character-level
083 083
detection, segmentation, and recognition. Fig 2 shows the overall flowchart of
084 our proposed word recognition system. First, a word image is converted into a 084
085 sequence of column feature, where each column feature is generated by concate- 085
086 nating the HOG features extracted from the corresponding image patches in the 086
087 same column of the input image. A multi layer recurrent neural network (RNN) 087
088 with bidirectional Long Short-Term Memory (LSTM) [12] is then trained for 088
089 labelling the sequential data. Finally, the connectionist temporal classification 089
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ACCV-14 submission ID *** 3

Training Images Trained RNN Network


090 Extracted Features
090
091 091
Feature Preparation
092 092
093 093
094 094
Rescoring
095 Classification
based on Lexicon
Distributed 095
Lexicon: Distributed, Royal, Price, Inside, ... ...
096 Extracted Features Recognized Word 096
Input Cropped Word Image
097 097
098 098
Fig. 2. The overall flowchart of our proposed scene word recognition system.
099 099
100 100
101 (CTC) [13] technique is exploited to find out the best match of a list of lexicon 101
102 words based on the RNN output of the sequential feature using. 102
103 We apply our proposed technique on cropped word image recognition with 103
104 a lexicon. The cropped word denotes a word image that is cropped along the 104
105 word bounding box within the original image as illustrated in Fig 1. Such word 105
106 regions can be located by a text detector or with user assistance in different real- 106
107 world scenarios. The lexicon refers to a list of possible words associated with the 107
108 word image, which can be viewed as a form of contextual information. It is 108
109 obvious that the search space can be significantly narrowed down in real world 109
110 applications. For example, the lexicon can be nearby signboard names collected 110
111 using Google when recognizing signboards at one location [4]. In other cases, 111
112 the lexicon can be food names, sporter names, product lists, etc., depending on 112
113 different scenarios. 113
114 The proposed scene text recognition technique has a number of novel con- 114
115 tributions. First, we describe an effective way of converting a word image into 115
116 a sequential signal where techniques used in relevant areas, such as speech pro- 116
117 cessing and handwriting recognition, can be introduced and applied. Second, we 117
118 adapt RNN and CTC techniques for recognition of texts in scenes, and designed 118
119 a segmentation-free scene word recognition system that obtains superior word 119
120 recognition accuracy. Third, unlike some systems that rely heavily on certain 120
121 local dataset (which are not available to the public) [8, 14], our system makes 121
122 use of several publicly available datasets, hence providing a baseline for easier 122
123 benchmarking of the ensuing scene text recognition techniques. 123
124 124
125 2 Related Work 125
126 126
127 2.1 Scene Text Recognition 127
128 128
In general, the word recognition in natural scene consists of two main steps, text
129 detection and text recognition. The first step usually detects possible regions as 129
130 character candidates, the second step recognizes the detected regions and groups 130
131 them into word. 131
132 Different visual feature detectors and descriptors have been exploited for the 132
133 detection and recognition of texts in scenes. The Histogram of Oriented Gradi- 133
134 ent (HOG) feature [15] is widely used in different methods [4, 10, 16]. Recently, 134
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
4 ACCV-14 submission ID ***

135 135
136 136
137 137
Input Layer
138 Output Layer 138
139 139
140 140
141 Output 141
142 142
143 143
144 144
Hidden Layer
145 145
146 (a) Fully RNN (b) Long Short-Term Memory RNN 146
147 147
148 Fig. 3. Illustration of Recurrent Neural Network. 148
149 149
150 150
151
the part-based tree structure [9] is also proposed along with the HOG descrip- 151
152
tor to capture the structure information of different characters. In addition, the 152
153
Co-occurrence HOG feature [17] is developed to represent the spatial relation- 153
ship of neighbouring pixels. Other features including stroke widths (SW) [18,
154 154
19], maximally stable extremal regions (MSER) [20, 21] and weighted direction
155 155
code histogram (WDCH) [8] have also been applied for character detection and
156 156
classification. The HOG feature is also exploited in our proposed system for
157 157
construction of sequential features of word images.
158 158
With large amount training data, The systems reported in [8, 14, 22] obtain
159 159
high recognition accuracy with unsupervised feature learning techniques. More-
160 160
over, different approaches are proposed to group the detected characters into
161 161
word using contextual information, including Weighted Finite-State Transduc-
162 162
ers [11], Pictorial Structure [4, 16], and Conditional Random Field Model [10,
163 163
9]. We propose a novel scene text recognition technique that treat each word
164 164
image as a whole without requiring character-level segmentation and recogni-
165 165
tion. The CTC technique [13] is exploited to find the best word alignment for
166 166
each unsegmented column feature vector and so character-level segmentation
167 167
and recognition is not needed.
168 168
169 169
170 2.2 Recurrent Neural Network 170
171 171
Recurrent Neural Network (RNN) is a special neural network that has been
172 172
used for handling sequential data, as illustrated in Fig 3 (a). The RNN aims to
173 173
predict the label of current time stamp with the contextual information of past
174 time stamps. It is a powerful classification model but not widely used in the 174
175 literature. The major reason is that it often requires a long training process as 175
176 the error path integral decays exponentially along the sequence [23]. 176
177 The long short-term memory (LSTM) model [23] was proposed to solve this 177
178 problem as illustrated in Fig 3 (b). In LSTM, an internal memory structure is 178
179 used to replace the nodes in the traditional RNN, where the output activation of 179
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ACCV-14 submission ID *** 5

180 the network at time t is determined by the input data of the network at time t and 180
181 the internal memory stored in the network at time t − 1. The learning procedure 181
182 under LSTM therefore becomes local and constant. Furthermore, a forget gate 182
183 is added to determine whether to reset the stored memory [24]. This strategy 183
184 helps the RNN to remember contextual information and withdraw errors during 184
185 learning. The memory update and output activation procedure of RNN can be 185
186 formulated in Eq. 1 as follows: 186
187 187
188 188
189 S t = S t−1 · f (Wf orget X t ) + f (Win X t ) · f (Wc X t ) (1a) 189
190 Y t = f (Wout X t ) · g(S t ) (1b) 190
191 191
192 192
193 where S t , X t , Y t denote the stored memory, input data, and output activation 193
194 of the network at time t, respectively. Functions g(∗) and f (∗) refer to the 194
195 sigmoid function that squashes the data. W∗ denotes the weight parameters of 195
196 the network. In addition, the first term (Wf orget X t ) is used to control whether 196
197 to withdraw previous stored memory S t−1 . 197
198 Bidirectional LSTM [13, 25] is further proposed to predict the current label 198
199 with past and future contextual information by processing the input sequence 199
200 in two directions (i.e. from beginning to end and, from end to beginning). It 200
201 has been applied for handwriting recognition and outperforms the wildly-used 201
202 Hidden Markov Model (HMM). We introduce the bidirectional LSTM technique 202
203 into scene text recognition domain, and the powerful model obtains superior 203
204
recognition performance. 204
205 205
206 206
3 Feature Preparation
207 207
208 208
To apply the RNN model, the input word image needs to be first converted into a
209 209
sequential feature. In speech recognition, the input signal is already a sequential
210 210
data and feature can be directly extracted frame by frame. Similarly, a word
211 211
image can be viewed as a sequential array if we take each column of the word
212 212
image as a frame. This idea has been applied for handwriting recognition [13]
213 213
and achieved great success.
214 214
However, the same procedure cannot be applied on the scene text recognition
215 215
problem due to two factors. First, the input data of the handwriting recognition
216 216
task is usually binary or has a clear bi-modal pattern, where the text pixels
217 217
can be easily segmented from the background. Second, the handwritten text
218 usually has a much smaller stroke width variation compared with texts in scenes, 218
219 where features extracted from each column of handwriting images contains more 219
220 meaningful information for classification. 220
221 On the other hand, quite a number of visual features in computer vision such 221
222 as HOG are extracted in image patches, which perform well in text recognition 222
223 due to their robustness to illumination variation and invariance to the local 223
224 geometric and photometric transformations. We first convolutionally partition 224
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
6 ACCV-14 submission ID ***

P*L feature matrix


225 HOG Feature
225
K*L feature matrix
226 226
M*N Input Image Q*L sequential feature
227 227
W*W image block HOG Feature
228 228
W*W image block
T*T averaging
229 W*W image block window 229
HOG Feature
230 230
231 231
232 Convolutional HOG Extraction Average Pooling Vectorization 232
233 233
234 Fig. 4. The overall process of column feature construction. From left to right, HOG 234
235 features are extracted convolutionally from every image blocks of the input image 235
236 exhaustively; Average pooling is then performed on each column; Finally, the vector- 236
237 ization process is applied to obtain the sequential feature. It is worth to note that, 237
P = M − W + 1, L = N − W + 1, K = P/T , and Q = K ∗ H, where H denotes the
238 238
length of the HOG feature.
239 239
240 240
241 the input image into patches with step size 1, which is described in Algorithm 1 241
242 below. 242
243 243
244 244
245 Algorithm 1 Convolutional Image Patch Extraction 245
246 1: procedure PatchExtraction(img,W ) 246
247 2: M : height of img 247
248
3: N : width of img 248
4: P atches : (M-W+1) × (N-W+1) matrix, each entry is a W × W matrix.
249 249
5: for i = 1:M-W+1 do
250 250
6: for j = 1:N-W+1 do
251 251
7: P atches(i,j) = img(i:i+W-1,j:j+W-1)
252 8: end for 252
253 9: end for 253
254 10: Return P atches 254
255 11: end procedure 255
256 256
257 257
258 where img denotes the input word image, W denotes the size of one image patch. 258
259 The HOG feature is then extracted and normalized for each image patch. After 259
260 that, the HOG features of the image patches at the same column are linked 260
261 together. An average pooling strategy is applied to incorporate information of 261
262 neighbouring blocks by averaging the HOG feature vectors within a neighbouring 262
263 window as defined in Eq. 2. 263
264 264
265 265
266 266
X
HOGavg (i, j) = (HOG((i − 1) ∗ T + 1 : i ∗ T, j)/T (2)
267 267
268 where i, j refer to index, HOG denotes the extracted normalized HOG feature 268
269 vector of corresponding patch, HOGavg denotes the feature vector after averag- 269
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ACCV-14 submission ID *** 7

270 ing pooling, and T denotes the size of neighbouring window for average pooling. 270
271 A column feature is finally determined by concatenating the averaged HOG fea- 271
272 ture vectors at the same column. 272
273 Fig 4 shows the overall process of column feature vector construction. To 273
274 ensure that all the column features have the same length, the input word image 274
275 needs to be normalized to be of the same height M beforehand. Furthermore, 275
276 the patch size W and neighbouring windows size T can be set empirically, and 276
277 will be discussed in the experiment section. 277
278 278
279 279
280 4 Recurrent Neural Network Construction 280
281 281
282 A word image can thus be recognized by classifying the correspondingly con- 282
283 verted sequential column feature vector. We use the RNN [23] instead of the 283
284 traditional neural network or HMM due to its superior characteristics in several 284
285 aspects. First, unlike the HMM that generates observations based only on the 285
286 current hidden state, RNN incorporates the context information including the 286
287 historical states by using the LSTM structure [13] and therefore outperform the 287
288 HMM greatly. Second, unlike the traditional neural network, the bidirectional 288
289 LSTM RNN model [13] does not require explicit labelling of every single column 289
290 vector of the input sequence. This is very important to the scene text recogni- 290
291 tion because characters in scenes are often connected, broken, or blurred where 291
292 the explicit labelling is often an infeasible task as illustrated in Fig 1. Note that 292
293 RNNLIB [26] is implemented to build and train the multi-layer RNN. 293
294 CTC [13] is applied to the output layer of RNN to label the unsegmented 294
295 data. In our system, a training sample can be viewed as a pair of input column 295
296 feature and a target word string (C, W). The objective function of CTC is then 296
297
defined as follows: 297
298 X 298
299 O=− ln p(W|C) (3) 299
300 (C,W)∈S 300
301 301
where S denotes the whole training set and p(W|C) denotes the conditional
302 302
probability of word W given a sequence of column feature C. The target is to
303 303
minimize O, which is equivalent to maximize the conditional probability p(W|C).
304 304
The output path π of the RNN output activations has the same length of the
305 305
input sequence C. Since the neighbouring column feature vectors might represent
306 306
the same character, some column feature vectors may not represent any labels.
307 307
An additional ’blank’ output cell therefore needs to be added into the RNN out-
308 308
put layer. In addition, the repeating labels and empty labels also need to be re-
309 moved to map to the target word W. For example, (0 −0 ,0 a0 ,0 a0 ,0 −0 ,0 −0 ,0 b0 ,0 b0 ,0 b0 ) 309
310 can be mapped to (a, b), where 0 −0 denotes the empty label. So the p(W|C) is 310
311 defined as follows: 311
312 312
X
313 313
p(W|C) = p(π|C) (4)
314 314
V (π)=W
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
8 ACCV-14 submission ID ***

315 where V denotes the operator that translates the output path π to target word 315
316 W. It is worth to note that the translation process V is not unique. p(π|C) refers 316
317 to the conditional probability of output path π given input sequence C, which 317
318 is defined as follows: 318
319 319
L L
320 Y Y 320
p(π|C) = p(πt |C) = yπt t (5)
321 321
t=1 t=1
322 322
323
where L denotes the length of the output path and πt denotes label of output 323
324
path π at time t. The term y t denotes the network output of RNN at time t, 324
325
which can be interpreted as the probability distribution of the output labels at 325
326
time t. Therefore yπt t denotes the probability of πt at time t. 326
327
The CTC forward backward algorithm [13] is then applied to calculate p(W|C). 327
328
The RNN network is trained by back-propagating the gradient through the out- 328
329
put layer based on the objective function as defined in Eq. 3. Once the RNN is 329
trained, it can be used to convert a sequential feature vector into a probability
330 330
matrix. In particular, the RNN will produce a L × G probability matrix Y given
331 331
an input sequence of column feature vector, where L denotes the length of the
332 332
sequence, and G denotes the number of possible output labels. Each entry of Y
333 333
can be interpreted as the probability of a label at a time step.
334 334
335 335
336 5 Word Scoring with Lexicon 336
337 337
338 Given a probability matrix Y and a lexicon L with a set of possible words, the 338
339 word recognition can be formulated as searching for the best match word w∗ as 339
340 follows: 340
341 X 341
342 w∗ = arg max p(w|Y) = arg max p(π|Y) (6) 342
w∈L w∈L
343 V (π)=w 343
344 344
where p(w|Y) is the conditional probability of word w given Y. A direct graph
345 345
can be constructed for the word w so that each node represents a possible label
346 346
of w. In another word, we need to sum over all the possible paths that can form
347 347
a word w on the probability matrix Y to calculate the score of a word w.
348 348
A new word wi can be generated by adding some blank interval into the
349 349
beginning and ending of w as well as the neighbouring labels of w, where the
350 350
blank interval denotes the empty label. The length of wi is 2 ∗ |w| + 1, where
351 351
|w| denotes the length of w. A new |wi | × L probability matrix P can thus be
352 352
formed, where |wi | denotes the length of wi and L denotes the length of the input
353 i 353
sequence. P(m, t) denotes the probability of label wm at time t, which can be
354 determined by the probability matrix Y. Each path from P(1, 1) to P(|wi |, L) 354
355 denotes a possible output π of word w, where the probability can be calculated 355
356 using Eq. 5. 356
357 The problem thus changes to the score accumulation along all the possible 357
358 paths in P. It can be solved with the CTC token pass algorithm [13] using 358
359 dynamic programming. The computational complexity of this algorithm is O(L · 359
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ACCV-14 submission ID *** 9

360 |wi |). Finally, the word with highest score in the lexicon is determined as the 360
361 recognized word. 361
362 362
363 363
364
6 Experiments and Discussion 364
365 365
366
6.1 System Details 366
367 In the proposed system, all cropped word images are normalized to be of the 367
368 same height, i.e., M = 32. The patch size W , the HOG bin number, and the 368
369 averaging window size T are set to 8, 8, and 5, respectively. For RNN, the number 369
370 of input cells is the same as the length of the extracted column feature at 40. 370
371 The output layer is 64 including 62 characters ([a...z,A...Z,0...9]), one label for 371
372 special characters ([+,&,$,...]), and one empty label. 3 hidden layers are used 372
373 that have 60, 100, and 140 cells, respectively. The system is implemented on 373
374 Ubuntu 13.10 with 16GB RAM and Intel 64 bit 3.40GHz CPU. The training 374
375 process takes about 1 hours on a training set with about 3000 word images. The 375
376 average time for recognizing a cropped word image is around one second. This 376
377 is comparable with the state-of-the-art techniques and can be further improved 377
378 to satisfy the requirement of real-time applications. 378
379 379
380 380
6.2 Experiments on ICDAR and SVT datasets
381 381
382 The proposed method has been tested on three public datasets, including ICDAR 382
383 1 2 3 383
2003 dataset, ICDAR 2011 dataset, and Street View Text (SVT) dataset.
384 The three datasets consist of 1156 training images and 1110 testing images, 848 384
385 training images and 1189 testing images, 257 training images and 647 testing 385
386 images, respectively. 386
4
387 During the experiments, we add the Char74k character images [27] to form 387
388 a bigger training dataset. For the Char74k dataset, only English characters are 388
389 used that consists of more than ten thousands of character images in total. In 389
390 particular, 7705 characters images are obtained from natural images and the rest 390
391 are hand drawn using a tablet PC. The trained RNN is applied to the testing 391
392 images of the three datasets for word recognition. 392
393 We compare our proposed method with eight state-of-the-art techniques, 393
394 including markov random field method (MRF) [5], inverse rendering method 394
395 (IR) [7], nonlinear color enhancement method (NESP) [6], pictorial structure 395
396 method (PLEX) [16], HOG based conditional random field method (HOG+CRF) [10],396
397 weighted finite-state transducers method (WFST) [11], part based tree structure 397
398 method (PBS) [9] and convolutional neural network method (CNN) [14]. 398
399 To make the comparison fair, we evaluate recognition accuracy on testing 399
400 data with a lexicon created from all the words in the test set (as denoted by 400
401 1 401
http://algoval.essex.ac.uk/icdar/Datasets.html
402 2 402
http://robustreading.opendfki.de/wiki/SceneText
403 3 403
http://vision.ucsd.edu/∼kai/grocr/
404 4 404
http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
10 ACCV-14 submission ID ***

405
Table 1. Word Recognition Accuracy on the ICDAR 03 testing dataset, the ICDAR 405
11 testing dataset and the SVT dataset.
406 406
407 407
ICDAR03 ICDAR03 ICDAR11 ICDAR11
408 Datsets SVT 408
409 (Full) (50) (Full) (50) 409
410 410
MRF [5] 0.67 0.69 - - -
411 411
412 IR [7] 0.75 0.77 - - - 412
413 NESP [6] 0.66 - 0.73 - 413
414 414
PLEX [16] 0.62 0.76 - - 0.57
415 415
416 HOG + CRF [10] - 0.82 - - 0.73 416
417 417
PBS [9] 0.79 0.87 0.83 0.87 0.74
418 418
419
WFST [11] 0.83 - 0.56 - 0.73 419
420 CNN [14] 0.84 0.90 - - 0.70 420
421 421
Proposed 0.82 0.92 0.83 0.91 0.83
422 422
423 423
424 424
425 425
426 ICDAR03(FULL) and ICDAR11(FULL) in Table 1), as well as with lexicon 426
427 consisting of 50 random words from the test set (as denoted by ICDAR03(50) 427
428 and ICDAR11(50) in Table 1). For SVT dataset, a lexicon consisting of about 428
429 50 words is provided and directly adopted in our experiments. 429
430 Table 1 shows word recognition accuracy of the proposed technique and the 430
431 compared techniques. The text segmentation methods (MRF, IR, and NESP) 431
432 produce lower recognition accuracy than other methods because robust and ac- 432
433 curate scene text segmentation by itself is an very challenging task. Our proposed 433
434 method produce good recognition results on all the testing datasets. Especially 434
435 in the SVT data set, our proposed method achieves 83% word recognition ac- 435
436 curacy, which outperforms the state-of-the-art methods significantly. The deep 436
437 learning technique using convolutional neural network [14] also produces good 437
438 performances but it requires a much larger training dataset together with a 438
439 portion of synthetic data which are not available to the public. 439
440 We also tested our proposed method on the ICDAR 2011 and 2013 word 440
441 recognition dataset for Born-Digital Images (Web and Email). It achieves 92% 441
442 and 94% recognition accuracies, respectively, which clearly outperforms other 442
443 methods as listed on the competition websites (with best recognition accuracy 443
444 at 82.9%5 and 82.21%6 , respectively). The much higher recognition accuracy 444
445 also demonstrates the robustness and effectiveness of our proposed word image 445
446 recognition system. 446
447 447
448 5 448
ICDAR 2011: http://www.cvc.uab.es/icdar2011competition/
449 6 449
ICDAR 2013: http://dag.cvc.uab.es/icdar2013competition/
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ACCV-14 submission ID *** 11

450 Recognition Accuracy under lexicon sizes on public datasets 450


100
451 451
99 ICDAR 03
452 ICDAR 11
452
98 ICDAR 13
453 453
454 97 454
Recognition Accuracy

455 96 455
456 456
95
457 457
94
458 458
93
459 459
460 92 460
461 91 461
462 462
90
5 10 20 50
463 Lexicon Size 463
464 464
465 465
Fig. 5. Recognition accuracy of our proposed method on ICDAR 03, ICDAR 11 and
466 466
ICDAR 13 datasets when the lexicon size is different.
467 467
468 468
469 In addition, the correlation between lexicon size and word recognition ac- 469
470 curacy of our proposed method is investigated. Fig 5 shows the experimental 470
471 results based on the ICDAR03, ICDAR11, and ICDAR13 datasets. In addition, 471
472 four lexicon sizes are tested that consist of 5, 10, 20, and 50 words, respectively. 472
473 As illustrated in Fig. 5, the word recognition performance can actually be fur- 473
474 ther improved when some prior knowledge is incorporated and the lexicon size 474
475 is reduced. 475
476 We further apply our proposed method on the recent ICDAR 2013 Rubust 476
477 Text Reading Competetion dataset [3]7 , where 22 algorithms from 13 different 477
478 research groups have been submitted and evaluated under the same criteria. The 478
479 data of ICDAR 2013 is a actually subset of ICDAR 2011 dataset where a small 479
480 number of duplicated images are removed. 480
481 The winning PhotoOCR method [8] makes use of a large multi-layer deep 481
482 neural network and obtains 83% accuracy on the testing dataset. As a compar- 482
483 ison, our proposed method achieved 84% recognition accuracy. Note that our 483
484 proposed method incorporates a lexicon with around 1000 words, whereas Pho- 484
485 toOCR method does not use lexicon. The lexicon is generated by including all 485
486 the ground truth words of ICDAR 2013 test dataset. On the other hand, the 486
487 PhotoOCR method uses a huge amount of training data that consists of more 487
488 than five million word images (which are not available to the public). 488
489 489
490 6.3 Discussion 490
491 491
492 In our proposed system, we assume that the cropped words are more or less 492
493
horizontal. This is true in most of the real world cases. However, it might fail 493
494 7 494
http://dag.cvc.uab.es/icdar2013competition
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
12 ACCV-14 submission ID ***

495 495
496 496
497 497
498 498
499
(a) (b) (c) (d) 499
500 500
501 501
502 502
503 503
504 504
505 505
(e) (f) (g) (h)
506 506
507 507
508 508
509 509
510 510
511 (i) (j) (k) (l) 511
512 512
513 Fig. 6. Examples of falsely recognized word images 513
514 514
515 515
516 when texts in scenes are severely curved or suffer from severe perspective distor- 516
517 tion as illustrated in Fig. 6. It partially explains the lower recognition accuracy 517
518 on SVT dataset, which consists of a certain amount of severely perspectively 518
519 distorted word images. In addition, the SVT dataset is generated from the the 519
520 Google Street View and so consists of a large amount of difficult images such 520
521 as shop names, street names, etc. In addition, the proposed technique could fail 521
522 when the word image has a low resolution or inaccurate text bounding boxes 522
523 are detected as illustrated in Fig 6. We will investigate these two issues in our 523
524 future study. 524
525 With a very limited set of training data, our proposed method produces much 525
526 higher recognition accuracy than the state-of-the-art methods as illustrated in 526
527 Table 1. The PhotoOCR reports a better word recognition accuracy(90%) on 527
528 SVT dataset. The better accuracy is largely due to the usage of a huge amount 528
529 of training data that is not available to the public. As a comparison, our proposed 529
530 method is better in terms of training data size, training time, and computational 530
531 costs. More importantly, our proposed model is trained by using the publicly 531
532
available datasets, which provides a benchmarking baseline for the future scene 532
533
text recognition techniques. 533
534 534
535 7 Conclusion 535
536 536
537 Word recognition under unconstrained condition is a difficult task and has at- 537
538 tracted increasing research interest in recent years. Many methods have been 538
539 reported to address this problem. However, there still exists a large gap for com- 539
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ACCV-14 submission ID *** 13

540 puter understanding of texts in natural scene. This paper presents a novel scene 540
541 text recognition system that makes use of the HOG feature and RNN model. 541
542 Compared with state-of-the-art techniques, our proposed method is able to 542
543 recognize the whole word images without segmentation. It works by integrating 543
544 two key components. First, it converts a word image into a sequential feature vec- 544
545 tor and so requires no character-level segmentation and recognition. Second, the 545
546 RNN is introduced and exploited to classify the sequential column feature vec- 546
547 tors into word accurately. Experiments on several public datasets show that the 547
548 proposed technique obtains superior word recognition accuracy. In addition, the 548
549 proposed technique is trained and tested over several publicly available datasets 549
550 which forms a good baseline for future benchmarking of other new scene text 550
551 recognition techniques. 551
552 552
553 553
References
554 554
555 555
1. Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003
556 robust reading competitions. In: Document Analysis and Recognition (ICDAR), 556
557 2003 International Conference on. (2003) 682–687 557
558 2. Shahab, A., Shafait, F., Dengel, A.: ICDAR 2011 robust reading competition 558
559 challenge 2: Reading text in scene images. In: Proceedings of the 2011 International 559
560 Conference on Document Analysis and Recognition. ICDAR ’11 (2011) 1491–1496 560
561 3. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez i Bigorda, L., Rob- 561
562
les Mestre, S., Mas, J., Fernandez Mota, D., Almazan Almazan, J., de las Heras, 562
L.P.: ICDAR 2013 robust reading competition. In: Document Analysis and Recog-
563 563
nition (ICDAR), 2013 12th International Conference on. (2013) 1484–1493
564 564
4. Wang, K., Belongie, S.: Word spotting in the wild. In: Proceedings of the 11th
565 565
European Conference on Computer Vision: Part I. ECCV’10 (2010) 591–604
566 5. Mishra, A., Alahari, K., Jawahar, C.V.: An MRF model for binarization of natural 566
567 scene text. In: Document Analysis and Recognition (ICDAR), 2011 11th Interna- 567
568 tional Conference on. (2011) 11–16 568
569 6. Kumar, D., Anil Prasad, M.N., Ramakrishnan, A.G.: Nesp: Nonlinear enhancement 569
570 and selection of plane for optimal segmentation and recognition of scene word 570
571
images. In: Proc. SPIE. Volume 8658. (2013) 571
7. Zhou, Y., Feild, J., Learned-Miller, E., Wang, R.: Scene text segmentation via
572 572
inverse rendering. In: Document Analysis and Recognition (ICDAR), 2013 12th
573 573
International Conference on. (2013) 457–461
574 574
8. Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: Reading text in
575 uncontrolled conditions. In: Computer Vision (ICCV), 2013 IEEE International 575
576 Conference on. (2013) 576
577 9. Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., Zhang, Z.: Scene text recognition 577
578 using part-based tree-structured character detection. In: Computer Vision and 578
579 Pattern Recognition (CVPR), 2013 IEEE Conference on. (2013) 2961–2968 579
580 10. Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text 580
recognition. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
581 581
Conference on. (2012) 2687–2694
582 582
11. Novikova, T., Barinova, O., Kohli, P., Lempitsky, V.: Large-lexicon attribute-
583 583
consistent text recognition in natural images. In: Proceedings of the 12th European
584 Conference on Computer Vision - Volume Part VI. ECCV’12 (2012) 752–765 584
CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
14 ACCV-14 submission ID ***

585 12. Graves, A., Schmidhuber., J.: Framewise phoneme classification with bidirectional 585
586 lstm and other neural network architectures. In: Neural Networks. Volume 18. 586
587 (2005) 602610 587
588
13. Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: 588
A novel connectionist system for unconstrained handwriting recognition. Pattern
589 589
Analysis and Machine Intelligence, IEEE Transactions on 31 (2009) 855–868
590 590
14. Wang, T., Wu, D., Coates, A., Ng, A.: End-to-end text recognition with convolu-
591 591
tional neural networks. In: Pattern Recognition (ICPR), 2012 21st International
592 Conference on. (2012) 3304–3308 592
593 15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 593
594 Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer 594
595 Society Conference on. Volume 1. (2005) 886–893 595
596 16. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Com- 596
597 puter Vision (ICCV), 2011 IEEE International Conference on. (2011) 1457–1464 597
17. Tian, S., Lu, S., Su, B., Tan, C.L.: Scene text recognition using co-occurrence of
598 598
histogram of oriented gradients. In: Document Analysis and Recognition (ICDAR),
599 599
2013 12th International Conference on. (2013) 912–916
600 600
18. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke
601 width transform. In: Computer Vision and Pattern Recognition (CVPR), 2010 601
602 IEEE Conference on. (2010) 2963–2970 602
603 19. Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations 603
604 in natural images. In: Computer Vision and Pattern Recognition (CVPR), 2012 604
605 IEEE Conference on. (2012) 1083–1090 605
606 20. Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: 606
607
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 607
(2012) 3538–3545
608 608
21. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with per-
609 609
spective distortion in natural scenes. In: Computer Vision (ICCV), 2013 IEEE
610 International Conference on. (2013) 610
611 22. Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., Wu, D., 611
612 Ng, A.: Text detection and character recognition in scene images with unsuper- 612
613 vised feature learning. In: Document Analysis and Recognition (ICDAR), 2011 613
614 International Conference on. (2011) 440–445 614
615 23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computing 9 615
616
(1997) 1735–1780 616
24. Gers, F.A., Schmidhuber, J.A., Cummins, F.A.: Learning to forget: Continual
617 617
prediction with lstm. Neural Computing 12 (2000) 2451–2471
618 618
25. Zhang, X., Tan, C.: Segmentation-free keyword spotting for handwritten docu-
619 ments based on heat kernel signature. In: Document Analysis and Recognition 619
620 (ICDAR), 2013 12th International Conference on. (2013) 827–831 620
621 26. Graves, A.: Rnnlib: A recurrent neural network library for sequence learning prob- 621
622 lems. (http://sourceforge.net/projects/rnnl/) 622
623 27. de Campos, T.E., Babu, B.R., Varma, M.: Character recognition in natural images. 623
624 In: Proceedings of the International Conference on Computer Vision Theory and 624
625
Applications. (2009) 625
626 626
627 627
628 628
629 629

S-ar putea să vă placă și