Documente Academic
Documente Profesional
Documente Cultură
Abstract
arXiv:1801.01315v1 [cs.CV] 4 Jan 2018
Figure 2: Architecture of PixelLink. A CNN model is trained to perform two kinds of pixel-wise predictions: text/non-text
prediction and link prediction. After being thresholded, positive pixels are joined together by positive links, achieving instance
segmentation. minAreaRect is then applied to extract bounding boxes directly from the segmentation result. Noise predictions
can be efficiently removed using post-filtering. An input sample is shown for better illustration. The eight heat-maps in the
dashed box stand for the link predictions in eight directions. Although some words are difficult to separate in text/non-text
prediction, they are separable through link predictions.
IC15 (Karatzas et al. 2015), and COCO-Text (Veit et al. have a shorter side ≥ 10 pixels.
2016). Therefore, bounding boxes of CCs are then extracted
through methods like minAreaRect in OpenCV (Its 2014). 4 Optimization
The output of minAreaRect is an oriented rectangle, which
4.1 Ground Truth Calculation
can be easily converted into quadrangles for IC15, or rectan-
gles for IC13. It is worth mentioning that in PixelLink, there Following the formulation in TextBlocks (Zhang et al.
is no restriction on the orientation of scene text. 2016), pixels inside text bounding boxes are labeled as pos-
This step leads to the key difference between PixelLink itive. If overlapping exists, only un-overlapped pixels are
and regression-based methods, i.e., bounding boxes are ob- positive. Otherwise negative.
tained directly from instance segmentation other than loca- For a given pixel and one of its eight neighbors, if they be-
tion regression. long to the same instance, the link between them is positive.
Otherwise negative.
Note that ground truth calculation is carried out on input
3.4 Post Filtering after Segmentation images resized to the shape of prediction layer, i.e., conv3 3
Since PixelLink attempts to group pixels together via links, for 4s and conv2 2 for 2s.
it is inevitable to have some noise predictions, so a post-
filtering step is necessary. A straightforward yet efficient so- 4.2 Loss Function
lution is to filter via simple geometry features of detected The training loss is a weighted sum of loss on pixels and loss
boxes, e.g., width, height, area and aspect ratio, etc. For ex- on links:
ample, in the IC15 experiments in Sec. 5.3, a detected box L = λLpixel + Llink . (1)
is abandoned if its shorter side is less than 10 pixels or if its Since Llink is calculated on positive pixels only, the classi-
area is smaller than 300. The 10 and 300 are statistical re- fication task of pixel is more important than that of link, and
sults on the training data of IC15. Specifically, for a chosen λ is set to 2.0 in all experiments.
filtering criteria, the corresponding 99-th percentile calcu-
lated on TRAINING set is chosen as the threshold value. Loss on Pixels Sizes of text instances might vary a lot. For
For example, again, 10 is chosen as the threshold on shorter example, in Fig. 1, the area of ‘Manchester’ is greater than
side length because about 99% text instances in IC15-train the sum of all the other words. When calculating loss, if we
Input Image Link Prediction ones. The loss on pixel classification task is:
conv 1x1, 16 1
Lpixel = W Lpixel CE , (3)
Text/nontext Prediction (1 + r)S
conv stage 1
pool1, /2 conv 1x1, 2 where Lpixel CE is the matrix of Cross-Entropy loss on
conv stage 2 text/non-text prediction.
As a result, pixels in small instances have a higher weight,
conv 1x1, 2(16) + and pixels in large instances have a smaller weight. How-
pool2, /2 ever, every instance contributes equally to the loss.
upsample
conv stage 3
Loss on Links Losses for positive and negative links are
conv 1x1, 2(16) + calculated separately and on positive pixels only:
pool3, /2
conv stage 4 upsample Llink pos = Wpos link Llink CE ,
5.4 Detecting Long Text in TD500 vantages over regression based methods. As listed in Tab. 4,
Since lines are detected in TD500, the final model on IC15 among all methods using VGGNet, PixelLink can be trained
is not used for fine-tuning. Instead, the model is pretrained much faster with less data, and behaves much better than the
on IC15-train for about 15K iterations and fine-tuned on others. Specifically, after about only 25K iterations of train-
TD500-train + HUST-TR400 (Yao, Bai, and Liu 2014) for ing (less than a half of those needed by EAST or SegLink),
about 25K iterations. Images are resized to 768 × 768 for PixelLink can achieve a performance on par with SegLink or
testing. Thresholds of pixel and link are set to (0.8, 0.7). The EAST. Keep in mind that, PixelLink is trained from scratch,
min shorter side in post-filtering is 15 and min area 600, as while the others need to be fine-tuned from an ImageNet-
the corresponding 99th-percentiles of the training set. pretrained model. When also trained from scratch on IC15-
Results in Tab. 2 show that among all VGG16-based mod- train, SegLink can only achieve a F-score of 67.81 .
els, EAST behaves the worst for its highest demand on large The question arises that, why PixelLink can achieve a bet-
receptive fields. Both SegLink and PixelLink don’t need a ter performance with many fewer training iterations and less
deeper network to detect long texts, for their smaller de- training data? We humans are good at learning how to solve
mands on receptive fields. problems. The easier a problem is, the faster we can learn,
the less teaching materials we will need, and the better we
5.5 Detecting Horizontal Text in IC13 are able to behave. It may also hold for deep-learning mod-
The final models for IC15 are fine-tuned on IC13-train, els. So two factors may contribute.
TD500-train and TR400, for about 10K iterations. In single Requirement on receptive fields When both adopting
scale testing, all images are resized to 512 × 512. Multi- VGG16 as the backbone, SegLink behaves much better than
scales includes (384, 384), (512, 512), (768, 384), (384, EAST in long text detection, as shown in Tab. 2. This gap
768), (768, 768), and a maximum longer side of 1600. should be caused by their different requirements on recep-
Thresholds on pixel and link are (0.7, 0.5) for single-scale tive fields. Prediction neurons of EAST are trained to ob-
testing, and (0.6, 0.5) for multi-scale testing. The 99-th per- serve the whole image, in order to predict texts of any length.
centiles are 10, 300, for shorter side and area respectively, SegLink, although regression based too, its deep model only
for post-filtering. tries to predict segments of texts, resulting in a less require-
Different from regression-based methods like EAST and ment on receptive fields than EAST.
TextBoxes, PixelLink has no direct output as confidence
on each detected bounding box, so its multi-scale testing Difficulty of tasks In regression-based methods, bounding
scheme is specially designed. Specifically, prediction maps boxes are formulated as quadrangle or rotated rectangle. Ev-
of different scales are uniformly resized to the largest height ery prediction neuron has to learn to predict their locations
and the largest width among all maps. Then, fuse them by as precise numerical values, i.e. coordinates of four vertices,
taking the average. The rest steps are identical to single-scale or center point, width, height and rotation angle. It’s pos-
testing. sible and effective, as has long been proven by algorithms
Results in Tab.3 show that multi-scale leads to an im- like Faster R-CNN, SSD, YOLO, etc. However, such kind
provement of about 4 to 5 points in F-score, similar to the of predictions is far from intuitive and simple. Neurons have
observation in TextBoxes. to study a lot and study hard to be competent for their as-
signments.
6 Analysis and Discussion When it comes to PixelLink, neurons on the prediction
6.1 The Advantages of PixelLink 1
This experiment is repeated 3 times using the open source code
A further analysis of experiment results on IC15 shows that of SegLink. Its best performance is 63.6, 72.7, 67.8 for Recall, Pre-
PixelLink, as a segmentation based method, has several ad- cision and F-score respectively.
Table 4: Comparison of training speed and training data for IC15. SynthText (Gupta, Vedaldi, and Zisserman 2016) is a synthetic
dataset containing more than 0.8M images. All listed results for comparison are quoted from the corresponding original papers.
Method ImageNet Pretrain Optimizer Training Data Iterations F-score
PixelLink+VGG16 4s at Iter-25K No SGD IC15-train '25K 79.7
PixelLink+VGG16 4s at Iter-40K No SGD IC15-train '40K 82.3
SegLink + VGG16 Yes SGD SynthText,IC15-train '100K 75.0
EAST+VGG16 Yes ADAM IC15-train,IC13-train >55K 76.4
layer only have to observe the status of itself and its neigh-
boring pixels on the feature map. In another word, PixelLink Table 6: Effect of Instance-Balance on IC13. The only dif-
has the least requirement on receptive fields and the easiest ference on the two settings is the use of Instance-Balance.
task to learn for neurons among listed methods. Testing is done without multi-scale.
A useful guide for the design of deep models can be IB Recall Precision F-score
picked up from the assumption above: simplify the task for Yes 82.3 84.4 83.3
deep models if possible, because it might make them get No 79.4 84.2 81.7
trained faster and less data-hungry.
The success in training PixelLink from scratch with a very
limited amount of data also indicates that text detection is Training image size matters. In Exp.4, images are re-
much simpler than general object detection. Text detection sized to 384 × 384 for training, resulting in an obvious de-
may rely more on low-level texture feature and less on high- cline on both recall and precision. This phenomenon is in
level semantic feature. accordance with SSD.
Post-filtering is essential. In Exp.5, post-filtering is re-
6.2 Model Analysis moved, leading to a slight improvement on recall, but a sig-
As shown in Tab. 5, ablation experiments have been done to nificant drop on precision.
analyze PixelLink models. Although PixelLink+VGG16 2s
models have better performances, they are much slower and Predicting at higher resolution is more accurate but
less practical than the 4s models, and therefore, Exp.1 of the slower. In Exp.6, the predictions are conducted on
best 4s model is used as the basic setting for comparison. conv2 2, and the performance is improved w.r.t. both re-
call and precision, however, at a cost of speed. As shown in
Tab. 1, the speed of 2s is less than a half of 4s, demonstrating
Table 5: Investigation of PixelLink with different settings on a tradeoff between performance and speed.
IC15. ‘R’: Recall, ‘P’: Precision; ‘F’: F-score.
# Configurations R P F 7 Conclusion and Future Work
1 The best 4s model 81.7 82.9 82.3 PixelLink, a novel text detection algorithm is proposed in
2 Without link mechanism 58.0 71.4 64.0 this paper. The detection task is achieved through instance
3 Without Instance Balance 80.2 82.3 81.2 segmentation by linking pixels within the same text instance
4 Training on 384x384 79.6 81.2 80.4 together. Bounding boxes of detected text are directly ex-
5 No Post-Filtering 82.3 52.7 64.3 tracted from the segmentation result, without performing lo-
6 Predicting at 2s resolution 82.0 85.5 83.7 cation regression. Since smaller receptive fields are required
and easier tasks are to be learned, PixelLink can be trained
from scratch with less data in fewer iterations, while achiev-
Link is very important. In Exp.2, link is disabled by set- ing on par or better performance on several benchmarks than
ting the threshold on link to zero, resulting in a huge drop state-of-the-art methods based on location regression.
on both recall and precision. The link design is important VGG16 is chosen as the backbone for convenient compar-
because it converts semantic segmentation into instance seg- isons in the paper. Some other deep models will be investi-
mentation, indispensable for the separation of nearby texts in gated for better performance and higher speed.
PixelLink. Different from current prevalent instance segmentation
methods (Li et al. 2016) (He et al. 2017a), PixelLink does
Instance-Balance contributes to a better model. In not rely its segmentation result on detection performance.
Exp.3, Instance-Balance (IB) is not used and the weights Applications of PixelLink will be explored on some other
of all positive pixels are set to the same during loss calcu- tasks that require instance segmentation.
lation. Even without IB, PixelLink can achieve a F-score
of 81.2, outperforming state-of-the-art. When IB is used, a
slightly better model can be obtained. Continuing the experi-
Acknowlegement
ments on IC13, the performance gap is more obvious(shown This work was supported by National Natural Science Foun-
in Tab. 6). dation of China under Grant 61379071. Special thanks to Dr.
Yanwu Xu in CVTE Research for all his kindness and great Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully con-
help to us. volutional networks for semantic segmentation. In Proceed-
ings of the IEEE Conference on Computer Vision and Pat-
References tern Recognition, 3431–3440.
Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-fcn: Object Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.;
detection via region-based fully convolutional networks. In and Xue, X. 2017. Arbitrary-oriented scene text detection
Advances in Neural Information Processing Systems, 379– via rotation proposals. arXiv preprint arXiv:1703.01086.
387. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-
Donoser, M., and Bischof, H. 2006. Efficient maximally sta- cnn: Towards real-time object detection with region proposal
ble extremal region (mser) tracking. In Computer Vision and networks. In Advances in Neural Information Processing
Pattern Recognition, 2006 IEEE Computer Society Confer- Systems, 91–99.
ence, volume 1, 553–560. IEEE. Shi, B.; Bai, X.; and Belongie, S. 2017. Detecting oriented
Glorot, X., and Bengio, Y. 2010. Understanding the dif- text in natural images by linking segments. arXiv preprint
ficulty of training deep feedforward neural networks. In arXiv:1703.06520.
Proceedings of the Thirteenth International Conference on Shrivastava, A.; Gupta, A.; and Girshick, R. 2016. Train-
Artificial Intelligence and Statistics, 249–256. ing region-based object detectors with online hard example
Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic mining. In Proceedings of the IEEE Conference on Com-
data for text localisation in natural images. In Proceedings puter Vision and Pattern Recognition, 761–769.
of the IEEE Conference on Computer Vision and Pattern Simonyan, K., and Zisserman, A. 2014. Very deep convo-
Recognition, 2315–2324. lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
He, T.; Huang, W.; Qiao, Y.; and Yao, J. 2016. Accurate text
localization in natural image with cascaded convolutional Tian, Z.; Huang, W.; He, T.; He, P.; and Qiao, Y. 2016. De-
text network. arXiv preprint arXiv:1603.09423. tecting text in natural image with connectionist text proposal
network. In European Conference on Computer Vision, 56–
He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017a.
72. Springer.
Mask r-cnn. arXiv preprint arXiv:1703.06870.
Veit, A.; Matera, T.; Neumann, L.; Matas, J.; and Belongie,
He, W.; Zhang, X.-Y.; Yin, F.; and Liu, C.-L. 2017b. Deep
S. 2016. Coco-text: Dataset and benchmark for text de-
direct regression for multi-oriented scene text detection.
tection and recognition in natural images. arXiv preprint
arXiv preprint arXiv:1703.08289.
arXiv:1601.07140.
Itseez. 2014. The OpenCV Reference Manual, 2.4.9.1 edi- Yao, C.; Bai, X.; and Liu, W. 2014. A unified framework for
tion. multioriented text detection and recognition. IEEE Transac-
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Big- tions on Image Processing 23(11):4737–4749.
orda, L. G.; Mestre, S. R.; Mas, J.; Mota, D. F.; Almazan, Yao, C.; Bai, X.; Liu, W.; Ma, Y.; and Tu, Z. 2012. Detect-
J. A.; and de las Heras, L. P. 2013. Icdar 2013 robust ing texts of arbitrary orientations in natural images. In Com-
reading competition. In Document Analysis and Recognition puter Vision and Pattern Recognition (CVPR), 2012 IEEE
(ICDAR), 2013 12th International Conference, 1484–1493. Conference on, 1083–1090. IEEE.
IEEE.
Yao, C.; Bai, X.; Sang, N.; Zhou, X.; Zhou, S.; and Cao,
Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Z. 2016. Scene text detection via holistic, multi-channel
Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chan- prediction. arXiv preprint arXiv:1606.09002.
drasekhar, V. R.; Lu, S.; et al. 2015. Icdar 2015 competition
on robust reading. In Document Analysis and Recognition Ye, Q., and Doermann, D. 2015. Text detection and recog-
(ICDAR), 2015 13th International Conference, 1156–1160. nition in imagery: A survey. IEEE transactions on Pattern
IEEE. Analysis and Machine Intelligence 37(7):1480–1500.
Kim, K.-H.; Hong, S.; Roh, B.; Cheon, Y.; and Park, M. Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; and Bai,
2016. Pvanet: Deep but lightweight neural networks for real- X. 2016. Multi-oriented text detection with fully convolu-
time object detection. arXiv preprint arXiv:1608.08021. tional networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 4159–4167.
Li, Y.; Qi, H.; Dai, J.; Ji, X.; and Wei, Y. 2016. Fully
convolutional instance-aware semantic segmentation. arXiv Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; and
preprint arXiv:1611.07709. Liang, J. 2017. East: An efficient and accurate scene text
detector. arXiv preprint arXiv:1704.03155.
Liao, M.; Shi, B.; Bai, X.; Wang, X.; and Liu, W. 2017.
Textboxes: A fast text detector with a single deep neural net-
work. In AAAI, 4161–4167.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu,
C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox
detector. In European Conference on Computer Vision, 21–
37. Springer.