Sunteți pe pagina 1din 4

Multifont Arabic Character Recognition

Using HoughTransform and Hidden Markov Models

1
Nadia Ben Amor
1
n.benamor@ttnet.tn
National Engineering School of Tunis (ENIT) Tunisia
and
2
Najoua Essoukri Ben Amara
2
Najoua.benamara@enim.rnu.tn

National Engineering School of Monastir (ENIM) Tunisia


Laboratory of Systems and Signal Processing (LSTS)

Abstract

Optical Characters Recognition (OCR) has Due to the cursive nature of the script, there are several
been an active subject of research since the early days characteristics that make recognition of Arabic distinct
of computers. Despite the age of the subject, it remains from the recognition of Latin scripts or Chinese.
one of the most challenging and exciting areas of The work we present in this paper belongs to the general
research in computer science. In recent years it has field of Arabic documents recognition exploring the use
grown into a mature discipline, producing a huge body of multiple sources of information. In fact, several
of work. experimentation carried out in our laboratory had proved
Arabic character recognition has been one of the last the importance of the cooperation of different types of
major languages to receive attention . This is due, in information at different levels (features extraction,
part, to the cursive nature of the task since even printed classification…) in order to overcome the variability of
Arabic characters are in cursive form. Arabic and especially multifont characters[2 ].
This paper describes the performance of
combining Hough transform and Hidden Markov In spite of the different researches realised in the field
Models in a multifont Arabic OCR system. Experimental of Arabic OCR (AOCR), we are not yet able to evaluate
tests have been carried out on a set of 85.000 samples of objectively the reached performances since the tests had
characters corresponding to 5 different fonts from the not been carried out on the same data base. Thus, the
most commonly used in Arabic writing. idea is to develop several single and hybrid approaches
Some promising experimental results are reported. and to make tests on the same data base of multifont
Keywords: Arabic Optical Character Recognition, Arabic characters so that we can deduce the most
Hough Transforms, Hidden Markov Models. suitable combination or method for Arabic Character
Recognition.
1. Introduction
In this paper, we present an Arabic Optical multifont
Arabic belongs to the group of Semitic alphabetical Character Recognition system based on Hough
scripts in which mainly the consonants are represented transform for features selection and Hidden Markov
in writing, while the markings of vowels (using Models for classification[3].
diacritics) is optional. In the next section, the whole OCR system will be
presented. The different tests carried out and obtained
This language is spoken by almost 250 million people results so far are developed in the fourth section.
and is the official language of 19 countries[1]. There
are two main types of written Arabic: classical Arabic 2. Characters Recognition System :
the language of the Quran and classical literature and
modern standard Arabic the universal language of the The main process of the AOCR system we developed
Arabic speaking world which is understood by all can be presented by the following figure:
Arabic speakers. Each Arabic speaking country or
region also has its own variety of colloquial spoken
Arabic.

Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (2005) 285
Acquisition and preprocessing
features, which, given a certain classification technique,
will produce the most and efficient classification results.
Features Extraction Obviously, the extraction of suitable features helps the
system reach the best recognition rate[6]. In a previous
work, we have used wavelet transform in order to
Character learning Character recognition
extract features and we have obtained very promising
results[7]. In this paper, we present a Hough Transform
based method for features extraction .
Models
Recognized characters 2.2.1 Hough Transform:

The Hough Transform (HT) is known as the popular and


Figure 1. Block diagram of the OCR system powerful technique for finding multiple lines in a binary
image, and has been used in various applications.
Though the principle of the Hough Transform is rather
simple and seems easy to use, we cannot bring out
2.1 Pre-Processing : precise results without paying enough attention to the
arrangement of the parameter space used in the HT.
Pre-processing covers all those functions carried out The HT gathers evidence for the parameters of the
prior to feature extraction to produce a cleaned up equation that defines a shape, by mapping image points
version of the original image so that it can be used into the space defined by the parameters of the curve .
directly and efficiently by the feature extraction After gathering evidence, shapes are extracted by
components of the OCR. In our case, the goal of image finding local maxima in the parameter space (i.e., local
preprocessing is to generate simple line-drawing image peaks). The HT is a robust technique capable of
such as the one in Figure 2 which presents the edges handling significant levels of noise and occlusion.
detection of the character ‘noun’. Our implementation The Hough technique is particularly useful for
uses the Canny edge detector [4]for this extraction. computing a global description of a feature (where the
While the extracted edges are generally good, they number of solution classes need not be known a priori),
include many short, incorrect, (noise) edges as well as given local measurements. The motivating idea behind
the correct boundaries. Noise edges are removed the Hough technique for line detection is that each input
through a two-step process: first, connected components measurement (e.g. coordinate point) indicates its
are extracted from the thresholded edge image, and then contribution to a globally consistent solution .
the smallest components, those with the fewest edge Hough transform is used to identify features of a
pixels, were eliminated. After noise removal, the particular shape within a character image such as
resulting edges are quite clean. straight lines, curves and circles . When using the H T to
detect straight lines, we rely on the fact that a line can be
expressed in parametric format by the formula:
r=xcos θ+ysin θ , where r is the length of a normal from
the origin to the line and θ is the orientation of r with
respect to the x-axis.
Figure 2. Edges extraction using canny edge detector To find all the lines within the character image we need
to build up the Hough parameter space H . This is a two
dimensional array that contains accumulator cells. These
2.2 Feature Extraction : cells should be initialised with zero values and will be
filled with line lengths for a particular θ and r . For our
Features extraction is one of the two basic steps of
study the range of θ is usually from 0° to 180° although
pattern recognition. We quote from Lippman [5]:
often we only need to consider a subset of these angles
“Features should contain information required to
as we are usually only interested in lines that lie in
distinguish between classes, be insensitive to irrelevant
particular direction.
variability in the input, and also be limited in number to
Without using information from neighbouring pixels
permit efficient computation of discriminant functions
(which the Hough transform doesn’t), each black pixel
and to limit the amount of training data required.”
p(x,y) in the input image can possibly lie on a line of
In fact, this step involves measuring those features of
any angle . For each black pixel p(x,y) in the image, we
the input character that are relevant to classification.
take each angle along which we wish to find lines,
After feature extraction, the character is represented by
calculate the value r as defined above and increment the
the set of extracted features.
value held in accumulator cell H(r, θ) by 1. The values
There is an infinite number of potential features that one
in the resultant matrix will hold values that indicate the
can extract from a finite 2D pattern. However, only
number of pixels that lie on a particular line r=xcos
those features that are of possible relevance to
θ+ysin θ . These values don’t represent actual lines
classification need to be considered. This entails that
within the source picture, merely a pixel count of points
during the design stage, the expert is focused on those
that lie upon a line of infinite length through the image.

286 Proc. ISPA05


Lines passing through more pixels will have higher 3.1 HMM’s Topology :
values than those lines passing through fewer pixels.
The line can be plotted by substituting values for either The HMM we retained uses a left-to-right topology, in
x and y or r and θ and calculating the corresponding co- which each state has a transition to itself and the next
ordinates. state. HMM for each character have 4 to 7 states, but we
have noticed that 5 is approximately the optimal number
2.2.2 Line Extraction of states .
The standard approach is to assume a simple
To extract collinear point sets, one must first extract probabilistic model of characters production whereby a
significant straight lines from the image. These lines specified character C produces an observation sequence
correspond to major linear features. The advantage of O with probability P(C;O). The goal is then to decode
the Hough transform[8] is the fact that it operates the character, based on the observation sequence, so that
globally on the image rather than locally. The Hough the decoded character has the maximum a posteriori
transform works by allowing each edge point in the probability.
image to vote for all lines that pass through the point, Considering the choices of initial values of observation
and then selecting the lines with the most votes. After and transition matrixes, all models are identical at the
all edge points are considered, the peaks in the beginning of the learning.
parameter space indicate which lines are supported by The number of states varies from 4 to 7 and it’s worth
the most points from the image. mentioning that a 0 state has been added to make the
The first thing to understand about parameter space for computing easier.
line extraction is that there is no one-to-one relationship Since there is no observation in this state, there will be
between pixels in the image and cells in the parameter no influence on the models.
space matrix. Rather, each cell in parameter space
represents a line that spans across the entire image. 3.2 Recognition :
The transformation between feature space and parameter
space is the following: Since the models are labeled by the identity of the
Project a line through each edge pixel at every characters they represent, the task of recognition is to
possible angle (you can also increment the
identify, among a set of L models λk , k=1,…..L those
angles at steps).
(the character) which gives the best interpretation of the
For each line, calculate the minimum distance
observation sequence to be decoded i.e:
between the line and the origin.
Car= arg max[P(Oλ)]
Increment the appropriate parameter space
1<=car<=L
accumulator by one.
The resulting matrix: The x-axis of parameter space
ranges from 1 to the square root of the sum of the 4. Experimental Results:
squares of rows and columns from feature space. This
number corresponds to the furthest possible minimum 4.1. Test vocabulary:
distance from the origin to a line passing through the
image. The y-axis represents the angle of the line. The different tests have been carried out on isolated
Obviously the axes could be switched. Arabic characters.
The larger the number in any given cell of the Due to the absence in Arabic OCR of a data base, we
accumulator matrix, the larger the likelihood that a line have created our own corpus which is formed by 85000
exists at that angle and distance from the origin. samples in five different fonts among the most
commonly used in Arabic writing which are: Arabic
3.Hidden Markov Models Classification : transparent, Badr, Alhada, Diwani, Koufi. The following
figure shows some of Arabic characters in the five
Hidden Markov models or HMM’s are widely used in considered fonts we have worked on.
many fields where temporal (or spatial) dependencies
are present in the data[9]. Arabic Transparent
During the last decade, hidden Markov models
(HMMs), which can be thought of as a generalization of
Badr
dynamic programming techniques, have become a very
interesting approach in pattern recognition.
The power of the HMM lies in the fact that the Diwani
parameters that are used to model the signal can be well
optimized, and this results in lower computational Alhada
complexity in the decoding procedure as well as
improved recognition accuracy. Furthermore, other Kufi
knowledge sources can also be represented with the
same structure, which is one of the important advantages ‘Té’ ‘Dhel’ ‘Ké’ ‘Noun’ ‘Mim’
of the hidden Markov modeling.
Figure 3. Samples of different characters’ shapes
according to their font
Proc. ISPA05 287
5. Conclusion
Table 1 : Recognition rate per character
A wide variety of techniques are used to perform Arabic
characters Number of states in each model character recognition. Through this paper we presented
4 5 6 7 one of these techniques based on the Hough Transform
‫ا‬ 97.32 94.56 95.25 94.34 for feature extraction and the Hidden Markov Models
for classification.
‫ب‬ 96.95 93.11 94.85 98.69 As results show, designing an appropriate set of features
‫ت‬ 94.49 97.90 96.81 98.11 for the classifier is a vital part of the system and the
‫ث‬ 99.56 99.78 99.14 93.98 achieved recognition rate is indebted to the selection of
‫ج‬ features.
94.42 94.99 95.27 96.95
We aim to optimise the step of features extraction by
‫ح‬ 91.81 92.17 92.65 93.40 adapting the incrementing step of θ according to the
‫خ‬ 96.59 100 94.14 95.79 character’s form [11] especially in a multifont context.
‫د‬ 99.35 99.35 98.26 98.91 We are intending to carry out other hybrid classifiers
combining Hidden Markov Models and Artificial Neural
‫ذ‬ 97.16 93.55 95.57 96.45
Networks in order to take advantages of their different
‫ر‬ 96.88 99.20 96.78 96.23 characteristics .
‫ز‬ 98.19 93.98 97.43 99.38
‫س‬ 98.04 98.77 98.61 97.10 6. References
‫ش‬ 92.03 98.84 97.14 95.07 [1]A. Amin. Arabic character recognition. In H. Bunke and P.
‫ص‬ 96.37 98.26 96.10 96.81 Wang, editors, Handbook of Character Recognition and
Document Image Analysis, pages 397–420. World Scientific
‫ض‬ 99.13 99.27 98.54 97.32 Publishing Company, 1997.
‫ط‬ 95.29 90.94 96.27 98.12
‫ظ‬ 92.31 93.26 94.61 98.04 [2]N. Ben Amor, S. Gazeh, N. Essoukri Ben Amara:
"Adaptation d’un système d’identification de fontes à la
‫ع‬ 98.84 99.06 99.10 99.27 reconnaissance des caractères arabes multi-fontes" Quatrièmes
‫غ‬ 98.33 95.00 94.89 93.55 Journées des Jeunes Chercheurs en Génie Électrique et
Informatique, GEI'2004, Monastir, Tunisia , 2004.
‫ف‬ 100 95.21 96.37 97.75
‫ق‬ 93.26 95.65 93.24 93.33 [3]N. Ben Amara, A. Belaïd and N. Ellouze:“Utilisation des
‫ك‬ modèles markoviens en reconnaissance de l'écriture arabe :
97.10 96.60 96.41 96.16
État de l’art” CIFED 2000
‫ل‬ 97.09 92.96 94.65 96.81
‫م‬ 95.43 96.23 95.89 95.29 [4]J.F Canny. “A Computational Approach to Edge
Detection,” IEEE Transactions on Pattern Analysis and
‫ن‬ 99.27 100 99.10 97.75 Machine Intelligence, vol. PAMI-6, pp. 679-698, 1986.
‫ﻩ‬ 99.78 99.13 98.21 97.97
[5]R. Lippmann, “Pattern Classification using Neural
‫و‬ 96.59 93.62 96.51 93.65 Networks”, IEEE Communications Magazine, p. 48,
‫ى‬ 100 99.64 96.42 94.78 November 1989.
Average 96,84 96.47 94.85 96.46
[6]E. W. Brown, "Character Recognition by Feature Point
Extraction", Northeastern University internal paper, 1992
4.2 Results: [7]N. Ben Amor , N. Essoukri Ben Amara: "Applying Neural
Among this data base, we have used 80% of samples as Networks and Wavelet Transform to Multifont Arabic
a learning base and the rest were used for the tests . Character Recognition" International Conference on
In the next section, we present the different results Computing, Communications and Control Technologies
obtained from combining Hidden Markov Models in (CCCT 2004), Austin (Texas), USA, on August 14-17, 2004.
classification and Hough Transform in features
extraction. [8]J. Illingworth and J. Kittler, “A Survey of the Hough
We notice that the best result was achieved by using Transform” Computer Vision, Graphics and Image Processing,
vol. 44, pp. 87-116, 1988.
five states in the HMM. In fact, in this case the
recognition rate is 96.84%. [9] R.-D. Bippus and M. Lehning. “Cursive script recognition
In comparison of the achieved results with the using Semi Continuous Hidden Markov Models in
wavelet/HMM based method we previously combination with simple features”. In European workshop on
developed[10], we can say that the result obtained with handwriting analysis and recognition, Brussels, July 1994.
the Hough transform is almost the same as the one
obtained with DB3 wavelet transform which is 96.66% . [10]N. Ben Amor, N. Essoukri Ben Amara : “Hidden Markov
Models and Wavelet Transform in Multifont Arabic
Characters Recognition”, accepted at International
Conference on Computing, Communications and Control

288 Proc. ISPA05

S-ar putea să vă placă și