Sunteți pe pagina 1din 5

CIMSA 2004 ~ L E E Elmemvtional Conference on

Computalional Intelligence for Measurement Systems and Applications


Baston, YD, USA, 14-16 July 2004

Segmentation of Connected Arabic Characters Using Hidden Markov Models


Alaa M. Gouda’ and M. A. Rashwan’
1
Electrical and Computer Engineering Department Faculty of Engineering.
King Abduladz University, Jeddah, Kingdom of Saudi Arabia
Electronics and Communications Department, Faculty of Engineering,
Cairo University, Egypt

Abstrad ~ Because the Arabic tuf is connected by nature, depending on the baseline, after these primitives are
segrnenration ofArabic text infocharacters is a very impmiant task recognized; a reconstruction algorithm is used to rebuild the
for building an Arabic OCR Although a lot af work has been done characters from its primitives. The main disadvantage of this
in this area, there is no pe&t technique for segmentation h a been method is the Miculty of finding the baseline of the Arabic
used until naw. In this paper, discrete Hidden Markov Mcdels are text especially for the handwritten forms. There is another
used f o r segmentation af Arabic words into leners. The results ore
very e n c o u m p g A system has been built and used for tesring the
disadvantage for this algorithm which is the dependence of
proposed alganthm and the segmentation res& achiwed 99% some primitives on the used font and the writing styles.
Considerablework has also been done on the area of contour
Keywar& - OCR: &b&f; Chamcter Segmentation: Cursive Script; following based segmentation [1I]. A sophisticated technique
for segmentation of machine printed characters, based on
I. INTRODUCTION neural networks, to determine the location of break points on
the closed contour of the word was presented by Abdul-
Mageed [2]. Some researchers overcame the segmentation
When building an OCR for recognizing Latin text
problem by recognizing complete words without
segmentation procedure is needed only for systems that segmentation [4], [SI, and [IO]. The disadvantages of these
handle cursive script. But when building an Arabic OCR the algorithms are the huge size of training text required for
problem of segmenting the connected word must be learning these systems when handling a large vocabulaq
addressed because the Arabic text is connected by nature for lexicon, and their inability to deal with the different
both typewTitten and handwritten forms. The segmentation is
derivatives of the same word unless they are included in the
a critical step because incorrectly segmented characters are lexicon.
not likely to be C O R ~ ~recognized.
U ~ Solving the character
segmentation problem is one of the keys to putting character
recognition technology to practical use. Several algorithms In this paper we used discrete Hidden Markov Models for
have been proposed for segmentation of Latin cursive script the task of segmenting Arabic text into chaxacters. Then, any
1141. However, the problem of segmentation of Arabic recognition algorithm may be easily used for recognizing the
cursive script bas not received as much attention. Some of the segmented characters. In the following section of this paper,
early work was done by F.H. Hassan [12] who used the pitch an introduction to Hidden Markov Models is presented, and
segmentation approach to segment typewritten Arabic cursive then, the preprocessing rtqnired for Arabic text is described
script The main disadvantageof this approach is that an error in Section 111. A description of the proposed system is
in the segmentation of one character is likely to cause presented in Section IV. The experimental results are
incorrect sflentation of the following characters in the presented in Section V. The conclusiou and future work are
same word or piece of word. H. Almuallim [5] implemented a presented in Section VI, and the used references are listed at
segmentation algorithm based on extraction of character the end of the paper.
strokes. The disadvantage of this system is its dependence
upon the writing font since some characters has strokes that 11. HIDDEN MARKOV MODELS
are different when writing them in different fonts. Cbeung et
al. [71 proposed a recognition-based segmentation algorithm Hidden Markov models (HMMs) are a method of
where the word bitmap is processed sequentially in a step-by- modeling systems with discrete, time dependant behavior
step mode, and at each step the character is checked for chamcterized by common, short time “processes” and
recognition against a prespeciiied feature space. The transitions between them An HMM can be thought of as a
character is then isolated after being recognized. This finite state machine where the transitions between the states
approach has the disadvantage of involving high are dependant upon the occurrence of some “symbol”.
computations and accordingly low recognition speed, Associated with each state transition is an output probabilify
Segmentationbased on the vertical histogram, was introduced distribution, which describes the probability with which a
by Abdelazim [I], and Nashida 1161. Using vertical
histograms, the word is segmented into many primitives

1I5
symbol will occur during the transition, and a transition {M’, hf, ... IM ‘} which maximizes the probability of
probability indicating the l i k e l i h d of this transition. observation sequence 0; :

The HMM descriks a stochastic process, which produces


a sequence of observed events or symbols. The HMM is
called “Hidden” Markov Model because there is an
underlying stochastic process that is not observable, but Rabiner and Jnang [18] have noted that either the forward-
affects the observed sequence of events. The sequence of backward procedure or the Viterbi algorithm may be used to
observations can be described using the notation: determine Pr (U 1 M). The summation in the forward-
0: = 0,,U, ,___, 0,,._.,O,, where the process bas been backward procedure is usually dominated by the single term,
observed for the T discrete time steps from time Fl to t=T. which is selected by the maximization step in the Viterbi
In discrete HMM,the observations may be any one of K a l g o r i h As a result, the scaling procedure may be avoided
symbols, vk. These symbols are usually vector quantization since the Viterbi algorithm can be modified to compute log
(VQ) codebook entries computed at regular intervals. (Pr (U I M ) ) directly. In this case the initial (log) probability
is given by:
The key parameters to be determined in an HMM-based
system are the number of states per pattern, N,, and the state 0 (1) = log (Z 4 ( 0 1 ) ) ~
transition and symbol output probability mamces. Sufficient
amount of training data (per pattem) is needed to obtain And the recursion becomes:
acceptable estimates of these probability density functions.
These probability density functions capture the statistics of
d(t)=mm~<,<h
[Ai (r-l)+k(ag)l + b(b,(U3)
the mining population.

Each HMM consists of a number of states. When the Then the log probability computed with the
model is in state si, the process may be measured and one of Viterbi algorithm for each word is:
the symbols vx may be produced, according to an observation
probability distribution, b, (vk). At each time step t, the model [og (pr (0 I M)) = m m i <I < N [d (QI
will undergo a transition to a new state, sj, according to a
transition probability distribution, given by aq.. These The implementation of log values permits the computation
transitions may take place between any two states, including of word probabilities to be additive rather than multiplicative,
self-transitions, as long as the transition probability which avoids computer “overflow” and speeds up the
distribution is non-zero. Therefore, the HMM is a Markov recognition process considerably. In general, the HMM
process because the probability of being in a particular state computatiod complexity is o (2)in the number of states
at time t+Z, given the state sequence prior to time t, depends searched by the Viterbi algorithm [15].
only on the state at time 1. In our system, we used discrete
left-to-right HMMs where state transitions at any time step t
is restricted only to the same state s,or the following state sj+, Hidden Markov Models are used successfully for the
application of speech recognition [3], [6], 1131, U71, [ I S ] ,
as shown in Fig. 1.
1191, and [ZO].

To use Hidden Markov Models for handling Arabic text,


the 2-D problem of text image must be converted into I-D
problem. This is implemented by using a sliding window that
scans each line of text from right to left and emacts features
of the window and applies it to the Hidden Markov Models as
shown in Fig. 2. The window used in our system is bas 1-
pixel width
Fig. 1. Len-to-&@ Hidden Markov Model.

To build a system that is able to handle an Arabic font


A particular model may be characterized by the number of with M different characters (or ligatures), we consmct M
states N , and the three probability densities II = {%I,A = ditferent hidden Markov models each of them represents one
{ a g } ,and B = {bt ( v k ) } , where ir, is the initial probability of these cbamcters (ligatures). The number of states per
distribution across states. Pattem recognition with an HMM model is selected empirically to maximize the system
is equivalent to selecting the single model M” from the set performance.

116
aEecting the main task, which is segmcntation. For example,
characters Baa, Taa, Tha& and Yaa has the same main stroke
as shown in Table 3 and their main stroke is shown in the
rightmost column.

Table 3. Different Characters of the Same Main-Stroke


Fig. 2. Sliding Window ovm the Bitmap of an Arabic Text
71yaalV WThaaW WTaalT FIBaaN
-
111. PREPROCESSING ARABIC TEXI
4 3 i +L
The Arabic character set contains basically 28 characters
hut each character has two or more shapes depending upon its
position in the word (start, middle, end, or isolated). For
example, character "Baa" has four shapes, while character
"Raa" has only two shapes as shown in Table 1. The
existence of secondary strokes like Fatha, K a m Dhamma,
Hamza, h4adda and Tanweens adds m y shapes for each
character. It is clear that these forms are different in shape
and they have to be modeled using different models. This
increases the required number of models and decreases the
system performance.

Table 4. Enampler for Anbic Legatures


Table 1. E m p l c s for Character Pasitions in h b i o Text
~ Traditional Simplified
Character Isolated End Middle Start Arabic Arabic
+
"Baa"
"&a"
U
J
LL
2
-L
NA NA I
I
LaamAlef
I
To solve this problem, we added a preprocessing stage to
remove all secondary strokes form input images which allows
1 Baa-Meem 1
I

4
I

P I
us to model all different forms of the same character by a
single model. This does not s e c t the main task since our
Laam-MeemHaa 4
goal is segmentation and not recognition. For example,
chamcter Alef and character Noon in their isolated form have
the shapes found in Table 2, and the righlmost column shown
both charactersafter removing secondary strokes.

Table 2. Effect of Secondary Strokes on Character Shapes

Furthermore, there are some characters that are identical in


I Preprocessing

the shape of the main stroke but they are different in the
number or the position of secondary strokes. By removing
Fig. 3. Reprocessing of Input Text
secondary strokes in the preprocessing phase, we are able to
use a single model to represent these different characters, and
again this will increase the system performance without

I17
IV. SYSTEM ARCHITECTUfE The segmentor subsystem uses the trained HMMs to
segment the unseen test documents of the same font used for
Our proposed system is divided into two subsystems: the training the hidden Markov models. The block diagram of the
trainer subsystem which is used to initialize and train Hidden recognizer subsystem is shown in Fig. 5
Markov Models using many scanned bitmaps with their
corresponding text files, and the segmentor subsystem which
uses the Viterbi algorithm to segment input bitmaps to output
characters (or ligatures). The block diagram of the trainer Preprocessor
subsystem is shown in Fig. 4. And Word

S-Gd

I
4-k Viterbi
Segmentor
sesnvntea
arurrctm

I Fig. 5 . Segmentor Subsystem


TNU.&al
Hb&73 The word segmentor module segmentsthe scanned bitmap
Trst Files into words, while the feature extractor and vector quantizer
modules generate a series of observations representing the
Fig. 4. Trainer subsystem segmented words as described in the trainiig subsystem. The
Viterbi segmentor generates a series of patterns that
To build a system that is able to handle an Arabic font maximizes the probability of the observation sequence. The
with M different ligatures, we use M different Hidden step times at which the transition occurs from last state of an
Markov Models each of them represents one of the Arabic HMM to the first state of another HMM are used as the
ligatures. The number of states per model is selected output segmentation borders between characters of the given
empirically to maximize the system performance. All models word.
are initialized using a single bitmap that contains the
complete set of ligatures of the used Arabic font. V. EXPERIMENTAL RESULTS
The word-segmentor module is used to segment paper into
lines, and then to segment each l i e into words or sub-words The proposed system was built using Ctc language, and
starting from upper right comer of the page. used for training both simplified and traditional Arabic fonts.
Twenty-five Arabic text pages were printed using a laser
The feature extractor module is used to extract features of printer and scanned using a monochrome scanner with 300
the sliding window which slides through the body of the DPI resolution. Twenty-one pages of this set are used for
Arabic word from right to left, and provides a series of training the system and four pages are used for testing.
feature vectors to the feature quantizer module.
Many sets of features are tested with our proposed system, The segmentation results were greater than 99% for both
but invariant moments described by AlKhaly et al. [SI fonts. A sample of the output segmentation results is shown
provided the best performance. in Fig. 6.
The vector quantizer module provides a series of VI. CONCLUSION AND FUTURE WORK
observations to the labeler module using the input train of
feature vectors and a codebook constructed from m y In this paper, we used HMM techniques for segmenting
images of the training set of documents representing the connected Arabic text, and tested this system using the
Arabic font. simplified and traditional Arabic fonts of Microsoft Word
The labeler module provides a series of observations with application. The system provided a very high accuracy (99 %
their associated sequence of patterns to the HMM trainer. The for both fonts).
HMM trainer uses these inputs to adapt the values of the In the fum, the current system will be modified to be
transition probability matrices and the observation probability used for segmenting connected handwritten Arabic text.
matrices of models of these patterns to maximize the Another set of featnres is to be used for this job. In addition, a
probability of the observation sequences of these patterns as language model has to be added for increasing the
described in the HMM section. segmentation accuracy.
Fig. 6. Segmentation Results

REFERENCES

Abdehzim, H.Y., and Hmhish M.A. '~ArobicReading Muchine", Washington, D.C., pp. 137-149, 1992.
Roc. of the 10" Nat. Computer Conference, Scientific Publishing [ l l ] Guindi, R. M., "An Arabic Text Recognibon System", M.Sc. ThesiS
Center, Jeddah, pp 733-744, 1988. Fvculiy of Engineering, Cairo University, 1987.
L21 AbdulMageed, A, "A ,Vovel Approach f i r a Trainablr AmbidLaon [12] Hassan, F. H., "Arabic Choroctu Recognition", 7* Summer Session
TertReadrr", lv1.S~.Thesis, Cairo University, 1994. ofthe Arabic School of Science and Technology, Syria, 1985.
PI Afify, M. A, " L o p Vocabulary Continuous Arabic Speech [I31 Lee K. F. and Hon, H. W., "Speaker-independent P h Recognrtion
~ ~ ~
Recognmon", Ph.D. Thesis, Faculty of Engincenng, Cairo University, USmg Hidden .bforkov ModeP', IEEE Tnnsaetions on Audio, Speech
1995. Signal Processing,Volume 37, Number 11, November 1989.
141 AI-Brdr, B. and Hmalick R.M., "A segmentation-pee approach 10 1141 Liang, S., Shridhar, M., and Khmadi M., "Eflaent Algorrthms /or
iert recogniiion w t h opp!~cationlo Aiabrc I&", lntmutional Journal Segmenrotion and Recognition of Printed Choroclers in D o o m "
on Document M y s i s and Recognition, IIDAR I@), pp. 147-166, Processing", IEEE Pacific Rim Conference on Communications,
1998. Computers, and Signal Processing, Vol. I, pp. 240-243, 1995.
AlMuallim H., and Ynmaguchi, S., "A Melhod of Recognilion of [IS] Morgan, D. P. and Scotield, C. L.,"Neural Networks ond Speech
Arabic Cursive Hmdwnting", IEEE T m s . Panem Analysis and Processing", Kluwer Academic Publishers, pp. 119-127, 1992.
Machine Intelligence, Vol. 9 No. 5, pp 715-722, Sept 1987. [I61 Nashida, H., and Mori, S., "An Algebraic Approach ro Automonc
Brugnara, F., Falavigna, D., and Omologo, M., "Automonc Construction of Srrucrured Models", IEEE T r m . P a a m Analysis
Segmentonon and Lobolixg of rpeech Based on Hidden Markov andMnchineIntelligence, Vol. I5 No. 12, pp. 1298-1311, Dec. 1993.
.Mode?', Speech Cammuoicatiotion 12, Narth Holland, 1993. 1171 Rabiner, L. R. and Levinson, S. E.,"A Speaker-Independent, Syntar-
Cheung, A, Bermamoq M., and Berg", N.W., "An Arabic Dmcted, Connected Word Recognrtion system Based On Hidden
optic01 character recognition system using reeognihon-based Markov Model and Levo1 Building", IEEE Transactions on Audio,
segmenlohon", Pattem Recognition Vol. 34, pp 215-233,200 1. Speech and Signal Rocessing, Volume 33, Number 3, June 1985.
El-Khaly, F., and S i d - h e 4 M., "Machine Recogninon of Optically [IS] Rubiner, L.R. and Juan& B. H., "AnInnoducnon 10 Hidden Markov
Copnrred Machine Printed Arabic Text', Patlem Recognition, Vol. Mode?', IEEE Audio, Speech and Signal Processing Magazine,
23,No. 11, pp. 1207-1214, 1990. J m q 1986.
Erlandson, E.J., Trenkle, J.M., and Vogs RC., " W o r d - b e l I191 M i n e r , L. R., "A Turo~iolon Hidden Markov Models and Selected
recogmnon of multifont Arabic f a t using a feonrre-vecror maiching Applicotions in Speech Recognition", Proceedings of IEEE, Volume
opprwch". Proceedings ofthe SPIE, Vol. 2660-08, SanJose, 1996. 77,pp. 257-285,Febnwy 1989.
Gader, P.D., Forester, B.D.,Gillies, AM., Ganzbelger, M.J., Vagt, [ZO] Rubiner, L. R., Wilpon J. G. and Soong, F. K., "High Peformance
RC., and Trenkle., J.M., "A Segmentation-Free Neural Nehwrk Connecred Digit Recogninon Using Hidden Markov Mode?, IEEE
Classifier for Machine-Printed Numenc Fields", In Proceedings of Transactions on Audio, Speech and Signal Processing, Volume 37,
FiW U.S.P.S. Advanced Technology Conference, Vol. 3, Number 8, August 1989.

119

S-ar putea să vă placă și