Documente Academic
Documente Profesional
Documente Cultură
Bengali Characters
Jalal Uddin Mahtnud, Mohammed Feroz Raihan and Chowdhury Mofizur Rahman
Department of Computer Science and Engineering
Bangladesh University of Engineering & Technology
880-2-96656 I2(7 109)
rnrony(rr a~~~~~tc'l.nc'1,fraihan2002(@yahoo.coni,
cmrhet(cc y ,ihoo c ' w i
Absfrucf---This paper is concerned with a complete optical levels and other preprocessing is performcd to find a stream
character recognition (OCR) system for Bengali character. of isolated character. We have adoptcd thc chain codc
Recognition is done for both isolated and continuous printed method of image representation [5], which allows a compact
multi font Bengali characters. Preprocessing steps includes representation and reduction of data and hcncc processing
segmentation in various levels, noise removal and scaling. time. Chain code is a linear structure that results from
Free man chain code has been calculated from scaled quantization of the trajectory traced by the ccnters of
character which is further processed to obtain a adjacent boundary elements in an image array. Feature
discriniinating set of fcature vectors for the recognizer. The extraction from chain code representation can be effectively
unknown samplcs arc classified using feed forward neural used for recognition of the character image. As chain codc
network based recognition scheme. It has been found fi-om representation gives the boundary of the character image,
experimental results that success rate is approximately 98% thickness of the character is useless in such case. Thinning
for isolated charactcrs and 96% for continuous character. of the character image is necdless when chain code
rcprcsentation is uscd. Trajectory traced by thc chain codc
Iiidcx Tcrirrs: Chain code, preprocessing, scgmentation, implies the shapc of the character that can be further
feature extraction, neural network, Back propagation, processed for feature cxtraction. Slope distribution of chain
Connected component. code implies the curvature properties of the character that
has been uscd as local fcature. Furthcr, usc of thc back
1. INTRODUCTION propagation based learning schenie i n the recognition
strategy cnablcs the system to learn from examples. Thc
Thcrc has bccn particular interest over thc last dccadc in gencralizing capability of this Icarning scheinc has becn
rccognition of printed characters. Recognition of printed harnessed to achieve font invariant recognition of the
character is itself a challenging problem since there is a characters. In section 2 we present the methodology of the
variation of the same character due to change of fonts or system. In section 3 we discuss the techniqucs used for pre-
introduction of noise. Recognition of Bengali character is a processing. Section 4 is concerncd with the fcaturc
subject of special interest for us and many works have bcen extraction stratcgics. In scction 5 thc training and
done in this area. Various strategics have bcen proposed by recognition scheme has bcen discussed. Empirical results
different authors. Multi font character recognition scheme have been presented in section 6 and conclusion is drawn in
suggested by Kahan and Pavlidis[ I]. Roy and, Chatterjee[2] scction 7.
prcscntcd a ncarcst ncighbor classifier for Bcngali
charactcrs cniploying features cxtracted by a string 2. METHODOLOGY
conticctivity critcrion. Abhijit Datta and Santanu Chaudhuri
[3] suggestcd a curvature based feature extraction strategy The basic objective of the present scheme is to develop a
for both printed and handwritten Bengali characters. B.B . complete OCR(optica1 character recognition ) system for
Chaudhuri and U.Pal [4] combined primitive analysis with different fonts and sizes of Bengali characters. Differcnce in
template matching to detect compound Bcngali characters . font and sizes niakes recognition task difficult if
Most of the works on Bengali character are recognition of preprocessing, feature extraction and recognition are not
isolated characters. Very fcw deal with a complete OCR for robust. There may be noisc pixels that arc introduccd duc to
printed document in Bengali. From that standpoint, this scanning of the image. Besides, same font and size can have
papcr will cffcctivcly promote the rescarchers who are bold face character and normal otic. So width of the stroke
intcrestcd to develop a coinplctc OCR system. In this work, is also a factor that affects rccognition. So a good charactcr
continuous characters arc segmented using some traditional rccognition approach must climinatc the noisc aftcr rcading
nicthodology as wcll as some new methodology in various ' binary image data, sinooth thc image for bcttcr recognition,
extract features efficiently, train the system and classify
patterns. In the next figure a typical Bengali script is shown.
0-7x03-765 I -X/03/$17.00 02003 IEEE
Poster Papers / 1373
3. I Segnientaiion I I
Segmentation of the input binary image data in different Fig 4.a Linear scan for character scgmentation
lcvel is pcrfonned. The segmentation is done in the
following steps:
-I
Fig 4.b Ordinary linear search fails to scgment
3.1. I Text Line befection----Text line detection has been L
coniponcnts below that imaginary line. All those isolated character. Detection of connectcd coniponcnt is
Components contain a lowest point which are called 'Base done using depth first search approach.
Point' . Base line is the highest frequency row of those
points. After detcnnining the base line, a depth Base line is 4.2 Center of Mass f o r Eacli Coinponeitt
thc highest frequency row of those points. After dcterniining Center of mass has been calculated for each connectcd
the base line, a depth first search easily extracts the component. Center of mass for i th conncctcd component is
characters below base line. (Xi,Yi). where
Ni
Fig 6. Base Line of the word image Xi = Pij / Ni ........................... (1)
n j= I
Ni
Yi = CQij / Ni ........................... (2)
j=l
Hcre,
Ni = Number of Black pixels in connected coniponent i.
Pij = x Coordinate of the jth Black pixel in ith connectcd
Fig 7. Extracting characters below base line
component.
Qij = y Coordinate of the jth Black pixcl in ith connected
component.
3.2 Scaling
Scaling of the isolated character has been performed so that 4.3 Bounded Rectangle Calciilation
size invariant rccognition can be possible. Though the If the grid is searched in a row wise manner, and connected
recognition is size invariant, better result is obtained when component is found using depth first search strategy, top left
the charactcrs arc assumed to bc in a specific range. For our
and bottom right CO ordinate of a component can be found.
systcin , the rangc of characters has been taken as 10 pt from Besides its minimum and maximum span in x dircction as
24 pt. The charactcrs that arc scaled using an efficient well as in y dircction can be found which results a boundcd
scaling algorithm [6] converted to standard size which is 64 rectangle of the component.
x 64 for our system.
4.4 Division of the Coittponents into Regions
3.3 Noise Rentoval
Each conncctcd coniponcnt has becn dividcd into four
After scanning, bccausc of unwanted noise, arbitrary regions indicating four quadrants in two-dimensional
extrusions and intrusions may be found at the boundary of geometric system. The origin of the two dimensional
the character images. Noisy cavities in the character images geometric system is the center of nmss of that connected
are also common. These distortions detrimentally affect the component. With the origin and thc bounded rectangle of
shape of the characters. Noise removal includes removal of the connected component, four regions can be cstablishcd.
singlc pixel Component and removal of stair case cffect after
scaling. Stair case effect occurs when the scaled Character
have junctions so thin that inner and outer contour required
for chain code rcpresentation cannot be found. Eacli pixel
has bccn replaccd by a filtcring function to avoid such Fig 8: Four Rcgions for a connectcd component
cffect.
4.5 Chain Code Generation:
4. FEATUREEXTRACTION After !he character has becn dividcd into connccted
coinponents and boundary of the connected componcnts are
Fcature extraction is the most challenging part for character
established, chain code is to be calculated .There are several
recognition and choice of good features significantly
chain code convention used for image representation, but
improves the recognition rate and minimizcs the error in
the most popular one is Freeman chain code. Freeman Chain
case of noise. The steps of feature extraction has been
code is based on the observation that each pixel has eight
discussed below:
neighborhood pixels.
4. I Connected Components Extraction
In Bengali language a character can have more than one
connected Component. In these characters recognition of
two coniponcnts actually yields the desired result. There
fore all the connected coinponents are detected froin the
Poster Papers / 1375
- - -
Here a1 = al/N, a2 = a2/N, ............. a,, = a,,/N.
N = d (al’ + a2’ + .........+an’).
5. I Training
Neural network has been traincd by normalized feature
vector obtained for cach charactcr in the training sct. Four-
layer neural network .has been used with two hiddcn layers
Up Zone for improving the classification capability of thc neural
network with minimum error tolcrancc ratc. For 32
- 4 5 6 7 dimensional featurc vector and 4 layer, number of ncuron
used in the first hidden layer is 90 and that in thc second
LeR Zone Down Zone hidden layer is 75.
Fig IO: 4 dircction zones for searching 5.2 Recognition
In the recognition phase of the nctwork a single itcration is
enough to give the confidcncc valuc for each class of thc
charactcr set. Thc confidcncc valuc obtaincd from thc
output layer of the neural network, which closes to I implics
the presence of that character class. Except the confidence
value of the recognized character, other confidcnce values
are closes to 0.
7. CONCLUSION
A fast chain code based optical character recognition (OCR)
system for Bengali alphabet has been presented and
implcmented. Pcrformance analysis has also been given.
Development of efficient methods for preprocessing,
segmentation and feature extraction resulted an increase of
speed. Contour representation is suitable for representing
characters both in printed and handwritten. Though we are
concerned only with printed characters in this paper, the
same methodology can be used for recognition of
handwritten characters. Obviously i n that case,
prcprocessing and scgmentation algorithms have to consider
more practical problems than printed character recognition.
REFERENCES
[ I ] S. Kahan and T.Pavlidis, “Recognition of printed
characters of any font and size”, IEEE Trans. Pattern Anal.
Arid Mach.lnteN. 9,274-288,1987.