A Complete OCR System For Continuous Bengali Characters

A Complete OCR System for Continuous
Bengali Characters
Jalal Uddin Mahtnud, Mohammed Feroz Raihan and Chowdhury Mofizur Rahman
Department of Computer Science and Engineering
Bangladesh University of Engineering & Technology
880-2-96656 I2(7 109)
rnrony(rr a~~~~~tc'l.nc'1,fraihan2002(@yahoo.coni,
cmrhet(cc y ,ihoo c ' w i
Absfrucf---This paper is concerned with a complete optical levels and other preprocessing is performcd to find a stream
character recognition (OCR) system for Bengali character. of isolated character. We have adoptcd thc chain codc
Recognition is done for both isolated and continuous printed method of image representation [5], which allows a compact
multi font Bengali characters. Preprocessing steps includes representation and reduction of data and hcncc processing
segmentation in various levels, noise removal and scaling. time. Chain code is a linear structure that results from
Free man chain code has been calculated from scaled quantization of the trajectory traced by the ccnters of
character which is further processed to obtain a adjacent boundary elements in an image array. Feature
discriniinating set of fcature vectors for the recognizer. The extraction from chain code representation can be effectively
unknown samplcs arc classified using feed forward neural used for recognition of the character image. As chain codc
network based recognition scheme. It has been found fi-om representation gives the boundary of the character image,
experimental results that success rate is approximately 98% thickness of the character is useless in such case. Thinning
for isolated charactcrs and 96% for continuous character. of the character image is necdless when chain code
rcprcsentation is uscd. Trajectory traced by thc chain codc
Iiidcx Tcrirrs: Chain code, preprocessing, scgmentation, implies the shapc of the character that can be further
feature extraction, neural network, Back propagation, processed for feature cxtraction. Slope distribution of chain
Connected component. code implies the curvature properties of the character that
has been uscd as local fcature. Furthcr, usc of thc back
1. INTRODUCTION propagation based learning schenie i n the recognition
strategy cnablcs the system to learn from examples. Thc
Thcrc has bccn particular interest over thc last dccadc in gencralizing capability of this Icarning scheinc has becn
rccognition of printed characters. Recognition of printed harnessed to achieve font invariant recognition of the
character is itself a challenging problem since there is a characters. In section 2 we present the methodology of the
variation of the same character due to change of fonts or system. In section 3 we discuss the techniqucs used for pre-
introduction of noise. Recognition of Bengali character is a processing. Section 4 is concerncd with the fcaturc
subject of special interest for us and many works have bcen extraction stratcgics. In scction 5 thc training and
done in this area. Various strategics have bcen proposed by recognition scheme has bcen discussed. Empirical results
different authors. Multi font character recognition scheme have been presented in section 6 and conclusion is drawn in
suggested by Kahan and Pavlidis[ I]. Roy and, Chatterjee[2] scction 7.
prcscntcd a ncarcst ncighbor classifier for Bcngali
charactcrs cniploying features cxtracted by a string 2. METHODOLOGY
conticctivity critcrion. Abhijit Datta and Santanu Chaudhuri
[3] suggestcd a curvature based feature extraction strategy The basic objective of the present scheme is to develop a
for both printed and handwritten Bengali characters. B.B . complete OCR(optica1 character recognition ) system for
Chaudhuri and U.Pal [4] combined primitive analysis with different fonts and sizes of Bengali characters. Differcnce in
template matching to detect compound Bcngali characters . font and sizes niakes recognition task difficult if
Most of the works on Bengali character are recognition of preprocessing, feature extraction and recognition are not
isolated characters. Very fcw deal with a complete OCR for robust. There may be noisc pixels that arc introduccd duc to
printed document in Bengali. From that standpoint, this scanning of the image. Besides, same font and size can have
papcr will cffcctivcly promote the rescarchers who are bold face character and normal otic. So width of the stroke
intcrestcd to develop a coinplctc OCR system. In this work, is also a factor that affects rccognition. So a good charactcr
continuous characters arc segmented using some traditional rccognition approach must climinatc the noisc aftcr rcading
nicthodology as wcll as some new methodology in various ' binary image data, sinooth thc image for bcttcr recognition,
extract features efficiently, train the system and classify
patterns. In the next figure a typical Bengali script is shown.
0-7x03-765 I -X/03/$17.00 02003 IEEE
Poster Papers / 1373
3.1.3 Character Seg/,nenlafion---To segment thc individual

character in a word we first find the head line of the word
which is called ‘Matra’ in Bengali. From the word, row
histogram is constructed by counting frequency of cach I-ow
in the word. The row with the highest frcqucncy valuc
indicates that head line which has been dcnotcd as ‘Matra’ .
Some times there are consecutivc two or inorc rows with
Bcforc recognizing isolated character, preprocessing steps almost same frequency value. In that case , ‘Matra’ row is
segincnts the paragraphs into lines, lines into words and not a single row. Rather all the rows that are consecutivc to
words into isolated characters. Noise removal and the highest frequency row and have frequency very close to
smoothing constitutes the next step. Then connected that row constitutes the ‘Matra’ which is now a thick
components within each character have been detected using headline. To find the demarcation linc bctwccn charactcrs a
depth first approach. Each connected component has been vertical scan is initiated from the row that is just beneath the
divided into four regions dcpending on the center of mass of ‘Matra’ row. If the scan can reach the bottom of the word
each component. Distribution of directional slopes of then it has successfully found the demarcation line between
Freeman chain code in each region makes the feature set for characters. But only linear vertical scan fails to find thc
that region of the individual connected component. demarcation line for some words. In this case linear
searching to find the demarcation line bctwcen characters
3. PREPROCESSING fails for the presence of a portion of the next adjacent
character in the column associated for the current character.
I n thc present system Character images have been obtained
To overcome this situation, wc have uscd an approach
by optical scanning of the character images on the plain
where the vertical scan is not lincar only. It is piecewisc
paper. The input data obtained by scanning of printed text is
lincar in the sense that thc scan takes turns whcncvcr it sccs
almost contaminated with noise and contains redundanf
an obstacle (i.c. black pixel) and trics to reach thc bottom of
information. Preprocessing includes segmentation, scaling,
the word. Fig 4 shows the approach.
noise removal and elimination of redundant information as
far as possible.
3. I Segnientaiion I I
Segmentation of the input binary image data in different Fig 4.a Linear scan for character scgmentation
lcvel is pcrfonned. The segmentation is done in the
following steps:
-I
Fig 4.b Ordinary linear search fails to scgment
3.1. I Text Line befection----Text line detection has been L
pcrformed by scanning the input page image horizontally.

Frequency of black pixels in each row is counted in order to
construct the row histogram .The position between two
consecutive lines, where the number of pixels in a row is
*I ‘1
Fig 4.c Piecewise linear scan for segmentation
zero denotes a boundary between the lines. Here it is
assumed that thc text block contains only single column of T o find the portion of any character above the Matra we
tcxt. Fig.:! shows the linc segincntation process. then check if we can move upward from thc Matra row from
a point- just adjacent to the Matra row and betwccn the two
demarcation lines. If it is, thcn a Grccdy search is initiatcd
from that point and thc whole character is found.
F i g 2 Text line segmentation

Fig 5: Greedy search for finding thc portion ofthe charactcr
3 1.2 Word Seg/nentation---After a line has been detected, it above Matra row
is scanned vertically. In order to find the column histogram,
number of black pixels in each column is calculated. If there Besides of these difficulties , there are cliaractcrs which are
exists n consccutive scan that find no black pixel we denote positioned below another charactcr. To scgnicnt thc
it to be a marker between two words. The value of n is taken characters below another character base line of thc word
cxpcrimcntally. Fig 3 shows the word segmentation process. image has bcen calculated. Each word can bc considcred to
Imq;mIqaa I have an imaginary linc that crosses at thc middlc of thc
word. If a greedy search is initiated from the word image
Fig.3: Word Segmentation that searchcs for prcscncc of black pixcls bclow thc
imaginary line then the search will result some conncctcd
TENCON 2003 / 1374
coniponcnts below that imaginary line. All those isolated character. Detection of connectcd coniponcnt is
Components contain a lowest point which are called 'Base done using depth first search approach.
Point' . Base line is the highest frequency row of those
points. After detcnnining the base line, a depth Base line is 4.2 Center of Mass f o r Eacli Coinponeitt
thc highest frequency row of those points. After dcterniining Center of mass has been calculated for each connectcd
the base line, a depth first search easily extracts the component. Center of mass for i th conncctcd component is
characters below base line. (Xi,Yi). where
Ni
Fig 6. Base Line of the word image Xi = Pij / Ni ........................... (1)
n j= I
Ni
Yi = CQij / Ni ........................... (2)
j=l
Hcre,
Ni = Number of Black pixels in connected coniponent i.
Pij = x Coordinate of the jth Black pixel in ith connectcd
Fig 7. Extracting characters below base line
component.
Qij = y Coordinate of the jth Black pixcl in ith connected
component.
3.2 Scaling
Scaling of the isolated character has been performed so that 4.3 Bounded Rectangle Calciilation
size invariant rccognition can be possible. Though the If the grid is searched in a row wise manner, and connected
recognition is size invariant, better result is obtained when component is found using depth first search strategy, top left
the charactcrs arc assumed to bc in a specific range. For our
and bottom right CO ordinate of a component can be found.
systcin , the rangc of characters has been taken as 10 pt from Besides its minimum and maximum span in x dircction as
24 pt. The charactcrs that arc scaled using an efficient well as in y dircction can be found which results a boundcd
scaling algorithm [6] converted to standard size which is 64 rectangle of the component.
x 64 for our system.
4.4 Division of the Coittponents into Regions
3.3 Noise Rentoval
Each conncctcd coniponcnt has becn dividcd into four
After scanning, bccausc of unwanted noise, arbitrary regions indicating four quadrants in two-dimensional
extrusions and intrusions may be found at the boundary of geometric system. The origin of the two dimensional
the character images. Noisy cavities in the character images geometric system is the center of nmss of that connected
are also common. These distortions detrimentally affect the component. With the origin and thc bounded rectangle of
shape of the characters. Noise removal includes removal of the connected component, four regions can be cstablishcd.
singlc pixel Component and removal of stair case cffect after
scaling. Stair case effect occurs when the scaled Character
have junctions so thin that inner and outer contour required
for chain code rcpresentation cannot be found. Eacli pixel
has bccn replaccd by a filtcring function to avoid such Fig 8: Four Rcgions for a connectcd component
cffect.
4.5 Chain Code Generation:
4. FEATUREEXTRACTION After !he character has becn dividcd into connccted
coinponents and boundary of the connected componcnts are
Fcature extraction is the most challenging part for character
established, chain code is to be calculated .There are several
recognition and choice of good features significantly
chain code convention used for image representation, but
improves the recognition rate and minimizcs the error in
the most popular one is Freeman chain code. Freeman Chain
case of noise. The steps of feature extraction has been
code is based on the observation that each pixel has eight
discussed below:
neighborhood pixels.
4. I Connected Components Extraction
In Bengali language a character can have more than one
connected Component. In these characters recognition of
two coniponcnts actually yields the desired result. There
fore all the connected coinponents are detected froin the
Poster Papers / 1375
- - -
Here a1 = al/N, a2 = a2/N, ............. a,, = a,,/N.
N = d (al’ + a2’ + .........+an’).
4.8 Conversion to Character Slope Distribution:

If there is more than one connectcd component i n the
character, then 32 norinalizcd slope for cach conncctcd
component will be found aftcr the prcvious stcp . But
rccognition step recognizes the wholc charactcr, not its
individual connected component. Therefore normalized
Fig 9: Slopc Convcntion for Frccinan Chain code features for each connected components are avcraged to get
the total features for the charactcr.
The 8 transitional positions defined by freeman chain code
are then divided into 4 transitional zones in order to 5. CLASSIFICATION AND RECOGNITION
facilitate the searching and to keep the correct order of
Neural network approach has been used for classification
searching.
and recognition. Training and recognition phase of the
neural network has been perforincd using convcntional back
3 0- propagation algorithm.
5. I Training
Neural network has been traincd by normalized feature
vector obtained for cach charactcr in the training sct. Four-
layer neural network .has been used with two hiddcn layers
Up Zone for improving the classification capability of thc neural
network with minimum error tolcrancc ratc. For 32
- 4 5 6 7 dimensional featurc vector and 4 layer, number of ncuron
used in the first hidden layer is 90 and that in thc second
LeR Zone Down Zone hidden layer is 75.
Fig IO: 4 dircction zones for searching 5.2 Recognition
In the recognition phase of the nctwork a single itcration is
enough to give the confidcncc valuc for each class of thc
charactcr set. Thc confidcncc valuc obtaincd from thc
output layer of the neural network, which closes to I implics
the presence of that character class. Except the confidence
value of the recognized character, other confidcnce values
are closes to 0.
Fig I 1 : Dctcrniining thc chain code of an image
4.6 Slope Distribution Generation :

Whcn searching for a closed contour continues, there is a
variation of slope in each region. The frequency of each
directional slope at cach region is recordcd and updated
during the traversal. There are eight directional slopes in a
rcgion, thcrcforc total 32 directional slope for the whole
component. The frequency of jth directional slope at ith Fig 12: A Ncural Network with 4 layers, 90 ncurons and 75
rcgion is local feature S i j ,where j = O,I, .... ,7 and i = neurons in hidden layers.
0, I ,2,3.
4.7 Normalized Slope Calculation: RESULT

6. EMPIRICAL
In order to obtain fractional value, feature values must be The system has becn tested extcnsivcly for both isolatcd and
normalized to (0-1) scale. The rule for normalization is: If continuous characters. Many test tiles have bcen generated
a l , a2, a3, .......... , an are n feature vectors in n for this purpose. Different complexity levels of test have
dimensional featurc space, then their normalized values are been used. The degree of accuracy in recognition rate is
- - -
presented chart 1. Training of thc systcm was pcrformcd
al. a?, ........... a,.
with three types of fonts : Sulekha, Susri and Sunetra. Test
files were generatcd with sample from training fonts as well [3]. Abhijit Dutta and Santanu Chaudhury, “Bengali Alpha-
as from some unknown fonts for the system. Chart2 presents Numeric Character Recognition Using Curvature Features”,
success rates when test files contained multiple fonts with Pattern Recognition Vol-26, 1707-1 720 ,1993.
varying sizcs. Rccognition rate is superior for isolated
characters than for continuous characters. The discrepancy [4].B.B.Chaudhuri and U.Pal , “A Complctc Printed Bangla
i n performance in case of continuous characters may have OCR System”, Pattern Recognition Vol-31, 531-549 ,1997.
happened due to segmentation procedure.
[5]. H. Freeman , “Computer Processing of Line Drawing
100 Images”,CompzrtiiigSurvey.s,Vol -6, no -I,pp -57-97,1974.
*8
9b
+Submi
[6].Suman Kumar Nath and Muhammad Mashroor Ali, “An
94 +l.ldhini
91
Efficient Object Scaling Algorithm for raster device”,
sn SIIIMItd Graphics and Image Processing, NCCIS ,1997.
88
Qi
I 1 16 In 22 31
Chart 1: Success Rates in Recognizing Continuous

Characters for unknown fonts for the system
Chart 2: SUCCCSSRates in Recognizing Continuous

Characters for nixed fonts with various sizes.
7. CONCLUSION
A fast chain code based optical character recognition (OCR)
system for Bengali alphabet has been presented and
implcmented. Pcrformance analysis has also been given.
Development of efficient methods for preprocessing,
segmentation and feature extraction resulted an increase of
speed. Contour representation is suitable for representing
characters both in printed and handwritten. Though we are
concerned only with printed characters in this paper, the
same methodology can be used for recognition of
handwritten characters. Obviously i n that case,
prcprocessing and scgmentation algorithms have to consider
more practical problems than printed character recognition.
REFERENCES
[ I ] S. Kahan and T.Pavlidis, “Recognition of printed
characters of any font and size”, IEEE Trans. Pattern Anal.
Arid Mach.lnteN. 9,274-288,1987.
[21. A.K.Roy and B.Chatterjee, “Design of nearest neighbor

classifier for Bengali character recognition”, J.IEEE
30.1984.

A Complete OCR System For Continuous Bengali Characters

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Complete OCR System For Continuous Bengali Characters

Încărcat de

Drepturi de autor:

Formate disponibile

A Complete OCR System for Continuous

3.1.3 Character Seg/,nenlafion---To segment thc individual

pcrformed by scanning the input page image horizontally.

F i g 2 Text line segmentation

4.8 Conversion to Character Slope Distribution:

Fig I 1 : Dctcrniining thc chain code of an image

4.6 Slope Distribution Generation :

4.7 Normalized Slope Calculation: RESULT

Chart 1: Success Rates in Recognizing Continuous

Chart 2: SUCCCSSRates in Recognizing Continuous

[21. A.K.Roy and B.Chatterjee, “Design of nearest neighbor

S-ar putea să vă placă și