Sunteți pe pagina 1din 78

CHAPTER - 1

1.1 INTRODUCTION

Text line extraction is generally seen as a preprocessing step for tasks


such as document structure extraction, printed character or handwriting
recognition. Many techniques have been developed for page segmentation of
printed documents newspapers, scientific journals, magazines, business letters)
produced with modern editing tools. The segmentation of handwritten documents
has also been addressed with the segmentation of address blocks on envelopes
and mail pieces, and for authentication or recognition purposes. More recently,
the development of handwritten text databases provides new material for
handwritten page segmentation.

Ancient and historical documents, printed or handwritten, strongly differ


from the documents mentioned above because layout formatting requirements
were looser. Their physical structure is thus harder to extract. In addition,
historical documents are of low quality, due to aging or faint typing. They include
various disturbing elements such as holes, spots, writing from the verso
appearing on the recto, ornamentation, or seals. Handwritten pages include
narrow spaced lines with overlapping and touching components. Characters and.
words have unusual and varying shapes, depending on the writer, the period and
the place concerned. The vocabulary is also large and may include unusual
names and words

Full text recognition is in most cases not yet available, except for printed
documents for which dedicated OCR can be developed. However, invaluable
collections of historical documents are already digitized and indexed for
consulting, exchange and distant access purposes which protect them from
direct manipulation. In some cases, highly structured editions have been
established by scholars. But a huge amount of documents are still to be exploited
electronically. To produce an electronic searchable form, a document has to be
indexed. The simplest way of indexing a document consists in attaching its main
characteristics such as date, place and author (the so called ‘metadata’).

Indexing can be enhanced when the document structure and content are
exploited. When a transcription (published version, diplomatic transcription) is
available, it can be attached to the digitized document: this allows users to
retrieve documents from textual queries. Since text based representations do not
reflect the graphical features of such documents, a better representation is
obtained by linking the transcription to the document image. A direct
correspondence can then be established between the document image and its
content by text/image alignment techniques .This allows the creation of indexes
where the position of each word can be recorded, and of links between both
representations.

Clicking on a word on the transcription or in the index through a GUI


allows users to visualize the corresponding image area and vice versa. The
document analysis embedded in such systems provides tools to search for
blocks, lines and words, and may include a dedicated handwriting recognition
system. Interactive tools are generally offered for segmentation and recognition
correction purposes. Several projects also concern printed material. However,
document structure can also be used when no transcription is available. Word
spotting techniques can retrieve similar words in the image document through an
image query. When words of the image document are extracted by top down
segmentation, which is generally the case, text lines are extracted first.

The authentication of manuscripts in the paleographic sense can also


make use of document analysis and text line extraction. Authentication consists
in retrieving writer characteristics independently from document content. It
generally consists in dating documents, localizing the place where the document
was produced, identifying the writer by using characteristics and features
extracted from blank spaces, line orientations and fluctuations, word or character
shapes. Page segmentation into text lines is performed in most tasks mentioned
above and overall performance strongly relies on the quality of this process.

Fig 1 Line Segmentation Block diagram

The purpose of this project is to survey the efforts made for historical
documents on the text line segmentation task. Section 2 describes the
characteristics of text line structures in historical documents and the different
ways of defining a text line. Preprocessing of document images (gray level, color
or black and white) is often necessary before text line extracting to prune
superfluous information (non textual elements, textual elements from the verso)
or to correctly binaries the image.

This problem is addressed in preprocessing. In some related methods like


Projection–based methods , Smearing methods , Grouping methods , Methods
based on the Hough transform, Repulsive-Attractive network method and
Stochastic method we survey the different approaches to segment the clean
image into text lines. Taxonomy is proposed, listed as projection profiles,
smearing, grouping, Hough-based, repulsive-attractive network and stochastic
methods. The majority of these techniques have been developed for the projects
on historical documents mentioned above. We address the specific problem of
overlapping and touching components in further.

Document image understanding algorithms are expected to work with a


document, irrespective of its layout, script, font, color, etc. Segmentation aims to
partition a document image into various homogeneous regions such as text
blocks, image blocks, lines, words etc. Page segmentation algorithms can be
broadly classified into three categories: bottom-up, top-down, and hybrid
algorithms. The classification is based on the order in which the regions in a
document are identified and labeled. The layout of the document is represented
by a hierarchy of regions: page, image or text blocks, lines, words, components,
and pixels. The traditional document segmentation algorithms give good results
on most documents with complex layouts but assume the script in the document
to be simple as in English. These algorithms fail to give good results on the
documents with complex scripts such as African, Persian and Indian scripts.
Fig 2 Reference lines and interfering lines with overlapping and touching
components.

1.2 CHARACTERISTICS AND REPRESENTATION OF TEXT


LINES

To have a good idea of the physical structure of a document image, one


only needs to look at it from a certain distance: the lines and the blocks are
immediately visible. These blocks consist of columns, annotations in margins,
stanzas, etc... As blocks generally have no rectangular shape in historical
documents, the text line structure becomes the dominant physical structure. We
first give some definitions about text line components and text line segmentation.
Then we describe the factors which make this text line segmentation hard.
Finally, we describe how a text line can be represented.
1.2.1 DEFINITION

Baseline: Fictitious line which follows and joins the lower part of the character
bodies in a text line (Fig. 2)

Median line: Fictitious line which follows and joins the upper part of the
character bodies in a text line.

Upper line: Fictitious line which joins the top of ascenders.

Lower line: Fictitious line which joins the bottom of descenders.

Overlapping components: Overlapping components are descenders and


ascenders located in the region of an adjacent line (Fig. 2).

Touching components: Touching components are ascenders and descenders


belonging to consecutive lines which are thus connected. These components are
large but hard to discriminate before text lines are known.

Text line segmentation: Text line segmentation is a labeling process which


consists in assigning the same label to spatially aligned units (such as pixels,
connected components or characteristic points).
There are two categories of text line segmentation approaches:
searching for (fictitious) separating lines or paths, or searching for aligned
physical units. The choice of a segmentation technique depends on the
complexity of the text line structure of the document.

1.2.2 INFLUENCE OF AUTHOR STYLE


Baseline fluctuation: The baseline may vary due to writer movement. It may
be straight, straight by segments, or curved.

Line orientations: There may be different line orientations, especially on


authorial works where there are corrections and annotations.

Line spacing: Lines that are rather widely spaced lines are easy to find. The
process of extracting text lines grows more difficult as interlines are narrowing
the lower baseline of the first line is becoming closer to the upper baseline of the
second line; also, descenders and ascenders start to fill the blank space left for
separating two adjacent text lines .

Insertions: Words or short text lines may appear between the principal text
lines, or in the margins.

1.2.3 INFLUENCE OF POOR IMAGE QUALITY

Imperfect preprocessing: Smudges, variable background intensity and the


presence of seeping ink from the other side of the document make image
preprocessing particularly difficult and produce binarization errors.

Stroke fragmentation and merging: Punctuation, dots and broken strokes due
to low-quality images and/or binarization may produce many connected
components; conversely, words, characters and strokes may be split into several
connected components. The broken components are no longer linked to the
median baseline of the writing and become ambiguous and hard to segment into
the correct text line.

1.2.4 TEXT LINE REPRESENTATION

Separating paths and delimited strip: Separating lines (or paths) are
continuous fictitious lines which can be uniformly straight, made of straight
segments, or of curving joined strokes. The delimited strip between two
consecutive separating lines receives the same text line label. So the text line
can be represented by a strip with its couple of separating lines (Fig. 3).

Clusters: Clusters are a general set-based way of defining text lines. A label is
associated with each cluster. Units within the same cluster belong to the same
text line. They may be pixels, connected components, or blocks enclosing pieces
of writing. A text line can be represented by a list of units with the same label.

Strings: Strings are lists of spatially aligned and ordered units. Each string
represents one text line.

Baselines: Baselines follow line fluctuations but partially define a text line. Units
connected to a baseline are assumed to belong to it. Complementary processing
has to be done to cluster non-connected units and touching components.
Fig. 3 Various text line representations: paths, strings and baselines.
1.3 DOCUMENT IMAGE ANALYSIS

Document image analysis refers to algorithms and techniques that are


applied to images of documents to obtain a computer-readable description from
pixel data.Awell-known document image analysis product is the Optical
Character Recognition (OCR) software that recognizes characters in a scanned
document. OCR makes it possible for the user to edit or search the document’s
contents. Inthis paper we briefly describe various components of a document
analysis system.Many of these basic building blocks are found in most document
analysis systems,

The objective of document image analysis is to recognize the text and


graphics component in images of documents, and to extract the intended
information as a human would. Two categories of document image analysis can
be defined (see figure 4). Textual processing deals with the text components of
a document image. Some tasks here are: determining the skew (any tilt at which
the document may have been scanned into the computer), finding columns,
paragraphs, text lines, and words, and finally recognizing the text (and possibly
its attributes such as size, font etc.) by optical character recognition (OCR).
Graphics processing deals with the non-textual line and symbol components that
make up line diagrams, delimiting straight lines between text sections, company
logos etc.
Pictures are a third major component of documents, but except for
recognizing their location on a page, further analysis of these is usually the task
of other image processing and machine vision techniques. After application of
these text and graphics analysis techniques, the several megabytes of initial data
are culled to yield a much more concise semantic description of the document.

Fig 4 A hierarchy of document processing subareas listing the types of


document components dealt with in each subarea.

Document analysis systems will become increasingly more evident in the


form of everyday document systems. For instance, OCR systems will be more
widely used to store, search, and excerpt from paper-based documents. Page-
layout analysis techniques will recognize a particular form or page format and
allow its duplication. Diagrams will be entered from pictures or by hand, and
logically edited. Pen-based computers will translate handwritten entries into
electronic documents. Archives of paper documents in libraries and engineering
companies will be electronically converted for more efficient storage and instant
delivery to a home or office computer.
Consider three specific examples of the need for document analysis
presented here.

(1) Typical documents in today’s office are computer-generated, but even so,
inevitably by different computers and software such that even their electronic
formats are incompatible. Some include formatted text and tables as well as
handwritten entries. There are different sizes, from a business card to a large
engineering drawing. Document analysis systems recognize types of documents,
enable the extraction of their functional parts, and translate from one computer
generated format to another.

(2) Automated mail-sorting machines to perform sorting and address recognition


have been used for several decades, but there is the need to process more mail,
more quickly, and more accurately.

(3) In a traditional library, loss of material, misfiling, limited numbers of each


copy, and even degradation of materials are common problems, and may be
improved by document analysis techniques. All these examples serve as
applications ripe for the potential solutions of document image analysis.

1.3.1 DATA CAPTURE

Data in a paper document are usually captured by optical scanning and


stored in a file of picture elements, called pixels that are sampled in a grid pattern
throughout the document. These pixels may have values: OFF (0) or ON (1) for
binary images, 0–255 for gray-scale images, and 3 channels of 0–255 color
values for color images.
At a typical sampling resolution of 120 pixels per centimeter, a 20 x 30 cm
page would yield an image of 2400x3600 pixels. When the document is on a
different medium such as microfilm, palm leaves, or fabric, photographic methods
are often used to capture images. In any case, it is important to understand that
the image of the document contains only raw data that must be further analyzed
to glean the information.

1.4 OPTICAL CHARACTER RECOGNITION (OCR)

Optical Character Recognition (OCR) lies at the core of the discipline of


pattern recognition where the objective is to interpret a sequence of characters
taken from an alphabet. Characters of the alphabet are usually rich in shape. In
fact, the characters can be subject to many variations in terms of fonts and
handwriting styles. Despite these variations, there is perhaps a basic abstraction
of the shapes that identifies any of their instantiations. Developing computer
algorithms to identify the characters of the alphabet is the principal task of OCR.
The challenge to the research community is the following – while humans can
recognize neatly handwritten characters with 100% accuracy, there is no OCR
that can match that performance. OCR difficulty can increase on several counts.

Increase in fonts, size of the alphabet set, unconstrained handwriting,


touching of adjacent characters, broken strokes due to poor binarization, noise
etc. all contribute to the difficulty shows a sample of 0’s and 6’s that are easily
confused by a handwritten digit recognizer. There are many applications that
require the recognition of unconstrained handwriting. A word can be either purely
numeric as in the case of a Zip code, or purely alphabetic as in the case of US
state abbreviations or mixed as in the number of an apartment.
The task becomes particularly challenging when adjacent characters in a
character string are touching as shown in figure 8. Unlike purely alphabetic
strings where joining of the characters is natural and takes place by means of
ligatures, the joining of numerals in a numeric word and the upper-case
characters in an abbreviation are accidental. There are various ways in which
two digits can touch. Some of the categories lend themselves to natural
segmentation, whereas for some a holistic approach is the only option available.

1.4.1 WHAT IS OCR?

OCR is the acronym for Optical Character Recognition. This technology


allows a machine to automatically recognize characters through an optical
mechanism. Human beings recognize many objects in this manner our eyes are
the "optical mechanism." But while the brain "sees" the input, the ability to
comprehend these signals varies in each person according to many factors. By
reviewing these variables, we can understand the challenges faced by the
technologist developing an OCR system.

First, if we read a page in a language other than our own, we may


recognize the various characters, but be unable to recognize words. However, on
the same page, we are usually able to interpret numerical statements - the
symbols for numbers are universally used. This explains why many OCR
systems recognize numbers only, while relatively few understand the full
alphanumeric character range. Second, there is similarity between many
numerical and alphabetical symbol shapes. For example, while examining a
string of characters combining letters and numbers, there is very little visible
difference between a capital letter "O" and the numeral "0."
As humans, we can re-read the sentence or entire paragraph to help us
determine the accurate meaning. This procedure, however, is much more difficult
for a machine. Third, we rely on contrast to help us recognize characters. We
may find it very difficult to read text which appears against a very dark
background, or is printed over other words or graphics. Again, programming a
system to interpret only the relevant data and disregard the rest is a difficult task
for OCR engineers. There are many other problems which challenge the
developers of OCR systems. In this paper, we will review the history,
advancements, abilities and limitations of existing systems. This analysis should
help determine if OCR is the correct application for your company's needs, and if
so, which type of system to implement.

1.4.2 HISTORY OF OCR

The engineering attempts at automated recognition of printed characters


started prior to World War II. But it was not until the early 1950's that a
commercial venture was identified that justified necessary funding for research
and development of the technology. This impetus was provided by the American
Bankers Association and the Financial Services Industry. They challenged all the
major equipment manufacturers to come up with a "Common Language" to
automatically process checks. After the war, check processing had become the
single largest paper processing application in the world. Although the banking
industry eventually chose Magnetic Ink Recognition (MICR), some vendors had
proposed the use of an optical recognition technology.

However, OCR was still in its infancy at the time and did not perform as
acceptably as MICR. The advantage of MICR was that it is relatively impervious
to change, fraudulent alteration and interference from non-MlCR inks. The "eye''
of early OCR equipment utilized lights, mirrors, fixed slits for the reflected light to
pass through, and a moving disk with additional slits. The reflected image was
broken into discrete bits of black and white data, presented to a photo-multiplier
tube, and converted to electronic bits.

The "brain's" logic required the presence or absence of "black'' or "white"


data bits at prescribed intervals. This allowed it to recognize a very limited,
specially designed character set. To accomplish this, the units required
sophisticated transports for documents to be processed. The documents were
required to run at a consistent speed and the printed data had to occur in a fixed
location on each and every form.

This technology also introduced the concept of blue, non-reading inks as


the system was sensitive to the ultraviolet spectrum. The third generation of
recognition devices, introduced in the early 1970's, consisted of photo-diode
arrays. These tiny little sensors were aligned in an array so the reflected image of
a document would pass by at a prescribed speed. These devices were most
sensitive in the infra-red portion of the visual spectrum so "red" inks were used
as non-reading inks. That brings us to this generation of hardware.

1.4.3 LIMITATIONS OF OCR

OCR has never achieved a read rate that is 100% perfect. Because of
this, a system which permits rapid and accurate correction of rejects is a major
requirement. Exception item processing is always a problem because it delays
the completion of the job entry, particularly the balancing function. Of even
greater concern is the problem of misreading a character (substitutions). In
particular, if the system does not accurately balance dollar data, customer
dissatisfaction will occur. The success of any OCR device to read accurately
without substitutions is not the sole responsibility of the hardware manufacturer.
Much depends on the quality of the items to be processed.
Through the years, the desire has been to increase the accuracy of
reading, that is, to reduce rejects and substitutions to reduce the sensitivity of
scanning to read less-controlled input to eliminate the need for specially
designed fonts (characters), and to read handwritten characters. However,
today's systems, while much more forgiving of printing quality and more accurate
than earlier equipment, still work best when specially designed characters are
used and attention to printing quality is maintained. However, these limits are not
objectionable to most applications, and dedicated users of OCR systems are
growing each year. But the ability to read a special character is not, by itself,
Sufficient to create a successful system.

1.4.4 WHAT DOES IT TAKE TO MAKE A SUCCESSFUL OCR SYSTEM?

1. It takes a complimentary merging of the input document ~ stream with the


processing requirements of the particular application with a total system concept
that provides for convenient entry of exception type items with an output that
provides cost effective entry to complete the system. To show a successful
example, let's review the early credit card OCR applications.

2. Input was a carbon imprinted document. However, if the carbon was wrinkled,
the imprinter was misaligned, or any one of a variety of reasons existed, the
imprinted characters were impossible to read accurately.

3. To compensate for this problem, the processing system permitted direct key
entry of the fail to read items at a fairly high speed. Directly keyed items from the
misread document were under intelligent computer control which placed the
proper data in the right location for the data record. Important considerations in
designing the system encouraged the use of modulus controlled check digits for
the embossed credit card account number. This, coupled with tight monetary
controls by batch totals, reduced the chance of read substitutions.
4. The output of these early systems provided a "country club" type of billing.
That
is, each of the credit card sales slips was returned to the original purchaser. This
provided the credit card customer with the opportunity to review his own
Purchases to insure the final accuracy of billing. This has been a very successful
operation through the years. Today's systems improve the process by increasing
the amount of data to be read, either directly or through reproduction of details on
the sales draft. This provides customers with a "descriptive" billing statement
which itemizes each transaction. Attention to the details of each application step
is a requirement for successful OCR systems.

1.4.6 PHASES IN OCR

PREPROCESSING

SEGMENTATION

RECOGNITION
POST PROCESSING

Fig 5 phases of OCR

1.4.6.1 PREPROCESSING

Optical Character Recognition (OCR) refers to the process of converting


printed Tamil text documents into software translated Unicode Tamil Text. The
printed documents available in the form of books, papers, magazines, etc. are
scanned using standard scanners which produce an image of the scanned
document. As part of the preprocessing phase the image file is checked for
skewing. If the image is skewed, it is corrected by a simple rotation technique in
the appropriate direction. Then the image is passed through a noise elimination
phase and is binarized.

The preprocessed image is segmented using an algorithm which


decomposes the scanned text into paragraphs using special space detection
technique and then the paragraphs into lines using vertical histograms, and lines
into words using horizontal histograms, and words into character image glyphs
using horizontal histograms. Each image glyph is comprised of 32x32 pixels.
Thus a database of character image glyphs is created out of the segmentation
phase. Then all the image glyphs are considered for recognition using Unicode
mapping. Each image glyph is passed through various routines which extract the
features of the glyph.
1.4.6.2 SEGMENTATION

In computer vision, segmentation refers to the process of partitioning a


digital image into multiple segments (sets of pixels, also known as superpixels).
The goal of segmentation is to simplify and/or change the representation of an
image into something that is more meaningful and easier to analyze. Image
segmentation is typically used to locate objects and boundaries (lines, curves,
etc.) in images. More precisely, image segmentation is the process of assigning
a label to every pixel in an image such that pixels with the same label share
certain visual characteristics.

The result of image segmentation is a set of segments that collectively


cover the entire image, or a set of contours extracted from the image (see edge
detection). Each of the pixels in a region are similar with respect to some
characteristic or computed property, such as color, intensity, or texture. Adjacent
regions are significantly different with respect to the same characteristic(s).

1.4.6.3 RECOGNITION

Often abbreviated OCR, optical character recognition refers to the branch


of computer science that involves reading text from paper and translating the
images into a form that the computer can manipulate (for example, into ASCII
codes). An OCR system enables you to take a book or a magazine article, feed it
directly into an electronic computer file, and then edit the file using a word
processor.

All OCR systems include an optical scanner for reading text, and
sophisticated software for analyzing images. Most OCR systems use a
combination of hardware (specialized circuit boards) and software to recognize
characters, although some inexpensive systems do it entirely through software.
Advanced OCR systems can read text in large variety of fonts, but they still have
difficulty with handwritten text.
The potential of OCR systems is enormous because they enable users to
harness the power of computers to access printed documents. OCR is already
being used widely in the legal profession, where searches that once required
hours or days can now be accomplished in a few seconds.

1.4.6.4 POST PROCESSING

In most OCR systems, independent character recognition engine is often


used to recognize each segmented part of an image, where only shape and
structure of the character are considered. In order to improve the recognition
accuracy rate, it is necessary in post-processing to use language knowledge,
which introduces context information, to correct the image recognition results.
Post-processing approaches based on language knowledge include using a
lexicon or some syntax and semantic rules to correct the spelling of words, and
using some statistical language models (SLM) to select out the best sequence
from the candidate characters given by the OCR engine. Because of the
complexity of language, all kinds of language knowledge sometimes are used
together to obtain better performance.

An OCR engine outputs not only candidate characters, but also candidate
distance information of each candidate character, which is also important in OCR
post-processing. Currently, candidate distance is usually transformed to reliability
of the corresponding candidate character to be utilized. Generally speaking, the
bigger the reliability of a candidate character, the smaller the corresponding
candidate distance. In early period, the reliability was calculated by using some
empirical formulas. Afterwards, a statistical approach was proposed, which
calculates the reliability according to the distribution of candidate characters and
correct characters with different candidate distances. It reflects some statistical
characteristics, and its complexity is low, therefore it achieves good results in
some applications. However, the use of candidate distance is still limited in OCR
post-processing.
1.4.7 STEPS INVOLVED IN OCR

Optical character recognition is the recognition of printed or written text by


a computer. This involves photo scanning of the text, which converts the paper
document into an image, and then translation of the text image into character
codes such as ASCII. Any OCR implementation consists of a number of
preprocessing Steps followed by the actual recognition. The number and types of
preprocessing algorithms employed on the scanned image depend on many
factors such as age of the document, paper quality, resolution of the scanned
image, the amount of skew in the image, the format and layout of the images and
text typical preprocessing includes the following stages.

• Binarization,
• Noise removing,
• Thinning,
• Skew detection and correction,
• Line segmentation,
• Word segmentation, and
• Character segmentation

Recognition consists of

• Feature extraction,
• Feature selection, and
• Classification
Fig 6 Steps in an OCR

1.4.6.1 Binarization

Binarization is a technique by which the gray scale images are converted


to binary images. In any image analysis or enhancement problem, it is very
essential to identify the objects of interest from the rest. Binarization separates
the foreground (text) and background information. The most common method for
binarization is to select a proper threshold for the intensity of the image and then
convert all the intensity values above the threshold to one intensity value (for
example “white " ), and all intensity values below the threshold to the other
chosen intensity (“black").

Binarization is usually reported to be performed either globally or locally.


Global methods apply one intensity value to the entire image. Local or adaptive
thresholding methods apply different intensity values to different regions of the
image. These threshold values are determined by the neighborhood of the pixel
to which the thresholding is being applied. Several binarization techniques are
discussed in (Anuradha & Koteswarrao 2006). Figure 7(a) shows the scanned
image of a paper document printed in Telugu, a south Indian language. Figure
7(b) is the same image after binarization in which the text pixels are separated
from the background.

(A)

(B)

Fig 7 (A) Original image, (B) Binarized image


1.4.6.2 Noise Removal

Scanned documents often contain noise that arises due to printer,


scanner, print quality, age of the document, etc. Therefore, it is necessary to filter
this noise before we process the image. The commonly used approach is to low-
pass filter the image and to use it for later processing. The objective in the design
of a filter to reduce noise is that it should remove as much of the noise as
possible while retaining the entire signal (Rangachar et al 2002).

1.4.6.3 Thinning

Thinning, or, skeletonization or is a process by which a one-pixel-width


representation (or the skeleton) of an object is obtained, by preserving the
connectedness of the object and its end points. The purpose of thinning is to
reduce the image components to their essential information so that further
analysis and recognition are facilitated. This enables easier subsequent detection
of pertinent features. Figure 8 shows an image before and after thinning. A
number of thinning algorithms have been proposed and are being used. The
most common algorithm used is the classical Hilditch algorithm and its variants.

Fig 8 A character image (left) before thinning, and (b) after thinning
1.4.6.4 Skew detection and correction

When a document is fed to the scanner either mechanically or by a human


operator, a few degrees of skew (tilt) are unavoidable. Skew angle is the angle
that the lines of text in the digital image make with the horizontal direction. Figure
9(a) shows an image with skew.

Fig 9 An image (a) with skew, (b) without skew, and its horizontal profiles
There exist many techniques for skew estimation. One skew estimation
technique is based on the projection profile of the document; another class of
approach is based on nearest neighbor clustering of connected components.
Techniques based on the Hough transform and Fourier transform are also
employed for skew estimation. A popular method for skew detection employs the
projection profile. A horizontal projection profile is a one-dimensional array where
each element denotes the number of black pixels along a row in the image.

Span horizontally, the horizontal projection profile has peaks whose widths
are equal to the character height and valleys whose widths are equal to the
spacing between lines. At the correct skew angle, since scan lines are aligned to
text lines, the projection profile has maximum height peaks for text and valleys
for line spacing. In the image of figure 9(a), its horizontal projection profile can be
seen with no clear valleys due to the presence of skew. Figure 9(b) is an image
in which the skew is removed. The peaks and valleys in the projection profile can
be clearly seen.

1.4.6.5 Line, word, and character segmentation

After the tilt is corrected, the text has to be segmented first into lines; each
line then into words and finally each word have to be segmented into its
constituted characters. Horizontal projection of a document image is most
commonly employed to extract the lines from the document. If the lines are well
separated, and are not tilted, the horizontal projection will have separated peaks
and valleys, as shown in figure 9(b), which serve as the separators of the text
lines.. Figure 9 shows an image consisting of 3 text lines (left), and the 3
segmented lines (right), using horizontal projection profiles.

Similarly a vertical projection profile gives the column sums. One can
separate lines by looking for minima in horizontal projection profile of the page
and then separate words by looking at minima in vertical projection profile of a
single line. Figure 11(a) shows a line consisting of 4 words, along with vertical
projection profiles, and figure 11(b) shows the 4 words, after segmentation. In
Figure 11(c), a word is shown segmented into its constituting 3 characters.
Overlapping, adjacent characters in a word (called kerned characters) cannot be
segmented using zero-valued valleys in the vertical projection profile.

Special techniques have to be employed to solve this problem. Feature


extraction and selection Feature extraction can be considered as finding a set of
parameters (features) that define the shape of the underlying character as
precisely and uniquely as possible. The features have to be selected in such a
way that they help in discriminating between characters. Thinned data is
analyzed to detect features such as straight lines, curves, and significant points
along the curves.

Fig 10 Line segmentation


Fig 11 (a). A line segment, (b). Word segmentation, (c). Character
segmentation

1.4.6.6 Techniques used

Any OCR contains more or less the same steps described further. The
exact number and techniques differ slightly from one language to other. We now
present the studies in different OCRs, along with a detailed description of the
methods used in them. Recognition of isolated and continuous printed multi font
Bengali characters is reported in the work by Mahmud et al (2003).

This is based on Freemanchaincode features, which are explained as


follows. When objects are described by their skeletons or contours, they can be
represented by chain coding, where the ON pixels are represented as sequences
of connected neighbors along lines and curves. Instead of storing the absolute
location of each ON pixel, the direction from its previously coded neighbor is
stored.

The chain codes from center pixel are 0 for east, 1 for North- East, and so
on. This is represented pictorially in figure 12(a) and (b). Chain code gives the
boundary of the character image; slope distribution of chain code implies the
curvature properties of the character. In this work, connected components from
each character are divided into four regions with the center of mass of as the
origin. Slope distribution of chain code, in these four regions is used as local
feature. Using chain code representation, classification is done by a feed forward
neural network.

Testing on three types of fonts with accuracy of approximately 98% for


isolated characters and 96% for continuous characters is reported. Ray &
Chatterjee (1984) presented a recognition system based on a nearest neighbor
classifier employing features extracted by using a string connectivity criterion
complete OCR for printed Bangla is reported in the work by Chaudhuri & Pal
(1998), in which a combination of template and feature-matching approach is
used.

A histogram-based thresholding approach is used to convert the image


into binary images. For a clear document the histogram shows two prominent
peaks corresponding to white and black regions. The threshold value is chosen
as the midpoint of the two-histogram peaks. Skew angle is determined from the
skew of the headline.

Text lines are partitioned into three zones and the horizontal and vertical
projection profiles are used to segment the text into lines, words, and characters.
Primary grouping of characters into the basic, modified and compound
characters is made before the actual classification. A few stroke features are
used for this purpose along with a tree classifier where the decision at each node
of the tree is taken on the basis of presence/absence of a particular feature.
The compound character recognition is done in - two stages

1) In the first stage the characters are grouped into small sub-sets by the
above tree classifier.

2) At the second stage, characters in each group are recognized by a


run-based template matching approach. Some character level statistics like
individual character occurrence frequency, bigram and trigram statistics etc. are
utilized to aid the recognition process. For single font, clear documents 99.10%
character level recognition accuracy is reported.

Fig 12 Chain code and graphical representations


CHAPTER - 2
BACKGROUND WORK

2.1.1 Projection–based Methods

Projection-profiles are commonly used for printed document


segmentation. This technique can also be adapted to handwritten documents
with little overlap. The vertical projection profile is obtained by summing pixel
values along the horizontal axis for each y value. From the vertical profile, the
gaps between the text lines in the vertical direction can be observed (Fig. 13).
Profile (y) = Σ f(x,y)
1≤x≤M

The vertical profile is not sensitive to writing fragmentation. Variants for


obtaining a profile curve may consist in projecting black/white transitions such as
in number of connected components, rather than pixels. The profile curve can be
smoothed, e.g. by a Gaussian or median filter to eliminate local maxima. The
profile curve is then analysed to find its maxima and minima.

There are two drawbacks

Short lines will provide low peaks, and very narrow lines, as
well as those including many overlapping components will not produce significant
peaks. In case of skew or moderate fluctuations of the text lines, the image may
be divided into vertical strips and profiles sought inside each strip. These
piecewise projections are thus a means of adapting to local fluctuations within a
more global scheme.

In the global orientation (skew angle) of a handwritten page is first


searched by applying a Hough transform on the entire image. Once this skew
angle is obtained, projections are achieved along this angle. The number of
maxima of the profile gives the number of lines. Low maxima are discarded on
their value, which is compared to the highest maxima. Lines are delimited by
strips, searching for the minima of projection profiles around each maximum.
This technique has been tested on a set of 200 pages within a word
segmentation task.
In the work of each minimum of the profile curve is a potential
segmentation point. Potential points are then scored according to their distance
to adjacent segmentation points. The reference distance is obtained from the
histogram of distances between adjacent potential segmentation points. The
highest scored segmentation point is used as an anchor to derive the remaining
ones. The method is applied to printed records of the Second World War which
have regularly spaced text lines. The logical structure is used to derive the text
regions where the names of interest can be found.

Fig. 13 Vertical projection-profile extracted on an autograph of Jean-Paul


Sartre.

The RXY cuts method applied in He and Downtown uses alternating


projections along the X and the Y axis. This results in a hierarchical tree
structure. Cuts are found within white spaces. Thresholds are necessary to
derive inter-line or inter-block distances. This method can be applied to printed
documents (which are assumed to have these regular distances) or well
separated handwritten lines.
2.1.2 Smearing Methods

For printed and binarized documents, smearing methods such as the Run-
Length Smoothing Algorithm can be applied. Consecutive black pixels along the
horizontal direction are smeared: i.e. the white space between them is filled with
black pixels if their distance is within a predefined threshold. The bounding boxes
of the connected components in the smeared image enclose text lines.

A variant of this method adapted to gray level images and applied to


printed books from the sixteenth century consists in accumulating the image
gradient along the horizontal direction. This method has been adapted to old
printed documents within the Debora project. For this purpose, numerous
adjustments in the method concern the tolerance for character alignment and line
justification.

Text line patterns are found in the work of Shi and Govindaraju by building
a fuzzy run length matrix. At each pixel, the fuzzy run-length is the maximal
extent of the background along the horizontal direction. Some foreground pixels
may be skipped if their number does not exceed a predefined value. This matrix
is threshold to make pieces of text lines appear without ascenders and
descenders (Fig. 14). Parameters have to be accurately and dynamically tuned.

2.1.3 Grouping methods

These methods consist in building alignments by aggregating units in a


bottom-up strategy. The units may be pixels or of higher level, such as connected
components, blocks or other features such as salient points. Units are then
joined together to form alignments. The joining scheme relies on both local and
global criteria, which are used for checking local and global consistency
respectively.
Fig 14 Text line patterns extracted from a letter of Georges Washington
(reprinted from Shi and Govindaraju ). Foreground pixels have been
smeared along the horizontal direction.

Contrary to printed documents, a simple nearest-neighbor joining scheme


would often fail to group complex handwritten units, as the nearest neighbor
often belongs to another line. The joining criteria used in the methods described
below are adapted to the type of the units and the characteristics of the
documents under study.
But every method has to face the following

1) Initiating alignments: one or several seeds for each alignment.


2) Defining a unit’s neighborhood for reaching the next unit. It is generally a
rectangular or angular area (Fig. 14).

3) Solving conflicts: As one unit may belong to several alignments under


construction, a choice has to be made: discard one alignment or keep
both of them, cutting the unit into several parts.

Hence, these methods include one or several quality measures which ensure
that the text line under construction is of good quality. When comparing the
quality measures of two alignments in conflict, the alignment of lower quality can
be discarded (Fig.9). Also, during the grouping process, it is possible to choose
between the different units that can be aggregated within the same neighborhood
by evaluating the quality of each of the so-formed alignments.

Fig. 15 Angular and rectangular neighborhoods from point and rectangular


units (left). Neighborhood defined by a cluster of units (upright). Two
alignments A and B in conflict: a quality measure will choose A and
discard B (down right).
Fig. 15 Angular and rectangular neighborhoods from point and rectangular units
(left). Neighborhood defined by a cluster of units (upright). Two alignments A and
B in conflict: a quality measure will choose A and discard B (down right).

Quality measures generally include the strength of the alignment, i.e. the
number of units included. Other quality elements may concern component size,
component spacing, or a measure of the alignment’s straightness.

Fig. 16 Text lines extracted on Church Registers

Likforman-Sulem and Faure have developed in an iterative method based


on perceptual grouping for forming alignments, which has been applied to
handwritten pages, author drafts and historical documents. Anchors are detected
by selecting connected components elongated in specific directions (0°, 45°, 90°,
125°). Each of these anchors becomes the seed of an alignment. First, each
anchor, then each alignment, is extended to the left and to the right.
This extension uses three Gestalt criteria for grouping components:
proximity, similarity and direction continuity. The threshold is iteratively
incremented in order to group components within a broader neighborhood until
no change occurs. Between each iteration, alignment quality is checked by a
quality measure which gives higher rates to long alignments including anchors of
the same direction. A penalty is given when the alignment includes anchors of
different directions. Two alignments may cross each other, or overlap. A set of
rules is applied to solve these conflicts taking into account the quality of each
alignment and neighboring components of higher order (Fig. 16).

In the work of Feldbach and Tönnies, body baselines are searched in


Church Registers images. These documents include lots of fluctuating and
overlapping lines. Baselines units are the minima points of the writing (obtained
here from the skeleton). First basic line segments (BLS) are constructed, joining
each minima point to its neighbors. This neighborhood is defined by an angular
region (+-20°) for the first unit grouped, then by a rectangular region enclosing
the points already joined for the remaining ones. Unwanted basic segments are
found from minima points detected in descenders and ascenders.

These segments may be isolated or in conflict with others. Various


heuristics are defined to eliminate alignments on their size, or the local inter-line
distance and on a quality measure which favors’ alignments whose units are in
the same direction rather than nearer units but positioned lower or higher than
the current direction. Conflicting alignments can be reconstructed depending on
the topology of the conflicting alignments. The median line is searched from the
baseline and from maxima points (Fig. 16). Pixels lying within a given baseline
and median line are clustered in the corresponding text line, while ascenders and
descenders are not segmented. Correct segmentation rates are reported
between 90% and 97 % with adequate parameter adjustment. The seven
documents tested range from the 17th to the 19th century.
2.1.4 Methods based on the Hough transform
The Hough transform is a very popular technique for finding straight lines
in images. In Likforman-Sulem et a method has been developed on a hypothesis
validation scheme. Potential alignments are hypothesized in the Hough domain
and validated in the Image domain. Thus, no assumption is made about text line
directions (several may exist within the same page). The centroids of the
connected components are the units for the Hough transform. A set of aligned
units in the image along a line with parameters (ρ, θ) is included in the
corresponding cell (ρ, θ) of the Hough domain. Alignments including a lot of units
correspond to high peaked cells of the Hough domain. To take into account
fluctuations of handwritten text lines, i.e. the fact that units within a text line are
not perfectly aligned, two hypotheses are considered for each alignment and an
alignment is formed from units of the cell structure of a primary cell.

Fig. 17 Hypothesized cells (ρ0, θ0) and (ρ1, θ1) in Hough space. Each peak
corresponds to perfectly aligned units. An alignment is composed of units
belonging to a cluster of cells (the cell structure) around a primary cell.
A cell structure of a cell (ρ, θ) includes all the cells lying in a cluster
centered on (ρ, θ). Consider the cell (ρ0, θ0) having the greatest count of units. A
second hypothesis (ρ1, θ1) is searched in the cell structure of (ρ0, θ0). The
alignment chosen between these two hypotheses is the strongest one, i.e. the
one which includes the highest number of units in its cell structure. And the
corresponding cell (ρ0, θ0) or (ρ1, θ1) is the primary cell (Fig. 17). However,
actual text lines rarely correspond to alignments with the highest number of units
as crossing alignments (from top to bottom for writing in horizontal direction)
must contain more units than actual text lines.

A potential alignment is validated (or invalidated) using contextual


information, i.e. considering its internal and external neighbors. An internal
neighbor of a unit j is a within-Hough alignment neighbor. An external neighbor is
a out of Hough alignment neighbor which lies within a circle of radius δj from unit
j. Distance δj is the average distance of the internal neighbor distances from unit
j. To be validated, a potential alignment may contain fewer external units than
internal ones. This enables the rejection of alignments which have no perceptual
relevance. This method can extract oriented text lines and sloped annotations
under the assumption that such lines are almost straight (Fig. 18).

The Hough transform can also be applied to fluctuating lines of


handwritten drafts such as in Pu and Shi . The Hough transform is first applied to
minima points (units) in a vertical strip on the left of the image. The alignments in
the Hough domain are searched starting from a main direction, by grouping cells
in an exhaustive search in 6 directions. Then a moving window, associated with a
clustering scheme in the image domain, assigns the remaining units to
alignments. The clustering scheme (Natural Learning Algorithm) allows the
creation of new lines starting in the middle of the page.
Fig. 18 Text lines extracted on an autograph of Miguel Angel Asturias. The
orientations of traced lines correspond to those of the primary cells found
in Hough space.
2.1.5 Repulsive-Attractive network method

An approach based on attractive-repulsive forces is presented in Oztop et


al. It works directly on grey-level images and consists in iteratively adapting the
y-position of a predefined number of baseline units. Baselines are constructed
one by one from the top of the image to bottom. Pixels of the image act as
attractive forces for baselines and already extracted baselines act as repulsive
forces. The baseline to extract is initialized just under the previously examined
one, in order to be repelled by it and attracted by the pixels of the line below (the
first one is initialized in the blank space at top of the document). The lines must
have similar lengths. The result is a set of pseudo-baselines, each one passing
through word bodies (Fig. 19). The method is applied to ancient Ottoman
document archives and Latin texts.

Fig. 19 Pseudo baselines extracted by a Repulsive-Attractive network on


an Ancient Ottoman text (reprinted from Oztop et al).
2.1.6 Stochastic method
We present here a method based on a probabilistic Viterbi algorithm
(Tseng and Lee), which derives non-linear paths between overlapping text lines.
Although this method has been applied to modern Chinese handwritten
documents, this principle could be enlarged to historical documents which often
include overlapping lines. Lines are extracted through hidden Markov modeling.
The image is first divided into little cells (depending on stroke width), each one
corresponding to a state of the HMM (Hidden Markov Model). The best
segmentation paths are searched from left to right; they correspond to paths
which do not cross lots of black points and which are as straight as possible.

However, the displacement in the graph is limited to immediately superior


or inferior grids. All best paths ending at each y location of the image are
considered first. Elimination of some of these paths uses a quality threshold T: a
path whose probability is less than T is discarded. Shifted paths are easily
eliminated (and close paths are removed on quality criteria). The method
succeeds when the ground truth path between text lines is slightly changing
along the y-direction (Fig. 20). In the case of touching components, the path of
highest probability will cross the touching component at points with as less black
pixels as possible. But the method may fail if the contact point contains a lot of
black pixels.

Fig. 20 Segmentation paths obtained by a stochastic method 2.1.7 Water


reservoir principle
The water reservoir principle is as follows. If water is poured from the top
(bottom) of a component, the cavity regions of the component where water is
stored are considered the top (bottom) reservoirs (Pal et al 2003). Here, two
Oriya characters touch and create a large space which represents the bottom
reservoir. This large space is very useful for touching character detection and
segmentation. Owing to the shape of Oriya characters a small top reservoir is
also generated due to touching (see figure 21).

This small top reservoir also helps in touching character detection and
segmentation. All reservoirs are not considered for future processing. Reservoirs
having heights greater than a threshold T1 are selected for future use. For a
component the value of T1 is chosen as 1/9 times the component height. (The
threshold is determined from experiment.) We now discuss here some terms
relating to water reservoirs that will be used in feature extraction.

Top reservoir: By top reservoir of a component, we mean the reservoir


obtained when water is poured from the top of the component. Bottom reservoir:
By bottom reservoir of a component we mean the reservoir obtained when water
is filled from the bottom of the component. A bottom reservoir of a component is
visualized as a top reservoir when water is poured from the top after rotating the
component by 180◦. Left (right) reservoir: If water is poured from the left (right)
side of a component, the cavity regions of the component where water is stored
are considered the left (right) reservoirs. left (right) reservoir of a component is
visualized as a top reservoir when water is poured from the top after rotating the
by 90◦ clockwise (anti-clockwise).
Water reservoir area: By area of a reservoir we mean the area of the
cavity region where water can be stored if water is poured from a particular side
of the component. The number of pixels inside a reservoir is computed and this
number is considered the area of the reservoir. Water flow level: The level from
which water overflows from a reservoir is called the water flow level of the
reservoir (see figure 22).
Reservoir baseline: A line passing through the deepest point of a
reservoir and parallel to the water flow level of the reservoir is called the reservoir
baseline (see figure 21). Height of a reservoir: By height of a reservoir we mean
the depth of water in the reservoir. In other words, height of a reservoir is the
normal distance between reservoir baseline and water flow level of the reservoir.
In figure 22, H denotes the reservoir height.

Width of a reservoir: By width of a reservoir, we mean the normal


distance between two extreme boundaries (perpendicular to base-line) of a
reservoir.

Fig 21 Examples of big reservoirs created by touching (because of the


touching of two characters a big bottom reservoir is formed here).
Figure 22 Illustration of different features obtained from water
reservoir principle. ‘H’ denotes the height of bottom reservoir. Gray area of
the zoomed portion represents reservoir base area.

In each selected reservoir we compute its base-area points. By base-area


points of a reservoir we mean those border points of the reservoir which have
height less than 2×RL from the baseline of the reservoir. Base-area points for a
component are shown in the zoomed-in version of figure 4. Here RL is the length
of the most frequently occurring black runs of a component. In other words, RL is
the statistical mode of the black run lengths of a component. The value of RL is
calculated as follows. The component is scanned both horizontally and vertically.
If for a component we get n different run-lengths r1, r2, . . . rn with frequencies
f1, f2 . . . fn respectively, then the value of RL = ri , where fi = max(fj ), j = 1 . . . n.
3.1 PROCESSING OF OVERLAPPING COMPONENTS

Overlapping components are the main challenges for text line extractions
since no white space is left between lines. Some of the methods surveyed above
do not need to detect such components because they extract only baselines , or
because in the method itself some criteria make paths avoid crossing black
pixels. This section only deals with methods where ambiguous components
(overlapping) are actually detected before, during or after text line segmentation
Such criteria as component size, the fact that the component belongs to several
alignments, or on the contrary to no alignment, can be used for detecting
ambiguous components.

Once the component is detected as ambiguous, it must be classified into


three categories : the component is an overlapping component which belongs to
the upper (resp. lower) alignment, the component is a touching component which
has to be decomposed into several parts (two or more parts, as components may
belong to three or more alignments in historical documents). The separation
along the vertical direction is a hard problem which can be done roughly
(horizontal cut), or more accurately by analysing stroke contours and referring to
typical configurations (Fig. 23).

Fig 23 Set of typical overlapping configurations


Fig. 23 Set of typical overlapping configurations. The grouping technique
presented in Section grouping methods detects an ambiguous component during
the grouping process when a conflict occurs between two alignments. A set of
rules is applied to label the component as overlapping or touching. The
ambiguous component extends in each alignment region. The rules use as
features the density of black pixels of the component in each alignment region,
alignment proximity and contextual information (positions of both alignments
around the component). An overlapping component will be assigned to only one
alignment.
In piece wise, the document page is first cut into eight equal columns. A
projection-profile is performed on each column. In each histogram, two
consecutive minima delimit a text block. In order to detect overlapping
components, a k-means clustering scheme is used to classify the text blocks so
extracted into three classes: big, average, small. Overlapping components
necessarily belong to big physical blocks. All the overlapping cases are found in
the big text blocks class. All the “one line” blocks are grouped in the average
block text class. A second k-means clustering scheme finds the actual inter-line
blocks; put together with the “one line” block size, this determines the number of
pieces a large text block must be cut into (cf. Fig. 24). The document is divided
into vertical strips. Profile cuts within each strip are computed to obtain anchor
points of segmentation (PSLs) which do not cross any black pixels. These points
are grouped through strips by neighboring criteria.

Fig 24 Text line segmentation


If no segmentation point is present in the adjacent strip, the baseline is
extended near the first black pixel encountered which belongs to an overlapping
or touching component. This component is classified as overlapping or touching
by analyzing its vertical extension (upper, lower) from each side of the
intersection point. An empirical rule classifies the component. In the touching
case, the component is horizontally cut at the intersection point (Fig. 25).

Fig 25 Overlapping components separated (circle) separated into two


parts (rectangle) in Bangla writing.

Some solutions for separation of units belonging to several text lines can
be found also in the case of mail pieces and handwritten databases where efforts
have been made for recognition purposes. In the work of separation is made
from the skeleton of touching characters and the use of a dictionary of possible
touching configurations (Fig. 23). In Bruzzone and Coffetti, the contact point
between ambiguous strokes is detected and processed from their external
border.
An accurate analysis of the contour near the contact point is performed in
order to separate the strokes according to two registered configurations: a loop in
contact with a stroke, or two loops in contact. In simple cases of handwritten
pages the center of gravity of the connected component is used either to
associate the component to the current line or to the following line, or to cut the
component into two parts. This works well if the component is a single character.
It may fail if the component is a word, or part of a word, or even several words.
CHAPTER - 3
3.2 PROPOSED METHOD

PIECE-WISE PROJECTION METHOD

The global horizontal projection method computes the sum of all black
pixels on every row and constructs the corresponding histogram. Based on the
peak/valley points of the histogram, individual lines are generally segmented.
Although this global orizontal projection method is applicable for line
segmentation of printed documents, it cannot be used in unconstrained
handwritten documents because the characters of two Consecutive text-lines
may touch or overlap. For example, see the 4th and 5th text lines of the
document shown in figure 26 a.

Figure 26 (a) N-stripes and PSL lines in each stripe are shown for a sample
of handwritten text. (b) Potential PSLs of figure 26 (a) are shown.

Here these two lines are mostly overlapping. To take care of


unconstrained handwritten documents, we use here a piece-wise projection
method as below. Here, at first, we divide the text into vertical stripes of width W
(here we assume that a document page is in portrait mode). Width of the last
stripe may differ from W. If the text width is Z and the number of stripe is N, the
width of the last stripe is [Z − W × (N − 1)].
Computation of W is discussed later. Next, we compute piece-wise
separating lines (PSL) from each of these stripes. We compute the row-wise sum
of all black pixels of a stripe. The row where this sum is zero is a PSL. We may
get a few consecutive rows where the sum of all black pixels is zero. Then the
first row of such consecutive rows is the PSL. The PSLs of different stripes of a
text are shown in figure 26 a by horizontal lines. All these PSLs may not be
useful for line segmentation. We choose some potential PSLs as follows. We
compute the normal distances between two consecutive PSLs in a stripe. So if
there are n PSLs in a stripe we get n − 1 distances.

This is done for all stripes. We compute the statistical mode (MPSL) of such
distances. If the distance between any two consecutive PSLs of a stripe is less
than MPSL, we remove the upper PSL of these two PSLs. PSLs obtained after this
removal is the potential PSLs. The potential PSLs obtained from the PSLs of
figure 26 a are shown in figure 26b. We note the left and right co-ordinates of
each potential PSL for future use. By proper joining of these potential PSLs, we
get individual text lines. It may be noted that sometimes because of overlapping
or touching of one component of the upper line with a component of the lower
line, we may not get PSLs in some regions. Also, because of some modified
characters of telugu we find some extra PSLs in a stripe. We take care of them
during PSL joining, as explained next. Joining of PSLs is done in two steps.

In the first step, we join PSLs from right to left and, in the second step, we first
check whether line-wise PSL joining is complete or not. If for a line it is not
complete, joining from left to right is done to obtain complete segmentation. We
say PSLs joining of a line is complete if the length of the joined PSLs is equal to
the column (width) of the document image. This two-step approach is done to get
good results even if two consecutive text lines are overlapping or connected.
To join a PSL of the ith stripe, say i , to a PSL of (i − 1)th stripe, we check
whether any PSL, whose normal distance from Ki is less than MPSL,, exists or
not in the (i − 1) stripe. If it exists, we join the left co-ordinate of Ki with the right
co-ordinate of the PSL in the (i −1)th stripe. If it does not exist, we extend the Ki
horizontally in the left direction until it reaches the left boundary of the (i − 1)th
stripe or intersects a black pixel of any component in the (i − 1)th stripe. If the
extended part intersects the black pixel of a component of the (i − 1)th stripe, we
decide the “belongingness” of the component in the upper line or lower line.
Based on the belongingness of this component, we extend this line in such a way
that the component falls in its actual line. Belongingness of a component is
decided as follows.
We compute the distances from the intersecting point to the topmost and
bottommost point of the component. Let d1 be the top distance and d2 the
bottom distance.
If d1 < d2 and d1 < (MPSL/2) then the component belongs to the lower line.
If d2 ≤ d1 and d2 < (MPSL/2) then the component belongs to the upper line.
If d1 > (MPSL/2) and d2 > (MPSL/2) then we assume the component
touches another component of the lower line.
If the component belongs to the upper-line (lower-line) then the line is
extended following the contour of the lower part (upper part) of the component so
that the component can be included in the upper line (lower line).

The line extension is done until it reaches the left boundary of the (i −1)th
stripe. If the component is touching, we detect possible touching points based on the
structural shape of the touching component. From the experiment, we notice that in
most of the touching cases there exist junction/crossing shapes or there exist some
obstacle points in the middle portion having low black pixel density of the touching
component. These obstacle points and the junction/crossing shape help to find
touching position. Extension of PSL is done through this touching point to segment the
component into two parts.

Fig 27 Line-segmented result of text shown in figure 26. Text line


segmentation is shown by solid lines. (a) Two end points of a mis-
segmented line XY are marked by circles. (b) Correct segmentation is
shown.
Sometimes because of some modified characters we may get some wrongly
segmented lines. For example, see the line marked XY (see figure 26 a). To take care
of such wrong lines we compute the density of black pixels and compare this value
with the candidate length of the line. (By candidate length of a line we mean the
distance between the leftmost column of the leftmost component and the rightmost
column of the rightmost component of that line.)

Let L be the candidate length of a line. Now we scan each column of the
portion of the line that belongs to the candidate length to check the presence of
black pixels. If a black pixel does not exist in at least 50% of the column of that
line then the line is not a valid line and we delete the lower boundary of this line
to merge this line with its lower line. Thus a mis-segmented line like XY of figure
26 a is corrected. The corrected line segmentation result is shown in figure 26 b.

To get a size-independent measure, computation of W is done as follows.


We compute the statistical mode (md ) of the widths of the bottom reservoirs
obtained from the text. This mode is generally equal to character width. Since
average character in an word is four, the value of W is assumed as 4×md to
make the stripe width the word width. We computed word-length statistics. The
proposed line-segmented method does not depend on size and style of the
handwriting. Even if the handwritten lines overlap, touch or are curved, the
proposed scheme works. For word segmentation from a line, we compute vertical
histograms of the line. In general, the distance between two consecutive words of
a line is greater than the distance between two consecutive characters in a word.
Taking the vertical histogram of the line and using the above distance criteria we
segment words from lines..
3.2.1 Flowchart

Divide the text into vertical


stripes

Compute piece wise


separating lines

Choose some potential


PSL

Join these PSL

Compute the
belongingness of the
component

Fig 28 Flow chart of the algorithm


3.2.2 ALGORITHM

In short, line segmentation algorithm (LINE-SEGM) is as follows :

Algorithm LINE - SEGM

Step 1: Divide the text into vertical stripes of width W.

Step 2: Compute piece-wise separating lines (PSL) from each of these stripes as
discussed earlier.

Step 3: Compute potential PSLs from the PSLs obtained in step 2.

Step 4: Chose the rightmost top potential PSL and extend (from right to left) this
PSL up to the previous stripe.

Step 5: Continue this PSL joining from right to left until we reach the left
boundary of the left-most stripe.

Step 6: Check whether the length of the line drawn equals to the width of the
document. If yes, go to step 7. Else, PSL line extension is done to the right until
we reach the right boundary of the document.

Step 7: Repeat steps 4 to 6 for the potential PSLs not considered for joining so
far. If there is no more PSL for joining, stop.
Let us see all these steps in detail

Step 1: Divide the given text into some no of vertical stripes.

(a)

1st stripe 2nd stripe……………..………..nth stripe

Fig 29 (a) original text image. (b) output of text image


Step 2: Compute the piece-wise separating lines.

Fig 30 PSL of the text

Compute the row-wise sum of all black pixels of a stripe. The row where
the sum is zero is the PSL. If there are few consecutive rows where the black
pixels are zero, then first row of such rows is the PSL.

Step 3: Choose only potential PSL’s

Fig 30
potential
PSL’s of
the text
All the PSL’s may not be useful for line segmentation, so choose
some potential PSLs among these. Compute the normal distances between two
consecutive PSLs in a stripe. So if there are ‘n’ PSLs we get ‘n-1’ distances. This
is done for all stripes. Compute the statistical mode Mpsl of such distances. If the
distance between any two consecutive PSLs of a stripe is less than Mpsl then
remove the upper PSL of these two PSLs. PSLs obtained after this removal are
the potential PSLs.

Step 4: Join the PSLs

Fig 32 Joining the PSLs

Joining of PSLs are done in two ways

i) In this step we join PSLs from left to right.

ii) Check whether line-wise PSL joining is complete or not. If for a line it is
not complete, joining from left to right is done to obtain complete
segmentation.
We say PSLs joining of a line is complete if the length of joined PSLs is
equal to the column size of the document image. This two step approach is done
to get good results even if two consecutive text lines are in overlapping or
connected fashion.

Step 5: Compute belongingness of the component

Fig 33 Belongingness of component

If the extended part intersects black pixel of any component then compute
the belongingness of the component. Compute the distances from the
intersecting point to the topmost and bottommost point of the component .let d1
be the topmost point and d2 be the bottommost point of the component.

If d1<d2 and d1< (Mpsl/2) then the component belongs to lower line. If d2<d1
and d2< (Mpsl/2) then the component belongs to upper line.
Following is the figure obtained after all the steps

Fig 34 complete line segmentation obtained after all the steps

3.3 APPLICATIONS
3.3.1 Practical Applications

In recent years, OCR (Optical Character Recognition) technology has been


applied throughout the entire spectrum of industries, revolutionizing the
document management process. OCR has enabled scanned documents to
become more than just image files, turning into fully searchable documents with
text content that is recognized by computers. With the help of OCR, people no
longer need to manually retype important documents when entering them into
electronic databases. Instead, OCR extracts relevant information and enters it
automatically. The result is accurate, efficient information processing in less time.

3.3.2 Banking

The uses of OCR vary across different fields. One widely known
application is in banking, where OCR is used to process checks without human
involvement. A check can be inserted into a machine, the writing on it is scanned
instantly, and the correct amount of money is transferred. This technology has
nearly been perfected for printed checks, and is fairly accurate for handwritten
checks as well, though it occasionally requires manual confirmation. Overall, this
reduces wait times in many banks.

3.3.3 Legal

In the legal industry, there has also been a significant movement to


digitize paper documents. In order to save space and eliminate the need to sift
through boxes of paper files, documents are being scanned and entered into
computer databases. OCR further simplifies the process by making documents
text-searchable, so that they are easier to locate and work with once in the
database. Legal professionals now have fast, easy access to a huge library of
documents in electronic format, which they can find simply by typing in a few
keywords.

3.3.4 Healthcare
Healthcare has also seen an increase in the use of OCR technology to
process paperwork. Healthcare professionals always have to deal with large
volumes of forms for each patient, including insurance forms as well as general
health forms. To keep up with all of this information, it is useful to input relevant
data into an electronic database that can be accessed as necessary. Form
processing tools, powered by OCR, are able to extract information from forms
and put it into databases, so that every patient's data is promptly recorded. As a
result, healthcare providers can focus on delivering the best possible service to
every patient.

3.3.5 OCR in Other Industries

OCR is widely used in many other fields, including education, finance, and
government agencies. OCR has made countless texts available online, saving
money for students and allowing knowledge to be shared. Invoice imaging
applications are used in many businesses to keep track of financial records and
prevent a backlog of payments from piling up. In government agencies and
independent organizations, OCR simplifies data collection and analysis, among
other processes. As the technology continues to develop, more and more
applications are found for OCR technology, including increased use of
handwriting recognition. Furthermore, other technologies related to OCR, such
as barcode recognition, are used daily in retail and other industries. To learn
more about OCR solutions for your office, you can download a free trial of
Maestro Recognition Server, CVISION's OCR toolkit, or Trapeze, our automated
form-processing solution.

3.3.6 Resume processing

Several of the industry leaders in resume processing software use Prime


OCR to generate high accuracy results. Some customers use the text results
straight from Prime OCR while others choose to manually verify OCR results with
Prime Verify for maximum accuracy. One of the largest resume processing
facilities leverages Prime OCR's increased accuracy by providing recruiting
customers the same accuracy of results without having to manually verify each
resume. They take the results straight from Prime OCR and deliver them to the
customer passing on the savings of processing large batches of resumes. What
used to take days to send off shore to OCR and manually verify can now all be
done overnight in a local facility all with Prime OCR software.

3.3.7 Library archives/Digital Library

Digital library initiatives are adopting advanced OCR technology like Prime
OCR to convert large book collections for on-line viewing of content. Not only is
Prime OCR designed to generate accurate results but it can also provide a level
of reliability that cannot be found in traditional desktop OCR software.

A large university's project of converting large collections and providing


the content on-line was improved with Prime OCR's unique ability to provide high
accuracy results. The results were so impressive that all of the material that had
been previously processed was ran through Prime OCR a second time to
improve the ability to find textual information in the collection.

3.3.8 Document identification

An added option of Prime OCR allows for the software to accurately


identify different types of documents. Using high accuracy OCR output, coupled
with text parsing technology, Prime OCR is able to identify and group together
different styles of document or forms. In one case, there were over 250 different
documents that could be identified by the software. Customers use the
identification information to search and retrieval pertinent documents found in the
document database.

Medical facilities, including hospitals, generate a large number of different


styles of documents when admitting and treating patients. The document
collection for an individual is typically stored under the person's name or a
tracking number. Unique attributes of each different style of document was fed
into Prime OCR to successfully identify the document type of each page within
each individual's document collection. Hospital staff can now electronically
retrieve patient's records, including all accompanying documentation and review
pertinent history or lab results from the patient’s record.

3.3.9 E-books

On-line retailers use Prime OCR's RTF results to retain text format and
layout to re-create books that can be marketed as e-books. Prime OCR's
character accuracy and retention of format allow clients to efficiently reproduce
machine printed material into electronic media.

Various clients use Prime OCR's high accuracy results to save time and
money in generating on-line content from bound books. Not only does Prime
OCR generate high accuracy character results but it retains excellent formatting
which cuts down on the time to format each page of the book for on-line viewing.

3.3.10 Invoice and shipping receipt processing

Numerous applications demand high accuracy OCR results for reliable


operation. Customers use Prime OCR to extract the invoice or bill of lading
number off of the document and rename the scanned document to match the
invoice number of convert the image to a different format for viewing on the web.
With the invoice or bill of lading number they are able to quickly perform an
electronic search and retrieval of the scanned document improving customer
service.

A large shipping company uses Prime Zone to scan bar codes on a signed
billing receipt. Once scanned into the system users can view the signed
receipton-line by searching for the shipping reference number. Customer service
personnel are able to electronically e-mail the scanned signed receipt within
seconds instead of taking days to find a filed hard copy of the receipt.
Another large shipping company OCRs the invoice number from the
scanned invoice and has customized Prime OCR to rename the image file to the
invoice number facilitating document storage and search and retrieval.

3.4 SOFTWARE & HARDWARE REQUIREMENTS

Software Requirements

Tool for developing: JCreator


Tool: Irfan viewer
Language: java
OS - Any Windows based pc

Hardware Requirements

RAM – 512 mb
HDD – 20 gb or Higher
CHAPTER - 4

4.1 RESULTS
VERTICAL STRIPES & PLS’s OF INPUT IMAGE

Fig 35 vertical stripes & PLS’s of input image

FILTERED PLS’s
Fig 36 Filtered PSL’s

JOINING OF PSL’s
Fig 37 Joining of PSL’s
CHAPTER - 5
5.1 CONCLUSION

This paper has provided a comprehensive review of the methods for off-
line handwriting text line segmentation previously proposed by researchers. After
a brief description of the characteristics of text line structures in handwritten
documents, we have describes the challenges in text line segmentation. We also
reviewed the different approaches to segment a handwritten document into text
lines and proposed taxonomy. An extensive performance evaluation and
quantitative comparison of experiment results in the previously proposed
methods was performed study is made on different optical character recognition
systems developed for Indian scripts. The technologies of these OCRs are
discussed at length in this paper, which can be used as a starting step for the
researchers entering into this area.
BIBILOGRAPHY

Rodolfo P. dos Santos, Gabriela S. Clemente, Tsang Ing Ren and George D.C.
Cavalcanti Center of Informatics, Federal University of Pernambuco Recife, PE,
Brazil - www.cin.ufpe.br/~viisar {rps2,gsc2,tir,gdcc}@cin.ufpe.br.

AIM, Inc. 634 Alpha Drive Pittsburgh, PA 15238-2802, USA Email:


aidc@aimglobal.orgWebsite: www.aimglobal.org Copyright.

Suen, C.Y., et al (1987-05-29), Future Challenges in Handwriting and Computer


Applications, 3rd International Symposium on Handwriting and Computer
Applications,Montreal,May29,1987,http://users.erols.com/rwservices/pens/biblio8
8.html#Suen88, retrieved 2008-10-03

Tappert, Charles C., et al (1990-08), The State of the Art in On-line Handwriting
Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol12No8,August1990,pp787-ff,
http://users.erols.com/rwservices/pens/biblio90.html#Tappert90c, retrieved 2008-
10-03

N TRIPATHY and U PAL computer Vision and Pattern Recognition Unit, Indian
Statistical Institute, 203 BT Road,Kolkata 700 108,India e-mail:
umapada@isical.ac.in MS received 19 June 2004; revised 11 May 2006

G. Louloudis & C. Halatsis Department of Informatics and Telecommunications,


University of Athens,Greece

http://www.di.uoa.grlouloud@mm.di.uoa.gr, halatsis@di.uoa.gr.
M.K. Jindal Department of Computer Applications,
Panjab University Regional Centre,
Muktsar, Punjab, India
manishphd@rediffmail.com,

R.K. Sharma School of Mathematics and Computer Applications,


Thapar Institute of Engineering and Technology,
Patiala, Punjab, India
rksharma@tiet.ac.in

G.S. Lehal Department of Computer Science,


Punjabi University, Patiala, Punjab, India.
gslehal@gmail.com

Arun AGARWAL,and C.Raghavendra RAO Department of Computer &


Information Science, University Of Hyderabad, Andhrapradesh
E- mail: anuradhabs@yahoo.co.in

Laurence Likforman-Sulem, GET-Ecole Nationale Supérieure des


Télécommunications/TSI and CNRS-LTCI, 46 rue Barrault, 75013 Paris, France
email: likforman@tsi.enst.fr
http://www.tsi.enst.fr/~lauli/

T. K. Bhowmik IBM Global Services Pvt Ltd, Embassy Golf Link, Bangalore - 560
071, INDIA.
tbhowmik@in.ibm.com

A. Roy and U. Roy Dept. of Computer and System Sciences, Visva-Bharati,


Santiniketan 731235, INDIA.
uroyin@yahoo.co.in

S-ar putea să vă placă și