Sunteți pe pagina 1din 5

PRODUCT OVERVIEW OpenText cAPTURE CENTER

OpenText Capture Center


Classifying and extracting data from documents: OCR, ICR, and IDR

T
he content of scanned documents (or faxes) is readable What is all this: OCR, ICR, IDR?
only by humans. For computer applications, they are A scanned document is just a bunch of pixels.
merely a collection of meaningless pixels. In order Optical Character Recognition (OCR) yields
the character code for each written character
to facilitate the automatic storage of documents within a on the document.
repository and provide correct attributes for those documents, Intelligent Character Recognition (ICR) extends
OCR to contextual algorithms. ICR also allows
the type of document and relevant metadata have to be for the recognition of hand-printed characters.
known. In order to automatically trigger business processes Intelligent Document Recognition (IDR) analyzes
and populate business transactions with information, relevant aandocument and finds specific information from
unknown document layout, like an order
data from the documents must be available. number from purchase order. IDR typically
builds on top of OCR/ICR results.

OpenText Capture Center (OCC) extracts appropriate person, department, or Benefits


information from bitmap documents by backend Enterprise Resource Planning,
using the most advanced Optical Character Customer Relationship Management, n Reduce operating costs: Automate
Recognition (OCR), Intelligent Character Enterprise Content Management (ECM), or manual tasks and deploy a single input
Recognition (ICR), and Intelligent Document workflow solution. It also provides tracking management platform.
Recognition (IDR). and auditing of that correspondence. n Improve information quality: Classify,
Using OCC, your organization saves money Traditional mailroom processes are slow and extract, and verify information and
leverage a common set of business rules.
from reduced manual keying and paper inefficient, dominated by paper documents.
handling, speeds up business processes By moving to a digital mailroom, companies
n Accelerate business processes: Reduce
exception processing and enhance
by using digital workflow right from the can reduce operational costs, streamline customer relationships.
start, improves data quality by capturing and accelerate business processes, and n Reduce compliance risks: Control the flow of
all relevant data from documents, and deliver improved customer service. each incoming document and connect each
reduces compliance risks by keeping track document with its business transaction.
Transaction and process management
of document-related activities.
OCC captures images and processes
Using OCC data for certain business processes. The
OCC is used in many areas, including purpose of scanning is to input data into
the following: a business process. Ideally, all business
data for a transaction should be extracted
Digital mailroom and validated. As a result, the process may
The digital mailroom automatically captures be fully automated. The scanned docu-
and classifies all information entering an ment will be attributed by some metadata,
organization and then routes it to the stored in an archive, and associated with
the business transaction.

EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
PRODUCT OVERVIEW OpenText cAPTURE CENTER

Typical documents that may be processed


are order entries, application forms, insur-
Ad hoc capturing
Ad hoc capturing is characterized by low-
Using OpenText
ance claims, or business reply mail. Invoice
processing is also a dominant application
volume, on-demand document capture.
For example, an office worker who wants
Capture Center,
your organization
for this business-use case, and OpenText to convert paper documents into usable
Invoice Capture Center is a preconfigured electronic documents does ad hoc capturing.
application of OCC dedicated to this appli-
cation. Automation with OCC reduces
The devices used are low-speed scanners or
networked office Multi-Function Peripher-
saves money,
costs, increases productivity, increases
efficiency of business processes, and
als (MFPs). Systems are sometimes linked
into shared repositories, or documents
speeds up business
improves information quality.
Scanning documents into electronic files
may be moved into centralized reposito-
ries, converted, and used as faxes, email processes, improves
Documents have to be put into a reposi-
tory or classified and routed to a central-
attachments, etc.
With ubiquitous availability of network
data quality,
ized point as quickly as possible.
Typically, the process is batch oriented.
scanners and MFPs, ad hoc ensures
that paper is eliminated as early as
and reduces
The documents are scanned and indexed
automatically, sometimes with keyentry to
possible and documents can be shared
for collaborative processes.
compliance risks.
enter or correct indexes. Presorting is elimi-
nated through automated classification. The
Example applications
process is managed centrally, and compo- Customer Relationship Management:
nents of the process may remain local within In business-to-consumer businesses, it Insurance claims: Insurance custom-
the scanning system or distributed remotely. is critical to maintain customer records ers report claims either by using forms
accurately and completely. Customers or by submitting free formatted letters.
Documents will then be available in the
often send paper documents (e.g., appli- The customer needs to be identified in
content management and workflow
cation forms, claim forms, complaint the database along with claim details
systems for fast access, archiving, and
letters, salary certificates, etc.) that must like injured person, date of incident, or
knowledge management. Apart from this,
be filed in the right customer record. Useful sum of damages. Prepopulating the
main costs associated with managing
metadata are customer name, customer claim processing mask lets the insurance
paper (manual handling, storage, retrieval,
number, document type, case identifica- specialist focus on the subject itself instead
loss, compliance) will be reduced dramatically.
tion, contract number, and so on. Typical of capturing data.
Backfile conversion organizations with a high volume of incom- Travel expenses: Expense processing is
Backfile conversion replaces a paper archive ing paper documents are utility service a time-consuming, manual task with data
with a digital archive. companies, city councils, assurance entry, paper storage, and distribution prob-
companies, banks, mail order businesses, lems. A capture solution streamlines and
This business-use case is very similar
and service companies. automates this process; expense claims
to batch capture, as the main task is the
conversion of high-volume paper to elec- Human Resources (HR): Using electronic are routed automatically to the appropriate
tronic images with index and metadata. personnel files is the most efficient way to individuals for approval, and the resulting
However, it is important to conserve the reduce chaos in the HR department. This expense data can be automatically posted
original file structure of the paper archive. requires an initial conversion of the exist- for payment once approved.
This is done with various kinds of separa- ing files. From then on, all incoming docu- eDiscovery: Electronic discovery refers
tors to reflect the hierarchical structure of ments will be stored in electronic format. to any process in which electronic data is
the filing repository. The management of HR documentation sought, located, secured, and searched
can be very complex, mainly due to the with the intent of using it as evidence in a
Conversion of the paper archive will leverage
large number of different document types civil or criminal legal case. Scanned docu-
all the benefits related to a digital archive.
that can exist for each employee. Docu- ments and faxes must provide all of their
ments often exist in several formats and textual contents in order to be found in
locations with different processes being case. With OCC, thousands of file types
carried out for each document type. can be classified as records.

EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
SKU#
PRODUCT OVERVIEW OpenText cAPTURE CENTER

SCANNER BUSINESS
SCAN CLIENT RECOGNITION V A L I D AT I O N A P P L I C AT I O N

IMPORT D I S P AT C H EXPORT

FA X ,
FTP SITE, EMAIL, MONITOR C O N F I G U R AT I O N ARCHIVE
SHAREPOINT

Functional description or unusual document layout, some data all basic extraction tasks, design and test
will not be identified with a sufficient level of tools allow for recognition rate control, veri-
Extracting data from scanned documents
confidence. For these cases, manual data fication, and optimization.
is a multi-step process:
entry is supported by a powerful data entry Application Programming Interface (API):
Document acquisition: OCC is closely client that is designed according to the
connected to OpenText Imaging Enterprise Using the API via programming (.NET™) or
highest ergonomic standards. Keyboard scripting (JavaScript), advanced adaptation
Scan. Using the OCC scan profile, documents usage for advanced data keying personnel
are automatically transferred to the recog- to project-specific requirements are possi-
is supported, as well as mouse-based data ble. Customizing code can be injected at
nition process when scanning is finished. capture using OpenText Desktop Capture.
Documents can also be picked up from a almost any step of the recognition process.
Document release: Release modules Production monitoring: To control the
variety of other sources: file system folders,
for OpenText Content Server and Microsoft production process, the administrator can
File Transfer Protocol (FTP) sites, email
SharePoint automatically transfer the look into the current state of each of the
servers, or Microsoft® SharePoint® servers.
document into the required folder or library batches that are in the system. By select-
Document recognition: Document recog- and fill in metadata. For other systems,
nition is a two-step process. First, a docu- ing a certain subset, personnel can easily
the document image and data is stored in spot production problems like missing
ment is classified; that is, its document type the file system and can be picked up from
is determined—whether it is a purchase resources or failures of connected compo-
there. A programming interface allows nents. Statistical reports help to allocate
order, a contract, or an application form. for the development of custom release
In the second step, a defined extraction resources or to distribute costs in a shared
modules.For managing the following steps, service environment.
profile will be used for each document OCC supplies modules for customizing
class. This allows the user to extract the and administration. Technical administration: In case one
relevant business data from any specific of the modules runs into trouble, (e.g., a
Configuration: OCC classifies each release module cannot connect to the
type of document, e.g., a Purchase Order
document, i.e., determines its document target system) the administrator uses the
(PO) number for a PO or a contract number
classes. A document class defines the technical administration tools to identify
for a contract. If only one type of docu-
set of fields (also known as metadata or and fix the issue.
ment is processed, the classification step
index fields) and how OCC is expected
is omitted.
to locate and extract these. Classification
Document validation: OCR, ICR, and IDR determines the document class without
do not always extract all required data. Due manual presorting. All of this customizing
to dirt, document damages, irregular fonts, occurs with an intuitive user interface. For

EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
SKU#
PRODUCT OVERVIEW OpenText cAPTURE CENTER

Automated recognition: the


heart of OCC
The tasks that can be automated are docu-
ment separation, document classification,
CHEQUE A P P L I C AT I O N
and data extraction. For most of these
tasks, OCC offers several automation
methods that can be configured according
to the nature of the document. Some types
of documents include the following:
Structured documents
Typically these are forms
with fixed locations for ORDER CLAIM
each piece of data.
Semi-structured documents
Typical business documents attributes of a specific type of document. Data extraction
are semi-structured. POs or Each document class may have a different Data extracted from documents is either
delivery notes follow some set of metadata. used as metadata in a repository for struc-
general layout patterns so tured storage and retrieval or to automate
OCC offers several options to determine
that rules can be defined transaction processing in an enterprise
the document class. These options can
concerning where to look for certain application. The set of extraction methods
be combined.
pieces of information. However, unlike is always the same for both usages. All of
forms, there is no defined geometric region n Adaptive Classification Technology (ACT)
the following methods can be applied to a
for each piece of data. is a learning algorithm. ACT uses several
single document:
samples from each document class, e.g.,
Unstructured documents several orders, claims, or applications, n Barcode, patch code
In business-to-consumer and extracts the characteristic features of n Optical mark recognition
environments, correspondence these documents (like specific keywords
follows no regular pattern.
n Forms reading (fixed, anchored location,
or phrases) based on OCR content.
This is a typical example for hand print, machine print)
Each document that is to be classified
a case of unstructured docu- is compared against these features and n Free forms recognition (rule-based extraction)
ments. Only the syntax of the information classified accordingly. This method is n Adaptive Reading Technology (learning
and semantic pattern can guide the search well-suited for unstructured documents. through validation operator)
for information. n Rule-based classification uses a man- n Database-driven recognition (match a
Document separation made set of classification rules. These record in a database with the document)
OCC can assemble a batch of joined rules typically use graphical objects,
OpenText is known for its exceptionally
images into documents. The cutting points phrases, and combinations of keywords
advanced recognition technology all the
in the separation process are either defined together with their geometric relation.
way through the technology stack. More
by the content of extracted data fields, e.g., This approach is best used for semi-
than 35 years of experience in the field with
barcode or patch code, or by a defined structured documents as well as forms.
large-scale operations (like the US census
number of pages. n Forms often contain certain elements for or the German tax authorities) are the foun-
Document classification identification, like a form ID number. These dation of the product. Thousands of users as
can be extracted and used for classification. well as industry partners rely on OpenText’s
The document class is an attribute of the
document that is typically used to deter- n Preset values are used when the scan engines for OCR, ICR, and IDR. The power
mine the relevant business process or the operator scans just a single batch of these components is unleashed by OCC
folder into which it should be stored. Within of documents from the same class. to allow for the highest automation rate in
OCC, the document class is used to control Imaging Enterprise Scan allows for this document recognition.
which kind of metadata have to form the data to be imported at scan time.

EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
SKU#
PRODUCT OVERVIEW OpenText cAPTURE CENTER

OCR A N A LY Z E EXTRACT NORMALIZE VERIFY

CLERK: 12 CLERK: 12 CLERK: 12


DATE SHIPPED ORDER DATE DATE SHIPPED ORDER DATE DATE SHIPPED ORDER DATE “10/24/10” 24.10.2010
10/30/10 10/24/10 10/30/10 10/24/10 10/30/10 10/24/10
COVER CODE DA COVER CODE DA COVER CODE DA

3
D AT E VA L I D
CLERK: 12 O R D E R _ D AT E = F O R M AT (US) PERIOD
DATE SHIPPED “10/24/10” D AY =24 ?
ORDER DATE WORD MONTH =OCT
YEAR =2010
10/30/10
10/24/10 Number
NUMBER
COVER CODE
PA YES NO

Extracting data from a business document typically involves several steps. In the first
step, OCR is used to turn the pixels into individual characters. In the second step,
meaningful units like amounts, dates, or numbers are identified. In step three, the 24.10.2010
MANUAL
EXPORT
most plausible hypothesis for the information searched is identified, typically based KEYING
on contextual information. In the example, it is the phrase “ORDER DATE” that is the
triggering piece. Step four is normalizing the varying writing styles for the informa-
tion, and the last step is the logical validation. Although depicted as a sequence, the
steps really run in cycles, following many hypothesizes in parallel.
Licensing options
OCC is licensed by volume, i.e., number of
processed pages per year. Three licenses
are available depending on the number of
fields for which automation is used: Unlim-
ited, 1-5 and 0. The latter is mainly used
for manual indexing and/or searchable
PDF creation. Also available are optional
modules and one-time licenses. The
number of validation clients is not limited
by either of the options. n

www.opentext.com
N ort h A meri c a +800 49 9 6 54 4   n  U nite d S tates +1 8 47 267 933 0   n  G erman y + 49 8 9 4 62 9
U nite d K ingdom +4 4 0 1189 8 4 8 000   n  Austra l ia + 61 2 9 026 3 4 00
Copyright © 2012 Open Text Corporation. OpenText and Open Text are trademarks or registered trademarks of Open Text Corporation. This list is not exhaustive. All other trademarks or registered trademarks are the property of their respective
owners. All rights reserved. For more information, visit: http://www.opentext.com/2/global/site-copyright.html (08/2012)00137EN

S-ar putea să vă placă și