Documente Academic
Documente Profesional
Documente Cultură
T
he content of scanned documents (or faxes) is readable What is all this: OCR, ICR, IDR?
only by humans. For computer applications, they are A scanned document is just a bunch of pixels.
merely a collection of meaningless pixels. In order Optical Character Recognition (OCR) yields
the character code for each written character
to facilitate the automatic storage of documents within a on the document.
repository and provide correct attributes for those documents, Intelligent Character Recognition (ICR) extends
OCR to contextual algorithms. ICR also allows
the type of document and relevant metadata have to be for the recognition of hand-printed characters.
known. In order to automatically trigger business processes Intelligent Document Recognition (IDR) analyzes
and populate business transactions with information, relevant aandocument and finds specific information from
unknown document layout, like an order
data from the documents must be available. number from purchase order. IDR typically
builds on top of OCR/ICR results.
EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
PRODUCT OVERVIEW OpenText cAPTURE CENTER
EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
SKU#
PRODUCT OVERVIEW OpenText cAPTURE CENTER
SCANNER BUSINESS
SCAN CLIENT RECOGNITION V A L I D AT I O N A P P L I C AT I O N
IMPORT D I S P AT C H EXPORT
FA X ,
FTP SITE, EMAIL, MONITOR C O N F I G U R AT I O N ARCHIVE
SHAREPOINT
Functional description or unusual document layout, some data all basic extraction tasks, design and test
will not be identified with a sufficient level of tools allow for recognition rate control, veri-
Extracting data from scanned documents
confidence. For these cases, manual data fication, and optimization.
is a multi-step process:
entry is supported by a powerful data entry Application Programming Interface (API):
Document acquisition: OCC is closely client that is designed according to the
connected to OpenText Imaging Enterprise Using the API via programming (.NET™) or
highest ergonomic standards. Keyboard scripting (JavaScript), advanced adaptation
Scan. Using the OCC scan profile, documents usage for advanced data keying personnel
are automatically transferred to the recog- to project-specific requirements are possi-
is supported, as well as mouse-based data ble. Customizing code can be injected at
nition process when scanning is finished. capture using OpenText Desktop Capture.
Documents can also be picked up from a almost any step of the recognition process.
Document release: Release modules Production monitoring: To control the
variety of other sources: file system folders,
for OpenText Content Server and Microsoft production process, the administrator can
File Transfer Protocol (FTP) sites, email
SharePoint automatically transfer the look into the current state of each of the
servers, or Microsoft® SharePoint® servers.
document into the required folder or library batches that are in the system. By select-
Document recognition: Document recog- and fill in metadata. For other systems,
nition is a two-step process. First, a docu- ing a certain subset, personnel can easily
the document image and data is stored in spot production problems like missing
ment is classified; that is, its document type the file system and can be picked up from
is determined—whether it is a purchase resources or failures of connected compo-
there. A programming interface allows nents. Statistical reports help to allocate
order, a contract, or an application form. for the development of custom release
In the second step, a defined extraction resources or to distribute costs in a shared
modules.For managing the following steps, service environment.
profile will be used for each document OCC supplies modules for customizing
class. This allows the user to extract the and administration. Technical administration: In case one
relevant business data from any specific of the modules runs into trouble, (e.g., a
Configuration: OCC classifies each release module cannot connect to the
type of document, e.g., a Purchase Order
document, i.e., determines its document target system) the administrator uses the
(PO) number for a PO or a contract number
classes. A document class defines the technical administration tools to identify
for a contract. If only one type of docu-
set of fields (also known as metadata or and fix the issue.
ment is processed, the classification step
index fields) and how OCC is expected
is omitted.
to locate and extract these. Classification
Document validation: OCR, ICR, and IDR determines the document class without
do not always extract all required data. Due manual presorting. All of this customizing
to dirt, document damages, irregular fonts, occurs with an intuitive user interface. For
EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
SKU#
PRODUCT OVERVIEW OpenText cAPTURE CENTER
EN T ER PR ISE I N FO R M AT I O N M A N AG E M EN T
SKU#
PRODUCT OVERVIEW OpenText cAPTURE CENTER
3
D AT E VA L I D
CLERK: 12 O R D E R _ D AT E = F O R M AT (US) PERIOD
DATE SHIPPED “10/24/10” D AY =24 ?
ORDER DATE WORD MONTH =OCT
YEAR =2010
10/30/10
10/24/10 Number
NUMBER
COVER CODE
PA YES NO
Extracting data from a business document typically involves several steps. In the first
step, OCR is used to turn the pixels into individual characters. In the second step,
meaningful units like amounts, dates, or numbers are identified. In step three, the 24.10.2010
MANUAL
EXPORT
most plausible hypothesis for the information searched is identified, typically based KEYING
on contextual information. In the example, it is the phrase “ORDER DATE” that is the
triggering piece. Step four is normalizing the varying writing styles for the informa-
tion, and the last step is the logical validation. Although depicted as a sequence, the
steps really run in cycles, following many hypothesizes in parallel.
Licensing options
OCC is licensed by volume, i.e., number of
processed pages per year. Three licenses
are available depending on the number of
fields for which automation is used: Unlim-
ited, 1-5 and 0. The latter is mainly used
for manual indexing and/or searchable
PDF creation. Also available are optional
modules and one-time licenses. The
number of validation clients is not limited
by either of the options. n
www.opentext.com
N ort h A meri c a +800 49 9 6 54 4 n U nite d S tates +1 8 47 267 933 0 n G erman y + 49 8 9 4 62 9
U nite d K ingdom +4 4 0 1189 8 4 8 000 n Austra l ia + 61 2 9 026 3 4 00
Copyright © 2012 Open Text Corporation. OpenText and Open Text are trademarks or registered trademarks of Open Text Corporation. This list is not exhaustive. All other trademarks or registered trademarks are the property of their respective
owners. All rights reserved. For more information, visit: http://www.opentext.com/2/global/site-copyright.html (08/2012)00137EN