Quality Stage in Data Stage

info@anshisoft.com|www.anshisoft.
com
Quality Stage in Data Stage
Data Quality Challenges
Different or inconsistent standards in structure, format or values
Missing data, default values
Spelling errors, data in wrong fields
Buried information
Data myopia
Data anomalies
Different or Inconsistent Standards
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
info@anshisoft.com|www.anshisoft.com
Missing Data & Default Values
Buried Information
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
The Anomalies Nightmare
Quality Stage
Quality Stage is a tool intended to deliver high quality data required for
success in a
range of enterprise initiatives including business intelligence, legacy
consolidation and
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
master data management. It does this primarily by identifying components

of data that
may be in columns or free format, standardizing the values and formats of
those data,
using the standardized results and other generated values to determine
likely duplicate
records, and building a best of breed record out of these sets of potential
duplicates.
Through its intuitive user interface Quality Stage substantially reduces time
and cost to
implement
Customer
warehouse/business
Relationship
Management
(CRM),
data
intelligence (BI), data governance, and other strategic IT initiatives and

maximizes their
return on investment by ensuring their data quality.
With Quality Stage it is possible, for example, to construct consolidated

customer and
household views, enabling more effective cross-selling, up-selling, and
customer
retention, and to help to improve customer support and service, for example
by
identifying a company's most profitable customers. The cleansed data
provided by Quality Stage allows creation of business intelligence on
individuals and organizations for research, fraud detection, and planning.
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Out of the box Quality Stage provides for cleansing of name and address
data and some
related types of data such as email addresses, tax IDs and so on. However,
Quality Stage
is fully customizable to be able to cleanse any kind of classifiable data, such
as
infrastructure, inventory, health data, and so on.
Quality Stage Heritage

The product now called Quality Stage has its origins in a product called
INTEGRITY from a
company called Vality. Vality was acquired by Ascential Software in 2003 and
the
product renamed to Quality Stage. This first version of Quality Stage
reflected its
heritage (for example it only had batch mode operation) and, indeed, its
mainframe
antecedents (for example file name components limited to eight characters).
Ascential did not do much with the inner workings of Quality Stage which
was, after all,
already a mature product. Ascentials emphasis was to provide two new
modes of
operation for Quality Stage. One was a plug-in for Data Stage that allowed
data
cleansing/standardization to be performed (by Quality Stage jobs) as part of
an ETL data
flow. The other was to provide for Quality Stage to use the parallel execution
technology
(Orchestrate) that Ascential had as a result of its acquisition of Torrent
Systems in 2001.
IBM acquired Accentual Software at the end of 2005. Since then the main
direction has
been to put together a suite of products that share metadata transparently
and share a
common set of services for such things as security, metadata delivery,
reporting, and so
on. In the particular case of Quality Stage, it now shares a common Designer
client with
Data Stage: from version 8.0 onwards Quality Stage jobs run as, or as part of,
Data Stage
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
jobs, at least in the parallel execution environment.

QualityStage Functionality
Four tasks are performed by QualityStage; they are investigation,
standardization,
matching and survivorship. We need to look at each of these in turn. Under
the covers
QualityStage incorporates a set of probabilistic matching algorithms that can
find
potential duplicates in data despite variations in spelling, numeric or date
values, use of
non-standard forms, and various other obstacles to performing the same
tasks using
deterministic methods. For example, if you have what appears to be the
same
employee record where the name is the same but date of hire differs by a
day or two, a
deterministic algorithm would show two different employees whereas a
probabilistic
algorithm would show the potential duplicate.
(Deterministic means absolute in this sense; either something is equal or it
is not.
Probabilistic leaves room for some degree of uncertainty; a value is close
enough to be
considered equal. Needless to say, the degree of uncertainty used within
QualityStage
is configurable by the designer.)
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Investigation
By investigation we mean inspection of the data to reveal certain types of
information
about those data. There is some overlap between Quality Stage investigation
and the
kinds of profiling results that are available using Information Analyzer, but
not so much
overlap as to suggest that removal of functionality from either tool. Quality
Stage can
undertake three different kinds of investigation.
Features
Data investigation is done using the investigate stage

This stage analyzes each record field by field for its content and
structure.
Free form fields are broken up into individuals and then analyzed.
Provide frequency distributions of distinct values and patterns
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Each investigation phase produces pattern reports, word frequency

reports and word classification reports. The reports are located in the
same data directory of the server.
Investigate methods
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Character Investigation
Single-domain fields
Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
Entity Clarifiers:
Eg: name prefix, gender, and marital status.
Multiple-domain fields
large free-form fields such as multiple Address fields.
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Character discrete investigation: looks at the characters in a single field

(domain) to
report what values or patterns exist in that field. For example a field might
be expected
to contain only codes A through E. A character discrete investigation looking
at the
values in that field will report the number of occurrences of every value in
the field (and
therefore any out of range values, empty or null, etc.) Pattern in this
context means
whether each character is alphabetic, numeric, blank or something else. This
is useful in
planning cleansing rules; for example a telephone number may be
represented with or
without delimiters and with or without parentheses surrounding the area
code, all in
the one field. To come up with a standard format, you need to be aware of
what
formats actually exist in the data. The result of a character discrete
investigation (which
can also examine just part of a field, for example the first three characters) is
a
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
frequency distribution of values or patterns the developer determines

which.
Character concatenate investigation is exactly the same as character
discrete
investigation except that the contents of more than one field can be
examined as if they
were in a single field the fields are, in some sense, concatenated prior to
the
investigation taking place. The results of a character concatenate
investigation can be
useful in revealing whether particular sets of patterns or values occur
together.
Word investigation :is probably the most important of the three for the
entire
QuialityStage suite, performing a free-format analysis of the data records. It
performs
two different kinds of task; one is to report which words/tokens are already
known, in
terms of the currently selected rule set, the other is to report how those
words are to
be classified, again in terms of the currently selected rule set. There is no
overlap to
Information Analyzer (data profiling tool) from word investigation.
Rule Set :
A rule set includes a set of tables that list the known words or tokens. For
example,
the GBNAME rule set contains a list of names that are known to be first
names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table
in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so
on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in
some cases
reveal additional information, such as gender.
When a word investigation reports about classification, it does so by
producing a
pattern. This shows how each known word in the data record is classified,
and the order
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
in which each occurs. For example, under the USNAME rule set the name
WILLIAM F.
GAINES III would report the pattern FI?G the F indicates that William is a
known first
name, the I indicates the F is an initial, the ? indicates that Gaines is not
a known
word in context, and the G indicates that III is a generation as would be
Senior,
IV and fils. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed
below).
Classification tables contain not only the words/tokens that are known and
classified,
but also contain the standard form of each (for example William might be
recorded as
the standard form for Bill) and may contain an uncertainty threshold (for
example
Felliciity might still be recognizable as Felicity even though it is
misspelled in the
original data record). Probabilistic matching is one of the significant strengths
of
QualityStage.
Investigation might also be performed to review the results of
standardization,
particularly to see whether there are any unhandled patterns or text that
could be
better handled if the rule set itself were tweaked, either with improved
classification
tables or through a mechanism called rule set overrides.
Standardization
Standardization, as the name suggests, is the process of generating standard
forms of
data that might more reliably be matched. For example, by generating the
standard
form William from Bill, then there is an increased likelihood of finding the
match
between William Gates and Bill Gates. Other standard forms that can be
generated
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
include phonetic equivalents (using NYSIIS and/or Soundex), and something

like
initials maybe the first two characters from each of five fields.
Each standardization specifies a particular rule set. As well as word/token
classification
tables, a rule set includes specification of the format of an output record
structure, into
which original and standardized forms of the data, generated fields (such as
gender) and
reporting fields (for example whether a user override was used and, if so,
what kind of
override) may be written.
It may be that standardization is the desired end result of using Quality
Stage. For
example street address components such as Street or Avenue or Road
are often
represented differently in data, perhaps differently abbreviated in different
records.
Standardization can convert all the non-standard forms into whatever
standard format
the organization has decided that it will use.
This kind of Quality Stage job can be set up as a web service. For example, a
data entry
application might send in an address to be standardized. The web service
would return
the standardized address to the caller.
More commonly standardization is a preliminary step towards performing
matching.
More accurate matching can be performed if standard forms of words/tokens
are
compared than if the original forms of these data are compared.
Standardization Process Flow
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Delivered Rule Sets Methodology in Standardization
Example: Country Identifier Rule Set

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Output
Output Records
Records
US
US YY 100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOOR
BOSTON,
BOSTON,MA
MA02111
02111
CA
CA YY SITE
SITE66COMP
COMP10
10RR
RR88STN
STNMAIN
MAIN
MILLARVILLE
MILLARVILLEAB
ABT0L
T0L1K0
1K0
GB
GB YY 28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDON
W1X
W1X9FE
9FE
US
US NN 123
123MAIN
MAINSTREET
STREET
Example: Domain Pre-processor Rule Set
Output
Output Record
Record
Name
Name Domain
DomainTINA
TINA FISHER
FISHER ATTN
ATTN IBM
IBM
Address
211
Address Domain
Domain
211
WASHINGTON
WASHINGTON DR
DR PO
PO BOX
BOX 52
52
Area
Area Domain
Domain WESTBORO
WESTBORO ,, MA
MA
02140
02140
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Example: Domain Specific Rule Set
Output
Output Record
Record
House
House Number
Number
211
211
Street
Street Name
Name
WASHINGTON
WASHINGTON
Street
Street Suffix
Suffix Type
Type
Box
Box Type
Type
PO
PO BOX
BOX
Box
Box Value
Value
DR
DR
52
52
Logic for NAME Rule Set
Set variables for process option delimiters
Process the most common patterns first
Simplify the patterns
Check for common patterns again
Check for multiple names
Process organization names
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Process individual names
Default processing (based on process options)
Post process subroutine to populate matching fields
Logic of ADDR Rule Sets
Process the most common patterns first
Simplify the patterns
Check for common patterns again
Call subroutines for each secondary address element
Check for street address patterns
Logic of AREA Rule Sets
Process input from right to left
Call subroutines for each sub-domain (i.e. country name, post code,
province, city)
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Rule Sets
Rule Sets are standardization processes used by the Standardize Stage

and have three required components:
1. Classification Table Contains the key words that provide
special context, their standard value, and their userdefined class
2. Dictionary File Defines the output columns that will be
created by the standardization process
3. Pattern-Action File Drives the logic of the standardization
process and decides how to populate the output columns
Optional rule set components:

User Overrides
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Reference Tables
Standardization Example
Parsing (the Standardization Adventure Begins)
The standardization process begins by parsing the input data into

individual data elements called tokens
Parsing parameters are provided by the pattern-action file
Parsing parameters are two lists of individual characters:

SEPLIST - Any character in this list will be used to separate
tokens
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
STRIPLIST - Any character in this list will be removed
The SEPLIST is always applied first
Any character that is in the SEPLIST and not in the STRIPLIST, will be
used to separate tokens and will also become a token itself
The space character should be included in both lists
Any character that is in both lists will be used to separate tokens but
will not become a token itself
The best example of this is the space character - one or more
spaces are stripped but the space indicates where one token
ends and another begins
Parsing (Chinese, Japanese, Korean)
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
The parser behaves differently if the locale setting is Chinese,

Japanese, or Korean
Spaces are not used to divide tokens so each character, including a

space, is considered a token
Spaces are classified by underscores (_) in the pattern
The Classification file allows multiple characters to be classified

together
Latin characters are transformed to double byte representations
Classification
Parsing separated the input data into individual tokens
Each token is basically either an alphabetic word, a number, a special

character, or some mixture
Classification assigns a one character tag (called a class) to each and

every individual parsed token to provide context
First, key words that can provide special context are classified
Provided by the standardization rule set classification table
Since these classes are context specific, they vary across rule
sets
Next, default classes are assigned to the remaining tokens

These default classes are always the same regardless of the rule
set used
Lexical patterns are assembled from the classification results

Concatenated string of the classes assigned to the parsed tokens
Classification order
First, key words that can provide special context are
classified
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Provided by the standardization rule set classification

table
Since these classes are context specific, they vary
across rule sets
Next, default classes are assigned to the remaining tokens
These default classes are always the same regardless of
the rule set used
Lexical patterns are assembled from the classification results
Concatenated string of the classes assigned to the
parsed tokens
Classification Example
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Apply defaults to tokens not found in the classification table
System defaults that are always the same regardless

of the rule set used
^ = A single numeric token
+ = A single unclassified alpha token
Default Classes
Cla
ss
Description
A single numeric token
A single unclassified alpha token
One or more consecutive unclassified alpha tokens
>
Leading numeric mixed token (i.e. 2B, 88WR)
<
Trailing numeric mixed token (i.e. B2, WR88)
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Complex mixed token (i.e. NOT2B, C3PO, R2D2)
Default Classes (Special Characters)
Some special characters are reserved for use as default classes that
describe token values that are not actual special character values
For example: ^ + ? > < @ (as described on the previous slide)
However, if a special character is included in the SEPLIST but omitted

from the STRIPLIST, then the default class for that special character
becomes the special character itself and in this case, the default class
does describe an actual special character value
For example: Periods (.), Commas (,), Hyphens (-)
It is important to note this can also happen to the reserved
default classes (for example: ^ = ^ if ^ is in the SEPLIST but
omitted from the STRIPLIST)
Also, if a special character is omitted from both the SEPLIST and

STRIPLIST (and it is surrounded by spaces in the input data), then the
special default class of ~ (tilde) is assigned
If not surrounded by spaces, then the appropriate mixed token
default class would be assigned (for example: P.O. = @ if . is
omitted from both lists)
Default Class (NULL Class)
Has nothing to do with NULL values
The NULL class is a special class

Represented by a numeric zero (0)
Only time that a number is used as a class
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Tokens classified as NULL are unconditionally removed
Essentially, the NULL class does to complete tokens what the STRIPLIST
does to individual characters
Therefore, you will never see the NULL class represented in the
assembled lexical patterns
Classification Table
Classification Tables contain three required space delimited columns:
1. Key word that can provide special context
2. Standard value for the key word
Standard value can be either an abbreviation or an
expansion
The pattern-action file will determine if the standard value
is used
3. Data class (one character tag) assigned to each key word
Classification Table Example

;---------------------------------------------------------------------------; USADDR Classification Table
;---------------------------------------------------------------------------; Classification Legend
;---------------------------------------------------------------------------; B - Box Types
; D - Directionals
; F - Floor Types
; T - Street Types
; U - Unit Types
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
;---------------------------------------------------------------------------PO
"PO BOX"
B
BOX
"PO BOX"
B
POBOX
"PO BOX"
B
NORTH
N
N
N
FLOOR
FL
FL
FL
STREET
ST
ST
ST
APARTMENT
APT
APT
APT
D
D
F
F
T
T
U
U
Tokens in the Classification Table
A common misconception by new users is assuming that every

input alpha token should be classified by the classification table
Unclassified != Unhandled (i.e. unclassified tokens can still be
processed)
Classification table is intended for key words that provide special

context, which means context essential to the proper processing of
the data
General requirements for tokens in the classification table:

Tokens with standard values that need to be applied (within
proper context)
Tokens that require standard values, especially standard

abbreviations, will often map directly into their own
dictionary columns
Does not mean that every dictionary column requires a

user defined class
Tokens with both a high individual frequency and a low set

cardinality
Low set cardinality means that the token belongs to a

group of related tokens that have a relatively small
number of possible values and therefore the complete
token group can be easily maintained in the
classification table
If high set cardinality, adjacent tokens can often provide

necessary context.
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
What is Dictionary File ?
Defines the output columns created by the standardization rule set
When data is moved to these output columns, it is called bucketing
The order that the columns are listed in the dictionary file defines the
order the columns appear in the standardization rule set output
Dictionary file entries are used to automatically generate the column

metadata available for mapping on the Standardize Stage output link
Dictionary File Example
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
;----------------------------------------------------------------------------; USADDR Dictionary File

;----------------------------------------------------------------------------; Business Intelligence Fields
;----------------------------------------------------------------------------HouseNumber
C 10 S HouseNumber
StreetName
C 25 S StreetName
StreetSuffixType
C 5
S StreetSuffixType
StreetSuffixDirectional C 3
S StreetSuffixDirectional
;------------------------------------------------------------Dictionary
File Fields (Output Columns)
----------------; Matching Fields
Standardization can prepare data for all of its uses and therefore most
;------------------------------------------------------------dictionary files contain three types of output columns:
----------------1. Business Intelligence
StreetNameNYSIIS
C 8
S StreetNameNYSIIS
;------------------------------------------------------------ Usually comprised of the parsed and standardized
----------------input tokens
; Reporting Fields
2. Matching
;---------------------------------------------------------------------------- Columns specifically intended to facilitate more
UnhandledPattern
C 30 matching
S UnhandledPattern
effective
UnhandledData
C 50 S UnhandledData
Commonly includes phonetic coding fields (NYSIIS
and SOUNDEX)
3. Reporting
Columns specifically intended to assist with the

evaluation of the standardization results
Standard Reporting Fields in the Dictionary File
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Unhandled Pattern the lexical pattern representing the unhandled

data
Unhandled Data the tokens left unhandled (i.e. unprocessed) by the

rule set
Input Pattern the lexical pattern representing the parsed and

classified input tokens
Exception Data place holder column for storing invalid input data
(alternative to deletion)
User Override Flag indicates whether or not a user override was

applied (default = NO)
What is Pattern-Action File ?
Drives the logic of the standardization process
Configures the parsing parameters (SEPLIST/STRIPLIST)
Configures the phonetic coding (NYSIIS and SOUNDEX)
Populates the standardization output structures
Written in Pattern-Action Language, which consists of a series of

patterns and associated actions structured into logical processing units
called Pattern-Action Sets
Each Pattern-Action Set consists of:

One line containing a pattern, which is tested against the current
data
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
One or more lines of actions, which are executed if the pattern

tested true
Pattern-Action Set Example
temp = N
Pattern-Action File Structure
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Subroutines
Pattern-Action Sets
Each subroutine starts with a header line (\SUB) and ends with
a trailer line (\END_SUB)
Subroutines can be called by MAIN or by other subroutines
When called, sequentially processed until RETURN command
or \END_SUB is encountered
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Standardization vs. Validation
In QualityStage, standardization and validation describe different,

although related, types of processing
Validation extends the functionality of standardization
For example: 50 Washington Street, Westboro, Mass. 01581

Standardization can parse, identify, and re-structure the data as
follows:
House Number = 50
Street Name = WASHINGTON
Street Suffix Type = ST
City Name = WESTBORO
State Abbreviation = MA
Zip Code = 01581
Validation can verify that the data describes an actual address

and can also:
Correct City Name = WESTBOROUGH
Append Zip + 4 Code = 1013
Validation provides this functionality by matching against a

database
How to Deal with Un Handled data ?
There are two reporting fields in all delivered rule sets:

Unhandled Data
Unhandled Pattern
To identify and review unhandled data:

Investigate stage on the Unhandled Data and Unhandled Pattern
columns
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
SQA stage on the output of the Standardize stage
Unhandled data may represent the entire input or a subset of the input
If there is no unhandled data, it does not necessarily mean the data is

processed correctly
Some unhandled data does not need to be processed, if it doesnt

belong to that domain
Processing of a rule set may be modified through overrides or pattern

action language
User Overrides
Most standardization rule sets are enabled with user overrides
User overrides provide the user with the ability to make modifications
without directly editing the classification table or the pattern-action file
User Overrides are:

Entered via simple GUI screens
Stored in specific object within the rule set
Classification overrides can be used to add classifications for
tokens not in the classification table or to replace existing
classifications already in the classification table
The following pattern/text override objects are called based on
logic in the pattern-action file
input pattern
input text
unhandled pattern
unhandled text
Domain Specific Override Example
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Classification Override
Input Text Override

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Input Pattern Override
User Modification Subroutines
There are two subroutines in each delivered rule set that are
specifically for users to add pattern action language
User modifications within the pattern action file:

Input Modifications
This subroutine is called after the Input User Overrides are

applied but before any of the rule set pattern actions are
checked
Unhandled Modifications
This subroutine is called after all the pattern actions are

checked and the Unhandled User Overrides are applied
Pattern Action Language

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?
topic=/com.ibm.swg.im.iis.qs.patguide.doc/c_qspatact_container_top
ic.html
What is Matching ?
Matching is the real heart of Quality Stage. Different probabilistic algorithms
are
available for different types of data. Using the frequencies developed during
investigation (or subsequently), the information content (or rarity value) of
each value
in each field can be estimated. The less common a value, the more
information it
contributes to the decision. A separate agreement weight or disagreement
weight is
calculated for each field in each data record, incorporating both its
information content
(likelihood that a match actually has been found) and its probability that a
match has
been found purely at random. These weights are summed for each field in
the record to
come up with an aggregate weight that can be used as the basis for
reporting that a
particular pair or records probably are, or probably are not, duplicates of
each other.
There is a third possibility, a grey area in the middle, which Quality Stage
refers to as
the clerical review area record pairs in this category need to be referred
to a human
to make the decision because there is not enough certainty either way. Over
time the
algorithms can be tuned with things like improved rule sets, weight
overrides, different
settings of probability levels and so on so that fewer and fewer clericals are
found.
Matching makes use of a concept called blocking, which is an
unfortunately-chosen
term that means that potential sets of duplicates form blocks (or groups, or
sets) which
can be treated as separate sets of potentially duplicated values. Each block
of potential
duplicates is given a unique ID, which can be used by the next phase
(survivorship) and
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
can also be used to set up a table of linkages between the blocks of potential
duplicates
and the keys to the original data records that are in those blocks. This is
often a
requirement when de-duplication is being performed, for example when
combining
records from multiple sources, or generating a list of unique addresses from
a customer
file, et cetera.
More than one pass through the data may be required to identify all the
potential
duplicates. For example, one customer record may refer to a customer with a
street
address but another record for the same customer may include the
customers post
office box address. Searching for duplicate addresses would not find this
customer; an
additional pass based on some other criteria would also be required. Quality
Stage does
provide for multiple passes, either fully passing through the data for each
pass, or only
examining the unmatched records on subsequent passes (which is usually
faster).
Matching vs. Lookups, Joins, and Merges
Within Information Server, multiple stages offer capability that can be
considered matching, for example:
Lookup
Join
Merge
Unduplicate Match
Reference Match
Lookups, Joins, and Merges typically use key attributes, exact match
criteria, or matches to a range of values or simple formats
The Unduplicate Match Stage and Reference Match Stage offer

probabilistic matching capability
There are two types of match stage
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Unduplicate match :locates and groups all similar records within a

single input data source. This process identifies potential duplicate
records,which might then be removed
Reference Match identifies relationships among records in two data
sources. An example of many-to-one matching is matching the ZIP
codes in customer file with the list of valid ZIP codes. More than one
record in the customer file can have the same ZIP code in it.
Blocking step
Blocking provides a method of limiting the number of pairs to examine.
When you partition data sources into mutually-exclusive and
exhaustive subsets and only search for matches within a subset, the
process of matching becomes manageable.
Basic blocking concepts include:
Blocking partitions the sources into subsets that make computation

feasible. Block size is the single most important factor in match
performance. Blocks should be as small as possible without causing
block overflows. Smaller blocks are more efficient than larger blocks
during matching.
Reference Match Stage
The Reference Match stage identifies relationships among records. This

match can group records that are being compared in different ways as
follows:
One-to-many matching
Many-to-one matching
One-to-many matching
Identifies all records in one data source that correspond to a record for
the same individual, event, household, or street address in a second
data source.
Only one record in the reference source can match one record in the
data source because the matching applies to individual events.Eg:
finding the same individual based on comparing SSN in voter
registration list and department of motor vehicles list.
Many-to-one matching
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Multiple records in the data file can match a single record in the
reference file.
Eg: matching a transaction data source to a master data source allows
many transactions for one person in the master data source.
The Reference match stage delivers up to six outputs as

follows:
Match contains matched records for both inputs
Clerical has records that fall in the clerical range for both inputs
Data Duplicate contains duplicates in the data source
Reference Duplicate contains duplicates in the reference source
Data Residual contains records that are non-matches from the
data input
Reference Residual contains records that are non-matches from
the reference input
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Survivorship
As the name suggests survivorship is about what becomes of the data in
these blocks of
potential duplicates. The idea is to get the best of breed data out of each
block, based
on built-in or custom rules such as most frequently occurring non-missing
value,
longest string, most recently updated and so on.
The data that fulfill the requirements of these rules can then be handled in a
couple of
ways. One technique is to come up with a master record a single version
of the
truth that will become the standard for the organization. Another
possibility is that
the improved data could be populated back into the source systems whence
they were
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
derived; for example if one source were missing date of birth this could be
populated
because the date of birth was obtained from another source. Or more than
one. If this
is not the requirement (perhaps for legal reasons), then a table containing
the linkage
between the source records and the master record keys can be created, so
that the
original, source systems have the ability also to refer to the single source of
truth and
vice versa.
Address Verification and Certification

Quality Stage can do more (than simple matching). Address verification can
be
performed; that is, whether or not the address is a valid format can be
reported. Out of
the box address verification can be performed down to city level for most
countries. For
an extra charge, an additional module for world-wide address verification
(WAVES) can
be purchased, which will give address verification down to street level for
most
countries.
For some countries, where the postal systems provide appropriate data (for
example
SERP in the USA, CASS in Canada, DPID in Australia), address certification
can be
performed: in this case, an address is given to Quality Stage and looked up
against a
database to report whether or not that particular address actually exists.
These
modules carry an additional price, but that includes IBM obtaining regular
updates to
the data from the postal authorities and providing them to the Quality Stage
licensee.
New Address Verification Module

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Summary
IBM is planning to release its next version of Info Sphere Quality Stage
Worldwide Address Verification module (v10)
Release time frame is Q4 2012
AVI v10 will have superior functionality and coverage over our
current AVI v8.x module see slide 4
AVI v10 will leverage new address/decoding reference data
AVI v10 will have broad support for various Information Server
versions see slide 5
For current AVI v8.x customers only:
AVI v8.x will have continues support until end of Dec. 2013
Address reference data for AVI v8.x has been discontinue
by the vendor is ending in Dec. 2013
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
AVI v10 will include a Migration utility for automated migration

from AVI v8.x to AVI v10
For comparison AVI v10 and AVI v8 can run side-by-side (for
development)
Information Server / Operating System support matrix for AVI v10
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Stage Icon and Location
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Quality Stage Benefits

Quality Stage provides the most powerful, accurate matching available,
based on
probabilistic matching technology, easy to set up and maintain, and
providing the
highest match rates available in the market.
An easy-to-use graphical user interface (GUI) with an intuitive, point-andclick interface
for specifying automated data quality processes data investigation,
standardization,
matching, and survivorship reduces the time needed to deploy data
cleansing
applications.
Quality Stage offers a thorough data investigation and analysis process for
any kind of
free formatted data. Through its tight integration with Data Stage and other
Information
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
Server products it also offers fully integrated management of the metadata

associated
with those data.
There exists rigorous scientific justification for the probabilistic algorithms
used in
Quality Stage; results are easy to audit and validate.
Worldwide address standardization verification and enrichment capabilities
including
certification modules for the United States, Canada, and Australia add to
the value of
cleansed address data.
Domain-agnostic data cleansing capabilities including product data, phone
numbers,
email addresses, birth dates, events, and other comment and descriptive
fields, are all
handled. Common data quality anomalies, such as data in the wrong field or
data
spilling over into the next field, can be identified and addressed.
Extensive reporting providing metrics yield business intelligence about the
data and help
tune the application for quality assurance.
Service oriented architecture (SOA) enablement with Info Sphere Information
Services
Director, allowing you to leverage data quality logic built using the IBM Info
Sphere
Information Server and publish it as an "always on, available everywhere"
service in a
SOA in minutes.
The bottom line is that Quality Stage helps to ensure that systems deliver
accurate,
complete, trusted information to business users both within and outside the
enterprise.
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Quality Stage in Data Stage

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Quality Stage in Data Stage

Încărcat de

Drepturi de autor:

Formate disponibile

info@anshisoft.com|www.anshisoft.

Quality Stage in Data Stage

Data Quality Challenges

Different or inconsistent standards in structure, format or values

Missing data, default values

Spelling errors, data in wrong fields

Different or Inconsistent Standards

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Missing Data & Default Values

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

The Anomalies Nightmare

master data management. It does this primarily by identifying components

intelligence (BI), data governance, and other strategic IT initiatives and

With Quality Stage it is possible, for example, to construct consolidated

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Quality Stage Heritage

jobs, at least in the parallel execution environment.

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Data investigation is done using the investigate stage

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Each investigation phase produces pattern reports, word frequency

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

large free-form fields such as multiple Address fields.

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Character discrete investigation: looks at the characters in a single field

frequency distribution of values or patterns the developer determines

include phonetic equivalents (using NYSIIS and/or Soundex), and something

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Delivered Rule Sets Methodology in Standardization

Example: Country Identifier Rule Set

Example: Domain Pre-processor Rule Set

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Example: Domain Specific Rule Set

Logic for NAME Rule Set

Set variables for process option delimiters

Process the most common patterns first

Simplify the patterns

Check for common patterns again

Check for multiple names

Process organization names

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Process individual names

Default processing (based on process options)

Post process subroutine to populate matching fields

Logic of ADDR Rule Sets

Process the most common patterns first

Simplify the patterns

Check for common patterns again

Call subroutines for each secondary address element

Check for street address patterns

Post process subroutine to populate matching fields

Logic of AREA Rule Sets

Process input from right to left

Post process subroutine to populate matching fields

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Rule Sets are standardization processes used by the Standardize Stage

Optional rule set components:

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

Parsing (the Standardization Adventure Begins)

The standardization process begins by parsing the input data into

Parsing parameters are provided by the pattern-action file

Parsing parameters are two lists of individual characters:

USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723

STRIPLIST - Any character in this list will be removed