Documente Academic
Documente Profesional
Documente Cultură
com
Buried information
Data myopia
Data anomalies
info@anshisoft.com|www.anshisoft.com
Buried Information
info@anshisoft.com|www.anshisoft.com
Quality Stage
Quality Stage is a tool intended to deliver high quality data required for
success in a
range of enterprise initiatives including business intelligence, legacy
consolidation and
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
info@anshisoft.com|www.anshisoft.com
Relationship
Management
(CRM),
data
info@anshisoft.com|www.anshisoft.com
Out of the box Quality Stage provides for cleansing of name and address
data and some
related types of data such as email addresses, tax IDs and so on. However,
Quality Stage
is fully customizable to be able to cleanse any kind of classifiable data, such
as
infrastructure, inventory, health data, and so on.
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
Investigation
By investigation we mean inspection of the data to reveal certain types of
information
about those data. There is some overlap between Quality Stage investigation
and the
kinds of profiling results that are available using Information Analyzer, but
not so much
overlap as to suggest that removal of functionality from either tool. Quality
Stage can
undertake three different kinds of investigation.
Features
info@anshisoft.com|www.anshisoft.com
Investigate methods
info@anshisoft.com|www.anshisoft.com
Character Investigation
Single-domain fields
Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
Entity Clarifiers:
Eg: name prefix, gender, and marital status.
Multiple-domain fields
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
in which each occurs. For example, under the USNAME rule set the name
WILLIAM F.
GAINES III would report the pattern FI?G the F indicates that William is a
known first
name, the I indicates the F is an initial, the ? indicates that Gaines is not
a known
word in context, and the G indicates that III is a generation as would be
Senior,
IV and fils. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed
below).
Classification tables contain not only the words/tokens that are known and
classified,
but also contain the standard form of each (for example William might be
recorded as
the standard form for Bill) and may contain an uncertainty threshold (for
example
Felliciity might still be recognizable as Felicity even though it is
misspelled in the
original data record). Probabilistic matching is one of the significant strengths
of
QualityStage.
Investigation might also be performed to review the results of
standardization,
particularly to see whether there are any unhandled patterns or text that
could be
better handled if the rule set itself were tweaked, either with improved
classification
tables or through a mechanism called rule set overrides.
Standardization
Standardization, as the name suggests, is the process of generating standard
forms of
data that might more reliably be matched. For example, by generating the
standard
form William from Bill, then there is an increased likelihood of finding the
match
between William Gates and Bill Gates. Other standard forms that can be
generated
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
Output
Output Records
Records
US
US YY 100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOOR
BOSTON,
BOSTON,MA
MA02111
02111
CA
CA YY SITE
SITE66COMP
COMP10
10RR
RR88STN
STNMAIN
MAIN
MILLARVILLE
MILLARVILLEAB
ABT0L
T0L1K0
1K0
GB
GB YY 28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDON
W1X
W1X9FE
9FE
US
US NN 123
123MAIN
MAINSTREET
STREET
Output
Output Record
Record
Name
Name Domain
DomainTINA
TINA FISHER
FISHER ATTN
ATTN IBM
IBM
Address
211
Address Domain
Domain
211
WASHINGTON
WASHINGTON DR
DR PO
PO BOX
BOX 52
52
Area
Area Domain
Domain WESTBORO
WESTBORO ,, MA
MA
02140
02140
info@anshisoft.com|www.anshisoft.com
Output
Output Record
Record
House
House Number
Number
211
211
Street
Street Name
Name
WASHINGTON
WASHINGTON
Street
Street Suffix
Suffix Type
Type
Box
Box Type
Type
PO
PO BOX
BOX
Box
Box Value
Value
DR
DR
52
52
info@anshisoft.com|www.anshisoft.com
Call subroutines for each sub-domain (i.e. country name, post code,
province, city)
info@anshisoft.com|www.anshisoft.com
Rule Sets
info@anshisoft.com|www.anshisoft.com
Reference Tables
Standardization Example
info@anshisoft.com|www.anshisoft.com
Any character that is in the SEPLIST and not in the STRIPLIST, will be
used to separate tokens and will also become a token itself
Any character that is in both lists will be used to separate tokens but
will not become a token itself
The best example of this is the space character - one or more
spaces are stripped but the space indicates where one token
ends and another begins
info@anshisoft.com|www.anshisoft.com
Classification
First, key words that can provide special context are classified
Provided by the standardization rule set classification table
Since these classes are context specific, they vary across rule
sets
Classification order
First, key words that can provide special context are
classified
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
info@anshisoft.com|www.anshisoft.com
Classification Example
info@anshisoft.com|www.anshisoft.com
Default Classes
Cla
ss
Description
>
<
info@anshisoft.com|www.anshisoft.com
Some special characters are reserved for use as default classes that
describe token values that are not actual special character values
For example: ^ + ? > < @ (as described on the previous slide)
info@anshisoft.com|www.anshisoft.com
Essentially, the NULL class does to complete tokens what the STRIPLIST
does to individual characters
Therefore, you will never see the NULL class represented in the
assembled lexical patterns
Classification Table
Classification Tables contain three required space delimited columns:
1. Key word that can provide special context
2. Standard value for the key word
Standard value can be either an abbreviation or an
expansion
The pattern-action file will determine if the standard value
is used
3. Data class (one character tag) assigned to each key word
NORTH
N
N
N
FLOOR
FL
FL
FL
info@anshisoft.com|www.anshisoft.com
STREET
ST
ST
ST
APARTMENT
APT
APT
APT
D
D
F
F
T
T
U
U
info@anshisoft.com|www.anshisoft.com
The order that the columns are listed in the dictionary file defines the
order the columns appear in the standardization rule set output
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
Exception Data place holder column for storing invalid input data
(alternative to deletion)
info@anshisoft.com|www.anshisoft.com
temp = N
info@anshisoft.com|www.anshisoft.com
Subroutines
Pattern-Action Sets
Each subroutine starts with a header line (\SUB) and ends with
a trailer line (\END_SUB)
Subroutines can be called by MAIN or by other subroutines
When called, sequentially processed until RETURN command
or \END_SUB is encountered
info@anshisoft.com|www.anshisoft.com
House Number = 50
State Abbreviation = MA
info@anshisoft.com|www.anshisoft.com
Unhandled data may represent the entire input or a subset of the input
User Overrides
User overrides provide the user with the ability to make modifications
without directly editing the classification table or the pattern-action file
input pattern
input text
unhandled pattern
unhandled text
info@anshisoft.com|www.anshisoft.com
Classification Override
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
There are two subroutines in each delivered rule set that are
specifically for users to add pattern action language
Unhandled Modifications
info@anshisoft.com|www.anshisoft.com
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?
topic=/com.ibm.swg.im.iis.qs.patguide.doc/c_qspatact_container_top
ic.html
What is Matching ?
Matching is the real heart of Quality Stage. Different probabilistic algorithms
are
available for different types of data. Using the frequencies developed during
investigation (or subsequently), the information content (or rarity value) of
each value
in each field can be estimated. The less common a value, the more
information it
contributes to the decision. A separate agreement weight or disagreement
weight is
calculated for each field in each data record, incorporating both its
information content
(likelihood that a match actually has been found) and its probability that a
match has
been found purely at random. These weights are summed for each field in
the record to
come up with an aggregate weight that can be used as the basis for
reporting that a
particular pair or records probably are, or probably are not, duplicates of
each other.
There is a third possibility, a grey area in the middle, which Quality Stage
refers to as
the clerical review area record pairs in this category need to be referred
to a human
to make the decision because there is not enough certainty either way. Over
time the
algorithms can be tuned with things like improved rule sets, weight
overrides, different
settings of probability levels and so on so that fewer and fewer clericals are
found.
Matching makes use of a concept called blocking, which is an
unfortunately-chosen
term that means that potential sets of duplicates form blocks (or groups, or
sets) which
can be treated as separate sets of potentially duplicated values. Each block
of potential
duplicates is given a unique ID, which can be used by the next phase
(survivorship) and
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
info@anshisoft.com|www.anshisoft.com
can also be used to set up a table of linkages between the blocks of potential
duplicates
and the keys to the original data records that are in those blocks. This is
often a
requirement when de-duplication is being performed, for example when
combining
records from multiple sources, or generating a list of unique addresses from
a customer
file, et cetera.
More than one pass through the data may be required to identify all the
potential
duplicates. For example, one customer record may refer to a customer with a
street
address but another record for the same customer may include the
customers post
office box address. Searching for duplicate addresses would not find this
customer; an
additional pass based on some other criteria would also be required. Quality
Stage does
provide for multiple passes, either fully passing through the data for each
pass, or only
examining the unmatched records on subsequent passes (which is usually
faster).
Matching vs. Lookups, Joins, and Merges
Within Information Server, multiple stages offer capability that can be
considered matching, for example:
Lookup
Join
Merge
Unduplicate Match
Reference Match
Lookups, Joins, and Merges typically use key attributes, exact match
criteria, or matches to a range of values or simple formats
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
Multiple records in the data file can match a single record in the
reference file.
Eg: matching a transaction data source to a master data source allows
many transactions for one person in the master data source.
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
Survivorship
As the name suggests survivorship is about what becomes of the data in
these blocks of
potential duplicates. The idea is to get the best of breed data out of each
block, based
on built-in or custom rules such as most frequently occurring non-missing
value,
longest string, most recently updated and so on.
The data that fulfill the requirements of these rules can then be handled in a
couple of
ways. One technique is to come up with a master record a single version
of the
truth that will become the standard for the organization. Another
possibility is that
the improved data could be populated back into the source systems whence
they were
USA PH:+1-(999)-666-5174 | IND PH :+91-9000380723
info@anshisoft.com|www.anshisoft.com
derived; for example if one source were missing date of birth this could be
populated
because the date of birth was obtained from another source. Or more than
one. If this
is not the requirement (perhaps for legal reasons), then a table containing
the linkage
between the source records and the master record keys can be created, so
that the
original, source systems have the ability also to refer to the single source of
truth and
vice versa.
info@anshisoft.com|www.anshisoft.com
Summary
IBM is planning to release its next version of Info Sphere Quality Stage
Worldwide Address Verification module (v10)
Release time frame is Q4 2012
AVI v10 will have superior functionality and coverage over our
current AVI v8.x module see slide 4
AVI v10 will leverage new address/decoding reference data
AVI v10 will have broad support for various Information Server
versions see slide 5
For current AVI v8.x customers only:
AVI v8.x will have continues support until end of Dec. 2013
Address reference data for AVI v8.x has been discontinue
by the vendor is ending in Dec. 2013
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com
info@anshisoft.com|www.anshisoft.com