Data Quality

Data Quality - A problem and An
Approach
WHITE PAPER
Authors: Javed Beg and Shadab Hussain

As survival of companies is more judged by the ability to extract the right information within their
data, stress is also given on the quality of data lying within. Currently, data quality is not restricted
to the correctness and consistency of data. It also covers the extent to which good data is
integrated and used enterprise wide. It also involves the scrutiny at the initial most level of data
entrance in an organization and geographically & financially valid data with reference to other
databases across enterprise.
This white paper deals with evolving data quality concepts, their assessment and methodologies
to achieve enterprise-wide implementation & integration. Also, this paper evaluates the business
value and impact of data quality concepts. It is necessary to understand the best practice based
approach and architecture of a data quality initiative in an organization as well as its complicacy
and intricacy to achieve friendlier, useful and critical information.
Wipro Technologies
Innovative Solutions, Quality Leadership
WHITE PAPER
Data Quality - A problem and An Approach
Table of Contents
INTRODUCTION ...................................................................................................... 3
CUSTOMER NAME AND ADDRESS DATA .............................................................. 4
SIGNIFICANCE OF CLEANLINESS ........................................................................... 4
CONCEPT OF CLEAN DATA ...................................................................................... 6
STANDARDIZATION OR DATA CONSISTENCY ......................................................... 6
DE-DUPLICATION ...................................................................................................... 7
CONFORMANCE TO POSTAL RULES OR DATA VERIFICATION .............................. 7
COMPLETENESS ...................................................................................................... 7
ISOLATION OF PERSONAL & BUSINESS NAMES .................................................... 7
DATA INTEGRATION OR LINKING ............................................................................. 7
CROSS-COLUMN BASED CORRECTNESS ANALYSIS ........................................... 7
BUSINESS VALUE .................................................................................................. 8

ENTERPRISE INTEGRATION AND NEED ................................................................ 8
PROPOSED ARCHITECTURE ................................................................................. 9
WHAT'S DIFFERENT FROM ETL? ........................................................................... 10
DATA QUALITY SOLUTIONS AVAILABLE ............................................................ 11

DATA CLEANING COMMERCIAL TOOLS .................................................................. 11
TRILLIUM AND ITS CAPABILITIES ...................................................................... 11

METHODOLOGY ...................................................................................................... 12
NEW TRILLIUM V 6.0 FEATURES ............................................................................ 12
IMPLEMENTATION ................................................................................................... 12
CONTROL CENTER ................................................................................................ 12
PLATFORMS ............................................................................................................. 13
WIPRO COMPETENCES- A CASE STUDY ........................................................... 13

CONCLUSION ....................................................................................................... 14
REFERENCES ....................................................................................................... 15
ABOUT THE AUTHORS ........................................................................................ 15
COPYRIGHTS AND TRADEMARKS ...................................................................... 16
ABOUT WIPRO TECHNOLOGIES ......................................................................... 17
WIPRO IN DW/BI .................................................................................................. 17
Wipro Technologies
Table
Pageof: Contents
WHITE PAPER
Introduction
With the arrival of information age, huge amount of different variety of data has been made
readily accessible by voluminous databases, Internet and sophisticated communications
network. Application of Internet in supply chains, data mining, knowledge management and
many others related concepts are developing dramatically3.
There has been a considerable increase in independent distributed DB servers that directly
provide online info-retrieval services to end-users. This has facilitated information
manipulation of multi-source data to a great extent. The resulting integration of data from
multiple independent sources, for effective information sharing, gives rise to certain
incompatibilities1 . Most of the users neglect the importance of data considering it as a mere
stuff present in computers but data acts as the real fuel in the Information technology
engine.2 Incorrect and meaningless data should be removed as it results in faulty analysis
and consequent loss.
Without accurate data, users lose confidence in the database and make improper decisions3 .
The impact of data errors is felt by everyone at one time or another no matter to which area of
work does the user belong management, Internet applications, financial applications,
marketing and information services. Mostly, poor data quality results in loss of time, money
and customer confidence and causes embarrassment4 .
Based on some proprietary studies, it has been found out that data quality problems cost
10% of the total revenue3. In case of poor data quality, staff of any organization spends 25%
of its time in handling customer complaints caused by erratic data, checking data that should
be correct when it comes to a particular department, fixing incorrect data, finding missing
data and clarifying data that doesnt make any sense. The disasters resulting from poor data
quality that were highlighted in news in the recent past are listed below5 :
Mail from INS stuns flight school by Kevin Johnson (USA Today, March 13, 2002)
A result of mismanaged documents, a Florida flight school received notice from the
Immigration and Naturalization Service of approval of student visas for two of the September
11th terrorists - six months to the day after the attack on America.
W Hotels Room-Rate Mistake Benefits Some New York Guests By Jane Costello (The
Wall Street Journal On Line, January 3, 2002) .
Poor data quality impacted the profits of a hotel chain when the room rate was misquoted
on their web site.
Right Answer, Wrong Score: Test Flaws Take Toll By Diana B. Henriques and Jacques
Steinberg (The New York Times, May 20, 2001)
In recent years, educational testing companies have experienced serious breakdowns

in quality control. Testing industry errors in the last three years have affected millions of
students who took standardized proficiency tests in at least 20 states. The company that
scored tests in Minnesota gave 47,000 students lower scores than they deserved. Nearly
9,000 students in New York City were mistakenly assigned to summer school in 1999
because of an error by another big company.
C.I.A. Fires Officer Blamed in

(The New York Times, April 9, 2000, p. A1)
Bombing
of
Chinese
Embassy
Using out-dated information, the CIA selected the address of an armory for a bombing
target. At the time of the bombing, however, the building housed the Chinese Embassy.
Result: tragic loss of life and property.
Wipro Technologies
Page : 03 of 17
WHITE PAPER
EAI inQuality
Data
Small-Medium
- A problem
Enterprises
and An Approach
- A case-analysis
The 2000 Presidential Election in the United States held our attention for weeks as we
tried to determine:
- The accuracy of various counts and recounts
- How the end-to-end election chain is supposed to work
- The final result and consequences of the above.
On the Web, Pricing Errors Can Be Costly in More Ways Than One (The New York
Times, December 17, 1999)
Consumers scoop up under priced items on Amazon.com.
Group Asking U.S. for New Vigilance in Patient Safety (The New York Times, November
30, 1999, National Desk)
The health industry has been rocked with the news that poor quality kills up to 98,000 people
annually. Fortunately, not all these deaths are due to poor data, but many are. Poor
prescriptions are a good example.
Oh, Those Pesky Little Financial Details by Gretchen Morgenson (Market Watch, The
New York Times, January 31, 1999)
Poor quality data led to corporate embarrassment and drops in stock prices as many
companies were forced to restate corporate earnings.
Few examples specifically pointing to customers name and address importance are as
following (All for US):
In an estimate, more than 175,000 IRS and state tax refund checks were marked as
undeliverable by the postal department in a year
In an audit, it was estimated that 15-20% of voters on voter registration lists have either
moved or are deceased when compared to data gathered from post office relocation
data
An acquiring company learnt later that a deal was closed for new consumer business
with only half of the customers in reality to that anticipated because of duplication of data
A fiber-optics company lost $500,000 after a mislabeled shipment caused the wrong
cable to be laid at the bottom of a lake
The US government estimates that billions of dollars are lost annually due to poor data
quality
As the above-mentioned examples bring the attention to severe data quality problems in
general, authors would like to concentrate on the corporate problems relating to customers
name and address and products.
Customer Name and Address data

Significance of cleanliness
Customer data is important to every organization but especially significant for companies
pursuing three strategies: customer intimacy, product leadership or price leadership. In
such applications, a strong need, for developing a common definition of the customer data
to various business units, is always prevalent. Removal of errors from this common view
of customer data is a necessary activity for proper functioning of organizations 2.
In an attempt to establish more effective relationships with their customers, an organization
should develop company-wide CRM (Customer Relationship Management) strategies,
representing a combination of business processes, information management tools and
Wipro Technologies
Page : 04 of 17
WHITE PAPER
customer data. The foundation of effective customer relationships is built by using customer
data with highest standard of quality and integrity. Managing customer name and address
data is not a simple task as such type of data is often volatile. Name and address of
customers can change and are also easily accessible to be edited on data entry screens6 .
On the Internet, customer acquisition costs have soared up to US$ 65-250 per customer.
More than 50% of Internet companies cannot respond to their customers and cannot relate
to them due to lack of good quality customer data [Members]. Acquiring a new customer
costs 6-7 times more than that of retaining an existing one. Retaining 5% more of the best
customers of an organization can boost profits by as much as 100%. Personal bankruptcies
in 2001 increased 20% over 2000 (Source: American Bankruptcy Institute), and credit card
charge-off rates are at an all-time high. Even in uncertain economic conditions, managing
risks can ensure an organization to stay in shape7 .
A Data Warehousing Institute survey observed that only one out of four US companies has
implemented data quality initiatives. Therefore, 15% of the data in the typical US customer
database is inaccurate, as confirmed by national data audits8 . According to some interviews
with industry experts, customers and data survey, it has been determined that data quality
problems currently cost $ 600 billion/year to US businesses9 .
To analyze the cost of bad data, William Weil of Innovative Systems Inc carried out an
analysis carried out an analysis. Assuming that a given customer list for a company is 90%
accurate, of the 10% inaccurate customer details, 5% (i.e. 0.5% of the entire records) had
unusable addresses that could have been rectified. The cost of retaining each customer
was estimated as $100-1000. In a huge business organization, with around a million of
customers, 0.5% (i.e. 5000) would be lost in case they are not identified correctly in the
organizations database. Finally, loosing 5000 customers would result in a direct cost of
$5,00,000 - $50,00,000 (i.e. 5000 * {100-1000}). Thus, bad data is considerably expensive
and data cleaning is an activity that cannot be neglected at any cost.
To communicate effectively to its customers, via phone or mails, an organization should
maintain an extraordinarily cleaned customer list. Credibility is lost by using nonsensical or
misspelled addresses and by sending multiple letters to the same person. In banking and
healthcare, customer matching is an important issue where it is desirable to know about a
customer who buys a product repeatedly. This activity cannot be pursued with poor quality
data. Cross-selling i.e. identifying the overall needs of a household (that comprises of
identified individuals) and suggesting an effective consolidation or expansion of products
is also a customer data based application. It can also be beneficial for an organization to be
aware if it is dealing with multiple commercial organizations that turn out to be a part of a
larger parent organization. In some applications, there might be a requirement of merging
internal and external data e.g. after acquisitions. Even if external data is in a quality format,
it may not coincide with internal data format.
Many more examples of customer-based applications can be presented, that strongly need
a firm foundation of good quality customer database. Few of them are10 :
CRM and Customer Systems
E-business and Web
Call Centers
Marketing Systems
Data Matching and Compliance
Justice, Intelligence and Anti-Fraud
House Holding and Customer Matching
Wipro Technologies
Page : 05 of 17
WHITE PAPER
Some of the major industries that could be benefited by customer data cleaning are:
Banking and Finance

Government
Health and Community Services
Insurance
Tax Systems
Law Enforcement
Telecommunications
Concept of Clean Data

The data quality of one particular customer depends on the quality of each field (name,
address, etc).
[Customer Data Quality]i = g([field quality] 1, i, [field quality] 2, i, ., [field quality] n, i)
g function
n number of fields for customer 1
And the data quality of the entire Information System depends on the quality of each customer
record
[IS Data Quality] = h(g([field quality] 1, 1, ., [field quality] n, 1), , g([field quality] 1, m,
., [field quality] n, m))
h function
m number of customers
Thus,
[IS Data Quality] = h([Customer DQ] 1, [Customer DQ] 2, , [Customer DQ] m)
h function
m number of customers
It can be concluded that the overall business IS quality ultimately depends on customer field
quality11 .
It has been concluded after a number of studies that high quality customer name and
address data must incorporate the following aspects:
Standardization or Data consistency

A customer database may contain multiple variations of the same state, city, company,
customer or address including multiple abbreviations and types.
Table 1 Multiple Variations for Similar Values
Country
United States
US
State
Uttar Pradesh
UP
City
Bombay
Mumbai
Wipro Technologies
Page : 06 of 17
WHITE PAPER
Country
GEIS International
Incorporation
GEIS INTL Ltd
State
Prem Shankar Patel

P S Patel
City
B 44, Gautam Budh Nagar,

UP
B-44, G B Nagar, UP
Prevalence of such type of inconsistency leads to the problem of duplication of records. Data
should be properly standardized according to some business rules.
De-duplication
Considering the last row of Table 1, and assuming this address value to belong to Wipro
Technologies, duplicate records will be added for this company for the same location (i.e. B44, Gautam Budh Nagar, UP). Such duplication can lead to data integration problems.
Consequently, if we try to join two tables on a particular attribute value (as given in the current
example), Gautam Budh Nagar will not come out to be equal to G B Nagar, though they are
same.
Conformance to Postal Rules or data verification

A set of address records may include following anomalies:
(i) Invalid/missing house number with respect to a street address.
(ii) Invalid/missing street name with respect to a city.
(iii) Invalid/missing city with respect to a state.
(iv) Invalid/missing postal/zip code with respect to a city.
Less than 40 countries of the entire planet have formal postal strategies (rules) under proper
implementation.Very few Western/European countries have a postal/national address
reference file to perform address validation and verification [OASIS]. An application independent
name and address standard should be implemented, as address errors that contribute to
high mail return rates hindering the success of marketing companies.
Completeness
In many cases, values of particular attributes may not be present in the customer database.
This missing data problem has been observed to occur more frequently (about 65%) as
compared to other quality problems (only 60%) 9.
Isolation of Personal & Business Names

Organizations dealing with a customer database that is a blend of personal and commercial
records should maintain a coding technique to keep both of them isolated. Absence of any
such provision leads to incorrect construction of search keys and makes it extremely difficult
to locate the customer record [CDQ Toolkit].
Data Integration or linking

Two records of same customer not being able to join or link owing to different format, fails to
give an overall view of customer.
Cross-column based correctness analysis

This is also considered to be an important activity in some applications e.g. determination of
gender from name.
Wipro Technologies
Page : 07 of 17
WHITE PAPER
Business Value
Business executives should view data quality as a direct mean to reduce costs & impact
revenue. Data quality should also be considered as a business issue in addition to a
technological concept. Companies survival depends on the information hidden under the
piles of data. If the data is bad, system is bound to fail. Data quality is often overlooked in
trade off between time, cost and performance. It has to be made sure that its not ignored in
the race for low cost and delivery before deadline. Organizations must realize that there is no
point in being quick, if information is poor. Reports from META group indicate that 75% of
companies in US have yet to implement any Data Quality initiative. Business persons ignore
data quality due to:
Very little attention dragged by technical persons to this issue
Cost and difficulty in implementing data quality initiative
Inability to measure RoI12
To view the returns through data quality initiative, one can consider two examples given
below:
1. Realizing how much a company saves on postage by de-duplicating customer records
which has ratio of 20% duplication.
2. Accurate business analysis of good customer data reduces anomalies by just 2% among
million customers in banks, where a customer in a 5 year period keeps about 15,000$.13
Its now easy for business executives to understand the propounding effect data quality can
play in business. Estimates show that 15-20% of the data in an organization can be erroneous.
As the external data collection in Internet age increases, quality of data and control over it
reduces. As customer can now see a high amount of data via the Internet, data naturally
asks for a greater accuracy.
Some factors useful in achieving good data quality across enterprise:
Setting quantifiable business goals
Finding the soft points or channels that need more data quality implementation and can
contribute directly to realize business goals
Making it enterprise-wide issue and involving functional line as well as IS personnel

Committing personnel and resources to it as investment
Finding the costs, causes and fallouts of poor data quality
Employing tandard; best practices based model and methodology
Observing ROI14
Enterprise Integration and need15

An enterprise has many lines or functions. Furthermore they are divided into small
manageable units. Things only worsen if the common process or data really dont match.
Customer is the chief concern in each unit. Imagine the scenario where exactly same
customers mismatch owing to discrepancy in the data across many departments. If
organizations need a unified view of their customers, they need to consolidate this data. To
survive and progress in the current competitive environment, organizations are adopting
CRM, data warehouse, ERP, SCM, Web channels and e-commerce. These expensive
methods dont deliver the desired stuff if the associated data is not compatible, conflicting
and unable to interlink properly. Rather than providing data and expecting gains, organizations
should primarily ensure its correctness and then expect the results. Ensuring the correctness
of data is known as Data Quality Awareness. For proper functioning of Data Warehouse,
Wipro Technologies
Page : 08 of 17
WHITE PAPER
CRM, SCM, ERP, Web channels, e-commerce, etc., correct data is the necessary input.
Authors have personally witnessed some data warehouse projects where all data items
were well defined, developed, executed, managed, delivered and received but finally did not
deliver expected results because of quality problems inherent in the data. Teams were
working with bad quality data to get good quality results. The minor differences in customer
data can even distort a unified view of the subjects like customer and product. When a
number of systems exist in isolation, seamless integration of the data becomes difficult.
There has to be a system, which can ensure the integration of the data. To cite an example,
lets assume the expensive Data Warehouse, sensitive CRM, wide ERP, long SCM and
integrated e-commerce deals with same set of customers in reality but digitally can not
map two exactly same customers because of some difference in a few bits. Are we going to
get a unified view? Total figures will match, but todays strategies focuses on segmentation
down to the level as granular as possible. We loose granularity and we loose details to
chalk out strategies. There is no point in incorporating data quality components in each
system, separately, as they not only increase the cost but also enhance the risk of data
disparity. Why cant we establish a universal source for common data? After realizing a
strong need for this purpose, authors have proposed a Central Customer Data Bus (CCDB)
that is based on existing data quality architecture available in the market.
Proposed architecture16
The CCDB architecture will deal with issues pertaining to data quality viz selection, correction,
modification, enhancement and integration of data. CCDB will be an architecture, which will
employ various tools and technologies to achieve the desired data quality. While data
cleansing tools like trillium software system and dfStudio are useful for data quality, java
and asp can be employed to integrate the components with Web and other existing systems.
Our CCDB is proposed to have following components:
Collection:
It involves identification and profiling of channel/sources of data as good, bad, accurate,
reliable, confidential, erroneous etc. This phase focuses on treating each data channel with
customized solution of providing data quality specific to channels error vulnerability. It also
comprises standardization of data at each channel. A data quality initiative, which claims to
correct data at runtime, needs to identify sources of data and the assessment of intensity
and style of data correction, which is to be put in.
Standardization:
After collection, standard or specific common formats and rules are applied on the incoming
data.
Correction:
In this phase, the data is corrected according to its definition, value domains, range and
missing value algorithms defined.
Unification:
Profile and match the new data to the existing ones and establish a unique identity all
across enterprise. It will identify similar data within and across data sources. It will append
new data or provide missing data. Enhancement may be in the form of demography,
geography, credit and revenue earned.
Integration:
In this step, data is integrated to all the child or peer systems that will be using the data from
data quality initiative. This step provides links, connection and integrates the data and
ensures the data consistency across systems.
Wipro Technologies
Page : 09 of 17
WHITE PAPER
Whats different from ETL?

Professionals who have worked on Data Warehouse may say that one can take care of data
quality in ETL (Extraction Transformation Loading). Though ETL comprises data manipulation
in transformation part, it doesnt have capabilities of parsing (i.e. identifying various name
and address fields), enhancing, and unifying the data in general and customer data in
particular. In data quality terminology, ETL can look after conversion or standardization but it
cant correct and enhance data. Also, ETLs focus is more on data flow with some data
transformation prior to Data Warehouse. It doesnt focus on subject specific data like
customer or product data. While in data quality initiative focus is set on subject specific data
and data is enhanced and fixed as well. Data quality can be a specialized subset in ETL.
Already, data quality tools companies like Trillium have tied up with ETL tools companies
like ACTA and Informatica to provide data quality plug-ins with ETL tools. However, Authors
strongly feel that data quality should be considered in a much greater context and should be
covered widely in an enterprise.
Web
ERP
Business Rules
Standardization
Collection
SCM
Customer data
System Standards
Web
Customer View Requirements
Correction
Unification
External Databases
Internal Databases
Integration
Customer Data Linkages'
CCDB Architecture
Wipro Technologies
Page : 10 of 17
WHITE PAPER
Data Quality Solutions available17

Data Cleaning Commercial Tools
These tools correct, standardize and enhance names and addresses
Trillium Software System, Trillium Software
matchMaker, Info Tech Ltd
QuickAdress Batch, QAS Systems
PureName PureAddress, Carleton
NADIS, Group1 Software and MasterSoft International
Ultra Address Management, The Computing Group
De-duplication
Integrity, Vality
ETI*Data Cleanse, Evolutionary Technologies International
Centrus Merge/Purge library, Qualitative Marketing Software
SSA-Name/Data Clustering Engine, Search Software America
dfPower, DataFlux Corporation
i.d.Centric, firstLogic
reUnion and MasterMerge, PitneyBowes
PureIntegrate, Carleton
DoubleTake, StyleList, Personator, Peoplesmith
TwinFinder, Omikron
Data Tools Twins, Data Tools
DeDupe, International Software Publishing
Merge/Purge Plus, Group1 Software
Data Analysis ( only analysis prior to cleansing)
Migration Architect, Evoke Software
WizWhy & WizRule, WizSoft Inc.
Conversion Package, Gladstone Computer Services
Trillium and its Capabilities

Trillium Software System is a recognized tool in data quality domain. Trillium Data Quality is:
A recognized standard of excellence in data quality solutions to Global 1000 companies
Quick and easy integration into current CRM, eBusiness, and data warehouse
applications
Solution thatll conform to business rules, not business rules to which one must conform
Able to detect fraudulent customers & transactions
Identifying and understanding the difference between a ship-to and bill-to address
To achieve Trillium Data Quality, the software:
Wipro Technologies
Intelligently identifies elements in free-form data such as first and last names, business
names, address lines, cities, serial numbers, dates, part numbers and other data
shapes
Page : 11 of 17
WHITE PAPER
Recodes data patterns for consistency with support for international data
Enhances missing data in customer addresses
Identifies and matches households, business contacts and other relationships
Methodology
Trillium has got process flow of four modules to process data.
Converter: Data discovery and transformation
Profile, investigate, analyze, format, scan/clean and recode
Parser: Customer Id and enhancement
Elementize, classify, standardize, correct and transform name & address
Geocoder: Global customer validation
Verify addresses, global postal codes, address correction, and append census data
Matcher: Customer relationship matching
Identify Duplicates at Personal and Household level and present the output in the form of
Pass(exact match) / Fail (Fresh record) / Suspect (Not exactly a match).
New trillium v 6.0 features

Customer Data Parser Tuner:
Interactive Word Pattern Analysis, Automatic Business Rules Generation
Matcher Tuner:
Interactive Matcher Results Viewing, Graphical Display of Match Groups, Look at all
combinations of Matches and Suspects
Business Data Parser:
Generic (Non-name and Address) Parsing, User assigned attributes and Patterns,
Standardize Product Data & Comments Fields, Clean other data (product, HR, finance)
Country Router:
Multi-National Data Assignment, Automatically Disseminate Country specific records
Implementation
The four platform independent modules may be implemented in systems in several ways:
Batch Driver Programs

Provides processes for initial creation and incremental updates
Callable Functions
Library functions accessible from virtually any calling program and/or communications
manager
Distributed Objects
CORBA, COM+, EJBeans, Java/JNI
Plug-Ins
Fully integrated with DataStage, PowerCenter, Siebel, Ab Initio and Microsofts Data
Transformation Services (DTS) for SQL Server
Control Center
This utility of Trillium is a powerful GUI used to help set up and edit business rules, define
unique data quality processes. Within the Control Center, one can establish new parsing
standards, specific business rules for matching and linking of records, identifying industry
terminology, or develop user defined conditional logic.
Wipro Technologies
Page : 12 of 17
WHITE PAPER
Platforms
Trillium currently runs on the following platforms:
PCs (Windows NT/2000)
UNIX-based (IBM, Sun Solaris, HP UX, NCR Teradata, Compaq)
Mainframes (Transaction Server {CICS}, MVS & OS/390)
Minis (AS/400)
Wipro Competences- A case study

Trillium provides dynamic utilities and proper functionality for data quality solutions. Therefore,
it was identified as the most appropriate tool for providing data quality solutions to a customer
database of a well-known IT giant.
At the initial level, the client was maintaining a stand-alone company-registry system that
contained top 600 customers per time zone. The client identified certain basic business
requirements:
a) To estimate the revenue that was earned from its best customers across all lines of
business.
b)
c)
To gather information about all the products and services purchased by customers.
To get a clear picture of the number of quotes, provided to a customer, across all line of
business, which turned into a sale.
For capturing the exact answers to the queries mentioned above, a strong need of
understanding every aspect of customers across all business units and geographies, was
recognized.
Input Customer Data
Input batch
XML Files
Online screen
Command
line that run
on UNIX
Call able
Trillium API's
Cleansed
output
displayed on
confirmation
screen
Output flat
File
Callable
''matcher''
APIs
Output XML
files
Passage 1 (ONLINE PROCESS)
Passage 2 (BATCH PROCESS)
Wipro Technologies
Customer Database
Page : 13 of 17
WHITE PAPER
Effective cleansing was identified as a mandatory activity for better management of core
customer data and perfect alignment with the overall strategy of focusing on the customers.
Wipro devised a complete application for providing data quality solutions to this customer
registry system. The two Trillium utilities that were incorporated in the application are:
Batch Driver Programs (command lines that run on UNIX)

Callable Functions (APIs called through JAVA)
As shown in the figure, in passage 1 (i.e. Online process), a single customer +address or
only customer record that would have been entered through online screen, passes through
trillium APIs available for Converter, Parser, Geocoder and Matcher module, called by JAVA.
The final cleansed output is shown on a confirmation screen and could be added to the
customer database, according to users willingness.
Passage 2 (i.e. Batch process), takes XML files, consisting of customer + address or only
customer records, as an input. This file is converted to a form that is acceptable by the
command line utility of Trillium. While processing through command lines, the matching
process is carried out with respect to a flat file extracted from the customer database, at a
particular instant. If there is a considerable time lag between the execution time of the
application and the time when the extract is taken, the results obtained at the end of matching
process cannot be considered as reliable, as some records would have been entered
through the online screen during the period of lag. Thus, the command line output i.e. a flat
file is passed through only callable matcher module APIs. The output of this matching
process is more reliable as it is carried out with respect to the latest state of the customer
database. Finally, an output XML file is also created that carries information about the
cleansed data and the database interactions that would have happened during the entire
process.
Conclusion
Data quality is an opportunity for organizations to stay competitive. In the ever-evolving
competitive scenario, organizations have been striving to become the best. The final theme,
that is still ongoing, is about information. As companies admit and there are numerous
examples of using information to have an edge over competitors, quality of the data from
which this information is extracted, will be the next area that will differentiate companies and
determine their survival. Data quality brings organizations to the level of perfection where
data is minutely scrutinized for its reliability and consistency. The advantages of achieving
data quality will magnify many times with the passage through analysis systems like Data
Warehouse, Business Intelligence etc. that strive to extract the hidden information from the
available data. Data quality initiative will form a core part of IT and business strategies and
will definitely make it to boardrooms. Conclusively, data quality is worth its efforts.
Wipro Technologies
Page : 14 of 17
WHITE PAPER
References
1.
Laudon K. C., Data Quality and due Process in Large Inter-organizational Record Systems,
Communications of ACM. 29, 1, 4-11, 1986.
2.
Redman T. C., Data Quality: A field Guide, Digital Press, Jan 15 2001.
3.
Hankley B., Chapter 11: Database Quality, CIS841 Project, 2001.
4.
www.dataqualitysolutions.com/home.htm
5.
Information and update: OASIS TECHNICAL DISCUSSION GROUP: Customer Information Quality
http://lists.oasis-open.org/archives/members/200005/msg00004.html
6.
How can I maximize the profitability of my existing customer base, http://www.experian.com/

products/portfolio_solutions.html
7.
www.dataflux.com
8.
Get started with a data quality initiative, www.advisor.com/Articles.nsf/aid/SMITT633
9.
Adding Intelligence, quality & performance.., SSA, www.searchsoftware.com/Applications/

index.htm
10.
Julien Raigneau, Data quality metrics for customer relationship management, http://
nervosa.cic.hull.ac.uk:8080
11.
Bob Brauer, Data Quality Spinning Straw into Gold, www.dataflux.com/data/spinning.pdf, Data
Flux Corporation
12.
Len Dubois, Ten Critical Factors for Successful Enterprise-wide Data Quality, www.crmguru.com/
features/2002a/04181d.html
13.
Tony Fisher, Customer data Integration: The key to Successful CRM, http://www.DataFlux.com
14.
Helena Galhardas, Data Cleaning and Integration, http://cosomos.inesc.pt/~hig/cleaning.html
15.
Ralph Kimball, Dealing with Dirty Data, DBMS online, September 1996
About the Authors

Javed Beg is a Data Warehouse consultant with Wipro Technologies, a global software
services firm, currently located at Bangalore, India. He holds an MS from IIT-Madras and a
degree in Mechanical Engineering. He has contributed in the development of an application
for providing data quality solutions for a customer database. His interest areas are data
quality, database auditing and designing of Data Warehousing applications.
Shadab Hussain, is a Data Warehouse consultant with Wipro Technologies, a global software
services firm, currently located at Bangalore, India. He holds an MBA from IIITM-Gwalior and
a degree in Electrical Engineering. He has worked in data quality initiative project. His
interest areas are Data Quality, Data Warehouse designing and architecture and data
modeling.
Wipro Technologies
Page : 15 of 17
WHITE PAPER
Copyrights and Trademarks

All copyrights and trademarks acknowledged
Wipro and Wipro Technologies, Wipro Logo herein are service marks or registered
trademarks of Wipro .
SSA and Search Software Americaare registered trademarks of Search Software America
, a division of SPL Worldgroup Software Inc
Harte-Hanks Data Technologies is a subsidiary of Harte-Hanks Communication Inc
and is a registered trademark.
ADVISOR is trademark of ADVISOR MEDIA, Inc.
Experian is registered trademark of Experian.
Trillium Software System, Trillium Software are registered trademarks of Trillium
Software, a division of Harte-Hanks.
Wipro Technologies
Page : 16 of 17
WHITE PAPER
About Wipro Technologies

Wipro is the first PCMM Level 5 and SEI CMMi Level 5 certified IT Services Company globally.
Wipro provides comprehensive IT solutions and services (including Systems Integration, IS
Outsourcing, Package Implementation, Software Application Development and Maintenance)
and Research & Development Services (hardware and software design, development and
implementation) to corporations globally.
Wipros unique value proposition is further delivered through our pioneering Offshore
Outsourcing Model and stringent quality processes of SEI and Six Sigma.
Wipro in DW/BI
Wipro Technologies delivers Business Intelligence and Data Warehousing (BI&DW)
solutions to customers across the globe. Today it is a dedicated 450-member BI&DW
group, working towards the implementation of BI technology in support of our clients
worldwide. The group has cutting edge technology expertise in Data warehousing & Mining,
ERP & SCM Analytics, Analytical CRM, e-Business Analytics and NCR Teradata. Till date the
Group has put into practice BI&DW Services at approximately 50 client sites across domain
segments such as Retail, Utilities, Manufacturing, Healthcare, Finance, Insurance,
Government, Transportation and Telecommunication
http://www.wipro.com/datawarehouse
Copyright 2003. Wipro Technologies. All rights reserved. No part of this document may be reproduced, stored in a
retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without express written permission from Wipro Technologies. Specifications subject to change without notice. All other
trademarks mentioned herein are the property of their respective owners. Specifications subject to change without
notice..
Worldwide HQ
Wipro Technologies,
Sarjapur Road,
Bangalore-560 035,
India.
Tel: +91-80-844 0011.
U.S.A.
Wipro Technologies
1300, Crittenden Lane,
Mountain View, CA 94043.
Tel: (650) 316 3555.
U.K.
Wipro Technologies
137 Euston Road,
London, NW1 2 AA.
Tel: +44 (20) 7387 0606.
France
Wipro Technologies
17, Square Edouard,
VII, 75009 Paris.
Tel: +33 (01) 5343 9058.
Germany
Wipro Technologies
Am Wehr 5,
Oberliederbach,
Frankfurt 65835.
Tel: +49 (69) 3005 9408.
Japan
Wipro Technologies
# 911A, Landmark Tower,
2-1-1 Minatomirai 2-chome,
Nishi-ku, Yokohama 220 8109.
Tel: +81 (04) 5650 3950.
U.A.E.
Wipro Limited
Office No. 124,
Building 1, First Floor,
Dubai Internet City,
P.O. Box 500119, Dubai.
Tel: +97 (14) 3913480.
www.wipro.com
eMail: info@wipro.com
Wipro Technologies
Innovative
Wipro Technologies
Solutions, Quality
Leadership
Page : 17 of 17

Data Quality

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Quality

Încărcat de

Drepturi de autor:

Formate disponibile

Data Quality - A problem and An

Authors: Javed Beg and Shadab Hussain

Data Quality - A problem and An Approach

BUSINESS VALUE .................................................................................................. 8

DATA QUALITY SOLUTIONS AVAILABLE ............................................................ 11

TRILLIUM AND ITS CAPABILITIES ...................................................................... 11

WIPRO COMPETENCES- A CASE STUDY ........................................................... 13

Data Quality - A problem and An Approach

In recent years, educational testing companies have experienced serious breakdowns

C.I.A. Fires Officer Blamed in

Consumers scoop up under priced items on Amazon.com.

Customer Name and Address data

Data Quality - A problem and An Approach

Data Quality - A problem and An Approach

Banking and Finance

Concept of Clean Data

Standardization or Data consistency

Data Quality - A problem and An Approach

Prem Shankar Patel

B 44, Gautam Budh Nagar,

Conformance to Postal Rules or data verification

Isolation of Personal & Business Names

Data Integration or linking

Cross-column based correctness analysis

Data Quality - A problem and An Approach

Setting quantifiable business goals

Making it enterprise-wide issue and involving functional line as well as IS personnel

Finding the costs, causes and fallouts of poor data quality

Employing tandard; best practices based model and methodology

Enterprise Integration and need15

Data Quality - A problem and An Approach

Data Quality - A problem and An Approach

Whats different from ETL?

Customer View Requirements

Customer Data Linkages'

Data Quality - A problem and An Approach

Data Quality Solutions available17

Trillium Software System, Trillium Software

matchMaker, Info Tech Ltd

QuickAdress Batch, QAS Systems

PureName PureAddress, Carleton

NADIS, Group1 Software and MasterSoft International

Ultra Address Management, The Computing Group

ETI*Data Cleanse, Evolutionary Technologies International

Centrus Merge/Purge library, Qualitative Marketing Software

SSA-Name/Data Clustering Engine, Search Software America

dfPower, DataFlux Corporation

reUnion and MasterMerge, PitneyBowes

DoubleTake, StyleList, Personator, Peoplesmith

Data Tools Twins, Data Tools

DeDupe, International Software Publishing

Merge/Purge Plus, Group1 Software

Data Analysis ( only analysis prior to cleansing)

Migration Architect, Evoke Software

WizWhy & WizRule, WizSoft Inc.

Conversion Package, Gladstone Computer Services

Trillium and its Capabilities

A recognized standard of excellence in data quality solutions to Global 1000 companies

Able to detect fraudulent customers & transactions

To achieve Trillium Data Quality, the software:

Data Quality - A problem and An Approach

Enhances missing data in customer addresses

Identifies and matches households, business contacts and other relationships

New trillium v 6.0 features