Documente Academic
Documente Profesional
Documente Cultură
Approach
WHITE PAPER
Wipro Technologies
Innovative Solutions, Quality Leadership
WHITE PAPER
Table of Contents
INTRODUCTION ...................................................................................................... 3
CUSTOMER NAME AND ADDRESS DATA .............................................................. 4
SIGNIFICANCE OF CLEANLINESS ........................................................................... 4
CONCEPT OF CLEAN DATA ...................................................................................... 6
STANDARDIZATION OR DATA CONSISTENCY ......................................................... 6
DE-DUPLICATION ...................................................................................................... 7
CONFORMANCE TO POSTAL RULES OR DATA VERIFICATION .............................. 7
COMPLETENESS ...................................................................................................... 7
ISOLATION OF PERSONAL & BUSINESS NAMES .................................................... 7
DATA INTEGRATION OR LINKING ............................................................................. 7
CROSS-COLUMN BASED CORRECTNESS ANALYSIS ........................................... 7
Wipro Technologies
Table
Pageof: Contents
WHITE PAPER
Introduction
With the arrival of information age, huge amount of different variety of data has been made
readily accessible by voluminous databases, Internet and sophisticated communications
network. Application of Internet in supply chains, data mining, knowledge management and
many others related concepts are developing dramatically3.
There has been a considerable increase in independent distributed DB servers that directly
provide online info-retrieval services to end-users. This has facilitated information
manipulation of multi-source data to a great extent. The resulting integration of data from
multiple independent sources, for effective information sharing, gives rise to certain
incompatibilities1 . Most of the users neglect the importance of data considering it as a mere
stuff present in computers but data acts as the real fuel in the Information technology
engine.2 Incorrect and meaningless data should be removed as it results in faulty analysis
and consequent loss.
Without accurate data, users lose confidence in the database and make improper decisions3 .
The impact of data errors is felt by everyone at one time or another no matter to which area of
work does the user belong management, Internet applications, financial applications,
marketing and information services. Mostly, poor data quality results in loss of time, money
and customer confidence and causes embarrassment4 .
Based on some proprietary studies, it has been found out that data quality problems cost
10% of the total revenue3. In case of poor data quality, staff of any organization spends 25%
of its time in handling customer complaints caused by erratic data, checking data that should
be correct when it comes to a particular department, fixing incorrect data, finding missing
data and clarifying data that doesnt make any sense. The disasters resulting from poor data
quality that were highlighted in news in the recent past are listed below5 :
Mail from INS stuns flight school by Kevin Johnson (USA Today, March 13, 2002)
A result of mismanaged documents, a Florida flight school received notice from the
Immigration and Naturalization Service of approval of student visas for two of the September
11th terrorists - six months to the day after the attack on America.
W Hotels Room-Rate Mistake Benefits Some New York Guests By Jane Costello (The
Wall Street Journal On Line, January 3, 2002) .
Poor data quality impacted the profits of a hotel chain when the room rate was misquoted
on their web site.
Right Answer, Wrong Score: Test Flaws Take Toll By Diana B. Henriques and Jacques
Steinberg (The New York Times, May 20, 2001)
Bombing
of
Chinese
Embassy
Using out-dated information, the CIA selected the address of an armory for a bombing
target. At the time of the bombing, however, the building housed the Chinese Embassy.
Result: tragic loss of life and property.
Wipro Technologies
Page : 03 of 17
WHITE PAPER
EAI inQuality
Data
Small-Medium
- A problem
Enterprises
and An Approach
- A case-analysis
The 2000 Presidential Election in the United States held our attention for weeks as we
tried to determine:
- The accuracy of various counts and recounts
- How the end-to-end election chain is supposed to work
- The final result and consequences of the above.
On the Web, Pricing Errors Can Be Costly in More Ways Than One (The New York
Times, December 17, 1999)
Group Asking U.S. for New Vigilance in Patient Safety (The New York Times, November
30, 1999, National Desk)
The health industry has been rocked with the news that poor quality kills up to 98,000 people
annually. Fortunately, not all these deaths are due to poor data, but many are. Poor
prescriptions are a good example.
Oh, Those Pesky Little Financial Details by Gretchen Morgenson (Market Watch, The
New York Times, January 31, 1999)
Poor quality data led to corporate embarrassment and drops in stock prices as many
companies were forced to restate corporate earnings.
Few examples specifically pointing to customers name and address importance are as
following (All for US):
In an estimate, more than 175,000 IRS and state tax refund checks were marked as
undeliverable by the postal department in a year
In an audit, it was estimated that 15-20% of voters on voter registration lists have either
moved or are deceased when compared to data gathered from post office relocation
data
An acquiring company learnt later that a deal was closed for new consumer business
with only half of the customers in reality to that anticipated because of duplication of data
A fiber-optics company lost $500,000 after a mislabeled shipment caused the wrong
cable to be laid at the bottom of a lake
The US government estimates that billions of dollars are lost annually due to poor data
quality
As the above-mentioned examples bring the attention to severe data quality problems in
general, authors would like to concentrate on the corporate problems relating to customers
name and address and products.
Page : 04 of 17
WHITE PAPER
customer data. The foundation of effective customer relationships is built by using customer
data with highest standard of quality and integrity. Managing customer name and address
data is not a simple task as such type of data is often volatile. Name and address of
customers can change and are also easily accessible to be edited on data entry screens6 .
On the Internet, customer acquisition costs have soared up to US$ 65-250 per customer.
More than 50% of Internet companies cannot respond to their customers and cannot relate
to them due to lack of good quality customer data [Members]. Acquiring a new customer
costs 6-7 times more than that of retaining an existing one. Retaining 5% more of the best
customers of an organization can boost profits by as much as 100%. Personal bankruptcies
in 2001 increased 20% over 2000 (Source: American Bankruptcy Institute), and credit card
charge-off rates are at an all-time high. Even in uncertain economic conditions, managing
risks can ensure an organization to stay in shape7 .
A Data Warehousing Institute survey observed that only one out of four US companies has
implemented data quality initiatives. Therefore, 15% of the data in the typical US customer
database is inaccurate, as confirmed by national data audits8 . According to some interviews
with industry experts, customers and data survey, it has been determined that data quality
problems currently cost $ 600 billion/year to US businesses9 .
To analyze the cost of bad data, William Weil of Innovative Systems Inc carried out an
analysis carried out an analysis. Assuming that a given customer list for a company is 90%
accurate, of the 10% inaccurate customer details, 5% (i.e. 0.5% of the entire records) had
unusable addresses that could have been rectified. The cost of retaining each customer
was estimated as $100-1000. In a huge business organization, with around a million of
customers, 0.5% (i.e. 5000) would be lost in case they are not identified correctly in the
organizations database. Finally, loosing 5000 customers would result in a direct cost of
$5,00,000 - $50,00,000 (i.e. 5000 * {100-1000}). Thus, bad data is considerably expensive
and data cleaning is an activity that cannot be neglected at any cost.
To communicate effectively to its customers, via phone or mails, an organization should
maintain an extraordinarily cleaned customer list. Credibility is lost by using nonsensical or
misspelled addresses and by sending multiple letters to the same person. In banking and
healthcare, customer matching is an important issue where it is desirable to know about a
customer who buys a product repeatedly. This activity cannot be pursued with poor quality
data. Cross-selling i.e. identifying the overall needs of a household (that comprises of
identified individuals) and suggesting an effective consolidation or expansion of products
is also a customer data based application. It can also be beneficial for an organization to be
aware if it is dealing with multiple commercial organizations that turn out to be a part of a
larger parent organization. In some applications, there might be a requirement of merging
internal and external data e.g. after acquisitions. Even if external data is in a quality format,
it may not coincide with internal data format.
Many more examples of customer-based applications can be presented, that strongly need
a firm foundation of good quality customer database. Few of them are10 :
CRM and Customer Systems
E-business and Web
Call Centers
Marketing Systems
Data Matching and Compliance
Justice, Intelligence and Anti-Fraud
House Holding and Customer Matching
Wipro Technologies
Page : 05 of 17
WHITE PAPER
Some of the major industries that could be benefited by customer data cleaning are:
United States
US
State
Uttar Pradesh
UP
City
Bombay
Mumbai
Wipro Technologies
Page : 06 of 17
WHITE PAPER
Country
GEIS International
Incorporation
GEIS INTL Ltd
State
City
Prevalence of such type of inconsistency leads to the problem of duplication of records. Data
should be properly standardized according to some business rules.
De-duplication
Considering the last row of Table 1, and assuming this address value to belong to Wipro
Technologies, duplicate records will be added for this company for the same location (i.e. B44, Gautam Budh Nagar, UP). Such duplication can lead to data integration problems.
Consequently, if we try to join two tables on a particular attribute value (as given in the current
example), Gautam Budh Nagar will not come out to be equal to G B Nagar, though they are
same.
Completeness
In many cases, values of particular attributes may not be present in the customer database.
This missing data problem has been observed to occur more frequently (about 65%) as
compared to other quality problems (only 60%) 9.
Page : 07 of 17
WHITE PAPER
Business Value
Business executives should view data quality as a direct mean to reduce costs & impact
revenue. Data quality should also be considered as a business issue in addition to a
technological concept. Companies survival depends on the information hidden under the
piles of data. If the data is bad, system is bound to fail. Data quality is often overlooked in
trade off between time, cost and performance. It has to be made sure that its not ignored in
the race for low cost and delivery before deadline. Organizations must realize that there is no
point in being quick, if information is poor. Reports from META group indicate that 75% of
companies in US have yet to implement any Data Quality initiative. Business persons ignore
data quality due to:
Very little attention dragged by technical persons to this issue
Cost and difficulty in implementing data quality initiative
Inability to measure RoI12
To view the returns through data quality initiative, one can consider two examples given
below:
1. Realizing how much a company saves on postage by de-duplicating customer records
which has ratio of 20% duplication.
2. Accurate business analysis of good customer data reduces anomalies by just 2% among
million customers in banks, where a customer in a 5 year period keeps about 15,000$.13
Its now easy for business executives to understand the propounding effect data quality can
play in business. Estimates show that 15-20% of the data in an organization can be erroneous.
As the external data collection in Internet age increases, quality of data and control over it
reduces. As customer can now see a high amount of data via the Internet, data naturally
asks for a greater accuracy.
Some factors useful in achieving good data quality across enterprise:
Finding the soft points or channels that need more data quality implementation and can
contribute directly to realize business goals
Observing ROI14
Wipro Technologies
Page : 08 of 17
WHITE PAPER
CRM, SCM, ERP, Web channels, e-commerce, etc., correct data is the necessary input.
Authors have personally witnessed some data warehouse projects where all data items
were well defined, developed, executed, managed, delivered and received but finally did not
deliver expected results because of quality problems inherent in the data. Teams were
working with bad quality data to get good quality results. The minor differences in customer
data can even distort a unified view of the subjects like customer and product. When a
number of systems exist in isolation, seamless integration of the data becomes difficult.
There has to be a system, which can ensure the integration of the data. To cite an example,
lets assume the expensive Data Warehouse, sensitive CRM, wide ERP, long SCM and
integrated e-commerce deals with same set of customers in reality but digitally can not
map two exactly same customers because of some difference in a few bits. Are we going to
get a unified view? Total figures will match, but todays strategies focuses on segmentation
down to the level as granular as possible. We loose granularity and we loose details to
chalk out strategies. There is no point in incorporating data quality components in each
system, separately, as they not only increase the cost but also enhance the risk of data
disparity. Why cant we establish a universal source for common data? After realizing a
strong need for this purpose, authors have proposed a Central Customer Data Bus (CCDB)
that is based on existing data quality architecture available in the market.
Proposed architecture16
The CCDB architecture will deal with issues pertaining to data quality viz selection, correction,
modification, enhancement and integration of data. CCDB will be an architecture, which will
employ various tools and technologies to achieve the desired data quality. While data
cleansing tools like trillium software system and dfStudio are useful for data quality, java
and asp can be employed to integrate the components with Web and other existing systems.
Our CCDB is proposed to have following components:
Collection:
It involves identification and profiling of channel/sources of data as good, bad, accurate,
reliable, confidential, erroneous etc. This phase focuses on treating each data channel with
customized solution of providing data quality specific to channels error vulnerability. It also
comprises standardization of data at each channel. A data quality initiative, which claims to
correct data at runtime, needs to identify sources of data and the assessment of intensity
and style of data correction, which is to be put in.
Standardization:
After collection, standard or specific common formats and rules are applied on the incoming
data.
Correction:
In this phase, the data is corrected according to its definition, value domains, range and
missing value algorithms defined.
Unification:
Profile and match the new data to the existing ones and establish a unique identity all
across enterprise. It will identify similar data within and across data sources. It will append
new data or provide missing data. Enhancement may be in the form of demography,
geography, credit and revenue earned.
Integration:
In this step, data is integrated to all the child or peer systems that will be using the data from
data quality initiative. This step provides links, connection and integrates the data and
ensures the data consistency across systems.
Wipro Technologies
Page : 09 of 17
WHITE PAPER
Web
ERP
Business Rules
Standardization
Collection
SCM
Customer data
System Standards
Web
Correction
Unification
External Databases
Internal Databases
Integration
CCDB Architecture
Wipro Technologies
Page : 10 of 17
WHITE PAPER
De-duplication
Integrity, Vality
i.d.Centric, firstLogic
PureIntegrate, Carleton
TwinFinder, Omikron
Quick and easy integration into current CRM, eBusiness, and data warehouse
applications
Solution thatll conform to business rules, not business rules to which one must conform
Identifying and understanding the difference between a ship-to and bill-to address
Wipro Technologies
Intelligently identifies elements in free-form data such as first and last names, business
names, address lines, cities, serial numbers, dates, part numbers and other data
shapes
Page : 11 of 17
WHITE PAPER
Recodes data patterns for consistency with support for international data
Methodology
Trillium has got process flow of four modules to process data.
Converter: Data discovery and transformation
Profile, investigate, analyze, format, scan/clean and recode
Parser: Customer Id and enhancement
Elementize, classify, standardize, correct and transform name & address
Geocoder: Global customer validation
Verify addresses, global postal codes, address correction, and append census data
Matcher: Customer relationship matching
Identify Duplicates at Personal and Household level and present the output in the form of
Pass(exact match) / Fail (Fresh record) / Suspect (Not exactly a match).
Implementation
The four platform independent modules may be implemented in systems in several ways:
Callable Functions
Library functions accessible from virtually any calling program and/or communications
manager
Distributed Objects
CORBA, COM+, EJBeans, Java/JNI
Plug-Ins
Fully integrated with DataStage, PowerCenter, Siebel, Ab Initio and Microsofts Data
Transformation Services (DTS) for SQL Server
Control Center
This utility of Trillium is a powerful GUI used to help set up and edit business rules, define
unique data quality processes. Within the Control Center, one can establish new parsing
standards, specific business rules for matching and linking of records, identifying industry
terminology, or develop user defined conditional logic.
Wipro Technologies
Page : 12 of 17
WHITE PAPER
Platforms
Trillium currently runs on the following platforms:
PCs (Windows NT/2000)
Minis (AS/400)
To gather information about all the products and services purchased by customers.
To get a clear picture of the number of quotes, provided to a customer, across all line of
business, which turned into a sale.
For capturing the exact answers to the queries mentioned above, a strong need of
understanding every aspect of customers across all business units and geographies, was
recognized.
Input Customer Data
Input batch
XML Files
Online screen
Command
line that run
on UNIX
Call able
Trillium API's
Cleansed
output
displayed on
confirmation
screen
Output flat
File
Callable
''matcher''
APIs
Output XML
files
Passage 1 (ONLINE PROCESS)
Passage 2 (BATCH PROCESS)
Wipro Technologies
Customer Database
Page : 13 of 17
WHITE PAPER
Effective cleansing was identified as a mandatory activity for better management of core
customer data and perfect alignment with the overall strategy of focusing on the customers.
Wipro devised a complete application for providing data quality solutions to this customer
registry system. The two Trillium utilities that were incorporated in the application are:
As shown in the figure, in passage 1 (i.e. Online process), a single customer +address or
only customer record that would have been entered through online screen, passes through
trillium APIs available for Converter, Parser, Geocoder and Matcher module, called by JAVA.
The final cleansed output is shown on a confirmation screen and could be added to the
customer database, according to users willingness.
Passage 2 (i.e. Batch process), takes XML files, consisting of customer + address or only
customer records, as an input. This file is converted to a form that is acceptable by the
command line utility of Trillium. While processing through command lines, the matching
process is carried out with respect to a flat file extracted from the customer database, at a
particular instant. If there is a considerable time lag between the execution time of the
application and the time when the extract is taken, the results obtained at the end of matching
process cannot be considered as reliable, as some records would have been entered
through the online screen during the period of lag. Thus, the command line output i.e. a flat
file is passed through only callable matcher module APIs. The output of this matching
process is more reliable as it is carried out with respect to the latest state of the customer
database. Finally, an output XML file is also created that carries information about the
cleansed data and the database interactions that would have happened during the entire
process.
Conclusion
Data quality is an opportunity for organizations to stay competitive. In the ever-evolving
competitive scenario, organizations have been striving to become the best. The final theme,
that is still ongoing, is about information. As companies admit and there are numerous
examples of using information to have an edge over competitors, quality of the data from
which this information is extracted, will be the next area that will differentiate companies and
determine their survival. Data quality brings organizations to the level of perfection where
data is minutely scrutinized for its reliability and consistency. The advantages of achieving
data quality will magnify many times with the passage through analysis systems like Data
Warehouse, Business Intelligence etc. that strive to extract the hidden information from the
available data. Data quality initiative will form a core part of IT and business strategies and
will definitely make it to boardrooms. Conclusively, data quality is worth its efforts.
Wipro Technologies
Page : 14 of 17
WHITE PAPER
References
1.
Laudon K. C., Data Quality and due Process in Large Inter-organizational Record Systems,
Communications of ACM. 29, 1, 4-11, 1986.
2.
Redman T. C., Data Quality: A field Guide, Digital Press, Jan 15 2001.
3.
4.
www.dataqualitysolutions.com/home.htm
5.
Information and update: OASIS TECHNICAL DISCUSSION GROUP: Customer Information Quality
http://lists.oasis-open.org/archives/members/200005/msg00004.html
6.
7.
www.dataflux.com
8.
9.
10.
Julien Raigneau, Data quality metrics for customer relationship management, http://
nervosa.cic.hull.ac.uk:8080
11.
Bob Brauer, Data Quality Spinning Straw into Gold, www.dataflux.com/data/spinning.pdf, Data
Flux Corporation
12.
Len Dubois, Ten Critical Factors for Successful Enterprise-wide Data Quality, www.crmguru.com/
features/2002a/04181d.html
13.
Tony Fisher, Customer data Integration: The key to Successful CRM, http://www.DataFlux.com
14.
15.
Ralph Kimball, Dealing with Dirty Data, DBMS online, September 1996
Wipro Technologies
Page : 15 of 17
WHITE PAPER
Wipro Technologies
Page : 16 of 17
WHITE PAPER
Wipro in DW/BI
Wipro Technologies delivers Business Intelligence and Data Warehousing (BI&DW)
solutions to customers across the globe. Today it is a dedicated 450-member BI&DW
group, working towards the implementation of BI technology in support of our clients
worldwide. The group has cutting edge technology expertise in Data warehousing & Mining,
ERP & SCM Analytics, Analytical CRM, e-Business Analytics and NCR Teradata. Till date the
Group has put into practice BI&DW Services at approximately 50 client sites across domain
segments such as Retail, Utilities, Manufacturing, Healthcare, Finance, Insurance,
Government, Transportation and Telecommunication
http://www.wipro.com/datawarehouse
Copyright 2003. Wipro Technologies. All rights reserved. No part of this document may be reproduced, stored in a
retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without express written permission from Wipro Technologies. Specifications subject to change without notice. All other
trademarks mentioned herein are the property of their respective owners. Specifications subject to change without
notice..
Worldwide HQ
Wipro Technologies,
Sarjapur Road,
Bangalore-560 035,
India.
Tel: +91-80-844 0011.
U.S.A.
Wipro Technologies
1300, Crittenden Lane,
Mountain View, CA 94043.
Tel: (650) 316 3555.
U.K.
Wipro Technologies
137 Euston Road,
London, NW1 2 AA.
Tel: +44 (20) 7387 0606.
France
Wipro Technologies
17, Square Edouard,
VII, 75009 Paris.
Tel: +33 (01) 5343 9058.
Germany
Wipro Technologies
Am Wehr 5,
Oberliederbach,
Frankfurt 65835.
Tel: +49 (69) 3005 9408.
Japan
Wipro Technologies
# 911A, Landmark Tower,
2-1-1 Minatomirai 2-chome,
Nishi-ku, Yokohama 220 8109.
Tel: +81 (04) 5650 3950.
U.A.E.
Wipro Limited
Office No. 124,
Building 1, First Floor,
Dubai Internet City,
P.O. Box 500119, Dubai.
Tel: +97 (14) 3913480.
www.wipro.com
eMail: info@wipro.com
Wipro Technologies
Innovative
Wipro Technologies
Solutions, Quality
Leadership
Page : 17 of 17