White Paper - What Is QualityStage PDF

White Paper What Is QualityStage?
DSXchange.com White Paper What is QualityStage?
White Paper What Is QualityStage?

This white paper is one in a series of what is non-technical explanations of the various products that are supported by IBM InfoSphere Information Server. White papers in this series include: What Is Business Glossary? What Is information Analyzer? What Is FastTrack? What Is QualityStage? What Is DataStage? What Is Metadata Workbench?
Information Server is a suite of products and services that combine seamlessly to enable an organization to deliver trusted information to all consumers of that information, as well as to deliver knowledge about that information to the organization itself.
QualityStage
QualityStage is a tool intended to deliver high quality data required for success in a range of enterprise initiatives including business intelligence, legacy consolidation and master data management. It does this primarily by identifying components of data that may be in columns or free format, standardizing the values and formats of those data, using the standardized results and other generated values to determine likely duplicate records, and building a best of breed record out of these sets of potential duplicates. Through its intuitive user interface QualityStage substantially reduces time and cost to implement Customer Relationship Management (CRM), data warehouse/business intelligence (BI), data governance, and other strategic IT initiatives and maximizes their return on investment by ensuring their data quality. With QualityStage it is possible, for example, to construct consolidated customer and household views, enabling more effective cross-selling, up-selling, and customer retention, and to help to improve customer support and service, for example by identifying a company's most profitable customers. The cleansed data provided by QualityStage allows creation of business intelligence on individuals and organizations for research, fraud detection, and planning.
DSXchange.com All rights reserved.
Page 2 of 11
Out of the box QualityStage provides for cleansing of name and address data and some related types of data such as email addresses, tax IDs and so on. However, QualityStage is fully customizable to be able to cleanse any kind of classifiable data, such as infrastructure, inventory, health data, and so on.
QualityStage Heritage
The product now called QualityStage has its origins in a product called INTEGRITY from a company called Vality. Vality was acquired by Ascential Software in 2003 and the product renamed to QualityStage. This first version of QualityStage reflected its heritage (for example it only had batch mode operation) and, indeed, its mainframe antecedents (for example file name components limited to eight characters). Ascential did not do much with the inner workings of QualityStage which was, after all, already a mature product. Ascentials emphasis was to provide two new modes of operation for QualityStage. One was a plug-in for DataStage that allowed data cleansing/standardization to be performed (by QualityStage jobs) as part of an ETL data flow. The other was to provide for QualityStage to use the parallel execution technology (Orchestrate) that Ascential had as a result of its acquisition of Torrent Systems in 2001. IBM acquired Ascential Software at the end of 2005. Since then the main direction has been to put together a suite of products that share metadata transparently and share a common set of services for such things as security, metadata delivery, reporting, and so on. In the particular case of QualityStage, it now shares a common Designer client with DataStage: from version 8.0 onwards QualityStage jobs run as, or as part of, DataStage jobs, at least in the parallel execution environment.
QualityStage Functionality
Four tasks are performed by QualityStage; they are investigation, standardization, matching and survivorship. We need to look at each of these in turn. Under the covers QualityStage incorporates a set of probabilistic matching algorithms that can find potential duplicates in data despite variations in spelling, numeric or date values, use of non-standard forms, and various other obstacles to performing the same tasks using deterministic methods. For example, if you have what appears to be the same employee record where the name is the same but date of hire differs by a day or two, a deterministic algorithm would show two different employees whereas a probabilistic algorithm would show the potential duplicate.
Page 3 of 11
(Deterministic means absolute in this sense; either something is equal or it is not. Probabilistic leaves room for some degree of uncertainty; a value is close enough to be considered equal. Needless to say, the degree of uncertainty used within QualityStage is configurable by the designer.)
Investigation
By investigation we mean inspection of the data to reveal certain types of information about those data. There is some overlap between QualityStage investigation and the kinds of profiling results that are available using Information Analyzer, but not so much overlap as to suggest that removal of functionality from either tool. QualityStage can undertake three different kinds of investigation. Character discrete investigation looks at the characters in a single field (domain) to report what values or patterns exist in that field. For example a field might be expected to contain only codes A through E. A character discrete investigation looking at the values in that field will report the number of occurrences of every value in the field (and therefore any out of range values, empty or null, etc.) Pattern in this context means whether each character is alphabetic, numeric, blank or something else. This is useful in planning cleansing rules; for example a telephone number may be represented with or without delimiters and with or without parentheses surrounding the area code, all in the one field. To come up with a standard format, you need to be aware of what formats actually exist in the data. The result of a character discrete investigation (which can also examine just part of a field, for example the first three characters) is a frequency distribution of values or patterns the developer determines which. Character concatenate investigation is exactly the same as character discrete investigation except that the contents of more than one field can be examined as if they were in a single field the fields are, in some sense, concatenated prior to the investigation taking place. The results of a character concatenate investigation can be useful in revealing whether particular sets of patterns or values occur together. Word investigation is probably the most important of the three for the entire QuialityStage suite, performing a free-format analysis of the data records. It performs two different kinds of task; one is to report which words/tokens are already known, in terms of the currently selected rule set, the other is to report how those words are to be classified, again in terms of the currently selected rule set. There is no overlap to Information Analyzer (data profiling tool) from word investigation.
Page 4 of 11
A rule set includes a set of tables that list the known words or tokens. For example, the GBNAME rule set contains a list of names that are known to be first names in Great Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that can not only be recognized as name prefixes (titles, if you prefer) but can in some cases reveal additional information, such as gender. When a word investigation reports about classification, it does so by producing a pattern. This shows how each known word in the data record is classified, and the order in which each occurs. For example, under the USNAME rule set the name WILLIAM F. GAINES III would report the pattern FI?G the F indicates that William is a known first name, the I indicates the F is an initial, the ? indicates that Gaines is not a known word in context, and the G indicates that III is a generation as would be Senior, IV and fils. Punctuation may be included or ignored. Rule sets also come into play when performing standardization (discussed below). Classification tables contain not only the words/tokens that are known and classified, but also contain the standard form of each (for example William might be recorded as the standard form for Bill) and may contain an uncertainty threshold (for example Felliciity might still be recognizable as Felicity even though it is misspelled in the original data record). Probabilistic matching is one of the significant strengths of QualityStage. Investigation might also be performed to review the results of standardization, particularly to see whether there are any unhandled patterns or text that could be better handled if the rule set itself were tweaked, either with improved classification tables or through a mechanism called rule set overrides.
Standardization
Standardization, as the name suggests, is the process of generating standard forms of data that might more reliably be matched. For example, by generating the standard form William from Bill, then there is an increased likelihood of finding the match between William Gates and Bill Gates. Other standard forms that can be generated include phonetic equivalents (using NYSIIS and/or Soundex), and something like initials maybe the first two characters from each of five fields. Each standardization specifies a particular rule set. As well as word/token classification tables, a rule set includes specification of the format of an output record structure, into which original and standardized forms of the data, generated fields (such as gender) and reporting fields (for example whether a user override was used and, if so, what kind of override) may be written.
DSXchange.com All rights reserved. Page 5 of 11
It may be that standardization is the desired end result of using QualityStage. For example street address components such as Street or Avenue or Road are often represented differently in data, perhaps differently abbreviated in different records. Standardization can convert all the non-standard forms into whatever standard format the organization has decided that it will use. This kind of QualityStage job can be set up as a web service. For example, a data entry application might send in an address to be standardized. The web service would return the standardized address to the caller. More commonly standardization is a preliminary step towards performing matching. More accurate matching can be performed if standard forms of words/tokens are compared than if the original forms of these data are compared.
Matching
Matching is the real heart of QualityStage. Different probabilistic algorithms are available for different types of data. Using the frequencies developed during investigation (or subsequently), the information content (or rarity value) of each value in each field can be estimated. The less common a value, the more information it contributes to the decision. A separate agreement weight or disagreement weight is calculated for each field in each data record, incorporating both its information content (likelihood that a match actually has been found) and its probability that a match has been found purely at random. These weights are summed for each field in the record to come up with an aggregate weight that can be used as the basis for reporting that a particular pair or records probably are, or probably are not, duplicates of each other. There is a third possibility, a grey area in the middle, which QualityStage refers to as the clerical review area record pairs in this category need to be referred to a human to make the decision because there is not enough certainty either way. Over time the algorithms can be tuned with things like improved rule sets, weight overrides, different settings of probability levels and so on so that fewer and fewer clericals are found. Matching makes use of a concept called blocking, which is an unfortunately-chosen term that means that potential sets of duplicates form blocks (or groups, or sets) which can be treated as separate sets of potentially duplicated values. Each block of potential duplicates is given a unique ID, which can be used by the next phase (survivorship) and can also be used to set up a table of linkages between the blocks of potential duplicates and the keys to the original data records that are in those blocks. This is often a requirement when de-duplication is being performed, for example when combining records from multiple sources, or generating a list of unique addresses from a customer file, et cetera.
Page 6 of 11
More than one pass through the data may be required to identify all the potential duplicates. For example, one customer record may refer to a customer with a street address but another record for the same customer may include the customers post office box address. Searching for duplicate addresses would not find this customer; an additional pass based on some other criteria would also be required. QualityStage does provide for multiple passes, either fully passing through the data for each pass, or only examining the unmatched records on subsequent passes (which is usually faster).
Survivorship
As the name suggests survivorship is about what becomes of the data in these blocks of potential duplicates. The idea is to get the best of breed data out of each block, based on built-in or custom rules such as most frequently occurring non-missing value, longest string, most recently updated and so on. The data that fulfil the requirements of these rules can then be handled in a couple of ways. One technique is to come up with a master record a single version of the truth that will become the standard for the organization. Another possibility is that the improved data could be populated back into the source systems whence they were derived; for example if one source were missing date of birth this could be populated because the date of birth was obtained from another source. Or more than one. If this is not the requirement (perhaps for legal reasons), then a table containing the linkage between the source records and the master record keys can be created, so that the original, source systems have the ability also to refer to the single source of truth and vice versa.
Address Verification and Certification

QualityStage can do more (than simple matching). Address verification can be performed; that is, whether or not the address is a valid format can be reported. Out of the box address verification can be performed down to city level for most countries. For an extra charge, an additional module for world-wide address verification (WAVES) can be purchased, which will give address verification down to street level for most countries. For some countries, where the postal systems provide appropriate data (for example SERP in the USA, CASS in Canada, DPID in Australia), address certification can be performed: in this case, an address is given to QualityStage and looked up against a database to report whether or not that particular address actually exists. These modules carry an additional price, but that includes IBM obtaining regular updates to the data from the postal authorities and providing them to the QualityStage licensee.
Page 7 of 11
QualityStage Designer
The Designer is a Windows-based graphical user interface in which the various QualityStage tasks (called jobs) can be specified. This user interface has been created to use a design in the same way that you think paradigm. That is, you select components and metadata that describe precisely what you want to do, and draw a picture of the flow of activities. This is then converted into something that can be executed in batch or in real time as a web service. A single mouse click effects the conversion (the process is called compilation in the Designer). QualityStage Designer also includes facilities for constructing and testing rule sets and rule set overrides, for obtaining quality metrics and for tuning the entire process, for example by setting thresholds for match/non-match, applying weight overrides based on external knowledge (typically of the rarity value or sampling strategy), qualifying the uncertainty thresholds applicable to the various probabilistic matching algorithms, and more. Figure 1 shows a QualityStage job design that performs standardization of customer records and generates frequency distributions of selected fields. These frequency distributions will be used in a subsequent match job.
Figure 1 QualityStage Designer
Page 8 of 11
QualityStage Benefits
QualityStage provides the most powerful, accurate matching available, based on probabilistic matching technology, easy to set up and maintain, and providing the highest match rates available in the market. An easy-to-use graphical user interface (GUI) with an intuitive, point-and-click interface for specifying automated data quality processes data investigation, standardization, matching, and survivorship reduces the time needed to deploy data cleansing applications. QualityStage offers a thorough data investigation and analysis process for any kind of free formatted data. Through its tight integration with DataStage and other Information Server products it also offers fully integrated management of the metadata associated with those data. There exists rigorous scientific justification for the probabilistic algorithms used in QualityStage; results are easy to audit and validate. Worldwide address standardization verification and enrichment capabilities including certification modules for the United States, Canada, and Australia add to the value of cleansed address data. Domain-agnostic data cleansing capabilities including product data, phone numbers, email addresses, birth dates, events, and other comment and descriptive fields, are all handled. Common data quality anomalies, such as data in the wrong field or data spilling over into the next field, can be identified and addressed. Extensive reporting providing metrics yield business intelligence about the data and help tune the application for quality assurance. Service oriented architecture (SOA) enablement with InfoSphere Information Services Director, allowing you to leverage data quality logic built using the IBM InfoSphere Information Server and publish it as an "always on, available everywhere" service in a SOA in minutes. The bottom line is that QualityStage helps to ensure that systems deliver accurate, complete, trusted information to business users both within and outside the enterprise.
Page 9 of 11
More Information
http://www-01.ibm.com/software/data/infosphere/qualitystage
IBM, the IBM logo, InfoSphere, WebSphere, Information Server, Information Analyzer, FastTrack, Business Glossary, QualityStage, DataStage and Metadata Workbench are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product or service names might be trademarks or service marks of others.
Page 10 of 11
About the Author Ray Wurlod

Ray is a self-employed trainer and consultant for the IBM DataStage, IBM UniVerse and IBM Red Brick Warehouse and DataStage XE suites of products. Ray has taught advanced classes in the USA, the UK and Germany, and has been used frequently as a training consultant by IBM to conduct advanced in-house training classes. Additionally, Ray has presented training classes in almost every country in the Asia-Pacific region, and has been involved in technical presentations and implementations throughout the region. Ray joined Prime Computer of Australia in 1986. He later joined VMARK Software (original developers of DataStage) after Prime Computer sold its database businesses to VMARK. Rays principal role with VMARK, and subsequently with Ardent, Informix and IBM was as a DataStage trainer, but was also actively involved in technical support. While with VMARK and Ardent he was actively involved in the development of DataStage, creating a complete training curriculum for use in the Asia-Pacific region. He has also developed training curriculum and train-the-trainer programs for the UniVerse RDBMS, including its NLS (national language support) implementation. When Ardent sold its database businesses to Informix Software, Ray continued his involvement in Data Warehouse technology by becoming expert with the Red Brick Warehouse product; a database designed specifically for Data Warehouse (star schema) implementations. When Informix was acquired by IBM Ray continued his concentration on training, while additionally focusing on Data Warehousing applications.
Page 11 of 11

White Paper - What Is QualityStage PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

White Paper - What Is QualityStage PDF

Încărcat de

Drepturi de autor:

Formate disponibile

White Paper What Is QualityStage?

DSXchange.com White Paper What is QualityStage?