Sunteți pe pagina 1din 166

User Guide

Informatica Data Quality


(Version 8.6.1)
Informatica Data Quality User Guide
Version 8.6.1
September 2008

Copyright (c) 2001–2008 Informatica Corporation.

All rights reserved.


This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also
protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,
recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and/or international Patents and other Patents Pending.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and
227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.

The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in writing.
Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer,
Informatica B2B Data Exchange and Informatica On Demand are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All
other company and product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright © Sun Microsystems. All rights reserved. Copyright © Platon Data
Technology GmbH. All rights reserved. Copyright © Melissa Data Corporation. All rights reserved. Copyright © 1995-2006 MySQL AB. All rights reserved

This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is Copyright © 1999-2006 The Apache Software Foundation. All rights
reserved.

ICU is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of the ICU
software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or
sell copies of the Software, and to permit persons to whom the Software is furnished to do so.

ACE(TM)and TAO(TM), are copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and Vanderbilt University, Copyright (c) 1993-
2006, all rights reserved.

Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporation and other parties. The authors hereby grant permission to use, copy, modify, distribute,
and license this software and its documentation for any purpose.

InstallAnywhere is Copyright © Macrovision (Copyright ©2005 Zero G Software, Inc.) All Rights Reserved.

Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com).
This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright © 2000-2004 Jason Hunter and Brett McLaughlin. All rights reserved.

This product includes software developed by the JFreeChart project (http://www.jfree.org/freechart/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement,
which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”, without warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability and fitness for a particular purpose.

This product includes software developed by the JDIC project (https://jdic.dev.java.net/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may
be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”, without warranty of any kind, either express or implied, including but not limited
to the implied warranties of merchantability and fitness for a particular purpose.

This product includes software developed by lf2prod.com (http://common.l2fprod.com/). Your right to use such materials is set forth in the Apache License Agreement, which may be found at http://
www.apache.org/licenses/LICENSE-2.0.html.

DISCLAIMER: Informatica Corporation provides this documentation “as is” without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-
infringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. The information provided in this software or
documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is subject to change at any time without notice.

Part Number: IDQ-USG-86100-0002


Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 1: Informatica Data Quality Features and Functionality . . . . . . . . . . . . . . . . . 1


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data Quality Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Project Manager and File Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Publishing Plans to Data Quality Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Exporting and Importing Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Running Plans: Local and Remote Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Plan Resources and Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Version Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Working with Multiple Instances of a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Organizing the Workbench User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2: Data Source Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
CSV Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Database Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Fixed Width Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Realtime Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
SAP Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
CSV Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CSV Dual Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Database Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Dual Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
CSV Identity Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
DB Identity Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 3: Data Target Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CSV Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Fixed Width Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Report Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
CSV Merge Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
CSV Match Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Match Key Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Database Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Database Report Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
SAP Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

iii
Realtime Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Identity Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Chapter 4: Frequency Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
MinAvgMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Range Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Chapter 5: Analysis Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Character Labeller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Token Labeller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 6: Transformation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Search Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Word Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
To Upper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Rule Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Chapter 7: Parsing Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Token Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Profile Standardizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Context Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Chapter 8: Key Field Generator Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Soundex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Nysiis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 9: Matching Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Identity Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

iv Table of Contents
Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Jaro Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Mixed Field Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Weight Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Chapter 10: Address Validation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Global AV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 11: Dictionary Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Dictionary Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Updating Dictionary Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Creating a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Chapter 12: Report Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Viewing Data in the Report Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Standard View and Dashboard View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Viewing Plan Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Report Viewer Parameters and Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Tracking Changes in Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Importing Report Files and Working with Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Chapter 13: Deploying Plans for Runtime Execution . . . . . . . . . . . . . . . . . . . . . . . . 119


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Deploying Runtime Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Running a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Command Line Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Multi-Threading and Multi-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Appendix A: Rule Based Analyzer Rule Statements . . . . . . . . . . . . . . . . . . . . . . . . 127


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Functional Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Appendix B: Global AV: Output Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 131


Global AV Output Field Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Appendix C: Search/Replace Operations and Noise Removal . . . . . . . . . . . . . . . . 135


Noise Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

v
Appendix D: Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Appendix E: SQL Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Creating a MySQL Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Use of MAX Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Nested Groups and Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Appendix F: ODBC Data Source Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


Using the ODBC Data Source Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Appendix G: Character Encodings and Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . 143


Character Encodings and Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Appendix H: Data Quality Workbench Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


Data Quality Workbench Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Appendix I: Output Options in the CSV Match Target . . . . . . . . . . . . . . . . . . . . . . . 147


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Configuring the Outputs for Identified Matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Appendix J: Informatica Data Quality Naming Conventions . . . . . . . . . . . . . . . . . . 149


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

vi Table of Contents
Preface
Welcome to Informatica Data Quality, the latest-generation data quality management system from Informatica
Corporation. Informatica Data Quality will empower your organization to solve its data quality problems and
realize real, sustainable data quality improvements.
The high-level objectives for this guide are to describe the functionality of Informatica Data Quality in the
following areas:
♦ How to build data quality plans using the data sources, data targets, and operational components available in
the Workbench in the user interface.
♦ How to manage your data quality projects, plans, and associated resource files through Informatica Data
Quality Workbench.
♦ How to use dictionaries and reference data content.
This document builds on the Getting Started Guide. Before reading this document, Data Quality users should
read the Getting Started Guide to familiarize themselves with data quality concepts and product capabilities.
Note: The Informatica Data Quality Integration for PowerCenter is not documented in this guide. For more
information on the Data Quality Integration, see the Data Quality Data Quality Integration for PowerCenter
Guide.

Informatica Resources
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com.
The site contains product information, user group information, newsletters, access to the Informatica customer
support case management system (ATLAS), the Informatica Knowledge Base, Informatica Documentation
Center, and access to the Informatica user community.

Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have
questions, comments, or ideas about this documentation, contact the Informatica Documentation team
through email at infa_documentation@informatica.com. We will use your feedback to improve our
documentation. Let us know if we can contact you regarding your comments.

vii
Informatica Web Site
You can access the Informatica corporate web site at http://www.informatica.com. The site contains
information about Informatica, its background, upcoming events, and sales offices. You will also find product
and partner information. The services area of the site includes important information about technical support,
training and education, and implementation services.

Informatica Knowledge Base


As an Informatica customer, you can access the Informatica Knowledge Base at http://my.informatica.com. Use
the Knowledge Base to search for documented solutions to known technical issues about Informatica products.
You can also find answers to frequently asked questions, technical white papers, and technical tips.

Informatica Global Customer Support


There are many ways to access Informatica Global Customer Support. You can contact a Customer Support
Center through telephone, email, or the WebSupport Service.
Use the following email addresses to contact Informatica Global Customer Support:
♦ support@informatica.com for technical inquiries
♦ support_admin@informatica.com for general customer service requests
WebSupport requires a user name and password. You can request a user name and password at http://
my.informatica.com.
Use the following telephone numbers to contact Informatica Global Customer Support:

North America / South America Europe / Middle East / Africa Asia / Australia

Informatica Corporation Informatica Software Ltd. Informatica Business


Headquarters 6 Waltham Park Solutions Pvt. Ltd.
100 Cardinal Way Waltham Road, White Waltham Diamond District
Redwood City, California Maidenhead, Berkshire Tower B, 3rd Floor
94063 SL6 3TN 150 Airport Road
United States United Kingdom Bangalore 560 008
India

Toll Free Toll Free Toll Free


+1 877 463 2435 00 800 4632 4357 Australia: 1 800 151 830
Singapore: 001 800 4632 4357

Standard Rate Standard Rate Standard Rate


Brazil: +55 11 3523 7761 Belgium: +32 15 281 702 India: +91 80 4112 5738
Mexico: +52 55 1168 9763 France: +33 1 41 38 92 26
United States: +1 650 385 5800 Germany: +49 1805 702 702
Netherlands: +31 306 022 797
United Kingdom: +44 1628 511 445

viii Preface
CHAPTER 1

Informatica Data Quality Features


and Functionality
This chapter includes the following topics:
♦ Overview, 1
♦ Data Quality Plans, 2
♦ Project Manager and File Manager, 2
♦ Publishing Plans to Data Quality Server, 4
♦ Running Plans: Local and Remote Execution, 6
♦ Plan Resources and Plan Execution, 7
♦ Version Control, 8
♦ Working with Multiple Instances of a Plan, 11
♦ Organizing the Workbench User Interface, 11

Overview
This chapter discusses the project management, file management, and plan management options available
through Data Quality, including the capabilities of Data Quality Workbench in conjunction with Data Quality
Server. If you are running Data Quality Workbench in stand-alone or client-only mode, some functionality
might not be available to you.
Note: For more information on the components that make up the Informatica Data Quality suite, see the
Informatica Data Quality Installation Guide and the Getting Started with Data Quality Guide.

1
Data Quality Plans
Informatica Data Quality Data analyzes and enhances your source data through processes called plans that you
create in its Workbench application. A data quality plan is a self-contained and executable set of data analysis or
data enhancement steps consisting of one or more of the following types of components:

Table 1-1. Data Quality Plan Components

Required/
Component Description
Optional

Data source Required Provides input data for the plan.

Data target Required Collects data output from the plan.

Operational Optional Performs the data analysis or data enhancement actions on the data
they receive. Most plans contain multiple operational components.

A plan must contain at least one data source and data target. It can use any number of operational components.
A plan that writes data directly from one file or database to another does not require operational components.
Figure 1-1 shows the components in a plan arranged in the Data Quality Workbench user interface:

Figure 1-1. Plan Components in the Data Quality Workspace

The arrows indicate the direction of the data flow through the plan, from data source, through operational
components, to data target.
Note: You can move components in the workspace. Arrows are not foolproof indicators of the precise progress of
data in the plan.
Each operational component in Workbench performs a different type of analysis or enhancement task on your
data. Configure an operational component to execute on a subset of the data that it receives or to filter the data
that it makes available to other components in the component chain.
Many plans make use of text- or table-based reference dictionaries. Informatica provides a set of reference
dictionary files with its Content Installer. You can add dictionaries to several components in Workbench, and
you can define dictionaries in live tables within a database, ensuring that reference tables stay current.
You can edit and define your own dictionary files through the Dictionary Manager. Dictionary files are stored as
text files (.DIC files) in a Dictionaries folder in the Informatica Data Quality directory.
Note: Data Quality dictionaries install through the Content Installer, a separate installer within the Informatica
Data Quality installation. The Content Installer also installs any reference data and processing engine updates
that you receive from Informatica.

Project Manager and File Manager


Workbench stores plans in the Data Quality repository and reads reference data from the file system. It provides
separate browsers to view the contents of the repository and the file system.

2 Chapter 1: Informatica Data Quality Features and Functionality


♦ Project Manager. Lists the plans and project folders in the local Data Quality repository and any available
repositories on a Data Quality service domain. Allows you to organize plans in folders, publish plans from
the local repository to a service domain repository, export plans to PowerCenter repositories, and run plans.
♦ File Manager. Allows you to access and move files within the local file system and across the service domain
file system. With the File Manager, you can access any file type stored on a server.
In stand-alone installations of Data Quality Workbench, the File Manager and Project Manager provide access
to the local system and local repository only.

To view the Project Manager:

X In Informatica Data Quality Workbench, click the Projects tab.

To view this File Manager:

X In Informatica Data Quality Workbench, click the Files tab.

Working with the File Manager


The File Manager provides visibility to a Data Quality service domain in the following way:
♦ The names of the servers configured in the domain appear under the service domain name.
♦ The servers are host to the client user spaces and a shared file space for all users. These user spaces contain
the dictionary files and other resource files for plans stored in the service domain repository.
♦ The server hosts a Dictionaries folder that all service domain repository plans can read from. This folder is
created by the Data Quality installer and populated by the Content Installer.
♦ The local computer structure also appears.
To work with files within the File Manager, right-click a file or folder and select the required operation from the
shortcut menu that appears. The permitted operations are as follows:
♦ (Create) New Folder
♦ Rename
♦ Delete
♦ Cut
♦ Copy
♦ Paste
♦ Refresh
♦ Open Externally
♦ Security
The following procedure illustrates how to use the File Manager.
Note: You cannot copy files from another system, such as Windows Explorer, into File Manager folders.

To copy local files to the service domain with the File Manager:

1. Under the File Manager tab, browse the local folder structure and locate the required file.
2. Right-click the file name and select Copy from the context menu that appears.
3. On the service domain, expand the folders of the server to which you’ll copy the file and locate the
destination folder.
4. Right-click the folder name and select Paste from the context menu that appears.

Project Manager and File Manager 3


Publishing Plans to Data Quality Server
Publishing is the process of copying plans from a Workbench repository to a Data Quality Server repository.
Publishing deploys plans in a networked environment, allowing domain users with appropriate permissions to
access and execute the plans. Administrators set user permissions in the Data Quality Administration Console.
A published plan contains version control information that references the owner of the original plan, allowing
the genealogy of plans to be traced across repositories.

To publish a plan from the local repository to a domain repository:

1. Right-click the plan(s) you want to publish.


2. Select Copy from the context menu.
3. Browse the domain repository and locate the folder where you would like to publish the plan(s).
4. Right-click the folder and select Paste from the context menu.
5. Copy all necessary plan resources to the server file system, ensuring that you recreate the folder path
structures used in the source WorkBench plan. For more information on placing resources in the correct
locations, see “Implications for Plan Design” on page 8.
Note: When plans are published, the latest base version of the plan is used. Any changes saved since this version
are not published. For more information about plan version control, see “Version Control and Plan
Publication” on page 10.

Exporting and Importing Plans


Use Data Quality Workbench to export and import plans to and from your local repository. Export plans
directly into the PowerCenter repository as mapplets, or export them as files that can be imported by other Data
Quality users.
The following export and import options are available:
♦ Export plans directly into the PowerCenter repository as mapplets. Use this option to run Data Quality
plans natively within PowerCenter.
♦ Export plans in XML format. XML plans can be used by the runtime version of Data Quality as part of
command batch jobs or scheduled processes.
♦ Back up plans to Data Quality PLN files for storage.
♦ Import plans from PLN or XML formats. Informatica recommends importing from PLN files in order to
preserve the layout of the original plan.
Exported and imported plans do not contain plan version histories.

Exporting Plans to PowerCenter


Use Workbench to export Data Quality plan metadata directly to a PowerCenter repository.

To export plans into a PowerCenter repository, perform the following steps:

1. Right-click the plan(s) you want to export.


2. Select Export > PowerCenter Mapplet > To PowerCenter Repository.
3. Enter your connection details in the ‘Connect to PowerCenter Repository’ dialog box. Ensure you select
the correct PowerCenter repository version.
4. Choose a destination repository folder for the exported plans.

4 Chapter 1: Informatica Data Quality Features and Functionality


PowerCenter users can also import plan metadata to the PowerCenter repository if they have installed the Data
Quality Integration transformation. PowerCenter runs plans saved through the Data Quality Integration
transformation in mappings and sessions by loading an instance of the Data Quality engine. When you export a
plan as a mapplet, PowerCenter runs its parent mapping and session within the PowerCenter engine.
Note: You can also export plans as PowerCenter mapplet files in XML form. To access this option, right-click
and select Export > PowerCenter Mapplet > To XML File.

Exporting Plans for Runtime Use


Export plans as XML files for use during runtime execution. Runtime execution uses a command-line version of
the Data Quality engine to run plans as part of a scheduled or batch process. For more information on runtime
execution, see “Deploying Plans for Runtime Execution” on page 119.

To export a plan for runtime use:

1. Right-click on the plan(s) you want to export.


2. Select Export > IDQ Runtime Plan(s) (.xml).
3. Choose a destination folder for the XML plans, and click Select.
4. In the Export a Plan to XML dialog box, choose the operating system on which the plan will run and select
OK. If the exported plans contain file-based sources or targets, you can perform the following actions in
this dialog box:
♦ Change the paths for the sources or targets.
♦ Select OK to All to use the same paths for all file-based sources or targets.
5. Copy the exported XML file to the computer that will run the plans.
6. Copy all necessary source and reference files to the computer that will run the plans, ensuring that they are
placed in the proper locations. For more information, see “Plan Resources and Plan Execution” on page 7.

Backing Up Plans
Create backup copies of your plans in PLN format. Do not create XML copies of plans for backup purposes.
PLN files retain the original onscreen appearance of the plans.

To back up your plans:

1. Right-click on the plan(s) you want to export.


2. Select Export > Workbench Plan(s) (.pln).
3. Choose a destination folder, and click Select.
4. If reference files are required for the exported plans, back up these files to ensure that the backup plan is
fully functional.

Importing Plans
Informatica recommends using PLN files as the source for your plan imports. While you can import XML
plans, these plans separate all component instances into individual components. This greatly increases the visual
complexity of many plans in the Workbench user interface. Export plans as XML files for runtime execution.

To import plans:

1. Right-click the destination project or folder for the imported plan.


2. Select Import > Workbench Plan(s) (.pln).
3. Choose a file, and click Select.

Exporting and Importing Plans 5


4. If source and reference files are required for the imported plans, verify that these files are available to Data
Quality Workbench.

Running Plans: Local and Remote Execution


The plan execution process in Data Quality Workbench differs slightly for client-only, license users and users in
client-server environments. Client-only license users define and run plans locally. Full Informatica Data Quality
users can select any available plan in the service domain and run the plan on any available server. Any machine
on the service domain can run a plan if it is host to an Execution service, the Informatica Data Quality service
that executes the plan.
Before you run a plan, make sure all necessary resources, such as the data source files and any required reference
data, are present on the computer that runs the plan and in locations recognized by Data Quality.
When you run a plan locally through your local Workbench this is automatically the case, unless you have
moved any resources between design-time and execution. When you run a plan on a remote server, you must
ensure that the necessary resources are present in the correct locations on the server that runs the plan.
In remote execution scenarios, it is possible for the Execution service and domain repository to reside on
separate servers. The server that runs the plan is the server on which the Execution service is present.

Running a Data Quality Plan


Use the following procedure to run data quality plans in Workbench.

To run a data quality plan in Workbench:

1. Ensure the required plan is selected in the workspace.


2. Click the Run Plan toolbar button.
A dialog box opens with the plan name in its title bar.
3. Click Run.
The plan executes.
If you are connected to a Data Quality service domain, you can also select a remote Data Quality computer
on which to run the plan. That is, you can specify the Execution service that will run the plan. You can run
a plan from any repository available on the service domain. For example, you can open a plan from the
domain repository on Server 1 and run the plan on Server 2.
The Run Plan dialog features a progress bar that states the percentage of the data processed as the plan
executes. You can click the Stop button at any time to end plan execution and view the results so far.
This dialog box also has a menu that allows you to select the percentage of data to use in the plan. The
default setting is 100 percent. You can select a smaller percentage if you want to test that a plan will run as
anticipated. This can be useful if you have designed a complex plan that will take time to execute.

Reporting Options
As well as generating file-based and table-based output, Data Quality Workbench offers graphical reporting
options. These include a proprietary format that lets you view high-level and fine-grained plan results, to
create scorecards, and to export data to file. For more information, see “Report Viewer” on page 109.

6 Chapter 1: Informatica Data Quality Features and Functionality


Plan Resources and Plan Execution
Before you run a plan, check that all relevant files are available to the computer that runs it.
When you run a plan locally, the source data and reference data files are set when you configure the
components. Unless you move the data between designing and running the plan, the locations are understood
when you run the plan.
When you run a plan on a remote computer, the Data Quality Server reads the plan, identifies the original path
to each resource, and replaces each path with a corresponding path on the server. The server substitutes the
Windows drive letter with your file folder in the Server host folder structure. Therefore, you must ensure that
the source data and reference data files are available to the Server in locations that the Server expects.
Note: If you have used third-party data in the plan, ensure that the third-party data is installed in a location
accessible to the Execution service that runs the plan.
The following sections describe how Data Quality handles resource files in cases of remote plan execution.

Data Source Files


Data Quality Server recognizes a specific set of folders as valid resource file locations. If a plan refers to a source
file stored in the following location on the Workbench computer:
C:\Myfiles\File.txt
A Data Quality Server on Windows looks for the file here:
C:\Program Files\Informatica Data Quality\users\user.name\Files\Myfiles
A Data Quality Server on UNIX installed at /home/Informatica/Data Quality/ looks for the file here:
/home/Informatica/DataQuality/users/user.name/Files/Myfiles
For further information, see “Implications for Plan Design” on page 8.
Note: If you have published a file for runtime execution and your source file is located in a non-standard
location, you can provide a parameter file with the runtime command that maps the original location to the
required location.

Dictionary Files
Data Quality looks for dictionary files in a different way to source files.
The installation processes for Data Quality Workbench and Server creates an empty Dictionaries folder under
the top-level Informatica Data Quality folder. This folder is populated with dictionary files by the Content
Installer.
By default, the Dictionaries folder is created at the following location on Windows systems:
C:\Program Files\Informatica Data Quality\Dictionaries
and at the following location on UNIX systems:
/home/Informatica/DataQuality/Dictionaries
Data Quality Server also creates a separate dictionary folder for each Data Quality user that connects into the
service domain. The folder is created when the client user first opens the File Manager or first attempts to run a
plan remotely.
A remotely-run plan first looks for dictionaries in the client user’s Dictionaries folder. If this folder does not
contain the required dictionaries, the plan looks in the Dictionaries folder created during installation.
Therefore, when you run a plan to the server, you do not need to copy dictionary files to your user dictionary
folder on the server if those dictionaries already exist in the server’s dictionary folder.
By default, user dictionary folders are created in the following server locations:
♦ UNIX: /home/Informatica/DataQuality/users/user.name/Dictionaries
♦ Windows: C:\Program Files\Informatica Data Quality\users\user.name\Dictionaries

Plan Resources and Plan Execution 7


Cross-Platform Plan File Conventions
Data Quality Server handles the translation of client-to-server file paths and Windows-to-UNIX file paths
seamlessly. When a plan is opened on a Windows system, Data Quality ensures that all paths are in a Windows
format, with folders separated by back slashes. When a plan is opened on a UNIX system, Data Quality renders
all paths in UNIX format with folders separated by forward slashes. The transformations and file paths are case-
sensitive and case-preserving.

Implications for Plan Design


When you design a plan in Data Quality Workbench, you should ensure that the folders you create for file
resources can map efficiently to the server folder structure.
For example, a plan runs in Workbench and reads a source file from the following location:
C:\Program Files\Informatica Data Quality\Sources
When this plan runs on a remote Windows machine, Data Quality Server looks for the source file in the
following location:
C:\Program Files\Informatica Data Quality\users\user.name\Files\Program
Files\Informatica Data Quality\Sources
The folder path Program Files\Informatica Data Quality is repeated here. In this case, good plan design suggests
the creation of folders under C:\ that can be recreated efficiently on the server.

Version Control
Data Quality’s version control features enable you to save multiple versions of a plan, to view the plan version
history, and to edit and run historical versions of the plan.
As well as the most recently-saved version of a plan, Data Quality stores any earlier versions that have been
flagged for retention in the repository. This allows you to save versions of a plan at meaningful points in its
development and to revert to earlier versions of the plan if necessary.
For the purposes of version control, each Data Quality plan has a latest version and one or more base versions.
♦ Latest version. The most recently-saved state of a plan.
♦ Base versions. Earlier versions that have been preserved in the repository
When you save a plan for the first time, you automatically create a base version. If you do not create another
base version, the plan version history shows details for that base version and the latest version only.
Note the following:
♦ A base version cannot be overwritten. If you are working in a base version and save your changes, the newly-
saved state becomes the latest version.
♦ Version control does not keep every saved state of a plan. It is possible to open, edit, and save a plan
multiple times without adding base versions to the version history.
♦ Version control applies to plans only. Version control does not apply to projects or to the external resources
that a plan may require to run successfully.
♦ Version history is reset when you copy or publish a plan. Version information does not move with a plan
when it is copied within a repository, as this operation effectively creates a new plan. When a plan is
published, it retains the version details of the base version published from the Workbench repository – the
base version number on the client computer, the creation date and time of that base version, the user who
created it, and the comment added by that user. For more information, see “Version Control and Plan
Publication” on page 10.

8 Chapter 1: Informatica Data Quality Features and Functionality


Version Control Commands
You can perform all plan activities in Data Quality without interacting with the version control features.
However, all plans in the repository are assigned a version history that you can access through a shortcut menu.
When you right-click a plan name and select Version Control, a submenu opens as follows:

The Version Control submenu displays the following options:


♦ History. Opens the History Viewer dialog box, which provides file properties for the latest and base versions
of the plan.
♦ Get Latest Version. Opens the last-saved version of the plan or, if the plan is open, restores the onscreen plan
to its last-saved version.
♦ Save Plan as Base Version. Saves the current state of the plan as a new base version. You must enter a
comment describing your changes when you save a new version of the plan.

Viewing Version History


The History Viewer dialog box lists the plan versions maintained in the repository, with the latest version at the
top of the list.
It lists the latest and base versions of the plan, showing the version number, creation date and time, author (the
user who saved the plan), and the comment provided by the author when the version was created.
The Comment for Version pane shows the full text of the comment entered for the version.
Figure 1-2 shows the History Viewer dialog box:

Figure 1-2. History Viewer

Tracking Plans Across the Service Domain


The History Viewer can be useful to service domain users who want to track the progress of a plan through the
enterprise. As a plan retains the version details of its meaningful iterations, the History Viewer facilitates an
audit trail that can assist collaboration between plan designers and the users who deploy the plans.

Opening Plans with Version Control


When you double-click a plan in the Project Manager, you retrieve its latest saved version. You can also open
the latest version of a plan through the version control menus by right-clicking a plan name and selecting
Version Control > Get Latest Version.

Version Control 9
The Get Latest Version option also allows you to revert to the latest saved version while working with a plan. If
your plan has unsaved changes when you select Get Latest Version, Data Quality prompts you to confirm the
command, since reverting to the latest version will undo your changes.
Use the following procedure to open a base version of the plan.

To open a base version of a plan:

1. In the Project Manager, right-click a plan name and select Version Control > History.
2. In the History Viewer dialog box, select the required base version and click Open Selected Version.

Saving, Deleting, and Renaming Plans


Version control is sensitive to general plan operations. By default, any save command will update the latest plan
version.
When you save a plan for the first time, you automatically create a base version. When you create a subsequent
base version, the latest version is automatically updated.
When you rename a plan, the name change is propagated through all base versions of the plan.
When you delete a plan, you delete all versions. It is not possible to delete a specific base revision of a plan.

To create a base version:

1. In the Project Manager, right-click the name of the plan and select Version Control > Save Plan as Base
Version.
2. In the Confirm Base Version Creation dialog box, type a comment explaining the operation.
You will not be allowed to proceed without typing a comment in this dialog box.
3. Click Set As Base Version.

Version Control and Plan Publication


Data Quality treats version control differently for publication and local repository copy/move operations.
Publication preserves a plan’s most recent base version information. Local repository copy/move operations do
not.
Consider a plan published from the local repository to the domain. Publishing the plan sends its most recent
base version, with that version information, to the domain repository. Version information copied with the
published version includes the version number of the published base version on the client, the user who created
the base version on the client, a date-time stamp for the creation of that version, and the comments added when
the version was created. In this way, a plan on the domain is traceable back to its point of origin.
The domain repository also initiates its own version history for the plan. When a plan is first published, the
domain repository assigns it a base version number of 1 while retaining also the client-side version data for the
published version. If a client user subsequently publishes the plan a second time, the domain repository
increments its base version number while again retaining the client-side version data.
For example, you have published base version 5 of a plan from your Workbench repository to the domain
repository. The domain repository creates base version number 1. After working locally on the plan, you publish
base version number 8 from your Workbench repository to the domain, creating a new base version number in
the domain repository.
Table 1-2 illustrates the changes in version details:

Table 1-2. Version Data Updated During Plan Publication

Client Repository Domain Repository

Version Number 5 1

Version Number 8 2

10 Chapter 1: Informatica Data Quality Features and Functionality


Note:

♦ Publication copies/moves the most recent base version, which may not be the latest saved version.
♦ When a plan is copied within the client repository, only the latest saved version is copied/moved. All base
versions are discarded.

Working with Multiple Instances of a Plan


Data Quality is designed to be flexible. To enable teamwork between plan designers, it does not apply any locks
to an open plan. Though it is possible for users on different systems to work on a plan concurrently, this is not
recommended.
The following section describes plan behavior in the event that different instances of Data Quality Workbench
are working with the same plan.
♦ When you save a plan, Data Quality checks the repository to determine if there have been any updates to the
plan since its last “save” event. If it finds such an update, the system prompts you to confirm that you want
to overwrite the saved plan. This updates the latest version in the repository. Any changes made by the other
user will be lost.
♦ When you save a plan as a base version, Data Quality checks for any updates to the list of base versions for
that plan. If it finds such an update, the system notifies you that a new base version will be created with a
version number incremented from the version most recently created by the other user.
♦ Updating a base version also overwrites the latest saved version in the repository. Data Quality performs two
checks in this case: to establish if the latest version has been updated and to establish if a more recent base
version has been created. When you create a base version in this case, you are asked to accept the changes to
both versions of the plan. If you click No in either case, the plan will not be saved and the base version not
created.

Organizing the Workbench User Interface


You can organize the components on the plan workspace in any manner you choose. The Data Quality
Workbench user interface provides menu options that allow you to organize your plan components in a
meaningful way:
♦ The component icons are connected by directional lines in the workspace. These lines indicate the directions
in which data flows within the plan. However, the directional lines do not provide a foolproof indicator of
whether one component precedes another in plan operations. The relative positions of the icons in the
workspace do not affect the running of the plan.
♦ Another method of keeping track of the component dependencies in a plan is to assign components to one
or more layers. Layers let you show or hide component icons onscreen. You can create a layer through the
Plan Layer Manager, available from the Tools menu.
To assign a component to a layer, right-click it and select Assign To Layer from the context menu. To view
only the components in a single layer, select View > Plan Layers.
♦ To view a snapshot of the current source data in the plan, open the Source Viewer (F6). This window
appears in the workspace and displays the first 250 rows of the source data currently in use.
♦ The plan components can make use of reference dictionary files to determine the validity of data values.
These dictionaries are visible through the Workbench Dictionary Manager (F8).
♦ You can read or add notes to a plan by opening the Plan Notes window (F11). This window is a free-text tool
that allows you to comment on any aspect of the plan.

Working with Multiple Instances of a Plan 11


Workbench Naming Conventions
When you design or edit plans that will be shared with other users, it is good practice to name your Workbench
elements in an agreed and consistent manner.
You and your team should agree a clear and consistent set of naming conventions for projects, folders, plans,
configurable components, component elements, and dictionaries.
For a comprehensive guide to developing a naming system for these elements, see “Informatica Data Quality
Naming Conventions” on page 149.

12 Chapter 1: Informatica Data Quality Features and Functionality


CHAPTER 2

Data Source Components


This chapter includes the following topics:
♦ Overview, 13
♦ CSV Source, 13
♦ Database Source, 14
♦ Fixed Width Source, 16
♦ Realtime Source, 16
♦ SAP Source, 17
♦ CSV Match Source, 19
♦ CSV Dual Match Source, 19
♦ Database Match Source, 20
♦ Group Source, 21
♦ Dual Group Source, 21
♦ CSV Identity Group Source, 22
♦ DB Identity Group Source, 23

Overview
Source components are used to specify the location of the input data files for a plan.

CSV Source
The CSV Source component connects to files with data organized in a delimited format, such as
comma delimited (CSV), to provide source data for a plan. When configuring this component you
specify the location of the delimited file, the type of delimiter used, and other options as described
below.

Configuration
The CSV Source configuration dialog box contains the following editable fields:

13
♦ Source File. Displays the name of the file to which the component connects.
♦ Select. Click this button to browse to the source file.
When you click Select, the Select a CSV File as a Source dialog box opens. This dialog box provides an
option to identify the character encoding associated with the dataset. For more information, see “Character
Encodings and Unicode” on page 143.
♦ Field Delimiter. Select a field delimiter appropriate to the source data from this menu. The default option is
comma. If headings for the column source data contain this delimiter, you must use a text qualifier to
preserve the data structure.
♦ Text Qualifier. Select a qualifier appropriate to the source data from this menu.
The application in which the source file was last edited may have saved information with a text qualifier. The
default option is the [“] double quote.
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as a
header and thus distinguish it from the rest of the dataset.

Database Source
The Database Source component connects directly to a database to provide source data for a plan.
When configuring a Database Source, you identify the required database type, connect to a database
available to Data Quality, and configure the tables and columns on the database to produce a source
dataset for your plan.

Configuration
The component dialog box displays configuration options across four tabs: Connect To Database, Before,
During, and After.
The connection is defined on the Connect To Database tab. The Before tab settings create the database table
that will be populated with the source data for the plan. The During options define the data that is used in the
plan, i.e. by selecting and joining columns from the available databases and adding the data to the table defined
in the Before tab. The After tab updates the table configured on the previous tabs and determines the state of
the data as it will be used by other plan components.
Note: The Before, During, and After tabs work in the same fashion for all database types.

Connect To Database Tab


When connecting to a database source, first identify the database type.
The Database Type menu provides five options: Staging, IBM DB2, Oracle, Microsoft SQL Server, and ODBC
(connection to a ODBC-compliant database).
Staging is the default option. It refers to the local database used by Data Quality. The remaining Database
Information and Login Information fields are disabled for this option. That is, you can connect to the local
repository without setting any other options on this page.
When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must provide a
Data Source Name (DSN) for the database and you might be prompted to provide a valid username and
password combination. The DSN field identifies the database on the network.
When you connect to an Oracle database, you must provide the System Identifier (SID) that refers to the
Oracle instance.
The Encoding menu lists the available character encodings that can be applied to the data as it is used in the
plan. For more information, see “Character Encodings and Unicode” on page 143.

14 Chapter 2: Data Source Components


The Login Information area contains Username and Password fields. Use these fields when access permissions
have been applied to the database in question. Data Quality does not require this information by default.
Click Connect to establish the connection.

Before Tab
The Before tab has a Database pane and SQL Script pane.
The Database pane displays the available databases and tables in a folder hierarchy. Browse the hierarchy to
locate the data source tables and columns and write the SQL script that defines the table in the SQL Script
pane. Clicking on a folder or column in the left pane transposes its name to the right pane to aid accuracy in
scripting.
The following sample script creates an elementary table called Names:
drop table if exists names; # overwrites any existing names table
create table names
(
id int, # id field populated by integers
name varchar(255) # name field entries up to 255 chars
);
Click Execute to run the script and create the table. You must click Execute before proceeding to the During
tab.
Click Stop On Error if you want the system to stop the script operation and display an error message if the
execution encounters a problem.

During Tab
The During tab allows you to browse database tables and filter the columns to provide source data for your
plan. You can also apply conditions to tables and join columns from multiple tables. The tab shows five
columns:
♦ Database. Like the Before tab, the Database column displays the database structure as a folder hierarchy of
tables and columns.
♦ Select. Provides check boxes for the column on the explored tables. Check a column check box under Select
to add its data to the dataset.
♦ Join. Lets you select columns from multiple tables for “join” operations so their data is added to the dataset.
♦ Where and Text. These columns allow you to specify the conditions for data inclusion, both for the columns
identified in the Select column and the columns to be joined. Note the following:
− To activate the editable fields in the Where and Text columns, click in the column. Use the fields in the
Where column to access conditional statements. You can enter text in the Text column for each database
column.
− You can use the Where statement builder to specify the join condition to join two databases using two
Database Source components. Select a database table in the Join column by checking its check box. A new
Join column, such as Join1, appears to its right.
The During tab also contains the following options:
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. They are cleared by default.
♦ Expert mode. Use to view and edit the underlying SQL query statements, and to create advanced select
statements. This option is cleared by default.
♦ Preview. Use the Preview option to view the dataset as defined by the configured settings in this dialog box.
The Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
♦ Validate. Use the Validate option to verify that the SQL query is valid. This option allows you to
periodically test validity as you are constructing an SQL query.

Database Source 15
After Tab
The After tab completes the process of generating the plan dataset. The Before tab runs SQL scripts on the
database prior to its configuration The After tab permits SQL scripts to run on the configured dataset. Like the
Before tab, the After tab displays Database and SQL Script panes.
You can browse the configured tables and columns in the left pane and write the SQL script to run on data in
the right pane.
For more information and examples, see “SQL Scripts” on page 139.

Fixed Width Source


Use this component to specify a fixed-width file as the data source for your plan. This component
allows you to edit column names, widths, and data types.

Configuration
The Fixed Width Source configuration dialog box contains the following features:
♦ Source File. Displays the name of the file to which the source components connects.
♦ Select. Click this button to browse to the source file.
When you click Select, the Select a Fixed Width File as a Source dialog box opens. You can create a new file
by typing a name in the File Name field of this dialog. In this dialog box, you can identify the character
encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on
page 143.
♦ Fixed Width columns. The columns in this group allow you to enter the name, width, and datatype for each
field in the file.
♦ Remove Trailing Spaces. Use this option to remove trailing spaces, extra spaces at the end of data, from the
dataset used in the plan.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.

Realtime Source
The Realtime Source allows you to develop plans that accept input in real time from live data entry or
other applications. To configure this component, define the input fields that will run data to the plan.

Configuration
The Realtime Source configuration dialog box includes an Inputs column and an Input Type column and, when
first added to a plan, a single, undefined row.
To Add or Delete rows to or from the table, right-click in the dialog box and use the context menu. The Delete
option deletes the highlighted row.
The following columns display:
♦ Inputs. Double-click a field in this column to edit the input name. Click OK to apply your changes before
moving from the field.
♦ Input Type. Click a field in this column to view options for defining the input data type. The options are
String or Float.

16 Chapter 2: Data Source Components


For example, you may want to design a simple real-time plan to test the validity of a data code. The data code is
valid within an organization if it contains the correct year (for example, 2005 in Figure 2-1). You can write a
rule in the Rule Based Analyzer to check if any given input string contains this value. When you test the plan in
Workbench, an input dialog box like the following appears:

Figure 2-1. Realtime Source: Data Setup Dialog Box

Type the year (or any value) in the Value field and click OK to return a result. In a real-time scenario, data
inputs are checked without any direct user activity.

SAP Source
The SAP Source component allows you to use an SAP database as the data source in a plan. To obtain
the data, the SAP Source connects to a SAP system and uses a BAPI (Business API) function to read
data from the SAP database.
In the SAP Source component configuration dialog box you can identify the SAP system and set the input and
output parameters of the function. Set the input parameters to filter the database for the data relevant to your
plan. Set the output parameters to specify the data to be used in the plan.
Data Quality SAP connectivity is licensed separately from other Workbench components. If your license does
not include SAP connectivity, contact Informatica Global Customer Support. Similarly, the SAP Source
requires a valid connection to the SAP System and a corresponding SAP license for the SAP System.

Configuration
The configuration dialog box for the SAP Source displays its options on two tabs:
♦ Connection
♦ SAP System

Connection Tab
The Connection tab displays the following options:
♦ Host. The name or IP address of the SAP host computer.
♦ Client Number. Identifies a SAP client that you are authorized to use.
A SAP system can have multiple clients, each identified by a three-digit client number.
♦ System Number. A two-digit number that identifies the application server to which you want to connect.
SAP allows multiple application server instances to run against a database.
♦ Encoding. Character encodings that can be applied to the data as it is used in the plan. For more
information, see “Character Encodings and Unicode” on page 143.
♦ Username and Password. SAP username and password to identify you to the SAP system.

SAP System Tab


After entering the required information on the Connection tab, click Connect to open the SAP System tab.

SAP Source 17
The SAP application areas available on the connected system are listed on the left. On the right appears options
for defining the input and output parameters to be used in the function call to the SAP database.
You can explore the SAP application areas to reveal the business objects defined for each area and the functions
that can be configured for each business object. The icons associated with each level are color-coded:
application area icons are yellow, business object icons are green, and function icons are red.
Your first task is to explore the available objects and select the function you want to run. Then, you can define
the function using the Import and Export tab options.

Import Tab
On the Import tab, you can set the input parameters of the function that retrieves data from the SAP database.
With this tab selected, two columns display:
♦ Name. Lists the input parameters available for the function.
♦ Value. Use to filter parameter output. To enter a filter, click in the Value column for the the parameter and
enter a filter string.
Note that there are three types of parameters. Configure the values on the Import tab based on the parameter
type:
♦ Scalar parameter. A single name-value pair of the type described above, such as “Town – Chicago.”
♦ Structure parameter. A group of one or more scalar parameters, such as a multi-line address group. A
structure can have multiple rows but has a single column of values, for example:

ADDRESS

AddressLine1 781 Fifth Avenue

AddressLine2 New York

AddressLine3 NY

AddressLine4 10022

♦ Table parameter. Contains one or more rows of data described by one or more columns. For example, each
name below has multiple values:

CUSTOMERS

Name AddressLine1 AddressLine2 AddressLine3

Smith Fifth Avenue New York NY 10022

Jones Park Avenue New York NY 10128

Wilson Columbus Avenue New York NY 10025

Export Tab
The Export tab displays output parameters that correspond to the settings on the Import tab. The export
parameters determine the data values that are “exported” from the SAP database for use as source data in your
data quality plan.
The export parameters that appear are specific to the function being used:
♦ Value. To select a parameter for data export to your plan, use the Value check box of the parameter.
Depending in the parameter type, you might need to select individual data elements for export.
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing
spaces from the dataset. They are cleared by default.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.

18 Chapter 2: Data Source Components


Click OK in the configuration dialog box to save your changes.

CSV Match Source


The CSV Match Source compares the records in a single source file to identify duplicates. The source
file must be delimited. This component makes use of a CSV file in a similar manner to the CSV
Source component, then selects data for a matching operation. To match between two delimited
source files, use the CSV Dual Match Source component. For more information, see “CSV Dual
Match Source” on page 19.
When the CSV Match Source has been configured, two versions of each field in the source dataset will be
visible to the matching components. To distinguish between them, “_1” and “_2” are appended to the field
names.
The CSV Match Source is one of two components that enable the generation of match cluster information by
the CSV Match Target. The other source component is the Group Source. If you want to use the CSV Match
Target Identified Matches option to generate match cluster information, you must use CSV Match Source or
Group Source in the plan.

Configuration
The configuration dialog box contains the following fields:
♦ Source File. Displays the name of the file to which the source component connects.
♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see “Character Encodings and Unicode” on page 143.
♦ Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
♦ Text Qualifier. Select the text qualifier used in the source file. The default option is the quotation mark (“).
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.

CSV Dual Match Source


This component allows you to match data from two delimited source files. The functionality of the
component is similar to that of the CSV Match Source, except the Dual Match Source compares data
across two files.

Configuration
The CSV Dual Match Source configuration dialog box displays a set of options in a two areas: Source 1 and
Source 2. Each area provides identical settings for selecting and configuring a dataset. The settings in each area
are identical to those in the configuration dialog for the CSV Match Source:
♦ Source File. Displays the name of the file to which the source component connects.
♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see “Character Encodings and Unicode” on page 143.

CSV Match Source 19


♦ Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
♦ Text Qualifier. Select the text qualifier used in the source file. The default option is the quotation mark (“).
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
Note: If the CSV Dual Match Source component is being used for Match-and-Append operations, the reference
file appears in the Source 2 area.

Database Match Source


The Database Match Source component lets you explore the Data Quality repository to select tables
and columns for use in a matching plan. To configure this component you connect to the Data
Quality repository and configure the dataset.
The Database Match Source provides a single-component alternative for plans that use two Database Source
components to match data across a single table.

Configuration
The Database Match Source configuration dialog box includes two tabs: Connect to Database and Match
Selection. The Connect To Database tab options are identical to the Connect to Database tab on the Database
Source configuration dialog box, as described in “Database Source” on page 14.

Connect to Database Tab


The Database Match Source connects to the Data Quality repository. This option may be named Staging in the
configuration dialog box.
Click Connect to effect the connection and open the Match Selection tab. The remaining options on this tab
are disabled.

Match Selection Tab


The options on this tab allow you to explore the database tables defined in the repository and select the
columns to provide data for the matching plan:
♦ Database. Displays the repository structure as a folder hierarchy of tables and columns.
♦ Select. Provides check boxes for the column on the explored tables. Check Select for a column to add its data
to the dataset.
♦ Unique ID. Use to identify the data column to provide the unique ID for the dataset. The dataset can have
one unique ID only.
♦ Group Key. The fields that the matching plan searches for common values. Select one or more group keys.
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. They are cleared by default.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Note: Configuring a column for UniqueID or GroupKey automatically checks the Select option to add the
column to the dataset. However, clearing either option does not automatically remove them from the dataset.
Clear the Select option to remove a column from the dataset.

20 Chapter 2: Data Source Components


Group Source
The Group Source component defines the input data for a plan by reading the set of group files
created by a Group Target in another plan. When you configure the Group Source to connect to the
set of group files, the Group Source uses the dataset underlying these files as the source for the plan,
providing the data to the operational components on a group-by-group basis.
Grouped data is chiefly used in matching plans, although it can be used in other types of plans.
Groups are produced by the Group Target component. The Group Target creates a set of delimited text files in
a proprietary format and saves the files in a user-defined directory. The files use the extension SSG. When
configuring the Group Source, you need to specify the host directory for the grouped files.
Groups are created in the Group Target component by defining one or more key grouping fields for the dataset.
All records with common values in the key grouping fields will be associated with a single group.
The Group Source is one of two components that enable the generation of match cluster information by a CSV
Match Target. The other source component is the CSV Match Source. If you want to use the CSV Match
Target Identified Matches option to generate match cluster information, you must use Group Source or CSV
Match Source in the plan.
You can use the Dual Group Source to group data from two data sources. For more information, see “Dual
Group Source” on page 21.

Configuration
The Group Source configuration dialog box contains the following features:
♦ Select Directories pane. Identifies the directory or directories containing the grouped data you want to use.
To add a directory, right-click in the pane and click Add from the menu.
♦ Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act
as the source directory. Be sure to select a folder, not a file.
♦ Column Headers pane. Displays the headings for each data column in the group highlighted in the Select
Directories pane. This pane has no editable options.
Note the following:
♦ Group files do not contain data from the underlying dataset, and group creation does not edit the
underlying dataset in any way. Groups are a way to identify data records with a common values so these
records can be processed together in matching operations. Matching operations can be performed on
grouped data at significantly higher speeds than on non-grouped data.
♦ The column names in the Column Headers pane are appended with “_1” or “_2.” The columns are derived
from the source dataset in the plan that generated the SSG files. Each column in the dataset is duplicated so
their data values can be matched.

Dual Group Source


The Dual Group Source allows you to perform matching operations on grouped data from two
different data sources. It uses the SSG files defined for two datasets as input.

Configuration
The Dual Group Source configuration dialog box contains the same elements as the Group Source component.
However, the Dual Group Source dialog box displays two instances of each pane.
The Dual Group Source configuration dialog box contains the following features:

Group Source 21
♦ Select Directories pane. Identifies the directory or directories containing the grouped data you want to use.
To add a directory, right-click in the pane and click Add from the menu.
♦ Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act
as the source directory. Be sure to select a folder, not a file.
♦ Column Headers pane. Displays the headings for each data column in the group highlighted in the Select
Directories pane. This pane has no editable options.
For more information about using grouped data in plans, see “Group Source” on page 21.

CSV Identity Group Source


The CSV Identity Group Source performs identity matching on CSV sources using keys created by the
Identity Group Target. To use the CSV Identity Group Source, you must first run a plan containing an
Identity Group Target. The Identity Group Target stores keys in an identity index within Informatica
Data Quality. The CSV Identity Group Source matches input data against the keys in this identity
index.
In both the CSV Identity Group Source and the Identity Group Target, you must select the same Population
and Key Type, and ensure that the Input Column in both components contains the same type of data.
Additionally, the data sources used in both components must contain the same number of columns.
Note: Identity Group components require population files that install through the Content Installer. You must
contact Informatica to purchase and download population files separately. For information on installing
population files, consult the Informatica Data Quality Installation Guide.

Configuration
The configuration dialog box contains the following fields:
♦ Source File. Displays the name of the file to which the source component connects.
♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source
dialog box opens. You can identify the character encoding associated with the dataset. For more information,
see “Character Encodings and Unicode” on page 143.
♦ Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings
for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.
♦ Text Qualifier. Select the text qualifier used in the source file. The default option is the quotation mark (“).
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the dataset.
♦ Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population that most closely matches the origin of the input data.
♦ Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
♦ Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of
search quality and search speed. The search speed is inversely related to the number of matches returned, so

22 Chapter 2: Data Source Components


that faster searches return fewer matches. The following table describes the search speed and matching
criteria for each Search Level.

Search Search Matching


Description
Level Speed Criteria

Narrow Fastest Nearly exact This Search Level performs the fastest and most exact
matches. For example, using a Narrow Search Level for
person name matching returns exact matches and name
abbreviation matches (initials).

Typical Fast Strict This Search Level performs fast searches with strict
matching criteria. For example, using a Typical Search
Level for person name matching returns data with name
abbreviation matches and some potential errors (e.g.,
incorrect initials).

Exhaustive Average Loose This Search Level performs average speed searches with
loose matching criteria. For example, using an Exhaustive
Search Level for person name matching returns matches
that may represent substantial spelling errors.

Extreme Slow Very Loose This Search Level performs slow searches with very loose
matching criteria. For example, using an Extreme Search
Level for person name matching may return matches with
a very wide variety of spelling errors.

♦ Input Column. The input column specifies the source data that the CSV Identity Group Source uses for
matching. Choose an input column that contains the type of data specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s)
♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory that contains the key
index. Enter the Key Index Location specified in the Identity Group Target. The following string displays an
example of a a Key Index Location with multiple subdirectories:
UK/Person/Name

DB Identity Group Source


The DB Identity Group Source performs identity matching on database sources using keys created by
the Identity Group Target. To use the DB Identity Group Source, you must first run a plan containing
an Identity Group Target. The Identity Group Target stores keys in an identity index within
Informatica Data Quality. The DB Identity Group Source matches input data against the keys in this
identity index.
In both the DB Identity Group Source and the Identity Group Target, you must select the same Population and
Key Type, and ensure that the Input Column in both components contains the same type of data. Additionally,
the data sources used in both components must contain the same number of columns.
Note: Identity Group components require population files that install through the Content Installer. Informatica
provides these files separately from Data Quality. You must contact Informatica to purchase and download
population files. For information on installing population files, consult the Informatica Data Quality
Installation Guide.

DB Identity Group Source 23


Configuration
The DB Identity Group Source configuration dialog box includes two tabs: Connect to Database and Match
Selection.

Connect to Database Tab


The Connect To Database tab options are identical to the Connect to Database tab on the Database Source
configuration dialog box. For more information about the Connect to Database tab options, see “Database
Source” on page 14.
Click Connect to effect the connection and open the Match Selection tab.

Match Selection Tab


The options on this tab allow you to explore database tables and select the columns to provide data for the
matching plan:
♦ Database. Displays the database structure as a folder hierarchy of tables and columns.
♦ Select. Provides check boxes for the column on the explored tables. Check Select for a column to add its data
to the dataset.
♦ Input Column. The input column specifies the source data that the DB Identity Group Source uses for
matching. You can only select one input column. Choose an input column that contains the type of data
specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s)
♦ Group Key. The fields that the matching plan searches for common values. Select one or more group keys.
Note: Do not select the same column as the Input Column and Group Key. The selections must be
different. Both are mandatory.
♦ Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population that most closely matches the origin of the input data.
♦ Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
♦ Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of
search quality and search speed. The search speed is inversely related to the number of matches returned, so
that faster searches return fewer matches. The following table describes the search speed and matching
criteria for each Search Level.

Search Search Matching


Description
Level Speed Criteria

Narrow Fastest Nearly exact This Search Level performs the fastest and most exact
matches. For example, using a Narrow Search Level for
person name matching returns exact matches and name
abbreviation matches (initials).

Typical Fast Strict This Search Level performs fast searches with strict
matching criteria. For example, using a Typical Search
Level for person name matching returns data with name
abbreviation matches and some potential errors (e.g.,
incorrect initials).

24 Chapter 2: Data Source Components


Search Search Matching
Description
Level Speed Criteria

Exhaustive Average Loose This Search Level performs average speed searches with
loose matching criteria. For example, using an Exhaustive
Search Level for person name matching returns matches
that may represent substantial spelling errors.

Extreme Slow Very Loose This Search Level performs slow searches with very loose
matching criteria. For example, using an Extreme Search
Level for person name matching may return matches that
contain a very wide variety of spelling errors.

♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory that contains the key
index. Enter the Key Index Location specified in the Identity Group Target. The following string displays an
example of a a Key Index Location with multiple subdirectories:
UK/Person/Name
♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces
from the dataset. They are cleared by default.
♦ Stop on Error. Select this option if you want to stop script operation and display an error message if the
execution encounters a problem.
♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The
Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.
Note: Configuring a column for InputColumn or GroupKey automatically checks the Select option to add the
column to the dataset. However, clearing either option does not automatically remove them from the dataset.
Clear the Select option to remove a column from the dataset.

DB Identity Group Source 25


26 Chapter 2: Data Source Components
CHAPTER 3

Data Target Components


This chapter includes the following topics:
♦ Overview, 27
♦ CSV Target, 27
♦ Fixed Width Target, 28
♦ Report Target, 29
♦ CSV Merge Target, 30
♦ CSV Match Target, 31
♦ Match Key Target, 33
♦ Group Target, 35
♦ Database Target, 36
♦ Database Report Target, 38
♦ SAP Target, 38
♦ Realtime Target, 40
♦ Identity Group Target, 40

Overview
Just as you configure source components to specify input data for your data quality plan, you configure target
components to specify plan output. Targets are designed to accept data derived from the source and operational
components of a plan.

CSV Target
The CSV Target component defines a delimited file, such as a comma-separated file, as the output
format for your data quality plan.
The component allows you to do the following:
♦ Specify the fields included in the output file, including any combination of data source fields and fields
generated within the plan.
♦ Specify the position of each field in the output file.

27
♦ Enter a condition to filter data written to the output file.
♦ Configure the plan to create new output files or append data to an existing file.

Configuration
The CSV Target configuration dialog box contains the following options:
♦ Target File. Identifies the output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also
identify the character encoding associated with the dataset. For more information, see “Character Encodings
and Unicode” on page 143.
♦ Overwrite file? When checked, this option specifies that the plan overwrites the target file every time it runs
(in cases where the target file name and path are unchanged for successive executions of the plan). When
cleared, this option specifies that the plan writes its output to the end of the existing target file each time it
runs. In this case, the target file grows in size each time the plan is run. This box is checked by default.
♦ Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the
target. Use the filter to limit the records written to the output file.
Specify a condition by selecting a single input data field, an operator, and a condition value.
♦ Inputs. This pane lists the field types available to the target, typically, the data derived from the operational
components of the plan and the source dataset. Beside each field type is a check box. Use the check box to
add a field to the target output.
♦ Outputs. This pane shows the fields that have been selected from Inputs for inclusion in the data output. To
change the order of the output fields, use the Up and Down arrows.
♦ Launch Viewer. If there is a program associated with the file type, use this option to launch a database table
view of the target output automatically when the plan is executed.
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the rest of the dataset.
♦ Field Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is a
comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to
preserve the data structure.
♦ Text Qualifier. Select a qualifier appropriate to the data from this menu. The default option is a quotation
mark (“).

Fixed Width Target


The Fixed Width Target component generates plan output in a fixed-width file format.
The component allows you to do the following:
♦Specify the fields included in the output file, including any combination of data source fields and
fields generated within the plan.
♦ Specify the position of each field in the output file.
♦ Specify the length of each fixed width column.
♦ Enter a condition to filter data written to the output file.
♦ Configure the plan to create new output files or append data to an existing file.

Configuration
The Fixed Width Target configuration dialog box contains the following features:

28 Chapter 3: Data Target Components


♦ Target File. Identifies the output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also
identify the character encoding associated with the dataset. For more information, see “Character Encodings
and Unicode” on page 143.
♦ Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the
target. Use the filter to limit the records written to the output file.
Specify a condition by selecting a single input data field, an operator, and a condition value.
♦ Overwrite File. Use to overwrite the target file with successive executions of the plan.This option is checked
by default. Clearing this option keeps the selected target file from being overwritten, making it read-only.
♦ Inputs. This pane lists the field types available to the target, typically, the data derived from the operational
components of the plan and the source dataset. Beside each field type is a check box. Use the check box to
add a field to the target output.
♦ Outputs. Lists the name, width, and type of each selected input. The values in the cells of the Width column
determine the width as a number of characters for the associated columns of output data.
If the data values are longer than the width specified, the data will be truncated in the output file.
The default data type is String. Valid types are String, Number, and Date.
♦ Launch Viewer. If there is a program associated with the file type, use this option to launch a database table
view of the target output automatically when the plan is executed.
♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading
text and distinguish it from the rest of the dataset.
Note that the Fixed Width Source does not use a header record. Clear this option if you intend to use the
fixed-width target output file as a source in another plan.
♦ Launch Specification Viewer. Use this option to open the fixed-width specification file, which specifies the
field names and widths defined for the target output file.

Report Target
The Report Target generates an easy-to-read report file that displays plan output data. The report files
can be opened in other applications, including web browsers and spreadsheets.
You can create three types of report files: HTML, CSV (delimited flat file), and SSR (a proprietary
Informatica Data Quality format). SSR reports can be viewed as dashboards in the Data Quality Report Viewer.
For more information, see “Report Viewer” on page 109.
When you use Report Target, you need to use a frequency component, such as Count, before Report Target.
The data fields counted in the Report Target are determined in the frequency component preceding it in the
plan.
Note: The Report Target does not read outputs from the Aggregation component.

Configuration
The Report Target configuration dialog box contains the following features:
♦ Report File. Identifies the output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a Report as a
Target dialog box opens. You can create a new file by typing a name in the File Name field of this dialog. By
default, files of the type specified by the Report Transform options display.
♦ Report Transform. Determine the output file type.

Report Target 29
− Check the Standard option to enable the file type selection menu. The options are HTML, CSV, and
SSR. The HTML option activates the Include Chart menu, which allows you to add a pie chart, bar chart,
or line chart to the report.
− Check the Custom option to write the target output to a customized HTML report template and to
generate graphical reports. Click Select beside the Custom text field to browse to a template file.
♦ Launch Report on Completion. Use to launch the report file automatically when the plan is executed.

CSV Merge Target


The CSV Merge target merges columns from two sources to a single target file. It can be used in
matching plans that compare a dataset against a reference dataset. The component operates as follows:
♦The target lists data fields available from the other components in the plan as inputs. Select the input
fields to write as outputs to the target.
♦ The inputs defined as Source 1 are automatically written to the resulting merged target.
♦ The inputs defined as Source 2 constitute reference data. Data values from Source 2 are appended to the
merged target where good matches are found with Source 1 data, as determined by the Match Input Field
and Match Threshold settings.
Note: When more than one positive match is identified, the match with the highest score is appended.

Configuration
The CSV Merge Target configuration dialog box contains the following features:
♦ Target File. Identifies the output file for the merged data.
♦ Select. Use to browse to the output file for the data target.
When you click Select, the Select a CSV file as a Target dialog box opens. You can create a new file by
typing a name in the File Name field of this dialog.
♦ Inputs. Lists the potential input fields for the target. Input fields can be added to the Source 1 or Source 2
output panes so their data can be considered for inclusion in plan output. Add an input column to either
pane by right-clicking a field name in the Inputs pane and selecting Add to Source 1 List or Add to Source 2
List.
♦ Launch Match File. Use to open the output file automatically when the plan is run.
♦ Match Threshold. Filters the columns in the Source 2 Outputs pane according to their scores in the key
matching field, as defined for the target on the Match Input Field. Records in these columns with match
scores below this value are not included in the merged output. The default value is 0.9.
♦ Match Input Field. Lists the key matching fields defined by the plan components. Use this menu to select
the field on which to base the matching calculation. The Match Threshold applies to this calculation.
♦ Use First Line as Header. Use this option to designate the first line of data in the source file as heading text
and distinguish it from the rest of the dataset.
♦ CSV Separator: Delimiter. Select a field delimiter appropriate to the data from this menu. The default
option is comma (,). If headings for the column source data contain this delimiter, you must use a text
qualifier to preserve the data structure.
♦ CSV Separator: Qualifier. Select a qualifier appropriate to the data from this menu. The default is quotation
mark (“).

30 Chapter 3: Data Target Components


CSV Match Target
The CSV Match Target creates a delimited output file containing data generated by a matching plan.
The component can generate two types of output: a HTML match report displaying match clusters
and corresponding match scores, and a CSV file containing data values that meet or exceed the match
threshold score. This match file can be used as input for the consolidation process.
The principal steps in configuring the CSV Match Target are:
♦ Select the data fields whose data matches you want to include in the target output. Include at least one
matching component output field.
♦ Select the match input field to which you want to apply the match threshold. This field and the match
threshold value constitute a filter for the plan output data.
♦ Select the types of output you want the target to generated. The target can generate a HTML report or a
CSV file in one of two formats.
For more information about formatting CSV outputs, see “Output Options in the CSV Match Target” on
page 147.
The input fields listed in the CSV Match Target configuration dialog box are numbered by appending “_1” and
“_2” to the field names. When you match data fields from a single source file, “_1” and “_2” are appended to
the field names. When you match data fields in two data sources, the fields, “_1” is appended to the fields in
one source and “_2” is appended to the fields in the other source.

Configuration
The CSV Match Target configuration dialog box contains the following options:
♦ Target File. Identifies the CSV output file for the data target.
♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a
Target dialog box opens. You can create a new file by typing a name in the File Name.
♦ Inputs. Lists the data fields that can be included in the target output. Check a field to include it in the plan
output calculations. You must select at least one output from a matching component.
♦ Outputs. Lists the fields selected in the Inputs field. Use the Up and Down arrows to change the order of
the output fields, that is, the order in which you want them to appear in the plan output.
♦ Use First Line as Header. Check to designate the first line of data in the source file as heading text and so
distinguish it from the dataset.
♦ Launch Viewer. Use to open the output files automatically when the plan executes.
♦ Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,).
If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the
data structure.
♦ Qualifier. Select a qualifier appropriate to the data from this menu. The default is quotation mark (“).
♦ Create HTML Match Report. Use to generate a HTML report displaying the match clusters found by the
plan. This option is checked by default.
Note: An HTML match report can only be generated for plans that use a Group Source or CSV Match
Source. If your plan does not include one of these two sources, an error message appears. If you are running
a CSV Match target plan created in an earlier version of Workbench, check the source configuration to make
sure that the plan continues to run successfully.
♦ Match Output Type (Matched Pairs/Identified Matches). These options determine how the CSV report file
displays the matches found by the plan.

CSV Match Target 31


Use the Matched Pairs option to list matching values together in the file output. For example, if the strings
“John Smith” and “John Smyth” are identified as a matched pair, both these strings will be written to a single
row along with the match score:
John Smith John Smyth 0.9

Use the Identified Matches option to append the match cluster ID and the number of records per cluster to
records identified as matches by the plan. For example, in a plan that matches the four input records “John
Smith,” “Bill Brown,” “Mary Murphy,” and “John Smyth,” the Identified Matches option appends the
following columns to the target file and populate the columns as follows.
Name Cluster ID Records Per Cluster
John Smith 1 2

Bill Brown 3 1
Mary Murphy 2 1
John Smyth 1 2

Here, “John Smith” and “John Smyth” share a common Cluster ID, indicating that they satisfy the plan’s
matching criteria.
Also note the following points about the Identified Matches option:
− The Identified Matches option requires inputs from a CSV Match Source or a Group Source. If you add
inputs from other sources to the CSV Match Target and select the Identified Matches option, the plan
registers an error.
− Clustering does not group matching records in the output file. The data input order corresponds to the
data output order.
− The columns listed in the Outputs pane must be organized by data source, with an equal number of
columns for records from each data source. The match score column must appear after the record
columns. Figure 3-1 illustrates the correct order.
− If you select the Identified Matches option, match score values do not appear in the file output for this
Target, even if you select a match score in the Outputs pane. This is because Identified Matches causes
data to be written one by one, and any given data row can have multiple rows associated with it.

Figure 3-1. CSV Match Target Outputs Pane, Showing Column Order for Identified Matches

For more information about formatting outputs, see “Output Options in the CSV Match Target” on
page 147.
♦ Field. Lists the output fields defined by the matching components in the plan. Use this menu to select the
field from which the CSV Match Target reads the match score. The match threshold values set in this dialog
box apply to the match scores achieved in this field.

32 Chapter 3: Data Target Components


♦ Thresholds fields (Lower and Upper). Filter the data record values written as plan output according to the
record scores in the match input field (see Field menu above).
Enter a lower and upper limit for the match scores in these fields, between 0 and 1. Data from records whose
scores fall outside this range will not be included in the output. The default values are 0.9 for Lower and 1.0
for Upper. The Lower field is not designed to calculate matches with a value of 1.

Match Key Target


The Match Key Target component is commonly used in consolidation plans. It allows you to append
match plan output data directly to the source database. This eliminates the need to write match data to
a new target table. With the Match Key Target, matching and consolidation information is written
and held in database tables. The outputs of this component are CSV and HTML reports.
Data may be written by the Match Key Target if the following criteria are met in the source table
structure:
♦ The source table contains a column that can be used by the Match Key Target to uniquely identify a record.
This record will be a primary key — unique, non-null, and a sequence auto-increment.
♦ The source table contains a column in which the system stores the match score for each matching record.
This field must be of datatype Float.
♦ The source table contains a column in which the match key is recorded. This key identifies the consolidated
records within a cluster.

Configuration
The configuration options in the Match Key Target configuration dialog box are arranged on three tabs:
Database, Match Details, and Outputs.

Database Tab
The Database Type menu lists a static option, Staging, representing the Data Quality repository. The remaining
fields are disabled.
Click the Connect button to access the database data. This opens the Match Details tab.

Match Details Tab


The options on this tab are arranged in three areas:
♦ Table Details. Table Details area contains the Table Names menu. This menu lists the database tables
available to the target Use this menu to select the table to which the target will write the output data.
♦ Column Details. These menu options relate to the table identified under Table Details, whereas the Inputs
menu options list all columns in the database tables available according to the Database tab settings.
The Column Details area contains three fields:
− UniqueID. Select the column that contains the unique ID (primary key) of this table.
− Match Key. Select the column to record the match key. The match key is the primary key of the master
record in a match cluster.
− Match Score. Select the column to store the match score between each record and its master.
If the table does not already have a column created to hold the match key and match score, the table
structure must be altered to generate these fields. The match key and match score are populated when the
matching plan is run.
♦ Inputs. This area contains two fields: Unique ID - Input 1 and Unique ID - Input 2. Select the columns on
which to base the matching operations.

Match Key Target 33


Outputs Tab
The options on this tab let you configure a HTML match report and CSV match file to display the data output
from the target. The match report presents the matches in clusters, and the match file presents a single row for
each matched pair.
The creation of a report or file is optional. Also, fields selected under Match Table Column Selection and
Ordering appear in the match report and match file.
The Outputs tab displays the following areas:
♦ Match Report. This area contains the following options.
− Create Report. Check to create a match report when the plan is executed.
− Select. Click to browse to the report file. When you click Select, the Select a HTML file for the Report
dialog box opens. You can create a new file by typing a name in the File name field.
− Launch Viewer. Enabled when the Create Report is checked. When selected, the report opens
automatically when the plan runs.
− Clusters Per Page. Determines how many match clusters appear on each page in the report.
♦ Match Table Column Selection and Ordering. This area shows two panes. The left pane lists the columns
available on the table selected on the Match Details tab. The right pane lists the columns to appear in the
report or match file. To add a column to the right pane, click its check box in the left pane.
♦ Match Input. The match report presents each match cluster along with the selected input fields from related
match sources and the field selected from the Match Input menu. The Match Input selection and the
primary key of the source data appear as default fields on this report.
The Match Input menu lists the key fields defined by the matching components in the plan. The field you
select, in conjunction with its match threshold score, determines the records to be included in the target
output.
Likewise, the range of values you set in the Match Threshold fields are applied to the Match Input key field.
Matching records whose scores fall outside this range are not be included in the output. You can set lower
and upper values between 0 and 1. The default values are 0.75 and 1.0.
♦ Match File. Like the match report, the match file contains records that contain matches within the match
threshold for the field selected from the Match Input menu. The file contains the columns selected in the
Match Table Column Selection and Ordering area. Match File has the following options:
− Create File. Check to create a match file when the plan is executed.
− Select. Click to browse to the report file. When you click Select, the Select a CSV File as a Target dialog
box opens. You can create a new file by typing a name in the File name field of this dialog.
− Launch Viewer. Enabled when the Create File box is checked. When selected, the file opens automatically
when the plan runs.
− Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma
(,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve
the data structure.
− Qualifier. Select a qualifier appropriate to the data from this menu. The default option is the quotation
mark (“).
Note: It is good practice to run a plan populating an audit trail table with the unique IDs of each matching
record for every match created. When the data is consolidated, duplicate records are removed from the source
table.

34 Chapter 3: Data Target Components


Group Target
The Group Target component creates groups, a series of files in a Data Quality-proprietary format that
organizes plan data according to key data fields that you configure.
Grouping involves grouping records based on similar or identical values in one or more fields and
performing matching operations on the records assigned to each group.
Group Target output files can be used by a Group Source or Dual Group Source to organize the data inputs to
a matching plan.
Grouping large datasets is a useful precursor to running a matching plan. Matching operations can be
performed on grouped data improves performance with minimal loss of matching accuracy.
Grouped data is stored in local directories as a set of delimited files with the extension SSG. Set up groups by
defining one or more group key fields for the dataset. All records with common value in the defined key fields
are written to a single group file.
Note: Group files are organized separately from the original dataset and do not modify the original dataset in any
way. A large number of SSG files can be created in the group directories, depending on the number of records
with common data in the key fields.

Configuration
The Group Target configuration dialog box contains the following options:
♦ Directory. The location and name of the directory in which the groups are created. This field is not editable.
♦ Select. Click to open the Select the Group Directory dialog box and browse to the required directory. To
select a directory, highlight it in the main window and click Select. Select a directory, not a file.
♦ Outputs. This pane lists the columns available in the dataset. Check the column name to include its data in
the plan output. The columns you select are added to the Grouping Fields pane.
Tip: Right-click in this pane to display a Select All option.

♦ Grouping Fields. Select a group key. The group files created in the group directory are based on the key you
select.
♦ Maximum Group Size. The maximum number of records assigned to a group file. If the Group Target
reaches this limit when writing to a group file, it creates another file for the group. The default value is zero,
no limit.
Note: Matching operations are performed within group files. This is standard behavior for matching
operations on grouped records. Although a reduction in group size can lead to faster processing times, it can
also impact the accuracy of match results.
♦ Maximum Files Per Group. The maximum number of group files written to a given folder on disk. The
default value is 5000. When this number is exceeded, the Group Target creates one or more sub-folders to
house the remaining files. If this value is set to zero, no limit is be imposed and files are written to a single
folder.
♦ Ignore Empty Group Field Values. Use to avoid the creation of a group based on records with null values in
a group key field.
Note: The group files you create are overwritten if you run a plan again without changing the target
configuration details. To preserve a set of group files, select a new group directory before you run the plan
again.

Group Target 35
Database Target
The Database Target (or DB Target) component allows you to write plan output to a database. Data
produced by the plan can update selected tables in the database or can be inserted in new or existing
tables.
In addition to its own repository, Data Quality connects to Oracle, IBM DB2, and Microsoft SQL Server
databases and also supports ODBC connections. A single plan can write to multiple databases using multiple
Database Targets.
The Database Target can write the data records processed by the plan to the database, or it can write data from
the Aggregation component detailing the frequency of occurrence of data values.

Configuration
The Database Target configuration dialog box contains four tabs:
♦ Connect To Database
♦ Before
♦ During
♦ After
The connection is defined on the Connect To Database tab.

Connect To Database Tab


This tab contains two areas: Database Information and Target Format.
You must identify the target database in the Database Information fields.
♦ Database Type. This menu provides five options: Staging (the local repository), IBM DB2, Oracle,
Microsoft SQL Server, and ODBC (as a connection to a ODBC-compliant database).
Note: When you select Oracle, you are prompted for a Oracle database system identifier. If you select another
database type, you are prompted for a data source name.
♦ DSN. Data Source Name. Identifies the database on the network. This is required for all database
connection types except Oracle.
♦ SID. Source Identifier. Identifies the instance of the Oracle database.
♦ Encoding. Lists the available character encodings that can be applied to the data output. For more
information, see “Character Encodings and Unicode” on page 143.
♦ Login Information. Contains username and password text fields. You must provide your login when access
permissions have been applied to the database.
♦ Connect. Click to establish the connection.
You must also set the target format.
♦ Select Normal Mode to write the plan data to the database.
♦ Select Aggregation Mode to write data summarizing the frequency of occurrence of data values, as tabulated
by the Aggregation component, to the database. When you select this option, select the component from
which the component will read the data.
Note: When you select Normal Mode, the outputs from all components except the Aggregation component
are available to the target. When you select Aggregation Mode, only the outputs from the Aggregation
component are available.

Before Tab
The Before tab contains Database pane and a SQL Script pane. This tab is typically used in the Database Target
to create new tables in the selected database. You can also create Pre-INSERT and Pre-UPDATE statements.

36 Chapter 3: Data Target Components


♦ Click Execute to implement the SQL script. Click Execute before proceeding to the During tab.
♦ Check the Stop On Error check box to stop the script operation and open a message box if the execution
encounters ungrammatical script.

During Tab
The During tab enables you to browse the database tables and filter the columns that will constitute the data
written to the database. Use this tab to create INSERT and UPDATE statements. You can also apply conditions
to tables and join columns from multiple tables. The During tab includes five columns: Database, Insert,
Update, Where, and Text.
Figure 3-2 displays the Database Target During tab:

Figure 3-2. Database Target, During Tab

Note:

♦ Like the Before tab, the Database column displays the database structure as a hierarchy of tables and
columns.
♦ To write to a column in a database table, select the required Data Quality output from the corresponding
list in the Insert or Update column.
♦ Use Stop On Error to stop the script operation and open a message box if the execution encounters
ungrammatical script.
♦ Use Roll Back on Error to commit data to the database at the end of the batch operation. If this box cleared,
data is committed to the database at the end of each transaction.
♦ Use Expert Mode to view and edit the underlying SQL query. Expert Mode is typically used to create more
advanced statements.
Any changes made in Expert Mode are lost if you clear this box and return to standard mode.
♦ Click the Condition option to create a condition-based filter in the form of an IF statement to the data
processed by the target. Use the filter to limit the records written to the output file.
♦ In Aggregation Mode, only outputs from Aggregation component are available. You can use Expert mode to
perform additional calculations on aggregates.

After Tab
Use the After tab options to write post-insert or update SQL statements for a table. Use this tab to configure
primary keys and indexes for tables.
The After tab completes the process of defining the target output. The Before tab runs SQL scripts on the data
prior to its configuration. The After tab runs SQL scripts on the configured dataset. Its Database and SQL

Database Target 37
Script panes are identical to those of the Before tab. You can browse configured tables and columns in the
database and write the SQL script to run on selected data.
For more information about SQL scripts, see “SQL Scripts” on page 139.

Database Report Target


The Database Report Target component generates report data for a plan and inserts this data to the
Data Quality repository. Like the Report Target, Database Report Target accepts input from frequency
components.
The Database Report Target also makes Data Quality report data accessible to external applications through an
ODBC connection. You can analyze and present the results of a data quality plan through a range of analytical
software tools, including Microsoft Excel and Crystal Reports.
Note: Unlike the Report Target component, the Database Report Target does not produce a formatted report on
the data. Instead, it writes report data to local Data Quality MySQL database tables. The tables can then be
made available to other applications through ODBC.
The MySQL database tables that store the Data Quality report data are located in the Data Quality repository,
named repository.t_athanor_report (master record) and repository.t_athanor_report_detail (detail record).

Configuration
The Database Report Target configuration dialog box contains the following:
♦ Connection Details Area. Because the Database Report Target always writes data to the Data Quality
repository, the connection options shown in this area are static.
♦ Parameters Area. This area contains the following fields:
− Report Name. Enter a report name. The report data is saved in the repository under this name.
− Maintain Reports. When this box is checked, a new record containing the report data is inserted in the
MySQL database tables each time the plan executes. Each instance of the report is identified on the
MySQL table by a unique report ID and timestamp. When this box is cleared, the record containing the
report data is updated with the latest report data each time the plan is executed.

Technical Requirements
A MySQL ODBC Driver is required when importing data from the MySQL database to an external
application. This is available to download from http://www.mysql.com.

Maintenance
To ensure reasonable table size, it might be necessary to remove historical data from the database tables that
store report data. When deleting a record from these tables, ensure that the record in question is deleted from
both the Master and Detail records to avoid creating orphaned records.

SAP Target
The SAP Target allows you to write plan output to a SAP database. This component complements the
SAP Source component, which allows you to obtain data from the SAP database for use as source data
in a plan.

38 Chapter 3: Data Target Components


There are three basic steps to configuring the target to write data to the SAP database:
1. Define a connection between Data Quality and the target SAP system.
2. Browse the list of BAPI functions on the SAP system and select the function associated with the data.
3. Configure one or more parameters on the function to be populated with data from the Data Quality plan.
Perform these steps using options on the SAP Target configuration dialog box.

Configuration
The configuration dialog box for the SAP Source displays its options on two tabs:
♦ Connection. Use the Connection tab options to establish the connection to the SAP system.
♦ SAP System. When connected, use the SAP System tab options to locate the appropriate BAPI and link its
parameters to the output columns in your plan.

Connection Tab
The Connection tab contains the following options:
♦ Host. The name or IP address of the SAP host computer.
♦ Client Number. Identifies the SAP client that you are authorized to use. A SAP system can have multiple
clients, each of which is identifiable by the three-digit client number.
♦ System Number. SAP allows multiple application server instances to run against a database. The system
number is a two-digit number that identifies the application server to which you want to connect.
♦ Encoding. This menu lists the available character encodings that can be applied to the data as it is used in
the plan. For more information, see “Character Encodings and Unicode” on page 143.
♦ Username and Password. These fields identify you to the SAP system.
Clicking Connect opens the SAP System tab.

SAP System Tab


This tab is divided into two panes. The left pane lists the SAP application areas and functions available on the
connected system, and the right pane lists the parameters defined on the highlighted function.
You can explore the application area pane as an alphabetical list or as a hierarchy that groups areas together
according to user-defined criteria. The areas can be expanded to reveal the business objects defined for each area
and the functions configured for each business object. Application areas are read from the SAP system.
The icons associated with each level in the left pane are color-coded: application area icons are yellow, business
object icons are green, and function icons are red.
Explore the available objects and select the function you want to use to write to the SAP database. Then,
configure one or more of the function parameters to receive data from one or more plan output columns.
As demonstrated for the SAP Source configuration dialog, there are three parameter types:
♦ Scalar. A single name-value pair, such as Town – Chicago.
♦ Structure. A group of one or more scalar parameters, like a multi-line address group. A structure may have
multiple rows but has a single column of values.
♦ Table. Contains one or more rows of data described by one or more columns.
Note: The SAP Target treats each field in a parameter as a scalar parameter, regardless of whether it is a single-
field scalar parameter or a multi-field table.

To configure a parameter:

1. Examine the parameter and identify the fields to which you want to add data.
2. Double-click the Value field of the parameter:

SAP Target 39
If you select a scalar parameter, this opens the Edit Scalar Parameter dialog box.
If you select a structure or table parameter, this opens Edit Structure Parameter or Edit Table Parameter
dialog box in which constituent scalar values can be configured. Double-clicking a value in these dialogs
opens the Edit Scalar Parameter dialog box.
3. In the Edit Scalar Parameter dialog box, click the Down arrow by the Value field to see a list of available
output columns.
You can also enter a column name.
4. Select a column, and click OK.
5. Repeat these steps for all required parameters.

Realtime Target
The Realtime Target enables you to develop plans to process output data in real time and deliver data
to another application. With this component, you can define a set of columns that determine the data
sources for a plan executed by the Data Quality engine a real-time environment.
You can develop, run, and test the plan using the Workbench user interface.
When the Data Quality engine executes a real-time plan, the records passed to the application contains all fields
selected as outputs from the Realtime Target. When configuring Realtime Target, select only the data fields that
your application needs.

Configuration
The Realtime Target configuration dialog box displays a single pane that lists all available data fields. Select the
required fields individually, or right-click within the selection pane to Select All.

Identity Group Target


The Identity Group Target component generates keys for groups of input data. It stores these keys and
the input data in an identity index within Informatica Data Quality. The CSV Identity Group Source
and the DB Identity Group Source require the key values in this index to perform identity matching
on plan data.
All identity matching operations require two plans that must be run consecutively. The first plan must contain
an Identity Group Target. The second plan must contain either a CSV Identity Group component or a
DB Identity Group component. These components search the data for the keys defined by the Identity Group
Target in the first plan.
Note: Do not use the Identity Group Target in the same plan as any Data Quality match source component.

Identity Group components require population files that install through the Content Installer. You must
contact Informatica to purchase and download population files separately. For information on installing
population files, consult the Informatica Data Quality Installation Guide.

Configuration
The Identity Group Target configuration dialog box contains the following options:

40 Chapter 3: Data Target Components


♦ Input. This pane lists the potential input columns available to the target. Use the check box next to each
column to add that column to the target output. At least one input column should contain person name,
organization, or address data, as the Identity Group Target uses these data types for key generation.
Tip: Right-click in the input pane to display a Select All option.

♦ Outputs. This pane contains outputs for each selected input column. The outputs are automatically
generated when you add input columns.
♦ Population. Populations contain key-building algorithms that are customized for specific countries and
languages. Select the population that most closely matches the origin of the input data.
♦ Key Type. The standard populations provided by Informatica can generate keys for three types of index data:
person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you
wish to use in key generation.
♦ Key Level. The Key Level determines the number and variety of keys generated by the Identity Group
Target. The three key levels are Limited, Standard, and Extended. The following table describes the features
of each Key Level:

Disk
Key Level Space Matching Success Intended Use
Usage

Limited Low Finds likely matches; does not find all Non-critical searches on
probable matches systems with limited disk space

Standard High Overcomes most variations in word Most search applications


order, missing words, and extra words

Extended Very high Finds most possible matches, High-risk or mission-critical


regardless of word order variation and search applications
concatenation

♦ Input Column. The input column specifies the source data that the Identity Group Target uses for key
generation. Choose an input column that contains the type of data specified in the Key Type field.
The order of individual strings in the selected input column should match the normal string order used in
the population Key Type you selected. For example, in English-speaking countries the normal string order
for person names is as follows:
First Name + Middle Name(s) + Family Name(s
♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory where the key index
will be generated. Set a unique Key Index Location for each plan to avoid overwriting other key indexes.
You can specify a Key Index Location with multiple subdirectories in order to help organize your Identity
Key Indexes. The following string displays an example of a a Key Index Location with multiple
subdirectories:
UK/Person/Name

Identity Group Target 41


42 Chapter 3: Data Target Components
CHAPTER 4

Frequency Components
This chapter includes the following topics:
♦ Overview, 43
♦ Count, 43
♦ Sum, 46
♦ Aggregation, 47
♦ MinAvgMax, 49
♦ Range Counter, 50
♦ Missing Values, 51

Overview
Data Quality provides five components that determine the frequencies of values within selected data fields.
These components allow you to determine the frequencies of all values, specific values, and defined ranges of
values within data fields.
Frequency Analyzer components are essential in plans that use the Report Target or Database Report Target to
create plan output. Report Target and Database Report Target can only accept inputs from frequency
components.
Data Quality provides the following frequency components:
♦ Count
♦ Aggregation
♦ MinAvgMax
♦ Range Counter
♦ Missing Values

Count
The Count component determines the number of unique values in a column and calculates the
frequency of occurrence of each value. Count is a frequency component and therefore can provide data
input to the Report Target and Database Report Target.

43
For example, consider the addresses listed in Table 4-1:

Table 4-1. Count Component: Sample Address List

Address1 Address2 Address3 State Zip

2440 Camino Ramon San Ramon Contra Costa CA 94583-4296

2306 Shoreline Loop # 132 San Ramon Contra Costa CA 94583

2050 Shoreline Loop San Ramon Contra Costa CA 94583-5502

1200 Concord Ave Concord Contra Costa CA 94520-4915

1350 Montego Walnut Creek Contra Costa CA 94598-2822

1200 Montego Walnut Creek Contra Costa CA 94598-2820

108 Summerwood Pl Concord Contra Costa CA 94518-2718

305 Reflections Cir Apt 27 San Ramon Contra Costa CA 94583-5204

101 Ygnacio Valley Rd Ste 300 Walnut Creek Contra Costa CA 94596-4061

2245 Via De Mercados Concord Contra Costa CA 94520-4919

2000 Crow Canyon Pl Ste 206 San Ramon Contra Costa CA 94583-4633

2000 Crow Canyon Pl Ste 420 San Ramon Contra Costa CA 94583-1367

2000 Crow Canyon Pl Ste 260 San Ramon Contra Costa CA 94583-1384

2400 Camino Ramon Ste 100 San Ramon Contra Costa CA 94583-4287

Applying Count to the Address2 column results in the following data:


San Ramon 8
Concord 3
Walnut Creek 3

When the Count component output is read by a Report Target, and the plan output viewed in the Report
Viewer, you can drill-down on any item heading to view underlying data values.

Configuration
The Count configuration dialog box displays its settings on two tabs:
♦ Inputs
♦ Parameters

Inputs Tab
The Inputs tab lists the data columns available to the Count component from other components in the plan.
Select a column to add it to the Report Target.

Parameters Tab
The Parameters tab allows you to select and filter the data values that are counted by the component and passed
to the Report Target. It also lets you edit the output names for each counted column. The tab lists the columns
selected on the Inputs tab. For each column, three fields are displayed: Min Count, Max Cases, and Output
Name.
♦ Min Count. Specifies the minimum number of times a value must occur in a column before being listed in
the report output. For example, if a SURNAME column is selected on the Inputs tab, and the Min Count
value for SURNAME is 5, then a given surname must appear at least five times in the column to appear on

44 Chapter 4: Frequency Components


the list of surnames in the generated report. If the surname appears fewer than five times, its occurrences are
added to the Filtered total on the report.
♦ Max Cases. The Max Cases field specifies a stopping point for the count operation by setting an upper limit
on the number of different values the component lists in the report. When this limit is reached, the number
of uncounted records is included in the Others column of the report.
♦ Output Name. The name of each column sent to the target component. You can edit the name in each field.

Example
The following data sample contains eight different surnames in eleven records. A Min Count value of 2 returns
all surnames that occur more than once, Smith and Jones. A Max Cases of 7 continues counting until finding
seven different names, so the eighth name, Yeung, is added to the Others figure on the report.
SURNAME
1 Smith
2 Jones
3 Adams
4 Jones
5 Smith
6 Brady
7 Baldwin
8 Smith
9 Chase
10 Powell
11 Yeung

The Max Cases setting takes precedence over the Min Count setting. Max Cases determines the number of data
“buckets” available in the output. The Max Cases limit can be reached without identifying all the values that
meet or exceed the Min Count setting. For this reason, note the percentage of values represented by the Others
total.
For example, with the same settings but data ordered differently, as shown below, the most common name
would not be listed on the report:
SURNAME
1 Powell
2 Jones
3 Adams
4 Jones
5 Chase
6 Brady
7 Baldwin
8 Yeung
9 Smith
10 Smith
11 Smith

In this case, the Max Cases setting of 7 does not reach the eighth surname, Smith, which in fact is the most
common name in the dataset.
The Parameters options allow you to tune the performance of the plan in a number of ways.

Count 45
For example, you require the fifty most common surnames in a dataset of one million records. Assuming the
surnames are spread randomly throughout the dataset, applying a Max Cases figure in excess of fifty should
return the most common surnames without counting all rows.
There is no limit to the number that can be applied for Max Cases. However, when the total number of
different counts is greater than 20,000, plan performance may slow. When the number of counts is below
20,000, all values being counted are held in memory. If the number exceeds 20,000, all counts above this
number are held in the database as the count operations are carried out.
The following examples demonstrate how the two parameters can be used:
♦ To check for non-unique values in a field that should contain only unique values. Set the Min Count value
to 2. The report identifies all non-unique values, those that occur more than once.
The Max Cases field should be set to the number of records in the dataset. This ensures that sufficient
counts are performed so that even if the last two rows in the table are the only two with duplicate values,
they are identified.
♦ To count the frequency of values in a column where a finite number of different values are possible. In this
case, set Min Count to 1 and Max Cases to any value greater than the maximum number of possible values.

Sum
The Sum component calculates sums for the numeric values in each selected column. This component
classifies numeric values as positive, negative, invalid, or filtered, and provides count and sum totals
for each of these classes.
Use outputs from the Sum component as inputs for the Report Target and DB Report Target.
Note: The Sum component processes positive and negative numbers, for example 10 and -10. Do not prefix a
positive number with a + symbol. The Sum component will treat numbers entered in other formats (for
example, (10) or “10”) as invalid values.

Configuration
The Sum configuration dialog box contains the following:
♦ Inputs tab
♦ Parameters tab

Inputs Tab
The Inputs tab lists the data columns available to the component from other components in the plan. Check
the column name to assign it as an input.

Parameters Tab
Use the options on the Parameters tab to set a minimum value for inclusion in the “Positive” category for each
input column.
Positive numeric values that are less than or equal to the Min value for a column are classified as filtered. The
default Min value is 0.
Use the Parameters tab to rename the column outputs for the Sum components.

46 Chapter 4: Frequency Components


Aggregation
The Aggregation component provides a number of methods to calculate the frequency of occurrence of
data values both in a single column and across multiple columns. It can create detailed metrics that
demonstrate value frequencies across a dataset without writing the data in a temporary staging area or
using SQL.
The Aggregation’s capabilities include the following:
♦ It tabulates the quantities of records that contain common values in a selected field. The Count component
also performs this operation.
♦ It can tabulate the quantities of records that share a set of common values across multiple fields.
♦ It can calculate a sum of the numerical values in a given column.
♦ It can apply conditional rules to the data in selected columns so that additional counts are performed for
values that satisfy the conditions. Sum calculations do not use conditions.
The Aggregation component delivers outputs directly to a Database Target. Its outputs are not compatible with
other components.
Note: Set the Database Target to Aggregation Mode to enable it to read the Aggregation outputs.

Configuration
The Aggregation’s configuration dialog box displays its settings on three tabs:
♦ Inputs
♦ Parameters
♦ Outputs

Inputs Tab
The Inputs tab lists the data columns available to the component from other components in the plan. Select one
or more columns for configuration on the Parameters tab.
Note: When you select one or more columns on this tab, the Aggregation performs an aggregate count operation
on all data from these columns. This output appears as the Count field on the Outputs tab. You do not need to
configure other parameters to create this output, and you cannot deselect this output in the Aggregation
component.

Parameters Tab
The Parameters tab allows you to select and filter the data values that are counted by the component and passed
to the Database Target. The tab contains an upper area that lists the columns selected on the Inputs tab and a
lower area that lets you define conditions to apply to the inputs.
Beside the input names in the upper area are two columns: Group and Sum.
♦ Check the Group option for one or more input columns to generate totals for each pattern of values that
occurs across those columns. See “Calculating in Groups” on page 48.
♦ Check the Sum option for one or more input columns to calculate a total for the numerical values in those
columns. See “Calculating Sums” on page 48.
The Parameters tab also contains a Conditional Counts area. This allows you to filter the data to which a count
calculation is applied.
♦ Define a conditional count by selecting an input field and operators from the Conditional Count area and
clicking Add. To delete a condition, select it in the lower area and click Delete.
You can define conditional counts for individual columns, and you can add multiple conditional counts on
this tab.

Aggregation 47
Calculating in Groups
Table 4-2 provides sample bank account data that illustrates how group calculations work.

Table 4-2. Sample Input Data for Aggregation Component

NAME CITY STATE BALANCE

John Smith Brooklyn NY 36541.64

Mary Jones Brooklyn NY 6345.87

Estelle Franklin Brooklyn NY 354.12

Brian Franklin New York NY -650.01

Tina Brooks New York NY 3515.21

Charles Cowell New York NY 216.87

Marian Hodges New York NY 32.81

Kate Lee Albany NY 354.21

Albert Chung Albany NY 15498.32

Gillian Ross Buffalo NY 244.66

Figure 4-1 illustrates a sample configuration for the Aggregation component based on this data:

Figure 4-1. Aggregation Component Dialog Box. Parameters Tab

In Figure 4-1, the Group options for CITY and STATE are checked. Thus the component will aggregate data
patterns across both columns and send the following totals to a Database Target:

Brooklyn NY 3

New York NY 4

Albany NY 2

Buffalo NY 1

Calculating Sums
In Figure 4-1, the Sum option is checked for the BALANCE column. Thus the component will calculate the
sum of all values in this column, which is $62,453.70.
Sum calculations ignore all non-numeric data.

48 Chapter 4: Frequency Components


Conditional Counts
The Conditional Counts area lets you define a condition with Argument, Operator, and Value variables. A
condition acts as a filter for count calculations in the selected column.
Argument. The input column whose data will be filtered.
Operator. A mathematical operator applied to the argument data.
Value. The filter value.
Figure 4-1 contains a condition that will count the quantity of negative values in the BALANCE column, which
equates to the quantity of overdrawn accounts. You cannot define conditions for Sum calculations.

Outputs Tab
This tab lists the outputs that are written to the Database Target. You can edit the output names.
Figure 4-2 shows the outputs for the Parameters set in the previous example.

Figure 4-2. Aggregation Component, Outputs Tab

CITY and STATE. The quantities of common values in these fields will be calculated in group fashion. Group
calculations are not prefixed.
Count. This output is created when a column is selected on the Inputs tab. It sends a count of all value
quantities in all columns selected on the Inputs tan to the Database Target.
(Sum)BALANCE. All number in the BALANCE column will be added together and the sum sent to the
Database Target.
(Where)BALANCE<0. The quantity of negative balances will be sent to the Database Target.

MinAvgMax
This component returns the minimum, maximum, and average data values for selected columns.
The MinAvgMax only recognizes data in the Float datatype that originates as output from the Rule
Based Analyzer.

Configuration
The MinAvgMax configuration dialog box displays an Inputs tab with a single pane beneath listing the columns
you can use. Only numeric fields appear in the Inputs tab.
The calculations for the selected columns are sent to the Report Target.

MinAvgMax 49
Range Counter
The Range Counter calculates the frequency and distribution of numerical data in selected fields. It
does so by counting the numbers of values between user-defined intervals in the data.
To configure the Range Counter, select a data column and an interval, or a series of custom intervals,
to apply to the data. You can define multiple such instances within the component.

Configuration
The Range Counter configuration dialog box contains the following:
♦ Components pane
♦ Inputs tab
♦ Parameters tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance by working with the options on the Inputs
and Parameters tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Inputs Tab
The Inputs tab lists the data columns available to the component from the other components in the plan.
Check the column name to assign it to the highlighted instance in the Components pane.

Parameters Tab
The options on the Parameters tab determine how the range of data is represented in the report. The parameters
divide the data into meaningful subsets. While the Count component counts the overall number of data values
in a given column, the Range Counter divides the column data into subsets and counts the data values in each
subset.
The parameters are organized in two areas, Select Range Type and Select Intervals. The Select Range Type area
provides two options:
♦ Linear Numeric Range. Select to apply a uniform interval to the data column associated with the
highlighted instance.
When you select this option, the Select Intervals area displays a single Interval Value field. The value you
enter determines the size of the subsets in which the reported data is organized.
♦ Variable Numeric Range. Select to apply custom intervals to the data column associated with the
highlighted instance. When you select this option, the Select Intervals area displays. When you first
configure the component, this area shows a single row with three fields: Label, Start, and End. It also shows
an All check box. You can add as many rows as you need. Each row defines an interval, and each interval can
be a different size.
Label field. Allows you to enter a descriptive label for the data row that appears in the report.
Start and End fields. Allow you to set the interval boundaries for the ranges displayed in the report.
Add button. Adds a row beneath the existing rows.
Remove button. Deletes the selected row. To delete a row from the report, check its box and click Remove.
To delete all rows, check the All option and click Remove.

50 Chapter 4: Frequency Components


Missing Values
The Missing Values component searches for specific values in an input field and determines the
frequency of the values within the field. Use for searching for known bad or absent data values.
The Report Target creates a table listing the searched-for values and the number of times they occur in
the related column.

Configuration
The Missing Values configuration dialog box contains an upper pane that lists the data columns available to the
component, and a Missing Values pane to specify the data values you want to find.
To configure the component, highlight and select a data column in the upper pane. Next, right-click in the
Missing Values pane and select Add Value or Add Null Value from the context menu.
When you select Add Value, a message appears. Double-click the text as prompted and type a value on the edit
line. The value you provide will be assigned to the highlighted column. To save your changes, press Enter before
moving from the edit line. You can assign multiple values to a single column.
Note: You can select all columns in the upper pane with a context menu option. However, values are assigned
only to the highlighted column. You can also add multiple values for a single column.
Selecting Add Null Value adds the text “Null Value” to the pane and instructs Data Quality to search for null
values in the selected column.
To delete a value from the Missing Values pane, select Delete Value from the context menu.

Missing Values 51
52 Chapter 4: Frequency Components
CHAPTER 5

Analysis Components
This chapter includes the following topics:
♦ Overview, 53
♦ Character Labeller, 53
♦ Token Labeller, 56

Overview
Analysis components are used to identify data quality problems within individual fields in a dataset. The
analysis components identify features within free-text or non-numeric fields. The frequency of these features
can then be counted using the Count component and included in the plan report. The features can also be used
directly in cleansing and standardization routines.
Data Quality provides the following analysis components:
♦ Character Labeller
♦ Token Labeller

Character Labeller
The Character Labeller creates a character-by-character profile of data values in a data field. The
component categorizes some or all characters in the input fields according to character type. The
character types recognized by the component are:
♦ Alpha. An alphabetic character. The default label is c.
♦ Digit. A numeric character. The default label is n.
♦ Symbol. A symbol, such as a period. The default label is s.
♦ Space. Any space between data elements. The default label is _.
You can configure the component to identify all instances of one or more of these types in the input data. The
Character Labeller searches each field in the dataset for the character types you specify and writes a new column
containing codified representations of where your selections occur.
For example, the Character Labeller labels the string “01/01/2008” as “nn/nn/nnnn” with the Digit type
selected. It labels the same string as “nnsnnsnnnn” with the Digit and Symbol types selected.

53
You can change the labels assigned to the character types. You can also define custom labels that represent a
single character value or a set of character values.

Configuration
The Character Labeller configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Filters tab
♦ Dictionaries tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. Use the Components pane
to define an instance of the component for use in the plan.
When first opened, this pane lists a single unconfigured instance. Configure this instance by working with the
options on the tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can remove an existing
instance by highlighting it and selecting Delete from the context menu.

Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column name to assign the column to the instance highlighted in the Components pane. You can assign a single
input to each instance.

Parameters Tab
The Parameters tab options are organized in two areas:
♦ Standard Symbols. This area lists the standard symbols that can be applied to input data. To filter the input
fields for a character type, check its check box. If you clear a box, the underlying data for that character type
is returned.
You can select multiple character types for each instance of the component. You can also edit the symbols
returned for the character types. Table 5-1 lists the default symbols for each character type:

Table 5-1. Character Type Default Symbols

Character Type Default Symbol

Alpha c

Digit n

Space _ (underscore)

Symbol s

♦ Substring. This area provides options for returning the underlying data characters instead of the character
symbols for data in a field. It returns underlying characters based on their positions in the field.
For the data fields on the selected component instance, you can determine how many underlying characters
to return and where in the field to locate them.
Check Use Position to activate these settings.

54 Chapter 5: Analysis Components


− Start Position. Determines the starting location in the field for this operation. For example, with a setting
of 3, the Character Labeller returns underlying data starting at the third character in the string.
− Length. Determines the number of underlying characters to be returned, starting with the character
identified by the Start Position setting. For example, in a Date field with values in the mm/dd/yyyy
format, a Start Position of 7 and a Length of 4 returns the underlying year values for this field. You must
enter a value in this field to activate the substring settings.

Filters Tab
The Filters options allow you to define filters for the input data on a component instance. You can use one or
more characters to define a filter. When the Character Labeller encounters the filter string in the input data, it
returns the underlying data characters rather than the character type symbol.
For example, in a numeric field containing quantities, such as the number of transactions in an account, you
might define a filter of 0 (zero) as it is impossible that a customer would have zero transactions. In such a case,
non-zero values will be reported by the Digit symbol while values of zero will be reported by the zero digit.
♦ To create a filter, right-click in the Filters pane and select Add from the context menu. This opens the Filter
Setup dialog box. Type the required string in the Filter Text field and set the Enable Substring options if
required. If you do not select Enable Substring, the filter will apply to all characters in the field.
♦ Check Use Position to activate the substring settings.
− The Start Position option determines the starting location in the field for the filter operation.
− The Length option determines the number of underlying characters to be returned, starting with the
character identified by the Start Position setting. You must enter a value in this field to activate the
substring settings.
− The Case Sensitive option applies the filter text in a case-sensitive manner, that is, the filter will only
recognize alphabet characters in the same case (upper or lower) as the characters in the Filter Text field.
♦ The Transform all filtered text to upper case option changes the case of filtered characters to upper case.
This option not affect the operation of the Case Sensitive option. Transform all filtered text to upper case
operates on text that has already passed the Case Sensitive option, if the latter option is selected.

Dictionaries Tab
This tab allows you to apply dictionaries to the input data for the highlighted component instance. A dictionary
acts as another type of filter for the input data. Any character string that appear in the dictionary will be
filtered, and a user-defined character returned for them.
For example, you can apply a dictionary of state names to a customer address file, having first removed the
name of your home state. Using this dictionary, you can set the Character Labeller to replace any values in the
state field with an easily recognizable value such as X. This may assist a business that charges different postal
rates for out of state customers.
To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. The Dictionary
Setup dialog box opens. In this dialog, click the Select button to browse to a dictionary, and type a single filter
character in the Format Text field. The Character Labeller uses one character only.
Note: You must set the Enable Substring options on this tab if you select a dictionary. You cannot apply a
dictionary to all characters in a field.
♦ Check the Use Position option to activate the substring settings.
− The Start Position field determines the starting location in the field for the dictionary filter operation.
− The Length field determines the number of underlying characters to be filtered, starting with the character
identified by the Start Position setting.
Note: The Character Labeller applies dictionaries to the dataset in the order they are listed under the
Dictionaries tab for a highlighted component. You can adjust the dictionary order using the Up/Down arrows.

Character Labeller 55
Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.

Token Labeller
The Token Labeller analyzes the format of the data values within a field and categorizes each value
according to a list of standard or user-defined tokens.
The Token Labeller component defines nine standard tokens:
♦ Word (alphabetic)
♦ Number (numeric)
♦ Code (alphanumeric mix)
♦ Initial (single alphabetic character)
♦ Init Set (multiple alphabetic characters)
♦ Symbol (punctuation or other symbols)
♦ Dictionary
♦ Word Symbol (mix of alphabet and symbols)
♦ Code Symbol (mix of alpha-numeric tokens and symbols)
The Token Labeller searches the dataset for the tokens you specify and returns a profile detailing how these
tokens occur in the dataset.
Table 5-2 shows a sample Customer_Name data extract:

Table 5-2. Sample Customer_Name Data Extract

Customer_Name Customer_Name

Mr Matthew Evans Robert Chad Griffin

Jason R Taylor Ms Megan Adams

Amanda Parker Antonio Reed

Heather Gray D M Jenkins

Scott Campbell Mrs L Perry

Table 5-3 displays a data profile itemizing the occurrences of tokens in the data extract:

Table 5-3. Profile of Tokens

Data Values Quantity Percent

firstname surname 4 40

nameprefix firstname surname 2 20

nameprefix initial surname 1 10

initial initial surname 1 10

firstname firstname surname 1 10

You can define additional token types for the Token Labeller. Customized tokens are called filters in the Token
Labeller configuration dialog box.

56 Chapter 5: Analysis Components


Configuration
The Token Labeller configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Filters tab
♦ Dictionaries tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component that are available to the plan. When first opened,
this pane lists a single unconfigured instance. Configure this instance by working with the options on the tabs.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance from this pane by selecting Delete from the context menu.

Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column name to assign the column to the instance highlighted in the Components pane. You can assign a single
input to each instance.

Parameters Tab
The Parameters tab options are organized in three areas:
♦ Tokens. Lists the standard tokens that can be applied to input fields. To filter the input fields for a token
type, select the token. You can select multiple tokens for each instance of the component. If you clear a
selected token, the underlying data for that token type is returned.
♦ Case Sensitive. Lists the standard tokens that can be rendered in upper or lower case, except Number and
Symbol. To generate case-sensitive output for a token type, select the token.
Case-sensitive output means that the token appearance in the analysis output will mirror the case of the
related characters in the source data. For example, with case sensitivity applied, the name Lyndon B Johnson
is rendered, “Word INIT Word.” With case sensitivity inactive, the name is rendered “word init word.”
♦ Lookup. Check to apply case sensitivity to any dictionaries specified on the Dictionaries tab.
♦ Delimiters area. Provides a list of the punctuation symbols used to delimit data entries in a flat file. As with
the Tokens area, select the symbol if you want to use as a delimiter between data fields. Any punctuation
marks or symbols not selected are considered part of the dataset.

Filters Tab
The Filters options allow you to define and edit custom token types for a component instance and to specify the
data values to correspond to those types.
For example, data might contain fields of null or system-default data with their null status represented in
multiple ways, such as Null, Missing, N/A, or Other. The Filters tab allows you to create a token type, such as
“Null” and assign one or more data values to it. When the Token Labeller encounters that value, it identifies it
as the token you have created. In effect, a filter type with multiple values assigned to it is a form of reference
dictionary.

To create a filter:

1. Right-click in the Filters pane and select Add from the context menu.
This opens the Filter Setup dialog box.

Token Labeller 57
2. In the Format Text field, enter a filter type, that is, a token type.
3. Type a data value in the Filter Text field.
When the Token Labeller encounters the Filter Text value, it generates the Format Text custom token type.
You can add multiple filters with different Filter Text entries and a common Format Text entry.
The context menu also provides options to edit and delete filters from a component instance.
Note: Filters defined on this tab are not governed by the Parameters tab options. They are always applied to the
input data for the component instance with which they were created.

Dictionaries Tab
This tab allows you to use one or more reference dictionaries as token identifiers. The Token Labeller assigns
dictionary entries to a single token type.
For example, you add a US_CITY dictionary to an instance of the component and assign the token type CITY
to it. Now any value in the dataset that matches a dictionary value will be recognized as the token type CITY by
the Token Labeller.

To add a dictionary:

1. Right-click in the Dictionaries pane and select Add from the context menu.
This opens the Dictionary Setup dialog box.
2. In this dialog, click Select and browse to a dictionary.
3. In the Format Text field, type a name for the dictionary value type, that is, a token type.
In the Dictionary Setup dialog box, the Inclusive and Priority options determine how the Token Labeller treats
the data values it recognizes in a dictionary:
♦ Inclusive. When selected, the Token Labeller assigns the Format Text label to every data value it finds in the
dictionary for the highlighted instance. If this box is cleared, the Token Labeller assigns the Format Text
label to all data values that are not listed in the dictionary for the highlighted instance. This option is useful
for identifying invalid or non-dictionary matches.
♦ Priority. Determines how the Token Labeller treats strings located a dictionary entry. If this box is checked,
the Token Labeller treats the entire contents of a field as a single entity and labels it as a dictionary match. If
this box is cleared, the Token Labeller treats the matching string as a dictionary match and labels the rest of
the field separately.
For example, a company name column contains a field with the string “Informatica Corporation.” A Corporate
Suffix dictionary is applied to this column, so the Token Labeller identifies any string containing Ltd, Inc,
Corp, LLP, or any other standard corporate suffix.
When you check Priority for the Corporate Suffix dictionary, the Corporate Suffix dictionary treats the string
“Informatica Corporation” as a single entity and returns a corresponding value: companyname. If you clear this
option, the Token Labeller returns two values for this string: word companyname.
Note: The Token Labeller applies dictionaries to the dataset in the order they are listed under the Dictionaries
tab. You can adjust the dictionary order using the Up/Down arrows.
When multiple dictionaries have been assigned to a component instance and a data value appears in more than
one such dictionary, the Token Labeller applies the token defined for the first dictionary in which it finds the
value.

Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus
from the field.
You can save the data output from a Token Labeller instance as metadata with the following procedure.

58 Chapter 5: Analysis Components


To save data output from a Token Labeller:

1. In the Meta Data area of the output pane, check Store.


This activates the Metadata and Profile menu fields.
2. Type the metadata and profile names in these two fields or select from existing names.
3. Click OK.
There is no need to create metadata more than once. After metadata has been created for a component instance,
you can clear the Store option so metadata is not recreated each time the plan runs. Recreate metadata only
when the plan input dataset changes.

Token Labeller 59
60 Chapter 5: Analysis Components
CHAPTER 6

Transformation Components
This chapter includes the following topics:
♦ Overview, 61
♦ Search Replace, 61
♦ Word Manager, 63
♦ Merge, 64
♦ To Upper, 65
♦ Rule Based Analyzer, 67
♦ Scripting, 69

Overview
Data Quality transformation components allow you to adjust source data. They are typically used in
standardization plans.
Data Quality provides the following transformation components:
♦ Search Replace
♦ Word Manager
♦ Merge
♦ To Upper
♦ Rule Based Analyzer
♦ Scripting
Note: Transformation components create new fields for altered data. The original data remains untouched.

Search Replace
Use this component to standardize data. Like the Word Manager, the Search Replace component can
be used to remove unwanted values from a group. While the Word Manager uses dictionaries, the
Search Replace component makes use of user-defined values.
You can use the Search Replace component in the following ways:

61
♦ Search for a user-defined data string and remove it from the dataset.
♦ Search for a user-defined data string and replace it with another string.
♦ Insert a user-defined data string at the start or end of a field.

Configuration
The Search Replace configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Actions tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Inputs Tab
The Inputs tab lists the data columns available to the component instance highlighted in the top pane. Select a
field by highlighting it and clicking its check box. You can select a single column for each highlighted instance.

Actions Tab
The Actions tab lists the search and replace operations defined for the highlighted component instance. To add
an action, right-click in the pane and select Add from the context menu. This opens the Action Setup dialog
box:

Figure 6-1. Action Setup Dialog Box

The dialog box provides three options — Replace, Remove, and Insert — and a grid of text fields where you can
type one or more strings to be replaced or removed. Below this grid is a field where you can type any values that
you want to add to data. At the bottom of the dialog box are three buttons that determine where in each input
field the search and replace operation should be conducted.
The settings in this dialog box depend on the type of action you require. If you select Replace, all fields remain
available, so you can search for one or more strings and replace them with another string. If you select Remove,
the With field is disabled. If you select Insert, the search grid and also Anywhere option are disabled.
The search grid has twelve input fields by default. To add more fields, right-click in the grid and select Add
from the context menu. Likewise you can right-click and select Delete from the context menu to remove a row
from the grid. The highlighted row will be removed.

62 Chapter 6: Transformation Components


When you have finished working in this dialog box, click OK to save your action. To edit previously created
actions, right click on an action and choose Edit from the context menu.
If your Search Replace component contains multiple actions, you can change the order in which these actions
are performed. Select an action and click the arrows to move it up or down in the list.

Outputs Tab
The Output tab lists the names of the data outputs for the highlighted component instance as they appear in
other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.

Word Manager
The Word Manager applies one or more reference sources, data dictionaries, to an input dataset and
thus can be used to determine and improve the usability of the dataset.
The Word Manager is used for three main tasks:
♦ Determining the accuracy or inaccuracy of data in a column based on a reference source.
♦ Removing terms from a data column.
♦ Replacing terms in a data column.
Principally the Word Manager is used for data enhancement operations.
For example, by comparing an address data column containing European city names with a reference dictionary
of city names, you can evaluate the accuracy of data in this column.
If the dictionary includes variant spellings of city names, you can use the Word Manager to standardize spelling
by creating a new output column based on the dictionary entries.
You can check for original data entries that are not recognized by the dictionary. The Word Manager provides
an option to return only those values that are not recognized by the dictionary. The output column contains
only non-standard data. You can then subject that data to further evaluation.

Configuration
The Word Manager configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Dictionaries tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Word Manager 63
Inputs Tab
This tab lists the data columns available to the component from the other components in the plan. Check a
column to assign that column to the instance highlighted in the Components pane. You can assign a single
input to each instance.

Parameters Tab
The Parameters tab displays two groups of editable options:
♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries you specify for the data on the Dictionaries
tab. Check this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner.
♦ Delimiters. Displays a list of delimiting characters. Check the delimiters applicable to your source dataset.
If your input data includes multi-domain fields, you must indicate the delimiters in use in the dataset so that
the Word Manager can distinguish between the words in the field and apply the transformative rules you
define.

Dictionaries Tab
This tab allows you to use one or more reference dictionaries to analyze or improve input data.
To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. This opens the
Dictionary Setup dialog box. In this dialog, click Select to browse to a dictionary.
The Remove Dictionary Matches option ensures that only input data values that are not recognized by the
dictionary are returned in the output column.
Dictionaries are applied to the input data in the order listed in the Dictionaries pane. You can change this order
with the Up and Down arrows.

Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to render it editable. To save your edits, press Enter before
removing focus from the field.

Merge
The Merge component combines the data values from multiple input fields to form a single output
field. This component is common in standardization and analysis plans. For example, you can
combine Customer_Firstname and Customer_Surname fields to create a new field called
Customer_Name. You set the order in which the input values are merged. For example, you can create
a Customer_Name field in which surname precedes firstname or firstname precedes surname.

Configuration
The Merge configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

64 Chapter 6: Transformation Components


Components Pane
The Components pane shows the instances of the component are available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Inputs Tab
The Inputs tab lists the data fields available for assignment to the highlighted component. Select a field by
highlighting it and clicking its check box. Select at least two matching components on this tab.
Note: The order in which you check the boxes determines the order in which the columns are merged. If, in the
example above, you check the Customer_Surname field before the Customer_Firstname field, the merged
output lists the surname before the first name. The default name given to the output for the instance lists the
field whose box was checked first.

Parameters Tab
This tab displays the output order of the selected inputs and the join character used to merge them. To change
the output order, select an input and click the arrows to move it up or down in the list.
In the Select Join Character dropdown, choose the character to place between the merged items. Table 6-1 lists
the available characters:

Table 6-1. Available Join Characters for the Merge Component

Available Characters

Space Double Quote Comma Full Stop

Semi-Colon Single Quote Underscore Tab

Dash Pipe Forward Slash At Symbol (@)

NONE

Outputs Tab
This tab lists the names of the configured outputs as they appear in any other components connected to the
Merge component. Double-click a name to render it editable. To save your edits, press Enter before removing
focus from the field.

To Upper
The To Upper component provides several ways to alter the case of a dataset. The component provides
pre-set methods to transform case and also allows you to use dictionaries when determining which
strings to transform.
To Upper is often used to create data uniformity before matching, standardization, or analysis operations.

Configuration
The To Upper configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab

To Upper 65
♦ Parameters tab
♦ Delimiters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component are available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Inputs Tab
The Inputs tab lists the fields available for assignment to the highlighted instance. Select a field by highlighting
it and clicking its check box. You can add multiple fields to a single component instance. Each input field has
its own output field.

Parameters Tab
On this tab, the Case Transform area allows you to select the transformation method for the case of the data,
and the Options area provides additional options for dictionary use and underlying data in uppercase form.
The methods for transforming case are as follows:
♦ Uppercase. Converts all letters to uppercase.
♦ Lowercase. Converts all letters to lowercase.
♦ Toggle Case. Converts each lowercase letter to uppercase and vice versa.
♦ Title Case. Capitalizes the first letter in each sub-string.
♦ Sentence Case. Capitalizes the first letter of the field data string.
♦ No transform. No case transformation is applied. This option is generally used with the Capitalize option.
The Options area provides the following options:
♦ Capitalize Using Dictionary Entries. Use this option if you want to use a reference dictionary to identify
data strings for capitalization. Click Select to browse to a dictionary. Data strings recognized in the
dictionary are returned in the case style of their respective dictionary entries.
♦ Leave UPPERCASE Words as Found. Use this option to override the Capitalize option if the input data
string is already in upper case.

Delimiters Tab
When the input dataset consists of multi-domain fields, you might need to specify the delimiting symbol used
in the fields. The Delimiters tab lists the delimiters recognized by the component:

Table 6-2. Available Delimiters for the To Upper Component

Available Characters

Space Double Quote Comma Full Stop

Semi-Colon Single Quote Underscore Tab

Dash Pipe Forward Slash At Symbol [@]

Check the delimiters you want the component to recognize. You can use multiple delimiters.

66 Chapter 6: Transformation Components


Outputs Tab
This tab lists the names of the configured outputs as they appear in other components connected to the To
Upper component. Double-click a name to render it editable. To save your edits, press Enter before removing
focus from the field.

Rule Based Analyzer


The Rule Based Analyzer allows you to define and apply one or more business rules to selected input
data. It requires no previous knowledge of scripting or coding.
You can define two types of rules in this component: Condition and Assignment. Define a conditional
rule using IF-THEN-ELSE logic. Define an assignment by assigning a value to an output.

Configuration
When opened, the Rule Based Analyzer configuration dialog box displays any rules defined for the component.
Rule names appear in the Description column. The Status field indicates whether the plan can run the rule as
currently defined. A red icon in this field indicates that the rule has not been properly configured.
To add a rule, right-click in this pane and select Add Condition or Add Assignment from the context menu.
When you add a rule, default text appears in the Description field. Double-click in the field to exit the default
text. To configure the rule, right-click in this field and select Edit from the context menu.
Selecting Edit for a condition rule opens the Standard Rule dialog box. Selecting Edit for an assignment rule
opens the Set Rule dialog box.

Defining a Conditional Rule


The Standard Rule dialog box lists the IF, THEN, and ELSE statements defined for the component. You can
add multiple sets of statements. To edit a statement, right-click it and select Edit from the context menu.
Editing a statement involves working with a Rule Wizard to define the criteria for the statement.
When you enter multiple statements in the IF pane, those statements have an AND relationship.
The condition outputs are identified in the lower half of the Standard Rule dialog box. You can define multiple
outputs and assign a THEN or ELSE statement to any one of them.

Defining an Assignment Rule


The Set Rule dialog box provides fewer options than the Standard Rule dialog box. In place of the If, Then, and
Else panes, it has a single SET pane that lists the assignment settings defined for the rule. To edit a SET
statement, right-click its name and select Edit from the context menu.
As with conditional rules, editing a SET statement involves working with a Rule Wizard to define the criteria
for the statement. Similarly, you can define multiple potential outputs in the lower half of the dialog box and
assign the SET statement to any one of them.
The conditional rule logic is essentially a superset of assignment rule logic. If you add another THEN or ELSE
statement to a conditional rule, the Standard Rule dialog box displays the text “Assignment statement, right
click and select Edit.”

Expert Mode
The rule wizards allow you to write condition and assignment rules even if you have no knowledge of
programming. However, these rules retain their underlying code and syntax. To view and edit the underlying
code, use the Expert Mode option in the Standard and Set Rule dialog boxes. The code below is taken from a

Rule Based Analyzer 67


conditional rule defined to check the validity of a data values, Input1, by comparing them with a reference
dataset, Input2:
IF (Input1 = Input2) THEN
Output1 := "INVALID"
ELSE
Output1 := "VALID"
ENDIF
Use Expert Mode to construct more complex rules than are possible in the rule wizard, such as nested IF
statements.
Click the Validate button to validate the syntax of a rule.
Click OK to save your work. Informatica Data Quality displays an error message if the rule is invalid.
You can save an invalid or incomplete rule in Expert Mode. Complete or repair the rule before running the
plan.
Clearing the Expert Mode option before saving your work restores the dialog box defaults and discards any
changes you have made in the Scripts window.
For a list of keywords and expressions usable in Expert Mode, see “Rule Based Analyzer Rule Statements” on
page 127.

Example: CONTAINS Function


Use the CONTAINS function to create a rule that determines if a given string contains a user-defined value.
This function is useful when checking if data entry strings contain predicted data, for example, checking the
validity of a product code at the point of data entry.
The syntax for creating such a CONTAINS rule in Expert Mode is as follows:
Output1 := CONTAINS (Input2, Input1)

Where Input1 is the input string and Input2 is the string to be located.
The function returns an integer indicating the position of the value or the position of the first character in the
string. If the value is present in multiple positions on the string, the function returns the first position in which
it occurs. If the value is not present, the function returns 0.
The CONTAINS function is case-sensitive.

Example: DATECONVERT Function


Use the DATECONVERT function to create a rule that converts a date to a different format. For example, a
plan might use a rule that converts a date from typical UK format (DD/MM/YYYY) to U.S. format
(MM/DD/YYYY). The syntax for such a rule is:
Output1 := DATECONVERT(Input1,"DD/MM/YYYY","MM/DD/YYYY")

Date Functions
Date functions only accept numerical dates and do not accept leading or trailing spaces. Use a slash to separate
date elements in input strings. The Rule Based Analyzer processes all Gregorian dates.
When a two-digit year value is entered, Data Quality uses the following rules to determine the century:
♦ If the two-digit year value is less than ten, the year is treated as twenty-first century. Therefore, the Rule
Based Analyzer handles the year digits 00-09 as 2000-2009.
♦ If the two-digit year value is ten or more, the year is treated as twentieth century. Therefore, the Rule Based
Analyzer handles the year digits 10-99 as 1910-1999.

68 Chapter 6: Transformation Components


Treatment of Locale Numbers
All numerical inputs and outputs in the Rule Based Analyzer are interpreted in a locale-specific format. For
example, when using a French locale setting, the Rule Based Analyzer accepts and generate outputs using the
comma as a decimal separator.
If you want to use numbers in a format that differs from the default setting, place them in quotation marks, as
shown in the second point below:
♦ Generic format: 1.65
♦ Locale format: “1,65”

Error Handling
When invalid parameters are passed into Rule Based Analyzer functions, the error is logged and the plan
continues execution. For example, if a numeric value is incorrectly passed to a Date Compare function, Data
Quality executes the plan, but the Rule Based Analyzer output appears in the output file as “Invalid Value.”
When conditional statements contain incorrect syntax, Data Quality produces an error message and the plan
fails.

Scripting
The Scripting component provides greater flexibility than the Rule Based Analyzer to build
customized rules and processes into a data quality plan.
Note: The Scripting component allows you to write scripts using Tool Command Language (TCL). As
such, the component requires some knowledge of this language.
For a standard dataset and for standard rules, the Rule Based Analyzer is typically adequate. Informatica
recommends the Scripting component only for rules of a complexity that the Rule Based Analyzer cannot
handle.

Configuration
The Scripting configuration dialog box contains the following areas:
♦ Inputs
♦ Script
♦ Outputs
It does not have a Components pane and does not permit multiple instances to be defined for a single
component.
♦ Inputs. Allows you to identify the data columns that constitute the input data for the component. These
fields list the input fields available to the component. Click a field to access a menu and choose a column.
The columns you select are numbered in the Input Index fields.
♦ Script. Provides a workspace for writing the TCL script that can make use of the inputs defined above.
The Save and Load options allow you to save the script to a file and to load a pre-saved script from file.
These options act on the TCL script written in the Script pane only — they do not save or load other
settings in the dialog box.
♦ Outputs. Displays the output name for the generated data as it appears to other components. Double-click a
name to render it editable. To save your edits, press Enter before removing focus from the field.
The Output Type field allows you to change the output data type. Two types are available: String and Float.

Scripting 69
For more information about the range of functionality within the Scripting component, contact Informatica
Global Customer Support.

70 Chapter 6: Transformation Components


CHAPTER 7

Parsing Components
This chapter includes the following topics:
♦ Overview, 71
♦ Parser, 71
♦ Splitter, 72
♦ Token Parser, 73
♦ Profile Standardizer, 76
♦ Context Parser, 78

Overview
The parsing components allow you to extract relevant data from a field and separate extracted data into a
standardized format.
Data Quality provides the following parsing components:
♦ Parser
♦ Splitter
♦ Token Parser
♦ Profile Standardizer
♦ Context Parser

Parser
Informatica partners use the Parser component to implement customized parsing plug-ins. Parsing
plug-ins read specified input strings and create one or more new custom values from the words or
characters in the string.
Developers implement this component using the Global Component SDK. For more information, see the
Global Component SDK Guide.

71
Splitter
The Splitter component parses data values in a text field into new fields by comparing source data with
one or more reference datasets. Each instance of the Splitter parses a single data column.
Configure the Splitter by:
♦ Selecting data input, that is, a column on the dataset already configured in the plan.
♦ Identifying another data column to use as a reference dataset,
♦ Optionally, defining output field variables or identifying a dictionary for use as a filter on parsed data.
You can use the Splitter with or without a dictionary. The method you choose depends on the composition of
your dataset and the available dictionaries.

Parsing Data Without a Dictionary


You want to parse a column of names by gender and your dataset already contains a Gender column, so you do
not need a dictionary. First, select the source data column, such as the First_name field and then select the
Gender column for reference purposes.
Next, identify the variables you want the Splitter to match against the reference data. The variables should
match the possible values in the reference field, in this case MALE and FEMALE.
The Splitter component creates output fields based on the defined variables. Each value in the First_name field
identified as MALE in the reference data is written to a corresponding new MALE data field, and each source
value defined as FEMALE is written to a new FEMALE field. By default, the Splitter also creates an Overflow
field to capture any source data that cannot be identified by the reference column.

Parsing Data with a Dictionary


You want to parse a column of account names based on their residence in the United States. Instead of adding
variables for the names and possible abbreviations of every state, you can use a dictionary.
First, select a source data column, such as the Surname field, then select an appropriate column address column,
such as State or Zip, for reference purposes.
Next identify an appropriate dictionary, in this case, all valid U.S. zip codes. The entries in this dictionary are
compared with the reference column data. By default, the Splitter creates an output field for source data
recognized by the dictionary and an overflow field for values not recognized. In this way, the Splitter produces
two columns, one each for U.S. and non-U.S. account names.
Note the following:
♦ You can use multiple dictionaries and multiple variables.
♦ Dictionaries and variables are not mutually exclusive. You can use either or both with an instance of the
Splitter. Each has its own output column.
♦ The variables or the dictionaries you select are compared with the reference dataset, not the source dataset.

Configuration
The Splitter configuration dialog box contains two menus for identifying the input and reference data fields,
and two panes that you can populate using context menus:
♦ Source Input menu. Use to identify the data column to be parsed.
♦ Reference Input menu. Use to identify data column with which the defined variables or dictionaries will be
compared.
♦ Lookup (Case Sensitive) option. Use if you want the Splitter to apply case sensitivity when comparing a
dictionary with the reference data.

72 Chapter 7: Parsing Components


To add a dictionary or variable, right-click in the pane beneath the Lookup option and select Add Dictionary or
Add Value from the context menu.
The Splitter creates an output column for each entry in the upper pane and lists them in the Outputs pane. Edit
an output column name or overflow output field name by double-clicking it.

Token Parser
The Token Parser is designed to parse free-text fields that contain multiple tokens. It parses each token
to a separate field. The component identifies each value in the field by data type and writes each value
to a user-defined output field.
For example, a single free-text address field such as “3 Trebovir Rd, London, SW1” can be parsed to the
following output fields:
House Number Street Name Address Suffix City Postcode
3 Trebovir Road London SW1

The Token Parser searches an input field for the data types defined on the Outputs tab of the configuration
dialog box. When it finds a type specified for the first defined output, it writes that data to the associated
output field. It then searches the field for the type defined in the second output. When a specified data type is
not found, the corresponding output is left blank.
The parsing operation passes through each field only once. The parsing operation does not reset to the start of
the field when a data value is recognized.
The Token Parser uses the same set of generic data types as in the Token Labeller component:
♦ Word
♦ Code
♦ Number
It also allows you to define data types by dictionary.

Configuration
The Token Parser configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Dictionaries tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Token Parser 73
Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select a single field for each component instance.

Parameters Tab
The Parameters tab displays the following editable options:
♦ Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to
your source dataset.
♦ Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead
of the default direction of left to right. This option enables you to parse data based on the final values in a
field, such as postcode.
♦ Overflow Reverse Enabled. When selected, overflow data from a reverse-enabled parsing operation is
written to the Overflow output in reverse, right to left. Enabled when you use the Reverse Enabled option,
this option is selected by default. If you clear this option, overflow output for the parsed data is written left
to right.
♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Dictionaries
tab. Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner. When this option is checked, the dictionary will only recognize tokens in the same case as the
dictionary labels.
Note: This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the
lookup.
♦ Multiple Dictionary Outputs. Determines whether the component creates a single output column for the
dictionary or dictionaries applied to the instance, or whether a separate output column is created for each
dictionary. This option is selected by default.

Multiple Dictionary Operations


When you enable the Multiple Dictionary Outputs option, an output column is created for each dictionary
applied to the instance. The input is parsed by the first selected dictionary, and the first match found is written
to the dictionary output field.
If a match is found, the next dictionary is invoked, and this dictionary searches for a match within the
remaining non-parsed tokens. It does not search the tokens already searched by the former dictionary. If no
match is found, the dictionary output field is left blank and the process begins again by invoking the next
dictionary. This process continues for all dictionaries applied to the instance.
When the Multiple Dictionary Outputs option is cleared, a single output field is created. All dictionaries are
searched in the order in which they are listed on the Dictionaries tab, but only the first term identified is
written to the output column. The remaining non-parsed terms are passed to the text, number, and code
outputs, or alternatively to the overflow column.

Dictionaries Tab
The options on this tab allow you to apply a Data Quality dictionary to the input strings so that any input data
that matches a dictionary entry will be returned as a dictionary output. You can configure each dictionary to
write the input token unchanged to the dictionary output column or to standardize the input token to the
dictionary version of the token.
To add a dictionary to the instance highlighted in the Components pane, right-click in the pane beneath the
Dictionaries tab and select Add from the context menu. This opens the Dictionary Setup dialog box. Click the
Select button in this dialog to browse to the required dictionary.
The Dictionary Setup dialog box contains a Dictionary Standardization option. Check this option to return the
dictionary version of the token. Unchecked, this option returns the token as it appears in the input string.

74 Chapter 7: Parsing Components


Outputs Tab
The Outputs tab options define the output columns into which the input data values are parsed. Figure 7-1
shows the Outputs tab of the Token Parser:

Figure 7-1. Token Parser, Outputs Tab

The Token Parser can create up to five types of output column:


Code. Any value that mixes alphabetical and numerical data. Right-click in the Add Code Outputs field to
create a code output column.
Number. Any purely numerical value identified in the input data. Right-click in the Add Number Outputs field
to create a number output column.
Text. Any purely alphabetical value identified in the input data. Right-click in the Add Text Outputs field to
create a text output column.
Dictionary. Lists the columns defined on the Dictionaries tab. You cannot add or delete dictionary outputs
from the Outputs tab.
Overflow. A single column to which any non-parsed data is written. This field is created by the component and
cannot be deleted from the component
The Token Parser creates its outputs as follows:
♦ First, the component applies any user-set dictionaries to the input data. Any tokens recognized by the
dictionaries are written to the columns specified in the Dictionary Outputs field.
♦ Next, the component looks for output columns defined for code, number, and text tokens, in that order. If it
finds such columns, it writes any recognized tokens to the respective columns.
♦ You can create multiple output columns for a Token type. For example, if your input data is composed of
records containing three address fields, create three text outputs. If your input data contains a telephone
number and a five-digit zip code, create two code outputs.
♦ The component attempts to populate the first output column of each token type and then moves down the
columns listed for that type. If the component cannot find an appropriate column for a token, it writes that
token to the overflow column.

Token Parser 75
Note: The parsing operation passes through each input record once only. The parsing operation does not reset to
the start of the record when a data value is recognized.

Profile Standardizer
The Profile Standardizer uses the output data from a Token Labeller as input data in a parsing
operation. The Profile Standardizer parses input data to a number of output fields based on a data
structure that you define.
A Profile Standardizer parses one or more inputs from a single Token Labeller. To parse output from another
Token Labeller, use another Profile Standardizer.

Configuration
The Profile Standardizer configuration dialog box enables you to define a multi-field data structure for the
tokens recognized by the Token Labeller. Figure 7-2 displays the Profile Standardizer configuration dialog box:

Figure 7-2. Profile Standardizer Configuration Dialog Box

Using the Profile Standardizer, you can create new data columns into which one or more tokens are parsed. You
can create a rule for each combination of tokens, so that each underlying value is written to a new field.
For example, a Customer Account dataset includes a single Name field for customer names, including first and
middle names, surnames, and initials. The Token Labeller recognizes the types of tokens present in the Name
field data. The Profile Standardizer accepts the Token Labeller output and lists the various combinations of
tokens in the Name field. The Profile Standardizer can new columns for first names, middle names, and
surnames.
Figure 7-2 shows a Profile Standardizer in mid-configuration. You do not have to create rules for every
combination of tokens.
In Figure 7-2, the rule applied to line 3, word word, sends the first token to a new first name field and the
second token to a surname field. Similarly, the combination word word word on line 5 correspond to a

76 Chapter 7: Parsing Components


customer firstname, middle name, and surname, and the rule is defined accordingly. Depending on the dataset,
there can be an element of trial and error to maximizing the output of the Profile Standardizer. The rules might
require tuning to recognize your target level of parsing quality.
When you define a rule for a token combination, its row changes appearance.
♦ Components pane. Lists the instances defined for the Profile Standardizer. When first opened, this pane lists
a single instance, You can add multiple instances as long as they are linked to the same Token Labeller.
♦ Inputs pane. Lists the Token Labeller outputs available to the highlighted component instance. Select an
input by highlighting it and clicking its check box. You can select a single input.
The Metadata and Profile menus let you identify the metadata associated with the Token Labeller output. A
single Token Labeller can store multiple metadata and profile combinations. Selecting a new metadata-profile
combination in the Profile Standardizer can provide a new range of input options.
Save any changes you have made in the component before changing the current metadata or profile.
When the input, metadata, and profile are selected for the current instance, the Profiles column is populated
with the profiles created by the Token Labeller. You can now define the target columns for each set of tokens.
Right-click anywhere in the Profiles pane to add, insert, delete or rename columns from a context menu. When
you add a column, it appears to the right of existing columns.

Applying Rules to Profiles


After you created the new columns that you need, you can define the rules that determine how input data values
are parsed to new fields.
You do not have to define rules every token profile. Defining a small number of rules can often parse a large
percentage of input data. You can subsequently add or edit rules to reach your target levels for parsing quality.
As with other parsing components, the Profile Standardizer creates an Overflow column automatically for all
data that is not parsed by the defined rules.

To apply rules to profiles:

1. Click a field in a user-defined column to open the Edit Profile Rule dialog box.
This displays the tokens available for insertion to that field, that is, the tokens in the Name input field for
that record. Tokens are listed in order of their occurrence in the source field, from top to bottom.
2. Select a token to send all values corresponding to that token to the new field.
3. Define a rule for a field and click Apply.
The Edit Profile Rule dialog box automatically moves to the next field in the row and displays its token
options.

Reusing Profile Data


Configured Profile Standardizer instances are saved with the metadata and profile from which the Profile
Standardizer drew the input token information. The metadata and profile appear in menus in the dialog box.
Any rules you save with a Profile Standardizer can be accessed by other instances of the component in the plan,
or in any other plans that access the same metadata repository.
Changing or deleting the Token Labeller can affect the input to the Profile Standardizer, but does not affect the
rules already created for a profile. Changing the inputs selected in the Inputs window of the Profile Standardizer
does not affect the rules already saved in the component. These rules remain in the table for any other inputs
selected in the component.
When a component is saved with a particular profile and rules and a new profile is introduced and assigned
parsing rules, the rules from the previously-selected profile are appended to the end of the new table. The rules
from the previous profile are displayed by a light grey font on a dark grey background.

Profile Standardizer 77
Changing the Number of Displayed Profiles
The number of profiles displayed within the Profile Standardizer is limited by default to 500 rows. You can
change the maximum number of rows by editing the config.xml file located in your Data Quality installation
folder, by default: C:\Program Files\Informatica Data Quality\config.xml.
The value is configured as MetaDataProfiles:
<MetadataProfiles>500</MetadataProfiles>
Note: Restart Data Quality Workbench for the changes to take effect.

Context Parser
Like the Token Parser, the Context Parser is designed to parse free-text fields containing multiple
tokens into multiple single-token fields. Context Parser operations are based on the values and the
relative positions of the tokens.
The high-level steps in configuring the Context Parser are as follows:
1. Select an input data column for each instance.
2. Specify the delimiters to use when parsing input data.
3. Configure the output columns where individual tokens will be parsed:
♦ Determine the number of tokens you expect in the output data.
♦ Add an output field for each of these tokens.
♦ Define a token type for each output you add.
The output columns can contain one or more data values, which can be of the following types:
♦ Word
♦ Number
♦ Code
♦ Symbol
♦ Init
♦ Dictionary (listed in a specified dictionary)
By using a combination of positional hierarchy, generic token types, and dictionary-determined data, you can
achieve highly-effective parsing results even in very “noisy” datasets.

Configuration
The Context Parser configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.

78 Chapter 7: Parsing Components


To add an instance, right-click in this pane and select Add from the context menu. You can also remove an
instance by selecting Delete from the context menu.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select a single field for each component instance.

Parameters Tab
The Parameters tab displays the following editable options:
♦ Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to
your source dataset.
♦ Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead
of the default direction of left to right. This option enables you to parse data based on the final values in a
field, such as postcode.
♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Outputs tab.
Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive
manner.
This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the
lookup.

Outputs Tab
This tab displays the user-defined output columns for the highlighted component instance. With no outputs
defined, this area is empty. Right-click below the tab and select Add Output to add an output column.
Each output is defined by two fields. The output name appears in an editable upper field. The lower field lists
the types of data values to be parsed to the field. You can set the output field to accept any of six data value
types, and you can organize these types in any order.
The input data is parsed according to the order in which the outputs are listed on this tab, and within each
output column, by the order in which the data types are listed. You can change the order of the output columns
by right-clicking an output name and selecting Move Up or Move Down from the context menu.
Note the following:
♦ The Context Parser performs a single sweep of each input field. As a result, the Context Parser works best for
structured data. For less- structured data, the Profile Standardizer may be more appropriate.
For example, you add an output of type NUMBER, and below it add an output of type WORD. When
parsing “12 Main Street,” the Context Parser locates “12,” then “Main.” If you reverse the output types, the
Context Parser locates the “Main” but skips the number “12.”
♦ You can configure an output to accept more than one token by adding multiple token types to the output or
by selecting the Toggle Merge option.
Right-click a data type and select Toggle Merge from the context menu to place multiple values of that type
in a single output field if they occur consecutively within the input field. For example, right-clicking a
WORD data type and selecting Toggle Merge returns consecutive words, starting with the first word in the
field.
♦ An overflow output is created automatically for any input values that have not been handled by the
component.

Context Parser 79
80 Chapter 7: Parsing Components
CHAPTER 8

Key Field Generator Components


This chapter includes the following topics:
♦ Overview, 81
♦ Normalization, 81
♦ Soundex, 81
♦ Nysiis, 83

Overview
Key Field Generator components group data in preparation for the matching process. With these components,
you can create the keys by which the data is grouped. When you group data, you enhance the efficiency of the
matching process.
Data Quality provides the following key field generator components:
♦ Normalization
♦ Soundex
♦ Nysiis

Normalization
Informatica partners use the normalization component to implement customized normalization plug-
ins. Normalization plug-ins read input values and write standardized versions of those values.
Developers implement this component using the Global Component SDK. For more information, see
the Global Component SDK Guide.

Soundex
The Soundex component recognizes phonetic matches between alphabetic strings. It analyzes the
phonetic components of a word and assigns a value to the string based on the phonetic characteristics

81
of the initial characters in the string. Because it can identify matches between words based on an analysis of how
the words sound rather than how they are spelled, Soundex allows for spelling errors at the point of data entry.
Use Soundex to generate a phonetic key for grouping similar records before matching. Soundex can be applied
to any free-text field.
For every field analyzed, Soundex generates a code beginning with the first letter in the word and followed by a
series of numbers representing successive consonants. Generally, similar-sounding consonants are assigned the
same code. The Soundex depth, the number of alphanumeric characters returned, is set to 3 by default. This
means the Soundex code consists of the first letter in the string and two numbers representing the next two
distinct-sounding consonants. You can change the Soundex depth.

Configuration
The Soundex configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Soundex component to another.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You can select multiple inputs for each instance in the Components
pane, but all inputs share a common Soundex depth.

Parameters Tab
The Parameters tab allows you to set the number of alphanumeric characters Soundex returns, called the depth.
The default depth is 3, with an alphabetic character representing the first letter in the word, and two numbers
representing the next two letters.
Increasing the depth means increasing the number of digits generated to represent additional letters in the
word. The depth setting applies to the highlighted instance in the upper pane.
The following table illustrates different Soundex depth codes:

Surname Value Soundex Value - Depth 3 Soundex Value - Depth 4

Broderick B63 B636

Smith S53 S530

Ford F63 F630

Burton B63 B635

82 Chapter 8: Key Field Generator Components


Outputs Tab
This tab lists the names of the data outputs for the highlighted component instance as they appear in other
components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus
from the field.

Deriving Soundex Depth Codes


The Soundex depth code consists of the first letter of the string in a given field, followed by a series of numbers
that represent some or all of the remaining letters in the string. The component skips all vowels and similar
letters:
a, e, i, o, u, h, w, y
It adds numbers for other letters as shown in the following table:

Table 8-1. Soundex Depth Codes

Code Letters

1 B, F, P, V

2 C, G, J, K, Q, S, X, Z

3 D, T

4 L

5 M, N

6 R

The following general rules apply:


♦ If two or more consecutive letters have the same code number, they are coded together, allowing Soundex to
skip to the next distinct consonant sound. This rule applies in all cases, including the first and second letters
of the word.
For example:
Gutierrez is coded G362: G, 3 = T, 6 = both Rs, 2 = Z
Pfister is coded P236: P, (F skipped for having the same code as P), 2 = S, 3 = T, 6 = R
♦ If there are an insufficient letters for the Soundex depth, the remaining numbers in the code appear as zero.
For example, if the depth is set to 5 and the word in question has three letters, Soundex completes the code
with zeros.
♦ Letters are counted as consecutive when they are separated by a vowel or consonant skipped by Soundex.
If a vowel separates two consonants that have the same Soundex code, the consonant to the right of the
vowel is coded.
For example:
Tymczak is coded as T522: T, 5 = M, 2 = C, Z skipped, 2 = K). As "A" separates Z and K,
the K is coded.
If “H” or “W” separate two consonants that have the same Soundex code, the consonant to the left of the
vowel is coded and the vowel to the right ignored.
For example:
Ashcraft is coded A261 (A, 2 = S, C ignored, 6 = R, 1 = F). It is not coded A226.

Nysiis
The Nysiis component converts the values of an input field to their phonetic equivalent.

Nysiis 83
Unlike the Soundex component, Nysiis does not create a code to represent the string, instead, it reconstitutes
the spelling of the string based in its phonetic characteristics. While Soundex focuses on similarities in spelling
at the start of matched strings, Nysiis looks for overall similarities between strings.
Nysiis uses a phonetic encoding algorithm created for the New York State Identification and Intelligence
System.

Configuration
The Nysiis configuration dialog box consists of the following areas:
♦ Inputs tab
♦ Outputs tab

Inputs Tab
The Inputs tab lists the input columns available to the component. To select an input, check its check box. You
can access a Select All option in the context menu by right-clicking in the dialog box. You can create a single
instance of Nysiis for each component.

Outputs Tab
This tab lists the names of the data outputs as they appear in other components in the plan. Double-click a
name to render it editable. To save your edits, press Enter before removing focus from the field.
The following table shows examples of Name-to-Nysiis value conversions:

Surname Value Nysiis Value

Adams Adan

Adames Adan

Adems Adan

Barnes Barn

Barns Barn

Bearns Barn

Adams Adan

84 Chapter 8: Key Field Generator Components


CHAPTER 9

Matching Components
This chapter includes the following topics:
♦ Overview, 85
♦ Identity Match, 86
♦ Similarity, 88
♦ Edit Distance, 88
♦ Jaro Distance, 89
♦ Hamming Distance, 90
♦ Bigram, 91
♦ Mixed Field Matcher, 92
♦ Weight Based Analyzer, 94

Overview
Data Quality provides matching components that are explicitly designed to determine the degrees of similarity
between given data values. Each matching component applies a different algorithm to its data input, and each is
suited to a different type of data quality problem:
♦ Identity Match. Performs matching operations on input data at an identity level.
♦ Similarity. Implements custom plug-ins to calculate the type and degree of similarity between two strings.
♦ Edit Distance. Calculates the edit distance between two strings.
♦ Jaro Distance. Calculates the difference between two strings using a variation of the a variation of the Jaro-
Winkler1 algorithm.
♦ Hamming Distance. Calculates the number of positions in which characters differ two strings.
♦ Bigram. Calculates the occurrence of matching pairs between two strings.
♦ Mixed Field Matcher. Compares multiple fields between two strings based on selected match calculations.
♦ Weight Based Analyzer. Calculates an aggregate match score based on the output scores from other
matching components using user-defined weights for each score.
Note: Distance components are case-sensitive.

Matching components calculate numerical scores representing the similarity or dissimilarity between pairs of
data values, generating a match score between 0 and 1. The higher the score, the greater the degree of similarity
between the two strings based on the match component criteria.

85
For information about the formulas used to calculate match scores, see “Matching Formulas” on page 137.

Identity Match
The Identity Match component performs matching operations on input data at an identity level. An
identity is a set of fields providing name and address information for a person or organization. The
component treats one or more input fields as a defined identity and performs matching analysis
between the identities it locates in the input data.
The component analyzes records regardless of the character sets in which they are stored. Use this component to
identify similar or duplicate identities across datasets that may use several different language locales or character
encodings.
Informatica uses population files to describe key-building algorithms, search strategies, and matching schemes
that are customized for specific countries and languages. These customized settings improve match accuracy for
data sourced from those countries and languages.
There are three main steps to configuring the Identity Match component:
♦ Select a population in the upper menu in the configuration dialog box.
♦ Select the type of identity to analyze in the lower menu of this dialog box. Table 9-1 lists the type of identity
you can analyze. The fields available will depend on the population selected.
♦ Select the data fields you want to analyze and apply them to the template fields for your chosen identity
type. The fields available will depend on the population selected.

Configuration
The Identity Match configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options below this pane and on the
Inputs, Parameters, and Outputs tabs.
Below the Components pane are two drop-down menus:
♦ Use the upper drop-down menu to select the population that you will apply to the data. Select the Identity
Match Country option for a single locale or region, or select the Identity Match - Multiple Populations
option.
♦ Use the lower menu to specify the type of identity data that the component will match. For example, the
Contact option relates to the names and addresses of members of organizations. The option you select here
determines the fields that are displayed on the Inputs tab. Each population selected in the upper menu has
its own set of information types. Table 9-1 lists the type of identity you can analyze.

Table 9-1. Identity Type

Options Description

Wide_Contact Matches person name at organization name

Contact Matches person name at organization name and address

86 Chapter 9: Matching Components


Table 9-1. Identity Type

Options Description

Individual Matches person with either name id or birth date

Resident Matches person name at address

Address Matches address

Organization Matches organization name

Division Matches organization name at address

Household Matches family name at address

Person_Name Matches person name

Fields For general use for any one or combination of fields

Corp_Entity Matches company name

Family Matches family name at either address or phone number

Wide_Household Matches family name or phone number at address

Inputs Tab
The Inputs tab allows you to configure the data input fields. The Input Fields Mapping Area contains two
columns:
♦ The left-hand column lists the field names. The names displayed depend on the population selected in the
Components pane. Mandatory input fields are highlighted in the column.
♦ The right-hand column lists the available inputs for the selected input field. Select an option from each
drop-down list to map an available input to the selected input field.
Note: If you have selected the Identity Match - Multiple Populations option in the upper drop-down menu
beneath the Components pane, the Population field name is displayed and highlighted as mandatory in the
left-hand column. Select a population field on the right-hand column.
Note: For all field names (except for the Population field name) you must select values for the field name in
pairs. For example, when using field names PERSON_NAME1 and PERSON_NAME2 you must select
values for both field names in the right-hand column. This enables the component to match input fields
against each other.

Parameters Tab
The Parameters tab contains the following options:
♦ Default Population. Sets the default population if the multiple populations option has been selected in the
Components pane.
When you opt to match data from several populations, the Identity Match component looks to the specified
population first, and then to the other populations configured, when determining what population to apply
to the data.
♦ Match Level. Sets the match level to one of the following:
Typical. Accepts reasonable matches. This is the default selection if no other match level is specified. The
Accept Limit is 89 and the Reject Limit is 70.
Conservative. Accepts only close matches. The Accept Limit is 90 and the Reject Limit is 80.
Loose. Accepts matches with a high degree of variation. The Accept Limit is 75 and the Reject Limit is 50.
♦ Stop on Error. Check this option if you want the plan to stop running when the plan cannot locate up-to-
date population data. When this option is checked, the plan will stop running if it finds that the population
data is absent. When this option is unchecked, the plan will run as normal and write a status code to the
output column.

Identity Match 87
♦ Advanced Matching. The Overriding Match Control Field allows you to override the population settings by
providing a dialog in which you enter a query. The query syntax specifies the Identity Match options to be
used.
Note: For more information on the query syntax, refer to the Informatica Identity Systems Naming Server
documentation.

Outputs tab
This tab lists the possible output fields for the data associated with the instance highlighted in the Components
pane. The tab shows two output fields:
♦ Identity Match Score. The score can range between zero (no similarity) and 1 (perfect match) and is correct
to two decimal places.
♦ Identity Match Decision. Accept, Reject, Undecided, or Processed. The decisions returned are based on a
combination of the Match Score and the Match Level specified on the Parameters tab (Typical,
Conservative, or Loose).
Double-click a field name to render it editable. To save your edits, press Enter before removing focus from the
field.

Similarity
Informatica partners use the Similarity component to implement customized similarity plug-ins.
Similarity plug-ins read a pair of input values and compute the type and degree of identity between the
two values, expressing this identity as a numerical value.
Developers implement this component using the Global Component SDK. For more information, see the
Global Component SDK Guide.

Edit Distance
The Edit Distance component derives a match score for two data values by calculating the minimum
“cost” of transforming one string to another by the inserting, deleting, or replacing characters.
The result of this calculation is the edit distance. The higher the edit distance score, the greater the
similarity between the two strings.
This component is ideal for matching fields containing a single word or a short text string such as a name or
short address field. You can use it to compare corresponding fields across two records or to compare different
fields within the same record.
For example, an edit distance calculation is performed on two street names:
College St. Collage St

The component calculates the cost of transforming the “a” in Collage to an “e” and inserting a period after “St.”

Configuration
The Edit Distance configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab

88 Chapter 9: Matching Components


♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Edit Distance component to another.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.

Parameters Tab
The Parameters tab allows you to set the output score assigned to a matched pair when one or both fields are
empty or contain null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.

Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.

Jaro Distance
Like the Edit Distance component, the Jaro Distance component calculates the general similarity
between two data values. However, the Jaro Distance component reduces the match score when a pair
of values do not share a common prefix.
Like other Data Quality matching components, the higher the match score, the greater the similarity between
the strings.
The component uses a variation of the Jaro-Winkler1 algorithm. The algorithm penalizes the match if the first
four characters in each string are not identical. The default penalty is 0.2.

Configuration
The Jaro Distance configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

Jaro Distance 89
Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Jaro Distance component to another.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.

Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
The Penalty field determines the value subtracted from the match score if the first four characters of both
strings are not identical. The default setting is 0.2.
The Case Sensitive check box, when checked, specifies that the matching calculation will consider the case of
the characters when determining the identity between them. This box is unchecked by default.

Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.

Hamming Distance
The Hamming Distance component derives a match score by calculating the number of positions in
which characters differ for a pair of data strings. Use the Hamming Distance component when the
position of the data characters is a critical factor, as in numeric or code fields such as telephone
numbers, zip codes, dates, and product codes.
By default, the Hamming Distance component reads data from left to right. You can reverse this setting.

Configuration
The Hamming Distance configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

90 Chapter 9: Matching Components


Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Hamming Distance component to another.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.

Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.
This tab also displays the Reverse Hamming option. Use this option to configure the Hamming Distance
component to read data from right to left instead of the default, left to right.

Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.

Bigram
The Bigram component matches data based on the occurrence of consecutive characters in both data
strings in a matching pair, looking for pairs of consecutive characters that are common to both strings.
The greater the number of common identical pairs between the strings, the higher the match score.
This component is useful in the comparison of long text strings, such as free format address lines or lines of user
comments.
For example, when the following two names are analyzed by the Bigram component:
Damien Darren

The bigram pairs for the two inputs are as follows:


Da, am, mi, ie, en
Da, ar, rr, re, en
There are ten pairs in this example, yielding four matches or two matched pairs. Therefore, the Bigram Distance
between these strings is 0.4.

Configuration
The Bigram configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab

Bigram 91
♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.
Cut and Copy options are also available on the context menu. These options allow you to paste instances within
the component and from one Bigram component to another.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.

Parameters Tab
The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain
null values.
The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null
Match Value setting applies when both fields are null. Possible values range between 0 and 1.

Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.

Mixed Field Matcher


While the distance matching components compare pairs of data values at a time, the Mixed Field
Matcher compares multiple fields in different match calculations.
The Mixed Field Matcher component identifies matches in a dataset where data values of the same or
similar types appear across multiple fields, such as freeform address fields where address elements like the
apartment number, city, or zip code can exist in different fields for different records.
The component provides several mechanisms for fine-tuning the match score computation, so you can give
different priorities to matches or near-matches of different types and levels of approximation.
To configure this component, select two groups of data fields to be matched and identify the matching
algorithm to apply to the data. You can also activate and tune priority levels for incorrect or approximate
matches. However, Informatica recommends using the default settings for these parameters.
Note: Matching operations in this component can incur a significant performance overhead and may take longer
to execute than operations in other matching components.

Configuration
The Mixed Field Matcher configuration dialog box contains the following areas:
♦ Inputs tab

92 Chapter 9: Matching Components


♦ Parameters tab
♦ Output tabs

Inputs Tab
The Inputs tab allows you to view available data fields and select the sets of input fields to be compared. To
compare data, assign fields to Input Group A and Input Group B.
Note: Groups A and B must contain the same number of fields.

The Inputs pane lists the data fields available to the component. To add a data field to either input group, right-
click it and select Add to Group A or Add to Group B from the context menu. The data fields you select display
in the input group panes.
To remove a field from either pane, right-click it and select the Remove context menu option.
Use Ctrl-A to select all fields in these panes. Select multiple fields using Shift-click or Ctrl-click.

Parameters Tab
The Parameters tab options allow you to fine-tune the component matching operations. The tab organizes its
parameters in three areas:
♦ General. This area contains the following options:
− Relative Position Factor. When the Mixed Field Matcher compares two fields from different record sets,
the relative position within each record of each field affects the strength of the match. For example, when
the Mixed Field Matcher matches a pair of fields in two records, it considers the match stronger when the
two records are in the same column. If the same two fields appear in different columns, it considers them
a relatively inferior match.
You can set Relative Position Factor to Off, Low, Medium, and High. Medium is the default.
− Matching Order Factor. This setting is concerned with the relative order of the best matches between the
input record sets. For example, when matching two fields in the record sets representing Firstname and
Surname, the Mixed Field Matcher matches John Smith with Joan Smith better than with Smith Joan even
though the individual fields match with the same score.
You can set Matching Order Factor to Off, Low, Medium, and High. Medium is the default.
− Empty Input Fields Factor. This setting calculates the number of empty fields in a record as a proportion
of the total number of input fields. A high proportion of empty fields lowers the match score for fields in
the record.
You can set Empty Input Fields Factor to Off, Low, Medium, and High. Medium is the default.
− Different Input Sizes Factor. This property compares the numbers of empty or null fields found in a pair
of records. When two records have different numbers of empty or null fields, this difference is
incorporated into the final matching score.
You can set Different Input Sizes Factor to Off, Low, Medium, and High. Medium is the default.
♦ Field Match. This area contains the following options:
− Match Method. This menu identifies the overall key for the matching operations. The default setting is
LCS (Longest Common Subsequence). This setting considers the length of any common character strings
in a pair of input fields and adds a factor based on the longest such string to the final score.
The default setting does not require input from another matching component in the plan. The other
settings in this menu provide for scores from other matching components.
− Single Null Match Value. This settings applies if one of the two compared fields is empty. The default
setting is 0.5.
− Both Null Match Value. This setting applies if both fields are empty. The default setting is 0.5.
♦ Advanced Area. In most situations there is no need to change the advanced settings for this component. For
more information about these settings, consult Informatica Global Customer Support

Mixed Field Matcher 93


Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.

Weight Based Analyzer


The Weight Based Analyzer takes the results from two or more matching operations and calculates a
single match score. The component accepts data from any matching component and allows you to
assign weights to their match scores so the overall score for a field pair can reflect the priorities of the
data.
You can define more than one instance in the Weight Based Analyzer. This allows you to configure each
component with different combinations of input fields and different weights as required.
You can use the Weight Based Analyzer to calculate overall matching scores for the plan. For effective matching,
assign higher weightings to the more important fields.

Configuration
The Weight Based Analyzer configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.
You can add multiple instances of the component. To add an instance, right-click in this pane and select Add
from the context menu. You can remove an instance by selecting Delete from the context menu.

Inputs Tab
The Inputs tab lists the data fields available to the highlighted component instance. Select a field by
highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.
You must select at least two matching components on this tab.

Parameters Tab
This tab displays the matching components selected on the Inputs tab. Each matching component has a text
field in which you can edit the weight defined for it. The higher the value in a text field, the higher the priority
given by the component to the overall match score.

Outputs Tab
This tab lists the names of the configured output as they appear in other components in the plan. Double-click
a name to render it editable. To save your edits, press Enter before removing focus from the field.

94 Chapter 9: Matching Components


CHAPTER 10

Address Validation Components


This chapter includes the following topics:
♦ Overview, 95
♦ Global AV, 96

Overview
Data Quality installs with address validation engines that process address data within a plan while the Data
Quality engine processes other aspects of the plan. It also accepts address validation engines developed as plug-
ins in accordance with the requirements of Data Quality Global Component SDK. Data Quality installs a
single address validation component to handle these validation engines, called the Global AV. It also supports
plans that contain deprecated address validation components from earlier versions of Data Quality.
Note: The Global AV matches input address data against reference datasets of postal addresses. Before you can
use the Global AV, you must install reference data for the countries you are interested in. Data Quality does not
install these datasets by default. You can purchase reference datasets for the default-installed validation engines
from Informatica.
The Global AV and the installed validation engines deliver the following functionality:
♦ They validate the accuracy and deliverability of addresses according to the best reference data available for
the country in question. Some countries provide complete address information, down to premise level, and
can also enrich the address with new information, for example providing a nine-digit zip code in place of a
five digit zip. Other countries provide last-line address information only, that is, information on city,
province, or post code (information commonly found on the “last line” on the envelope).
♦ Where possible, they correct errors in addresses and complete partial address records. An address engine may
find a match for an input address in its reference dataset that is more complete or formally correct than the
input address. The component can return the reference address as an enhanced version of the input address.
♦ They add postally-relevant information to the address that may not appear in the data source or “on the
envelope.” For example, they can report on whether an address has a physical address or is at a commercial
mailbox location.
♦ They provide detailed status reports on the validity of each input address, describing its deliverable status
and the nature of any errors or ambiguities it contains.
♦ In addition to returning individual fields that contain postal address and other value-added information,
they can provide output addresses in an envelope-ready format.
The Global AV provides the user interface to all address validation engines, including engines that users add to
Data Quality through the Global Component SDK. Data Quality no longer installs a separate operational
component for each installed address validation engine.

95
This installation of Data Quality supports plans that contain address validation components installed with
earlier product versions. The supported components are the Address Validator, the International AV, and the
North America AV. You cannot create new instances of these components.

Installing Validation Components and Reference Data


Data Quality installs three address validation engines: Melissa Data, QAS, and Address Doctor. The Data
Quality Content Installer installs the reference datasets for these engines. You purchase and download address
reference datasets on a country-by-country basis from Informatica.
You can also use the Content Installer to install updates to these validation engines. For more information,
consult the Informatica Data Quality Installation Guide.
Note: Data Quality also permits approved third parties to add address validation engines to the Data Quality
system. These engines and their functionality must meet the requirements of the Data Quality Global
Component SDK. The Global AV component acts as a shell for all address validation engines.

Global AV
The Global AV component provides access to address data functionality and processing capabilities in
Data Quality. It provides a means of validating addresses from anywhere in the world through a single
component.
The Global AV compares your input data records to reference databases of postally valid address information to
quantify, verify, and enhance the quality and deliverability of your address records. It provides access to all
address validation engines installed with or linked to Data Quality.

Configuration
The Global AV configuration dialog box contains the following areas:
♦ Components pane
♦ Inputs tab
♦ Parameters tab
♦ Outputs tab

Components Pane
The Components pane shows the instances of the component available to the plan. When first opened, this
pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog
box.

Inputs Tab
The Inputs tab lists all available data columns. Select a column to add it to the instance highlighted in the
Components pane.
You can select multiple address columns for the component instance. In general, the more columns you provide,
the greater the opportunity for the Global AV to locate the correct address in its reference data. However,
incorrect input data does not enhance the matching operation.

Parameters Tab
The Parameters tab options allow you perform the following operations:
♦ Set the principal country database to use when validating input data.

96 Chapter 10: Address Validation Components


♦ Verify or change the address structure for the input address strings.
♦ Add CASS/DPV or Geocoding information to the outputs (country data permitting).
The options displayed on this tab change according to the country database option (single or multiple) that you
select:
♦ Validating data from one country. Check the Select Single Country option, and then select the required
country from the Select Country menu.
♦ Validating data from several countries. Check the Select Multiple Countries option, and then select a
country from the Select Default Country menu.
The process for validating addresses from several countries works as follows:
− The Global AV first looks for a populated country code field in the address. This must be a three-letter
ISO country code.
− If it finds a country code for a country on its menu, the component sends the address to the database for
that country.
− If it does not find an address match in the default database, the component applies the address to the
default country database.
− You do not have to set a country for the Select Multiple Countries option. If you select the NONE option
in the country database menu, the component will search all input addresses for a country code and
attempt to validate the addresses accordingly. If it does not find a country code for the address, the
component will not perform a validation check for that address.
Note: When you opt to validate data from several countries, the Global AV looks to the country code first,
and then to the database selected, when determining what country database to apply to the data.
Note: Do not select a Single Country database in the Global AV unless the input data relates exclusively to
the country you specify.
The Services Required area contains options relating to the enrichment of address information with Geocoding
and DPV information and to the handling of the plan in cases of critical reference data errors.
♦ Geocoding. Check this option to return latitude and longitude coordinates for each input address. This
option is available for the United States, United Kingdom, and Australia. This option is also available when
you choose the Select Multiple Countries option, but it only returns data from country databases containing
Geocoding data.
♦ CASS/DPV (Delivery Point Values). Check this option to return a two-digit Delivery Point Value for the
address. This option is available for the United States only. This option is also available when you choose the
Select Multiple Countries option, but it only returns data from a country database containing DPV data.
A delivery point value is a two-digit code that can uniquely represent, along with the nine-digit zip code, any
mailbox address. The full delivery code, including the zip and DPV information, is typically added to sorted
mail as a bar code. This option is available for the U.S. only. CASS (Coding Accuracy Support System) is a
United States Postal Service means of certifying the accuracy of address validation by software.
♦ Stop on Error. Check this option if you want the plan to cease execution if the plan cannot locate up-to-date
country reference data. When this option is checked, the plan will stop running if it finds that the reference
data is absent, or expired, or lacks a current license. When this option is unchecked, the plan will run as
normal and write a status code to the output columns.
The Input Fields Mapping area contains a Parameters column and an Input Fields column.
♦ The Parameters column lists the field names selected on the Inputs tab. The component will validate the
address fields in the order in which they appear in this column.
♦ The Input Fields column contains a set of menus for every field in the Parameters column. Each menu
contains an address element.
Use these menus to build the address that the component will send to the validation engines. Map each field
name you require from the Parameters column to a unique field name under Input Fields.
Note: You must map an input field to the Addressline1 parameter. You must also map an input field to the
Country parameter if you choose the Select Multiple Countries option.

Global AV 97
Outputs tab
This tab lists the possible output fields for the data associated with the instance highlighted in the Components
pane. The tab shows all the following options:
♦ All address field options associated with the country database selected on the Parameters tab.
♦ Formatted address fields that provide envelope-ready address lines in the manner expected by the postal
carrier of the country in question.
♦ Options providing postally-relevant information in areas such as CASS/DPV certification and Geocoding.
The CASS/DPV options are enabled if a current set of United States reference data is installed on your
system. Geocoding options are enabled if current reference data for the United States, United Kingdom, or
Australia is installed.
Check the fields you want to use as outputs from the component.
The two outputs at the top of this pane provide information on the quality of the match found between the
input address and the reference data. These outputs do not provide address data. You cannot clear these options:
♦ Match Status. Describes the type of match found for each input address.
♦ Match Code. Describes the success of the match found for each address.
For more information about the meanings of these variables, see “Global AV: Output Field Descriptions” on
page 131.

Understanding Match Status and Match Code Outputs


The Global AV provides access to the processing capabilities of the address validation engines installed with
Data Quality and also to any third-party address validation engines installed as plug-ins. The output fields for
the Global AV are based on the output fields of these engines. The output fields for the address validation
engines that are installed with the product are described here.
The Global AV reads its Match Status and Match Code values directly from the underlying component engines.
Table 10-1 lists the code values returned for the engines installed by default with Data Quality. The engine
names in these tables correspond to the names of the

Table 10-1. Match Code Comparison Across All Validation Components

Global AV Address Validator International AV North America AV

Match Code Match Code Match Type Status Code (if successful match) or Error
Code, Error String

Table 10-2 lists the status values returned for each engine.

Table 10-2. Match Status Comparison Across All Validation Components

Global AV Address Validator International AV North America AV

Validated Verified Correct Validated V

Unmatched Unmatched Poor/Fair deliverability X and S

Validated Verified Correct Validated 6

Multiple Matches Multiple Matches 7

Validated Verified Correct Validated 9

Good Match Good Match

Partial Match Partial Match

Tentative Match Tentative Match

Foreign Address Foreign Address

Poor Match Poor Match

98 Chapter 10: Address Validation Components


Table 10-2. Match Status Comparison Across All Validation Components

Global AV Address Validator International AV North America AV

Corrected Corrected

Good Deliverability Good Deliverability

Not Processed Not Processed

Engine not installed

Engine not licensed

Reference Data Missing

Reference Data Not Licensed

Reference Data Expired E

Reference Data License Expired

Unsupported Country

Incorrect Postal Code F

Use these tables to compare the values across components. These codes are also listed in appendixes for the four
validation components.

Formatted Address Outputs


In addition to analyzing and enhancing input address elements, the Global AV can assemble validated address
outputs in a standardized envelope-ready format. The component uses the validated input data to build the
formatted addresses, eliminating the need to manually parse address values from multiple fields into
standardized formats. The Global AV engines create formatted addresses on a record-by-record basis, so that
each address is created in the envelope format expected by the postal carrier in its country.
Because standard address formats differ from country to country, the formatted address lines are named
generically in the Global AV. The component provides ten lines for formatted addresses. Select as many lines as
your address may need. The address validation engines ignore any address lines that are unused.
The outputs are named as follows:
Formatted_Address_Line_1, Formatted_Address_Line_2... Formatted_Address_Line_10

Example: United States


The Global AV uses up to four lines to create an address in the standard USPS format. Table 10-3 shows how
the Global AV builds the formatted address:

Table 10-3. Standard United States Business Address Format

Global AV Output Description

Formatted_Address_Line_1 Company or organization name

Formatted_Address_Line_1 Urbanization (where applicable, for example in Puerto Rican addresses)

Formatted_Address_Line_3 Street address, including Suite/Suite Range fields

Formatted_Address_Line_4 City, State, Zip code

The address format shown in Table 10-3 is a business address. It does not include personal name information.
You can select this information separately when configuring the plan outputs.
Note: You cannot change the output values that the component writes to the formatted address fields. The
selections are determined in the underlying validation engines.

Global AV 99
Writing Formatted Addresses To Target Components
Formatted addresses answer a particular business need. If you do not need envelope-ready address information,
you need not select the formatted address options in the Global AV or in your plan target components. If you
select these options, you must have a strategy for using the information when it leaves the data quality plan. You
should consider the structure of the file or database table that will contain the formatted addresses.
When defining or editing a plan to create formatted addresses, consider the following strategies:
♦ Add an additional target to an address validation plan, and select only the formatted address outputs in that
target.
♦ Create a copy of an address validation plan and replace the target components with new targets that use the
formatted outputs only.

Address Formatting And Invalid Addresses


When your dataset contains only validated addresses, you can follow the strategies above with no difficulty.
When your dataset produces mixed validation results, you must decide how to handle the addresses that Data
Quality identifies as invalid or partially valid.
How the Global AV formats a poor-quality address depends on the engine that processes reference data for that
address. Informatica provides reference data on a country-by-country basis. For example:
♦ If the Global AV cannot validate an input address from the United States data, the Global AV does not write
any values to formatted address fields. The Global AV calls the Melissa Data processing engine to process
United States address data.
♦ If the Global AV cannot validate an input address from France, it writes the original input values to the
formatted address fields. The Global AV calls the QAS processing engine to process French address data.
You must test your plan output to verify that you receive the formatting results you expect. If your plan writes
both valid and invalid addresses to the formatted address fields, you can use a Rule Based Analyzer to create new
outputs from formatted addresses where the address records meet one or more validation status criteria.

Reference Data Engines And Supported Countries


Use Table 10-4 to determine how the Global AV handles non-validated addresses from different countries when
populating the formatted address fields:

Table 10-4. Address Formatting By Country (Invalid Data)

Country Processing Engine Formatted Address Handling When Data Is Invalid

Brazil Address Doctor Global AV writes the best available values to the formatted address
fields.

Argentina Address Doctor Global AV writes the best available values to the formatted address
fields.

Australia QAS Global AV writes original input values to formatted address fields.

Canada Melissa Data Global AV does not write data to formatted address fields.

Czech Republic Address Doctor Global AV writes the best available values to the formatted address
fields.

Denmark QAS Global AV writes original input values to formatted address fields.

France QAS Global AV writes original input values to formatted address fields.

India Address Doctor Global AV writes the best available values to the formatted address
fields.

Luxembourg QAS Global AV writes original input values to formatted address fields.

Mexico Address Doctor Global AV writes the best available values to the formatted address
fields.

Netherlands QAS Global AV writes original input values to formatted address fields.

100 Chapter 10: Address Validation Components


Table 10-4. Address Formatting By Country (Invalid Data)

Country Processing Engine Formatted Address Handling When Data Is Invalid

Poland Address Doctor Global AV writes the best available values to the formatted address
fields.

Russia Address Doctor Global AV writes the best available values to the formatted address
fields.

Singapore QAS Global AV writes original input values to formatted address fields.

South Africa Address Doctor Global AV writes the best available values to the formatted address
fields.

Turkey Address Doctor Global AV writes the best available values to the formatted address
fields.

United Kingdom QAS Global AV writes original input values to formatted address fields.

United States Melissa Data Global AV does not write data to formatted address fields.

Enhancing Address Validation Engine Performance


You can edit the configuration files associated with the Melissa Data and Address Doctor engines to improve
data processing speed and to log messages warning of data expiry. For more information, see the Data Quality
Installation Guide.

Global AV 101
102 Chapter 10: Address Validation Components
CHAPTER 11

Dictionary Management
This chapter includes the following topics:
♦ Overview, 103
♦ Dictionary Manager, 104
♦ Updating Dictionary Files, 104
♦ Creating a Dictionary, 106

Overview
Informatica Data Quality plans can use the following types of reference data:
♦ Dictionary files. Plain-text files provided by Informatica and saved in the DIC file format. These files are
usable in many Workbench components and are installed by the Content Installer.
♦ Database dictionaries. User-created reference datasets stored in database tables. These tables can be updated
dynamically when the underlying data is updated. Informatica does not provide these dictionaries.
Database dictionaries are a convenient way to use data that has been created for other purposes. By making
use of a dynamic connection, data quality plans can always point to the current version of a database
dictionary.
♦ Third-party reference data. File-based and database reference datasets originating from third party sources
and offered by Data Quality as additional product options. Required for address validation components.
The Content Installer installs these datasets.
This chapter describes the DIC files provided by Informatica and the process to create a dictionary. For more
information about third-party reference data, contact Informatica Global Support.

Dictionary Files
Dictionary files provide an authoritative reference source for many areas in which common terminology is used,
including postal address terms, city names, units of measurement, personal salutations, telephone area codes,
and company names. Many Data Quality components provide options for comparing or updating input data
against dictionary data. These dictionaries are editable, and you can also define your own dictionaries.
A dictionary file is essentially a text file saved in a proprietary (.DIC) format. Each file contains one or more
label entries with one or more item entries for each label. The label represents the correct or standard form of a
word or term. The item values for each label represent a range of variant or alternative spellings. Any operation
that updates your dataset from a dictionary does so by locating an item entry and returning its corresponding
label.

103
Data Quality reads dictionary files from the Dictionaries folder created at install time. The Data Quality
installer does not add dictionaries to this folder. Dictionaries are added by the Content Installer.
When you run a local plan, Data Quality Workbench looks for any dictionaries cited in the plan in the
Dictionaries folder of your Workbench installation. When you run a plan across the service domain, Data
Quality Server looks in the local Dictionaries folder and also in the your Dictionaries folder on the service
domain. For more information, see “Dictionary Files” on page 7.
Note: The dictionary folders read by Data Quality are set during product installation. Their locations can be
changed later if necessary. For information on changing these locations, contact Informatica Global Customer
Support.

Dictionary Manager
The Dictionary Manager is an applet within Workbench that allows you to view and manage the contents of the
local Dictionaries folder. To open the Dictionary Manager in Workbench, press F8.
When you use the Dictionary Manager for the first time following the Content Install, it appears populated
with multiple folders. Figure 11-1 displays the Dictionary Manager window:

Figure 11-1. Dictionary Manager

Note: The Content Installer overwrites any files with the same names that it finds in the Dictionaries folders. If
you have created, renamed, or moved any dictionaries since install and wish to rerun the Content Installer, back
up these files first.

Updating Dictionary Files


A dictionary file is organized as a table with a column of definitive spellings for the terms in the dictionary and
one or more columns for matching or acceptable variant spellings. Each dictionary term has entries in at least
two fields:
♦ Label field. Represents the spelling that will be written back to the plan.

104 Chapter 11: Dictionary Management


♦ Item fields. Represents the forms of spelling that are recognized as a match for the Label in the input data.
The first item field always contains the same spelling as the Label field, that is, it matches the formally
correct or approved spelling of the term.
You can create or update a dictionary in the following ways:
♦ Add or delete an item. Add or delete variant spellings for an existing dictionary term.
♦ Add or delete a label and its related items. Add or delete a definition from the dictionary.
♦ Create a new dictionary file. See page 106.
Before deleting data from a dictionary, be sure that doing so is appropriate for all plans that reference the
dictionary.
Note: You should backup or rename any dictionary you edit. If you rename a dictionary that is used by a plan,
you must edit the plan components to recognize the new dictionary name. If you edit a dictionary but do not
change its name, you do not need to update the plan configuration.

Adding New Items


You can add new spellings to existing definitions. For example, the Numeric Patterns dictionary contains
character patterns for many types of personal data, such as Social Security numbers, telephone numbers, and zip
codes. You can add a variant pattern for one of these data types.
In Figure 11-2, a pattern for a U.S. area code and telephone number has been added to the Item4 field. This
pattern divides the numbers with blank spaces, indicated by an underscore:

Figure 11-2. Numeric Patterns Dictionary

To add new spellings to a term in the dictionary:

1. Open the dictionary in the Dictionary Manager and locate the row containing the term.
2. Type the new spelling in the first empty cell on the row.

Adding New Labels


You can add new terms to a dictionary and define the related spellings. Dictionary labels do not need to be in
alphabetical order.
The decision to add terms to a dictionary depends on the purposes of the plans that will use it. You might not
want to recognize all possible variations in a data value.

To add a new term to a dictionary:

1. Open the dictionary and type the formal spelling in the first empty Label field and the Item1 field. These
two fields must be identical. You might need to scroll the dictionary contents to reach an empty row.
2. In the adjacent Item fields, type any variant spellings you want to include in the dictionary. Start in the
Item2 column.

Updating Dictionary Files 105


Creating a Dictionary
You can create text dictionaries or database dictionaries.

To create a text dictionary:

1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.
2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Text.
An empty dictionary worksheet displays.
3. Type or copy a list of values into the Label and Item columns of the dictionary.
4. Close the dictionary and click Yes to save the dictionary.
The dictionary appears in the folder with the name New Dictionary.
5. To rename the dictionary, right-click the dictionary name and select Rename
6. Type a new name for the dictionary.
The newly-created dictionary can be viewed in the Dictionary Manager and can be found in the Dictionaries
folder of your Data Quality installation.
Note: You can add a correctly-formatted text file with the extension DIC to folders in the Dictionaries folder
structure. The file will be visible in the Dictionary Manager.

To create a database dictionary:

1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.
2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Database.
The Select Two Columns for Dictionary dialog box opens.
3. Complete the enabled fields under the Connect To Database tab and click Connect.
Fields differ based on the database type you select.
The default database setting is Staging. It refers to the local database used by Data Quality. You can select
any valid connection.
♦ When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must
provide a DSN (Data Source Name) for the database. You might be prompted to provide a valid login.
The DSN field identifies the database on the network.
♦ When you connect to an Oracle database, you must provide the SID (System Identifier) for the Oracle
instance.
♦ You might be prompted for login information if you select a non-default database type.
♦ You can identify the character encoding associated with the data in the dictionary. For more
information, see “Character Encodings and Unicode” on page 143.
4. Click Connect.
The During tab displays.
5. Under this tab, select the two columns to use for the Label and Item1 values in the dictionary, and click
OK.

Creating Dictionary Files with the Report Viewer


The Data Quality Report Viewer allows you to create dictionary files from the output of a data quality plan.
To create or append to a dictionary file using the Report Viewer, your plan should write its output to a Report
Target. A Report Target creates output files in a proprietary SSR file format that allows plan data to display
graphically and in Data Quality dashboards.

106 Chapter 11: Dictionary Management


The Report Target accepts data only from a frequency component, such as a Count component. The Count
component counts the occurrences of data values in a selected column. You can drill-down into the summary
calculations for each column in the Report Viewer to locate the raw data for a dictionary file. When you drill-
down into data, you can select a data column and add it to an existing dictionary or create a new dictionary.
For more information about the Report Target, see page 29. For more information about the Report Viewer, see
page 109.

To create or append to a dictionary file using the Report Viewer:

1. Open the Report Viewer. Open the SSR file that references the plan data to be added to the dictionary.
You can open an SSR file in two ways:
♦ In Workbench, run a Data Quality plan with a Report Target, ensuring that the Report Target has been
configured to launch the Report Viewer on plan execution.
♦ In the Report Viewer, click File > Open and browse to the SSR file for the report in question.
2. With the report open in standard view, right-click the row for the relevant data instance and select Open.
A spreadsheet opens, showing all data rows for the instance you have selected.
3. If you want to save the full contents of a column to a dictionary file, right-click in the column and click
Edit > Select Column.
The entire column is highlighted.
-or-
If you want to save a selection from a column to a dictionary file, Shift-click to select the required values.
4. Right-click the highlighted values and select Export To > Dictionary File.
The Select Dictionary Name dialog box opens.
5. Browse to a location in the Informatica Data Quality Dictionaries folder structure.
6. If you want to create a new dictionary, type a new dictionary name.
-or-
If you want to append to or replace a dictionary, select a dictionary name.
You will be prompted to append to or overwrite the current data for the dictionary.
7. Click OK.

Creating a Dictionary 107


Figure 11-3 illustrates how you can drill-down through report data, right-click on a column, and save the
column data as a dictionary file. This file becomes populated with Label and Item1 entries corresponding to the
column data:

Figure 11-3. Creating a Dictionary File with the Report Viewer

In this case, the dictionary will contain a list of serial numbers from customer records that include invalid zip
codes. You can now create plans to check customer databases against these serial numbers.

108 Chapter 11: Dictionary Management


CHAPTER 12

Report Viewer
This chapter includes the following topics:
♦ Overview, 109
♦ Viewing Data in the Report Viewer, 109
♦ Standard View and Dashboard View, 111
♦ Viewing Plan Data, 114
♦ Report Viewer Parameters and Settings, 115
♦ Tracking Changes in Data Quality, 116
♦ Importing Report Files and Working with Groups, 117

Overview
The chapter describes the Data Quality Workbench Report Viewer. The Report Viewer allows you to perform
the following tasks:
♦ Display plan results, both in graphical and numerical formats and in a dedicated viewing application.
♦ View drill-down analysis of the raw data underlying the plan results.
♦ Create data quality dashboards that can be exported in spreadsheet and HTML form for business users and
other interested parties.
♦ Save key subsets of plan data to file for use as reference dictionaries.
The Report Viewer is particularly suited to displaying data quality dashboards, those that explore the quality of
a dataset according to criteria set by the business.
You can use the Report Viewer to view the SSR report files that are created by plans containing a Report Target.

Viewing Data in the Report Viewer


You can open and read data in the Report Viewer.

Opening the Report Viewer


The Report Viewer can be activated in three ways:

109
♦ Configuring the Report Target to generate a report in Standard/SSR report format, check Launch Report on
Completion, and then execute the plan.
♦ Open the Report Viewer from the Data Quality Workbench program group via the Windows Start menu.
You can use the Report Viewer’s File menu to open a report file.
♦ Click the Report Viewer toolbar button in the Data Quality Workbench user interface.

Reading Report Data


The Report Viewer can display data for all items selected frequency components of the plan. Data items
typically have many kinds of data associated with them.
When you select a data item in the Count component, you add the number of times each value occurs to the
report.

Figure 12-1. Report Viewer, Standard View

For example, a plan might contain a business rule defined in a Rule Based Analyzer that tests the accuracy of the
currency type associated with data records. In this case, the Rule Based Analyzer creates a new data column
whose fields may read Valid Currency or Invalid Currency.
The Report Viewer might also show the number of empty fields and values excluded from calculations
depending on the parameters of the preceding operational component, such as the number of values classified as
Others by the Count component. For this reason, it is important to understand how frequency components are
configured. A large number of Others values can indicate that the Count component needs to be reconfigured.

Types of Graph
In standard mode, you can choose from two graphing options for a data item from the View menu:
♦ Pie Chart
♦ Bar Chart
Beneath each chart type, the data for the item is tabulated. The No Graph option omits both chart types.
When you open the Report Viewer, the right pane displays data for one item at a time. You can select an All
Reports option through the View menu that displays all items in scrollable form in the right pane.
The View menu also lets you set the orientation of the bars in the chart to horizontal or vertical. The legend for
the charted item appears below the chart, providing precise metrics for the quantity and percentage of the
charted data.

110 Chapter 12: Report Viewer


Standard View and Dashboard View
You can view data in the report viewer in two modes:
♦ Standard view
♦ Dashboard view

Standard View
When first opened, the Report Viewer opens in Standard view, presenting its information in two panes. The left
pane lists the source fields selected in the frequency components in the plan. The right pane displays the
following information:
♦ A bar chart or pie chart for each item in the left pane.
♦ The numbers of records that satisfy or do not satisfy the quality criterion for each item and the percentage of
data in the item that each number represents.
Any changes you make to the view settings for the report are stored to a master settings file for the Report
Viewer. For example, if you leave the standard mode by selecting Dashboard view, the report data displays in
dashboard mode the next time the SSR file is opened.

Dashboard View
Dashboards illustrate the ongoing progress of the dataset towards data quality business targets. When you
activate the dashboard, the standard view is collapsed, and the items are presented in a series of bar charts that
can be arranged in data quality categories.
Dashboards can display the following information:
♦ The percentage of records that satisfy the data quality criterion underlying each item.
♦ The data quality target set by the business for each item.
♦ Horizontal bars charting the percentage of good quality records in each item with each bar color-coded to
indicate whether the data meets or misses its target.
♦ An icon that indicates whether the data quality in the item is improving over time.
♦ The percentage of records in each item that satisfied the respective data quality criteria in previous
executions of the plan.
Select View > Dashboard from the main menu to toggle between standard and dashboard modes.

Setting Data Quality Targets in the Dashboard


The fields in the Target column for each data item are editable. You can activate the cursor in each field and
type a percentage target value for it.
♦ When a data item meets its target, when the percentage in the Passed field meets or exceeds the percentage in
the Target field, the horizontal bar for that item turns green.
♦ When the Passed percentage is lower than the Target percentage, the horizontal bar turns red, except in cases
where the shortfall is within the threshold set in the Settings dialog box.

Modifying Dashboard Calculation Parameters


In addition to setting the weight associated with an item and its target percentage, you can add or remove data
elements from the data quality percentage calculation for that item. This allows you to display the data quality
compliance percentages for constituent elements within the data item.

Standard View and Dashboard View 111


To view and edit the list of data elements for a data item, right-click the item and select Configure Items. This
opens a configuration dialog box that lists the data elements associated with the item and shows which ones are
applied to the passed percentage calculation.
Check an element to add it to the calculation. To remove an element, clear its checkbox. Select at least one
element.
Note: Item configuration changes made in the dashboard are not applicable to the charts and statistics in
standard mode.

Dashboard Categories
In dashboard mode, you can create categories and assign data items to them. You typically create categories to
display items with common data quality criteria. Figure 12-2 on page 112 shows categories for Accuracy,
Completeness, Conformity, and Consistency and also the default New Items category.
Categories are managed through the Dashboard Categories dialog box. This dialog box provides options to add
new categories, edit category names, and move categories higher or lower in the dashboard report.
To open this dialog box, right-click any data item on the dashboard and select Configure Categories:

Figure 12-2. Report Viewer, Showing Dashboard Categories

Creating a Category
Use the following procedure to create categories.

To create a category:

1. Open the Dashboard Categories dialog box and click Add.


The Category Name dialog box opens.

112 Chapter 12: Report Viewer


2. Type a name in this dialog and click OK.
3. Click Close in the Dashboard Categories dialog box.

Assigning Items
All dashboards contain a single category when first created, named after the plan. All data items reside in this
category before you assign them to other categories.
Data Quality Workbench creates a new category for each new plan/group added to the report.

To assign a data item to a category:

1. On the dashboard, highlight the item name.


2. Right-click the category and select Move to from the context menu.
This displays a list of available categories.
3. Without leaving the context menu, select a new category for the item.
Note: A dashboard displays all items available to the Report Target. Items cannot be hidden or deleted from
the dashboard.

Moving Rows within Categories


You can move a row of data within a dashboard category.

To move a data row within a dashboard category:

X Hold the Alt key and drag the row to a different location in the category.

Deleting a Category
You can delete categories from a dashboard. A category that contains a data item cannot be deleted from the
dashboard. Assign the data item to a different category before deleting the category.

To remove a category from the dashboard:

X Highlight the category in the Dashboard Categories dialog box and click Remove.

Assigning Weights to Data Items


Each category on the dashboard has a weighted average, the average pass percentage across all items in the
category calculated based on the weight assigned to each item.
By default, all items have an equal weight of 1.0. You might change this value based on the business importance
of the item within the category or the relative number of data records represented by the category. A higher
number reflects higher relevance for that item. A lower number reflects lower importance. Setting the number
to 0 removes the item from the calculation of the average pass rate for the category.

To review and edit the weight assigned to an item:

1. Highlight the first row in its category, right-click and selecting Configure Items.
This opens the Weighted Average Configuration dialog box, which lists the items in the category and the
current weight for each one.
Note: The first row in each category is named Weighted Average by default. This name can be changed in
the Weighted Average Configuration dialog box. However, the first row always provides the weighted
average pass rate for the category and appears in bold type. The configuration dialog box name is static
regardless of the item name displayed in the first row.
2. Enter new weights as necessary.

Standard View and Dashboard View 113


Viewing Plan Data
You can use the Report Viewer to drill-down into the underlying plan data, including the source data, in tabular
form. From the drill-down table, you can filter the data to pinpoint different data values and copy all or part of
the dataset to a CSV file or clipboard.
In standard mode, you can double-click any chart element in the right pane to open a new window that displays
data records matching the properties of that element. You can also right-click any highlighted element in the
legend and select Open.
Dashboards provide another means to view the underlying data.

To view the records that do not satisfy the quality criteria for that item:

X Right-click a highlighted data item in dashboard mode and select View Exceptions.
Note: When you drill-down to data within the Report Viewer, you refresh the view of the underlying plan data,
displaying the current state of the dataset. If the data has changed since the plan was last run in Workbench,
these changes are available to the Report Viewer. This does not alter the SSR file or the plan.
Drill-down mode can display either the columns in plan source data or all columns used in the plan. The latter
includes both source data columns and columns created in the plan. Configure this setting in the Report Viewer
Settings dialog box.

Exporting and Filtering Data in Drill-Down Mode


In drill-down mode you can export data to CSV file and to dictionary (.DIC) file.

To export data to a dictionary file:

1. Right-click the data values you want to export and click Export To > Dictionary.
This Select Dictionary Name dialog box displays.
2. You can append the data to the dictionary or overwrite existing data by selecting an existing dictionary file.
-or-
You can enter a new name in the File name field to create a new Data Quality Workbench dictionary with
values for Label and Item1.
3. Save the dictionary in a location recognized by the Dictionary Manager.
To export data to a CSV file:

1. Right-click the data values you want to export and click Export To > CSV File.
The Select CSV File Name dialog box displays.
2. You can overwrite data in an existing file.
-or-
You enter a new name in the File name field to create a new CSV file.
You can use the context menu to filter the data that displays and focus on a subset of data. The drill-down
context menu provides the following options:
♦ Edit > Select Column. Selects all values in the column.
♦ Edit > Select All. Selects all values in the table.
♦ Edit > Copy. Copies the highlighted cells to the Windows clipboard. You can use Ctrl or Shift-click to
highlight cells across multiple rows and columns, and then copy their contents to the clipboard.
♦ Export to > Dictionary. Copies the highlighted cells to a reference dictionary (.DIC) file.
For more information about creating dictionaries using the Report Viewer, see “Creating Dictionary Files
with the Report Viewer” on page 106.

114 Chapter 12: Report Viewer


♦ Export to > CSV File. Copies the highlighted cells to a CSV File.
♦ Filter > Filter by Selection. Hides all records that do not contain the value in the highlighted cell.
♦ Filter > Remove Filters. Removes the filter applied and restores the data table.
♦ Filter > Auto Filter. Adds a new cell at the top of every column in the table. Each cell provides a menu of
every data value in the column. You can select a value from any cell to filter the table for records containing
the same value in the same column.
You can use multiple cells in a a filter, resulting in data that fulfills all filter requirements. Select Unfilter to
clear these filters.
♦ Find. Opens a dialog box that permits searches of selected columns or the entire table.

Report Viewer Parameters and Settings


Bear the following points in mind:
♦ The Report Viewer displays report files. The SSR files displayed in the Report Viewer are written or
updated only when the plan is executed using the Workbench Run Plan command. You cannot edit or save
report files using the Report Viewer.
♦ The Report Viewer stores settings in a master report settings file. Some display settings are stored
automatically, such as the display mode and report charts display. Other settings can be set as properties. The
Report Viewer does not store report settings in the SSR file.
♦ Some key report settings cannot be restored if they are changed in the Report Viewer. If you delete the
dashboard history, for example, you cannot restore it, even if you run the plan again or have a back up SSR
file. There is no Undo function in the Report Viewer.

Editing Report Viewer Settings


Several settings and display parameters relating to all viewed reports can be set manually.
The following settings are available in the Report Viewer Settings dialog box. Click File > Preferences to access
this dialog box.
♦ Limit pages to n records. Sets the number of records displayed when you drill-down to the data records
underlying the plan. The default value is 500.
♦ Limit record retrieval to [n] records. Sets the number of records retrieved in a drill-down operation. This
setting is useful when you want a snapshot of the plan data and do not need to run the entire plan. The
default value is 2000.
♦ Limit column autosizing to [n] characters. This value sets the default column width. Any field that is not
wide enough to display all characters in a string displays an arrow indicator. The default value is 30
characters.
♦ Limit Pie chart to [n] slices. This value sets the number of slices that display in report pie charts. Any data
values that do not fall into the number of slices set by this field are aggregated into a single slice.
The default value is 10 slices, displaying a maximum of nine slices that refer to data elements and a tenth
slice for the remaining elements.
Use this setting to keep pie chart easy to read. It is also a useful method of grouping data elements for drill-
down purposes.
♦ Limit Bar chart to [n] bars. This value sets the number of bars that display in report bar charts. Any data
values that do not fall into the number of bars set by this field are aggregated into a single bar.
The default value is 10 bars, displaying a maximum of nine bars that refer to data elements and a tenth bar
for the remaining elements.
As is the case with pie charts, you can use this setting to group data elements for drill-down purposes.

Report Viewer Parameters and Settings 115


♦ Show orange bar when within [n] percent of target. This setting relates to dashboards. It provides a visual
cue to indicate when a data quality level approaches its data quality target. The default setting is 5 percent.
♦ Show component columns. Use this option to show all data columns available in the plan in drill-down
view. This option is cleared by default, displaying only source data columns for drill-down.
♦ Report template. Displays the path to the XSL template on which the standard report view is based.
♦ Dashboard template. Displays the path to the XSL template on which the dashboard view is based.
♦ Dashboard history template. Displays the path to the template for the dashboard history graph.

Hiding Data Elements in Standard View


In addition to limiting the bar chart and pie chart segments displayed through the Settings dialog box, you can
hide data elements through the legend displayed in standard mode.

To hide data elements:

X Right-click the element and click Hide.


The item is removed from the legend and from any chart above it.

To restore hidden data elements:

X Right-click the legend and click Unhide.


The resulting dialog box will list all hidden items. You can choose one or more of these to restore.
Note: In dashboard view, the Report Viewer stores drill-down settings across successive Report Viewer sessions
and successive plan executions. However in standard view, hidden data settings are not stored.

Tracking Changes in Data Quality


A dashboard is particularly useful for tracking changes in the data quality levels of the dataset, data item by data
item. It provides two means to do so:
♦ Historical percentages
♦ Historical trend graphs

Historical Percentages
A dashboard can show the changes in the percentage data quality achieved by a data item over time. The Report
Viewer remembers the data quality percentages from the most recent dashboard view on each day that the
report is opened. That is, the Report Viewer remembers one set of percentages a day. These percentages appear
on the right of the dashboard.

Historical Trend Graphs


At a high level, an arrow in the left-most column on the dashboard will indicate whether the data quality for an
item has improved or disimproved since the base point date. (No arrow means there has been no change.)
For a more detailed view, highlight the item name, right-click on the dashboard, and select View History... from
the context menu. This opens a line graph plotting the progress in data quality for the item over time.

Viewing the Line Graph


The line graph displays percentage values on its vertical axis and date values on its horizontal axis. Right-
clicking in graph area provides access to the following options:

116 Chapter 12: Report Viewer


♦ Copy. Use to copy the chart image to the clipboard.
♦ Set as base point. Use to set the selected percentage as the baseline for the graph. In a graph with multiple
data points, a pair of dotted X-Y lines identify the selected percentage.
♦ Clear history before point. Use to clear all history before this date. When you select this option, you are
asked if you want to clear the history for all other items on the dashboard. The default option is Yes. Click
No if you want to clear the history for this item only. Click Cancel to cancel the operation.
Note: The Clear command deletes the earlier graph history and the associated historical data on the dashboard
itself. Once deleted, this information cannot be restored.

Importing Report Files and Working with Groups


You can combine data from multiple report files into a single view in the Report Viewer by using the Import
command. This command identifies an SSR file and imports its data into an open report.
When you import a report, you create a group comprising data from the imported report and the report
previously-open in the Report Viewer. A group is a collection of settings saved to the master report settings file
that points to multiple SSR files and defines how they display.
The group does not store report data or edit the SSR files.

Creating a Group
Use the following procedure to create groups.

To import data from a report file and create a group:

1. Select File > Import... from the main menu.


The Import Report dialog box opens.
2. Browse to the location of the SSR file and click OK.
When you identify the relevant file, a new dialog prompts you to type a group name for the combined
report data.

Managing Groups
Use the following procedure to view or delete group.

To view the groups available to the Report Viewer:

1. Click File > Groups to open the Manage Groups dialog box.
2. To view a group, highlight its name and click Open.
3. To delete a group, highlight it and click Delete.
Clicking the Close button closes this dialog box.
You cannot delete the currently open group.

Groups and Dashboards


Groups are useful for aggregating and displaying the data analyses of several plans. This can provide a wide-
angle view of the quality of the business data, particularly when scorecards are built for the group.
You can define a dashboard for a group as you do for a single report. With group dashboards, you can define
one or more categories containing key items from multiple reports.

Importing Report Files and Working with Groups 117


Note: You cannot toggle between a dashboard for a single report file and for a group. When you view the
dashboard for a group, the Report Viewer drops the dashboard for the originally-opened report file and displays
dashboards for available groups for the remaining Report Viewer session. To return to the earlier report file, you
need to open the file again.

118 Chapter 12: Report Viewer


CHAPTER 13

Deploying Plans for Runtime


Execution
This chapter includes the following topics:
♦ Overview, 119
♦ Deploying Runtime Plans, 119
♦ Running a Plan, 120
♦ Command Line Arguments, 122
♦ Performance, 123
♦ Multi-Threading and Multi-Processing, 124
♦ Security, 125

Overview
Data Quality supports the deployment of plans for runtime execution — that is, for execution as part of a
scheduled or batch process. Plans created in Data Quality Workbench can be published from one Data Quality
repository to another. The execution of the plans is then managed from the command line. You can deploy
plans on Windows and UNIX platforms.
Note: In earlier versions of Informatica Data Quality, the capability to deploy plans for scheduled or batch
execution was delivered through a separate application called Data Quality Runtime. In this version, Runtime
functionality has been incorporated into Data Quality Server. This chapter describes the runtime plans.
For information about the prerequisites and system requirements for runtime functionality, see the Informatica
Data Quality Installation Guide.

Deploying Runtime Plans


Plans deployed for batch or scheduled execution can be run from one of two locations:
♦ Directly from the Data Quality repository (enterprise installs only).
♦ As an XML file from the local file system.

119
The local or remote Data Quality repository is identified in the config.xml file on the machine that runs the
plan.
Data Quality Workbench users in a service domain can use the Project Manager and File Manager to publish
plans and move file resources to a remote Data Quality repository for deployment. All plans published to the
repository are available for execution by Informatica Data Quality as long as the paths to all relevant data and
dictionary files are valid for the plan. You can identify the paths and filenames using parameter files. For more
information, see “The -c Option” on page 122.
You can convert plans to XML files from the Workbench interface and deploy the plan files and other resource
files. For example, you can transfer files to another computer using FTP.
Note: When executing a runtime plan, Data Quality looks in the default Dictionaries folder for plan
dictionaries. However, you can specify data source files that anywhere on the Runtime host as long as their
locations are specified in a parameter file associated with the plan. For this reason, Data Quality Workbench
allows you to specify the source and target file locations when you save a plan as XML.
Use runtime plans in environments where the data repository is updated periodically from one or more low-
quality source systems when you need to cleanse and run reports on data periodically.
On Windows, the executable file for implementing runtime functionality is Athanor-RT.exe, located in the bin
folder of the Data Quality Server installation.
On UNIX and Linux the executable file is a script located in the bin folder of the Data Quality Server
installation, named “athanor-rt.” This script calls the Athanor-RT executable file using a suitable environment.
Note: Do not run the Athanor-RT executable directly on non-Windows platforms.

Running a Plan
Data Quality can execute a plan as an XML file from the file system or from the Data Quality repository.
The -f flag specifies that athanor-rt should read a plan from an XML file in the local file system. The -p flag
specifies that the plan should be read from the repository identified in the local config.xml file. For example, the
following code runs myplan.xml from the home/Informatica/DataQuality/plans folder:
athanor-rt -f home/Infomatica/DataQuality/plans/myplan.xml
The following code runs myplan from the Folder1 folder in the Project1 project in the repository:
athanor-rt -p project1/folder1/myplan
Note the following:
♦ You can use the -c command to have Data Quality read plan variables and source file locations from a
parameter file. This allows you to reuse a plan without having to edit the plan for each scenario. For more
information, see “Command Line Arguments” on page 122.
♦ Parameter files are also important elements in plan execution. Use -p as the parameter file to identify the
locations of the data source files.
♦ As the Data Quality executes plans, it logs messages to the screen, to the local log file, and to the Event Log
on Windows platforms or syslog on UNIX platforms as configured in the config.xml file.

Version Control
Data Quality Server provides version control for plans stored in the repository. The -p option allows you to
identify a base version of a plan for runtime execution.
For example, the following code runs base version 3 of myplan:
athanor-rt -p project1/folder1/myplan:3

120 Chapter 13: Deploying Plans for Runtime Execution


Scheduling Operations
Data Quality can run plans in batch mode automatically, by means of a scheduling application, or manually, by
an operator. For example, when an overnight batch schedule updates a database from a series of data feeds, you
can call the Data Quality engine to check the feeds for data quality problems. You can call the command line
application with a scheduler such as Windows Task Scheduler or UNIX Cron.

Windows Scheduling
The following steps describe how to schedule a plan on a Windows computer:
1. Create a batch file QualityReport.bat and add the desired command, for example:
C:\Program Files\IDQ\bin\Athanor-RT.exe -f C:\Plans\QualityReport.bat

2. Run the batch file to ensure that it works as expected.


Run the file with the user profile of its intended user.
3. Add a new task.
Open the Scheduled Tasks window from the Windows Control Panel. Right-click in the window and click
New > Scheduled Task from the shortcut menu, and name the task.
4. Open the property sheet for this task and edit its settings as follows:
On the Task tab:
♦ Type the local path to the batch file in the Run field, such as C:\Plans\QualityReport.bat.
♦ Type the path to the Data Quality installation in the Start In field, such as C:\Program Files\IDQ.
♦ Select the user profile that will run the plan. Remember to confirm that the file will run correctly for
that user.
On the Schedule tab, specify when you want to run the task.
Review the Settings tab fields. The default settings on this tab are sufficient for most tasks.
5. Click OK and, if prompted, enter a username and password.
The task is now under the control of the Windows Task Scheduler.
6. To add pre- or post-task operations, add steps to the batch file or add new tasks to the Scheduler.
You can use any scheduler with the ability to run command line tools.
Note: If the Windows Scheduler cannot find the specified file, check for spaces in the paths provided in step 4
above. Check the path by running the file from the command line. If spaces are present, surround the path with
quotation marks, as follows:
"C:\My Tasks\QualityReport.bat"
The batch file returns the error code of the last command executed.

UNIX Scheduling
The following steps illustrate the scheduling of plan Profile.xml on a Solaris machine using the cron scheduler:
1. Create a shell script called QualityReport.sh and add the run command, for example:
$ home/athanor/bin/athanor-rt -f $HOME/Plans/Profile.xml

2. Run the batch file to make sure it performs as expected.


3. Create a new scheduled task using the crontab -e shell command.
The following task runs QualityReport.sh and logs standard and error messages to /tmp/QualityReport.log:
0 02 * * * sh -f /export/home/athanor/QualityReport.sh > /tmp/QualityReport.log 2>&1
You can use any scheduler than has the ability to run command line tools. For more information on using cron
and crontab, see the “man crontab” and “man cron” commands or contact your system administrator.

Running a Plan 121


Command Line Arguments
Typing athanor-rt -? at the command prompt displays the following output:
Usage: .\Athanor-RT.exe [ -f <XML plan filename> | -p <project name>[/<folder name> ...
]/<plan name>[:<version id>] ]
[ Options ]
Specify a plan:
-f <XML plan filename> Run the plan contained in the runtime plan XML file
-p <Repository plan> Run the plan from the repository specified by the path
Options:
-c f Use the parameter file f to override values in the XML plan
-i n Display progress information every n records
-? Display this usage screen
-h Display this usage screen
For more information about options -f and -p, see “Running a Plan” on page 120.

The -c Option
Data Quality supports the use of parameter files that can facilitate the deployment of a plan in one or more
environments. The parameter file is passed to the Data Quality engine using the -c command.
The parameter file defines the environment-specific values to be used when the plan is executed. For example, a
mapping between the original location of a source file and its new location can be mapped in the parameter file:
C:\Program Files\IDQ\DevData\Source.csv=
C:\Program Files\IDQ\users\user.name\Files\ProdData\Source.csv
Such mappings are platform-independent, that is, a Windows path can be mapped to a UNIX path, and vice
versa.
You can export or publish a plan and notify an administrator who applies the parameter file. Alternatively, you
can prepare the parameter file before exporting or publishing the plan.
To make best use of the -c option, establish a standard convention to indicate the kind of information files
contain. Take care when defining mappings in the parameter file. For example, the mapping “word=book” will
replace all instances of “word” in the XML file, including tags such as <password>, which can result in an invalid
plan.

Encryption
Often the details in a parameter file, such as passwords and database connection details, are secured. To
maintain security, an administrator can encrypt the parameter file by passing it to the Athanor-Encode utility.
This generates an encrypted file with the extension .enc appended to the original parameter file name.
This file can only be read by Data Quality or by Informatica Global Customer Support. You can edit the
parameter file in a secure environment and place the encrypted version in the production environment.

Passwords
You can apply the parameter file in encrypted or plain text mode. In plain text mode, when you edit the
password tag, the parameter will be applied each time the plan is run.
When you want to replace encrypted passwords at execution time, you must edit the XML plan and replace the
encrypted password with a placeholder. For example, the following line:
<Password EncryptionLevel='1'>W3uC+PY/kzcAUw==</Password>
should be replaced with an non-encrypted placeholder than can be easily communicated and defined in
production parameter files, for example:
<Password>PasswordHolder</Password>
In a parameter file, the password can now be substituted using the following mapping:
PasswordHolder=user.name

122 Chapter 13: Deploying Plans for Runtime Execution


Shared Databases Details
A plan may be designed for use with two databases with common connection details, then in production, the
plan is run against two different databases. In such a case, Data Quality cannot distinguish between the two.
You must edit the original plan so that it refers to the production databases, or add placeholders for the
production databases before moving plans to a different domain. Alternatively, as best practice, it may be worth
developing the convention of using distinct database details and accounts for each database when a plan is in
design.

The -i Option
Use the -i option for checking system performance and establishing the reasons why a plan is behaving in a
certain way.
For example, if plan n reads a CSV source and changes two fields within the dataset to uppercase, then it writes
the data to a CSV target. Its input fields are as follows:
CUSTOMER_KEY, FIRST_NAME, LAST_NAME, ADDREESS_LINE_1, ADDRESS_LINE_2... ADDRESS_LINE_6
Running the plan and specifying -ix at the command, where x is a positive integer, produces the output shown
below, whenever x records (plus 1 for the initial record) are processed:
Time in long seconds 1063104892
Local time Tue Sep 09 11:54:52 2003
[0] DataSource Progress = 0
[1] DataSource Num Records = 9975
[2] DataSource Num Comparisons = 4
[3] Similarity Record ID = 4
[4] CUSTOMER_KEY = 12321
[5] FIRST_NAME = Edward
[6] LAST_NAME = Oconnell
[7] ADDRESS_LINE_1 = Clorane
[8] ADDRESS_LINE_2 = Kiloimo
[9] ADDRESS_LINE_3 = Co Limerick
[10] ADDRESS_LINE_4 =
[11] ADDRESS_LINE_5 =
[12] ADDRESS_LINE_6 =
[13] To Upper 2(FIRST_NAME) = EDWARD
[14] To Upper 2(LAST_NAME) = OCONNELL
Each row corresponds to a memory location in the engine. The time in long seconds is useful for checking the
performance of the engine. For most tasks, every set of x records should be processed in the same amount of
time. If this is not the case, a performance bottleneck exists.

Performance
The time it takes for a plan to execute depends on several factors. Some are related to Data Quality, and some
are related to the environment in which the plan is executed.
In general, plan execution time includes time for the following:
1. Reading data from a data source.
2. Executing the business rules defined in the plan.
3. Writing data to a data target or report.
Reading and writing data depends on the speeds at which the Data Quality engine can read from and write to a
data source or data target. With a slow-performing database source, the engine may spend more time waiting
for data than processing it. Similarly, a slow-performing file target means that Data Quality may spend more
time waiting for data to be written.

Performance 123
As a rule, database sources should be in as close as possible to the Data Quality instance that executes the plan.
For example, a plan using a database source will run much faster if the database is located on the same local
network than if the database is located at a remote site.
Similarly, when the Data Quality process is constrained by system resources such as CPU or available memory,
it spends more time processing. When a plan consumes a large percentage of the CPU, it will probably execute
faster on a higher-performance CPU.

Reading and Writing


Tuning database or file system access to reduce the time spent accessing data sources and targets allows Data
Quality to concentrate on processing records.

Processing
Increasing the CPU speed means that records can be processed more quickly.
The MySQL database underlying the Data Quality repository or staging area can also be tuned.

Maintenance and Housekeeping


In case of the following:
♦ Plan failure. athanor-rt reports an error code of 1 if a plan fails to execute. The calling process can opt to fail
or run again depending on the error code returned.
♦ Product failure. In the unlikely event that Data Quality crashes, you can facilitate crash diagnosis by
performing a stack traceback and sending the results to Informatica Global Customer Support. For
information about this operation, contact your systems administrator.

Multi-Threading and Multi-Processing


Data Quality applications are multi-threaded and therefore suited for multiple CPU environments. Multi-
threading allows an application to make use of multiple CPUs to improve throughput. On a single CPU, multi-
threading also allows an application to make use of a CPU while a slow input or output operation takes place.
However, multi-threading is not the only way to improve throughput. Multi-processing can split a problem
between multiple computing devices or multiple CPUs on a single device.
With multi-processing, you can decide how the best possible throughput can be achieved by dividing a problem
into several different “jobs.” Each job then executes and solves a part of the overall problem. There are two
major differences between this approach and multi-threading:
♦ Jobs can run on multiple devices and can provide greater computational power than any single device can
offer.
♦ You might be able to accelerate processing beyond speeds possible with a generic threading approach.
Multi-processing and multi-threading provide complementary approaches to increasing throughput.
With Data Quality installed on a single machine, you can execute multiple processes concurrently, each process
applying the same Data Quality plan to different parts of an overall dataset, and thus achieve greater
throughput efficiency.
For example when matching large datasets, you might have six processes running on a four-CPU system, with
each process tackling a different cluster of records. Each Data Quality process executes against only those
clusters assigned to it.

124 Chapter 13: Deploying Plans for Runtime Execution


The processing requirements of each cluster increase exponentially with the number of records in the cluster.
Typically one process is assigned only a few very large clusters while other processes are assigned a large number
of small clusters. Each process performs the same amount of work and each contributes to the overall operation.
A similar approach applies to the standardization of records. In this case, each Data Quality process executes on
a subset of the data. As the time taken to process the overall dataset increases linearly with the number of
records, it is a simple task to distribute the processing load across multiple Data Quality processes executing on
one or more CPUs within one or more computing hosts.

Security
Note the following security-related details:
♦ To avoid storing potentially sensitive passwords in plain text, Data Quality can encrypt plan and parameter
file passwords.
♦ The Data Quality installer on UNIX prevents the product from being installed by any user with root
privileges. On UNIX, Data Quality requires no special user privileges, other than write access to /tmp.
Consequently, a system administrator can restrict and control access to the product in the same manner as
access to any other user-level application.
♦ The Data Quality staging area is configured by default to permit access to the underlying MySQL database
to local users only. Extending access privileges requires the explicit granting of access to other users.

Security 125
126 Chapter 13: Deploying Plans for Runtime Execution
APPENDIX A

Rule Based Analyzer Rule


Statements
This appendix includes the following topics:
♦ Overview, 127
♦ Functional Operators, 128

Overview
When working with the Rule Based Analyzer, note the following points:
1. The rules are defined in a rule block.
2. Rule blocks contain a sequence of IF statements and assignment statements.
3. IF statements have the following form:
// Primary condition
IF <boolean expression>
THEN <Rule Block>
// Optional arbitrary number of elseifs
ELSEIF <boolean expression>
THEN <RuleBlock>
// Optional else
ELSE <Rule Block>
ENDIF

The definition of a rule block allows for IF statements to be nested. Each IF statement must be closed by
the ENDIF keyword.
Examples of IF statements:
IF input1 = "" // Testing if input 1 is empty
THEN output1:= "Empty Input"
ENDIF

IF (input1 < 100) and (input2 < 100)


THEN output1:= 0
ELSEIF input1 > 100
THEN output1:= input1
ELSEIF input2 > 100
THEN output1:= input2
ELSE output1:= 100
ENDIF

127
4. You can add single-line text comments to logical expressions that start with two forward-slashes (//).
5. Assignment statements have the following form:
OUTPUTX:= <expression>
(Where X ranges from 1 to the maximum output number.)

For example:
output1:= input1 * 123.5

6. Every expression has a type that is a Boolean, an integer, a floating point value, or a string. Expressions can
be simple constant values, inputs, outputs, or operations. For example:
123 // Integer
"123" // String
123.5 // Float
Input1 // Input 1 type and value
Output3 // Output 3 type and value
100 + 2 // Integer addition operation

7. Operations are composed of operators and their arguments.


Table A-1 lists operators you can use when building a rule:

Table A-1. Operators

Operator Types Operators

Prefix operators that take Boolean arguments NOT

Infix Operators that take Boolean arguments AND


OR
XOR (Exclusive or =)

Prefix Operators that take numerical arguments (integer or float) - (Negative)

Infix Operators that take numerical arguments (integer or float) = (Equal)


<> (Not equal)
< (Less than)
<= (Less than or equal to)
> (Greater than)
>= (Greater than or equal to)
- (Minus)
+ (Plus)
* (Multiply)
/ (Divide)
% (Modulo)
^ (Power)

Operators that take String arguments = (Equal)


<> (Not equal)
& (Concatenate)

Functional Operators
The Rule Based Analyzer accepts several functional operators in rules. You can apply them in the Rule wizard
and in Expert Mode. The operators ISNUMBER and ISDATE appear as options in IF statements only.
Use the following rules and guidelines when you use functional operators:
♦ Operators that expect float arguments attempt to convert string arguments to floating point numbers where
possible.
♦ The string concatenate operator [&] converts arguments to strings.
♦ Operators display an error message if an automatic conversion between types fails.
♦ The Rule Based Analyzer accepts all Gregorian dates.

128 Appendix A: Rule Based Analyzer Rule Statements


♦ Date functions do not accept leading or trailing spaces.
Table A-2 describes the functional operators you can use when building a rule:

Table A-2. Functional Operators

Functional Operator Returns Description

ISNUMBER (expression e) Boolean Returns true if the expression can be evaluated as a number.

ISDATE (expression e) Boolean Returns true if the expression can be evaluated as a date.
Dates must be in the DD/MM/YYYY format.

TOINT (expression e) Integer Converts an expression to an integer.

TOFLOAT (expression e) Float Converts an expression to a floating point value.

TOSTRING (expression e) String Converts an expression to a string.

STRLEN (string s) Integer Returns the number of characters in s.

LEFTSTR (string s, integer n) String Returns the leftmost n characters of the input string, s.
If n is greater than the length of s then s is returned.

RIGHTSTR (string s, integer String Returns the rightmost n characters of the input string s.
size) If n is greater than the length of s, then s is returned.

SUBSTR (string s, integer String Returns a substring of s, starting at the position specified by
startPos, integer size) startPos and with length specified by size.

DATECOMPARE (string s1, Integer Returns the number of days between s1 and s2.
string s2, dateformat) Must define date format, such as: DD/MM/YYYY.
For example, DateCompare (“2003/03/04”, “2002/03/04”,
“YYYY/MM/DD”) returns the number of days between the 4th
March 2003 and 4th March 2002.

DATECONVERT (string s, String Converts the date from one specified format to another.
dateformat1, dateformat2) Must define date format, such as DD/MM/YYYY.
See also Example, page 68.

MONTHCOMPARE (string s1, Integer Returns the number of months between s1 and s2.
string s2, dateformat) Must define date format, such as: DD/MM/YYYY.
For example, MonthCompare (“2003/03/04”, “2002/03/04”,
“YYYY/MM/DD”) returns the number of months between the 4th
March 2003 and 4th March 2002.

TIMECOMPARE (string s1, Integer Returns the number of seconds between s1 and s2.
string s2) Both s1 and s2 must be in hh:mm:ss format.
For example, TimeCompare(“13:35:27”, “13:34:28”) returns the
integer value 59.

CHAR (integer i) String Returns a string containing the character with the specified
ASCII code value.

CODE (string s) Integer Returns the ASCII code value for the first character of the
specified string.

MAX (integer i1, integer i2) Integer Returns the maximum value of the two arguments.

MAX (float f1, float f2) Float Returns the maximum value of the two arguments.

MIN (integer i1, integer i2) Integer Returns the minimum value of the two arguments.

MIN (float f1, float f2) Float Returns the minimum value of the two arguments.

ABS (integer i1) Integer Returns the absolute value of the argument.

ABS (float f1) Float Returns the absolute value of the argument.

CURDATE (“DD/MM/YYYY”) String Returns the current date in DD/MM/YYYY format.


Can also delimit date by [-], such as DD-MM-YYYY.

CURTIME () String Returns the current time in the hh:mm:ss format.

LTRIM (string s) String Returns the string created by trimming any white spaces from
the start of string s.

Functional Operators 129


Table A-2. Functional Operators

Functional Operator Returns Description

RTRIM (string s) String Returns the string created by trimming any blank spaces from
the end of string s.

TRIM (string s) String Returns the string that is created by trimming any white spaces
from the start and end of string s.

CONTAINS (string s2, string Integer Searches for string s2 in string s1. Returns the position of the
s1) string s2 in s1 or the position of the first character of s2 in s1.
Case-sensitive. For more information, see “Example:
CONTAINS Function” on page 68.

130 Appendix A: Rule Based Analyzer Rule Statements


APPENDIX B

Global AV: Output Field


Descriptions
This appendix includes the following topics:
♦ Global AV Output Field Map, 131

Global AV Output Field Map


This appendix contains information about the codes and values returned by the Global AV component. The
table below lists the Global AV output field names and maps these names to the outputs that can be created by
the underlying validation engines.

Table B-1. Global AV Outputs and Corresponding Validation Engine Outputs

Corresponding Corresponding Corresponding


Global AV Output Global AV
Address International North America
Name Selection Status
Validator Output AV Output AV Output

Match Status Required Match Type Match Status Status Code

Match Code Required Match Code Match Code Error Code and
(previously Error String
Match Score)

Address1 Default On Address1

Address2 Default On Address2

Organization Default On Organization Organization,

Building Default On Building Name Building,

Sub Building Default On Sub-Building Sub Building,


Name

House Number Default On Building Number House Number, Parsed Address


Range

Street Name Default On Street Name, Parsed Street


Name

City Abbreviation Default Off City Abbreviation

Locality/City Default On Post Town, Locality/City City

131
Table B-1. Global AV Outputs and Corresponding Validation Engine Outputs

Corresponding Corresponding Corresponding


Global AV Output Global AV
Address International North America
Name Selection Status
Validator Output AV Output AV Output

Additional Locality Default Off Additional


Locality

Dependent Locality Default Off Dependent Dependent


Locality Locality

Dependant Default Off Dependant


Thoroughfare Thoroughfare

Thoroughfare Default Off Thoroughfare,

Double Dependant Default Off Double Dependant


Locality Locality

Province/State Default On County Name Province/State State

Postal Code/Zipcode Default On Postcode Postal Zip


Code/Zipcode

Zip Plus 4 Default On Zip Plus 4

PO Box Default Off Post-Office Box PO Box

Country Name Default Off Country Name Country Country Name

Country Code ISO 3 Default Off Three Character Country Code


Digit Country Code

Carrier Route Default Off Carrier Route

Delivery Point Code Default Off Delivery Point


Code

Delivery Point Check Default Off Delivery Point


Digit Check Digit

County FIPS Default Off County FIPS

Address Type Code Default Off Address Type Code

Address Type String Default Off Address Type


String

Urbanization Default Off Urbanization

Congressional District Default Off Congressional


District

Private Mailbox Default Off Private Mailbox

Time Zone Code Default Off Time Zone Code

Time Zone Default Off Time Zone

MSA Default Off MSA

PMSA Default Off PMSA

Suite Status Code Default Off Suite Status Code

EWS Flag Default Off EWS Flag

Zip Type Default Off Zip Type

Parsed Pre-Direction Default Off Parsed Pre-


Direction

Parsed Suffix Default Off Parsed Suffix

Parsed Post-Direction Default Off Parsed Post-


Direction

132 Appendix B: Global AV: Output Field Descriptions


Table B-1. Global AV Outputs and Corresponding Validation Engine Outputs

Corresponding Corresponding Corresponding


Global AV Output Global AV
Address International North America
Name Selection Status
Validator Output AV Output AV Output

Parsed Suite Name Default Off Parsed Suite Name

Parsed Suite Range Default Off Parsed Suite


Range

Parsed Private Mailbox Default Off Parsed Private


Name Mailbox Name

Parsed Private Mailbox Default Off Parsed Private


Number Mailbox Number

LACS Default Off LACS

LACS Link Indicator Default Off LACS Link


Indicator

LACS Link Return Code Default Off LACS Link Return


Code

Element Match Status Default Off Element Match


Status

Element Result Status Default Off Element Result


Status

CMRA Required DPV CMRA

DPV Footnotes Required DPV DPV Footnotes

Delivery Point Suffix Default Off Postally Not


(DPS) Required (PNR)
Locality. Data
Quality does not
populate this field.

GEO_StatusCode Required Geocoder GEO_StatusCode


Option

GEO_ErrorCode Required Geocoder GEO_ErrorCode


Option

GEO_CensusBlock Required Geocoder GEO_CensusBlock


Option

GEO_CensusTrack Required Geocoder GEO_CensusTrack


Option

GEO_CountyFips Required Geocoder GEO_CountyFips


Option

GEO_CountyName Required Geocoder GEO_CountyName


Option

GEO_Latitude Required Geocoder Latitude for UK GEO_Latitude


Option and AUS

GEO_Longitude Required Geocoder Longitude for UK GEO_Longitude


Option and AUS

Formatted_Address_n Default Off Outputs are not engine-specific

Global AV Output Field Map 133


134 Appendix B: Global AV: Output Field Descriptions
APPENDIX C

Search/Replace Operations
and Noise Removal
This appendix includes the following topic:
♦ Noise Removal, 135

Noise Removal
This appendix contains information about noise removal, that is, removing extraneous
characters from data strings. Noise removal can make data records more legible and facilitate
matching operations.
When you run an analysis plan, identify any symbols, spaces, and unexpected characters in
the source data fields so you can remove or replace them with a Search Replace component.
This is known as noise removal.
Table C-1 lists some typical removal and replacement selections in the Search Replace component:

Table C-1. Standard Noise Removal and Replacement Operations

Data Element Action

. Replace with a single space.

, Replace with a single space.

- Replace with a single space.

/ Replace with a single space.

\ Replace with a single space.

; Replace with a single space.

Double Spaces Replace with a single space.

Blank space Remove at start.

ATTN: Remove at start.

C/O Remove at start.

C\O Remove at start.

Blank space Remove at end.

135
Table C-1. Standard Noise Removal and Replacement Operations

Data Element Action

“ Remove.

“ Remove.

' Remove.

' Remove.

( Remove.

! Remove.

` Remove.

# Remove.

: Remove.

{ Remove.

} Remove.

[ Remove.

] Remove.

136 Appendix C: Search/Replace Operations and Noise Removal


APPENDIX D

Matching Formulas
This appendix includes the following topic:
♦ Matching Formulas, 137

Matching Formulas
Given an input set of N records, the following number of comparisons is required without grouping:

If the records are grouped into m groups (G1…Gm being the number of records in groups 1…m) and
comparisons only occur within records in the same group, the following number of comparisons is required:

In the worst case, this means that grouping leads to a reduction of comparisons, where Gmax is the size of the
biggest group:

In practice, a greater reduction is expected since it is unlikely that every group is the same size.

137
138 Appendix D: Matching Formulas
APPENDIX E

SQL Scripts
This appendix includes the following topics:
♦ Overview, 139
♦ Creating a MySQL Table, 139
♦ Use of MAX Function, 140
♦ Nested Groups and Counts, 140

Overview
Data Quality is installed with a MySQL database system to which data files can be migrated
and in which queries can be developed. Although SQL scripts are not required in the majority
of cases when designing and running plans, there are cases in which SQL scripts can provide
efficient solutions to particular data problems.
The Database Source and Database Target component configuration dialog boxes allow you to develop SQL
scripts. The sections below describe some useful SQL scripts and the particular issues that they address.

Creating a MySQL Table


Use the following steps to create a MySQL table:
1. Using a Database Target component, create the database table to which you want to migrate a data file. In
the Before pane, type the following:
drop table if exists table_name; # delete table if it already exists
create table table_name # create table with following fields
(
TableID int primary key,
FieldA varchar(20), # use descriptive names for fields
FieldB varchar(20),
FieldC varchar(20),
FieldD float
FieldE int
);

2. In the During pane, insert the data from the source file to new table.
Select Expert Mode to see the SQL scripting equivalent of the tab settings.

139
3. In the After pane, you should create an index, especially when dealing with large datasets. Use the following
script:
Create index index_name on table_name(FieldE);

Use of MAX Function


The MAX function works best on numeric data.
You can use the following steps to use the MAX function to identify the most recent transaction for each
customer:
1. Convert each date to YYYYMMDD format and store it as an numeric type data field.
With this step in place, you can add the following SQL scripts to the Database Source configuration dialog
box to identify the most recent transaction for each customer.
2. Type the following in the Before tab:
Drop table if exists tmp; # create a temporary table
CREATE table tmp
(cust_ref varchar(20),
numdate bigint);
INSERT INTO tmp
SELECT
transtable.cust_ref,
MAX(transtable.numdate)
FROM transtable
GROUP BY transtable.numdate
CREATE index tmp_trans_index on tmp(cust_ref, numdate);

3. Type the following in the During tab:


SELECT select transtable.cust_ref, transtable.numdate, <any other fields>
FROM transtable, tmp
WHERE transtable.cust_ref = tmp.cust_ref
AND transtable.numdate = tmp.numdate

4. Type the following in the After tab:


Drop table tmp;

Nested Groups and Counts


You might use the following steps to count the numbers of customers in your dataset by town and country:
1. In the During pane, select the data fields required for the report.
For this example, assume each unique record represents a single customer and that each record contains the
following fields of information: Country and Town.
2. Check the Expert Mode option.
3. Edit the resulting script so that it reads as follows:
SELECT Table_name.Country, COUNT(table_name.Country), Table_name.town, COUNT
(table_name.town) FROM table_name
GROUP BY
Table_name.country., Table_name.town

140 Appendix E: SQL Scripts


APPENDIX F

ODBC Data Source


Administrator
This appendix includes the following topic:
♦ Using the ODBC Data Source Administrator, 141

Using the ODBC Data Source Administrator


Use the Microsoft ODBC Data Source Administrator when connecting to databases with
ODBC. When the Database Source is configured to connect using ODBC, it requires a Data
Source Name.
Note: The following procedure is written for Windows XP users. Details may differ slightly for
for other versions of Windows.

To create a Data Source Name that is recognized by ODBC:

1. Open the Administrative Tools window.


2. Double-click Data Sources (ODBC).
The ODBC Data Source Administrator dialog box opens.
3. In this dialog box, select the System DSN tab and click Add.
The Create New Data Source dialog box prompts to select the driver for which you want to set up a data
source.
4. Select the appropriate driver for the database that you want to connect to.
You might need to install a driver if you cannot locate one in the list.
When you have successfully identified the driver, a setup dialog box opens for the database driver you have
selected.
5. Type a name for the data source in the Data Source Name field.
6. Click Select and browse to select the appropriate database for the new data source.
7. Click OK to exit the dialog boxes and return to Data Quality Workbench.
8. Under the Connect to Database tab of the Database Source configuration dialog box, type the newly-
created Data Source Name in the relevant field and click Connect.

141
You should now see the data tables of the database that you associated with the data source name. You can drill
down into the tables and select fields as required.
Note the following:
♦ You can apply Data Quality components directly to data retrieved by ODBC and write the results to local
files. You can migrate the data retrieved by ODBC into a local Data Quality MySQL data table. This
approach may prove useful if you are retrieving a large data set across a network that is prone to heavy traffic.
♦ When connecting to Microsoft Access databases, you might find that no tables or data fields are available for
viewing after you establish an ODBC connection. This can occur if Access table names or field names
include spaces. Most database vendors do not accept spaces in table names or field names.
♦ This naming convention is an accepted industry standard. To view data in this instance, you must remove all
spaces from the Microsoft Access table names and field names.

142 Appendix F: ODBC Data Source Administrator


APPENDIX G

Character Encodings and Unicode


This appendix includes the following topic:
♦ Character Encodings and Unicode, 143

Character Encodings and Unicode


Informatica Data Quality is Unicode-compliant. Several components allow you to specify the character
encodings to be applied to the data on which they operate. The character encoding options are generally
available in the Encodings menu on the configuration dialog box for the component.
Entries on this menu include the default encoding for the current system based on the current locale, the
standard UTF encodings (UTF-8 and UTF-16 little endian and big endian), and an option to choose other
encodings not listed in the menu by default.
Encodings recently selected but not defined by the default selections are added to a history of previously-
selected encodings. Only those encodings not available by default are added to the history. The history is
limited to three entries.

Choosing a Non-Default Encoding


Click Choose on the menu to open a new dialog box listing the available encodings as defined in the
localeEncoding.csv file.
This dialog box lists the following:
♦ Base languages
♦ Encodings available for versions of the base language
♦ Countries associated with each version
♦ ISO number of each version
The list can be expanded and collapsed to aid list navigation. Highlight a language or dialect and click OK to
select it for any data on which the component will operate.
Note that you select an encoding of the language rather than the base language, and that in some cases the
versions are distinguished by operating system rather than region.

143
144 Appendix G: Character Encodings and Unicode
APPENDIX H

Data Quality Workbench Toolbar


This appendix includes the following topic:
♦ Data Quality Workbench Toolbar, 145

Data Quality Workbench Toolbar


Figure H-1 lists the names of Data Quality Workbench toolbar icons:

Figure H-1. Data Quality Workbench Toolbar

Cut
New Project New Plan Save Plan Run Plan Refresh Undo Redo Component

Show Show Show Plan Import


Copy Paste Configure Delete Source Project Notes Workbench
Component Component Component Component Viewer Manager Plan

Export Import Export Import Export Open Report Open


Workbench Realtime Realtime Runtime Runtime Viewer Dictionary View Plan
Plan Plan Plan Plan Plan Manager Layers

Tile Cascade Open Help


Windows Windows Topics

145
146 Appendix H: Data Quality Workbench Toolbar
APPENDIX I

Output Options in the


CSV Match Target
This appendix includes the following topics:
♦ Overview, 147
♦ Configuring the Outputs for Identified Matches, 148

Overview
Significant changes have been made to the CSV Match Target component in this version of
Data Quality. The CSV Match Target component:
♦ Can generate a CSV file in two formats.
♦ Provides improved HTML reporting.
♦ Employs a new algorithm to generate match clusters.

New Output Formats


The CSV Match Target provides two output formats:
♦ Identified Matches. Provides similar results to the HTML report output. In this format, the target
reconstructs the original source file and appends a cluster ID and the number of records in each cluster to
the record. As a result, the number of rows in the target output file should be the same as the number of
input rows. Any record for which a match was not found will have its own unique cluster ID and a cluster
size of 1.
♦ Matched Pairs. Delivers each matching pair that meets or exceeds the match threshold set in the target.
(This corresponds to the target output in version 3.0 of the product.)

HTML Report
The HTML Report format displays with the unique records in the cluster, with the best match identified and
the score against that match.

147
Usage
The CSV Match Target only calculate clusters when configured to do so. Select the Identified Matches or
HTML Report option to activate cluster generation.
You can also disable HTML report generation.

Clustering
The clustering algorithm assigns all records identified as matches to a cluster. The algorithm runs while the plan
runs and stores temporary data in memory.
In larger datasets, large quantities of matches can cause a large amount of memory to be used. Grouping data
can keep group sizes within recommended parameters, so unnecessary matching operations are avoided.
Informatica recommends a maximum 5,000 records per group.

Sources
The CSV Match Target can calculate record clusters when used with the CSV Match Source or Group Source.
When you use CSV Match Target with other sources and select the Identified Matches option, the plan does
not run. If you select HTML Report is selected, then the plan runs, but the HTML page indicates that the
report cannot be created.

Configuring the Outputs for Identified Matches


When you select the Identified Matches output format, you must review the order of the output columns in the
Output pane.
The columns in the Outputs pane must be organized by data source, with an equal number of columns for
records from each data source. The match score column must appear after the record columns. The logic is as
follows:
♦ Data reaches the CSV Match Target as two input records side by side, For example, records with Name and
Address fields reach the Target in the following format followed by the match score:
Name_1,Address_1,Name_2,Address_2
♦ When you select the Identified Matches format, the Target reconstructs the original input records. The
previous example would be reconstructed as follows:
Name_1,Address_1
Name_2,Address_2
♦ You must order the output columns in the Output pane so the columns from the first record are listed in
order, followed by the columns in the second record, followed by the columns for the match scores. The
Outputs pane for the previous example should look like this:
Name_1
Address_1
Name_2
Address_2
MatchScore
♦ Figure 3-1 on page 32 illustrates a well-ordered Outputs pane for the Identified Matches option.
Use the Up and Down arrows to order columns.

148 Appendix I: Output Options in the CSV Match Target


APPENDIX J

Informatica Data
Quality Naming
Conventions
This appendix includes the following topics:
♦ Overview, 149

Overview
This appendix describes a recommend naming system for Data Quality project elements. You
and your team should agree a clear and consistent set of naming conventions for the elements
you create in Workbench. Your exact approach to naming conventions will depend on your
organization’s needs.
The elements to consider are:
♦ Projects. Create a project under the local repository (My Repository) in Workbench Project Manager. You
cannot rename a Data Quality repository.
♦ Folders. Create a folder under a project in Workbench Project Manager. Folders can be nested in projects.
♦ Plans. Create a plan at folder or project level in Workbench Project Manager.
♦ Configurable components. Select a component from the Component Palette and add it to an open plan.
♦ Component instances. Open a component onscreen to view or edit an instance. A component comprises
one or more instances.
♦ Component outputs. Open a component onscreen to view or edit its outputs. A component creates one or
more output columns based on the rules applied to its inputs.
♦ Dictionaries. Open Workbench Dictionary Manager or the local file system to view dictionary (.DIC) files.
No element can share a name with another element at the same node in the Project Manager. For example, you
cannot define two folders named MyFolder in the same project.
You can copy an element at its current location. In such cases, Workbench prefixes its name with “Copy of.” For
example, you can make a copy of MyFolder and create a new folder named Copy of MyFolder by default in the
same project. If the length of the new element is longer than permitted, Workbench truncates the name.

149
Naming Projects
Workbench creates a project with the default name “New Project”.
Project naming should be clear and consistent within a repository. Follow these guidelines:
♦ Limit project names to 22 characters. The repository imposes a limit of 30 characters. Limiting project
names to 22 characters allows Workbench to prefix “Copy of ” to a copied project without truncating
characters.
♦ Include enough descriptive information in the project name for an unfamiliar user to grasp the general
purpose of the plans in the project.
♦ If plans within the project will operate on a single data source, incorporate the data source name in the
project name.
♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the project without changing its name.
♦ If you use company codes or abbreviations in the project name, ensure they are consistent and well
documented.

Naming Folders
Workbench creates four folders by default beneath a new project. The folders are named Consolidation,
Matching, Profiling, and Standardization and are listed alphabetically. These names relate to four common
types of data quality plan. You can rename, delete, and create folders to suit your business and project
objectives.
Naming guidelines for folders:
♦ Limit folder names to 42 characters. The repository imposes a limit of 50 characters. Limiting folder names
to 42 characters allows Workbench to prefix “Copy of ” to a copied folder without truncating characters.
♦ Include enough descriptive information in the folder name for an unfamiliar user to grasp the purpose of the
plans in the folder.
♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the folder without changing its name.
♦ If you use company codes or abbreviations in the folder name, ensure they are consistent and well
documented.

Naming Plans
When you create a new plan, Workbench prompts you to select one of four generic plan types as the plan name:
Analysis, Consolidation, Matching, or Standardization. These names relate to the default folder names.
Workbench provides them as an aid to project design.
These default names in no way determine or constrain plan functionality. You can add a new plan to any folder
regardless of their names.
Note: Take particular care when naming plans, particularly if you will export the plan to a PowerCenter
repository. Be as clear and descriptive as possible. Data quality operations are defined and implemented at plan
level. Although you can see a plan’s folder and project parentage in Workbench, these elements may not be
evident in the PowerCenter repository.
Naming guidelines for plans:
♦ Include the plan’s purpose or primary functionality in the plan name.
♦ If you will use the plan in a PowerCenter mapping or mapplet, prefix the plan name with dq_. This
conforms to PowerCenter naming conventions. PowerCenter applies a lowercase prefix to all elements in its
repository. For data quality plans, this is an optional but recommended step.
♦ Limit plan names to 42 characters. The repository imposes a limit of 50 characters. Limiting plan names to
42 characters allows Workbench to prefix “Copy of ” to a copied plan without truncating characters.

150 Appendix J: Informatica Data Quality Naming Conventions


♦ Include enough descriptive information in the plan name for an unfamiliar user to grasp the purpose of the
plans in the folder.
♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions.
They allow the PowerCenter repository to import the plan without changing its name.
♦ If you use company codes or abbreviations in the plan name, ensure they are consistent and well
documented.

Naming Components
When you add a component to a plan, its default name appears underneath its icon in the plan workspace. Edit
this name to provide a description of the component’s role in the plan. Prefix your new name with an
abbreviation of the plan’s original name to make the plan more legible onscreen.
If the component type abbreviation itself is not sufficient to identify what the component does, include an
identifier for the function of the component in its name.
Table J-1 lists prefixes you can use when renaming your components:

Table J-1. Component Names and Prefixes

Component Prefix Component Prefix

Address Validator av_ Soundex sx_

Aggregation ag_ Splitter spL_

Bigram bg_ To Upper tu_

Character Labeller cl_ Token Labeller tl_

Context Parser cp_ Token Parser tp_

Count co_ Weight Based Analyzer wba_

Edit Distance ed_ Word Manager wm_

Global AV av_ SOURCES/TARGETS

Hamming Distance hd_ CSV Dual Source csv_m_

International AV iav_ CSV Match Source csv_d_

Jaro Distance jd_ CSV Merge Target csv_merge_

Merge MG_ CSV Source/Target csv_

MinAvgMax mam_ DB Match Source db_m_

Missing Values mv_ DB Report Target db_r_

Mixed Field Matcher mfm_ DB Source/Target db_

North America AV nav_ Dual Group Source dgs_

Nysiis nys_ Fixed Width Source/Target fws_

Profile Standardizer ps_ Group Source/Target grp_

Range Counter rc_ Match Key Target mks_

Rule Based Analyzer rba_ Realtime Source/Target rs_

Scripting sc_ Report Target rep_

Search Replace sr_ SAP Source/Target sap_

In addition, consider these naming guidelines for components:


♦ Keep component names short where possible. You may wish to reuse component names in field names, and
your database may impose a limit on field length.
♦ Include the name of the input field or the field type.

Overview 151
♦ Use letters, numbers, and underscores in your name. Do not use spaces.
♦ If you use company codes or abbreviations in the component name, ensure they are consistent and well
documented.

Naming Fields
Careful field naming is essential when designing data quality plans. The power of Data Quality leads to
complex plans with many components.
Data Quality requires that every component output field name is unique in the plan. Output field names do
not persist from component to component.
Data Quality does not have the data lineage feature of PowerCenter, so the field name is the clearest indicator of
the source of a data element when a plan is examined by a third party.
Naming guidelines for fields:
♦ Prefix each output field name with an abbreviation of its component name. For a list of usable abbreviations,
see Table J-1.
♦ Use upper and lower case consistently.
♦ Do not rename output fields in target components unless necessary, as there is no convenient way to
determine the origin of a renamed output field.
♦ If you use company codes or abbreviations in the field name, ensure they are consistent and well
documented.

Naming Dictionary Files


Dictionaries may be given any name suitable for the operating system on which they will be used.
Naming guidelines for dictionary files:
♦ Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both
Windows and UNIX, do not use spaces.
♦ If you modify a dictionary file from Informatica, rename or move it to a new folder before using it in a plan.
In this way, you will not overwrite your modifications if you perform a Content update.
♦ If you use company codes or abbreviations in the dictionary name, ensure they are consistent and well
documented.

152 Appendix J: Informatica Data Quality Naming Conventions


INDEX

A Soundex 81
Matching Components
Aggregation component Bigram 91
configuring 47 Edit Distance 88
Hamming Distance 90
Identity Match 86
B Jaro Distance 89
Bigram component Mixed Field Matcher 92
configuring 91 Similarity 88
Weight Based Analyzer 94
Parsing Components
C Context Parser 78
Parser 71
-c option Profile Standardizer 76
command line argument 122 Splitter 72
shared database details 123 Token Parser 73
categories Source Components
creating dashboard 112 CSV 13
dashboard 112 CSV Dual Match 19
deleting 113 CSV Identity Group 22
moving rows 113 CSV Match 19
character encoding Database 14
configuring 143 Database Match 20
Character Labeller component DB Identity Group 23
configuring 53 Dual Group 21
characters Fixed Width 16
removing extraneous 135 Group 21
clustering Realtime 16
CSV Match Source algorithm 148 SAP 17
command line arguments Target Components
-c option 122 CSV 27
encrypting parameter files 122 CSV Match 31
-i option 123 CSV Merge 30
overview 122 Database 36
Components Database Report 38
Address Validation Components Fixed Width 28
Global AV 96 Group 35
Analysis Components Identity Group 40
Character Labeller 53 Match Key 33
Token Labeller 56 Realtime 40
Frequency Components Report 29
Aggregation 47 SAP 38
Count 43 Transformation Components
MinAvgMax 49 Merge 64
Missing Values 51 Rule Based Analyzer 67
Range Counter 50 Scripting 69
Sum 46 Search Replace 61
Key Field Generator Components To Upper 65
Normalization 81 Word Manager 63
Nysiis 83

153
Context Parser component deploying
configuring 78 runtime plans 119
Count component deploying plans
configuring 43 using the command line 122
CSV Dual Match Source component dictionaries
configuring 19 adding spellings 105
CSV Identity Group Source component creating 106
configuring 22 overview 103
CSV Match Source component updating files 104
configuring 19 Dictionary Manager
CSV Match Target component overview 104
configuring 31 Dual Group Source component
Identified Matches option 31, 148 configuring 21
Matched Pairs option 31
output options 147
sources for calculating clusters 148 E
CSV Merge Target component
Edit Distance component
configuring 30
configuring 88
CSV Source component
encodings
configuring 13
configuring 143
CSV Target component
encrypting
configuring 27
parameter files 122
encryption
for password protection 125
D executing
dashboard view plans 6
Report Viewer 111
dashboards
categories 112 F
creating categories 112
File Manager
creating groups 117
description 2
modifying calculation parameters 111
Fixed Width Source component
setting Data Quality targets 111
configuring 16
tracking changes 116
Fixed Width Target component
tracking historical percentages 116
configuring 28
tracking historical trends 116
functional operators
data
in rules 128
viewing plan 114
data elements
hiding 116
data matching
G
formulae 137 Global AV component
Data Quality staging area configuring 96
default permissions 125 Group Source component
data sources configuring 21
creating ODBC 141 Group Target component
database dictionaries configuring 35
creating 106 groups
description 103 creating 117
Database Match Source component creating dashboards 117
configuring 20 managing 117
Database Report Target component nested in scripts 140
configuring 38
Database Source component
configuring 14 H
Database Target component Hamming Distance components
configuring 36 configuring 90
databases hiding
shared details 123 data elements 116
DB Identity Group Source component HTML
configuring 23 CSV Match Target component report format 147

154 Index
I O
-i option ODBC
command line argument 123 creating data sources 141
Identified Matches option ODBC Data Source Administrator
configuring output 148 creating a DSN 141
CSV Match Target component 147
Identity Group Target component
configuring 40 P
Identity Match
parameter files
populations 86
encrypting 122
Identity Match component
passwords 122
configuring 86
Parser component
International AV component
configuring 71
return codes and values 131
passwords
items
parameter files 122
assigning 113
percentages
tracking historical 116
performance
J checking with command line argument 123
Jaro Distance component tuning 123
configuring 89 plans
executing 6
overview 2
L performance tuning 123
version control 8
line graphs
Profile Standardizer component
viewing 116
configuring 76
Project Manager
M description 2

Match Key Target component


configuring 33 R
Matched Pairs option
Range Counter component
CSV Match Target component 147
configuring 50
MAX function
Realtime Source component
in scripts 140
configuring 16
Merge component
Realtime Target component
configuring 64
configuring 40
MinAvgMax component
removing
configuring 49
extra characters 135
Missing Values component
Report Target component
configuring 51
configuring 29
Mixed Field Matcher component
Report Viewer
configuring 92
assigning weights to data items 113
multi-processing
creating dictionary files 106
overview 124
creating groups 117
multi-threading
dashboard view 111
overview 124
Data Quality targets on the dashboards 111
MySQL tables
editing settings 115
creating 139
exporting data 114
filtering data 114
N importing report files 117
managing groups 117
nested groups parameters and settings 115
in scripts 140 standard view 111
noise tracking changes 116
removal 135 viewing plan data 114
Normalization component working with groups 117
configuring 81 Rule Based Analyzer
Nysiis component rule statements 127
configuring 83

Index 155
Rule Based Analyzer component trends
configuring 67 tracking historical 116
rules
functional operators 128
runtime execution U
plans 119
Unicode
runtime plans
compliance 143
deploying 119
UNIX installation
root privileges 125
S
SAP Source component V
configuring 17
version control
SAP Target component
plan publication 10
configuring 38
plans 8
scheduling
tracking plans 9
operations 121
views
Scripting component
Report Viewer 111
configuring 69
Search Replace component
configuring 61
security
W
encrypting parameter files 122 Weight Based Analyzer component
tips 125 configuring 94
Similarity component weights
configuring 88 assigning to data items 113
Soundex component Word Manager component
configuring 81 configuring 63
sources
calculating clusters with CSV Match Target 148
Splitter component
configuring 72
SQL scripts
samples 139
standard dictionaries
creating text 106
description 103
standard view
Report Viewer 111
Sum component
configuring 46
system performance
checking with command line argument 123

T
tables
creating MySQL 139
terms
adding new to dictionaries 105
adding spellings to dictionaries 105
third-party reference data
description 103
To Upper component
configuring 65
Token Labeller component
configuring 56
Token Parser component
configuring 73, 74
multiple dictionary operations 74
toolbar
icons 145

156 Index
NOTICES
This Informatica product (the “Software”) includes certain drivers (the “DataDirect Drivers”) from DataDirect Technologies, an operating company of Progress
Software Corporation (“DataDirect”) which are subject to the following terms and conditions:

1. THE DATADIRECT DRIVERS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NON-INFRINGEMENT.

2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS,
WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF
ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY,
MISREPRESENTATION AND OTHER TORTS.