Documente Academic
Documente Profesional
Documente Cultură
Analyst 9.X
Lab Guide
Version IDQ91_HF1_AT_201107
Informatica Data Quality, Analyst Lab Guide
Version 9.1 HF1
July 2011
Copyright (c) 20012010 Informatica Corporation.
All rights reserved.
This software and documentation contain proprietary information of Informatica Corporation and are provided under a license
agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the
software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S.
and/or international Patents and other Patents Pending.
Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable
software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii)
(OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.
The information in this product or documentation is subject to change without notice. If you find any problems in this product or
documentation, please report them to us in writing.
Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart,
Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica Complex Data Exchange and Informatica On
Demand Data Replicator are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions
throughout the world. All other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright
Sun Microsystems. All rights reserved. Copyright Platon Data Technology GmbH. All rights reserved. Copyright Melissa Data
Corporation. All rights reserved. Copyright 1995-2006 MySQL AB. All rights reserved
This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is
Copyright 1999-2006 The Apache Software Foundation. All rights reserved.
ICU is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permission is hereby
granted, free of charge, to any person obtaining a copy of the ICU software and associated documentation files (the Software), to
deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or
sell copies of the Software, and to permit persons to whom the Software is furnished to do so.
ACE(TM)and TAO(TM), are copyrighted by Douglas C. Schmidt and his research group at Washington University, University of
California, Irvine, and Vanderbilt University, Copyright (c) 1993-2006, all rights reserved.
Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporation and other parties. The
authors hereby grant permission to use, copy, modify, distribute, and license this software and its documentation for any purpose.
InstallAnywhere is Copyright Macrovision (Copyright 2005 Zero G Software, Inc.) All Rights Reserved.
Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com).
This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright 2000-2004 Jason Hunter and
Brett McLaughlin. All rights reserved.
This product includes software developed by the JFreeChart project (http://www.jfree.org/freechart/). Your right to use such
materials is set forth in the GNU Lesser General Public License Agreement, which may be found at
http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, as is, without warranty of any
kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular
purpose.
This product includes software developed by the JDIC project (https://jdic.dev.java.net/). Your right to use such materials is set forth
in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials
are provided free of charge by Informatica, as is, without warranty of any kind, either express or implied, including but not limited to
the implied warranties of merchantability and fitness for a particular purpose.
This product includes software developed by lf2prod.com (http://common.l2fprod.com/). Your right to use such materials is set forth
in the Apache License Agreement, which may be found at http://www.apache.org/licenses/LICENSE-2.0.html.
DISCLAIMER: Informatica Corporation provides this documentation as is without warranty of any kind, either express or implied,
including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. Informatica
Corporation does not warrant that this software or documentation is error free. The information provided in this software or
documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is
subject to change at any time without notice.
Part Number: IDQ-LBG-90100-000
Table of Contents
Informatica Data Quality Analyst 9.1 HF1 v
UNIT 4: SCORECARDING ................................................................................................................................. 63
EXERCISE 5.1: DEFINING THE BAD RECORD TABLE IN THE ANALYST TOOL ................................................. 78
WORKING WITH THE DATA QUALITY ASSISTANT ................................................................................................... 80
EXERCISE 5.2: MANAGE BAD RECORDS ............................................................................................................ 80
EXERCISE 5.3: AUDIT TRAIL............................................................................................................................... 84
EXERCISE 5.4: PROFILE THE BAD CUSTOMER RECORDS TABLE..................................................................... 86
EXERCISE 5.5: DEFINING THE DUPLICATE RECORD TABLE IN THE ANALYST TOOL ..................................... 87
EXERCISE 5.6: RECORD CONSOLIDATION ......................................................................................................... 89
EXERCISE 5.7: SPLITTING A CLUSTER ................................................................................................................ 90
EXERCISE 5.8: THE AUDIT TRAIL ...................................................................................................................... 92
vi Table of Contents
Informatica Data Quality Analyst 9.1 HF1
Introduction and Overview
The business case
GoDog.Ltd is an American-based dog food manufacturing company. Their main customers are
supermarkets across the United States, however they also sell to supermarkets in the UK and Spain.
Problems have arisen within the company relating to the quality of its customer orders and product
data.
There is little confidence in the quality of the data. Reports generated provide misleading and
inaccurate information.
Before the data can be cleansed and standardized, the specific problems must be identified. This
will be done using Analyst Profiling. The results of this analysis will define what standardization
needs to be done.
The Data Analysts know the data and can identify what needs to be done to cleanse and standardize
the data. They can collaborate with Developers to ensure the appropriate standardization and
cleansing can be done.
Known problems with the data include:
There are many variations in supermarket name - A&P, for example, appears as A&P, A and
P, A+P, A&Ps, AandP.
Invoices have been sent to the wrong people in certain supermarkets due to inadequate
contact details and in some cases, inaccurate addresses. There has been no validation of
addresses done and many addresses are incomplete.
The data is completely unstandardized; cases vary per record as well fields containing
superfluous symbols and spaces.
Auto dial, which was enabled on all phones, is failing as many phone fields contain text and
spaces in what should be, purely numeric fields.
Possible duplicate records need to be identified.
Similar problems exist with the product data, description formats differ per record, naming
conventions are inconsistent, and many duplicate records exist.
The level of data quality needs to be assessed. Once the Analyst has assessed the data, the
developer can begin the process of cleansing and standardization, and duplicates can be identified.
Step 2. Standardizing the merged Customer Orders file (Analyst and Developer Task)
Once the data has been profiled, the analyst can identify anomalies and decide on the
standardization that needs to be performed. These requirements can be documented within the
profile which will enable analysts and developers to collaborate on projects. The analyst can create
the reference tables that the developers need to standardize, cleanse and enrich the data. Once
mapplets have been developed by the developers, they can be reviewed by the analyst and modified
if required.
The data analyst and developer can easily collaborate and transfer project related knowledge using
the Analyst and Developer Tools. This symbiotic transfer of information and ease of communication
on projects helps reduce misunderstandings and ensure project success.
Step 3. Address Validation (Developer Task using input from the Analyst)
Once the data has been cleansed and standardized the next step would be to ensure the addresses
are accurate by validating and enhancing them using definitive reference data from international
postal agencies. This is a developer task performed in the Developer Tool.
Objectives
US and EMEA Customer order data are in separate flat files that contain order information.
Import both Customer Order files using browse and upload and referencing the file on the
server. Once imported we will be able to perform profiling on the files in subsequent labs
Duration
30 minutes approximately
b) Click OK and verify the Project has been created. If needed, click the plus sign next to
the word Projects in the navigation panel.
Exercise 1.2: Importing the EU_Custord File using Browse and Upload
Next we want to go through importing our data using the various options available.
1) Locate and select the New Flat File icon in the Project Contents workspace tool bar
2) Select the Browse and Upload method to import a flat file definition.
Note: Remember this places a copy of the file on the server.
Note: . By choosing to import from the first line <1>, values are automatically picked up from the first line
c) Click Next and review the Column Attributes for the data. This view has been designed to
allow the user define the data types for each column so the data can be imported
appropriately.
(i) Set the following properties for the columns specified:
Note: Note that we are not setting all the data types. If a value does not match the type defined they will
not be displayed in some views. We need to profile the data before we can be certain of the data type. If
you are unsure we suggest leaving string as default.
RECORD_ID DATA TYPE: Choose int from the Data type dropdown.
UNIT_COST DATA TYPE : number
SCALE : 2
b) Click Next. Leave the default codepage, ensure the comma delimiter is selected and the
Text Qualifier set to Double Quote and select Import column names from first line.
c) As before click the Show button to update the preview.
(i) Scroll to the end of the preview to ensure the data appears as expected.
d) Click Next. Again the Data Types and Precision are set but these can be modified.
(i) Set:
6 Unit 1: Data Objects
Informatica Data Quality Analyst 9.1 HF1
ID DATA TYPE: Choose int from the Data type dropdown.
UNIT_COST DATA TYPE: number
SCALE : 2
e) Click Next and ensure the file is being saved in the CUSTOMER_DATA folder. Rename the
Data Object: US_CustomerOrders
(i) Description: US Customer Orders
f) Click Finish and review the results noting the file path is defined in the properties.
Objectives
Engage in a discovery effort to determine various business problems with the data and
possible fixes.
Apply Filters to profile subsets of data.
Create Comments to track any issues discovered.
Tag project objects with appropriate tags
Export Drilldowns to files.
Apply Out of the Box rules to your data.
Build custom expression rules.
Duration
90 minutes approximately
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 9
Exercise 2.1: Column Profiling
b) It is possible to create multiple profiles at the same time however in this example we will
only profile the EU_CustomerOrders file.
(i) Leave the EU_CustomerOrders file selected and press Save and Run to create a
profile using the default settings.
10 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
A profile named Profile_EU_CustomerOrders is automatically created and opened.
c) Make sure all the columns are selected to be displayed in drilldowns. If not, select the
Drilldown checkbox at the top. You may need to scroll to the right to see this.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 11
Exercise 2.2: Create a Filter and apply to a profile
1) Review the results of the profile. Scroll to the end of the profile and note the field called
Record_Status.
This field tells us if the record is live or inactive. In this example we want to profile Live records only
so we will create a filter and rerun the profile applying the filter to the profile.
2) In the profile click on the Manage Filters icon.
3) We are going to create a Filter for LIVE records so in the Manage Filters dialog click on the
New button.
4) Name the filter LIVE_RECORDS and select the field RECORD_STATUS from the dropdown.
12 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
a) Set the filter to filter records where the RECORD_STATUS = LIVE and click OK. (You may
need to click out of the LIVE dialog for the OK button to become enabled.)
b) Click Save when complete.
5) We have now saved the filter with the profile but we need to apply it.
a) In the Run Profile dialog click Choose a Filter, then select the Filter just created and
choose Run.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 13
Note: Note the option to remove any applied Filters from the profile is option 1.
7) Check the RECORD_STATUS field and ensure that only LIVE records have been profiled.
14 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
Exercise 2.3: Review Profiling Results, add comments and export drilldowns
Note: You may need to turn off the Pop up blocker ahead of this exercise. This can be done by selecting
Tools > Pop-up Blocker > Turn Off Pop-up Blocker from the browser menu.
1) There is much you can determine about your data through Column profiling:
a) Unique Values
The number of unique values in a column is listed for each column.
b) Unique %
The percentage of unique values in each column. Therefore, a column with 100% unique
values is a key. In this case, Record_ID.
c) Null values
The number of null values in each column
d) Null %
The percentage of null values in each column
e) Datatype
Data type derived from the values for the column
f) Inferred %
Percentage of values that match the data type inferred by the Analyst tool
g) Documented Data Type
Data type declared for the column in the profiled object
h) Max Value
Maximum value in the column
i) Min Value
Minimum value in the column
j) Last Profiled
Last Profiled Date and time you last ran the profile
2) Review the Datatype columns this is the data type that was inferred by the Analyst tool
through Profiling. The Inferred % relates to the percentage of values that meet that data type.
a) For the RECORD_ID we defined the type as integer and upon review all of the data has
been identified as integer.
b) What data type has been identified for the ORDER/SHIPPING DATE fields?
c) Has the Analyst tool identified the expected data types for the rest of the data?
3) Select the ORDER_NO column and review the Values to the right.
a) The column should contain order numbers which would be expected to be numeric. From
looking at the values it is clear this is not the case.
(i) Review the patterns, there are letters, spaces and symbols in the field. It would
benefit from some data cleansing to remove any noise.
b) Open the comment view to by clicking on the Show Comments icon to the right.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 15
Note: Because the comments are saved with the profile, not the column, its a good idea to name the column
you are writing the comment about.
(i) Add a comment to ask the developer to remove the noise from this field. Once
complete press Add.
Note: To remove a comment click on the icon to the left of the comment.
(ii) Close the comment view by pressing the Hide Comment icon in the viewer.
4) Select the ORDER_DATE column and review the Values to the right.
16 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
a) Examine the values:
(i) Note the 22 occurrences of NULL
This is not valid and will need to be addressed. We will need to export these records
and send them to the data owner to be corrected.
(ii) Note the formats of the other dates in the field.
This is a European format for dates (DD/MM/YYYY). Review the other dates in the
column by clicking on the Value column to sort the data. It is evident that the date is
written as DD/MM/YYYY throughout the file.
Note: We would need to change the format of the date from European to US format before the files could be
merged.
(iii) Patterns - Check the format of the data in Patterns. Are the dates in the same format
throughout the file?
We want to add comments to the profile to track inconsistencies and flag what needs to be reviewed
through the DQ process.
(i) Create a comment to note the dates in this file are in DD/MM/YYYY format and may
need to be changed.
(ii) Close the Comment viewer when complete.
b) Make sure the NULLS are to the top of the viewer (Sort the data if required). We want to
export these records to a file to send to the data owner as they cant be processed until
the appropriate order date is defined.
(i) Select the NULL values right click and choose Drilldown.
Note: It is possible to change the columns that are displayed in the lower preview pane by deselecting the
checkbox to the right of the Last Profiled column. Columns that are not selected are not displayed in the data
preview.
Note: Check to ensure the Pop-up blocker is switched off ahead of exporting the data.
c) Select the Export Data icon in the drilldown part of the dialog to export the data to a file.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 17
d) Rename the File as Profile_EU_CustomerOrders_Invalid_ORDERDATE.
(i) Either leave the default code page or choose UTF-8 encoding of Unicode (if that was
what you imported the file as) and click the OK button.
(ii) Choose to save the file on the desktop for easy access. (If this does not open it may
be because the pop up blocker is active.)
(iii) When complete choose Open and review the contents of the file.
18 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
(iv) Once done, close Excel and return to your profiling.
d) In the new dropdown choose the CURRENCY field. We want to see if there are any
records where the Country code is GBR but the country code isnt GBP. Once complete
20 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
press the Run button and review the results. Are there any records where the country is
GBR and the currency isnt correct?
e) Drill down on records with the Countrycode = ESP and check that all records with a
Spanish country code have an EUR Currency Code. You will need to edit the filter to
remove the previous query and run the new one.
If this is not the case what are the incorrect values in the Currency field?
f) Add a comment to ask the developer to create a Currency for these missing values based
on the CountryCode.
Tags:
As well as leaving a comment for the developer we can Tag fields as well. This means when the
Developer searches for Tags in the Developer tool, objects will be highlighted in association with a
Tag, for example Standardization. This will flag the object for standardization for the developer.
A list of all the available Tags is displayed. These tags were imported along with the Core Accelerator
(pre-built rules and reference tables)
2) Create a new tag by clicking on the new tag icon.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 21
3) In the area below type the name: Remove Noise and also enter the description. Note that
the Tag appears immediately in the list of available tags above.
Now that the Tag has been created, it needs to be assigned to objects. The Developers will be able
to use a combination of both comments and tags to see what needs to be done in the project.
22 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
4) Assign appropriate Tags to the following fields:
Addr1
Country Code
Currency
Contact
5) Once complete close the Tag Viewer and close the profile.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 23
b) Make sure the following columns only are selected:
ID
COMPANY
CONTACT
SHIPPING_STREET
SHIPPING_CITY
SHIPPING_STATE
SHIPPING_ZIP
COUNTRY
c) Choose to run the profile on the first 100 rows only. Click Save & Run.
Note: Clicking Next will bring you into the Filters view we will apply a filter later.
d) Only the selected columns have been profiled and only the first 100 records. Review the
results of the profile.
24 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
e) Drill down on the first Value in the Company. Note how only the profiled columns are
currently being displayed in the drilldown.
3) In our case the file is quite small so we will want to profile all of the columns and all of the
LIVE rows. To profile only the LIVE records we need to create the filter.
a) Create a new filter in the Manage Filters dialog to profile only the records where the
RECORD_STATUS = LIVE.
Note: Note the filter we created previously is not available this is because filters are saved with the profile on
which they are created.
b) Save the filter with the profile.
4) Now the filter has been created it needs to be applied when the profile is run. To rerun the
profile choose the Run Profile icon.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 25
a) This time make sure all the columns are selected by checking the checkbox above
Columns.
b) Ensure All Rows (complete analysis) is selected under Sampling Options.
c) The Select Columns button under Drill down options allows you to choose the columns
that are displayed when drilling down on values.
d) To apply the filter, click Choose a Filter and in the Filter dialog choose the profile you just
created.
e) Then choose Run. Make sure all the columns have been profiled this time but only
records with a Record_Status of Live should be profiled.
26 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
5) Column Profiling: ORDER_DATE
a) Review the date fields. Are they in the same format as the EU file?
9) Review the other Columns and add Comments and Tags where appropriate.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 27
Unit 2 Lab B: Rules
Duration: 60 mins
Profile Rules
Rules provide the Analyst with a way to identify anomalies in the data and also to manipulate the
data. Rules can be built in both the developer and analyst tools. The Developer will create rules that
the Analyst will be able to test and review.
While it is possible to apply and test the rules in the Analyst tool, the data will not be changed until
the rule is applied in a mapping in Developer.
Rules in the Analyst tool have restrictions:
Only one rule can be made to data at a time. For example, you cannot apply a rule to a rule.
However, you can use Custom rules with complex expressions to make more than one
manipulation type.
Part of the Core Accelerator contains a set of prebuilt rules that are available to all customers. These
rules can be used to execute mapplets that perform fairly complex functions.
It is also possible for the Analyst to develop their own rules using expressions in the Analyst Tool.
These Rules can be used to cleanse and standardize data.
These labs will look at applying prebuilt rules as well as developing custom rules.
Note: Rules can be created and reviewed in the Analyst Tool but the data will not be modified unless
applied as mapplets in the Developer.
We want to apply a rule to check the validity of the data in the email column in the
Profile_US_CustomerOrders .
28 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
b) Select Projects -> CONTENT -> Rules -> Contact_Data_Cleansing and select
rule_Email_Validation
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 29
e) Click Next.
The data has not changed since we ran the profile so we only need to run it for the new rule that has
been added.
(i) Click twice on the top Columns checkbox to select and deselect all of the columns.
Scroll to the bottom and select the rule column Conformant_Email.
(ii) We dont want to show results only for column in this run so deselect the checkbox
and press OK if a warning is displayed.
(a) Also verify that All Rows (complete analysis) is selected.
(iii) In Drilldown Options click the Select Columns button. We want to view all columns in
a drilldown so make sure all of the columns are selected and click OK.
30 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
f) Click Save & Run
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 31
3) Review the output of the Conformant_Email Rule over to the right.
32 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
b) The input column is ORDER_NO. Rename the output column Conformant_OrderNo.
c) Click Next and select the rule column Conformant_OrderNo as the column to profile.
(i) Select to view all of the columns in Drilldown Options, then Save & Run.
3) In the Profile click on the Conformant_OrderNo rule
a) Select Values and note there are approximately 32 values that need to be corrected.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 33
Exercise 2.9: Apply Pre-built Rules: rule_Completeness_Multi_Port
1) Apply another rule, this time choose the rule_Completeness_Multi_Port from the Projects ->
CONTENT -> Rules -> General_Data_Cleansing and apply it to the following fields renaming
the output appropriately:
COMPANY Completeness_Company
CONTACT Completeness_Contact
TITLE Completeness_Title
SHIPPING_STREET Completeness_Street
SHIPPING_CITY Completeness_City
SHIPPING_STATE Completeness_State
SHIPPING_ZIP Completeness_Zip
COUNTRY Completeness_Country
PHONE Completeness_Phone
EMAIL Completeness_Email
2) Press Next and ensure the fields are selected and this time choose to Save the rule. We will
apply another rule before running the profile in the next step.
3) Apply the following rule and when finished choose Save and Run making sure all of the
required columns are selected for profiling.
Rule Column Output Name
IsNumeric PHONE Conformant_Phone
34 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
4) Once complete you will have the following rules applied to the US_CustomerOrders profile
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 35
Create and Apply Custom Cleansing Rules EU_CustomerOrders
We want to apply rules that will correct some of the problems in the data. We can review the outputs
and verify they are correct. Later the developers can covert the profile to a mapping which will allow
them to utilize the cleansed data going forward.
There is a business rule that specifies there should be no 4 character zip codes. If one exists it is
because the leading 0 was removed by excel. Because of this, a zero (0) should be appended to the
front of any 4 digit codes.
Note: This is just a sample rule that could be applied.
2) Make a note of the amount of values in the 9(5) and 9(4) pattern format.
Create a custom rule to add a zero character to the front of any zip code values that are 4 characters
long.
3) Choose to Add a rule.
4) This time choose to Create a rule and name the rule: rule_LeadZeroFiveDigitZip
(i) Description: If a zip has four characters, append a 0 to the start.
36 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
a) Save as a reusable rule in the Rules > User_Data_Cleansing folder.
Note: This is a folder that was created centrally for user defined reusable rules.
Note: Because this is a Shared Project you will have read access but you will need to have been assigned
Write Permissions to save the rule here.
b) In the expression editor type the following expression:
IIF(LENGTH(ADDR4)=4, '0'||ADDR4, ADDR4)
This expression says; if the value in the ADDR4 column is four characters long, then concatenate a
zero character to the front of the rest of the value in ADDR4. If the value in the ADDR4 column is not
four characters long, then return the current value in the column.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 37
The Analyst will automatically check the validity of the expression syntax. If you receive a message
that the expression is invalid, carefully examine your typing to make sure it is correct. Call your
instructor if necessary.
(i) Under Columns: Choose to run only the new rule
(ii) Under Profile Results Option: Deselect the checkbox: Show results only for column, rules
selected in the current run.
(iii) Under Drilldown options: Choose Select Columns and choose to view all columns.
38 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
5) Once again review the value frequency for the patterns 9(5) and compare the frequency to
the previous values. The frequency should have increased for 9(5) and the pattern 9(4) will
be gone.
We need to create a rule that will convert the European dates to US format so the data can be
merged together into a single Customer Order file.
1) Click on the ORDER_DATE column and create a new Rule called: rule_EUtoUS_DateFormat
(i) Description: This rule converts EU Date Format to US date format
(ii) Save as a reusable rule in /Content/Rules/User_Data_Cleansing
a) In the Expression Editor, enter the following expression:
TO_CHAR(
TO_DATE(
ORDER_DATE, 'DD/MM/YYYY'), 'MM/DD/YYYY')
This will take the ORDER_DATE and change the format from European format 'DD/MM/YYYY' to US
format 'MM/DD/YYYY'.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 39
b) Click Next
(i) Under Columns: Choose to run only the new rule
(ii) Under Profile Results Option: make sure the checkbox is deselected
(iii) Under Drilldown options: Choose Select Columns and ensure all columns is selected.
(iv) Then click Save & Run.
c) In the profile, select rule column rule_EUtoUS_DateFormat
Check the dates have been converted to North American format.
40 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
Editing the output
2) We want to edit the output of the rule to set the output as ORDER_DATE_US.
a) Right click on the rule in the profile and select Edit Rule.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 41
a) Save the changes.
Note: Note that if you wanted to edit the syntax or permanently change the name of the output you
would need to select the rule from the User_Data_Cleansing folder.
Note: It may not be possible to edit the syntax for all rules. Some rules created in the developer use DQ
transformations and rules that have these transformations should be edited in the developer.
42 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
Once opened here it is possible to edit the input/output and syntax. It will also tell you where the rule is
used.
2) Select the rule and the appropriate Tag and assign the tag to the rule.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 43
Exercise 2.14: Searching for Items that have been Tagged
1) To the right of the Analyst and choose to Search by Tag
2) Enter the name of one of the Tags that you applied and press Go.
44 Unit 2 - Profiling
Informatica Data Quality Analyst 9.1 HF1
Note: There is an error in 9.1HF1 that may prevent the results from being displayed. This will be
corrected in a future release.
Unit 2 Profiling
Informatica Data Quality Analyst 9.1 HF1 45
Unit 3: Reference Table Management
Technical Description
The function of this lab is to familiarize users with working with Reference Tables.
It will cover the steps involved in creating, updating and modifying reference tables.
These reference tables will be used later in the course by the developer to standardize, cleanse and
enrich the data.
Objectives
Build reference tables using the reference table editor and also importing flat files.
Update, modify and edit values in the reference table.
Duration
45 minutes approximately
b) Select 1 Profile source data will not be filtered and press Run.
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 47
There will be a slight increase in the number of records in the profile.
48 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
c) Select Create a New Reference Table and click Next.
d) Click Next and verify the Column Attributes and Preview of the column.
e) Change the Precision to 3 and click Next
We will not include column descriptors at this time
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 49
f) On the Browse Project Tab, select the CONTENT>Dictionaries>Training project. It is
possible to keep notes for the IT auditor. This is important in standards compliance.
(i) Enter Created from European Customer Order file in the Audit Note and click Finish
to save the Reference Table.
To provide the cross reference standardization values, we must edit the new Reference Table.
1) Refresh the Browse: Projects tab and open the reference table just created.
2) To Edit the Reference Table click on the Edit Table icon ( ) in the toolbar.
a) To add columns in which to enter the values that are to be mapped to the correct Country
code, click the Add Column Attributes button ( ) and add 5 more columns to cater for
variants.
(i) Ensure the COUNTRY_CODE is set as the Valid column in the table.
50 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
(ii) Change the precision of the new columns to 20.
(iii) Click OK
The names of the additional columns are not important for the purpose of this lab. These columns
will serve to hold aliases that are to be associated with the desired value in column COUNTRY_CODE.
Therefore there is no need to change the column names at this time.
a) To Delete unneeded rows, select the rows containing COUNTRY_CODE values ES, UK,
and GB and click the Delete icon ( ).
b) An Audit Note box pops up. Enter the text:
Country codes ES, UK and GB, are invalid and have been deleted.
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 51
c) Press OK and review.
c) Select the list of values and click Add to Reference Table icon
d) Select Add to existing reference table and click Next
52 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
e) Select Projects >CONTENT>Dictionaries>Training > COUNTRY_CODE_STD and click Next
f) Choose to add the values to the second or third column in the table (we choose the
second or third column because the precision set in the first column is only 3 and the
values would be truncated.) and click Finish.
g) Return to the Browse: Projects Tab and open the reference table.
h) The row that contains the value USA in the column is the valid row.
(i) Modify this row to hold the correct value as well as the various aliases from the
US_CustomerOrders Data Object. The data values to enter are:
USA
UNITED STATES
U.S.A.
AMERICA
US
i) Delete the extra rows with the redundant US country code values
In the Audit Note enter: Invalid COUNTRY_CODES
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 53
j) Click OK.
The table should appear:
2) To view the Audit Trail press the Audit Trail button as illustrated.
a) In the Audit Trail window, press the Show button to review all the entries.
54 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
b) Browse to: C:\infa_shared\Dictionaries\Training and select
standardize_company_names.dic and choose open.
c) Rename Standardize_Company_Names and press the Upload button.
d) Set the Code Page set to: UTF-8 encoding of Unicode and click Next
e) Leave the default comma delimiter and select Double quotes as the Text Qualifier. Press
Next.
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 55
f) To use the reference table for standardization we must have a valid column selected.
This is the master column in the table. Make sure Column1 is selected.
g) Click Finish. The table is now ready for the developer to use.
56 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
Exercise 3.3: Find, Replace and add values in a Reference Table
1) Open the Reference Table Standardize_Company_Names if it isnt already open.
(i) Scroll to the top of the table if you arent already there.
a) Click the Find and Replace icon ( )
b) In the Find box, type: THE % and press the Next button
We want to change any invalid instances to the correct version. We selected Column1 as the Valid
Column so this means that all the other values in the row will be standardized to the value in this
cell.
c) In the first free cell for the row, enter the value CORNERS SHOPS and press Enter (or the
green tick arrow to the right) to save the changes.
d) Find PUBLIX SUPERMARKETS (or PUBLIX %) and change the value in column 7 from
PUBLIX SUPER MARKETS INC to PUBLIX SUPER MARKETS, INC as one of the variants.
2) We want to add in some new values. Press the Add Row icon
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 57
b) Close the reference table. It is now ready to be used by the developers.
a) This time choose Connect to a relational table and leave the Unmanaged Table
Unchecked. This means that the data will be imported into the Informatica 9 staging
area and the reference table can be edited and updated. It also means that any changes
to the source table are NOT reflected in Informatica. .
58 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
b) Click Next
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 59
e) This table will be used with product data by the developer where the country code should
be 2 digits so this will be set as the Valid column. Press Next.
60 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
Exercise 3.5: Create an Unmanaged Reference Table from an oracle table
1) Again - we want to keep the reference tables all together in our central project so on the
Browse Project Tab open CONTENT>Dictionaries>Training folder and from the icons to the
right, choose New Reference Table.
a) This time choose Connect to a relational table and check the Unmanaged Table
checkbox. This means that a link is created to the table and the data will not be
imported therefore it will be unmanaged by Informatica 9. It means that any changes
to the source table are reflected in Informatica but the table cannot be edited in
Informatica.
b) Click Next
c) Select the Training Connection and press Next.
Unit 3 RTM
Informatica Data Quality Analyst 9.1 HF1 61
d) From the Tables available choose DOMAINS and press Next.
e) Set the first column as the Valid column and press Next.
f) Save it to the Projects\Dictionaries\Training folder and press Finish.
g) Note that the values in the table can NOT be editied, nor can rows be added or deleted.
62 Unit 3 - RTM
Informatica Data Quality Analyst 9.1 HF1
Unit 4: Scorecarding
Scorecarding US_CustomerOrders
Technical Description
The function of the following lab is to demonstrate how to configure a scorecard in the Analyst Tool
using a number of different techniques. These will include:
Applying and including prebuilt Rules
Using valid values from profiling.
Because we have already applied Pre-built rules to our US_CustomerOrders profile we can use this to
create the scorecard.
Objectives
Understand how to build and configure a data quality scorecard by applying Pre-built rules
and specifying the valid values per column.
Duration
40 minutes approximately
Overview
The data has been profiled and we have applied both Pre-built Rules (to the US_CustomerOrders
Profile and Custom rules) to the EU_CustomerOrders Profile.
We want to track the level of data quality using a scorecard. This will provide a high level view of the
quality of the data. This is facilitated through the Data Quality Scorecard.
Scorecards can be created in the Analyst Tool and updated to include more complex Rules
developed in the Developer Tool.
Targets can be defined and a snapshot taken of the data quality measures at any stage in the
process. Once the data has been cleansed, the scorecard can be re-run to measure the level of
improvement and the targets adjusted if required.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 63
The first step involves identifying which key columns we want to scorecard and which measures (e.g.
consistency, conformity, accuracy) will be applied.
Business Rules we have already applied to the US_CustomerOrders profile include:
Conformity: (using Pre-built rules)
The ORDER and PHONE columns should contain only numeric characters.
The EMAIL address should be a valid email in the appropriate format.
Both DATE columns should contain valid dates.
2) We want to scorecard all of the data so choose to run the profile. Remove the filter and run
the profile on ALL of the data, including all columns and rules.
a) Review the results to ensure that all records have been profiled.
We have already applied prebuilt rules to the profile that we can use in the Scorecard.
64 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
In the scorecard, as well as applying Pre-built rules to determine good and bad data, we can define
what the valid/invalid values per column are. This is possible in columns that do not have very many
unique values. In this section we will be able to check the following:
Valid country in the COUNTRY column is USA.
Valid values in the PAY_MTHD column are DD, SO, CA, or CQ.
Valid values in the PAY_TERM column are 30, 45, 90 or 120.
a) Select the following columns to add to the Scorecard, as well as all of the rules:
(i) COUNTRY rename Accurate_Country
(ii) PAY_MTHD rename Accurate_PayMethod
(iii) PAY_TERM rename Accurate_PayTerm
Note: We are selecting only these columns as the others contain too many values for us to define as
valid/invalid.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 65
b) Also select the Pre-built Rules that were applied earlier. There is no need to rename the
outputs as they were named appropriately at the time.
Note: If any of these are disabled, run the profile and try again.
c) Once selected press next.
66 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
a) Call the Scorecard US_Customer_Scorecard and create it in the CUSTOMER_DATA
Project. Select OK.
We are measuring Completeness, Conformity and Accuracy and we want to group the measures into
these categories or groups.
b) To the right of the dialog choose to create a New Group within the
US_Customer_Scorecard.
We have a default group where all of the current measures reside and we will rename this later.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 67
c) Enter the name Accuracy and press OK
d) Press New again and enter the name Conformity and press OK. You will now have 3
groups: Default, Accuracy and Conformity.
e) Press Next
5) Click on the first measure in the list - Accurate_Country. The list of values contained are
displayed to the right of the dialog. This is where we can define the valid values.
a) Select the USA as valid.
68 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
b) Click on the next measure Accurate_PayMethod and select CQ/CA/DD/SO as valid
values.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 69
c) For Accurate_PayTerm select the following as valid: 30/45/90/120
d) For the Completeness Columns select Complete
e) For Conformant_Email select Valid
f) For Conformant_OrderNo select True
g) For Conformant_Phone select True
h) Once you have defined the values for all of the columns select Finish. Open the
scorecard if it does not open automatically.
We have the scorecard with 3 groups with all of the measures in the Default group. The next step is
to rename and re organize the measures into their appropriate group. Before we can edit the
scorecard we will need to run it.
6) Run the scorecard by pressing the Run Scorecard button to the right.
a) Accept the defaults in the Run scorecard dialog and press Run.
Once the scorecard has been run we will be able to edit it.
70 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
7) Click on the Edit Scorecard icon
a) Choose the Score Groups tab.
b) Click on Default and choose to Edit.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 71
(i) Most of the measures in the group are for Completeness so rename Default to
Completeness and press OK.
c) Click on the Accuracy measures to the right holding down the CTRL key to select them
together, and then press Move.
72 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
e) Click on the Conformity measures to the right holding down the CTRL key to select them
together, and then press Move.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 73
8) We now have the right measures in the appropriate Group. Drill down to view the matching
rows. From here it is possible to toggle between the Valid and Invalid rows.
9) To run the Scorecard at a later date, open the Scorecard in the project and press the Run
Scorecard icon.
a) It is possible to choose which columns in the source you want available during drilldowns
by selecting/deselecting the columns in the viewer.
b) You can also choose to use live or staged data in the drilldowns.
74 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
10) To view the history or trend, right click on the column and choose Show Trend Chart.
Unit 4 Scorecarding
Informatica Data Quality Analyst 9.1 HF1 75
You now have a scorecard:
Which was created directly from value frequencies in a profile BUT IS NOT connected to the
profile(s) from which column/virtual column originated from. This means you could delete
the profile which was used to create the scorecard, without impacting the scorecard.
That you can update to include additional scores from ANY profile which means the
scorecard can reference multiple data sources.
Note: If you delete the source then both profile and scorecard would become invalid.
76 Unit 4 - Scorecarding
Informatica Data Quality Analyst 9.1 HF1
Unit 5: Working with the Data Quality Assistant
Technical Description
The objective of the following lab is to enable users become familiar with launching the DQA, editing
bad records and consolidating records through the DQA.
Objectives
Working with the Data Quality Assistant.
Step 2. Working with the DQA This part is done by the Analysts
Define the tables in the Analyst tool
Correct Bad records
Consolidate duplicate records
In this section we will deal with Step 2. The tables have already been created and populated by
mappings run by developers in the Developer Tool. Once the data has been written to the table the
Analyst can connect to the table to review and edit the data.
c) From the established connections choose Training as this is where the customer table is.
Click Next.
d) Select the table Bad_Customer_Records from the list of tables. The DQA knows that they
associated issue table will be the original table name with _issue. Click Next
2) Select the checkbox to the left of the two KWIKSAVE records and press the Edit icon.
a) Check the box for the company field and enter the value KWIKSAVE. In the Audit Note
enter an explanation of why you are changing the name. Once complete press OK and
OK again on the pop up dialog.
3) Click on the SUPERPETZ (CREDIT ON HOLD) record and note that you are in edit mode. (If
you cant edit the field click the Show button again)
a) Correct the name of this and the Clares Pantry record by removing the words in brackets.
4) Select the checkbox to the left of the four records we edited. They are complete now so we
want to mark them as Accepted.
a) Once the records are selected click the Accept button to the right. The records will
remain in the view. To refresh the view press the Show button. Note the records will now
have been removed from the view.
a) Enter Spar_GBR as the name and from the dropdown select the ex_company field and
set to SPAR.
b) Press the button to add an condition line and set the ex_country = GBR.
c) Once complete press OK and Save.
d) From the list of available tables, choose CON_CUSTOMER and click next.
2) Scroll to Cluster 8.
a) Using the green arrow to the left of the fields, create a master record by selecting the
most appropriate field from each record.
b) Once complete select Consolidate Cluster. Note the records are removed from the view.
b) Once extracted consolidate the 2 clusters. (The extracted cluster will become cluster
number 1)
An information bar at the bottom of the Audit Trail tab provides the number of records in each status group.
Note: All the audit details are stored in the RTM_DQA_AUDIT_TABLE table and RTM_DQA_AUDIT_DETAILS
table
The function of the workshop is to give users time to review what was covered on the course and
apply their knowledge to a different type of data. It should be viewed as a revision module that is
aimed at helping users become more confident with working with the Analyst Tool.
It will primarily cover Profiling, Rules and Scorecarding. The data is product data that has been
loaded into a table for you to work with.
b) Within the new project we want to connect to the table that contains the data. From the
Actions Menu on the right choose New >Table.
(ii) From the list of tables presented select the Product Data table and press Next, Next
and Finish.
Optional Labs
Informatica Data Quality Analyst 9.1 93
(iii) The table will be listed in your project.
94 Optional Labs
Informatica Data Quality Analyst 9.1
Note: There is a known issue displaying large numeric values (GTIN for example) which will be corrected in the
next HF. This issue relates to how the data is displayed only and will not affect the labs.
____________________________________________________________________________________
b) Review the material_desc field. How consistent do the values appear? Write a list of
some of the issues that are in this field and how they could be corrected.
_________________________________________________________________________________
__________________________________________________________________________________
c) Based on first impressions are there invalid values in the material type field? What are
they?
____________________________________________________________________________________
d) The base unit of measure should be either KG or EA. how many do not conform?
____________________________________________________________________________________
Optional Labs
Informatica Data Quality Analyst 9.1 95
e) How many of the gross and net weight values are null or 0?
____________________________________________________________________________________
f) The GTIN code should be made up of 14 numeric values. Do all of the records conform to
this?
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
____________________________________________________________________________________
96 Optional Labs
Informatica Data Quality Analyst 9.1
4 - Scorecard: Requirements
Once complete build a scorecard on your profile to measure the following:
a) Completeness:
Material Description
Optional Labs
Informatica Data Quality Analyst 9.1 97
Material Type
Vendor ID
Created_On
b) Conformity:
Base_UOM
Material code
Gtin
Vendor ID
c) Consistency
Consistency between the Material Description and Material Type (Rule)
d) Accuracy
Material Type Valid Values:
o SP, CD, SC, PC, DT, DFM, PSC, DB, PSD, OT, CDS, PP, PS
Product Unit Valid Values:
o MM
Volume Unit Valid Values:
o PP
Weight Unit Valid Values:
o KG
2) Open the Developer Tool and review that the Projects, Profiles, Rules, Tags/Comments and
Scorecard created in the Analyst tool are available for the Developer in the Developer Tool.
98 Optional Labs
Informatica Data Quality Analyst 9.1
Notes Page
Informatica Data Quality Analyst 9.1 HF1 99
100 Notes Page
Informatica Data Quality Analyst 9.1 HF1
Notes Page
Informatica Data Quality Analyst 9.1 HF1 101
102 Notes Page
Informatica Data Quality Analyst 9.1 HF1
Notes Page
Informatica Data Quality Analyst 9.1 HF1 103
104 Notes Page
Informatica Data Quality Analyst 9.1 HF1
Notes Page
Informatica Data Quality Analyst 9.1 HF1 105
106 Notes Page
Informatica Data Quality Analyst 9.1 HF1
Notes Page
Informatica Data Quality Analyst 9.1 HF1 107
108 Notes Page
Informatica Data Quality Analyst 9.1 HF1