Sunteți pe pagina 1din 13

Unstructured Data Transformation Overview

By PenchalaRaju.Yanamala

Transformation type:
Active/Passive
Connected

The Unstructured Data transformation is a transformation that processes


unstructured and semi-structured file formats, such as messaging formats, HTML
pages and PDF documents. It also transforms structured formats such as
ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT.

The Unstructured Data transformation calls a Data Transformation service from a


PowerCenter session. Data Transformation is the application that transforms the
unstructured and semi-structured file formats. You can pass data from the
Unstructured Data transformation to a Data Transformation service, transform
the data, and return the transformed data to the pipeline.

Data Transformation has the following components:

Data Transformation Studio. A visual editor to design and configure


transformation projects.
Data Transformation Service. A Data Transformation project that is deployed
to the Data Transformation Repository and is ready to run.
Data Transformation repository. A directory that stores executable services
that you create in Data Transformation Studio. You can deploy projects to
different repositories, such as repositories for test and production services.
Data Transformation Engine. A processor that runs the services that you
deploy to the repository.

When Data Transformation Engine runs a service, it writes the output data, or it
returns output data to the Integration Service. When Data Transformation Engine
returns output to the Integration Service, it returns XML data. You can configure
the Unstructured Data transformation to return the XML in an output port, or you
can configure output groups to return row data.
Configuring the Unstructured Data Option

The Unstructured Data transformation is installed with PowerCenter. Data


Transformation has a separate installer. Install the Data Transformation Server
and Client components after you install PowerCenter.

To install the Unstructured Data option, complete the following steps:

1.Install PowerCenter.
Install Data Transformation. For information about installing Data
2.Transformation, see the Data Transformation Administrator Guide.
3.Configure the Data Transformation repository folder.

Configuring the Data Transformation Repository Directory

The Data Transformation repository contains executable Data Transformation


services. When you install Data Transformation, the installation creates the
following folder:

<Data_Transformation_install_dir>\ServiceDB

To configure a different repository folder location, open Data Transformation


Configuration from the Windows Start menu. The repository location is in the
following path in the Data Transformation Configuration:

CM Configuration > CM Repository > File System > Base Path

If Data Transformation Studio can access the remote file system, you can
change the Data Transformation repository to a remote location and deploy
services directly from Data Transformation Studio to the system that runs the
Integration Service. For more information about deploying services to remote
machines, see the Data Transformation Studio User Guide.

Copy custom files from the Data Transformation autoInclude\user or the


externLibs\user directory to the autoInclude\user or externLibs\user directory on
the machine that runs the Integration Service. For more information about these
directories, see the Data Transformation Engine Developer Guide.

Data Transformation Service Types

When you create a project in Data Transformation Studio, you choose a Data
Transformation service type to define the project. Data Transformation has the
following types of services that transform data:

Parser. Converts source documents to XML. The output of a parser is always


XML. The input can have any format, such as text, HTML, Word, PDF, or HL7.
Serializer. Converts an XML file to an output document of any format. The
output of a serializer can be any format, such as a text document, an HTML
document, or a PDF.
Mapper. Converts an XML source document to another XML structure or
schema. A mapper processes the XML input similarly to a serializer. It
generates XML output similarly to a parser. The input and the output are fully
structured XML.
Transformer. Modifies the data in any format. Adds, removes, converts, or
changes text. Use transformers with a parser, mapper, or serializer. You can
also run a transformer as stand-alone component.
Streamer. Splits large input documents, such as multi-gigabyte data streams,
into segments. The streamer processes documents that have multiple
messages or records in them, such as HIPAA or EDI files.

For more information about creating projects with Data Transformation, see
Getting Started with Data Transformation.

Unstructured Data Transformation Components

The Unstructured Data transformation contains the following tabs:

Transformation. Enter the name and description of the transformation. The


naming convention for an Unstructured Data transformation is
UD_TransformationName. You can also make the Unstructured Data
transformation reusable.
Properties. Configure the Unstructured Data transformation general properties
such as IsPartitionable and Output is Repeatable.
UDT Settings. Modify Unstructured Data transformation settings such as input
type, output type, and service name.
UDT Ports. Configure Unstructured Data transformation ports and attributes.
Relational Hierarchy. Define a hierarchy of output groups and ports to enable
the Unstructured Data transformation to write rows to relational targets.

Properties Tab
Configure the Unstructured Data transformation general properties on the
Properties tab.

The following table describes properties on the Properties tab that you can
configure:

Property Description
Tracing Level The amount of detail included in the session log when you run a
session containing this transformation. Default is Normal.
IsPartitionable The transformation can run in more than one partition. Select one
of the following options:
-No. The transformation cannot be partitioned.
Locally. The transformation can be partitioned, but the Integration
-Service must run all partitions in the pipeline on the same node.
Across Grid. The transformation can be partitioned, and the
Integration Service can distribute each partition to different
-nodes.
Default is Across Grid.
Output is The order of the output data is consistent between session runs.
Repeatable Never. The order of the output data is inconsistent between
-session runs.
Based On Input Order. The output order is consistent between
session runs when the input data order is consistent between
-session runs.
Always. The order of the output data is consistent between
session runs even if the order of the input data is inconsistent
-between session runs.
Default is Never for active transformations. Default is Based On
Input Order for passive transformation runs.
Output is Indicates whether the transformation generates consistent output
Deterministic data between session runs. Enable this property to perform
recovery on sessions that use this transformation.

Warning: If you configure a transformation as repeatable and deterministic, it is


your responsibility to ensure that the data is repeatable and deterministic. If you
try to recover a session with transformations that do not produce the same data
between the session and the recovery, the recovery process can result in
corrupted data.

UDT Settings Tab

Configure the Unstructured Data transformation attributes on the UDT Settings


tab.

The following table describes the attributes on the UDT settings tab:

Attribute Description
InputType Type of input data that the Unstructured Data transformation
passes to Data Transformation Engine. Choose one of the
following input types:
-Buffer. The Unstructured Data transformation receives source
data in the InputBuffer port and passes data from the port to Data
Transformation Engine.
File. The Unstructured Data transformation receives a source file path in
the InputBuffer port and passes the source file path to Data
-Transformation Engine. Data Transformation Engine opens the source
file.
OutputType Type of output data that the Unstructured Data transformation or
Data Transformation Engine returns. Choose one of the following
output types:
Buffer. The Unstructured Data transformation returns XML data
through the OutputBuffer port unless you configure a relational
hierarchy of output ports. If you configure a relational hierarchy of
-ports, the Unstructured Data transformation does not write to the
OutputBuffer port.
File. Data Transformation Engine writes the output to a file. It
does not return the data to the Unstructured Data transformation
-unless you configure a relational hierarchy of ports in the
Unstructured Data transformation.
Splitting.The Unstructured Data transformation splits a large XML
-output file into smaller files that can fit in the OutputBuffer port. You
must pass the split XML files to the XML Parser transformation.
ServiceName Name of the Data Transformation service to run. The service must
be present in the local Data Transformation repository.
Streamer Buffer size of the data that the Unstructured Data transformation
Chunk Size passes to Data Transformation Engine when the Data
Transformation service runs a streamer. Valid values are 1-
1,000,000 KB. Default is 256 KB.
Dynamic Run a different Data Transformation service for each input row.
Service Name When Dynamic Service Name is enabled, the Unstructured Data
transformation receives the service name in the Service Name
input port.
When Dynamic Service name is disabled, the Unstructured Data
transformation runs the same service for each input row. The
Service Name attribute in the UDT Settings must contain a service
name. Default is disabled.
Status Tracing Set the level of status messages from the Data Transformation
Level service.
Description Only. Return a status code and a short description to
-indicate if the Data Transformation service was successful or if it
failed.
-Full Status. Return a status code and a status message from the
Data Transformation service in XML.
-None. Do not return status from the Data Transformation service.
Default is none.

Viewing Status Tracing Messages

You can view status messages from the Data Transformation service. Set the
status tracing level to Description Only or Full Status. The Designer creates the
UDT_Status_Code port and the UDT_Status_Message output ports in the
Unstructured Data transformation.

When you choose Description Only, Data Transformation Engine returns a status
code and one of the following status messages:
Status Code Status Message
1 Success
2 Warning
3 Failure
4 Error
5 Fatal Error

When you choose Full Status, Data Transformation Engine returns a status code
and the error message from the Data Transformation service. The message is in
XML format.

Unstructured Data Transformation Ports

When you create an Unstructured Data transformation, the Designer creates


default ports. It creates other ports based on how you configure the
transformation. The Unstructured Data transformation input and output types
determine how the Unstructured Data transformation passes data to and
receives data from Data Transformation Engine.

Table 27-1 describes the Unstructured Data transformation default ports:

Table 27-1. Unstructured Data Transformation Default Ports


Port Input/ Description
Output
InputBuffer Input Receives source data when the input type is buffer.
Receives a source file name and path when the input type is
file.
OutputBuffer Output Returns XML data when the output type is buffer.
Returns the output file name when the output type is file.
Returns no data when you configure hierarchical output
groups of ports.

Table 27-2 describes other Unstructured Data transformation ports that the
Designer creates when you configure the transformation:

Table 27-2. Unstructured Data Transformation Other Ports


Port Input/ Description
Output
OutputFileName Input Receives a name for an output file when the output
type is file.
ServiceName Input Receives the name of a Data Transformation service
when you enable Dynamic Service Name.
UDT_Status_Code Output Returns a status code from Data Transformation
Engine when the status tracing level is Description
Only or Full Status.
UDT_Status_ Output Returns a status message from Data Transformation
Message Engine when the status tracing level is Description
Only or Full Status.
Note: You can add groups of output ports for relational targets on the Relational
Hierarchy tab. When you configure groups of ports, a message appears on the
UDT Ports tab that says hierarchical groups and ports are defined on another
tab.

Ports by Input and Output Type

The input type determines the type of data that the Integration Service passes to
Data Transformation Engine. The input type determines whether the input is data
or a source file path.

Configure one of the following input types:

Buffer. The Unstructured Data transformation receives source data in the


InputBuffer port. The Integration Service passes source rows from the
InputBuffer port to Data Transformation Engine.
File. The Unstructured Data transformation receives the source file path in the
InputBuffer port. The Integration Service passes the source file path to Data
Transformation Engine. Data Transformation Engine opens the source file. Use
the file input type to parse binary files such as Microsoft Excel or Microsoft Word
files.

If you do not define output groups and ports, the Unstructured Data
transformation returns data based on the output type.

Configure one of the following output types:

Buffer. The Unstructured Data transformation returns XML through the


Outputbuffer port. You must connect an XML Parser transformation to the
Outputbuffer port.
File. Data Transformation Engine writes the output file instead of passing data
to the Integration Service. Data Transformation Engine names the output file
based on the file name from the OutputFilename port. Choose the File output
type to transform XML to binary data such as a PDF file or a Microsoft Excel file.
The Integration Service returns the output file name in the OutputBuffer port for
each source row. If the output file name is blank, the Integration Service returns a
row error. When an error occurs, the Integration Service writes a null value to the
OutputBuffer and returns a row error.
Splitting. The Unstructured Data transformation splits XML data from Data
Transformation Engine into multiple segments. Choose the Splitting output type
when the Unstructured Data transformation returns XML files that are too large
for the OutputBuffer port. When you configure Splitting output, pass the XML
data to the XML Parser transformation. Configure the XML Parser
transformation to process the multiple XML rows as one XML file.

Adding Ports

A Data Transformation service might require multiple input files, file names, and
parameters. It can return multiple output files. When you create an Unstructured
Data transformation, the Designer creates one InputBuffer port and one
OutputBuffer port. If you need to pass additional files or file names between the
Unstructured Data transformation and Data Transformation Engine, add the input
or output ports. You can add ports manually or from the Data Transformation
service.
The following table describes the ports you can create on the UDT Ports tab:

Port Type Input/ Description


Output
Additional Input Input Receives input data to pass to Data Transformation
(buffer) Engine.
Additional Input Input Receives the file name and path for Data
(file) Transformation Engine to open.
Service Parameter Input Receives an input parameter for a Data
Transformation service.
Additional Output Output Receives XML data from Data Transformation Engine.
(buffer)
Additional Output Output Receives an output file name from Data
(file) Transformation Engine.
Pass-through Input/ Passes data through the Unstructured Data
Output transformation without changing it.

Creating Ports From a Data Transformation Service

A Data Transformation service can require input parameters, additional input


files, or user-defined variables. The service might return more than one output
file to the Unstructured Data transformation. You can add ports that pass
parameters, additional input files, and additional output files. The Designer
creates ports that correspond to the ports in the Data Transformation service.

Note: You must configure a service name to populate ports from a service.

To create ports based on a Data Transformation service:

1.Click the Ports tab on the Unstructured Data transformation.


2.Click Populate From Service.
The Designer displays the service parameters, additional input, and additional
output port requirements from the Data Transformation service. Service
parameters include Data Transformation system variables and user-defined
variables.
3.Select the ports to create and configure each port as a buffer port or file port.
Click Populate to create the ports that you select. You can select all ports that
4.appear

Defining a Service Name

When you create an Unstructured Data transformation, the Designer displays a


list of the Data Transformation services that are in the Data Transformation
repository. Choose the name of a Data Transformation service that you want to
call from the Unstructured Data transformation. You can change the service
name after you create the transformation. The service name appears on the UDT
Settings tab.

To run a different Data Transformation service for each source row, enable the
Dynamic Service Name attribute. Pass the service name with each source row.
The Designer creates the ServiceName input port when you enable dynamic
service names.
When you enable dynamic service names, you cannot create ports from a Data
Transformation service.

Relational Hierarchies

To pass row data to relational tables or other targets, configure output ports on
the Relational Hierarchy tab. You can define groups of ports and define a
relational structure for the groups.

When you configure output groups, the output groups represent the relational
tables or the targets that you want to pass the output data to. Data
Transformation Engine returns rows to the group ports instead of writing an XML
file to the OutputBuffer port. The transformation writes rows based on the output
type.

Create a hierarchy of groups in the left pane of the Relational Hierarchy tab. All
groups are under the root group called PC_XSD_ROOT. You cannot delete the
root. Each group can contain ports and other groups. The group structure
represents the relationship between target tables. When you define a group
within a group, you define a parent-child relationship between the groups. The
Designer defines a primary key-foreign key relationship between the groups with
a generated key.

Select a group to display the ports for the group. You can add or delete ports in
the group. When you add a port, the Designer creates a default port
configuration. Change the port name, datatype, and precision. If the port must
contain data select Not Null. Otherwise, the output data is optional.

When you view the Unstructured Data transformation in the workspace, each
port in a transformation group has a prefix that contains the group name.

When you delete a group, you delete the ports in the group and the child groups.

Exporting the Hierarchy Schema

When you define hierarchical output groups in the Unstructured Data


transformation, you must define the same structure in the Data Transformation
project that you create to transform the data. Export the hierarchy structure as an
XML schema file from the Unstructured Data transformation. Import the schema
to your Data Transformation project. You can then map the content of a source
document to the XML elements and attributes in the Data Transformation project.

To export the group hierarchy from the Relational Hierarchy tab, click Export to
XML Schema. Choose a name and a location for the .xsd file. Choose a location
that you can access when you import the schema with Data Transformation
Studio.

The Designer creates a XML schema file with the following namespace:

"www.informatica.com/UDT/XSD/<mappingName_<Transformation_Name>>"

The schema includes the following comment:


<!-- ===== AUTO-GENERATED FILE - DO NOT EDIT ===== -->

<!-- ===== This file has been generated by Informatica PowerCenter ===== -->

If you modify the schema, the Data Transformation Engine might return data that
is not the same format as the output ports in the Unstructured Data
transformation.

The XML elements in the schema represent the output ports in the hierarchy.
Columns that can contain null values have a minOccurs=0 and maxOccurs=1
XML attribute

Mappings

When you create a mapping, design it according to the type of Data


Transformation project you are going to run. For example, the Data
Transformation Parser and Mapper generate XML data. You can configure the
Unstructured Data transformation to return rows from the XML data or you can
configure it to return an XML file.

The Data Transformation Serializer component can generate any output from
XML. It can generate HTML or binary files such as Microsoft Word or Microsoft
Excel. When the output is binary data, Data Transformation Engine writes the
output to a file instead of passing it back to the Unstructured Data transformation.

The following examples show how to configure mappings with an Unstructured


Data transformation.

Parsing Word Documents for Relational Tables

You can extract order information from a Microsoft Word document and write the
order information to an order header table and an order detail table. Configure an
Unstructured Data transformation to call a Data Transformation parser service
and pass the name of each Word document to parse. The Data Transformation
Engine opens the Word document, parses it, and returns the rows to the
Unstructured Data transformation. The Unstructured Data transformation passes
the order header and order details to the relational targets.

The mapping has the following objects:

Source Qualifier transformation. Passes each Microsoft Word file name to the
Unstructured Data transformation. The source file name contains the complete
path to the file that contains order information.
Unstructured Data transformation. The input type is file. The output type is
buffer. The transformation contains an order header output group and an order
detail output group. The groups have a primary key-foreign key relationship.
The Unstructured Data transformation receives the source file name in the
InputBuffer port. It passes the name to Data Transformation Engine. Data
Transformation Engine runs a parser service to extract the order header and
order detail rows from the Word document. Data Transformation Engine returns
the data to the Unstructured Data transformation. The Unstructured Data
transformation passes data from the order header group and order detail group
to the relational targets.
Relational targets. Receive the rows from the Unstructured Data
transformation.

Creating an Excel Sheet from XML

You can extract employee names and addresses from an XML file and create a
Microsoft Excel sheet with the list of names.

The mapping has the following components:

XML source file. Contains employee names and addresses.


Source Qualifier transformation. Passes XML data and an output file name to
the Unstructured Data transformation. The XML file contains employee names.
Unstructured Data transformation. The input type is buffer and the output
type is file. The Unstructured Data transformation receives the XML data in the
InputBuffer port and the file name in the OutputFileName port. It passes the
XML data and the file name to Data Transformation Engine.
Data Transformation Engine runs a serializer service to transform the XML data
to a Microsoft Excel file. It writes the Excel file with a file name based on the
value of OutputFilename.
The Unstructured Data transformation receives only the output file name from
Data Transformation Engine. The Unstructured Data transformation OutputBuffer
port returns the value of OutputFilename.
Flat file target. Receives the output file name.

Parsing Word Documents and Returning A Split XML File

The Data Transformation Parser and Mapper components can transform data
from any format and generate XML data. When the XML data is large, you can
split the XML into segments and pass the segments to an XML Parser
transformation. The XML Parser transformation receives the segments and
processes the XML data as one document.

When you configure the Unstructured Data transformation to split XML output,
the Unstructured Data transformation returns XML based on the OutputBuffer
port size. If the XML file size is greater than the output port precision, the
Integration Service divides the XML into files equal to or less than the port size.
The XML Parser transformation parses the XML and passes the rows to
relational tables or other targets.

For example, you can extract the order header and detail information from
Microsoft Word documents with a Data Transformation parser service.

The mapping has the following components:

Source Qualifier transformation. Passes the Word document file name to the
Unstructured Data transformation. The source file name contains the complete
path to the file that contains order information.
Unstructured Data transformation. The input type is file. The output type is
splitting. The Unstructured Data transformation receives the source file name in
the InputBuffer port. It passes the file name to Data Transformation Engine.
Data Transformation Engine opens the source file, parses it, and returns XML
data to the Unstructured Data transformation.
The Unstructured Data transformation receives the XML data, splits the XML file
into smaller files, and passes the segments to an XML Parser transformation.
The Unstructured Data transformation returns data in segments less than the
OutputBuffer port size. When the transformation returns XML data in multiple
segments, it generates the same pass-through data for each row. The
Unstructured Data transformation returns data in pass-through ports when a row
is successful or not successful.
The XML Parser transformation. The Enable Input Streaming session property
is enabled. The XML Parser transformation receives the XML data in the
DataInput port. The input data is split into segments. The XML Parser
transformation parses the XML data into order header and detail rows. It passes
order header and detail rows to relational targets. It returns the pass-through
data to a Filter transformation.
Filter transformation. Removes the duplicate pass-through data before
passing it to the relational targets.
Relational targets. Receive data from each group in the XML Parser
transformation and the Filter transformation.

Rules and Guidelines

Use the following rules and guidelines when you create an unstructured data
mapping:

When you configure hierarchical groups of output ports, the Integration Service
writes to the groups of ports instead of writing to the OutputBuffer port. The
Integration Service writes to the groups of ports regardless of the output type
you define for the transformation.
If an Unstructured Data transformation has the File output type, and you have
not defined group output ports, you must link the OutputBuffer port to a
downstream transformation. Otherwise, the mapping is invalid. The
OutputBuffer port contains the output file name when the Data Transformation
service writes the output file.
Enable Dynamic Service Name to pass a service name to the Unstructured
Data transformation in the Service Name input port. When you enable Dynamic
Service Name, the Designer creates the Service Name input port.
You must configure a service name with the Unstructured Data transformation
or enable the Dynamic Service Name option. Otherwise the mapping is invalid.
Link XML output from the Unstructured Data transformation to an XML Parser
transformation.

Steps to Create an Unstructured Data Transformation

Create an Unstructured Data transformation in the PowerCenter Transformation


Developer or the Mapping Designer.

To create an Unstructured Data transformation:

In the Mapping Designer or Transformation Developer, click Transformation >


1.Create.
2.Select Unstructured Data Transformation as the transformation type.
3.Enter a name for the transformation.
4.Click Create.
The Unstructured Data Transformation dialog box appears.
5. Configure the following properties:
Property Description
Service Name of the Data Transformation service you want to use. The
Name Designer displays the Data Transformation services in the Data
Transformation repository folder. Do not choose a name if you plan to
enable dynamic service names. You can add a service name on the
UDT Settings tab after you create the transformation.
Input Describes how Data Transformation Engine receives input data.
Type Default is Buffer.
Output Describes how Data Transformation Engine returns output data.
Type Default is Buffer.
6.Click OK.
You can change the service name, input, and output type on the UDT Settings
7.tab.
Configure the Unstructured Data transformation properties on the Properties
8.tab.
If the Data Transformation service has more than one input or output file, or if it
requires input parameters, you can add ports on the UDT Ports tab. You can
9.also add pass-through ports on the Ports tab.
If you want to return row data from the Unstructured Data transformation
instead of XML data, create groups of output ports on the Relational Hierarchy
10.tab.
If you create groups of ports, export the schema that describes them from the
11.Relational Hierarchy tab.
Import the schema to the Data Transformation project to define the project
12.output

S-ar putea să vă placă și