Sunteți pe pagina 1din 8

1 Question Answer ========================================================== Phases - are used to break the graph into pieces.

Temporary files created during a phase will be deleted after its completion. Phases are used to effectively separately manage resourceconsuming (memory, CPU, disk) parts of the application. Phases vs Checkpoints Checkpoints - created for recovery purposes. These are points where everything is written to disk. You can recover to the latest saved point - and rerun from it. You can have phase breaks with or without checkpoints. xfr A new sandbox will have many directories: mp, dml, xfr, db, ... . xfr is a directory where you put files with extension .xfr containing your own custom functions (and then use : include "somepath/xfr/yourfile.xfr"). Usually XFR stores mapping. 1) Data Parallesim - data (partitionning of data into parallel streams for parallel processing). three types of parallelism 2) Componnent Paralelism (execute simultaneously on different branches of the graph) 3) Pipeline (sequential). Multi-File System m_mkfs - create a multifile (m_mkfs ctrlfile mpfile1 ... mpfileN) m_ls - list all the multifiles m_rm - remove the multifile m_cp - copy a multifile m_mkdir - to add more directories to existing directory structure Memory requirements of a graph How to calculate a SUM Each partition of a component uses: ~ 8 MB + max-core (if any) Add size of lookup files used in phase (if multiple components use same lookup only count it once) Multiply by degree of parallelism. Add up all components in a phase; that is how much memory is used in that phase. Select the largest-memory phase in the graph

MFS

SCAN ROLLUP SCANWITHROLLUP Scan followed by Dedup sort and select the last If we don't use any key in the sort component while using the dedup sort, then the output depends on the keep parameter.

dedup sort with null key

first - only the first record last - only last record unique_only - there will be no records in the output file.

join on partitioned flow checkin, checkout

file1 (A,B,C) , file2 (A,B,D). We partition both files by "A", and then join by "A,B". IS it OK? Or should we partition by "A,B" ? Not clear. You can do checkin/checkout using the wizard right from the GDE using versions and tags
Selector Web Definitions

how to have different passwords for QA and parameterize the .dbc file - or use environmental variable. production use scan and filter m_dump <dml> <mfs file> -start 50 -end 75 use next_in_sequence() function and filter by expression component (next_in_sequence() >50 && next_in_sequence() <75)

How to get records 50-75 out of 100

Hot to convert a serial file into FFS

create MFS, then use partition component

project parameters vs. When you check out a project into your sandbox - you get project parameters. Once in your sandbox parameters sandbox - you can refer to them as sandbox parameters. Bad-Straight-flow merging graphs partitioning, repartitioning, departitioning lookup file indexing error you get when connecting mismatching components (for example, connecting serial flow directly to mfs flow without using a partition component) You can not merge two ab initio graphs. You can use the ouput of one graph as input for another. You can also copy/paste the contents between graphs. See also about using .plan partitioning - dividing a single flow of records(serial file, mfs) into multiple flows. departitioning - removing partitionning (gather an merge component) re-partitioning - change the number of partitions (eg, from 2 to 4 flows)

for large amounts of data use MFS lookup file (instead of serial) No indexes as such. But there is an "output indexing" using reformat and doing necessary coding in transform part. Environment project - special public project that exists in every Ab Initio environment. It contains all the environment parameters required by the private or public projects which constitute AI Standard Environment. Aggregate - old component Rollup - newer, extended, recommended to use instead of Agregate. (built-in functions like sum count avg min max product, ...) EME = Enterprise Metdata Environment. Functions (repository, version control, statistical analysis, dependency analysis). It is on the server side and holds all the projects (metadata of transformations, config info, source and target info: graph dml xfr ksh sql, etc..). This is where you checkin/checkout. /Project dir of EME contains common directories for all application sandboxes connected to it. It also helps in dependency analysis of codes. Ab Initio has series of air commands to manipulate repository objects. GDE = Graphical Devlopment Environment (on the client box) Co-operating sytem = Ab Initio server installed on top of native (unix) os on the server

Environment project

Aggregate vs Rollup

EME, GDE, Cooperating sytem

fencing means job controlling on priority basis. In AI it actually refers to customized phase breaking. A well fenced graph means no matter what is source data volume process will not cough in dead locks. It actually limits the number of simultaneous processes. fencing Fencing - changing a priority of a job Phasing - managing the resources to avoid deadlocks. For example, limiting the number of simultaneous processes (by breaking the graph into phases, only 1 of which can run at any given time)
Selector Web Definitions

Continuous components 2 Question deadlock

Continuous components - produce useful output file while running continously. For example, Continuous rollup, Continuous update batch subscribe

Answer ========================================================== Deadlock is when two or more processes are requesting the same resource. To avoid use phasing and resource pooling. AB_HOME - where co>operating system is installed AB_AIR_ROOT - default location for EME datastore sandboxes standard environment AI_SORT_MAX_CORE, AI_HOME, AI_SERIAL, AI_MFS, etc. from unix prompt: env | grep AI

environment

wrapper script

unix script to run graphs

A multistage component is a component which transforms input records in 5 stages (1.input select, 2.temporary initialization, 3.processing, 4. output selection, 5.finalize). So it is a transform multistage component component which has packages. Examples: scan Normalize and Denormalize, rollup scan normalize and denormalize sorted. Dynamic DML is used if the input metadata can change. Example: at different time different input files are recieved for processing which have different dml. in that case we can use flag in the dml and the flag is first read in the input file recieved and according to the flag its corresponding dml is used. fan in, fan out lock join vs lookup multi update fan in departition component (decrease parallelism) a user can lock the graph for editing so that others will see the message and can not edit the same graph. Lookup is good for spped for small files (will load whole file in memory). For large files use join. You may need to increase the maxcore limit to handle big joins. multi update executes SQL statements - it treats each input record as a completely separate piece of work. scheduler We can use Autosys, Control-M, or any other external scheduler. We can take care of dependencies in many ways. For example, if scripts should run sequentially, we can arrange for this in Autosys, or we can create a wrapper script and put there several sequential commands (nohup command1.ksh & ; nohup command2.ksh &; etc). We can even create a special graph in Ab Initio to execute individual scripts as needed. fan out - partition component (increase parallelism)

Dynamic DML

Api and Utility modes in input table

These are database interfaces (api - uses SQL, utility - bulk loads, whatever vendor provides) lookup file component. Functions: lookup, lookup_count, lookup_next, lookup_match, lookup_local. Lookups are always used with combination of the reformat components.
Selector Web Definitions

lookup file

Calling stored proc in DB Frequently used functions data validation driving port

You can call stored proc (for example, from input component). In fact, you can even write SP in Ab Initio. Make it "with recompile" to assure good performance. string_ltrim, string_lrtrim, string_substring, reinterpret_as, today(), now() is_valid, is_null, is_blank, is_defined When joining inputs (in0, in1, ...) one of the ports is used as "driving (by default - in0). Driving input is usually the largest one. Whereas the smallest can have "Sorted-Input" parameter be set to "Input need not be sorted" because it will be loaded completely in memory. Ab Initio benefits: parallelism built in, mulitifile system, handles huge amounts of data, easy to build and run. Generates scripts which can be easily modified as needed )if something couldn't be done in ETL tool itself). The scripts can be easily scheduled using any external scheduler and easily integrated with other systems.

Ab Initio vs Informatica for ETL

Ab Initio doesn't require a dedicated administrator. Ab Initio doesn't have built-in CDC capabilities (CDC = Change Data Capture). Ab Initio allows to (attach error / reject files) to each transformation and capture and analyze the message and data separately (as opposed to Informatica which has just one huge log). Ab Initio provides immediate metrics for each component.

override key control file

override key option is used when we need to join 2 fields which have different field names. control file should be in the multifile directory (contains the addresses of the serial files) max-core parameter (for example, sort 100 MBytes) specifies the amount of memory used by a component (like Sort or Rollup) - per partition - before spilling to disk. Usually you don't need to change it - just use default value. Setting it too high may degrade the performance because of OS swapping and degrading of the performance of other components. graph > select parameters tab > click "create" - and create a parameter. Usage: $paramname. Edit > parameters. These parameters will be substituted during run time. You may need to declare you parameter scope as formal. Each component has reject, error, and log ports. Reject captures rejected records, Error captures corresponding error, and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either Never Abort, Abort on first reject, or setting ramp/limit. You can also use force_error() function in transform function.

max-core

Input Parameters

Error Trapping

3 Question How to see resource usage assign keys component Answer ========================================================== In GDE goto options View > Tracking Details - will see each component's CPU and memory usage, etc. Easy and saves development time. Need to understand how to feed parameters, and you can't control it easily. Join in DB vs join in Ab Initio Scenario 1 (preferred): we run query which joins 2 tables in DB and gives us the result in just 1 DB component. Scenario 2 (much slower): we use 2 database components, extract all data - and join them in Ab Initio.
Selector Web Definitions

Join with DB

not recommended if number of records is big. It is better to retrieve the data out - and then join in Ab Initio. Parameter showing how data is unevenly distributed between partitions.

Data Skew skew = (partition size - avg.part.size)* 100 / (size of the largest partition) .dbc - database configuration file (dbname, nodes, version user/pwd) - resides in the db directory dbc vs cfg .cfg - any tyoe of config file. for example, remote connection config (name of remote server, user/pwd to connect to db, location of OS on remote machine, connection method). .cfg file resides in the config dir. depth not equal data format error etc... compilation errors depth error : we get this error.. when two components connected together but does't match there layout broadcast pbyexpression pbyroundrobin pbykey pwithloadbalance when joining, used records go to the output port, unused records - to the unused port Go parallel using partitionning. Roundrobin partitionning gives good balance. Use Multi-file system (MFS). Use Ad Hoc MFS to read many serial files in parallel, and use concat component. Once data is partitionned - do not switch it to serial and back. Repartition instead. Do not acceess large filess via NFS - use FTP instead use lookup local rather than lookup (especially for big lookups). Use rollup and Filter as soon as possible to reduce number of records. Ideally do it in the source (database ?) before you get the data. Remove unnecessary components. For example, instead of using filter by exp, you can implement the same function in reformat/Join/Rollup. Another example - when joining data from 2 files, use union function instead of adding an additional component for removing duplicates. use gather instead of concatenate. it is faster to do a sort after a partitino, than to do a sort before a partition. try to avoid using a join with the "db" component. when getting data from database - make sure your queries are fast (use indexes, etc.). If possible, do necessary selection / aggregation / sorting in the database before getting data into Ab Initio. tune Max_core for Optimal performance (for sort depends on the size of the input file). Note - If in-memory join cannot fit its non-driving inputs in the provided MAX-CORE, then it will drop all the inputs to disk and in-memory does not make sence. Using phase breaks let you allocate more memory in individual components - thus improving performance. Use checkpoint after sort to land data on disk Use Join and rollup in-memory feature When joining very small dataset to a very large dataset it is more efficient to broadcast the small dataset to MFS using broadcast component, or use the small file as lookup. But for large dataset don't use broadcast as a partitioner. Use Ab Initio layout instead of database default to achieve parallel loads Change AB_REPORT parameter to increased monitoring duration Use catalogs for reusability Components like join/ rollup should have the option "Input must be sorted" if they are placed after a sort component. minimize number of sort components. Minimize usage of sorted join component, and if possible replace them by in-memory join/hash join. Use only required fields in the sort reformat join components. Use "Sort within Groups" instead of just Sort when data was
Selector Web Definitions

types of partitions unused port tuning performance

already presorted. Use phasing/flow buffers in case of merge sorted joins Minimize the use of regular expression functions like re_index in the transfer functions Avoid repartitioning of data unnecessarily. When splitting records into more than two flows, use Reformat rather than Broadcast component. For joining records from 2 flows use Concatenate component ONLY when there is a need to follow some specific order in joining records. If no order is required then it is preferable to use Gather component. Instead of putting many Reformat components consecutively, use output indexes parameter in the first Reformat component and mention the condition there. Delta table maintain the sequencer of each data table. Master (or base) table - a table on top of which we create a view

delta table scan vs rollup packages Reformat vs "Redefine Format" Conditional DML SORTWITHINGROUP

rollup - performs aggregate calculations on groups, scan - calculates cumulative totals used in multistage components or transform components Reformat - deriving new data by adding/dropping fields Redefine format - rename fields

DML which is separated based on a condition The prerequisit for using sortwithingroup is that the data is already sorted by the major key. sortwithingroup outputs the data once it has finished reading the major key group. It is like an implicit phase.

passing a condition as a parameter

Define a Formal Keyword Parameter of type string. For example, you call it FilterCondition, and you want it to do filtering on COUNT > 0 . Also in your graph in your "Filter by expression" Component enter following condition: $FilterCondition Now on your command line or in wrapper script give the following command YourGraphname.ksh -FilterCondition COUNT > 0

Passing file name as a parameter

#!/bin/ksh #Running the set up script on enviornment typeset PROJ_DIR $(cd $(dirname $0)/..; pwd) . $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR #Exporting the script parameter1 to INPUT_FILE_NAME if [ $# -ne 2 ]; then INPUT_FILE_PARAMETER_1 $1 INPUT_FILE_PARAMETER_2 $2 # This grpah is using the input file cd $AI_RUN ./my_graph1.ksh $INPUT_FILE_PARAMETER_1 # This graph also is using the input file. ./my_graph2.ksh $INPUT_FILE_PARAMETER_2 exit 0; else echo Insufficient parameters exit 1; fi ------------------------------------#!/bin/ksh
Selector Web Definitions

#Running the set up script on enviornment typeset PROJ_DIR $(cd $(dirname $0)/..; pwd) . $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR #Exporting the script parameter1 to INPUT_FILE_NAME export INPUT_FILE_NAME $1 # This grpah is using the input file cd $AI_RUN ./my_graph1.ksh # This graph also is using the input file. ./my_graph2.ksh exit 0; How to remove header and trailer lines? How to create a multi file system on Windows use conditional dml where you can separate detail from header and trailer. For validations use reformat with count :3 (out0:header out1:detail out2:trailer.) first method: in GDE go to RUN > Execute Command - and run m_mkfs c:control c:dp1 c:dp2 c:dp3 c:dp4 second method: double-click on the file component, and in ports tab double-click on partitions - there you can enter the number of partitions.

Vector Dependency Analysis

A vector is simply an array. It is an ordered set of elements of the same type (type can be any type, including a vector or a record). Dependency analysis will answer the questions regarding datalinage, that is where does the data come from what applications prodeuce and depend on this data etc..

4 Question Answer ========================================================== There are many ways to create a surrogate key. For example, you can use next_in_sequence() function in your transform. Or you can use "Assign key values" component. Or you can write a stored procedure - and call it. Surrogate key Note: if you use partitions, then do something like this: (next_in_sequence()-1)*no_of_partition()+this_partition() .abinitiorc This is a config file for ab initio - in user's home directory and in $AB_HOME/Config. It sets abinitio home pat

Selector Web Definitions

Selector Web Definitions

S-ar putea să vă placă și