Documente Academic
Documente Profesional
Documente Cultură
What is the architecture of any Data warehousing project? What is the flow?
1) The basic step of data warehousing starts with datamodelling. i.e. creation of
dimensions and facts.
2) data warehouse starts with collection of data from source systems such as
OLTP,CRM,ERPs etc
3) Cleansing and transformation process is done with ETL(Extraction Transformation
Loading)?tool.
4) by the end of ETL process target databases(dimensions,facts) are ready with data
which accomplishes the business rules.
5) Now finally with the use of Reporting tools(OLAP) we can get the information which
is used for decision support.
1
in bottom up approach: first we will build data marts then data warehouse. The data
mart that is first build will remain as a proof of concept for the others. Less time as
compared to above and less cost.
2
15%onlywhen compared to FACT table. So only Denormalization is promoted in
Universe Designing.
What is fact less fact table? Where you have used it in your project?
Fact less Fact Table contains nothing but dimensional keys. It is used to support negative
analysis report. For example a Store that did not sell a product for a given period.
What is snapshot?
Snapshot is static data source; it is permanent local copy or picture of a report,
it is suitable for disconnected networks. we can’t add any columns to sanpshot. we can
sort, grouping and aggregations and it is mainly used for analyzing the historical data.
What is cube and why we are crating a cube what is diff between ETL and OLAP cubes?
Any schema or Table or Report which gives you meaningful information Of One
attribute wrt more than one attribute is called a cube. For Ex: In a product table with
Product ID and Sales colomns, we can analyze Sales wrt to Prodcut Name, but if you
analyze Sales wrt Product as well as Region( region being attribute in Location Table) the
report or Resultant table or schema would be Cube.
ETL Cubes: Built in the staging area to load frequently accessed reports to the target.
Reporting Cubes: Built after the actual load of all the tables to the target depending on the
customer requirement for his business analysis.
3
How data in data warehouse stored after data has been extracted and transformed from
heterogeneous sources and where does the data go from data warehouse?
Data in Data warehouse stored in the form of relational tables, most of the data ware
houses approach is snowflake schema.
4
Customization of warehouse architecture for different groups in the organization then
data marts are added and used.
architecture: Source –> Staging Area –> Warehouse –> Data Marts –> End Users
What is ODS?
ODS stands for Online Data Storage.
what is the need of surrogate key;why primary key not used as surrogate key?
Surrogate Key is an artificial identifier for an entity. In surrogate key values are
generated by the system sequentially(Like Identity property in SQL Server and Sequence
in Oracle). They do not describe anything.
Primary Key is a natural identifier for an entity. In Primary keys all the values are entered
manually by the user which are uniquely identified. There will be no repetition of data.
5
If a column is made a primary key and later there needs a change in the data type or the
length for that column then all the foreign keys that are dependent on that primary key
should be changed making the database Unstable
Surrogate Keys make the database more stable because it insulates the Primary and
foreign key relationships from changes in the data types and length.
6
Degenerated Dimension is a dimension key without corresponding dimension. Example:
?????In the PointOfSale Transaction Fact table, we have:
???????? Date Key (FK), Product Key (FK), Store Key (FK), Promotion Key?(FP),?and
POS Transaction Number??
Date Dimension corresponds to Date Key, Production Dimension corresponds to
Production Key. In a traditional parent-child database, POS Transactional Number would
be?the key to the transaction header record that contains all the info valid for the
transaction as a whole, such as the transaction date and store?identifier.?But in this?
dimensional model, we have already extracted this info into other dimension. Therefore,
POS Transaction Number?looks like a dimension key in the fact table but does not have
the corresponding dimension table.
Therefore, POS Transaction Number is a degenerated dimension.
7
A cube can be stored on a single analysis server and then defined as a linked cube on
other Analysis servers. End users connected to any of these analysis servers can then
access the cube. This arrangement avoids the more costly alternative of storing and
maintaining copies of a cube on multiple analysis servers. linked cubes can be connected
using TCP/IP or HTTP. To end users a linked cube looks like a regular cube.
Difference between Snow flake and Star Schema. What are situations where Snow flake
Schema is better than Star Schema to use and when the opposite is true?
Star Schema means
A centralized fact table and sarounded by diffrent dimensions
Snowflake means
In the same star schema dimensions split into another dimensions
Star Schema contains Highly Denormalized Data
Snow flake? contains Partially normalized
Star can not have parent table
But snow flake contain parent tables
Why need to go there Star:
Here 1)less joiners contains
2)simply database
3)support drilling up options
Why nedd to go Snowflake schema:
Here some times we used to provide?seperate dimensions from existing dimensions that
time we will go to snowflake
Dis Advantage Of snowflake:
Query performance is very low because more joiners is there
8
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity
What is the main difference between Inmon and Kimball philosophies of data
warehousing?
Both differed in the concept of building teh datawarehosue..
According to Kimball …
Kimball views data warehousing as a constituency of data marts. Data marts are focused
on delivering business objectives for departments in the organization. And the data
warehouse is a conformed dimension of the data marts. Hence a unified view of the
enterprise can be obtain from the dimension modeling on a local departmental level.
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the
development of the data warehouse can start with data from the online store. Other
subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS)
data can be added later if management decides it is necessary.
i.e.,
Kimball–First DataMarts–Combined way —Datawarehouse
Inmon—First Datawarehouse–Later—-Datamarts
what r the data types present in BO?n wht happens if we implement view in the designer
n report
n my knowlegde, these are?called as object types in the Business Objects.And alias is
different from view in the universe. View is at database level, but alias?is a different
name given for the same table to resolve the loops in universe.
The different data types in business objects are:1. Character.2. Date.3. Long text.4.
Number
9
navigation services. Metadata includes things like the name, length, valid values, and
description of a data element. Metadata is stored in a data dictionary and repository. It
insulates the data warehouse from changes in the schema of operational systems.
Metadata Synchronization The process of consolidating, relating and synchronizing data
elements with the same or similar meaning from different systems. Metadata
synchronization joins these differing elements together in the data warehouse to allow for
easier access.
10
line analytical processing (OLAP). As a result, data warehouses are designed differently
than traditional relational databases.
What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?
Normalization can be defined as segregating of table into two different tables, so as to
avoid duplication of values.?
What are Semi-additive and factless facts and in which scenario will you use such kinds
of fact tables?
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others. For example:
Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what’s the total current balance
for all accounts in the bank?), but it does not make sense to add them up through time
(adding up all current balances for a given account for each day of the month does not
give us any useful information
11
A factless fact table captures the many-to-many relationships between
dimensions, but contains no numeric or textual facts. They are often used to record events
or
coverage information. Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that didn?t sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university
12
What is VLDB?
Very Large Database (VLDB)
it is sometimes used to describe databases occupying magnetic storage in the terabyte
range and containing billions of table rows. Typically, these are decision support systems
or transaction processing applications serving large numbers of users.
What is ETL?
ETL is an abbreviation for “Extract, Transform and Load”.This is the process of
extracting data from their operational data sources or external data sources, transforming
the data which includes cleansing, aggregation, summarization, integration, as well as
basic transformation and loading the data into some form of the data warehouse.
What is the definition of normalized and denormalized view and what are the differences
between them?
I would like to add one more pt. here, as OLTP is in Normalized?form, more no. of tables
are?scanned or referred for a single query,?as through primary key and foreign key data
needs to be fetched from its respective Master tables. Whereas in OLAP, as the data is in
De-normailzed form, for a?query?the no. of tables queried?are less.For eq.:- If we have a
banking appln., in OLTP env., we will have a separate table for customer personal details
, Address details,?its transaction details etc..Whereas in OLAP env. these all details can
be stored in one sinlge table thus decreasing the scanning of multiple tables for a single
record of a customer details.
13
What is data mining?
Data mining is a process of extracting hidden trends within a datawarehouse. For
example an insurance dataware house can be used to mine data for the most high risk
people to insure in a certain geographial area.
14
What is BUS Schema?
A BUS Schema or a BUS Matrix? A BUS Matrix (in Kimball approach) is to identify
common Dimensions across Business Processes; ie:?a way of identifying Conforming
Dimensions.
What are data validation strategies for data mart validation after loading process?
Data validation is to make sure that the loaded data is accurate and meets the business
requirements.
Strategies are different methods followed to meet the validation requirements
Why OLTP database are designs not generally a good idea for a Data Warehouse
OLTP cannot store historical information about the organization. It is used for storing the
details of daily transactions while a data warehouse is a huge storage of historical
information obtained from different data marts for making intelligent decisions about the
organization.
15
Which columns go to the fact table and which columns go the dimension table?
The Aggregation or calculated value columns will go to Fact Table and details
information will go to dimensional table.
To add on, Foreign key elements along with Business Measures, such as Sales in $ amt,
Date may be a business measure in some case, units (qty sold) may be a business
measure, are stored in the fact table. It also depends on the granularity at which the data
is stored.
What is ER Diagram ?
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76]
as a way to unify the network and relational database views.
16
What are the various ETL tools in the Market?
1. Informatica Power Center
2. Ascential Data Stage
3. ESS Base Hyperion
4. Ab Intio
5. BO Data Integrator
6. SAS ETL
7. MS DTS
8. Oracle OWB
9. Pervasive Data Junction
10. Cognos Decision Stream
Why should you put your data warehouse on a different system than your OLTP system?
Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the source from
which any BI tools fetch data for Analytical, reporting or data mining purposes. It
generally contains the data through the whole life cycle of the company/product. DWH
contains historical, integrated, Denormalized, subject oriented data.
However, on the other hand the OLTP system contains data that is generally limited to
last couple of months or a year at most. The nature of data in OLTP is: current, volatile
and highly normalized. Since, both systems are different in nature and functionality we
should always keep them in different systems.
Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup would you put
your TX logs
Raid 0 - Make several physical hard drives look like one hard drive. No redundancy but
very fast. May use for temporary spaces where loss of the files will not result in loss of
committed data.
Raid 1- Mirroring. Each hard drive in the drive array has a twin. Each twin has an exact
copy of the other twins data so if one hard drive fails, the other is used to pull the data.
Raid 1 is half the speed of Raid 0 and the read and write performance are good.
Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar to Raid 1. Sometimes faster than
Raid 1. Depends on vendor implementation.
Raid 5 - Great for readonly systems. Write performance is 1/3rd that of Raid 1 but Read
is same as Raid 1. Raid 5 is great for DW but not good for OLTP.
Hard drives are cheap now so I always recommend Raid 1.
What are the differences between the static and dynamic caches?
Static cache stores overloaded values in the memory and it wont change throught the
running of the session
17
Where as dynamic cache stores the values in the memory and changes dynamically
during the running of the session used in scd types — where target table changes and is
cache are dynamically changes
Does u need separate space for Data warehouse & Data mart?
IN THE DATAWARE HOUSE ALL THE INFORMATION OF THE ENTERPRISE IS
THERE BUT THE DATA MART IS SPECIFIC FOR THE PARTICULAR ANALYSIS
LIKE SALES,PRODUCTION ….,,, SO DATA MART IS SUBJECT ORIENTED AND
WAREHOUSE IS NOTHING BUT COLLECTION OF DATAMARTS SO WE
ASSUME IT ALSO SUBJECT ORIENTED BCZ IT’S COLLECTION OF DATA
MARTS … SO FOR INDIVIDUAL ANALYSIS WE NEED DATAMARTS.
18
ASENTIAL DATASTAGE 7.5
1. What are other Performance tunings you have done in your last project to
increase the performance of slowly running jobs?
1) Minimize the usage of Transformer (Instead of this use Copy, modify, Filter, Row
Generator)
2) Use SQL Code while extracting the data Handle the nulls, Minimize the warnings
3) Reduce the number of lookups in a job design Use not more than 20stages in a job
4)Use IPC stage between two passive stages to Reduces processing time
5)Drop indexes before data loading and recreate after loading data into tables
6) There is no limit for no of stages like 20 or 30 but we can break the job into small jobs
then we use dataset Stages to store the data.
7) Check the write cache of Hash file. If the same hash file is used for Look up and as
well as target, disable this Option. If the hash file is used only for lookup then "enable
Preload to memory". This will improve the performance. Also, check the order of
execution of the routines.
8) Don't use more than 7 lookups in the same transformer; introduce new transformers
if it exceeds 7 lookups.
9) Use Preload to memory option in the hash file output.
10) Use Write to cache in the hash file input.
11) Write into the error tables only after all the transformer stages.
12) Reduce the width of the input record - remove the columns that you would not use.
13) Cache the hash files you are reading from and writing into. Make sure your cache is
big enough to hold the hash files.
(Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash
files.)
This would also minimize overflow on the hash file.
14) If possible, break the input into multiple threads and run multiple instances of the job.
15) Staged the data coming from ODBC/OCI/DB2UDB stages for optimum performance
also for data recovery in case job aborts.
16) Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for
faster inserts, updates and selects.
17) Tuned the 'Project Tunables' in Administrator for better performance.
18) Sorted the data as much as possible in DB and reduced the use of DS-Sort for better
performance of jobs. Used sorted data for Aggregator.
19) Removed the data not used from the source as early as possible in the job.
20) Worked with DB-admin to create appropriate Indexes on tables for better
performance of DS queries
21) Converted some of the complex joins/business in DS to Stored Procedures on DS for
19
faster execution of the jobs.
22) If an input file has an excessive number of rows and can be split-up then use standard
logic to run jobs in parallel.
23) Constraints are generally CPU intensive and take a significant amount of time to
process. This may be the case if the constraint calls routines or external macros but if it is
inline code then the overhead will be minimal.
24) Try to have the constraints in the 'Selection' criteria of the jobs itself. This will
eliminate the unnecessary records even getting in before joins are made.
25) Tuning should occur on a job-by-job basis.
26) Using a constraint to filter a record set is much slower than performing a SELECT …
WHERE….
27) Make every attempt to use the bulk loader for your particular database. Bulk loaders
are generally faster than using ODBC or OCI.
2. How can I extract data from DB2 (on IBM i-series) to the data warehouse via
Datastage as the ETL tool? I mean do I first need to use ODBC to create
connectivity and use an adapter for the extraction and transformation of data?
You would need to install ODBC drivers to connect to DB2 instance (does not come with
regular drivers that we try to install, use CD provided for DB2 installation, that would
have ODBC drivers to connect to DB2) and then try out
You use the Designer to build jobs by creating a visual design that models the flow and
transformation of data from the data source to the target warehouse. The Designer
graphical interface lets you select stage icons, drop them onto the Designer work area,
and add links.
You can do like this, by passing parameters from UNIX file, and then calling the
execution of a Datastage job. The ds job has the parameters defined. which are passed by
UNIX
20
6. What is a project? Specify its various components?
You always enter Datastage through a Datastage project. When you start a Datastage
client you are prompted to connect to a project.
Yes, we can do it in an indirect way. First create a job which can populate the data from
database into a Sequential file and name it as Seq_First1. Take the flat file which you are
having and use a Merge Stage to join the two files. You have various join types in Merge
Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of
these which suits your requirements.
10. Can any one tell me how to extract data from more than 1 heterogeneous
Sources? Means, example 1 sequential file, Sybase, Oracle in a single Job.
Yes you can extract the data from two heterogeneous sources in data stages using the
transformer stage it's so simple you need to just form a link between the two sources in
the transformer stage.
11. Will Datastage consider the second constraint in the transformer if the first
constraint is satisfied (if link ordering is given)?"
Answer: Yes.
12. How we use NLS function in Datastage? What are advantages of NLS function?
Where we can use that one? Explain briefly?
21
- Use Local formats for dates, times and money
- Sort the data according to the local rules
13. If a Datastage job aborts after say 1000 records, how to continue the job from
1000th record after fixing the error?
By specifying Check pointing in job sequence properties, if we restart the job. Then job
will start by skipping upto the failed record. this option is available in 7.5 edition.
14. How to kill the job in data stage? ANS by killing the respective process ID
Basically Environment variable is predefined variable those we can use while creating
DS job. We can set either as Project level or Job level. Once we set specific variable that
variable will be available into the project/job.
We can also define new environment variable. For that we can go to DS Admin.
16. What are all the third party tools used in Datastage?
Autosys, TNG, event coordinator are some of them that I know and worked with
APT_CONFIG is just an environment variable used to identify the *.apt file. Don’t
confuse that with *.apt file that has the node's information and Configuration of
SMP/MMP server.
17. If you’re running 4 ways parallel and you have 10 stages on the canvas, how
many processes does Datastage create?
Answer is 40
You have 10 stages and each stage can be partitioned and run on 4 nodes which makes
total number of processes generated are 40
18. Did you Parameterize the job or hard-coded the values in the jobs?
Always parameterized the job. Either the values are coming from Job Properties or from
22
a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some
parameters in your jobs. The often Parameterized variables in a job are: DB DSN name,
username, and password.
Actually the Number of Nodes depends on the number of processors in your system. If
your system is supporting two processors we will get two nodes by default.
No, it is not possible to run Parallel jobs in server jobs. But Server jobs can be executed
in Parallel jobs
21. It is possible to access the same job two users at a time in Datastage?
No, it is not possible to access the same job two users at the same time. DS will produce
the following error: "Job is accessed by other user"
MetaStage is used to handle the Metadata which will be very useful for data linkage and
data analysis later on. Meta Data defines the type of data we are handling. This Data
Definitions are stored in repository and can be accessed with the use of MetaStage.
23. What is merge and how it can be done plz explain with simple example taking 2
tables
Merge is used to join two tables. It takes the Key columns sort them in Ascending or
descending order. Let us consider two table i.e. Emp,Dept.If we want to join these two
tables we are having DeptNo as a common Key so we can give that column name as key
and sort DeptNo in ascending order and can join those two tables
23
master link
25. What are the enhancements made in Datastage 7.5 compare with 7.0
Many new stages were introduced compared to Datastage version 7.0. In server jobs we
have stored procedure stage, command stage and generate report option was there in file
tab.
In job sequence many stages like start loop activity, end loop activity, terminates loop
activity and user variables activities were introduced.
In parallel jobs surrogate key stage, stored procedure stages were introduced.
26. How can we join one Oracle source and Sequential file?.
The stages followed by exception activity will be executed whenever there is an unknown
error occurs while running the job sequencer.
The main difference is Vendors? Each one is having plus from their architecture. For
Datastage it is a Top-Down approach. Based on the Business needs we have to choose
products.
31. What are Static Hash files and Dynamic Hash files?
The hashed files have the default size established by their modulus and separation when
you create them, and this can be static or dynamic.
24
Overflow space is only used when data grows over the reserved size for someone of
the groups (sectors) within the file. There are many groups as the specified by the
modulus.
32. What is the exact difference between Join, Merge and Lookup Stage?
The Duplicates can be eliminated by loading the corresponding data in the Hash file.
Specify the columns on which u want to eliminate as the keys of hash.
The different hashing algorithms are designed to distribute records evenly among the
groups of the file based on characters and their position in the record ids.
When a hashed file is created, Separation and modulo respectively specifies the group
buffer size and the number of buffers allocated for a file. When a Static Hash file is
created, DATASTAGE creates a file that contains the number of groups specified by
modulo.
Size of Hash file = modulus (no. groups) * Separations (buffer size)
25
The concept of surrogate comes into play when there is slowly changing dimension in a
table.
In such condition there is a need of a key by which we can identify the changes made in
the dimensions.
These are system generated key. Mainly they are just the sequence of numbers or can be
Alfa numeric values also.
These slowly changing dimensions can be of three types namely SCD1, SCD2, and
SCD3.
We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass
all the parameters from command prompt.
Then call this shell script in any of the market available schedulers.
The 2nd option is schedule these jobs using Data Stage director.
We can also use the Hash File stage to avoid / remove dupilcate rows by specifying the
hash key on a particular fileld
Version Control stores different versions of DS jobs. Runs different versions of same job,
reverts to previous version of a job also view version histories
41. Suppose if there are million records did you use OCI? If not then what stage do
you prefer?
Using Orabulk
26
How to find errors in job sequence?
Using Datastage Director we can find the errors in job sequence
How do you pass the parameter to the job sequence if the job is running at night?
Two ways:
1. Set the default values of Parameters in the Job Sequencer and map these parameters to
job.
2. Run the job in the sequencer using dsjobs utility where we can specify the values to be
taken for each parameter.
What is the transaction size and array size in OCI stage? How these can be used?
Transaction Size - This field exists for backward compatibility, but it is ignored for
release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by
Rows per transaction on the Transaction Handling tab on the Input page.
Rows per transaction - The number of rows written before a commit is executed for
the transaction. The default value is 0, that is, all the rows are written before being
committed to the data table.
Array Size - The number of rows written to or read from the database at a time. The
default value is 1, that is, each row is written in a separate statement.
What is the difference between DRS (Dynamic Relational Stage) and ODBC
STAGE?
To answer your question the DRS stage should be faster then the ODBC stage as it
uses native database connectivity. You will need to install and configure the required
database clients on your Datastage server for it to work.
Dynamic Relational Stage was leveraged for People soft to have a job to run on any of
the supported databases. It supports ODBC connections too. Read more of that in the
plug-in documentation.
ODBC uses the ODBC driver for a particular database, DRS (Dynamic Relational
stage) is a stage that tries to make it seamless for switching from one database to another.
It uses the native connectivity’s for the chosen target ...
27
How do you track performance statistics and enhance it?
Through Monitor we can view the performance statistics
What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs
itself? This will eliminate the unnecessary records even getting in before joins are
made?
This means try to improve the performance by avoiding use of constraints wherever
possible and instead using them while selecting the data itself using a where clause. This
improves performance.
How to drop the index befor loading data in target and how to rebuild it in data
stage?
The Administrator enables you to set up Datastage users, control the purging of the
Repository, and, if National Language Support (NLS) is enabled, install and manage
maps and locales.
28
# 64BIT_FILES - This sets the default mode used to
# create static hashed and dynamic files.
# A value of 0 results in the creation of 32-bit
# files. 32-bit files have a maximum file size of
# 2 gigabytes. A value of 1 results in the creation
# of 64-bit files (ONLY valid on 64-bit capable platforms).
# The maximum file size for 64-bit
# files is system dependent. The default behavior
# may be overridden by keywords on certain commands.
64BIT_FILES 0
What is the order of execution done internally in the transformer with the stage
editor having input links on the left hand side and output links?
Stage variables, constraints and column derivation or expressions
Type 1: The new record replaces the original record, no trace of the old record at all
29
Type 2: A new record is added into the customer dimension table. Therefore, the
customer is treated essentially as two different people.
Type 3: The original record is modified to reflect the changes.
In Type1 the new one will over write the existing one that means no history is
maintained, History of the person where she stayed last is lost, simple to use.
In Type2 New record is added, therefore both the original and the new record Will be
present, the new record will get its own primary key, Advantage of using this type2 is,
Historical information is maintained But size of the dimension table grows, storage and
performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical
changes.
In Type3 there will be 2 columns one to indicate the original value and the other to
indicate the current value. Example a new column will be added which shows the original
address as New York and the current address as Florida. Helps in keeping some part of
the history and table size is not increased. But one problem is when the customer moves
from Florida to Texas the New York information is lost. so Type 3 should only be used if
the changes will only occur for a finite number of time.
server jobs mainly execute the jobs in sequential fashion, the ipc stage as well as link
partioner and link collector will simulate the parallel mode of execution over the server
jobs having single cpu Link Partitioner: It receives data on a single input link and diverts
the data to a maximum no. of 64 output links and the data processed by the same stage
having same meta data Link Collector: It will collects the data from 64 input links,
merges it into a single data flow and loads to target. These both r active stages and the
design and mode of execution of server jobs has to be decided by the designer
JCL defines Job Control Language it is used to run more number of jobs at a time with or
without using loops.
steps: click on edit in the menu bar and select 'job properties' and enter the parameters as
parameter prompt typeSTEP_ID STEP_ID string Source SRC stringDSN DSN string
Username unm string Password pwd stringafter editing the above steps then set JCL
button and select the jobs from the list box and run the job
30
What is the difference between Datastage and Datastage TX?
Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL
tool & this is not a new version of Datastage 7.5.Tx is used for ODS source ,this much I
know
If the size of the Hash file exceeds 2GB...What happens? Does it overwrite the
current rows?
How much would be the size of the database in Datastage? What is the difference
between In process and Interprocess?
In-process:
You can improve the performance of most DataStage jobs by turning in-process row
buffering on and recompiling the job. This allows connected active stages to pass data via
buffers rather than row by row.
Note: You cannot use in-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice, and it
is advisable to redesign your job to use row buffering rather than COMMON blocks.
Inter-process
Use this if you are running server jobs on an SMP parallel system. This enables the job to
run using a separate process for each active stage, which will run simultaneously on a
separate processor.
Note: You cannot inter-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice, and it
is advisable to redesign your job to use row buffering rather than COMMON blocks.
How can you do incremental load in Datastage? Incremental load means daily load.
When ever you are selecting data from source, select the records which are loaded or
updated between the timestamp of last successful load and today’s load start date and
time. For this u have to pass parameters for those two dates.
31
Store the last run date and time in a file and read the parameter through job parameters
and state second argument as current date and time.
In the target make the column as the key column and run the job.
What r XML files and how do you read data from XML files and what stage to be
used?
In the pallet there is a Real time stage like xml-input, xml-output, xml-transformer
Flat files stores the data and the path can be given in general tab of the sequential file
stage
File set:- It allows you to read data from or write data to a file set. The stage can have a
single input link. A single output link and a single rejects link. It only executes in parallel
mode the data files and the file that lists them are called a file set. This capability is
useful because some operating systems impose a 2 GB limit on the size of a file and you
need to distribute files among nodes to prevent overruns.
Datasets r used to import the data in parallel jobs like odbc in server jobs
what is meaning of file extender in data stage server jobs. can we run the data stage
job from one job to another job?
File extender means the adding the columns or records to the already existing the file, in
the data stage, we can run the data stage job from one job to another job in data stage.
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are
same or created a job to concatenate the 2 files into one if the metadata is different.
What is the default cache size? How do you change the cache size if needed?
Default read cache size is 128MB. We can increase it by going into Datastage
Administrator and selecting the Tunable Tab and specify the cache size.
32
What about System variables?
Datastage provides a set of variables containing useful system information that you can
access from a transform or routine. System variables are read-only.
@DATE the internal date when the program started. See the Date function.
@DAY The day of the month extracted from the value in @DATE.
@FALSE The compiler replaces the value with 0.
@FM A field mark, Char(254).
@IM An item mark, Char(255).
@INROWNUM Input row counter. For use in constraints and derivations in Transformer
stages.
@OUTROWNUM Output row counter (per link). For use in derivations in Transformer
stages.
@LOGNAME The user login name.
@MONTH The current extracted from the value in @DATE.
@NULL The null value.
@NULL.STR The internal representation of the null value, Char(128).
@PATH The pathname of the current Datastage project.
@SCHEMA The schema name of the current Datastage project.
@SM A sub value mark (a delimiter used in Universe files), Char(252).
@SYSTEM.RETURN.CODE
Status codes returned by system processes or commands.
@TIME The internal time when the program started. See the Time function.
@TM A text mark (a delimiter used in UniVerse files), Char(251).
@TRUE The compiler replaces the value with 1.
@USERNO The user number.
@VM A value mark (a delimiter used in UniVerse files), Char(253).
@WHO The name of the current DataStage project directory.
@YEAR The current year extracted from @DATE.
33
What is DS Director used for - did u use it?
Datastage Director is GUI to monitor, run, validate & schedule Datastage server jobs.
Datastage developer is one who will code the jobs.datastage designer is one who will
design the job, i mean he will deal with blue prints and he will design the jobs, the stages
that are required in developing the code
What will you in a situation where somebody wants to send you a file and use that
file as an input or reference and then run job
Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the
job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending
on the file.
What are the command line functions that import and export the DS jobs?
34
What is sequence stage in job sequencer? What are the conditions?
A sequencer allows you to synchronize the control flow of multiple activities in a job
sequence. It can have multiple input triggers as well as multiple output triggers. The
sequencer operates in two modes: ALL mode. In this mode all of the inputs to the
sequencer must be TRUE for any of the sequencer outputs to fire. ANY mode. In this
mode, output triggers can be fired if any of the sequencer inputs are TRUE
What are the Repository Tables in Datastage and what are they?
What difference between operational data stage (ODS) & data warehouse?
100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs for
every 6 months although it need not be the same number for everybody
35
3.How can we improve the performance of DataStage?
4.what are the Job parameters?
5.what is the difference between routine and transform and function?
6.What are all the third party tools used in DataStage?
7.How can we implement Lookup in DataStage Server jobs?
8.How can we implement Slowly Changing Dimensions in DataStage?.
9.How can we join one Oracle source and Sequential file?.
10.What is iconv and oconv functions?
11.Difference between Hashfile and Sequential File?
12. Maximum how many characters we can give for a Job name in DataStage?
1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined.
Here you can see a grid, where you can enter your parameter name and the corresponding
the path of the file.
2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter"
and select the parameter name which you have given in the above. The selected
parameter name appears in the text box beside the "Use Job Parameter" button. Copy the
parameter name from the text box and use it in your job. Keep the project default in the
text box.
What is the utility you use to schedule the jobs on a UNIX server other than using
Ascential Director?
AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to
schedule the datastage jobs.
I think we can call a job into another job. In fact calling doesn't sound good, because you
attach/add the other job through job properties. In fact, you can attach zero or more jobs.
36
Click on Add Job and select the desired job.
If data is partitioned in your job on key 1 and then you aggregate on key 2, what
issues could arise?
Data will partitioned on both the keys ! hardly it will take more for execution .
Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs
XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage
macros in Routines.
To Execute one job from other job, following steps needs to be followed in Routines.
Container is a collection of stages used for the purpose of Reusability. There are 2 types
of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project. ·
There are two types of shared container:·
1.Server shared container. Used in server jobs (can also be used in parallel jobs).·
2.Parallel shared container. Used in parallel jobs. You can also include server shared
containers in parallel jobs as a way of incorporating server job functionality into a
parallel stage (for example, you could use one to make a server plug-in stage available to
a parallel job).
37
What does a Config File in parallel extender consist of?
Complex design means having more joins and more look ups. Then that job design
will be called as complex job.We can easily implement any complex design in
DataStage by following simple tips in terms of increasing performance also. There is
no limitation of using stages in a job. For better performance, Use at the Max of 20
stages in each job. If it is exceeding 20 stages then go for another job.Use not more
than 7 look ups for a transformer otherwise go for including one more transformer.
What r the different type of errors u faced during loading and how u solve them
Check for Parameters. and check for input files are existed or not and also check for input
tables existed or not and also usernames,datasource names, passwords like that
What user variable activity when it used how it used !where it is used with real
example
By using This User variable activity we can create some variables in the job sequnce,this
variables r available for all the activities in that sequnce.
I want to process 3 files in sequentially one by one , how can i do that. while
processing the files it should fetch files automatically .
If the metadata for all the files r same then create a job having file name as parameter,
then use same job in routine and call the job with different file name...or u can create
sequencer to use the job...
38
What happens out put of hash file is connected to transformer..What error it
through?
If Hash file output is connected to transformer stage the hash file will consider as the
Lookup file if there is no primary link to the same Transformer stage, if there is no
primary link then this will treat as primary link itself. you can do SCD in server job by
using Lookup functionality. This will not return any error code.
What are OConv () and Iconv () functions and where are they used?
iconv is used to convert the date into into internal format i.e only datastage can
understand
example :- date comming in mm/dd/yyyy format
datasatge will conver this ur date into some number like :- 740
u can use this 740 in derive in ur own format by using oconv.
suppose u want to change mm/dd/yyyy to dd/mm/yyyy.now u will use iconv and oconv.
I have never tried doing this, however, I have some information which will help you in
saving a lot of time. You can convert your server job into a server shared container. The
server shared container can also be used in parallel jobs as shared container.
I am using DataStage 7.5, Unix. we can use shared container more than one time in the
job.There is any limit to use it. why because in my job i used the Shared container at 6
39
flows. At any time only 2 flows are working. can you please share the info on this.
DataStage from Staging to MDW is only running at 1 row per second! What do we do to
remedy?
I am assuming that there are too many stages, which is causing problem and providing
the solution.
In general. if you too many stages (especially transformers , hash look up), there would
be a lot of overhead and the performance would degrade drastically. I would suggest you
to write a query instead of doing several look ups. It seems as though embarassing to
have a tool and still write a query but that is best at times.
If there are too many look ups that are being done, ensure that you have appropriate
indexes while querying. If you do not want to write the query and use intermediate
stages, ensure that you use proper elimination of data between stages so that data
volumes do not cause overhead. So, there might be a re-ordering of stages needed for
good performance.
1) for massive transaction set hashing size and buffer size to appropriate values to
perform as much as possible in memory and there is no I/O overhead to disk.
2) Enable row buffering and set appropate size for row buffering
What is the flow of loading data into fact & dimensional tables?
What is the difference between sequential file and a dataset? When to use the copy
stage?
Sequential Stage stores small amount of the data with any extension in order to access the
file
Where as Dataset is used to store huge amount of the data and it opens only with an
extension (.ds)
40
The Copy stage copies a single input data set to a number of output datasets. Each record
of the input data set is copied to every output data set. Records can be copied without
modification or you can drop or change the order of columns.
Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for
those stages whose output connects to the shared container input, then meta data will be
propagated at run time, so there is no need to map it at design time.
If RCP is disabled for the job, in such case OSH has to perform Import and export every
time when the job runs and the processing time job is also increased.
What are Routines and where/how are they written and have you written any
routines before?
Routines are used for impelementing the business logic they are two types 1) Before Sub
Routines and 2)After Sub Routinestepsdouble click on the transformer stage right click
on any one of the mapping field select [dstoutines] option within edit window give the
business logic and select the either of the options( Before / After Sub Routines)
41
How can we improve the performance of Datastage jobs?
Performance and tuning of DS jobs:
1.Establish Baselines
2.Avoid the Use of only one flow for tuning/performance testing
3.Work in increment
4.Evaluate data skew
5.Isolate and solve
6.Distribute file systems to eliminate bottlenecks
7.Do not involve the RDBMS in intial testing
8.Understand and evaluate the tuning knobs available.
ORABULK is used to load bulk data into single table of target oracle database.
BCP is used to load bulk data into a single table for microsoft sql server and sysbase
Open the ODBC Data Source Administrator found in the control panel/administrative
tools.
Under the system DSN tab, add the Driver to Microsoft Excel.
Then u'll be able to access the XLS file from Datastage.
OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the
oracle to load the data. It is kind of the lowest level of Oracle being used for loading the
data.
42
There are two types of lookupslookup stage and lookupfilesetLookup:Lookup refrence to
another stage or Database to get the data from it and transforms to other
database.LookupFileSet:It allows you to create a lookup file set or reference one for a
lookup. The stage can have a single input link or a single output link. The output link
must be a reference link. The stage can be configured to execute in parallel or sequential
mode when used with an input link. When creating Lookup file sets, one file will be
created for each partition. The individual files are referenced by a single descriptor file,
which by convention has the suffix .fs.
SharedContainer:
Step1:Select the stages required
Step2:Edit>Construct Container>Shared
Shared containers are stored in the Shared Containers branch of the Tree Structure
There are many ways to populate one is writing SQL statement in oracle is one way
What are the differences between the data stage 7.0 and 7.5
in server jobs?
There are lot of Diffrences: There are lot of new stages are available in DS7.5 For Eg:
CDC Stage Stored procedure Stage etc..
43
Datastage Director. A user interface used to validate, schedule, run, and monitor
Datastage jobs.
Datastage Manager. A user interface used to view and edit the contents of the
Repository.
Theare are the variables used at the project or job level.We can use them to to configure
the job ie.we can associate the configuration file(Wighout this u can not run ur job),
increase the sequential or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click
on "add environment variable" to see most of the environment variables.
When we say "Validating a Job", we are talking about running the Job in the "check
only" mode. The following checks are made :
44
- SQL SELECT statements are prepared.
- Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use
the local data source are created, if they do not already exist.
When the source data is anormous or for bulk data we can use OCI and SQL loader
depending upon the source
Where we use link Partitioner in data stage job? explain with example?
We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active
stage which takes one input andallows you to distribute partitioned rows to up to 64
output links.
Purpose of using the key and difference between Surrogate keys and natural key
We use keys to provide relationships between the entities (Tables). By using primary and
foreign key relationship, we can maintain integrity of the data.
The natural key is the one coming from the OLTP system.
The surrogate key is the artificial key which we are going to create in the target DW. We
can use thease surrogate keys insted of using natural key. In the SCD2 scenarions
surrogate keys play a major role
We have to create users in the Administrators and give the necessary privileges to users.
Is it possible to move the data from oracle ware house to SAP Warehouse using with
DATASTAGE Tool.
We can use Datastage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW
to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available
with DataStage Version 7.5
45
How to implement type2 slowly changing dimensions in data stage?explain with
example?
We can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By
Putting on the Rejected cell where we will be writing our constraints in the properties of
the Transformer2)Use REJECTED in the expression editor of the Constraint Create a
hash file as a temporary storage for rejected rows. Create a link and use it as one of the
output of the transformer. Apply either of the two steps above said on that Link. All the
rows which are rejected by all the constraints will go to the Hash File.
Does Enterprise Edition only add the parallel processing for better performance?
46
stages; MVS jobs only accept MVS stages. There are some stages that are common to all
types (such as aggregation) but they tend to have different fields and options within that
stage.
What is the utility you use to schedule the jobs on a UNIX server other than using
Ascential Director?
AUTOSYS": Through autosys we can automate the job by invoking the shell script written to
schedule the Datastage jobs.
I think we can call a job into another job. In fact calling doesn't sound good, because you
attach/add the other job through job properties. In fact, you can attach zero or more jobs.
Steps will be Edit --> Job Properties --> Job Control, Click on Add Job and select the desired
job.
If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could
arise?
Data will partition on both the keys! Hardly will it take more for execution.
Ans.
1) E-R Diagrams
2) Dimensional modeling
a) logical modeling b) Physical modeling
Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and
YYY. The Job YYY can be executed from Job XXX by using Datastage macros in Routines.
To execute one job from other job, following steps needs to be followed in Routines.
1. Attach job using DSAttachjob function.
2. Run the other job using DSRunjob function
3. Stop the job using DSStopJob function
47
1. Server shared container. Used in server jobs (can also be used in parallel jobs).·
2. Parallel shared container. Used in parallel jobs. You can also include server shared
containers in parallel jobs as a way of incorporating server job functionality into a parallel
stage (for example, you could use one to make a server plug-in stage available to a parallel
job).
Constraint specifies the condition under which data flow through the output link. Constraint
which output link is used. Constraints are nothing but business rule or logic.
For example-we have to split customers.txt file into customer address files based on customer
country, we need to pass constraints. Suppose we want us customer addresses we need to
pass constraint for us customer.txt file. Similarly for Canadian and Australian customer.
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set
as a constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM, AVG
etc.derivations specifies the expression to pass values to the target column. For simple
example input column is a derivation that passes the value to target column.
Any Datastage objects including whole projects, which are stored in manager repository, can
be exported to a file. This exported file can then imported back into Datastage.
Complex design means having more joins and more lookups. Then that job design will be
called as complex job. We can easily implement any complex design in Datastage by following
simple tips in terms of increasing performance also. There is no limitation of using stages in a
job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20
stages then go for another job. Use not more than 7 look ups for a transformer otherwise go
for including one more transformer.
Validation guarantees that Datastage job will be successful, it carry out fallowing without
actually data processing.
1) Connections are made for sources.
48
2) Opens the files.
3) Prepares the sql statements necessary for fetching the data.
4) It makes all connection from source to target that ready for data processing from
source to target.
5) Check for Parameters. And check for input files are existed or not and also check for
input tables existed or not and also usernames, data source names, passwords like
that
What r the different type of errors u faced during loading and how u solves them?
How do you fix the error "OCI has fetched truncated data" in DataStage
Can we use Change capture stage to get the truncated data’s? Members please
confirm
What user variable activity when it used how it used !where it is used with real
example
By using This User variable activity we can create some variables in the job sequnce,this
variables r available for all the activities in that sequnce.
1)If an input file has an excessive number of rows and can be split-up then use standard
logic to run jobs in parallel
ANS: row partitioning and collecting.
If u have SMP machines u can use IPC, link-colector, link-partitioner for performance tuning
If u have cluster,MPP machines u can use parallel jobs
49
the dimensions.
These slowely changing dimensions can be of three type namely SCD1,SCD2,SCD3.
These are sustem genereated key.Mainly they are just the sequence of numbers or can be
alfanumeric values also.
How can we implement Lookup in DataStage Server jobs?
We can use a Hash File as a lookup in server jobs. The hash file needs atleast one key
column to create.
How can u implement slowly changed dimensions in datastage?
Yes you can implement Type1 Type2 or Type 3. Let me try to explain Type 2 with time
stamp.
Step :1 time stamp we are creating via shared container. it return system time and one
key. For satisfying the lookup condition we are creating a key column by using the
column generator.
Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the change
capture stage we will find out the differences. the change capture stage will return a value
for chage_code. based on return value we will find out whether this is for insert , Edit,??
or update. if it is insert we will modify with current timestamp and the old time stamp
will keep as history.
Sep 19
Summarize the differene between OLTP,ODS AND DATA WAREHOUSE ?
oltp - means online transaction processiing ,it is nothing but a database ,we are calling
oracle,sqlserver,db2 are olap tools.
OLTP databases, as the name implies, handle real time transactions which inherently
have some special requirements.
ODS- stands for Operational Data Store.Its a final integration point ETL process we load
the data in ODS before you load the values in target..
DataWareHouse- Datawarehouse is collection of integrated,time varient,non volotile and
time varient collection of data which is used to take management decisions.
Why are OLTP database designs not generally a good idea for a Data Warehouse
OLTP cannot store historical information about the organization. It is used for storing the
details of daily transactions while a datawarehouse is a huge storage of historical
information obtained from different datamarts for making intelligent decisions about the
organization.
what is data cleaning? how is it done?
I can simply say it as Purifying the data.
Data Cleansing: the act of detecting and removing and/or correcting a database?s dirty
data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted
incorrectly)
What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a data
warehouse. For example: Based on design you can decide to put the sales data in each
transaction. Now, level of granularity would mean what detail are you willing to put for
each transactional fact. Product sales with respect to each minute or you want to
aggregate it upto minute and put that data.
50
It also means that we can have (for example) data agregated for a year for a given product
as well as the data can be drilled down to Monthly, weekl and daily basis…teh lowest
level is known as the grain. going down to details is Granularity
Which columns go to the fact table and which columns go the dimension table?
The Aggreation or calculated value colums will go to Fac Tablw and details information
will go to diamensional table.
To add on, Foreign key elements along with Business Measures, such as Sales in $ amt,
Date may be a business measure in some case, units (qty sold) may be a business
measure, are stored in the fact table. It also depends on the granularity at which the data
is stored.
51
What is the main difference between schema in RDBMS and schemas in Data
Warehouse….?
RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* cannot solve extract and complex problems
* poorly modelled
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
What is the need of surrogate key; why primary key not used as surrogate key?
Surrogate Key is an artificial identifier for an entity. In surrogate key values are?
generated by the system sequentially(Like Identity property in SQL Server and Sequence
in Oracle). They do not describe anything.
Primary Key is a natural identifier for an entity. In Primary keys all the values are entered
manually by the users which are uniquely identified. There will be no repetition of data.
Need for surrogate key not Primary Key
If a column is made a primary key and?later there needs?a change in the datatype or the
length for that column then all the foreign keys that are dependent on that primary key
should be changed making the database Unstable
Surrogate Keys make the database more stable because it insulates the Primary and
foreign key relationships from changes in the data types and length.
What is WH scheme like “star scheme”, “snow flake” and there advantages /
disadvantages under different conditions ?
How to design an optimized data warehouse both from data upload and query
performance point of view ?
What to exactly is parallel processing and partitioning & how it can be employed for
optimizing the data warehouse design ?
what are preferred indexes & constraints for DWH ?
52
How the volume of data (from medium to very high) and frequency of querying will
effect the d/n considerations ?
why DATAWARE HOUSE ?
Diffarent bitwen OLTP & OLAP ?
what is the feature of DWH ?
Do you know some more ETL TOOL ?
what is the use of staging Area ?
Do you know the life cycle of WH ?
Did you heard about star
Tell me about ur –self ?
How many dimension & Fact are there in ur project ?
What is Dimension ?
Different between DWH & DATA MART ?
1. How can you Explain DWH to a Lay man?
2. What is Molap and Rolap? What is Diff between Them?
3. what are Diff Schemas used in DWH? Which one is
most Commonly Used?
4. What is Snow Flex Schema ?Explain?
5. what is Star Schema ?Expalin?
6. How you Decide that you have to Go for Ware
Hosing?In Requirement Study?
7. What are all the Questions you Put to yourClient?when you are Designing DWH?
Oracle :
how many type of Indexes are there..
In ware house which indexes are used
what is diff betw Trancate and Delete table ..
how do you Optimise the Query..Read Optimisation in
Oracle..
Project :
Project Description and All....
53
• What is a Star schema?.
• What is a Snow Flake schema?.
• What is the difference between those two?.
• Which one would you choose and why?.
• What are dimensions?.
• What are facts?.
• What is your role in the projects that you have done?.
• What is Informatica powercenter capabilities?.
• What are the different types of Transformers?.
• What is the difference between Source and Joiner transformers?.
• Why is a source used and why is a joiner used?.
• What are active and passive transformers?.
• How many transformers have you used?.
• What is CDC?.
• What are SCDs and what are the different types?.
• Which of the types have you used in your project?.
• What is a date dimension
• How have you handled sessions?.
• How can you handle multiple sessions and batches?.
• On what platform was your Informatica server?.
• How many mappings were there in your projects?.
• How many transformers did your biggest mapping have?.
• Which was your source and which was your target platforms?.
• What is a cube?.
• How can you create a catalog and what are its types?.
• What is power play transformer?.
• What is Power play administrator?(Same as above question).
• What is Slice and Dice?.
• What is your idea of an Authenticator?.
• Can cubes exist independently?.
• Can cubes be sources to another application?.
• How many maximum number of rows can a date dimension have?.
• How have you done reporting?.
• What are hotfiles?
• What are snapshots?.
• What is the difference?.
1]what is the difference between snow flake and star flake schemas?
2]how will u come to know that u have to do performance tunning?
3]describe project
4]how many dimensions and facts in your project?
54
5]draw scd type 1 and scd type 2
6]if from target you are getting timestamp data and you have one port in target
having datatype as date then how will u load it?
7]what are the different type of lookups?
8]what condition will you give in update stratergy transformation in scd type 1?
9]what the different type of variables in update transformation?
10]what is target based commit and source based commit?
11]why you think scd type 2 is critical?
12]what are the type of facts?
13]what is factless fact?
14]if i am returning one port through connected lookup then why you need
unconnected lookup?
15]if from flat file some duplicate rows are coming then how will you remove it
using informatica?
16]if from relational table duplicate rows are coming then how will you remove
them using informatica?
17]if i did not give group by option in aggregator transformation then what will be
the result?
18]what is multidimentional analysis?
19]if i give all the charactristics of data wearhouse to oltp then will it be data
wearhouse?
20]what are the charactristics of datawearhouse?
21]what is the break up of your team?
22]how will you do performane tunning in mapping?
23]which is good for performance static or dynamic cache?
24]what is target load order?
25]what are the transformation you worked on?
26]what is the naming convention you are using?
27]how are you getting data from client?
28]how will you convert rows into column and column into rows using informatica?
29]how will you enable test load?
30]did you work with connceted and unconnected lookup tell the difference
31]did you ever use normalizer?
55
• Why should you do indexing first of all?.
• What sort of indexing is done in Fact and why?.
• What sort of indexing is done in Dimensions and why?.
• What sort of normalization will you have on dimensions and facts?.
• What are materialized views?.
• What is a Star schema?.
• What is a Snow Flake schema?.
• What is the difference between those two?.
• Which one would you choose and why?.
• What are dimensions?.
• What are facts?.
• What is your role in the projects that you have done?.
• What is Informatica powercenter capabilities?.
• What are the different types of Transformers?.
• What is the difference between Source and Joiner transformers?.
• Why is a source used and why is a joiner used?.
• What are active and passive transformers?.
• How many transformers have you used?.
• What is CDC?.
• What are SCDs and what are the different types?.
• Which of the types have you used in your project?.
• What is a date dimension
• How have you handled sessions?.
• How can you handle multiple sessions and batches?.
• On what platform was your Informatica server?.
• How many mappings were there in your projects?.
• How many transformers did your biggest mapping have?.
• Which was your source and which was your target platforms?.
• What is a cube?.
• How can you create a catalog and what are its types?.
• What is power play transformer?.
• What is Power play administrator?(Same as above question).
• What is Slice and Dice?.
• What is your idea of an Authenticator?.
• Can cubes exist independently?.
• Can cubes be sources to another application?.
• How many maximum number of rows can a date dimension have?.
• How have you done reporting?.
• What are hot files?
• What are snapshots?.
• What is the difference?.
56