Sunteți pe pagina 1din 12

Data Warehousing and Business Intelligence

11 December 2010
Data Warehousing Interview Questions
Filed under: Data Warehousing,Other Vincent Rainardi @ 10:59 pm
Tags: Data Warehousing, Other
Following my article last week SSAS Developer Interview Questions, which
generated quite a lot of responses, I thought Id write similar thing on the data
warehousing. Like before, I will be using LMH to show the complexity (Low, Medium,
High). This time, Ill add the purpose, i.e. what the questions are designed for.

Data warehousing skill never stands on its own. You dont interview people just for
their data warehousing knowledge. Usually you are either looking for an ETL
developer (Informatica, DataStage, BODI, SSIS), a Microsoft BI developer
(RS/AS/IS), a BO/Cognos (Report) Developer, a Data Warehouse Architect, a BI
Solution Architect, an ETL Architect, a Data Architect or an Enterprise Architect.
All these roles require data warehousing knowledge. So most likely you will be
combining these questions with other questions, e.g. for an Informatica developer
you will be combining them with Informatica questions.

When I interview to fill a role, what I look particularly is whether the candidate
can do the job or not. Nowadays it is different from 10 years ago. We now have
Google and Bing. Anything we dont know we can quickly Google it. So the amount of
knowledge is not important now. What is important (to me) is a) experience, b)
problem solving and c) character. I remember about 3 years ago I was interviewing a
candidate for a data architect role. The candidate was speaker in data architecture
conference so we were convinced she must be very good. I asked her what the rule
for the 2nd normal form was. She could not answer it. I asked the 1st and 3rd
normal form and she could not answer them either. This is like bread and butter.
You cant do data modelling without knowing the normalisation rules.

But then in an interview with a bank 3 years ago I was also blank like her. I was
an author of a SQL Server data warehousing & BI book so they were thinking high of
me with regards to SSIS, SSAS and SSRS. They asked me what those 5 tabs in SSIS
BIDS. I could mention 4 but could not remember the 5th one, even though I was using
SSIS almost everyday. Since then when I interviewed I did not look for the amount
of knowledge, but whether the candidate can solve problems instead. I remember one
day my manager and I was interviewing for a Teradata developer role. I said to the
candidate that the amount of Teradata knowledge that he had was not important.
Within 5 minutes of opening the manual or Googling he would be able to get that
information. So I said I would not ask him any Teradata SQL or BTEQ functions.
Instead I gave him 2 real world problems that we were facing in the project and
asked him to give me the solutions in about 5 minutes. The way he interrogated us
with questions to get information about the project and finally suggested a good
solution really impressed us, so we offered him the job. I can completely
understand that some people disagree with my approach. After that interview my boss
pulled me and told me off: You must not say that in front of the candidates
Vincent. Of course the amount of Teradata knowledge they possess is important! Why
do you think we hire them for?

So in the interview questions below, which crystallises from my experience, I put


both knowledge-based and experience/problem solving questions. Generally Im less
interested in theory only questions, so I try to wrap them up in real world
problems/situations.

Im a firm believer that experience is the best teacher. So at interviews I always


try to find out if the candidate has done it before. So I test them using every day
problems. People who have done data warehousing will surely come across those
problems, and understand what the solutions are. People who are new in data
warehousing, or only know the DW theory from books wouldnt have encountered those
problems and would not have a clue what the answers are.

1. Question: How do you implement Slowly Changing Dimension type 2? I am not


looking for the definition, but the practical implementation e.g. table structure,
ETL/loading. {M}

Answer: Create the dimension table as normal, i.e. first the dim key column as an
integer, then the attributes as varchar (or varchar2 if you use Oracle). Then Id
create 3 additional columns: IsCurrent flag, Valid From and Valid To (they are
datetime columns). With regards to the ETL, Id check first if the row already
exists by comparing the natural key. If it exists then expire the row and insert
a new row. Set the Valid From date to todays date or the current date time.

An experienced candidate (particularly DW ETL developer) will not set the Valid
From date to the current date time, but to the time when the ETL started. This is
so that all the rows in the same load will have the same Valid From, which is 1
millisecond after the expiry time of the previous version thus avoiding issue with
ETL workflows that run across midnight.

Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It
is considered the basic/fundamental. The purpose of this question is to separate
the quality candidate from the ones who are bluffing. If the candidate can not
answer this question you should worry.

2. Question: How do you index a fact table? And explain why. {H}

Answer: Index all the dim key columns, individually, non clustered (SQL Server) or
bitmap (Oracle). The dim key columns are used to join to the dimension tables, so
if they are indexed the join will be faster. An exceptional candidate will suggest
3 additional things: a) index the fact key separately, b) consider creating a
covering index in the right order on the combination of dim keys, and c) if the
fact table is partitioned the partitioning key must be included in all indexes.

Purpose: Many people know data warehousing only in theory or only in logical data
model. This question is designed to separate those who have actually built a data
warehouse and those who havent.

3. Question: In the source system, your customer record changes like this:
customer1 and customer2 now becomes one company called customer99. Explain a)
impact to the customer dim (SCD1), b) impact to the fact tables. {M}

Answer: In the customer dim we update the customer1 row, changing it to customer99
(remember that it is SCD1). We do soft delete on the customer2 row by updating the
IsActive flag column (hard delete is not recommended). On the fact table we find
the Surrogate Key for customer1 and 2 and update it with customer99s SK.

Purpose: This is a common problem that everybody in data warehousing encounters. By


asking this question we will know if the candidate has enough experience in data
warehousing. If they have not come across this (probably they are new in DW), we
want to know if they have the capability to deal with it or not.

4. Question: What are the differences between Kimball approach and Inmons? Which
one is better and why? {L}

Answer: if you are looking for a junior role e.g. a developer, then the expected
answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas
in Inmons we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from
the EDW. Junior candidates usually prefer Kimball, because of query performance and
flexibility, or because thats the only one they know; which is fine. But if you
are interviewing for a senior role e.g. senior data architect then they need to say
that the approach depends on the situation. Both Kimball & Inmons approaches have
advantages and disadvantages. I explained some of the main reasons of having a
normalised DW here.

Purpose: a) to see if the candidate understands the core principles of data


warehousing or they just know the skin, b) to find out if the candidate is open
minded, i.e. the solution depends on what we are trying to achieve (theres right
or wrong answer) or if they are blindly using Kimball for every situation.

5. Question: Suppose a fact row has unknown dim keys, do you load that row or not?
Can you explain the advantage/disadvantages? {M}

Answer: We need to load that row so that the total of the measure/fact is correct.
To enable us to load the row, we need to either set the unknown dim key to 0 or the
dim key of the newly created dim rows. We can also not load that row (so the total
of the measure will be different from the source system) if the business
requirement prefer it. In this case we load the fact row to a quarantine area
complete with error processing, DQ indicator and audit log. On the next day, after
we receive the dim row, we load the fact row. This is commonly known as Late
Arriving Dimension Rows and there are many sources for further information; one of
the best is Bob Beckers article here in 2006. Others refer to this as Early
Arriving Fact Row, which Ralph Kimball explained here in 2004.

Purpose: again this is a common problem that we encounter in regular basis in data
warehousing. With this question we want to see if the candidates experience level
is up to the expectation or not.

6. Question: Please tell me your experience on your last 3 data warehouse projects.
What were your roles in those projects? What were the issues and how did you solve
them? {L}

Answer: Theres no wrong or right answer here. With this question you are looking
for a) whether they have done similar things to your current project, b) whether
their have done the same role as the role you are offering, c) whether they faces
the same issues as your current DW project.

Purpose: Some of the reasons why we pay more to certain candidates compared to the
others are: a) they have done it before they can deliver quicker than those who
havent, b) they come from our competitors so we would know whats happening there
and we can make a better system than theirs, c) they have solved similar issues so
we could borrow their techniques.

7. Question: What are the advantages of having a normalised DW compared to


dimensional DW? What are the advantages of dimensional DW compared to normalised
DW? {M}

Answer: For advantages of having a normalised DW see here and here. The advantages
of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the
requirements with minimal changes on the data model, b) performance, e.g. you can
query it faster than normalised model, c) its quicker and simpler to develop than
normalised DW and easier to maintain.

Purpose: to see if the candidate has seen the other side of the coin. Many people
in data warehousing only knows Kimball/dimensional. Second purpose of this question
is to check if the candidate understands the benefit of dimensional modelling,
which is a fundamental understanding in data warehousing.

8. Question: What is 3rd normal form? {L} Give me an example of a situation where
the tables are not in 3rd NF, then make it 3rd NF. {M}

Answer: No column is transitively depended on the PK. For example, column1 is


dependant on column2 and column2 is dependant on column3. In this case column3 is
transitively dependant on column1. To make it 3rd NF we need to split it into 2
tables: table1 which has column1 & column2 and table2 which has column2 and
column3.

Purpose: A lot of people talk about 3rd normal form but they dont know what it
means. This is to test if the candidate is one of those people. If they cant
answer 3rd NF, ask 2nd NF. If they cant answer 2nd NF, ask 1st NF.

9. Question: Tell me how to design a data warehouse, i.e. what are the steps of
doing dimensional modelling? {M}

Answer: There are many ways, but it should not be too far from this order: 1.
Understand the business process, 2. Declare the grain of the fact table, 3. Create
the dimension tables including attributes, 4. Add the measures to the fact tables
(from Kimballs Toolkit book chapter 2). Step 3 and 4 could be reversed (add the
fact first, then create the dims), but step 1 & 2 must be done in that order.
Understanding the business process must always be the first, and declaring the
grain must always be the second.

Purpose: This question is for data architect or data warehouse architect to see if
they can do their job. Its not a question for an ETL, report or cube developer.

10. Question: How do you join 2 fact tables? {H}

Answer: Its a trap question. You dont usually join 2 fact tables especially if
they have different grain. When designing a dimensional model, you include all the
necessary measures into the same fact table. If the measure you need is located on
another fact table, then theres something wrong with the design. You need to add
that measure to the fact table you are working with. But what if the measure has a
different grain? Then you add the lower grain measure to the higher grain fact
table. What if the fact table you are working with has a lower grain? Then you need
to get the business logic for allocating the measure.

It is possible to join 2 fact tables, i.e. using the common dim keys. But the
performance is usually horrible, hence people dont do this in practice, except for
small fact tables (<100k rows). For example: if FactTable1 has dim1key, dim2key,
dimkey3 and FactTable2 has dim1key and dim2key then you could join them like this:

1
2
3
4
5
6
7
8
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from
( select dim1key, dim2key, sum(measure1) as measure1
from FactTable1
group by dim1key, dim2key
) f1
join FactTable2 f2
on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key
So if we dont join 2 fact tables that way, how do we do it? The answer is using
the fact key column. It is a good practice (especially in SQL Server because of the
concept of cluster index) to have a fact key column to enable us to identify rows
on the fact table (see my article here). The performance would be much better (than
joining on dim keys), but you need to plan this in advance as you need to include
the fact key column on the other fact table.

1
2
3
4
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from FactTable1 f1
join FactTable2 f2
on f2.fact1key = f1.factkey
I implemented this technique originally for self joining, but then expand the usage
to join to other fact table. But this must be used on an exception basis rather
than the norm.

Purpose: not to trap the candidate of course. But to see if they have the
experience dealing with a problem which doesnt happen every day.

11. Question: How do you index a dimension table? {L}

Answer: clustered index on the dim key, and non clustered index (individual) on
attribute columns which are used on the querys where clause.

Purpose: this question is critical to be asked if you are looking for a Data
Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows
logical data model. Many of them dont know how to index. They dont know how
different the physical tables are in Oracle compared to in Teradata. This question
is not essential if you are looking for a report or ETL developer. Its good for
them to know, but its not essential

12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph
Kimball.

Answer: He was the one who introduced the concept of data warehousing. Arguably
Barry Devlin was the first one, but hes not as popular as Inmon. If you ask who is
Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldnt know. But
every decent practitioner in data warehousing should know about Inmon and Kimball.

Purpose: to test if the candidate is a decent practitioner in data warehousing or


not. Youll be surprise (especially if you are interviewing a report developer) how
many candidates dont know the answer. If someone is applying for a BI architect
role and he never heard about Inmon you should worry.

13. Question: How do we build a real time data warehouse? {H}

Answer: if the candidate asks Do you mean real time or near real time it may
indicate that they have a good amount of experience dealing with this in the past.
There are two ways we build a real time data warehouse (and this is applicable for
both Normalised DW and Dimensional DW):

a) By storing previous periods data in the warehouse then putting a view on top of
it pointing to the source systems current period data. Current period is usually
1 day in DW, but in some industries e.g. online trading and ecommerce, it is 1
hour.

b) By storing previous periods data in the warehouse then use some kind of
synchronous mechanism to propagate current periods data. An example of synchronous
data propagation mechanism is SQL Server 2008s Change Tracking or the old schools
trigger.

Near real time DW is built using asynchronous data propagation mechanism, aka mini
batch (2-5 mins frequency) or micro batch (30s 1.5 mins frequency).

Purpose: to test if the candidate understands complex, non-traditional mechanism


and follows the latest trends. Real time DW was considered impossible 5 years ago
and only developed in the last 5 years. If the DW is normalised its easier to make
it real time than if the DW is dimensional as theres dim key lookup involved.

14. Question: What is the difference between a data mart and a data warehouse? {L}

Answer: Most candidates will answer that one is big and the other is small. Some
good candidates (particularly Kimball practitioners) will say that data mart is one
star. Whereas DW is a collection of all stars. An excellent candidate will say all
the above answers, plus they will say that a DW could be the normalised model that
store EDW, whereas DM is the dimensional model containing 1-4 stars for specific
department (both relational DB and multidimensional DB).

Purpose: The question has 3 different levels of answer, so we can see how deep the
candidates knowledge in data warehousing.

15. Question: What the purpose of having a multidimensional database? {L}

Answer: Many candidates dont know what a multidimensional database (MDB) is. They
have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by
saying an MDB is an OLAP database. Many will say Oh I see but actually they
are still puzzled so it will take a good few moments before they are back to earth
again. So ask again: What is the purpose of having an OLAP database? The answer
is performance and easier data exploration. An MDB (aka cube) is a hundred times
faster than relational DB for returning an aggregate. An MDB will be very easy to
navigate, drilling up and down the hierarchies and across attributes, exploring the
data.

Purpose: This question is irrelevant to report or ETL developer, but a must for a
cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos)
should be able to answer the question as its their bread and butter.

16. Question: Why do you need a staging area? {M}

Answer: Because:

a) Some data transformations/manipulations from source system to DWH cant be done


on the fly, but requires several stages and therefore needs to be landed on disk
first

b) The time to extract data from the source system is limited (e.g. we were only
given 1 hour window) so we just get everything we need out first and process
later

c) For traceability and consistency, i.e. some data transform are simple and some
are complex but for consistency we put all of them on stage first, then pick them
up from stage for further processing

d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS)
and we want to minimise the impact to the source systems workload. So rather than
reading twice from the source system, we land the data on the staging then both
the ODS and the DDS read the data from staging.

Purpose: This question is intended more for an ETL developer than a report/cube
developer. Obviously a data architect needs to know this too.

17. Question: How do you decide that you need to keep it as 1 dimension or split it
into 2 dimensions? Take for example dim product: there are attributes which are at
product code level and there are attributes which are at product group level.
Should we keep them all in 1 dimension (product) or split them into 2 dimensions
(product and product group)? {H}

Answer: Depends on how they are going to be used, as I explained in my article One
or two dimensions here.

Purpose: To test if the candidate is conversant in dimensional modelling. This


question especially is relevant for data architects and cube developers and less
relevant for a report or ETL developer.

18. Question: Fact table columns usually numeric. In what case does a fact table
have a varchar column? {M}

Answer: degenerate dimension

Purpose: to check if the candidate has ever involved in detailed design of


warehouse tables. Follow up with question 19.

19. Question: What kind of dimension is a degenerate dimension? Give me an


example. {L}

Answer: A dimension which stays in the fact table. It is usually the reference
number of the transaction. For example: Transaction ID, payment ref and order ID

Purpose: Just another question to test the fundamentals.

20. Question: What is show flaking? What are the advantages and disadvantages? {M}

Answer: In dimensional modelling, snow flaking is breaking a dimension into several


tables by normalising it. The advantages are: a) performance when processing
dimensions in SSAS, b) flexibility if the sub dim is used in several places e.g.
city is used in dim customer and dim supplier (or in insurance DW: dim policy
holder and dim broker), c) one place to update, and d) the DW load is quicker as
there are less duplications of data. The disadvantages are: a) more difficult in
navigating the star*, i.e. need joins a few tables, b) worse sum group by*
query performance (compared to pure star*), c) more flexible in accommodating
requirements, i.e. the city attributes for dim supplier dont have to be the same
as the city attributes for dim customer, d) the DW load is simpler as you dont
have to integrate the city.

*: a star is a fact table with all its dimensions, navigating means


browsing/querying, sum group by is a SQL select statement with a group by
clause, pure star is a fact table with all its dimensions and none of the dims are
snow-flaked.

Purpose: Snow flaking is one of the classic debates in dimensional modelling


community. It is useful to check if the candidate understands the reasons of just
following blindly. This question is applicable particularly for data architect
and OLAP designer. If their answers are way off then you should worry. But it also
relevant to ETL and report developers as they will be populating and querying the
structure.
I hope these interview questions will be useful for the data warehousing community.
Not only for the interviewees but also for the interviewers. Im sure I made some
mistakes in this article (everybody does) so if you spot one please contact me. As
usual I welcome any question & discussion at vrainardi@gmail.com.

Vincent Rainardi, 11/12/2010

Update 6/5/2012: Just a quick note that Arshad Khan has put together 500 data
warehousing and BI questions and answers in his new book: Business Intelligence &
Data Warehousing Simplified: 500 Questions, Answers, & Tips. It is by far the most
comprehensive Interview Questions I have ever seen and will be very useful for
preparing interviews. Some of the questions are:

What is a fact?
What are the different types of facts?
What are the characteristics of a fact table?
What is fact table granularity?
What is OLAP?
What are the benefits of OLAP?
What are OLAP limitations?
How does OLAP impact the data warehouse?
What are OLAP Models?
What is the Inmon approach?
What is the Kimball approach?
What is a star schema?
What are the benefit of a star schema?
What is the snowflake schema?
About these ads

Related
Data Warehousing Interview Questions
In "Data Warehousing"
Role of a DW Designer/Dimensional Modeller in the ETL Development Phase of a DW/BI
Project
In "Data Warehousing"
Why Do We Need a Data Warehouse?
In "Data Warehousing"
Comments (11)
11 Comments

Good work

Comment by Ali Dogar 11 May 2011 @ 7:31 pm | Reply

Very Impressive and insightful.

Comment by Raphik Minaua 7 June 2011 @ 4:08 am | Reply

great article

Comment by vinay 7 December 2011 @ 4:03 pm | Reply

[] Data Warehousing Interview Questions []

Pingback by sql server business intelligence interview questions and answers SQL
SERVER LEARNER 24 April 2012 @ 12:04 pm | Reply

thanks, it was helpful.

Comment by gita 6 May 2012 @ 9:49 pm | Reply

Thanks for sharing this extremly specific interview tip article. You didnt mention
it but I really like the question, How to maintain Business intelligence
activities? is a great question that not only allows you to show your technical
skills, but also your management and decision making skills. Thanks again!
http://questionstoaskduringaninterview.net/business-intelligence-interview-
questions/

Comment by scoyne2 27 May 2012 @ 8:06 pm | Reply

here i have collected lots of topics related to:

Data Warehouse Interview Questions and Answers

kindly have a look this would help you a lot

Regards,

Akaas Developer

Comment by Akaas Developer 12 June 2012 @ 9:36 am | Reply

Perfect.exactly what I was looking for!

Comment by Kerri 21 December 2012 @ 4:41 pm | Reply

it is very good information , if you can add little more information about inputs
for creating data modeling and who ever involved into that process it will good and
( like business requirement data modeling conceptual design logical design
physical design) like that it will give clear picture how to execute data warehouse
project for

Comment by sahadevan 19 May 2013 @ 12:15 pm | Reply

Hi Sahadevan,
some people create the data model based on the source systems. But I usually create
the data model based on the reporting/cubes requirements, i.e. I identified what
data elements are required for the cubes/reports. These data elements are the ones
I created on the dimensional model. Based on that I specify the requirements for
the ETL, i.e. where these data element should be sourced from. After it goes live,
often there are new reporting/cube requirements for new data elements (new
attributes, new measures) so we add them to the dimensional model, and source them
on the ETL design. So no, I dont bring everything from the source systems, but
only the data elements required for the analytics/reporting/dashboard.

As per conceptual model logical model physical model, I usually start with list
of entities and how they are connected to each other. Based on this list of
entities I draw the conceptual model at entity level. Then I identify the
attributes and measures for each entity and create the logical dimensional model.
Translating to physical model usually it is an exercise of a) identifying the
appropriate data type, b) identify appropriate indexes, c) identify appropriate
partitioning criteria. If there is an ODS or NDS on the warehouse, I usually design
it based on the source systems, but I try to make it truely 3NF (lots of tables
from the source systems are 2NF or 1NF).

Hope this make it clearer for you. If you have further question please dont
hesitate. Also if you (or anybody) have a different view/practice Id be interested
to hear from you. Of course, there is not only 1 approach that we can use, but
there are many approaches that we can use to design the data model for a DW
project.

Comment by Vincent Rainardi 19 May 2013 @ 1:37 pm | Reply

Try hands on Informatica using a book


http://www.amazon.in/Learning-Informatica-PowerCenter-Rahul-
Malewar/dp/1782176489/ref=tmm_pap_title_0?ie=UTF8&qid=1444030559&sr=1-1
Regards,
Rahul Malewar
8411002339

Comment by Rahul Malewar 17 October 2015 @ 6:02 am | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Enter your comment here...


Pages
Content of This Blog
My Book
DW Books
Microsoft BI Books
Whos Who in SSAS
Whos Who in DW
Forums and Blogs
About Me
Categories:
Analysis Services
BI Tools
Business Intelligence
Business Knowledge
Finance
Investment Banking
Data Architecture
Data Warehousing
Event
MDX
Oracle BI
Other
Project Management
SQL Server
SSIS
Search:
Search
Archives:
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
February 2015
December 2014
November 2014
October 2014
August 2014
July 2014
June 2014
May 2014
March 2014
February 2014
November 2013
October 2013
September 2013
June 2013
May 2013
March 2013
January 2013
December 2012
October 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
April 2009
March 2009
Meta:
Register
Log in
RSS
Comments RSS
XFN
Create a free website or blog at WordPress.com.
The Rubric Theme. Create a free website or blog at WordPress.com.

Follow
Follow Data Warehousing and Business Intelligence

Get every new post delivered to your Inbox.

Join 297 other followers

Enter your email address

Sign me up

Build a website with WordPress.com

S-ar putea să vă placă și