Documente Academic
Documente Profesional
Documente Cultură
HiringTheBest HiringAnalysts
3 significant areas to cover. Assumption is that questions in these areas will provide data
to assess leadership, culture fit, & communication skills
First: Business Process Assessment The candidate should be able to assess
problem/opportunities in a "case" study method.
Second: Technical Depth The candidate will need to retrieve, manipulate, & evaluate
large sets of data efficiently
Third: Self-directed/Leadership Will this candidate look for business opportunities with
passion?
1. Business Process Assessment Quick Questions
What are you reading currently?
What has influenced your business behavior most heavily in the last year?
What kind of process or project mgmt training have you had?
What do you think of (five forces, rational, UML, competitive advantage, ted levitt's
criticism of the "product lifecycle", Six Sigma)
How do you get to root cause for an issue such as
Looking for: Evidence that the person is growing, stretching in a direction that is
successful at Amazon. Is the candidate reading the learning organization, the innovators
dilemma or who moved my cheese. Do they read the Economist, Mit Tech review, HBR
or Newsweek? Do they recognize process terms?
Longer Questions These should be of 2 types - simple give me an equation for situation
and then more vague case types.
Simple Equations
Profitability = Revenue - Costs
Need Inventory = OH -Demand (bonus if time phased & includes forecasts,
intransits)
Predicted OH Inventory = OH +Intransits + Pos - Demand -Forecast
Healthy Inventory =
Start just asking for a simple definition, then start discussing factors with the candidate.
For example in profitability, what factors go into costs?
Looking for thoughtfulness & testing of assumptions. How does the candidate think
through the question - systematically or ad-hoc
Cases:
Skip suggested to preface questions of this type with "there is no right answer, I want to
use this as an example to see how you approach a problem" <insert Nimrod, Janice, Skip
questions here>
2. Technical Assessment Questions Quick questions:
From a list of orders over the last week using the tool of your choice
1. rank the orders by quantity
2. avg quantity for each vendor
3. # of distinct vendors per week.
4. Find & count lines in a log file that have a specific ASIN or user id
(In an onsite, could have a data file on a laptop and say show me?.)
Looking for:
Unix - cut cat find grep sort
Excel/Access : basic functions, pivot tables, data structures, domain tables
SQL - nested queries, functions, basic joins Perl - any scripting?, RegExp
Reference - does the candidate know how to find help, admit boundaries
Longer SQL/Unix Questions
1. I need to provide a report with:
2. -the total units & the average cost of book orders by day of week over the last 10
weeks by country
A good answer will look something like
Select sum(quantity), avg(cost), product, to_char(date,'DY'),
country
from (
1. "Say there's a text file of the form "userid-tab-command" that tracks all the
commands that a given user runs. How would you find out how many times user
"Bob" has run any command at all?"
A Good Answer: At its most basic:
"grep -c Bob filename" or "cat filename | grep Bob".
If they understand that "Bob" could be part of the command, then the correct grep is
actually:
grep -c "^Bob"
to anchor the user. Even better, so in case there's a user called "Bob" and another called
"BobH", they should do:
grep -c "^Bob<tab>".
back to HiringTheBest
Companies that hire lots of analysts have the process down to a science, just as Amazon
does for SDEs. The interview process at a big 5 consulting firm is very defined,
behaviorally focused, looking at capabilities. An Analyst is a unique creature but not
impossible to find & assess.
During the interview loop in addition to culture fit & interpersonal skills, a candidate
should be reviewed on how they've displayed analyst type competencies in the past AND
solve a problem to display the competency in actuality. Analysts are usually good
presenters, just asking them about the past may not display the limits of their abilities.
Here are a couple frameworks of core Analyst Competencies:
--1. Think broad and deep: can take the big picture strategic business view and can also
dive into the details to understand a problem
2. Problem solving skills: can they structure and frame a problem, make estimates when
necessary, figure out the dataset needed (smallest, easiest dataset to draw solid
conclusions), get and analyze the data, summarize the conclusions and their reasoning
3. Communication skills: clear, organized, concise, ability to adapt to audience (VP to
SDE), think on the fly, thoughtful
4. Multi-tasking: can they juggle many issues at one time?
5. Independence: ability to work with minimal direction and ask for help when needed
6. Customer focus
7. Cultural fit: Team (COFS, SCOS,...) and Amazon
8. Leadership
--Find, Frame, Analyze, & Deliver within Amazon
Find Problems/Opportunities
An analyst should be able to recognize broken processes, bad processes, troubleshoot
processes. But also prioritize whether the proposal is polishing a pig or creating a golden
cow. Building pretty toys with no ROI is a waste of time. Given the business maturity at
Amazon, there are a lot of process improvements or new businesses where money can be
saved/found.
Past Example A candidate should be able to point to past projects where they:
-worked as support
-Saved x USD, n Minutes as a result of a process change
Have them explain their role, then ask what upstream/downstream business impacts
occurred after the change occurred. Drive into specifics
Problem Solving Provide a problem for them to solve - -What should amazon add to its
site to deliver more on the "Find Discover & Buy anything online" -What is different for
Amazon over Blockbuster video? -What impacts to supply chain & customer experience
would be felt by adding an Amazon Air Travel Store -Given factors x,y,z - how would
you calculate ROI on project Q to present to a sr. vice president.... in an hour
This competency is a display of business competency - does the candidate see the big
picture or get wrapped up into their project?
Potential Skills -Process mapping: Can a candidate draw out a level1,2,3 diagram?
Understand ICOMs or do they move into systems & dataflow -Basic Business Measures:
ROI, DRP, Forecasting, S&OP
--Frame Model & Hypothesis
The analyst is part wizard, part math professor. They are called in to explain the past,
look into the crystal ball about the future, and draw a cool looking formula to make you
believe it. Acceptable skills vary here, an Ops Research will need to demonstrate different
skills than an MBA type or a Supply Chain type. But at the end of the day, an Amazon
analyst will need to be able to stand at a whiteboard and draw some algebraic looking
formula (sum of receive time - confirm time for n asins * min fifo cost layer.... )
Can they identify the opportunity (competency 1) then define and model it here?
Past Example A candidate should be able to explain past projects where they:
-developed a model (forecasting, spreadsheet, financial...)
-describe the tool(s) with which they've worked (AMPL,Excel,Pkg
Software)
This competency is a display of analytic skills - does the candidate set assumptions,
challenge the definitions, and display the ability to draft a reasonable model? Could they
build a metrics package?
Potential Skills -Modeling: Can a candidate draw out a forecast equation, linear
programming -Advanced Business Measures: Time Value of Money
--Analyze
Once a candidate has built a model, no-one is going to go get data for you. The tools on
hand will be limited or perhaps not available. To succeed the analyst will need to identify
and evaluate a data source, then get the data themselves or negotiating for SDE time.
Since SDE time is money, this is usually the less preferred choice. The key elements here
are abilities to: *Retrieve Data *Evaluate Data Quality *Data Scale
So an analyst has found a good opportunity, determined how to quantify it, but how will
the control be built ongoing?
Past Example A candidate should be able to explain past projects where they:
-built a tool or heavily configured software
-what were the shortcomings? how did they drive through their weaknesses
-What data gathering tools were used, how big was the data set
Once the analysis is completed, is it just a report on a shelf? What changed? Were cost
reductions actually realized? What form did the analysis results take - powerpoint, 3 ring
binder, email, whitepaper? Who saw them and what did they do? Is the candidate aware
of good visualization guidelines (Tufte, W. Cleveland) or do they LOVE powerpoint? At
Amazon, Analysts often present their own results - will the work stand up to scrutiny?
Past Example A candidate should be able to explain past projects where they:
-presented results in detail, in 15 minutes
-How did you get your points across in your allotted 10 minutes of
executive time?
-What data presentation tools were used?
Problem Solving Provide a problem for them to solve - tweak it for ecommerce -"You
have 15 minutes tomorrow afternoon to report back to a VP about a question he asked
you today regarding specific metric accuracy - Could you prepare an outline of your
answer, what format would it be in, how would you followup on your
recommendations?"
Look for creativity
Potential Skills -Creativity -Effective communication -a get it done attitude
Data Engineer Interview Questions
Contents
[hide]
1.1 DW Concepts
1.2 Tuning
1.3 SQL
1.4 Oracle
1.5 ETL
1.6 Linux/Unix
1.7 Teradata
1.9.1 Oracle
Can you provide the different types of slow changing dimensions (Type
I, II, III). What are the key differences in their implementation
1. Type I SCD's are dimensions where old data is overwritten with
new data and no historical data is kept.
2. Type II SCD's are dimensions where multiple records are kept to
track historical data. 'Version' or 'Effective Date' are common
ways to allow unlimited history preserved with each
update/record.
Given a type II dimension table having a 32bit guid as the natural key,
how would you design the fact tables to support both point in time as
well as current hierarchy reporting
o
What kind of factor should be considered while build fact table? Would
merging a Order table and Order Item table make more sense or not.
o
ANS : Factors like storage, maintenance, less duplicate, denormalization, volume, backfill contribute to decision of design
of fact table in such scenario.
You have an Item Orders fact table? Will you store the Product group of
the item in the fact? If so why? Else why not?
o
ANS: Push and pull ETL strategies refer to the way in which data
is transferred from source to ETL tool. Push ETL is when external
source sends data to ETL tool. Pull ETL is when ETL tool
requests/retrieves data from source.
[edit][hide] Tuning
If you have a poorly performing report/etl process, how would you investigate and tune it
going all the way back to table design.
explain plans - when tuning what do you look for in an explain plan
that screams red flags.
What about the oracle level join types (hash, nested loop) and when
each should be used
[edit][hide] SQL
FROM ORDERS
SELECT
TO_CHAR(ORDER_DAY,'MONTH') AS MONTH
, SUM(QTY) AS MONTHLY_SUM
FROM ORDER_ITEMS
GROUP BY TO_CHAR(ORDER_DAY,'MONTH')
FROM ORDER_ITEMS
GROUP BY TO_CHAR(ORDER_DAY,'MONTH')
4. Pivot:
o
using the data from #3. give me the data with the Months as
columns instead of rows or
SELECT
FROM
FROM ORDER_ITEMS
GROUP BY TO_CHAR(ORDER_DAY,'MONTH')
SELECT
ASIN,
SUM(VALUE) AS ITEM_VALUE,
FROM ORDER_ITEMS
GROUP BY ASIN
FROM ORDERS
Query for customer that they bought a year ago and yesterday.
FROM ORDERS
O6,C1, 01-May-2006
Give SQL for the list of customer_ids who placed more than 1 order
o
GROUP BY Customer
Give the Sql for the list of customer_ids who have placed at least 1
order in 2000 and at least 1 order in 2006.
o
GROUP BY Customer
Please write a sql which can generate the number of Orders for each
year, 2000 to 2006.
o
SELECT
FROM ORDERS
Display the employee records who joins the department before their
manager?
o
SELECT emp1.*
Display employee records getting more salary than the average salary
in their department?
o
SELECT
FROM EMPLOYEES
SELECT
FROM EMPLOYEES
FROM
ANS: A Cartesian product returns all the rows in all the tables
listed in the query. Each row in the one table is paired with all
the rows in each of the rest of the tables. This happens when
there is no relationship defined between tables.
WHERE RANK = 2
ANS: Yes.
ANS: The DUAL table is a pseudo table, not a real table. The
DUAL table has only one column named DUMMY, which is a
VATCHAR2 data type.
[edit][hide] Oracle
ANS: IN tells SQL to run an outer query using the list of values
within the clause. EXISTS tells SQL to run an outer query on a list
of values within the clause until there is a match. EXISTS is
faster because SQL stops executing that operation after the first
match, whereas SQL has to look at all values in an IN clause.
ANS: You can have more than 1 UNIQUE constraint within a table
and it can be NULL, whereas there can only be one PK constraint
per table, and cannot be NULL.
ANS: Insert.
ANS: DECODE can only work with scalar values. CASE can work
with predicates and sub queries, and handles NULL differently.
ANS: Hash joins are used for joining large data sets. The
optimizer uses the smaller of two tables or data sources to build
a hash table on the join key in memory. It then scans the larger
table, probing the hash table to find the joined rows. Nested
loops nested join small number of rows, with a good driving
condition between the two tables. It drives from the outer loop
to the inner loop. The inner loop is iterated for every row
returned from the outer loop, ideally by an index scan.
[edit][hide] ETL
1. Add in world wide reporting. How would that affect your ETL?
o
ANS: Your ETL will then have to be adjusted to ensure that the
data is available for reporting, based on the different time zones.
2. Given a billion row table, How do you add a new column and backfill
the data from source without impacting the user?
o
ANS: You will create a new table with the additional column and
then backfill the data from the existing table. Once the backfill is
complete, you can then deprecate the original table and publish
the new table with the additional column to the users. This
approach causes no impact to the users, as you are creating a
separate table to backfill (with the additional column) instead of
attempting to perform an UPDATE on a billion row table.
3. What is the best strategy to use when you have to delete 400 million
from a billion row table.
o
ANS: Create a new table and backfill it with the existing data in
the original table. Delete the desired 400 million rows from the
new table, and then publish that table to the users, while
deprecating the original.
[edit][hide] Linux/Unix
1. cron
o
2. combine 2 files
o
3. dedupe #2
o
4. pipes
o
If a file has permissions 000, then who can access the file?
o
ANS: Only root can read/write the file, while only the owner can
change the file's permissions. No one can execute the file.
What is redirection?
o
Given that 3rd column is the primary key, how would you find if there
are duplicates in the file.
o
What is piping?
o
ANS: grep is used to search for patterns in a file, where as, find
is used to search files or directories.
ANS: You could use the Awk NR command, which gives you the
total number of records being processed or line number. For
example, if a file has 10 columns, then you would check to see if
a line number has NR<10.
[edit][hide] Teradata
The following are just definitions. Try to provide a real-life problem, like how would
model so you can report on delay times between order state statuses - pending, success,
error, etc.
1. What are the primary the differences between a transactional database
vs a data warehouse database?
1. Transaction Database is Relational Database with the normalized
table, whereas Data Warehouse is with denormalized tables.
2. Transaction Database is highly volatile. Designed to maintain
transactions of the business Where Data Warehouse is non
volatile with periodic updates.
3. Transaction Database is OLTP. Data warehouse is for analysis.
4. Transaction Database is functional data. Data Warehouse
database is subject oriented.
2. Differentiate Primary Key and Partition Key?
1. ANS: Primary key is the key we define on the table column or set
of columns(composite pk) to make sure all the rows in a table
are unique. Partition key is the key that we use to partition the
table with.
o
1. Type I: Replace the old record with a new record with updated
data, there by we lose the history. But data warehouse has a
responsibility to track the history effectively, where Type I
implementation fails.
2. Type II: Create a new additional dimension table record with new
value. By this way we can keep the history. We can determine
which dimension is current by adding a current record flag or by
time stamp on the dimensional row.
2. What is the difference between Snow flake and Star Schema?
What are the benefits of each?
1. Star Schema
1. Star join is a primary key to foreign key join of the
dimension tables to a fact table.
Describe the normal forms? What is BCNF? 2nd normal form? 3rd
normal form?
o
3rd normal form represents a table where every nonprime attribute is non-transitively dependent on every
candidate key in the table. The attributes that do not
contribute to the description of the primary key are
removed from the table. In other words, no transitive
dependency is allowed.
Add a value to the OLTP design that alters the grain of one
associated dimension (e.g: new/used books). Where would the
change be propagated to?
ANS: One way is by using bridge tables that holds at least the 2
foreign keys from the 2 tables that have the M:M relationship.
2. What kind of logging system would you design for sql and pl/sql scripts
so that all errors get logged in error tables? Provide at least two design
solutions
o
ANS: One design solution within Oracle is, you can create a
stored procedure call that can be attached to any other
package/procedure that would be able to gather data on an
error/exception or user/system checkpoint, and insert into a
specified table. Another solution, using UNIX, would be to create
one script as the controller file that captures the unique
identifiers for each error/checkpoint, while another script uses
the data from the controller file to pull more data from Oracle's
error logs.
ANS: RAC systems allows closer to 100% uptime, can scale with
less hardware, and possibly even handle a larger load. On the
other hand, RAC systems can be costly, and more difficult to
manage (training and troubleshooting). Also, RAC systems usally
only improve availability, which is just one aspect of a well
designed system.
followup: Have you architected a reporting solution. What were the challenges faced.
SQLInterviewQuestions
I've created this page as a place to put SQL puzzles to assign candidates who claim strong
SQL backgrounds as homework, or on-site, or phone screen questions (in decreasing
order of difficulty).
where all the field names of X, Y, and Z are distinct. [Answer: true.
Argument via set-theoretic calculation. Incidentally, Oracle
Corporation's query-plan optimizer team is in a state of denial about
this equivalence.]
[edit][hide] On-site Questions
1. Suppose I have two entities in my DB: Objects, and Tags. Suppose also
that I have a mapping table ObjectTag which represents a many-tomany relationship between Objects and Tags. Now I wish to find, given
a finite input list of Tag ids, the set of Objects which map to (a) any of
the input tags [easy], and (b) all of the input tags [harder]. Can you do
(a) and (b) with one query each?
o
o
o
(a)
select distinct o.*
from Objects o join ObjectTag ot on o.id = ot.obj_id
where ot.tag_id in ( <input list> )
(b) Two ways, with the second worth many more points than the
first in terms of elegance. Let n be the length of the input list:
1.
2.
3.
4.
SQL
Department
deptid (primary key)
deptname
[edit][hide] Questions
Type
Question
SQL
Group by
Group by
having
Highest salary
Sub query employee with dept
name
Join
Self join
Self join
Corelated
subquery
The above will cover some basic scenarios. If you want multiple joining condition may
be add another table like address into the mix and create some joining conditions. Can
ask about EXISTS, NOT EXISTS and other correlated subquery conditions.
Ask some question regarding partitioning say we have tables : orders, customers.
Orders has order date, performance issues how to improve. Should arrive at partitioning
by date. May be one question about giving hints in sql query.
PipsInterviewQuestions
1. What is the 'Simpson's paradox'? Give an example. followup: How might this paradox
occur in continuous distributions?
[edit][hide] SQL
1. Suppose you are aggregating shipping_addresses over customers; each customer has a
customer_id and each address has an address_id; customers may have multiple shipping
addresses.
We want to aggregate shipping address zip codes up to customers to choose a
'representative' zip code for each customer that can be used for model building.
There are three tables
Create a sql query that returns the most recently used zip-code and the most commonly
used zip code for each customer. Join the results of this query with the census table to get
the medianHouseValue for the zip code for each customer.
SQLInterviewQuestions
I've created this page as a place to put SQL puzzles to assign candidates who claim strong
SQL backgrounds as homework, or on-site, or phone screen questions (in decreasing
order of difficulty).
[edit][hide] Homework Questions
1. Prove or disprove the following equation:
( X join(f(X,Y)) Y ) left join(g(Y,Z)) Z == X join(f(X,Y)) ( Y
left join(g(Y,Z)) Z )
where all the field names of X, Y, and Z are distinct. [Answer: true.
Argument via set-theoretic calculation. Incidentally, Oracle
Corporation's query-plan optimizer team is in a state of denial about
this equivalence.]
[edit][hide] On-site Questions
1. Suppose I have two entities in my DB: Objects, and Tags. Suppose also
that I have a mapping table ObjectTag which represents a many-tomany relationship between Objects and Tags. Now I wish to find, given
a finite input list of Tag ids, the set of Objects which map to (a) any of
the input tags [easy], and (b) all of the input tags [harder]. Can you do
(a) and (b) with one query each?
o
o
o
(a)
select distinct o.*
from Objects o join ObjectTag ot on o.id = ot.obj_id
where ot.tag_id in ( <input list> )
(b) Two ways, with the second worth many more points than the
first in terms of elegance. Let n be the length of the input list:
1.
2.
13.
14.
What command would I use to find the names off all processes running
as a specific user?
What is a BST
or
-----------------------------------------------------------------------------------------------------------------------------DWInterviewCompetencies
1 Competencies
o
1.7 DW Grid
Following are the competencies that are identified that each person should focus on for
DW Data Engineer role. Before looking into the competencies, please abide by the
following:
Please make sure you have only two Competencies from the list below,
if its more please ask your HM.
Any skill set that you have a serious data point and that is outside of
your competency, please dont vote for it, instead keep it in Pros/Cons.
This should include Operational Data Engineering skills that is required in DW world.
ONLY HIRING MANAGER WILL DO THIS Candidates should be comfortable
with partitioning, parallelism, impacts to objects (indexes, MVs), huge backfills,
different granularity handling etc.
Examples questions:
Huge, billion rows, multi terabyte data. Some section corrupted, how
will you backfill only those affected rows?
Tables in three Clusters out of sync, how will you correct it?
o
Duplicates
Data errors when they dont match column data type definition
A big file 500 M rows (200 GB), how will you load into tables?
o
External table
SQL producing 500 M rows, writing takes long time, what are his
thought process to make it better?
This includes both Oltp Data Modeling and DW Data Modeling and Design.
ONLY HIRING MANAGER and BR WILL DO OLTP DM, OTHERS PLEASE
ASK MORE DW DM and Design
Give a use case and ask him to design a data model (some DMs that
we ask are bookmyshow.com, car pooling, table management in a
restaurant etc..)
Aggregate designs
How will you design a multi granular table with some measures
against 3 dimensions?
How will you select a particular granularity row (see for bitmap
indexes on booleans that describe the granularity of that row?)
Execution plans
Partitioning concepts
Parallelism concepts
This includes giving candidates problems and observing the approach and SQL coding
skills for the same. My recommendation will be start off with simple SQL coding skills to
medium to complex problems that requires intermediate designs and implementing above
with SQL code as well. You can also give problems that requires procedural coding
(PL/SQL programming). In SQL, please Observe for minimal scans, effective joins, not
too many subqueries, set operators, temporary tables, With tables etc..
Examples include:
Analytical functions (lag lead, ranks, rownumbers, first value, last value
etc..), if the candidate DOESNT know analytical fucntions, ITS
FINE. Ask more on solving using group bys.
Pivotting, De-Pivotting
Efficient for looping in PL/SQL programs (if its two for loops, can it be
done in a single for loop etc)
Project management
cult fit
HM can also pick any skill set form above just to be comfortable and
please include that in Pros/Cons.
BR competencies.
[edit][hide] DW Grid
Data
DM and
Engineering Design(OLTP/DW)
2 Votes
2 Votes
Coding
and PS
3 Votes
DB
Concept
s
2 Votes
HM
BR
2
2
Votes Votes
DE - A B
DM - C D
Coding - A B C
DB Concepts - D HM
HM Round - HM BR
BR - BR HM
So we need 4 onsite interviewers + a HM + a BR. (HM should do one of the
competencies as well).
Data
Engineerin
g
Coding
and PS
DB
DM
Concept (OLTP/DW
s
)
General
HM/BR skill
sets
Venkatesh
Mohan
Abhishek
Agrawal
Rakesh Singh
Naidu Rongali
Aniruddha
Vishnupurikar
Paparao
Chinthagunti
Ankush Kuhar
Samar Sodhi
AmazonAnalyticsDEInterviewsQuestions
Contents
[hide]
1.2.1 BI Tools
1.2.2 Reporting
1.2.3 SQL
1.2.5 Unix
1.2.8 Essbase
Basic
o
Intermediate
Advanced
[edit][hide] Reporting
1. What are drill down and drill across reports, what is the difference?
7. given order and order items tables, select customer ids of customers
who placed orders with more than 3 items (having or subquery)
8. What is the use of DESC in SQL?
9. How do you find the number of rows in a Table?
10.What is Cartesian product in the SQL?
11.What is a view? What is materialized View? What is the difference
between view and materialized view
12.Can you insert data into a view?
13.What is a merge statement? What is the requirement for a merge
statement? Is PK necessary for merge?
14.What is dual? Is it a table? if so what columns does it have? Whats the
data type?
Basic
o
Intermediate
o
Advanced
o
8. What is MAXL?
9. What is MDX?
10.What is aggregation?
11.Why is aggregation needed?
12.If new data is added to the cube, without adding new dim members, is
re-aggregation required?
13.What is query based aggregation and stop value based aggregation?