Sunteți pe pagina 1din 84

How to Ace a Data Science Interview

As I mentioned in my first post, I have just finished an extensive tech job search, which
featured eight on-sites, along with countless phone screens and informal chats. I was
interviewing for a combination of data science and software engineering (machine learning)
positions, and I got a pretty good sense of what those interviews are like. In this post, I give an
overview of what you should expect in a data science interview, and some suggestions for how to
prepare.
An interview is not a pop quiz. You should know what to expect going in, and you can take the
time to prepare for it. During the interview phase of the process, your recruiter is on your side
and can usually tell you what types of interviews youll have. Even if the recruiter is reluctant to
share that, common practices in the industry are a good guide to what youre likely to see.
In this post, Ill go over the types of data science interviews Ive encountered, and offer
my advice on how to prepare for them. Data science roles generally fall into two broad ares of
focus: statistics and machine learning. I only applied to the latter category, so thats the type of
position discussed in this post. My experience is also limited to tech companies, so I cant offer
guidance for data science in finance, biotech, etc..
Here are the types of interviews (or parts of interviews) Ive come across.
Always:

Coding (usually whiteboard)

Applied machine learning

Your background

Often:

Culture fit

Machine learning theory

Dataset analysis

Stats

You will encounter a similar set of interviews for a machine learning software engineering
position, though more of the questions will fall in the coding category.

Coding (usually whiteboard)

This is the same type of interview youd have for any software engineering position, though the
expectations may be less stringent. There are lots of websites and books that will tell you how to
prepare. Practice your coding skills if theyre rusty. Dont forget to practice coding away from
the computer (e.g. on paper), which is surely a skill thats rusty. Review the data structures you
may never have used outside of school binary search trees, linked lists, heaps. Be comfortable
with recursion. Know how to reason about algorithm running times. You can generally use any
real language you want in an interview (Matlab doesnt count, unfortunately); Pythons
succinct syntax makes it a great language for coding interviews.
Prep tips:

If you get nervous in interviews, try doing some practice problems under time
pressure.

If you dont have much software engineering experience, see if you can get a
friend to look over your practice code and provide feedback.

During the interview:

Make sure you understand exactly what problem youre trying to solve. Ask
the interviewer questions if anything is unclear or underspecified.

Make sure you explain your plan to the interviewer before you start writing
any code, so that they can help you avoid spending time going down lessthan-ideal paths.

If you cant think of a good way to do something, it often helps to start by


talking through a dumb way to do it.

Mention what invalid inputs youd want to check for (e.g. input variable type
check). Dont bother writing the code to do so unless the interviewer asks. In
all my interviews, nobody has ever asked.

Before declaring that your code is finished, think about variable initialization,
end conditions, and boundary cases (e.g. empty inputs). If it seems helpful,
run through an example. Youll score points by catching your bugs yourself,
rather than having the interviewer point them out.

Applied machine learning

All the applied machine learning interviews Ive had focused on supervised learning. The
interviewer will present you with a prediction problem, and ask you to explain how you would
set up an algorithm to make that prediction. The problem selected is often relevant to the
company youre interviewing at (e.g. figuring out which product to recommend to a user, which
users are going to stop using the site, which ad to display, etc.), but can also be a toy example

(e.g. recommending board games to a friend). This type of interview doesnt depend on much
background knowledge, other than having a general understanding of machine learning
concepts (see below). However, it definitely helps to prepare by brainstorming the types of
problems a particular company might ask you to solve. Even if you miss the mark, the
brainstorming session will help with the culture fit interview (also see below).
When answering this type of question, Ive found it helpful to start by laying out the setup of the
problem. What are the inputs? What are the labels youre trying to predict? What machine
learning algorithms could you run on the data? Sometimes the setup will be obvious from the
question, but sometimes youll need to figure out how to define the problem. In the latter case,
youll generally have a discussion with the interviewer about some plausible definitions (e.g.,
what does it mean for a user to stop using the site?).
The main component of your answer will be feature engineering. There is nothing magical about
brainstorming features. Think about what might be predictive of the variable you are trying to
predict, and what information you would actually have available. Ive found it helpful to give
context around what Im trying to capture, and to what extent the features Im proposing reflect
that information.
For the sake of concreteness, heres an example. Suppose Amazon is trying to figure out what
books to recommend to you. (Note: I did not interview at Amazon, and have no idea what they
actually ask in their interviews.) To predict what books youre likely to buy, Amazon can look for
books that are similar to your past Amazon purchases. But maybe some purchases were mistakes,
and you vowed to never buy a book like that again. Well, Amazon knows how youve interacted
with your Kindle books. If theres a book you started but never finished, it might be a positive
signal for general areas youre interested in, but a negative signal for the particular author. Or
maybe some categories of books deserve different treatment. For example, if a year ago you were
buying books targeted at one-year-olds, Amazon could deduce that nowadays youre looking for
books for two-year-olds. Its easy to see how you can spend a while exploring the space between
what youd like to know and what you can actually find out.
Your background

You should be prepared to give a high-level summary of your career, as well as to do a deep-dive
into a project youve worked on. The project doesnt have to be directly related to the position
youre interviewing for (though it cant hurt), but it needs to be the kind of work you can have an
in-depth technical discussion about.
To prepare:

Review any papers/presentations that came out of your projects to refresh


your mind on the technical details.

Practice explaining your project to a friend in order to make sure you are
telling a coherent story. Keep in mind that youll probably be talking to
someone whos smart but doesnt have expertise in your particular field.

Be prepared to answer questions as to why you chose the approach that you
did, and about your individual contribution to the project.

Culture fit

Here are some culture fit questions your interviewers are likely to be interested in. These
questions might come up as part of other interviews, and will likely be asked indirectly. It helps
to keep what the interviewer is looking for in the back of your mind.

Are you specifically interested in the product/company/space youd


be working in? It helps to prepare by thinking about the problems the
company is trying to solve, and how you and the team youd be part of could
make a difference.

Do you care about impact? Even in a research-oriented corporate


environment, I wouldnt recommend saying that you dont care about
company metrics, and that youd love to just play with data and write papers.

Will you work well with other people? I know its a clich, but most work
is collaborative, and companies are trying to assess this as best they can.
Avoid bad-mouthing former colleagues, and show appreciation for their
contributions to your projects.

Are you willing to get your hands dirty? If theres annoying work that
needs to be done (e.g. cleaning up messy data), will you take care of it?

Are you someone the team will be happy to have around on a


personal level? Even though you might be stressed, try to be friendly,
positive, enthusiastic and genuine throughout the interview process.

You may also get broad questions about what kinds of work you enjoy and what motivates you.
Its useful to have an answer ready, but there may not be a right answer the interviewer is
looking for.
Machine learning theory

This type of interview will test your understanding of basic machine learning concepts, generally
with a focus on supervised learning. You should understand:

The general setup for a supervised learning system

Why you want to split data into training and test sets

The idea that models that arent powerful enough cant capture the right
generalizations about the data, and ways to address this (e.g. different model
or projection into a higher-dimensional space)

The idea that models that are too powerful suffer from overfitting, and ways
to address this (e.g. regularization)

You dont need to know a lot of machine learning algorithms, but you definitely need to
understand logistic regression, which seems to be what most companies are using. I also had
some in-depth discussions of SVMs, but that may just be because I brought them up.
Dataset analysis

In this type of interview, you will be given a data set, and asked to write a script to pull out
features for some prediction task. You may be asked to then plug the features into a machine
learning algorithm. This interview essentially adds an implementation component to the applied
machine learning interview (see above). Of course, your features may now be inspired by what
you see in the data. Do the distributions for each feature youre considering differ between the
labels youre trying to predict?
I found these interviews hardest to prepare for, because the recruiter often wouldnt tell me what
format the data would be in, and what exactly Id need to do with it. (For example, do I need to
review Pythons csv import module? Should I look over the syntax for training a model in scikitlearn?) I also had one recruiter tell me Id be analyzing big data, which was a bit intimidating
(am I going to be working with distributed databases or something?) until I discovered at the
interview that the big data set had all of 11,000 examples. I encourage you to push for as much
info as possible about what youll actually be doing.
If you plan to use Python, working through the scikit-learn tutorial is a good way to prepare.
Stats

I have a decent intuitive understanding of statistics, but very little formal knowledge. Most of the
time, this sufficed, though Im sure knowing more wouldnt have hurt. You should understand
how to set up an A/B test, including random sampling, confounding variables, summary statistics
(e.g. mean), and measuring statistical significance.
Preparation Checklist & Resources

Here is a summary list of tips for preparing for data science interviews, along with a few helpful
resources.
1. Coding (usually whiteboard)
o

Get comfortable with basic algorithms, data structures and figuring out
algorithm complexity.

Practice writing code away from the computer in your programming


language of choice.

Resources:

Pretty exhaustive list of what you might encounter in an


interview

Many interview prep books, e.g. Cracking the Coding Interview

2. Applied machine learning


o

Think about the machine learning problems that are relevant for each
company youre interviewing at. Use these problems as practice
questions.

3. Your background
o

Think through how to summarize your experience.

Prepare to give an in-depth technical explanation of a project youve


worked on. Try it out on a friend.

4. Culture fit
o

Think about the problems each company is trying to solve, and how
you and the team youd be part of could make a difference.

Be prepared to answer broad questions about what kind of work you


enjoy and what motivates you.

5. Machine learning theory


o

Understand machine learning concepts on an intuitive level, focusing


especially on supervised learning.

Learn the math behind logistic regression.

Resources:

The Shape of Data blog provides a nice intuitive overview.

A Few Useful Things to Know about Machine Learning

To really go in depth, check out Andrew Ngs Stanford machine


learning course on Coursera or OpenClassroom.

6. Dataset analysis

Get comfortable with a set of technical tools for working with data.

Resources:

If you plan to use Python, work through the scikit-learn


tutorial (you could skip section 2.4).

7. Stats
o

Get familiar with how to set up an A/B test.

Resources:

Quora answer about how to prepare for interview questions


about A/B testing

How not to run an A/B test

Sample size calculator, which you can use to get some intuition
about sample sizes required based on the sensitivity (i.e.
minimal detectable effect) and statistical significance youre
looking for

The Interview Process: What a Company Wants

I have just finished a more extensive tech job search than anyone should really do. It
featured eight on-sites, along with countless phone screens and informal chats. There were a few
reasons why I ended up doing things this way: (a) I quit my job when my husband and I moved
from Boston to San Francisco a few months ago, so I had the time; (b) I wasnt sure what I was
looking for big company vs. small, data scientist vs. software engineer on a machine learning
system, etc.; (c) I wasnt sure how well it would all go.
This way of doing a job search turned out to be an awesome learning experience. In this series
of posts, Ive tried to jot down some thoughts on what makes for a good interview process, both
for the company and for the candidate. I was interviewing for a combination of data science and
software engineering positions, but many observations should be more broadly applicable.

What are we trying to do here, anyway?


Before we can talk about what is a good or bad interview process, we need to understand the
companys objectives. Here are some things your company might be trying to do, or perhaps
should be trying to do. Note that Im focusing on the interview stage here; there are many
separate questions about finding/filtering candidates.

Hire or no hire: Decide whether to give the candidate an offer.


1. Qualification check: Figure out whether the candidate is qualified for
the position they applied for. This is the most basic objective of the
interview process. To check someones qualifications, you first need to define
what it means to be qualified for the position. In addition to technical skills,
many companies look for a culture fit, which can help maintain the work
and social environment at the company or change it, if thats whats
needed.
2. Potential check: If the candidate isnt qualified right now, can they
become excellent at this job anyway? Companies have very different
philosophies on whether this is a question they care to ask. In many cases,
there are good reasons to ask it. I was told a story about someone who was
hired as a machine learning expert, but soon got excited about infrastructure
challenges, and before long became the head of an infrastructure team. At
that point, what does it matter precisely what set of skills he originally came
in with, as long as hes smart and capable of learning new things?
3. Opportunity check: If the candidate isnt ideally suited to the
position they applied for, are there other roles in the company where
wed love to have them? More than one place I interviewed at came back
with an offer for a different role from the one I applied for (in my case, data
scientist instead of engineer). They werent advertising for that job, but
they were thinking opportunistically.
Leave a good impression.

There are two major components to this.


1. Be cool: Make sure the candidate comes away with a positive view of
the company. Part of doing this effectively is figuring out what counts as
cool to this particular candidate.
2. Be nice: Make sure the candidate has a positive overall experience.

Doing this well has an obvious benefit when the candidate is qualified: theyll be more likely to
take the offer. But it also has some less obvious benefits that apply to all candidates:

The candidate will be more likely to refer friends to your company. I heard
about a candidate who was rejected but went on to recommend two friends
who ended up joining the company.

The candidate will be more positive when discussing your company with their
friends. Its a small world.

Even if you dont want to hire the candidate right now, you might want to hire
them in a year.

There is intrinsic merit in being nice to people as theyre going through what
is often a stressful experience.

Feel good doing it: Make sure the interviewers have a positive interview
experience.

As someone on the other side of the fence, this one is harder for me to reason about. But here are
some thoughts on why this is important:

Your employees might be spending a lot of time interviewing (as much as 10


hours a week during the fall recruiting season), and you dont want them to
be miserable doing it.

If the interviewer is grumpy, the candidate will be less likely to think well of
the company (see above). One of the companies I interviewed at requires
interviewers to submit detailed written feedback, which resulted in them
dedicating much of their attention to typing up my whiteboard code during
the interview. More than one interviewer expressed their frustration with the
process. Even if they were pretty happy with their job most of the time, it
certainly didnt come across that way.

In the next post, Ill take a look at some job postings. Do you have thoughts on
other goals companies should strive for? Please comment!k
Get that job at Google
I've been meaning to write up some tips on interviewing at Google for a good long
time now. I keep putting it off, though, because it's going to make you mad.
Probably. For some statistical definition of "you", it's very likely to upset you.
Why? Because... well, here, I wrote a little ditty about it:
Hey man, I don't know that stuff
Stevey's talking aboooooout
If my boss thinks it's important
I'm gonna get fiiiiiiiiiired
Oooh yeah baaaby baaaay-beeeeee....

I didn't realize this was such a typical reaction back when I first started writing
about interviewing, way back at other companies. Boy-o-howdy did I find out in a
hurry.
See, it goes like this:
Me: blah blah blah, I like asking question X in interviews, blah blah blah...
You: Question X? Oh man, I haven't heard about X since college! I've never needed
it for my job! He asks that in interviews? But that means someone out there thinks
it's important to know, and, and... I don't know it! If they detect my ignorance, not

only will I be summarily fired for incompetence without so much as a thank-you, I


will also be unemployable by people who ask question X! If people listen to Stevey,
that will be everyone! I will become homeless and destitute! For not knowing
something I've never needed before! This is horrible! I would attack X itself, except
that I do not want to pick up a book and figure enough out about it to discredit it.
Clearly I must yell a lot about how stupid Stevey is so that nobody will listen to him!
Me: So in conclusion, blah blah... huh? Did you say "fired"? "Destitute?" What are
you talking about?
You: Aaaaaaauuuggh!!! *stab* *stab* *stab*
Me: That's it. I'm never talking about interviewing again.
It doesn't matter what X is, either. It's arbitrary. I could say: "I really enjoy asking the
candidate (their name) in interviews", and people would still freak out, on account
of insecurity about either interviewing in general or their knowledge of their own
name, hopefully the former.
But THEN, time passes, and interview candidates come and go, and we always wind
up saying: "Gosh, we sure wish that obviously smart person had prepared a little
better for his or her interviews. Is there any way we can help future candidates out
with some tips?"
And then nobody actually does anything, because we're all afraid of getting stabbed
violently by People Who Don't Know X.
I considered giving out a set of tips in which I actually use variable names like X,
rather than real subjects, but decided that in the resultant vacuum, everyone would
get upset. Otherwise that approach seemed pretty good, as long as I published
under a pseudonym.
In the end, people really need the tips, regardless of how many feelings get hurt
along the way. So rather than skirt around the issues, I'm going to give you a few
mandatory substitutions for X along with a fair amount of general interview-prep
information.
Caveats and Disclaimers
This blog is not endorsed by Google. Google doesn't know I'm publishing these tips.
It's just between you and me, OK? Don't tell them I prepped you. Just go kick ass on
your interviews and we'll be square.
I'm only talking about general software engineering positions, and interviews for

those positions.
These tips are actually generic; there's nothing specific to Google vs. any other
software company. I could have been writing these tips about my first software job
20 years ago. That implies that these tips are also timeless, at least for the span of
our careers.
These tips obviously won't get you a job on their own. My hope is that by following
them you will perform your very best during the interviews.
Oh, and um, why Google?
Oho! Why Google, you ask? Well let's just have that dialog right up front, shall we?
You: Should I work at Google? Is it all they say it is, and more? Will I be serenely
happy there? Should I apply immediately?
Me: Yes.
You: To which ques... wait, what do you mean by "Yes?" I didn't even say who I am!
Me: Dude, the answer is Yes. (You may be a woman, but I'm still calling you Dude.)
You: But... but... I am paralyzed by inertia! And I feel a certain comfort level at my
current company, or at least I have become relatively inured to the discomfort. I
know people here and nobody at Google! I would have to learn Google's build
system and technology and stuff! I have no credibility, no reputation there I would
have to start over virtually from scratch! I waited too long, there's no upside! I'm
afraaaaaaid!
Me: DUDE. The answer is Yes already, OK? It's an invariant. Everyone else who
came to Google was in the exact same position as you are, modulo a handful of
famous people with beards that put Gandalf's to shame, but they're a very tiny
minority. Everyone who applied had the same reasons for not applying as you do.
And everyone here says: "GOSH, I SURE AM HAPPY I CAME HERE!" So just apply
already. But prep first.
You: But what if I get a mistrial? I might be smart and qualified, but for some
random reason I may do poorly in the interviews and not get an offer! That would be
a huge blow to my ego! I would rather pass up the opportunity altogether than have
a chance of failure!
Me: Yeah, that's at least partly true. Heck, I kinda didn't make it in on my first
attempt, but I begged like a street dog until they gave me a second round of

interviews. I caught them in a weak moment. And the second time around, I
prepared, and did much better.
The thing is, Google has a well-known false negative rate, which means we
sometimes turn away qualified people, because that's considered better than
sometimes hiring unqualified people. This is actually an industry-wide thing, but the
dial gets turned differently at different companies. At Google the false-negative rate
is pretty high. I don't know what it is, but I do know a lot of smart, qualified people
who've not made it through our interviews. It's a bummer.
But the really important takeaway is this: if you don't get an offer, you may still be
qualified to work here. So it needn't be a blow to your ego at all!
As far as anyone I know can tell, false negatives are completely random, and are
unrelated to your skills or qualifications. They can happen from a variety of factors,
including but not limited to:
1. you're having an off day
2. one or more of your interviewers is having an off day
3. there were communication issues invisible to you and/or one or more of the
interviewers
4. you got unlucky and got an Interview Anti-Loop
Oh no, not the Interview Anti-Loop!
Yes, I'm afraid you have to worry about this.
What is it, you ask? Well, back when I was at Amazon, we did (and they undoubtedly
still do) a LOT of soul-searching about this exact problem. We eventually concluded
that every single employee E at Amazon has at least one "Interview Anti-Loop": a
set of other employees S who would not hire E. The root cause is important for you
to understand when you're going into interviews, so I'll tell you a little about what
I've found over the years.
First, you can't tell interviewers what's important. Not at any company. Not unless
they're specifically asking you for advice. You have a very narrow window of perhaps
one year after an engineer graduates from college to inculcate them in the art of
interviewing, after which the window closes and they believe they are a "good
interviewer" and they don't need to change their questions, their question styles,
their interviewing style, or their feedback style, ever again.
It's a problem. But I've had my hand bitten enough times that I just don't try

anymore.
Second problem: every "experienced" interviewer has a set of pet subjects and
possibly specific questions that he or she feels is an accurate gauge of a candidate's
abilities. The question sets for any two interviewers can be widely different and
even entirely non-overlapping.
A classic example found everywhere is: Interviewer A always asks about C++ trivia,
filesystems, network protocols and discrete math. Interviewer B always asks about
Java trivia, design patterns, unit testing, web frameworks, and software project
management. For any given candidate with both A and B on the interview loop, A
and B are likely to give very different votes. A and B would probably not even hire
each other, given a chance, but they both happened to go through interviewer C,
who asked them both about data structures, unix utilities, and processes versus
threads, and A and B both happened to squeak by.
That's almost always what happens when you get an offer from a tech company. You
just happened to squeak by. Because of the inherently flawed nature of the
interviewing process, it's highly likely that someone on the loop will be unimpressed
with you, even if you are Alan Turing. Especially if you're Alan Turing, in fact, since it
means you obviously don't know C++.
The bottom line is, if you go to an interview at any software company, you should
plan for the contingency that you might get genuinely unlucky, and wind up with
one or more people from your Interview Anti-Loop on your interview loop. If this
happens, you will struggle, then be told that you were not a fit at this time, and then
you will feel bad. Just as long as you don't feel meta-bad, everything is OK. You
should feel good that you feel bad after this happens, because hey, it means you're
human.
And then you should wait 6-12 months and re-apply. That's pretty much the best
solution we (or anyone else I know of) could come up with for the false-negative
problem. We wipe the slate clean and start over again. There are lots of people here
who got in on their second or third attempt, and they're kicking butt.
You can too.
OK, I feel better about potentially not getting hired
Good! So let's get on to those tips, then.
If you've been following along very closely, you'll have realized that I'm interviewer
D. Meaning that my personal set of pet questions and topics is just my own, and it's
no better or worse than anyone else's. So I can't tell you what it is, no matter how

much I'd like to, because I'll offend interviewers A through X who have slightly
different working sets.
Instead, I want to prep you for some general topics that I believe are shared by the
majority of tech interviewers at Google-like companies. Roughly speaking, this
means the company builds a lot of their own software and does a lot of distributed
computing. There are other tech-company footprints, the opposite end of the
spectrum being companies that outsource everything to consultants and try to use
as much third-party software as possible. My tips will be useful only to the extent
that the company resembles Google.
So you might as well make it Google, eh?
First, let's talk about non-technical prep.
The Warm-Up
Nobody goes into a boxing match cold. Lesson: you should bring your boxing gloves
to the interview. No, wait, sorry, I mean: warm up beforehand!
How do you warm up? Basically there is short-term and long-term warming up, and
you should do both.
Long-term warming up means: study and practice for a week or two before the
interview. You want your mind to be in the general "mode" of problem solving on
whiteboards. If you can do it on a whiteboard, every other medium (laptop, shared
network document, whatever) is a cakewalk. So plan for the whiteboard.
Short-term warming up means: get lots of rest the night before, and then do
intense, fast-paced warm-ups the morning of the interview.
The two best long-term warm-ups I know of are:
1) Study a data-structures and algorithms book. Why? Because it is the most
likely to help you beef up on problem identification. Many interviewers are happy
when you understand the broad class of question they're asking without
explanation. For instance, if they ask you about coloring U.S. states in different
colors, you get major bonus points if you recognize it as a graph-coloring problem,
even if you don't actually remember exactly how graph-coloring works.
And if you do remember how it works, then you can probably whip through the
answer pretty quickly. So your best bet, interview-prep wise, is to practice the art of
recognizing that certain problem classes are best solved with certain algorithms and
data structures.

My absolute favorite for this kind of interview preparation is Steven Skiena's The
Algorithm Design Manual. More than any other book it helped me understand just
how astonishingly commonplace (and important) graph problems are they should
be part of every working programmer's toolkit. The book also covers basic data
structures and sorting algorithms, which is a nice bonus. But the gold mine is the
second half of the book, which is a sort of encyclopedia of 1-pagers on zillions of
useful problems and various ways to solve them, without too much detail. Almost
every 1-pager has a simple picture, making it easy to remember. This is a great way
to learn how to identify hundreds of problem types.
Other interviewers I know recommend Introduction to Algorithms. It's a true classic
and an invaluable resource, but it will probably take you more than 2 weeks to get
through it. But if you want to come into your interviews prepped, then consider
deferring your application until you've made your way through that book.
2) Have a friend interview you. The friend should ask you a random interview
question, and you should go write it on the board. You should keep going until it is
complete, no matter how tired or lazy you feel. Do this as much as you can possibly
tolerate.
I didn't do these two types of preparation before my first Google interview, and I
was absolutely shocked at how bad at whiteboard coding I had become since I had
last interviewed seven years prior. It's hard! And I also had forgotten a bunch of
algorithms and data structures that I used to know, or at least had heard of.
Going through these exercises for a week prepped me mightily for my second round
of Google interviews, and I did way, way better. It made all the difference.
As for short-term preparation, all you can really do is make sure you are as alert and
warmed up as possible. Don't go in cold. Solve a few problems and read through
your study books. Drink some coffee: it actually helps you think faster, believe it or
not. Make sure you spend at least an hour practicing immediately before you walk
into the interview. Treat it like a sports game or a music recital, or heck, an exam: if
you go in warmed up you'll give your best performance.
Mental Prep
So! You're a hotshot programmer with a long list of accomplishments. Time to forget
about all that and focus on interview survival.
You should go in humble, open-minded, and focused.
If you come across as arrogant, then people will question whether they want to work

with you. The best way to appear arrogant is to question the validity of the
interviewer's question it really ticks them off, as I pointed out earlier on.
Remember how I said you can't tell an interviewer how to interview? Well, that's
especially true if you're a candidate.
So don't ask: "gosh, are algorithms really all that important? do you ever need to do
that kind of thing in real life? I've never had to do that kind of stuff." You'll just get
rejected, so don't say that kind of thing. Treat every question as legitimate, even if
you are frustrated that you don't know the answer.
Feel free to ask for help or hints if you're stuck. Some interviewers take points off for
that, but occasionally it will get you past some hurdle and give you a good
performance on what would have otherwise been a horrible stony half-hour silence.
Don't say "choo choo choo" when you're "thinking".
Don't try to change the subject and answer a different question. Don't try to divert
the interviewer from asking you a question by telling war stories. Don't try to bluff
your interviewer. You should focus on each problem they're giving you and make
your best effort to answer it fully.
Some interviewers will not ask you to write code, but they will expect you to start
writing code on the whiteboard at some point during your answer. They will give you
hints but won't necessarily come right out and say: "I want you to write some code
on the board now." If in doubt, you should ask them if they would like to see code.
Interviewers have vastly different expectations about code. I personally don't care
about syntax (unless you write something that could obviously never work in any
programming language, at which point I will dive in and verify that you are not, in
fact, a circus clown and that it was an honest mistake). But some interviewers are
really picky about syntax, and some will even silently mark you down for missing a
semicolon or a curly brace, without telling you. I think of these interviewers as
well, it's a technical term that rhymes with "bass soles", but they think of
themselves as brilliant technical evaluators, and there's no way to tell them
otherwise.
So ask. Ask if they care about syntax, and if they do, try to get it right. Look over
your code carefully from different angles and distances. Pretend it's someone else's
code and you're tasked with finding bugs in it. You'd be amazed at what you can
miss when you're standing 2 feet from a whiteboard with an interviewer staring at
your shoulder blades.
It's OK (and highly encouraged) to ask a few clarifying questions, and occasionally
verify with the interviewer that you're on the track they want you to be on. Some

interviewers will mark you down if you just jump up and start coding, even if you
get the code right. They'll say you didn't think carefully first, and you're one of those
"let's not do any design" type cowboys. So even if you think you know the answer to
the problem, ask some questions and talk about the approach you'll take a little
before diving in.
On the flip side, don't take too long before actually solving the problem, or some
interviewers will give you a delay-of-game penalty. Try to move (and write) quickly,
since often interviewers want to get through more than one question during the
interview, and if you solve the first one too slowly then they'll be out of time. They'll
mark you down because they couldn't get a full picture of your skills. The benefit of
the doubt is rarely given in interviewing.
One last non-technical tip: bring your own whiteboard dry-erase markers. They sell
pencil-thin ones at office supply stores, whereas most companies (including Google)
tend to stock the fat kind. The thin ones turn your whiteboard from a 480i standarddefinition tube into a 58-inch 1080p HD plasma screen. You need all the help you
can get, and free whiteboard space is a real blessing.
You should also practice whiteboard space-management skills, such as not starting
on the right and coding down into the lower-right corner in Teeny Unreadable Font.
Your interviewer will not be impressed. Amusingly, although it always irks me when
people do this, I did it during my interviews, too. Just be aware of it!
Oh, and don't let the marker dry out while you're standing there waving it. I'm tellin'
ya: you want minimal distractions during the interview, and that one is surprisingly
common.
OK, that should be good for non-tech tips. On to X, for some value of X! Don't stab
me!
Tech Prep Tips
The best tip is: go get a computer science degree. The more computer science you
have, the better. You don't have to have a CS degree, but it helps. It doesn't have to
be an advanced degree, but that helps too.
However, you're probably thinking of applying to Google a little sooner than 2 to 8
years from now, so here are some shorter-term tips for you.
Algorithm Complexity: you need to know Big-O. It's a must. If you struggle with
basic big-O complexity analysis, then you are almost guaranteed not to get hired.
It's, like, one chapter in the beginning of one theory of computation book, so just go
read it. You can do it.

Sorting: know how to sort. Don't do bubble-sort. You should know the details of at
least one n*log(n) sorting algorithm, preferably two (say, quicksort and merge sort).
Merge sort can be highly useful in situations where quicksort is impractical, so take
a look at it.
For God's sake, don't try sorting a linked list during the interview.
Hashtables: hashtables are arguably the single most important data structure
known to mankind. You absolutely have to know how they work. Again, it's like one
chapter in one data structures book, so just go read about them. You should be able
to implement one using only arrays in your favorite language, in about the space of
one interview.
Trees: you should know about trees. I'm tellin' ya: this is basic stuff, and it's
embarrassing to bring it up, but some of you out there don't know basic tree
construction, traversal and manipulation algorithms. You should be familiar with
binary trees, n-ary trees, and trie-trees at the very very least. Trees are probably the
best source of practice problems for your long-term warmup exercises.
You should be familiar with at least one flavor of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree. You should actually know how it's
implemented.
You should know about tree traversal algorithms: BFS and DFS, and know the
difference between inorder, postorder and preorder.
You might not use trees much day-to-day, but if so, it's because you're avoiding tree
problems. You won't need to do that anymore once you know how they work. Study
up!
Graphs
Graphs are, like, really really important. More than you think. Even if you already
think they're important, it's probably more than you think.
There are three basic ways to represent a graph in memory (objects and pointers,
matrix, and adjacency list), and you should familiarize yourself with each
representation and its pros and cons.
You should know the basic graph traversal algorithms: breadth-first search and
depth-first search. You should know their computational complexity, their tradeoffs,
and how to implement them in real code.

You should try to study up on fancier algorithms, such as Dijkstra and A*, if you get
a chance. They're really great for just about anything, from game programming to
distributed computing to you name it. You should know them.
Whenever someone gives you a problem, think graphs. They are the most
fundamental and flexible way of representing any kind of a relationship, so it's
about a 50-50 shot that any interesting design problem has a graph involved in it.
Make absolutely sure you can't think of a way to solve it using graphs before moving
on to other solution types. This tip is important!
Other data structures
You should study up on as many other data structures and algorithms as you can fit
in that big noggin of yours. You should especially know about the most famous
classes of NP-complete problems, such as traveling salesman and the knapsack
problem, and be able to recognize them when an interviewer asks you them in
disguise.
You should find out what NP-complete means.
Basically, hit that data structures book hard, and try to retain as much of it as you
can, and you can't go wrong.
Math
Some interviewers ask basic discrete math questions. This is more prevalent at
Google than at other places I've been, and I consider it a Good Thing, even though
I'm not particularly good at discrete math. We're surrounded by counting problems,
probability problems, and other Discrete Math 101 situations, and those innumerate
among us blithely hack around them without knowing what we're doing.
Don't get mad if the interviewer asks math questions. Do your best. Your best will
be a heck of a lot better if you spend some time before the interview refreshing
your memory on (or teaching yourself) the essentials of combinatorics and
probability. You should be familiar with n-choose-k problems and their ilk the more
the better.
I know, I know, you're short on time. But this tip can really help make the difference
between a "we're not sure" and a "let's hire her". And it's actually not all that bad
discrete math doesn't use much of the high-school math you studied and forgot. It
starts back with elementary-school math and builds up from there, so you can
probably pick up what you need for interviews in a couple of days of intense study.
Sadly, I don't have a good recommendation for a Discrete Math book, so if you do,

please mention it in the comments. Thanks.


Operating Systems
This is just a plug, from me, for you to know about processes, threads and
concurrency issues. A lot of interviewers ask about that stuff, and it's pretty
fundamental, so you should know it. Know about locks and mutexes and
semaphores and monitors and how they work. Know about deadlock and livelock
and how to avoid them. Know what resources a processes needs, and a thread
needs, and how context switching works, and how it's initiated by the operating
system and underlying hardware. Know a little about scheduling. The world is
rapidly moving towards multi-core, and you'll be a dinosaur in a real hurry if you
don't understand the fundamentals of "modern" (which is to say, "kinda broken")
concurrency constructs.
The best, most practical book I've ever personally read on the subject is Doug Lea's
Concurrent Programming in Java. It got me the most bang per page. There are
obviously lots of other books on concurrency. I'd avoid the academic ones and focus
on the practical stuff, since it's most likely to get asked in interviews.
Coding
You should know at least one programming language really well, and it should
preferably be C++ or Java. C# is OK too, since it's pretty similar to Java. You will be
expected to write some code in at least some of your interviews. You will be
expected to know a fair amount of detail about your favorite programming
language.
Other Stuf
Because of the rules I outlined above, it's still possible that you'll get Interviewer A,
and none of the stuff you've studied from these tips will be directly useful (except
being warmed up.) If so, just do your best. Worst case, you can always come back in
6-12 months, right? Might seem like a long time, but I assure you it will go by in a
flash.
The stuff I've covered is actually mostly red-flags: stuff that really worries people if
you don't know it. The discrete math is potentially optional, but somewhat risky if
you don't know the first thing about it. Everything else I've mentioned you should
know cold, and then you'll at least be prepped for the baseline interview level. It
could be a lot harder than that, depending on the interviewer, or it could be easy.
It just depends on how lucky you are. Are you feeling lucky? Then give it a try!

Send me your resume


I'll probably batch up any resume submissions people send me and submit them
weekly. In the meantime, study up! You have a lot of warming up to do. Real-world
work makes you rusty.
I hope this was helpful. Let the flames begin, etc. Yawn.
5:15 AM, July 19, 2011

Top Data Science Interview Questions Most Asked

Here are top 50 objective type sample Data Science Interview questions and their answers are
given just below to them. These sample questions are framed by experts from Intellipaat who
trains for Data Science training to give you an idea of type of questions which may be asked in
interview. We have taken full care to give correct answers for all the questions. Do comment
your thoughts Happy Job Hunting!

Top Answers to Data Science Interview Questions


1.What do you mean by word Data Science?
Data Science is the extraction of knowledge from large volumes of data that
are structured or unstructured, which is a continuation of the field data
mining and predictive analytics, It is also known as knowledge discovery and
data mining.
2.Explain the term botnet?
A botnet is a a type of bot running on an IRC network that has been created
with a Trojan.
3.What is Data Visualization?
Data visualization is a common term that describes any effort to help people
understand the significance of data by placing it in a visual context.
4.How you can define Data cleaning as a critical part of process?
Cleaning up data to the point where you can work with it is a huge amount of
work. If were trying to reconcile a lot of sources of data that we dont control
like in this flight, it can take 80% of our time.

5.Point out 7 Ways how Data Scientists use Statistics?


1.
2.
3.
4.
5.
6.
7.

Design and interpret experiments to inform product decisions.


Build models that predict signal, not noise.
Turn big data a into the big picture
Understand user retention, engagement, conversion, and leads.
Give your users what they want.
Estimate intelligently.
Tell the story with the data.

6.Differentiate between Data modeling and Database design?


Data Modeling Data modeling (or modeling) in software engineering is the
process of creating a data model for an information system by applying
formal data modeling techniques.
Database Design- Database design is the system of producing a detailed data
model of a database. The term database design can be used to describe
many different parts of the design of an overall database system.
7.Describe in brief the data Science Process flowchart?
1.Data is collected from sensors in the environment.
2. Data is cleaned or it can process to produce a data set (typically a data
table) usable for processing.
3. Exploratory data analysis and statistical modeling may be performed.
4. A data product is a program such as retailers use to inform new purchases
based on purchase history. It may also create data and feed it back into the
environment.
8. What do you understand by term hash table collisions?
Hash table (hash map) is a kind of data structure used to implement an
associative array, a structure that can map keys to values. Ideally, the hash
function will assign each key to a unique bucket, but sometimes it is possible
that two keys will generate an identical hash causing both keys to point to
the same bucket. It is known as hash collisions.
9.Compare and contrast R and SAS?
SAS is commercial software whereas R is free source and can be downloaded
by anyone.
SAS is easy to learn and provide easy option for people who already know
SQL whereas R is a low level programming language and hence simple
procedures takes longer codes.
10.What do you understand by letter R?

R is a low level language and environment for statistical computing and


graphics. It is a GNU project which is similar to the S language and
environment which was developed at BELL.
11.What all things R environment includes?
1. A suite of operators for calculations on arrays, in particular matrices,
2. An effective data handling and storage facility,
3. A large, coherent, integrated collection of intermediate tools for data
analysis, an effective data handling and storage facility,
4. Graphical facilities for data analysis and display either on-screen or on
hardcopy, and
5. A well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input and
output facilities.
12.What are the applied Machine Learning Process Steps?
1. Problem Definition: Understand and clearly describe the problem that is
being solved.
2. Analyze Data: Understand the information available that will be used to
develop a model.
3. Prepare Data: Define and expose the structure in the dataset.
4. Evaluate Algorithms: Develop robust test harness and baseline accuracy
from which to improve and spot check algorithms.
5. Improve Results: Improve results to develop more accurate models.
6. Present Results: Details the problem and solution so that it can be
understood by third parties.
13.Compare Multivariate, Univariate and Bivariate analysis?
MULTIVARIATE: Multivariate analysis focuses on the results of observations of
many different variables for a number of objects.
UNIVARIATE: Univariate analysis is perhaps the simplest form of statistical
analysis. Like other forms of statistics, it can be inferential or descriptive. The
key fact is that only one variable is involved.
BIVARIATE: Bivariate analysis is one of the simplest forms of quantitative
(statistical) analysis. It involves the analysis of two variables (often denoted
as X, Y), for the purpose of determining the empirical relationship between
them.
14.What is Hypothesis in Machine Learning?
The hypothesis space used by a machine learning system is the set of
all hypotheses that might possibly be returned by it. It is typically dened by
a hypothesis language, possibly in conjunction with a language bias.

15.Differentiate between Uniform and Skewed Distribution?


UNIFORM DISTRIBUTION: A uniform distribution, sometimes also known as a
rectangular distribution, is a distribution that has constant probability. The
latter of which simplifies to the expected for . The continuous distribution is
implemented as Uniform Distribution
SKEWED DISTRIBUTION: In probability theory and statistics, Skewness is a
measure of the asymmetry of the probability distribution of a real-valued
random variable about its mean. The skewness value can be positive or
negative, or even undefined. The qualitative interpretation of the skew is
complicated.
16.What do you understand by term Transformation in Data Acquisition?
The transformation process allows you to consolidate, cleanse, and integrate
data. We can semantically arrange the data from heterogeneous sources.
17.What do you understand by term Normal Distribution?
It is a function which shows the distribution of many random variables as a
symmetrical bell-shaped graph.
18.What is Data Acquisition?
It is the process of measuring an electrical or physical phenomenon such as
voltage, current, temperature, pressure, or sound with a computer. A DAQ
system comprises of sensors, DAQ measurement hardware, and a computer
with programmable software.
19.What is Data Collection?
Data collection is the process of collecting and measuring information on
variables of interest, in a proper systematic fashion that enables one to
answer stated research questions hypotheses, and revise outcomes.
20.What do you understand by term Use case?
A use case is a methodology used in system analysis to identify, clarify, and
organize system requirements. The use case consists of a set of possible
sequences of interactions between systems and users in a particular
environment and related to a defined particular goal.
21.What is Sampling and Sampling Distribution?
SAMPLING: Sampling is the process of choosing units (ex- people,
organizations) from a population of interest so that by studying the sample
we can fairly generalize our results back to the population from which they
were chosen.

SAMPLING DISTRIBUTION: The sampling distribution of a statistic is the


distribution of that statistic, considered as a random variable, when derived
from a random sample of size n. It may be considered as the distribution of
the statistic for all possible samples from the same population of a given size.
22.What is Linear Regression?
In statistics, linear regression is an way for modeling the relationship between
a scalar dependent variable y and one or more explanatory variables (or
independent variable) denoted by X. The case of one explanatory variable is
known as simple linear regression.
23.Differentiate between Extrapolation and Interpolation?
Extrapolation is an approximate of a value based on extending a known
sequence of values or facts beyond the area that is certainly known.
Interpolation is an estimation of a value within two known values in a list of
values.
24.How expected value is different from Mean value?
There is no difference. These are two names for the same thing. They are
mostly used in different contexts, though if we talk about the expected value
of a random variable and the mean of a sample, population or probability
distribution.
25.Differentiate between Systematic and Cluster Sampling?
SYSTEMATIC SAMPLING: Systematic sampling is a statistical methology
involving the selection of elements from an ordered sampling frame. The
most common form of systematic sampling is an equal-probability method.
CLUSTER SAMPLING: A cluster sample is a probability sample by which each
sampling unit is a collection, or cluster, of elements.
26.What are the advantages of Systematic Sampling?
1.Easier to perform in the field, especially if a proper frame is not available.
2. Regularly provides more information per unit cost than simple random
sampling, in the sense of smaller variances.
27.What do you understand by term Threshold limit value?
The threshold limit value (TLV) of a chemical substance is a level in which it is
believed that a worker can be exposed day after day for a working lifetime
without affecting his/her health.
28.Differentiate between Validation Set and Test set?

Validation set: It is a set of examples used to tune the parameters [i.e.,


architecture, not weights] of a classifier, for example to choose the number of
hidden units in a neural network.
Test set: A set of examples used only to assess the performance
[generalization] of a fully specified classifier.
29.How can R and Hadoop be used together?
The most common way to link R and Hadoop is to use HDFS (potentially
managed by Hive or HBase) as the long-term store for all data, and use Map
Reduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode,
enrich, and sample data sets from HDFS into R. Data analysts can then
perform complex modeling exercises on a subset of prepared data in R.
30.What do you understand by term RIMPALA?
RImpala-package contains the R functions required to connect, execute
queries and retrieve back results from Impala. It uses the rJava package to
create a
JDBC connection to any of the impala servers running on a
Hadoop Cluster.
31.What is Collaborative Filtering?
Collaborative filtering (CF) is a method used by some recommender systems.
It consists of two senses, a narrow one and a more general one. In general,
collaborative filtering is the process of filtering for information or patterns
using techniques involving collaboration among multiple agents, viewpoints,
data sources.
32.What are the challenges of Collaborative Filtering?
1.
2.
3.
4.
5.
6.

Scalability
Data sparsity
Synonyms
Grey sheep Data sparsity
Shilling attacks
Diversity and the Long Tail

33.What do you understand by Big data?


Big data is a buzzword, or catch-phrase, which describe a massive volume of
both structured and unstructured data that is so large which is difficult to
process using traditional database and software techniques.
34.What do you understand by Matrix factorization?

Matrix factorization is simply a mathematical tool for playing around with


matrices, and is therefore applicable in many scenarios by which one would
find out something hidden under the data.
35.What do you understand by term Singular Value Decomposition?
In linear algebra, the singular value decomposition (SVD) is a factorization of
a real or complex matrix. It has many useful applications in signal processing
and statistics.
36.What do you mean by Recommender systems?
Recommender systems or recommendation systems (sometimes replacing
system with a synonym such as platform or engine) are a subclass of
information filtering system that seek to predict the rating or preference
that a user would give to an item.
37.What are the applications of Recommender Systems?
Recommender systems have become extremely common in recent years, and
are applied in a variety of applications. The most popular ones are probably
movies, music, news, books, research articles, search queries, social tags,
and products in general.
38.What are the two ways of Recommender System?
Recommender systems typically produce a list of recommendations in one of
two ways: Through collaborative or content-based filtering. Collaborative
filtering approaches building a model from a users past behavior (items
previously purchased or selected and/or numerical ratings given to those
items) as well as similar decisions made by other users. This model is then
used to predict items (or ratings for items) that the user may have an interest
in. Content-based filtering approaches utilize a series of discrete
characteristics of an item in order to recommend additional items with similar
properties.
39.What are the factors to find the most accurate recommendation algorithms?
1.
2.
3.
4.
5.
6.
7.
8.

Diversity
Recommender Persistence
Privacy
User Demographics
Robustness
Serendipity
Trust
Labeling

40.What is K-Nearest Neighbor?


k-NN is a type of instance-based learning, or lazy learning, where the function
is only approximated locally and all computation is deferred until
classification. The k-NN algorithm is among the simplest of all machine
learning algorithms.
41.What is Horizontal Slicing?
In horizontal slicing, projects are broken up roughly along architectural lines.
That is there would be one team for UI, one team for business logic and
services (SOA), and another team for data.
42.What are the advantages of vertical slicing?
The advantage of slicing vertically is you are more efficient. You dont have
the overhead, and effort that comes from trying to coordinate activities
across multiple teams. No need to negotiate for resources. Youre all on the
same team.
43.What is null hypothesis?
In inferential statistics the null hypothesis usually refers to a general
statement or default position that there is no relationship between two
measured phenomena, or no difference among groups.
44.What is Statistical hypothesis?
In statistical hypothesis testing, the alternative hypothesis (or maintained
hypothesis or research hypothesis) and the null hypothesis are the two rival
hypotheses which are compared by a statistical hypothesis test.
45.What is performance measure?
Performance measurement is the method of collecting, analyzing and/or
reporting information regarding the performance of an individual, group,
organization, system or component.
46.What is the use of tree command?
This command is used to list contents of directories in a tree-like format.
47.What is the use of uniq command?
This command is used to report or omit repeated lines.
48.Which command is used translate or delete characters?
tr command is used translate or delete characters.

49.What is the use of tapkee command?


This command is used to reduce dimensionality of a data set using various
algorithms.
50.Which command is used to sort the lines of text files?
sort command is used to sort the lines of text files.
100 Data Science in Python Interview Questions and Answers for 2016
30 Dec 2015

Pythons growing adoption in data science has pitched it as a competitor to R programming


language. With its various libraries maturing over time to suit all data science needs, a lot of
people are shifting towards Python from R. This might seem like the logical scenario. But R
would still come out as the popular choice for data scientists. People are shifting towards Python
but not as many as to disregard R altogether. We have highlighted the pros and cons of both these
languages used in Data Science in our Python vs R article. It can be seen that many data
scientists learn both languages Python and R to counter the limitations of either language. Being
prepared with both languages will help in data science job interviews.
CLICK HERE

to get the 2016 data scientist salary report delivered to your inbox!

Python is the friendly programming language that plays well with everyone and runs on
everything. So it is hardly surprising that Python offers quite a few libraries that deal with data
efficiently and is therefore used in data science. Python was used for data science only in the
recent years. But now that it has firmly established itself as an important language for Data
Science, Python programming is not going anywhere. Mostly Python is used for data analysis
when you need to integrate the results of data analysis into web apps or if you need to add
mathematical/statistical codes for production.

In our previous posts 100 Data Science Interview Questions and Answers (General) and 100
Data Science in R Interview Questions and Answers, we listed all the questions that can be asked
in data science job interviews. This article in the series, lists questions which are related to
Python programming and will probably be asked in data science interviews.

Data Science Python Interview Questions and Answers

The questions below are based on the course that is taught at DeZyre Data Science in Python.
This is not a guarantee that these questions will be asked in Data Science Interviews. The
purpose of these questions is to make the reader aware of the kind of knowledge that an applicant
for a Data Scientist position needs to possess.
Data Science Interview Questions in Python are generally scenario based or problem based
questions where candidates are provided with a data set and asked to do data munging, data
exploration, data visualization, modelling, machine learning, etc. Most of the data science
interview questions are subjective and the answers to these questions vary, based on the given
data problem. The main aim of the interviewer is to see how you code, what are the
visualizations you can draw from the data, the conclusions you can make from the data set, etc.
1) How can you build a simple logistic regression model in Python?
2) How can you train and interpret a linear regression model in SciKit learn?
3) Name a few libraries in Python used for Data Analysis and Scientific computations.
NumPy, SciPy, Pandas, SciKit, Matplotlib, Seaborn
4) Which library would you prefer for plotting in Python language: Seaborn or
Matplotlib?
Matplotlib is the python library used for plotting but it needs lot of fine-tuning to ensure that
the plots look shiny. Seaborn helps data scientists create statistically and aesthetically

appealing meaningful plots. The answer to this question varies based on the requirements for
plotting data.
5) What is the main difference between a Pandas series and a single-column
DataFrame in Python?
6) Write code to sort a DataFrame in Python in descending order.
7) How can you handle duplicate values in a dataset for a variable in Python?
8) Which Random Forest parameters can be tuned to enhance the predictive power of
the model?
9) Which method in pandas.tools.plotting is used to create scatter plot matrix?
Scatter_matrix
10) How can you check if a data set or time series is Random?
To check whether a dataset is random or not use the lag plot. If the lag plot for the given
dataset does not show any structure then it is random.
11) Can we create a DataFrame with multiple data types in Python? If yes, how can
you do it?
12) Is it possible to plot histogram in Pandas without calling Matplotlib? If yes, then
write the code to plot the histogram?
13) What are the possible ways to load an array from a text data file in Python? How
can the efficiency of the code to load data file be improved?
numpy.loadtxt ()
14) Which is the standard data missing marker used in Pandas?
NaN
15) Why you should use NumPy arrays instead of nested Python lists?
16) What is the preferred method to check for an empty array in NumPy?
17) List down some evaluation metrics for regression problems.

18) Which Python library would you prefer to use for Data Munging?
Pandas
19) Write the code to sort an array in NumPy by the nth column?
Using argsort () function this can be achieved. If there is an array X and you would like to
sort the nth column then code for this will be x[x [: n-1].argsort ()]
20) How are NumPy and SciPy related?
21) Which python library is built on top of matplotlib and Pandas to ease data plotting?
Seaborn
22) Which plot will you use to access the uncertainty of a statistic?
Bootstrap
23) What are some features of Pandas that you like or dislike?
24) Which scientific libraries in SciPy have you worked with in your project?
25) What is pylab?
A package that combines NumPy, SciPy and Matplotlib into a single namespace.
26) Which python library is used for Machine Learning?
SciKit-Learn
Learn Data Science in Python to become an Enterprise Data Scientist

Basic Python Programming Interview Questions


27) How can you copy objects in Python?
The functions used to copy objects in Python are1)

Copy.copy () for shallow copy

2)

Copy.deepcopy () for deep copy

However, it is not possible to copy all objects in Python using these functions. For instance,
dictionaries have a separate copy method whereas sequences in Python have to be copied by
Slicing.
28) What is the difference between tuples and lists in Python?
Tuples can be used as keys for dictionaries i.e. they can be hashed. Lists are mutable whereas
tuples are immutable - they cannot be changed. Tuples should be used when the order of
elements in a sequence matters. For example, set of actions that need to be executed in sequence,
geographic locations or list of points on a specific route.
29) What is PEP8?
PEP8 consists of coding guidelines for Python language so that programmers can write readable
code making it easy to use for any other person, later on.
30) Is all the memory freed when Python exits?
No it is not, because the objects that are referenced from global namespaces of Python modules
are not always de-allocated when Python exits.
31) What does _init_.py do?
_init_.py is an empty py file used for importing a module in a directory. _init_.py provides an
easy way to organize the files. If there is a module maindir/subdir/module.py,_init_.py is placed
in all the directories so that the module can be imported using the following commandimport maindir.subdir.module
32) What is the different between range () and xrange () functions in Python?
range () returns a list whereas xrange () returns an object that acts like an iterator for generating
numbers on demand.
33) How can you randomize the items of a list in place in Python?
Shuffle (lst) can be used for randomizing the items of a list in Python
34) What is a pass in Python?
Pass in Python signifies a no operation statement indicating that nothing is to be done.

35) If you are gives the first and last names of employees, which data type in Python will
you use to store them?
You can use a list that has first name and last name included in an element or use Dictionary.
36) What happens when you execute the statement mango=banana in Python?
A name error will occur when this statement is executed in Python.
37) Write a sorting algorithm for a numerical dataset in Python.
38) Optimize the below python codeword = 'word'
print word.__len__ ()
Answer: print word._len_ ()
39) What is monkey patching in Python?
Monkey patching is a technique that helps the programmer to modify or extend other code at
runtime. Monkey patching comes handy in testing but it is not a good practice to use it in
production environment as debugging the code could become difficult.
40) Which tool in Python will you use to find bugs if any?
Pylint and Pychecker. Pylint verifies that a module satisfies all the coding standards or not.
Pychecker is a static analysis tool that helps find out bugs in the course code.
41) How are arguments passed in Python- by reference or by value?
The answer to this question is neither of these because passing semantics in Python are
completely different. In all cases, Python passes arguments by value where all values are
references to objects.
42) You are given a list of N numbers. Create a single list comprehension in Python to
create a new list that contains only those values which have even numbers from elements of
the list at even indices. For instance if list[4] has an even value the it has be included in the
new output list because it has an even index but if list[5] has an even value it should not be
included in the list because it is not at an even index.

[x for x in list [: 2] if x%2 == 0]


The above code will take all the numbers present at even indices and then discard the odd
numbers.
43) Explain the usage of decorators.
Decorators in Python are used to modify or inject code in functions or classes. Using decorators,
you can wrap a class or function method call so that a piece of code can be executed before or
after the execution of the original code. Decorators can be used to check for permissions, modify
or track the arguments passed to a method, logging the calls to a specific method, etc.
44) How can you check whether a pandas data frame is empty or not?
The attribute df.empty is used to check whether a data frame is empty or not.
45) What will be the output of the below Python code
def multipliers ():
return [lambda x: i * x for i in range (4)]
print [m (2) for m in multipliers ()]
The output for the above code will be [6, 6,6,6]. The reason for this is that because of late
binding the value of the variable i is looked up when any of the functions returned by multipliers
are called.
46) What do you mean by list comprehension?
The process of creating a list while performing some operation on the data so that it can be
accessed using an iterator is referred to as List Comprehension.
Example:
[ord (j) for j in string.ascii_uppercase]
[65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90]
47)

What will be the output of the below code

word = aeioubcdfg'
print word [:3] + word [3:]
The output for the above code will be: aeioubcdfg'.
In string slicing when the indices of both the slices collide and a + operator is applied on the
string it concatenates them.
48)

list= [a,e,i,o,u]

print list [8:]


The output for the above code will be an empty list []. Most of the people might confuse the
answer with an index error because the code is attempting to access a member in the list whose
index exceeds the total number of members in the list. The reason being the code is trying to
access the slice of a list at a starting index which is greater than the number of members in the
list.
49)

What will be the output of the below code:

def foo (i= []):


i.append (1)
return i
>>> foo ()
>>> foo ()
The output for the above code will be[1]
[1, 1]
Argument to the function foo is evaluated only once when the function is defined. However,
since it is a list, on every all the list is modified by appending a 1 to it.
50) Can the lambda forms in Python contain statements?

No, as their syntax is restrcited to single expressions and they are used for creating function
objects which are returned at runtime.
This list of questions for Python interview questions and answers is not an exhaustive one and
will continue to be a work in progress. Let us know in comments below if we missed out on any
important question that needs to be up here.
Python Developer interview questions

This Python Developer interview profile brings together a snapshot of what to look for in
candidates with a balanced sample of suitable interview questions.

Introduction

Computing Science Questions

Role Specific Questions

In some respects even the most technical role demands qualities common to strong candidates for
all positions: the willingness to learn; qualified skills; passion for the job.
Even college performance, while it helps you to assess formal education, doesnt give a complete
picture. This is not to underplay the importance of a solid background in computer science. Some
things to look for:
Understanding of basic algorithmic concepts
Discuss basic algorithms, how would they find/think/sort
Can they show a wider understanding of databases
Do they have an approach to modelling?
Do they stay up to date with the latest developments? If so, how? Probe for their favourite
technical books. Who are they following on Twitter, which blogs do they turn to?
Are they active on Github? Do they contribute to any open source software projects? Or take part
in Hackathons. In short, how strong is their intellectual interest in their chosen field? How is this
demonstrated? Ask for side projects (like game development). Committed, inquisitive candidates
will stand out.

Computing Science Questions

Using pseudo-code, reverse a String iteratively and recursively

What constitutes a good unit test and what a functional one?

Role Specific Questions

Do arguments in Python get passed by reference or by value?

Why are functions considered first class objects in Python?

What tools do you use for linting, debugging and profiling?

Give an example of filter and reduce over an iterable object

Implement the linux whereis command that locates the binary, source, and
manual page files for a command.

What are list and dict comprehensions?

What do we mean when we say that a certain Lambda expression forms a


closure?

What is the difference between list and tuple?

What will be the output of the following code?

list = ['a', 'b', 'c', 'd', 'e']

print list[10:]

What will be the output of the following code in each step?


o

class C:

dangerous = 2

o
o

c1 = C()

c2 = C()

print c1.dangerous

c1.dangerous = 3

print c1.dangerous

print c2.dangerous

o
o

del c1.dangerous

print c1.dangerous

o
o

C.dangerous = 3

print c2.dangerous

Top Python Interview Questions Most Asked

Here are top 30 objective type sample Python Interview questions and their answers are given
just below to them. These sample questions are framed by experts from Intellipaat who trains for
Python training to give you an idea of type of questions which may be asked in interview. We
have taken full care to give correct answers for all the questions. Do comment your thoughts
Happy Job Hunting!

Top Answers to Python Interview Questions


1. What is Python?
Python is an object oriented and open-source programming language, which
supports structured and functional built-in data structures. With a placid and
easy-to -understand syntax, Python allows code reuse and modularity of
programs. The built-in DS in Python makes it a wonderful option for Rapid
Application Development (RAD). The coding language also encourages faster
editing, testing and debugging with no compilation steps.
2. What are the standard data types supported by Python?
It supports six data types:
1. Number : object stored as numeric value
2. String : object stored as string
3. Tuple : data stored in the form of sequence of immutable objects
4. Dictionary (dicts): associates one thing to another irrespective of the type
of data, most useful container (called hashes in C and Java)

5. List : data stored in the form of a list sequence


6. Set (frozenset): unordered collection of distinct objects
3. Explain built-in sequence types in Python Programming?
It provides two built in sequence types1. Mutable Type : objects whose value can be changed after creation,
example: sets, items in the list, dictionary
2. Immutable type : objects whose value cannot be changed once created,
example: number, Boolean, tuple, string
4. Explain the use of iterator in Python?
Python coding uses Iterator to implement the iterator protocol, which enables
traversing trough containers and group of elements like list.The two
important methods include _iter_() returning the iterator object and next()
method for traversal.
5.Define Python slicing ?
The process of extracting a range of elements from lists, arrays, tuples and
custom Python data structures as well. It works on a general start and stop
method: slice (start, stop, increment)
6. How can you compare two lists in Python?
We can simply perform it using compare function cmp(intellipaatlist1,
intellipaatlist2)
def cmp(intellipaatlist1, intellipaatlist2):
for val in intellipaatlist1:
if val in intellipaatlist2:
returnTrue
returnFalse
7. What is the use of // operator?
// is a Floor Divisionoperator, which divides two operands with the result as
quotient showing only digits before decimal point.For instance, 6//3 = 2 and
6.0//3.0 = 2.0
8.Define docstring in Python with example.
A string literal occurring as the first statement (like a comment) in any
module, class, function or method is referred as docstring in Python. This kind
of string becomes the _doc_ special attribute of the object and provides an
easy way to document a particular code segment. Most modules do contain
docstrings and thus, the functions and classes extracted from the module
also consist of docstrings.

9. What function randomizes the items of a list in place?


Using shuffle() function
For instance:
import randomize
lst = [2, 18, 8, 4];
randomize.shuffle(lst)
print Shuffled list : , lst
random.shuffle(list)
print Reshuffled list : , list
10. List five benefits of using Python?
1. Having the built-in data types, Python saves programmers time and effort
from declaring variables. It has a powerful dict ionary and polymorphic list for
automatic declaration. It also ensures better code reusability
2. Highly accessible and easy-to-learn for beginners and a strong glue for
advanced Professionals consisting fo several high-level modules and
operations not performed by other programming languages.
3. Allows easy readability due to use of square brackets for most functions
and indexes
4. Python requires no explicit memory management as the interpreter itself
allocates the memory to new variables and free them automatically.
5. Python comprises a huge standard library for most Internet platforms like
Email, HTML, FTP and other WWW platforms.
11.What are the disadvantages of using Python?
1. Python is slow as compared to other programming languages. Although,
this slow pace doesnt matter much, at times, we need other language to
handle performance-critical situations.
2. It is ineffective on mobile platforms; fewer mobile applications are
developed using python. The main reason behind its instability on
smartphones is Pythons weakest security. There are no good secure cases
available for Python until now
3. Due to dynamic typing, Programmers face design restrictions while using
the language. The code needs more and more testing before putting it into
action since the errors pop up only during runtime.
4. Unlike JavaScript, Pythons features like concurrency and parallelism are
not developed for elegant use.
12. Explain the use of split function?
The split() function in Python breaks a string into shorter strings using the
defined separator. It renders a list of all words present in the string.
>>> y= true,false,none

>>> y.split(,)
Result: (true, false, none)
What is the use of generators in Python?
Generators are primarily used to return multiple items but one after the other.
They are used for iteration in Python and for calculating large result sets. The
generator function halts until the next time request is placed.
One of the best uses of generators in Python coding is implementing callback
operation with reduced effort and time. They replace callback with iteration.
Through the generator approach, programmers are saved from writing a
separate callback function and pass it to work-function as it can applying for
loop around the generator.
13. How to create a multidimensional list in Python?
As the name suggests, a multidimensional list is the concept of a list holding
another list, applying to many such lists. It can be one easily done by creating
single dimensional list and filling each element with a newly created list.
14. What is lambda?
lambda is a powerful concept used in conjunction with other functions like
filter(), map(), reduce(). The major use of lambda construct is
to create anonymous functions during runtime, which can be used where they
are created. Such functions are actually known as throw-away functions in
Python. The general syntax is lambda argument_list:expression.
For instance:
>>> def intellipaat1 = lambda i, n : i+n
>>> intellipaat(2,2)
4
Using filter()
>> intellipaat = [1, 6, 11, 21, 29, 18, 24]
>> print filter (lambda x: x%3 = = 0, intellipaat)
[6, 21, 18, 24]
15. Define Pass in Python?
The pass statement in Python is equivalent to a null operation and a
placeholder, wherein nothing takes place after its execution. It is mostly used
at places where you can let your code go even if it isnt written yet.
If you would set out a pass after the code, it wont run. The syntax is pass
16. How to perform Unit Testing in Python?
Referred to as PyUnit, the python Unit testing framework-unittest supports
automated testing, seggregating test into collections, shutdown testing code
and testing independence from reporting framework. The unittest module

makes use of TestCase class for holding and preparing test routines and
clearing them after the successful execution.
17. Define Python tools for finding bugs and performing static analysis?
. PyChecker is an excellent bug finder tool in Python, which performs static
analysis unlike C/C++ and Java. It also notifies the programmers about the
complexity and style of the code. In addition, there is another tool, PyLint for
checking the coding standards including the code line length, variable names
and whether the interfaces declared are fully executed or not.
18. How to convert a string into list?
Using the function list(string). For instance:
>>> list(intellipaat) in your lines of code will return
[i, n, t, e, l, l, i, p, a, a, t]
In Python, strings behave like list in various ways. Like, you can access
individual characters of a string
>> > y = intellipaat
>>> s[2]
t
19. What OS do Python support?
Linux, Windows, Mac OS X, IRIX, Compaq, Solaris
20. Name the Java implementation of Python?
Jython
21. Define docstring in Python.
A string literal occurring as the first statement (like a comment) in any
module, class, function or method is referred as docstring in Python. This kind
of string becomes the _doc_ special attribute of the object and provides an
easy way to document a particular code segment. Most modules do contain
docstrings and thus, the functions and classes extracted from the module
also consist of docstrings.
22. Name the optional clauses used in a try-except statement in Python?
While Python exception handling is a bit different from Java, the former
provides an option of using a try-except clause where the programmer
receives a detailed error message without termination the program.
Sometimes, along with the problem, this try-except statement offers a
solution to deal with the error.
The language also provides try-except-finally and try-except-else blocks.

23. How to use PYTHOPATH?


PYTHONPATH is the environment variable consisting of directories.
$PYTHONPATH is used for searching the actual list of folders for libraries.
24. Define self in Python?
self is a reference to the current instance of the class. It is just like this in
JavaScript. While we create an instance of a class, that instance has its data,
which internally passes a reference to itself
25. Define CGI?
Common Gateway Interface support in Python is an external gateway to
interact with HTTP server and other information servers. It consists of a series
of standards and instructions defining the exchange of information between a
custom script and web server. The HTTP server puts all important and useful
information concerning the request in the script environment and then run
the script and sends it back in the form of output to the client.
26. What is PYTHONSTARTUP and how is it used?
PYTHONSTARTUP is yet another environment variable to test the Python file in
the interpreter using interactive mode. The script file is executed even before
the first prompt is seen. Additionally, it also allows reloading of the same
script file after being modified in the external editor.
27. What is the return value of trunc() in Python?
truc() returns integer value. Uses the _trunc_ method
>>> import intellipaat
intellipaat.trunc(4.34)
4
28. How to convert a string to an object in Python?
To convert string into object, Python provides a function eval(string). It allows
the Python code to run in itself
29. Is there any function to change case of all letters in the string?
Yes, Python supports a function swapcase(), which swaps the current letter
case of the string. This method returns a copy of the string with the string
case swapped.
30.What is pickling and unpickling in Python?
The process of Pickling relates to the Pickle module. Pickle is a general
module that acquires a python object and converts it into string. It further

dumps that string object into a file by using dump () function.


Pickle comprises two methods:
Dump (): dumps an object to a file object
and Load (): loads an object from a file object
Unpickling is the reacquiring process to perform retrieval of the original
Python object from the stored string for reuse.

Top 25 Python Interview Questions

1) What is Python? What are the benefits of using Python?


Python is a programming language with objects, modules, threads, exceptions and automatic
memory management. The benefits of pythons are that it is simple and easy, portable, extensible,
build-in data structure and it is an open source.
2) What is PEP 8?
PEP 8 is a coding convention, a set of recommendation, about how to write your Python code
more readable.
3) What is pickling and unpickling?
Pickle module accepts any Python object and converts it into a string representation and dumps it
into a file by using dump function, this process is called pickling. While the process of
retrieving original Python objects from the stored string representation is called unpickling.
4) How Python is interpreted?

Python language is an interpreted language. Python program runs directly from the source code.
It converts the source code that is written by the programmer into an intermediate language,
which is again translated into machine language that has to be executed.
5) How memory is managed in Python?

Python memory is managed by Python private heap space. All Python objects and data
structures are located in a private heap. The programmer does not have an access to this
private heap and interpreter takes care of this Python private heap.

The allocation of Python heap space for Python objects is done by Python memory
manager. The core API gives access to some tools for the programmer to code.

Python also have an inbuilt garbage collector, which recycle all the unused memory and
frees the memory and makes it available to the heap space.

6) What are the tools that help to find bugs or perform static analysis?
PyChecker is a static analysis tool that detects the bugs in Python source code and warns about
the style and complexity of the bug. Pylint is another tool that verifies whether the module meets
the coding standard.
7) What are Python decorators?
A Python decorator is a specific change that we make in Python syntax to alter functions easily.
8) What is the difference between list and tuple?
The difference between list and tuple is that list is mutable while tuple is not. Tuple can be
hashed for e.g as a key for dictionaries.
9) How are arguments passed by value or by reference?
Everything in Python is an object and all variables hold references to the objects. The references
values are according to the functions; as a result you cannot change the value of the references.
However, you can change the objects if it is mutable.

10) What is Dict and List comprehensions are?


They are syntax constructions to ease the creation of a Dictionary or List based on existing
iterable.
11) What are the built-in type does python provides?
There are mutable and Immutable types of Pythons built in types Mutable built-in types

List

Sets

Dictionaries

Immutable built-in types

Strings

Tuples

Numbers

12) What is namespace in Python?


In Python, every name introduced has a place where it lives and can be hooked for. This is
known as namespace. It is like a box where a variable name is mapped to the object placed.
Whenever the variable is searched out, this box will be searched, to get corresponding object.
13) What is lambda in Python?

It is a single expression anonymous function often used as inline function.


14) Why lambda forms in python does not have statements?
A lambda form in python does not have statements as it is used to make new function object and
then return them at runtime.
15) What is pass in Python?
Pass means, no-operation Python statement, or in other words it is a place holder in compound
statement, where there should be a blank left and nothing has to be written there.
16) In Python what are iterators?
In Python, iterators are used to iterate a group of elements, containers like list.
17) What is unittest in Python?
A unit testing framework in Python is known as unittest. It supports sharing of setups,
automation testing, shutdown code for tests, aggregation of tests into collections etc.
18) In Python what is slicing?
A mechanism to select a range of items from sequence types like list, tuple, strings etc. is known
as slicing.
19) What are generators in Python?
The way of implementing iterators are known as generators. It is a normal function except that it
yields expression in the function.
20) What is docstring in Python?
A Python documentation string is known as docstring, it is a way of documenting Python
functions, modules and classes.
21) How can you copy an object in Python?
To copy an object in Python, you can try copy.copy () or copy.deepcopy() for the general case.
You cannot copy all objects but most of them.
22) What is negative index in Python?

Python sequences can be index in positive and negative numbers. For positive index, 0 is the
first index, 1 is the second index and so forth. For negative index, (-1) is the last index and (-2)
is the second last index and so forth.
23) How you can convert a number to a string?
In order to convert a number into a string, use the inbuilt function str(). If you want a octal or
hexadecimal representation, use the inbuilt function oct() or hex().
24) What is the difference between Xrange and range?
Xrange returns the xrange object while range returns the list, and uses the same memory and no
matter what the range size is.
25) What is module and package in Python?
In Python, module is the way to structure program. Each Python program file is a module, which
imports other modules like objects and attributes.
The folder of Python program is a package of modules. A package can have modules or
subfolders.

21 Must-Know Data Science Interview


Questions and Answers
KDnuggets Editors bring you the answers to 20 Questions to Detect Fake Data Scientists,
including what is regularization, Data Scientists we admire, model validation, and more.
By Gregory Piatetsky, KDnuggets.
comments
The recent post on KDnuggets
20 Questions to Detect Fake Data Scientists has been very popular - most viewed in
the month of January.
However these questions were lacking answers, so KDnuggets Editors got together
and wrote the answers to these questions. I also added one more critical question number 21, which was omitted from the 20 questions post.

Here are the answers. Because of the length, here are the answers to the first 11
questions, and here is part 2.
Q1. Explain what regularization is and why it
is useful.
Answer by Matthew Mayo.
Regularization is the process of adding a tuning
parameter to a model to induce smoothness in
order to prevent overfitting. (see also KDnuggets
posts on Overfitting)

This is most often done by adding a constant multiple to an existing weight vector.
This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be
any norm. The model predictions should then minimize the mean of the loss
function calculated on the regularized training set.
Xavier Amatriain presents a good comparison of L1 and L2 regularization here, for
those interested.

Fig 1: Lp ball: As the value of p decreases, the size of the corresponding Lp space also decreases.

Q2. Which data scientists do you admire most? which startups?


Answer by Gregory Piatetsky:
This question does not have a correct answer, but here is my personal list of 12

Data Scientists I most admire, not in any particular order.

Geoff Hinton, Yann LeCun, and Yoshua Bengio - for persevering with Neural Nets
when and starting the current Deep Learning revolution.

Demis Hassabis, for his amazing work on DeepMind, which achieved human or
superhuman performance on Atari games and recently Go.
Jake Porway from DataKind and Rayid Ghani from U. Chicago/DSSG, for enabling
data science contributions to social good.
DJ Patil, First US Chief Data Scientist, for using Data Science to make US
government work better.
Kirk D. Borne for his influence and leadership on social media.
Claudia Perlich for brilliant work on ad ecosystem and serving as a great KDD-2014
chair.
Hilary Mason for great work at Bitly and inspiring others as a Big Data Rock Star.
Usama Fayyad, for showing leadership and setting high goals for KDD and Data
Science, which helped inspire me and many thousands of others to do their best.
Hadley Wickham, for his fantastic work on Data Science and Data Visualization in R,
including dplyr, ggplot2, and Rstudio.
There are too many excellent startups in Data Science area, but I will not list them
here to avoid a conflict of interest.
Here is some of our previous coverage of startups.
Q3. How would you validate a model you created to generate a predictive
model of a quantitative outcome variable using multiple regression.

Answer by Matthew Mayo.


Proposed methods for model validation:

If the values predicted by the model are far outside of the response variable
range, this would immediately indicate poor estimation or model inaccuracy.

If the values seem to be reasonable, examine the parameters; any of the


following would indicate poor estimation or multi-collinearity: opposite signs
of expectations, unusually large or small values, or observed inconsistency
when the model is fed new data.

Use the model for prediction by feeding it new data, and use the coefficient of
determination (R squared) as a model validity measure.

Use data splitting to form a separate dataset for estimating model


parameters, and another for validating predictions.

Use jackknife resampling if the dataset contains a small number of instances,


and measure validity with R squared and mean squared error (MSE).

Q4. Explain what precision and recall are. How do they relate to the ROC
curve?
Answer by Gregory Piatetsky:
Here is the answer from KDnuggets FAQ: Precision and Recall:

Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases
among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a
better chance of catching many of the 100 positive cases. You record the IDs of your predictions,
and when you get the actual results you sum up how many times you were right or wrong. There
are four ways of being right or wrong:
1. TN / True Negative: case was negative and predicted negative
2. TP / True Positive: case was positive and predicted positive
3. FN / False Negative: case was positive but predicted negative
4. FP / False Positive: case was negative but predicted positive

Makes sense so far? Now you count how many of the 10,000 cases fall in each bucket, say:

Predicted Negative

Predicted Positive

Negative Cases

TN: 9,760

FP: 140

Positive Cases

FN: 40

TP: 60

Now, your boss asks you three questions:


1. What percent of your predictions were correct?
You answer: the "accuracy" was (9,760+60) out of 10,000 = 98.2%
2. What percent of the positive cases did you catch?
You answer: the "recall" was 60 out of 100 = 60%
3. What percent of positive predictions were correct?
You answer: the "precision" was 60 out of 200 = 30%

See also a very good explanation of Precision and recall in Wikipedia.

Fig 4: Precision and Recall.


ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT
PRECISION) and is commonly used to measure the performance of binary classifiers.
However, when dealing with highly skewed datasets, Precision-Recall (PR) curves
give a more representative picture of performance. See also this Quora answer:
What is the difference between a ROC curve and a precision-recall curve?.
Q5. How can you prove that one improvement you've brought to an
algorithm is really an improvement over not doing anything?
Answer by Anmol Rajpurohit.
Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the
principles of scientific methodology are violated leading to misleading innovations,
i.e. appealing insights that are confirmed without rigorous validation. One such
scenario is the case that given the task of improving an algorithm to yield better
results, you might come with several ideas with potential for improvement.
An obvious human urge is to announce these ideas ASAP and ask for their
implementation. When asked for supporting data, often limited results are shared,
which are very likely to be impacted by selection bias (known or unknown) or a
misleading global minima (due to lack of appropriate variety in test data).
Data scientists do not let their human emotions overrun their logical reasoning.
While the exact approach to prove that one improvement you've brought to an
algorithm is really an improvement over not doing anything would depend on the
actual case at hand, there are a few common guidelines:

Ensure that there is no selection bias in test data used for performance
comparison

Ensure that the test data has sufficient variety in order to be symbolic of reallife data (helps avoid overfitting)

Ensure that "controlled experiment" principles are followed i.e. while


comparing performance, the test environment (hardware, etc.) must be
exactly the same while running original algorithm and new algorithm

Ensure that the results are repeatable with near similar results

Examine whether the results reflect local maxima/minima or global


maxima/minima

One common way to achieve the above guidelines is through A/B testing, where
both the versions of algorithm are kept running on similar environment for a
considerably long time and real-life input data is randomly split between the two.
This approach is particularly common in Web Analytics.
Q6. What is root cause analysis?
Answer by Gregory Piatetsky:
According to Wikipedia,
Root cause analysis (RCA) is a method of problem solving used for identifying the
root causes of faults or problems. A factor is considered a root cause if removal
thereof from the problem-fault-sequence prevents the final undesirable event from
recurring; whereas a causal factor is one that affects an event's outcome, but is not
a root cause.

Root cause analysis was initially developed to analyze industrial accidents, but is
now widely used in other areas, such as healthcare, project management, or
software testing.
Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.
Essentially, you can find the root cause of a problem and show the relationship of
causes by repeatedly asking the question, "Why?", until you find the root of the
problem. This technique is commonly called "5 Whys", although is can be involve
more or less than 5 questions.

Fig. 5 Whys Analysis Example, from The Art of Root Cause Analysis .
Q7. Are you familiar with price optimization, price elasticity, inventory
management, competitive intelligence? Give examples.
Answer by Gregory Piatetsky:
Those are economics terms that are not frequently asked of Data Scientists but they
are useful to know.
Price optimization is the use of mathematical tools to determine how customers will
respond to different prices for its products and services through different channels.
Big Data and data mining enables use of personalization for price optimization. Now
companies like Amazon can even take optimization further and show different prices
to different visitors, based on their history, although there is a strong debate about
whether this is fair.
Price elasticity in common usage typically refers to

Price elasticity of demand, a measure of price sensitivity. It is computed as:


Price Elasticity of Demand = % Change in Quantity Demanded / % Change in
Price.

Similarly, Price elasticity of supply is an economics measure that shows how the
quantity supplied of a good or service responds to a change in its price.

Inventory management is the overseeing and controlling of the ordering, storage


and use of
components that a company will use in the production of the items it will sell as well
as the overseeing and controlling of quantities of finished products for sale.
Wikipedia defines
Competitive intelligence: the action of defining, gathering, analyzing, and
distributing intelligence about products, customers, competitors, and any aspect of
the environment needed to support executives and managers making strategic
decisions for an organization.

Tools like Google Trends, Alexa, Compete, can be used to determine general trends
and analyze your competitors on the web.
8. What is statistical power?
Answer by Gregory Piatetsky:
Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the
probability that the test correctly rejects the null hypothesis (H0) when the
alternative hypothesis (H1) is true.
To put in another way, Statistical power is the likelihood that a study will detect an
effect when the effect is present. The higher the statistical power, the less likely you
are to make a Type II error (concluding there is no effect when, in fact, there is).
Here are some tools to calculate statistical power.
9. Explain what resampling methods are and why they are useful. Also
explain their limitations.
Answer by Gregory Piatetsky:
Classical statistical parametric tests compare observed statistics to theoretical
sampling distributions. Resampling a data-driven, not theory-driven methodology
which is based upon repeated sampling within the same sample.
Resampling refers to methods for doing one of these

Estimating the precision of sample statistics (medians, variances, percentiles)


by using subsets of available data (jackknifing) or drawing randomly with
replacement from a set of data points (bootstrapping)

Exchanging labels on data points when performing significance tests


(permutation tests, also called exact tests, randomization tests, or rerandomization tests)

Validating models by using random subsets (bootstrapping, cross validation)

See more in Wikipedia about bootstrapping, jackknifing.


See also How to Check Hypotheses with Bootstrap and Apache Spark

Here is a good overview of Resampling Statistics.


10. Is it better to have too many false positives, or too many false
negatives? Explain.
Answer by Devendra Desale.
It depends on the question as well as on the domain for which we are trying to solve
the question.
In medical testing, false negatives may provide a falsely reassuring message to
patients and physicians that disease is absent, when it is actually present. This
sometimes leads to inappropriate or inadequate treatment of both the patient and
their disease. So, it is desired to have too many false positive.
For spam filtering, a false positive occurs when spam filtering or spam blocking
techniques wrongly classify a legitimate email message as spam and, as a result,
interferes with its delivery. While most anti-spam tactics can block or filter a high
percentage of unwanted emails, doing so without creating significant false-positive
results is a much more demanding task. So, we prefer too many false negatives
over many false positives.
11. What is selection bias, why is it important and how can you avoid it?
Answer by Matthew Mayo.
Selection bias, in general, is a problematic situation in which error is introduced due
to a non-random population sample. For example, if a given sample of 100 test
cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in
relatively equal numbers in the population, then a given model may make the false
assumption that probability could be the determining predictive factor. Avoiding
non-random samples is the best way to deal with bias; however, when this is
impractical, techniques such as resampling, boosting, and weighting are strategies
which can be introduced to help deal with the situation.

21 Must-Know Data Science Interview Questions and Answers, part 2

Second part of the answers to 20 Questions to Detect Fake Data Scientists, including controlling
overfitting, experimental design, tall and wide data, understanding the validity of statistics in the
media, and more.
By Gregory Piatetsky, KDnuggets.

comments
The post on KDnuggets 20 Questions to Detect Fake Data Scientists has been very
popular - most viewed post of the month.
However these questions were lacking answers, so KDnuggets Editors got together
and wrote the answers. Here is part 2 of the answers, starting with a "bonus"
question.

Bonus Question: Explain what is overfitting and how would you control for
it
This question was not part of the original 20, but probably is the most important one
in distinguishing real data scientists from fake ones.
Answer by Gregory Piatetsky.
Overfitting is finding spurious results that are due to chance and cannot be
reproduced by subsequent studies.
We frequently see newspaper reports about studies that overturn the previous
findings, like eggs are no longer bad for your health, or saturated fat is not linked to
heart disease. The problem, in our opinion is that many researchers, especially in
social sciences or medicine, too frequently commit the cardinal sin of Data Mining Overfitting the data.
The researchers test too many hypotheses without proper statistical control, until
they happen to find something interesting and report it. Not surprisingly, next time
the effect, which was (at least partly) due to chance, will be much smaller or absent.

These flaws of research practices were identified and reported by John P. A.


Ioannidis in his landmark paper Why Most Published Research Findings Are False
(PLoS Medicine, 2005). Ioannidis found that very often either the results were
exaggerated or the findings could not be replicated. In his paper, he presented
statistical evidence that indeed most claimed research findings are false.
Ioannidis noted that in order for a research finding to be reliable, it should have:

Large sample size and with large effects

Greater number of and lesser selection of tested relationship

Greater flexibility in designs, definitions, outcomes, and analytical modes

Minimal bias due to financial and other factors (including popularity of that
scientific field)

Unfortunately, too often these rules were violated, producing irreproducible results.
For example, S&P 500 index was found to be strongly related to Production of butter
in Bangladesh (from 19891 to 1993) (here is PDF)

See more interesting (and totally spurious) findings which you can discover yourself
using tools such as Google correlate or Spurious correlations by Tyler Vigen.
Several methods can be used to avoid "overfitting" the data

Try to find the simplest possible hypothesis

Regularization (adding a penalty for complexity)

Randomization Testing (randomize the class variable, try your method on this
data - if it find the same strong results, something is wrong)

Nested cross-validation (do feature selection on one level, then run entire
method in cross-validation on outer level)

Adjusting the False Discovery Rate

Using the reusable holdout method - a breakthrough approach proposed in


2015

Good data science is on the leading edge of scientific understanding of the world,
and it is data scientists responsibility to avoid overfitting data and educate the
public and the media on the dangers of bad data analysis.
See also

The Cardinal Sin of Data Mining and Data Science: Overfitting

Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in


Adaptive Data Analysis

Overcoming Overfitting with the reusable holdout: Preserving validity in


adaptive data analysis

11 Clever Methods of Overfitting and how to avoid them

Tag: Overfitting

Q12. Give an example of how you would use experimental design to


answer a question about user behavior.
Answer by Bhavya Geethika.

Step 1: Formulate the Research Question:


What are the effects of page load times on user satisfaction ratings?
Step 2: Identify variables:
We identify the cause & effect. Independent variable -page load time, Dependent
variable- user satisfaction rating
Step 3: Generate Hypothesis:
Lower page download time will have more effect on the user satisfaction rating for a
web page. Here the factor we analyze is page load time.

Fig 12: There is a flaw in your experimental design (cartoon from here)
Step 4: Determine Experimental Design.
We consider experimental complexity i.e vary one factor at a time or multiple
factors at one time in which case we use factorial design (2^k design). A design is
also selected based on the type of objective (Comparative, Screening, Response
surface) & number of factors.

Here we also identify within-participants, between-participants, and mixed


model.For e.g.: There are two versions of a page, one with Buy button (call to
action) on left and the other version has this button on the right.
Within-participants design - both user groups see both versions.
Between-participants design - one group of users see version A & the other user
group version B.
Step 5: Develop experimental task & procedure:
Detailed description of steps involved in the experiment, tools used to measure user
behavior, goals and success metrics should be defined. Collect qualitative data
about user engagement to allow statistical analysis.
Step 6: Determine Manipulation & Measurements
Manipulation: One level of factor will be controlled and the other will be
manipulated. We also identify the behavioral measures:
1. Latency- time between a prompt and occurrence of behavior (how long it
takes for a user to click buy after being presented with products).
2. Frequency- number of times a behavior occurs (number of times the user
clicks on a given page within a time)
3. Duration-length of time a specific behavior lasts(time taken to add all
products)
4. Intensity-force with which a behavior occurs ( how quickly the user purchased
a product)

Step 7: Analyze results


Identify user behavior data and support the hypothesis or contradict according to
the observations made for e.g. how majority of users satisfaction ratings compared
with page load times.

Q13. What is the diference between "long" ("tall") and "wide" format
data?
Answer by Gregory Piatetsky.
In most data mining / data science applications there are many more records (rows)
than features (columns) - such data is sometimes called "tall" (or "long") data.
In some applications like genomics or bioinformatics you may have only a small
number of records (patients), eg 100, but perhaps 20,000 observations for each
patient. The standard methods that work for "tall" data will lead to overfitting the
data, so special approaches are needed.

Fig 13. Diferent approaches for tall data and wide data, from presentation
Sparse Screening for Exact Data Reduction, by Jieping Ye.
The problem is not just reshaping the data (here there are useful R packages), but
avoiding false positives by reducing the number of features to find most relevant
ones.
Approaches for feature reduction like Lasso are well covered in Statistical Learning

with Sparsity: The Lasso and Generalizations, by Hastie, Tibshirani, and Wainwright.
(you can download free PDF of the book)

Second part of the answers to 20 Questions to Detect Fake Data Scientists, including controlling
overfitting, experimental design, tall and wide data, understanding the validity of statistics in the
media, and more.
Pages: 1 2 3
By Gregory Piatetsky, KDnuggets.

Q14. What method do you use to determine whether the statistics


published in an article (or appeared in a newspaper or other media) are
either wrong or presented to support the author's point of view, rather
than correct, comprehensive factual information on a specific subject?
A simple rule, suggested by Zack Lipton, is
if some statistics are published in a newspaper, then they are wrong.
Here is a more serious answer by Anmol Rajpurohit.
Every media organization has a target audience. This choice impacts a lot of
decisions such as which article to publish, how to phrase an article, what part of an
article to highlight, how to tell a given story, etc.
In determining the validity of statistics published in any article, one of the first steps
will be to examine the publishing agency and its target audience. Even if it is the
same news story involving statistics, you will notice that it will be published very
differently across Fox News vs. WSJ vs. ACM/IEEE journals. So, data scientists are
smart about where to get the news from (and how much to rely on the stories based
on sources!).

Fig 14a: Example of a very misleading bar chart that appeared on Fox
News

Fig 14b: how the same data should be presented objectively, from 5 Ways to
Avoid Being Fooled By Statistics
Often the authors try to hide the inadequacy of their research through canny
storytelling and omitting important details to jump on to enticingly presented false
insights. Thus, a thumb's rule to identify articles with misleading statistical
inferences is to examine whether the article includes details on the research
methodology followed and any perceived limitations of the choices made related to
research methodology. Look for words such as "sample size", "margin of error", etc.
While there are no perfect answers as to what sample size or margin of error is
appropriate, these attributes must certainly be kept in mind while reading the end
results.

Another common case of erratic reporting are the situations when journalists with
poor data-education pick up an insight from one or two paragraphs of a published
research paper, while ignoring the rest of research paper, just in order to make their
point. So, here is how you can be smart to avoid being fooled by such articles:
Firstly, a reliable article must not have any unsubstantiated claims. All the
assertions must be backed with reference to past research. Or otherwise, is must be
clearly differentiated as an "opinion" and not an assertion. Secondly, just because
an article is referring to renowned research papers, does not mean that it is using
the insight from those research papers appropriately. This can be validated by
reading those referred research papers "in entirety", and independently judging
their relevance to the article at hand. Lastly, though the end-results might naturally
seem like the most interesting part, it is often fatal to skip the details about
research methodology (and spot errors, bias, etc.).
Ideally, I wish that all such articles publish their underlying research data as well as
the approach. That way, the articles can achieve genuine trust as everyone is free
to analyze the data and apply the research approach to see the results for
themselves.

Q15. Explain Edward Tufte's concept of "chart junk."


Answer by Gregory Piatetsky:
Chartjunk refers to all visual elements in charts and graphs that are not necessary
to comprehend the information represented on the graph, or that distract the viewer
from this information.
The term chartjunk was coined by Edward Tufte in his 1983 book The Visual Display
of Quantitative Information.

Fig 15. Tufte writes: "an unintentional Necker Illusion, as two back planes optically
flip to the front. Some pyramids conceal others; and one variable (stacked depth of
the stupid pyramids) has no label or scale."

Here is a more
modern example from exceluser where it is very hard to understand the column plot
because of workers and cranes that obscure them.
The problem with such decorations is that they forces readers to work much harder
than necessary to discover the meaning of data.

16. How would you screen for outliers and what should you do if you find
one?
Answer by Bhavya Geethika.
Some methods to screen outliers are z-scores, modified z-score, box plots, Grubb's
test, Tietjen-Moore test exponential smoothing, Kimber test for exponential

distribution and moving window filter algorithm. However two of the robust methods
in detail are:
Inter Quartile Range
An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or
above third quartile (Q3) in a given data set.

High = (Q3) + 1.5 IQR

Low = (Q1) - 1.5 IQR

Tukey Method
It uses interquartile range to filter very large or very small numbers. It is practically
the same method as above except that it uses the concept of "fences". The two
values of fences are:

Low outliers = Q1 - 1.5(Q3 - Q1) = Q1 - 1.5(IQR)

High outliers = Q3 + 1.5(Q3 - Q1) = Q3 + 1.5(IQR)

Anything outside of the fences is an outlier.


When you find outliers, you should not remove it without a qualitative assessment
because that way you are altering the data and making it no longer pure. It is
important to understand the context of analysis or importantly "The Why question Why an outlier is different from other data points?"
This reason is critical. If outliers are attributed to error, you may throw it out but if
they signify a new trend, pattern or reveal a valuable insight into the data you
should retain it.
Q17. How would you use either the extreme value theory, Monte Carlo
simulations or mathematical statistics (or anything else) to correctly
estimate the chance of a very rare event?
Answer by Matthew Mayo.
Extreme value theory (EVT) focuses on rare events and extremes, as opposed to
classical approaches to statistics which concentrate on average behaviors. EVT
states that there are 3 types of distributions needed to model the the extreme data

points of a collection of random observations from some distribution: the Gumble,


Frechet, and Weibull distributions, also known as the Extreme Value Distributions
(EVD) 1, 2, and 3, respectively.
The EVT states that, if you were to generate N data sets from a given distribution,
and then create a new dataset containing only the maximum values of these N data
sets, this new dataset would only be accurately described by one of the EVD
distributions: Gumbel, Frechet, or Weibull. The Generalized Extreme Value
Distribution (GEV) is, then, a model combining the 3 EVT models as well as the EVD
model.
Knowing the models to use for modeling our data, we can then use the models to fit
our data, and then evaluate. Once the best fitting model is found, analysis can be
performed, including calculating possibilities.

18. What is a recommendation engine? How does it work?


Answer by Gregory Piatetsky:
We are all familiar now with recommendations from Netflix - "Other Movies you
might enjoy" or from Amazon - Customers who bought X also bought Y.,

Such systems are called recommendation engines or more broadly recommender


systems.
They typically produce recommendations in one of two ways: using collaborative
or content-based filtering.
Collaborative filtering methods build a model based on users past behavior
(items previously purchased, movies viewed and rated, etc) and use decisions made
by current and other users. This model is then used to predict items (or ratings for

items) that the user may be interested in.


Content-based filtering methods use features of an item to recommend
additional items with similar properties. These approaches are often combined in
Hybrid Recommender Systems.
Here is a comparison of these 2 approaches used in two popular music
recommender systems - Last.fm and Pandora Radio. (example from Recommender
System entry)

Last.fm creates a "station" of recommended songs by observing what bands


and individual tracks the user has listened to on a regular basis and
comparing those against the listening behavior of other users. Last.fm will
play tracks that do not appear in the user's library, but are often played by
other users with similar interests. As this approach leverages the behavior of
users, it is an example of a collaborative filtering technique.

Pandora uses the properties of a song or artist (a subset of the 400 attributes
provided by the Music Genome Project) in order to seed a "station" that plays
music with similar properties. User feedback is used to refine the station's
results, deemphasizing certain attributes when a user "dislikes" a particular
song and emphasizing other attributes when a user "likes" a song. This is an
example of a content-based approach.

Here is a good Introduction to Recommendation Engines by Dataconomy and an


overview of building a Collaborative Filtering Recommendation Engine by Toptal. For
latest research on recommender systems, check ACM RecSys conference.

19. Explain what a false positive and a false negative are. Why is it
important to diferentiate these from each other?
Answer by Gregory Piatetsky:
In binary classification (or medical testing), False positive is when an algorithm (or
test) indicates presence of a condition, when in reality it is absent. A false negative
is when an algorithm (or test) indicates absence of a condition, when in reality it is
present.
In statistical hypothesis testing false positive is also called type I error and false
negative - type II error.

It is obviously very important to distinguish and treat false positives and false
negatives differently because the costs of such errors can be hugely different.
For example, if a test for serious disease is false positive (test says disease, but
person is healthy), then an extra test will be made that will determine the correct
diagnosis. However, if a test is false negative (test says healthy, but person has
disease), then treatment will be done and person may die as a result.

20. Which tools do you use for visualization? What do you think of
Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in
a chart (or in a video)?

Answer by Gregory Piatetsky:


There are many good tools for Data Visualization. R, Python, Tableau and Excel are
among most commonly used by Data Scientists.
Here are useful KDnuggets resources:

Visualization and Data Mining Software

Overview of Python Visualization Tools

21 Essential Data Visualization Tools

Top 30 Social Network Analysis and Visualization Tools

Tag: Data Visualization

There are many ways to representing more than 2 dimensions in a chart. 3rd
dimension can be shown with a 3D scatter plot which can be rotate. You can use
color, shading, shape, size. Animation can be used effectively to show time
dimension (change over time).
Here is a good example.

Fig 20a: 5-dimensional scatter plot of Iris data, with size: sepal length; color:
sepal width; shape: class; x-column: petal length; y-column: petal width, from here.
For more than 5 dimensions, one approach is Parallel Coordinates, pioneered by
Alfred Inselberg.

Fig 20b: Iris data in parallel coordinates

See also

Quora: What's the best way to visualize high-dimensional data? and

pioneering work of Georges Grinstein and his colleagues on High-Dimensional


Visualizations .