Onsite Questions

Big Picture
1.Explain machine learning to a layperson.
Imagine a curious kid who sticks his palm over a candle

flame and pulls back in a brief moment of sharp pain.
The next day, he

comes across a hot
stove top, seeing
the red color and
feeling the heat
waves pulsing from it like the candle from the day
before.
The kid has never touched a stove top, but

fortunately, he has learned from previous data to
avoid red things that pulse heat.
2. Which Python and R libraries have you used in the past?
Be prepared to list a few that you've used, and have example projects.
Popular Python libraries include scikit-learn, NumPy, Pandas, and matplotlib.
Popular R libraries include caret, caret's individual model dependencies, and ggplot.
3. What does it mean to "fit" a model? How do hyperparameters relate?
Fitting a model is the process of learning the parameters of a model using training data.
Parameters help define the mathematical formulas behind machine learning models.
However, there are also "higher-level" parameters that cannot be learned from the data,
called hyperparameters.
Hyperparameters define properties of the models, such as model complexity or learning
rate.
Explain the Bias-Variance tradeoff.
Predictive models have a tradeoff between bias (how well the model fits the data) and
variance (how much the model changes based on changes in the inputs).
Simpler models are stable (low variance) but they don't get close to the truth (high bias).
More complex models are more prone to being

overfit (high variance) but they are expressive
enough to get close to the truth (low bias).
The best model for a given problem usually lies

somewhere in the middle.
Explain the relationship between priors and posteriors.
In Bayesian statistics, the prior is the probability distribution of a variable before collecting
data and the posterior is the updated probability distribution after collecting data.
What is a conjugate distribution and why is it important in machine learning?
If the prior and posterior distributions belong to the same family, then they are conjugate
distributions.
Conjugate distributions allow dynamic statistical models at can be continuously updated

with new data, such as multi-armed bandits.
What is supervised learning? Give an example.
Supervised learning tasks model a set of inputs against their labeled outputs with the goal
of being able to make predictions.
Supervised learning encompasses all of what we commonly think of as "predictive

modeling."
Examples include classification and regression.
How do you decide between a list of supervised learning models?
There's no single best algorithm before seeing the data (No Free Lunch Theorem).
Picking the right model is a combination of having theoretical understanding of each

model while trying many different ones and evaluating their performance metrics on a
hold-out set of data.
The most important concept here is that you should always have a pure hold-out set of
data that hasn't been seen before choosing models.
What is unsupervised learning? Give an example.
Unsupervised learning tasks look for structure in only the inputs, as there are no labeled
outputs.
Examples include clustering and algorithmic feature engineering.
What is reinforcement learning? Give an example.
Reinforcement learning tasks interact with a dynamic environment to reach a certain goal.
Examples include chess AI and self-driving cars.
What are parametric models? Give an example.

Parametric models are those with a finite number of parameters.
To predict new data, you only need to know the parameters of the model.
Examples include linear regression, logistic regression, and linear SVMs.
What are non-parametric models? Give an example.
Non-parametric models are those with an unbounded number of parameters, allowing for
more flexibility.
To predict new data, you need to know the parameters of the model and the state of the
data that has been observed.
Examples include decision trees, k-nearest neighbors, and topic models using latent
dirichlet analysis.
What are generative models? Give an example.
Generative models describe how data is generated by modeling the probability

distributions of the data.
In other words, they model the joint probability P(y, x).
Examples include Naive Bayes, Latent Dirichlet Allocation (LDA), and Hidden Markov
Models.
What are discriminative models? Give an example.
Discriminitive models directly learn the dependence of an unobserved variable y on an

observed variable x, or the conditional probability P(y | x).
Examples include logistic regression, neural networks,

and random forests.
When are discriminative models more effective than generative discriminative ones, and
vice-versa?
For tasks that do not require the joint distribution, such as regression and classification,
discriminative models tend to have better performance.
Generative models are more practical to update when dealing with nonstationary
distributions.
They also have unsupervised learning applications such as anomaly detection.
What is the curse of dimensionality?
The difficulty of searching through a solution

space becomes much harder as you have more
features (dimensions).
Consider the analogy of looking for a penny in a

line vs. a field vs. a building.
The more dimensions you have, the higher volume of data you'll need.
Walk me through the process of building a predictive model.
There are many right answers, but they should at least include the following elements:
Clarifying business objective, data collection, exploratory analysis, train/test split, data
pre-processing, feature engineering, model tuning, and analysis.
Explain the No Free Lunch theorem and what it means for applied machine
learning.
The NFL Theorem states that two algorithms are equivalent when averaged across all
possible problems.
In practice this means two things...
(1) You'll often just need to try a variety of algorithms because it's impossible to guess
which one will perform the best.
(2) You'll often only work with a subset of "all possible problems," and you can definitely
build intuition on which algorithms work better for the problems you tackle. For example,
random forests, XGBoost, and neural networks tend to win most Kaggle competitions.
Optimization
1. Explain the idea behind gradient descent in layman's terms.
Gradient descent is like a ball rolling down a valley, and rolling in the steepest direction at
any moment.
If the valley is shaped like a bowl (convex

function), then we are guaranteed to reach a
global maximum.
That valley represents the loss function you're

trying to minimize.
2. What is the difference between stochastic gradient descent (SGD) and gradient
descent (GD)?
Both algorithms are methods for finding a set of parameters that minimize a loss function
by evaluating parameters against data and then making adjustments.
In standard gradient descent, you'll evaluate all training samples for each set of
parameters. This is akin to taking big, slow steps toward the solution.
In stochastic gradient descent, you'll evaluate only 1 training sample for the set of
parameters before updating them. This is akin to taking small, quick steps toward the
solution.
3. When would you use GD over SDG, and vice-versa?
GD theoretically minimizes the error function better than SGD. However, SGD converges
much faster once the dataset becomes large.
That means GD is preferable for small datasets while SGD is preferable for larger ones.
In practice, however, SGD is used for most applications because it minimizes the error
function well enough while being much faster and more memory efficient for large
datasets.
4. Why would you use gradient descent for convex functions instead of
computing the closed-form solution?
Gradient descent is extremely easy to implement and it's computationally cheap. In many
big data applications, the bottleneck tends to be problem size (data and supporting
infrastructure) rather than the algorithm used for optimization.
5. Does gradient descent always reach the same solution?
No, because it might get stuck in a local minima. It will only be guaranteed to reach the
global minimum if the loss function is convex.
How should you decide on a cost (a.k.a loss or error) function for machine learning?
The problem you're trying to solve determines the cost function, which often has a real-
world interpretation. The cost is often represented as a cost for missed predictions plus a
penalty for model complexity (regularization term).
6. How should you decide on a cost (a.k.a loss or error) function for
machine learning?
The problem you're trying to solve determines the cost function, which often has a real-
world interpretation.
The cost is often represented as a cost for missed predictions plus a penalty for model
complexity (regularization term).
7. How would you choose between minimizing mean-squared-error (MSE) or mean-
absolute-error (MAE) as your error metric?
Minimizing MSE finds the mean while minimizing MAE finds the median. Minimizing MSE
is better for avoiding very large errors while MAE is more robust to outliers. In addition,
MSE is computationally easier because it has continuous derivatives.
8. How can you perform logistic regression on a large dataset in a memory-based

environment like R?
You can use stochastic gradient descent, which has a small memory footprint. This is
because stochastic gradient descent only needs to calculate one data point at a time.
9. Explain the log-loss function.
Log-loss is the a function for classification models, and it's used when the model outputs
a probability for each class.
It measures the cross-entropy, which is a measure of the unpredictability between the true
labels and the predictions.
10. What are the advantages and disadvantages of search-based methods (i.e.
genetic algorithms) vs. gradient-based methods?
Search-based methods don't require the optimized function to be differentiable.
They are also better at dealing with loss functions that have many local extrema.
Their disadvantage is that they are generally much slower to converge than gradient
methods when the requirements for gradient methods are met.
Data Preprocessing
1. Which data preprocessing steps are almost always useful?
1. Normalizing (center and scaling) features to bring them to the same scale.
2. Handling missing values either through dropping samples with missing values or
through imputing them.
3. Encoding categorical values.
2. What is the Box-Cox transformation used for?
The Box-Cox transformation is a type of "power transformation" that transforms data to

make the distribution more normal.
It's used to stabilize the variance (eliminate heteroskedasticity) and normalize the
distribution.
3. What is the advantage of the Box-Cox transformation over other transformations?
It's the generalized power transformation, which means it's flexible and can empirically
find the best transformation.
For example, when its lambda parameter is 0, it's equivalent to the log-transformation.
4. What are 3 ways to deal with missing data?
1. Drop samples that have missing data for the features in your model.
2. Impute the missing data.
3. Use models that are robust to missing data, such as

mixed effects models.
5. What are 3 different methods of data imputation?
1. Center imputation (using mean, median, or mode).
2. k-nearest neighbors imputation.
3. Bagged tree imputation.
6. What is one major flaw of imputation methods?
Imputation introduces bias towards existing data, which could make the model less
generalizable.
If your dataset is large enough, consider dropping missing data instead of imputing it.
7. What are 3 data preprocessing techniques to handle outliers?
1. Winsorize (cap at threshold).
2. Transform to reduce skew (using Box-Cox or similar).
3. Remove outliers if you're certain they are anomalies or

measurement errors.
8. What are 3 ways of reducing dimensionality?
1. Removing collinear features.
2. Performing PCA, ICA, or other forms of algorithmic dimensionality reduction.
3. Combining features with feature engineering.
9. What is multicollinearity?
Multicollinearity occurs when your features are highly correlated with one another.
For example, if you had both height in meters and height in centimeters has features,
you'd have multicollinearity.
10. Which ML algorithms are more robust to multicollinearity?
Models with regularization or built-in feature selection tend to perform better because they
do not overvalue the collinear features.
11. Should you remove collinearity before or after performing PCA?
You should remove collinearity before running PCA because

correlated variables inflate the variances of the principal
components.
12. What is one-hot encoding?
One-hot encoding transforms a categorical features of k classes into k numeric features

by creating one indicator (0 or 1) feature per class.
13. What is label encoding?
Label encoding transforms a categorical feature of k classes into 1 numeric feature of k

unique values (0 to k-1).
14. Why is it necessary to perform one-hot or label encoding?
Most ML algorithms cannot handle categorical data.
One-hot encoding is especially prevalent because it often offers performance boosts over
label encoding for many model types because many learning algorithms use only a single
weight per feature.
15. When is it appropriate to use label encoding over one-hot encoding?
If the feature only has 2 classes or if it's ordinal in nature (e.g. highest level of education).
16. What are 3 ways to get around memory constraints if you need to train a model
on a local machine?
1. You can down-sample the data
2. Reduce dimensionality using PCA
3. Select a model that you can train in batches

Sampling and Splitting
1. Why is it important to split your data into training, validation, and test sets?
It's important for ensuring the model is generalizable, meaning it can apply to future,
unseen data.
The training set is meant for learning the model parameters.
The validation set is meant to help you tune hyperparameters.
The test set gives you a reliable estimation of model performance on unseen data.
2. How much data should you allocate for your training, validation, and test sets?
You have to find a balance, and there's no right answer for every problem.
If your test set is too small, you'll have an unreliable estimation of model performance
(performance statistic will have high variance).
If your training set is too small, your actual model parameters will have high variance.
A good rule of thumb is to use an 80/20 train/test split.
Then your train set can be further split into train/validation or into partitions for cross-
validation.
3. If you split your data into train/test splits, is it still possible to overfit your model?
Yes, it's definitely possible.
One common beginner mistake is re-tuning a model or

training new models with different parameters after
seeing its performance on the test set.
In this case, its the model selection process that causes

the overfitting.
The test set should not be tainted until you're ready to

make your final selection.
4. How should you split your dataset to tune models on time series data?
You should use step-forward analysis where you use data up to k different timestamps,
train on the data before that timestamp, then evaluate on the data after the timestamp.
5. How should you sample your data into train and test sets for
classification problems?
You should sample through stratified sampling so the outcome distribution in the training
set is representative of the outcome distribution in the test set.
6. What's the purpose of cross-validation?
Cross-validation helps you measure the performance of one model on different sets of
data.
In practice, a common use of CV is to help you tune model hyperparameters while
reducing the risk of overfitting.
7. Walk me through k-fold cross validation.
First, split the data into k subsets (folds), and then train the model on k-1 folds while using
the last fold to evaluate the model.
Repeat this k times, leaving each fold out

once.
8. What's the relationship between

LOOCV and k-fold CV?
LOOCV, or leave-one-out-cross-validation, is essentially k-fold CV except where k is equal

to n, the size of your entire sample.
9. What is bootstrap resampling (or "bootstrapping" for short)?
Bootstrapping is the process of resampling from your sample with replacement.
It's used in random forests and other bagging methods.

10. What is class imbalance and why is it troublesome for machine learning?
Class imbalance occurs when you have a much higher proportion of certain classes than
others.
For example, if you have a credit card fraud detector, and only 1% of transactions are
fraudulent, then you have a problem with class imbalance.
It becomes tricky to train machine learning models because they can achieve high
accuracy by labeling every case as non-fraudulent (the majority class).
11. What are a 3 ways to address class imbalance?
1. Up-sample the minority class.
2. Down-sample the majority class.
3. Alter the cost function to have larger penalties for Type 2 errors (false negatives).
Supervised Learning
1. What are the advantages and disadvantages of decision trees?

Decision trees are easy to interpret, non-
parametric (which means they are robust to
outliers), and there are relatively few parameters
to tune.
Decision trees are prone to be overfit. However,

this can be addressed by ensemble methods like
random forests or boosted trees.
2. What are the advantages and disadvantages of logistic regression?
They have a nice probabilistic interpretation, there are

many ways to regularize the models, and they are fast
and scalable with online gradient descent.
They do not perform as well when the relationships in

the data are not linearly.
3. What are the advantages of support vector machines (SVMs)?

They have high accuracy, nice
theoretical guarantees against
overfitting, and they can work
even if data isn't linearly
separable if you choose the right
kernel. SVMs are especially
popular in text classification
problems.
Choosing the appropriate hyperparameters and kernel can

tricky and prone to overfitting.
4. What are the advantages and disadvantages of Naive Bayes?
Naive Bayes is simple to implement and you need less training data if the conditional
independence assumption holds. It also performs surprisingly well in practice, even if the
independence assumption doesn't hold.
Their simple representation doesn't allow the flexibility to solve some problems.
5. What are the advantages and disadvantages of random forests?
They tend to perform very well in many practical applications while being relatively easy to
implement. In addition, they have built in feature selection and regularization.
They can take longer to train, and they have higher memory requirements.
Random forests should almost always be tried, and they can set the benchmark as a well-
performing, yet relatively easy out-of-the-box algorithm.
6. What is regularization?
Regularization artificially discourages complex models in order to

reduce overfitting. It involves penalizing the loss function for
additional complexity and limiting the flexibility of the model.
Different models are regularized differently.
7. What is the difference between Lasso and Ridge regression?
Lasso (L1) regression performs variable selection and parameter shrinkage. Ridge (L2)
only performs parameter shrinkage.
8. What are the advantages and disadvantages of neural networks?
Neural networks (specifically deep NNs) have led to performance breakthroughs for
unstructured datasets such as images, audio, and video.
Their incredible flexibility allows them to learn patterns
that no other ML algorithm can learn.
However, they require a large amount of training data to

converge. It's also difficult to pick the right architecture,
and the internal "hidden" layers are incomprehensible.
9. What are the advantages and disadvantages of k-

nearest neighbors?
k-Nearest Neighbors have a nice intuitive explanation, and then tend to work very well for
problems where comparables are inherently indicative. For example, you could build a
kNN housing price model by modeling on other houses in the area with similar number of
bedrooms, floor space, etc.
They are memory-intensive.They also do not have built-in feature

selection or regularization, so they do not handle high
dimensionality well.
10. How can you choose a classifier based on training set size?
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform
better because they are less likely to be overfit. If training set is large, low bias / high
variance models (e.g. Logistic Regression) tend to perform better because they can reflect
more complex relationships.
11. Which types of models are more robust to outliers?
Tree-based (non-parametric) models are more robust to outliers than regression-based

(parametric) models.
12. Why might it be preferable to include fewer predictors in your model?
If the model doesn't have regularization or built-in feature selection, then you can alleviate
overfitting my including fewer predictors.
13. What is linear regression and how does least squares related to it?
Linear regression is fitting a straight hyperplane through a set of points. Least squares is a
way of performing linear regression by minimizing the squared errors of predictions.
14. What is your favorite classification algorithm? Can you explain it to me?
Here's where practice with implementing algorithms from scratch really helps! Try to be as
concise as possible. You can also review pseudo-code of common algorithms.
15. Explain both parts of the naive Bayes classifier name... what makes it naive
and what makes it Bayesian?
Naive Bayes assumes all input features are independent. It gets the "naive" label because
it doesn't account for interactions and dependencies between features.
The Bayes part of the name comes from the Bayes Rule, a key concept in conditional
probability.
16. In general, when do regression models outperform decision tree models?

Regression models tend to perform better when there is linearity in the data (such as with
time series data). In addition, decision trees can only create orthogonal decision
boundaries, which can be restrictive for some datasets.
17. Name 3 types of supervised learning models that have built-in feature selection.
1. Stepwise regression
2. Lasso regression
3. Random Forests
18. How do decision trees decide how to split the data?
They iterate through different splitting options and evaluate the resulting Gini Index or
Node Entropy. Both metrics attempt to measure the "purity" of the resulting nodes.
19. Which types of models work better when the number of features is larger than
the size of your dataset?
Models that penalize complexity and have built-in feature selection tend to work better for
high-dimensionality datasets.
20. What is a convex hull in the context of SVMs?
The convex hull represents the outer boundaries of different groups when the data are
linearly separable. SVMs attempt to maximize the distance between convex hulls.
Unsupervised Learning
1. What is the difference between cluster analysis and factor analysis?
Cluster analysis is for grouping cases by similarity along the feature-set. An example is k-
means clustering.
Factor analysis is for grouping features using linear combinations. An example is PCA.
Both are common applications of unsupervised learning.
2. Explain Principle Component Analysis (PCA).
PCA is a method for transforming features in a dataset by combining them into

uncorrelated linear combinations.
These new features, or principal components, sequentially

maximize the variance represented (i.e. the first principal
component has the most variance, the second principal
component has the second most, and so on).
As a result, PCA is useful for dimensionality reduction because

you can set an arbitrary variance cutoff.
3. What types of data-preprocessing should always be performed before PCA?
You should always normalize (center and scale) your data first. Otherwise, PCA and other
dimensionality-reduction techniques will give different results.
4. What is the difference between PCA and ICA (Independent Component Analysis)?
PCA finds linear combinations of the features that are uncorrelated while ICA finds linear
combinations of the features that are independent.
The class example is the "cocktail party problem" where you are trying to find the
independent audio streams from the different participants.
5. Walk me through the process of k-means clustering.
1. K-means starts by randomly initializing k centroids.
2. Each data point is assigned to its closest centroid.
3. Centroids are recalculated as the means of its cluster.
4. Steps (2) and (3) are repeated until the clusters are stable.
6. Explain anomaly detection.
Anomaly detection is the process of fitting a distribution to data and then detecting future
outliers (anomalies) based on their statistical likelihood of occurring.
7. Explain Latent Dirichlet Allocation (LDA).
Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying

documents by subject matter.
LDA is a generative model that represents documents as a mixture of topics that each
have their own probability distribution of possible words.
The "Dirichlet" distribution is simply a distribution of distributions. In LDA, documents are

distributions of topics that are distributions of words.
8. What are hierarchical cluster models? Give an example.
Hierarchical (or connectivity) cluster models are distance-based models that represent
clusters using dendrograms.
They do not provide a single partition of the dataset, but instead produce a hierarchy of
clusters that merge at certain distances.
An example is single-linkage clustering.
9. What are centroid cluster models? Give an example.
Centroid cluster models form clusters around centroids that may not necessarily be
members of the dataset.
An example is k-means clustering.

10. What are distribution cluster models? Give an example.
Distribution cluster models group objects that are most likely produced by the same
underlying distribution.
An example is the Gaussian mixture model.
Model Evaluation
1. What is a confusion matrix?
It's a performance output for binary classification that shows the number of samples in
each predicted class vs. actual class.
The confusion matrix shows True Positives (TP),

False Positives (FP), True Negatives (TN), and
False Negatives (FN).
2. What's the difference between precision and recall?
Precision is the percent of selected true positives to all selected elements. Recall is the
ratio selected true positives to all true positives.
3. What's the difference between Type 1 and Type 2 error?
Type 1 error are false positives (null true, but rejected).
Type 2 error are false negatives (null false, but failed to reject).
4. Are Type 1 or Type 2 errors more acceptable? Give examples.
It all depends on the application and the cost associated with each error.
For example, Type 1 (false positive) errors are more acceptable for disease detection if
going by the "better-safe-than-sorry" motto.
On the other hand, Type 2 (false negative) errors may be more acceptable in marketing
campaigns in which false leads are extremely costly to pursue.
5. Is more data always better?

Larger datasets afford simple models more power, making it easier to classify outliers and
identify the underlying distribution.
Collecting more data is only worse if it's expensive to do so. Note that this does not mean
models and theory are unimportant.
6. Pick one: better data vs. better algorithms.
In general, better data > better algorithms.
Feature engineering also goes a long way.
Note that better data does not always mean more data... sometimes it could even mean
less (e.g. data cleaning and outlier removal)!
7. What happens when the distribution of the test data is significantly different
than that of the training data?
The model becomes inaccurate on test data.
This is called dataset shift, and it could be caused by sample selection bias or non-
stationary environments (e.g. seasonality).
8. What is the ROC Curve and what is AUC (a.k.a. AUROC)?
The ROC (receiver operating characteristic) the

performance plot for binary classifiers of True Positive
Rate (y-axis) vs. False Positive Rate (x-axis).
AUC is area under the ROC curve, and it's a common

performance metric for evaluating binary
classification models.
It's equivalent to the expected probability that a

uniformly drawn random positive is ranked before a
uniformly drawn random negative.
9. Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of-
sample evaluation metric?
AUROC is robust to class imbalance, unlike raw accuracy.
For example, if you want to detect a type of cancer that's prevalent in only 1% of the
population, you can build a model that achieves 99% accuracy by simply classifying
everyone has cancer-free.
10. What is calibration and why is it important?
Calibration is whether a prediction made with X% confidence is correct about X% of the

time.
This tells us whether we can trust the probabilities computed by a model.
11. If you built a classification model that has 90% accuracy when detecting a
certain type of fraud, do you have a good model?
Not necessarily, because of class imbalance.
You'll need to calculate other performance metrics as well.
12. How can you detect overfitting based on train and test set performance metrics?
If the model performs well on the training set but poorly on the
test set, then the model is likely overfit.
13. How can you detect underfitting based on train and test set performance
metrics?
If the model performs poorly on the training set, even as more data are introduced, then
the model may be underfit.
14. You are building a trading algorithm and find that stock prices for a
particular company are correlated with rainfall in Japan. Should you trade on
this signal?
Unlikely. Correlation does not imply causation, and you should always test your model on
out-of-sample data first.
15. What is the relationship between correlation and covariance?
Correlation is covariance standardized to the range from -1 to 1, making it possible to

compare across features.
16. In regression models, is a higher R-squared always better?
Not necessarily, because standard R-squared always increases as you add more features
into the model.
It's better to evaluate performance metrics on a hold-out test set instead.

Ensemble Learning
1. How do ensemble methods work?
Ensembles are meta-models that use the outputs of individual models as their inputs.
Think of them as making predictions by committee.
2. Why are ensemble methods superior to individual models?
They average out biases, reduce variance, and are less likely to overfit.
There's a common line in machine learning

which is: "ensemble and get 2%."
This implies that you can build your models

as usual and typically expect a small
performance boost from ensembling.
3. Explain bagging.
Bagging, or Bootstrap Aggregating, is an

ensemble method in which the dataset is first divided into multiple subsets through
resampling.
Then, each subset is used to train a model, and the final predictions are made through
voting or averaging the component models.
Bagging is performed in parallel.

4. Explain stacking.
Stacking is an ensemble method in which several models are trained on your original data.
Then, a logistic regression meta-model is used to aggregate predictions.
5. Explain boosting.
Boosting is an ensemble method in which a model is iteratively improved by weighing

misclassifications more heavily in subsequent iterations.
This process is continued until a stopping criteria is met, and it's performed sequentially.
6. What is a "weak" learner? Give an example.
Weak learners as predictors that perform relatively poorly, but still better than random
guessing.
They are also often computationally simple.
One example is a decision stump, or a 1-

level decision tree.
7. What is a "strong" learner? Give an example.
Strong learners are predictors that perform relatively well.
Many popular ML algorithms are considered strong learners, such as SVMs, random
forests, and logistic regression.
8. How can you turn "weak" learners into "strong" learners? (hint: it's not yelling
at them to hit the gym)
Weak learners can be turned into strong ones through boosting.
As long as you can consistently beat random guessing with the weak learning, any
boosting algorithm will work.
9. In general, which perform better: ensembles of similar models or ensembles of

different models?
Ensembles tend to perform better when the individual models are uncorrelated.
10. Is a random forest an ensemble?
Yes, random forests are bagged decision trees that are built through bootstrapping and
limiting the choice of features for each tree.
11. Explain how bagging improves performance by lowering variance.
Bagging essentially generates additional training data from your original dataset when you
perform the bootstrapping.
Additional data reduces variance of your model.

Business Applications
1. What are some key business metrics for (S-a-a-S startup | Retail bank | e-
Commerce site)?
Thinking about key business metrics, often shortened as KPI's (Key Performance
Indicators), is an essential part of a data scientist's job. Here are a few examples, but you
should practice brainstorming your own.
Tip: When in doubt, start with the easier question of "how does this business make
money?"
• S-a-a-S startup: Customer lifetime value, new accounts, account lifetime, churn rate,
usage rate, social share rate
• Retail bank: Offline leads, online leads, new accounts (segmented by account type), risk
factors, product affinities
• e-Commerce: Product sales, average cart value, cart abandonment rate, email leads,
conversion rate
2. How would you build a model to predict customer churn for a S-a-a-S product?
Here's a simplified run-down of the process:
First, start by clarifying the business objective. In this case, you're trying to predict existing
customers who might be likely to cancel their subscriptions. Therefore, this is a supervised
learning task.
Next, consider which outcome metric you'd model against. In this case, it would be past
customers who cancelled their subscriptions.
You may also wish to implement a cutoff point to keep the problem manageable (for
example, those who canceled their subscriptions within 3 months from joining).
Then, collect the data you'll need for feature engineering and modeling. Get creative here.
Finally, try different ML methods, build the model, and validate it against hold-out data.
Afterwards, communicate your results with business leaders and iterate as needed.
3. How can you help our marketing team be more efficient?
The answer will depend on the type of company. Here are some examples.
Clustering algorithms to build custom customer segments for each type of marketing
campaign.
Natural language processing for headlines to predict performance before running ad

spend.
Predict conversion probability based on a user's website behavior in order to create better
retargeting campaigns.
4. Explain to senior executives why data is important.
Data helps your company make better decisions.
Rather than relying on only the knowledge and experience from individuals, you can
leverage observations from across your entire enterprise.
5. Design a spam filter using supervised learning.
For supervised learning, you'd need a set of training emails that were labeled as spam or
not spam.
From there, you can use a variety of different NLP techniques to extract features from the
text, such as bag-of-words.
You can also extract features from email metadata,

such as the from-address and email headers.
Once you have your features, you could fit a naive

Bayes classifier to make predictions.
6. Design a spam filter using

unsupervised learning.
You could use unsupervised learning if you had a set

of training emails but no spam / not spam labels.
You would start by engineering the same types of features as you would for supervised
learning.
You could then use clustering algorithms to group the emails into clusters. Afterward, you
would hand label particular clusters as spam or not spam.
7. What are multi-armed bandits used for?
Multi-armed bandits address the challenge of maximizing the reward of a finite number
attempts when you are presented with various alternatives of different hidden expected
values.
They attempt to balance exploration (searching for new solutions) and exploitation
(cashing in on proven solutions).
One classic use case in business is maximizing the effectiveness of an advertising

campaign by trying to find the best performing ad creative.
8. How would you approach feature engineering for a domain you were
unfamiliar with?
Feature engineering is often one of the highest ROI activities in machine learning, so you
should be very comfortable with it, even if you don't know the domain well. Here are 3
thought-starters:
1. Speak with domain experts and try to quantify or systematize their recommendations
from experience.
2. Perform ample exploratory data analysis to see big-picture patterns.
3. Consider segmenting blunt groups of data.
4. Consider aggregating sparse features.
9. What are some ways you can help our business?
You'll want to prepare a custom answer for each company. Look at the previous
applications for inspiration.

Onsite Questions

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Onsite Questions

Încărcat de

Drepturi de autor:

Formate disponibile

Big Picture

1.Explain machine learning to a layperson.

Imagine a curious kid who sticks his palm over a candle

The next day, he

The kid has never touched a stove top, but

2. Which Python and R libraries have you used in the past?

Popular Python libraries include scikit-learn, NumPy, Pandas, and matplotlib.

3. What does it mean to "fit" a model? How do hyperparameters relate?

Explain the Bias-Variance tradeoﬀ.

More complex models are more prone to being

The best model for a given problem usually lies

Explain the relationship between priors and posteriors.

What is a conjugate distribution and why is it important in machine learning?

Conjugate distributions allow dynamic statistical models at can be continuously updated

Supervised learning encompasses all of what we commonly think of as "predictive

Examples include classification and regression.

How do you decide between a list of supervised learning models?

Picking the right model is a combination of having theoretical understanding of each

What is unsupervised learning? Give an example.

Examples include clustering and algorithmic feature engineering.

What is reinforcement learning? Give an example.

Examples include chess AI and self-driving cars.

What are parametric models? Give an example.

Examples include linear regression, logistic regression, and linear SVMs.

What are non-parametric models? Give an example.

What are generative models? Give an example.

Generative models describe how data is generated by modeling the probability

In other words, they model the joint probability P(y, x).

Discriminitive models directly learn the dependence of an unobserved variable y on an

Examples include logistic regression, neural networks,

They also have unsupervised learning applications such as anomaly detection.

What is the curse of dimensionality?

The diﬃculty of searching through a solution

Consider the analogy of looking for a penny in a

In practice this means two things...

1. Explain the idea behind gradient descent in layman's terms.

If the valley is shaped like a bowl (convex

That valley represents the loss function you're

3. When would you use GD over SDG, and vice-versa?

5. Does gradient descent always reach the same solution?

8. How can you perform logistic regression on a large dataset in a memory-based

9. Explain the log-loss function.

Search-based methods don't require the optimized function to be diﬀerentiable.

3. Encoding categorical values.

2. What is the Box-Cox transformation used for?

The Box-Cox transformation is a type of "power transformation" that transforms data to

3. What is the advantage of the Box-Cox transformation over other transformations?

2. Impute the missing data.

3. Use models that are robust to missing data, such as

5. What are 3 diﬀerent methods of data imputation?

1. Center imputation (using mean, median, or mode).

2. k-nearest neighbors imputation.

3. Bagged tree imputation.

6. What is one major flaw of imputation methods?

1. Winsorize (cap at threshold).

2. Transform to reduce skew (using Box-Cox or similar).

3. Remove outliers if you're certain they are anomalies or

8. What are 3 ways of reducing dimensionality?

1. Removing collinear features.

2. Performing PCA, ICA, or other forms of algorithmic dimensionality reduction.

3. Combining features with feature engineering.

10. Which ML algorithms are more robust to multicollinearity?

You should remove collinearity before running PCA because

12. What is one-hot encoding?