Sunteți pe pagina 1din 14

Dealing with Missing Data in Python

If enough records are missing entries, any analysis you perform will be skewed and the results of the
analysis weighted in an unpredictable manner. Having a strategy for dealing with missing data is
important.

It’s essential to find missing data in your dataset to avoid getting incorrect results from your analysis.
The following code shows how you could obtain a listing of missing values without too much effort.

import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
print (s.isnull())
print()
print (s[s.isnull()])

Use the isnull() method to detect the missing values. The output shows True when the value is missing.
By adding an index into the dataset, you obtain just the entries that are missing.

After you figure out that your dataset is missing information, you need to consider what to do about it.
The three possibilities are to ignore the issue, fill in the missing items, or remove (drop) the missing
entries from the dataset.

import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
print s.fillna(int(s.mean()))
print
print s.dropna()

The two methods of interest are fillna(), which fills in the missing entries, and dropna(), which drops the
missing entries. When using fillna(), you must provide a value to use for the missing data. This example
uses the mean of all the values, but you could choose many other approaches.
Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the count,
shows how many rows have non-missing values.

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be
collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the mean, which is the average. Under that, std is the standard deviation,
which measures how numerically spread out the values are.

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to
highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find
a number that is bigger than 25% of the values and smaller than 75% of the values. That is
the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analgously,
and the max is the largest number.

Building Your Model


You will use the scikit-learn library to create your models. When coding, this library is written
as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for
modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

 Define: What type of model will it be? A decision tree? Some other type of model? Some
other parameters of the model type are specified too.
 Fit: Capture patterns from provided data. This is the heart of modeling.
 Predict: Just what it sounds like
 Evaluate: Determine how accurate the model's predictions are.

What is Model Validation


You've built a model. But how good is it?

You'll need to answer this question for almost every model you ever build. In most (though not
necessarily all) applications, the relevant measure of model quality is predictive accuracy. In other
words, will the model's predictions be close to what happens.

Some people try answering this problem by making predictions with their training data. They
compare those predictions to the actual target values in the training data.
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute
Error (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is:


error=actual−predicted

So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive
number. We then take the average of those absolute errors. This is our measure of model quality. In
plain English, it can be said as

On average, our predictions are off by about X

The Problem with "In-Sample" Scores


The measure we just computed can be called an "in-sample" score. We used a single set of houses
(called a data sample) for both building the model and for calculating it's MAE score. This is bad.

Imagine that, in the large real estate market, door color is unrelated to home price. However, in the
sample of data you used to build the model, it may be that all homes with green doors were very
expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and
it will always predict high prices for homes with green doors.

Since this pattern was originally derived from the training data, the model will appear accurate in the
training data.

But this pattern likely won't hold when the model sees new data, and the model would be very
inaccurate (and cost us lots of money) when we applied it to our real estate business.

Even a model capturing only happenstance relationships in the data, relationships that will not be
repeated when new data, can appear to be very accurate on in-sample accuracy measurements.

Models' practical value come from making predictions on new data, so we should measure
performance on data that wasn't used to build the model. The most straightforward way to do this is
to exclude some data from the model-building process, and then use those to test the model's
accuracy on data it hasn't seen before. This data is called validation data.

Underfitting, Overfitting and Model Optimization


overfitting, where a model matches the training data almost perfectly, but does poorly in validation
and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the
houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses.
Resulting predictions may be far off for most houses, even in the training data (and it will be bad in
validation too for the same reason). When a model fails to capture important distinctions and
patterns in the data, so it performs poorly even in training data, that is called underfitting.
Since we care about accuracy on new data, which we estimate from our validation data, we want to
find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red)
validation curve in

There are a few alternatives for controlling the tree depth, and many allow for some routes through
the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very
sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the
more we move from the underfitting area in the above graph to the overfitting area.
Models can suffer from either:

 Overfitting: capturing spurious patterns that won't recur in the future, leading to less
accurate predictions, or
 Underfitting: failing to capture relevant patterns, again leading to less accurate
predictions.
We use validation data, which isn't used in model training, to measure a candidate model's
accuracy. This lets us try many candidate models and keep the best one.

Random Forests
Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because
each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree
with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and
overfitting. But, many models have clever ideas that can lead to better performance.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each
component tree. It generally has much better predictive accuracy than a single decision tree and it
works well with default parameters. If you keep modeling, you can learn more models with even
better performance, but many of those are sensitive to getting the right parameters.

Handling Missing Values


There are many ways data can end up with missing values.

Python libraries represent missing numbers as nan which is short for "not a number". You can detect
which cells have missing values, and then count how many there are in each column with the
command:
print(data.isnull().sum())

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with
missing values. So you'll need to choose one of the strategies below.
A Simple Option: Drop Columns with Missing Values
If your data is in a DataFrame called original_data, you can drop columns with missing values.
One way to do that is
data_without_missing_values = original_data.dropna(axis=1)

In many cases, you'll have both a training dataset and a test dataset. You will want to drop the same
columns in both DataFrames. In that case, you would write
cols_with_missing = [col for col in original_data.columns

if original_data[col].isnull().any()]

redued_original_data = original_data.drop(cols_with_missing, axis=1)

reduced_test_data = test_data.drop(cols_with_missing, axis=1)

If those columns had useful information (in the places that were not missing), your model loses
access to this information when the column is dropped. Also, if your test data has missing values in
places where your training data did not, this will result in an error.

So, it's somewhat usually not the best solution. However, it can be useful when most values in a
column are missing.

A Better Option: Imputation


Imputation fills in the missing value with some number. The imputed value won't be exactly right in
most cases, but it usually gives more accurate models than dropping the column entirely.

This is done with

from sklearn.preprocessing import Imputer

my_imputer = Imputer()

data_with_imputed_values = my_imputer.fit_transform(original_data)

The default behavior fills in the mean value for imputation. Statisticians have researched more
complex strategies, but those complex strategies typically give no benefit once you plug the results
into sophisticated machine learning models.

One (of many) nice things about Imputation is that it can be included in a scikit-learn Pipeline.
Pipelines simplify model building, model validation and model deployment.

An Extension To Imputation
Imputation is the standard approach, and it usually works well. However, imputed values may by
systematically above or below their actual values (which weren't collected in the dataset). Or rows
with missing values may be unique in some other way. In that case, your model would make better
predictions by considering which values were originally missing. Here's how it might look:
# make copy to avoid changing original data (when Imputing)

new_data = original_data.copy()

# make new columns indicating what will be imputed

cols_with_missing = (col for col in new_data.columns

if new_data[c].isnull().any())

for col in cols_with_missing:

new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation

my_imputer = Imputer()

new_data = my_imputer.fit_transform(new_data)

In some cases this approach will meaningfully improve results. In other cases, it doesn't help at all.

Using Categorical Data with One Hot Encoding


Categorical data is data that takes only a limited number of values.

For example, if you people responded to a survey about which what brand of car they owned, the
result would be categorical (because the answers would be things like Honda, Toyota, Ford, None,
etc.). Responses fall into a fixed set of categories.

You will get an error if you try to plug these variables into most machine learning models in Python
without "encoding" them first. Here we'll show the most popular method for encoding categorical
variables.

One-Hot Encoding : The Standard Approach for Categorical Data


One hot encoding is the most widespread approach, and it works very well unless your categorical v
ariable takes on a large number of values (i.e. you generally won't it for variables taking more than 1
5 different values. It'd be a poor choice in some cases with fewer values, though that varies.)

One hot encoding creates new (binary) columns, indicating the presence of each possible value from
the original data. Let's work through an example.

Pandas assigns a data type (called a dtype) to each column or Series.


Object indicates a column has text (there are other things it could be theoretically be, but that's unim
portant for our purposes). It's most common to one-hot encode these "object" columns, since they ca
n't be plugged directly into most models. Pandas offers a convenient function called get_dummies t
o get one-hot encodings. Call it like this:

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

Alternatively, you could have dropped the categoricals. To see how the approaches compare, we
can calculate the mean absolute error of models built with two alternative sets of predictors:

1. One-hot encoded categoricals as well as numeric predictors


2. Numerical predictors, where we drop categoricals.
One-hot encoding usually helps, but it varies on a case-by-case basis. In this case, there doesn't
appear to be any meaningful benefit from using the one-hot encoded variables.

Applying to Multiple Files


So far, you've one-hot-encoded your training data. What about when you have multiple files (e.g. a
test dataset, or some other data that you'd like to make predictions for)? Scikit-learn is sensitive to
the ordering of columns, so if the training dataset and test datasets get misaligned, your results will
be nonsense. This could happen if a categorical had a different number of values in the training data
vs the test data.
Ensure the test data is encoded in the same manner as the training data with the align command.
The align command makes sure the columns show up in the same order in both datasets (it uses
column names to identify which columns line up in each dataset.) The
argument join='left' specifies that we will do the equivalent of SQL's left join. That means, if
there are ever columns that show up in one dataset and not the other, we will keep exactly the
columns from our training data. The argument join='inner' would do what SQL databases call
an inner join, keeping only the columns showing up in both datasets. That's also a sensible choice.

Conclusion
The world is filled with categorical data. You will be a much more effective data scientist if you know
how to use this data. Here are resources that will be useful as you start doing more sophisticated
work with cateogircal data.

 Pipelines: Deploying models into production ready systems is a topic unto itself. While
one-hot encoding is still a great approach, your code will need to built in an especially
robust way. Scikit-learn pipelines are a great tool for this. Scikit-learn offers a class for
one-hot encoding and this can be added to a Pipeline. Unfortunately, it doesn't handle text
or object values, which is a common use case.

 Applications To Text for Deep Learning: Keras and TensorFlow have fuctionality for
one-hot encoding, which is useful for working with text.
 Categoricals with Many Values: Scikit-learn's FeatureHasher uses the hashing trick to
store high-dimensional data. This will add some complexity to your modeling code.

What is XGBoost
XGBoost is the leading model for working with standard tabular data (the type of data you store in
Pandas DataFrames, as opposed to more exotic types of data like images and videos). XGBoost
models dominate many Kaggle competitions.

To reach peak accuracy, XGBoost models require more knowledge and model tuning than
techniques like Random Forest.

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm (scikit-learn has
another version of this algorithm, but XGBoost has some technical advantages.) What is Gradient
Boosted Decision Trees? We'll walk through a diagram.
We go through cycles that repeatedly builds new models and combines them into
an ensemble model. We start the cycle by calculating the errors for each observation in the dataset.
We then build a new model to predict those. We add predictions from this error-predicting model to
the "ensemble of models."
To make a prediction, we add the predictions from all previous models. We can use these
predictions to calculate new errors, build the next model, and add it to the ensemble.

There's one piece outside that cycle. We need some base prediction to start the cycle. In practice,
the initial predictions can be pretty naive. Even if it's predictions are wildly inaccurate, subsequent
additions to the ensemble will address those errors.

Model Tuning
XGBoost has a few parameters that can dramatically affect your model's accuracy and training
speed. The first parameters you should understand are:

n_estimators and early_stopping_rounds


n_estimators specifies how many times to go through the modeling cycle described above

In the underfitting vs overfitting graph, n_estimators moves you further to the right. Too low a value
causes underfitting, which is inaccurate predictions on both training data and new data. Too large a
value causes overfitting, which is accurate predictions on training data, but inaccurate predictions on
new data (which is what we care about). You can experiment with your dataset to find the ideal.
Typical values range from 100-1000, though this depends a lot on the learning ratediscussed
below.
The argument early_stopping_rounds offers a way to automatically find the ideal value. Early
stopping causes the model to stop iterating when the validation score stops improving, even if we
aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then
use early_stopping_rounds to find the optimal time to stop iterating.
Since random chance sometimes causes a single round where validation scores don't improve, you
need to specify a number for how many rounds of straight deterioration to allow before
stopping. early_stopping_rounds = 5 is a reasonable value. Thus we stop after 5 straight rounds of
deteriorating validation scores.
When using early_stopping_rounds, you need to set aside some of your data for checking the
number of rounds to use. If you later want to fit a model with all of your data, set n_estimators to
whatever value you found to be optimal when run with early stopping.

learning_rate
Here's a subtle but important trick for better XGBoost models:

Instead of getting predictions by simply adding up the predictions from each component model, we
will multiply the predictions from each model by a small number before adding them in. This means
each tree we add to the ensemble helps us less. In practice, this reduces the model's propensity to
overfit.

So, you can use a higher value of n_estimators without overfitting. If you use early stopping, the
appropriate number of trees will be set automatically.

In general, a small learning rate (and large number of estimators) will yield more accurate XGBoost
models, though it will also take the model longer to train since it does more iterations through the
cycle.

What Are Partial Dependence Plots


Some people complain machine learning models are black boxes. These people will argue we
cannot see how these models are working on any given dataset, so we can neither extract insight
nor identify problems with the model.
By and large, people making this claim are unfamiliar with partial dependence plots. Partial
dependence plots show how each variable or predictor affects the model's predictions. This is useful
for questions like:

 How much of wage differences between men and women are due solely to gender, as
opposed to differences in education backgrounds or work experience?

 Controlling for house characteristics, what impact do longitude and latitude have on home
prices? To restate this, we want to understand how similarly sized houses would be priced
in different areas, even if the homes actually at these sites are different sizes.

 Are health differences between two groups due to differences in their diets, or due to other
factors?
If you are familiar with linear or logistic regression models, partial dependence plots can be
interepreted similarly to the coefficients in those models. But partial dependence plots can capture
more complex patterns from your data, and they can be used with any model. If you aren't familiar
with linear or logistic regressions, don't get caught up on that comparison.

Interpreting Partial Dependence Plots


The partial dependence plot is calculated only after the model has been fit. The model is fit on
real data. In that real data, houses in different parts of town may differ in myriad ways (different
ages, sizes, etc.)
Some tips related to plot_partial_dependence:

 The features are the column numbers from the X array or dataframe that you wish to have
plotted. This starts to look bad beyond 2 or 3 variables. You could make repeated calls to
plot 2 or 3 at a time.
 There are options to establish what points on the horizontal axis are plotted. The simplest
is grid_resolution which we use to determine how many different points are plotted. These
plots tend to look jagged as that value increases, because you will pick up lots of
randomness or noise in your model. It's best not to take the small or jagged fluctuations
too literally. Smaller values of grid_resolution smooth this out. It's also much less of an
issue for datasets with many rows.
 There is a function called partial_dependence to get the raw data making up this plot,
rather than making the visual plot itself. This is useful if you want to control how it is
visualized using a plotting package like Seaborn. With moderate effort, you could make
much nicer looking plots.

Partial dependence plots are a great way (though not the only way) to extract insights from complex
models. These can be incredibly powerful for communicating those insights to colleagues or non-
technical users.

There are a variety of opinions on how to interpret these plots when they come from non-
experimental data. Some claim you can conclude nothing about cause-and-effect relationships from
data unless it comes from experiments. Others are more positive about what can be learned from
non-experimental data (also called observational data). It's a divisive topic in the data science world,
beyond the scope of this tutorial.

However most agree that these are useful to understand your model. Also, given the messiness of
most real-world data sources, it's also a good sanity check that your model is capturing realistic
patterns.

The partial_dependence_plot function is an easy way to get these plots, though the results aren't
visually beautiful. The partial_dependence function gives you the raw data, in case you want to
make presentation-quality graphs.
What Are Pipelines
Pipelines are a simple way to keep your data processing and modeling code organized. Specifically,
a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were
a single step.

Many data scientists hack together models without pipelines, but Pipelines have some important
benefits. Those include:

1. Cleaner Code: You won't need to keep track of your training (and validation) data at each
step of processing. Accounting for data at each step of processing can get messy. With a
pipeline, you don't need to manually keep track of each step.
2. Fewer Bugs: There are fewer opportunities to mis-apply a step or forget a pre-processing
step.
3. Easier to Productionize: It can be surprisingly hard to transition a model from a
prototype to something deployable at scale. We won't go into the many related concerns
here, but pipelines can help.
4. More Options For Model Testing: You will see an example in the next tutorial, which
covers cross-validation.

Understanding Pipelines
Most scikit-learn objects are either transformers or models.

Transformers are for pre-processing before modeling. The Imputer class (for filling in missing
values) is an example of a transformer. Over time, you will learn many more transformers, and you
will frequently use multiple transformers sequentially.

Models are used to make predictions. You will usually preprocess your data (with transformers)
before putting it in a model.

You can tell if an object is a transformer or a model by how you apply it. After fitting a transformer,
you apply it with the transform command. After fitting a model, you apply it with
the predict command. Your pipeline must start with transformer steps and end with a model. This is
what you'd want anyway.

Eventually you will want to apply more transformers and combine them more flexibly.

What is Cross Validation


Machine learning is an iterative process.

You will face choices about predictive variables to use, what types of models to use,what arguments
to supply those models, etc. We make these choices in a data-driven way by measuring model
quality of various alternatives.

You've already learned to use train_test_split to split the data, so you can measure model
quality on the test data. Cross-validation extends this approach to model scoring (or "model
validation.") Compared to train_test_split, cross-validation gives you a more reliable measure
of your model's quality, though it takes longer to run.

The Shortcoming of Train-Test Split


Imagine you have a dataset with 5000 rows. The train_test_split function has an argument
for test_size that you can use to decide how many rows go to the training set and how many go
to the test set. The larger the test set, the more reliable your measures of model quality will be. At an
extreme, you could imagine having only 1 row of data in the test set. If you compare alternative
models, which one makes the best predictions on a single data point will be mostly a matter of luck.

You will typically keep about 20% as a test dataset. But even with 1000 rows in the test set, there's
some random chance in determining model scores. A model might do well on one set of 1000 rows,
even if it would be inaccurate on a different 1000 rows. The larger the test set, the less randomness
(aka "noise") there is in our measure of model quality.

But we can only get a large test set by removing data from our training data, and smaller training
datasets mean worse models. In fact, the ideal modeling decisions on a small dataset typically aren't
the best modeling decisions on large datasets.

The Cross-Validation Procedure


In cross-validation, we run our modeling process on different subsets of the data to get multiple
measures of model quality. For example, we could have 5 folds or experiments. We divide the data
into 5 pieces, each being 20% of the full dataset.
We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything
else as training data. This gives us a measure of model quality based on a 20% holdout set, much
as we got from using the simple train-test split.
We then run a second experiment, where we hold out data from the second fold (using everything
except the 2nd fold for training the model.) This gives us a second estimate of model quality. We
repeat this process, using every fold once as the holdout. Putting this together, 100% of the data is
used as a holdout at some point.

Returning to our example above from train-test split, if we have 5000 rows of data, we end up with a
measure of model quality based on 5000 rows of holdout (even if we don't use all 5000 rows
simultaneously.

Trade-offs Between Cross-Validation and Train-Test Split


Cross-validation gives a more accurate measure of model quality, which is especially important if
you are making a lot of modeling decisions. However, it can take more time to run, because it
estimates models once for each fold. So it is doing more total work.

Given these tradeoffs, when should you use each approach? On small datasets, the extra
computational burden of running cross-validation isn't a big deal. These are also the problems where
model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you
should run cross-validation.

For the same reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and
you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs small dataset. If your model takes a
couple minute or less to run, it's probably worth switching to cross-validation. If your model takes
much longer to run, cross-validation may slow down your workflow more than it's worth.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If
each experiment gives the same results, train-test split is probably sufficient.

Data Leakage
Data leakage is one of the most important issues for a data scientist to understand. If you don't know
how to prevent it, leakage will come up frequently, and it will ruin your models in the most subtle and
dangerous ways. Specifically, leakage causes a model to look accurate until you start making
decisions with the model, and then the model becomes very inaccurate.

There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.

Leaky Predictors
This occurs when your predictors include data that will not be available at the time you make
predictions.
To prevent this type of data leakage, any variable updated (or created) after the target value is
realized should be excluded. Because when we use this model to make new predictions, that data
won't be available to the model.

Leaky Validation Strategy


A much different type of leak occurs when you aren't careful distinguishing training data from
validation data. For example, this happens if you run preprocessing (like fitting the Imputer for
missing values) before calling train_test_split. Validation is meant to be a measure of how the model
does on data it hasn't considered before. You can corrupt this process in subtle ways if the validation
data affects the preprocessing behavoir.. The end result? Your model will get very good validation
scores, giving you great confidence in it, but perform poorly when you deploy it to make decisions.

Preventing Leaky Predictors


There is no single solution that universally prevents leaky predictors. It requires knowledge about
your data, case-specific inspection and common sense.

However, leaky predictors frequently have high statistical correlations to the target. So two tactics to
keep in mind:

 To screen for possible leaky predictors, look for columns that are statistically correlated to
your target.
 If you build a model and find it extremely accurate, you likely have a leakage problem.
Preventing Leaky Validation Strategies
If your validation is based on a simple train-test split, exclude the validation data from any type
of fitting, including the fitting of preprocessing steps. This is easier if you use scikit-learn Pipelines.
When using cross-validation, it's even more critical that you use pipelines and do your preprocessing
inside the pipeline.

S-ar putea să vă placă și