Sunteți pe pagina 1din 45

Intro to Forecasting

Michael Bailey
Economist, Data Scientist @Facebook

Trying to predict the future is like trying to drive down a country


road at night with no lights while looking out the back window.
Peter Drucker

Successful Forecasts

Source: 538 Blog NY Times

Successful Forecasts

Source: Baseball-Almanac

Successful Forecasts

Source: Baseball-Almanac

Failed Forecasts
"I predict the Internet will soon go spectacularly supernova and in 1996
catastrophically collapse. Robert Metcalfe, founder of 3Com and
inventor of Ethernet, writing in a 1995 InfoWorld column.

"Television won't be able to hold on to any market it captures after the


first six months. People will soon get tired of staring at a plywood box
every night." Darryl Zanuck, 20th Century Fox, 1946

We will never make a 32 bit operating system. Bill Gates

Failed Forecasts

Failed Forecasts

Outline
What methods should be used to construct a forecast?
What distinguishes a good forecast from a bad one? What are the best
practices and common mistakes of forecasters?
How can I learn more?

Example: P2P Lending


Peer-to-Peer (P2P) lending networks facilitate the matching between
borrowers who want to borrow money, and peer lenders who are willing
to loan them money at a premium.
Because peers are matched directly, borrowers face lower interest rates
than those offered by banks or credit cards and lenders can earn higher
returns than those offered by banks or bonds.
These networks are highly transparent, releasing a plethora of data
about the potential borrower. The two largest networks, Lending Club
and Prosper continuously release their lending data to the public:
LendingClub Downloads
Prosper Downloads

Example: P2P Lending

Example: P2P Lending

Example: P2P Lending


How many Loans will Lending Club and Prosper make next month? Next 3
months? Next year?

Qualitative Methods
Ad-hoc / make stuff up: used more often than you would think, often for
new products/markets where there isnt much data available.
Delphi Method: iterative forecasts by a room of experts. Panel sees
results from previous round and reforecasts. Several problems with this
method bias of outliers, group psychology effects like herding, etc.

Quantitative Methods
Time Series predict future values based upon past values. Some models
include other regressors (ARMAX), but usually the forecast is based solely
upon observed values of the response.

Nave Forecast this periods value is equal to last periods value


(ARIMA(0,1,0)).

Moving Average This periods predicted response is equal to the


average of the past n periods response. This is known as a MA model of
order n.
Useful for smoothing out noise to see trends in the data.

Moving Average -

The key to time series analysis is to transform the data into a stable
time series:
(1)

Mean is nearly constant

Transformation: take differences (diff() function in R)


(2)

Volatility is nearly constant

Transformation: take logs or powers. Boxcox family of transformations


flexibly covers both:
Y = (lambda*y + 1)^(1/lambda)

Ahhhh, Stability at last! Note that taking several differences of the


data is not at all uncommon.

Once you have a stable time series, you can forecast forward using
exponential smoothing models. Moving Averages are a special
type of ES.
ES models take in all past data, but put different weights on how
recent data should predict the next period.
One ES model is HoltWinters (HoltWinters() in R) that selects a
smoothing parameter automatically.

Decomposition: Sometimes youll want to decompose your time


series into trend and seasonal components. There are several
algorithms to accomplish this (see decompose() and stl() in R).
Lets see if there is any seasonality in the monthly lending data:

ES models assume no autocorrelation, or correlation between the errors


of successive y values.
The correct model might need to take into account substantial
correlation, for example the true model generating the data could be:
y(t) = y(t-1) + e
Todays value is yesterdays value plus some error. This is an
autoregressive model of order 1.
Use lag.plot() or acf() in R to see autocorrelation structure:

There is a model that incorporates all the concepts of differences, MA,


and AR, that is the ARIMA(p, d, q) model.
ARIMA(1, 0, 0) = AR(1)
ARIMA(0, 0, 1) = MA(1)
d = number of times we are differencing the data to obtain stationarity.
I wont go into how to select the values of p, q, it requires learning about
the partial correlation function and the correlation function (see
references at end of slides).
Lets be lazy and use auto.arima() in R to pick them for us:

ARIMA(1, 2, 1) predictions for Lending Club disbursements:


Feb
Mar
114,727,363 123,818,067

Apr
132,671,221

May
141,424,018

Jun
150,134,416

Jul
158,826,902

Make sure you are fitting your model using training data, and validating
your model using test data.
Two most common model validation metrics are:
Mean Absolute Error = mean(|error|)
Root Mean Squared Error = sqrt(mean(error^2))
There is vociferous debate in the forecasting journals about the best
metric to use very domain specific.

What if you want to predict y|x?


For non-time series, a plethora of tools available: multivariable
regression, information likelihood models, machine learning, neural
networks, etc.
For time series, any model that fits y|x used in a forecast will need
forecasted values of x to make the forecast. Usually avoided in practice
unless very predictable xs are chosen.
Scenario Forecasting: fit a model of y|x and then pick different scenarios
for the evolution of x. Provide a forecast for each scenario.

Forecasting Process:
1) Define problem, Gather contextual knowledge
2) Plot like crazy to learn data structure (correlation matrices,
autocorrelation plots, etc.)
3) Split data into train/test set.
4) (Time Series) Transform the data to obtain stationarity.
5) Fit appropriate models, test that errors look like white noise.
6) Apply models to test data set, evaluate using error deviation metrics.
Make forecasts.
7) Re-evaluate next period.

Best Practices
Gather as much contextual knowledge (experts) as possible.

Best Practices
Embrace Uncertainty.

>

Best Practices
Avoid overfitting.
Always train your model and perform model selection on a subset of your
data.
Dont necessarily select the model that best fits your current data.
Dont necessarily select the model that best fits the next n periods.
Remember, you can always attain a perfect fit to the data with enough
parameters.

Best Practices
Make Lots of Forecasts and Calibrate.
Continually asses the probability statements of your model to see how
far it deviates from the truth.

Best Practices
Beware the Lucas Critique.
When your forecast might affect the outcome, calibration is incredibly
difficult.

Platform Forecasting
At Facebook, we have the challenging problem of the need of a platform
forecast.
Our revenue is dependent on our users using the site, and our advertisers
wanting to serve them ads. We control neither supply nor demand, and
thus we need to make forecasts for both.
We also need to understand how the composition of supply and demand
turns into revenue which very much depends on the ads-serving
mechanisms and optimizations we employ, which are continuously
changing.
We use a combination of simulation techniques, experiments, machine
learning and cross-section models, and time series models to estimate
the demand and supply curves are facing.

Resources
Dont google forecasting instead search on methods (time series,
prediction market, neural networks, etc.)

Resources R

Forecasting in R
R zoo package, useful for dates
R forecast package
A Little R Time Series Book
Python/Pandas
Time Series in Pandas
Time Series in Pandas (Video)
statsmodels
Texts
Forecasting Methods and Applications
Time Series Analysis
Forecasting: Principles and Practice
Signal and the Noise: Why So Many Predictions Fail but Some
Don't

S-ar putea să vă placă și