Documente Academic
Documente Profesional
Documente Cultură
Does it look linear? Linearity requires continuous DV, but here Preference is binary
Data Visualization
Questions:
0 = female 0 = regular beer
1 = male Preference 1 = light beer
0 1
What can we learn from these plots? Gender Married
3
distribution.
Residuals are heteroskedastic (non-constant error
2
variance):
Preference
1
Var(𝜀) = E[Y | X] – (1 - E[Y | X] ) = P(1-P)
Fallacious predictions:
0
Y has two discrete response values (0/1), but fitted Y
values may lie outside the (0,1) bound.
-1
0 20000 40000 60000 80000 100000
Income
How to Handle Binary Dependent Variables
Steps:
Transform the binary Y into a continuous probability function f(Y) such that f(Y) is constrained
between 0 and 1, and then running the regression: f(Y) = 𝛽0 + 𝛽1*x + 𝜀
Two popular probability models: logit (logistic regression) and probit models.
Logit and probit models:
Both models use a S-shaped cumulative distribution function (CDF) as f(Y) to restrict fitted values to
fall between 0 and 1.
Probit uses the CDF of a standard normal distribution, and logit uses the CDF of a logistic
distribution (similar to standard normal distribution but with heavier tails).
DV in not Y, but the logit/probit function f(Y).
Logit and Probit Models
Let P = Pr[Y=1]
In the beer example, P is the probability that a customer prefers light beer (Y=1).
Then, Pr[Y=0] = 1-P .
1-p is the probability that a customers prefers regular beer.
Odds (or odds ratio) of the event: w = P / (1-P)
Logit model: P PDF of a normal distribution:
ln 1 2 X ki ... k X ki z 1 x
2
1 P 1
x
2 2
e
t
Z2 2
Probit model: P(Y 1) 1
2
e 2
dt ( z )
PDF of a standard normal distribution:
x2
1
z 1 2 X ki ... k X ki x e 2
2
Logit vs. Probit Models
Logit (logistic regression):
More popular in health sciences (e.g., epidemiology).
Coefficients easier to interpret in terms of odds ratios.
Susceptible to heteroskedasticity (non-constant error
variances).
Probit:
More popular in advanced econometrics and political
science.
Robust to heteroskedasticity (because we use standard
normal PDF).
Probit is an integral function; no easy interpretation
possible.
Odds vs Probabilities
Question:
You plan to celebrate your birthday on a randomly chosen day of the week. What is the
probability that this day is a Sunday? What are the odds?
A. Probability and odds are both 1/7
B. Probability = 1/7, but Odds=1/6
C. Probability = 1/6, but Odds = 1/7
For more on the difference, see http://en.wikipedia.org/wiki/Odds
Probability: Number of success / Number of (successes + failures) = 1/7
Odds: Number of successes / Number of failures: 1/6
Odds is often read as 6 to 1, 6-1, or 6/1.
Logit Model
Logit of an observation is modeled as a linear combination of predictor variables:
logit = 0 1 Gender 2 Married 3 Income 4 Age e
In R, you can fit a logistic regression, using the “binomial” option, as:
m <- glm(Preference ~ ., family=binomial (link="logit"),
data=d)
Use link="probit"
1
Deviance Residuals: for probit models
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.819e-01 1.931e+00 -0.353 0.724
Gender -7.779e-01 7.166e-01 -1.085 0.278
Married 1.697e-01 7.945e-01 0.214 0.831 0
Income 2.785e-04 6.334e-05 4.396 1.10e-05 ***
Age -2.282e-01 5.239e-02 -4.357 1.32e-05 ***
Change in probability p for a unit increase in one predictor variable, while holding all other
predictors constant, is not constant, but depends on the specific values of all predictors.
Question: A 45-year-old single male earning $40,000/year has P[Light Beer] = 0.352.
Does this customer prefer light beer or regular beer?
Logit and Probit Models in R
logit <- glm(Preference ~ ., family=binomial (link="logit"), data=d)
probit <- glm(Preference ~ ., family=binomial (link="probit"), data=d)
# Predicted probabilities
predols <- predict(ols, type="response"); summary(predols)
predlogit <-predict(logit, type="response"); summary(predlogit)
predprobit <-predict(probit, type="response"); summary(predprobit)
Assumptions of the Logistic Regression
Multivariate normality and homoskedasticity::
E[Pr(Y=1)] = P
Var[Pr(Y=1)] = P(1-P)
Hence, logit model will not meet multivariate normality and homoskedasticity assumptions.
Linearity:
Logit is a non-linear transformation.
However, the logistic model must still satisfy the other assumptions:
Independence.
No multicollinearity (use VIF tests).
No autocorrelation.
Classification and Cutoff Value
From the estimated logit model:
First, predict P = Pr[Y=1] for each observation.
Then, use a predetermined cutoff value to determine class affiliation.
What is a good cutoff value?
In the beer example, for a 45-year-old single male, earning $40,000/year, P[Light Beer] = 0.352. In
this case, 0.5 may be a reasonable cutoff value.
However, in other cases, the cutoff value may depend on your objective.
Objective 1: Target customers with the “right” beer ads (i.e. light beer ads for light beer drinkers).
Objective 2: Influence regular beer drinkers to drink the pricier light beer.
Would the cutoff value be the same for both objectives?
Goodness of Fit of Classification Models
Classification models don’t have residuals or R2.
McFaddens Pseudo R². library(rcompanion)
nagelkerke(logit)
Comparison of the model of interest with a nagelkerke(probit)
model where RHS is constant. $Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
Likelihood of model of interest: McFadden 0.592800
Cox and Snell (ML) 0.560359
Likelihood with all coefficients except that of Nagelkerke (Cragg and Uhler) 0.747145
the intercept restricted to zero
$Likelihood.ratio.test
Test: > Df.diff LogLik.diff Chisq p.value
-4 -41.09 82.18 6.0133e-17
ln Lˆ ( M Full )
PseudoR R
2 2
1 $Number.of.observations
ln Lˆ ( M Intercept )
McF
Model: 100 Null: 100
Regular Light
Actual Class
Preference 0 1 TP = 44 FN = 6
0 44 6
1 6 44 FP = 6 TN = 44
120
13 (9,4) 27 (11,16)
wt < 75 wt ≥ 75 wt < 95 wt ≥ 95
100
wt < 70 wt ≥ 70
40 50 60 age
1 (0,1) 1 (1,0)
Brick: No Brick: No SqFt < 2010 Bedrooms < 3.5 Children < 1.5 Location: Close Location: Close Salary < 120400
768 1240
Location: Close
2390
decision makers can do about a problem. strong Normal High Normal High
strong weak
weak
windy yes outlook yes
no yes yes no
sunny rain
strong weak
overcast
no yes no yes ?
Predictive Accuracy
Classification works well if they can accurately predict future occurrences.
How to assess predictive accuracy:
Split your data into two random subsets: training set and holdout set.
Train (“estimate”) a model using the training set.
Fit the estimated model on the holdout set to assess predictive performance.
Question: Which of the two models below predicts better, i.e., which model has higher “lift”?
Logistic Regression Classification Tree
Actual Actual
0 1 0 1
Predicted
Predicted
0 11 3 0 16 6
1 6 20 1 1 17
Important Note on OLS Regression
Questions:
Multiple or Adjusted R2 in an OLS model is a measure of:
A. Predictive accuracy C. Fit between data and model
B. Explanatory power D. All of the above
What does high R2 mean? Can high R2 be problematic?
Use the same procedure as classification to assess predictive accuracy.
How to create test and holdout samples in R:
Seed allows replication of this same sample later
set.seed(43324)
index <- sample(1:nrow(d), size=floor(0.8*nrow(d)), replace=FALSE)
training <- data[index, ]
holdout <- data[-index, ]
We plan to use 80% of the data for
training set and 20% for holdout set Sample selection without replacement
Key Takeaways
Classification models can be used to predict two or more discrete values (not for continuous values).
Classification models may be based on statistical or algorithmic methods.
Statistical methods (logit, probit) are model based:
Allow insight into the relationship between input and output.
But interpretations are cumbersome (“logit”, “odds”).
May not handle complex interactions among X variables very well.
Data mining methods (decision tree, regression tree) are algorithm-based.
May not explain the relationships between X and Y (a “black box”).
Can be very intuitive.
Have a tendency to overfit the data.
Evaluating predictive ability using training and holdout data samples.