Sunteți pe pagina 1din 40

Classification Models

ANOL BHATTACHERJEE, PH.D.


UNIVERSITY OF SOUTH FLORIDA
Outline
 Classification models.
 Why OLS regression does not work
 Statistical models: logit, probit
 Concepts: probability, odds ratio.
 Evaluation: confusion matrix, error rate, accuracy, specificity, true positives, …
 Plotting: ROC, lift curves, lift curves with costs & benefits,
 Algorithmic models:
 Decision trees: classification trees, regression trees.
 Comparison with logit models.
 Predictive accuracy of model.
Beer Example
 The problem: Gender Married Income Age Preference
1 0 $20,945 42 Regular
 A beer-maker wants to know the characteristics that can 1 0 $22,408 40 Regular
distinguish light beer drinkers from regular beer drinkers, 1 0 $23,234 31 Regular
1 1 $24,302 46 Regular
so that it can better target customers with appropriate 1 1 $25,440 51 Regular
coupons and promotions. 1 0 $26,004 62 Regular
0 0 $26,186 38 Regular
 Data: 0 0 $26,598 29 Light
 Sample: 100 customers. 0 0 $27,078 45 Regular

 DV: Customer’s beer preference (1=light or 0=regular).


 IV: gender (0=female, 1=male), married (1=yes), income,
and age.
The Problem
plot(Preference ~ Income, plot(Preference ~ Age, plot(Preference ~ Gender,
data=d) data=d) data=d)

Does it look linear? Linearity requires continuous DV, but here Preference is binary
Data Visualization
Questions:
0 = female 0 = regular beer
 1 = male Preference 1 = light beer
0 1
 What can we learn from these plots? Gender Married

A. Male customers prefer light beer


0 = unmarried
B. Older people drink light beer 1 = married

C. Customers with higher income


prefer light beer
1 0 1 0

 What is the problem with this data?


Income Age
A. It is not a random sample
B. We cannot use regression
C. There are too many dummy
variables
0 0 0 0 50 0
00 00 00 00 10
20 40 60 80

N = 100, 2010-04-05, R 2.8.1


Linear Probability Model
 When the DV is binary, the OLS model is called the linear probability model.

ols <- lm(Preference ~ Income + Married + Gender + Age, data=d)


summary(ols)
plot(ols)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.877e-01 1.866e-01 2.078 0.0404 *
Gender -4.586e-02 6.950e-02 -0.660 0.5109
Married 2.131e-02 7.569e-02 0.282 0.7789
Income 2.877e-05 3.770e-06 7.631 1.79e-11 ***
Age -2.323e-02 2.968e-03 -7.826 6.98e-12 ***

Residual standard error: 0.3343 on 95 degrees of freedom


Multiple R-squared: 0.5753, Adjusted R-squared: 0.5574
F-statistic: 32.17 on 4 and 95 DF, p-value: < 2.2e-16

Question: What do we learn from the above model?


Interpreting the Linear Probability Model
 Y follows a Bernoulli distribution with expected value P.
 Expected value of Y, given X is:
E(Y|X) = Pr[ Y=1 | X] = P
1 - E(Y|X) = Pr[ Y=0 | X] = 1 - P
 Interpretation of 𝛽:
𝛽 = ∆E(Y|X) / ∆X = ∆Pr[Y=1 | X] / 1 = P
 Beer example:
 𝛽Married = 0.021 means that the probability of a married person choosing light beer is 0.021 higher
than a single person.
 𝛽Age= -0.023 means that the probability of a person choosing light beer decreases by 0.023 for
increase in age by 1.
Problems with the Linear Probability Model
 Residuals are not normal:
 Error terms follow a Bernoulli distribution, not normal

3
distribution.
Residuals are heteroskedastic (non-constant error

2

variance):

Preference

1
 Var(𝜀) = E[Y | X] – (1 - E[Y | X] ) = P(1-P)
 Fallacious predictions:

0
 Y has two discrete response values (0/1), but fitted Y
values may lie outside the (0,1) bound.

-1
0 20000 40000 60000 80000 100000

Income
How to Handle Binary Dependent Variables
 Steps:
 Transform the binary Y into a continuous probability function f(Y) such that f(Y) is constrained
between 0 and 1, and then running the regression: f(Y) = 𝛽0 + 𝛽1*x + 𝜀
 Two popular probability models: logit (logistic regression) and probit models.
 Logit and probit models:
 Both models use a S-shaped cumulative distribution function (CDF) as f(Y) to restrict fitted values to
fall between 0 and 1.
 Probit uses the CDF of a standard normal distribution, and logit uses the CDF of a logistic
distribution (similar to standard normal distribution but with heavier tails).
 DV in not Y, but the logit/probit function f(Y).
Logit and Probit Models
 Let P = Pr[Y=1]
 In the beer example, P is the probability that a customer prefers light beer (Y=1).
 Then, Pr[Y=0] = 1-P .
 1-p is the probability that a customers prefers regular beer.
 Odds (or odds ratio) of the event: w = P / (1-P)
 Logit model:  P  PDF of a normal distribution:
ln    1   2 X ki  ...   k X ki  z 1   x 
2

1 P  1   
  x 
2  2 

e
t

Z2  2
Probit model: P(Y  1)  1

 2
e 2
dt   ( z )
 PDF of a standard normal distribution:
x2
1 
z  1   2 X ki  ...   k X ki   x  e 2

2
Logit vs. Probit Models
 Logit (logistic regression):
 More popular in health sciences (e.g., epidemiology).
 Coefficients easier to interpret in terms of odds ratios.
 Susceptible to heteroskedasticity (non-constant error
variances).
 Probit:
 More popular in advanced econometrics and political
science.
 Robust to heteroskedasticity (because we use standard
normal PDF).
 Probit is an integral function; no easy interpretation
possible.
Odds vs Probabilities
 Question:
 You plan to celebrate your birthday on a randomly chosen day of the week. What is the
probability that this day is a Sunday? What are the odds?
A. Probability and odds are both 1/7
B. Probability = 1/7, but Odds=1/6
C. Probability = 1/6, but Odds = 1/7
 For more on the difference, see http://en.wikipedia.org/wiki/Odds
 Probability: Number of success / Number of (successes + failures) = 1/7
 Odds: Number of successes / Number of failures: 1/6
 Odds is often read as 6 to 1, 6-1, or 6/1.
Logit Model
 Logit of an observation is modeled as a linear combination of predictor variables:
logit = 0  1 Gender  2 Married  3 Income  4 Age  e
 In R, you can fit a logistic regression, using the “binomial” option, as:
m <- glm(Preference ~ ., family=binomial (link="logit"),
data=d)
Use link="probit"
1
Deviance Residuals: for probit models

Prob of light beer


Min 1Q Median 3Q Max
-2.55206 -0.39729 0.00092 0.28800 2.05923

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.819e-01 1.931e+00 -0.353 0.724
Gender -7.779e-01 7.166e-01 -1.085 0.278
Married 1.697e-01 7.945e-01 0.214 0.831 0
Income 2.785e-04 6.334e-05 4.396 1.10e-05 ***
Age -2.282e-01 5.239e-02 -4.357 1.32e-05 ***

Null deviance: 138.63 on 99 degrees of freedom


Residual deviance: 56.45 on 95 degrees of freedom
AIC: 66.45
Interpreting the Odds Ratio
 Recall: ln(w) = 𝛽0 + 𝛽1*x + 𝜀 where w = P/(1-P) is the odds ratio
 Hence w = exp(𝛽0 + 𝛽1*x)
 Beer example:
ln(w) = -0.68 – 0.78*Gender + 0.17*Married + 0.0003*Income – 0.23*Age
 If income x3 increases by 1, but all other predictor variables remain the same, then
w(x1, x2, x3, x4) = exp (𝛽0 + 𝛽1x1 + 𝛽2x2 + 𝛽3x3 + 𝛽4 x4 + e)
w(x1, x2, x3+1, x4) = exp (𝛽0 + 𝛽1x1 + 𝛽2x2 + 𝛽3(x3+1) + 𝛽4 x4 + e)
w(x1, x2, x3+1, x4) / w(x1, x2, x3, x4) = exp(𝛽3)
 Regression coefficient:
 exp(b3) is the proportional change in the odds of event Y=1 if predictor x3 increases by 1 unit and
all other x’s remain the same.
Estimating Odds
 Interpreting the Income coefficient:
 𝛽Income = 0.0003
 If income increases by $1, the odds that a customer prefers light beer increase by exp(0.0003) =
1.0003, if all other predictors in the model are invariant.
 Question:
 𝛽Age = -0.23
 For each additional year of age, the odds that a customer prefers regular beer…
A. Decrease by 23% C. Decrease by 79%
B. Decrease by 21% D. None of the above
Odds for Categorical Variables
 𝛽Gender = -0.78 ⇒ exp(𝛽Gender) = 0.46
 The odds of a male customer (Gender = 1/Male) preferring light beer are 0.46 times of the odds of a
female customer (Gender = 0/Female) of the same marital status, age, and income.
 𝛽Married = 0.17 ⇒ exp(𝛽Married) = 1.19
 The odds of a married customer preferring light beer are 1.19 times of the odds of a single customer
of the same gender, age, and income.
 But we are still talking about probabilities of odds, rather than probabilities of drinking
light/regular beer.
Estimating Probabilities
 log(w) = log[P/(1-P)] = 𝛽0 + 𝛽1*x
 w = P/(1-P) = e(𝛽0 + 𝛽1*x)
 P = 1/(1 + e-(𝛽0 + 𝛽1*x) )
 Questions:
 What is the probability that a 45-year-old single male, earning $40,000/year prefers light beer?
 How would the above probability change if the same customer was 46 years old?
 What is the probability that a 55-year-old single male, earning $40,000/year prefers light beer?
 How would this probability change if the same male customer was 56 years old?
Nonlinearity
P[Y=Light beer | 45-year old, single male, P[Y=Light beer | 55-year old, single male,
earning $40,000/year]: earning $40,000/year]:
P[Light] = 1/(1+exp(.60967)) = 0.352 P[Light] = 1/(1+exp(2.89187)) = 0.052
P[Y=Light beer | 46-year old, single male, P[Y=Light beer | 56-year old, single male,
earning $40,000/year]: earning $40,000/year]:
P[Light] = 1/(1+exp(0.83789)) = 0.302 P[Light] = 1/(1+exp(3.12009)) = 0.042
∆PAge: 45→46 = -0.050 ∆PAge: 55→56 = -0.010

Change in probability p for a unit increase in one predictor variable, while holding all other
predictors constant, is not constant, but depends on the specific values of all predictors.

Question: A 45-year-old single male earning $40,000/year has P[Light Beer] = 0.352.
Does this customer prefer light beer or regular beer?
Logit and Probit Models in R
logit <- glm(Preference ~ ., family=binomial (link="logit"), data=d)
probit <- glm(Preference ~ ., family=binomial (link="probit"), data=d)

ols$coef Intercept Gender Married Income Age


logit$coef OLS 3.877* -0.046 0.021 0.0003*** -0.023***
probit$coef
Logit -0.682 -0.778 0.170 0.0003*** -0.228***
exp(logit$coef) # Odds ratio
Probit -0.510 -0.417 0.046 0.0002*** -0.128***
# Marginal effects in logit
LogitScalar <- mean(dlogis(predict(logit, type="link")))
LogitScalar * coef(logit)
ProbitScalar <- mean(dnorm(predict(probit, type="link")))
ProbitScalar * coef(probit)

# Predicted probabilities
predols <- predict(ols, type="response"); summary(predols)
predlogit <-predict(logit, type="response"); summary(predlogit)
predprobit <-predict(probit, type="response"); summary(predprobit)
Assumptions of the Logistic Regression
 Multivariate normality and homoskedasticity::
 E[Pr(Y=1)] = P
 Var[Pr(Y=1)] = P(1-P)
 Hence, logit model will not meet multivariate normality and homoskedasticity assumptions.
 Linearity:
 Logit is a non-linear transformation.
 However, the logistic model must still satisfy the other assumptions:
 Independence.
 No multicollinearity (use VIF tests).
 No autocorrelation.
Classification and Cutoff Value
 From the estimated logit model:
 First, predict P = Pr[Y=1] for each observation.
 Then, use a predetermined cutoff value to determine class affiliation.
 What is a good cutoff value?
 In the beer example, for a 45-year-old single male, earning $40,000/year, P[Light Beer] = 0.352. In
this case, 0.5 may be a reasonable cutoff value.
 However, in other cases, the cutoff value may depend on your objective.
 Objective 1: Target customers with the “right” beer ads (i.e. light beer ads for light beer drinkers).
 Objective 2: Influence regular beer drinkers to drink the pricier light beer.
 Would the cutoff value be the same for both objectives?
Goodness of Fit of Classification Models
 Classification models don’t have residuals or R2.
 McFaddens Pseudo R². library(rcompanion)
nagelkerke(logit)
 Comparison of the model of interest with a nagelkerke(probit)
model where RHS is constant. $Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
 Likelihood of model of interest: McFadden 0.592800
Cox and Snell (ML) 0.560359
 Likelihood with all coefficients except that of Nagelkerke (Cragg and Uhler) 0.747145
the intercept restricted to zero
$Likelihood.ratio.test
 Test: > Df.diff LogLik.diff Chisq p.value
-4 -41.09 82.18 6.0133e-17
ln Lˆ ( M Full )
PseudoR  R
2 2
 1 $Number.of.observations
ln Lˆ ( M Intercept )
McF
Model: 100 Null: 100

 Higher pseudo R² may indicate better fit.


Accuracy of Classification Models
 Confusion (of classification) matrix Confusion Matrix for Beer Example
 Comparison of model predictions vs. actual responses. Predicted Class
Light (Y=1) Regular (Y=0)
table(Preference, round(fitted(logit)))

Regular Light
Actual Class
Preference 0 1 TP = 44 FN = 6
0 44 6
1 6 44 FP = 6 TN = 44

table(Preference, round(fitted(probit))) TP: True Positives


Preference 0 1 FP: False Positives
0 44 6 TN: True Negatives
1 7 43 FN: False Negatives

 Question: Classification metrics:


 In the beer example, what is the overall accuracy of the Error rate = (FP + FN) / (TP + FP + TN + FN)
logistic model? Accuracy = 1 – Error Rate
 How do this model compare to a random (base) model?
When One Class is More Important
 In many cases, it is more important to identify members of one class:
 Tax fraud, response to promotional offers, detection of malignant tumors.
 In such cases, we are willing to tolerate greater overall error, in return for
better identifying the important class for further attention.
 Accuracy metrics for important class:
 Sensitivity:
 Also called True Positive Rate Predicted Predicted
 % of Class 1 (positives) correctly classified class 1 class 0
 Sensitivity = TP / (TP + FN) Actual class 1 TP FN
 Specificity: Actual class 0 FP TN
 Also called False Positive Rate
 % of Class 0 (negatives) correctly classified Question: Which is more important:
 Specificity = TN / (TN + FP) sensitivity or specificity?
ROC Curve and AUC
 ROC Curve:
 Plot of true positive vs. false positive rates for different
thresholds of each.
 The diagonal is the baseline – a random classifier.
 The classifier that yields the highest “lift” (gain) above
the random classifier is the best classification model.
 AUC (Area Under the ROC Curve):
 An aggregate performance metric between 0 and 1.
Lift Curve
 Used to assess and compare model performance in
identifying the important class (Y=1).
 How many tax records to examine.
 How many loans to grant.
 How many customers to mail promotional offers.
 Compares lift (or “gains”) of classifiers:
 Measure of model performance relative to “no model” (random model).
 Provides model assessment at different cutoff values.
 Comparison with confusion matrix:
 While confusion matrix evaluates models on the entire population, lift charts evaluate model
performance in a portion of the population.
Computing Lift Curves
 Using model’s classification scores, sort observations from
most likely to least likely to be members of Class 1 (“the
important class”).
 Compute lift: Count of correctly classified Class 1 records
(on y-axis) vs. number of total records (on x-axis).
Lift Curve
 Computation of lift:
 Lift (n=5) = 5 correctly classified Class 1 / 5 total
observations = 1
 Lift (n=10) = 9 / 10 = 0.90
 Lift (n=15) = 11/15 = 0.73
 Lift (n=20) = 12/20 = 0.60
 Characteristic of life curve:
 A good classifier will give us a high lift when we act on
only a few cases; as we include more cases the lift will
decrease
Misclassification Costs
 Cost of a misclassification error may be higher for one class than the other(s).
 Example:
 Suppose we send a promotion offer to 1000 people, with 1% average response rate (1=response).
 Confusion matrix: Predict as 1 Predict as 0
Actual 1 8 2
Actual 0 20 970
 Baseline: Classify every observation as belonging to the majority group (Naïve Rule – the simplest
classification rule).
 Our classifier can classify eight Class 1’s as 1; this comes at the cost of 2 FN and 20 FP.
 If costs of sending an offer is $1 and expected revenue from a Class 1 offer is $10,
 Under Naïve rule, don’t send any offers: Cost = 0, Revenue = 0, Profit = 0.
 Under our classifier, send offers to 28 predicted Class 1: Cost = $28, Revenue = $80, Profit = $52.
 Without any classifier, cost of 1000 offers = $1000, Revenue from 1000 offers = $100, Loss = $900.
Lift Curves with Costs & Benefits
 How to add costs and benefits to Lift Curve
 Sort data in descending probability of success.
 Add cost/benefit for each outcome and cumulative cost/benefit.
 Plot records as cumulative cost/benefit vs. number of cases.
 Lift curve may go negative.
 Use cutoff to select the point where net benefit is maximum.
Extensions to the Classification Problem
 Classifier with multiple outcomes:
 For m classes, confusion matrix will have m rows and m columns.
 m(m-1) misclassification costs, since any case could be misclassified in m-1 ways.
 Can get pretty complex; luckily, such situations are rare.
 Classification using triage:
 Three outcomes: Class 1, 0, and Maybe.
 Third category may be categorized to 1 or 0 following a special human review.
Decision Trees
 A machine learning based approach used for classification.
 Maps observations about an item (represented in the branches) to conclusions about the item's
target value (represented in the leaves).
 Classification accuracy can be assessed using confusion matrix, ROC curves, and lift curves.
 Advantages:
 Easy to understand: Tree-structure mimics our thinking; no logit functions, odds, log-odds, etc.
 Powerful: More flexible than linear (logistic) models.
 Algorithms:
 ID3 (Iterative Dichotomiser 3)
 CART (Classification And Regression Tree)
 CHAID (CHi-squared Automatic Interaction Detector).
Two Types of Decision Trees
Probabilities of light beer preference
 Classification tree: Yes: Income No: Income
 Where the target variable takes discrete values. < $39,930 > $39,930
Splitting variable
 Regression tree:
 Where the target variable takes continuous
values.
 Modeling classification trees in R:
 Using the rpart( ) function in “tree” package:
More splitting variables
#Regression tree
m <- rpart(y ~ x1 + x2 + …)
summary(m)
plot(m)
?cart # Classification tree
?prune # Pruning trees
Classification Trees
40 (20,20)
weight
age < 50 age ≥ 50

120
13 (9,4) 27 (11,16)

wt < 75 wt ≥ 75 wt < 95 wt ≥ 95
100

5 (1,4) 8 (8,0) 15 (0,15) 12 (11,1)


80
age < 48 age ≥ 48 age < 62 age ≥ 62

60 3 (0,3) 2 (1,1) 11 (11,0) 1 (0,1)

wt < 70 wt ≥ 70
40 50 60 age
1 (0,1) 1 (1,0)

Powerful technique; finds the best ways to partition data


Logit Model vs. Classification Tree
The Regression Model
Logistic Regression Classification Tree
Input variables Coefficient Std. Error p-value Odds
Constant term 2.08893299 2.67051315 0.43408436 * Residual df 55
Gender -0.16874874 0.88501459 0.84878147 0.84472114 Std. Dev. Estimate 35.71668625
Married 0.47131756 0.93353128 0.61364591 1.60210371 % Success in training data 50
Income 0.00021046 0.00007467 0.00482294 1.00021052 # Iterations used 10
Age -0.24593146 0.06694476 0.00023911 0.78197581 Multiple R-squared 0.57059759

 Do both models generate the same insights?


Training Data scoring - Summary Report
 Question: Both models conclude that:
off Income
CutA. Prob.Val. and age are
for Success the most
important
(Updatable) 0.5 predictors.
( Updating the value here w ill NOT update value in detailed report )

B. High income implies high propensity for light beer.


Classification Confusion Matrix
C. Low age impliesClass
Predicted high propensity for light beer.
Actual All of the above.
D. Class Light Regular
Light 26 4
Which customer segment
Regular 3 has the
27 highest propensity for
light beer?
Error Report
Regression Tree
Salary < 58650
|
Neighborhood: East,North
|

Salary < 31750 Catalogs < 15

SqFt < 2020 Brick: No


Catalogs < 15 Salary < 108250 Salary < 95050
387

Brick: No Brick: No SqFt < 2010 Bedrooms < 3.5 Children < 1.5 Location: Close Location: Close Salary < 120400
768 1240

Children < 0.5


1420 758 1840 3400 1800 2460 3940
Offers < 2.5 Bathrooms < 2.5
123000 142000137000155000164000186000

Location: Close
2390

114000 99400 112000133000


2760 4940

House Prices Example Catalogs Example


Pros and Cons of Decision Trees
 Advantages:
 Very flexible, even for modeling very complex combinations of input factors.
 Very easy to understand and to use; very intuitive; mimics human thought process.
 Information can be converted into rules; can be augmented with “expert rules”.
 Disadvantages: Temperature

 Tree can be very large.


cool mild hot

 Can over-fit the data; poor prediction.


outlook outlook windy
 Deduced rules can be very complex
with a lot of “if-then-else” statements. sunny
overcast
rain sunny
overcast
rain strong weak

 Provides little insights into what yes yes windy windy


yes
humid
no
humid

decision makers can do about a problem. strong Normal High Normal High
strong weak
weak
windy yes outlook yes
no yes yes no

sunny rain
strong weak
overcast

no yes no yes ?
Predictive Accuracy
 Classification works well if they can accurately predict future occurrences.
 How to assess predictive accuracy:
 Split your data into two random subsets: training set and holdout set.
 Train (“estimate”) a model using the training set.
 Fit the estimated model on the holdout set to assess predictive performance.
 Question: Which of the two models below predicts better, i.e., which model has higher “lift”?
Logistic Regression Classification Tree
Actual Actual
0 1 0 1
Predicted

Predicted
0 11 3 0 16 6
1 6 20 1 1 17
Important Note on OLS Regression
 Questions:
 Multiple or Adjusted R2 in an OLS model is a measure of:
A. Predictive accuracy C. Fit between data and model
B. Explanatory power D. All of the above
 What does high R2 mean? Can high R2 be problematic?
 Use the same procedure as classification to assess predictive accuracy.
 How to create test and holdout samples in R:
Seed allows replication of this same sample later
set.seed(43324)
index <- sample(1:nrow(d), size=floor(0.8*nrow(d)), replace=FALSE)
training <- data[index, ]
holdout <- data[-index, ]
We plan to use 80% of the data for
training set and 20% for holdout set Sample selection without replacement
Key Takeaways
 Classification models can be used to predict two or more discrete values (not for continuous values).
 Classification models may be based on statistical or algorithmic methods.
 Statistical methods (logit, probit) are model based:
 Allow insight into the relationship between input and output.
 But interpretations are cumbersome (“logit”, “odds”).
 May not handle complex interactions among X variables very well.
 Data mining methods (decision tree, regression tree) are algorithm-based.
 May not explain the relationships between X and Y (a “black box”).
 Can be very intuitive.
 Have a tendency to overfit the data.
 Evaluating predictive ability using training and holdout data samples.

S-ar putea să vă placă și