Documente Academic
Documente Profesional
Documente Cultură
1
Statistics is sexy?
2
Making better decisions!
Historical data
Demand Time Series
65
60
55
50
Decision
Historical Demand
45
40
Probability based on
(units)
35
model probability
30
25
20
15
10
model
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period
3
Current/previous research (snapshot)
• Academic/consulting projects:
o LinkedIn – investigating job seeker status
o Facebook – sampling social networks
o Statistical approaches to improve data quality
• In my research, I typically target the “quantitative” (=geek-
type) academic journals; e.g.
o Product line optimization; lead article with discussions; best paper
award (international journal of research in marketing 2011)
o Customer (value) analysis in heterogeneous markets
(Psychometrika 2012, Management Science 2015)
o Firm performance and the role of marketing (Journal of Marketing
2015, HBR 2015)
o Social effects in CRM campaigns (Journal of Marketing Research
2017)
Course objectives
4
Course organization
• Blackboard
• Course syllabus
• Before class:
– preparation guide (readings, case, class
discussion questions and dataset)
• After class:
– pdf of the lecture notes (WYSIWYG)
10
5
What you can expect from us
11
12
6
Mathcamp quiz: (partial) results
13
Today’s lecture
Moral of quiz: be aware of your ability to evaluate numbers
judgmentally! You may want to get some data for decision making…
14
7
Example: airline demand over time
65
60
55
50
Historical Demand
45
units
40
(units)
35
in 10s
30
25
20
15
10
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period
15
• More often than not the quantities we are interested in will not
be predictable but will exhibit an inherent variation
16
8
Learning from past observations
65
60
55
50
Historical Demand
45
units
40
(units)
35
in 10s
30
25
20
15
10
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period
17
Demand Histogram
7% 100%
Frequency
Cumulative Frequency 90%
6%
= 28.1 80%
Frequency of Occurrence
5%
= 10.9
70%
Cumulative Frequency
60%
4%
50%
3%
40%
2% 30%
20%
1%
10%
0% 0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
(in 10s units)
Period Demand (units)
18
9
Approximate the data with a probability function
Use a well-defined probability function whose shape resembles
the frequency histogram to probabilistically represent demand.
Demand Histogram
7% 100%
Frequency
Cumulative Frequency 90%
6%
= 28.1 80%
Frequency of Occurrence
5%
= 10.9
70%
Cumulative Frequency
60%
4%
50%
3%
40%
2% 30%
20%
1%
10%
0% 0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
(in 10s units)
Period Demand (units)
19
10
Organizing data
“Cases”
“Records” Variables
“Observations”
21
Big data!
22
11
Variable types: measurement levels
Categorical Quantitative
No natural numerical meaning Natural numerical meaning
customer ID, gender, ZIP code, brand, Sales, price, stock returns, interest
hair color, class grades (A,B,C), … rate, …
Today’s lecture
24
12
Mini case: American Express
25
26
13
Working with data
2. If electronic, it depends:
– SPSS can import Excel files, text files etc.
– Etc.
28
14
SPSS: two ‘views’
29
30
15
SPSS variable view
COLUMNS are
“properties” of your
variables (slides 35—37)
31
session1&2_credit_card_web.sav
data_science_day_1_credit_card_web.sav
32
16
33
34
17
‘Variable View’ fields
• Name: short name for the variable (no spaces; no
special characters)
– Try to give it a clever, brief name referring to what the variables
are about, without using special characters or being lengthy:
• Good variable name: “ItemsGrocery” or “items_grocery”
• Bad variable name: “# of grocery items sold?”
– Good practice: always keep a record ID variable (eg ‘obs’)
35
[[ For instance, if you work with survey data, you may want to
include wording from the actual question in the survey ]]
– When you create a SPSS table or chart, this “label” will be the
title of the chart or table
36
18
‘Variable View’ fields, cont.
• Values: enter in the words associated with your number
codes for categorical variable
– Only needed when your variable has categories, and the
categories are described by words
– When you analyze the data, these words will appear in the
tables / charts rather than the numeric codes
37
Today’s lecture
38
19
Descriptive statistics: describing data
39
20
How to choose the “correct” statistical
descriptive technique (1)
Generally there is not “one” correct way. Your choice
depends on:
41
42
21
Today’s lecture
43
• Frequency tables
• Pie charts
• Bar/column charts
44
22
Frequency table variable ‘card’
45
46
23
Another example: date primary card issued
(~tenure of the customer)
In SPSS: Analyze – Descriptive Statistics – Frequencies
48
24
Another example: date primary card issued
(~tenure of the customer)
49
In-class exercise 1
50
25
In-class exercise 1 (discussion)
52
26
Graphical representation of frequency tables
54
27
Revisit the brand-tenure analysis (slide 48)
55
56
28
Cross tab: credit card and date issued
In SPSS: Analyze – Descriptive Statistics – Crosstabs
29
In class exercise 2
59
60
30
Graphically displaying cross tabs
Segmented bar
chart
31
Another useful option in SPSS:
recode a categorical variable
• One data transformation that is used quite often is to
recode a categorical (nominal or ordinal) variable
– To collapse categories of a categorical variable in fewer
categories (e.g. some categories are thinly populated, or
for presentation sake)
5. Click ‘OK’
64
32
Recode the variable ‘card_date_yr’ (cont.)
• SPSS added your new variable to the back of your data file
(Data View); at the bottom of the variable list (Variable
View)
• You thought you were done, but you are not!
– Update all the fields under ‘Variable View’ for this new
variable
• Enter a description under ‘Label’ (e.g. ‘Date primary card issued
before 2001 or 2001 or after’)
65
66
33
Today’s lecture
67
• Numerical methods
– Central tendency measures
– Dispersion measures
– Correlation (core stats class session 5)
• Visual methods
– Histograms
– Box plots
– Time series plots
– Scatterplots (core stats class session 5)
68
34
APPENDIX
Contents ---
69
70
35
Practice measurement scales (slide 23)
1. Quantitative variable; numbers of this scale have natural meaning;
multiplications in the context of this construct (number of students)
make sense
2. Ordinal scale (categorical variable); the numbers 1,2,3,4,5 reflect
order with respect to the underlying construct (age)
3. Nominal scale (categorical variable); the numbers (e.g.) 1=off
campus; 2=on campus are just labels
4. Quantitative variable; this can be debated. This is an attitude rating
scale. It seems reasonable to compute an average satisfaction,
hence, some arithmetic for this variable makes sense
5. Quantitative variable; same comment as example 1
6. Ordinal scale (categorical variable); same comment as example 2
7. Quantitative variable. It seems reasonable to compute an average
GMAT score, hence, some arithmetic for this variable makes
sense
71
36
Data science camp
Introduction to the core stats class
Day 2
Today’s lecture
Part 6: Wrap up
1
Analyzing quantitative variables
• Numerical methods
– Central tendency measures
– Dispersion measures
– Correlation (core stats class session 5)
• Visual methods
– Histograms
– Box plots
– Time series plots
– Scatterplots (core stats class session 5)
Research question:
2
Numerical methods – central tendencies
Interpretation?
Range = max-min
3
Graphical representation: histogram
Mode = 0.00
Median = 88.42
Mean = 129.28
25th percentile = 0.00
4
Graphical representation: boxplot
5
Temporal Data
11
Stock Performance
12
6
Mini case: amount spent on groceries
Green – retail
Brown – travel
Blue – groceries
Insights for
American
Express?
14
7
Recap of techniques for quantitative data
– Histograms
15
Today’s lecture
Part 6: Wrap up
8
Learning from data for decision making
17
• Manufacturing
– Quality control
• Marketing
– Online ad copy testing (A/B)
• Finance
– Comparing returns from different investment portfolios
9
Two forms of applied statistics
1. Descriptive statistics
2. Inferential statistics
19
20
10
The two ‘key’ concepts in statistics
Notation
μ and σ
X S
p
Population: Sample:
-- total of all elements that share -- a subset of a population
some common characteristic(s) -- goal of stats: use sample to
-- goal of stats: learn about it learn about population
(‘Truth’)
21
Opinion polls
22
11
The idea behind sampling
N (BIG) Population
Every sample of size n
is equally likely
n1 n2 nobs
…. n(N choose n)
T1 T2 T(N choose n)
For each hypothetical sample of
size n you compute a statistic T A histogram of all these hypothetical
T’s would determine the ‘margin of
error’ (previous slide)
23
24
12
Two sampling questions to class
25
13
Today’s lecture
Part 6: Wrap up
28
14
Mini-case: insurance claims
WSJ 09/2012
29
WSJ 09/2012
30
15
Mini-case: insurance claims
• Fairly rich dataset from a large insurance company
31
16
Key stats for claim amount
17
Two fundamental distributions in statistics
• Population distribution:
– frequency distribution (histogram) of the population elements for a
certain variable (e.g. claim amount); generally a smooth line
– It is unknown but you want to know about it
– The mean of the population distribution is μ (how could you
compute it?)
– The standard deviation of the population distribution is
• Sample distribution:
– frequency distribution (histogram) of the sample elements for e.g.
claim amount
– it is known once we have our sample
– the mean of the sample distribution is the sample mean 𝑋
– the standard deviation of the sample distribution is S
– usage: to infer (=learn) about the population distribution
35
Today’s lecture
Part 6: Wrap up
18
Statistical inference: learning about the
population mean
• Mini-case:
– Sample mean is 73.01 (in 1000$) [[ slide 32 ]]
– Can we conclude µ = 73.01?
37
Sampling error
38
19
Understanding the variability in sample means
Consider the following hypothetical ‘mini’ population (‘the truth’): μ=72.5
Claim ID 1 2 3 4 5 6 7 8 9 10
Claim (1000$) 73 71 69 71 75 78 70 72 72 74
These 45 means will form the sampling distribution. Of what form will
this distribution be (think about making a histogram of these 45 sample means)?
39
20
[[ Challenging ]] In-class exercise
Discuss how the sampling distribution gives insight in how
good your sample mean 𝑋 estimates the population mean μ.
Hint:
– Consider the following two situations:
1. Consider a sample of 100 cases from company A. The sample
mean for claims is 72 and the sample standard deviation S = 20.
How does the sampling distribution look like?
2. In a sample of 100 from company B the sample mean of 68. The
standard deviation S = 5. How does the sampling distribution
look like?
– Which estimate (72 or 68) gives you most confidence to infer
about the population mean?
41
Completes slide 35
42
21
Today’s lecture
Part 6: Wrap up
Uncertainty in statistics
44
22
Big picture
Constructing a confidence interval (CI) for an
unknown parameter
45
23
Confidence interval: Zconfidence for 95% area
-Zconfidence 0 +Zconfidence
Web: http://homepage.divms.uiowa.edu/~mbognar/applets/normal.html
HEC: http://rstudio-test.hec.fr/probcalc/
[[ FYI: check out appendix for some extra info and exercise on
probability calculations for a normal distribution ]]
48
24
Using PQRS
This is Z
49
Using PQRS
This is Z
50
25
Mini-case: insurance claims
Confidence interval for μ for the variable insurance claims
s s
( X zc , X zc )
n n
51
Interpretation:
26
In-class exercise
53
54
27
Confidence interval – couple of remarks
55
56
28
Confidence interval – remark 1 (cont.)
58
29
Confidence interval – remark 2
59
60
30
Today’s lecture
Part 6: Wrap up
62
31
Univariate, descriptive, statistics
• A very important first step [[ and sometimes the only step ]] of
statistical work:
– Get a feel for data and become friends!
– Examine data for accuracy
– Helps decide follow up research/analyses
• Always make sure you know the basic numbers and graphs
for the key variables when you use data for decision making
63
64
32
Inferential statistics
• We often need to learn something about the “state of the
world” using a sample
• However, the “decision space” lies beyond the data
• Inferential statistics helps out:
– Realization: if I get a different sample, my statistics will
change!
– Important to know for decision making: by how much?
– Fortunately, we only need one sample (and knowledge of
the central limit theorem) to get an idea of this
– How? Compute a confidence interval! (eg day 2 part 5)
33
Final remarks data science camp
• Keep an eye out for the quiz!
– You need to fill it out to get a pass grade for the camp. Your
grade does not matter!
[[ Of course, take the quiz serious. It will signal you where you
are standing. This material is relevant for the core stats class! ]]
• Stats class starts next week Tuesday! We’ll dive right in!
67
Today’s lecture
Part 6: Wrap up
34
@home SPSS practice
SPSS work for the credit card case (day 1)
[[ answers will be given in the how to guide ]]
69
35
@home SPSS practice
SPSS work for the insurance case (day 2)
[[ answers will be given in the how to guide ]]
71
APPENDIX
Contents ---
How to generate a simple random sample in Excel or SPSS
(slide 73)
72
36
Simple Random Sample in Excel and SPSS
• To select a simple random sample in Excel:
– Create column in Excel (labeled “random” or similar)
– Use Excel’s =RAND() function to generate a random
number for each observation
– “Freeze” the random number
• Edit-Copy-Paste Special-Values
– Sort by the random number
– Take the first n rows (n=desired random sample size)
• In SPSS, this can be done through the option
“Random sample of cases” in the menu ‘Data—
Select Cases…’
73
– Etc.
74
37
Compute a new variable in SPSS
– In the ‘Target Variable’ box enter the name for the new variable (e.g.
‘spent_per_item_retail’)
– Click ‘OK’
75
76
38
Histogram of monthly $ spent per item (retail)
78
39
Probability distribution continuous variable
𝑓(𝑋)
𝑓(𝑋)
40
Continuous probability model: Normal distribution
81
82
41
Properties of the Normal distribution
Probability 𝑋=0
that X < 0 Probability
that X > 0
84
42
Normal probability calculations exercise 1&2
85
P(Demand<600) = ?
86
43
Normal probability calculations exercise 4&5
87
For an alternative solution using the 68-95-99 rule, see the next
slide.
88
44
Normal probability calculations solutions 1—3
89
90
45
Normal probability calculations solution Q5
Sometimes we need to find the observed value
corresponding to a given proportion. Here, we are given a
maximum probability (10%) that we don’t offer enough
flights to meet demand. How many passengers must we
plan to accommodate to not exceed this risk?
1. Fill in given
probability here
46
Statistics and business analytics
Session 1
Poll n=732 data scientist (63% industry, 11% academia, 26% other)
Source: Kdnuggets
2
1
Course organization
• Blackboard
• Course syllabus
• Quizzes
• SPSS labs
Grading policy
• Grading scale:
– Total: 100%
2
Last topic data science camp
• Key point?
– different samples lead to different statistics (e.g.
mean, proportion, standard deviation etc.).
– question: ‘how different is different’?
• Solution?
– Sampling distribution confidence intervals
for means
Today’s lecture
3
Today’s lecture: supporting decisions with data
Mini-case insurance claims: the accounting department
reports that the average claim last year was $63500.
Not reject
the null: False Negative
Correct
NOT GUILTY (Type II error)
(NO JAIL)
8
4
Steps in Hypothesis Testing
Problem Definition
How much certainty Conduct the appropriate What data have you
do you want? test collected?
Step 6
10
5
Today’s lecture
11
12
6
Step 1: one-sided vs. two-sided hypothesis tests
• One-sided tests are focused on departures from H0 in a
single direction
– In the claims mini-case, we want to know if the new claims
are 10% higher on average than the old claims to warrant
new policy premiums
13
Truth
NOT GUILTY GUILTY
(GO TO JAIL)
Our
Not reject
the null: False Negative
Correct
NOT GUILTY (Type II error)
(NO JAIL)
14
7
Step 2: choose the significance level
The significance level is denoted by . It indicates how
certain we are in our decision. You choose this number
(typically =0.10, 0.05, or 0.01), and use this in step 5.
Truth
𝐻 : 𝜇 = 70 𝐻 : 𝜇 > 70
(new claims
at least
Our
10% higher)
False Negative
𝜇 = 70 Correct
(Type II error)
15
8
Step 4: prepare a statistical decision
17
-2 0 2
All possible 𝑍𝑡𝑒𝑠𝑡 values that you could get (from many many
hypothetical samples) if H0 is true
Consider two situations:
1. If the null hypothesis is true, you are likely to get 𝑍𝑡𝑒𝑠𝑡 values that are close to 0,
say in the white area, and you are unlikely to get 𝑍𝑡𝑒𝑠t values that are far away
from zero, say in the green areas
2. Therefore, it would be unlikely to get a 𝑍𝑡𝑒𝑠t value in the green area from your
sample, if indeed the null hypothesis is true.
18
9
Step 4: prepare a statistical decision
• A precise definition of “far” and “close”: where does
my 𝑍𝑡𝑒𝑠𝑡 value fall under the standard normal curve?
Is it in the tails (green area) or not?
Enter your
𝑍𝑡𝑒𝑠𝑡 value in
this box
10
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject the
null hypothesis
– What you do: compare the P-value to the chosen significance
level (in step 2) of the test
• The significance level of the test (denoted by ) is the
relative cut-off for deciding if the observed difference is
sufficiently big to reject the null
– You choose this value : typically 1%, 5%, or 10%
• Statistical decision rule:
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the null
hypothesis
[[ Statistical warning: you never accept a null hypothesis! ]]
21
11
Step 6: make a business conclusion
• What if we had chosen 𝛼 = 0.10 in step 2 instead (i.e. we
would tolerate a larger type I error in our decision)?
• The null hypothesis is rejected for 𝛼 = 0.10.
• Conclusion: our sample with average claim size of $73000
provides evidence that the average claim amount [[ in the
population ]] is larger than $70000 and (therefor) increased by
more than 10% (𝑍𝑡𝑒𝑠𝑡 = 1.39, P-value = 0.08). We would
recommend a re-evaluation of the policy prices.
12
Today’s lecture
25
26
13
Hypothesis test for proportions
• Mini-case credit fraud: last year, 8.5% (=0.085) of
the claims were fraudulent. Management wants to
know whether this year there were more or less
fraudulent claims.
28
14
Hypothesis test for proportions
Step 4: compute the P-value
– Tells us on a common (probability) scale whether the
sample is “close” or “far” from the null hypothesis
– Curve under a standard normal distribution
– E.g. use PQRS
– Did you find: P-value = 2*0.00 = 0.00?
Step 5: reject or not reject the null hypothesis?
– Compare the P-value to the significance level in step 2
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the
null hypothesis
– Statistical conclusion?
29
30
15
Hypothesis testing – remark 1
31
16
Statistical vs. practical significance – remark 3
• Statistical significance based on the rule “P-value < ”
(step 5) is at best a rule-of-thumb and at worst bad practice
34
17
Another illustration of type I and II errors
(slide 8)
18
Statistics and business analytics
Session 2
Course announcements
1
Today’s lecture
2
Columbus Ohio bike share
• Project goal: bike share in Columbus Ohio?
• Project kick off late Fall 2011
• Collaboration with Ohio State University,
ConsiderBiking.org and the mayor’s office of Columbus
• Employ a survey:
– Purchase intent scale (5 pnt (interval) scale)
– Industry rule-of-thumb: 80% of “definitely buy” and 30% of
“probably buy” actually end up buying (…)
3
Confidence intervals: recap
Confidence interval size is a function of three things:
S
X Z confidence
n
‘Margin of error’
– the data
Specifically, the standard deviation
– the confidence level
As the confidence level increases (all else equal), the length
of the confidence interval increases.
– the sample size(s)
To control confidence interval length – choose the sample
size appropriately.
7
4
Sample size determination for a single outcome
10
5
Sample size determination for a single outcome
• From historical data: a similar study ran on the OSU
campus, we had found an S of 1.55, so
. × .
𝑛= = 12.152 = 147.67 ≈ 150
.
11
6
Sample size calculation for a proportion
(single outcome)
• Step 1: what is the desired confidence level?
– Take 95% which gives you a 𝑍𝑐𝑜𝑛𝑓 = 1.96
• Step 2: what is the smallest difference (above and
below) that has practical importance to you?
– Here we took 5% up and down, so 𝐵 = 0.05
• Step 3: working backwards from the confidence
interval formula for proportions [[ see lecture notes data
science camp day 2, part 5 ]], use the following formula
× ×( )
𝑛=
13
. × . × . .
𝑛= = = 384.16 ≈ 385
. .
14
7
Today’s lecture
15
Caution!!
8
Data challenges
17
Today’s lecture
18
9
Check your sampling as part of basic
statistical work
• Bike share on the Ohio State University campus
– A completely separate part of the project involved
sampling students on the OSU campus
– This data was analyzed separately from the
(previous) downtown study
– The sampling was done by students in my class who
sampled ‘on campus’ – not ideal (convenience
sample)!
• When the sample is in, it is good practice to check
some basic demographic variables and compare
those with population demographics (to the extent
these are known, of course)
19
20
10
Testing a hypothesis about population proportionS:
6 steps
1. Formulate the null and alternative hypotheses
2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English
21
– Ha:
22
11
Step 2: choose the significance level
Significance level is denoted by . It indicates how certain
we are in our decision. You choose this number (typically
=0.10, 0.05, or 0.01), and use this in step 5.
23
24
12
Step 3: compute a test statistic
Four steps to compute a chi-square statistic
1. Write down the formula and the symbols
Oi
O E 2 O E2 2 ...
1 1 2
2
Observed counts for cell i
2O1 E1 O2 E2
2
2
E1 E2
176 158.42 220 237.62
158.4 237.6
17.6 17.6 3.26
2 2
158.4 237.6
26
13
Step 4: prepare a statistical decision
• When the 2 value you computed is:
– small (‘close to zero’) your data is close to the null
hypothesis
– large your data is ‘far away’ from the null hypothesis
27
28
14
Step 4: prepare a statistical decision
Enter your 2
value in this
box
Web: http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html
HEC: http://rstudio-test.hec.fr/probcalc/
29
30
15
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject the
null hypothesis
– What you do: compare the P-value to the chosen significance
level (in step 2) of the test
• The significance level of the test (denoted by ) is the
relative cut-off for deciding if the observed difference is
sufficiently big
– You choose this value : typically 1%, 5%, or 10%
• Statistical decision rule:
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the null
hypothesis
• Statistical warning: you never accept a null hypothesis!
31
32
16
Try it yourself!
A second key demographic variable in the bike share study
is class standing. This is an ordinal variable. Hence, we
could analyze it with a frequency table. From the OSU
administration office, we know that the relative amount of
students being Freshman, Sophomores, Juniors and
Seniors is equal in the population. Would you say our
sample is representative?
33
17
What’s up next: SPSS lab 1 & quiz 1
• SPSS lab 1 –
– Time and location: same as regular class meetings
– It will be ‘hands-on’; my TA (Alican) and I will be there to help out
– Prepare the lab (e.g. with your team)
• Review lecture notes of sessions 1&2, data science camp
• ‘How to in SPSS’ (pdfs on Blackboard)
• Case: American Express (Data Science camp)
– You will be asked to hand in a brief assignment after the
lab (counts towards final grade); may be done in pairs
• Quiz 1 –
– Same idea as for data science camp
– Open: 1 day before SPSS lab
– Close: 2 days after SPSS lab
35
Appendix
36
18
What to do if my sample is not representative?
Example: a sample of 1000 US voters, included 500
African Americans (AA) and 500 non-African
Americans (NAA).
37
19
What to do if my sample is not representative?
Example: a sample of 1000 US voters, included 500
African Americans (AA) and 500 non-African
Americans (NAA).
– If we know (e.g. US census) that 18% of the population are
AAs and 82% are NAAs, what would be a better overall
average?
– A better average would be the weighted average:
0.18 × 80 + 0.82 × 40 = 47.2%
– We “down-weighted” the AAs data in computing the new
sample average
– This weighted average is much closer to the truth in the
population: 45% (the truth is the outcome of the election,
assuming everybody voted, or that those who voted are
representative of the whole population)
39
20
Statistics and business analytics
Session 3
Today’s lecture
1
Bivariate statistical analysis
• For decision making often quite important and is a
stepping stone to multivariate statistics (e.g.
regressions)
• Examples:
– Marketing – did ad A or B generate more
clickthroughs?
– Supply chain – does the temperature affect the sales
of cola?
– Human resources – did men and women had an
equal chance of being promoted in the past year?
– Banking – are homeowners that are single more likely
to default than married homeowners?
Today and next class: we will further work on completing this table
4
2
Case insurance claims
• Same case (data) as previous sessions 1&2
• Sample of insurance claims from a large insurer
• Today’s class: can we use demographic information
to help price insurance policies?
Today’s lecture
3
Comparing two means: means plot
( 73.01 across
all policies;
data science camp
day 2 side 32)
4
Side-by-side box plots helps us see the
within-group variation
10
5
Step 1: Formulate the statistical hypotheses
Step 1: You formulate TWO hypotheses:
• The null hypothesis H0
– For bivariate statistics: stated in terms of no difference or no
relation
– Formal: the two variables are independent
– Example: there is no difference in average claim amounts of
retirees and non-retirees
• The alternative hypothesis H1 or HA
– For bivariate statistics: it states that there is a difference or a
relation
11
12
6
Step 3: compute a test statistic
• A test-statistic measures how close the sample has come to
the null hypothesis
• A well-thought-off test statistic (statistician figure this out)
follows a well-known distribution such as the normal, t-, or
chi-square distribution
• For testing the difference between two population means,
use the following formula:
X1 X 2 S12 S 22
Z difference SX
sX n1 n 2
13
X X2 S12 S 22
Z difference 1 SX
sX n1 n 2
Interpretation?
14
7
Step 4: prepare a statistical decision
• Similar to the Z-test for one mean (session 1), we use the
standard normal distribution to curve the computed Zdifference
value
– E.g. use PQRS or other probability calculators
15
16
8
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject
the null hypothesis
– What you do: compare the P-value to the chosen
significance level (in step 2) of the test
17
9
@Home practice
19
Means plot
Are claim amounts from customers with different education levels,
on average, the same, or not? How should we proceed to address
this question?
20
10
Today’s lecture
21
ANalysis Of VAriance
22
11
Side-by-side box plots helps us see
the within-group variation
24
12
Using ANOVA to test equality of population means
25
13
Using ANOVAs
CAUTION!
More so than any of the techniques we have learned so far,
ANOVA requires us to be more careful about examining
underlying data assumptions
1. Sample should be a random sample (or at least arguably
so)
2. Data should be approximately normally distributed within
each group
3. The variances in the different groups should be
approximately equal
27
14
Using ANOVAs Check assumption 3
CAUTION! (slide 27)
• The variances are fairly similar for the five groups (above
table), however, it is hard to argue that within each group the
data is approximately normal distributed (previous slide)
• Therefore, we should resist the temptation interpreting and
using the previous analyses (slides 25&26).
• Instead, we could consider re-doing the analysis with the
logarithmic transformation! (why?)
29
Conclusion?
Are the assumptions after the log transform valid?
30
15
Check ANOVAs assumptions
32
16
ANOVA: note on interpretation
• Rejecting the null hypothesis of equal means (e.g. slide
30), does not mean that all of the means are different!
– How to do?
33
The P-values for the tests H0: µi = µj are listed in the column ‘Sig.’
(Sloppy: you do the test on slides 10—18 here 5 × 4 = 20 times)
Practice managerial interpretation
34
17
Today’s class in sum
• Statistical inference for bivariate statistical analysis
(analyzing two variables jointly)
• Particularly today: one quantitative variable and one
categorical variable
– Compare two means (t-test)
– Compare more than two means (ANOVA + multiple comparisons)
– Graphically: means plot; side-by-side box plots
• Application: insurance claims
– Are insurance claims from certain demographic groups, on
average, higher, lower, or about the same?
– Claims of retirees are, on average, lower than claims from non-
retirees; claims from clients with the least education are, on
average, lower than from clients with the most education
– These analyses provide a starting point for building pricing
models for segments
35
Appendix
36
18
Comparing two means: remark A1
37
38
19
Comparing two means: remark A2
• For instance, suppose for this sample of 4415 individuals, we
had also measured the claim amounts from two years ago.
• How would this variable show up in SPSS? Eg consider the
following hypothetical example:
39
– We cannot use formula on slide 13, because the two means are
computed over the same set of observations
40
20
Comparing two means: remark A2
• Idea: create a new variable (see column ‘diff’ on slide 39)
• That is, test H0: µdiff = 0 (or any other number) with
diff 0
diff
diff
√
• Curve under a standard normal distribution to get P-
value
[[ Discussion book pp392—402 (11th); pp455—459 (12th); pp447—451 (13th);
in SPSS paired samples t-test ]]
41
21
Statistics and business analytics
Session 4
“Lots of numbers”
lecture!
Course announcements
• Quiz 1 results
• Quiz “walk in” office hours (data science camp quiz +
quiz 1)
– Review the feedback given in your result summary!
• 1730-1800hrs ES2
• 1800-1830hrs ES1
• Quiz 2: will open one day before SPSS lab 2 on Mon Oct
15 and close two days after the lab on Thu Oct 18
2
1
Today’s lecture
Previous class
2
Bivariate stats techniques
The purpose of the analysis and the scale level of the
variables help us decide what statistical technique to use
3
Hypothetical ANOVA example 1
4
Hypothetical ANOVA example 2 (cont.)
Today’s lecture
10
5
Insurance claims case
Table useful?
12
6
Cross tab fraud versus type
13
14
7
Step 1: Formulate the statistical hypotheses
Step 1: You formulate TWO hypotheses:
• The null hypothesis H0
– For bivariate statistics: stated in terms of no difference or no
relation
– Formal: the variables ‘fraudulent’ and ‘claim_type’ are
independent
– Here: there is no difference in likelihood (~probability) for a claim
to be fraudulent across the different claim types
• The alternative hypothesis H1 or HA
– For bivariate statistics: it states that there is a difference or there
is a relation between the variables
15
16
8
Step 3: compute a test statistic
O1 E1 O2 E2
2
2
2
...
Oi Observed counts for cell i
E1 E2 Ei Expected counts for cell i
when H0 is true
• Interpretation?
17
18
9
Step 3: computing Ei (previous slide)
II. Obtain the Ei’s – warning: this is tricky
row total column total
Use formula: Ei
total sample size
No Yes Total
1054 3952 1054 463
Wind 1 943.5 2 110.5 1054
4415 4415
627 3952 627 463
Water 3 561.2 4 65.8 627
4415 4415
1039 3952 1039 463
Fire 5 930 6 109 1039
4415 4415
404 3952 404 463
Contam 7 361.6 8 42.4 404
4415 4415
1291 3952 1291 463
Theft 9 1155.6 10 135.4 1291
4415 4415
19
20
10
Step 4: prepare a statistical decision
21
22
11
Step 4: prepare a statistical decision
CAUTION
23
24
12
Step 6: make a business conclusion
25
26
13
Try it yourself!
How about size of the town? Are fraudulent claims more
likely to happen in smaller or larger cities? You should
develop an hypothesis test to investigate this research
question; use =0.10.
You should find that the null hypothesis is not rejected (but
borderline). Size of town does not help us assign fraud
inspectors, for instance.
27
28
14
Learning about fraud
29
Today’s lecture
30
15
Making bad graphs
Question to class
31
Bad graph 1
16
Bad graph 2
34
17
Bad graph 3
200 yrs ago people (Playfair, 1786) already knew how to do this..
36
18
Bad graph 4
19
Bad graph 5
39
40
20
Graphics in sum
• Good practices:
– Data-ink ratio should grow with the amount of data
displayed
– No “chart junky”
41
Today’s lecture
42
21
In sum – first 5 sessions
Using statistics for decisions
• Always start with descriptive statistics
– Get the key ‘statistics’ for your variables
– Ask/compute graphics for the key variables
– Convince yourself that the data is of good quality (e.g.
sample selection, sample size, measurement, outliers
etc.)
Univariate statistics
• Given a purpose of analysis...
• Categorical variable
Descriptive
44
22
Bivariate statistics
46
23
Next class meeting: SPSS lab 2
47
Appendix
24
Statistics and business analytics
Session 5
Course announcements
Quiz 2
1
Today’s lecture
Part 2: Correlations
2
Bivariate stats techniques
The purpose of the analysis as well as the scale level of the
variables help us decide what statistical technique to use
Case: La Quinta
3
Case: La Quinta
Margin
factor
Market Demand
Competition Community Physical
Awareness Generators
Case: La Quinta
• Sample of 100 hotels
• We got a subset of the variables that measure the
factors in the profit margin model
– # of rooms within 3 mile radius (competition)
– Profit margin
4
Univariate statistics: variable ‘Margin’
5
Today’s lecture
Part 2: Correlations
11
Correlation
• When you want to describe the relation between TWO
quantitative variables, you may compute the correlation
coefficient
6
Correlation between profit margin and
competition is negative
r = -0.47
13
r = 0.50
14
7
Correlation between profit margin and physical
factor is negligible
r = -0.09
15
16
8
Are correlation coefficients used a lot in
applied business analytics?!
Yes!! A lot… But here are two warnings:
www.correlated.org
www.Tylervigen.com
18
9
Freakonomics: Everything Is Correlated
People who drowned after falling out (04/04/2011) www.Tylervigen.com
of a fishing boat (# deaths)
0.95
19
0.79
launches (#)
20
10
Freakonomics: Everything Is Correlated
(04/04/2011) www.Tylervigen.com
Price of apples ($ per pound)
0.89
Today’s lecture
Part 2: Correlations
22
11
Regression analysis
23
– Price elasticity
• Risk assessments
– Insurance polices
12
Correlation vs. regression analysis
25
Y = a + b*X
(Demand generator)
26
13
The Regression equation -
true model (in population)
A.k.a.
Regressor;
Dependent Variable Independent Variable Explanatory
(“Profit margin”) variable;
(“Office space volume”)
Predictor
Y 0 1 X 1
Constant Coefficient of
(Intercept) Independent Variable
(Slope)
45.7 (slide 8)
28
14
Sum of squared errors is measure of
predictive accuracy
45.7 (slide 8)
29
45.7 (slide 8)
30
15
Sum of squared errors is measure of
predictive accuracy
45.7 (slide 8)
31
“Best” line
45.7 (slide 8)
32
16
Sum of squared errors is measure of
predictive accuracy
ei (green error bar) is
the difference between
the predicted value
“Best” line (red line) and the
actual data
45.7 (slide 8)
33
34
17
Does X help us explain/predict Y?
35
Question:
Can SSE>SST?
36
18
Does X help us explain/predict Y?
"Variation in Y explained"
"Total variation in Y"
37
Today’s lecture
Part 2: Correlations
38
19
SPSS puts a line where sum of squared errors
is smallest
Table 1
Table 2
Table 3
“Best” line
39
40
20
SPSS output for simple linear regression
Table 2
41
21
SPSS output for simple linear regression
Table 3
44
22
Try it yourself @Home 1
23
Today’s class in sum
47
24
Statistics and business analytics
Session 6
Course announcements
SPSS lab 3 (of 5), on Thursday Oct 25th, practices
basic regressions (sessions 5 and 6)
1
Today’s lecture
• Examples:
– Sale force management: what is the relationship between
sales productivity (e.g. sales) and years of experience for
sales people?
2
Case: La Quinta – previous class meeting
Important challenge in hotel business: expanding locations
Where should La Quinta locate a new hotel?
Margin
factor
Market Demand
Competition Community Physical
Awareness Generators
3
Multiple linear regression model
Case: La Quinta
Margin
factor
Market Demand
Competition Community Physical
Awareness Generators
4
La Quinta: regression results from SPSS
5
SPSS output for linear regression
Table 1
11
6
SPSS output for simple linear regression
Table 3
*in each case we assume all other variables are held constant!
14
7
Interpreting the Coefficients*
*in each case we assume all other variables are held constant!
15
8
Which of the factors is/are most important in
determining location?
• The independent variables X1, X2,…,X6 are all measured
on different scales complicating assessing their relative
importance
– X1 (number of rooms) is a count (1,2,…)
– X2 (distance to competitor) is in miles
– X3 (volume of office space) is in 1000s sq ft
– Etc.
• The estimated regression coefficients incorporate these
scale differences and it is therefore tricky to compare
them relatively
• Two possible solutions:
– Use the standardized coefficients
– Use the t-values
17
9
Which of the factors is/are most important in
determining location?
Today’s lecture
20
10
Check the underlying assumptions of linear
regressions
21
22
11
Important aspects to check in regression
analysis
Before drawing conclusions about a population based on
a regression analysis done on a (one!) sample, first
check (at the minimum) the following five aspects:
1. Variable types: use quantitative variables [[ fun fact: we will
extend this next class meetings to categorical variables ]]
23
– Be aware of outliers
24
12
Is there a linear relationship between Y and the Xs?
Inspect scatter plots of the dependent variable versus the
independent variables
R2 = 50%
(note: no relation,
i.e. 2 0%, is not
per se abnormal)
25
R2 = 50%
26
13
Is there a linear relationship between Y and the Xs?
Inspect scatter plots of the dependent variable versus the
independent variables
R2 = 50%
27
• Good practice:
(1) Look for “normality” [[ next slide ]]
28
14
Inspect the residuals (errors) of your model
The 10 largest
residuals are all
less than +/- 3
(‘Std. Residual’)
which is good
(and rare)!
30
15
Aspect 5: Check for multicollinearity
• With more X variables, you run the risk of multicollinearity
– Two (or more) independent variables are highly correlated
with each other
• This poses a threat to your regression model:
– Untrustworthy estimates for your β’s (“wrong” signs)
– Low t-values (“very few predictors are significant”)
– Limits the size of R2
– Hard to assess importance of predictors
• Importance of multicollinearity problem is less severe if
your goal is prediction, however, it is more important if
your goal is explanation
• Neither detection nor solutions are obvious:
1. Compute correlation matrix among independent variables
2. Run collinearity diagnostics in SPSS
31
32
16
Check for multicollinearity
33
Possible solutions:
– Increase sample size (duh!)
– get rid of some of the independent variables (duh!)
– leave as is (duh!), but do report in analysis
– [[ savvy: Factor analysis – replace two ore more collinear
variables with a synthetic variable that summarizes them;
session 9 ]]
34
17
Today’s lecture
35
36
18
Using the regression model for predictions
Plug in (hypothetical) X values in your estimated
equation:
X1 = 3815; X2 = 0.9; X3 = 476; X4 = 24.5; X5 = 35; X6 = 11.2
19
In sum: choosing and using a regression model
• Choosing a regression model:
– Reasonable model fit (i.e. check the underlying assumptions
– five ‘tasks’ on slide 23)
39
• Absolutely!
40
20
Statistics and business analytics
Session 7
Course announcements
1
Previous class meetings
Linear regression models
– Quantify the relation between one dependent variable
and one ore more independent variables
– Very important class of models in applied statistical
work
– Useful for explaining and predicting
– Several diagnostic tasks need to be performed before
regressions can be used for decisions [[ G.I.G.O. –
slide 23 session 6 ]]
2
Today’s lecture
3
Using statistics in this law suit case
4
Evidence for gender discrimination in salary?
Test that the means for men and women are the same
[[ see session 3 part 2 ]]
5
Including categorical independent variables
• We can NOT include categorical variables as
regressors in a regression model “as is”
– Gender: 1=male, 2 = female
– Job grade: 1=lowest,…,6=highest
– Education: 1=high school,…,5=grad school
• Why’s that?
– Regression models do multiplications, additions,
subtractions which can only be done with quantitative
variables
– Regression interpretation: an one unit increase in X results
in a unit increase in Y, for all X values. This is generally
too restrictive if X is categorical (e.g. next slide)
11
6
[[ bad ]] Regression model salary vs job grade
Salary (quantitative)
Example: Gender
Female 1 1
Reference Male 2 0
14
7
Why does dummy (0/1) coding work?
Let’s consider a better model to investigate possible salary
discrimination, by controlling for experience [[ ‘YrsExper’ –
quantitative independent variable ]]
Salary = β0 + β1 * YrsExper + β2 * Femaledummy
Two cases:
1. Employee is female ---- Femaledummy = 1
Salary = β0 + β1 * YrsExper + β2 * 1 or
Salary = (β0 + β2) + β1 * YrsExper
2. Employee is male ---- Femaledummy = 0
Salary = β0 + β1 * YrsExper + β2 * 0 or
Salary = β0 + β1 * YrsExper
15
Males
Salary (quant)
Females
(nominal)
8
Detailed interpretation of the regression
coefficients on previous slide
• The intercept (b0=35.8; P-value=0.00)
– The expected starting salary for males with zero years of
experience
• The slope for years of experience (b1=0.98; P-
value=0.00)
– The expected increase in salary for one extra year of
experience at the bank for either gender
• The slope for the female dummy (b2=-8.0; P-
value=0.00)
– This is the key coefficient for this law case
– It indicates that the average salary for women is 8.0
(~$8000) lower than for men, given that they have the
same experience levels
17
Lowest 1
2nd 2
3rd 3 ???
4th 4
5th 5
Highest 6
18
9
Categorical (=nominal/ordinal) independent variables
Job Value Dum1 Dum2 Dum 3 Dum 4 Dum 5
level
Lowest 1 1 0 0 0 0
2nd 2 0 1 0 0 0
3rd 3 0 0 1 0 0
4th 4 0 0 0 1 0
5th 5 0 0 0 0 1
Highest 6 0 0 0 0 0
19
10
Salary model with YrsExp, JobGrade, Gender
22
11
@home exercise for a rainy day
23
Today’s lecture
24
12
Salary model from slide 16: two parallel lines
Males
(n=68)
Salary (quant)
Females
(n=140)
(nominal)
YrsExper (quant)
Given experience, females earn less than men. But salary
increases at the same rate for males and females. Realistic?
25
26
13
Interaction variables in regressions
• An interaction variable is a product of two
explanatory variables
– Scale level doesn’t matter (e.g. dummy×dummy,
dummy×quantitative, quantitative×quantitative)
– Useful if we believe the effect of one explanatory
variable on Y depends on the value of another
explanatory variable
27
14
Case gender discrimination: interaction of
gender with years of experience
Y=Salary, X1=YrsExp, X2=Female (dummy), X3=X1×X2
29
15
Interaction of gender with years of experience
Males
Salary (quant)
Females
YrsExper (quant)
The effect of years of experience on salary is quite different for
male and female employees: males move up the salary ladder
much quicker!
31
Once you have included job grade (and if it still rains), you
should include education level in the model as well (slide
23). The model now already becomes pretty complex. How
does it fit the data? How would you interpret the
coefficients?
32
16
One note of caution
While not emphasized today, using dummy variables
and interaction terms does not free you from the
diagnostic tasks discussed before! (session 6 slide 23)
– G.I.G.O.!
– Dependent variable = quantitative
– Independent variables are quantitative OTHERWISE
use DUMMIES
– Linear relation between Y and quantitative X’s
– Assess goodness of fit (R square; p-values)
– Residuals (errors) must ‘behave’
– Multicollinearity
33
Today’s lecture
34
17
Generalizing linear regression
• The linearity assumption is often a good and
convenient assumption, but sometimes not realistic
• How do we know things are linear or not?
– LOOK AT YOUR DATA (scatterplots of Y and Xs, and
examine residuals)
– Economic theory
35
Ad spending ($100s)
18
Example: ad spending on sales
Sales ($100s)
Fit Y against X
Ad spending ($100s)
• With SPSS: 8181 85 ( 0.66; 491; P-value = 0)
• Probably not: for low and large values of X we over predict Y,
for medium values of X we under predict Y. Alternatives?
37
Ad spending ($100s)
• With SPSS: 6773 190 1.10 ( 0.69; etc.)
• Hard to interpret the coefficients b1=190 and b2=-1.10
• Other alternatives: fit Y against √ or instead of X and X2
38
19
Example: ad spending on sales
Sales ($100s)
Ad spending ($100s)
• With SPSS: LN 8.5 0. 25LN .
• Interpretation: a 1% increase in X goes with a 0.25 percent
increase in Y [[ sales-advertisement elasticity ]]
39
40
20
Today’s class in sum
• How to use categorical variables as independent
variables in a linear regression model
– Use dummy variables to represent the categorical variable
21
Statistics and business analytics
Session 8
Warning!
Tough Lecture!
Course announcements
Music preference survey is available! Please fill out..
https://hec.az1.qualtrics.com/jfe/form/SV_0debqP4E3ppDOoB
1
Today’s lecture
2
In many business applications we have
categorical variables…
3
Brilliant idea!
4
Example: simple logistic regression model
(one independent variable)
log[ p / (1-p) ] = β0 + β1 * X1
Today’s lecture
10
5
Mini case: gender discrimination in salary at
large US bank
11
73% P(promoted)?
27%
6
Clustered bar chart promotion and gender
log[ p / (1-p) ] = β0 + β1 * X1 + β2 * X2
14
7
For logistic regression, SPSS produces LOTS of output
Here are the relevant tables (interpretation next slides)
Table 1 Table 2
Table 3
Table 4
15
Prob(Y=1) = p
Y = ‘Prom’ (1=Y, 0=N)
X1 = ‘YrsExper’ (quantitative)
X2 = ‘Gender_dum’ (1=F, 0=M)
16
8
SPSS output for logit regression
Table 1
• Generally, as the logit regression is “almost” linear (it is linear
in log-odds), much of the reasoning / interpretation / checks is
similar to linear regressions (sessions 5—7)
17
9
SPSS output for logit regression
Table 3 False positive
False negative
19
10
SPSS output for logit regression
Table 3 False positive
False negative
• There are no clear guidelines for the cut value. For instance it
depends on whether predicting a false negative or a false
positive is worse (e.g. more costly); see appendix 2.
• Sometimes a reasonably compromise is to set it to the observed
proportion of promotions in the sample (here: 27% of the sample
is promoted; see slide 12)
21
11
Today’s lecture
23
12
A more detailed interpretation: odds ratio
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
(p=probability promotion, X1 = YrsExper, X2 = Female_dummy)
• A more precise interpretation can be given through the odds
ratio (exp(B) column in table 4). When X1 is increased by 1, the
odds ratio is
26
13
Remark: odds ratio interpretation
• Odds ratio’s are not easy to interpret; they are fairly
abstract
• Make them easier to interpret through ‘baseline odds’,
which translates them to a concrete situation like “the
number of successes per the number of failures”
• Example: let’s choose as baseline odds a situation
where all X variables are put to 0
– This represents a promotion of a male with 0yrs of experience
14
Remark: odds ratio interpretation
• What if being promoted (at 0yrs experience) were rare?
• Suppose instead that these baseline odds were 0.001:
– “One male is promoted within his first year for every 1000 males
in their first year that are not”
• Now, the odds for a female would change from 0.001 to
0.00027:
– Because: 0.27 0.001 0.00027
29
30
15
Today’s lecture
31
16
Using the logit model for predictions
Example: the probability that a female employee with 5
years of experience is promoted is
Solve for p:
17
Using the logit model for predictions
35
Blue = M
Probability of promotion
Red = F
Calculations on
previous slides
[[ What’s missing
in this graph? ]]
YrsExper
18
Important aspects to check in regression analysis
Before drawing conclusions about a population based on a
LOGIT regression analysis, first check (at the minimum) the
following five aspects:
1. Variable types: dependent variable is DICHOTOMOUS (1/0);
independent variables are quantitative otherwise dummies
2. Is there a linear relationship between the log odds and the Xs?
Hard to investigate! Rely on theory and model checks 3. and 4.
37
38
19
Appendix
39
20
Appendix 2: cut of value for logit prediction
Predicted
No Yes
No Correct False positive
Observed
Yes False negative Correct
41
21
Appendix 2: cut of value for logit prediction
Given these expected costs, it would be better to classify an
observation as ‘Yes’ if the expected cost of that action is lower than
the expected cost of classifying the observation as ‘No’. That is,
classify as ‘Yes’ if:
‘ ’ ‘ ’⟺
1 5 ⟺
1 5 ⟺
1 6 ⟺
1 1
⟺ 0.17.
6 6
That is, our cut-off value is now 0.17 instead of 0.27, which
acknowledges the relative cost of misclassification.
Hence, when the logit model predicts for an observation that the
probability of a ‘Yes’ is larger than 0.17, i.e. 0.17, then
classify as ‘Yes’, otherwise classify as ‘No’.
43
44
22
Appendix 3: odds ratio in a logit model
45
23
Appendix 4: the math behind odds ratio
• Now the ratio of the new odds over the old odds is:
∗ ∗ ∗
∗ ∗
• Hence,
47
24
Statistics and business analytics
Session 9
Course announcements
• Next class meeting: session 10 of 10 (cluster analysis)
• SPSS Lab 5 of 5 (covers sessions 9 and 10): we’ll have
two lab sessions
– Lab 5.1 (Thu Nov 22)
• Quiz 5 of 5: opens one day before lab 5.2, closes two days
after lab 5.2
2
1
Today’s lecture
Part 2: Running a FA
2
Daimler/Chrysler seeks a new image
3
Daimler/Chrysler seeks a new image
Daimler/Chrysler regression
• Regression analysis based on attitudinal data
• The full model (handout pp3-5) is not very useful
– Too many regressors
– Most regressors are not significant (P-value>0.05)
– High levels of multicolinearity (average VIF = 5.6, several VIFs
> 10, several tolerance levels < 0.07)
– Many correlations between regressors over 0.8, e.g.
o “I can do anything I set my mind to” with “Skeptical predictions are
usually wrong” (r=0.96, p-value<0.05)
o “I would like to take a trip around the world” with “I wish I could leave
my present life and do something entirely different” (r=0.90 , p-
value<0.05)
o “I usually dress for fashion, not comfort” with “I am in very good
physical condition”. (r=0.92, p-value<0.05)
8
4
Daimler/Chrysler regression
10
5
Factor Analysis: to potentially help a regression
Goal: we need to run following regression:
Y b0 b1 X 1 b2 X 2 b3 X 3 b4 X 4 b5 X 5 ... bn X n
11
6
Today’s lecture
Part 2: Running a FA
13
14
7
Decide on the number of factors
• The maximum number of factors is the number of
variables (here: 30)
• The choice depends on..
a) Managerial decision (subjective)
15
8
Today’s lecture
Part 2: Running a FA
17
9
Rotating factors to facilitate interpretation
Factor 2
1
(1) Orthogonal rotation
(varimax is most popular) 2
5
3 4
19
20
10
Interpretation of factors
Factor 1
V11 - family is not too heavily in debt today (0.896)
V12 - pay cash for everything I buy (0.902)
V13 - spend for today & let tomorrow bring what it will (0.937)
V14 - use credit cards because I can slowly pay off bill (0.937)
V15 - seldom use coupons when I shop (0.871)
V16 - interest rates are low enough so I can buy what I want
(0.758)
21
Interpretation of factors
Factor 2
22
11
Interpretation of factors
Factor 3
23
Interpretation of factors
Factor 4
24
12
Interpretation of factors
Factor 5
25
Interpretation of factors
Factor 6
26
13
Interpretation of factors
Factor 7
27
Interpretation of factors
Factor 8
28
14
Interpretation of factors
Factor 9
29
15
Today’s lecture
Part 2: Running a FA
31
1. Face validity? That is, does the found solution make sense?
Can it be given a reasonable interpretation? [[ subjective! ]]
32
16
Evaluate the goodness of fit (2)
• How much of the variation in each variable is accounted
for by the factor solution?
• Inspect the ‘Communalities’ table (handout p12)
– A ‘0’ means no variation of that variable is explained by the
9 factors [[ could suggest another factor is needed ]]
– A ‘1’ means all the variation is explained by the factor(s)
and variable and factor are the same [[ defies purpose of
factor analysis ]]
– Ideal is somewhere in between
• Variable V6 (“Life is too short not to take some gambles”) is
most poorly captured (communality = 0.37)
• Variable V30 (“I can do anything I set my mind to”) is best
captured (communality = 0.94)
33
Today’s lecture
Part 2: Running a FA
34
17
Save factor scores for subsequent
analyses (optional)
So, what was al this “Factor Analyzing” good for?
1. It helps us give names to underlying factors or constructs
(‘super variables’) in sets of highly intercorrelated /
multicollinear variables
– Having data on many variables does not mean that we know
what is going on
– Instead, looking at a fewer number of transformed variables
often gives more comprehensible and useful information.
35
18
Configuring the Viper program using
psychological characteristics (handout pp13-14)
Style; adventure;
probably not family
traditionalism
19
Factor analysis popular in business research
Marketing
Economics
Finance
39
X2
Today, we did all three, and combined F1
factor analysis with regressions, which X3
Y
led to fewer variables in our regression
equation, eliminating multicollinearity F2 X4
X5
40
20
Next class meeting
41
21
Statistics and business analytics
Session 10
Course announcements
• SPSS Lab 5 of 5 (covers sessions 9 and 10): we’ll have
two lab sessions
1
Today’s lecture
A managerial question
o We wish to create CD(s) or playlist(s) for the MBA
market. What would be the best music compilation for
this market?
o Survey (handout p1)
o Assume mean (say) over 7.0 should be
included, mean of 4.0 or less excluded
For this class the compilation should include:
2
HEC Paris MBA students music preference
(handout p2)
5
3
MBA music market
Cluster analysis
4
Let’s examine Rock vs. Rap for 11 students
Cluster 4
Cluster 3
Cluster 2 Cluster 1
Cluster sizes –
Cluster Count %
A cluster solution consists of at least: 1 2 0.18
2 3 0.27
(a) the cluster means/centers
3 3 0.27
(b) cluster sizes 4 3 0.27
(c) For each observation, its cluster Total 11 1.00
assignment (a “new” categorical variable)
10
5
Let’s examine Rock vs. Rap for 11 students
To measure distance we
could use the Euclidean
distance measure, which
measures the length of the
line segment connecting two
points
6
Let’s examine Rock vs. Rap for 11 students
13
14
7
K-means clustering
15
K-means clustering
16
8
Today’s lecture
17
18
9
Cross-validation to get a feel for number of
clusters in the data
• Run K-means for k 1,2, … , clusters on the training
dataset. Then, examine for each choice of how well cluster
solution fits in the test dataset
• Plot the fit statistics against 1,2, … ,
– Disclaimer: many fit statistics, no agreement on which is best
– Distance measure: choose the that corresponds to the elbow
or kink
– Model-based proxies based on Akaike and Bayes Information
Criteria: assumes that variables are (approximately) independent
and normally distributed within each cluster; penalizes for
number of parameters; choose that gives smallest AIC or BIC
– R-square based proxies: (sloppy) how much variance of the
variables can be explained by the cluster solution? Choose
that gives highest R-square value
19
20
10
In-class exercise 1
Scatter plot (handout p5)
In-class exercise 1
Fusion diagram (handout p5)
11
In-class exercise 1
(handout p5)
Minimum for all three model based proxies occurs at a four cluster
solution; BIC penalizes here the most for number of parameters
23
In-class exercise 1
(handout p6)
12
Run K-means to get four cluster solution
(In-class exercise 1, cont.)
Fit statistics on
handout p6
13
In-class exercise 2: how many clusters?
Cl. 1 Cl. 2
Red = cluster 1; Black = cluster 2
Rock 5.07 4.99
Rap/HH 4.88 5.12
Size 52% 48%
14
Implementing K-means
http://rstudio-test.hec.fr/kmeans/
Today’s lecture
30
15
So, where does this leave us for the MBA
music market case?
• Data:
– 801 students (seven cohorts) provided liking responses on
10 pnt scale of 16 musical types (slide 5; handout pp1-2)
• Managerial question:
– Are there segments of students who might best be
targeted differently with different music playlists/CDs?
– If so, who are they?
• To address the managerial question, we run a cluster
analysis. We need to provide (at the minimum): (a) a
discussion of the # clusters, and (b) a discussion of the
cluster solution (cluster centers, cluster sizes, cluster
assignments)
31
32
16
MBA music market (in-class exercise 3)
• How many clusters could be supported by the data?
– Usually the metrics do not agree, and the best you can do is to
give a range for
– For the class case, we could argue anywhere from 3—7 clusters
33
I
n
c
l
u
d
e
E
x
c
l
u
d
e
N
34
17
MBA music market (in-class exercise 4)
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Rock Pop Classical RapHipHop
Jazz Rock BroadwayMovies Pop
I Classical Jazz Jazz RnB
n Folk
Classical Rock
c
Blues
l
RnB
u
RapHipHop
d
TechnoDance
e
BroadwayMovies
35
36
18
MBA music market (in-class exercise 4)
Cluster 2: CD “We want it all” featuring a mix of the most popular Pop,
Rock, Jazz, Classical, Blues, RnB, RapHipHop, TechnoDance,
BroadwayMovies
Cluster 4: CD “Party People” featuring Rap, Hip Hop, RnB [[ and a bit
of Techno/Dance ]]
37
38
19
Today’s lecture
39
20
Scatter plot MBA preferences R/H vs Country
Not much! Mean (stdev) for RHH and Country are 5.9 (2.7) and 5.2 (2.6),
and 0.03 (P-value=0.47). Doesn’t help much for decision making.
41
Black – cl 1
Blue – cl 2
Red – cl 3
Green – cl 4
But, a cluster solution with four clusters gives insights that may
be useful for marketing decision making
42
21
Cluster analysis in sum…
44
22
APPENDIX
45
4. If the new seeds are close to the previous, stop. Otherwise, repeat
steps 2 and 3.
46
23
Appendix 2: K-means in SPSS
• Two challenges: choosing [[ but cross-validation can help ]], and
more importantly, choosing the starting seeds (step 1 previous
slide) is tricky
• Final solution tends to be sensitive to choice of starting seeds
particularly for small(er) samples, many clustering variables, and
a “messy” cluster structure
• SPSS’ implementation of K-means is quite poor
– By default, it uses the first observations in your data spreadsheet
as starting values
– Hence, (randomly) re-arranging the rows of your data spreadsheet
could lead to (very) different cluster solutions
– Therefore, you should never-ever rely on a single clustering
solution from one set of starting values
• Because SPSS has no automated option for “randomizing”
starts, its K-means implementation is not recommended
47
24