Data Science Camp: Introduction To The Core Stats Class Day 1

Data science camp
Introduction to the core stats class

A.k.a. Why are you here?
Day 1
Please work in pairs today (with your neighbor)

Each pair, please do the following:
1. download the SPSS file from blackboard and save it on your
laptop
2. do NOT yet open it in SPSS
3. please do START SPSS (just starting it, nothing else for now)
Fall 2018, Peter Ebbes
Making better decisions
“Most [[people]] are poor quantitative thinkers. This

widespread innumeracy is the father of zillions of bad
decisions…Numbers convey information,
quantitative information. Decisions are based on
information. When people are innumerate—when
they do not know how to make good use of available
quantitative information—they make uninformed
decisions.”
What the Numbers Say: A Field Guide to Mastering

Our Numerical World
1
Statistics is sexy?
“I keep saying that the sexy job in the next 10

years will be statisticians”
Hal Varian, chief economist @Google
The Skills Companies Need Most in 2018
2
Making better decisions!
Historical data
Demand Time Series
65
60
55
50
Decision
Historical Demand
45
40
Probability based on
(units)
35
model probability
30
25
20
15
10
model
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period
Data science camp Math camp
Quiz master problem: probability to win a car – decision change door

Defective computer chips: P(defective) – quality control
Stockouts: normal probability distribution – optimal capacity?
Getting to know your instructor
• Name: Peter Ebbes

o Nationality: Dutch
• Education:
o MA in marketing and statistics (econometrics)
o PhD in economics
• Current position:
o Associate prof. of marketing
o Prior at OSU, PSU, and Michigan
• Prior work/consulting experience: e.g. Research
International, Intel, USATODAY, Philips, LinkedIn,
Globys, Amplero, Soundcloud
6
3
Current/previous research (snapshot)
• Academic/consulting projects:
o LinkedIn – investigating job seeker status
o Facebook – sampling social networks
o Statistical approaches to improve data quality
• In my research, I typically target the “quantitative” (=geek-
type) academic journals; e.g.
o Product line optimization; lead article with discussions; best paper
award (international journal of research in marketing 2011)
o Customer (value) analysis in heterogeneous markets
(Psychometrika 2012, Management Science 2015)
o Firm performance and the role of marketing (Journal of Marketing
2015, HBR 2015)
o Social effects in CRM campaigns (Journal of Marketing Research
2017)
Course objectives
Provide an introduction to the core statistics class
Basic variable types
How to summarize data numerically and graphically
DIY – brief intro to SPSS
4
Course organization
• Blackboard
• Course syllabus
• SPSS – download it, then get license for free in

office S111
Class material on Blackboard
• Before class:
– preparation guide (readings, case, class
discussion questions and dataset)
• After class:
– pdf of the lecture notes (WYSIWYG)
• After one or a few classes:

– ‘How to’ SPSS guide
10
5
What you can expect from us
• Classes will start and end on time
• Timely evaluations with constructive feedback
• Emails will be returned promptly
• Easy access office hours by appointment
11
What I expect from you
• Be on time – the doors are in the front

• Once class is in session, do not leave (only if absolutely
necessary) – the doors are in the front
• If for some reason you must be late for class or leave
early, let me know in advance
• Submit assignments on time
• No multi-tasking in class (phones, laptops, etc.)
• Ask questions if something is not clear
• Practice, practice and practice
12
6
Mathcamp quiz: (partial) results
• Idea: what if you have to make a decision, and

you have no data? Maybe you have no time to
collect it, or you feel you do not need it.
• You therefore need to rely on your judgment.
• How good are you at evaluating

quantities judgmentally?
13
Today’s lecture
Moral of quiz: be aware of your ability to evaluate numbers
judgmentally! You may want to get some data for decision making…
Part 1: data and probability models
Part 2: Intro SPSS
Part 3: Descriptive statistics
Part 4: Descriptive statistics – categorical variable
Part 5: Descriptive statistics – quantitative variable
14
7
Example: airline demand over time
Demand Time Series
65
60
55
50
Historical Demand
45
units
40
(units)
35
in 10s
30
25
20
15
10
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period
What factors may affect demand?
15
Logic of probability models
• More often than not the quantities we are interested in will not
be predictable but will exhibit an inherent variation
• It would generally be impossible to measure all the variables

that determine the phenomenon of interest in any setting
• Idea: a realistic model must take into account the possibility of

randomness
• Construct probability models so that they represent the actual

data generating process that lies behind the data. They help
us make better decisions.
16
8
Learning from past observations
Demand Time Series
65
60
55
50
Historical Demand
45
units
40
(units)
35
in 10s
30
25
20
15
10
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period
What is the probability that demand is less than 400

passengers tomorrow?
17
Plot the data: empirical distribution

Using a number of past periods, construct a frequency histogram;
these frequencies represent the probability distribution.
Demand Histogram
7% 100%
Frequency
Cumulative Frequency 90%
6%
 = 28.1 80%
Frequency of Occurrence
5%
 = 10.9
70%
Cumulative Frequency
60%
4%
50%
3%
40%
2% 30%
20%
1%
10%
0% 0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
(in 10s units)
Period Demand (units)
18
9
Approximate the data with a probability function
Use a well-defined probability function whose shape resembles
the frequency histogram to probabilistically represent demand.
Demand Histogram
7% 100%
Frequency
Cumulative Frequency 90%
6%
 = 28.1 80%
Frequency of Occurrence
5%
 = 10.9
70%
Cumulative Frequency
60%
4%
50%
3%
40%
2% 30%
20%
1%
10%
0% 0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
(in 10s units)
Period Demand (units)
19
Steps in using probability functions
• What type of data are you dealing with: continuous

(quantitative) or discrete (categorical)?
• Based on the observed data (e.g. histogram),

choose an appropriate probability distribution
• Estimate the parameters of the chosen probability

distribution
• Make sure it fits!
• What we need: data, knowledge of statistics, and a

smart computer program
20
10
Organizing data
Data are often organized into a data table
“Cases”
“Records” Variables
“Observations”
21
Linking data with a relational database
Big data!
22
11
Variable types: measurement levels
Categorical Quantitative
No natural numerical meaning Natural numerical meaning
May appear in a data table as a number Already a number
Arithmetic makes no sense Some arithmetic makes sense
Has an appropriate unit
customer ID, gender, ZIP code, brand, Sales, price, stock returns, interest
hair color, class grades (A,B,C), … rate, …
Note: income in categories Note: income in exact amount
Special one: attitude rating scales
In SPSS: nominal or ordinal In SPSS: scale

23
Today’s lecture
Part 2: Intro SPSS
24
12
Mini case: American Express
American Express managers felt that usage was

slowing, particular in the retail category, typically
largest category
Before making drastic marketing spending, data was

bought to investigate consumer usage of credit cards
over a two year period, relative to the competition
25
Mini case: American Express
Basic business questions (not an exhaustive list):
– What is American express’ market share?

– Are more men or women using credit cards?
– In which category is spent the most?
– Over the course of the two years, are there any trends
in spending? Market share?
–…
26
13
Working with data
You need data analysis software
– SPSS: windows base / point and click

with option to write your own code
– But many others; SAS, stata, eviews,

minitab, excel, …
– If you know how to write your own code,

you have other options such as Matlab,
R, Python, C++, …
“IBM bets on mergers and algorithms for growth” FT (2016)

27
How do you get your data in SPSS?

1. If paper and pencil (hard-copy files), manual data
entry in SPSS.
– Directly into SPSS
– Check preparation guide (two YouTube videos)
2. If electronic, it depends:
– SPSS can import Excel files, text files etc.
– Most online survey services (e.g. Qualtrics) allow you to

export to SPSS directly
– Etc.
28
14
SPSS: two ‘views’
• Data View and Variable View

– To switch in between the two views, click on the
tabs at the bottom of the screen
• Data view: allows you to manually enter or

modify your data
• Variable view: allows you to describe your

“variables” more fully than variable name allows
29
SPSS data view
COLUMNS are variables
ROWS are cases
30
15
SPSS variable view
COLUMNS are
“properties” of your
variables (slides 35—37)
In ‘Variable View’, the ROWS are “variables”
31
Your very first SPSS exercise!
Open in SPSS the data file for the American

Express case
session1&2_credit_card_web.sav
data_science_day_1_credit_card_web.sav
Examine data view and variable view
32
16
33
34
17
‘Variable View’ fields
• Name: short name for the variable (no spaces; no
special characters)
– Try to give it a clever, brief name referring to what the variables
are about, without using special characters or being lengthy:
• Good variable name: “ItemsGrocery” or “items_grocery”
• Bad variable name: “# of grocery items sold?”
– Good practice: always keep a record ID variable (eg ‘obs’)
• Type: type of data for that variable

– If you want to change this, click in the box and then click on
small box with three dots; most relevant are
• Numeric: numbers
• String: words (e.g. text data, open-ended survey questions)
35
‘Variable View’ fields, cont.

• Width: maximum number of characters that can be
entered for that variable
– For numeric; e.g. 8 (default) is often fine (--> 99999999)
– For string; allow for enough to enter all text
• Decimals: number of decimal places for that variable
• Label: longer description of the variable

– Be more specific and elaborate here, special characters okay
[[ For instance, if you work with survey data, you may want to
include wording from the actual question in the survey ]]
– When you create a SPSS table or chart, this “label” will be the
title of the chart or table
36
18
‘Variable View’ fields, cont.
• Values: enter in the words associated with your number
codes for categorical variable
– Only needed when your variable has categories, and the
categories are described by words
– When you analyze the data, these words will appear in the
tables / charts rather than the numeric codes
• Measure: this is the measurement level (nature of the

numbers) of the variable [[ see slide 23 ]]
1. Nominal: numbers are category labels
2. Ordinal: numbers are category labels that reflect order
3. “Scale”: numbers are numeric, quantitative
• Other fields (eg missing, align): don’t worry about them
37
Today’s lecture
Part 2: Intro SPSS
38
19
Descriptive statistics: describing data
• The start of a (statistical) project: before you do

anything fancy-pancy
• Descriptive statistics – make data usable
– The raw data on slide 33 (data view) is NOT usable

for decision making… at all!
– Uses numerical and graphical methods to summarize

the information in a dataset in a convenient form
39
Descriptive statistics in SPSS

(Also a long list under graphs)
A (very) long list!
It is extremely important to use the correct statistical

technique to obtain meaningful insights for decisions
40
20
How to choose the “correct” statistical
descriptive technique (1)
Generally there is not “one” correct way. Your choice
depends on:
– The purpose of the analyses

• What do we need to know for decision making?
• How will the results be used?
• Who will use the information?
– Statistical mechanics (next slide)
41
How to choose the “correct” statistical

descriptive technique (2)
Statistical mechanics – your choice depends on:
– The scale level of your variables (nominal / ordinal
(categorical) or quantitative; see also slide 23
– How many variables you analyze jointly
• Just one variable at a time (univariate statistical
analyses) – fairly easy; first thing you have to do
• Two variables at a time (bivariate statistical analyses) –

little harder
• More than two variables at a time (multivariate

statistical analyses) – pretty tough!
42
21
Today’s lecture
Part 2: Intro SPSS
43
Methods to examine categorical data
• Frequency tables
• Pie charts
• Bar/column charts
• Contingency tables (cross-tabs)
• Clustered and segmented bar charts
44
22
Frequency table variable ‘card’
In SPSS: Analyze – Descriptive Statistics – Frequencies
Who can interpret this table?
45
Another example: date primary card issued

(~tenure of the customer)
• Research question: the American Express manager

wondered whether perhaps other brands are
attracting new users
• We could indirectly examine this through the

variable ‘card_date_yr_rec’ ‘card_date_yr’
• This variable captures the tenure of the customer

(i.e. when did (s)he sign up for the card)
• This is a categorical variable – frequency table!
46
23
In SPSS: Analyze – Descriptive Statistics – Frequencies
What does this table tell us?

47

• Previous table not useful given research question!
• We need to break it out by brand; i.e. generate a frequency

table of the variable ‘card_date_yr_rec’ for each brand
• Simple approach: create frequency table for only a subset of

the observations, first for American Express users, then for
visa users, etc.
• Lets start with American Express users
• Easy with SPSS! Very useful option!

– ‘Data – Select cases’
– See also Appendix to these lecture notes See ‘how to guide’
48
24
(only those who have American Express as primary brand)
• Users with an American Express card at time of data

collection, when did they get it?
• Cumulative percent: add up subsequent Valid Percent’s
49
In-class exercise 1
Repeat the same analysis for Discover users

[[ in the interest of time, we will do this only for Discover users ]]
Compare the two frequency tables
What do you conclude?
50
25
In-class exercise 1 (discussion)
(only those who have American Express as primary brand) Discover

users
Users with an American Express vs Discover card at time

of data collection, when did they get it?
51
Graphical representation of frequency tables
It is often easier to look at bar or pie charts than at a

frequency table
Useful techniques to graphically display frequency tables:

bar chart or pie chart
In SPSS: Graphs – Legacy dialogs – Bar (or Pie)
Lets make a chart for the variable ‘card’
52
26
Graphical representation of frequency tables
(compare with frequency table on slide 45)

Of these two graphs, which graph is easier to extract
information from?
53
Caveats about bar and pie charts
These figures are only appropriate if observations fall

into only one of the categories
– Mutually exclusive (disjoint) events [[ math camp ]]
– Pie and bar charts should add to 100%
– These visual representations focus on a single

categorical variable; can be generalized to analyze
combinations of categorical variables
54
27
Revisit the brand-tenure analysis (slide 48)
We looked at two variables “card” and “card_date_yr”

one-at-a-time (univariate statistics)
55
Examining relationships among

two categorical variables
• Contingency tables (cross-tabs) let us examine

patterns among multiple categorical variables
• Cross-tabs can be seen as a “merger” of two

frequency tables
• Popular technique in applied research

– Easy to create and easy to understand
– Typically strong link with managerial insight and/or action
56
28
Cross tab: credit card and date issued
In SPSS: Analyze – Descriptive Statistics – Crosstabs
Cross tabs generate…. lots of numbers! Interpretation?

@Home: check how this cross table relates to frequency
tables on slide 51
57
Facilitation interpretations in cross tabs:

percentages
Column percentage – this cross tab can show us what

the ‘card issue shares’ were in each time period
(conditional probabilities)
Conclusions?
58
29
In class exercise 2
• Research question: what credit card brands are

used by men and women?
– HINT – there are two variables underlying this
research question: ‘gender’ and ‘card’
– Both variables are categorical. We can create a cross
tab to investigate this research question.
• Research question: how does the tenure of men

compare to the tenure of women with respect to
when they got their primary card?
– HINT: what are the variables in this question?
59
Graphical representation of cross tables
• As before, it is often easier to look charts than

tables!
• Useful techniques to graphically display of cross

tabs: clustered bar chart or segmented bar chart
• In SPSS: this is a bit of a challenge!
• @home exercise: study the ‘How to guide’ (will be

on Blackboard) and try to reproduce the graphs on
the next two slides
60
30
Graphically displaying cross tabs
Clustered bar chart (compare to table on slide 58)

61
Graphically displaying cross tabs
Segmented bar
chart
If the conditional distribution is the same across different

categories, the two variables are said to be independent
62
31
Another useful option in SPSS:
recode a categorical variable
• One data transformation that is used quite often is to
recode a categorical (nominal or ordinal) variable
– To collapse categories of a categorical variable in fewer
categories (e.g. some categories are thinly populated, or
for presentation sake)
– But also to make a quantitative variable categorical (e.g.

we want to compare satisfaction levels of those that make
more than $100K vs those that make less than $100K
where income was measured in exact dollars)
• This can be easily done in SPSS:

– Transform – Recode into Different Variables
63
Example: recode the variable ‘card_date_yr’ into a new

one that has only two categories (‘before 2001’ and
‘2001 and after’)
1. Transform – Recode into Different Variables
2. Select the variable ‘card_date_yr’ into the box ‘Numerical
Variable  Output Variable’
3. Type a new variable name in the box ‘Name’ under ‘Output
variable’ (e.g. ‘card_date_yr_rec’) [[ Note: stick to naming
conventions discussed before]], then click ‘Change’
4. Click ‘Old and New Values’
• Tell SPSS how the ‘Old Value’ maps onto the ‘New Value’
• Hint: write this on a piece of paper first
• Fill in those numbers
5. Click ‘OK’
64
32
Recode the variable ‘card_date_yr’ (cont.)
• SPSS added your new variable to the back of your data file
(Data View); at the bottom of the variable list (Variable
View)
• You thought you were done, but you are not!
– Update all the fields under ‘Variable View’ for this new
variable
• Enter a description under ‘Label’ (e.g. ‘Date primary card issued
before 2001 or 2001 or after’)
• Complete ‘Values’ (e.g.: 1=before 2001, 2 = 2001 or after)
• Update the ‘Measure’ (here nominal or ordinal are fine)
– Visually inspect for a few observations that the recoding

worked out well (in Data View) – this is important!
65
Recap of techniques for categorical data
• Frequency tables, bar chart, pie chart

– Useful for looking at the distribution of a single
categorical variable
• Contingency tables, clustered and segmented bar

charts
– Useful for examining potential relationships among
two categorical variables
• Present your results in a way that is consistent with

what you need to know
66
33
Today’s lecture
Part 2: Intro SPSS
67
Analyzing quantitative variables
• Numerical methods
– Central tendency measures
– Dispersion measures
– Correlation (core stats class session 5)
• Visual methods
– Histograms
– Box plots
– Time series plots
– Scatterplots (core stats class session 5)
68
34
APPENDIX
Contents ---
Practice measurement scales (slide 23)
69

What is the scale level / measurement level of the following
variables (answers on next slide)
1. The number of MBA students in a given year at HEC Paris
2. The age of an MBA participant, measured as 1=younger than 25yrs,
2=in between 25-30 yrs, 3=in between 31 and 35 yrs, 4 = in between
36-40 yrs, 5 = over 40 yrs
3. Whether an MBA student lives on campus or off campus
4. The overall satisfaction of the students for mathcamp measured on a
five point satisfaction scale (1=very unsatisfied, 2=unsatisfied,
3=neither satisfied, nor unsatisfied, 4=satisfied, 5=very satisfied)
5. How many times a student went to the gym in the past week, measured
as the exact number (0,1,2,3,… etc. times)
6. How many times a student went to the gym in the past week, measured
as ‘not at all’, ‘1-2’ times, ‘3-4’ times, ‘more than 4 times’
7. The GMAT score of the student applicant
70
35
1. Quantitative variable; numbers of this scale have natural meaning;
multiplications in the context of this construct (number of students)
make sense
2. Ordinal scale (categorical variable); the numbers 1,2,3,4,5 reflect
order with respect to the underlying construct (age)
3. Nominal scale (categorical variable); the numbers (e.g.) 1=off
campus; 2=on campus are just labels
4. Quantitative variable; this can be debated. This is an attitude rating
scale. It seems reasonable to compute an average satisfaction,
hence, some arithmetic for this variable makes sense
5. Quantitative variable; same comment as example 1
6. Ordinal scale (categorical variable); same comment as example 2
7. Quantitative variable. It seems reasonable to compute an average
GMAT score, hence, some arithmetic for this variable makes
sense
71
36
Data science camp
Introduction to the core stats class
Day 2
Today’s lecture
Part 2: Learning from samples
Part 3: Sample distribution
Part 4: Sampling distribution
Part 5: Confidence intervals
Part 6: Wrap up
Part 7: SPSS “lab” (@home practice)

2
1
Analyzing quantitative variables
• Numerical methods
– Central tendency measures
– Dispersion measures
– Correlation (core stats class session 5)
• Visual methods
– Histograms
– Box plots
– Time series plots
– Scatterplots (core stats class session 5)
Mini case credit card usage
Research question:
How much did credit card users spend on grocery

items in the two years of observation?
Or, how much did credit card users spend on retail

items in the two years of observation?
2
Numerical methods – central tendencies
Mean (~average), Median, Mode

Use (e.g.) SPSS (see ‘how to guide’)
Interpretation?
Numerical methods – dispersion
Standard deviation, variance, range, inter-quartile range

(IQR)
Range = max-min
IQR = [ perc(25th), perc(75th) ]

= [ 78.24, 495.89 ]
Interpretation? I.e. range contains middle
50% of the data
3
Graphical representation: histogram
Mode = 0.00
Median = 88.42
Mean = 129.28
25th percentile = 0.00
75th percentile = 214.30
Dealing with outliers
Outliers are observations that stand apart from the

majority of observations
– Can heavily influence our analyses and conclusions
– Examine carefully: might be errors
– Watch the tails: heavy-tailed distributions have a
higher likelihood of extreme events (e.g. catastrophic
loss) and need to be modeled in that case
– Should be noted in any conclusions drawn from the
data (e.g. run analysis twice and compare
results/conclusions)
4
Graphical representation: boxplot
A box plot displays the median, IQR, and potential outliers

simultaneously
9
Compare amounts spent groceries and retail
Side-by-side box plots

(two “variables” – amount spent and category)
10
5
Temporal Data
Time series plots can be used to see temporal

patterns in the data (“two” variables: time and rating)
11
Stock Performance
Temporal data can be seen as having “two” variables

(here, stock price and time)
Quiz: who knows what stock this is?
12
6
Mini case: amount spent on groceries
Use multiple lines / graphs to enhance comparisons (next slide)

13
Mini case: amount spent on categories by brand
Green – retail
Brown – travel
Blue – groceries
Insights for
American
Express?
14
7
Recap of techniques for quantitative data
Examining individual variables, one-at-a-time,
– Descriptive statistics (central tendencies and dispersion)
– Histograms
– Box plots (for a single variable)
These are examples of univariate descriptive statistics for

quantitative variables. In the core statistics/business
analytics class we will learn bi/multivariate statistical
techniques for these variable types
(Contrast with recap for categorical variables on day 1 slide 66)
15
Today’s lecture
Part 6: Wrap up

16
8
Learning from data for decision making
• Data is often a sample – but we want to know

something about the population
• How can we learn about the population using

statistics?
• How can we support our decisions using statistics?
17
Relevant Contexts of where data is a sample
• Manufacturing
– Quality control
• Marketing
– Online ad copy testing (A/B)
• Finance
– Comparing returns from different investment portfolios
• The AE credit card case study (data science camp

day 1)
– Spending in various categories
18
18
9
Two forms of applied statistics
1. Descriptive statistics
2. Inferential statistics
19
Two forms of applied statistics
1. Descriptive statistics: make data usable

• describing results from a data source
• important first step in data analysis
• this is what we did up till now
2. Inferential statistics: generalize results to a population

• managerial decisions: are for a population not a sample
• use a sample to infer about the population (‘The almighty

truth’)  the sample suggests something about it
• we will never be sure of the ‘exact truth’, but can make a

good educated guess about it using inferential statistics
20
10
The two ‘key’ concepts in statistics
Notation
μ and σ

X S
p
Population: Sample:
-- total of all elements that share -- a subset of a population
some common characteristic(s) -- goal of stats: use sample to
-- goal of stats: learn about it learn about population
(‘Truth’)
21
Opinion polls
9/21/2011 USA Today
22
11
The idea behind sampling
N (BIG) Population
Every sample of size n
is equally likely
n1 n2 nobs
…. n(N choose n)
This is the observed one
T1 T2 T(N choose n)
For each hypothetical sample of
size n you compute a statistic T A histogram of all these hypothetical
T’s would determine the ‘margin of
error’ (previous slide)
23
Two sampling questions to class
• If we want to get a sample, what generally would

we need before we can sample?
• What sampling strategies can you think of?
24
12
Two sampling questions to class
• If we want to get a sample, what generally would

we need before we can sample?
– A definition of a population
– A sampling frame (a list of all members of the
population)
– Sample size
• What sampling strategies can you think of?
25
Sampling strategies – from ideal to worst

[[ in a nutshell ]]
• Simple random sampling – all members of the population

have an equal chance of being chosen
• Cluster sampling – when it is easy to measure groups or
clusters, randomly sample clusters and include everyone
in the cluster (e.g. households, city blocks)
• Stratified sampling – attempt to make sure that the
sample contains different types of people in proportion to
their numbers in the population; select a simple random
sample from each strata
• Convenience sampling – sampling in whatever way is
easy for you to do (ahem)
Note: most basic stats assumes simple random sampling

26
13
Today’s lecture
Part 6: Wrap up

27
Mini-case: insurance claims
28
14
WSJ 09/2012
The ABI in a report said that its members detected

139,000 bogus or exaggerated insurance claims
last year, up 5% from 2010. The amount of money
saved by insurers from detecting these bogus
claims was around 983 million pounds ($1.58
billion), up 7% from the previous year.
29
WSJ 09/2012
Most of that was from dishonest car drivers, who

filed for GBP541 million worth of bogus claims.
Another big source of fraud is home insurance,
with dishonest homeowners filing for GBP106
million in fake claims last year.
30
15
• Fairly rich dataset from a large insurance company
• The company wants to better understand it’s fraud

claims in home insurances, and what it can do to better
detect and prevent them
• Some questions to class:
– What is the population in this case?
– How big is the sample?
– What are the key variables?
– What analyses would you perform for claim amount?
31
Key stats for claim amount
What does this tell us?

32
16
Sample distribution for cost of claim amount

33
Sample distribution for cost of claim amount

34
17
Two fundamental distributions in statistics
• Population distribution:
– frequency distribution (histogram) of the population elements for a
certain variable (e.g. claim amount); generally a smooth line
– It is unknown but you want to know about it
– The mean of the population distribution is μ (how could you
compute it?)
– The standard deviation of the population distribution is 
• Sample distribution:
– frequency distribution (histogram) of the sample elements for e.g.
claim amount
– it is known once we have our sample
– the mean of the sample distribution is the sample mean 𝑋
– the standard deviation of the sample distribution is S
– usage: to infer (=learn) about the population distribution
35
Today’s lecture
Part 6: Wrap up

36
18
Statistical inference: learning about the
population mean
• Many statistical inference problems start with learning

about the population mean µ
• Our best “guess” for the population mean is the sample

mean 𝑋
• Mini-case:
– Sample mean is 73.01 (in 1000$) [[ slide 32 ]]
– Can we conclude µ = 73.01?
• No! So, then how “close” is 𝑋 = 73.01 from unknown µ?
37
Sampling error
• Different samples typically have different means
• Means in samples are generally different from the

(unknown) mean of the population
• The statistician wonders:

– How much variation is there in my sample mean?
– How far is the sample mean from the true population
mean?
– Can we “estimate” the sampling
error/uncertainty?
38
19
Understanding the variability in sample means
Consider the following hypothetical ‘mini’ population (‘the truth’): μ=72.5
Claim ID 1 2 3 4 5 6 7 8 9 10
Claim (1000$) 73 71 69 71 75 78 70 72 72 74
Theoretical exercise: lets draw several (random) samples of size n=2
– sample 1: ID(1, 2) -- (73, 71)  sample mean X = 72  pretty close to μ!

– sample 2: ID(3, 4) -- (69, 71)  sample mean X = 70
– sample 3: ID(5, 6) -- (75, 78)  sample mean X = 76.5  pretty not close to μ!
– sample 5: ID(9,10) -- (72, 74)  sample mean X = 73
– etcetera until we have all possible combinations of two (45 in total)
These 45 means will form the sampling distribution. Of what form will
this distribution be (think about making a histogram of these 45 sample means)?
39
Sampling distribution and the central limit theorem (CLT)

Definition of sampling distribution for a mean:
– The sampling distribution is the distribution (histogram) of the sample
means (𝑋).
– It is a theoretical distribution: (a) take many samples from a
population, (b) compute 𝑋 for each sample, and then (c) construct a
frequency distribution (histogram) from all computed sample means.
– When the sample size to compute the sample means is large, the
sampling distribution is a normal distribution (bell shaped), with
• Mean: μ (hence, it has the same mean as the population distribution)
• Standard deviation: σ/n
• For practical purposes statisticians use S /n for the standard deviation
of the sampling distribution
• This result holds for virtually any population distribution (magic!)
This result is the Central Limit Theorem (CLT)

40
20
[[ Challenging ]] In-class exercise
Discuss how the sampling distribution gives insight in how
good your sample mean 𝑋 estimates the population mean μ.
Hint:
– Consider the following two situations:
1. Consider a sample of 100 cases from company A. The sample
mean for claims is 72 and the sample standard deviation S = 20.
How does the sampling distribution look like?
2. In a sample of 100 from company B the sample mean of 68. The
standard deviation S = 5. How does the sampling distribution
look like?
– Which estimate (72 or 68) gives you most confidence to infer
about the population mean?
41
Three (!) fundamental distributions in statistics

• Population distribution:
– frequency distribution (histogram) of the population elements for a certain variable (e.g.
height or income); generally a smooth line
– It is unknown but you want to know about it
– The mean of the population distribution is μ (how can you compute it?)
– The standard deviation of the population distribution is 
• Sample distribution:
– frequency distribution (histogram) of the sample elements for e.g. height
– it is known once we have our sample
– the mean of the sample distribution is the sample mean (symbol?)
– the standard deviation of the sample distribution is S
– usage: to infer about the population distribution
• Sampling distribution:
– theoretical frequency distribution of possible sample means
– it is bell-shaped (a normal distribution)
– it has the same mean as the population distribution μ (it is unknown!)
– the standard deviation is S /  n (a.k.a. “standard error of the mean”)
– usage: the get an idea of how good our sample mean estimates the population mean μ
Completes slide 35
42
21
Today’s lecture
Part 6: Wrap up

43
Uncertainty in statistics
Statisticians compute statistics, of

course, but what really makes
someone a statistician, and not a
hack, is that (s)he thinks of the
statistic as an estimate of some
unknown quantity (a parameter)
Systems with people (customers,

employees, traders) have much
more uncertainty than (say) an
automated production line
44
22
Big picture
Constructing a confidence interval (CI) for an
unknown parameter
CI = (best guess) ± (“Magic #”) × SE(best guess)
‘Margin of sampling error’
Best guess: comes from the data

“Magic #”: depends on how much certainty you
want
45
Computing a CI for the mean

• You use the following formula:
S
X  Z confidence 
n
‘Margin of sampling error’
• Two steps to compute a confidence interval
– Step 1: obtain the values for 𝑋, S, and n
– Step 2: obtain the value for Zconfidence
• For a 95% confidence interval: Zconfidence corresponds to

the 95% area under the standard normal curve
46
23
Confidence interval: Zconfidence for 95% area
Area is 0.475 Area is 0.475

-- Important to remember --
Standard normal curve has:
(1) mean = 0
(2) std. dev. = 1
-Zconfidence 0 +Zconfidence
The corresponding Z-value is 1.96 (use probability calculator; e.g.

PQRS)
47
Use probability calculators on a computer or

the web
PQRS 2: see Blackboard (or, http://www.pyqrs.eu/home/#history)
Web: http://homepage.divms.uiowa.edu/~mbognar/applets/normal.html
HEC: http://rstudio-test.hec.fr/probcalc/
[[ FYI: check out appendix for some extra info and exercise on
probability calculations for a normal distribution ]]
48
24
Using PQRS
For 95% confidence interval:100 – 95 = 5% in tails

(hence, 2.5% in left tail and 2.5% in right tail)
This is Z
Probability that Z < 1.96 Probability that Z > 1.96

= =
Area under the normal curve left Area under the normal curve right
of Z = 1.96 of Z = 1.96
49
Using PQRS
For 90% confidence interval:100 – 90 = 10% in tails

(hence, 5% in left tail and 5% in right tail)
This is Z
Probability that Z < 1.65 Probability that Z > 1.65

= =
Area under the normal curve left Area under the normal curve right
of Z = 1.65 of Z = 1.65
50
25
Confidence interval for μ for the variable insurance claims
s s
( X  zc , X  zc )
n n
𝑋 = sample mean = 73.01

𝑆 = sample standard deviation = 144.40
𝑛 = sample size = 4415
𝑍𝑐 = 1.96 for a 95% confidence interval
95% CI is (68.75, 77.27) Interpretation?
51
Interpretation:
“We are 95% sure that the mean claim amount of

all claims at this company is contained in the
interval (68.75, 77.27)”
[[ here, all claims filed at this company constitute “the

population”. We would like to know the population
mean but do not observe a census of all claims filed. ]]
52
26
In-class exercise
Q1 Use a probability calculator to find the Z-value you

need for a 90% and 99% confidence interval
Q2 Compute the 90% confidence interval for claim

amount
53
In class exercise solutions
Q1 𝑍 = 1.645 for a 90% confidence interval and 𝑍 =

2.575 for a 99% confidence interval
.
Q2 upper bound: 73.01 + 1.645 × = 73.01 + 1.645 ×
2.173 = 73.01 + 3.575 = 76.59
.
Lower bound: 73.01 − 1.645 × = 73.01 − 3.575 =
69.44
90% confidence interval: (69.44, 76.59)
FYI: you can have SPSS compute confidence intervals for

you, see ‘How to guide’ of today’s class!
54
27
Confidence interval – couple of remarks
Mini case insurance claims
– Average claim amount (in sample of size n=4415) is

73.01; standard deviation is 144.40 (both in 1000$)
– CI: (68.75, 77.27)
– Meaning: “We are 95% sure that the true (unknown)

population mean is contained in the interval (68.75,
77.27)”
A confidence interval tells us how large the random

error is and is more informative than a point estimate
55
Confidence interval – remark 1
We can also compute a CI for proportions if it

concerns a dichotomous variable (y/n, M/F, fraud/no
fraud etc.)
– Population proportion:  (it’s value is unknown)
– Sample proportion: p (it’s value is known)
Mini-case example: what is the proportion of fraudulent

claims?
– Variable ‘fraudulent’ is nominal scaled
– Hence, analyze with a frequency distribution (next
slide)
56
28
Confidence interval – remark 1 (cont.)
So, the proportion of fraudulent claims is p=0.105. How

close is it to the true, unknown population proportion ?
p(1  p ) p(1  p)
( p  zc , p  zc )
n n
The 95% CI for  is (0.096, 0.114). Interpretation?

57
Confidence interval interpretation
“We are 95% confident that the proportion of claims

that are fraudulent out of all claims filed at this
company is in between 0.096 and 0.114.”
[[ again, as before, “the population” is the set of all claims

filed at this company. We would like to know the proportion
in the population. ]]
58
29
The standard normal distribution that we need to get

the ‘𝑍’ value to compute confidence intervals is an
approximation.
– For means, we rather use the 𝑡-distribution. However,

when our sample size is large enough (larger than
about 50), the practical differences are minimal
– For proportions, we rather use the binomial

distribution [[ don’t worry about its details ]]. But, again, if
the sample size is large enough (𝑛 ≥ 50 and 𝑛𝑝 ≥ 10
and 𝑛(1 − 𝑝) ≥ 10), the differences are small and the
standard normal distribution is fine
59

• The confidence interval procedure is rather “robust”
– In plain English: the calculations are fairly insensitive to data
artifacts or assumptions
– The central limit theorem holds for any population distribution
(more or less) – magic!
• In case of strong deviations from data that follows a
normal distribution, large(r) sample sizes are important
• Other good practice: transform your data
– The variable cost of claim has a rather long-tailed distribution
(e.g. slides 33 or 34); sample mean 𝑋 and sample standard
deviation 𝑆 are sensitive to outliers
– What would be a good transformation? Try @home (part 7)!
60
30
Today’s lecture
Part 6: Wrap up

61
Summary of data science camp main topics

• Basic descriptive and inferential statistics
• Start of a data analytics project: get to know your
data, usually univariate (one-variable-at-a-time),
graph your data (this is descriptive statistics)
• Then, focus on learning and decision making – go
beyond sample / statistical model
– You want conclusions and decisions for an entire
population
– You need to understand sampling variability
– This is inferential statistics [[ will develop in core stats class ]]
62
31
Univariate, descriptive, statistics
• A very important first step [[ and sometimes the only step ]] of
statistical work:
– Get a feel for data and become friends!
– Examine data for accuracy
– Helps decide follow up research/analyses
• What techniques to use? Depends on the measurement /

scale level of your variable(s) [[ day 1, part 3 ]]
– Categorical variable: Numerical (frequency table with
proportions, counts) and / or Graphical (bar or pie chart)
– Quantitative variable: Numerical (central tendency, dispersion, 5
number summary) and / or Graphical (boxplot, histogram)
• Always make sure you know the basic numbers and graphs
for the key variables when you use data for decision making
63
Bivariate, descriptive, statistics

We did a bit of bivariate statistics: analyzing TWO variables
jointly. Again the scale levels of the two variables helps us to
decide what statistical technique could be useful
This is largely underdeveloped at the moment, but we will
continue work on this in the core statistics and business
analytics class
When we have TWO categorical variables:
Numerical: ??
Graphical: ??
[[ e.g. day 1 slides 56—58 ]]
When we have TWO variables, and one variable is time: ??
64
32
Inferential statistics
• We often need to learn something about the “state of the
world” using a sample
• However, the “decision space” lies beyond the data
• Inferential statistics helps out:
– Realization: if I get a different sample, my statistics will
change!
– Important to know for decision making: by how much?
– Fortunately, we only need one sample (and knowledge of
the central limit theorem) to get an idea of this
– How? Compute a confidence interval! (eg day 2 part 5)
• We will further develop this in the core statistics and

business analytics class
65
We learned a bit of SPSS

• How data is stored in SPSS (data and variable view)
• Basic descriptive statistics (numerical and graphical)
• Useful SPSS options
– Select cases (day 1 slide 48)
– Recode a categorical variable (day 1 slides 63—65)
– Compute a new variable (appendix; @home exercise day

2 part 7)
• Review the SPSS ‘How to guides’ of day 1 and 2
• We will practice SPSS a lot in the months to come!

66
33
Final remarks data science camp
• Keep an eye out for the quiz!
– You need to fill it out to get a pass grade for the camp. Your
grade does not matter!
[[ Of course, take the quiz serious. It will signal you where you
are standing. This material is relevant for the core stats class! ]]
• Stats class starts next week Tuesday! We’ll dive right in!
– Keep an eye on your email/blackboard
• Don’t hesitate to contact us if you have questions or

concerns. We are here to help!
67
Today’s lecture
Part 6: Wrap up

68
34
@home SPSS practice
SPSS work for the credit card case (day 1)
[[ answers will be given in the how to guide ]]
Q1 Of the five categories (groceries, retail, entertainment,

travel, other), which category has, on average, the highest
monthly spending? And which has category the lowest?
Q2 What is, on average, the number of items customers

bought in a given month in the retail category? And, what is
the modal number?
Q3 What can be learned from the histogram for the monthly

number of items purchased in the retail category?
69
@home SPSS practice

SPSS work for the insurance case (day 2)
1. Become friends with the data! Examine the variables

‘claim_type’, ‘coverage’, and ‘edcat’. What did you learn
from these analyses?
2. Compute the log of claim amount. What does the

distribution look like? Is this the population, sample, or
sampling distribution?
[[ HINT: review the appendix of today’s lecture notes to

learn how to compute a new variable in SPSS ]]
[[ Continued on next slide ]]
70
35
@home SPSS practice
SPSS work for the insurance case (day 2)
3. What is the proportion of properties that were rendered

inhabitable? How certain are you about your estimate?
4. Explore whether there is a difference in fraudulent

claims between properties that were rendered
unhabitable and those that were still habitable.
[[ HINT: this question is an example of bivariate,

descriptive, statistical analysis ]]
71
APPENDIX
Contents ---
How to generate a simple random sample in Excel or SPSS
(slide 73)
Another useful option in SPSS: compute a new variable (slides

74 – 77)
Continuous probability distributions (slides 78—80)
The normal probability distribution (slides 81—84)
Normal probability calculations exercise with answers (slides

85—91)
72
36
Simple Random Sample in Excel and SPSS
• To select a simple random sample in Excel:
– Create column in Excel (labeled “random” or similar)
– Use Excel’s =RAND() function to generate a random
number for each observation
– “Freeze” the random number
• Edit-Copy-Paste Special-Values
– Sort by the random number
– Take the first n rows (n=desired random sample size)
• In SPSS, this can be done through the option
“Random sample of cases” in the menu ‘Data—
Select Cases…’
73
Another useful option in SPSS:

Compute Variable
• Often, we need to perform a transformation on our data.
This can be done in SPSS with the compute command.
This command allows us to carry out various functions
on columns of data in our data file
• Common transformations:
– multiplying all values in a column by a constant (e.g. to
express the variable Price in euros instead of dollars)
– taking the logarithm of a variable (e.g. to handle outliers)
– adding up two or more columns
– Etc.
74
37
Compute a new variable in SPSS
• Research question (credit card case, day 1): the manager

would like to know the distribution of monthly $ spent per
item for the retail category
• We now need to compute a new variable: divide
‘spent_retail’ by ‘items_retail’
– Transform – Compute Variable
– In the ‘Target Variable’ box enter the name for the new variable (e.g.
‘spent_per_item_retail’)
– In the ‘Numeric Expression’ box, type in the formula: ‘spent_retail /

items_retail’ (you can select the variables out of the variable list on
the left hand side)
– Click ‘OK’
75
Compute a new variable in SPSS

• SPSS created the new variable ‘spent_per_item_retail’
– Complete the variable definition in ‘Variable View’ (!)
– Inspect for a few cases that the computation worked in Data View (!)
– NOTE the error messages! Do they make sense? Can we ignore
them or did something go wrong?
• See also how to guide of today’s class for a way to avoid

the error messages in this particular example
• You can now analyze this new variable. As this is a

quantitative variable, one could compute the mean and
standard devation, or visualize the distribution in the
sample through a histogram (next slide)
76
38
Histogram of monthly $ spent per item (retail)
A histogram for the newly computed quantitative variable plots

also the mean and standard deviation
77
Probability distributions for a

continuous random variables
A quantitative variable (e.g. day 1 part 1) is sometimes
modeled by means of a continuous random variable, say 𝑋
The probability distribution of 𝑋 is described by a density

curve 𝑓(𝑋)
The probability of any event is the area under the density

curve and above the values of 𝑋 that make up the event
(see next two slides graphically)
In a sample, this density curve may be approximated by a

histogram
78
39
Probability distribution continuous variable
𝑓(𝑋)
Interpretation of the height of the density function: how

closely are the values of the random variable packed at
places on the x-axis.
79
Probability distribution continuous variable
𝑓(𝑋)
Interpretation of the area under the density function:

represents the probability of event 𝐴. The total area under
any density curve is 1.
80
40
Continuous probability model: Normal distribution
• The normal distribution is among the most popular

continuous probability models
• Symmetric, unimodal, bell-shaped curved
• Characterized by two parameters: the mean 𝜇 and

the standard deviation 𝜎
• Importance in statistics and business

– Linear regression (marketing mix, forecasting,
financial beta)
– Confidence intervals (sampling)
81
Normal distribution formalities
For the record, if:

(Notation means 𝑋 is distributed
𝑋~𝑁(𝜇, 𝜎) as Normal with mean 𝜇 and st.dev. 𝜎)
then, the density curve is given by the function

1
𝑓 𝑋 = 𝑒
𝜎 2𝜋
Further, to determine the probability a point lies between
“𝑎” and “𝑏”, you take the integral of this function from 𝑎 to 𝑏.
And, since it is a density curve, the integral over the range

(here, from –infinity to +infinity) equals 1
82
41
Properties of the Normal distribution
The 68-95-99.7 rule for the normal distribution
Interpretation? See math camp!

83
Use probability calculators on a computer or

the web
Distribution
Parameters
Probability 𝑋=0
that X < 0 Probability
that X > 0
Step 1: Select the distribution

Step 2: Specify its parameters (𝜇 and 𝜎 for the normal distribution)
Step 3: to find a probability for a given 𝑋, supply a value for 𝑋; to find a
𝑋 for a given probability, supply the probability
84
42
Normal probability calculations exercise 1&2
Q1 – use a probability calculator for a standard normal

distribution to calculate the probability of observing a Z
value in between -2 and 2
Q2 – use a probability calculator for a standard normal

distribution to calculate the probability of observing a Z
value in between -2 and 1
85
Normal probability calculations exercise 3

The data analytics team at AirFrance (AF) examined
historical data of daily demand for a particular flight route and
found that it has a distribution that is roughly normal with a
mean of 500 and a standard deviation of 100.
Suppose AF allocates enough flights to accommodate 600
passengers. Based on the 68-95-99.7 rule, how likely is it that
the airline offers enough flights to meet demand?
P(Demand<600) = ?
86
43
Normal probability calculations exercise 4&5
As previous exercise, assume that flight demand is

nearly normal with N(500,100).
4. Suppose that we currently allocate enough planes to
meet demand for 450 passengers. Adding another
plane would give us capacity for 600 passengers. How
likely is it that demand will fall between 450 and 600
passengers?
5. Suppose we are only willing to run a 10% chance that
we don’t offer enough flights to meet passenger
demand. How many passengers must we plan to
accommodate?
87
Normal probability calculations solutions 1—3

Note: a standard normal distribution has mean 0 and variance 1.
Q1 – may use 68-95-99 rule; or 𝑃 −2 < 𝑍 < 2 = 0.954 ≈ 0.95

with a probability calculator
Q2 – need to use probability calculator; 𝑃 −2 < 𝑍 < 1 =

𝑃 𝑍 < 1 − 𝑃 𝑍 < −2 = 0.8413 − 0.0228 = 0.8185 ≈ 0.82
Q3 – In this exercise we work with a normal distribution with

mean 500 and standard deviation 100. We can compute the
probability that is asked for with a probability calculator. The
answer is 𝑃 𝑋 < 600 = 0.84.
For an alternative solution using the 68-95-99 rule, see the next
slide.
88
44
Normal probability calculations solutions 1—3
Q3 continued – In this exercise we could also use the 68-95-99

rule (suppose we did not have a probability calculator).
Based on that rule, we know that 68% of the observations is

within 1 standard deviation of the mean, in other words, 68% of
the observations is in the range [400,600].
Because of symmetry of the normal curve, we also know that

34% is within [500,600].
Again, because of symmetry we know that 50% of the

observations is in the range [-infinity,500].
Taking these last two results together, we can conclude that

34%+50%=84% of the observations is in the range [-infinity,
600].
89
Normal probability calculations solution Q4
We need the area under the curve indicated above. This

can be compute using PQRS in two steps:
– Compute the probability 𝑃 𝑋 < 600 = 0.8413
– Compute the probability 𝑃 𝑋 < 450 = 0.3085
– Then 𝑃 450 < 𝑋 ≤ 600 = 0.8413 − 0.3085 = 0.5328
90
45
Normal probability calculations solution Q5
Sometimes we need to find the observed value
corresponding to a given proportion. Here, we are given a
maximum probability (10%) that we don’t offer enough
flights to meet demand. How many passengers must we
plan to accommodate to not exceed this risk?
1. Fill in given
probability here
3. Read off your

2. Click here observed value here
91
46
Statistics and business analytics
Supporting decisions with data

Applications of hypothesis testing
Session 1



Poll n=732 data scientist (63% industry, 11% academia, 26% other)
Source: Kdnuggets
2
1
Course organization
• Blackboard
• Course syllabus
• Quizzes
• SPSS labs
• Quiz data science?
Grading policy
• Grading scale:
– SPSS computer lab: 30%
– Midterm (individual): 30%
– Final (team w. peer evaluation): 40%
– Total: 100%
• MBA policy: every grade is a “relative grade” rather

than an “absolute grade”
• The average final letter grade (based on total average)

will be a 3.6 (GPA) or lower
2
Last topic data science camp
• Learning from a sample about a population
• Key point?
– different samples lead to different statistics (e.g.
mean, proportion, standard deviation etc.).
– question: ‘how different is different’?
• Solution?
– Sampling distribution  confidence intervals
for means
Today’s lecture
Part 1: Hypothesis: basics
Part 2: 6 steps of hypothesis testing
Part 3: Some remarks hypothesis testing
3
Today’s lecture: supporting decisions with data
Mini-case insurance claims: the accounting department
reports that the average claim last year was $63500.
Management will consider a policy premium change if the

average cost of claims increases by 10% or more from last
year.
Here, $63500 + 10% = $70000 (rounded).
We found in our data (data science camp part 3) that, this

year, the average claim size in our sample is $73000
(rounded).
Would you order a reevaluation of the policy premiums?
Supporting decisions with data

Example: criminal trial – presumption of innocence (“innocent until
proven guilty”)
– The null hypothesis is that you are innocent
– The alternative hypothesis is that you are not innocent (guilty)
– Evidence is necessary to reject the null hypothesis (deference to the null
hypothesis)
Truth
NOT GUILTY GUILTY
Reject the null:

GUILTY False Positive
Correct
Conclusion
(GO TO JAIL) (Type I error)

Our
Not reject
the null: False Negative
Correct
NOT GUILTY (Type II error)
(NO JAIL)
8
4
Steps in Hypothesis Testing
Problem Definition
Step 1 Clearly State the

Null and Alternative
Step 2 Hypotheses
Step 3
How much certainty Conduct the appropriate What data have you
do you want? test collected?
Is the gap between No

the expected and
observed Do not reject null
sufficiently big?
Step 4&5
Managerial conclusion
Reject null hypothesis
Step 6
Testing a hypothesis about a population mean: 6 steps
1. Formulate the null and alternative hypotheses

2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English
Note: you should not do 1&2 in retrospect
10
5
Today’s lecture
11
Step 1: Formulate the statistical hypothesis for a mean

You need to formulate TWO hypotheses:
The null hypothesis H0

– it specifies that the population mean μ is equal to a single value (the value
that you want to test)
– it is usually the ‘status quo’, the ‘current norm’, or a situation in the past
The alternative hypothesis H1 or HA

– is also a statement about μ; it states that the population mean is different
(unequal, larger, or smaller) from the value specified under H0
– typically, H1 represents our decision alternative: rejecting H0 usually
implies action. We want strong evidence (from data) for that
Important to remember: the hypotheses refer to a specified value of the

population parameter (e.g., μ), not a sample statistic (e.g., 𝑋) or
sample value (e.g. 73 ($1000)).
12
6
Step 1: one-sided vs. two-sided hypothesis tests
• One-sided tests are focused on departures from H0 in a
single direction
– In the claims mini-case, we want to know if the new claims
are 10% higher on average than the old claims to warrant
new policy premiums
• Two-sided tests focus on any departure from H0 (greater

than OR less than)
– Mini-case: are the average insurance claim cost higher or
lower than 70 (in $1000s)?
• Good practice: always use two-sided tests, except in

those cases which are clearly and blatantly one-sided
from their description
13
Step 2: choose the significance level

The significance level is denoted by . It indicates how
certain we are in our decision. You choose this number
(typically =0.10, 0.05, or 0.01), and use this in step 5.
Truth
NOT GUILTY GUILTY
Reject the null: False Positive

GUILTY (Type I error) Correct
Conclusion
(GO TO JAIL) 
Our
Not reject
the null: False Negative
Correct
NOT GUILTY (Type II error)
(NO JAIL)
14
7
The significance level is denoted by . It indicates how
certain we are in our decision. You choose this number
(typically =0.10, 0.05, or 0.01), and use this in step 5.
Truth
𝐻 : 𝜇 = 70 𝐻 : 𝜇 > 70
𝜇 > 70 False Positive

(Type I error) Correct
Conclusion
(new claims
at least 
Our
10% higher)
False Negative
𝜇 = 70 Correct
(Type II error)
15
Step 3: compute a test statistic
• A test-statistic measures how ‘close’ the sample has come

to the null hypothesis
• Remember: goal of hypothesis testing is to use a sample to
prove or disprove an idea about the population
• A well-thought-off test statistic (statisticians figure this out)
follows a well-known distribution such as the normal, t-, or
chi-square distribution
• For testing about a mean, you use the following test-
statistic:
X  0 73.01  70 3.01
Z test  Z test    1.39
S 144.40 2.17
n 4415
• Interpretation?
(numbers are from insurance claims mini
case; e.g. data science camp day 2 part 3)
16
8
Step 4: prepare a statistical decision
• So, we computed that the ‘distance’ between sample and

hypothesis is 𝑍𝑡𝑒𝑠𝑡 = 1.38, which represents the gap
between what we expect (𝐻 ) and observe (data)
– If the sample is close to the null hypothesis, we are willing to
believe the null
– If the sample is far from the null hypothesis, we are not
willing to believe the null
• What’s needed: a precise definition of “close” and “far”
• That definition comes from the following statistical result:
If the null hypothesis is true, than the frequency distribution of all
possible 𝑍𝑡𝑒𝑠𝑡 values (imagine you get many many samples) is a
standard normal distribution
17

-- Important to remember --
Distribution of 𝑍𝑡𝑒𝑠𝑡 Standard normal curve has:
Assumes H0 is true (1) mean = 0
(2) std. dev. = 1
-2 0 2
All possible 𝑍𝑡𝑒𝑠𝑡 values that you could get (from many many
hypothetical samples) if H0 is true
Consider two situations:
1. If the null hypothesis is true, you are likely to get 𝑍𝑡𝑒𝑠𝑡 values that are close to 0,
say in the white area, and you are unlikely to get 𝑍𝑡𝑒𝑠t values that are far away
from zero, say in the green areas
2. Therefore, it would be unlikely to get a 𝑍𝑡𝑒𝑠t value in the green area from your
sample, if indeed the null hypothesis is true.
18
9
• A precise definition of “far” and “close”: where does
my 𝑍𝑡𝑒𝑠𝑡 value fall under the standard normal curve?
Is it in the tails (green area) or not?
• Statisticians compute the P-value (“probability-

value”) to decide this very precisely.
P-value definition: the probability (assuming that the null

hypothesis is true) of observing a 𝑍𝑡𝑒𝑠𝑡 value that is at least as
contradictory to the null hypothesis as the one computed
• Compute the P-value with standard normal tables or

(e.g.) PQRS
19
Enter your
𝑍𝑡𝑒𝑠𝑡 value in
this box
Pr(Z ≤ 1.39) = Pr(Z > 1.39) =

0.9177 0.0823
if H0 is true if H0 is true
Sloppy: here, the P-value is the probability away from H0

in the direction of H1
(remember: H0 is at 0 under the curve)
– For a ONE-sided test: P-value = 0.0823
– For a TWO-sided test: P-value = 2 × 0.0823 = 0.1645
20
10
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject the
null hypothesis
– What you do: compare the P-value to the chosen significance
level (in step 2) of the test
• The significance level of the test (denoted by ) is the
relative cut-off for deciding if the observed difference is
sufficiently big to reject the null
– You choose this value : typically 1%, 5%, or 10%
• Statistical decision rule:
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the null
hypothesis
[[ Statistical warning: you never accept a null hypothesis! ]]
21
Step 6: make a business conclusion

• The P-value for the test (one-sided) on slide 12 is 0.0823.
• Suppose we had chosen in step 2 a significance level of
(say) 5% = 0.05
• The null hypothesis is rejected or not rejected?
• Business conclusion (for one-side test): our sample with

average claim size of $73000 does not provide enough
evidence that the average claim amount [[ in the population ]] is
larger than $70000 (𝑍𝑡𝑒𝑠𝑡 = 1.39, P-value = 0.08).
• Hence, based on this data, we would not recommend
computing new policy prices
This has to be in your conclusion!
22
11
• What if we had chosen 𝛼 = 0.10 in step 2 instead (i.e. we
would tolerate a larger type I error in our decision)?
• The null hypothesis is rejected for 𝛼 = 0.10.
• Conclusion: our sample with average claim size of $73000
provides evidence that the average claim amount [[ in the
population ]] is larger than $70000 and (therefor) increased by
more than 10% (𝑍𝑡𝑒𝑠𝑡 = 1.39, P-value = 0.08). We would
recommend a re-evaluation of the policy prices.
• @home: suppose management wanted to know whether the

average claim amount this time period has increased or
decreased from previous time period (i.e. $63500, see slide
7). What would your conclusion be?
23
In sum: supporting decisions with data

• A decision typically applies to a population, but we often
base it of ONE sample
• Crucial: understand the sampling variability
• When a decision involves a comparison to a standard or
status quo, we often can do an hypothesis test
– Does our sample mean of $73000 in claims warrant a new policy
premium?
– Ignorant way of thinking: yes, because it is more than 10% higher
(exceeds $70000)
– Smart way of thinking: the $73000 comes from a sample; different
samples lead to different results. Is the population mean really
different from (here: larger than) $70000?
• Conduct an hypothesis test
– Make sure to ‘test’ for sampling variability
– What is the P-value?
24
12
Today’s lecture
25
Hypothesis test for proportions
• The previous example was for quantitative (ratio) scaled

data (hence, we can compute a mean – lecture notes
data science camp)
• For categorical data, we can do something similar if it

concerns a dichotomous variable (y/n, M/F, fraud/no
fraud etc.) and the sample size is large enough (n≥50
and np≥10 and n(1-p)≥10)
• You would apply the same 6 steps as before (slide 10),

with two changes: step 1 and step 3 (next slides)
26
13
• Mini-case credit fraud: last year, 8.5% (=0.085) of
the claims were fraudulent. Management wants to
know whether this year there were more or less
fraudulent claims.
• From our analysis (data science camp part 5), we

found in our sample that 10.5% (=0.105) of the
claims is fraudulent.
• Step 1: what are the null and alternative

hypotheses?
– H0: =0.085
– HA: ≠0.085
27
Step 2: same as before; choose a significance

level (typical:  = 5% (=0.05))
Step 3: compute a test-statistic. For proportions,
use the following formula:
(p )
z test 
 (1   )
n
( 0 . 105  0 . 085 ) 0 . 02
   4 . 76
0 . 085 (1  0 . 085 ) 0 . 0042
4415
28
14
Step 4: compute the P-value
– Tells us on a common (probability) scale whether the
sample is “close” or “far” from the null hypothesis
– Curve under a standard normal distribution
– E.g. use PQRS
– Did you find: P-value = 2*0.00 = 0.00?
Step 5: reject or not reject the null hypothesis?
– Compare the P-value to the significance level  in step 2
– If the P-value is LARGER than : DO NOT REJECT the
null hypothesis
– Statistical conclusion?
29
Step 6: business conclusion

– Tell us in plain English what the outcome of the
hypothesis test is
– E.g. There is quite strong evidence (𝑍𝑡𝑒𝑠𝑡 = 4.76, P-
value = 0.00) that the fraudulent rates this year
(10.5%) are different from last year (8.5%).
– Alternative summary: the fraudulent rate this year
(10.5%) is [[statistically]] significantly different from last
year’s fraud rate (8.5%) (𝑍𝑡𝑒𝑠𝑡 = 4.76, P-value =
0.00).
– Recommendation: there is need to investigate further
why the fraud rates appear to be higher (by about 2%)
30
15
Hypothesis testing – remark 1
Hypothesis testing can also be done directly with

SPSS; we will discuss in the lab session.
– [[ But it would be good to complete these steps by
hand calculations for your own learning ]]
Most tests for means obtained with a computer

package will show the results for a t-test. We have
discussed here the Z-test.
– For large enough samples (𝑛 > 50), the results are
(nearly) identical; for smaller samples t-test is better
– Curve your test-statistic (step 3) under a 𝑡-distribution
with 𝑛 − 1 degrees of freedom (e.g. PQRS)
31
Hypothesis testing – remark 2

• Similar to computing a confidence interval for a mean,
the procedure to test an hypothesis about a mean could
be sensitive to ‘outliers’.
– However, this procedure is rather robust to long tailed data when
sample size is large (as is the case in our example)
– Use the t-distribution to be on the “safe side” (e.g. previous slide)
• Alternatively, the test procedure may be repeated on the

log scale, i.e. first take the logarithm of the variable, then
redo all calculations and compare the results
– Challenging exercise: carry out this test for claim amount
– [[ hint: in SPSS, first transform your data on a log-scale. Then

write down the null hypothesis in log-form, and carry out the 6
steps of hypothesis testing ]]
32
16
Statistical vs. practical significance – remark 3
• Statistical significance based on the rule “P-value < ”
(step 5) is at best a rule-of-thumb and at worst bad practice
• The threshold 𝛼 clues us to what the type 1 error might be.

But for practical purposes there is no difference between a
P-value of 0.051 and 0.049.
• You should always report the actual P-value. The decision

maker can then interpret the result in terms of practical
significance rather than statistical significance.
• While low P-values support evidence against the null

hypothesis, it is often a good idea in practice to use them in
conjunction with other statistical inferential approaches,
such as confidence intervals
33
Your statistically significant other..
34
17
Another illustration of type I and II errors
(slide 8)
Next class meeting: additional topics on statistical inference

35
18
Additional topics on inferential statistics

Beyond the basics: sample sizes and other
sampling challenges
Session 2
Course announcements
• Quiz data science camp

– Take a look at the feedback statements!
– After a few quizzes opportunity to look into the Qs
you missed
1
Today’s lecture
Part 1: Sample size calculations
Part 2: Sampling challenges
Part 3: Investigating distributions for

categorical variables (chi-square test)
Columbus Ohio bike share

• Project goal: bike share in Columbus Ohio?
• Project kick off late Fall 2011
• Collaboration with Ohio State University,
ConsiderBiking.org and the mayor’s office of Columbus
• Employ a survey:
– Purchase intent scale (5 pnt (interval) scale)
– Industry rule-of-thumb: 80% of “definitely buy” and 30% of
“probably buy” actually end up buying (…)
2
Columbus Ohio bike share
• Project goal: bike share in Columbus Ohio?
• Project kick off late Fall 2011
• Collaboration with Ohio State University,
ConsiderBiking.org and the mayor’s office of Columbus
• Employ a survey:
– Purchase intent scale (5 pnt (interval) scale)
– Industry rule-of-thumb: 80% of “definitely buy” and 30% of
“probably buy” actually end up buying (…)
• Mayor asks: how large should your sample be for us to

make a reliable decision?
• Class discussion: what would you say?
Sample size determination for a single outcome
• When data is expensive (in terms of time or

money), then you need to plan ahead to get just
the data you will need to make a decision.
• This turns out to be pretty easy!
• You need to work ‘backwards’ from the

confidence interval formula (assume 𝑛 > 50)
3
Confidence intervals: recap
Confidence interval size is a function of three things:
S
n
‘Margin of error’
– the data
Specifically, the standard deviation
– the confidence level
As the confidence level increases (all else equal), the length
of the confidence interval increases.
– the sample size(s)
To control confidence interval length – choose the sample
size appropriately.
7

S
n
• Step 1: what is the desired confidence level?
– “The probability that the unknown population mean will be
in your interval”
– Most clients will give blank stares – typically we use 95%
– With 95% confidence, Zconfidence = 1.96
• Step 2: what is the smallest difference (above and
below) that has practical importance to you?
– Most clients can give you an answer here
– Call this difference ±𝐵
– Mayor: “quarter point above and below the average” i.e.
𝐵 = 0.25
8
4
• Step 3: working backwards, the sample size you

need to detect a change of ±𝐵 at the 95%
confidence level is
𝑍 ×𝑆
𝑛=
𝐵
• For our example, 𝐵 = 0.25, 𝑍 = 1.96

• What is 𝑆?
– This is the tricky part: we need to guess what 𝑆 might be
when we haven’t seen any data yet.
How do you know the standard deviation before

you’ve collected data?
• Use historical data
– Sometimes a similar study has been done
before/somewhere else
• Run a pre-test
– Take a small preliminary sample (of size 20, say) just to
get an estimate of the standard deviation
• Guess at what the largest feasible variation could be
– Not recommended, but may be the only feasible option
– Rule-of-thumb: “estimate” the standard deviation as
follows
max possible value − min possible value
𝑠≈
4
10
5
• From historical data: a similar study ran on the OSU
campus, we had found an S of 1.55, so
. × .
𝑛= = 12.152 = 147.67 ≈ 150
.
• So, a sample of size 150 would get us a good idea of the

true (unknown) population mean.
• Note that this is for “one” population (single outcome). If

an important research question is to contrast e.g. the
students to the working professionals, we have “two”
populations, so we would try to get 150 out of each (i.e.
a total sample size of n=300)
11
Sample size calculation for a proportion

(single outcome)
• Previous sample size calculation was for
quantitative data
• Same principle for categorical data (proportions)
• Critical question bike share:
“Would you recommend bike share to a friend?”

Yes or No
• If we take this question, how large should our

sample be if we allow for an error margin of at most
5% (up and down)?
12
6
(single outcome)
• Step 1: what is the desired confidence level?
– Take 95% which gives you a 𝑍𝑐𝑜𝑛𝑓 = 1.96
• Step 2: what is the smallest difference (above and
below) that has practical importance to you?
– Here we took 5% up and down, so 𝐵 = 0.05
• Step 3: working backwards from the confidence
interval formula for proportions [[ see lecture notes data
science camp day 2, part 5 ]], use the following formula
× ×( )
𝑛=
13

(single outcome)
• Step 4: guess pguess
– Question: what number for pguess gives you
maximum n?
– Use that number to calculate a conservative
sample size
– Did you find…

. × . × . .
𝑛= = = 384.16 ≈ 385
. .
14
7
Today’s lecture

15
Caution!!
Any formula for inference is correct only

under specific circumstances . . .
• Data must be from a simple random sample (SRS)

– Or can be plausibly thought of as an SRS
– There is no method for inference from haphazardly

collected data
Fancy formulas cannot rescue badly produced data
• Hence, always carefully and critically consider the

steps of data collection
16
8
Data challenges
• Ideally: we want to get a SRS. Sometimes that is

easy, e.g.
– A financial firm has a large clientele with personalized
investment portfolios. Randomly select 150 and
investigate their performance in detail
– A semiconductor manufacturer randomly selects 1%
of products for quality control
• Much harder in other cases, particularly when

people are involved.
– Challenges?
17
Today’s lecture

18
9
Check your sampling as part of basic
statistical work
• Bike share on the Ohio State University campus
– A completely separate part of the project involved
sampling students on the OSU campus
– This data was analyzed separately from the
(previous) downtown study
– The sampling was done by students in my class who
sampled ‘on campus’ – not ideal (convenience
sample)!
• When the sample is in, it is good practice to check
some basic demographic variables and compare
those with population demographics (to the extent
these are known, of course)
19
Demographics OSU survey

• Variable gender (nominal) – analyze with…
• From the administration office, we were told across

all (50K+) students, the gender is split 40-60.
• Would you say this sample is representative based
on gender?
20
10
Testing a hypothesis about population proportionS:
6 steps
hypothesis
21
Step 1: the null and alternative hypotheses

• You have to formulate the null and alternative
hypotheses
• For categorical (nominal and ordinal) data: these are
stated in terms of population proportions!
• For the ‘gender’ variable (in words):
– H0: the proportion of men and women in the population
of OSU students is 0.40 and 0.60, respectively
– Ha: the proportions of [[…]] is NOT 40/60
• In statistical language (population parameters):
– H0:
– Ha:
22
11
Significance level is denoted by . It indicates how certain
we are in our decision. You choose this number (typically
=0.10, 0.05, or 0.01), and use this in step 5.
For many managerial problems, =0.05 is chosen.
23

• A test-statistic measures how close the sample has come to
the null hypothesis
• For testing about proportions from a frequency table, you
use the Chi-square test-statistic:
“Chi”
 
2 O1  E1  O2  E2 
2

2
 ...
Oi  Observed counts for cell i
E1 E2 Ei  Expected counts for cell i
when H0 is true
• Interpretation?
24
12
Four steps to compute a chi-square statistic
1. Write down the formula and the symbols
Oi 
O  E 2 O  E2 2  ...
  1 1  2
2
Observed counts for cell i

when H0 is true
2. Obtain the values for Oi from your frequency table

O1 = 176 O2 = 220
3. Obtain the values for Ei from your null hypothesis
Warning: this is the tricky part!!
E1 = 0.4 * 396 = 158.4
E2 = 0.6 * 396 = 237.6
4. Fill in the formula above
25

Filling in the formula – did you find:
 2O1  E1  O2  E2 
2

2

E1 E2
176  158.42  220  237.62 
158.4 237.6
17.6   17.6  3.26
2 2
158.4 237.6
26
13
• When the 2 value you computed is:
– small (‘close to zero’)  your data is close to the null
hypothesis
– large  your data is ‘far away’ from the null hypothesis
• The question you need to answer in step 4:

“When is the 2 value ‘too large’ so that I do not believe
anymore that H0 is true?”
• Statistical result: if the null hypothesis is true, than the

frequency distribution of all possible 2 values (imagine you
get many many samples) is a chi-square distribution
27

• Just as the normal distribution (for the Z-test concerning
population means) is characterized by a mean (=0) and
a standard deviation (=1), the chi-square distribution is
characterized by the ‘degrees of freedom’ (d.f.)
• For this hypothesis test, d.f. = #categories – 1 = 2 – 1 = 1
• Use PQRS to curve your computed chi-square value to

see whether it is in the tails or not – i.e. compute the P-
value
P-value definition: the probability (assuming that the null hypothesis is

true) of observing a value for 2 test that is at least as contradictory to
the null hypothesis as the one actually computed
28
14
Enter your 2
value in this
box
Pr(2 ≤ 3.26) = Pr(2 > 3.26) =

0.93 0.07
if H0 is true if H0 is true
For the 2 test, the P-value is the most right one: P-value = 0.07.
– Should we double it?
Web: http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html
HEC: http://rstudio-test.hec.fr/probcalc/
29

CAUTION
Use the chi-square distribution to curve the chi-
square test value that you computed, requires
that the expected counts are large enough
To compute a valid P-value:
At most 20% of the Ei ’s computed in step 3

can be less than 5
30
15
• You have to make the final decision: reject or not reject the
null hypothesis
– What you do: compare the P-value to the chosen significance
level (in step 2) of the test
• The significance level of the test (denoted by ) is the
relative cut-off for deciding if the observed difference is
sufficiently big
– You choose this value : typically 1%, 5%, or 10%
hypothesis
• Statistical warning: you never accept a null hypothesis!
31
• The P-value for the test on slide 29 is 0.07.

• We choose in step 2 a significance level of 5% = 0.05
• Hence, the null hypothesis is… not rejected!
• Business conclusion: our sample (table slide 20) does

not provide evidence that it deviates significantly for
gender from the population (2 = 3.26, P-value=0.07).
Hence, we can argue that it is representative with
respect to the variable gender.
32
16
Try it yourself!
A second key demographic variable in the bike share study
is class standing. This is an ordinal variable. Hence, we
could analyze it with a frequency table. From the OSU
administration office, we know that the relative amount of
students being Freshman, Sophomores, Juniors and
Seniors is equal in the population. Would you say our
sample is representative?
You should find: not representative. Does it matter?
33
What to do if my sample is not

representative?
• Clearly, something went wrong in data collection
• Always carefully check your data sources
• Some solutions:
– Discard the data and get new (better!) data
– Obtain more data (e.g. more freshman)
– Analyze at a disaggregate (segment) level
– Weigh your observations (tricky, but doable; see
appendix)
• Be open and frank about it in your research!
Fancy stats cannot rescue badly produced data
34
17
What’s up next: SPSS lab 1 & quiz 1
• SPSS lab 1 –
– Time and location: same as regular class meetings
– It will be ‘hands-on’; my TA (Alican) and I will be there to help out
– Prepare the lab (e.g. with your team)
• Review lecture notes of sessions 1&2, data science camp
• ‘How to in SPSS’ (pdfs on Blackboard)
• Case: American Express (Data Science camp)
– You will be asked to hand in a brief assignment after the
lab (counts towards final grade); may be done in pairs
• Quiz 1 –
– Same idea as for data science camp
– Open: 1 day before SPSS lab
– Close: 2 days after SPSS lab
35
Appendix
Illustration of how weighting sample data can help

in some cases when the sample is not
representative (slide 34)
36
18
What to do if my sample is not representative?
Example: a sample of 1000 US voters, included 500
African Americans (AA) and 500 non-African
Americans (NAA).
• Poll result: 60% said they would vote Democratic

• Poll conclusion: democrats will win election
• Election result: 45% voted Democratic

• Election conclusion: democrats lost the election to the
republicans
• Hence, poll (= sample) was horribly wrong. Why?
37

Americans (NAA).
Analysis poll – taking a closer look at the polling data:
– AAs were over-represented in the sample and at the same
time strongly democratic (about 80% democrat)
– NAAs were under-represented in the sample and not favoring
democratic (about 40% democrat)
– Taking the simple average in the sample, we get
0.5 × 80 + 0.5 × 40 = 60% (the polling outcome)
– If we know (e.g. from the US census) that 18% of the
population are AAs and 82% are NAAs, what would be a
better overall average?
38
19
Americans (NAA).
– If we know (e.g. US census) that 18% of the population are
AAs and 82% are NAAs, what would be a better overall
average?
– A better average would be the weighted average:
0.18 × 80 + 0.82 × 40 = 47.2%
– We “down-weighted” the AAs data in computing the new
sample average
– This weighted average is much closer to the truth in the
population: 45% (the truth is the outcome of the election,
assuming everybody voted, or that those who voted are
representative of the whole population)
39
20
Bivariate statistics (comparing means)
Session 3
Today’s lecture
Part 1: Bivariate statistics
Part 2: Comparing two means (Z/t-test)
Part 3: Comparing multiple means (ANOVA)
1
Bivariate statistical analysis
• For decision making often quite important and is a
stepping stone to multivariate statistics (e.g.
regressions)
• Examples:
– Marketing – did ad A or B generate more
clickthroughs?
– Supply chain – does the temperature affect the sales
of cola?
– Human resources – did men and women had an
equal chance of being promoted in the past year?
– Banking – are homeowners that are single more likely
to default than married homeowners?
What are the variables and scale levels?

3
Recap of techniques for potential relationships

among two variables [[ data science camp ]]
This is called bivariate statistical analyses – given the goal
of the analyses:
[[ Complements the table for univariate statistics – SPSS lab 1 ]]
Today and next class: we will further work on completing this table
4
2
Case insurance claims
• Same case (data) as previous sessions 1&2
• Sample of insurance claims from a large insurer
• Today’s class: can we use demographic information
to help price insurance policies?
o Idea: given our data, investigate whether claim amounts

are different (or not) across demographic groups
o Statistics: get’s a bit more complicated,

because we now relate ‘claim_amount’ to
each of the demographic variables (e.g.
‘gender’, ‘edcat’, ‘retire’ etc.)
Today’s lecture
3
Comparing two means: means plot
( 73.01 across
all policies;
data science camp
day 2 side 32)
What can we learn from this graph?

7
Are claim amounts from retirees, on average, higher

or lower than claim amounts from non-retirees?
Here we analyzed two variables jointly
– Variable ‘retire’ (categorical)
– Variable ‘claim_amount’ (quantitative)
When one variable is categorical, and the other is quantitative,

we could compare the means of the quantitative variable for
each of the categories of the categorical variable
– Ignorant manager: average claim amounts from claims filed by non-
retirees are higher than average claim amounts from claims filed by
retirees!
– Smart statistician: could the observed difference in average claim

amount be just random sampling error, or could there be a real
difference in average claim amount in the population?
4
Side-by-side box plots helps us see the
within-group variation
Large within-group variation; Small within-group variation

the observed differences (but same centers); the
among the centers may be just observed differences among
sampling variation the centers are more likely to
be significant
9

hypothesis
10
5
Step 1: Formulate the statistical hypotheses
Step 1: You formulate TWO hypotheses:
• The null hypothesis H0
– For bivariate statistics: stated in terms of no difference or no
relation
– Formal: the two variables are independent
– Example: there is no difference in average claim amounts of
retirees and non-retirees
• The alternative hypothesis H1 or HA
– For bivariate statistics: it states that there is a difference or a
relation
• Question: how are the null and alternative hypotheses written

in statistical terms for the example on slide 7?
• Important to remember: the null hypothesis always refers to

population parameters
11

12
6
the null hypothesis
• A well-thought-off test statistic (statistician figure this out)
• For testing the difference between two population means,
use the following formula:
X1  X 2 S12 S 22
Z difference  SX  
sX n1 n 2
• See next slide for example…
13
Compute Zdifference using the following SPSS table
X  X2 S12 S 22
Z difference  1 SX  
sX n1 n 2
Interpretation?
14
7
• The “distance” between the null hypothesis and the sample

is Zdifference = -11.64
– Is this far or close?
– Answer: quantify with the P-value
• Similar to the Z-test for one mean (session 1), we use the
standard normal distribution to curve the computed Zdifference
value
– E.g. use PQRS or other probability calculators
15
• What is the P-value here?

• Interpretation?
16
8
• You have to make the final decision: reject or not reject
the null hypothesis
– What you do: compare the P-value to the chosen
significance level (in step 2) of the test

– If the P-value is LARGER than : DO NOT REJECT the
null hypothesis
• The P-value = 0.00 which is LESS than  = 0.05:

You REJECT the null hypothesis
17
• Our results show that the difference in average claim amount

of claims from retirees and non-retirees is significant (Z =
-11.64, P-value  0.00) [[ A significant difference means: the
data supports that there is a difference in the population. ]]
• The sample suggests that retirees have, on average, lower
claim amounts than non-retirees. A decision to consider more
favorable premiums for retirees could therefore be supported
by data.
18
9
@Home practice
Often, home insurances are obtained for properties that are

not the primary residence. Is there a difference in average
claim amounts for those that claimed for a primary
residence versus those that did not claim for a primary
residence?
You should find: there is not enough evidence in the data to

argue that there is a difference [[ in the population ]]
19
Comparing two means: remark 1

(some other remarks in the appendix)
Means plot
Are claim amounts from customers with different education levels,
on average, the same, or not? How should we proceed to address
this question?
20
10
Today’s lecture
21
ANalysis Of VAriance
• ANOVAs can be used to answer the question “Do all

groups have the same population mean”?
– One variable is quantitative
– One variable is categorical (more than 2 categories)
• We cannot answer this question just from the means

plot (slide 20) alone; we need to consider the
difference among the sample means along with the
differences within each group
22
11
Side-by-side box plots helps us see
the within-group variation
Large within-group variation; Small within-group variation

the observed differences (but same centers); the
among the centers may be just observed differences among
sampling variation the centers are more likely to
be significant
23

hypothesis
24
12
Using ANOVA to test equality of population means
• Does education relate to claim amount? (slide 20)

• Step 1: formulate the null hypothesis
H0:
• Step 2: choose significance level (=0.05)
• Step 3: compute test-statistic
– For ANOVA, this is the F-statistic
– Very hard to compute by hand – use SPSS
25
Using ANOVA to test equality of population means

• Step 3 (cont.): F-statistic = 1.78
• Step 4: prepare a statistical decision
– Get the P-value – curve the test statistic under the
appropriate distribution
– For an F-statistic, use the F-distribution
– Fortunately, SPSS does it for us (‘Sig.’ in previous table)
– P-value = 0.13
• Step 5: make a statistical decision
– The null hypothesis is REJECTED / NOT REJECTED
• Step 6: what is the business conclusion
– Even though the mean plot (sample) on slide 20 suggests
that education informs us about the average claim amounts,
there is not enough evidence to support this in the population
(F-statistic = 1.78, P-value = 0.13)
26
13
Using ANOVAs
CAUTION!
More so than any of the techniques we have learned so far,
ANOVA requires us to be more careful about examining
underlying data assumptions
1. Sample should be a random sample (or at least arguably
so)
2. Data should be approximately normally distributed within
each group
3. The variances in the different groups should be
approximately equal
Therefore, ANOVAs should always be accompanied by

graphical and numerical summaries (next two slides)
27
Using ANOVAs Check assumption 2

CAUTION! (previous slide)
Do you feel comfortable that the ANOVA assumptions are correct?

28
14
Using ANOVAs Check assumption 3
CAUTION! (slide 27)
• The variances are fairly similar for the five groups (above
table), however, it is hard to argue that within each group the
data is approximately normal distributed (previous slide)
• Therefore, we should resist the temptation interpreting and
using the previous analyses (slides 25&26).
• Instead, we could consider re-doing the analysis with the
logarithmic transformation! (why?)
29
Re-do the ANOVA with the log transform
Try at home! You should be able to produce the

following table (and graphs on next slides):
Conclusion?
Are the assumptions after the log transform valid?
30
15
Check ANOVAs assumptions
After log transform, data has an (approximate) normal distribution

and equal variance within groups (compare to slide 28)
31
Sample means log claim amount
The observed difference in means is statistically significant (slide 30)
32
16
ANOVA: note on interpretation
• Rejecting the null hypothesis of equal means (e.g. slide
30), does not mean that all of the means are different!
– We have at least one inequality (alternative hypothesis)
• This can be tested through the multiple comparisons

procedure [[ this step is not necessary though; depends
on research question ]]
– Can only be used after rejecting the ANOVA H0 (slide 30)!
– Which pairs of population means differ significantly?
– How to do?
33
ANOVA: multiple comparisons procedure

(snapshot of table)
The P-values for the tests H0: µi = µj are listed in the column ‘Sig.’
(Sloppy: you do the test on slides 10—18 here 5 × 4 = 20 times)
Practice managerial interpretation
34
17
Today’s class in sum
• Statistical inference for bivariate statistical analysis
(analyzing two variables jointly)
• Particularly today: one quantitative variable and one
categorical variable
– Compare two means (t-test)
– Compare more than two means (ANOVA + multiple comparisons)
– Graphically: means plot; side-by-side box plots
• Application: insurance claims
– Are insurance claims from certain demographic groups, on
average, higher, lower, or about the same?
– Claims of retirees are, on average, lower than claims from non-
retirees; claims from clients with the least education are, on
average, lower than from clients with the most education
– These analyses provide a starting point for building pricing
models for segments
35
Appendix
Some additional remarks regarding the t-test for

two means
36
18
Comparing two means: remark A1
• Similar as before, we computed a Z-test (slides

13, 14) but this test is similar to a t-test that
computer packages (e.g. SPSS) typically give you
• For sample sizes that are at least 50 or more,

these two give similar results
• For smaller samples, t-test is better (and verify

that the variable has an approximate normal
distribution; e.g. side-by-side box plots)
37
• In the previous example, we compared means for

two independent groups of data
– The observations in the one group (‘retirees’) were
sampled independently from the observations in the
other group (‘non-retirees’)
– In other words, the two means are computed from a
different and unrelated set of observations
• Some statistical questions ask us to compare two

means that are each computed over the same set of
observations. We need to use a different t-test
formula (i.e. not the one on slide 13, 14)
38
19
• For instance, suppose for this sample of 4415 individuals, we
had also measured the claim amounts from two years ago.
• How would this variable show up in SPSS? Eg consider the
following hypothetical example:
39

• For instance, suppose for this sample of 4415 individuals, we
had also measured the claim amounts from two years ago.
• How would this variable show up in SPSS?
• Research question: did average claim amounts change from

two years ago or not?
– This year: mean = 73.01, std.dev. = 144.40, n=4415
– [[ suppose ]] Two years ago, i.e. the average and standard

deviation of the column ‘claim_amount_2yrs’ (previous slide) is:
mean = 65.30, std. dev. = 167.25, n=4415
– We cannot use formula on slide 13, because the two means are
computed over the same set of observations
40
20
• Idea: create a new variable (see column ‘diff’ on slide 39)
diff = claim_thisyear – claim_twoyearsago
• Then: compute sample mean and standard deviation for

‘diff’ and apply t-test of session 1 to this new variable
• That is, test H0: µdiff = 0 (or any other number) with
diff 0
diff
diff
√
• Curve under a standard normal distribution to get P-
value
[[ Discussion book pp392—402 (11th); pp455—459 (12th); pp447—451 (13th);
in SPSS paired samples t-test ]]
41
21
Bivariate statistics (cross-tabs)
Session 4
“Lots of numbers”
lecture!
• Quiz 1 results
• Quiz “walk in” office hours (data science camp quiz +
quiz 1)
– Review the feedback given in your result summary!
– When: Tue Oct 16th

• 1700-1730hrs ES3
• 1730-1800hrs ES2
• 1800-1830hrs ES1
– Where: building W1 3rd floor (316 and 317)
• Quiz 2: will open one day before SPSS lab 2 on Mon Oct
15 and close two days after the lab on Thu Oct 18
2
1
Today’s lecture
Part 2: Inference for a cross tab
Part 3: Graphics do’s and don’ts
Part 4: Towards business analytics
Previous class
• We started analyzing whether demographic segments

have a tendency to claim more/less on their policy
– This is bivariate statistics (claim_amount combined with

several demographic variables)
• Bivariate statistical analyses are very popular for

decision making, and a stepping stone for further
(advanced) stats
• Usually not the start of a stats project – first become

friends with your data (descriptive, univariate statistics!)
2
Bivariate stats techniques
The purpose of the analysis and the scale level of the
variables help us decide what statistical technique to use
[[ If one variable is “time” we do time series analysis ]]

5
ANOVA: test equality of more than two means

• Bivariate statistics – one categorical variable and one
quantitative variable (see previous slide; session 3)
– Compare the means of the quantitative variable across the
categories of the categorical variable
– Example: average claim amount for customers with pre college,
college, post college education
• ANOVAs are easy to run in any computer package, but you
need to check the underlying assumptions (session 3 slide
27)
– Before running an ANOVA, first examine side-by-side box plots
– If data looks approximately normally distributed (with equal
variances), go for it
– Otherwise, try log transform first, and re-compute the box plots to
see whether things look acceptable after transformation
6
3
Hypothetical ANOVA example 1
Data approximately normally distributed with equal variances.

OK for ANOVA, no need to take the log.
7
Hypothetical ANOVA example 2
In this 2nd example, data is NOT normally distributed. We should NOT

run ANOVA on original variable claim amount, but should log it.
8
4
Hypothetical ANOVA example 2 (cont.)
After taking the log, data is approximately normally distributed

with equal variances. OK to run ANOVA on LN(claim amount).
9
Today’s lecture
10
5
Insurance claims case
What can we say about claims that are fraudulent

versus claims that are not? Are fraudulent claims more
likely to be of a certain type?
– Are men more or less likely to file a fraudulent claim

than women?
– Are claims that render the home uninhabitable more

or less likely to be fraudulent then claims that did not
leave the home uninhabitable?
– How about town size? Are fraudulent claims more

likely to happen in smaller or larger towns?
Variable(s)? Scale levels?

11
Cross tab fraud versus type
Table useful?
12
6
Cross tab fraud versus type
What can we conclude?
13
Testing a hypothesis about a cross tab: 6 steps

hypothesis
14
7
Step 1: Formulate the statistical hypotheses
Step 1: You formulate TWO hypotheses:
• The null hypothesis H0
– For bivariate statistics: stated in terms of no difference or no
relation
– Formal: the variables ‘fraudulent’ and ‘claim_type’ are
independent
– Here: there is no difference in likelihood (~probability) for a claim
to be fraudulent across the different claim types
• The alternative hypothesis H1 or HA
– For bivariate statistics: it states that there is a difference or there
is a relation between the variables
• Note: there is no convenient way to write H0 in statistical

notation for cross-tabs – so we don’t do that
• Important to remember: the null hypothesis always refers to
the population
15

16
8

the null hypothesis
• For testing in a cross tab, you use the chi-square test-
statistic:
 
O1  E1  O2  E2 
2
2

2
 ...
Oi  Observed counts for cell i
when H0 is true
• Interpretation?
17

Three steps to compute a chi-square statistic:
I. Obtain the values for Oi from your cross table (slide 12 or 13)
No Yes Total
Wind 1 963 2 91 1054
Water 3 577 4 50 627
Fire 5 919 6 120 1039
Contam 7 378 8 26 404
Theft 9 1115 10 176 1291
Total 3952 463 4415

18
9
Step 3: computing Ei (previous slide)
II. Obtain the Ei’s – warning: this is tricky
row total  column total
Use formula: Ei 
total sample size
No Yes Total
1054 3952 1054 463
Wind 1 943.5 2 110.5 1054
4415 4415
627 3952 627 463
Water 3 561.2 4 65.8 627
4415 4415
1039 3952 1039 463
Fire 5 930 6 109 1039
4415 4415
404 3952 404 463
Contam 7 361.6 8 42.4 404
4415 4415
1291 3952 1291 463
Theft 9 1155.6 10 135.4 1291
4415 4415
Total 3952 463 4415

19

Step III: Fill out the formula on slide 17
963 943.5 2 91 110.5 2 577 561.2 2

Χ2
943.5 110.5 561.2
2 2
50 65.8 919 930 120 109 2
65.8 930 109
2 2
378 361.6 26 42.4
361.6 42.4
1115 1155.6 2 176 135.4 2
1155.6 135.4
0.403 3.441 0.445 3.794 0.130 1.110
0.744 6.343 1.426 12.174 30.01
20
10
• When the  value you computed is:

– small (‘close to zero’)  your data is close to the null
hypothesis
– large  your data is ‘far away’ from the null hypothesis
• The question you need to answer in step 4:

“When is the  value ‘too large’ so that I do not believe
anymore that H0 is true?”
• Statistical result: if the null hypothesis is true, than the

frequency distribution of all possible  values (imagine you
get many many samples) is a chi-square distribution
21

Curve the found value for chi-square (30.01) to get a precise
definition of how far / close the sample is from the null
– Compute the P-value from a chi-square distribution with degrees of
freedom = (# rows – 1) × (# columns – 1)
– Here: d.f. = (5-1) × (2-1) = 4×1=4
Use e.g. PQRS or online applets, e.g.
Web: http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html
Experimental: http://rstudio-test.hec.fr/probcalc/
22
11
CAUTION
Using the chi-square distribution to curve your chi-

square test value, requires that the expected
counts are large enough
To compute a valid P-value:
At most 20% of the Ei ’s computed in step 3

can be less than 5
23
• You have to make the final decision: reject or not

reject the null hypothesis
– What you do: compare the P-value to the chosen
significance level (in step 2) of the test
hypothesis
• P-value = 0.00 and will therefor be LESS than  =

0.05. Hence, REJECT the null hypothesis
24
12
If the null hypothesis is rejected in step 5, then you

formulate the business conclusion by interpreting the
pattern of the relationship from the cross table (use
percentages) (table from slide 13).
25
Our analysis suggests that (in the population) there is a

significant difference in likelihood for the claim types to be
fraudulent (2= 30.01, P-value = 0.00), and claims of some
types are more likely to be fraudulent than others.
Specifically, our sample suggests that claims from theft

and fire are about twice as likely (13.6% and 11.5%) to be
fraudulent than claims from contamination (6.4%). About
8% of the wind and water claims are fraudulent.
We recommend that fraud inspectors spend relatively

more time investigating claims from theft and fire…
(etc. depending on case context)
26
13
Try it yourself!
How about size of the town? Are fraudulent claims more
likely to happen in smaller or larger cities? You should
develop an hypothesis test to investigate this research
question; use =0.10.
You should find that the null hypothesis is not rejected (but
borderline). Size of town does not help us assign fraud
inspectors, for instance.
27
Cross tabs + chi-square test in sum
Very useful and powerful technique to analyze the

relationship between two categorical variables
– A cross tab “describes” their relationship (descriptive

statistics)
– A chi-square test “generalizes” the finding to the

population (inferential statistics)
– We could “graph” their relationship using clustered or

segmented bar charts (data science camp day 1)
28
14
Learning about fraud
• A series of cross-tab analyses help in understanding fraud
– Claims from theft and fire are more likely to be fraudulent

then other claim types (slide 26)
– But claims filed in smaller or larger towns/cities are

fraudulent about equally likely (@home exercise)
– [[ We should complete the analyses: how about marital

status? Education? Retire? Residence type? Etc. – it’s like
painting a picture ]]
• These insights can be used “as-is”, or may be seen as

stepping stone to a more complex model of fraud
prediction: should we investigate this claim or not?
29
Today’s lecture
30
15
Making bad graphs
Question to class
What makes a good graph?
31
Bad graph 1
Do you have something to show? No? Let’s fill it up with chartjunk!

32
16
Bad graph 2
Compressed y-axis distorts differences – we can show that

expiration of the Bush tax cuts are *really* bad!
33
34
17
Bad graph 3
Let’s display data inaccurately! We can abuse the visual

metaphor, change scales etc.
35
200 yrs ago people (Playfair, 1786) already knew how to do this..
36
18
Bad graph 4
Urgency to misinform: abuse the (time) scale!

37
For time series plots, use regular time scale

38
19
Bad graph 5
Let’s show off: clutter it up and confuse people!
39
Pie charts should add up to 100%

Label them clearly, with a minimum of ink, and no distractions
40
20
Graphics in sum
• Good graphics: display data accurately and clearly

– Examine them carefully to know what they have to say
– Then let them say it with a minimum of ink
• Good practices:
– Data-ink ratio should grow with the amount of data
displayed
– No “chart junky”
– Choose scale carefully
– Label clearly and fully
41
Today’s lecture
42
21
In sum – first 5 sessions
Using statistics for decisions
• Always start with descriptive statistics
– Get the key ‘statistics’ for your variables
– Ask/compute graphics for the key variables
– Convince yourself that the data is of good quality (e.g.
sample selection, sample size, measurement, outliers
etc.)
• An important step to be a statistician: realize you are

working with a sample but your conclusions /
recommendations / decisions are for the population
(inferential statistics; CLT)!
43
Univariate statistics
• Given a purpose of analysis...
• Categorical variable
Descriptive
– Numerical: frequency table (proportions, counts)

– Graphical: bar, pie chart
– Inferential: confidence interval for a proportion (dichotomous
variable), Z-test for a proportion (dichotomous variable), chi-
square test for proportionS
• Quantitative variable
Descriptive
– Numerical: central tendency, dispersion, 5 number summary

– Graphical: boxplot, histogram
– Inferential: confidence interval for a mean and Z-test for a mean
• And, sample size calculations for key variables
44
22
Bivariate statistics
The purpose of the analysis and the scale level of the

[[ If one variable is “time” we do time series analysis ]]

45
Parts TWO and THREE of this class focus more

on ‘business analytics’
• Multivariate statistics – bringing together multiple

variables in one approach / probability model
– Usually comes after uni/bivariate statistical analysis
– Very useful for decision making and predicting
• Techniques: regression analysis, logit regression

models, factor analysis, cluster analysis
46
23
Next class meeting: SPSS lab 2
• Covers sessions 3&4

• Prepare case:
– Continue working with insurance fraud dataset (you
already know data and case context)
– Will be posted on Blackboard
• Pay attention to discussion questions given in case;
these hint towards the Q&A form
• Go through ‘How to guides’ for sessions 3&4
47
Appendix
Same cross tab as slide 13 with column percentage

instead of row percentage
48
24
Using correlations and linear regressions

to inform decisions
Session 5
Quiz 2
1
Today’s lecture
Part 1: Bivariate stats for La Quinta (case)
Part 2: Correlations
Part 3: Simple linear regressions mechanics
Part 4: Interpreting SPSS regression output
Brief recap of PART ONE of this class
• Intro to statistical and business analytics

– Descriptive stats vs Inferential stats
– Univariate stats vs Bivariate stats
• Three important things for applied work

1. Purpose of the analysis
2. Use scale level (categorical vs quantitative) to
determine statistical method
3. Keep in mind the implication of the sampling
distribution
2
Bivariate stats techniques
The purpose of the analysis as well as the scale level of the
Case: La Quinta
La Quinta operates and provides franchise services to

more than 800 hotels with over 80000 rooms in the
U.S., Canada and Mexico under La Quinta Inns and
La Quinta Inns & Suites brands.
Based in Dallas, Texas with 9,000 employees

nationwide
[[ with an early morning stats class! ]]

6
3
Case: La Quinta
Important challenge in hotel business: expanding

locations [[ e.g. WSJ 10/31/2012 Marriott plans Asia expansion ]]
Where should La Quinta locate a new hotel?
Margin
factor
Market Demand
Competition Community Physical
Awareness Generators
Case: La Quinta
• Sample of 100 hotels
• We got a subset of the variables that measure the
factors in the profit margin model
– # of rooms within 3 mile radius (competition)
– Distance to nearest competitor (market awareness)
– Offices, higher education (demand generators)
– Median household income (community)
– Distance to downtown (physical)
– Profit margin
• You got the data: where do you start?

8
4
Univariate statistics: variable ‘Margin’
Profit margin (in percentages)
Sample: 45.74, S=7.75, min=27.30, max=62.80

Perform same analyses for other 6 quantitative variables
9
Univariate statistics: variable ‘Margin’

Profit margin (in percentages)
Suppose we randomly draw another hotel from La Quinta’s database.

What would be your prediction for its profit margin based on this graph
and in absence of a model?
10
5
Today’s lecture
11
Correlation
• When you want to describe the relation between TWO
quantitative variables, you may compute the correlation
coefficient
• Correlation coefficient defined:

– Correlation coefficient summarizes the strength of linear association
between two quantitative variables.
– Sample correlation: r (is computed from sample)
– Population correlation: unknown – Greek letter ‘rho’)
– Correlation coefficients are always between -1 and +1.
• Correlation helps answer questions like:

– If X increases does Y tend to increase or decrease?
– If X is greater in value does Y tend to be greater or smaller in value?
What graphical technique accompanies correlation analysis?

12
6
Correlation between profit margin and
competition is negative
Graphically: scatter plot
r = -0.47
13
Correlation between profit margin and demand

generators is positive
r = 0.50
14
7
Correlation between profit margin and physical
factor is negligible
r = -0.09
15
The sample correlation coefficients r can be

compute with SPSS
The numbers in red are the ones

we saw on previous slides
Interpretation? (table is symmetric)
16
8
Are correlation coefficients used a lot in
applied business analytics?!
Yes!! A lot… But here are two warnings:
1. The correlation coefficient measures the strength of a

linear relationship between two variables
2. Be careful interpreting correlations: a strong relationship

between two variables does not mean causation
– … the number of Churches in a city correlates with the number of

crimes
– … students in a psychology class who had long hair got higher
scores on the midterm than those who had short hair
‘spurious correlations’ or ‘nonsense correlations’

17
Freakonomics: Everything Is Correlated

(04/04/2011)
www.correlated.org
Goal: uncover one surprising correlation every day
www.Tylervigen.com
“Spurious Correlations was a project I put together as a fun

way to look at correlations and to think about data.”
18
9
People who drowned after falling out (04/04/2011) www.Tylervigen.com
of a fishing boat (# deaths)
0.95
Marriage rate Kentucky (per 1000)
19

(04/04/2011) www.Tylervigen.com
Worldwide non-commercial space
0.79
launches (#)
Sociology doctorates awarded in US (#)
20
10
(04/04/2011) www.Tylervigen.com
Price of apples ($ per pound)
0.89
Number of labor political action committees in US
Please remember: correlation is not causation!

21
Today’s lecture
22
11
Regression analysis
“Workhorse of applied statistics”

Objective
– Quantify the relationship between a criterion
(dependent) variable and one or more predictor
(independent) variables
Uses
– Understanding how a predictor variable influences the
dependent variable
– Predicting/Forecasting the dependent variable based on

specified values of the predictor variables
23
Common applications of regression in business

• Predicting demand
– Impact of economic conditions
– Marketing mix models
– Price elasticity
• Risk assessments
– Insurance polices
– Financial risk (beta)
– Examining abnormal financial returns
• Other applications (economics, political science,

meteorology etc.)
24
12
Correlation vs. regression analysis
• Correlation analysis/scatter plots: do before performing

regressions
• Correlation analysis tells us the strength of a linear

relationship between two quantitative variables
• Regression analysis allows us to build a mathematical

equation describing these linear relationships
• These equations can be used to
– Predict the value of the dependent variable
– Explain the effect of one or more independent variables on

the dependent variable
25
How do you quantify the relation between

dependent and independent variables?
One way: through a straight line
Straight line formula:
Y = a + b*X
a = intercept (cuts Y-axis)

b = slope of line
(Demand generator)
26
13
The Regression equation -
true model (in population)
A.k.a.
Regressor;
Dependent Variable Independent Variable Explanatory
(“Profit margin”) variable;
(“Office space volume”)
Predictor
Y   0  1 X 1  
Constant Coefficient of
(Intercept) Independent Variable
(Slope)
How do we obtain values for β0 and β1?

27
Sum of squared errors is measure of

predictive accuracy
45.7 (slide 8)
28
14
predictive accuracy
45.7 (slide 8)
29

predictive accuracy
45.7 (slide 8)
30
15
predictive accuracy
45.7 (slide 8)
31

predictive accuracy
“Best” line
45.7 (slide 8)
32
16
predictive accuracy
ei (green error bar) is
the difference between
the predicted value
“Best” line (red line) and the
actual data
45.7 (slide 8)
The objective of Ordinary Least

Squares (OLS) is to minimize ∑
33
Does X help us explain/predict Y?

(Predicted value)
“Best” line (red):
Difference between the

best line and the actual
data:
∑
(Sum of Squared Errors;

a.k.a. unexplained
variation/variance)
34
17
The difference between

the overall mean and
the actual data
∑
Sum of Squares Total;

total variation in Y
35

the overall mean and
the actual data
∑
the best line and the
actual data:
∑
Question:
Can SSE>SST?
36
18
• SST is the total variation in Y

• SSE is the unexplained variation in Y
• SSE is not larger than SST by design of OLS [[ remember:
the fitted line (red) is found by minimizing SSE ]]
• How much of the variance of Y is explained by X?
"Variation in Y explained"
"Total variation in Y"
– SSR is the Sum of Squares due to Regression ~ explained

variation in Y (by the model)
– Maximum value of R2?
– Minimum value of R2?
37
Today’s lecture
38
19
SPSS puts a line where sum of squared errors
is smallest
Table 1
Table 2
Table 3
“Best” line
39
SPSS output for simple linear regression

Table 1
How much variance in Y (profit margin) is explained?

– R-square = 0.251 – Interpretation?
– Adjusted R-square: similar, but takes into account the
number of independent variables
– [[ Std. error of the estimate: standard deviation of the error (slide 26) ]]
40
20
Table 2
Did X (Volume of office space) tell us anything?

– FYI: SSE = 4453.569, SST = 5949.458, and SSR =
SST – SSE = 1495.889
– More important: omnibus test of model relevance
[[ generalize sample result to the population ]]
41

Table 2 (cont.)
Did X (Volume of office space) tell us anything?

– The P-value in the column ‘Sig.’ was generate by the null hypothesis
that “the model has no explanatory power (in the population)”
– This is equivalent to stating that all regression coefficients  are zero
– Here H0 is rejected (P-value = 0.00 < 0.05)
– Conclusion: the model explains some variation in the dependent
variable (in the population)
42
21
Table 3
What is the best regression line?

– The estimated coefficients for 0 and 1 are, respectively, b0 =
34.188 and b1 = 0.023
– Generalize results to population through an hypothesis test that X
does not explain Y (no relation)
– Here, null hypothesis is rejected (P-value = 0.00 < 0.05) and
‘Volume’ has a significant (positive) effect on ‘Margin’
– We will get back to ‘Standardized Coefficients’ later
43
Using the simple linear regression model

The estimated model is:
Profit margin = 34.188 + 0.023 * Volume (in 1000 sqft)
This model:
– Quantifies the linear relation between profit margin and office space
volume in the hotel area, and is more informative than the correlation
coefficient (r=0.50 on slide 13)
– Explains how much “extra” profit margin (0.023) can be expected from
a 1 unit (in 1000sqft) increase in office space volume in the area
– Can be used to predict a profit margin for different values of office
space volume in the area
• Area A has 800 (thousand) sq feet in office space: expected profit margin
for a hotel is: 34.188+0.023*800=52.6
• Area B has 400 (thousand) sq feet in office space: expected profit margin
for a hotel is: 34.188+0.023*400=43.4
44
22
Try it yourself @Home 1
Interpret the above regression output

45
Try it yourself @Home 2
Interpret the above regression output

46
23
Bivariate statistics for two quantitative variables

– Correlation analysis
– Simple linear regression model
Next class meeting: extend the simple linear

regression model to situations where we have
multiple (quantitative) variables
– Multivariate statistics
– Multiple linear regression
47
24
Using linear regressions to inform decisions
Session 6
SPSS lab 3 (of 5), on Thursday Oct 25th, practices
basic regressions (sessions 5 and 6)
– Prepare: ‘How to guides in SPSS’ sessions 5 and 6 and

SPSS lab case guidelines.
– Quiz 3: opens up one day before SPSS lab (aka

tomorrow); closes two days after SPSS lab on Saturday
Oct 27th at 23.59hrs.
1
Today’s lecture
Part 1: Multiple linear regressions
Part 2: Regression diagnostic tasks
Part 3: Using for decisions – predictions
Previous class meeting

• Analyzing two quantitative variables jointly
– Correlations
– Simple linear regression model
• Examples:
– Sale force management: what is the relationship between
sales productivity (e.g. sales) and years of experience for
sales people?
– Marketing: how sensitive are customers to price (price

sensitivity)
– Finance: how sensitive are the stocks of company A to

general market movements? (CAPM, beta)
2
Case: La Quinta – previous class meeting
Important challenge in hotel business: expanding locations
Where should La Quinta locate a new hotel?
Margin
factor
Market Demand
# rooms Office Distance

data
within 3 space in to town

miles area center
We did three simple linear regressions. What if we want to

assess the impact of ALL factors on margin jointly?
5
Multiple linear regression model

Simple linear regression was used to analyze how one
quantitative variable (the dependent variable Y) is related
to one other quantitative variable (the independent
variable X).
Multiple linear regression allows for any number of

independent variables.
– But still only one dependent variable
– And still all variables are quantitative (but this can be
extended; next class meetings)
– This is an example of multivariate statistical analysis
We expect to develop models that fit the data better

than would a simple linear regression model.
6
3
Multiple linear regression model
We now assume that we have (say) K independent variables

potentially related to one dependent variable
Interpretation is similar to the simple linear regression model

(session 5, slide 27)
– Y is the dependent variable
– X’s are the independent variables
– is the constant (intercept)
– ,…, ’s are the slopes
– is the error term
7
Case: La Quinta
Margin
factor
Market Demand
• This theoretic model is operationalized by measuring each

of these factors by one or more variables
– Margin is measured by the variable ‘profit margin’
– Each factor is measured by one or more variables (see also prep
guide session 5)
– To keep it simple, in this case we work with a set of 6 variables
• Then, we fit a multiple linear regression model to estimate
the relation between the factors and profitability
8
4
La Quinta: regression results from SPSS
SPSS will give you three tables that need to be interpreted

(similar tables as in session 5 slide 39)
La Quinta’s regression model

A first glance
• Although we haven’t checked significance of these variables
yet, at first pass:
38.14 0.008 1.65 0.02 0.21 0.41 0.23
• It suggests that increases in

(2) the number of miles to closest competition, (3) office space,
(4) student enrollment and (5) household income will positively
impact the profit margin.
• Likewise, increases in
(1) the total number of lodging rooms within a short distance and
(6) the distance from downtown will negatively impact profit
margin.
10
5
SPSS output for linear regression
Table 1
How much variance in Y (profit margin) is explained?

– R-square = 0.53 – Interpretation?
– Adjusted R-square: similar, but takes into account the number
of independent variables
– [[ Note: the R-square went up from 0.25 to 0.53 by including additional
independent variables; quite an improvement! (session 5 slide 40) ]]
11
SPSS output for linear regression

Table 2
Did the independent variables tell us anything about profit margin?

– The P-value in the column ‘Sig.’ was generate by the null hypothesis that “the
model has no explanatory power (in the population)”
– This is equivalent to stating that all regression coefficients  are zero
– Here H0 is rejected (P-value = 0.00 < 0.05)
– Conclusion: the model explains some variation in the dependent variable (in
the population)
12
6
Table 3
It is important to inspect the (here) six t-tests for each independent

variable. Each tests that (in the population) there is no relation.
13
Interpreting the Coefficients*
• Intercept (b0 = 38.14)

– Average profit margin when all independent variables = zero.
– What’s it mean?
• # of motel and hotel rooms (b1 = –0.008)

– Each additional room within three miles of the La Quinta inn, the
profit margin will decrease (on average) 0.008.
– (I.e. for each additional 1000 rooms the margin decreases 8)
• Distance to nearest competitor (b2 = 1.65)

– For each additional mile that the nearest competitor is to a La Quinta
inn, the average profit margin increases (on average) 1.65.
*in each case we assume all other variables are held constant!
14
7
Interpreting the Coefficients*
• Office space (b3 = 0.020)

– For each additional thousand square feet of office space, the margin
will increase (on average) 0.020.
• E.g. an extra 100,000 square feet of office space will increase margin
(on average) 2.0.
• Median household income (b5 = 0.41)

– For each additional thousand dollar increase in median household
income, the average profit margin increases 0.41
*in each case we assume all other variables are held constant!
15
Which of the factors is/are most important in

determining location?
Inspect the standardized coefficients or t-values

16
8
• The independent variables X1, X2,…,X6 are all measured
on different scales complicating assessing their relative
importance
– X1 (number of rooms) is a count (1,2,…)
– X2 (distance to competitor) is in miles
– X3 (volume of office space) is in 1000s sq ft
– Etc.
• The estimated regression coefficients incorporate these
scale differences and it is therefore tricky to compare
them relatively
• Two possible solutions:
– Use the standardized coefficients
– Use the t-values
17

• Standardized coefficients
– First standardize your dependent and independent variables
– (re)Run the regression on the standardized variables
– Now all estimated regression coefficients are on the same scale
– E.g. an increase of 1 standard deviation in X1 leads to a
decrease of 0.440 in Y ( denotes the sample standard
deviation of Y). [[ -0.440 comes from table slide 16 ]]
• t-value comparison
– The magnitude of the t-values may be a better approach to
assess the relative importance of independent variables
• Important to remember: in general, do not use the estimated
coefficients (b1, b2,…,b6) for relative importance!
18
9
The variables X1, X3, and X5 are most important in explaining

profitability
19
Today’s lecture
20
10
Check the underlying assumptions of linear
regressions
Before you use the regression model for decision

making, you need to convince yourself that what
you are (about) to work with is okay to use
21
Garbage In = Garbage Out
22
11
Important aspects to check in regression
analysis
Before drawing conclusions about a population based on
a regression analysis done on a (one!) sample, first
check (at the minimum) the following five aspects:
1. Variable types: use quantitative variables [[ fun fact: we will
extend this next class meetings to categorical variables ]]
2. Assess goodness of fit: investigate R2 and all the P-values

(significance of independent variables)
3. Is there a linear relationship between Y and the Xs?
4. Residual analysis: your errors (see slide 7) should ‘behave’

(ideally, normally distributed, and no outliers)
5. Multicollinearity: your predictor (independent) variables should

not correlate ‘too highly’
23
Aspect 3: Is there a linear relationship between

Y and the Xs?
Inspect scatter plots (e.g. session 5) of the dependent

variable versus the independent variables
– Be aware of abnormal patterns, in particular non-linear

patterns
– Be aware of outliers
24
12
Is there a linear relationship between Y and the Xs?
Inspect scatter plots of the dependent variable versus the
independent variables
R2 = 50%
(note: no relation,
i.e. 2 0%, is not
per se abnormal)
25

R2 = 50%
26
13
R2 = 50%
27
Aspect 4: inspect the residuals (errors) of your model
• The errors should (ideally) have a normal distribution

(at least, approximately so)
– Inspect through a histogram (e.g. with a normal curve as
reference in it) or boxplot
• Good practice:
(1) Look for “normality” [[ next slide ]]
(2) Be aware of large residuals (=errors) [[ next slide + 1 ]]
• There are no ‘hard-and-fast’ rules for “normality”; most

applications will not have a perfectly normal distribution
for the errors. That’s okay. But strong deviations should
ring alarm bells.
28
14
Inspect the residuals (errors) of your model
La Quinta: errors behave pretty well

29
Inspect the residuals (errors) of your model
Check if any of your residuals (model errors) are extremely

large
– Rule of thumb: three or more standard deviations from the
“average” residual is considered large
The 10 largest
residuals are all
less than +/- 3
(‘Std. Residual’)
which is good
(and rare)!
30
15
Aspect 5: Check for multicollinearity
• With more X variables, you run the risk of multicollinearity
– Two (or more) independent variables are highly correlated
with each other
• This poses a threat to your regression model:
– Untrustworthy estimates for your β’s (“wrong” signs)
– Low t-values (“very few predictors are significant”)
– Limits the size of R2
– Hard to assess importance of predictors
• Importance of multicollinearity problem is less severe if
your goal is prediction, however, it is more important if
your goal is explanation
• Neither detection nor solutions are obvious:
1. Compute correlation matrix among independent variables
2. Run collinearity diagnostics in SPSS
31
Aspect 5: Check for multicollinearity

• Compute the bivariate correlation coefficients between
each pair of X1, X2,…,X6 – see session 5 slide 16
– Rule of thumb: be aware of correlations over +/- 0.80
– La Quinta case: of the 15 bivariate correlation coefficients,
the largest was 0.15
• Compute standard collinearity diagnostics
– VIF (Variance Inflation Factors): (sloppy) how much error
(variance) inflation is there in the estimated regression
coefficients?
– [[ Tolerance: (1/VIF) ]]
– Rule of thumb: (a) no VIF should be larger than 10; (b)
average VIF should be close to 1; and (c) VIFs in between
4 and 10 should be examined
32
16
Check for multicollinearity
Collinearity statistics appended to the coefficient table (slide 13)

La Quinta: no problems with multicollinearity
33
What to do if multicollinearity IS present?
Basically multicollinearity means there is not enough

information in the data to estimate the separate slopes
, ,…,
Possible solutions:
– Increase sample size (duh!)
– get rid of some of the independent variables (duh!)
– leave as is (duh!), but do report in analysis
– [[ savvy: Factor analysis – replace two ore more collinear
variables with a synthetic variable that summarizes them;
session 9 ]]
Bottom line: no easy solutions to multicollinearity
34
17
Today’s lecture
35
Using the regression model for predictions
Predict the operating margin if a La Quinta Inn is built at a

location where…
 There are 3815 rooms within 3 miles of the site.
 The closest other hotel or motel is .9 miles away.
 The amount of office space is 476000 square feet.
 There is one college and one university nearby with a total
enrollment of 24500 students.
 Census data indicates the median household income in the
area (rounded to the nearest thousand) is $35000.
 The distance to the downtown center is 11.2 miles.
How much profit margin can we expect for this location?
36
18
Plug in (hypothetical) X values in your estimated
equation:
X1 = 3815; X2 = 0.9; X3 = 476; X4 = 24.5; X5 = 35; X6 = 11.2
38.14 0.008 1.65 0.02 0.21

0.41 0.23
38.14 0.008 3815 1.65 0.9 0.02
476 0.21 24.5 0.41 35 0.23 11.2
37.09
Hence, we expect a profit margin of 37.09% for the

particular location on previous slide
– Good prediction?
37
Two caveats for predictions

1. The prediction (37.09%) is a point prediction. Chances
are the actual profit margin will be different (this is based
on a sample and the model is an approximation). So, we
are better off giving an interval as prediction.
o Compute with SPSS. For the La Quinta location, the 95%
prediction interval is [25.4%, 48.8%]
2. Extrapolation: never use the model outside of the range
on which it was estimated
o E.g. how about expected profit margin for a location like
before (slide 36) but instead with a large college university
that has 50000 students?
o The maximum observed in the sample is 26500!
38
19
In sum: choosing and using a regression model
• Choosing a regression model:
– Reasonable model fit (i.e. check the underlying assumptions
– five ‘tasks’ on slide 23)
– Relatively high values of R2
– Signs of the (significant) regression coefficients need to

make sense (theory)
• Using a regression model to inform decisions:

– How do independent variables influence a dependent
variable? Which independent variables are (most) important?
– Predictions for (hypothetical) values of the independent

variables
39
Is there more to learn about regressions?!
• Absolutely!
• Next class meetings we will look at situations

where…
– The linearity assumption is violated
– We have categorical variables
40
20
Using linear regressions to inform decisions

(part II)
Session 7
Music preference survey: please fill out!

[[ will email you link shortly ]]
1
Previous class meetings
Linear regression models
– Quantify the relation between one dependent variable
and one ore more independent variables
– Very important class of models in applied statistical
work
– Useful for explaining and predicting
– Several diagnostic tasks need to be performed before
regressions can be used for decisions [[ G.I.G.O. –
slide 23 session 6 ]]
o One of them: all variables are quantitative
o Another one of them: linearity
Today’s class meeting

But: aren't those two assumptions on the previous
slide huge limitations?
• Answer: yes they are […] limitations!
• In many applications we observe categorical variables
and these may be related to a dependent variable
– Today: how can we include categorical variables as
independent variables in a regression model
• In other applications we often suspect non-linear

relationships between an independent variable and a
dependent variable
– Last part of today’s lecture (brief)
2
Today’s lecture
Part 1: Categorical regressors
Part 2: Interactions with a categorical regressor
Part 3: Nonlinear relations (briefly)
Mini case: gender discrimination in salary at

large US bank
• Common application in HR and business law
• Data (n=208) is from a real case: a US bank was facing
a gender discrimination suit in mid 90s
• Charge: female employees receive substantially smaller
salaries than its male employees
• Variables included (among others):
– Salary (in $1000s)
– Gender
– Education level (categorical)
– Job level (categorical)
– Experience (yrs)
6
3
Using statistics in this law suit case
When a statistician gets the data… (s)he first becomes

friends with it!
– A run down analysis of all the variables in the data file:
univariate statistical analysis!
– Very important and useful:
o Often already yields decision making insights
o Allows for data sanity checks
o Helps guide subsequent analyses
o Helps understand the limitations of

the data given a statistical technique
Become friends with ‘Salary’ and ‘Gender’
A logical next step: combine gender and salary

(bivariate stats). How?
8
4
Evidence for gender discrimination in salary?
Test that the means for men and women are the same
[[ see session 3 part 2 ]]
Difference in sample means: 8.3 (in $1000s)

Z-statistic = 4.14, P-value = 0.000
Conclusion? Is there proof for gender discrimination in
salary?
Evidence for gender discrimination in salary?

The t-test shows that females earn significantly less than men.
But perhaps there is a reason for this:
They might have been hired more recently (less experience)
They may work at lower job grades
They may have lower education levels, etc.
A better approach is to explain salary (quantitative) differences

among men and women (categorical), while controlling for other
factors such as education (categorical), experience
(quantitative), job grade (categorical) etc.
“For a male and female in the same job grade, with the same level
of experience and education, is there a difference in salary?”
In a regression model we are now mixing categorical and

quantitative independent variables
10
5
Including categorical independent variables
• We can NOT include categorical variables as
regressors in a regression model “as is”
– Gender: 1=male, 2 = female
– Job grade: 1=lowest,…,6=highest
– Education: 1=high school,…,5=grad school
• Why’s that?
– Regression models do multiplications, additions,
subtractions which can only be done with quantitative
variables
– Regression interpretation: an one unit increase in X results
in a unit increase in Y, for all X values. This is generally
too restrictive if X is categorical (e.g. next slide)
11
[[ bad ]] Regression model salary vs job grade

Salary (quantitative)
Job grade (categorical)

24.4 5.6 (X is ‘Job grade’ with levels=1,2,…,6); R-
square = 62%
12
6
[[ bad ]] Regression model salary vs job grade
Salary (quantitative)
Job grade (categorical)

But the salary means for each job grade level show that the
linearity assumption is too restrictive (means plot)
13
Categorical (=nominal/ordinal) independent variables

Categorical variables can be included as predictors
by using “dummy variables”
– A way of representing groups of people using only
zeros and ones
Example: Gender
Gender Value Dummy

variable 1
Female 1 1
Reference Male 2 0

14
7
Why does dummy (0/1) coding work?
Let’s consider a better model to investigate possible salary
discrimination, by controlling for experience [[ ‘YrsExper’ –
quantitative independent variable ]]
Salary = β0 + β1 * YrsExper + β2 * Femaledummy
Two cases:
1. Employee is female ---- Femaledummy = 1
Salary = β0 + β1 * YrsExper + β2 * 1 or
Salary = (β0 + β2) + β1 * YrsExper
2. Employee is male ---- Femaledummy = 0
Salary = β0 + β1 * YrsExper + β2 * 0 or
Salary = β0 + β1 * YrsExper
15
Estimating the salary model on previous slide
Males
Salary (quant)
Females
(nominal)
YrsExper (quant) R2 = 49%

Salary = 35.8 + 0.98 * YrsExper – 8.0 * Femaledummy
16
8
Detailed interpretation of the regression
coefficients on previous slide
• The intercept (b0=35.8; P-value=0.00)
– The expected starting salary for males with zero years of
experience
• The slope for years of experience (b1=0.98; P-
value=0.00)
– The expected increase in salary for one extra year of
experience at the bank for either gender
• The slope for the female dummy (b2=-8.0; P-
value=0.00)
– This is the key coefficient for this law case
– It indicates that the average salary for women is 8.0
(~$8000) lower than for men, given that they have the
same experience levels
17

Job Value
level
Lowest 1
2nd 2
3rd 3 ???
4th 4
5th 5
Highest 6

18
9
Job Value Dum1 Dum2 Dum 3 Dum 4 Dum 5
level
Lowest 1 1 0 0 0 0
2nd 2 0 1 0 0 0
3rd 3 0 0 1 0 0
4th 4 0 0 0 1 0
5th 5 0 0 0 0 1
Highest 6 0 0 0 0 0

19
Steps to create dummy variables

1. Count number of categories you have and subtract 1.
2. Create as many new variables as the value you got in step 1. These
are your dummy variables.
3. Choose one of your categories as baseline (the category against
which all other categories are compared).
4. Assign the baseline category the value 0 for all your dummy
variables.
5. For the first dummy variable, assign the value 1 to the first category
that you want to compare against the baseline. Assign all the other
categories 0 for this dummy variable.
6. For the second dummy variable, assign the value 1 to the second
category that you want to compare against the baseline. Assign all
the other categories 0 for this dummy variable.
7. Repeat this for all remaining dummy variables.
8. Include all your dummy variables in the regression model.
20
10
Salary model with YrsExp, JobGrade, Gender
R2 =74%; F=81.6, P-value=0.00

21
What can we learn from the model on previous slide?

• For instance: an employee…
– … in job grade 1 makes 27.5 ($1000s) less on average than
an employee in job grade 6 (reference), all else [[ gender and
experience ]] being equal
– … in job grade 5 makes 11.34 ($1000s) less on average than
an employee in job grade 6, all else being equal
– … with one additional year of experience makes 0.41
($1000s) more on average, regardless of gender and job
grade
• But, a female employee makes (on average) 1.93
($1000s) less than a male employee, all else being
equal [[experience and job grade]]
– But this is not significant (or, at best, marginally) after
controlling for experience and job grade! Conclusion?
22
11
@home exercise for a rainy day
Extend the previous model (slide 21) by also controlling for

education level (categorical, five categories)
– Create 4 dummy variables in SPSS and add these to the 7

– Does education have an effect on salary, given gender,

experience, and job grade?
– Are the assumptions underlying your regression model

satisfied (session 6 slide 23)?
23
Today’s lecture
24
12
Salary model from slide 16: two parallel lines
Males
(n=68)
Salary (quant)
Females
(n=140)
(nominal)
YrsExper (quant)
Given experience, females earn less than men. But salary
increases at the same rate for males and females. Realistic?
25
Interaction variables in regressions
• The two parallel lines on previous slide imply that

males and females salary increases at the same
rate.
• This is unlikely to be a realistic assumption [[ and a
possible indicator of discrimination in the lawsuit! ]]
• Ideally, we would like to estimate one regression
equation using the full sample (n=208), rather than
estimating two separate regressions for males
(n=68) and females (n=140) in relatively small
samples.
• This can be done through an interaction variable
26
13
• An interaction variable is a product of two
explanatory variables
– Scale level doesn’t matter (e.g. dummy×dummy,
dummy×quantitative, quantitative×quantitative)
– Useful if we believe the effect of one explanatory
variable on Y depends on the value of another
explanatory variable
• Example lawsuit case: Y=Salary, X=YrsExp,

D=Femaledummy
Interpretation? Do same analysis as on slide 14!
27

Interpreting interactions with a dummy variable is tricky
and can be best done by writing separate equations
and seeing how they differ
For FEMALES, D=1, so…

1 1
For MALES, D=0, so…
0 0
Hence: using the interaction X×D we get a separate

line for males and females, each with its own intercept
and slope
28
14
Case gender discrimination: interaction of
gender with years of experience
Y=Salary, X1=YrsExp, X2=Female (dummy), X3=X1×X2
R2 = 64%, F=120.16 (P-value=0.00)

Interpretation? Write down the separate equations for
females and males, and interpret each one
29
Case gender discrimination: interaction of

gender with years of experience
Detailed interpretation coefficients previous slide:
• The intercept (b0=30.43; p-value=0.00)
– Average salary of males at 0 yrs of experience
• Slope for experience (b1=1.528; p-value=0.00)
– Average increase in salary per extra year for males
• Slope for female dummy (b2=4.098; p-value=0.00)
– Expected salary premium ($4098) for females at 0 yrs of
experience over males
• Slope for interaction (b3=-1.248; p-value=0.00)
– The salary penalty ($1248) per extra year of experience for
females relative to males
30
15
Interaction of gender with years of experience
Males
Salary (quant)
Females
YrsExper (quant)
The effect of years of experience on salary is quite different for
male and female employees: males move up the salary ladder
much quicker!
31
@home exercise for another rainy day
Extend the previous model (slide 29) by also controlling for

job grade (categorical, 6 categories). How does the model
fit the data? How would you interpret the coefficients?
Once you have included job grade (and if it still rains), you
should include education level in the model as well (slide
23). The model now already becomes pretty complex. How
does it fit the data? How would you interpret the
coefficients?
32
16
One note of caution
While not emphasized today, using dummy variables
and interaction terms does not free you from the
diagnostic tasks discussed before! (session 6 slide 23)
– G.I.G.O.!
– Dependent variable = quantitative
– Independent variables are quantitative OTHERWISE
use DUMMIES
– Linear relation between Y and quantitative X’s
– Assess goodness of fit (R square; p-values)
– Residuals (errors) must ‘behave’
– Multicollinearity
33
Today’s lecture
34
17
Generalizing linear regression
• The linearity assumption is often a good and
convenient assumption, but sometimes not realistic
• How do we know things are linear or not?
– LOOK AT YOUR DATA (scatterplots of Y and Xs, and
examine residuals)
– Economic theory
• If things are not linear, what can we do?

– Categorical variables: use dummies
– Quantitative variables: transform your data such that
the relationship between f(X) and g(Y) is linear
35
Example: ad spending on sales

Sales ($100s)
Ad spending ($100s)
Would a linear regression model be a good model to

explain/predict sales using ad spending?
36
18
Sales ($100s)
Fit Y against X
Ad spending ($100s)
• With SPSS: 8181 85 ( 0.66; 491; P-value = 0)
• Probably not: for low and large values of X we over predict Y,
for medium values of X we under predict Y. Alternatives?
37

Sales ($100s)
Fit Y against X and X2
Ad spending ($100s)
• With SPSS: 6773 190 1.10 ( 0.69; etc.)
• Hard to interpret the coefficients b1=190 and b2=-1.10
• Other alternatives: fit Y against √ or instead of X and X2
38
19
Sales ($100s)
Fit LN(Y) against LN(X)

LN is the natural logarithm
Ad spending ($100s)
• With SPSS: LN 8.5 0. 25LN .
• Interpretation: a 1% increase in X goes with a 0.25 percent
increase in Y [[ sales-advertisement elasticity ]]
39

• The curve on previous slide is probably the most
reasonable alternative: the log-log regression model
• LN(Y) = β0 + β1LN(X) 
LN is the natural logarithm
– (dy/Y)/(dx/X) = ∆Y%/∆X% = β1
– Constant elasticity: a 1% increase in X goes with a 1%
increase in Y (decrease if 1 is negative)
– If Y is sales and X is prices, then 1 is the price-elasticity
• The log transform is used a lot in demand modeling:
– It induces nice statistical properties (e.g. making skewed
error distributions more symmetric)
– Has a convenient interpretation in terms of percentages
(elasticities)
40
20
• How to use categorical variables as independent
variables in a linear regression model
– Use dummy variables to represent the categorical variable
• Interactions: if the effect of one explanatory variable on Y

depends on the value of another explanatory variable
• One way to deal with non-linearities in regressions
– A common transformation uses the natural logarithm
– In SPSS: first compute new variables from the old
variables (Transform-Compute Variable)
– Then run a regression using the newly computed variables
• Next class meeting: how to use a categorical variable as

DEPENDENT variable
41
21
Logistic regressions for a categorical

dependent variable
Session 8
Warning!
Tough Lecture!
Music preference survey is available! Please fill out..
https://hec.az1.qualtrics.com/jfe/form/SV_0debqP4E3ppDOoB
Closes tomorrow (Friday) tonight!
1
Today’s lecture
Part 1: Logit regression main idea
Part 2: Logit regression SPSS output
Part 3: Logit regression interpretation
Part 4: Predicting probabilities and

other diagnostic tasks
Previous class meetings: linear regression

models
• Quantify the relation between one dependent

variable and one ore more independent variables
– Dependent variable: quantitative (sales, salary, stock

returns, insurance claim etc)
– Independent variable: quantitative or dummy

representation (if categorical)
• Today’s class meeting: what could we do if we have

a categorical dependent variable?
2
In many business applications we have
categorical variables…
… that we want to explain and / or predict

– What drives customer retention? (customer churns or
not)
– What factors are explaining bankruptcy of startups? (a

startup goes bankrupt or not)
– What is the probability that a LinkedIn user is a job seeker

(job seeker yes vs no)
– What is the probability that an insurance claim is

fraudulent (e.g. session 4)
What to do with categorical dependent variables?
• In the previous examples, the dependent variable

has two categories, and is therefore not quantitative
– However, we still would like to be able to explain or
predict it as a function of other variables.
• The dependent variable is, in fact, ‘binary’ (1 or 0)

– E.g. buy vs not buy, job seeker vs non job seeker,
fraudulent vs not fraudulent etc
• Idea: explain / predict the probability of a 1 or 0 as a

function of X’s
• This wont go well with the linear model… Why?
3
Brilliant idea!
Don’t predict the probability (p) of a 1 or a 0, but

something on a scale of negative infinity to positive
infinity
Instead, predict the log-odds: log[ p / (1-p) ]
(see also appendix 1 for a discussion on probability and odds)

7
Idea: convert probabilities (p) to span the entire

line (log-odds)
4
Example: simple logistic regression model
(one independent variable)
log[ p / (1-p) ] = β0 + β1 * X1
Allows for the prediction / explanation of probabilities

– Prob(LI user is job seeker) = p
– Examples for X1, e.g. updated profile page in last month,

grew his/her LI network, invitations send / invitations
received.
– More general, include multiple X’s jointly as independent

variables
9
Today’s lecture

10
5
Mini case: gender discrimination in salary at
large US bank
• Common application in HR and business law: a US bank

was facing a gender discrimination suit in mid 90s
• Sample n=208
• Charge: female employees receive substantially smaller
salaries than its male employees
• Prelim insights (last class meeting): there is no strong
evidence for salary discrimination after controlling for
experience, job grade etc., but it may be the case that
women don’t advance as fast
• Additional variable: employee was promoted
in the last 12 months (Y/N)
11
Bar chart for promotion
73% P(promoted)?
27%
Are women more or less like to be promoted than men?

Bivariate statistics: two categorical variables – cross tab (next slide)
12
6
Clustered bar chart promotion and gender
Chi-square=12.7; p-value = 0.00; only 19% of females were promoted,

versus 43% of males. But: other factors (e.g. experience, education)
are likely to affect promotion; how to control for?
13
Logistic regression for promotion

(two independent variables)
Let’s keep things simple for class discussion purpose:

– Prob(Y=1) = p
– Y = ‘Prom’ (1=Y, 0=N)
– X1 = ‘YrsExper’ (quantitative)
– X2 = ‘Gender_dum’ (1=F, 0=M)
Use SPSS to estimate the logistic regression model:
log[ p / (1-p) ] = β0 + β1 * X1 + β2 * X2
14
7
For logistic regression, SPSS produces LOTS of output
Here are the relevant tables (interpretation next slides)
Table 1 Table 2
Table 3
Table 4
15
Logistic regression for promotion

(with two independent variables)
Prob(Y=1) = p
Y = ‘Prom’ (1=Y, 0=N)
X1 = ‘YrsExper’ (quantitative)
X2 = ‘Gender_dum’ (1=F, 0=M)
Although we havent checked significance of these

variables yet, a first pass look at the estimated model (from
table 4 previous slide, column ‘B’):
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
16
8
SPSS output for logit regression
Table 1
• Generally, as the logit regression is “almost” linear (it is linear
in log-odds), much of the reasoning / interpretation / checks is
similar to linear regressions (sessions 5—7)
• Omnibus test of model relevance: do our predictors (X1 and

X2) explain/predict anything in Y (probability of being
promoted)?
– Test statistic is the chi-square statistic (model), i.e. 2 = 39.98
with corresponding P-value = 0.00
– Conclusion: the explanatory variables have a significant impact
17

Table 2
• As for linear regressions, the R2 tells us how much of the

variation in Y (the variation in the observed 1’s and 0’s)
we can explain with the independent variables
– Q&S R square is lower; it cannot reach theoretical maximum of 1
– Nagelkerke’s R-square is an adjustment
• “About 25.4% of the variation in the 0/1’s of job
promotions is explained by experience and gender”
18
9
Table 3 False positive
False negative
• The classification table is also an indicator of how well

the model performs
• Interpretation?
For 145 individuals the model correctly predicts they were not promoted;
for 7 individuals the model incorrectly predicts they were promoted; for 40
individuals the model incorrectly predicts they were not promoted, for 16
individuals the model correctly predicts they were promoted
19

False negative
• Mechanics: for each one of the 208 employees..

– The model computes a probability that the employee was promoted
– Example: P(employee #12 is promoted | who’s female with 10yrs
experience) = 0.3 [[ more details on this in part 4 ]]
– Because 0.3 < 0.5 (cut value), we take 0 (not promoted)
– If P(..|..)>0.5, we take 1 (promoted)
– The cut value can be chosen
20
10
False negative
• There are no clear guidelines for the cut value. For instance it
depends on whether predicting a false negative or a false
positive is worse (e.g. more costly); see appendix 2.
• Sometimes a reasonably compromise is to set it to the observed
proportion of promotions in the sample (here: 27% of the sample
is promoted; see slide 12)
21

Table 4
The omnibus test (slide 17) indicated that at least one

regression coefficient β is nonzero. Which ones are
significant (in the population)?
– Inspect the P-values (column ‘Sig.’) for the test β = 0
– Conclusion: both independent variables have a significant effect
(P-values<0.05) on Y (promoted Y/N)
– Importance of independent variables? Check Wald statistic
– Now we can move on to a detailed interpretation: either through
log odds or the odds ratio (part 3)
22
11
Today’s lecture

23
Detailed interpretation: log odds

log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
(p=probability promotion, X1 = YrsExper, X2 = Female_dummy)
• Intercept (b0 = -1.52; P-value=0.00)

– The log odds of a promotion for a male with zero years of
experience is -1.52
• Years of experience (b1 = 0.13; P-value=0.00)
– An additional year of experience increases the log odds of
a promotion by 0.13 (regardless of gender)
• Gender dummy (b2 = -1.32; P-value=0.00)
– Regardless of years of experience, the log odds of being
promoted is -1.32 lower for females than males
In other words: in terms of log-odds, interpretation is similar to linear regression

But, many researchers do not find these interpretations precise enough
24
12
A more detailed interpretation: odds ratio
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
• A more precise interpretation can be given through the odds
ratio (exp(B) column in table 4). When X1 is increased by 1, the
odds ratio is
new odds after a unit change in 1 1

original odds
• In other words, for previous example.. See appendix 3&4 for “the
math behind this formula”
– Odds ratio = e0.13 = 1.14
– If X1 (=YrsExper) increases with 1, then the odds of being
promoted are 1.14 times the odds before the increase (regardless
of gender)
– Or, a (1.14 – 1)*100%=14% increase in odds of promotion for
every year of experience
25
A more detailed interpretation: odds ratio

log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
• A more precise interpretation can be given through the odds
ratio (exp(B) column in table 4). When X2 is increased by 1
(0=male1=female), the odds ratio is
new odds after a unit change in 2 2

original odds
• In other words, for previous example..
– Odds ratio = e–1.32 = 0.27
– The odds to be promoted for females are 0.27 times the odds for
men (with same level of experience)
– Or, a (0.27 – 1)*100% = 73% decrease in odds of promotion of
females compared to males
26
13
Remark: odds ratio interpretation
• Odds ratio’s are not easy to interpret; they are fairly
abstract
• Make them easier to interpret through ‘baseline odds’,
which translates them to a concrete situation like “the
number of successes per the number of failures”
• Example: let’s choose as baseline odds a situation
where all X variables are put to 0
– This represents a promotion of a male with 0yrs of experience
– Baseline odds = exp(-1.52) = 0.22 (or 22/100)
– “We would expect 22 males to be promoted for every 100 males

that are not, within the group that have 0 yrs of experience”
• How does this result for men compare to females?

27

• We found (slide 26) that the odds for females with 0 yrs
of experience decrease by 73%, hence, 0.27*0.22=0.06
– “We would expect 6 females to be promoted for every 100
females that are not, within the group that have 0 yrs experience”
– That is a quite substantial difference with the males!
• How to get a meaningful baseline value?

– One way, as we just did, put all explanatory variables to zero
– Or, if not meaningful, consider an “average” case (i.e. put the

value of all explanatory variables to their averages)
• Another advantage of working with baseline odds is that

it helps to understand the impact of the results
[[ next slide ]]
28
14
• What if being promoted (at 0yrs experience) were rare?
• Suppose instead that these baseline odds were 0.001:
– “One male is promoted within his first year for every 1000 males
in their first year that are not”
• Now, the odds for a female would change from 0.001 to
0.00027:
– Because: 0.27 0.001 0.00027
– Hence, we have 1 female promotion within her first year for

every 3700 females in their first year that are not promoted
– Compared to males, we have 3.7 males that are promoted for
every 3700 males that are not promoted in their first year
– This difference does not sound “as impressive” as the difference
in the example on the previous two slides
29
• Hence, odds ratio’s alone are hard to interpret.
• Besides, they cannot be used very well to assess

“impact”. This is particular the case when the event
is rare.
• Baseline odds, however, can help provide a

convenient way of interpretation and evaluate the
size of the effect
• In sum, baseline odds should always be included in

discussing logistic regression analysis.
30
15
Today’s lecture

31
Using the logit model for predictions
• The real power of the logit model comes from it’s

ability to estimate/predict probabilities
– What is the probability that a female employee with 5
years of experience is promoted?
– What is the probability that a male employee with 5
years of experience is promoted?
• Approach: plug in the X values in your log-odds

equation and solve for p
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2

32
16
Example: the probability that a female employee with 5
years of experience is promoted is
log[ p / (1-p) ] = -1.52 + 0.13 * 5 -1.32 * 1 = -2.19
Take exponent on both sides of equation:
p / (1-p) = exp(-2.19) = 0.111
Solve for p:
p = 0.111 / (1 + 0.111) = 0.10
For completeness, we need to compute (bit tricky!) an

(confidence) interval: the 95% interval is [0.06, 0.17]
33

In sum – steps to estimate a probability (previous slide):
1. Plug in the X’s for the scenario in the log-odds equation

and complete the products and sums to get the log-odds
2. Exponentiate the number you got in step 1
3. Plug this number in the following equation to compute p

(the probability of observing a 1):
number from step 2

1 number from step 2
[[ 4. Ask for a confidence interval for the predicted probability;

need computer for that! ]]
34
17
Super-fun @home exercise: estimate the probability of

promotion for a male employee with 5 yrs of experience
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
You should find: 0.30 (rounded)
You can either use the previous “manual” steps (slide

34) or use SPSS to get the prediction (See How to
guide)
35
Blue = M
Probability of promotion
Red = F
Calculations on
previous slides
[[ What’s missing
in this graph? ]]
YrsExper
Using this simple model, it appears that males have on

average much higher probabilities being promoted than
females, for any level of experience
36
18
Important aspects to check in regression analysis
Before drawing conclusions about a population based on a
LOGIT regression analysis, first check (at the minimum) the
following five aspects:
1. Variable types: dependent variable is DICHOTOMOUS (1/0);
independent variables are quantitative otherwise dummies
2. Is there a linear relationship between the log odds and the Xs?
Hard to investigate! Rely on theory and model checks 3. and 4.
3. Assess goodness of fit: investigate R2 (table 2) and all the P-

values (tables 1 and 4 – significance of independent variables)
4. Residual analysis: inspect the model predictions (table 3)
5. Multicollinearity: your predictor (independent) variables should

not correlate ‘too highly’ (check with correlations)
37

• Logistic regressions for dichotomous dependent
variables
– Quite similar to linear regressions: is linear in log odds
– Interpret through log odds, or odds ratio with baseline odds
– Allows for prediction of probabilities (+ intervals!)
– Need to perform similar 5 tasks as linear regressions
before conclusions can be drawn (slide 37)
• Can be further generalized (beyond scope class):
– More than two categories: multinomial logit model
– Ordered categories (ordinal dependent variable): ordinal
logistic regression
• Next class meeting: SPSS lab 4 (of 5)!
38
19
Appendix
Appendix 1 – probability versus odds
Appendix 2 – cut value for logit prediction
Appendix 3 – odds ratio in a logit model
Appendix 4 – the math behind odds ratio
39
Appendix 1: probability versus odds

• See also optional reading listed in the prep guide ‘Odds or
Probability’ by Ronald Wasserstein
• Informally:
– Probability – the number of ways an event can occur divided by
the total number of possible outcomes
– Odds – the number of ways an event can occur divided by the
number of ways it does not occur
• Example 1: take p=Pr(win)=0.5. Then Odds(win) = p/(1-p) =
0.5/0.5 = 1/1. For every win there is one loss. Or, play two
times, you are expected to win one and lose one.
0.25/0.75 = 1/3. For every win there are three losses. Or, play
four times, you are expected to win one and lose three.
0.75/0.25 = 3/1. For every three wins there is one loss. Or,
play four times, you are expected to win three and lose one.
40
20
Appendix 2: cut of value for logit prediction
Predicted
No Yes
No Correct False positive
Observed
Yes False negative Correct
Choosing the cut value (slides 19—21) to classify predictions is not

obvious. A couple of strategies could be adopted:
First strategy: in absence of any information, the best prediction of the

probability that a randomly drawn observation is a ‘Yes’, is the observed
proportion ( ) of ‘Yes’ in the sample (e.g. slide 12).
The logit model incorporates information through your Xs, and predicts
that the probability of ‘Yes’ is (say) . Now we could classify the
observation as a ‘Yes’ if and as a ‘No’ if .
This was the approach taken in class with 0.27 (slide 21).
41

Relative cost table Predicted False positive
False negative No Yes Probability
Observed No 0 C 1
(‘Truth’) Yes 5C 0
Second strategy: another approach can be taken if there is some

knowledge of the relative cost of a false positive and a false
negative. Suppose that the cost of a false negative is (say) 5 times
the cost of a false positive (with no cost of a correct classification).
Assume the proportion of ‘Yes’ is and ‘No’ is 1 (in the
population). This is summarized in the above table.
Here, we can classify a prediction either as ‘Yes’ or as ‘No’. The
expected cost of these two actions are now given by:
‘ ’ 1 0 1
‘ ’ 0 1 5 5
(continued on next slide)
42
21
Given these expected costs, it would be better to classify an
observation as ‘Yes’ if the expected cost of that action is lower than
the expected cost of classifying the observation as ‘No’. That is,
classify as ‘Yes’ if:
‘ ’ ‘ ’⟺
1 5 ⟺
1 5 ⟺
1 6 ⟺
1 1
⟺ 0.17.
6 6
That is, our cut-off value is now 0.17 instead of 0.27, which
acknowledges the relative cost of misclassification.
Hence, when the logit model predicts for an observation that the
probability of a ‘Yes’ is larger than 0.17, i.e. 0.17, then
classify as ‘Yes’, otherwise classify as ‘No’.
43
Following the preceding, if the cost of a false negative is equal to

the cost of a false positive, then the cut-off value would be 0.5.
Third strategy: a third approach would run the classification table
for many choices of the cut value. In the previous two examples we
took just two values (0.27 and 0.17). But, we could create several
tables for varying choices of the cut value, and discuss the
implications for the resulting classifications.
This is the idea behind the ROC curve which is obtained by
changing the cut value and calculating the false negative and
positive rates. Some researchers advocate choosing the model that
maximizes the area under the ROC curve. For a discussion, see
e.g. Ledolter (2013) Ch. 8 (detailed reference in course syllabus).
44
22
Appendix 3: odds ratio in a logit model
The odds ratio for a regression coefficient is

odds 1
odds
It represents the change in the odds of the outcome
(multiplicatively) by increasing by 1 unit
– If 0, the odds and probability are the same at all
values ( 1)
– If 0, the odds and probability increase as increases

( 1)
– If 0, the odds and probability decrease as

increases ( 1)
45
Appendix 4: the math behind odds ratio

• Let p be the probability of an event (e.g. buy product,
person is promoted).
• The logistic regression model can be given as (slide 9):
log ∗
1
• Exponentiation of both sides gives:
∗ ∗ ∗
1
• If we increase with one unit, we get ‘new’ odds. Call
these odds+1
′ ∗ ∗
1 ′
∗ ∗ ∗
(continued on next slide)
46
23
Appendix 4: the math behind odds ratio
• Now the ratio of the new odds over the old odds is:
∗ ∗ ∗
∗ ∗
• Hence,
odds after a unit change in

∆
original odds
This is the formula on slide 25
47
24
Data reduction through factor analysis
Session 9
• Next class meeting: session 10 of 10 (cluster analysis)
• SPSS Lab 5 of 5 (covers sessions 9 and 10): we’ll have
two lab sessions
– Lab 5.1 (Thu Nov 22)
• Final team project “review” / class wrap up
• Info about the final team project (organization, deliverables etc.)
• Start with SPSS lab 5
– Lab 5.2 (Tue Nov 27)
• Finish + hand in SPSS lab 5
• Quiz 5 of 5: opens one day before lab 5.2, closes two days
after lab 5.2
2
1
Today’s lecture
Part 1: Basics of Factor Analysis (FA)
Part 2: Running a FA
Part 3: Interpreting a FA solution
Part 4: FA goodness of fit
Part 5: Putting FA to work
How does what we are about going to do relate

to what we did?
• PART 2: various forms of regression analysis

– Multivariate statistical technique
– Singles out one variable (dependent) which is to be
explained or predicted by one or more other variables
(independent variables aka regressors, predictors etc)
• Today we start with PART 3: factor analysis (FA)
– Also multivariate statistical technique
– But: no variable is singled out for special treatment as a DV
– FA poses “How are these k variables interrelated?”
– It may be seen as a “data reduction” tool and sometimes
precedes regression analysis to reduce the set of regressors
2
Daimler/Chrysler seeks a new image
(Dodge Viper, 2016 model)
Daimler/Chrysler seeks a deep understanding of the

psychological characteristics of yuppies (prime target
group for the car) in order to formulate the marketing
program for the Dodge Viper.
– How to overcome its boxcar image?

– What incentives to offer?
– What is the role of styling and prestige in promotion?
– How exploit the Daimler-Benz merger?
3
The company has been presented with the responses from

the attitude survey (handout p2). The criterion variable
(“I would consider buying the Dodge Viper made by
Daimler-Chrysler”) is an important measure for
behavior/interest.
Chrysler needs to know the about the psychological
characteristics of the yuppies to configure the Dodge
Viper program.
Task data analytics team: make recommendations on the
design, brand positioning, and targeting of the Dodge
Viper to increase appeal to the yuppie market, based on
analyses of the this data.
Daimler/Chrysler regression
• Regression analysis based on attitudinal data
• The full model (handout pp3-5) is not very useful
– Too many regressors
– Most regressors are not significant (P-value>0.05)
– High levels of multicolinearity (average VIF = 5.6, several VIFs
> 10, several tolerance levels < 0.07)
– Many correlations between regressors over 0.8, e.g.
o “I can do anything I set my mind to” with “Skeptical predictions are
usually wrong” (r=0.96, p-value<0.05)
o “I would like to take a trip around the world” with “I wish I could leave
my present life and do something entirely different” (r=0.90 , p-
value<0.05)
o “I usually dress for fashion, not comfort” with “I am in very good
physical condition”. (r=0.92, p-value<0.05)
8
4
Daimler/Chrysler regression
A smaller model (fewer regressors) could be better

(handout pp6-7)
– But: which regressors to choose?
– *Adjusted* R-square smaller model higher than for

previous large model (57.3% vs 56% respectively)
– Multicollinearity is less a problem
– Is this model useful? What can we learn from it

relevant for this case?
Factor Analysis: in general
• Factor analysis is a data reduction technique useful in

dealing with large data sets in which quantitative
variables are (strongly) correlated
• Assumption is that the original variables are generated

by a set of underlying dimensions that cannot be
measured directly
• Factors are derived sequentially so that they are

uncorrelated and jointly describe the total variance of the
original variables in descending order of ‘importance’
• These “factors” could then be used for further analysis

(e.g. in a regression) for (strategic) decision making
10
5
Factor Analysis: to potentially help a regression
Goal: we need to run following regression:
Y  b0  b1 X 1  b2 X 2  b3 X 3  b4 X 4  b5 X 5  ...  bn X n
Factor 1 Factor 2 “Super variables”
1. Create a small number of indices (factors) from a set of

correlated variables that capture the statistical information
in the original set of variables F1
2. Understand the underlying structure

F2
3. Use the factors for subsequent analysis,
instead of the original X’s
Y  d 0  d1 F1  d 2 F2
11
Factor Analysis mechanics

• We will eschew technical details in this class and focus
on interpreting and using factor analysis
• It is based on a similar idea as regressions: the total
variance of one variable (Y) is partitioned into
components which sum to the total (SST=SSR+SSE)
• In a nutshell, for factor analysis, it is similar:
– For a set of variables (X1, X2,…,Xn), the total (co-)
variability (say, R) can be partitioned into a common
portion C which is explained by the factors and a portion U
which is unexplained by the factors: R=C+U
– Factor analytic approaches transform the original set of

variables into a new set of uncorrelated linear
combinations of these variables
12
6
Today’s lecture
13
Factor Analysis in 7 (easy) steps
1. Confirm data are metric (quantitative)

2. Decide on the number of factors to be derived
3. Derive the factor solution
4. Present the (rotated) solution
5. Interpret the solution
6. Evaluate the goodness of fit
7. Save factor scores for subsequent analyses
(optional)
14
7
Decide on the number of factors
• The maximum number of factors is the number of
variables (here: 30)
• The choice depends on..
a) Managerial decision (subjective)
b) Elbow in scree plot (handout p8)
c) Eigenvalues > 1 [[ sloppy: the eigenvalue represents

how much variance a factor explains ]]
d) Cumulative variance explained
• Step b) is a plot in the SPSS output, and steps c) and d)

are to be found in the SPSS table ‘Total Variance
Explained’ (handout p9)
15
Decide on the number of factors
• Inspect the scree plot and the ‘Total Variance Explained’

table
• The scree plot suggests kinks at 3, 9, and 13 (maybe 15
too) factors
– 13 factors is not acceptable because the eigenvalues for
10, 11, 12, 13 are too low (below 1)
– 3 factors is not ideal because only 34% of the variance is

explained relative to 4, 5, … factor solutions which all have
eigenvalues > 1
– 9 factors seems to be a good compromise
– Regardless, choosing the number of factors also depends

on interpretability of the factors (next part)
16
8
Today’s lecture
17
Deriving and interpreting the factor solution
• The purpose of factor analysis is twofold: data reduction

but also substantive interpretation
o “What are the constructs that underlie the observed
variables? Can we give them an interpretation?”
• Interpretation is done by examining the ‘Component
Matrix’ table and/or ‘Rotated Component Matrix’ table
(handout pp10-11)
o These are called ‘Factor Loadings’, which are nothing
more than the correlations between the original variables
and each of the factors
o If a correlation is high (near -1 or 1), it means that the
factor ‘loads high’ on that variable, that is, that variable will
be used to interpret the factor later on
18
9
Rotating factors to facilitate interpretation
Rotation is a transformation of the initial solution, i.e. the

unrotated factor loadings (handout p10), into a new solution
which is easier to interpret
Factor 2
1
(1) Orthogonal rotation
(varimax is most popular) 2
(2) Oblique rotation 90°

Factor 1
5
3 4
19
Interpretation of factors: in-class exercise
• Better to use the ‘Rotated Component Matrix’ table.

Identify the significant loadings in each row/column
– Row-wise (i.e. for each variable), circle highest loadings
– Examine “significance” of the circled loadings (rule-of-

thumb: should be (well) over 0.30 in absolute value)
– Underline all other “significant” loadings
• Label each factor by finding a collective name to

describe the items most strongly associated with this
factor
20
10
Interpretation of factors
Factor 1
V11 - family is not too heavily in debt today (0.896)
V12 - pay cash for everything I buy (0.902)
V13 - spend for today & let tomorrow bring what it will (0.937)
V14 - use credit cards because I can slowly pay off bill (0.937)
V15 - seldom use coupons when I shop (0.871)
V16 - interest rates are low enough so I can buy what I want
(0.758)
Possible name: Financial composure

Variance explained by factor: 15.74% (handouts p9)
21
Factor 2
V2 - very good physical condition (0.907)

V3 - dress for fashion, not comfort (0.905)
V4 - have more stylish clothes than my friends (0.826)
V5 - want to look a little different from others (0.648)
Possible name: Style conscious

Variance explained by factor: 9.38%
22
11
Factor 3
V7 - not concerned about the ozone layer (0.764)

V8 - the govt is doing too much to control pollution (0.837)
V9 - basically, society today is fine (0.859)
V10 - don't have time to volunteer for charities (0.859)
Possible name: Societal apathy

23
Factor 4
V22 - American-made cars can't compare with foreign-made

(0.955)
V23 - govt should restrict imports of products from Japan (.955)
V24 - Americans should always try to buy American products
(0.915)
Possible name: Patriotism

24
12
Factor 5
V29 - sceptical predictions are usually wrong (0.950)

V30 - can do anything I set my mind to (0.955)
V31 - in five years, my income will be a lot higher (0.896)
Possible name: Optimism

25
Factor 6
V6 - life is too short not to take some gambles (0.445)

V25 - would like to take a trip around the world (0.912)
V26 - want to do something different with my life (0.923)
V27 - usually among the first to try new products (0.708)
Possible name: Adventurous

26
13
Factor 7
V17 - have more self-confidence than most of my friends (0.903)

V18 - like to be considered a leader (0.935)
V19 - others often ask me for help (0.877)
Possible name: Opinion leadership

27
Factor 8
V20 - children are the most important thing in a marriage (0.901)

V21 - would rather spend a quiet night at home than go out
(0.900)
Possible name: Family traditionalism

28
14
Factor 9
V28 - like to work hard and play hard (0.618)
Possible name: Endurance

29
Interpretation of the factors -- summary

• Nine psychographic factors:
Factor 1: Financial composure
Factor 2: Style conscious
Factor 3: Societal apathy
Factor 4: Patriotism
Factor 5: Optimism
Factor 6: Adventurous
Factor 7: Opinion leadership
Factor 8: Family traditionalism
Factor 9: Endurance
• These nine factors explain 78.5% of the variance in the
original 30 variables
30
15
Today’s lecture
31
Evaluate the goodness of fit (1)
As for regression models, we need to convince ourselves

that the factor analysis is “good enough” for managerial use
1. Face validity? That is, does the found solution make sense?
Can it be given a reasonable interpretation? [[ subjective! ]]
2. How much of the total variation in the original 30 variables is

explained by the 9 factor solution?
– Here: 78.5% (‘Total Variance Explained’ table handout p9)
3. How much of the variation in each variable is accounted for

by the factor solution? [[ next slide ]]
32
16
Evaluate the goodness of fit (2)
• How much of the variation in each variable is accounted
for by the factor solution?
• Inspect the ‘Communalities’ table (handout p12)
– A ‘0’ means no variation of that variable is explained by the
9 factors [[ could suggest another factor is needed ]]
– A ‘1’ means all the variation is explained by the factor(s)
and variable and factor are the same [[ defies purpose of
factor analysis ]]
– Ideal is somewhere in between
• Variable V6 (“Life is too short not to take some gambles”) is
most poorly captured (communality = 0.37)
• Variable V30 (“I can do anything I set my mind to”) is best
captured (communality = 0.94)
33
Today’s lecture
34
17
Save factor scores for subsequent
analyses (optional)
So, what was al this “Factor Analyzing” good for?
1. It helps us give names to underlying factors or constructs
(‘super variables’) in sets of highly intercorrelated /
multicollinear variables
– Having data on many variables does not mean that we know
what is going on
– Instead, looking at a fewer number of transformed variables
often gives more comprehensible and useful information.
2. Factor analysis provides us with a set of transformed

variables, which may subsequently be used in another
statistical analysis, such as in a regression [[ next slides ]].
35
Save factor scores for subsequent

analyses (optional)
• Use SPSS to compute ‘Factor scores’
– For each observation, obtain an actual value for the (here)
9 factors
– SPSS adds 9 new columns in your data spreadsheet (Data
View) [[ ‘FAC1’, ‘FAC2’, …, ‘FAC9’ ]]
• How are the scores for a factor computed?
– First intuition: take the average of the X’s that load on the
factor
– Better: compute a weighted average keeping in mind the
factor loadings
• Then, use the factor scores as (here: 9) independent
variables in a regression
36
18
Configuring the Viper program using
psychological characteristics (handout pp13-14)
R2 = 0.51, F=47.01 (P-value=0.00), Max VIF=1, Average VIF=1,

all tolerances > 0.1, residuals approx normal
37
Some flavor of previous result in recent ads (Dodge website)

Adventure; patriotism
Style; adventure;
probably not family
traditionalism
Style; optimism (?)

38
19
Factor analysis popular in business research
Marketing
– E.g. Dodge Viper case
Economics
– Next months employment rate depends on many variables

(interest rates, money supply, jobs created, consumer
confidence, wages, inventory, inflation etc.)
– Summarize state of economy in smaller number of “state” factors
Finance
– Use factor analysis to identify common factors: system factors

(market, industry), non-system factors (firms specific), and use
factors to describe stock returns
39
Factor analysis review
1. To reduce the number of variables

2. To detect structure in the relationships between
variables; variables “group” together into factors
3. [[ optional ]] Create entirely new set of (quantitative)
‘super variables’ for use in subsequent analysis (e.g.
regressions or cluster analysis)
X1
X2
Today, we did all three, and combined F1
factor analysis with regressions, which X3
Y
led to fewer variables in our regression
equation, eliminating multicollinearity F2 X4
X5
40
20
Next class meeting
• How to reduce the number of rows (~ observations) in

the dataset?
• Group observations together in a “smart way”
– Observations within a group should be as similar as

possible
– Observations belonging to different groups should be as

dissimilar as possible
• Business application: segmentation
– Technique: cluster analysis
– Illustration: MBA music market segmentation
41
21
Segmentation through cluster analysis
Session 10
• SPSS Lab 5 of 5 (covers sessions 9 and 10): we’ll have
two lab sessions
– Lab 5.1 (Thu Nov 22) -- final team project “review” /

class wrap up; start with SPSS lab 5
– Lab 5.2 (Tue Nov 27) -- finish + hand in SPSS lab 5
• Course evaluations – please fill out!
1
Today’s lecture
Part 1: Basics of clustering and K-means
Part 2: Choosing the number of clusters
Part 3: Case – MBA music market
Part 4: Clustering wrap-up
MBA music market
A managerial question
o We wish to create CD(s) or playlist(s) for the MBA
market. What would be the best music compilation for
this market?
o Survey (handout p1)
o Assume mean (say) over 7.0 should be
included, mean of 4.0 or less excluded
 For this class the compilation should include:
 For this class the compilation should not include:
2
HEC Paris MBA students music preference
(handout p2)
5
MBA music market
Targeting to the “average” customer: one CD with

Rock, Pop, Classical, Jazz
– Would you be happy with the album targeted to the

“average” customer?
– Could we do better?
Homogeneous market: “one shoe fits all” is old

fashioned
Heterogeneous market: customers behave differently,

and have different needs and wants
3
MBA music market
• Lets use data analytic approaches to try to divide up

the market into groups of consumers that are “alike”
within groups and “different” across groups
• Goal: increase effectiveness by designing an

appropriate business strategy for these subgroups of
consumers (“market segmentation”)
Cluster analysis
A class of statistical / mathematical techniques used to

classify objects into groups
– Objects within a group should be as similar as possible
– Objects belonging to different groups should be as dissimilar as

possible
Ideal Scenario Real-world Scenario
4
Let’s examine Rock vs. Rap for 11 students
Music preferences 11 students: four groups (handout p3)

9

Cluster solution for four clusters
Cluster means –
Cluster 4
Cluster 3
Cluster 2 Cluster 1
Cluster sizes –
Cluster Count %
A cluster solution consists of at least: 1 2 0.18
2 3 0.27
(a) the cluster means/centers
3 3 0.27
(b) cluster sizes 4 3 0.27
(c) For each observation, its cluster Total 11 1.00
assignment (a “new” categorical variable)
10
5
• How were they grouped? Visual inspection: employ

some measure of similarity to assess proximity
• What would be a measure of “distance” in this plot?
11
To measure distance we
could use the Euclidean
distance measure, which
measures the length of the
line segment connecting two
points
• Squared Euclidean distance for distance between student i and j:

∑ (here: p=2 dimensions, xik is coordinate for i-th
student on k-th variable)
• Example 1: distance between S1 and S7 is 8 7 8 2 37
• Example 2: distance between S1 and S3 is 8 9 8 8 1
12
6
Proximity matrix: distances between all pairs of students (handout p4)
13
Clustering main idea/approach

• We managed to cluster 11 observations into four clusters
‘by the eye’
• But what if we have many more observations (students)
and many more dimensions (genres)? That is, how to
assign n students to clusters based on variables?
• We need a computer algorithm to do that for us
– Objective: assign observations to clusters such that there
will be as much similarity within a cluster and as much
difference between clusters as possible
• A well known approach: K-means clustering

[[ FYI there are about a “gazillion” approaches ]]
14
7
K-means clustering
• It attempts to find clusters that are most compact, in

terms of the [[ square of the Euclidian ]] distance of each
observation to the center of each cluster
• Fairly simple, fast, and reliable algorithm to assign

observations to clusters (see appendix)
• A basic example of unguided / unsupervised

“machine” learning – discover unknown groupings
just from the structure in our data
• Only input variable required: the number of clusters
15
K-means clustering
• How to choose the number of clusters?
• Good news: many different ways to do
• Bad news: the different ways tend to give different

answers
• Reasonable approach: rely on techniques from

machine learning for guidance [[ next part ]]
– Note: need decent sample size and fast computer
16
8
Today’s lecture
17
Cross-validation to get a feel for number of

clusters in the data
• Important approach in data mining/machine learning

approaches to validate the model (e.g. Ledolter, 2013)
• Basic idea:
– split the dataset in a training dataset and test dataset
– fit the model (e.g. cluster model with K-means) on the

training dataset
– test the fitted model (e.g. the found cluster structure) on

the test dataset
• Main goal: how does the model/approach generalize to a

new independent dataset; to prevent overfitting
18
9
Cross-validation to get a feel for number of
clusters in the data
• Run K-means for k 1,2, … , clusters on the training
dataset. Then, examine for each choice of how well cluster
solution fits in the test dataset
• Plot the fit statistics against 1,2, … ,
– Disclaimer: many fit statistics, no agreement on which is best
– Distance measure: choose the that corresponds to the elbow
or kink
– Model-based proxies based on Akaike and Bayes Information
Criteria: assumes that variables are (approximately) independent
and normally distributed within each cluster; penalizes for
number of parameters; choose that gives smallest AIC or BIC
– R-square based proxies: (sloppy) how much variance of the
variables can be explained by the cluster solution? Choose
that gives highest R-square value
19
In-class exercise 1&2
• I applied this approach to two (fake) datasets with

500, with 16 genres (slide 5), where I know the true
number of clusters and where the cluster solution is well-
defined in the data.
• The results are in your handouts on pp5-6
• Exercise 1: the first example, is similar to the case we

discussed earlier in class (slide 9). How do the fit
statistics support the four cluster solution?
• Exercise 2: how many clusters do we seem to have in

the data for the other example, as suggested by the fit-
statistics?
20
10
In-class exercise 1
Scatter plot (handout p5)
Scatter plot: observed preferences Rock and Rap/Hip-Hop (e.g. slide 9)

21
In-class exercise 1
Fusion diagram (handout p5)
Reading from left to right:

Starting with 1 cluster,
adding a second cluster
reduces the distance
coefficient considerably.
Going from 4 to 5 clusters,
we are (relatively speaking)
not reducing the distance
coefficient much further.
A four clusters structure
seems to best describe this
data
[[ The distance coefficient measures the average distance of each

observation to its cluster center ]]
22
11
In-class exercise 1
(handout p5)
Minimum for all three model based proxies occurs at a four cluster
solution; BIC penalizes here the most for number of parameters
23
In-class exercise 1
(handout p6)
When the number of clusters reaches 4, there is no further improvement in

the R2 based metrics (in fact, the adj. R2 measures decrease for 4)
24
12
Run K-means to get four cluster solution
(In-class exercise 1, cont.)
Green = cluster 1; Blue = cluster 2

Black = cluster 3; Red = cluster 4
Cl. 1 Cl. 2 Cl. 3 Cl. 4
Rock 1.91 9.02 2.00 8.97
Rap/HH 9.02 9.00 2.05 1.99
Size 24% 28% 25% 23%
The algorithm does what we

could do by the eye, it assigns
each of the 500 observations to
one of four clusters (colors)
The Rock/Rap preferences of

the 500 observations are
described well by these four
cluster
25
In-class exercise 2: how many clusters?
Fit statistics on
handout p6
Scatter plot: observed preferences Rock and Rap/Hip-Hop – 1 cluster

26
13
In-class exercise 2: how many clusters?
Cl. 1 Cl. 2
Red = cluster 1; Black = cluster 2
Rock 5.07 4.99
Rap/HH 4.88 5.12
Size 52% 48%
Warning: K-means WILL give

you a solution – here I ran it for
2
Even though there are clearly

no clusters in the data, the
algorithm will give you
something
The fit statistics indicated that

1 is the ‘best’ cluster
solution (i.e. no clusters)
27
In sum: choosing the number of clusters

• The previous examples were very clear and obvious:
real data generally does not behave like that
• Therefore, choosing the number of clusters is often
highly subjective, and should always be discussed as
part of a cluster analysis. Usually a combination of:
– Examining ‘fit metrics’ (e.g. previous approaches based on
cross-validation)
– Choosing a range for instead of one number
– Examine cluster solutions (cluster means/ sizes, cluster
assignments) for three or four choices of
• As final solution, we choose the cluster solution that can

be supported by the data, and at the same time can be
used to develop appropriate business strategies
28
14
Implementing K-means
SPSS’ implementation of K-means is quite poor
– It does not randomize starting values of the K-means

algorithm [[ e.g. appendix ]]
– It cannot do cross-validation to get a feel for the number

of clusters (which has to be part of cluster analysis)
– Use our browser version (based on R/Shiny):
http://rstudio-test.hec.fr/kmeans/
(see “how to guide” today’s session)

29
Today’s lecture
30
15
So, where does this leave us for the MBA
music market case?
• Data:
– 801 students (seven cohorts) provided liking responses on
10 pnt scale of 16 musical types (slide 5; handout pp1-2)
• Managerial question:
– Are there segments of students who might best be
targeted differently with different music playlists/CDs?
– If so, who are they?
• To address the managerial question, we run a cluster
analysis. We need to provide (at the minimum): (a) a
discussion of the # clusters, and (b) a discussion of the
cluster solution (cluster centers, cluster sizes, cluster
assignments)
31
MBA music market clustering example

[[ In-class exercise 3&4&5 ]]
• In-class exercise 3 (statistics): examine the fit metrics; how
many clusters would you recommend? (handout p7)
• In-class exercise 4 (marketing/product manager): examine the

cluster solution that I ran (handout pp8—9)
– How many clusters did I choose? Do you agree?
– How would you describe each cluster?
– What type of playlists/CDs would you market? Which genre(s) would
you include? Which genre(s) would you not include? How large are the
potential clusters?
• In-class exercise 5 (individual targeting): in what cluster do

you fall (if you filled out questionnaire)? Do you agree? How
well do you fit in the cluster? (handout pp11—12)
32
16
MBA music market (in-class exercise 3)
• How many clusters could be supported by the data?
• Use the cross-validation fit statistics to get a feel for the

number of clusters
– Look for kinks in the fusion plot; find the lowest AIC/BIC; find the
highest adjusted R2
– Usually the metrics do not agree, and the best you can do is to
give a range for
– For the class case, we could argue anywhere from 3—7 clusters
• We would then need to get the cluster solution for each

3,4,5,6, and 7, and examine each for managerial relevance
• I started with the 4 solution to get an initial idea of the

MBA music market segments (handouts pp8—12; next slides)
33

Cluster 1 Cluster 2 Cluster 3 Cluster 4
I
n
c
l
u
d
e
E
x
c
l
u
d
e
N
34
17
Rock Pop Classical RapHipHop
Jazz Rock BroadwayMovies Pop
I Classical Jazz Jazz RnB
n Folk
Classical Rock
c
Blues
l
RnB
u
RapHipHop
d
TechnoDance
e
BroadwayMovies
Reggae ChristianGospel NewAge NewAge

E RnB Kids Reggae Folk
x NewAge TechnoDance Country
c Folk RnB ChristianGospel
l RapHipHop Kids Kids
u Country RapHipHop
d ChristianGospel
e Kids
N 161 (20%) 246 (31%) 170 (21%) 224 (28%)
35
How would this improve the initial CD (slide 5)?

Rock Pop Classical RapHipHop
Jazz Rock BroadwayMovies Pop
I Classical Jazz Jazz RnB
n Folk
Classical Rock
c
Blues
l
RnB
u
RapHipHop
d
TechnoDance
e
BroadwayMovies
Reggae ChristianGospel NewAge NewAge

E RnB Kids Reggae Folk
x NewAge TechnoDance Country
c Folk RnB ChristianGospel
l RapHipHop Kids Kids
u Country RapHipHop
d ChristianGospel
e Kids
N 161 (20%) 246 (31%) 170 (21%) 224 (28%)
36
18
Based on this cluster solution we could propose four CDs

or playlists (albums) to market at HEC Paris.
Cluster 1: CD “Classic rock” featuring Rock, Classic, Jazz
Cluster 2: CD “We want it all” featuring a mix of the most popular Pop,
Rock, Jazz, Classical, Blues, RnB, RapHipHop, TechnoDance,
BroadwayMovies
Cluster 3: CD “The Sophisticates” featuring Jazz, Classical, Folk,

Broadway movies
Cluster 4: CD “Party People” featuring Rap, Hip Hop, RnB [[ and a bit
of Techno/Dance ]]
37
MBA music market clustering
In addition to describing/naming the four clusters and

the sizes of the clusters:
• Examine visualizations of the cluster solution; these help

present the results and judge the merit of the solution
(handout p10)
• Find the cluster assignment for each respondent in the

dataset; this potentially helps to individually
market/target products, and identify which respondent
best represents the cluster (in-class exercise 5;
handouts pp11—12)
38
19
Today’s lecture
39
Scatter plot MBA preferences R/H vs Country
Uni/bivariate statistics for these two quantitative variables: can

we learn anything useful for decision making?
40
20
Not much! Mean (stdev) for RHH and Country are 5.9 (2.7) and 5.2 (2.6),
and 0.03 (P-value=0.47). Doesn’t help much for decision making.
41
Black – cl 1
Blue – cl 2
Red – cl 3
Green – cl 4
But, a cluster solution with four clusters gives insights that may
be useful for marketing decision making
42
21
Cluster analysis in sum…
• Separating cases into clusters so that cases in the same

cluster are similar to one another, and different from cases
in the other clusters
• Very judgmental: what variables to cluster on [[ here: only
quantitative ]], what clustering approach [[ here: only K-means ]],
how many clusters [[ here: only cross-validation ]], and how to
interpret each cluster [[ ~segment? ]]
• Good practice: examine reliability and validity
o Run solutions for different number of clusters, cluster
approaches/algorithms, and starting values of the algorithms
o Judge to the extent solutions can be interpreted and used
• Final solution is the one that makes the most sense and
can be used to develop appropriate business strategies
43
Cluster analysis in sum…

• Is there even more to say about cluster analysis? You
bet!
• Business strategy needs to recognize that customer
needs and wants are heterogeneous (marketing!)
– K-means clustering is useful (despite the limitations): e.g.
part-time MBA team at Air France (2014)
– A statistically more sound approach comes from finite
mixture // latent class models: e.g. we developed a strategy
for Intel’s processors based on these finite mixture models
– But need specialized software (e.g. R, Matlab, GAUSS)
• What’s up next: SPSS lab 5 and quiz 5!
– Factor analysis and K-means clustering
44
22
APPENDIX
Appendix 1: K-means algorithm
Appendix 2: K-means in SPSS
45
Appendix 1: K-means algorithm

It attempts to find clusters that are most compact, in terms of the
[[ square of the Euclidian ]] distance of each observation to the
center of each cluster
The clustering proceeds as follows:

1. Choose starting seeds; “individuals” who are quite different from
one another [[ best practice: use some randomization; multiple starts ]]
2. Go over the sample and allocate each observation to the closest

seed
3. Compute the mean of each cluster. These means will be the

new seeds.
4. If the new seeds are close to the previous, stop. Otherwise, repeat
steps 2 and 3.
[[ note: in step 2, all observations are re-assigned to the clusters ]]
46
23
Appendix 2: K-means in SPSS
• Two challenges: choosing [[ but cross-validation can help ]], and
more importantly, choosing the starting seeds (step 1 previous
slide) is tricky
• Final solution tends to be sensitive to choice of starting seeds
particularly for small(er) samples, many clustering variables, and
a “messy” cluster structure
• SPSS’ implementation of K-means is quite poor
– By default, it uses the first observations in your data spreadsheet
as starting values
– Hence, (randomly) re-arranging the rows of your data spreadsheet
could lead to (very) different cluster solutions
– Therefore, you should never-ever rely on a single clustering
solution from one set of starting values
• Because SPSS has no automated option for “randomizing”
starts, its K-means implementation is not recommended
47
24

Data Science Camp: Introduction To The Core Stats Class Day 1

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Data Science Camp: Introduction To The Core Stats Class Day 1

Încărcat de

Drepturi de autor:

Data science camp

Introduction to the core stats class

Please work in pairs today (with your neighbor)

Fall 2018, Peter Ebbes

Making better decisions

“Most [[people]] are poor quantitative thinkers. This

What the Numbers Say: A Field Guide to Mastering

“I keep saying that the sexy job in the next 10

Hal Varian, chief economist @Google

The Skills Companies Need Most in 2018

Data science camp Math camp

Quiz master problem: probability to win a car – decision change door

Getting to know your instructor

• Name: Peter Ebbes

Provide an introduction to the core statistics class

Basic variable types

How to summarize data numerically and graphically

DIY – brief intro to SPSS

• SPSS – download it, then get license for free in

Class material on Blackboard

• After one or a few classes:

• Classes will start and end on time

• Timely evaluations with constructive feedback

• Emails will be returned promptly

• Easy access office hours by appointment

What I expect from you

• Be on time – the doors are in the front

• Idea: what if you have to make a decision, and

• You therefore need to rely on your judgment.

• How good are you at evaluating

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

Demand Time Series

What factors may affect demand?

Logic of probability models

• It would generally be impossible to measure all the variables

• Idea: a realistic model must take into account the possibility of

• Construct probability models so that they represent the actual

Demand Time Series

What is the probability that demand is less than 400

Plot the data: empirical distribution

Steps in using probability functions

• What type of data are you dealing with: continuous

• Based on the observed data (e.g. histogram),

• Estimate the parameters of the chosen probability

• Make sure it fits!

• What we need: data, knowledge of statistics, and a

Data are often organized into a data table

Linking data with a relational database

May appear in a data table as a number Already a number

Arithmetic makes no sense Some arithmetic makes sense

Has an appropriate unit

Note: income in categories Note: income in exact amount

Special one: attitude rating scales

In SPSS: nominal or ordinal In SPSS: scale

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

American Express managers felt that usage was

Before making drastic marketing spending, data was