05 StatisticalModels PDF

CS109/Stat121/AC209/E-109
Data Science
Statistical Models
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu
data y statistical inference model parameter

(observed) probability (unobserved)
This Week
HW1 due this Thursday - start last week!
Course dropbox is now active at http://isites.harvard.edu/k99240

(Harvard ID required). Please follow the
submission instructions carefully, and do a
test well in advance of the HW1 deadline.
Friday lab 10-11:30 am in MD G115
Pandas with Rahul, Brandon, and Steffen
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
Road Map to Probability
-6 -4 -2 0 2 4 6
distributions random variables events numbers
Xx P(Xx)=F(x)
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
name, parameters
MGF
E(X),Var(X),SD(X)
for more about probability: stat110.net

F(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
generate Xx P(Xx)=F(x)
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
name, parameters
MGF
E(X),Var(X),SD(X)
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
generate Xx P(Xx)=F(x)
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
name, parameters
MGF
E(X),Var(X),SD(X)
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
generate Xx P P(Xx)=F(x)
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
name, parameters
MGF
E(X),Var(X),SD(X)
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
function of r.v.
name, parameters
MGF
E(X),Var(X),SD(X)
X,X2,X3, E(X),E(X2),E(X3),
g(X) E(g(X))
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
function of r.v.
name, parameters
MGF
E(X),Var(X),SD(X)
X,X2,X3, LOTUS E(X),E(X2),E(X3),

g(X) E(g(X))
F(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
CDF F X
X=x P(X=x)
PMF (discrete)
PDF (continuous)
story
function of r.v.
name, parameters
MGF
E(X),Var(X),SD(X)
X,X2,X3, LOTUS E(X),E(X2),E(X3),

g(X) E(g(X))
find the MGF

What is a statistical model?
a family of distributions, indexed by parameters
sharpens distinction between data and parameters,
and between estimators and estimands
parametric (e.g., Normal, Binomial) vs.
nonparametric (e.g., methods like bootstrap, KDE)
data y statistical inference model parameter

(observed) probability (unobserved)
Parametric vs. Nonparametric
parametric: finite-dimensional parameter space (e.g.,
mean and variance for a Normal)
nonparametric: infinite-dimensional parameter space
is there anything in between?
nonparametric is very general, but no free lunch!
remember to plot and explore the data!
What good is a statistical model?
All models are wrong, but some models are useful.
George Box (1919-2013)
All models are wrong, but some models are useful. George Box
Jorge Luis Borges,
On Exactitude in Science
In that Empire, the Art of Cartography attained such Perfection
that the map of a single Province occupied the entirety of a
City, and the map of the Empire, the entirety of a Province. In
time, those Unconscionable Maps no longer satisfied, and the
Cartographers
All models Guild struck
are wrong, a Map
but some of the
models Empirewhose
are useful. size was
George Box
that of the Empire, and which coincided point for point with it.
Borges Google Doodle

Statistical Models: Two Books
Parametric Model Example:
Exponential Distribution
x
f (x) = e ,x > 0
1.2
1.0
1.0
0.8
0.8
0.6
pdf
0.6
cdf
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
x x
Remember the memoryless property!

Length-Biasing Paradox
What is the waiting time for a bus?
timeline
For i.i.d. Exponential arrivals, your average waiting time is the

same as the average time between buses!
Length-Biasing Paradox
How would you measure the average prison sentence?
Exponential Distribution
x
f (x) = e ,x > 0
Exponential is characterized by memoryless property

and characterized by having constant hazard function
all models are wrong, but some are useful...
iterate between exploring, the data model-building,
model-fitting, and model-checking
key building block for more realistic models
Remember the memoryless property!

The Weibull Distribution
Exponential has constant hazard function
Weibull generalizes this to a hazard that is t to a power
much more flexible and realistic than Exponential
representation: a Weibull is an Expo to a power
The Evil Cauchy Distribution
http://www.etsy.com/shop/NausicaaDistribution
Family Tree of Parametric Distributions
HGeom
Limit
Conditioning
Bin
(Bern)
Limit
Conjugacy
Conditioning
Beta
Pois
(Unif)
Limit
Poisson process Bank - post office
Conjugacy
Gamma NBin
Normal
(Expo, Chi-Square) Limit (Geom)
Limit
Limit
Student-t
(Cauchy)
Blitzstein-Hwang, Introduction to Probability

Binomial Distribution
Figure 3.6 shows plots of the Binomial PMF for various values of n and p. Note that
the PMF of the Bin(10, 1/2) distribution is symmetric about 5, but when the success
story: X~Bin(n,p) is the number of successes in n
probability is not 1/2, the PMF is skewed. For a fixed number of trials n, X tends to be
larger when the success probability is high and lower when the success probability is low,
independent Bernoulli(p) trials.
as we would expect from the story of the Binomial distribution. Also recall that in any
PMF plot, the sum of the heights of the vertical bars must be 1.
Bin(10,1/2) Bin(10,1/8)
0.30
0.4

0.25

0.3
0.20

0.15
pmf
pmf
0.2

0.10
0.1

0.05

0.00
0.0

0 2 4 6 8 10 0 2 4 6 8 10
x x
Bin(100,0.03) Bin(9,4/5)
0.30
0.4
0.25
0.3

0.20

0.15
pmf
pmf
0.2

0.10

0.1
0.05

0.00

0.0
0 2 4 6 8 10 0 2 4 6 8 10
x x
Binomial Distribution
story: X~Bin(n,p) is the number of successes in n
independent Bernoulli(p) trials.
Example: # votes for candidate A in election with n voters,

where each independently votes for A with probability p
mean is np (by story and linearity of expectation:

E(X+Y)=E(X)+E(Y))
variance is np(1-p) (by story and the fact that

Var(X+Y)=Var(X)+Var(Y) if X,Y are uncorrelated)
(Doonesbury)
Normal (Gaussian) Distribution
symmetry
central limit theorem
characterizations (e.g., via entropy)
68-95-99.7% rule
Wikipedia
Normal Approximation to Binomial
Wikipedia
Bootstrap
data: 3.142 2.718 1.414 0.693 1.618
1.414 2.718 0.693 0.693 2.718
1.618 3.142 1.618 1.414 3.142
reps 1.618 0.693 2.718 2.718 1.414
0.693 1.414 3.142 1.618 3.142
2.718 1.618 3.142 2.718 0.693
1.414 0.693 1.618 3.142 3.142
resample with replacement, use empirical

distribution to approximate true distribution

05 StatisticalModels PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

05 StatisticalModels PDF

Încărcat de

Drepturi de autor:

Formate disponibile

CS109/Stat121/AC209/E-109

data y statistical inference model parameter

Course dropbox is now active at http://isites.harvard.edu/k99240

Road Map to Probability

distributions random variables events numbers

for more about probability: stat110.net

Road Map to Probability

distributions random variables events numbers

Road Map to Probability

distributions random variables events numbers

Road Map to Probability

distributions random variables events numbers

Road Map to Probability

distributions random variables events numbers

Road Map to Probability

distributions random variables events numbers

X,X2,X3, LOTUS E(X),E(X2),E(X3),

Road Map to Probability

distributions random variables events numbers

X,X2,X3, LOTUS E(X),E(X2),E(X3),

find the MGF

data y statistical inference model parameter

Borges Google Doodle

Remember the memoryless property!

For i.i.d. Exponential arrivals, your average waiting time is the

Exponential is characterized by memoryless property

Remember the memoryless property!

Blitzstein-Hwang, Introduction to Probability

Example: # votes for candidate A in election with n voters,

mean is np (by story and linearity of expectation:

variance is np(1-p) (by story and the fact that

1.414 2.718 0.693 0.693 2.718

1.618 3.142 1.618 1.414 3.142

reps 1.618 0.693 2.718 2.718 1.414

0.693 1.414 3.142 1.618 3.142

2.718 1.618 3.142 2.718 0.693

1.414 0.693 1.618 3.142 3.142

resample with replacement, use empirical

S-ar putea să vă placă și