Sunteți pe pagina 1din 29

Theory

PROBABILITY
DISTRIBUTIONS
By Dr Gary Deng

Every random variable has an associated probability

distribution.

Random variables can be:


Univariate
Multivariate
Every statistical test is based on some probability distribution.
Every regression model requires assumptions to be made

about the probability distribution of the residuals.

For probability distributions that have known analytical

solutions, you can simulate values from them: Monte Carlo


Simulation.

Univariate Random Variables

Discrete Random Variables

We will focus mainly on univariate random variables in

A discrete random variable (X) has:

this unit.
To put it simply, a univariate random variable has a

designated set of possible values and associated


probabilities.
A random variable can be either:
Discrete
Continuous

A set of possible values: x1, x2, , xk


A set of probabilities: p1, p2, , pk
And:

=1

Example:

# Your mark for this subject


X = c(30, 40, 50, 60, 70, 80)
# The probability of you getting these marks
P = c(0.3, 0.3, 0.05, 0.05, 0.1, 0.2)
# Checking the sum of probabilities equal to 1
sum(P)

Discrete Random Variables

Continuous Random Variables

The most important features of a random variable:

A continuous random variable has:

Mean:

A continuous set of values

Variance:

The fixed set of probabilities of a discrete variable are now replaced with a

continuous probability density function (pdf).

Example (there are many ways of doing this):

mean=sum(X*P)
mean
sum( (X-mean)^2*P )

The pdf is typically denoted by p(x).

= 1 (the probabilities sum to 1).

You can also use matrix algebra:

mean=t(X)%*%P
Mean

For the remainder of this lecture we will focus mainly on continuous random

variables

Uniform Distribution
The uniform distribution is the easiest of all.
Its uniform because every value has the same probability.

0,

a is the min value, b is the max value.

Uniform Distribution
R code:

# Generate 10 uniform random numbers between 3 and 5.


runif(10, 3, 5)
# Generate 100000 uniform random numbers between 3 and 5.
X <- runif(100000,3,5)
# Look at its histogram.
hist(X)
# compute its mean using the formula.
(1/2)*(3+5)
# verify using the mean function.
mean(X)
# compute its variance using the formula.
(1/12)*(5-3)^2
# verify using the var function.
var(X)

Normal Distribution

Normal Distribution

The most widely used distribution of all, commonly

R code:

denoted as

Let

, then

. This is where your standard

normal table comes from

install.packages("pracma")
library("pracma")
# mean of the normal distribution
mu = 5
# standard deviation of the normal distribution
sigma = 1
# storage of corresponding X values
NormalCDF = matrix(0,10000,1)
# cumulative probaility ranges from 0.0001 to 1
PROB = seq(0.0001,1,0.0001)

# Generate 10,000 N(0,1) numbers (i.e., standard normal).


# no need to specify the mean and variance.
X<-rnorm(10000)
# Generate 10,000 N(30,102) numbers.
X<-rnorm(10000,30,10)
# checking the histogram.
hist(X)
# checking the mean and variance.
mean(X)
var(X)

Chi Squared Distribution


The Chi Squared

distribution is derived from the


standard normal distribution.

# a simple loop
for (i in 1:10000)
{
# define the normal cumulative density function
# erf is known as the error function
# Notice the function is defined as squared deviation from cumulative probability
# what happens if it's not squared?
# the minimizer will seek to find the lowest value rather than a zero value
# test it yourself by removing ^2

Let Z~

0,1 , and if , , , , are drawn randomly and


independently from this distribution, then their sum of
squares is
distributed with degrees of freedom:

NCDF <- function(X)


{ ( PROB[i] - 1/2 * (1 + erf( (X-mu)/(sigma*2^0.5) )) )^2 }

1 + 2 + +

# optimize is an optimizing function used for situations where there is one unknown
# the function is NCDF
# 0 is teh lower bound of the search
# 10 is the upper bound of the search
out = optimize(NCDF, lower = 0, upper = 10 )
# stores the solved value of X in the above function
NormalCDF[i] = out$minimum
}
plot(NormalCDF,PROB)

And:

= ;

=2

Chi Squared Distribution

Chi Squared Distribution

Recall the most widely used test of independence

R code:

between categorical variables the Chi-square test:

In light of the definition of a Chi-squared variable, the

above Chi-square test makes a lot more sense: it is a sum


of squared standard normal variables

# Generate 10,000 Chi-squared numbers with 5 degrees of


freedom.
X<-rchisq(10000,5)
# Look at its histogram.
hist(X)
# Verify that the mean and variance are indeed as n and 2n
respectively
mean(X)
var(X)

F Distribution

F Distribution

The F distribution is (F)undamental to hypothesis testing

R code:

For example, your ANOVA outputs are nothing more than reported

results of an F-test

The F distribution is defined in terms of two


Let:

~
and

; ~
;
are independent of each other;

Then:

variables.

# Generate 100 F numbers with df1 = 5, df2 = 2.


X<-rf(100,5,2)
# Look at its histogram.
hist(X)

t Distribution

t Distribution: First Representation

The t-test should not be unfamiliar to any of you

Let

Now I am going to show you that it is closely linked to all the

And let

distributions we have mentioned up to this point

0,1 and
and

be independent of each other.

Then:

Mean: 0
Variance:
where

= degrees of freedom

t is just the ratio between a standard normal and the

Its easy to see that as

, the variance
1, and t
collapses to a standard normal (mean 0 variance 1).

square root of a Chi-squared standardized by its degree


of freedom

t Distribution: Second Representation

Other important distributions

Let

Later on we will be looking at Generalized Linear Models

And let

0,1 and
and

be independent of each other.

(GLM), where the distributional assumptions made about


the residuals are not normal.

Some of the most important distributions will be presented

Then:

below.

=1

You do NOT need to know the mathematics behind these

distributions

In other words, the square of a t is just an F.


In light of how BOTH the t statistic and the F statistic are used

in hypothesis testing, this relationship is very intuitive.

The important thing is for you to recognize WHEN to use

these distributions

Logistic Distribution
Looks very similar to a normal distribution, but it has a fatter tail.
It is used when your dependent variable is (0/1).
It is the foundation of the Logit model.

Logistic Distribution
R codes

# generate 10000 logistic random numbers with location


being 0 and scale being 1.
X<-rlogis(10000, 0, 1)
# generate 10000 standard normal random numbers.
Y<-rnorm(10000, 0,1)
# compare their histograms
# Note add=T means I am adding the second histogram
to the first
hist(X, xlim=c(-20,20), col="red")
hist(Y, add=T, col="blue")

Exponential Distribution

Exponential Distribution

Frequently used to describe time between events in a

Example: suppose that the average time you spend on

Poisson process (more on Poisson later).


PDF: p , =
CDF: F , = 1

clearing a stage in a computer game is exponentially


distributed with a mean time of 30 minutes, then the rate
parameter = . The probability of you wasting more
than an hour of your life fighting imaginary dragons on
your Xbox is:
> 60 = 1
=

F 60, =

1
=
30

60
= 0.1353353
30

Exponential Distribution

Poisson Distribution

R codes:

Closely related to the exponential distribution.

# generate 10000 random exponential numbers from the


example
X<-rexp(10000, rate = 1/30)
hist(X)

It is used to describe the probability of a number of events occurring

in a fixed interval of time or a fixed area of space.

It is very commonly used in dealing with COUNT data.

is the expected number of occurrences

PDF:

= ,
=

Poisson Distribution

Example: Tips (ANOVA)

R codes:

Textbook page 198-210.

# generate 10000 random Poisson numbers


X<-rpois(10000, 1)
Y<-rpois(10000, 10)
Z<-rpois(10000, 20)
# stack their histograms together
hist(X, xlim=c(0,50), col="red")
hist(Y, add=T, col="blue")
hist(Z, add=T, col="green")
# check the mean and variance are indeed very close
mean(X)
var(X)

Load package reshape2.

> data(tips, package = "reshape2")


> tips
total_bill tip sex
smoker
1
16.99 1.01 Female No
2
10.34 1.66 Male
No
3
21.01 3.50 Male
No
4
23.68 3.31 Male
No
5
24.59 3.61 Female No

day
Sun
Sun
Sun
Sun
Sun

time
size
Dinner 2
Dinner 3
Dinner 3
Dinner 2
Dinner 4

Example: Tips (ANOVA)

Example: Tips (ANOVA)

Lets look at whether: Day, Time, and Sex have any

Lets look at the p-value of gender

significant impact on the amount of Tips received.


R codes

tipAnova <- aov(tip~day+time+sex-1,tips)


summary(tipAnova)
tipAnova$coefficients

In this F distribution:
DF1 = 1, DF2 = 238

# R code for obtaining p-value from an F distribution.


pf(0.820,1,238,lower.tail=FALSE)
Recall that when DF1 = 1 the F distribution is the same as
the square of a t(DF2).
# R code for obtaining p-value from a t distribution.
pt(0.820^0.5,238,lower.tail=FALSE)*2

R Shiny
Its a great web application framework for R.
Go to: http://shiny.rstudio.com/tutorial/ for a comprehensive online tutorial.

LINEAR MODELS
By Dr Gary Deng

First you must install the Shiny package:

install.packages("shiny")
library(shiny)
Every Shiny App has two components:
1.
2.

a user-interface (ui) script


a server script

The user-interface (ui) script controls the layout and appearance of your app. It is

defined in a source script named ui.R. The server.R script contains the
instructions that your computer needs to build your app.

You need to put BOTH the ui.R and server.R script in the same folder.

ui script:

Shiny example:

# Define UI for application that draws a histogram


shinyUI(fluidPage(
# Application title
titlePanel("Hello Shiny!"),
# Sidebar with a slider input for the number of bins
sidebarPanel(
sliderInput("bins",
"Number of bins:",
min = 1,
max = 50,
value = 30)
),

Here is a simple
App. As you
change the
number of bins
using the slider,
the histogram of
X changes
accordingly.

# Show a plot of the generated distribution


mainPanel(
plotOutput("distPlot")
)
))

server script
#Define server logic required to draw a histogram
shinyServer(function(input, output) {

fluidPage: Essentially layout of the


webpage changes as you change the
size of the window.
sidebarPanel : This defines the layout
of the sidebar panel. You dont have to
have a sidebar, but typically it is useful
to separate your Controls (such as
sliders, checkboxes, etc) from the
Results (such as charts, tables, etc).
sliderInput: One of the many widgets
you can use to design your app.
value : This is just the starting value.
mainPanel : This defines the layout of
the main panel, which contains just one
plot.
plotOutput : This is used to plot any
outputs you want to plot. The stuff
inside the is what you are plotting.

Matrix Algebra (Very Basic)


function(input, output): There is always
an input and an output for a server
function.

# Expression that generates a histogram. The expression is


# wrapped in a call to renderPlot to indicate that:
# 1) It is "reactive" and therefore should re-execute automatically
# 2) when inputs change Its output type is a plot
output$distPlot : The output is an
object known as distPlot. It
corresponds to the ui.R scripts
output$distPlot <- renderPlot({
plotOutput("distPlot")
x <- faithful[, 2] # Old Faithful Geyser data
bins <- seq(min(x), max(x), length.out = input$bins + 1)
renderPlot : You need to render
outputs. Other options include
# draw the histogram with the specified number of bins
renderText, renderTable, etc. Rendering
is hugely important.
hist(x, breaks = bins, col = 'darkgray', border = 'white')
})
})

Let

be any matrix.

is its row dimension, and

the column dimension.

If

= 1, you have a scalar (a single number).

If

>1&

= 1, you have a Column Vector.

If

= 1&

> 1, you have a Row Vector.

If

> 1, you have a Square Matrix.

The transpose of a matrix, often denoted as


The rest of it is no different from how
you would script in R

rows and columns of a matrix.

or

, reverses the

Matrix Algebra (Very Basic)


# Column Vector
X = matrix(c(3,4,7,8,9,10,3,3),8,1)
X
# Row Vector
X = matrix(c(3,4,7,8,9,10,3,3),1,8)
X
# a (4x2) matrix
X = matrix(c(3,4,7,8,9,10,3,3),4,2)
X
# a (3x3) square matrix
X = matrix(c(3,4,7,8,9,10,3,3,1),3,3)
X
# transpose of X
t(X)

Matrix Multiplcation
Let there be two matrices,
You can multiply

and

and

if and only if they are conformable.

And the ORDER in which you multiply them matters. That is,

in general.

For

to exist, must be equal to , and the resultant matrix A =

.
will have dimension

For

to exist, must be equal to

will have dimension .

In general, A

even if both

and

, and the resultant matrix B =

are square matrices.

In the case of A =

, each element in A is the sumproduct of the corresponding


is located in row 2 and column 3 of
ROW in and Column in . For example,
the A matrix, it will be the sumproduct of row 2 of and column 3 of .

Matrix Multiplcation

Matrix Inverse

X = matrix(c(3,4,7,8,9,10,3,3,1),3,3)
Y = matrix(c(6,8,2,1,2,1,3,2,1),3,3)
O = matrix(c(2,2,3),3,1)
# X and O are conformable (3x1)
X%*%O
# O and X are NOT conformable
O%*%X
# t(O) and X are conformable (1x3)
t(O)%*%X
# X and Y are conformable
X%*%Y
# first element of X%*%Y
# is sumproduct of row1 of X and column1 of Y
sum(X[1,]*Y[,1])
# last element of X%*%Y
# is sumproduct of row3 of X and column3 of Y
sum(X[3,]*Y[,3])
# YX is NOT equal to XY
Y%*%X==X%*%Y

Matrix inverse is hugely important in statistics.


Similar to: 8

= 1,

For

to be the matrix inverse of , we must have:


= , where is known as an Identity matrix with 1
on its leading diagonal and 0 everywhere else.

Only square matrices have a matrix inverse.


But NOT all square matrices are invertible.

Matrix Inverse

Theory of LM
In matrix notation:

X = matrix(c(3,4,7,8,9,10,3,3,1),3,3)
# in R, matrix inverse is done through solve().
XI = solve(X)
X%*%XI
# if you foolishly did this
XI = X^-1
XI
# you have inverted every single element in X.
# XI will NOT be the true matrix inverse of X.
X%*%XI

=
Where:

is

1 ;

is

1 ;

+
is

is

1 .

observations, regressors/predictors/explanatory
variables/exogenous variables, typically includes the intercept/constant
unless specified otherwise.

determines the stochastic properties of . In this chapter, we will


assume that it is multivariate normal.

Clearly,

determines the nature of the relationship between the


response variable and the regressors. But it is unknown and needs to be
estimated.

LS estimation

Cars example

The most fundamental of all estimation principles in

Simple Linear regression model (lm):

statistics is Least Squares.


Originated from early 19th century.
Consider a simple example: looking at the relationship

between speed and stopping distance.


data(cars)
attach(cars)
plot(speed,dist,xlim=c(0,25))

regmodel=lm(dist~speed)
summary(regmodel)

Cars example

Cars example

The regression model:

Your LS estimates:

R codes:

# generate a sequence of xs
x=seq(min(speed),max(speed),1)
ypredicted=regmodel$coefficients[1]+regmodel$coefficients[2]*x
lines(x,ypredicted,col="red")

LS Principle:

LS Principle:

Let the residuals of the regression model be denoted by:

=
Where

Graphically,

is the LS fitted value of

is the vertical distance between

The least squares (LS) principle is:

Select
.

and

to minimize the residual sum of squares (RSS):


=

and the line of best fit:

First order conditions are:

=
=

2
2

=0
=0

Solving the above two equations gives us the LS estimates.


It is easy to extend the above to more than one regressors, but it requires
matrix algebra.

LS Estimator In Matrix Algebra

In Matrix Algebra

Recall that our previous LM example with CARS

=
=
=

Using matrix algebra and differentiate the above with respect


to , you will get the most memorable matrix algebra result of
all time:
Note:

Eqn(16.3) on
page 211 is
WRONG.

Other (not-so-good) options

Properties of LS estimators

It is important to recognize that LS line of best


fit is NOT the only option out there.
If you choose a different penalty function, you
will get a different line
This particular line (not unique) gives you
sum of residuals being equal to 0.
The line of best fit is the sample mean.
=

Run the following codes in R:


data(cars)
attach(cars)
Y = matrix(dist)
dim(Y)
one = matrix(rep(1,50))
X = cbind(one,matrix(speed))
dim(X)
b = solve( t(X)%*%X ) %*% ( t(X)%*%Y )
b

-n =0

Clearly we are under-predicting half of the


data and over-predicting the other half

For all intents and purposes, you just have to remember

that LS estimators are BLUE:


BLUE: Best Linear Unbiased Estimator.
Under the following assumptions:
1) The regressors are INDEPENDENT of the residuals.
2) The regressors are NOT perfectly collinear with each

other.
3) The residuals are homoscedastic.
4) The residuals are uncorrelated.

Simulated Example to Verify that LS


estimators are indeed BLUE.

Let:
= 1; ~ 3,1 ; ~
Let: = 1,000
Let: ~ 0, 0.5
Let:
= 1; = 0.1;
=3
Simulate
1,000 1 .

Simulated Example
What is the distribution of the error term?
What is the distribution of Y?

hist(Y, xlim=c(-5,60),col="blue")
hist(E,add=T,col="red")

R codes:
X1<-rep(1,1000)
X2<-rnorm(1000,3,1)
X3<-rchisq(1000,3)
E<-rnorm(1000,0,0.5)
B1=1
B2=0.1
B3=3
Y<-B1*X1+B2*X2+B3*X3+E

Assuming normally distributed residuals does NOT imply

a normally distributed Y.

Simulated Example

Simulated Example

# Simulate 10,000 times.


T <- 10000
# Use BETA to store all LS estimates from EACH simulated sample
BETA <- matrix(0,T,3)

regmodel=lm(Y~X1+X2+X3-1)
summary(regmodel)

The LS estimates are not


going to be exactly equal to
the known values, but they
are very close.

Under repeated sampling,


you see that the means are
practically the same as the
known values. -

# Use a simple LOOP to do this.


for (i in 1:10000)
{
X1<-rep(1,1000)
X2<-rnorm(1000,3,1)
X3<-rchisq(1000,3)
E<-rnorm(1000,0,0.5)
B1=1
B2=0.1
B3=3
Y<-B1*X1+B2*X2+B3*X3+E
# store the LS estimates
regmodel=lm(Y~X1+X2+X3-1)
BETA[i,] <- regmodel$coefficient
}
# compute the MEAN of the 10,000 LS estimates of the three coefficients.
colMeans(BETA)

All estimators have a


corresponding sampling
distribution.

Electricity Usage Example

Electricity Usage Example

data <- read.csv("Electricity.csv")


attach(data)
plot(WMAXit,AvgKWH)
plot(WMINit,AvgKWH)

regmodel=lm(AvgKWH~WMAXit+WMINit+Time)
summary(regmodel)
plot(resid(regmodel))
hist(resid(regmodel))

Electricity Usage Example


WMAXit2 <- WMAXit^2
WMINit2 <- WMINit^2
regmodel=lm(AvgKWH~WMAXit+WMINit+WMAXit2+WMINit2+Time)
summary(regmodel)
plot(resid(regmodel))
hist(resid(regmodel))

GENERALIZED LINEAR
MODELS
By Dr Gary Deng

More Rshiny: a Forecasting App! -

Selecting the time


series you want to
forecast

Selecting how many


quarters ahead you
want to forecast

Obtaining both time series plots and a table of forecast values. -

The time series data needed for the app

Need to load time series


data first.

Need to use the


forecast package

The choices of time series to be


selected are given by the column
names of the loaded csv file.

This is known as a reactive


environment, as the outputs
constantly REACT to input
changes.
Define start and end dates of the
observed sample data.
Define Y as a quarterly time series.
Define a list of objects that you need to refer to
when rendering results.

Two key inputs:


(1) Select input as to
which time series to
forecast.
(2) Select the number of
quarters to forecast
ahead.

Both Y and H come from the


reactive object selectedData
defined above.

There are a lot of options available but you dont have to worry about them here.

Theory

R function

There are many occasions where a LS estimation of a linear regression model may not

be appropriate.
Specifically, when the residuals of the model do not follow a normal distribution (which is

very often the case in practice), LS estimation could result in biased and inconsistent
estimates.
As the name suggests, GLM is a flexible generalization of a linear regression model.
The above is taken straight from R.
The Right Hand Side remains a LINEAR function:

, i.e., a linear combination of the

predictors.

following:

The Left Hand Side response variable, however, is now related to

via a so-called

link function.
More precisely:

For all intents and purposes, in most cases you only need to know how to specify the

formula: which is the linear combination of the regressors on the right hand side.
data: which is the dataset being used.
family: which describes the error distribution and the link function.

, where

is the mean of the response variable, and

. is

the link function.


The nature of this link function depends on the distribution of the dependent variable.

You can just about ignore the rest of the arguments in this function
GLMs are mostly estimated using ML (which we will look at next week).

R Function

Binomial Data
One of the most common problems encountered in

practice is that the response variable could be binary in


nature.
The two primary ways of dealing with binomial data are:
These are the distribution families and corresponding link functions that you can

Probit and Logit.

choose from.
In this lecture we will only look at:

Binomial

Poisson

Logit Model

For all intents and purposes, they give you very similar

results. And we will just look at Logit

Logit Model
Define

One way to understand the Logit model is to employ a very important concept called a latent

variable, which forms the foundation of most probability models involving categorical dependent
variables.

can be restated as:


=

For example, let us suppose that we are looking at the probability of a suspect making a false

confession, i.e., the dependent variable = 1 when the confession is false and = 0 otherwise.
Intuitively, one could think of ones willingness to lie as a latent variable that determines the
outcome of the confession. It is unobserved, but when this willingness to lie passes a certain
threshold, ones confession becomes false.

Graphically the idea of this latent variable can be represented as follows:

as the latent variable, the initial problem of:


1
=
0
1
0

>0

Let us rewrite the linear regression model in terms of the latent variable, we have:
=
+
+
Therefore
=1 =
>0
and
>0
+
+ >0
=
=
>
+
=
<
+
Finally, if we assume that follows a so-called extreme value distribution, the quantity of
written as:
+
<
+
=
1+
+
where

is the cumulative density function of a logistic distribution evaluated at

<

can be

, which lies between

0 and 1. The nice thing about this latent variable approach is that it makes an intuitive link between an observed binary
outcome (such as false confession) with an unobserved yet intuitive latent variable (such as willingness to lie).

Logit Model

Logit Model

Another way to understand the logit model is through a functional transformation of the left-hand side
dependent variable. More specifically, the transformation uses what is known as the logit function:

The Two Interpretations are Two Sides of the Same Coin

Firstly, we note that the input to a logit function is , i.e., the probability of = 1. Secondly, natural log, i.e.,
ln (. ), is used. Thirdly,
is known as the odds, and it is simply the probability of observing Yes relative to the
probability of observing No.
The binary logistic regression is then defined as:

1
= 1
=
+

Example: US homicides
There were 3,085 counties in the US in 1990. The centre for National Consortium on Violence Research (NCOVR) had compiled a
dataset containing homicide rates (per 100,000 capita) for each of the 3,085 counties in the US in 1990, along with a number of
socio-economic variables thought to be important in predicting homicide rates.
The Dependent Variable:
= 1 if homicide count exceeded 20 per 100,000 capita. These are the homicide hotspots
The Explanatory (socio-economic) Variables:
Southern:

We can easily see why this transformation is useful by noticing that the left hand side variable of the
regression model is no longer bounded between 0 and 1. For instance, if the outcome overwhelmingly favours
+ . Conversely, when the odds overwhelmingly favours No over Yes,
Yes over No, then
1 and
0 and

1
=

then

Finally, to show that the two interpretations are two sides of the same
coin, one could easily derive the following:

1 if the county is in a southern state, 0 otherwise.

UnemploymentRate (UE90): unemployment rate of the county.


DivorceRate (DV90):

divorce rate of the county.

MedianAge (MA90):

median age of the county population.

PopulationStructure (PS90): a variable constructed using principle component analysis, which essentially captures the
percentage of minority races in the county population. The larger the value for this variable the larger the percentage of more
minority races.
ResourceDeprivation (RD90):
a variable constructed using principle component analysis, which essentially captures
the level of deprivation of social and economic infrastructure in the county. The larger the value for this variable the more deprived
was a county of adequate social and economic infrastructure (such as schools and hospitals).

1+

+
+
+
+
+
+

+
+

Example: US homicides
data <- read.csv("Homicides.csv")
attach(data)
regmodel_logit <- glm(HomicideHotSpot~SOUTH+UE90+DV90+MA90+PS90+RD90, family=binomial("logit"))
summary(regmodel_logit)

Example: US homicides

Poisson Model

regmodel_Probit <- glm(HomicideHotSpot~SOUTH+UE90+DV90+MA90+PS90+RD90, family=binomial("probit"))


summary(regmodel_Probit)

Very often, we could be dealing with count data.


Counts are non-negative, and they are strictly integers.
Formally (p237, Lander),

Probit gives you similar


results

~
Where

is the ith response and


=

is the mean of the distribution for the ith observation.


Typically Poisson regression models are estimated using Maximum Likelihood.
There are a lot of advanced issues with Poisson regressions, such as

overdispersion (variance > mean) and zero inflation (having too many 0 counts in
the data). They are beyond the scope of this unit.

Example: Shipwreck

Example: Shipwreck

The Ship Damage Data.

data <- read.csv("ShipWreck.csv")


attach(data)
data
hist(damage)

These are the data from McCullagh and Nelder (1989). The file has

34 rows corresponding to the observed combinations of type of ship,


year of construction and period of operation. Each row has
information on five variables as follows:
1)
2)
3)
4)
5)

ship type, coded 1-5 for A, B, C, D and E,


year of construction (1=1960-64, 2=1965-70, 3=1970-74, 4=1975-79),
period of operation (1=1960-74, 2=1975-79)
months of service, ranging from 63 to 20,370, and
damage incidents, ranging from 0 to 53.

Reference: McCullagh, P. and Nelder, J. (1989) Generalized Linear

Models, 2nd Edition. Chapman and Hall, London. Page 204.

Example: Shipwreck
xtabs(~damage+type)
xtabs(~damage+construction)
xtabs(~damage+operation)
plot(damage~months)

Example: Shipwreck
months2 <- months^2
regmodel_Pois <- glm(damage~type+construction+operation+months+months2-1,family=poisson(link="log"))
summary(regmodel_Pois)

Optimization

MAXIMUM LIKELIHOOD
ESTIMATION

Before we look at Maximum Likelihood (ML) Estimation in more detail,

we will first look at Optimization in general.

Simply put, optimization is to optimize the value of a function.


(1) Optimization could be either Maximization (such as in ML) or Minimization

(such as in LS);

By Dr Gary Deng

(2) The function that gets optimized has to be a function of controllables. For

example, in minimizing Sum of Squares, you are choosing values of your least
squares estimates.

(3) There may or may not be substantive constraints to your optimization

problem. I will not go into details about this as it is beyond the scope of this
unit, but there are many real life problems that are constrained optimization
problems. As an example, profit maximization could be subject to the labour
supply constraint (a maximum number of hours that your employees could
work, for instance).

Nonlinear Least Squares

LS as an optimization problem

Very often, the residual

Recall LS estimation objective function for a simple linear regression

model:

is a function of

all other nineteen sites are known. Your equipment allows you to shoot laser beams and measure
the distance between yourself and the other sites. However the measurement is prone to errors.
Question: can you find out your exact spatial coordinates?

Its a Minimization problem by choosing values of


In this case,

is a Nonlinear function of the parameters of interest.

Consider the following example in Navigation. You are at Site 12, and the exact spatial locations of

is a linear function of

, because:

=
Where

fashion.

enters the right hand side of the equation in a linear

Navigation Example

Navigation Example

We can turn this into an optimization problem.

Nonlinear Least Squares minimizes:

We are interested in coordinates

,
.
We observe/measure Euclidean distances between site
12 and all other sites with some error.
Thus the distance between site 12 and any site

=
+
=
+
Is nonlinear in the two unknowns
,

+
.

=
=

,
+

In R, there are a number of optimization routines that you can

call on
For all intents and purposes, knowing how to use optim is

good enough in most cases. By default optim MINIMIZES.

Navigation Example
data <- read.csv("NavigationExample.csv")
attach(data)
# First we must specify the objective function
fn = function(P) sum( ( D - ( (X-P[1])^2 + (Y-P[2])^2 )^0.5 )^2 )
# next we specify initial values/guesses to the problem
initial = c(0,0)
# call on optim to minimize sum of squares
# the first argument is the initial value
# the second argument is the function to be optimized
# the third argument selects the numerical search method
# we will talk about Hessian later.
out = optim(initial, fn, method = "BFGS", hessian = TRUE)
# problem SOLVED out$par

Finish
site 12

You dont need to know how to do this


But if you had run the following codes:
out = optim(initial, fn, method = "L-BFGS-B",
hessian =
TRUE,control=list(trace=6,REPORT=1))
You would have obtained step-by-step
information on the convergence process.

Start

And if you had plotted the path on the same


graph, you will see clearly how the algorithm
gradually took us to the correct answer.

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Let

In the most simple case (the only case we will explore in this

=
, ,,
be an -vector of observed sample values. Let
=
, ,,
be a -vector of unknown parameters.
Furthermore, let depend on .

You can write down the joint density as:

can be interpreted in two ways:


1)
2)

To put it simply, an

The second interpretation is referred to as a likelihood function:

The reversal of the order of

(Independent and

. This joint density

Conditional on fixed it is the probability of sample outcomes .


Conditional on sample outcomes it is a function of .

unit), we look at what is known as an


Identically Distributed) sample.

set of random variables: (1) have the


same probability distribution; (2) are mutually independent.

A classic example of a sample that is NOT

: serially

dependent time series data.

and emphasizes the new focus of


interest. We observe , and we want to estimate .

Maximum Likelihood Estimation can be easily extended to deal

with nondata, but it is a little bit more mathematically


involved and so we will not get into it for now.

Maximum Likelihood Estimation

Maximum Likelihood Estimation

For an

The Maximum Likelihood Principle:

sample, we can write down the joint density function

as:

Maximizing the likelihood function


; with respect to amounts to finding values of , typically denoted
as , that maximizes the probability of obtaining the observed sample values
=
, ,,
.

f Y;

=f

,,

=f

In most empirical applications, it is simpler to maximize the log of the likelihood function:

Note: we can only write the joint density as the product of the

marginal densities when the sample observations are


independent of each other.

Differentiate with respect to

;Y =

It is clear that the

monotonic.

;Y = f

,,

=f

The majority of properties of ML estimators are large-sample or asymptotic results.


Key properties of ML Estimators:
Consistent:
=
In words: as sample size
2)

that maximizes

;Y

approaches infinity,

converges in value to .

Asymptotically Normal:
~

1 L
L

;Y

; Y also maximizes L

L ; Y with respect to
by setting the score to zero, that is:

; Y , as logarithmic transformation is

is known as the score,

;Y =

Maximum Likelihood Estimation


1)

The derivative of

:
L

The likelihood function can then be written as:

;Y

; Y The ML estimator

is obtained

=0

Example One: Logistic Model for DoseResponse Data from Dobson(1990)


This example fits a logistic model to dose-response data. It is

primarily illustrative, as it could easily be dealt with by glm() using


family=binomial. The data, for beetle mortality, are as follows. For
each of 8 experiments a group of n beetles were subject to a dose of
carbon disulphide at concentration x for five hours, and y records the
number that died.

In words: is asymptotically normally distributed with mean and variance given by the inverse of
.
is known
as the Information Matrix. Numerically, it is often evaluated with the Hessian Matrix:
L ;Y
=
which is the second order matrix derivative of the log likelihood function. As you will see later, this will be numerically
evaluated and generated in R.
Consistency means that under large-sample conditions ML estimators give you very good coefficient

estimates. Asymptotic normality means that you can use its asymptotic distribution to perform hypothesis
tests.

# Here is the data


# carbon disulphide concentration
x = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
# beetles death
y = c(6, 13, 18, 28, 52, 53, 61, 60)
# beetles experimented on
n = c(59, 60, 62, 56, 63, 59, 62, 60)

Beetles Death

Beetles Death

For continuous data, we have probability density function (pdf).

The probability of death depends on the concentration . For logistic regression, recall that we model the probability
using
=

For discrete data, we have probability mass function (pmf).


To apply maximum likelihood estimation, we need to be able to write

down the pmf for this sample.

Which can be easily rewitten as:


=

You can easily find out from Wikipedia that the pmf for a binomial

distribution is:

+
1+

Substitute the above expression into the pmf:


1

is the number of beetles experimented on, is the number


that died, is the probability of success (in this case Death)

Specifically (omitting the

likelihood function is:


=

,,

=f

Evaluate the above pmf for each experiment observation (8 experiments in total in this sample) and multiply them
together. We have a likelihood function.

subscript for notational convenience), the


# Note that the default of optim is to minimize
# so we simply reverse the sign of the log likelihood function derived eariler

1+

fn = function(B) sum( -y*(B[1]+B[2]*x) + n*log(1+exp(B[1]+B[2]*x)) - lchoose(n, y) )

1+

out = optim(c(-50,20), fn, method = "BFGS", hessian = TRUE)

Now you will see why a log likelihood function is easier to work with

Note that the vector of parameters must be a single argument to the

function, here denoted by B, where B[1] is

ln =

1+

ways of roughly guessing the values of a and b for use as starting


values. Since this is a simple example, it doesnt matter much where
we start

This is the log likelihood function we will maximize by choosing values for

and B[2] is

We pick sensible starting values and do the fit. There are various

Using simple algebra involving log functions, we can easily work out:

ln =

=f

Beetles Death

sample, we can write down the joint density function as:


f Y;

Where

Beetles Death

Recall that for an

and

Beetles Death

Beetles Death

The maximum likelihood estimation output is:

Recall that the inverse of the negative of the Hessian matrix approximates the

variance-covariance matrix of the estimated coefficient vector (in a


minimization problem you dont have to take the negative of the Hessian).
sqrt(diag(solve(out$hessian)))
Gives you the approximate standard errors of the maximum likelihood

coefficient estimates.

Therefore, the t-stat for estimated

Beetles Death
Finally as a comparison, if we had used GLM, the results will be very similar.
Because GLMs are also estimated with ML

out = glm(cbind(y,n-y) ~ x, family = binomial)


summary(out)

is:

Gaussian (Normal) Data


The Gaussian (Normal) pdf is:

1
2

1
2

In the case of a linear regression model,

is the response variable ,


and is the underlying mean conditional on the explanatory variables
(i.e., ), and will also be unknown.

As an example. Let there be n observations in the sample, and let there

be 3 explanatory variables. The negative of the log likelihood function is


(verify on your own and use it for your assignment -):

lnL <- (n/2)*log(2*3.1415926)+(n/2)*log(p[1])+(1/(2*p[1]))*(t(yX%*%p[2:4])%*% (y-X%*%p[2:4]))


out <- optim(c(rep(1,4)), lnL, hessian=TRUE, method = "BFGS")

Example 1: Baltimore House Price


Variable
STATION
PRICE
NROOM
DWELL
NBATH
PATIO
FIREPL
AC
BMENT
NSTOR
GAR
AGE
CITCOU
LOTSZ
SQFT

REVISION WEEK
By Dr Gary Deng

Description
ID variable
sales price of house iin $1,000 (MLS)
number of rooms
1 if detached unit, 0 otherwise
number of bathrooms
1 if patio, 0 otherwise
1 if fireplace, 0 otherwise
1 if air conditioning, 0 otherwise
1 if basement, 0 otherwise
number of stories
number of car spaces in garage (0 = no garage)
age of dwelling in years
1 if dwelling is in Baltimore County, 0 otherwise
lot size in hundreds of square feet
interior living space in hundreds of square feet

R codes
data <- read.csv("BaltimoreHousing.csv")
data<-data.matrix(data)
n = dim(data)[1]
m = dim(data)[2]
y = data.matrix(data[,2])
X = data.matrix(data[,3:m])
one = matrix(rep(1,n))
X = cbind(one,X)
k = dim(X)[2]
lnL <- function(p)
(n/2)*log(2*3.1415926)+(n/2)*log(p[1])+(1/(2*p[1]))*(t(y-X%*%p[2:(k+1)])%*% (y-X%*%p[2:(k+1)]))
out <- optim(c(rep(1,(k+1))), lnL, hessian=TRUE, method = "BFGS")
beta <- matrix(out$par)
stdev <- matrix(sqrt(diag(solve(out$hessian))))
t_beta <- beta/stdev
cbind(beta[2:(k+1)],stdev[2:(k+1)],t_beta[2:(k+1)])
reg_lm <- lm(y~X-1)
cbind(beta[2:(k+1)],reg_lm$coefficient)
solve(t(X)%*%(X))%*%(t(X)%*%y)

Variable
STATION
PRICE
NROOM
DWELL
NBATH
PATIO
FIREPL
AC
BMENT
NSTOR
GAR
AGE
CITCOU
LOTSZ
SQFT

Description
ID variable
sales price of house iin $1,000 (MLS)
number of rooms
1 if detached unit, 0 otherwise
number of bathrooms
1 if patio, 0 otherwise
1 if fireplace, 0 otherwise
1 if air conditioning, 0 otherwise
1 if basement, 0 otherwise
number of stories
number of car spaces in garage (0 = no garage)
age of dwelling in years
1 if dwelling is in Baltimore County, 0 otherwise
lot size in hundreds of square feet
interior living space in hundreds of square feet

Example 2: Columbus Crime


Variables
Variable
POLYID
HOVAL
INC
CRIME
OPEN
PLUMB
DISCBD
NSA
NSB
EW
CP

Description
neighborhood ID, used in GeoDa User's Guide and
tutorials
housing value (in $1,000)
household income (in $1,000)
residential burglaries and vehicle thefts per 1000
households
open space (area)
percent housing units without plumbing
distance to CBD
north-south indicator variable (North = 1)
other north-south indicator variable (North = 1)
east-west indicator variable (East = 1)
core-periphery indicator variable (Core = 1)

R codes
data <- read.csv("columbus.csv")
data<-data.matrix(data)
n = dim(data)[1]
m = dim(data)[2]

Exactly the same codes


(except for the file name)
-

y = data.matrix(data[,2])
X = data.matrix(data[,3:m])
one = matrix(rep(1,n))
X = cbind(one,X)
k = dim(X)[2]
lnL <- function(p)
(n/2)*log(2*3.1415926)+(n/2)*log(p[1])+(1/(2*p[1]))*(t(y-X%*%p[2:(k+1)])%*% (y-X%*%p[2:(k+1)]))
out <- optim(c(rep(1,(k+1))), lnL, hessian=TRUE, method = "BFGS")
beta <- matrix(out$par)
stdev <- matrix(sqrt(diag(solve(out$hessian))))
t_beta <- beta/stdev
cbind(beta[2:(k+1)],stdev[2:(k+1)],t_beta[2:(k+1)])
reg_lm <- lm(y~X-1)
cbind(beta[2:(k+1)],reg_lm$coefficient)
solve(t(X)%*%(X))%*%(t(X)%*%y)

A Template for Your Forecasting Needs

require(forecast)
data <- read.csv("BigMacIndex.csv")
data<-data.matrix(data)

# T is the total number of observations we have


T = nrow(data)
# H is the forecast horizon: i.e., how far into the future
# say you want to forecast 4 quarters ahead
H=4

The results are not as


similar as before
Small sample size

# loading the observed time series.


# Let's say you want to treat the last 4 observations as unknown
# a practice known as creating a hold-out-sample
YAUS = data[1:(T-H),2]
YASIA = data[1:(T-H),3]
# Define both series as quarterly time series
YAUS <- ts(YAUS, frequency =4)
YASIA <- ts(YASIA, frequency =4)
# First we fit forecasting models
# ARIMA is a popular choice
# exponential smoothing (ets) is another one
fit_arima_AUS <- auto.arima(YAUS)
fit_arima_AUS
fit_ets_ASIA <- ets(YASIA)
fit_ets_ASIA

# using the fitted models to generate forecasts


forecast_arima_AUS = forecast(fit_arima_AUS,H)$mean
forecast_arima_AUS <- data.matrix(forecast_arima_AUS)
forecast_ets_ASIA = forecast(fit_ets_ASIA,H)$mean
forecast_ets_ASIA <- data.matrix(forecast_ets_ASIA)
# compare your forecasts to the actual (AUS)
TIME = matrix(data[,1],T,1)
REAL = matrix(data[,2],T,1)
FORECASTED = matrix(c(YAUS,forecast_arima_AUS),T,1)
plot(TIME,REAL,type="l",col="blue",xlim=c(1,T),ylim=c(1.0,4.0))
par(new=TRUE)
plot(TIME,FORECASTED,type="l",col="green",xlim=c(1,T),ylim=
c(1.0,4.0) )
# compare your forecasts to the actual (ASIA)
TIME = matrix(data[,1],T,1)
REAL = matrix(data[,3],T,1)
FORECASTED = matrix(c(YASIA,forecast_ets_ASIA),T,1)
plot(TIME,REAL,type="l",col="blue",xlim=c(1,T),ylim=c(1.0,4.0))
par(new=TRUE)
plot(TIME,FORECASTED,type="l",col="green",xlim=c(1,T),ylim=
c(1.0,4.0) )

Australia

Asia

S-ar putea să vă placă și