STA80006 Weeks7-12 PDF

Theory
PROBABILITY
DISTRIBUTIONS
By Dr Gary Deng
Every random variable has an associated probability
distribution.
Random variables can be:

Univariate
Multivariate
Every statistical test is based on some probability distribution.
Every regression model requires assumptions to be made
about the probability distribution of the residuals.
For probability distributions that have known analytical
solutions, you can simulate values from them: Monte Carlo

Simulation.
Univariate Random Variables
Discrete Random Variables
We will focus mainly on univariate random variables in
A discrete random variable (X) has:
this unit.
To put it simply, a univariate random variable has a
designated set of possible values and associated

probabilities.
A random variable can be either:
Discrete
Continuous
A set of possible values: x1, x2, , xk

A set of probabilities: p1, p2, , pk
And:
=1
Example:
# Your mark for this subject

X = c(30, 40, 50, 60, 70, 80)
# The probability of you getting these marks
P = c(0.3, 0.3, 0.05, 0.05, 0.1, 0.2)
# Checking the sum of probabilities equal to 1
sum(P)
Discrete Random Variables
Continuous Random Variables
The most important features of a random variable:
A continuous random variable has:
Mean:
A continuous set of values
Variance:
The fixed set of probabilities of a discrete variable are now replaced with a
continuous probability density function (pdf).
Example (there are many ways of doing this):
mean=sum(X*P)
mean
sum( (X-mean)^2*P )
The pdf is typically denoted by p(x).
= 1 (the probabilities sum to 1).
You can also use matrix algebra:
mean=t(X)%*%P
Mean
For the remainder of this lecture we will focus mainly on continuous random
variables
Uniform Distribution
The uniform distribution is the easiest of all.
Its uniform because every value has the same probability.
0,
a is the min value, b is the max value.
Uniform Distribution
R code:
# Generate 10 uniform random numbers between 3 and 5.

runif(10, 3, 5)
# Generate 100000 uniform random numbers between 3 and 5.
X <- runif(100000,3,5)
# Look at its histogram.
hist(X)
# compute its mean using the formula.
(1/2)*(3+5)
# verify using the mean function.
mean(X)
# compute its variance using the formula.
(1/12)*(5-3)^2
# verify using the var function.
var(X)
Normal Distribution
Normal Distribution
The most widely used distribution of all, commonly
R code:
denoted as
Let
, then
. This is where your standard
normal table comes from
install.packages("pracma")
library("pracma")
# mean of the normal distribution
mu = 5
# standard deviation of the normal distribution
sigma = 1
# storage of corresponding X values
NormalCDF = matrix(0,10000,1)
# cumulative probaility ranges from 0.0001 to 1
PROB = seq(0.0001,1,0.0001)
# Generate 10,000 N(0,1) numbers (i.e., standard normal).

# no need to specify the mean and variance.
X<-rnorm(10000)
# Generate 10,000 N(30,102) numbers.
X<-rnorm(10000,30,10)
# checking the histogram.
hist(X)
# checking the mean and variance.
mean(X)
var(X)
Chi Squared Distribution

The Chi Squared
distribution is derived from the

standard normal distribution.
# a simple loop
for (i in 1:10000)
{
# define the normal cumulative density function
# erf is known as the error function
# Notice the function is defined as squared deviation from cumulative probability
# what happens if it's not squared?
# the minimizer will seek to find the lowest value rather than a zero value
# test it yourself by removing ^2
Let Z~
0,1 , and if , , , , are drawn randomly and

independently from this distribution, then their sum of
squares is
distributed with degrees of freedom:
NCDF <- function(X)

{ ( PROB[i] - 1/2 * (1 + erf( (X-mu)/(sigma*2^0.5) )) )^2 }
1 + 2 + +
# optimize is an optimizing function used for situations where there is one unknown
# the function is NCDF
# 0 is teh lower bound of the search
# 10 is the upper bound of the search
out = optimize(NCDF, lower = 0, upper = 10 )
# stores the solved value of X in the above function
NormalCDF[i] = out$minimum
}
plot(NormalCDF,PROB)
And:
= ;
=2
Recall the most widely used test of independence
R code:
between categorical variables the Chi-square test:
In light of the definition of a Chi-squared variable, the
above Chi-square test makes a lot more sense: it is a sum

of squared standard normal variables
# Generate 10,000 Chi-squared numbers with 5 degrees of

freedom.
X<-rchisq(10000,5)
hist(X)
# Verify that the mean and variance are indeed as n and 2n
respectively
mean(X)
var(X)
F Distribution
F Distribution
The F distribution is (F)undamental to hypothesis testing
R code:
For example, your ANOVA outputs are nothing more than reported
results of an F-test
The F distribution is defined in terms of two

Let:
~
and
; ~
;
are independent of each other;
Then:
variables.
# Generate 100 F numbers with df1 = 5, df2 = 2.

X<-rf(100,5,2)
hist(X)
t Distribution
t Distribution: First Representation
The t-test should not be unfamiliar to any of you
Let
Now I am going to show you that it is closely linked to all the
And let
distributions we have mentioned up to this point
0,1 and
and
be independent of each other.
Then:
Mean: 0
Variance:
where
= degrees of freedom
t is just the ratio between a standard normal and the
Its easy to see that as
, the variance
1, and t
collapses to a standard normal (mean 0 variance 1).
square root of a Chi-squared standardized by its degree

of freedom
t Distribution: Second Representation
Other important distributions
Let
Later on we will be looking at Generalized Linear Models
And let
0,1 and
and
be independent of each other.
(GLM), where the distributional assumptions made about

the residuals are not normal.
Some of the most important distributions will be presented
Then:
below.
=1
You do NOT need to know the mathematics behind these
distributions
In other words, the square of a t is just an F.

In light of how BOTH the t statistic and the F statistic are used
in hypothesis testing, this relationship is very intuitive.
The important thing is for you to recognize WHEN to use
these distributions
Logistic Distribution
Looks very similar to a normal distribution, but it has a fatter tail.
It is used when your dependent variable is (0/1).
It is the foundation of the Logit model.
Logistic Distribution
R codes
# generate 10000 logistic random numbers with location

being 0 and scale being 1.
X<-rlogis(10000, 0, 1)
# generate 10000 standard normal random numbers.
Y<-rnorm(10000, 0,1)
# compare their histograms
# Note add=T means I am adding the second histogram
to the first
hist(X, xlim=c(-20,20), col="red")
hist(Y, add=T, col="blue")
Exponential Distribution
Frequently used to describe time between events in a
Example: suppose that the average time you spend on
Poisson process (more on Poisson later).

PDF: p , =
CDF: F , = 1
clearing a stage in a computer game is exponentially

distributed with a mean time of 30 minutes, then the rate
parameter = . The probability of you wasting more
than an hour of your life fighting imaginary dragons on
your Xbox is:
> 60 = 1
=
F 60, =
1
=
30
60
= 0.1353353
30
Poisson Distribution
R codes:
Closely related to the exponential distribution.
# generate 10000 random exponential numbers from the

example
X<-rexp(10000, rate = 1/30)
hist(X)
It is used to describe the probability of a number of events occurring
in a fixed interval of time or a fixed area of space.
It is very commonly used in dealing with COUNT data.
is the expected number of occurrences
PDF:
= ,
=
Poisson Distribution
Example: Tips (ANOVA)
R codes:
Textbook page 198-210.
# generate 10000 random Poisson numbers

X<-rpois(10000, 1)
Y<-rpois(10000, 10)
Z<-rpois(10000, 20)
# stack their histograms together
hist(X, xlim=c(0,50), col="red")
hist(Y, add=T, col="blue")
hist(Z, add=T, col="green")
# check the mean and variance are indeed very close
mean(X)
var(X)
Load package reshape2.
> data(tips, package = "reshape2")

> tips
total_bill tip sex
smoker
1
16.99 1.01 Female No
2
10.34 1.66 Male
No
3
21.01 3.50 Male
No
4
23.68 3.31 Male
No
5
24.59 3.61 Female No
day
Sun
Sun
Sun
Sun
Sun
time
size
Dinner 2
Dinner 3
Dinner 3
Dinner 2
Dinner 4
Lets look at whether: Day, Time, and Sex have any
Lets look at the p-value of gender
significant impact on the amount of Tips received.

R codes
tipAnova <- aov(tip~day+time+sex-1,tips)

summary(tipAnova)
tipAnova$coefficients
In this F distribution:
DF1 = 1, DF2 = 238
# R code for obtaining p-value from an F distribution.

pf(0.820,1,238,lower.tail=FALSE)
Recall that when DF1 = 1 the F distribution is the same as
the square of a t(DF2).
# R code for obtaining p-value from a t distribution.
pt(0.820^0.5,238,lower.tail=FALSE)*2
R Shiny
Its a great web application framework for R.
Go to: http://shiny.rstudio.com/tutorial/ for a comprehensive online tutorial.
LINEAR MODELS
By Dr Gary Deng
First you must install the Shiny package:
install.packages("shiny")
library(shiny)
Every Shiny App has two components:
1.
2.
a user-interface (ui) script

a server script
The user-interface (ui) script controls the layout and appearance of your app. It is
defined in a source script named ui.R. The server.R script contains the
instructions that your computer needs to build your app.
You need to put BOTH the ui.R and server.R script in the same folder.
ui script:
Shiny example:
# Define UI for application that draws a histogram

shinyUI(fluidPage(
# Application title
titlePanel("Hello Shiny!"),
# Sidebar with a slider input for the number of bins
sidebarPanel(
sliderInput("bins",
"Number of bins:",
min = 1,
max = 50,
value = 30)
),
Here is a simple
App. As you
change the
number of bins
using the slider,
the histogram of
X changes
accordingly.
# Show a plot of the generated distribution

mainPanel(
plotOutput("distPlot")
)
))
server script
#Define server logic required to draw a histogram
shinyServer(function(input, output) {
fluidPage: Essentially layout of the

webpage changes as you change the
size of the window.
sidebarPanel : This defines the layout
of the sidebar panel. You dont have to
have a sidebar, but typically it is useful
to separate your Controls (such as
sliders, checkboxes, etc) from the
Results (such as charts, tables, etc).
sliderInput: One of the many widgets
you can use to design your app.
value : This is just the starting value.
mainPanel : This defines the layout of
the main panel, which contains just one
plot.
plotOutput : This is used to plot any
outputs you want to plot. The stuff
inside the is what you are plotting.
Matrix Algebra (Very Basic)

function(input, output): There is always
an input and an output for a server
function.
# Expression that generates a histogram. The expression is

# wrapped in a call to renderPlot to indicate that:
# 1) It is "reactive" and therefore should re-execute automatically
# 2) when inputs change Its output type is a plot
output$distPlot : The output is an
object known as distPlot. It
corresponds to the ui.R scripts
output$distPlot <- renderPlot({
plotOutput("distPlot")
x <- faithful[, 2] # Old Faithful Geyser data
bins <- seq(min(x), max(x), length.out = input$bins + 1)
renderPlot : You need to render
outputs. Other options include
# draw the histogram with the specified number of bins
renderText, renderTable, etc. Rendering
is hugely important.
hist(x, breaks = bins, col = 'darkgray', border = 'white')
})
})
Let
be any matrix.
is its row dimension, and
the column dimension.
If
= 1, you have a scalar (a single number).
If
>1&
= 1, you have a Column Vector.
If
= 1&
> 1, you have a Row Vector.
If
> 1, you have a Square Matrix.
The transpose of a matrix, often denoted as

The rest of it is no different from how
you would script in R
rows and columns of a matrix.
or
, reverses the
Matrix Algebra (Very Basic)

# Column Vector
X = matrix(c(3,4,7,8,9,10,3,3),8,1)
X
# Row Vector
X = matrix(c(3,4,7,8,9,10,3,3),1,8)
X
# a (4x2) matrix
X = matrix(c(3,4,7,8,9,10,3,3),4,2)
X
# a (3x3) square matrix
X = matrix(c(3,4,7,8,9,10,3,3,1),3,3)
X
# transpose of X
t(X)
Matrix Multiplcation
Let there be two matrices,
You can multiply
and
and
if and only if they are conformable.
And the ORDER in which you multiply them matters. That is,
in general.
For
to exist, must be equal to , and the resultant matrix A =
.
will have dimension
For
to exist, must be equal to
will have dimension .
In general, A
even if both
and
, and the resultant matrix B =
are square matrices.
In the case of A =
, each element in A is the sumproduct of the corresponding

is located in row 2 and column 3 of
ROW in and Column in . For example,
the A matrix, it will be the sumproduct of row 2 of and column 3 of .
Matrix Multiplcation
Matrix Inverse
X = matrix(c(3,4,7,8,9,10,3,3,1),3,3)
Y = matrix(c(6,8,2,1,2,1,3,2,1),3,3)
O = matrix(c(2,2,3),3,1)
# X and O are conformable (3x1)
X%*%O
# O and X are NOT conformable
O%*%X
# t(O) and X are conformable (1x3)
t(O)%*%X
# X and Y are conformable
X%*%Y
# first element of X%*%Y
# is sumproduct of row1 of X and column1 of Y
sum(X[1,]*Y[,1])
# last element of X%*%Y
# is sumproduct of row3 of X and column3 of Y
sum(X[3,]*Y[,3])
# YX is NOT equal to XY
Y%*%X==X%*%Y
Matrix inverse is hugely important in statistics.

Similar to: 8
= 1,
For
to be the matrix inverse of , we must have:

= , where is known as an Identity matrix with 1
on its leading diagonal and 0 everywhere else.
Only square matrices have a matrix inverse.

But NOT all square matrices are invertible.
Matrix Inverse
Theory of LM
In matrix notation:
X = matrix(c(3,4,7,8,9,10,3,3,1),3,3)
# in R, matrix inverse is done through solve().
XI = solve(X)
X%*%XI
# if you foolishly did this
XI = X^-1
XI
# you have inverted every single element in X.
# XI will NOT be the true matrix inverse of X.
X%*%XI
=
Where:
is
1 ;
is
1 ;
+
is
is
1 .
observations, regressors/predictors/explanatory
variables/exogenous variables, typically includes the intercept/constant
unless specified otherwise.
determines the stochastic properties of . In this chapter, we will

assume that it is multivariate normal.
Clearly,
determines the nature of the relationship between the

response variable and the regressors. But it is unknown and needs to be
estimated.
LS estimation
Cars example
The most fundamental of all estimation principles in
Simple Linear regression model (lm):
statistics is Least Squares.

Originated from early 19th century.
Consider a simple example: looking at the relationship
between speed and stopping distance.

data(cars)
attach(cars)
plot(speed,dist,xlim=c(0,25))
regmodel=lm(dist~speed)
summary(regmodel)
Cars example
Cars example
The regression model:
Your LS estimates:
R codes:
# generate a sequence of xs
x=seq(min(speed),max(speed),1)
ypredicted=regmodel$coefficients[1]+regmodel$coefficients[2]*x
lines(x,ypredicted,col="red")
LS Principle:
LS Principle:
Let the residuals of the regression model be denoted by:
=
Where
Graphically,
is the LS fitted value of
is the vertical distance between
The least squares (LS) principle is:
Select
.
and
to minimize the residual sum of squares (RSS):

=
and the line of best fit:
First order conditions are:
=
=
2
2
=0
=0
Solving the above two equations gives us the LS estimates.

It is easy to extend the above to more than one regressors, but it requires
matrix algebra.
LS Estimator In Matrix Algebra
In Matrix Algebra
Recall that our previous LM example with CARS
=
=
=
Using matrix algebra and differentiate the above with respect

to , you will get the most memorable matrix algebra result of
all time:
Note:
Eqn(16.3) on
page 211 is
WRONG.
Other (not-so-good) options
Properties of LS estimators
It is important to recognize that LS line of best

fit is NOT the only option out there.
If you choose a different penalty function, you
will get a different line
This particular line (not unique) gives you
sum of residuals being equal to 0.
The line of best fit is the sample mean.
=
Run the following codes in R:

data(cars)
attach(cars)
Y = matrix(dist)
dim(Y)
one = matrix(rep(1,50))
X = cbind(one,matrix(speed))
dim(X)
b = solve( t(X)%*%X ) %*% ( t(X)%*%Y )
b
-n =0
Clearly we are under-predicting half of the

data and over-predicting the other half
For all intents and purposes, you just have to remember
that LS estimators are BLUE:

BLUE: Best Linear Unbiased Estimator.
Under the following assumptions:
1) The regressors are INDEPENDENT of the residuals.
2) The regressors are NOT perfectly collinear with each
other.
3) The residuals are homoscedastic.
4) The residuals are uncorrelated.
Simulated Example to Verify that LS

estimators are indeed BLUE.
Let:
= 1; ~ 3,1 ; ~
Let: = 1,000
Let: ~ 0, 0.5
Let:
= 1; = 0.1;
=3
Simulate
1,000 1 .
Simulated Example
What is the distribution of the error term?
What is the distribution of Y?
hist(Y, xlim=c(-5,60),col="blue")
hist(E,add=T,col="red")
R codes:
X1<-rep(1,1000)
X2<-rnorm(1000,3,1)
X3<-rchisq(1000,3)
E<-rnorm(1000,0,0.5)
B1=1
B2=0.1
B3=3
Y<-B1*X1+B2*X2+B3*X3+E
Assuming normally distributed residuals does NOT imply
a normally distributed Y.
Simulated Example
Simulated Example
# Simulate 10,000 times.

T <- 10000
# Use BETA to store all LS estimates from EACH simulated sample
BETA <- matrix(0,T,3)
regmodel=lm(Y~X1+X2+X3-1)
summary(regmodel)
The LS estimates are not

going to be exactly equal to
the known values, but they
are very close.
Under repeated sampling,

you see that the means are
practically the same as the
known values. -
# Use a simple LOOP to do this.

for (i in 1:10000)
{
X1<-rep(1,1000)
X2<-rnorm(1000,3,1)
X3<-rchisq(1000,3)
E<-rnorm(1000,0,0.5)
B1=1
B2=0.1
B3=3
Y<-B1*X1+B2*X2+B3*X3+E
# store the LS estimates
regmodel=lm(Y~X1+X2+X3-1)
BETA[i,] <- regmodel$coefficient
}
# compute the MEAN of the 10,000 LS estimates of the three coefficients.
colMeans(BETA)
All estimators have a

corresponding sampling
distribution.
Electricity Usage Example
data <- read.csv("Electricity.csv")

attach(data)
plot(WMAXit,AvgKWH)
plot(WMINit,AvgKWH)
regmodel=lm(AvgKWH~WMAXit+WMINit+Time)
summary(regmodel)
plot(resid(regmodel))
hist(resid(regmodel))

WMAXit2 <- WMAXit^2
WMINit2 <- WMINit^2
regmodel=lm(AvgKWH~WMAXit+WMINit+WMAXit2+WMINit2+Time)
summary(regmodel)
plot(resid(regmodel))
hist(resid(regmodel))
GENERALIZED LINEAR
MODELS
By Dr Gary Deng
More Rshiny: a Forecasting App! -
Selecting the time

series you want to
forecast
Selecting how many

quarters ahead you
want to forecast
Obtaining both time series plots and a table of forecast values. -
The time series data needed for the app
Need to load time series

data first.
Need to use the

forecast package
The choices of time series to be

selected are given by the column
names of the loaded csv file.
This is known as a reactive

environment, as the outputs
constantly REACT to input
changes.
Define start and end dates of the
observed sample data.
Define Y as a quarterly time series.
Define a list of objects that you need to refer to
when rendering results.
Two key inputs:

(1) Select input as to
which time series to
forecast.
(2) Select the number of
quarters to forecast
ahead.
Both Y and H come from the

reactive object selectedData
defined above.
There are a lot of options available but you dont have to worry about them here.
Theory
R function
There are many occasions where a LS estimation of a linear regression model may not
be appropriate.
Specifically, when the residuals of the model do not follow a normal distribution (which is
very often the case in practice), LS estimation could result in biased and inconsistent
estimates.
As the name suggests, GLM is a flexible generalization of a linear regression model.
The above is taken straight from R.
The Right Hand Side remains a LINEAR function:
, i.e., a linear combination of the
predictors.
following:
The Left Hand Side response variable, however, is now related to
via a so-called
link function.
More precisely:
For all intents and purposes, in most cases you only need to know how to specify the
formula: which is the linear combination of the regressors on the right hand side.
data: which is the dataset being used.
family: which describes the error distribution and the link function.
, where
is the mean of the response variable, and
. is
the link function.

The nature of this link function depends on the distribution of the dependent variable.
You can just about ignore the rest of the arguments in this function
GLMs are mostly estimated using ML (which we will look at next week).
R Function
Binomial Data
One of the most common problems encountered in
practice is that the response variable could be binary in

nature.
The two primary ways of dealing with binomial data are:
These are the distribution families and corresponding link functions that you can
Probit and Logit.
choose from.
In this lecture we will only look at:
Binomial
Poisson
Logit Model
For all intents and purposes, they give you very similar
results. And we will just look at Logit
Logit Model
Define
One way to understand the Logit model is to employ a very important concept called a latent
variable, which forms the foundation of most probability models involving categorical dependent
variables.
can be restated as:

=
For example, let us suppose that we are looking at the probability of a suspect making a false
confession, i.e., the dependent variable = 1 when the confession is false and = 0 otherwise.
Intuitively, one could think of ones willingness to lie as a latent variable that determines the
outcome of the confession. It is unobserved, but when this willingness to lie passes a certain
threshold, ones confession becomes false.
Graphically the idea of this latent variable can be represented as follows:
as the latent variable, the initial problem of:

1
=
0
1
0
>0
Let us rewrite the linear regression model in terms of the latent variable, we have:
=
+
+
Therefore
=1 =
>0
and
>0
+
+ >0
=
=
>
+
=
<
+
Finally, if we assume that follows a so-called extreme value distribution, the quantity of
written as:
+
<
+
=
1+
+
where
is the cumulative density function of a logistic distribution evaluated at
<
can be
, which lies between
0 and 1. The nice thing about this latent variable approach is that it makes an intuitive link between an observed binary
outcome (such as false confession) with an unobserved yet intuitive latent variable (such as willingness to lie).
Logit Model
Logit Model
Another way to understand the logit model is through a functional transformation of the left-hand side
dependent variable. More specifically, the transformation uses what is known as the logit function:
The Two Interpretations are Two Sides of the Same Coin
Firstly, we note that the input to a logit function is , i.e., the probability of = 1. Secondly, natural log, i.e.,
ln (. ), is used. Thirdly,
is known as the odds, and it is simply the probability of observing Yes relative to the
probability of observing No.
The binary logistic regression is then defined as:
1
= 1
=
+
Example: US homicides
There were 3,085 counties in the US in 1990. The centre for National Consortium on Violence Research (NCOVR) had compiled a
dataset containing homicide rates (per 100,000 capita) for each of the 3,085 counties in the US in 1990, along with a number of
socio-economic variables thought to be important in predicting homicide rates.
The Dependent Variable:
= 1 if homicide count exceeded 20 per 100,000 capita. These are the homicide hotspots
The Explanatory (socio-economic) Variables:
Southern:
We can easily see why this transformation is useful by noticing that the left hand side variable of the
regression model is no longer bounded between 0 and 1. For instance, if the outcome overwhelmingly favours
+ . Conversely, when the odds overwhelmingly favours No over Yes,
Yes over No, then
1 and
0 and
1
=
then
Finally, to show that the two interpretations are two sides of the same
coin, one could easily derive the following:
1 if the county is in a southern state, 0 otherwise.
UnemploymentRate (UE90): unemployment rate of the county.

DivorceRate (DV90):
divorce rate of the county.
MedianAge (MA90):
median age of the county population.
PopulationStructure (PS90): a variable constructed using principle component analysis, which essentially captures the
percentage of minority races in the county population. The larger the value for this variable the larger the percentage of more
minority races.
ResourceDeprivation (RD90):
a variable constructed using principle component analysis, which essentially captures
the level of deprivation of social and economic infrastructure in the county. The larger the value for this variable the more deprived
was a county of adequate social and economic infrastructure (such as schools and hospitals).
1+
+
+
+
+
+
+
+
+
data <- read.csv("Homicides.csv")
attach(data)
regmodel_logit <- glm(HomicideHotSpot~SOUTH+UE90+DV90+MA90+PS90+RD90, family=binomial("logit"))
summary(regmodel_logit)
Poisson Model
regmodel_Probit <- glm(HomicideHotSpot~SOUTH+UE90+DV90+MA90+PS90+RD90, family=binomial("probit"))

summary(regmodel_Probit)
Very often, we could be dealing with count data.

Counts are non-negative, and they are strictly integers.
Formally (p237, Lander),
Probit gives you similar

results
~
Where
is the ith response and

=
is the mean of the distribution for the ith observation.

Typically Poisson regression models are estimated using Maximum Likelihood.
There are a lot of advanced issues with Poisson regressions, such as
overdispersion (variance > mean) and zero inflation (having too many 0 counts in
the data). They are beyond the scope of this unit.
Example: Shipwreck
Example: Shipwreck
The Ship Damage Data.
data <- read.csv("ShipWreck.csv")

attach(data)
data
hist(damage)
These are the data from McCullagh and Nelder (1989). The file has
34 rows corresponding to the observed combinations of type of ship,

year of construction and period of operation. Each row has
information on five variables as follows:
1)
2)
3)
4)
5)
ship type, coded 1-5 for A, B, C, D and E,

year of construction (1=1960-64, 2=1965-70, 3=1970-74, 4=1975-79),
period of operation (1=1960-74, 2=1975-79)
months of service, ranging from 63 to 20,370, and
damage incidents, ranging from 0 to 53.
Reference: McCullagh, P. and Nelder, J. (1989) Generalized Linear
Models, 2nd Edition. Chapman and Hall, London. Page 204.
Example: Shipwreck
xtabs(~damage+type)
xtabs(~damage+construction)
xtabs(~damage+operation)
plot(damage~months)
Example: Shipwreck
months2 <- months^2
regmodel_Pois <- glm(damage~type+construction+operation+months+months2-1,family=poisson(link="log"))
summary(regmodel_Pois)
Optimization
MAXIMUM LIKELIHOOD
ESTIMATION
Before we look at Maximum Likelihood (ML) Estimation in more detail,
we will first look at Optimization in general.
Simply put, optimization is to optimize the value of a function.

(1) Optimization could be either Maximization (such as in ML) or Minimization
(such as in LS);
By Dr Gary Deng
(2) The function that gets optimized has to be a function of controllables. For
example, in minimizing Sum of Squares, you are choosing values of your least
squares estimates.
(3) There may or may not be substantive constraints to your optimization
problem. I will not go into details about this as it is beyond the scope of this
unit, but there are many real life problems that are constrained optimization
problems. As an example, profit maximization could be subject to the labour
supply constraint (a maximum number of hours that your employees could
work, for instance).
Nonlinear Least Squares
LS as an optimization problem
Very often, the residual
Recall LS estimation objective function for a simple linear regression
model:
is a function of
all other nineteen sites are known. Your equipment allows you to shoot laser beams and measure
the distance between yourself and the other sites. However the measurement is prone to errors.
Question: can you find out your exact spatial coordinates?
Its a Minimization problem by choosing values of

In this case,
is a Nonlinear function of the parameters of interest.
Consider the following example in Navigation. You are at Site 12, and the exact spatial locations of
is a linear function of
, because:
=
Where
fashion.
enters the right hand side of the equation in a linear
Navigation Example
Navigation Example
We can turn this into an optimization problem.
Nonlinear Least Squares minimizes:
We are interested in coordinates
,
.
We observe/measure Euclidean distances between site
12 and all other sites with some error.
Thus the distance between site 12 and any site
=
+
=
+
Is nonlinear in the two unknowns
,
+
.
=
=
,
+
In R, there are a number of optimization routines that you can
call on
For all intents and purposes, knowing how to use optim is
good enough in most cases. By default optim MINIMIZES.
Navigation Example
data <- read.csv("NavigationExample.csv")
attach(data)
# First we must specify the objective function
fn = function(P) sum( ( D - ( (X-P[1])^2 + (Y-P[2])^2 )^0.5 )^2 )
# next we specify initial values/guesses to the problem
initial = c(0,0)
# call on optim to minimize sum of squares
# the first argument is the initial value
# the second argument is the function to be optimized
# the third argument selects the numerical search method
# we will talk about Hessian later.
out = optim(initial, fn, method = "BFGS", hessian = TRUE)
# problem SOLVED out$par
Finish
site 12
You dont need to know how to do this

But if you had run the following codes:
out = optim(initial, fn, method = "L-BFGS-B",
hessian =
TRUE,control=list(trace=6,REPORT=1))
You would have obtained step-by-step
information on the convergence process.
Start
And if you had plotted the path on the same

graph, you will see clearly how the algorithm
gradually took us to the correct answer.
Maximum Likelihood Estimation
Let
In the most simple case (the only case we will explore in this
=
, ,,
be an -vector of observed sample values. Let
=
, ,,
be a -vector of unknown parameters.
Furthermore, let depend on .
You can write down the joint density as:
can be interpreted in two ways:

1)
2)
To put it simply, an
The second interpretation is referred to as a likelihood function:
The reversal of the order of
(Independent and
. This joint density
Conditional on fixed it is the probability of sample outcomes .

Conditional on sample outcomes it is a function of .
unit), we look at what is known as an

Identically Distributed) sample.
set of random variables: (1) have the

same probability distribution; (2) are mutually independent.
A classic example of a sample that is NOT
: serially
dependent time series data.
and emphasizes the new focus of

interest. We observe , and we want to estimate .
Maximum Likelihood Estimation can be easily extended to deal
with nondata, but it is a little bit more mathematically

involved and so we will not get into it for now.
For an
The Maximum Likelihood Principle:
sample, we can write down the joint density function
as:
Maximizing the likelihood function

; with respect to amounts to finding values of , typically denoted
as , that maximizes the probability of obtaining the observed sample values
=
, ,,
.
f Y;
=f
,,
=f
In most empirical applications, it is simpler to maximize the log of the likelihood function:
Note: we can only write the joint density as the product of the
marginal densities when the sample observations are

independent of each other.
Differentiate with respect to
;Y =
It is clear that the
monotonic.
;Y = f
,,
=f
The majority of properties of ML estimators are large-sample or asymptotic results.

Key properties of ML Estimators:
Consistent:
=
In words: as sample size
2)
that maximizes
;Y
approaches infinity,
converges in value to .
Asymptotically Normal:
~
1 L
L
;Y
; Y also maximizes L
L ; Y with respect to
by setting the score to zero, that is:
; Y , as logarithmic transformation is
is known as the score,
;Y =

1)
The derivative of
:
L
The likelihood function can then be written as:
;Y
; Y The ML estimator
is obtained
=0
Example One: Logistic Model for DoseResponse Data from Dobson(1990)

This example fits a logistic model to dose-response data. It is
primarily illustrative, as it could easily be dealt with by glm() using

family=binomial. The data, for beetle mortality, are as follows. For
each of 8 experiments a group of n beetles were subject to a dose of
carbon disulphide at concentration x for five hours, and y records the
number that died.
In words: is asymptotically normally distributed with mean and variance given by the inverse of
.
is known
as the Information Matrix. Numerically, it is often evaluated with the Hessian Matrix:
L ;Y
=
which is the second order matrix derivative of the log likelihood function. As you will see later, this will be numerically
evaluated and generated in R.
Consistency means that under large-sample conditions ML estimators give you very good coefficient
estimates. Asymptotic normality means that you can use its asymptotic distribution to perform hypothesis
tests.
# Here is the data

# carbon disulphide concentration
x = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113, 1.8369, 1.8610, 1.8839)
# beetles death
y = c(6, 13, 18, 28, 52, 53, 61, 60)
# beetles experimented on
n = c(59, 60, 62, 56, 63, 59, 62, 60)
Beetles Death
Beetles Death
For continuous data, we have probability density function (pdf).
The probability of death depends on the concentration . For logistic regression, recall that we model the probability
using
=
For discrete data, we have probability mass function (pmf).

To apply maximum likelihood estimation, we need to be able to write
down the pmf for this sample.
Which can be easily rewitten as:

=
You can easily find out from Wikipedia that the pmf for a binomial
distribution is:
+
1+
Substitute the above expression into the pmf:

1
is the number of beetles experimented on, is the number

that died, is the probability of success (in this case Death)
Specifically (omitting the
likelihood function is:

=
,,
=f
Evaluate the above pmf for each experiment observation (8 experiments in total in this sample) and multiply them
together. We have a likelihood function.
subscript for notational convenience), the

# Note that the default of optim is to minimize
# so we simply reverse the sign of the log likelihood function derived eariler
1+
fn = function(B) sum( -y*(B[1]+B[2]*x) + n*log(1+exp(B[1]+B[2]*x)) - lchoose(n, y) )
1+
out = optim(c(-50,20), fn, method = "BFGS", hessian = TRUE)
Now you will see why a log likelihood function is easier to work with
Note that the vector of parameters must be a single argument to the
function, here denoted by B, where B[1] is
ln =
1+
ways of roughly guessing the values of a and b for use as starting

values. Since this is a simple example, it doesnt matter much where
we start
This is the log likelihood function we will maximize by choosing values for
and B[2] is
We pick sensible starting values and do the fit. There are various
Using simple algebra involving log functions, we can easily work out:
ln =
=f
Beetles Death
sample, we can write down the joint density function as:

f Y;
Where
Beetles Death
Recall that for an
and
Beetles Death
Beetles Death
The maximum likelihood estimation output is:
Recall that the inverse of the negative of the Hessian matrix approximates the
variance-covariance matrix of the estimated coefficient vector (in a

minimization problem you dont have to take the negative of the Hessian).
sqrt(diag(solve(out$hessian)))
Gives you the approximate standard errors of the maximum likelihood
coefficient estimates.
Therefore, the t-stat for estimated
Beetles Death
Finally as a comparison, if we had used GLM, the results will be very similar.
Because GLMs are also estimated with ML
out = glm(cbind(y,n-y) ~ x, family = binomial)

summary(out)
is:
Gaussian (Normal) Data

The Gaussian (Normal) pdf is:
1
2
1
2
In the case of a linear regression model,
is the response variable ,

and is the underlying mean conditional on the explanatory variables
(i.e., ), and will also be unknown.
As an example. Let there be n observations in the sample, and let there
be 3 explanatory variables. The negative of the log likelihood function is

(verify on your own and use it for your assignment -):
lnL <- (n/2)*log(2*3.1415926)+(n/2)*log(p[1])+(1/(2*p[1]))*(t(yX%*%p[2:4])%*% (y-X%*%p[2:4]))

out <- optim(c(rep(1,4)), lnL, hessian=TRUE, method = "BFGS")
Example 1: Baltimore House Price

Variable
STATION
PRICE
NROOM
DWELL
NBATH
PATIO
FIREPL
AC
BMENT
NSTOR
GAR
AGE
CITCOU
LOTSZ
SQFT
REVISION WEEK
By Dr Gary Deng
Description
ID variable
sales price of house iin $1,000 (MLS)
number of rooms
1 if detached unit, 0 otherwise
number of bathrooms
1 if patio, 0 otherwise
1 if fireplace, 0 otherwise
1 if air conditioning, 0 otherwise
1 if basement, 0 otherwise
number of stories
number of car spaces in garage (0 = no garage)
age of dwelling in years
1 if dwelling is in Baltimore County, 0 otherwise
lot size in hundreds of square feet
interior living space in hundreds of square feet
R codes
data <- read.csv("BaltimoreHousing.csv")
data<-data.matrix(data)
n = dim(data)[1]
m = dim(data)[2]
y = data.matrix(data[,2])
X = data.matrix(data[,3:m])
one = matrix(rep(1,n))
X = cbind(one,X)
k = dim(X)[2]
lnL <- function(p)
(n/2)*log(2*3.1415926)+(n/2)*log(p[1])+(1/(2*p[1]))*(t(y-X%*%p[2:(k+1)])%*% (y-X%*%p[2:(k+1)]))
out <- optim(c(rep(1,(k+1))), lnL, hessian=TRUE, method = "BFGS")
beta <- matrix(out$par)
stdev <- matrix(sqrt(diag(solve(out$hessian))))
t_beta <- beta/stdev
cbind(beta[2:(k+1)],stdev[2:(k+1)],t_beta[2:(k+1)])
reg_lm <- lm(y~X-1)
cbind(beta[2:(k+1)],reg_lm$coefficient)
solve(t(X)%*%(X))%*%(t(X)%*%y)
Variable
STATION
PRICE
NROOM
DWELL
NBATH
PATIO
FIREPL
AC
BMENT
NSTOR
GAR
AGE
CITCOU
LOTSZ
SQFT
Description
ID variable
sales price of house iin $1,000 (MLS)
number of rooms
1 if detached unit, 0 otherwise
number of bathrooms
1 if patio, 0 otherwise
1 if fireplace, 0 otherwise
1 if air conditioning, 0 otherwise
1 if basement, 0 otherwise
number of stories
number of car spaces in garage (0 = no garage)
age of dwelling in years
1 if dwelling is in Baltimore County, 0 otherwise
lot size in hundreds of square feet
interior living space in hundreds of square feet
Example 2: Columbus Crime

Variables
Variable
POLYID
HOVAL
INC
CRIME
OPEN
PLUMB
DISCBD
NSA
NSB
EW
CP
Description
neighborhood ID, used in GeoDa User's Guide and
tutorials
housing value (in $1,000)
household income (in $1,000)
residential burglaries and vehicle thefts per 1000
households
open space (area)
percent housing units without plumbing
distance to CBD
north-south indicator variable (North = 1)
other north-south indicator variable (North = 1)
east-west indicator variable (East = 1)
core-periphery indicator variable (Core = 1)
R codes
data <- read.csv("columbus.csv")
n = dim(data)[1]
m = dim(data)[2]
Exactly the same codes

(except for the file name)
-
y = data.matrix(data[,2])
X = data.matrix(data[,3:m])
one = matrix(rep(1,n))
X = cbind(one,X)
k = dim(X)[2]
lnL <- function(p)
(n/2)*log(2*3.1415926)+(n/2)*log(p[1])+(1/(2*p[1]))*(t(y-X%*%p[2:(k+1)])%*% (y-X%*%p[2:(k+1)]))
out <- optim(c(rep(1,(k+1))), lnL, hessian=TRUE, method = "BFGS")
beta <- matrix(out$par)
stdev <- matrix(sqrt(diag(solve(out$hessian))))
t_beta <- beta/stdev
cbind(beta[2:(k+1)],stdev[2:(k+1)],t_beta[2:(k+1)])
reg_lm <- lm(y~X-1)
cbind(beta[2:(k+1)],reg_lm$coefficient)
solve(t(X)%*%(X))%*%(t(X)%*%y)
A Template for Your Forecasting Needs
require(forecast)
data <- read.csv("BigMacIndex.csv")
# T is the total number of observations we have

T = nrow(data)
# H is the forecast horizon: i.e., how far into the future
# say you want to forecast 4 quarters ahead
H=4
The results are not as

similar as before
Small sample size
# loading the observed time series.

# Let's say you want to treat the last 4 observations as unknown
# a practice known as creating a hold-out-sample
YAUS = data[1:(T-H),2]
YASIA = data[1:(T-H),3]
# Define both series as quarterly time series
YAUS <- ts(YAUS, frequency =4)
YASIA <- ts(YASIA, frequency =4)
# First we fit forecasting models
# ARIMA is a popular choice
# exponential smoothing (ets) is another one
fit_arima_AUS <- auto.arima(YAUS)
fit_arima_AUS
fit_ets_ASIA <- ets(YASIA)
fit_ets_ASIA
# using the fitted models to generate forecasts

forecast_arima_AUS = forecast(fit_arima_AUS,H)$mean
forecast_arima_AUS <- data.matrix(forecast_arima_AUS)
forecast_ets_ASIA = forecast(fit_ets_ASIA,H)$mean
forecast_ets_ASIA <- data.matrix(forecast_ets_ASIA)
# compare your forecasts to the actual (AUS)
TIME = matrix(data[,1],T,1)
REAL = matrix(data[,2],T,1)
FORECASTED = matrix(c(YAUS,forecast_arima_AUS),T,1)
plot(TIME,REAL,type="l",col="blue",xlim=c(1,T),ylim=c(1.0,4.0))
par(new=TRUE)
plot(TIME,FORECASTED,type="l",col="green",xlim=c(1,T),ylim=
c(1.0,4.0) )
# compare your forecasts to the actual (ASIA)
TIME = matrix(data[,1],T,1)
REAL = matrix(data[,3],T,1)
FORECASTED = matrix(c(YASIA,forecast_ets_ASIA),T,1)
plot(TIME,REAL,type="l",col="blue",xlim=c(1,T),ylim=c(1.0,4.0))
par(new=TRUE)
plot(TIME,FORECASTED,type="l",col="green",xlim=c(1,T),ylim=
c(1.0,4.0) )
Australia
Asia

STA80006 Weeks7-12 PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

STA80006 Weeks7-12 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Theory

Every random variable has an associated probability

Random variables can be:

about the probability distribution of the residuals.

For probability distributions that have known analytical

solutions, you can simulate values from them: Monte Carlo

Univariate Random Variables

Discrete Random Variables

We will focus mainly on univariate random variables in

A discrete random variable (X) has:

designated set of possible values and associated

A set of possible values: x1, x2, , xk

# Your mark for this subject

Discrete Random Variables

Continuous Random Variables

The most important features of a random variable:

A continuous random variable has:

A continuous set of values

continuous probability density function (pdf).

Example (there are many ways of doing this):

The pdf is typically denoted by p(x).

= 1 (the probabilities sum to 1).

You can also use matrix algebra:

a is the min value, b is the max value.

# Generate 10 uniform random numbers between 3 and 5.

The most widely used distribution of all, commonly

. This is where your standard

normal table comes from

# Generate 10,000 N(0,1) numbers (i.e., standard normal).

Chi Squared Distribution

distribution is derived from the

0,1 , and if , , , , are drawn randomly and

NCDF <- function(X)

Chi Squared Distribution

Chi Squared Distribution

Recall the most widely used test of independence

between categorical variables the Chi-square test:

In light of the definition of a Chi-squared variable, the

above Chi-square test makes a lot more sense: it is a sum

# Generate 10,000 Chi-squared numbers with 5 degrees of

The F distribution is (F)undamental to hypothesis testing

The F distribution is defined in terms of two

# Generate 100 F numbers with df1 = 5, df2 = 2.

t Distribution: First Representation

The t-test should not be unfamiliar to any of you

Now I am going to show you that it is closely linked to all the

distributions we have mentioned up to this point

be independent of each other.

t is just the ratio between a standard normal and the

Its easy to see that as

square root of a Chi-squared standardized by its degree

t Distribution: Second Representation

Other important distributions

Later on we will be looking at Generalized Linear Models

be independent of each other.

(GLM), where the distributional assumptions made about

Some of the most important distributions will be presented

You do NOT need to know the mathematics behind these

In other words, the square of a t is just an F.

in hypothesis testing, this relationship is very intuitive.

The important thing is for you to recognize WHEN to use

# generate 10000 logistic random numbers with location