Documente Academic
Documente Profesional
Documente Cultură
Slides 3
Bilkent
So far, our interest has been on events involving a single random variable only. In
other words, we have only considered “univariate models.”
Multivariate models, on the other hand, involve more than one variable.
Consider an experiment about health characteristics of the population. Would we be
interested in one characteristic only, say weight? Not really. There are many
important characteristics.
De…nition (4.1.1): An n-dimensional random vector is a function from a sample
space Ω into Rn , n-dimensional Euclidean space.
Suppose, for example, that with each point in a sample space we associate an
ordered pair of numbers, that is, a point (x , y ) 2 R2 , where R2 denotes the plane.
Then, we have de…ned a two-dimensional (or bivariate) random vector (X , Y ) .
Example (4.1.2): Consider the experiment of tossing two fair dice. The sample
space has 36 equally likely points. For example:
Now, let
Then,
(3, 3) : X =6 and Y = 0,
(4, 1) : X =5 and Y = 3,
What is, say, P (X = 5 and Y = 3)? One can verify that the two relevant sample
points in Ω are (4, 1) and (1, 4) . Therefore, the event fX = 5 and Y = 3g will only
occur if the event f(4, 1), (1, 4)g occurs. Since each of these sample points in Ω are
equally likely,
2 1
P (f(4, 1), (1, 4)g) = = .
36 18
Thus,
1
P (X = 5 and Y = 3) = .
18
For example, can you see why
1
P (X = 7, Y 4) = ?
9
This is because the only sample points that yield this event are (4, 3), (3, 4), (5, 2)
and (2, 5).
Note that from now on we will use P (event a, event b) rather than P (event a and
event b).
The joint pmf is de…ned for all (x , y ) 2 R2 , not just the 21 pairs in the above Table.
For any other (x , y ) , f (x , y ) = P (X = x , Y = y ) = 0.
As before, we can use the joint pmf to calculate the probability of any event de…ned
in terms of (X , Y ) . For A R2 ,
P ((X , Y ) 2 A ) = ∑ f (x , y ).
fx ,y g2A
E [g (X , Y )] = ∑ g (x , y )f (x , y ).
(x ,y )2R2
Example (4.1.4): For the (X , Y ) whose joint pmf is given in the above Table, what
is the expected value of XY ? Letting g (x , y ) = xy , we have
1 1 1 1 11
E [XY ] = 2 0 +4 0 + ... + 8 4 +7 5 = 13 .
36 36 18 18 18
As before,
One very useful result is that any nonnegative function from R2 into R that is
nonzero for at most a countable number of (x , y ) pairs and sums to 1 is the joint
pmf for some bivariate discrete random vector (X , Y ).
Suppose we have a multivariate random variable (X , Y ) but are concerned with, say,
P (X = 2) only.
We know the joint pmf fX ,Y (x , y ) but we need fX (x ) in this case.
Theorem (4.1.6): Let (X , Y ) be a discrete bivariate random vector with joint pmf
fX ,Y (x , y ). Then the marginal pmfs of X and Y , fX (x ) = P (X = x ) and
fY (y ) = P (Y = y ), are given by
fX (x ) = ∑ fX ,Y (x , y ) and fY (y ) = ∑ fX ,Y (x , y ).
y 2R x 2R
Proof: For any x 2 R, let Ax = f(x , y ) : ∞ < y < ∞g. That is, Ax is the line in
the plane with …rst coordinate equal to x . Then, for any x 2 R,
fX (x ) = P (X = x ) = P (X = x , ∞ < Y < ∞)
= P ((X , Y ) 2 Ax ) = ∑ fX ,Y (x , y )
(x ,y )2A x
= ∑ fX ,Y (x , y ).
y 2R
Example (4.1.7): Now we can compute the marginal distribution for X and Y from
the joint distribution given in the above Table. Then,
fY (1) = 5/18, fY (2) = 2/9, fY (3) = 1/6, fY (4) = 1/9, fY (5) = 1/18.
Notice that ∑5y =0 fY (y ) = 1, as expected, since these are the only six possible values
of Y .
Then,
Now consider the marginal pmfs for the distribution considered in Example (4.1.5).
We have the same marginal pmfs but the joint distributions are di¤erent!
It is important to realise that the joint pdf is de…ned for all (x , y ) 2 R2 . The pdf
may equal 0 on a large set A if P ((X , Y ) 2 A ) = 0 but the pdf is still de…ned for
the points in A.
Again, naturally,
Z ∞
fX (x ) = f (x , y )dy , ∞ < x < ∞,
∞
Z ∞
fY (y ) = f (x , y )dx , ∞ < y < ∞.
∞
x = 1, y =1 and x + y = 1.
Therefore,
F (x , y ) = P (X x, Y y) for all (x , y ) 2 R2 .
Although for discrete random vectors it might not be convenient to use the joint cdf,
for continuous random variables, the following relationship makes the joint cdf very
useful: Z x Z y
F (x , y ) = f (s, t )dsdt.
∞ ∞
From the bivariate Fundamental Theorem of Calculus,
∂2 F (x , y )
= f (x , y )
∂x ∂y
at continuity points of f (x , y ). This relationship is very important.
We have talked a little bit about conditional probabilities before. Now we will
consider conditional distributions.
The idea is the same. If we have some extra information about the sample, we can
use that information to make better inference.
Suppose we are sampling from a population where X is the height (in kgs) and Y is
the weight (in cms). What is P (X > 95)? Would we have a better/more relevant
answer if we knew that the person in question has Y = 202 cms ? Usually,
P (X > 95jY = 202) is supposed to be much larger than P (X > 95jY = 165).
Once we have the joint distribution for (X , Y ) , we can calculate the conditional
distributions, as well.
Notice that now we have three distribution concepts: marginal distribution,
conditional distribution and joint distribution.
De…nition (4.2.1): Let (X , Y ) be a discrete bivariate random vector with joint pmf
f (x , y ) and marginal pmfs fX (x ) and fY (y ). For any x such that
P (X = x ) = fX (x ) > 0, the conditional pmf of Y given that X = x is the function
of y denoted by f (y jx ) and de…ned by
f (x , y )
f (y jx ) = P (Y = y jX = x ) = .
fX (x )
For any y such that P (Y = y ) = fY (y ) > 0, the conditional pmf of X given that
Y = y is the function of x denoted by f (x jy ) and de…ned by
f (x , y )
f (x jy ) = P (X = x jY = y ) = .
fY (y )
∑y f (x , y ) f (x )
∑ f (y jx ) = fX (x )
= X
fX (x )
= 1.
y
In addition,
3/18 3
f (10j1) = f (30j1) = = ,
10/18 10
4/18 4
f (20j1) = = ,
10/18 10
4/18
f (30j2) = = 1.
4/18
Interestingly, when X = 2, we know for sure that Y will be equal to 30.
Finally,
f (x , y )
f (x jy ) = .
fY (y )
E [(X b )2 ] = E fX E [X ] + E [X ] b g2
h i
= E (fX E [X ]g + fE [X ] b g)2
= E fX E [X ]g2 + fE [X ] b g2
+2E (fX E [X ]gfE [X ] b g) ,
where
E (fX E [X ]gfE [X ] b g) = fE [X ] b gE fX E [X ]g = 0.
Then,
E [(X b )2 ] = E fX E [X ]g2 + fE [X ] b g2 .
We have no control over the …rst term, but the second term is always positive, and
so, is minimised by setting b = E [X ]. Therefore, b = E [X ] is the value that
minimises the prediction error and is the best predictor of X .
In other words, the expectation of a random variable is its best predictor.
Can you show this for the conditional expectation, as well? See Exercise 4.13 which
you will be asked to solve as homework.
Var (Y jx ) = E [Y 2 jx ] fE [Y jx ]g2 .
Hence,
Z ∞ Z ∞ 2
Var (Y jx ) = y 2 e (y x)
dy ye (y x)
dy = 1.
x x
Again, you can obtain the …rst part of this result by integration by parts.
What is the implication of this? One can show that the marginal distribution of Y is
gamma (2, 1) and so Var (Y ) = 2. Hence, the knowledge that X = x has reduced the
variability of Y by 50%!!!
(Institute) EC509 This Version: 4 Nov 2013 30 / 123
Conditional Distributions and Independence
E [Y jX ] = 1 + X .
Yet, in some other cases, the conditional distribution might not depend on the
conditioning variable.
Say, the conditional distribution of Y given X = x is not di¤erent for di¤erent
values of x .
In other words, knowledge of the value of X does not provide any more information.
This situation is de…ned as independence.
De…nition (4.2.5): Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x , y ) and marginal pdfs or pmfs fX (x ) and fY (y ). Then X and Y are called
independent random variables if, for every x 2 R and y 2 R,
f (x , y ) = fX (x )fY (y ). (1)
Example (4.2.6): Consider the discrete bivariate random vector (X , Y ) , with joint
pmf given by
where
Z ∞ Z ∞
cd = g (x )dx h (y )dy
∞ ∞
Z ∞ Z ∞ Z ∞ Z ∞
= g (x )h (y )dydx = f (x , y )dydx = 1.
∞ ∞ ∞ ∞
Moreover,
Z ∞ Z ∞
fX (x ) = g (x )h (y )dy = g (x )d and fY (y ) = g (x )h (y )dx = h (y )c.
∞ ∞
Then,
f (x , y ) = g (x )h (y ) = g (x )h (y )|{z}
cd = fX (x )fY (y ),
1
proving that X and Y are independent.
To prove the Lemma for discrete random vectors, replace integrals with summations.
It might not be clear enough at …rst sight but this is a powerful result. It implies
that we do not have to calculate the marginal distributions …rst and then check
whether their product gives the joint distribution.
Instead, it is enough to check whether the joint distribution is equal to the product
of some function of x and some function of y , for all values of (x , y ) 2 R2 .
Example (4.2.8): Consider the joint pdf
1 2 4 y (x /2 )
f (x , y ) = x y e , x >0 and y > 0.
384
If we de…ne
y 4e y
g (x ) = x 2 e x /2
and h (y ) =
,
384
for x > 0 and y > 0 and g (x ) = h (y ) = 0 otherwise, then, clearly,
f (x , y ) = g (x )h (y )
Proof: Start with (2) and consider the continuous case. Now,
Z ∞ Z ∞
E [g (X )h (Y )] = g (x )h (y )f (x , y )dxdy
∞ ∞
Z ∞ Z ∞
= g (x )h (y )fX (x )fY (y )dxdy
∞ ∞
Z ∞ Z ∞
= g (x )fX (x )dx h (y )fY (y )dy
∞ ∞
= E [g (X )]E [h (Y )].
Let some g (x ) and h (y ) be the indicator functions of the sets A and B , respectively.
Now, g (x )h (y ) is the indicator function of the set C R2 where
C = f(x , y ) : x 2 A, y 2 B g.
Note that for an indicator function such as g (x ), E [g (X )] = P (X 2 A ), since
Z Z
P (X 2 A ) = f (x )dx = IA (x )f (x )dx = E [IA (x )].
x 2A
P (X 2 A, Y 2 B ) = P ((X , Y ) 2 C ) = E [g (X )h (Y )]
= E [g (X )]E [h (Y )]
= P (X 2 A )P (Y 2 B ).
These results make life a lot easier when calculating expectations of certain random
variables.
E [X 2 Y ] = E [X 2 ]E [Y ] = Var (X ) + fE [X ]g2 E [Y ] = 1 + 12 1 = 2.
E [e tZ ] = E [e t (X +Y ) ] = E [e tX e tY ] = E [e tX ]E [e tY ] = MX (t )MY (t ),
where we have used the result that for two independent random variables X and Y ,
E [XY ] = E [X ]E [Y ].
Of course, if independence does not hold, life gets pretty tough! But we will not
deal with that here.
(Institute) EC509 This Version: 4 Nov 2013 41 / 123
Conditional Distributions and Independence
which is the mgf of a normal random variable with mean µX + µY and variance
σ2X + σ2Y .
We now consider transformations involving two random variables rather than only
one.
Let (X , Y ) be a random vector and consider (U , V ) where
U = g1 (X , Y ) and V = g2 (X , Y ),
Hence,
P ((U , V ) 2 B ) = P ((X , Y ) 2 A ) .
This implies that the probability distribution of (U , V ) is completely determined by
the probability distribution of (X , Y ) .
θx e θ λy e λ
fX ,Y (x , y ) = , x = 0, 1, 2, ... and y = 0, 1, 2, ... .
x! y!
Obviously
A = f (x , y ) : x = 0, 1, 2, ... and y = 0, 1, 2, ...g.
De…ne
U = X +Y and V = Y,
implying that
g1 (x , y ) = x + y and g2 (x , y ) = y .
u = x + y = x + v.
Moreover, for any (u, v ) the only (x , y ) satisfying u = x + y and v = y are given by
x = u v and y = v . Therefore, we always have
Auv = (u v , v ).
As such,
fU ,V (u, v ) = ∑ fX ,Y (x , y ) = fX ,Y (u v, v)
(x ,y )2Auv
θu v
e θ λv e λ
= , v = 0, 1, 2, ... and u = v , v + 1, v + 2, ... .
(u v )! v !
which follows from the binomial formula given by (a + b )n = ∑ni=0 (xn )ax b n x .
This is the pmf of a Poisson random variable with parameter θ + λ. A theorem
follows.
Theorem (4.3.2): If X Poisson (θ ), Y Poisson (λ) and X and Y are
independent, then X + Y Poisson (θ + λ).
u = g1 (x , y ) and v = g2 (x , y ),
and obtain
x = h1 (u, v ) and y = h2 (u, v ).
The last remaining ingredient is the Jacobian of the transformation. This is the
determinant of a matrix of partial derivatives.
∂x ∂x ∂x ∂y ∂x ∂y
J= ∂u
∂y
∂v
∂y = .
∂u ∂v
∂u ∂v ∂v ∂u
The next example is based on the beta distribution, which is related to the gamma
distribution.
The beta (α, β) pdf is given by
Γ (α + β) α 1 1
fX (x jα, β) = x (1 x )β ,
Γ (α) Γ ( β)
where 0 < x < 1, α > 0 and β > 0.
Now we know that V = X and the set of possible values for X is 0 < x < 1. Hence,
the set of possible values for V is given by 0 < v < 1.
Since U = XY = VY , for any given value of V = v , U will vary between 0 and v , as
the set of possible values for Y is 0 < y < 1. Hence
0 < u < v.
For
u
x = h1 (u, v ) = v and y = h2 (u, v ) = ,
v
we have
∂x ∂x 1
0 1
J= ∂u
∂y
∂v
∂y = 1 u = .
∂u ∂v v v2 v
Then, the transformation formula gives,
Γ (α + β + γ) α 1 1 u α+ β 1 u γ 1 1
fU ,V (u, v ) = v (1 v )β 1 ,
Γ (α) Γ( β)Γ (γ) v v v
where 0 < u < v < 1.
Obviously, since V = X , V beta (α, β). What about U ?
Γ(α+ β+γ)
De…ne K = Γ(α)Γ( β)Γ(γ) . Then,
Z 1
fU (u ) = fU ,V (u, v )dv
u
Z 1 α+ β 1 γ 1
1 1 u u 1
= K vα (1 v )β 1 dv
u v v v
Z 1
1 1 uv α 1 1 α+ β 1 u γ 1 1
= K uα (1 v )β (u ) β v 1 dv
u vu v v v
| {z }
v β
Z 1 β γ 1
1 1 v 1 u 1u
= Ku α (1 v )β (u ) β 1 dv
u u v v v v
|{z}
| {z }
(u/v ) β 1 u/v 2
Z 1 β 1 γ 1
1 u u u
= Ku α u 1 dv .
u v v v2
Let
u 1
y = u ,
v 1 u
which implies that
u 1
dy = dv ,
v2 1 u
or, equivalently,
v2
dv = dy (1 u) .
u
Now, observe that
β 1 β 1
1 u 1
yβ = u ,
v 1 u
γ 1 γ 1
1 1 u u/v + u 1 u/v
(1 y )γ = =
1 u 1 u
γ 1 γ 1
u 1
= 1 .
v 1 u
Now,
1 1
yβ (1 y )γ
is the kernel for the pdf of Y beta ( β, γ)
Γ ( β + γ) β 1 1
fY (y j β, γ) = y (1 y )γ , 0 < y < 1, α > 0, β > 0,
Γ ( β) Γ (γ)
and it is straightforward to show that
Z 1
1 1 Γ ( β) Γ (γ)
yβ (1 y )γ = .
0 Γ ( β + γ)
Hence,
Z 1
fU (u ) = Ku α 1
(1 u ) β+γ 1
yβ 1
(1 y )γ 1
dy
0
Γ (α + β + γ) Γ ( β) Γ (γ) α 1
= u (1 u ) β + γ 1
Γ (α) Γ( β)Γ (γ) Γ ( β + γ)
Γ (α + β + γ) α 1
= u (1 u ) β + γ 1 ,
Γ (α) Γ ( β + γ)
where 0 < u < 1.
This shows that the marginal distribution of U is beta (α, β + γ) .
U = g1 (X , Y ) = X + Y and V = g2 (X , Y ) = X Y.
1 1 2 1 1 2
fX ,Y (x , y ) = p exp x p exp y
2π 2 2π 2
1 1
= exp x2 + y2 ,
2π 2
where ∞ < x < ∞ and ∞ < y < ∞.
Therefore,
A = f (x , y ) : fX ,Y (x , y ) > 0g = R2 .
We have
u = x +y and v =x y,
and to determine B we need to …nd out all the values taken on by (u, v ) as we
choose di¤erent (x , y ) 2 A.
Thankfully, when these equations are solved for (x , y ) they yield unique solutions:
u+v u v
x = h1 (u, v ) = and y = h2 (u, v ) = .
2 2
Now, the reasoning is as follows: A = R2 . Moreover, for every (u, v ) 2 B , there is a
unique (x , y ) . Therefore, B = R2 , as well! The mapping is one-to-one and, by
de…nition, onto.
Therefore,
1 1 u2 1 1 v2
fU ,V (u, v ) = p p exp p p exp
2π 2 4 2π 2 4
Since the joint density factors into a function of u and a function of v , by Lemma
(4.2.7), U and V are independent.
Z = Y N (0, 1).
V =X Y = X +Z N (0, 2),
as well.
That the sums and di¤erences of independent normal random variables are
independent normal random variables is true independent of the means of X and Y ,
as long as Var (X ) = Var (Y ). See Exercise 4.27.
Note that the more di¢ cult bit here is to prove that U and V are indeed
independent.
What if we start with a bivariate random variable (X , Y ) but are only interested in
U = g (X , Y ) and not (U , V )?
We can then choose a convenient V = h (X , Y ) , such that the resulting
transformation from (X , Y ) to (U , V ) is one-to-one on A.
Then, the joint pdf of (U , V ) can be calculated as usual and we can obtain the
marginal pdf of U from the joint pdf of (U , V ) .
Which V is “convenient” would generally depend on what U is.
X jY binomial (Y , p ) ,
Y Poisson (λ) .
where the summation in the third line starts from y = x rather than y = 0 since if
y < x , the conditional probability should be equal to zero: clearly, the number of
surviving eggs cannot be larger than the number of those laid.
But the summation in the …nal term is the kernel for Poisson ((1 p )λ) . Remember
that,
∞
[(1 p )λ]t
∑ t!
= e (1 p ) λ .
t =0
Therefore,
(λp )x e λ (1 p )λ (λp )x λp
P (X = x ) = e = e ,
x! x!
which implies that X Poisson (λp ) .
The answer to the original question then is E [X ] = λp : on average, λp eggs survive.
Now comes a very useful theorem which you will, most likely, use frequently in the
future.
Remember that E [X jy ] is a function of y and E [X jY ] is a random variable whose
value depends on the value of Y .
Theorem (4.4.3): If X and Y are two random variables, then
n o
E X [ X ] = E Y E X jY [ X jY ] ,
The corresponding proof for the discrete case can be obtained by replacing integrals
with sums.
How does this help us? Consider calculating the expected number of survivors again.
n o
E X [X ] = E Y E X jY [ X jY ]
= EY [pY ] = pEY [Y ] = pλ.
X jY binomial (Y , p ),
Y jΛ Poisson (Λ),
Λ exponential ( β).
Then,
n o
E X [X ] = EY EX jY [X jY ] = EY [pY ]
n o n o
= EΛ EY jΛ [pY jΛ] = pEΛ EY jΛ [Y jΛ]
= pEΛ [Λ] = pβ,
The …nal expression is that of the negative binomial pmf. Therefore, the two-stage
hierarchy is given by
X jY binomial (Y , p ),
1
Y negative binomial r = 1, p = .
1+β
αβ
VarP (EX jP [X jP ]) = VarP (nP ) = n 2 2
.
(α + β) (α + β + 1)
Moreover, VarX jP (X jP ) = nP (1 P ), due to X jP being a binomial random
variable.
(Institute) EC509 This Version: 4 Nov 2013 75 / 123
Hierarchical Models and Mixture Distributions
What about EP [VarX jP (X jP )]? Remember that the beta (α, β) pdf is given by
Γ (α + β) α 1 1
fX (x jα, β) = x (1 x )β ,
Γ (α) Γ ( β)
where 0 < x < 1, α > 0 and β > 0.
Then,
EP [VarX jP (X jP )] = EP [nP (1 P )]
Z 1
Γ (α + β) 1 1
= n p (1 p )p α (1 p )β dp.
Γ(α)Γ( β) 0
The integrand is the kernel of another beta pdf with parameters α + 1 and β + 1
since the pdf for P beta (α + 1, β + 1) is given by
Γ (α + β + 2)
p α (1 p )β .
Γ (α + 1) Γ ( β + 1)
Therefore,
Γ (α + β) Γ(α + 1)Γ( β + 1)
EP [VarX jP (X jP )] = n
Γ(α)Γ( β) Γ (α + β + 2)
Γ (α + β) αΓ(α) βΓ( β)
= n
Γ(α)Γ( β) (α + β + 1) (α + β) Γ (α + β)
αβ
= n .
(α + β + 1) (α + β)
Thus,
αβ αβ
VarX (X ) = n + n2 2
(α + β + 1) (α + β) (α + β) (α + β + 1)
αβ(α + β + n )
= n .
( α + β )2 ( α + β + 1 )
Cov (X , Y )
ρXY = ,
σX σY
which is also called the correlation coe¢ cient.
If large (small) values of X tend to be observed with large (small) values of Y , then
Cov (X , Y ) will be positive.
Why so? Within the above setting, when X > µX then Y > µY is likely to be true
whereas when X < µX then Y < µY is likely to be true. Hence
E [(X µX ) (Y µY )] > 0.
Similarly, if large (small) values of X tend to be observed with small (large) values of
Y , then Cov (X , Y ) will be negative.
(Institute) EC509 This Version: 4 Nov 2013 78 / 123
Covariance and Correlation
Cov (X , Y ) = E [XY ] µX µY .
Proof:
Cov (X , Y ) = E [(X µX ) (Y µY )]
= E [XY µX Y µY X + µX µY ]
= E [XY ] µX E [Y ] µY E [X ] + µX µY
= E [XY ] µX µY .
Now, Z x +1 Z x +1
fX (x ) = fX ,Y (x , y )dy = 1dy = y jxy + 1
=x = 1,
x x
implying that X Uniform (0, 1). Therefore, E [X ] = 1/2 and Var (x ) = 1/12.
Now, fY (y ) is a bit more complicated to calculate. Considering the region where
fX ,Y (x , y ) > 0, we observe that 0 < x < y when 0 < y < 1 and y 1 < x < 1
when 1 y < 2. Therefore,
Ry Ry
fY (y ) = 0 fX ,Y (x , y )dx = 0 1dx = y , when 0 < y < 1
R1 R1 .
fY (y ) = y 1 fX ,Y (x , y )dx = y 1 1dx = 2 y , when 1 y < 2
Moreover,
Z 1 Z x +1 Z 1 x +1
xy 2
EX ,Y [XY ] = xydydx = dx
0 x 0 2
y =x
Z 1
1 7
= x2 + x dx = .
0 2 12
Therefore,
1 1
Cov (X , Y ) = and ρXY = p .
12 2
E [XY ] = E [X ]E [Y ].
Then
Cov (X , Y ) = E [XY ] µX µY = µX µY µX µY = 0,
and consequently,
Cov (X , Y )
ρXY = = 0.
σX σY
It is crucial to note that although X ??Y implies that Cov (X , Y ) = ρXY = 0, the
relationship does not necessarily hold in the reverse direction.
Example (4.5.8): Let X Uniform (0, 1), Z Uniform (0, 1/10) and Z ?? X . Let,
moreover,
Y = X + Z,
and consider (X , Y ) .
Consider the following intuitive derivation of the distribution of (X , Y ) . We are
given X = x and Y = x + Z for a particular value of X . Now, the conditional
distribution of Z given X is,
Cov (X , Y ) = E [XY ] E [X ]E [Y ]
= E [X (X + Z )] E [X ]E [X + Z ]
2
= E [X ] + E [XZ ] fE [X ]g2 E [X ]E [Z ]
= E [X ] + E [X ]E [Z ] fE [X ]g2 E [X ]E [Z ]
2
1
= E [X 2 ] fE [X ]g2 = Var (X ) = .
12
By Theorem (4.5.6),
1 1
Var (Y ) = Var (X ) + Var (Z ) = + .
12 1200
Then, r
1/12 100
ρXY = p p = .
1/12 1/12 + 1/1200 101
1 ρXY 1;
see part (1) of Theorem (4.5.7) in Casella & Berger (p. 172). We will provide an
alternative proof of this result when we deal with inequalities.
X µX σ2X ρσX σY
N , .
Y µY ρσX σY σ2Y
In addition, starting from the bivariate distribution, one can show that
x µX
Y jX = x N µY + ρσY , σ2Y (1 ρ2 ) ,
σX
and, likewise,
y µY
X jY = y N µX + ρσX , σ2X (1 ρ2 ) .
σY
Finally, again, starting from the bivariate distribution, it can be shown that
Therefore, joint normality implies conditional and marginal normality. However, this
does not go in the opposite direction; marginal normality does not imply joint
normality.
and " #
2
1 v2 1 1 x µX
p exp = p exp .
σX 2π 2 σX 2π 2 σX
The …rst of these can be considered as the conditional density fY jX (y jx ) which is
normal with mean and variance
ρσ
µY jX = µY + Y (x µX ) and σ2Y jX = (1 ρ2 )σ2Y .
σX
The second, on the other hand, is the unconditional density fX (x ) which is normal
with mean µX and variance σ2X .
(Institute) EC509 This Version: 4 Nov 2013 90 / 123
Covariance and Correlation
Bivariate Normal Distribution
So far, we have discussed multivariate random variables which consist of two random
variables only. Now, we extend these ideas to general multivariate random variables.
For example, we might have the random vector (X1 , X2 , X3 , X4 ) where X1 is
temperature, X2 is weight, X3 is height and X4 is blood pressure.
The ideas and concepts we have discussed so far extend to such random vectors
directly.
Let X = (X1 , ..., Xn ). Then the sample space for X is a subset of Rn , the
n-dimensional Euclidian space.
If this sample space is countable, then X is a discrete random vector and its joint
pmf is given by
For any A Rn ,
P (X 2 A ) = ∑ f (x).
x2A
Similarly, for the continuous random vector, we have the joint pdf given by
f (x) = f (x1 , ..., xn ) which satis…es
Z Z Z Z
P (X 2 A ) = f (x)d x = f (x1 , ..., xn )dx1 ...dxn .
A A
R R
Note that A is an n-fold integration, where the limits of integration are such
that the integral is calculated over all points x 2 A.
Let g (x) = g (x1 , ..., xn ) be a real-valued function de…ned on the sample space of X.
Then, for the random variable g (X),
The marginal pdf or pmf of (X1 , ..., Xk ) , the …rst k coordinates of (X1 , ..., Xn ), is
given by
f (x1 , ..., xn )
f (xk +1 , ..., xn jx1 , ..., xk ) = ,
f (x1 , ..., xk )
This marginal pdf can be used to compute any probably or expectation involving X1
and X2 . For instance,
Z ∞ Z ∞
E [X 1 X 2 ] = x1 x2 f (x1 , x2 )dx1 dx2
∞ ∞
Z ∞ Z ∞
3 1 5
= x1 x2 x 2 + x22 + dx1 dx2 = .
∞ ∞ 4 1 2 16
Now let’s …nd the conditional distribution of X3 and X4 given X1 = x1 and X2 = x2 .
For any (x1 , x2 ) where 0 < x1 < 1 and 0 < x2 < 1,
Z = X1 + ... + Xn .
MZ (t ) = [MX (t )]n .
MZ (t ) = MX 1 (t ) ... MX n (t ) = (1 βt ) α1
... (1 βt ) αn
= (1 βt ) (α1 +...+αn ) .
This is the mgf of a gamma (α1 + ... + αn , β) distribution. Thus, the sum of
independent gamma random variables that have a common scale parameter β also
has a gamma distribution.
We will now cover some basic inequalities used in statistics and econometrics.
Most of the time, more complicated expressions can be written in terms of simpler
expressions. Inequalities on these simpler expressions can then be used to obtain an
inequality, or often a bound, on the original complicated term.
This part is based on Sections 3.6 and 4.7 in Casella & Berger.
This says that, there is at least a 75% chance that a random variable will be within
2σ of its mean (independent of the distribution of X ).
E [Y ]
P (Y r) ,
r
and the relationship holds with equality if and only if
P (Y = r ) = p = 1 P (Y = 0) , where 0 < p 1.
The more general form of Chebychev’s Inequality, provided in White (2001), is as
follows.
Proposition 2.41 (White 2001): Let X be a random variable such that
E [jX jr ] < ∞, r > 0. Then, for every t > 0,
E [jX jr ]
P (jX j t) .
tr
Setting r = 2, and some re-arranging, gives the usual Chebychev’s Inequality. If we
let r = 1, then we obtain Markov’s Inequality. See White (2001, pp.29-30).
Then,
1 jX jp 1 jY jq 1 1
E p + = + = 1,
p E [jX j ] q E [jY jq ] p q
and so
1/p 1/q
E [jXY j] fE [jX jp ]g fE [jY jq ]g .
1 ρXY 1.
One can show this without using the Cauchy-Schwarz Inequality, as well. However,
the corresponding calculations would be a lot more tedious. See Proof of Theorem
4.5.7 in Casella & Berger (2001, pp.172-173).
(Institute) EC509 This Version: 4 Nov 2013 114 / 123
Inequalities
jX + Y j jX j + jY j .
Now,
h i
E [jX + Y jp ] = E jX + Y j jX + Y jp 1
h i h i
E jX j jX + Y jp 1 + E jY j jX + Y jp 1
, (3)
Then,
E [jX + Y jp ] E [jX + Y jp ]
n h io1/q = 1/q
E j X + Y j q (p 1 ) fE [jX + Y jp ]g
1
1 q
= fE [jX + Y jp ]g
1
p
= fE [jX + Y jp ]g ,
and
1/p 1/p
fE [jX + Y jp ]g fE [jX jp ]g + fE [jY jp ]g1/p .
The previous results can be used for the case of numerical sums, as well.
For example, let a1 , ..., an and b1 , ..., bn be positive nonrandom values. Let X be a
random variable with range a1 , ..., an and P (X = ai ) = 1/n, i = 1, ..., n. Similarly,
let Y be a random variable taking on values b1 , ..., bn with probability
P (Y = bi ) = 1/n. Moreover, let p and q be such that p 1 + q 1 = 1.
Then,
1 n
n i∑
E [jXY j] = jai bi j ,
=1
" #1/p
1 n
n i∑
1/p
fE [jX jp ]g = jai j p
=1
" #1/q
1 n
n i∑
1/q
and fE [jY jq ]g = jbi j q
.
=1
and so,
" #1/p " #1/q
n n n
∑ jai bi j ∑ jai j p
∑ jbi j q
.
i =1 i =1 i =1
A special case of this result is obtained by letting bi = 1 for all i and setting
p = q = 2. Then,
" #1/2 " #1/2 " #1/2
n n n n
∑ jai j ∑ jai j2 ∑1 = ∑ jai j2 n 1/2 ,
i =1 i =1 i =1 i =1
and so, !2
n n
1
n ∑ jai j ∑ ai2 .
i =1 i =1
E [g (X )] g fE [X ]g.
E [g (X )] E [a + bX ] = a + bE [X ]
= `(E [X ]) = g (E [X ]).
E [X 2 ] fE [X ]g2 .
E [g (X )] g (E [X ]).
Let us prove the …rst part of this Theorem for the easier case where
E [g (X )] = E [h (X )] = 0.
Now,
Z ∞
E [g (X )h (X )] = g (x )h (x )fX (x )dx
∞
Z Z
= g (x )h (x )fX (x )dx + g (x )h (x )fX (x )dx .
fx :h (x ) 0 g fx :h (x ) 0 g
Hence,
Z Z
g (x )h (x )fX (x )dx g (x0 ) h (x )fX (x )dx ,
fx :h (x ) 0 g fx :h (x ) 0 g
Z Z
and g (x )h (x )fX (x )dx g (x0 ) h (x )fX (x )dx .
fx :h (x ) 0 g fx :h (x ) 0 g
Thus,
Z Z
E [g (X )h (X )] = g (x )h (x )fX (x )dx + g (x )h (x )fX (x )dx
fx :h (x ) 0 g fx :h (x ) 0 g
Z Z
g (x0 ) h (x )fX (x )dx + g (x0 ) h (x )fX (x )dx
fx :h (x ) 0 g fx :h (x ) 0 g
Z ∞
= g (x0 ) h (x )fX (x )dx = g (x0 )E [h (X )] = 0.
∞
You can try to do proof for the second part, following the same method.