Probability Notes

Math 630 - Probability Theory and its Applications
Zach Bailey and Werner Linde

October 21, 2016
Contents
1 Events and Their Probabilities
1.1 Introduction . . . . . . . . . . .
1.2 Events as Sets . . . . . . . . . .
1.3 Probability . . . . . . . . . . .
1.4 Different Types of Probabilities
1.5 Conditional Probability . . . . .
1.6 Independence . . . . . . . . . .
1.7 Product Spaces . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Random Variables and Their Distributions

2.1 Random Variables . . . . . . . . . . . . . . . . . . .
2.2 Discrete and Continuous Random Variables . . . .
2.3 Random Vectors . . . . . . . . . . . . . . . . . . .
2.4 Relation between Joint and Marginal Distributions
2.5 Independence of Random Variables . . . . . . . . .
2.6 Independence of Continuous Random Variables . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
2
4
7
11
13
15
.
.
.
.
.
.
19
19
23
25
27
29
31
3 Special Random Variables

33
3.1 Important Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Important Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Transformations of Random Variables
46
4.1 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Random Numbers and Coin Tossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Addition of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Expectation, Variance, and Covariance
5.1 Expected Value of a Random Variable .
5.2 Expectation of Special Random Variables
5.3 Properties of the Expectation . . . . . .
5.4 Higher Moments and Variance . . . . . .
5.5 Covariance . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
62
62
65
67
6 Limit Theorems
71
6.1 Chebyshevs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1 EVENTS AND THEIR PROBABILITIES
Events and Their Probabilities
1.1
Introduction
Almost all in our life is random. There exist only very few events having probability 1. The aim of
probability theory is the following: Describe random experiments by mathematical methods.
Examples:
- rolling a die
- lifetime of a particle
- weather forecast
- the number of accidents in a city per day
- number of website hits
History: In the 11th century, the numbers {2, . . . , 12} are the possible sums of rolling two die,
and Arab mathematicians tried to find the probabilities of each sum. In the year 1900 there was a
mathematical congress in Paris where David Hilbert presented his 23 most important problems in
mathematics, called the Hilbert Problems. One of the problems was this: Give a solid foundation of
probability theory.
This problem was solved by A. N. Kolmogorov in 1933 with his book Foundations of Probability
Theory. Much of this is based on the former work of H. Lebesgue, who wrote his doctoral thesis
in 1902 where he developed Measure Theory. Kolmogorov decided this was a very good tool to
formulate the axioms of probability theory.
Definition 1.1 (A Random Experiment). Without changing the conditions we observe different
results/outcomes where the results are not predictable. We are only able to give a list of possible
outcomes.
1.2
Events as Sets
Say we investigate a random experiment.

Definition 1.2 (Sample Space). The sample space, , is the set of all possible outcomes.
Remark. Sometimes we choose larger than necessary for certain mathematical reasons.
Example 1.1.
1. Roll a die, then = {1, . . . , 6}. We could also take = N or R because the
probabilities of events not containing 1, . . . , 6 is zero. cannot be {1, . . . , 5} because it does
not contain all possible results.
2. Roll a die n times. Then = {1, . . . , 6}n = {(x1 , . . . , xn ) : xj {1, . . . , 6} for all j}.
3. Lifetime of a machine gives = [0, ). If we include that the machine is not defective at the
beginning, we can make = (0, ).
4. We roll a die until we observe a roll of 6 for the first time. What is random is the number of
tries. Then = N.
Definition 1.3 (Events). A subset A is called an event. We denote P(A) as the power set of
A, also written 2A , as the set of all subsets of A. We say # is the number of elements of . If it is
finite then #P(A) = 2#A .
We observe . Let A then one case is that w A. Then A occurred. If w
/ A then A did
not occur. As an example, if we denote the lifetime of a particle as = [0, ) and A = [2, ) then
A occurs if the particle lives longer than two years.
Easy Rules:
(i) always occurs (certain event)
(ii) never occurs (impossible event)
(iii) A occurs if and only if Ac does not occur.
(iv) The union of two events occurs if and only if either one or the other occurs.
(v) The intersection occurs if and only if both occur.
There are some events of special interest. An event A is elementary if #A = 1.
Basic idea: Events occur with a certain probability. Given A , there is a number p [0, 1] such
that A occurs with probability p. We write P(A) = p as the probability of occurrence.
Example 1.2. If = {1, . . . , 6} and A = {2, . . . , 6} then P(A) = 56 .
In short, we are looking for a function P : P() [0, 1] such that P(A) is the probability of
occurrence of A. Our problem is that if we let be R then there are no non-trivial functions
P : P() [0, 1] possessing the natural properties of a probability. Kolmogorovs solution to this
problem was to classify good events and bad events. By this we mean that we choose a collection
of sets, A , where P(A) is defined for A A and not defined for A
/ A.
Definition 1.4 (-field). A is called a -field if:
1. A
2. A A = Ac A .
3. For any countable collection A1 , A2 , A then
Proposition 1.1. Let A be a -field, then
(i) A
(ii) A1 , . . . , An A =
n
S
Aj A .
j=1
(iii) A1 , A2 , A =
Aj A
j=1
(iv) A, B A = A \ B A
An A .
Proof.
(i) = c
(ii) Let Aj = for j > n. Then A1 A2 A but this is just A1 An .
S
T
(iii) A1 , A2 , TA implies that Ac1 , Ac2 , A . Thus because of ( Acn )c = An by DeMorgans
law we get An A as well.
(iv) B A means B c A thus A B c = A \ B A .
Proposition 1.2. Let E be any non-empty collection of events in . Then there is a smallest -field
A0 such that E A0 .
Proof. Let = {A : A is a field and E A }. We know is non-empty, because E P().
So let
\
A0 =
= {A : A A , A }
A
We see A0 is a -field because,

A0 .
A A0 = A A A thus Ac A for all A .
Countable unions are contained for the same reasons.
We see also that E A0 and that A0 is indeed the smallest -field containing E .
Definition 1.5 (Generated -field). We say that A0 = (E ) is the smallest -field generated by E
or that E generates A0 .
Basic Examples:
Let = R and E = {(a, b) : a < b}. Then B(R) = (E ) is called the Borel -field and the elements of B(R) are called the Borel sets. This is also called the -field generated by the topology of R.
Question: Is every set in R a Borel set? Answer: No! There is a construction called the Vitali set
which is not Borel. The proof is found in Wheeden and Zygmund - Measure and Integral.
Let = Rn and E = {(a1 , b1 ) (an , bn ) : aj < bj }. Then B(Rn ) = (E ).
1.3
Probability
What does P(A) mean? P(A) = 0.7?? One repeats the experiment n times under the same conditions,
independently. Then an (A) = #{j : A occurs in the j th trial} denotes the absolute frequency of
occurrence of A. Of course an (A) is random with 0 an (A) n. Then its relative frequency rn (A)
is defined as
an (A)
rn (A) =
1
(1)
n
Then we suggest rn (A) p for some p [0, 1] as n . We let p = P(A) be the limit of these
frequencies. If n is large then A occurs approximately n P(A) times.
5
3
Example 1.3. If A {1, . . . , 6}, A = {1, 2} and n = 103 then the average A occurs is 103 times.
Similarly if you have a chance of curing a disease of .9 then out of 1000 patients, approximately 900
will be cured.
Properties of rn :
rn () = 0, rn () = 1.
A B = = rn (A B) = rn (A) + rn (B).
The limit should also have similar properties:
1. P() = 0, P() = 1.
2. A B = = P(A B) = P(A) + P (B).
3. By induction on n we get that if A1 , . . . , An are pairwise disjoint then P(
n
S
j=1
Aj ) =
n
P
P(Aj ).
j=1
Property 2. or, equivalently, property 3. is called finite additivity. It does not suffice to build a
powerful theory, although many mathematicians tried to do so. We need the stronger -additivity
of P defined below.
Definition 1.6 (Probability Measure). Let (, A ) be a measurable space, so that 6= and A is
a -field on . Then P : A [0, 1] is called a probability measure or a probability if:
1. P() = 0, P() = 1.
2. P is -additive i,e, if there are A1 , A2 , A pairwise disjoint, then
!
[
X
P
An =
P(An ).
n=1
n=1
Then (, A , P) is called a Probability Space

Axiom of Kolmogorov:
Each random experiment can be described by a suitable probability space.
Example 1.4. Roll a die once. What is the probability space describing this experiment? =
.
{1, 2, 3, 4, 5, 6}, A = P(). Then P(A) = #A
6
Proposition 1.3. Let (, A , P) be a probability space and A, B A . Then
1. P is finitely additive.
2. P is monotone, i.e. A B = P(A) P(B).
3. A B then P(B \ A) = P(B) P(A).
4. P(Ac ) = 1 P(A).
5. (Booles Inequality) Take any arbitrary A1 , A2 , A . Then

!
[
X
P
An
P(An ).
n=1
n=1
6. P is continuous from below, i.e. if A1 A2 then

!
[
P
An = lim P(An ).
n
n=1
7. P is continuous from above i.e. if A1 A2 A3 then

!
\
P
An = lim P(An ).
n
n=1
Proof.
1. An+1 = =
2. Immediate consequence of 1.
3. B = (B \ A) A.
4. Let B = in 4.
5. Let B1 = A1 , B2 = A2 \ A1 , B3 = A3 \ (A1 A2 ) and so on so Bn = An \ (A1 An1 ).
S
S
Then
Bn =
An and B1 , B2 , . . . are pairwise disjoint and Bn An for all n. Then
n=1
n=1
!
=P
An
!
Bn
P(Bn )
6. Let A0 = and Bn = An \ An1 for n 1 then
P(An ).
n=1
n=1
n=1
n=1
Bn =
n=1
An where B1 , B2 , . . . are pairwise
n=1
disjoint. Then
P
!
An
=P
n=1
!
Bn
n=1
P(Bn ) = lim
n=1
n
X
P(Bj ) = lim
j=1
n
X
P(Aj \ Aj1 )
j=1
= lim P(An ) P(A0 ).

n
7. Since A1 A2 we see Ac1 Ac2 so

!
[
P
Acn = lim P(Acn ) = 1 lim P(An ).
n
n=1
and
P
[
n=1
!
Acn
=P
\
n=1
!c !
An
=1P
\
n=1
!
An .
Proposition 1.4. Given A, B A then

P(A B) = P(A) + P(B) P(A B).
Proof. We see A B = (A \ B) (B \ A) (A B) Which are disjoint sets. Thus
P(A B) = P(A \ B) + P(B \ A) + P(A B).
Then P(A \ B) = P(A) P(A B) and P(B \ A) = P(B) P(A B), proving the proposition.
Proposition 1.5.
P(A1 A2 A3 ) = P(A1 ) + P(A2 ) + P(A3 ) P(A1 A2 ) P(A1 A3 ) P(A2 A3 ) + P(A1 A2 A3 ).
Proof. Let B = A1 A2 then use prop 1.4 twice.
Proposition 1.6 (Inclusion/Exclusion Formula). Given A1 , . . . , An A then
!
n
n
[
X
X
P
(1)k+1
Aj =
P(Aj1 Ajk )
j=1
1j1 <<jk n
k=1
Proof. Clever technique using the expected value.

Application:
Probability that at least one person gets his own gift? Let Aj = {person j has his gift}. This is just
!
n
n
X
X
[
(1)k+1
P
P(Aj1 Ajk )
Aj =
j=1
1j1 <jk <n
k=1
The probability inside the right hand sum is the probability that persons j1 , . . . , jk have their gift
out of n people. This is just (nk)!
. Then our probability is
n!
!
n
n
n
[
X
X
X
1
k+1 (n k)!
P
Aj =
(1)
1=
(1)k+1 1 e1 , n
n!
k!
j=1
1j <j <n
k=1
k=1
1
Final remark: If P(A) = 1 then A occurs almost surely (a.s.) But P(A) = 1 6 = A = . An easy
way to see this is to let be the set of all infinite sequences of 0s and 1s. Then the probability of
choosing one sequence at random out of all of them is 0, so the probability of not choosing it is 1.
Thus we choose a sequence that is not {0, 0, 0, . . . } almost surely. Similarly we say A is a zero set if
P(A) = 0.
1.4
Different Types of Probabilities
We distinguish between the following two types of probabilities:

Discrete Probabilities
Those probabilities appear when describing experiments with finitely many or at most countably many possible outcomes. Typical examples of those experiments are rolling a die n times,
tossing a coin n times or recording the number of customers in a store.
Continuous Probabilities
Those probabilities describe experiments where uncountably many outcomes are possible. Typical examples are the lifetime of a light bulb, the length of a phone call or the values of a
measurement (e.g. of the pressure of the air or of an item).
Remark. There exists a third kind of probabilities different from both mentioned above, so called
singular continuous probabilities. Although they satisfy P({a}) = 0 for a R, they do not possess
a density. They are concentrated on thin sets as for example on Cantors Discontinuum.
Case 1: Discrete Probabilities
Here we have either = {x1 , . . . , xN } or = {x1 , x2 , . . .}. As -field A we may always choose the
power set, i.e. A = P().
We define a function f : R by
f (x) := P({x}) ,
Then
f (x) 0 and
x .
f (x) = 1 .
(2)
The function f is called probability mass function of P. Moreover, given A it follows

X
P(A) =
f (x) .
(3)
xA
Conversely, given any function f : R satisfying (2), by (3) a probability P on P() is defined.
Thus we have a one-to-one relation between probabilities on P() and real-valued functions f on
satisfying (2).
Example 1.5. Suppose = {0, 1, 2, 3}. Letting f (0) = 1/4, f (1) = 1/4, f (2) = 1/6 and f (3) = 1/3,
the generated probability P satisfies
P({1, 3}) = f (1) + f (3) = 1/4 + 1/3 = 7/12 .
Basic examples of discrete probabilities
1. Uniform distribution on a finite set: We have = {x1 , . . . , xN } and assume that all elementary
events are equally likely. Hence
f (xi ) = P({xi }) =
1
,
N
i = 1, . . . , N .
This leads to
#A
#A
=
, A .
N
#
This probability is called uniform distribution on = {x1 , . . . , xN }.
P(A) =
2. Binomial distribution: Given an integer n 1 let = {0, . . . , n}. For p [0, 1] we set

n k
f (k) = Bn,p ({k}) =
p (1 p)nk , k = 0, . . . , n .
k
The generated probability Bn,p on P() is called binomial distribution with parameters n
and p. Execute n independent trials of an experiment where each time success appears with
probability p and failure with probability 1 p. Then Bn,p ({k}) is the probability to observe
exactly k times success (hence nk times failure). The number p is called the success probability.
3. Geometric distribution: Let = {1, 2, . . .}. Given p (0, 1) define

f (k) = Gp ({k}) = p(1 p)k1 ,
k = 1, 2, . . . .
The generated measure Gp on P() is called geometric distribution with parameter p.

Execute independent trials of an experiment where each time success appears with probability
p and failure with probability 1 p. Then Gp ({k}) is the probability to have for the first time
success exactly in the k-th trial.
4. Poisson distribution: Let = {0, 1, 2, . . .}. Given a parameter > 0 we set
f (k) = Pois ({k}) =
k k
e ,
k!
k = 0, 1, . . . .
The probability Pois on P() is called Poisson distribution of parameter > 0.

The Poisson distribution describes the number of successes in experiments with many trials
and very small success probability. If n is large and p is small then Bn,p ({k}) equals almost
Pois ({k}) where = n p. For example, the Poisson distribution is used to describe the number
of accidents in a city per day, for the number of hits of a website or for the number of atoms
of radium which disintegrate during some time period.
5. Hypergeometric distribution: Given parameters 1 M N and n N the hypergeometric
distribution HN,M,n is defined on P(), = {0, . . . , n}, by
N M
M
m
f (m) = HN,M,n ({m}) =
nm

N
n
m = 0, . . . , n .
Suppose there is a delivery of N items among which M are defective. Choose by random n of
the N items. Then HN,M,n ({m}) is the probability to observe exactly m defective ones among
the chosen n.
Case 2: Continuous Probabilities
Here we have = R or R, for example = [0, 1]. As -field we choose A = B(R).
Definition 1.7. A Riemann integrable function f : R R is called probability density if it satisfies
Z
f (x) 0 , x R , and
f (x)dx = 1 .
(4)
Given an interval [a, b] R we set

Z
P([a, b]) =
f (x)dx .
(5)
Proposition 1.7. (Extension Theorem) Given a probability density f on R. Then there is a unique
probability P on B(R) satisfying (5).
The function f is called density of P. Probabilities having a density are said to be continuous.1
1
The correct notation would be absolute continuous. But since we do not deal with singular probabilities we may
shortly say continuous.
10
Remark.
(1) In contrast to the discrete case continuous probabilities satisfy P({a}) = 0 for all a R.
(2) The density of a continuous probability is not unique. For example, changing the density at
finitely many points does not change the generated probability.
Basic examples of continuous probabilities
1. Uniform distribution on a finite interval: Let I = [, ] be a finite interval in the real line.
Define the function p by
1
: x [, ]
f (x) =
0 : x
/ [, ]
Of course, f satisfies (4), hence it is a probability density, and the probability P generated by
f via (5) is called uniform distribution on the interval I.
Note that P satisfies

[a, b] [, ]

P([a, b]) =
[, ]
where |A| denotes the length of a set A. In particular, if [a, b] [, ], then we get
P([a, b]) =
ba
.
The crucial property of the uniform distribution is as follows: The probability of occurrence of
a set A I is independent of the position of A in I = [, ]. Only the size of A matters.
2. Exponential distribution: Given > 0 define p on R by
x
e
: x0
f (x) =
0
: x<0
Let E be the probability generated by f . It is called exponential distribution with parameter
> 0.
If 0 a < b, then we get
Z
E ([a, b]) =
ex dx = ea eb .
The exponential distribution describes the lifetime of non-aging particles, components or other
items.
3. Cauchy distribution: Define f by
f (x) =
1 1
,
1 + x2
x R.
The probability P generated by this density is called Cauchy distribution. Note that
P([a, b]) =
arctan(b) arctan(a)
.
The Cauchy distribution is used to describe experiments where large values may appear with
comparable large probability.
11
4. Normal distribution: let R and > 0. Define f, on R by

1
(x )2
f, (x) =
exp
, x R.
2 2
2
R
Of course, f, (x) > 0, and we will see later on that f, (x)dx = 1.
The probability N(, 2 ) having density f, is called normal distribution with expectation
and variance 2 > 0. That is
Z b
(x)2
1
2
N(, )([a, b]) =
e 22 dx .
2 a
The case = 0 and = 1 is of special interest. Here we have
Z b
1
2
N(0, 1)([a, b]) =
ex /2 dx .
2 a
The probability N(0, 1) is called standard normal distribution.
1.5
Conditional Probability
Let (, A , P) be our probability space and B A such that P(B) > 0. Then we say
P(A|B) = Probability of A given B or probability of A conditional on B.
Example 1.6.
4
= 19 . Let B = { First result is
1. Roll a die twice. A = {sum of the results is 5} then P(A) = 36
1 or 2}. We get the result by restricting the sample space, = {1, . . . , 6} {1, . . . , 6} to the
new sample space, B = {1, 2} {1, . . . , 6}. So
P(A|B) =
#(A B)
1
=
#B
6
2. We have an urn containing two black balls and two white. Let A = {2nd is white} and B = {1st
is black}. Then P(A|B) = 23
Definition 1.8. (Conditional Probability) (, A , P), B A with P(B) > 0. Then we set
P(A|B) =
P(A B)
.
P(B)
P(A|B) is called conditional probability of A under (condition) B.

Remark. Writing the previous definition as
P(A B) = P(B)P(A|B) ,
this formula is called the Law of Multiplication.
Example 1.7. Let us give two examples which show how the law of multiplication applies.
12
1. From the above example, compute P(A B) = P({(b, w)}). Then P(B) =
so P(A B) = 13 .
1
2
and P(A|B) =
2
3
2. Lottery with 49 numbers. Choose 6 in a row (without replacing them). What is the probability
that the first number is even and the second is odd? Let B be the first case and A be the
24
. Then P(A|B) = 25
and our result is that P(A B) = 2425
= 25
second. Then P(B) = 49
48
4849
98
Proposition 1.8. The mapping A 7 P(A|B) for P(B) > 0 is a probability measure on (, A ).
= 0 Similarly if P(|B) = P(B)
= 1.
Proof. P(|B) = P(B)
P(B)
P(B)
Let A1 , A2 , . . . disjoint. Then
!
!
!
S
S
! P
Aj B
P
(Aj B)
[
j=1
j=1
P
Aj |B =
=
P(B)
P(B)
j=1
P(Aj B)
j=1
P(B)
=
X
j=1
P(Aj B) X
=
P(Aj |B)
P(B)
j=1
Definition 1.9. The probability A 7 P(A|B) is called conditional probability (with respect to B).
It satisfies P(B|B) = 1 and P(B c |B) = 0.
n
S
Proposition 1.9 (Law of total probability). Let =
Bj , with B1 , . . . , Bn disjoint, P(Bj ) > 0.
j=1
Then
P(A) =
n
X
P(Bj ) P(A|Bj )
(6)
j=1
Proof. We see
n
X
j=1
P(Bj ) P(A|Bj ) =
n
X
P(Bj )
j=1
P(A Bj )
=
P(Bj )
n
X
P(A Bj ) = P
j=1
n
[
!
A Bj
j=1
Note that the Bj are all disjoint, allowing us to write the sum as the union. Then this is equal to
!!
n
[
P A
Bj
= P(A ) = P(A).
j=1
Example 1.8. Consider an urn with four balls, two black and two white. If A is the event that the
2nd ball picked is white and B1 is the event that the first ball is white, while B2 is the event that
the first ball is black. Then = B1 B2 and
P(A) = P(B1 )P(A|B1 ) + P(B2 )P(A|B2 ) =
1 1 1 2
1
+ =
2 3 2 3
2
Let B1 , . . . , Bn disjoint with
n
S
13
Bj = The probabilities P(B1 ), . . . , P(Bn ) are called a priori prob-
j=1
abilities. These are the probabilities of the Bj s before executing the experiment. Now execute
the experiment and observe the occurrence of an event A. Then P(B1 |A), . . . , P(Bn |A) are the a
posteriori probabilities, i.e., those after knowing the occurrence of A.
Proposition 1.10 (Bayes Rule).
P(Bj )P(A|Bj )
P(Bj |A) = P
n
P(Bi )P(A|Bi )
(7)
i=1
Proof. From the previous proposition, the denominator is equal to P(A). Then
P(AB )
P(Bj ) P(Bj )j
P(Bj )P(A|Bj )
P(Bj )P(A|Bj )
P(Bj A)
=
=
=
= P(Bj |A)
n
P
P(A)
P(A)
P(A)
P(Bi )P(A|Bi )
i=1
Example 1.9. We have three coins, two are fair and one is biased. The biased coin has the condition
that heads occurs with probability 31 and tails occurs with probability 23 . Say we choose a coin
and toss it and we observe heads. What is the probability that the chosen coin was fair?
Let Bj be the event that the j-th coin was chosen, j {1, 2, 3}. The a priori probabilities are
P(Bj ) = 13 . Now let A be the event we observe heads. Then using the a posteriori probabilities,
P(Bj |A) we see
4
11 11 11
+
+
= .
P(A) =
32 32 33
9
Then we see
11
P(B1 )P(A|B1 )
19
3
P(B1 |A) =
= 342 =
= .
P(A)
64
8
9
By a same computation, P(B2 |A) = 38 , thus P(B3 |A) = 41 , which is the probability we wanted to find.
1.6
Independence
What is the mathematical description of independence, i.e. A and B occur independently. We can
describe this using conditional probabilities. If A and B are independent, then P(A|B) = P(A). We
say this because the event B does not affect A or that A occurs independently of B. In short, no
data about A can be recovered from knowledge of B. So we have
P(A) = P(A|B) =
P(A B)
= P(A B) = P(A)P(B).
P(B)
Definition 1.10. (, A , P) and A, B A . Then A and B are said to be independent if

P(A B) = P(A)P(B).
If they are not independent, they are called dependent.
Example 1.10.
(8)
14
1. Roll a die twice. Let B be the event that the sum of the results is 7. Let A be the event that
6
1
the first result is 3. We know P(A) = 61 and P(B) = 36
= 61 . Then P(A B) = 36
. Thus
P(A B) = P(A)P(B) so that they are independent. This is an interesting example because
the events are not independent if we replace 7 with 6.
2. Consider the urn with two black and two white balls. Let A be the event that the first ball
is white and B be the event that the second is black. We see P(A) = P(B) = 12 Then
P(A B) = P(A)P(B|A) = 21 23 = 13 6= 14 = P(A)P(B). Thus they are dependent.
Proposition 1.11.
(i) A is independent of B B is independent of A.
(ii) and are independent of each event.
(iii) if A and B are independent then A and B c , and Ac and B c are independent.
Proof.
(i) Trivial
(ii) P( B) = P(B) = P()P(B)
(iii) P(A B c ) = P(A) P(A B) = P(A) P(A)P(B) = P(A)(1 P(B)) = P(A)P(B c ). The
first equality was gotten by noting that the disjoint sets A B and A B c give us P(A) =
P(A B) + P(A B c ). Since A and B c are independent, then using (iii) again we see that Ac
and B c are independent.
What does it mean to have more than two sets independent?

Given A, B, C, one possibility is that P(Ai Aj ) = P(Ai )P(Aj ) for all i 6= j (A = A1 , B = A2 , C = A3 ).
Another possibility is that P(A B C) = P(A)P(B)P(C).
Define two conditions for A, B, C, that
(a) A and B are independent, A and C are independent, and B and C are independent.
(b) P(A B C) = P(A)P(B)P(C).
Example 1.11. Our sample space is S = {1, . . . , 12}. Let A = {1, . . . , 9}, B = {6, 7, 8, 9} and
C = {9, 10, 11, 12} and
#A
P(A) =
12
3
1
1
1
Then P(A) = 4 , P(B) = 3 , P(C) = 3 but P(A B C) = P({9}) = 12
= 43 13 13 . But P(A C) =
1
1
P({9}) = 12 6= P(A)P(C) = 4 . Thus the events A and C are dependent.
This means that (b) 6 = (a). We want our definition of independence to have both properties, so
just one does not suffice. At home, construct an example where (a) 6 = (b).
Definition 1.11 (Independence). (, A , P), Ai A , x I. Then {Ai }iI are independent if and
only if for I0 I with #I0 < , we have
!
\
Y
P
Ai =
P(Ai )
(9)
iI0
iI0
15
Remark.
1. #(I0 ) 2 because for #(I0 ) = 1 there is nothing to prove.
2. If I = {1, . . . , n} then for all 1 i1 < i2 < < im n we have
P(Ai1 Aim ) = P(Ai1 ) P(Aim )
by taking I0 = {i1 , . . . , im }.
3. For n = 3, #(I0 ) = 2 then P(Ai Aj ) = P(Ai )P(Aj ) for all i 6= j in {1, 2, 3} and if #(I0 ) = 3,
then P(A1 A2 A3 ) = P(A1 )P(A2 )P(A3 ).
Example 1.12. Consider a coin with sides 0 and 1 that is fair and toss it n times. Let Aj = {jth toss is 0}. We want to show A1 , . . . , An is independent. We see = {0, 1}n and A = P()
and
#A
2n
2n1
P(Aj ) = n =
2
P(A) =
1
2
Then if I = {1, . . . , n} and I0 I with m = #I0 then

!
\
2nm
P
Ai = n = 2m
2
iI
0
Since
1.7
iI0
P(Ai ) =

1 m
2
we are done.
Product Spaces
Suppose we execute two (maybe different) random experiments. They are described by probability
spaces (1 , A1 , P1 ) and (2 , A2 , P2 ). How to express that they are executed independently? One
possibility would be as follows: Independence of the experiments means that the probability of
occurrence of A1 A1 in the first experiment and of A2 A2 in the second one, this probability
should be P1 (A1 ) P2 (A2 ). But there is no probability space that describes the joint occurrence of A1
and A2 . Consequently we need a probability space (, A , P) which describes the common execution
of both experiments and where the occurrence of A1 and A2 equals P1 (A1 ) P2 (A2 ).
Example 1.13. The first experiment is rolling a fair die while the second one is tossing a fair coin
twice. In order to express that both experiments are independent we need a probability space which
describes the simultaneous execution of both experiments. A suitable sample space would be
= {1, . . . , 6} {H, T }2 = {(x1 , x2 , x3 ) : x1 = 1, . . . , 6, x2 , x3 are H or T} .
Let (1 , A1 , P1 ) and (2 , A2 , P2 ) be two probability spaces. The sample space for combining the two
experiments is
:= 1 2 = {(1 , 2 ) : 1 1 , 2 2 } .
The natural -field on is given by
A = A1 A2 = {A1 A2 : A1 A1 , A2 A2 }
Definition 1.12. The -field A1 A2 is called the product--field of A1 and A2 .
16
Next we observe that the event A1 A2 occurs if and only if A1 and A2 occur. Hence a suitable
probability P on (, A ) should satisfy
P(A1 A2 ) = P1 (A1 )P2 (A2 ) ,
A1 A1 , A2 A2 .
(10)
Proposition 1.12. Given two probability spaces (1 , A1 , P1 ) and (2 , A2 , P2 ) let = 1 2 and

A = A1 A2 . Then there is a unique probability measure P on (, A ) satisfying (10) for all A1 A1
and A2 A2 .
Definition 1.13. The unique P satisfying (10) is called product measure of P1 and P2 . It is denoted
by P = P1 P2 . In different words,
(P1 P2 )(A1 A2 ) = P1 (A1 )P2 (A2 ) ,
A1 A1 , A2 A2 .
Concrete cases:
1. Discrete case: Suppose 1 = {x1 , x2 , . . .} and 2 = {y1 , y2 , . . .}. Then
= 1 2 = {(xi , yj ) : 1 i, j < } .
The product--field of P(1 ) and P2 (2 ) is P().
Proposition 1.13. If f1 (x) = P1 ({x}), x 1 , and f2 ({y}) = P2 ({y}), y 2 , their product
probability P = P1 P2 is given by
X
P(A) =
f1 (x) f2 (y) , A .
(x,y)A
Example 1.14. Choose independently two numbers from {1, . . . , n} where the same number
may be chosen twice. Find the probability that the first number is strictly less than the second
one.
The two sample spaces are {1, . . . , n}, so the common sample space is = {(i, j) : 1 i, j n}.
Since all numbers are equally likely the probability mass functions f1 and f2 are constant with
f1 ({i}) = f2 ({j}) = n1 . Hence, for any A we have
P(A) =
X 1 1
#A
= 2 ,
n n
n
(i,j)A
that is, the product probability P is the uniform distribution on {1, . . . , n}2 . In particular, if
A = {(i, j) : i < j}, then #A = (n1)n
, hence
2
P(A) =
n(n 1)
n1
=
.
2
2n
2n
Example 1.15. Two players, say X and Y , roll simultaneously a die. Winner is who gets
number 6 first. Find the probability that X wins.
Here we have 1 = 2 = N. Moreover, the probability to observe 6 for the first time in the
ith roll is
i1
1 5
f1 ({i}) = f2 ({i}) =
, i = 1, 2, . . .
6 6
17
Hence for any event A N2 the product measure P is given by

i+j2
1 X 5
P(A) =
.
36
6
(i,j)A
Let A be the event that the game ends in a draw, that is

A = {(i, i) : i 1} .
Then it follows
1
P(A) =
36
2 X

i
6
1
1
1
5
=
1 = .
5
6
25 1 5/6
5
i=1
Hence, since it is equally likely that X or Y win, we finally get

P{X wins} =
2
.
5
Example 1.16. Suppose the number of accidents per week in a city is Poisson distributed
with parameter > 0. Further we assume that the numbers of accidents in different weeks are
independent of each other. Find the probability that the number of accidents in the next week
is at least twice time that of this week.
We have 1 = 2 = N0 . Hence = N20 . The probability mass functions are given by
f1 ({i}) = f2 ({i}) =
i
e ,
i!
i = 0, 1, . . . .
The set of interest is A = {(i, j) : j 2i}. Hence

"
#
j
X i X
X i+j
e2 =
P(A) =
e2 .
i!j!
i!
j!
i=0
j=2i
(i,j)A
2. Continuous case: We are given two probability spaces (R, B(R), P1 ) and (R, B(R), P2 ) and
suppose that both probabilities possess densities:
Z
Z
P1 (A) =
f1 (x)dx and P2 (A) =
f2 (y)dy .
(11)
A
A first result describes the product--field B(R) B(R).

Proposition 1.14. It holds
B(R) B(R) = B(R2 ) .
The next result characterizes P1 P2 in the continuous case.
Proposition 1.15. Let P1 and P2 be as in (11). Then P = P1 P2 satisfies
ZZ
P(A) =
f1 (x)f2 (y) dx dy , A B(R2 ) .
A
18
Example 1.17. Choose by random and independent of each other two numbers from [0, 1].
Find the probability that their sum does not exceed 1/4.
The probabilities P1 and P2 are the uniform distributions on [0, 1]. Hence the densities f1 and
f2 are given by f1 (x) = f2 (x) = 1[0,1] (x). The set of interest is

1
A = (x, y) : x + y
.
4
Consequently, if P = P1 P2 , it follows
ZZ
1[0,1] (x) 1[0,1] (y) dx dy =
P(A) =
A
ZZ
dx dy =
1
.
32
0x+y1/4
Example 1.18. Suppose the lifetime of certain light bulbs is exponentially distributed with
parameter > 0. Switch on two bulbs at the same time. Find the probability that at least
one of the bulbs still burns at time T > 0.
The densities f1 and f2 are given by

f1 (x) = f2 (x) =
ex : x 0
0
: x<0
Let t1 and t2 be the lifetimes of bulb 1 and bulb 2. Then we ask for the probability of the event
A which occurs if max{t1 , t2 } > T . Let us look for the complementary event
Ac = {(t1 , t2 ) : max{t1 , t2 } T } .
Then it follows
c
ZZ
P(A ) =
2 e(t1 +t2 ) dt1 dt2 = (1 eT )2 ,
0t1 ,t2 T
hence
P(A) = 1 (1 eT )2 = 2 eT e2T .
Remark. Of course, the construction of products easily extends from two probability spaces to finitely
many ones. More precisely, given probability spaces (1 , A1 , P1 ) to (n , An , Pn ), we set
= 1 n
and A = A1 An is the smallest -field containing rectangle sets A1 An with Aj Aj .
A probability measure P on (, A ) is called product measure of P1 to Pn (write P = P1 Pn ) if
P(A1 An ) = P1 (A1 ) Pn (An ) ,
Aj Aj .
The existence of P can be proved by induction observing that

P1 Pn = (P1 Pn1 ) Pn .
Example 1.19. Toss a coin labeled by 0 and 1 n times. Suppose that 1 appears with probability
p, hence 0 with probability 1 p. Then 1 = = n = {0, 1}. Hence = {0, 1}n and Pj ({1}) = p
while Pj ({0}) = 1 p. If x = (x1 , . . . , xn ) , then the product probability P = P1 Pn
satisfies
P({x}) = P1 ({x1 }) Pn ({xn }) = pk (1 p)nk
P
where k = #{j n : xj = 1} = nj=1 xj .
2 RANDOM VARIABLES AND THEIR DISTRIBUTIONS
19
Random Variables and Their Distributions
2.1
Random Variables
We conduct a random experiment, described by (, A , P) and observe . In a next step we

transform by a function X,
X:R
So we obtain a random real number X().
Remark. Note that the function X is fixed and non-random. The randomness of X() comes from
the entry .
Example 2.1.
1. Roll a die twice, = (1 , 2 ) where 1 , 2 , {1, . . . , 6}. Then
X() = 1 + 2 .
with random input gives a random number in {2, . . . , 12}. We could define other random
variables such as Y () = max{1 , 2 } and so on.
2. x1 , . . . , xn are measurements. and x = (x1 , . . . , xn ) Rn is a random vector. We could define
X(x) = n1 (x1 + + xn ).
3. Say we have a dartboard D of radius 1 centered in (0, 0). Then the point (s, t) D where the
dart hits the board is the random point. What we care about is
X(t, s) = t2 + s2 ,
the distance of its impact to the center of the board.
Definition 2.1 (Random Variable). Suppose that X : R. Then X is said to be a random
variable if for all t R, the sets
{ : X() t} = {X t} A
Remark.
(i) If A = P() then every X is a random variable.
(ii) The other extreme case is A = {, }. Then only constant functions X are random variables.
(iii) Suppose X : Rn R is a continuous function, then X is a random variable. We see this
because { Rn : X() t} is closed in Rn thus belongs to B(Rn )
(iv) X : R R is monotone, then { R : X() t} is an open or a closed interval, hence belongs
to B(R).
We call X 1 (B) = { : X() B} the pre-image of B. We have some easily verified
properties:
!
[
[
1
X
Bj =
X 1 (Bj ) ,
j=1
j=1

c
(A B) = X 1 (A) X 1 (B) and X 1 (Ac ) = X 1 (A) .
Using this notation the function X is a random variable if and only if

X 1 ((, t]) A , t R.
20
Proposition 2.1. X is a random variable if and only if X 1 (B) A for all Borel sets B R.
Proof. Suppose first that X 1 (B) A for all Borel sets B. Take t R and set B = (, t]. This
is a Borel set, thus X 1 (B) A , hence X is a random variable.
For the other direction, let X be a random variable. Then we know X 1 ((, t]) A for all t. We
want to prove that X 1 (B) A for all Borel sets B. Let us prove that
C = {A R : X 1 (A) A }
is a -field. We know this is true because
- C as X 1 () = A .
- If A C then X 1 (A) A which means X 1 (A)c = X 1 (Ac ) A as the left hand side is in A .

S
S
1
1
1
An A as the
- If A1 , A2 , C Then X (An ) A for all n thus
X (An ) = X
n=1
n=1
left hand side is in A .
So we have shown that C is a -field. Now let E = {(, t] : t < }, then E C = (E ) C .

Since (E ) = B(R) we are done.
Let X : R be a random variable.
Definition 2.2 (Distribution Function:). The distribution function, FX of X is defined by
FX (t) = P{X t} = P{ : X() t} = P(X 1 (, t])
Thus FX : R [0, 1].
Example 2.2.
(
0 :t<c
1. X() = c R. Then FX (t) = P{X t} =
1 :tc
2. P(X = 0) = 1 p,
3. P{X = k} =
k k
e
k!
P(X = 1) = p. Then
:t<0
0
FX (t) = 1 p : 0 t < 1
1
: t 1.
for > 0, k = 0, 1, 2, . . . . Then
(
0
:t<0
FX (t) = Pn k k
:nt<n+1
k=0 k! e
4. Suppose x1 < x2 < < xN in R. Let X be uniformly distributed on {x1 , . . . , xn }, that means
that P{X = xj } = N1 . Then one easily gets
FX (t) =
k
, for t [xk , xk+1 ) ,
N
k = 1, . . . , N 1 .
Furthermore,
FX (t) = 0 , t < x1 ,
and FX (t) = 1 , t xN .
21
Theorem 2.1. FX has the following properties:

(i) FX () = 0,
FX () = 1.
(ii) t < s = FX (t) FX (s).

(iii) tn & t = FX (tn ) FX (t).
Proof.
(i) Let An = { : X() tn } where tn & . Then A1 A2 are all in and
T
Aj = . Thus P(An ) 0, and noting P(An ) = FX (tn ) we are done.
j=1
Now let An = { : X() tn } where tn % . Then A1 A2 are all in and
S
Aj = . Thus P(An ) 1, and noting P(An ) = FX (tn ), we are done.
j=1
(ii) Since s < t we see {X t} {X s} thus P{X t} P{X s}

(iii) Let tn & t and define Aj as before. Then
Aj = { : X() t}.
j=1
Since FX is non-decreasing, for each t R the left-hand limit

lim FX (s) := FX (t 0)
s%t
exists. Either FX (t 0) = FX (t), then FX is continuous at t, or

h = FX (t) FX (t 0) > 0 ,
then FX has at t R a jump of height h > 0.
Proposition 2.2. The distribution function FX is continuous at t R if and only if
P{X = t} = 0 .
It has a jump of height h > 0 if and only if
P{X = t} = h .
Proof. Choose tn % t. Then we have
FX (t) FX (t 0) = lim [FX (t) FX (tn )] = lim [P{X t} P{X tn }]
n
= lim P{tn < X t} = P{X = t}.

n
Thus FX has a jump of height h > 0 in t if and only if P{X = t} = h and it is continuous at t if and
only if P{X = t} = 0.
22
Let us state further properties of FX :

1. If a < b, then
P{a < X b} = FX (b) FX (a) .
This easily follows from { : a < X() b} = { : X() b} \ { : X() a}.
2. It holds
1 FX (t) = P{X t}c = P{X > t} .
Fact: The big thing about random variables, X, is that we do not care what they look like! We only
care about the probabilities associated with them via the distribution. i.e. we only care about how
they are distributed.
Example 2.3. Roll a die twice. For = {1 , 2 } let X() = 1 + 2 . What we are about is the
probability of the values,
P({X = 2}) =?, P({X = 3}) =?
Definition 2.3 (Distribution Law). Take B B(R) and denote

PX (B) = P{X B} = P X 1 (B) = P{ : X() B} .
Then PX is called the distribution law of X.
Important Remark: The distribution of the values of X can either be described by its distribution
function FX or by its distribution law PX . Because of
FX (t) = PX ((, t]) and PX ((a, b]) = FX (b) FX (a)
both approaches are completely equivalent. The advantage of FX is that it is easier to describe
than PX . Its disadvantage is that its extension to higher dimensions is less useful.
Proposition 2.3. PX is a probability measure on (R, B(R)).
Proof. We see PX () = 0 trivially and PX (R) = P{ : X() R}
S = P() = 1.SThen let
1
B1 , B2 , B disjoint. Then Aj = X (Bj ) are disjoint as well, and if Bj = B and Aj = A,
then X 1 (B) = A and
X
X
PX (B) = P(A) =
P(Aj ) =
PX (Bj )
j=1
Example 2.4. Take the sum of the values rolling
36
3
FX (t) = 36
36
j=1
a die twice. Then

:t<2
:2t<3
:3t<4
:4t<5
..
.
Using the law we can say

PX ({2}) = P{X = 2} =
1
,
36
PX ({3}) = P{X = 3} =
2
,
36
PX ([2, 4]) =
1
.
6
2.2
23
Discrete and Continuous Random Variables
Definition 2.4 (Discrete Random Variable). A r.v. X : R is called discrete if there is an at

most countable set D R such that X : D.
Definition 2.5 (Probability Mass Function). The probability mass function f of X is defined by
f (t) = P{X = t} = P{ : X() = t}
Then we observe four facts:
1. f (t) = 0 if t
/D
P
2. f (t) 0 as
f (t) = P{X D} = 1
tD
3. FX (s) =
f (t)
ts, tD
4. PX (B) =
f (t)
tBD
1
2k
for k N. Then
(
2t : t N
f (t) =
0
: else
Example 2.5. Let D = N and P{X = k} =
What are the distribution function FX and what the distribution law PX (B) of that X?
[t]
X
1
1 (1/2)[t]+1
FX (t) =
=
1 = 1 2[t] ,
k
2
1/2
k=1
and
PX (B) =
t 1,
X 1
2k
kBN
Here [t] denotes the largest integer less or equal than t R.

Definition 2.6 (Continuous Random Variable). X is a continuous random variable if there is a
function f such that
Z t
FX (t) = P{X t} =
f (u)du
With density function f . Some facts:

R
1. f (u) 0 and f (u)du = 1
Rt
2. The mapping t 7 f (u)du is continuous. Thus FX is continuous = P{X = t} = 0 for
all t R
R
3. PX (B) = B f (u)du and
Z
PX ([a, b]) = P{a X b} =
f (u)du
a
4.
d
F (t)
dt X
= f (t) if f is continuous at t.
24
Example 2.6. If
then
(
0
FX (t) =
1 et
:t0
:t>0
(
0
f (t) =
t1 et
:t0
:t>0
Example 2.7. Lifetime of a light bulb. Let X = t if the bulb burns out at time t > 0. What is the
probability P(X t)? It is the probability that the bulb is still burning after time t. This is given
that P(X t) = et . The lifetime is wholly dependent on . Then
FX (t) = P{X t} = 1 P(X t} = 1 et , t > 0
and the density is
f (t) = et , t > 0
This is called the exponential distribution, with parameter .
One important question is, are the discrete and continuous random variables the only random variables? The answer is no, here are some examples:
1. First of all we may have random variables which are a mixture of a discrete and of a continuous
random variable.
Example 2.8. Let X be the lifetime of a light bulb. At a certain time T > 0 we switch the
bulb off provided it is still burning. Then

1 et : 0 t < T
FX (t) =
1
:
tT
2. If FX is continuous, does this imply that X is continuous? I.e. for any given continuous, nondecreasing function can we recover it from its derivative? The answer is no. Take the devils
staircase for example..
Definition 2.7 (Binomial distributed random variables). Let X : {0, . . . , n} with 0 p 1,
then X is Bn,p distributed if

n k
P{X = k} =
p (1 p)nk
k
Then

n k
f (k) =
p (1 p)nk , k = 0, . . . , n
k
This describes an experiment with n trials where each trial can be a failure (0) or success (1). X = k
is k-times success and p is the probability to have a success 1 in each trial.
Example 2.9. Say there is an air plane company having a plane with m seats. The company sells
n > m tickets in order to minimize the risk of empty seats. Let p be the probability that a passenger
shows up so that 1 p is the probability of a passenger not showing up. Let Xn be the number of
passengers that show up when selling n tickets. The company has to find the maximal n > m such
that

n
n
X
X
n k
P{Xn > m} =
P{Xn = k} =
p (1 p)nk
k
k=m+1
k=m+1
25
for a given risk probability > 0.

Longtime records show that a show-up probability p = 0.9 is very reasonable. Suppose now an
airplane has m = 250 seats and the company sells 266 tickets. Then the probability that the plane
is overbooked equals 0.00794. Yet selling 267 tickets, this probability is 0.01413. Hence, if = 0.01,
the optimal choice is to sell 266 tickets. That keeps the risk for the company below 0.01. In case of
= 0.1, the optimal choice for n is 271. Selling 272 tickets, the risk of overbooking is already 0.1224.
2.3
Random Vectors
Let (, A , P) be a probability space and where

Xj : 7 Xj (),
1jn
Then make the X1 , . . . , Xn into a vector, X : Rn where

X() = (X1 (), . . . , Xn ())
Example 2.10.
1. Roll a die twice. Let = (1 , 2 ), X1 () = min{1 , 2 }, X2 = max{1 , 2 } and X3 = 1 + 2
2. N students, X1 () = height, X2 () = weight.
3. Say we have m boxes, B1 , . . . , Bm . Place k balls in these boxes. Let Xj () be the number of
balls is Bj . Then (X1 , . . . , Xm ) is an m-dimensional random vector with
X1 + + X m = k .
4. Lottery with 49 numbers and six are chosen. Xk is the number chosen k-th. Then
X = (X1 , . . . , X6 )
is a 6-dimensional random vector.
Definition 2.8. Let X = (X1 , . . . , Xn ) be a mapping from Rn . Then X is called a random
vector if and only all coordinate mappings X1 , . . . , Xn are random variables.
Equivalently X is a random vector if and only if for all tj R we have
{ : X1 () t1 , . . . , Xn () tn } A
Proposition 2.4. X is a random vector if and only if B B(Rn ) it follows that
X 1 (B) A
Proof. One direction is trivial. The other direction is left to the reader.
We want to define the distribution law of X or the joint distribution of X1 , . . . , Xn .
Definition 2.9. PX (B) = P{ : X() B} for all B B(Rn ). PX is called the distribution
law of X.
Proposition 2.5. PX is a probability measure on (Rn , B(Rn )).
26
Note that
PX ([a1 , b1 ] [an , bn ]) = P (X [a1 , b1 ] [an , bn ]) = P(a1 X1 b1 , . . . , an Xn bn )
If n = 1, then the distribution function FX and the distribution law PX are tightly related via
FX (t) = PX ((, t]) and PX ((a, b]) = P{a < X b} = FX (b) FX (a) .
Let t = (t1 , . . . , tn ). Define if X is a random vector,
FX (t) = P{X1 tn , , Xn tn }
Then FX : Rn [0, 1] is the distribution function. The relation between FX and PX becomes now
much more complicated. If n = 2, then
P{a1 < X1 b1 , a2 < X2 b2 } = FX (b1 , b2 ) FX (a1 , b2 ) FX (a2 , b1 ) + FX (a1 , a2 )
And the bigger n becomes the more complicated is the relation between FX and PX .
Conclusion: Distribution functions are a helpful tool if n = 1. If n > 1, then they are less useful.
There are only a few cases where we consider distribution functions
Definition 2.10 (Special Cases).
(a) Call X = (X1 , . . . , Xn ) discrete if there is an at-most countable set D Rn such that X : D.
The joint probability mass function is defined for t D, we call it
X
f (t) = P{X = t} = PX (B) = P{X B} =
f (t)
tDB
(b) X is continuous if
Z
t1
tn
f (x1 , . . . , xn ) dxn dx1 .
P{X1 t1 , . . . , Xn tn } =
We call f the joint density

Z
PX (B) =
f (x)dx
B
Example 2.11.
1. Consider an urn with four balls, two marked 0 and two marked 1. Take out two balls without
replacement. Let X1 be the value of the first ball and X2 be the value of the second. Then X =
(X1 , X2 ) is a 2-dim random vector. The possible values are D = {(0, 0), (0, 1), (1, 0), (1, 1)}.
What is f (0, 0)? It is
1
f (0, 0) = P{X1 = 0, X2 = 0} = ,
6
f (1, 1) =
1
6
Similarly f (1, 0) = f (0, 1) = 31 . This means that PX is completely described.

0 1
0
1
1
6
1
3
1
3
1
6
Now we replace the first ball and get

0 1
0
1
1
4
1
4
1
4
1
4
27
2. Roll a die twice. Draw the tables yourself. = (1 , 2 ). Let X1 = min{1 , 2 } and X2 =
max{1 , 2 }
3. Consider the case that we throw a dart at the unit circle K1 with X = (X1 , X2 ) be the
coordinates of the point that it lands. Then
(
1
: u2 + v 2 1
f (u, v) =
0 : else
and
Z
f (x)dx =
PX (B) =
B
vol2 (B K1 )
.
Definition 2.11 (Marginal Distributions). Given X = (X1 , . . . , Xn ), PX is a probability measure

on (Rn , B(Rn ) with PX1 , . . . , PXn being probability measures on (R, B(R)). The latter probability
measures are called the marginal distributions of X. Recall PX is their joint distribution.
Consider example 1 above. Then the marginal distributions certainly do not determine the joint
distribution.
Example 2.12. We have two groups of students, I and II with 40 and 60 students respectively. If
we know the joint distribution to be
I
II
Fail 10/100 5/100
Pass 30/100 55/100
then we know the probability that a student chosen by random is at group I and passed the exam.
This probability is 0.3. But if only know the number of students in group I and II and the number of
students that failed or passed, this information does not suffice to evaluate the previous probability.
2.4
Relation between Joint and Marginal Distributions
Let X = (X1 , . . . , Xn ) a random vector and PX = P(X1 ,...,Xn ) . Recall that the joint distribution PX
is uniquely described by
PX (B1 Bn ) = P{X1 B1 , . . . , Xn Bn } .
On the other hand, we also have the n probabilities
PXj (B) = P{Xj B}
which were called the marginal distributions. Recall that they are probabilities on (R, B(R) while
the joint distribution is a probability on (Rn , B(Rn )).
j
z}|{
Proposition 2.6. PXj (B) = P(X1 ,...,Xn ) (R B R)
Proof.
j
PXj (B) = P{X1 R, , Xj1
z}|{
R, Xj B, , Xn R} = P(X1 ,...,Xn ) (R B R)
28
Consequences:
The discrete case, n = 2, consider (X, Y ). Then P(X,Y ) is the joint distr. and PX and PY are the
marginals. Say
X : D = {x1 , x2 , . . . },
X : E = {y1 , y2 , . . . }
so that
(X, Y ) : D E = {(xi , yj ) : i, j N}
Then the probability mass function f is
f (xi , yj ) = P{X = xi , Y = yj } ,
i, j N .
PX and PY are described by the probability mass functions fX and fY where

fX (xi ) = P{X = xi }
fY (yi ) = P{Y = yi }
Proposition 2.7.
fX (xi ) =
fY (yi ) =
X
j=1
f (xi , yj )
f (xi , yj )
i=1
Proof. Since E =
{yj }, -additivity gets us
j=1
fX (xi ) = P(X = xi ) = P{X = xi , Y E} =
f (xi , yj )
j=1
The other direction is similar

Example 2.13. Consider the joint distribution of X and Y to be
0
1
0 1/8 1/4
1 3/8 1/4
Then the marginals can be found to be P(Y = 0) = 3/8, P(Y = 1) = 5/8 while P(X = 0) = P(X =
1) = 1/2.
Imagine that we only know the marginals to be P(X = 0) = P(X = 1) = 1/2. Then we could have
a number of possibilities, namely that either
0
1
0 1/6 1/3
1 1/3 1/6
0
1
0 1/4 1/4
1 1/4 1/4
or
Thus the marginal distributions do not determine the joint distribution.

For the continuous case, the joint distribution is
Z
P(X t, Y s) =
f (u, v)dudv
29
Proposition 2.8. Let f be the joint density of (X, Y ). Then it follows

Z
f (u, v)dv is a density of X.
fX (u) =
Z
fY (v) =
f (u, v)du is a density of Y.
Proof.
Z
Z
P{X t} = P{X t, Y < } =
f (u, v)dv du
{z
}
=fX (u)
Example 2.14. Suppose (X, Y ) is uniformly distributed on K = {(u, v) : u2 + v 2 1}.

(
1
: u2 + v 2 1
f (u, v) =
0 : else
Then for |u| < 1,
Z
1
f (u, v)dv =
fX (u) =
2.5
Z
1dv =
u2 +v 2 1
2
1 u2
Independence of Random Variables
Say we have X1 , . . . , Xn : R. What does it mean if X1 , . . . , Xn are independent?

Say n = 2 then this should mean P{X2 B2 |X1 = t} = P{X2 B}. Of course, this makes only
sense if P{X1 = t} > 0. Therefore, the better way is to say that for all B1 , B2 , we have
P{X2 B2 |X1 B1 } = P{X2 B2 }
or that
P(X1 ,X2 ) (B1 B2 ) = PX1 (B1 )PX2 (B2 )
Definition 2.12. X1 , . . . , Xn are independent if and only if for all B1 , . . . , Bn B(R) we have for
B = B1 Bn ,
P(X1 ,...,Xn ) (B) = PX1 (B1 ) PXn (Bn )
(12)
Another way to write (12) is
P{X1 B1 , . . . , Xn Bn } = P{X1 B1 } P{Xn Bn } ,
Bj B(R) .
Remark.
(1) It suffices to satisfy (12) for Bj = (, tj ]. In different words, the Xj are independent if and
only if for all t1 , . . . , tn R, we have
P{X1 t1 , . . . , Xn tn } = P{X1 t1 } P{Xn tn } .
(2) X1 , . . . , Xn are independent if and only if X11 (B1 ), , Xn1 (Bn ) are independent as events for
all B1 , . . . , Bn B(R).2
2
Why this is not a trivial assertion? Recall how the independence of n events was defined.
30
(3) Equation (12) tells us that X1 , . . . , Xn are independent if and only if

P(X1 ,...,Xn ) = PX1 PXn .
In particular we observe that for independent Xj the marginal distributions determine the joint
distribution. Recall that this is false in general.
Concrete cases:
(a) Take n = 2 in the discrete case. Let (X, Y ) be the 2-dimensional vector, we have f (xj , yj ) =
P{X = xj , Y = yj } with the mass functions of the marginals, fX (xj ) = P{X = xi } and
fY (yj ) = P{Y = yj }. Then
Theorem 2.2. X and Y are independent if and only if
f (xi , yj ) = fX (xi )fY (yj )
for all xi and yj .
Proof. Let X and Y be independent. Then
f (xi , yj ) = P{X {xi }, Y {yj }} = P{X {xi }} P{Y {yj }} = fX (xi )fY (yj )
Now let f (xi , yj ) = fX (xi )fY (yj ). Let B1 , B2 R be arbitrary. Note
X
P{X B1 , Y B2 } = P{(X, Y ) B1 B2 } =
f (xi , yj )
(xi ,yj )B1 B2
!
=
xB1
fX (xi )fY (yj )
xi B1 ,yj B2
(xi ,yj )B1 B2
fX (xi )fY (yj ) =
fX (xj )
fY (yj ) = P{X B1 }P{Y B2 } .
yj B2
Example 2.15.
1. Roll a die twice. Let X be the result of the first roll and Y be the result of the second. Then
1
f (xi , yj ) =
= fX (xi )fY (yj ).
36
2. P{X = k} = P{Y = k} =
find
P{|X Y | 1} =
=
=
X
k=1
X
k=1
X
k=1
2
3
1
.
2k
X, Y independent implies P{X = k, Y = l} =
1
.
2k+l
We want to
P{X = k, |Y k| 1}
[P{X = k, Y = k + 1} + P{X = k, Y = k} + P{X = k, Y = k 1}]
1
22k+1
X
X
1
1
+
22k k=2 22k1
k=1
31
3. Say you have a massive ball of dough and say that the raisins in the dough are uniformly
distributed. Take out a piece of dough, then what is the number of raisins in the dough?
Let X be the number of raisins in 1 pound of bread. A good model is to assume that PX is a
Poisson distribution with parameter > 0,
P{X = k} =
k
e ,
k!
k = 0, 1, 2, . . .
Now take another pound of bread where Y is the number of raisins in the next pound. What
is
X
X
2k 2
P{X = Y } =
P{X = k, Y = k} =
e
2
(k!)
k=0
k=0
Proposition 2.9 (Extension to n random variables). Let
Xj : Dj ,
1 j n,
hence
(X1 , . . . , Xn ) : D1 Dn .
Then X1 , . . . , Xn are independent if and only if
P{X1 = d1 , . . . , Xn = dn } = P{X1 = d1 } P{Xn = dn }
for all dj Dj .
Example 2.16. Toss a coin with 0 and 1 n times. Let Xj be the result of the jth toss. Suppose
P{Xj = 0} = 1 p and P{Xj = 1} = p. Then, given xj {0, 1} it follows
P{(X1 = x1 , . . . , Xn = xn )} = pk (1 p)nk
where exactly k of the xj satisfy xj = 1. Then
P{X1 = x1 } P{Xn = xn } = pk (1 p)nk = P{X1 = x1 } P{Xn = xn } .
Thus X1 , . . . , Xn are independent.
2.6
Independence of Continuous Random Variables
Let us restrict ourselves to the case that n = 2. Then

Z t
Z
P{X t} =
fX (u)du,
P{Y t} =
Theorem 2.3. X and Y are independent if and only if

f (u, v) = fX (u)fY (v)
is a joint density of (X, Y ).
fY (v)dv
32
Proof. X and Y are independent, then

Z t Z s
f (u, v)dudv = P{X t, Y s}
= P{X t}P{Y s}
Z s
Z t
fX (u)du
fY (v)dv
=
Z t Z s
fX (u)fY (v)dudv
=
Thus fX (u)fY (v) is the joint density of (X, Y ). The other direction is the exact same.
Example 2.17. Compute the lifetime of a bulb. Let X = t denote the bulb burning out at time t.
Give it the distribution
Z t
P{X t} =
eu du
0
Let Y denote the lifetime of a second bulb. The density of (X, Y ) is then 2 e(u+v) . What is
P{Y X + 1} = P{(X, Y ) {((u, v) : v u + 1, u > 0}}
Then
ZZ
2 e(u+v) dudv
P{Y X + 1} =
Zvu+1,u>0
Z
= 2
Z
=
ev dveu du
u+1
e(u+1) eu du
1
=
2e
Example 2.18. Take (X, Y ) to be the points (x, y) in the circle, with the density
(
1
: u2 + v 2 1
f (u, v) =
0 : else
R
R
Then fX (u) = f (u, v)dv = 2 1 u2 and fY (v) = f (u, v)du = 2 1 v 2 Thus f (u, v) 6=
fX (u)fY (v) so X and Y are not independent. In order for these two densities to be independent, the
domain must be a rectangle.
3 SPECIAL RANDOM VARIABLES
33
Special Random Variables
3.1
Important Discrete Random Variables
Let X : D = {x1 , x2 , . . . } where f (xj ) = P{X = xj }, f (x) = 0, x

/ D.
Properties:
(i) f (xj ) 0
(ii)
f (xj ) = 1
j=1
(iii) P{X B} = PX (B) =
f (xj )
xj B
(iv) FX (t) =
xj t
f (xj )
Here are some discrete r.v.s of interest:

(a) Uniformly Distributed Random Variables
If D = {x1 , . . . , xN } then X is uniform on D if
P{X = x1 } = = P{X = xN } =
Then f (xj ) =
1
N
1
N
for j = 1, . . . , N and
PX (B) = P{X B} =
X
xj B
f (xj ) =
#(B D)
#D
Example 3.1. Roll a die n times, so = {1, . . . , 6}n with P{} =

= (1 , . . . , n ) be the value of the jth trial. Then
P{Xj = k} =
6n1
1
= ,
n
6
6
1
.
6n
Let Xj () = j where
k = 1, . . . , 6 .
So Xj is uniformly distributed on D = {1, . . . , 6}.

In the previous example all Xj , 1 j n, possess the same distribution. This leads to the following
definition.
Definition 3.1 (Identically Distributed). Two random variables are said to be identically distributed
if PX (B) = PY (B) for all B B(R). We write
d
X =Y .
Remark. Note that X and Y need not to be defined on the same probability space in order that
d
X = Y . Roll a die twice and take X as the first result. Now roll it four times and define Y as the
d
last result. Then X = Y although they are defined on different probability spaces.
34
(b) Binomial Distributed Random Variables

0 p 1, n N. Suppose X has values in D = {0, . . . , n}. Then X Bn,p (is binomial
distributed with parameters n and p) if

n k
P{X = k} =
p (1 p)nk ,
k = 0, . . . , n .
k
This can be thought of as n trials of a success/failure process with p being the success probability,
with failure probability as 1 p. We then have
X = k we have exactly k times success.
Example 3.2. Exam with 100 problems. You have a probability of p of answering each question
correctly. To get a grade better than F one needs 60 or more questions answered correctly. So
P(to pass) =

100
X
100
k=60
pk (1 p)100k
For example, in order that P(to pass) 0.75, the success probability p has to satisfy p > 0.6274.
(c) Poisson Distributed Random Variables
D = N0 = {0, 1, 2, . . . } with > 0. We say X Pois (Poisson distributed with parameter
> 0) if
k
P{X = k} = e , k = 0, 1, 2, . . .
k!
Note that
X
k
e = e e = 1
k!
k=0
Lemma 3.1. xn x in R implies

x n n
1+
ex ,
n
Proof. We have to show

x n
xn n

1+
1+
0,
n
n
Take
|an bn | = |a b||an1 + an2 b + + bn1 |
Suppose |a|, |b| c < . Then
|an bn | |a b| n cn1
We know xn x so that |xn | 1 + |x| = d for n large enough. Then a = 1 +
addition, |a|, |b| 1 + nd

n1
|x xn |
d
|a b |
n 1+
C|x xn | 0,
n
n
n
Note that 1 +

d n1
n
ed , hence it is bounded by some C > 0.
xn
n
and b = 1 + nx . In
35
Theorem 3.1 (Poissons Limit Theorem). Let Xn Bn,pn distributed and suppose
lim npn = > 0
Then
P{Xn = k}
k k
e
k!
PXn ({k}) PX ({k}) where X Pois .
or
Proof. We have

n k
1
n!
P{Xn = k} =
pn (1 pn )nk =
pk (1 pn )nk
k
k! (n k)! n
1 n(n 1) (n k + 1)
=
(npn )k (1 pn )n (1 pn )k
k!
nk
Since npn > 0, it follows pn 0 as n , hence we get
(i)
n(n1)(nk+1)
nk
(ii) (npn )k k
(iii) (1 pn )k 1
(iv) Employing the previous lemma, (1 pn )n = 1 +
Thus for any k N0 we have
lim P{Xn = k} =

npn n
n
e , as npn , n
k
e .
k!
Applications: Let n big and p small. If X Bn,p then with = np it follows

P{X = k}
k
e .
k!
Example 3.3. Let X be the number of people among N having a birthday today. Then

0
N
1
N 1
N
1
364
N
1
364
P{X 2} = 1 P{X 1} = 1
.
0
365
365
1
365
365
Using the above approximation, =
N
365
(13)
then
0 N/365 N/365
P{X 2} 1 e
e
.
0
1
(14)
For example, if N = 500, then (13) gives P{X 2} = 0.397895 while (14) equals 0.397719.
Remark. Poissons Limit Theorem also explains why Poisson distributed random variables appear in
cases where there are many trials, each having a small success probability. For example, there are
many cars in the city (when driven it is a trial), but the probability that a fixed car has an accident
at that day is very small. This explains why the number of accidents per day may be described by
a Poisson distributed random variable.
36
(d) Geometric Distributed Random Variables

0 < p < 1, D = {1, 2, . . . }. Then X is geometric distributed with parameter p if
P{X = k} = p(1 p)k1 .
This means the success probability is p. So that X = k if and only if in the kth trial we have
success for the first time.
Example 3.4. Roll a die. The probability that 6 appears for the first time in the 10th trial is
( 61 )( 56 )9 = 0.0323011.
Example 3.5. Playing roulette (without double zero) the probability of Red is 18/37. Thus
the probability that Red appears for the first time in the 88th trial is (18/37) (19/37)87 which is
approximately 3.19951 1026 . This really happened once in Monte Carlo.
Remark. Geometric Random Variables have a special property: They are memoryless: The probability of occurrence of the first success in the (m + k)th trial under the condition that the first k
trials were failures is the same as that of occurrence of first success in the mth trial. More precisely,
it holds
P{X = m + k|X > k} = P{X = m} .
Note that X > k occurs if and only if the first k trials were failures. In different words it does not
improve your chance of winning having many failures before starting the game.
(e) Negative Binomial Distributed Random Variables
0 < p < 1 success probability, n 1 fixed. Probability in the (k + n)th trial to have exactly the
nth success. As an example say on the 20th roll of a die, we see exactly our fourth 6. Here
n = 4 and k = 16.
, or
There are exactly n+k1
ways to distribute the n 1 ones and k zeros. So we say X is Bn,p
k
negative binomial distributed with parameters n and p, if

n+k1 n
P{X = k} =
p (1 p)k
k
Remark. If n = 1, then X is B1,p

distributed if and only if
P{X = k} = p (1 p)k ,
k = 0, 1, . . .
Let us say that X is modified geometric distributed. Note that in this the case Y := X + 1 is
geometric distributed with parameter p. Indeed, we have
P{Y = k} = P{X = k 1} = p (1 p)k1 ,
k = 1, 2, . . . .
In different words X is modified geometric if X = k says that in the (k + 1)st trial we have
success for the first time. So everything is shifted by one.
Let us come back to the general case. We can write

n+k1
(n + k 1)(n + k 2) n
=
k!
k
(n)(n 1)(n 2) (n k + 1)
= (1)k
k!

n
= (1)k
k

So
37

n n
P{X = k} =
p (p 1)k
k
Why does this sum up to 1?
P{X = k} =

X
n
pn (p 1)k
k

X
n
n
=p
(p 1)k
k
k=0
k=0
k=0
= pn
1
(1 + (p 1))n
=1
This is found by computing the Taylor expansion of
1
.
(1+x)n
Example 3.6 (Banachs Matchbox Problem). Two matchboxes M1 and M2 each with N matches.
The probability of removing a match from either box is p = 12 . Take the last match out of M1 or M2 .
How many matches remain in the other box? Each trial we take out one match. For each m, there
are two cases: M1 is empty and M2 has m matches or M2 is empty and M1 has m matches. Let M2
be empty and take
X = m, if m matches are left in M1 .
Let success be if we choose M2 . Thus taking out the last match from M2 means that we have exactly
for the N th time success. In order that there are still m matches in M1 , this N -th success has to
happen in the (2N m)th trial. Then the probability is success after 2N m trials that M2 is empty
equals

N N m
2N m 1
1
1
P{X = m} =
, m = 1, . . . , N .
N m
2
2
The event that after 2N m trials, M1 is empty has the same probability (and they are disjoint).
Thus the probability that when taking out the last match from one box, the other one contains m
matches, equals

N N m
2N m1
2N m 1
1
1
2N m 1
1
=
.
2
N m
2
2
N m
2
Remark. Since some m = 1, . . . , N has to appear, the last result tells us
2N m1
N
X
2N m 1
1
= 1.
N
m
2
m=1
Letting k = N m, we get
N
1
X
k=0
k
N +k1
1
= 2N 1 .
k
2
Replacing N 1 by n this implies the known identity

n
X
n+k 1
= 2n .
k
k
2
k=0
38
Remark. In the literature one also finds Banachs Matchbox Problem in a slightly modified form:
One asks for the probability to have left m matches in the other box when choosing a box which
turns out to be empty. Note that here m = 0 may occur, namely if both boxes are empty at the
same time.
(f) Hypergeometric Random Variables
A retailer delivers N machines. Among them are M defective. Choose n of the N machines and
check them. What is the probability that among the n tested, m are defective? Counting, this
becomes

P{X = m} =
M
m
N M
nm

N
n
m = 0, . . . , n
We say that the random variable X here is hypergeometric distributed with parameters N, M
and n.
Example 3.7. In an urn are M white balls and N M black ones. Take out n balls in a single
draw. Let X be the number of observed white balls. Then X is hypergeometric with parameters N ,
M , and n.
What happens if we take out the n balls one after the other with replacement. Probability for a
. So
white ball is p = M
N
nm
n
M
n
M
1
P{X = m} =
N
N
m
Example 3.8. Lottery of 49 numbers and 6 chosen without repetition. There are 6 numbers written
on the lottery ticket. What is the probability that exactly three of them were chosen and three not?
Let X = m if m numbers of my ticket were chosen.
43
6
P{X = m} =
6m

49
6
m = 0, . . . , 6 .
Applications of Hypergeometric Distributions:

1. Quality Control: One knows N , one knows n and observes the random m. We want to get
some information about the unknown M . To this end we use the Maximum Likelihood method:
Investigate the function

: M 7
M
m
N M
nm

N
n
(m) as this M where is maximal.

Here N and n are known and m was observed. Choose M
Then
i
(h
m(N +1)
:m<n
n
(m) = argmax (M ) =
M
N
:m=n
is an estimator for M .
For example, let us assume N = 100 n = 15. Observe
=6
m=1 M
= 13
m=2 M
= 26
m=4 M
= 40
m=6 M
39
2. Size Estimation: Suppose we have the following situation: We know M and n and observe
m. Is it possible to get some information about N ?
Let us illustrate this problem by an example. Say there are N fish in a pond, N unknown.
Catch M and mark them. Put back and wait! Catch n observe m marked. We investigate

M N M
(N ) 7
nm

N
n
(m) = argmax (N ). Then

where now M and n are known and m was observed. Choose N

M
n
(m) =
N
m
is an estimator for N .
= 214. If m = 2 then N
= 750. If
Say M = 50 and n = 30. If m = 7 we estimate that N
m = 16 then N = 93.
(g) Multinomial Distributed Random Vectors
We have m boxes B1 , . . . , Bm and place independently one after the other n balls into these
m boxes. The probability to place a single ball into box Bj is pj where 0 pj 1 and
p1 + + pm = 1. Let Xj be the number of balls in Bj . Then we have X1 , . . . , Xm random
variables. What is the probability
P{X1 = k1 , . . . , Xm = km } where k1 + + km = n ?
This can be modeled by a random vector X = (X1 , . . . , Xm ) that takes values in in the set
D = {(k1 , . . . , km ) : k1 + + km = n} .
Then

P{X1 = k1 , , Xm = km } =

n
pk11 pkmm
k1 , . . . , km
where the multinomial coefficients are defined by

n
n!
,
=
k1 ! km !
k1 , . . . , km
k1 + + km = n .
This is inspired by the multinomial theorem, that

n
(a1 + + am ) =
X
k1 ++km =n

n
ak11 akmm
k1 , . . . , km
We say X = (X1 , . . . , Xm ) is multinomial distributed with parameters p1 , . . . , pm .

What happens if each box is equally likely? Then we have p1 = = pm = 1/m, hence

n
n
1
P{X1 = k1 , . . . , Xm = km } =
k1 , . . . , km
m
40
Example 3.9. Suppose n m. The number of balls is smaller than the number of boxes. What is
the probability that in the first n boxes B1 , . . . , Bn is exactly one ball? Then k1 = = kn = 1 and
kn+1 = = km = 0. Then the probability is
n!
mn
Since there are
most one ball is
m
n
ways to fix n of the m boxes, the probability that in each of the m boxes is at

m
n!
n
m
n
Example 3.10. A train with 3 coaches arrives. 6 passengers enter the train. Find the probability
that there are exactly 2 persons in each coach. We have m = 3 and n = 6 with p1 = p2 = p3 = 13 .
Then
6
6!
1
10
P{X1 = 2, X2 = 2, X3 = 2} =
=
2!2!2! 3
81
Remark. Suppose the vector X = (X1 , . . . , Xm ) is multinomial distributed with parameters p1 , . . . , pm
Question: What are the marginal distributions, that is, the distributions of the Xj ?
Answer: Each Xj Bn,pj .
3.2
Important Continuous Random Variables
Recall, for a continuous random variable X whose distribution has density f , we have
Z t
P{X t} =
f (u)du
R
f density, f (u) 0, f (u)du = 1.
From this we deduce
Z
P{a X b} =
f (u)du
a
What are the most important densities f ?

(a) Uniform Distribution on R
Given an interval [, ] we say X is uniform distributed on [, ] or X U [, ] if it has density
(
1
:u
f (u) =
0
: else
Equivalently, X U [, ] if
Z
P{X t} = F (t) =
1
t
dt =
:t<
:t
:t>
Basically the uniform distribution is saying that the probability of an event B is just the length
of B [, ] divided by the length of [, ], i.e. the Lebesgue measure of the event in [, ]
normalized by .
41
Example 3.11. Say a train leaves the station every 30 minutes. X is the time since the last
departure and X U [0, 30]. What is the probability that we wait more than 20 minutes? If
we wait more than 20 minutes, then the last train had to have departed in the first 10 minutes.
This probability is
Z 10
1
1
P{0 X 10} =
= .
30 0
3
0
(b) Uniform Distribution on Rn
Given a set K Rn bounded and closed and n the Lebesgue measure. Then X U (K) or X
is uniformly distributed on K if
P{X B} =
n (B K)
n (K)
If we are in R1 this translates to length, in R2 - area and in R3 - volume.

Example 3.12. If we have a dart board of radius 1 and the bullseye has radius 1/20. Say X is
the location of where we hit on the dart board, uniformly distributed. If K is the dart board and
. Then the probability of hitting the bullseye is

B is the bullseye then (K) = and (B) = 400
P{X K} =
/400
1
=
400
Example 3.13. Two friends visit a bar independently between 1 and 2 oclock. After arrival
each of them waits 20 minutes. Find the probability that the two friends meet each other.
If X1 is the arrival time of the first friend and X2 that of the second, the vector X = (X1 , X2 )
is uniformly distributed on K = [1, 2]2 . They meet each other if X A where

1
2
.
A = (x1 , x2 ) [0, 1] : |x1 x2 |
3
Hence P{X A} = 2 (A) =
5
9
Example 3.14 (Buffons Needle Test). Take a needle of length a < 1 and throw it on a lined
sheet of paper. Say the distance between two lines on the paper is 1. Find the probability that
the needle cuts a line.
What is random in this experiment? Choose the two lines such that between them the midpoint
of the needle lies. Let x [0, 1] be the distance of the midpoint of the needle to the lower line.
Furthermore, denote by [/2, /2] the angle of the needle to a line perpendicularly to the
lines on the paper. For example, if = 0, then the needle is perpendicular to the lines on the
paper while for = /2 it lies parallel.
Hence, to throw a needle randomly is equivalent to choosing a point (, x) uniformly distributed
in K = [/2, /2] [0, 1].
The needle cuts the lower line if and only if
a
cos 1 x.
2
a
2
cos x and it cuts the upper line provided that
If
A = {(, x) [/2, /2] [0, 1] : x
a
cos
2
or 1 x
a
cos } ,
2
then we get
P{The needle cuts a line} = P(A) =
2 (A)
2 (A)
=
.
2 (K)
42
But it follows
Z
/2
2 (A) = 2
/2
a
cos d = 2a ,
2
hence
P(A) =
2a
.
(c) Standard Normal Distribution

Lemma 3.2. It holds
2 /2
ex
dx =
2 .
We say X is standard normal distributed, or X N(0, 1) if the density is

u2
1
f (u) = e 2 .
2
Then ist distribution function is the Gaussian Error Function

Z t
u2
1
(t) =
e 2 du
2
(d) N(, 2 ) Distributed Random Variables
Given R and > 0 a random variable X is N(, 2 ) distributed if
Z t
1
2
2
P{X t} =
e(u) /2 du .
2
Proposition 3.1. Suppose > 0 and R.
(a) If X N(0, 1), then
Y = X + N(, 2 )
(b) If Y N(, 2 ), then

X=
Y
N(0, 1)
Proof. We only prove part (a). The second assertion is proved in the same way.

t
P{Y t} = P{X + t} = P X
Z t
u2
1
=
e 2 du
2
Letting s =
we see that
Z t
(s)2
1
P{Y t} =
e 22 ds
2
Conclusion: If Y N(, 2 ) then

b
a
P{a Y b} =
43
Question: How does 1 (t) 0 as t ?

If t , then 1 (t)
2 /2
t
1 e
t
2
. To show this, apply LHopitals rule and get
lim
1 (t)
1
=1
t 1 + 1/t2
= lim
2 /2
t 1 et
t
2
What is the n-dimensional standard normal? Suppose X1 , . . . , Xn are all independent and N(0, 1).
Then if X = (X1 , . . . , Xn ) and f : Rn [0, ) is the density of X, we see that
f (s1 , s2 , . . . , sn ) = f1 (s1 ) f2 (s2 ) fn (sn ) =
1
2
e|s| /2
n/2
(2)
where s = (s1 , . . . , sn ).
(e) Gamma Distributed Random Variables:
Recall the Gamma function,
us1 eu du, 0 < s <
(s) =
0
Properties:
(1) is continuous on (0, ).
(2) (s + 1) = s (s).
(3) (n) = (n 1)!
(4) (1/2) =
Proof. To show (4),
Z
u1/2 eu du
(1/2) =
0
2
Take u = v /2 then du = v dv, then

Z
(1/2) = 2
ev
2 /2
dv =
Let , > 0 and define

f, (u) =
Then
R
0
(
0
1
u1 eu/
()
:u0
:u>0
f, (u)du = 1. We say a random variable X is , distributed if

Z
P{X t} =
f, (u)du ,
0
Special Cases:
0 t < .
44
(i) If = 1 and = 1/ for some > 0, then

Z
P{X t} =
eu du
We say that X is Exp distributed if for t > 0,

P{X t} = 1 et
So Exp = 1/,1 . Take
P{a X b} = ea eb
for a 0. There is a special property: Take P{X t + s|X s}. This relates to a
probability that a particle that is s years old, lives at least t more years. This is just
P{X t + s|X s} =
P{X t + s}
e(t+s)
= s = et = P{X t}
P{X s}
e
This means that the particle is not aging.

(ii) The Erlang Distribution with parameter n is just 1/,n for > 0 and n > 1, n N. This
has density
n
1
un1 eu , u > 0 .
un1 eu =
f (u) = 1 n
(n
1)!
(u)
To see where this applies, take n bulbs of the same lifetime. Switch on the first bulb. It
burns out, then replace it by a second, and so on. Let Xn be the time the n-th bulb burns
out, then Erlang distributed.
(iii) 2n = 2,n/2 is the Chi-Square distribution with n degrees of freedom. So the density is
2n/2 n/21 u/2
u
e
,
n2
u>0
Its applied by taking X1 , . . . , Xn independent standard normal. Now take

X = X12 + + Xn2
then X 2n .
(f) Beta Distribution:
This is
Z
B(x, y) =
0
This is called Eulers Beta Function.

Properties:
(1) B is continuous on (0, )2 .
(2) B(x, y) = B(y, x).
(3) B(x, y) =
(4) B(n, m) =

(5) B 21 , 12 =
(x)(y)
(x+y)
(n1)!(m1)!
(n+m1)!

=
(1)
ux1 (1 u)y1 du,
x, y > 0
45
Let , > 0 and 0 a b 1, then X is Beta distributed means

Z b
1
P{a X b} =
u1 (1 u)1 du
B(, ) a
Where does this appear? Take X1 , . . . , Xn independent and uniform on [0, 1]. Order them
pointwise according to size
X1 X2 Xn
Then each Xk is Bk,n+k1 distributed.
(g) Cauchy Distribution:
If
1
P{X t} =
Then X is Cauchy.
1
i
1h
arctan(t) +
du =
1 + u2
4 TRANSFORMATIONS OF RANDOM VARIABLES
46
Transformations of Random Variables
4.1
Functions of Random Variables
We have X : R and : R R. We want to be measurable, i.e. that

{s R : (s) t} B(R) ,
t R.
Say Y = (X) so that Y () = (X()).

Task: Describe the distribution law of Y by that of X.
Example 4.1.
X N(0, 1). Describe the distribution of X 2 . Here (s) = s2 .
X N(0, 1) and Y = eX . Here (s) = es .
X is uniform on [0, 1] and Y = 1/X. Here (s) = s1 , s > 0.
Easy Case: (Discrete) If X : D = {x1 , x2 , . . . }. Then (X) : E where
E = (D) = {y1 , y2 , . . . }
for some yj R. Let Bk = {d D : (d) = yk }. Then
P{Y = yk } = P{X Bk } =
f (d)
{dD:(d)=yk }
with probability mass function f of X.

Example 4.2. Say P(X = 3) = = P(X = 3) =
Y = (X) then E = {0, 1, 4, 9}, hence
P{Y = 0} = P{X 2 = 0} =
1
7
1
,
7
where D = {3, . . . , 3}. If (s) = s2 with
P{X 2 = 1} =
2
, ...
7
Example 4.3. Simple Random Walk. Say there is a drunken sailor that takes steps of exactly length
1 to the right with probability p and to the left with probability 1 p. Let Xj be the value of the
j-th step, either +1 or -1. Then Sn = X1 + + Xn is the place where the sailor ends up after taking
and
n random steps. We have Sn {n, n + 2, . . . , n 2, n}. Let (s) = s+n
2
Zn = (Sn ) =
Sn + n
2
Then Zn attains the values {0, . . . , n} and we can write

Zn =
Set Yj =
Xj +1
.
2
X1 + 1 X2 + 1
Xn + 1
+
+ +
.
2
2
2
Then P{Yj = 0} = 1 p and P{Y = 1} = p, so that Zn Bn,p . Then

n k
P{Zn = k} = P{Sn = 2k n} =
p (1 p)k , k = 0, . . . , n .
k
47
Choose m {n, n + 2, . . . , n 2, n} and set k = m+n

. Then we get
2

n+m
nm
n
P{Sn = m} = n+m p 2 (1 p) 2 .
2
To go even further, assume that p = 1/2 and that n is even and m = 0. Then

n 1
n! 1
P{Sn = 0} = n n = n 2 n .
2
[( 2 )!] 2
2
Stirlings formula gives n!

n n
e
2n so that
2n
2
P{Sn = 0}
n 2 = .
n
n 2
2n 2e
n

n n
e
Hard Case: (Continuous). Then P{X t} =

strictly increasing, we can say
Rt
f (u)du and Y = (X). If is differentiable and
FY (t) = P{Y t} = P{(X) t} = P{X 1 (t)} = FX (1 (t)) .

Let g be the density of Y , then
g(t) = FY0 (t) =
f (1 (t))
0 (1 (t))
For strictly decreasing it follows

FY (t) = P{Y t} = P{(X) t} = P{X 1 (t)} = 1 FX (1 (t)) ,
leading to
g(t) = FY0 (t) =
f (1 (t))
0 (1 (t))
Example 4.4.
1. X N(0, 1) and y = eX , (s) = es . Then 1 (t) = ln(t). The density is
1
1 e 2 (ln(t))
g(t) =
, t > 0,
t
2
and g(t) = 0 , t 0 .
2. Let U be uniform on [0,1] and Y = 1/U with (s) = 1s . Of course f (t) = 1[0,1] (t). Then
(1
1[0,1] 1t
:t1
2
g(t) =
= t
2
t
0 : else
Note that here is decreasing, hence we have to use (15) in this case.
What do we do if is not monotone?
Idea: Try to find FY (t) = P{(X) t}. Then g(t) = FY0 (t).
(15)
48
Example 4.5. X N(0, 1), Y = X 2 . We know P{Y t} = 0 if t 0. Take t > 0, then

r Z t
Z t
2
1
2
2
u2 /2
FY (t) = P{X t} = P{ t X t} =
eu /2 du .
e
du =
0
2 t
What if FY0 (t)? This is
r
FY0 (t) =
1
1
2 t2 /2 1
1
= t1/2 et/2 = 1/2
e
t 2 1 et/2 .
2 (1/2)
2 t
2
Since the density of the Gamma , distribution is
1
u1 eu/
()
we have the following proposition.
Proposition 4.1. If X N(0, 1) then X 2 2, 1 .

2
Proof. See (16)

Affine Linear Transformations:
a 6= 0 and b R. Let Y = aX + b and suppose that X is continuous with density f . For
1. a > 0
Density g(t) = a1 f
tb
FY (t) = P{Y t} = P X
a

tb

= FX
tb
a
2. a < 0

FY (t) = P{aX t b} = P
Then the density is g(t) = a1 f
tb
a
tb
X
a

= 1 FX
tb
a
Summing up we obtain the following result:

Proposition 4.2. The density g of aX + b with a 6= 0 is

tb
1
f
g(t) =
|a|
a
Example 4.6.
1. X N(0, 1) and Y = aX + b with a 6= 0. Then

(tb)2
1
tb
1
g(t) =
f
= e 2|a|2
|a|
a
|a| 2
So that Y N(b, |a|2 ).
2. Suppose X Exp . So that f (t) = et for t > 0. Let Y = aX for a > 0. Then

1
1
t
g(t) = f
= et/a
a
a
a
so that Y Exp/a .
(16)
4.2
49
Random Numbers and Coin Tossing
Say x [0, 1) then we can write x in binary form, i.e. x = 0.x1 x2 . . . where the xj are either 0 or 1.
Equivalently, we have
X
xk
x=
.
k
2
k=1
To make the representation of x unique, we will not consider representations ending in all 1s.
Let U : [0, 1] be uniform distributed. P{U t} = t for 0 t 1. Let Xj : {0, 1} so that
U () = 0.X1 ()X2 () . . . = U () =
X
Xj ()
j=1
2j
This leads us to the theorem

Theorem 4.1.
(a) P{Xj = 0} = P{Xj = 1} = 21 .
(b) n 1 we have X1 , . . . , Xn are independent.
Proof. Draw the unit interval and split it into intervals on either side of 12 , I0 = {x : x1 = 0} and
I1 = {x : x1 = 1}. Then split up these intervals as I00 , I01 , I10 , I11 accordingly and continue n times.
When do we have Xn = 0? This occurs exactly when
U I1 2 ...n1 0
for ai arbitrary in {0, 1}. Then
P {Xn = 0} =
P{U I1 2 ...n1 0 }
1 ,...,n1
Since U is uniformly distributed, this is just

P{Xn = 0} =
X
1 ,...,n1
1
2n1
1
=
= .
n
n
2
2
2
For (b) choose 1 , . . . , n {0, 1} arbitrarily. Then

P{X1 = 1 , . . . , Xn = n } = P{U I1 n } =
1
= P{X1 = 1 } P{Xn = n } .
2n
Since n and the choice of 1 , . . . , n are arbitrary, this shows that X1 , . . . , Xn are independent.
Remark. The preceding theorem tells us that choosing a number u = 0.x1 x2 x3 . . . uniformly in [0, 1]
leads to an infinite sequence (x1 , x2 , . . .) of coin tossing. Every time 0 and 1 appear with probability
1/2 and the results are independent of each other.
The next theorem shows that we also may go the other way round. We toss a coin infinitely often and
obtain a number uniformly distributed on [0, 1]. So let us assume X1 , X2 , . . . are random variables
with the two properties
(a) P{Xj = 0} = P{Xj = 1} = 21 .
50
(b) n 1 we have X1 , . . . , Xn are independent.

Set U =
P
j=1
Xj
,
2j
then we have the following theorem,
Theorem 4.2. U is uniform distributed on [0, 1].

Proof. Suppose s = 0.s1 s2 . . . and t = 0.t1 t2 . . . . When do we have that s < t? There exists n 1
such that tj = sj for 1 j < n but sn = 0 and tn = 1. Now fix t (0, 1). Then
{s [0, 1) : s < t} =
An (t)
n=1
where An (t) = {s : sj = tj , j < n, sn = 0, tn = 1}. When is An (t) 6= ? This happens if and only
if tn = 1. Note also that An (t) is disjoint from Am (t) for n 6= m. Then we want to investigate
X
P{U < t} =
P{U An (t)} .
(17)
tn =1
We see that U An (t) if and only if X1 = t1 , . . . , Xn1 = tn1 , Xn = 0. Then

P{U An (t)} = P{X1 = t1 , . . . , Xn1 = tn1 , Xn = 0}
= P{X1 = t1 } P{Xn1 = tn1 } P{Xn = 0}
1
= n.
2
Then the sum in (17) is
X
P{U An (t)} =
tn =1
X 1
X
tn
=
= t.
n
n
2
2
t =1
n=1
n
Thus P{U < t} = t. The last thing we need to present is that

1
= 0.
n 2n
P{U = t} = lim P{X1 = t1 , . . . , Xn = tn } = lim

n
Thus
P{U t} = P{U < t} + P{U = t} = t ,
so that U is uniform distributed.
Remark. In view of the previous result we may generate a random number in [0, 1] by using a coin:
Take a fair coin with 0 and 1 and flip it N times. We get u = 0.x1 x2 . . . xN where xj represents
the j-th coin toss. Then, if N is large enough, u is almost uniformly chosen from [0, 1]. What does
this mean? It tells us that the probability to get an u which is in [a, b] [0, 1] is approximately b a.
Say we want not only one number uniform distributed on [0, 1] but n independent numbers u1 , . . . , un .
How one can do this? The answer is very easy. Toss not only one coin but n independently. The
results are x11 , x12 , . . . to xn1 , xn2 , . . .. Constructing with these sequences u1 , . . . , un , these numbers are
uniformly distributed and independent.
Problem: Until now we are only able to construct uniform distributed numbers on [0, 1] by tossing a
coin. Is it also possible to find such numbers which possess other distributions?
Let us explain by an example why such a question may be of interest. Say you want to build
a vending machine. If the machine dispenses to quickly, the operating cost is expensive. If the
51
machine dispenses too slowly not enough money is made. Say the random variable Xj dictated the
time between the j-th and the (j + 1)-st customer is Exp and independent. So we have arrival times
0 < t1 < t2 < with tj+1 tj independent and Exp distributed. To simulate whether or not the
machine is adequate we have to test it with randomly chosen t1 , t2 , . . . having the desired properties.
Discrete case:
Construct a discrete random variable X so that for given xk R and pk 0 with
have
P{X = xk } = pk , k = 1, 2, . . .
The answer is that
[0, 1] =
|Ik | = pk ,
Ik ,
X=
k=1
xk 1Ik (U )
k=1
pk = 1 we
(18)
k=1
where U is uniformly distributed on [0, 1].

Proposition 4.3. The random variable X defined by (18) satisfies
P{X = xk } = pk ,
k = 1, 2, . . .
Proof.
P{X = xk } = P{U Ik } = |Ik | = pk .
Example 4.7. We want to construct a random variable X which is Bn,p distributed, i.e. that

n k
P{X = k} =
p (1 p)nk , k = 0, . . . , n.
k

To this end split [0, 1] into intervals Ik , k = 0, . . . , n, where each length |Ik | = nk pk (1 p)nk . Then
let
n
X
k 1Ik (U ) .
X=
k=0
Continuous case:
R
Given a function f with f 0 and f (s)ds = 1, we want to construct a random variable X so
that f is its density, i.e. that for all t R
Z t
P{X t} =
f (s)ds .
Define F : R [0, 1] by
Z
F (t) :=
f (s)ds .
(19)
Thus F is the distribution function of the random variable we want to construct.

Next define the pseudo-inverse F : [0, 1) [, ) of F by
F (s) = inf{t R : F (t) = s} ,
0 s < 1.
52
Remark. Of course, F (s) is a well-defined real number if 0 < s < 1 and F (0) = . If F is
one-to-one on a finite or infinite interval (a, b), then F (s) = F 1 (s) if F (a) < s < F (b). This
happens, for example, if f (t) > 0, a < t < b.
We shall need the following properties of F :
Lemma 4.1.
1. For s (0, 1) and t R holds
F (F (s)) = s
and
F (F (t)) t .
2. Given a t (0, 1) we have

F (s) t s F (t) .
(20)
Proof. The equality F (F (s)) = s is a direct consequence of the continuity of F . Indeed, if there
are tn & F (s) with F (tn ) = s, then
s = lim F (tn ) = F (F (s)) .
n
The second part of the first assertion follows directly from the definition of F .
Now let us come to the proof of (20). If F (s) t, then the monotonicity of F as well as F (F (s)) = s
lead to s = F (F (s)) F (t).
Conversely, if s F (t), then F (s) F (F (t)) t by the first part, thus (20) is proved.
Now choose a uniform distributed U and set X = F (U ). Since P{U = 0} = 0, we may assume that
X attains values in R.
R
Proposition 4.4. Let f satisfy f 0 and f (s)ds = 1. Define F by (19) and let F be its
pseudo-inverse. Take U uniform on [0, 1] and set X = F (U ). Then f is a density of the random
variable X.
Proof. Using (20) it follows
FX (t) = P{X t} = P{ : F (U ()) t} = P{ : U () F (t)} = F (t)
which completes the proof.
(
0
Example 4.8. f (s) =
es
Then
:s<0
.
:s0
Z t
F (t) =
es ds = 1 et , t > 0 .
0
The inverse of F is given by

F 1 (s) =
ln(1 s)
,
Hence
0 < s < 1.
ln(1 U )
is exponential distributed with parameter . Note that 1 U is also uniform distributed on [0, 1],
hence
= ln U
X
is exponential as well.
X=
53
Let us give another important example.

Example 4.9. Say we want to simulate an N(0, 1) distributed random variable X. Recall the
definition of the Gaussian error function
Z t
1
2
(t) =
ex /2 dx .
2
If U ist uniform on [0, 1], then X = 1 (U ) is N(0, 1).
How do we get an N(, 2 ) distributed random variable Y ? The answer is set Y = 1 (U ) + .
Example 4.10. Suppose we have

f (t) =
Then it follows
3 t2 : 0 t 1
0 : otherwise
(21)
t<0
0 :
3
t : 0t1
F (t) =
1 :
t>1
and X = U 1/3 , where is U uniform on [0, 1], is a random variable having density (21).
4.3
Addition of Random Variables
Problem: Given X and Y random variables, how is X + Y distributed? Do we even know that X + Y
is a random variable?
Proposition 4.5. If X and Y are random variables, then X + Y is a random variable.
Proof. Since Q is dense in R for any a < b, we can choose q Q such that a < q < b. Then
[
[{X < q} {q < t Y }]
{X + Y < t} = {X < t Y } =
qQ
Since X and Y are random variables, {X < q} A and {q < t Y } A . Thus their intersection is
in A as a -field is closed under finite intersections. Since A is also closed under countable unions,
the whole set is in A , thus X + Y is a random variable.
Theorem 4.3. Let X and Y be independent with values in Z with mass functions
f (k) = P{X = k},
g(k) = P{Y = k},
Then
h(k) = (f g)(k) =
f (k j)g(j) =
j=
h(k) = P{X + Y = k}
X
j=
Proof. We want to determine P{X + Y = k}. Define

Bk = {(i, j) Z2 : i + j = k} .
f (j)g(k j)
54
Then
P{X + Y = k} = P{(X, Y ) Bk } =
=
P{X = i, Y = j}
(i,j)Bk
P{X = i} P{Y = j} =
P{X = k j}P{Y = j}
j=
(i,j)Bk
f (k j)g(j) = (f g)(k)
j=
Example 4.11. Say P{X = k} = P{Y = k} = 21k , k = 1, 2, 3, . . . . Say X, Y are independent and
Z = X Y . Let f , g, and h be the densities of X, Y and Z respectively. Then
(
(
1
1
:
k
1
: k 1
k
2
2k
,
g(k) =
f (k) =
0 : otherwise
0
: otherwise
and for k 0,
h(k) =
f (k j)g(j) =
j=
1
X
f (k j)g(j) =
j=
If k < 0 we get in the same way h(k) =
f (k + j)g(j) =
j=1
2k
.
3
2(k+j) 2j =
j=1
2k
.
3
Putting both cases together gives
P{X Y = k} =
2|k|
,
3
k Z.
Theorem 4.4. Suppose X, Y : N0 where N0 = {0, 1, 2, . . . } and X, Y are independent. Then

P{X + Y = k} =
k
X
P{X = j}P{Y = k j}
j=0
for k N0 .
Proof. Same as above theorem, noting that f (k) = g(k) = 0 if k
/ N0 .
Example 4.12. X Bn,p , Y Bm,p independent. Then
k
X
k
X
n

m
P{X + Y = k} =
p (1 p)
pkj (1 p)m(kj)
P{X = j}P{Y = k j} =
j
kj
j=0
j=0

k
X
n
m
k
n+mk
= p (1 p)
j
kj
j=0

m+n
= pk (1 p)n+mk
k
j
nj
Thus X + Y Bn+m,p .
Corollary 4.1. X1 , Xn B1,p . Then if Sn = X1 + + Xn we have Sn Bn,p .
55
Example 4.13. Let X Pois and Y Pois independent. Then X + Y Pois+ .

Proof.
P{X + Y = k} =
k j
X
j!
j=0
kj
e
(k j)!
k
e(+) X
k!
e(+)
=
j kj =
( + )k
k! j=0 j!(k j)!
k!
Explanation: Let X Pois and Y Pois be the number of accidents in city A and B respectively.
Then if Z is the number of accidents in both, then Z Pois+ .
Example 4.14. X Bn,p

and Y Bm,p
independent. Then X + Y Bn+m,p
. Recall

n
m m
f (k j) =
pn (1 p)kj ,
g(j) =
p (1 p)j
kj
j
Then
k
X
f (k j)g(j) = p
n+m
j=0

k
X
n
m
n+m
k n m
(1 p)
=p
(1 p)
k
j
j
k
j=0
k
Now we ask a different question. Say we have continuous random variables X and Y with densities
f and g. Does X + Y possess a density, and if so what is it?
Definition 4.1. If f, g : R R, then we define the convolution of f and g as
Z
Z
f (y)g(x y) dy
f (x y)g(y) dy =
(f g)(x) =
Proposition 4.6. Assume

surely.
|f (x)| dx < and
|g(x)| dx < . Then (f g)(x) exists almost
Proof. Using Fubinis Theorem we get

Z Z
Z
|f (x y)| |g(y)| dy dx
|(f g)(x)|dx

Z
Z
=
|g(y)|
|f (x y)| dx dy < .
Hence |(f g)(x)| < for almost all x R.

Theorem 4.5. Let X and Y be independent with densities f and g. Then X + Y has density f g.
Proof. If Bt = {(u, y) : u + y t} then if f(X,Y ) is the density of (X, Y ), we have
ZZ
P{X + Y t} = P{(X, Y ) Bt } =
f(X,Y ) (u, y) du dy
Bt
Since X and Y are independent, f(X,Y ) (u, y) = f (u)g(y). Then the above integral is equal to

ZZ
Z Z ty
f (u)g(y) du dy =
f (u) du g(y) dy
(u,y)Bt
If we let u = x y, then du = dx so that the above integral is equal to

Z Z t
Z t Z
Z
f (x y) dxg(y) dy =
f (x y)g(y) dy dx =
(f g)(x) dx
56
Example 4.15. X and Y uniform on [0, 1]. Then f (x) = 1[0,1] (x) and g(y) = 1[0,1] (y). Then
Z
Z 1
(f g)(x) =
f (x y)g(y) dy =
1[0,1] (x y) dy
We then have two cases: if 0 x 1,

Z 1
1[0,1] (x y) dy =
dy = x
0
if 1 x 2 then
Z
1[0,1] (x y) dy =
dy = 2 x
x1
Hence X + Y has density h given by
x
: 0x1
2x : 1x2
h(x) =
0
: otherwise
Example 4.16. X ,1 and Y ,2 with X and Y independent. Then
1
f (y) =
1 (1 )
1
g(x y) =
Then
2 (2 )
y 1 1 ey/ ,
(x y)2 1 e(xy)/ ,
ex/
(f g)(x) = 1 +2
(1 )(2 )
so that
() = x
1 +2 2
Let s =
y
x
y0
Z
|0
y 1 1 (x y)2 1 dy
{z
}
()
y 1 1
y 2 1
1
dy
x
x
and dy = x ds so that
() = x
1 +2 1
Z
|0
s1 (1 s)2 1 ds
{z
}
=B(1 ,2 )=
So that
yx
(1 )(2 )
(1 +2 )
x1 +2 1
(f g)(x) = 1 +2
ex/ ,
(1 + 2 )
x>0
So we obtained the following proposition.

Proposition 4.7. If X ,1 and Y ,2 independent then X + Y ,1 +2 .
Corollary 4.2. If X 2n and Y 2m independent then X + Y 2n+m .
Proof. We have X 2n = 2,n/2 and Y 2m = 2,m/2 , hence by the preceding proposition
X + Y 2,(n+m)/2 = 2n+m .
57
Corollary 4.3. X1 , . . . , Xn independent and N(0, 1) distributed. Then

X12 + + X12 2n
Proof. Recall that X12 , . . . , Xn2 are independent 2,1/2 distributed. Hence an iterative application of
the preceding proposition implies
X12 + + Xn2 2,n/2 = 2n .
Proposition 4.8. If X and Y are independent Erlang distributed with parameters n or m and > 0,
then X + Y is Erlang distributed with parameters n + m and .
Proof. X Exp,n = 1 ,n and Y Exp,m = 1 ,m implies X + Y 1 ,n+m = Exp,n+m .
Corollary 4.4. If X1 , . . . , Xn are exponential with parameter > 0, then X1 + + Xn Exp,n

Proof. Note that we have Xj Exp,1 . Thus apply the previous proposition n times.
Application: Say we have light bulbs. The lifetime of each is exponential distributed with parameter
> 0. Let Xj be the time between changing the (j 1)st and j th bulbs, with 0 the start time. Then
Sn = X1 + + Xn is the time we have to change the n-th bulb. Since X1 , . . . , Xn are independent
and Exp distributed. Then Sn Exp,n and
Z
n
un1 eu du
P{Sn t} =
(n 1)! t
Lemma 4.2.
n
(n 1)!
un1 eu du =
n1
X
(t)j
j!
j=0
et
Proposition 4.9. Let X1 , X2 , . . . be independent Exp distributed random variables and set Sn =
X1 + + Xn . Given T > 0 it follows
(T )n T
P{Sn T < Sn+1 } =
e
,
n!
n = 1, 2, . . .
Proof. It holds
{Sn T < Sn+1 } = {Sn+1 > T } \ {Sn > T } .
Hence in view of the preceding lemma we get
P{Sn T < Sn+1 } = P{Sn+1 > T } P{Sn > T } =
n
X
(t)j
j=0
j!
et
n1
X
(t)j
j=0
j!
et =
(T )n T
e
.
n!
This completes the proof.

Remark. Suppose the lifetime of light bulbs is exponential distributed. Switch on a first bulb at time
t = 0. If this bulb burns out replace it by a second and so on. Then Sn is the time the n-th bulb
burns out. The preceding proposition tells us that the number of necessary changes until time T > 0
is PoisT distributed. In formulas,
P{n changes until time T > 0} =
(T )n T
e
.
n!
58
Let us state a very important result.

Theorem 4.6. If X N(1 , 12 ) and Y N(2 , 22 ) independent, then X + Y N(1 + 2 , 12 + 22 ).
Proof. Let f, be the density of a N(, 2 ) distributed random variable then we have to show
f1 ,1 f2 ,2 = f1 +2 ,2 +2
1
or, equivalently,

Z
(x 1 2 )2
1
1
(x y 1 )2 (y 2 )2
exp
.
dy =
exp
21 2
212
222
2(12 + 22 )
2(12 + 22 )1/2
5 EXPECTATION, VARIANCE, AND COVARIANCE
5
5.1
59
Expectation, Variance, and Covariance

Expected Value of a Random Variable
Discrete Case:
Let X : R. What is the mean value, or expected value of X? Say P{Xj = xj } = pj for
j = 1, . . . , n. Then the barycentre is
p 1 x1 + p n xn =
n
X
xj P{Xj = xj } = EX .
j=1
Say X : D = {x1 , x2 , . . . }. What is EX now?

Definition 5.1. Suppose xj 0 and X : [0, ). Then
EX =
xj P{X = xj } [0, ]
j=1
A first example shows that EX = may occur even in quite natural problems.
Example 5.1. Play a series of fair games. Whenever you put M dollars into the pool you get back
2M dollars if you win. If you lose, then the M dollars are lost.
Apply the following strategy. After losing a game double the amount in the pool. Say you start with
$1 and lose, then next time put $2 into the pool, then $4 and so on, until you win for the first time.
As easily seen, in the n-th game the stake is $2n1 .
Suppose for some n 1 you lost n 1 games and won the n-th one. How much money did you lose?
If n = 1, then you lost nothing, while for n 2 you spent
1 + 2 + 4 + + 2n2 = 2n1 1
dollars. Note that 2n1 1 = 0 if n = 1, hence for all n 1 the total lost is 2n1 1 dollars.
On the other hand, if you win the n-th game, you gain 2n1 dollars. Consequently, no matter what
the results are, you will always win 2n1 (2n1 1) = 1 dollars.
Where is the catch? Let X be the amount of money needed in the case that one wins for the first
time in the n-th game. One needs $1 to play the first game, $1 +$2 =$3 to play the second, until
1 + 2 + 4 + + 2n1 = 2n 1
to play the n-th game. Thus X has values in {1, 3, 7, 15 . . .} and
P{X = 2n 1} = P{First win in game n} =
1
,
2n
n = 1, 2 . . . .
Consequently, it follows
EX =
X
n=1
(2n 1) P{X = 2n 1} =
X
2n 1
n=1
2n
= .
This result tells us that the average amount of money needed to use this strategy is arbitrarily large.
Of course, the owners of gambling casinos know this strategy as well. Therefore they limit the
possible amount of money in the pool. For example, if the largest possible stake is $N, then the
strategy breaks down as soon as you lose n games with 2n > N .
60
Definition 5.2. Let X : {x1 , x2 , . . . }. We say that X has expected value, or that its expectation
exists, if
X
E|X| =
|xj |P{Xj = xj } < .
(22)
n=1
Then we set
EX =
xj P{Xj = xj } .
(23)
n=1
Remark. Recall that
k=1
|k | < for a sequence real numbers (k )k1 implies the existence of
k = lim
n
X
k=1
k .
k=1
Hence, under condition (22) the expected value EX is well-defined by (23).

Example 5.2. P{X = k} =
1
,
N
k = 1, . . . , N and
N
X
N
1 X
1 N (N + 1)
N +1
kP{X = k} =
EX =
k=
=
N k=1
N
2
2
k=1
Example 5.3. P{X = k} =
n
k
k
p (1 p)nk , k = 0, . . . , n. Then

n
n
X
X
(n 1)!
n k
k
p (1 p)nk = np
EX =
pk1 (1 p)nk
k
(k
1)!(n
k)!
k=1
k=1
= np
n1
X
k=0
(n 1)!
pk (1 p)n1k
k!(n 1 k)!
= np(p + (1 p))n1 = np
The explanation of this is that if you have n trials with success probability p, then on average one
has success np times.
Continuous Case:
Think of a Riemann sum of discrete random variables, approximating a continuous r.v. X. Then
you can say that
Z
EX =
uf (u) du .
If X 0 almost surely then this integral is well-defined in [0, ]. For the general case:
Definition 5.3. X has an expected value if
Z
E|X| =
|u|f (u) du < .
(24)
Its expected value is then defined by

Z
EX =
uf (u) du .
(25)
61
Remark. Suppose a function g : R R satisfies

Z
|g(x)| dx < .
Then this implies the existence of

Z
g(x) dx .
g(x) dx = a
lim
b+
Hence, if condition (24) is satisfied, then the integral in (25) is well-defined and EX exists in R.
Example 5.4.
1. Let X be uniform on [, ] then f (u) =

Z
EX =
uf (u) du =
1
2 2
1
+
du =
2
1 1
.
1u2
2. Let X be Cauchy distributed, so that f (u) =

1
E|X| =
|u|
2
du =
2
1+u
1[,] (u). Then
Z
0
Then

u
1
2
du = ln(1 + u ) =
1 + u2
This means that X has no expectation if it is Cauchy distributed.

3. X N(0, 1). Then
so that
1
E|X| =
2
1
EX =
2
2 /2
|u|eu
du < ,
2 /2
ueu
du = 0 .
4. If X , , then P{X 0} = 1, and thus

1
EX =
()
u u1 eu/ du
u/ = s and du = ds, then this is equal to

Z
+1
s es ds =
() 0
as the integral on the right is equal to ( + 1) = (). This gives us that X Exp 1 ,1
implies EX = 1 . If the lifetime of a bulb is exponential distributed, the average lifetime is

> 0. Then = 1 is the inverse of the average lifetime. Now say the average lifetime is 80
years, then
Z
eu du = e90/80 = e9/8
P{X 90} =
90
5. Let X be geometric distributed. So that

EX =
X
k=0
kp(1 p)k1 = p
X
k=1
k(1 p)k1

We know that the Taylor series of h(x) =
h(x) =
1
1x
is
1
,
(1x)2
k=0
xk for |x| < 1, hence
x = h (x) =
k=0
so that h0 (x) =
62
kxk1 ,
k=1
thus
EX =
p
1
= .
2
(1 (1 p))
p
Remark. The expected value of a random variable is motivated by the strong law of large numbers
which asserts that the average of the values of n experiments converges to the expected value as the
number of trials goes to infinity.
Theorem 5.1 (Strong Law of Large Numbers). If X1 , X2 , . . . i.i.d. and EX1 exists, then

X1 + + Xn
P lim
= EX1 = 1 .
n
n
Remark. The expected value should not be mixed up with the median. Note that m is a median of
X if P{X m} 1/2 and P{X m} 1/2. For example, if X is Exp -distributed, then EX = 1
while the median is m = ln2 .
5.2
Expectation of Special Random Variables
(a) P{X = xj } =
1
,
N
j = 1, . . . , N , X is uniform. Then EX =
1
N
N
P
xj .
j=1
(b) If X Bn,p then EX = np.

(c) X Pois i.e. P{X = k} =
k
e ,
k!
k = 0, 1, . . . . Then EX = .
(d) If P{X = k} = p(1 p)k1 , k = 1, 2, . . . then EX = p1 . For example, when rolling a die in the
average 6 appears for the first time in the 6th trial.
(e) X U [, ] then EX =
+
2
(f) X , then EX = . Since Exp = 1/,1 then if Y Exp , EY = 1 .

(g) X B(, ) then EX =
them by size.
B(+1,)
B(,)
.
+
If we choose n numbers uniformly from [0,1] and order
X1 X2 Xn
Then Xk B(k, n k + 1) so EXk =
k
.
n+1
(h) X N(, 2 ). Then EX = .
5.3
Properties of the Expectation

d
Theorem 5.2. (a) X = Y = EX = EY

(b) P{X = c} = 1 = EX = c for c constant.
(c) E(aX + bY ) = aEX + bEY for a and b constants.
63
(d) X Y , a.s. X() Y (), , then EX EY .

(e) : R R, then
Discrete Case: E(X) =
(xj )P{X = xj }
j=1
Z
(s)f (s) ds
Continuous Case: E(X) =
(f ) If X and Y are independent, then

E(XY ) = (EX)(EY )
Proof. (f)
Take X and Y as simple functions. Let X takes values {x1 , . . . , xn } and Y in y1 , . . . , ym and let
Aj = {X = xj } = X =
Bj = {Y = yj } = Y =
n
X
xj 1Aj
j=1
m
X
yi 1Bi
i=1
Then by independence of the Aj and Bi ,

XY =
m X
n
X
xj yi 1Aj Bi = E(XY ) =
i=1 j=1
m X
n
X
xj yi P(Aj Bi ) =
i=1 j=1
n
X
xj yi P(Aj )P(Bi )
i=1 j=1
!
xj P(Aj )
j=1
m X
n
X
m
X
!
yi P(Bi )
= (EX)(EY )
i=1
The general case follows by approximating X and Y by simple functions.

Remark. Suppose P{X 0} = 1. Let f be the density. Then
Z
Z Z s
Z Z
Z
duf (s) ds =
f (s) ds du =
P{X u} du
EX =
sf (s) ds =
0
0
u
0
0
0
Z
=
(1 FX (u)) du
0
Applications and Examples

1. Suppose X N(0, 1). Then EX = 0 as ses
2 /2
is an odd function. Similarly,
E(X + ) = EX + =
2. X N(0, 1). Then
Z
1
2
EX =
sn es /2 ds
2
n
If n is odd, then clearly EX = 0. So look at
Z
Z
1
2
2
2n
2n s2 /2
() = EX =
s e
ds =
s2n es /2 ds
0
2
n

Then let u = s2 /2 so s =
1
() =
2u then ds =
1
2u
64
du. Then
1
2n
(n + ) = 2n (1/2)(n 1/2) (n 3/2) (1/2)
(1/2)
2
= (2n 1)(2n 3)(2n 5) 3 1
= (2n 1)!!
2n un1/2 eu du =
3. Company produces cornflakes. In order to sell the cornflakes they put a picture in each box. If
there are n distinct pictures, how many boxes must we buy, on average, to collect all n pictures?
Solution: Assume we already have k pictures, k = 0, . . . , n 1. Let Xk be the number of
purchases before one gets a new picture. We see that
P{X0 = 1} = 1
..
.
P{Xk = m} = pk (1 pk )m1
. In other words, the Xk are
where the success probability of getting the k-th picture is nk
n
nk
geometric distributed with parameter pk = n where 2 k n 1. Then
EXk =
1
n
=
pk
nk
Now let S = X0 + + Xn1 . Then

ES = EX0 + EXn1
= n (
n1
X
1
n
n
+ +
=1+
=1+
pk
n1
n (n 1)
k=1
n
X
1
1
1
+
+ + 1) = n
n ln n
n n1
j
j=1
What is the average of necessary purchases before one has n/2 pictures? Then you need to
take

n
n
1
1
1
+
=n
+
+ +
EX0 + + En/21 = 1 +
n 1 n (n/2 1)
n n1
n/2 + 1
This can be written as
() = n
n
X
1
j=1
We know
lim
n
X
1
j=1
n/2
X
1
j=1
!
ln n
= 0.577 . . .
which is called the Euler-Masceroni constant. So

() n (ln n ln(n/2)) = n ln 2
4. Roll a die n times. Then Sn is the sum, so that ESn =
7n
.
2
5.4
65
Higher Moments and Variance
For n 1, X has an n-th moment if

E|X|n <
Then EX n is called the n-th moment of X. Also if f has density f , then
Z
n
|s|n f (s) ds
E|X| =
Lemma 5.1. a > 0, 0 . Then

a a + 1 .
Proof. If a 1 then a 1. If a > 1 then a a .
Proposition 5.1. Let 1 n m < . If X has an m-th moment, then it has an n-th moment.
Proof. |X()|n |X()|m + 1. Then
EX()|n E|X()|m + 1 < .
Remark. A much stronger result is Holders inequality:

1
(E|X|n ) n (E|X|m ) m
Example 5.5.
If X N(0, 1) then E|X|n < for all n.
If X Exp , then
E|X| =
sn es ds <
If X , then E|X| < for all n 1.

If X has density f (x) =
1
x
1[1,) (x), then

n
EX = ( 1)
1
sn
ds < n < 1
s
Remark (Hamburger Moment Problem). Given two r.v. X and Y such that EX n = EY n for all
n N. Then do X and Y possess the same distribution?
Now we come to the variance. We know that the EX is similar to the mean value of X. Now let
= EX. Then the mean quadratic distance of X to EX
E|X |2
This measures how far the values of X are, on average, close to the mean value.
Definition 5.4. Suppose EX 2 < . Then
VX = E|X |2 = EX 2 2EX + 2
is called the Variance of X.
Proposition 5.2.
66
(a) VX = EX 2 (EX)2 .
(b) V(aX + b) = a2 VX.
(c) If X is discrete, then
X
VX =
(xj )2 P{X = xj }
j=1
while for X continuous we have

Z
VX =
(s )2 f (s) ds
(d) V(X) 0 and VX = 0 X = c almost surely, c constant.

(e) If X and Y are independent, then V(X + Y ) = VX + VY
Proof.
(a) VX = E(X )2 = EX 2 2EX + 2 = EX 2 22 + 2 = EX 2 2
(b) Let = EX then E(aX + b) = aEX + b = a + b
V(aX + b) = E[(aX + b) (a + b)]2 = a2 E[X ]2 = a2 VX
(c) Observe that VX = E(X) where (s) = |s |2 . Hence the assertion follows directly from
property (e) in Theorem 5.2.
(d) Of course we always have V(X) 0 because of (X )2 0. Use (d) of Theorem 5.2.
Furthermore, if P{X = c} = 1, then Ex = c, and since X c = 0 with probability 1 we
get VX = 0. The other direction is more complicated and cannot be proved by our methods.
(e) Let = EX and = EY . Then
V(X + Y ) = E[(X ) + (Y )]2 = E(X )2 + 2E(X )(Y ) + E(Y )2
But by independence
E(X )(Y ) = E(X )E(Y ) = ( )( ) = 0
Example 5.6.
1. X is Pois . Then EX = and
2
EX =
X
k=1
k!
X
k=1
X
kk
(k + 1)k
e =
e = 2 + .
(k 1)!
k!
k=0
Hence, VX = EX 2 (EX)2 = 2 + 2 = .
67
2. X is uniform on [, ] Then EX = +
and
2
Z
1
2 + + 2
1 3 3
2
EX =
=
.
s2 ds =

3
3
Hence,
VX =
2 + + 2 2 + 2 + 2
( )2
=
.
3
4
12
3. X N(0, 1), EX = 0 and
Z
s2
1
s2 e 2 ds = 1 .
EX =
2
2
If X N(, ) then EX = and VX = 2 .
2
4. P{X = 0} = 1 p and P{X = 1} = p. Then EX = p, EX 2 = p so that

VX = p p2 = p(1 p)
If Sn = X1 + + Xn , then Sn Bn,p . The variance becomes VSn = np(1 p).
5. Say P{X = k} = p(1 p)k1 for k N. Then EX = p1 . We know g(x) =
1
1x
xk so
k=0
X
X
1
x
k1
kx
=
kxk
=
=
2
(1 x)2
(1
x)
k=1
k=1
Thus
X
2x
1+x
1
2 k1
+
=
k
x
=
(1 x)2 (1 x)3
(1 x)3
k=1
Then
2
EX = p
k 2 (1 p)k1 = p
k=1
and finally
VX =
5.5
1 + (1 p)
2p
=
3
p
p2
1
1p
2p
=
p2
p2
p2
Covariance
Say we are given X and Y , dependent. We want to measure their degree of dependence.
Definition 5.5. Let E|X|2 < , E|Y |2 < . Where = EX, = EY . The covariance of X and
Y is
Cov(X, Y ) = E(X )(Y )
Properties:
(a) |ab| 21 (a2 + b2 ), so
1
|(X )(Y )| [(X )2 + (Y )2 ]
2
implies
1
E|(X )(Y )| [VX + VY ] <
2
this means that Cov(X, Y ) is well-defined.
68
(b) If X and Y are independent, then

Cov(X, Y ) = E[(X )(Y )]
= E(X )E(Y ) = 0
If X and Y are independent, then Cov(X, Y ) = 0. Then X and Y are called uncorrelated
if Cov(X, Y ) = 0. Thus X, Y independent = X, Y uncorrelated. (The converse is not
necessarily true!)
Discrete Case:
Let f (xi , yj ) = P{X = xi , Y = yj }. Then
X
Cov(X, Y ) =
(xi )(yj ) f (xi , yj )
i=1 j=1
Example 5.7. Say

Y \X
0
1
1
6
1
3
1
3
1
6
Then,
1
1
1
Cov(X, Y ) = (0 1/2)(0 1/2) + (1 1/2)(0 1/2) + (0 1/2)(0 1/2)
6
3
3
1
1
+ (1 1/2)(1 1/2) =
6
12
Continuous Case:
The joint distribution is
Z
P{X t, Y s} =
f (u, v) du dv
then
(u )(v )f (u, v) du dv
Cov(X, Y ) =
: u2 + v 2 1
0 : else
What we see is that EX = 0 = EY because ufX (u) and vfY (v) are odd functions. Then
ZZ
Z
Z 1v2
1
1 1
Cov(X, Y ) =
(u 0)(v 0) du dv =
v
u du dv = 0
1 1v2
u2 +v 2 1
Example 5.8. Let f (u, v) =
We see then that X and Y are uncorrelated, but we have proven earlier that they are not independent.
Let us summarize some properties of the covariance.
(a) Cov(X, Y ) = E(XY ) E(X)E(Y ). This is seen by computing
Cov(X, Y ) = E(X )(Y ) = E(XY ) EX EY + = E(XY )
as = EX and = EY . If we look at Example (5.7) we see that E(XY ) = 61 1 and (EX)(EY ) =
1
1
. Then Cov(XY ) = 61 41 = 12
.
4
69
(b) Cov(X, Y ) = Cov(Y, X) and Cov(1 X1 + 2 X2 , Y ) = 1 (X1 , Y ) + 2 Cov(X2 , Y ). In other words,

the covariance is bilinear.
(c) Cov(X, X) = VX.
Proposition 5.3 (Schwarz Inequality). Let X and Y be random variables with second moment.
Then
1/2
1/2
EY 2
|E(X Y )| EX 2
Proof. Assume EY 2 > 0 and let R. We know
0 E[|X| |Y |]2 = E|X|2 2E|XY | + 2 E|Y |2
1
Let = (EX 2 ) 2 /(EY 2 ) 2 then this shows

1
0 EX 2
(EX 2 ) 2
1
(EY 2 ) 2
E|XY | + EX 2
(EX 2 ) 2
(EY
2) 2
E|XY | EX 2
1
= |EXY | E|XY | (EX 2 ) 2 (EY 2 ) 2
Corollary 5.1. |Cov(X, Y )| (VX)1/2 (VY )1/2 .

Proof.
|E(X )(Y )| (E|X |2 )1/2 (E|Y |2 )1/2 .
Definition 5.6. The correlation coefficient between X and Y is

(X, Y ) :=
Cov(X, Y )
1
(VX) 2 (VY ) 2
Note that by Proposition 5.3 we have 1 (X, Y ) 1. Then (X, X) = 1, (X, X) = 1 and
(X, Y ) = 0 if X, Y are independent (the converse again is not always true).
We say that X and Y are strongly correlated if (X, Y ) is near 1 or -1. We say that X and Y are
positively correlated if (X, Y ) > 0. This means that large X implies that a larger Y is more likely.
Similarly, X and Y are negatively correlated if (X, Y ) < 0. This means that larger values of X
mean smaller values of Y more likely.
For example, if we choose a person by random and X is the height and Y the weight. Then X and
Y are surely positively correlated.
On the other hand, if X is the number of cigarettes a man smokes per day and Y his lifetime, then
X and Y are, as known, negatively correlated.
Example 5.9. We have an urn with 2n balls. n are labeled 0 and n are labeled 1. Take out
two balls without replacement, where X is the first and Y is the second ball. How are X and Y
correlated? By the law of multiplication,
1
EX = EY = ,
2
E(XY ) =
n1
,
4n 2
Cov(X, Y ) =
n1
1
1
=
4n 2 4
8n 4
70
Then
1
1
= (X, Y ) =
4
2n 1
The result tells us the following: The larger n the smaller is the correlation between X and Y .
For very large n they are almost uncorrelated. Furthermore we see that X and Y are negatively
correlated. Why this so? If X is large, i.e., if X = 1, then it becomes more likely that Y = 0, that
is, that it becomes smaller.
VX = VY =
Example 5.10. Say we have

Y \X
-1
0
1
-1
-1
1
10
1
10
1
10
1
10
2
10
1
10
1
10
1
10
1
10
Then X and Y are dependent (check this) but

Cov(X, Y ) = EXY = 0
so they are uncorrelated.
6 LIMIT THEOREMS
71
Limit Theorems
6.1
Chebyshevs Inequality
Proposition 6.1. Let Z be a non-negative random variable and > 0. Then

P{Z }
Proof. For each R,
1
EZ
1Z () Z() = E(1Z ) EZ
But if Z is continuous,
E(1Z ) = E1Z =
f (s) ds = P{Z }
so we are done. If Z is discrete then the proof is similar.

Proposition 6.2 (Chebyshevs Inequality). Let X be a random variable with 2nd moment and let
c > 0. Then
1
P{|X EX| c} 2 VX
c
Proof. Let Z = |X EX|2 . Then
P{|X EX| c} = P{|X EX|2 c2 } = P{Z c2 }
1
1
EZ = 2 VX
2
c
c
Example 6.1.
1. Roll a die n times and let Sn be the sum of the values. Then

7
7
1
Sn
= n =
E
n
n
2
2
and

V
Then
Sn
n

=
1
1 35
VSn = nVX1 =
2
n
n 12

Sn 7
35

P
n
2
n 12 2
Say = 0.1. Then

P{34
Sn
35
35
36} 1
=1
0.7083
n
12 n 0.01
120
2. Toss a fair coin n times let Sn be the number of heads. Then ESn =
P{|Sn n/2| c}
n
2
and VSn = n4 . Thus
n
4c2
Theorem 6.1 (Weak Law of Large Numbers). Let X1 , X2 , ... be i.i.d (independent and identically
distributed) and set = EX1 = EX2 = . Let Sn = X1 + + Xn then for each > 0 we have

Sn

lim P = 0
n
n
6 LIMIT THEOREMS
72
Proof. We make the extra assumption EX12 < . Then

Sn

Sn
V
(nVX)/n2
1 VX1
n
P
=
=
0, n
2
2
n

n 2
Theorem 6.2 (Strong Law of Large Numbers). X1 , X2 , . . . i.i.d and let EX1 = < , Sn =
X1 + + Xn . Then

Sn
= =1
P lim
n n
Application:
Let h : [0, 1]d R. Our goal is to evaluate
Z
h(x) dx
[0,1]d
What we do is we take U uniform on [0, 1]d . Then U = (U1 , . . . , Ud ) with density f (x) = 1[0,1]d (x).
Then
Z
Z
Eh(U ) =
h(x)f (x) dx =
h(x) dx
Rd
[0,1]d
We take U 1 , U 2 , U 3 , . . . , independent uniform distributed on [0, 1]d . Then

n
1X
h(U j ) Eh(U1 ) =
n k=1
6.2
Z
h(x) dx
[0,1]d
Central Limit Theorem
Let X1 , X2 , . . . be i.i.d. where E|X1 |2 < . Let Sn = X1 + + Xn with = EX1 and 2 = VX1 .
Then
E(Sn n) = 0,
V(Sn n) = n 2
which implies

V
Sn n

= 1 and E
Sn n

=0
Theorem 6.3 (Central Limit Theorem). X1 , X2 , . . . i.i.d. with E|X1 |2 < . Let Sn , , and 2 be
as above, then

Z b
Sn n
1
2
P a
b
et /2 dt
n
2 a
In other words if Zn =
Sn
n
n
then Zn Z where Z N(0, 1).
Example 6.2.
35
1. Roll a die n times = 72 and 2 = 12
. Then Sn is almost N(7n/2, 35n/12) distributed for n
3
large. Let n = 10 and a = 3, 400, b = 3, 600. Then
!
!
3600 3500
3400 3500
P{3400 Sn 3600} p
p
0.9352
n 35/12
n 35/12
6 LIMIT THEOREMS
73
2. Calculations in a bank. Say there is a deposit of $1.2347. Then it rounds down so that $1.23
shows in the account, so the bank gains $0.0047. Let Xj be the amount that the bank gains
or loses on transaction j. Then 0.5 Xj 0.5 if we measure X in cents. Then Xj are
1
independent and uniformly distributed on [0.5, 0.5]. Then EX = 0 and VXj = 12
. Let Sn =
X1 + + Xn be the total amount of money (in cents) gained or lost from transactions. Then
Sn is approximately N(0, n/12) distributed for n large. Let n = 106 , then a = 103 cents = $10.
Thus
!
103 12
0.0002663
P{Sn $10} = 1 P{Sn a} 1
103

Probability Notes

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Probability Notes

Încărcat de

Drepturi de autor:

Formate disponibile

Math 630 - Probability Theory and its Applications

Zach Bailey and Werner Linde

2 Random Variables and Their Distributions

3 Special Random Variables

1 EVENTS AND THEIR PROBABILITIES

Events and Their Probabilities

Say we investigate a random experiment.

1 EVENTS AND THEIR PROBABILITIES

1 EVENTS AND THEIR PROBABILITIES

We see A0 is a -field because,

1 EVENTS AND THEIR PROBABILITIES

Then (, A , P) is called a Probability Space

1 EVENTS AND THEIR PROBABILITIES

5. (Booles Inequality) Take any arbitrary A1 , A2 , A . Then

6. P is continuous from below, i.e. if A1 A2 then

7. P is continuous from above i.e. if A1 A2 A3 then

6. Let A0 = and Bn = An \ An1 for n 1 then

An where B1 , B2 , . . . are pairwise

= lim P(An ) P(A0 ).

7. Since A1 A2 we see Ac1 Ac2 so

1 EVENTS AND THEIR PROBABILITIES

Proposition 1.4. Given A, B A then

Proof. Clever technique using the expected value.

1j1 <jk <n

Different Types of Probabilities

We distinguish between the following two types of probabilities:

1 EVENTS AND THEIR PROBABILITIES

The function f is called probability mass function of P. Moreover, given A it follows

1 EVENTS AND THEIR PROBABILITIES

3. Geometric distribution: Let = {1, 2, . . .}. Given p (0, 1) define

The generated measure Gp on P() is called geometric distribution with parameter p.

The probability Pois on P() is called Poisson distribution of parameter > 0.

f (m) = HN,M,n ({m}) =

Given an interval [a, b] R we set

1 EVENTS AND THEIR PROBABILITIES

1 EVENTS AND THEIR PROBABILITIES

4. Normal distribution: let R and > 0. Define f, on R by

P(A|B) is called conditional probability of A under (condition) B.

1 EVENTS AND THEIR PROBABILITIES

Proposition 1.9 (Law of total probability). Let =

Bj , with B1 , . . . , Bn disjoint, P(Bj ) > 0.

1 EVENTS AND THEIR PROBABILITIES

Let B1 , . . . , Bn disjoint with

Bj = The probabilities P(B1 ), . . . , P(Bn ) are called a priori prob-

Definition 1.10. (, A , P) and A, B A . Then A and B are said to be independent if

1 EVENTS AND THEIR PROBABILITIES

What does it mean to have more than two sets independent?

1 EVENTS AND THEIR PROBABILITIES

Then if I = {1, . . . , n} and I0 I with m = #I0 then

1 EVENTS AND THEIR PROBABILITIES

Proposition 1.12. Given two probability spaces (1 , A1 , P1 ) and (2 , A2 , P2 ) let = 1 2 and

1 EVENTS AND THEIR PROBABILITIES

Hence for any event A N2 the product measure P is given by

Let A be the event that the game ends in a draw, that is

Hence, since it is equally likely that X or Y win, we finally get

The set of interest is A = {(i, j) : j 2i}. Hence

A first result describes the product--field B(R) B(R).

1 EVENTS AND THEIR PROBABILITIES

2 e(t1 +t2 ) dt1 dt2 = (1 eT )2 ,

The existence of P can be proved by induction observing that

2 RANDOM VARIABLES AND THEIR DISTRIBUTIONS

Random Variables and Their Distributions

We conduct a random experiment, described by (, A , P) and observe . In a next step we

Using this notation the function X is a random variable if and only if