Sunteți pe pagina 1din 25

ST213 Mathematics of Random Events

Wilfrid S. Kendall
version 1.0

28 April 1999

1. Introduction
The main purpose of the course ST213 Mathematics of Random Events (which we
will abbreviate to MoRE) is to work over again the basics of the mathematics of
uncertainty. You have already covered this in a rough-and-ready fashion in:
(a) ST111 Probability;
and possibly in
(b) ST114 Games and Decisions.
In this course we will cover these matters with more care. It is important to
do this because a proper appreciation of the fundamentals of the mathematics of
random events
(a) gives an essential basis for getting a good grip on the basic ideas of
statistics;
(b) will be of increasing importance in the future as it forms the basis of
the hugely important field of mathematical finance.
It is appropriate at this level that we cover the material emphasizing concepts
rather than proofs: by-and-large we will concentrate on what the results say and
so will on some occasions explain them rather than prove them. The third-year
courses MA305 Measure Theory, and ST318 Probability Theory go into the matter
of proofs. For further discussion of how Warwick probability courses fit together,
see our road-map to probability at Warwick at
www.warwick.ac.uk/statsdept/teaching/probmap.html

1.1 Books
[1] D. Williams (1991) Probability with Martingales CUP.

1.2 Resources (including examination information)


The course is composed of 30 lectures, valued at 12 CATS credit. It has an assessed
component (20%) as well as an examination in the summer term. The assessed
component will be conducted as follows: an exercise sheet will be handed out
approximately every fortnight, totalling 4 sheets. In the 10 minutes at the start
of the next lecture you produce an answer to one question under examination
conditions, specified at the start of the lecture. Model answers will be distributed

after the test, and an examples class will be held a week after the test. The tests
will be marked, and the assessed component will be based on the best 3 out of 4
of your answers.
This method helps you learn during the lecture course so should:
improve your exam marks;
increase your enjoyment of the course;
cost less time than end-of-term assessment.
Further copies of exercise sheets (after they have been handed out in lectures!)
can be obtained at the homepage for the ST213 course:
www.warwick.ac.uk/statsdept/teaching/ST213.html
These notes will also be made available at the above URL, chapter by chapter
as they are covered in lectures. Notice that they do not cover all the material
of the lectures: their purpose is to provide a basic skeleton of summary material
to supplement the notes you make during lectures. For example no proofs are
included. In particular you will not find it possible to cover the course by ignoring
lectures and depending on these notes alone!
Further related material (eg: related courses, some pretty pictures of random
processes, ...) can be obtained by following links from: W.S. Kendalls homepage:
www.warwick.ac.uk/statsdept/Staff/WSK/
Finally, the Library Student Reserve Collection (SRC) will in the summer
term hold copies of previous examination papers, and we will run two revision
classes for this course at that time.

1.3 Motivating Examples


Here are some examples to help us see what are the issues.
(1) J. Bernoulli (circa 1692): Suppose that A1 , A2 , ... are mutually
independent events, each of which has probability p. Define
Sn

#{ events Ak which happen for k n} .

Then the probability that Sn /n is close to p increases to 1 as n tends


to infinity:
P [ |Sn /n p|  ] 1
as n for all  > 0.
(2) Suppose the random variable U is uniformly distributed over the
continuous range [0, 1]. Why is it that for all x in [0, 1] we have
P[U = x]

P[a U b]

ba

and yet

whenever 0 a b 1? Why cant we argue as follows?


P[a U b]

=
=

{x}

x[a,b]

P[U = x]

0?

x[a,b]

(3) The Banach-Tarski paradox. Consider a sphere S 2 . In a certain


qualified sense it is possible to do the following curious thing: we
can find a subset F S 2 and (for any k 3) rotations 1k , 2k , ...,
kk such that
S 2 = 1k F 2k F ... kk F .
What then should we suppose the surface area of F to be? Since
S 2 = 13 F 23 F 33 F we can argue for area(F ) = 1/3. But since
S 2 = 14 F 24 F 34 F 44 F we can equally argue for area(F ) = 1/4.
Or similarly for area(F ) = 1/5. Or 1/6, or ...
(4) Reverting to Bernoullis example (Example 1 above) we could ask,
what is the probability that, when we look at the whole sequence
S1 /1, S2 /2, S3 /3, ..., we see the sequence tends to p? Is this different
from Bernoullis statement?
(5) Here is a question which is apparently quite different, which turns
out to be strongly related to the above ideas! Can we generalize the
idea of a Riemann integral in such a way as to make sense of rather
discontinuous integrands, such as the case given below?
Z

f (x) dx
0

where
f (x)

1
0

when x is a rational number,


when x is an irrational number.

2. Probabilities, algebras, and -algebras


2.1 Motivation
Consider two coins A and B which are tossed in the air so as each to land with
either heads or tails upwards. We do not assume the coin-tosses are independent!

It is often the case that one feels justified in assuming the coins individually are
equally likely to come up heads or tails. Using the fact P [ A = T ] = 1 P [ A = H ],
etc, we find
1
P [ A comes up heads ] =
2
1
P [ B comes up heads ] =
2
To find probabilities such as P [ HH ] = P [ A = H, B = H ] we need to say
something about the relationship between the two coin-tosses. It is often the case
that one feels justified in assuming the coin-tosses are independent, so
P [ A = H, B = H ]

P[A = H ] P[B = H ] .

However this assumption may be unwise when the person tossing the coin is not
experienced! We may decide that some variant of the following is a better model:
the event determining [B = H] is C if [A = H], D if [A = T ], where
P[C = H ]

P[D = H ]

3
4
1
4

and A, C, D are independent.


There are two stages of specification at work here. Given a collection C of
events, and specified probabilities P [ C ] for each C C, we can find P [ C c ] =
1 P [ C ] the probability of the complement C c of C, but not necessarily P [ C D ]
for C, D C.

2.2 Revision of sample space and events


Remember from ST111 that we can use notation from set theory to describe
events. We can think of events as subsets of sample space . If A is an event,
then the event that A does not happen is the complement or complementary event
Ac = { : 6 A}.
If B is another event then the event that both A and B happen is the intersection A B = { : A and B}. The event that either A or B (or
both!) happen is the union A B = { : A or B}.

2.3 Algebras of sets


This leads us to identify classes of sets for which we want to find probabilities.

Definition 2.1 (Algebra of sets): An algebra (sometimes called a field) of


subsets of is a class C of subsets of a sample space satisfying:
(1) closure under complements: if A C then Ac C;
(2) closure under intersections: if A, B C then A B C;
(3) closure under unions: if A, B C then A B C.
Definition 2.2 (Algebra generated by a collection): If C is a collection of
subsets of then A(C), the algebra generated by C, is the intersection of all
algebras of subsets of which contain C.
Here are some examples of algebras:
(i) the trivial algebra A = {, };
(ii) supposing = {H, T }, another example is
A

{ = {H, T }, {H}, {T }, } ;

(iii) now consider the following class of subsets of the unit interval [0, 1]:
A = { finite unions of subintervals }; This is an algebra. For example, if
A = (a0 , a1 ) (a1 , a2 ) ... (a2n , a2n+1 )
is a non-overlapping union of intervals (and we can always re-arrange
matters so that any union of intervals to be non-overlapping!) then
Ac

(iv)
(v)
(vi)

(vii)
(viii)
(ix)

[0, a0 ] [a1 , a2 ] ... [a2n+1 , 1] .

This checks point (1) of the definition of an algebra of sets. Point


(2) is rather easy, and point (3) is defined by points (1) and (2).
Consider A = {{1, 2, 3}, {1, 2}, {3}, }. This is an algebra of subsets
of = {1, 2, 3}. Notice it does not include events such as {1}, {2, 3}.
Just to give an example of a collection of sets which is not an algebra,
consider {{1, 2, 3}, {1, 2}, {2, 3}, }.
Algebras get very large. It is typically more convenient simply to
give a collection C of sets generating the algebra. For example, if
C = then A(C) = {, } is the trivial algebra described above!
If = {H, T } and C = {{H}} then A = {{H, T }, {H}, {T }, } as in
example (ii) above.
If = [0, 1] and C = { intervals in [0, 1] } then A(C) is the collection
of finite unions of intervals as in example (iii) above.
Finally, if = {H, T } and C is the collection of points in [0, 1]
then A(C) is the collection of (a) all finite sets in [0, 1] and (b) all
complements of finite sets in [0, 1].

In realistic examples algebras are rather large : not surprising, since they
correspond to the collection of all true-or-false statements you can make about a
certain experiment! (If your experiments results can be summarised as n different
yes/no answers such as, result is hot/cold, result is coloured black/white,
etc then the relevant algebra is composed of 2n different subsets!) Therefore it is
of interest that the typical element of an algebra can be written down in a rather
special form:
Theorem 2.3 (Representation of typical element of algebra): If C is a
collection of subsets of then the event A belongs to the algebra A(C) generated
by C if and only if
Mi
N \
[
A =
Ci,j
i=1 j=1

c
where for each i, j either Ci,j or its complement Ci,j
belongs to C. Moreover we
may write A in this form with the sets

Di

Mi
\

Ci,j

j=1

being disjoint. *
We are now in a position to produce our first stab at a set of axioms for
probability. Given a sample space and an algebra A of subsets, probability P [ ]
assigns a number between 0 and 1 to each event in the algebra A, obeying the
rules given below. There is a close analogy to the notion of length of subsets of
[0, 1] (and also to notions of area, volume, ...): the table below makes this clear:
Probability

Length of subset of [0, 1]

P[] = 0

Length () = 0

P[] = 1

Length ([0, 1]) = 1

P[A B ] = P[A] + P[B ]

Length ([a, b] [c, d]) =


Length ([a, b]) + Length ([c, d])
if a b < c d

if A B =

* This result corresponds to a basic remark in logic: logical statements, however


complicated, can be reduced to statements of the form (A1 and A2 and ... and
Am ) or (B1 and B2 and ... and Bn ) or ... or (C1 and C2 and ... and Cp ), where
the statements A1 etc are either basic statements or their negations, and no more
than one of the (...) or ... or (...) can be true at once.

There are some consequences of these axioms which are not completely trivial.
For example, the law of negation
c
P[A ]

1 P[A] ;

and the generalized law of addition holding when A B is not necessarily empty
P[A B]

P[A] + P[B ] P[A B ]

(think of double-counting); and finally the inclusion-exclusion law


X
P [ A1 A2 ... An ] =
P [ Ai ]
i

XX

i6=j

P [ Ai Aj ]

+ ...

+ (1)n P [ A1 A2 ... An ] .

2.4 Limit Sets


Much of the first half of ST111 is concerned with calculations using these various
rules of probabilistic calculation. Essentially the representation theorem above tells
us we can compute the probability of any event in A(C) just so long as we know the
probabilities of the various events in C and also of all their intersections, whether
by knowing events are independent or whether by knowing various conditional
probabilities. *
However these calculations can become long-winded and ultimately either infeasible or unrevealing. It is better to know how to approximate probabilities and
events, which leads us to the following kind of question:
Suppose we have a sequence of events Cn which are decreasing (getting harder
and harder to satisfy) and which converge to a limit C:
Cn

C.

Can we say P [ Cn ] converges to P [ C ]?


Here is a specific example. Suppose we observe an infinite sequence of coin
tosses, and think therefore of the collection C of events Ai that the ith coin comes
up heads. Consider the probabilities
(a) P [ second toss gives heads ] = P [ A2 ]
* We avoid discussing conditional probabilities here for reasons of shortage of
time: they have been dealt with in ST111 and figure very largely in
www.warwick.ac.uk/statsdept/teaching/ST202.html

Tn
(b) P [ first n tosses all give heads ] = P [ i=1 Ai ]
(c) P [ the first toss which gives a head is even-numbered ]
There is a difference! The first two can be dealt with within the algebra. The
third cannot: suppose Cn is the event the first toss in numbers 1, ..., n which
gives a head is even-numbered or else all n of these tosses give tails, then Cn lies
in A(C), and converges down to the event C the first toss which gives a head is
even-numbered, but C is not in A(C).
We now find a number of problems raise their heads.
Problems with everywhere being impossible: Suppose we
are running an experiment with an outcome uniformly distributed
over [0, 1]. Then we have a problem as mentioned in the second of our
motivating examples: under reasonable conditions we are working
with the algebra of finite unions of sub-intervals of [0, 1], and the
probability measure which gives P [ [a, b] ] = b a, but this means
P [ {a} ] = 0. Now we need to be careful, since if we rashly allow
ourselves to work with uncountable unions we get

{x} =

x[0,1]

0 = 0.

x[0,1]

But this contradicts P [ [0, 1] ] = 1 and so is obviously wrong.


Problems with specification: if we react to the above example
by insisting we can only give probabilities to events in the original
algebra, then we can fail to give probabilities to perfectly sensible
events, such as in examples such as in (c) in the infinite sequence of
coin-tosses above. On the other hand if we rashly prescribe probabilities then how can we avoid getting into contradictions such as
above?
It seems sensible to suppose that at least when we have Cn C then we should
be allowed to say P [ Cn ] P [ C ], and this turns out to be the case as long as the
set-up is sensible. Here is an example of a set-up which is not sensible:
= {1, 2, 3, ...}, C = {{1}, {2}, ...}, P [ n ] = 1/2n+1 . Then A(C) is the collection of finite and co-finite* subsets of the positive integers, and
P [ {1, 2, ..., n} ]

n
X

1/2m+1

(1/2) (1 1/2n+1 )

m=1

We must now investigate how we can deal with limit sets.


* co-finite: complement is finite

1/2 6= 1 .

2.5 -algebras
The first task is to establish a wide range of sensible limit sets. Boldly, we look
at sets which can be obtained by any imaginable combination of countable set
operations: the collection of all such sets is a -algebra.**
Definition 2.4 (-algebra): A -algebra of subsets of is an algebra which is
also closed under countable unions.
In fact -algebras are even larger than ordinary algebras; it is difficult to
describe a typical member of a -algebra, and it pays to talk about -algebras
generated by specified collections of sets.
Definition 2.5 (-algebra generated by a collection): For any collection of
subsets C of , we define (C) to be the intersection of all -algebras of subsets of
which contain C:
\
(C) =
{S : S is a -algebra and C S} .
Theorem 2.6 (Monotone limits): Note that (C) defined above is indeed a
-algebra. Furthermore, it is the smallest -algebra containing C which is closed
under monotone limits.

Examples of -algebras include: all algebras of subsets of finite sets (because


then there will be no non-finite countable set operations); the Borel -algebra generated by the family of all intervals of the real line; the -algebra for the coin-tossing
example generated by the infinite family of events Ai = [ ith coin is heads ].

2.6 Countable additivity


Now we have established a context for limit sets (they are sets belonging to a algebra) we can think about what sort of limiting operations we should allow for
probability measures.
Definition 2.7 (Measures): A set-function : A [0, ] is said to be a
finitely-additive measure if it satisfies:
(FA) (A B) = A + B whenever A, B are disjoint.
It is said to be countably-additive (or -additive) if in addition
S
P
S
(CA) i=1 Ai = i=1 Ai whenever the Ai are disjoint and their union i=1 Ai
lies in A.

We abbreviate finitely-additive to (FA), countably-additive to (CA).


We often abbreviate countably-additive measure to measure.
Notice S
that if A were actually a -algebra then we wouldnt have to check the

condition i=1 Ai lies in A in the third property.


** stands for countable

Definition 2.8 (Probability measures): A set-function P : A [0, 1] is said to


be a finitely-additive probability measure if it is a (FA) measure such that P [ ] =
1. It is a (CA) probability measure (we often just say probability measure if in
addition it is (CA).
Notice various consequences for probability measures: () = 0, condition
(ii)
S
follows
from
condition
(iii)
if
condition
(iii)
holds,
we
always
have

(
(A
))

i
i=1
P
(A
)
even
when
the
union
is
not
disjoint,
etc.
i
i=1
CA is a kind of continuity condition. A similar continuity condition is that of
monotone limits.
Definition 2.9 (Monotone limits): A set-function : A [0, 1] is said to obey
the monotone limits property (ML) if it satisfies:
Ai Ai whenever the Ai increase upwards to a limit set A which
lies in A.
(ML) is simpler to check than (CA) but is equivalent for finitely-additive
measures.
Theorem 2.10 (Equivalence for countable additivity):
(F A) + (M L)

(CA)

Lemma 2.11 (Another equivalence): Suppose P is a finitely additive probability measure on (, F), where F is an algebra of sets. Then P is countably
additive if and only if
lim P [ An ] = 1
n

whenever the sequence of events An belongs to the algebra F and moreover An .

2.7 Uniqueness of probability measures


To illustrate the next step, consider the notion of length/area. (To avoid awkward
alternatives, we talk about the measure instead of length/area /volume/...) It is
easy to define the area of very regular sets. But for a stranger, more fractal-like,
set A we would need to define something like an outer-measure
nX
o
(A) = inf
(Bi ) : where the Bi cover A

to get at least an upper bound for what it would be sensible to call the measure
of A.
Of course we must give equal priority to considering what is the measure of
the complement Ac . Suppose for definiteness that A is contained in a simple set

10

Q of finite measure (a convenient interval for length, a square for area, a cube for
volume, ...) so that Ac = Q \ A. Then consideration of (Ac ) leads us directly to
consideration of inner-measure for A:
(A)

(Q) (Ac ) .

Clearly (A) (A): moreover we can only expect a truly sensible definition
of measure on the set


F =
A : (A) = (A) .

The fundamental theorem of measure theory states that this works out all
right!

Theorem 2.12 (Extension theorem): If is a measure on an algebra A which


is -additive on A then it can be extended uniquely to a countable additive measure
on F defined as above: moreover (A) F.
The proof of this remarkable theorem is too lengthy to go into here. Notice
that it can be paraphrased very simply: if your notion of measure (probability,
length, area, volume, ...) can be defined consistently on an algebra in such a way
that it is -additive whenever the two sides
!

X
[
=
Ai

Ai
i=1

i=1

S
make sense (whenever the disjoint union i=1 Ai actually belongs to the algebra),
then it can be extended uniquely to the (typically much larger) -algebra generated
by the original algebra, so as again to be a (-additive) measure.
There is an important special part of this theorem which is worth stating
separately.
Definition 2.13 (-system): A -system of subsets of is a collection of subsets
including itself and closed under finite intersections.
Theorem 2.14 (Uniqueness for probability measures): Two finite measures
which agree on a -system also agree on the generated -algebra ().

2.8 Lebesgue measure and coin tossing


The extension theorem can be applied to the uniform probability space =
[0, 1], A given by finite unions of intervals, P given by lengths of intervals. It
turns out P is indeed -additive on A (showing this is non-trivial!) and so the
extension theorem tells us there is a unique countably additive extension P on

11

the -algebra B = (A) (the Borel -algebra restricted to [0, 1]). We call this
Lebesgue measure.
There is a significant connection between infinite sequences of coin tosses and
numbers in [0, 1]. Briefly, we can expand a number x [0, 1] in binary (as opposed
to decimal!): we write x as
.1 2 3 ...
where i equals 1 or 0 according as 2i x is greater than 1 or not. The coin-tossing
-algebra can be viewed as generated by the sequence
{1 , 2 , 3 , ...}
with 0 standing for tails, 1 for heads. In effect we get a map from coin-tossing
space 2N to number space [0, 1] with the slight cautionary note that this map
very occasionally maps two sequences onto one number (think of .0111111... and
.100000...). In particular
[1 = a1 , 2 = a2 , ..., d = ad ]

[x, x + 2d )

where x is the number corresponding to (a1 , a2 , ..., ad ).


Remarkably, we can now use the uniqueness theorem to show that the map
T : (a1 , a2 , ..., ad ) 7 x preserves probabilities, in the sense that Lebesgue measure
is exactly the same as we get by finding the probability of the event T 1 (A) as a
coin-tossing event, if the coins are independent and fair.
It is reasonable to ask whether there are any non-measurable sets, since algebras are so big! It is indeed very hard to find any. Here is the basic example,
which is due in essence to Vitali.
Consider the following equivalence relation on (,B,P): we say x y if x y
is a rational number. Now construct a set A by choosing exactly one member from
each equivalence class.
So for any x [0, 1] there is one and only one y A such that x y is a
rational number.
If A were Lebesgue measurable then it would have a value P [ A ]. What would
this value be?
Imagine [0, 1] folded round into a circle. It is the case that P [ A ] does not
change when one turns this circle. In particular we can now consider Aq = {a + q :
a A} for rational q. By construction Aq and Ar are disjoint for different rational
q, r. Now we have
[
Aq = [0, 1]
q rational

12

and since there are only countably many rational q, and P [ Aq ] doesnt depend on
q, we determine
X
X
P [ [0, 1] ] =
P [ Aq ] =
P[A] .
q rational
q rational
But this cannot make sense if P [ [0, 1] ] = 1! We are forced to conclude that
A cannot be Lebesgue measurable.
This example has a lot to do with the Banach-Tarski paradox described in
one of our motivating examples above.

3. Independence and measurable functions


3.1 Independence
In ST111 we formalized the idea of independence of events. Essentially we require
a multiplication law to hold:
Definition 3.15 (Independence of an infinite sequence of events): We say
the events Ai (for i = 1, 2, ...) are independent if, for any finite subsequence
i1 < i2 < ... < ik we have
P [ Ai1 ... Aik ]

P [ Ai1 ] ... P [ Aik ]

Notice we require all possible multiplication laws to hold: it is possible to build


interesting examples where events are independent pair-by-pair, but altogether give
non-trivial information about each other.
We need to talk about infinite sequences of events (often independent). We
often have in the back of our minds a sense that the sequence is revealed to us
progressively over time (though this need not be so!), suggesting two natural questions. First, will we see events occur in the sequence right into the indefinite
future? Second, will we after some point see all events occur?
Definition 3.16 (Infinitely often and Eventually): Given a sequence of
events B1 , B2 , ... we say
Bi holds infinitely often ([Bi i.o.]) if there are infinitely many different i for which the statement Bi is true: in set-theoretic terms
[Bi i.o.]

i=1 j=i

13

Bj .

Bi holds eventually ([Bi ev.]) if for all large enough i the statement
Bi is true: in set-theoretic terms
\

[
[Bi ev.] =
Bj .
i=1 j=i

Notice these two concepts ev. and i.o. make sense even if the infinite sequence
is just a sequence, with no notion of events occurring consecutively in time!
Notice (you should check this yourself!)
[Bi i.o.]

[Bic ev.]c .

3.2 Borel-Cantelli lemmas


The multiplication laws appearing above in Section 2.1 force a kind of infinite
multiplication law.
Lemma 3.17 (Probability of infinite intersection): If the events Ai (for
i = 1, 2, ...) are independent then
#
"

\
Y
Ai
=
P
P [ Ai ]
i=1

i=1

Q
We have to be careful what we mean by
Qnthe infinite product i=1 P [ Ai ]: we
mean of course the limiting value limn i=1 P [ Ai ].
We can now prove a remarkable pair of facts about P [ Ai i.o. ] (and hence its
twin P [ Ai ev. ]!). It turns out it is often easy to tell whether these events have
probability 0 or 1.

Theorem 3.18 (Borel-Cantelli lemmas): Suppose the events Ai (for i = 1, 2,


...) form an infinite sequence. Then
P
(i) if i=1 P [ Ai ] < then
P [ Ai holds infinitely often ] = P [ Ai i.o. ] =
P
(ii) if i=1 P [ Ai ] = and the Ai are independent then
P [ Ai holds infinitely often ]

P [ Ai i.o. ]

0;

1.

Note the two parts of the above result are not quite symmetrical: the second
part also requires independence. It is a good exercise to work out a counterexample
to part (ii) if independence fails.

3.3 Law of large numbers for events


As a consequence of these ideas it can be shown that limiting frequencies exist
for sequences of independent trials with the same success probability.

14

Theorem 3.19 (Law of large numbers for events): Suppose that we have a
sequence of independent events Ai each with the same probability p. Let Sn count
the number of events A1 , ...,, An which occur. Then

for all positive .




Sn



p  ev.
P
n

3.4 Independence and classes of events


The idea of independence stretches beyond mere sequences of events. For example,
consider (a) a set of events concerning a football match between Coventry City and
Aston Villa at home for Coventry, and (b) a set of events concerning a cricket test
between England and Australia at Melbourne, both happening on the same day.
At least as a first approximation, one might assume that any combination of events
concerning (a) is independent of any combination concerning (b).
Definition 3.20 (Independence and classes of events): Suppose C1 , C2 are
two classes of events. We say they are independent if A and B are independent
whenever A C1 , B C2 .
Here our notion of -systems becomes important.
Lemma 3.21 (Independence and -systems): If two -systems are independent, then so are the -algebras they generate.
Returning to sequences, the above is the reason why we can jump immediately
from assumptions of independence of events to deducing that their complements
are independent.
Corollary 3.22 (Independence and complements): If a sequence of events
Ai is independent, then so is the sequence of complementary events Aci .

3.5 Measurable functions


Mathematical work often becomes easier if one moves from sets to functions. Probability theory is no different. Instead of events (subsets of sample space) we can
often find it easier to work with random variables (real-valued functions defined on
sample space). You should think of a random variable as involving lots of different
events, namely those events defined in terms of the random variable taking on
different sets of values. Accordingly we need to take care that the random variable
doesnt produce events which fall outwith our chosen -algebra. To do this we
need to develop the idea of a measurable function.

15

Definition 3.23 (Measurable space): (, F) is a measurable space if F is a


-algebra of subsets of .
Definition 3.24 (Borel -algebra): The Borel -algebra B is the -algebra of
subsets of R generated by the collection of intervals of R.
In fact we dont need all the intervals of R. It is enough to take the closed
half-infinite intervals (, x].
Definition 3.25 (Measurable function): Suppose that (, F), (0 , F 0 ) are
both measurable spaces. We say the function
f :

is measurable if f 1 (A) = { : f () A} belongs to F whenever A belongs to F 0 .


Definition 3.26 (Random variable): Suppose that X : R is measurable
as a mapping from (, F) to (R, B). Then we say X is a random variable.
As we have said, to each random variable there is a class of related events.
This actually forms a -algebra.
Definition 3.27 (-algebra generated by a random variable): If X : R
is a random variable then the -algebra generated by X is the family of events
(X) = {X 1 (A) : A B}.

3.6 Independence of random variables


Random variables can be independent too! Essentially here independence means
that a event generated by one of the random variables cannot be used to give useful
predictions about an event generated by the other random variable.
Definition 3.28 (Independence of random variables): We say random variables X and Y are independent if their -algebras (X), (Y ) are independent.
Theorem 3.29 (Criterion for independence of random variables): Let X
and Y be random variables, and let P be the -system of R formed by all halfinfinite closed intervals (, x]. Then X and Y are independent if and only if the
collections of events X 1 P, Y 1 P are independent*.

3.7 Distributions of random variables


We often need to talk about random variables on their own, without reference
to other random variables or events. In such cases all we are interested in is the
probabilities they have of taking values in various regions:
* Here we define X 1 P = {X 1 (A) : A P} = {X 1 ((, x]) : x R}
16

Definition 3.30 (Distribution of a random variable): Suppose that X is a


random variable. Its distribution is the probability measure PX on R given by
PX [B]

P[X B ]

whenever B B.

4. Integration
One of the main things to do with functions is to integrate them (find the area
under the curve). One of the main things to do with random variables is to take
their expectations (find their average values). It turns out that these are really the
same idea! We start with integration.

4.1 Simple functions and Indicators


Begin by thinking of the simplest possible function to integrate. That is an indicator function, which only takes two possible values, 0 or 1:
Definition 4.31 (Indicator function): If A is a measurable set then its indicator function is defined by
n
0 if x 6 A;
I[A] (x) =
1 if x A.
The next stage up is to consider a simple function taking only a finite number
of values, since it can be regarded as a linear combination of indicator functions.
Definition 4.32 (Simple functions): A simple function h is a measurable function h : R which only takes finitely many values. Thus we can represent it
as
h(x) = c1 I[A1 ] (x) + ...cn I[An ] (x)
for some finite collection A1 , ..., An of measurable sets and constants c1 , ..., cn .
It is easy to integrate simple functions ...
Definition 4.33 (Integration of simple functions): The integral of a simple
function h with respect to a measure is given by
Z

h d

h(x)( dx)

n
X

ci (Ai )

i=1

where
h(x)

c1 I[A1 ] (x) + ...cn I[An ] (x)

17

as above.
R
Note that one really should prove that the definition of h d does not depend
on exactly how one represents h as the sum of indicator functions.
Integration for such functions has a number of basic properties which one uses
all the time, almost unconsciously, when trying to find integrals.
Theorem 4.34 (Properties of integration for simple functions):
R
R
(1) if (f 6= g) =R 0 then f d = R g d;
R
(2) Linearity: (af + bg) d = a R f d + bR g d;
(3) Monotonicity: f g means f d g d;
(4) min{f, g} and max{f, g} are simple.
Simple functions are rather boring. For more general functions we use limiting
arguments. We have to be a little careful here, since some functions will have
integrals built up from + where they are integrated over one part of the region,
and over another part. Think for example of
Z

1
dx
x

1
dx +
x

1
dx
x

equals ?

So we first consider just non-negative functions.


Definition 4.35 (Integration for non-negative measurable functions): If
f 0 is measurable then we define
Z

f d

sup

Z

g d : for simple g such that 0 g f

4.2 Integrable functions


For general functions we require that we dont get into this situation of .
Definition 4.36 (Integration for general measurable functions): If f is
measurable and we can write f = g h for two non-negative measurable functions
g and h, both with finite integrals, then
Z

f d

We then say f is integrable.

18

g d

h d .

R
One really needs to prove that the integral f d does not depend on the
choice f = g h. In fact if there is any choice which works then the easy choice
g
h

=
=

max{f, 0}
max{f, 0}

will work.
One can show that the integral on integrable functions agrees with its definition on simple functions and is linear. What starts to make the theory very easy
is that the integral thus defined behaves very well when studying limits.
Theorem 4.37 (Monotone convergence theorem (MON)): If fn f (all
being non-negative measurable functions) then
Z
Z
fn d
f d .
Corollary 4.38 (Integrability and simple functions): if f is non-negative
and measurable then for any sequence of non-negative simple functions fn such
that fn f we have
Z
Z
fn d

f d .

Definition 4.39 (Integration over a measurable set): if A is measurable and


f is integrable then
Z
Z

I[A] f d .
f d =
A

4.3 Expectation of random variables


The above notions apply directly to random variables, which may be thought of
simply as measurable functions defined on the sample space!
Definition 4.40 (Expectation): if P is a probability measure then we define
expectation (with respect to this probability measure) for all integrable random
variables X by
Z
Z
X dP =
X()P( d) .
E[X ] =
The notion of expectation is really only to do with the random variable considered on its own, without reference to any other random variables. Accordingly
it can be expressed in terms of the distribution of the random variable.

19

Theorem 4.41 (Change of variables): Let X be a random variable and let


g : R R be a measurable function. Assuming that the random variable g(X) is
integrable,
Z
E [ g(X) ]

g(x)PX ( dx) .

4.4 Examples
You need to work through examples such as the following to get a good idea of
how the above really works out in practice. See the material covered in lectures
for more on this.
R1
Evaluate 0 xLeb( dx) = x.
P
Consider

{1, 2, 3, ...}, P [ {i} ] = pi where i=1 pi = 1. Evaluate


R
P=

f dP = R i=1 f (i)pi .
y x
Evaluate 0 e Leb( dx).
Rn
Evaluate 0 f (x)Leb( dx) where
f (x)

Evaluate

if 0 x < 1,
if 1 x < 2,
...
if n 1 x < n.

I[0,] (x) sin(x)Leb( dx).

5. Convergence
Approximation is a fundamental key to making mathematics work in practice.
Instead of being stuck, unable to do a hard problem, we find an easier problem
which has almost the same answer, and do that instead! The notion of convergence
(see first-year analysis) is the formal structure giving us the tools to do this. For
random variables there are a number of different notions of convergence, depending
on whether we need to approximate a whole sequence of actual random values, or
just a particular random value, or even just probabilities.

5.1 Convergence of random variables


Definition 5.42 (Convergence in probability): The random variables Xn
converge in probability to Y ,
Xn

Y in prob ,

20

if for all positive  we have


P [ |Xn Y | >  ]

0.

Definition 5.43 (Convergence almost surely / almost everywhere): The


random variables Xn converge almost surely to Y ,
Xn

Y a.s. ,

if we have
P [ Xn Y ]

0.

The (measurable) functions fn converge almost everywhere to f if the set


{x : fn (x) f (x) fails }
is of Lebesgue measure zero.
The difference is that convergence in probability deals with just a single random value Xn for large n. Convergence almost surely deals with the behaviour of
the whole sequence. Here are some examples to think about.
Consider random variables defined on ([0, 1], B, Leb) by Xn () =
I[[0,1/n]] (), Then Xn 0 a.s..
Consider the probability space above and the events A1 = [0, 1],
A2 = [0, 1/2], A3 = [1/2, 1], A4 = [0, 1/4], ..., A7 = [3/4, 1], ... Then
Xn = I[An ] converges to zero in probability
but not almost surely.
Pn
Suppose in the above that Xn =
k=1 (k/n)I[[(k1)/n,k/n]] . Then
Xn X a.s., where X( = [0, 1].
Suppose in the above that Xn a for all n. Let Yn = maxmn Xm .
Then Yn Y a.s. for some Y .
Suppose in the above that the Xn are not bounded, but are independent, and furthermore
lim

i=1

P [ Xn a ] = 1 .

Then Yn Y a.s. where


P[Y a] =

i=1

P [ Xn a ] .

As one might expect, the notion of almost sure convergence implies that of
convergence in probability.

21

Theorem 5.44 (Almost sure convergence implies convergence in probability): Xn X a.s. implies Xn X in prob.
ALmost sure convergence allows for various theorems telling us when it is
OK to exchange integrals and limits. Generally this doesnt work: consider the
example
Z
Z
Z
1=
exp(t) dt 6
lim exp(t) dt = 0 dt = 0 .
0

However we have already seen one case where it does work: when the limit in
monotonic. In fact we only need this to hold almost everywhere (i.e. when the
convergence is almost sure).

Theorem 5.45 (MON): if the functions fn , f are non-negative and if fn f


a.e. then
Z
Z
fn d f d .
It is often the case that the following simple inequalities are crucial to figuring
out whether convergence holds.
Lemma 5.46 (Markovs inequality): if f : R R is increasing and nonnegative and X is a random variable then
P [ X a ] E [ f (X) ] /f (a)
for all a such that f (a) > 0.


Corollary 5.47 (Chebyshevs inequality): if E X 2 < then
2
P [ |X E [ X ] | a ] Var(X)/a

for all a > 0.


In particular we can get a lot of mileage by combining with the fact, that
while in general the variance of a random variable is not additive, it is additive in
the case of independence.
Lemma 5.48 (variance and independence): if a sequence of random variables
Xi is independent then
!
n
n
X
X
Var
Xi
=
Var (Xi ) .
i=1

i=1

5.2 Laws of large numbers for random variables


An important application of these ideas is to show that the law of large numbers
extends from events to random variables.

22

Theorem 5.49 (Weak law of large numbers): if a sequence of random variables Xi is independent, and if the random variables all have the same finite mean
and variance E [ Xi ] = and Var(Xi ) = 2 < , then
Sn /n in prob.
where Sn = (X1 + ... + Xn )/n is the partial sum of the sequence.
As you will see, the proof is really rather easy when we use Chebyshevs
inequality above. Indeed it is also quite easy to generalize to the case when the
random variables are correlated, as long as the covariances are small ...
However the corresponding result for almost sure convergence, rather than
convergence, is rather harder to prove.
Theorem 5.50 (Strong law of large numbers): if a sequence of random variables Xi is independent and identically distributed, and if E [ Xi ] = then
Sn /n a.s.
where Sn = (X1 + ... + Xn )/n is the partial sum of the sequence.

5.3 Convergence of integrals and expectations


We already know a way to relate integrals to limits (MON). What about a general
sequence of non-negative measurable functions?
Theorem 5.51 (Fatous lemma (FATOU)): If the functions fn : R R are
actually non-negative then
Z
Z
lim inf fn d lim inf fn d .
We can also go the other way:
Theorem 5.52 (Reverse Fatou): If the functions fn : R R are bounded
above by g a.e. and g is integrable then
Z
Z
lim sup fn d
lim sup fn d .

5.4 Dominated convergence theorem


Although in general one cant interchange limits and integrals, this can be done if
all the functions (equivalently, random variables) involved are bounded in absolute
value by a single non-negative function (random variable) which has finite integral.

23

Corollary 5.53 (Dominated convergence theorem (DOM)): If the functions fn : R R are bounded above in absolute value by g a.e. (so |fn | < g a.e.)
and g is integrable and also fn f then
Z
Z
lim fn d =
f d .
This is a very powerful result ...

5.5 Examples
If the Xn form a bounded sequence random variable and they converge almost surely to X then
E [ Xn ] E [ X ] .
Suppose that U is a random variable uniformly distributed over [0, 1]
and
n
2X
1
Xn =
k2n I[k2n U <(k+1)2n ] .
k=0

Then E [ log(1 Xn ) ] 1.
Suppose that the Xn are independent and X1 = 1 while for n 2
P [ Xn = n + 1 ]

P [ Xn = 1/(n + 1) ]

P [ Xn = 1 ]

1 2/n3

1/n3

Qn
and Zn = i=1 Xi . Then the Zn form an almost surely convergent
sequence with limit Z , and
E [ Zn ]

E [ Z ] .

6. Product measures
6.1 Product measure spaces
The idea here is, given two measure spaces (, F, ) and (0 , F 0 , ), we build a
meaasure space 0 by using rectangle sets AB with measures (A)(B).
As you might guess from the product form (A) (B), in the context of
probability this is related to independence.

24

Definition 6.54 (Product measure space): define the product measure


on the -system R of rectangle sets A B as above.
Let A(R) be the algebra generated by R.

Lemma 6.55 (Representation of A(R)): every member of A(R) can be expressed as a finite disjoint union of rectangle sets.
It is now possible to apply the Extension Theorem (we need to check additivity this is non-trivial but works) to define the product measure
on the whole -algebra (R).

6.2 Fubinis theorem


There are three big results on integration. We have already met two: MON and
DOM, which tell us cases when we can exchange integrals and limits. The other
result arises in the situation where we have a product measure space. In such a
case we can integrate any function in one of three possible ways: either using the
product measure, or by first doing a partial integration holding one coordinate
fixed, and then integrating with respect to that one. We call this alternative
iterated integration, and obviously there are two ways to do it depending on which
variable we fix first. The final big result is due to Fubini, and tells us that as long
as the function is modestly well-behaved it doesnt matter which of the three ways
we do the integration, we still get the same answer:
Theorem 6.56 (Fubinis theorem): Suppose f is a real-valued function defined
on the product measure space above which is either (a) non-negative or (b) integrable. Then

Z
Z Z
0
f d( ) =
f (, )( d) ( d 0 )
0

Notice the two alternative conditions. Non-negativity (sometimes described


as Tonellis condition, is easy to check but can be limited. Think carefully about
Fubinis theorem and especially Tonellis condition, and you will see that the only
thing which can go wrong is when in the product form you have an problem!

6.3 Relationship with independence


Suppose X and Y are independent random variables. Then the distribution of the
pair (X, Y ), a measure on R R given by
(A)

P [ (X, Y ) A ] ,

is exactly the product measure where is the distribution of X, and is the


distribution of Y .
End of outline notes

25

S-ar putea să vă placă și