Sunteți pe pagina 1din 39

Random Variables

EE 278
Lecture Notes # 3
Winter 20102011

Probability space (, F , P)

Random variables, vectors,


and processes

A (real-valued) random variable is a real-valued function defined on


with a technical condition (to be stated)
Common to use upper-case letters. E.g., a random variable X is a
function X : R.
Y, Z, U, V, ,
Also common: random variable may take on values only in some
subset X R (sometimes called the alphabet of X , AX and X also
common notations)
Intuition: Randomness is in experiment, which produces outcome
according to probability P random variable outcome is
X() X R.

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

Examples

c R.M. Gray 2011

Functions of random variables

Consider (, F , P) with = R, P determined by uniform pdf on [0, 1)


Coin flip from earlier: X : R {0, 1} by

Suppose that X is a rv defined on (, F , P) and suppose that


g : X R is another real-valued function.

Observe X , do not observe outcome of fair spin.

Can express the previous examples as W = V 2, Z = eV , L = V ln V ,

Lots of possible random variables, e.g., W(r) = r2, Z(r) = er , V(r) = r,


L(r) = r ln r (require r 0), Y(r) = cos(2r), etc.

Similarly, 1/W , sinh(Y), L3 are all random variables

0 if r 0.5
X(r) =
.

1 otherwise

Then the function g(X) : R defined by g(X)() = g(X()) is also


a real-valued mapping of , i.e., a real-valued function of a random
variable is a random variable

Y = cos(2V)

Can think of rvs as observations or measurements made on an


underlying experiment.
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

Random vectors and random processes

Derived distributions

In general: input probability space (, F , P) + random variable X


output probability space, say (X , B(X ), PX ), where X R and PX
is distribution of X
PX (F) = Pr(X F)

A finite collection of random variables (defined on a common


probability space (, F , P) is a random vector
E.g., (X, Y), (X0, X1, , Xk1)

Typically PX described by pmf pX or pdf fX

An infinite collection of random variables (defined on a common


probability space) is a random process

For binary quantizer special case derived PX .

E.g., {Xn, n = 0, 1, 2, }, {X(t); t (, )}

Idea generalizes and forces a technical condition on definition of


random variable (and hence also on random vector and random
process)

So theory of random vectors and random processes mostly boils


down to theory of random variables.

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

Inverse image formula

Given (, B(), P) and a random variable X ,

X 1(F)

find PX

Basic method: PX (F) = the probability computed using P of all the


original sample points that are mapped by X into the subset F :

Inverse image method: Pr(X F) = P({ : X() F}) = P(X 1(F))

PX (F) = P({ : X() F})


Shorthand way to write formula in terms of inverse image of an event
F B(X ) under the mapping X : X : X 1(F) = {r : X(r) F}:
1

PX (F) = P(X (F))


Written informally as PX (F) = Pr(X F) = P{X F} = probability that
random variable X assumes a value in F
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

inverse image formula fundamental to probability, random


processes, signal processing.
Shows how to compute probabilities of output events in terms of the
input probability space
does the definition make sense?
i.e., is PX (F) = P(X 1(F)) well-defined for all output events F ??
Yes if include requirement in definition of random variable
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

Careful definition of a random variable

Most every function we encounter is measurable, but calculus of

Given a probability space (, F , P), a (real-valued) random variable


X is a function X : X R with the property that

In simple binary quantizer example, X is measurable (easy to show


since F = B([0, 1)) contains intervals)
Recall

probability rests on this property and advanced courses prove


measurability of important functions.

if F B(X ), then X 1(F) F

PX ({0}) = P({r : X(r) = 0}) = P(X 1({0}))


= P({r : 0 r 0.5}) = P([0, 0.5]) = 0.5

Notes:

PX ({1}) = P(X 1({1})) = P((0.5, 1.0]) = 0.5

In English: X : X R is a random variable iff the inverse

PX (X ) = PX ({0, 1}) = P(X 1({0, 1}) = P([0, 1)) = 1

image of every output event is an input event and therefore


PX (F) = P(X 1(F)) is well-defined for all events F .

PX () = P(X 1()) = P() = 0,


In general, find PX by computing pmf or pdf, as appropriate.
Many shortcuts, but basic approach is inverse image formula.

Another name for a function with this property: measurable


function

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

Random vectors

10

Can be discrete (discribed by multidimensional pmf) or continuous


(e.g., described by multidimensional pdf) or mixed

All theory, calculus, applications of individual random variables useful


for studying random vectors and random processes since random
vectors and processes are simply collections of random variables.
One k-dimensional random vector = k 1-dimensional random
variables defined on a common probability space.

Similarly, a real-valued function of a random vector (several random


variables) is a random variable. E.g., if X0, X1, . . . Xn1 are random
variables, then
n1

is a random variable defined by

n1

1
S n() =
Xk ()
n k=0

Several notations used, e.g., X k = (X0, X1, . . . , Xk1) is shorthand for

X k () = (X0(), X1(), . . . , Xk1)()

or X or {Xn; n = 0, 1, . . . , k 1} or {Xn; n Zk }
c R.M. Gray 2011

Recall that a real-valued function of a random variable is a random


variable.

1
Sn =
Xk
n k=0

Earlier example: two coin flips, k-coin flips (first k binary coefficients
of fair spinner)

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

11

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

12

Inverse image formula for random vectors

Random processes

A random vector is a finite collection of rvs defined on a common


probability space

PX(F) = P(X (F)) = P({ : X() F})

= P({ : (X0(), X1(), . . . , Xk1()) F})

A random process is an infinite family of rvs defined on a common


probability space. Many types:

where the various forms are equivalent and all stand for Pr(X F)

Technically, the formula holds for suitable events F B(R)k , the Borel
field of Rk (or some suitable subset). See book for discussion.
One multidimensional event of particular interest is a Cartesian
product of 1D events (called a rectangle):

{Xt ; t R} (continuous-time, two-sided)


Also called stochastic process

PX(F) = P({ : X0() F0, X1() F1, . . . , Xk1() Fk1})


c R.M. Gray 2011

13

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

14

Keep in mind the suppressed argument e.g., each Xt is Xt (), a


function defined on the sample space

In general: {Xt ; t T } or {X(t); t T }


Other notations: {X(t)}, {X[n]} (for discrete-time)

X(t) is X(t, ), it can be viewed as a function of two arguments

Sloppy but common: X(t), context tells rp and not single rv

Have seen one example fair coin flips, a Bernoulli random process

Also called a stochastic process. Discrete-time random processes


are also called time series

Another, simpler, example:


Random sinusoids Suppose that A and are two random variables
with a joint pdf fA, (a, ) = fA(a) f(). For example, U([0, 2))
and A N(0, 2). Define a continuous-time random process X(t) for
all t R

Always: a random process is an indexed family of random variables,


T is index set
For each t, Xt is a random variable. All Xt defined on a common
probability space

X(t) = A cos(2t + )

Or, making the dependence on explicit,

index is usually time, in some applications it is space, e.g., random


field {X(t, s); t, s [0, 1)} models a random image,
{V(x, y, t); x, y [0, 1); t [0, )} models analog video.
EE278: Introduction to Statistical Signal Processing, winter 20102011

{Xn; n Z} (discrete-time, two-sided)


{Xt ; t [0, )} (continuous-time, one-sided)

k
F = k1
i=0 F i = {x : xi F i; i = 0, . . . , k 1}

EE278: Introduction to Statistical Signal Processing, winter 20102011

{Xn; n = 0, 1, 2, . . .} (discrete-time, one-sided)

c R.M. Gray 2011

X(t, ) = A() cos(2t + ())


15

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

16

Derived distributions for random variables

Cumulative distribution functions

General problem: Given probability space (, F , P) and a random


variable X with range space (alphabet) X . Find the distribution PX .

Define cumulative distribution function (cdf) by

F X (x)

If X is discrete, then PX described by a pmf

pX (x) = P(X 1({x})) = P({ : X() = x})

PX (F) =
pX (x) = P(X 1(F))

fX (r)dr = Pr(X x)

This is a probability and inverse image formula works

F X (x) = P(X 1((, x]))

xF

and from calculus

If X is continuous, then need a pdf.


But a pdf is not a probability so inverse image formula does not apply
immediately
alter approach
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

17

Notes:

If a b, then since (, a] = (, b] (b, a] is the union of disjoint


intervals, then F X (a) = F X (b) + PX ((b, a]) and hence
PX ((a, b]) =

fX (x) =

d
F X (x)
dx

So first find cdf F X (x), then differentiate to find fX (x)


18

If original space (, F , P) is a discrete probability space, then rv X


defined on (, F , P) is also discrete
Inverse image formula

pX (x) = PX ({x}) = P(X 1({x})) =

fX (x) dx = F X (b) F X (a)

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

p()

:X()=x

F X (x) is monontonically nondecreasing


cdf is well defined for discrete rvs:
F X (r) = Pr(X r) =

pX (x),

x:xr

but not as useful. Not needed for derived distributions

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

19

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

20

Example: discrete derived distribution


pY (1) =

=
= Z+, P determined by the geometric pmf

even

(1 p)k1 p =

p
(1 p)

(1 p)k1 p

k=2,4,...

((1 p)2)k = p(1 p) ((1 p)2)k


k=1

k=0

(1 p)
1 p
=
= p
2
2 p
1 (1 p)
1
pY (0) = 1 pY (1) =
2 p

1 if even
Define a random variable Y : Y() =

0 if odd
Using the inverse image formula for the pmf for Y() = 1:
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

21

P(F) =

rF

g(r) dr; F B(R).

PX (F) = P(X (F)) =

Quantizer example did this.

Square of a random variable

(R, B(R), P) with P induced by a Gaussian pdf.

X a rv. Inverse image formula

If X discrete, find the pmf pX (x) =

22

Example: continuous derived distribution

Suppose original space is (, F , P) = (R, B(R), P) where P is


described by a pdf g:

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

Define W : R R by W(r) = r2; r R.

r: X(r)F

r: X(r)=x

Find pdf fW . First find cdf FW , then differentiate. If w < 0, FW (w) = 0.


If w 0,

g(r) dr.

FW (w) = Pr(W w) = P({ : W() = 2 w})


w1/2
1/2
1/2
= P([w , w ]) =
g(r) dr

g(r) dr

w1/2

If X is continuous, want the pdf. First find cdf then differentiate.


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

This can be complicated, but dont need to plug in g yet


23

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

24

Example: continuous derived distribution

Use integral differentiation formula to get pdf directly

d
dw

b(w)
a(w)

g(r) dr = g(b(w))

db(w)
da(w)
g(a(w))
dw
dw

The max and min functions

In our example

fW (w) = g(w1/2)

1/2
2

Let X fX (x) and Y fY (y) be independent so that


fX,Y (x, y) = fX (x) fY (y).

1/2
w
1/2
g(w )
2

Define

U = max{X, Y}, V = min{X, Y}

E.g., if g =N(0, 2), then

where

w1/2 w/22
fW (w) =
e
; w [0, ).
22
a chi-squared pdf with one degree of freedom
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

25

Find the pdfs of U and V .

x
max(x, y) =

y
min(x, y) =

EE278: Introduction to Statistical Signal Processing, winter 20102011

if x y

otherwise
if x y

otherwise
c R.M. Gray 2011

26

Thus

fV (v) = fX (v) + fY (v) fX (v)FY (v) fY (v)F X (v)

To find the pdf of U , we first find its cdf. U u iff both X and Y are
u, so using independence

FU (u) = Pr(U u) = Pr(X u, Y u) = F X (u)FY (u)


Using the product rule for derivatives,

fU (u) = fX (u)FY (u) + fY (u)F X (u)


To find the pdf of V , first find the cdf. V v iff either X or Y v so that
using independence

FV (v) = Pr(X v or Y v)

= 1 Pr(X > v, Y > v)

= 1 (1 F X (v))(1 FY (v))
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

27

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

28

Directly-given random variables

Implies output probability space in trivial way:

PV (F) = P(V 1(F)) = P(F)


All named examples of pmfs (uniform, Bernoulli, binomial, geometric,
Poisson) and pdfs (uniform, exponential, Gaussian, Laplacian,
chi-squared, etc.) and the probability spaces they imply can be
considered as describing random variables:
Suppose (, F , P) is a probability space with R.

A random variable is said to be Bernoulli, binomial, etc. if its


distribution is determined by a Bernoulli, binomial, etc. pmf (or pdf)
Two random variables V and X (possibly defined on different
experiments) are said to be equivalent or identically distributed if
PV = PX , i.e., PV (F) = PX (F) all events F

Define a random variable V :

V() =

E.g., both continuous with same pdf, or both discrete with same pmf

the identity mapping, random variable just reports original sample


value
EE278: Introduction to Statistical Signal Processing, winter 20102011

If original space discrete (continuous), so is random variable, and


random variable is described by pmf (pdf)

c R.M. Gray 2011

29

Example: Binary random variable defined as quantization of fair


spinner vs. directly given as above.
EE278: Introduction to Statistical Signal Processing, winter 20102011

30

Derived distributions: random vectors

Note: Two ways to describe random variables:


1. Describe a probability space (, F , P) and define a function X on
it. Together these imply distribution PX for rv (by a pmf or pdf)

As in the scalar case, distribution can be described by probability


functions cdfs and either pmfs or pdfs (or both)

2. (Directly given) Describe distribution PX directly (by a pmf or pdf).


Implicitly (, F , P) = (X , B(X ), PX ) and X() = .

If random vector has a discrete range space, then the distribution can
be described by a multidimensional pmf pX(x) = PX({x}) = Pr(X = x)
as

Both representations are useful.

PX(F) =

xF

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

c R.M. Gray 2011

31

pX(x) =

pX0,X1,...,Xk1 (x0, x1, . . . , xk1)

(x0,x1,...,xk1)F

If the random vector X has a continuous range space, then


distribution
can be described by a multidimensional pdf fX
PX(F) = F fX(x) dx
Use multidimensional cdf to find pdf
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

32

Given a k-dimensional random vector X, define cumulative


distribution function (cdf) FX by

Other ways to express multidimensional cdf:

FX(x) = PX k1
i=0 (, xi]

= P({ : Xi() xi; i = 0, 1, . . . , k 1})


k1

= P Xi1((, xi]) .
i=0

Integration and differentiation are inverses of each other

FX(x)
= F X0,X1,...,Xk1 (x0, x1, . . . , xk1)

fX0,X1,...,Xk1 (x0, x1, . . . , xk1)

= PX({ : i xi; i = 0, 1, . . . , k 1})

= Pr(Xi xi; i = 0, 1, . . . , k 1)
x0 x1 xk1
=

fX0,X1,...,Xk1 (0, 1, . . . , k1)d0d1 dk1

k
F X ,X ,...,X (x0, x1, . . . , xk1).
x0x1 . . . xk1 0 1 k1

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

33

c R.M. Gray 2011

34

For example, if X = (X0, X1, . . . , Xk1) is discrete, described by a pmf


pX, then distribution for PX0 is described by pmf pX0 (x0) which can be
computed as

Joint and marginal distributions

Random vector X = (X0, X1, . . . , Xk1) is a collection of random


variables defined on a common probability space (, F , P)

pX0 (x0) = P({ : X0() = x0})

Alternatively, X is a random vector that takes on values randomly as


described by a probability distribution PX, without explicit reference to
the underlying probability space.

x1,x2,...,xk1

In general we have for cdfs that

E.g., finding the distributions of individual components of the random


vector.
c R.M. Gray 2011

= P({ : X0() = x0, Xi() X ; i = 1, 2, . . . , k 1})

=
pX(x0, x1, x2, . . . , xk1)

In English, all of these are Pr(X0 = x0)

Either the original probability measure P or the induced distribution


PX can be used to compute probabilities of events involving the
random vector.

EE278: Introduction to Statistical Signal Processing, winter 20102011

EE278: Introduction to Statistical Signal Processing, winter 20102011

35

F X0 (x0) = P({ : X0() x0)

= P({ : X0() x0, Xi() X ; i = 1, 2, . . . , k 1})


= FX(x0, , , . . . , )

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

36

if the pdfs exist,


fX0 (x0) =

Sum or integrate over all of the dummy variables corresponding to


the unwanted random variables in the vector to obtain the pmf or pdf
for the random variable Xi

fX(x0, x1, x2, . . . , xk1)dx1dx2 . . . dxk1

F Xi () = FX(, , . . . , , , , . . . , ),

Can find distributions for any of the components in this way:

or Pr(Xi ) = Pr(Xi and X j , all j i)

pXi ()
=

pX0,X1,...,Xk1 (x0, x1, . . . , xi1, , xi+1, . . . , xk1)

x0,x1,...,xi1,xi+1,...,xk1

or

fXi () =

dx0 . . . dxi1dxi+1 . . . dxk1 fX0,...,Xk1 (x0, . . . , xi1, , xi+1, . . . , xk1)


c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

37

2D random vectors

These relations are called consistency relationships a random


vector distribution implies many other distributions, and these must
be consistent with each other.

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

38

If the range space of the vector (X, Y) is continuous and the cdf is
differentiable so that fX,Y (x, y) exists,

(X, Y) a random vector.

Ideas are clearest when only 2 rvs:

Similarly can find cdfs/pmfs/pdfs for any pairs or triples of random


variables in the random vector or any other subvector (at least in
theory)

marginal distribution of X is obtained from the joint distribution of X


and Y by leaving Y unconstrained

fX (x) =

fX,Y (x, y) dy,

with similar expressions for the distribution for rv Y .


Joint distributions imply marginal distributions.

PX (F) = PX,Y ({(x, y) : x F, y R}); F B(R).

The opposite is not true without additional assumptions, e.g.,


independence.

Marginal cdf of X is F X () = F X,Y (, )


If the range space of the vector (X, Y) is discrete,

pX (x) =

pX,Y (x, y).

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

39

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

40

Examples of joint and marginal distributions

Thus in the special case of a product distribution, knowing the


marginal pmfs is enough to know the joint distribution. Thus marginal
distributions + independence the joint distribution.

Example

Pair of fair coins provides an example:

Suppose rvs X and Y are such that the random vector (X, Y) has a
pmf of the form

1
pXY (x, y) = pX (x)pY (y) = ; x, y = 0, 1
4
1
pX (x) = pY (y) = ; x = 0, 1
2

pX,Y (x, y) = r(x)q(y),


where r and q are both valid pmfs. ( pX,Y is a product pmf)
Then

pX (x) =

pX,Y (x, y) =

r(x)q(y)
y

= r(x) q(y) = r(x).


y

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

41

EE278: Introduction to Statistical Signal Processing, winter 20102011

Example of where marginals not enough

pXY (x, y) 0
1

42

Another example

Flip two fair coins connected by a piece of flexible rubber


0
0.4
0.1

c R.M. Gray 2011

Loaded pair of six-sided dice have property the sum of the two dice =
7 on every roll.

1
0.1
0.4

All 6 combinations possible combinations ( (1,6), (2,5), (3,4), (4,3),


(5,2), (6,1)) have equal probability.

pX (x) = pY (y) = 1/2, x = 0, 1

Suppose outcome of one die is X , other is Y

Not a product distribution, but same marginals as product distribution


case

(X, Y) is a random vector taking values in {1, 2, . . . , 6}2


1
pX,Y (x, y) = , x + y = 7, (x, y) {1, 2, . . . , 6}2.
6

Quite different joints can yield the same marginals. Marginals alone
do not tell the story.

Find marginal pmfs


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

43

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

44

pX (x) =

Continuous example

1
pXY (x, y) = pXY (x, 7 x) = , x = 1, 2, . . . , 6
6

Same as if product distribution. marginals alone do not imply joint

(X, Y) a rv with a pdf that is constant on the unit disk in the XY plane:

C x2 + y2 1
fX,Y (x, y) =

0 otherwise

Find marginal pdfs. Is it a product pdf?


Need C :

x2+y21

C dx dy = 1.

Integral = area of a circle multiplied by C C = 1/.


EE278: Introduction to Statistical Signal Processing, winter 20102011

fX (x) =

+ 1x2

1x2

c R.M. Gray 2011

45

+1

46

2D Gaussian pdf with k = 2, m = (0, 0)t , and

= {(i, j) : (1, 1) = (2, 2) = 1, (1, 2) = (2, 1) = }.

Inverse matrix is

2C 1 x2 dx = C = 1,

or C = 1.
Thus

c R.M. Gray 2011

Joints and marginals: Gaussian pair

C dy = 2C 1 x2 , x2 1.

Could now also find C by a second integration:

EE278: Introduction to Statistical Signal Processing, winter 20102011

1
1

1
1
=
,
1 2 1

the joint pdf for the random vector (X, Y) is

fX (x) = 2
1 x2 , x2 1.
1

1
2
2
exp 2(1
(x
+
y

2xy)
2)
fX,Y (x, y) =
, (x, y) R2.

2
2 1

By symmetry Y has the same pdf. fX,Y not a product pdf.


Note marginal pdf is not constant, even though the joint pdf is.

called correlation coefficient


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

47

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

48

Consistency & directly given processes

Need 2 < 1 for to be positive definite


To find the pdf of X , integrate joint over y
Do this using standard trick: complete the square:

Have seen two ways to describe (specify) a random variable as a


probability space + a function (random variable), or a directly given rv
(a distribution pdf or pmf)

x2 + y2 2xy = (y x)2 2 x2 + x2 = (y x)2 + (1 2)x2

Same idea works for random vectors.

2
(yx)2
(yx)2
x2
x
exp 2(1

exp

2)
2
2(12) exp 2
fX,Y (x, y) =
=
.

2
2 1 2
2(1 2)

What about random processes? E.g., direct definition of fair coin


flipping process.

Part of joint is N(x, 1 2), which integrates to 1. Thus


2

fX (x) = (2)1/2ex /2.


Note marginals the same regardless of !
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

49

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

50

The axioms of probability that these pmfs for any choice of K and
k1, . . . , kK must be consistent in the sense that if any of the pmfs is
used to compute the probability of an event, the answer must be the
same. E.g.,

For simplicity, consider discrete time, discrete alphabet random


process, say {Xn}. Given random process, can use inverse image
formula to compute pmf for any finite collection of samples
(Xk1 , Xk2 , . . . , XkK ), e.g.,

pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) = Pr(Xki = xi; i = 1, . . . , K)

pX1 (x1) =

pX1,X2 (x1, x2)

x2

= P({ : Xki () = xi; i = 1, . . . , K})


=
For example, in the fair coin flipping process

pX0,X1,X2 (x0, x1, x2)

x0,x2

pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) = 2K , all (x1, x2, . . . , xK ) {0, 1}K

x3,x5

pX1,X3,X5 (x0, x2, x5)

since all of these computations yield the same probability in the


original probability space Pr(X1 = x1) = P({ : X1() = x1})

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

51

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

52

To completely describe a random process, you need only provide a


formula for a consistent family of pmfs for finite collections of
samples.

Bottom line If given a discrete time discrete alphabet random


process {Xn; n Z}, then for any finite K and collection of K sample
times k1, . . . , kK can find the joint pmf pXk ,Xk ,...,XkK (x1, x2, . . . , xK ) and
1
2
this collection of pmfs must be consistent.

The same result holds for continuous time random processes and for
continuous alphabet processes (family of pdfs)

Kolmogorov proved a converse to this idea now called the


Kolmogorov extension theorem, which provides the most common
method for describing a random process:

Difficult to prove, but most common way to specify model.


Kolmogorov or directly-given representation of a random process
describe consistent family of vector distributions. For completeness:

Theorem. Kolmogorov extension theorem for discrete time


processes Given a consistent family of finite-dimensional pmfs
pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) for all dimensions K and sample times
k1, . . . , kK , then there is a random process {Xn; n Z described by
these marginals.

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

53

Theorem. Kolmogorov extension theorem


Suppose that one is given a consistent family of finite-dimensional
distributions PXt0 ,Xt1 ,...,Xtk1 for all positive integers k and all possible
sample times ti T ; i = 0, 1, . . . , k 1. Then there exists a random
process {Xt ; t T } that is consistent with this family. In other words,
to describe a random process completely, it is sufficient to describe a
consistent family of finite-dimensional distributions of its samples.
Example: Given a pmf p, define a family of vector pmfs by

pXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) =

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

54

The continuous alphabet analogy is defined in terms of a pdf f


define the vector pdfs by

fXk1 ,Xk2 ,...,XkK (x1, x2, . . . , xK ) =

f (xk )

i=1

A discrete time continuous alphabet process is iid if its joint pdfs


factor in this way.

p(xk ),

i=1

then there is a random process {Xn} having these vector pmfs for
finite collections of samples. A process of this form is called an iid
process.
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

55

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

56

Independent random variables

If X , Y discrete, choosing F = {x}, Y = {y}

pXY (x, y) = pX (x)pY (y) all x, y


Return to definition of independent rvs, with more explanation.
Definition of independent random variables an application of
definition of independent events.

Conversely, if joint pmf = product of marginals, then evaluate


Pr(X F, Y G) as

Defined events F and G to be independent if P(F G) = P(F)P(G)

P(X 1(F) Y 1(G)) =

pXY (x, y) =

xF,yG

pX (x)pY (y)

xF,yG

=
pX (x)
pY (y) = P(X 1(F))P(Y 1(G))

Two random variables X and Y defined on a probability space are


independent if the events X 1(F) and Y 1(G) are independent for all
F and G in B(R), i.e., if

xF

P(X 1(F) Y 1(G)) = P(X 1(F))P(Y 1(G))

yG

independent by general definition

Equivalently, Pr(X F, Y G) = Pr(X F) Pr(Y G) or

PXY (F G) = PX (F)PY (G)

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

57

For general random variables, consider F = (, x], G = (, y].


Then if X, Y independent, F XY (x, y) = F X (x)FY (y) all x, y. If pdfs exist,
this implies that

fXY (x, y) = fX (x) fY (y)

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

58

A collection of rvs {Xi, i = 0, 1, . . . , k 1} is independent or mutually


independent if all collections of events of the form
{Xi1(Fi); i = 0, 1, . . . , k 1} are mutually independent for any
Fi B(R); i = 0, 1, . . . , k 1.
A collection of discrete random variables Xi; i = 0, 1, . . . , k 1 is
mutually independent iff

Conversely, if this relation holds for all x, y, then


P(X 1(F) Y 1(G)) = P(X 1(F))P(Y 1(G)) and hence X and Y are
independent.

pX0,...,Xk1 (x0, . . . , xk1) =

k1

i=0

pXi (xi); xi.

A collection of continuous random variables is independent iff the


joint pdf factors as

fX0,...,Xk1 (x0, . . . , xk1) =

k1

fXi (xi).

i=0

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

59

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

60

Conditional distributions

A collection of general random variables is independent iff the joint


cdf factors as

F X0,...,Xk1 (x0, . . . , xk1) =

k1

i=0

F Xi (xi); (x0, x1, . . . , xk1) Rk .

Apply conditional probability to distributions.


Can express joint probabilities as products even if rvs not
independent

The random vector is independent, identically distributed (iid) if the


components are independent and the marginal distributions are all
the same.

E.g., distribution of input given observed output (for inference)


There are many types: conditional pmfs, conditional pdfs, conditional
cdfs
Elementary and nonelementary conditional probability

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

61

Discrete conditional distributions

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

62

Define for each x AX for which pX (x) > 0 the conditional pmf

pY|X (y|x) = P(Y = y|X = x)


P(Y = y, X = x)
=
P(X = x)
P({ : Y() = y} { : X() = x})
=
P({ : X() = x})
pX,Y (x, y)
=
,
pX (x)

Simplest, direct application of elementary conditional probability to


pmfs
Consider 2D discrete random vector (X, Y)

elementary conditional probability that Y = y given X = x

alphabet AX AY
joint pmf pX,Y (x, y)
marginal pmfs pX and pY
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

63

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

64

Properties of conditional pmfs

Can compute conditional probabilities by summing conditional pmfs,

P(Y F|X = x) =
For fixed x, pY|X (|x) is a pmf:

pY|X (y|x) =

yAY

P(X G, Y F) =

1
pX (x) = 1.
pX (x)

x,y:xG,yF

pX (x)

xG

=
The joint pmf can be expressed as a product as

xG

pX,Y x, y

yF

pY|X (y | x)

pX (x)P(F | X = x)

Later: define nonelementary conditional probability to mimic this


formula

pX,Y (x, y) = pY|X (y|x)pX (x).


EE278: Introduction to Statistical Signal Processing, winter 20102011

pY|X (y|x)

yF

Can write probabilities of events of the form X G, Y F (rectangles)


as

pX,Y (x, y)
1
=
pX,Y (x, y)
pX (x)
pX (x) yA
yA
Y

c R.M. Gray 2011

65

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

66

Example of Bayes rule: Binary Symmetric Channel

If X and Y are independent, then pY|X (y|x) = pY (y)


Given pY|X , pX , Bayes rule for pmfs:

pX|Y (x|y) =

pY|X (y|x)pX (x)


pX,Y (x, y)
=
,
pY (y)
u pY|X (y|u)pX (u)

Consider the following binary communication channel

Z {0, 1}

a result often referred to as Bayes rule.

X {0, 1}

Y {0, 1}

Bit sent is X Bern(p), 0 p 1, noise is Z Bern(), 0 0.5,


bit received is Y = (X + Z) mod 2 = X Z , and X and Z are
independent
Find 1) pX|Y (x|y), 2) pY (y), and 3) Pr{X Y}, the probability of error

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

67

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

68

1. To find pX|Y (x|y) use Bayes rule

pX|Y (x|y) =

xAX

Therefore

pY|X (0 | 0) = pZ (0 0) = pZ (0) = 1

pY|X (y|x)
pX (x)
pY|X (y|x)pX (x)

pY|X (0 | 1) = pZ (0 1) = pZ (1) =
pY|X (1 | 0) = pZ (1 0) = pZ (1) =

pY|X (1 | 1) = pZ (1 1) = pZ (0) = 1

Know pX (x), but we need to find pY|X (y|x):

pY|X (y|x) = Pr{Y = y | X = x} = Pr{X Z = y | X = x}

= Pr{x Z = y | X = x} = Pr{Z = y x | X = x}

= Pr{Z = y x} since Z and X are independent


= pZ (y x)

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

69

c R.M. Gray 2011

70

c R.M. Gray 2011

72

2. We already found pY (y) as

Plugging into Bayes rule:

pY|X (0|0)
(1 )(1 p)
pX (0) =
pY|X (0|0)pX (0) + pY|X (0|1)pX (1)
(1 )(1 p) + p
p
pX|Y (1|0) = 1 pX|Y (0|0) =
(1 )(1 p) + p
pY|X (1|0)
(1 p)
pX|Y (0|1) =
pX (0) =
pY|X (1|0)pX (0) + pY|X (1|1)pX (1)
(1 )p + (1 p)
(1 )p
pX|Y (1|1) = 1 pX|Y (0|1) =
(1 )p + (1 p)
pX|Y (0|0) =

EE278: Introduction to Statistical Signal Processing, winter 20102011

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

71

pY (y) = pY|X (y|0)pX (0) + pY|X (y|1)pX (1)

(1 )(1 p) + p for y = 0
=

(1 p) + (1 )p for y = 1

3. Now to find the probability of error Pr{X Y}, consider

Pr{X Y} = pX,Y (0, 1) + pX,Y (1, 0)


= pY|X (1|0)pX (0) + pY|X (0|1)pX (1)
= (1 p) + p =

EE278: Introduction to Statistical Signal Processing, winter 20102011

Conditional pmfs for vectors

An interesting special case is = 12 . Here, Pr{X Y} = 12 , which is


the worst possible (no information is sent), and

pY (0) = 12 p + 12 (1 p) =

1
2

= pY (1)

Therefore Y Bern( 12 ), independent of the value of p !

Random vector (X0, X1, . . . , Xk1)

In this case, the bit sent X and the bit received Y are independent
(check this)

pmf pX0,X1,...,Xk1

Define conditional pmfs (assuming denominators 0)

pXl|X0,...,Xl1 (xl|x0, . . . , xl1) =


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

73

chain rule

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

74

Continuous conditional distributions


Continuous distributions more complicated

pX0,X1,...,Xn1 (x0, x1, . . . , xn1)

pX0,X1,...,Xn1 (x0, x1, . . . , xn1)


pX0,X1,...,Xn2 (x0, x1, . . . , xn2)
=
pX0,X1,...,Xn2 (x0, x1, . . . , xn2)
..
n1

pX0,X1,...,Xi (x0, x1, . . . , xi)


= pX0 (x0)
p
(x , x , . . . , xi1)
i=1 X0,X1,...,Xi1 0 1
= pX0 (x0)

pX0,...,Xl (x0, . . . , xl)


.
pX0,...,Xl1 (x0, . . . , xl1)

n1

l=1

Given X, Y with joint pdf fX,Y , marginal pdfs fX , fY , define conditional


pdf

fY|X (y|x)

analogous to conditional pmf, but unlike conditional pmf, not a


conditional probability!

pXl|X0,...,Xl1 (xl|x0, . . . , xl1)

A density of conditional probability

Formula plays an important role in characterizing memory in


processes. Can be used to construct joint pmfs, and to specify a
random process.
EE278: Introduction to Statistical Signal Processing, winter 20102011

fX,Y (x, y)
.
fX (x)

c R.M. Gray 2011

Problem: conditioning event has probability 0. Elementary


conditional probability not work.
75

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

76

Nonelementary conditional probability

Conditional pdf is a pdf:

fY|X (y|x) dy =
=
=

fX,Y (x, y)
dy
fX (x)

1
fX,Y (x, y) dy
fX (x)
1
fX (x) = 1,
fX (x)

Does P(Y F|X = x) = F fY|X (y|x) dy. make sense as an appropriate


definition of conditional probability given an event of zero probability?
Observe that analogous to the ed result for pmfs, assuming the
pdfs all make sense

provided require that fX (x) > 0 over the region of integration.

P(X G, Y F) =

Given a conditional pdf fY|X , define (nonelementary) conditional


probability that Y F given X = x by

P(Y F|X = x)

fY|X (y|x) dy.

Resembles discrete form.


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

77

x,y:xG,yF

xG

xG

fX (x)

fX,Y (x, y)dxdy

fY|X (y | x)dy dx

yF

fX (x)P(F | X = x)

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

78

Bayes rule for pdfs

Our definition is ad hoc. But the careful mathematical definition of


conditional probability P(F | X = x) for an event of 0 probability is
made not by a formula such as we have used to define conditional
pmfs and pdfs and elementary conditional probability, but by its
behavior inside an integral (like the Dirac delta). In particular,
P(F | X = x) is defined as any measurable function satisfying
equation for all events F and G, which our definition does.

Bayes rule:

fX|Y (x|y) =

fY|X (y|x) fX (x)


fX,Y (x, y)
=
.
fY (y)
fY|X (y|u) fX (u) du

Example of conditional pdfs: 2D Gaussian

U = (X, Y), Gaussian pdf with mean (mX , mY )t and covariance matrix
=
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

79

2X
X Y
X Y
2Y

EE278: Introduction to Statistical Signal Processing, winter 20102011

,
c R.M. Gray 2011

80

Algebra

Rearrange

det() = 2X 2Y (1 2)

1
1/2X
/(X Y )
1

=
1/2Y
(1 2) /(X Y )

1 ym ( / )(xm ) 2
Y Y X
X

exp

2
X 2
exp 12 ( xm
12Y
X )
fXY (x, y) =

22X
22Y (1 2)

so

fXY (x, y)
1
1
1
t
=
e 2 (xmX ,ymY ) (xmX ,ymY )

2 det

1
1
=
exp

2(1 2)
2X Y 1 2

2
x mX 2

(x

m
)(y

m
)
y

m
X
Y
Y

2
+

X
X Y
Y

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

fY|X (y|x) =

12Y

22Y (1

2 )

Gaussian with variance 2Y|X 2Y (1 2), mean

mY|X mY + (Y /X )(x mX )
81

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

82

Chain rule for pdfs

Integrate joint over y (as before)


2

fX (x) =

1 ym ( / )(xm ) 2
Y Y X
X

exp

e(xmX ) /2X
.

2
2X

Assume fX0,X1,...,Xi (x0, x1, . . . , xi) > 0,

Similarly, fY (y) and fX|Y (x|y) are also Gaussian

fX0,X1,...,Xn1 (x0, x1, . . . , xn1)


fX0,X1,...,Xn1 (x0, x1, . . . , xn1)
fX ,X ,...,X (x0, x1, . . . , xn2)
=
fX0,X1,...,Xn2 (x0, x1, . . . , xn2) 0 1 n2
..
n1

fX0,X1,...,Xi (x0, x1, . . . , xi)


= fX0 (x0)
f
(x , x , . . . , xi1)
i=1 X0,X1,...,Xi1 0 1

Note: X and Y jointly Gaussian also both individually and


conditionally Gaussian!

= fX0 (x0)

n1

i=1

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

83

fXi|X0,...,Xi1 (xi|x0, . . . , xi1).

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

84

Statistical detection and classification

binary symmetric channel (BSC)

Given observation Y , what is the best guess X(Y)


of transmitted
value?

Simple application of conditional probability mass functions


describing discrete random vectors

decision rule or detection rule

Transmitted: discrete rv X , pmf pX , pX (1) = p

Measure quality by probability guess is correct:

(e.g., one sample of a binary random process)

= Pr(X = X(Y))

Pc(X)
= 1 Pe ,

Received: rv Y
Conditional pmf (noisy channel) pY|X (y|x)

where

= Pr(X(Y)

Pe(X)
X).

More specific example as special case: X Bernoulli, parameter p

pY|X (y|x) =

xy
x=y

A decision rule is optimal if it yields the smallest possible Pe or


maximum possible Pc

.
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

85

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

86

This is maximum a posteriori (MAP) detection rule

=
Pr(X = X) = 1 Pe(X)

In binary example: Choose X(y)


= y if < 1/2 and X(y)
= 1 y if
> 1/2.

pX,Y (x, y)

(x,y):X(y)=x

pX|Y (x|y)pY (y)

minimum (optimal) error probability over all possible rules is


min(, 1 )

(x,y):X(y)=x

pY (y)
pX|Y (x|y)

In general nonbinary case, statistical detection is statistical


classification: Unseen X might be presence or absence of a disease,
observation Y the results of various tests.

x:X(y)=x

pY (y)pX|Y (X(y)|y).

General Bayesian classification allows weighting of cost of different


kinds of errors (Bayes risk) so minimize a weighted average
(expected cost) instead of only probability of error

To maximize sum, maximize pX|Y (X(y)|y)


for each y.

Accomplished by X(y)
arg max pX|Y (u|y) which yields

pX|Y (X(y)|y)
= maxu pX|Y (u|y)

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

87

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

88

Additive noise: Discrete random variables


pX,Y (x, y) = Pr(X = x, Y = y) = Pr(X = x, X + W = y)

=
pX,W (, ) = pX,W (x, y x)

Common setup in communications, signal processing, statistics:

,:=x,+=y

Original signal X has random noise W (independent of X ) added to it,


observe Y = X + W
Typically use observation Y to make inference about X

Note: Formula only makes sense if y x is in the range space of W


Thus

pY|X (y|x) =

Begin by deriving conditional distributions.

Intuitive!

Discrete case: Have independent rvs X and W with pmfs pX and pW .


Form Y = X + W . Find pY

pX,Y (x, y)
= pW (y x),
pX (x)

Marginal for Y :

pY (y) =

Use inverse image formula:


EE278: Introduction to Statistical Signal Processing, winter 20102011

= pX (x)pW (y x).

pX,Y (x, y) =

c R.M. Gray 2011

89

pX (x)pW (y x)
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

90

Additive noise: continuous random variables

a discrete convolution
Above uses ordinary real arithmetic. Similar results hold for other
definitions of addition, e.g., modulo 2 arithmetic for binary

X , W , fXW (x, w) = fX (x) fW (w) (independent), Y = X + W

As with linear systems, convolutions usually be easily evaluated in


the transform domain. Will do shortly.

Find fY|X and fY


Since continuous, find joint pdf by first finding joint cdf

F X,Y (x, y) = Pr(X x, Y y) = Pr(X x, X + W y)



=
fX,W (, ) d d
=
=

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

91

,:x,+y
y

d fX () fW ()

d fX ()FW (y ).

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

92

Additive Gaussian noise

Taking derivatives:

fX,Y (x, y) = fX (x) fW (y x)

fY (y) =

fY|X (y|x) = fW (y x).


fX,Y (x, y) dx =

Assume fX = N(0, X ), fW = N(0, 2Y ), fXW (x, w) = fX (x) fW (w),


Y = X + W.

fX (x) fW (y x) dx,

a convolution integral of the pdfs fX and fW


pdf fX|Y follows from Bayes rule:

Gaussian example:

fX|Y (x|y) =

fX (x) fW (y x)

fX () fW (y ) d

EE278: Introduction to Statistical Signal Processing, winter 20102011

which is N(x, 2W ).
c R.M. Gray 2011

93

To find fX|Y using Bayes rule, need fY :

fY (y) =

EE278: Introduction to Statistical Signal Processing, winter 20102011

Integrand resembles

fY|X (y|) fX () d

exp 12 (y )2 exp 12 2
2W
2X
=
d

22W
22X
2


1 y 2y + 2 2
1
=
exp
+ 2 d
2X W
2
2W
X
2

exp 2y 2
1 2 1
1
2y
W
=
exp ( 2 + 2 ) 2 d
2X W
2
X W
W

c R.M. Gray 2011

c R.M. Gray 2011

94

1 m 2
exp (
) .
2

which has integral

1 m 2
exp (
) d = 22
2

(Gaussian pdf integrates to 1)


Compare

Can integrate by completing the square (later see an easier way


using tranforms, but this trick is not difficult)
EE278: Introduction to Statistical Signal Processing, winter 20102011

e(yx) /2W
fY|X (y|x) = fW (y x) =
22W

95

1 2 1
1 2y
1 m 2
1 2
m m2

2 + 2 2 vs.
=
2 2 + 2 .
2
2
2
2

X W
W

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

96

The braced terms will be the same if choose

1
1
1
=
+
2 =
2 2W 2X

2X 2W
,
2X + 2W

and

y
m
2
=

m
=
y.
2W 2
2W

1
1 2y m 2 m2
2 + 2 2 =
2

X W
W

Sum of two independent 0 mean Gaussian rvs is another 0 mean


Gaussian rv, the variance of the sum = sum of the variances

completing the square.

c R.M. Gray 2011

97

EE278: Introduction to Statistical Signal Processing, winter 20102011

fX|Y (x|y) = N

For a posteriori probability fX|Y use Bayes rule + algebra

fX|Y (x|y) = fY|X (y|x) fX (x)/ fY (y)

2
exp 212 (y x)2 exp 212 x2 exp 12 2 y+2
W
X
W
X
/
=

2
2
2
2W
2(X + 2W )
2X
2

y2
1 y 2yx+x2
x2
exp 2
+ 2 2 +2
2W
X
X
W
=

22X 2W /(2X + 2W )

1
2
2
2 2
exp 22 2 /(
(x

y
/(
+

))
2 +2 )
X
X
W
X W
X
W
=
.

22X 2W /(2X + 2W )

EE278: Introduction to Statistical Signal Processing, winter 20102011

2
2
2 exp 1 2 y 2
exp 12 y2
2 +
m
W
X
W
fY (y) =
22 exp
=

2
2X W
2
2(2X + 2W )

So fY = N(0, 2X + 2W )

EE278: Introduction to Statistical Signal Processing, winter 20102011

1 2 1
1
2y
exp ( 2 + 2 ) 2 d
2
X W
W

1 m 2 m2
m
2
=
exp
2 d = 2 exp
2
2

22

c R.M. Gray 2011

c R.M. Gray 2011

98

2X 2W
.
y,
2X + 2W 2X + 2W
2X

The mean of a conditional distribution called a conditional mean, the


variance of a conditional distribution called a conditional variance

99

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

100

Continuous additive noise with discrete input

continuous cases

Most important case of mixed distributions in communications


applications

FY|X (y|x) = Pr(Y y | X = x) = Pr(X + W y | X = x)

= Pr(x + W y | X = x) = Pr(W y x | X = x)

Typical: Binary random variable X , Gaussian random variable W , X


and W independent, Y = X + W

= Pr(W y x) = FW (y x)

Previous examples do not work, one rv discrete, other continuous


Similar signal processing issue: Observe Y , guess X

Differentiating,

As before, may be one sample of a random process, in practice have


{Xn}, {Wn}, {Yn}. At time n, observe Yn, guess Xn
Conditional cdf FY|X (y|x) for Y given X = x is an elementary
conditional probability. Analogous to purely discrete and purely
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

fY|X (y|x) =
101

pX (x)

pX (x)

Pr(Y G) =
=

pX (x)
pX (x)

fW (y x) dy.

pX|Y (x|y) =

fY (y) =

pX (x) fY|X (y|x) =

EE278: Introduction to Statistical Signal Processing, winter 20102011

fY|X (y|x)pX (x)


fY|X (y|x)pX (x)
=
,
fY (y)
pX () fY|X (y|)

but this is not an elementary conditional probability, conditioning


event has probability 0!

fY|X (y|x) dy

Can be justified in similar way to conditional pdfs:

fW (y x) dy.

Pr(X F and Y G) =

Choosing G = (, y] yields cdf FY (y)

102

Continuing analogy Bayes rule suggests conditional pmf:

fY|X (y|x) dy

Choosing F = R yields

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

a convolution, analogous to pure discrete and pure continuous cases

Joint distribution combined by a combination of pmf and pdf.

Pr(X F and Y G) =

d
d
FY|X (y|x) = FW (y x) = fW (y x)
dy
dy

pX (x) fW (y x),
c R.M. Gray 2011

G
G

103

EE278: Introduction to Statistical Signal Processing, winter 20102011

dy fY (y) Pr(X F|Y = y)


dy fY (y)

pX|Y (x|y)

c R.M. Gray 2011

104

Binary detection in Gaussian noise

so that pX|Y (x|y) satisfies

Pr(X F|Y = y) =

pX|Y (x|y)

Apply to binary input and Gaussian noise: the conditional pmf of the
binary input given the noisy observation is

fW (y x)pX (x)
fY (y)
fW (y x)pX (x)
=
; y R, x {0, 1}.
pX () fW (y )

pX|Y (x|y) =

The derivation of the MAP detector or classifier extends immediately


to a binary input random variable and independent Gaussian noise

of X given Y = y is
As in the purely discrete case, MAP detector X(y)
given by
fW (y x)pX (x)

X(y)
= argmax pX|Y (x|y) = argmax
.
x
x
pX () fW (y )

Denominator of the conditional pmf does not depend on x, the


denominator has no effect on the maximization

Can now solve classical binary detection in Gaussian noise.

X(y)
= argmax pX|Y (x|y) = argmax fW (y x)pX (x).
x

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

105

Assume for simplicity that X is equally likely to be 0 or 1:

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

106

Error probability of optimal detector:

Pe = Pr(X(Y)
X)

= Pr(X(Y)
0|X = 0)pX (0) + Pr(X(Y)
1|X = 1)pX (1)

1 (x y)2
1

X(y)
= argmax pX|Y (x|y) = argmax
exp
2 2W
x
x
2
2W

= Pr(Y > 0.5|X = 0)pX (0) + Pr(Y < 0.5|X = 1)pX (1)

= argmax pX|Y (x|y) = argmin |x y|


x

= Pr(W + X > 0.5|X = 0)pX (0) + Pr(W + X < 0.5|X = 1)pX (1)

= Pr(W > 0.5|X = 0)pX (0) + Pr(W + 1 < 0.5|X = 1)pX (1)
Minimum distance or nearest neighbor decision, choose closest x to y

A threshold detector

0 y < 0.5

X(y)
=
.

1 y > 0.5

EE278: Introduction to Statistical Signal Processing, winter 20102011

= Pr(W > 0.5)pX (0) + Pr(W < 0.5)pX (1)


using the independence of W and X . In terms of function:

1
0.5
0.5
1
Pe = 1
+
=
.
2
W
W
2W
c R.M. Gray 2011

107

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

108

Statistical estimation

Will later introduce another quality measure (MSE) and optimize.


Now mention other approaches.
Examples of estimation or regression instead of detection

In detection/classification problems, goal is to guess which of a


discrete set of possibilities is true. MAP rule is an intuitive solution.
Different if (X, Y) continuous, observe Y , and guess X .
Examples: X, W independent Gaussian, Y = X + W . What is best
guess of X given Y ?

{Xn} is a continuous alphabet random process (perhaps Gaussian).


Observe Xn1. What is best guess for Xn? What if observe
X0, X1, X2, . . . , Xn1?

Quality criteria for discrete case no longer works, Pr(X(Y)


= Y) = 0 in
general.
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

109

MAP Estimation

110

Maximum Likelihood Estimation

The maximum likelihood (ML) estimate of X given Y = the value of x


that maximizes the conditional pdf fY|X (y|x) (instead of the a posteriori
pdf fX|Y (x|y))

Mimic map detection, maximize conditional probability function

X MAP(y) = argmax x fX|Y (x|y)


Easy to describe, application of conditional pdfs + Bayes.

X ML(y) = argmax fY|X (y|x).

But can not argue optimal in sense of maximizing quality

Advantage: Do not need to know prior fX and use Bayes to find


fX|Y (x|y). Simple

Example: Gaussian signal plus noise


Found fX|Y (x|y) = Gaussian with mean y2X /(2X + 2W )

In the Gaussian case, X ML(Y) = y.

Gaussian pdf maximized at its mean MAP estimate of X given


Y = y is the conditional mean y2X /(2X + 2W ).
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

Will return to estimation when consider expectations in more detail.


111

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

112

Characteristic functions

For discrete rv with pmf pX , define characteristic function MX

MX ( ju) =
When sum independent random variables, find derived distribution by
convolution of pmfs or pdfs
Can be complicated, avoidable using transforms as in linear systems
Summing independent random variables arises frequently in signal
analysis problems. E.g., iid random process {Xk } is put into a linear

filter to produce an output Yn = nk=1 hnk Xk .

What is distribution of Yn?

n-fold convolution a mess. Describe shortcut.

pX (x)e jux

where u is usually assumed to be real.


A discrete exponential transform. Sometimes , , j not included.
( notational differences in Fourier transforms)
Alternative useful form: Recall definition of expectation of a random
variable g defined on a discrete probability space described by a pmf

g: E(g) = p()g()
Consider probability space (X , B(X ), PX ) with PX described by pmf

pX

Transforms of probability functions called characteristic functions.


Variation on Fourier/Laplace transforms. Notation varies.
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

This is directly-given representation for rv X , X is the identity function


on X : X(x) = x
113

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

114

c R.M. Gray 2011

116

MX ( ju) = Fu/2(pX ) = Ze ju (pX )

jux
Define random
variable g(X) on this space g(X)(x) = e . Then
jux
E[g(X)] =
pX (x)e so that

Properties of characteristic functions follow from those of


Fourier/Laplace/z/exponential transforms.

MX ( ju) = E[e juX ]


Characteristic functions, like probabilities, can be viewed as special
cases of expectation
Resembles discrete time Fourier transform

F(pX ) =

pX (x)e j2x

and the z-transform

Zz(pX ) =

EE278: Introduction to Statistical Signal Processing, winter 20102011

pX (x)z x.

c R.M. Gray 2011

115

EE278: Introduction to Statistical Signal Processing, winter 20102011

Characteristic functions and summing independent


rvs

Can recover pmf from MX by suitable inversion. E.g., given


pX (k); k ZN ,

1
2

/2

1
MX ( ju)eiuk du =
2
/2
=

/2

pX (x)

/2
x

1
2

pX (x)e jux eiuk du

/2

Two independent random variables X , W with pmfs pX and pW and


characteristic functions MX and MW

e ju(xk) du

Y = X+W

/2

pX (x)kx = pX (k).

To find characteristic function of Y

MY ( ju) =

But usually invert by inspection or from tables, avoid inverse


transforms

pY (y)e juy

use the inverse image formula

pY (y) =

pX,W (x, w)

x,w:x+w=y
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

117

to obtain

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

118

Iterate:

MY ( ju) =

juy
pX,W (x, w) e juy =
p
(x,
w)e
X,W

y
x,w:x+w=y
y
x,w:x+w=y

=
ju(x+w)
=
p
(x,
w)e
pX,W (x, w)e ju(x+w)
X,W

x,w:x+w=y

x,w

Last sum factors:

MY ( ju) =

pX (x)pW (w)e juxe juw =

x,w

= MX ( ju)MW ( ju),

pX (x)e jux

pW (w)e juw

Theorem 1. If {Xi; i = 1, . . . , N} are independent random variables


with characteristic functions MXi , then the characteristic function of
N
the random variable Y = i=1
Xi is

MY ( ju) =

MXi ( ju).

i=1

If the Xi are independent and identically distributed with common


characteristic function MX , then

MY ( ju) = MXN ( ju).

transform of the pmf of the sum of independent random variables


is the product of their transforms

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

119

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

120

Example: X Bernoulli with parameter p = pX (1) = 1 pX (0)

MX ( ju) =

juk

k=0

pX (k) = (1 p) + pe

{Xi; i = 1, . . . , n} iid Bernoulli random variables, Yn =


ju n

MYn ( ju) = [(1 p) + pe ]


with binomial theorem

ju

pYn (k) =

k=1

Xi, then

MX ( ju) =

F ( f X ) =

c R.M. Gray 2011

121

MX ( ju) = E e juX .

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

122

Paralleling the discrete case,

and the Laplace transform

L s( fX ) =

fX (x)e jux dx.

Consider again two independent random variables X and Y with pdfs


fX and fW , characteristic functions MX and MW

fX (x)e j2x dx

As in the discrete case,

Relates to the continuous-time Fourier transform

n
(1 p)nk pk ; k Zn+1.
k

For a continous random variable X with pdf fX , define the


characteristic function MX of the random variable (or of the pdf) as

pYn (k)

EE278: Introduction to Statistical Signal Processing, winter 20102011

Same idea works for continuous rvs

pYn (k)e juk = ((1 p) + pe ju)n


k=0

n
n

juk

nk
k

=
k (1 p) p e ,
k=0

MYn ( ju) =

Uniqueness of transforms

MY ( ju) = MX ( ju)MW ( ju).


fX (x)esx dx

Will later see simple and general proof.

by

MX ( ju) = Fu/2( fX ) = L ju( fX )


Hence can apply results from Fourier/Laplace transform theory. E.g.,
given a well-behaved density fX (x); x R MX ( ju), can invert
transform

fX (x) =

1
2

EE278: Introduction to Statistical Signal Processing, winter 20102011

MX ( ju)e jux du.

c R.M. Gray 2011

123

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

124

Summing Independent Gaussian rvs

As in the discrete case, iterating gives result for many independent


rvs:

X N(m, 2)

If {Xi; i = 1, . . . , N} are independent random variables with


characteristic functions MXi , then the characteristic function of the
N
random variable Y = i=1
Xi is

MY ( ju) =

Characteristic function found by completing the square:

MXi ( ju).

i=1

If the Xi are independent and identically distributed with common


characteristic function MX , then

MY ( ju) = MXN ( ju).

= e jumu

2 2 /2

Thus N(m, 2) e jumu


c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

125

MYn ( ju) = [e jumu

2 2 /2 n

] = e ju(nm)u

2 (n2 )/2

2 2 /2

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

126

c R.M. Gray 2011

128

Gaussian random vectors

{Xi; i = 1, . . . , n} iid Gaussian random variables with pdfs N(m, 2)

Yn = nk=1 Xi
Then

1
2
2
MX ( ju) = E(e ) =
e(xm) /2 e jux dx
2
1/2
(2 )

1
2
2
2
2
=
e(x 2mx2 jux+m )/2 dx
2
1/2
(2 )

1
2 2
(x(m+ ju2))2/22
=
e
dx e jumu /2
2
1/2
(2 )
juX

A random vector is Gaussian if its density is Gaussian

= characteristic function of N(nm, n2)


Moral: Use characteristic functions to derive distributions of sums of
independent rvs.

Component rvs are jointly Gaussian


Description is complicated, but many nice properties
Multidimensional characteristic functions help derivation
Random vector X = (X0, . . . , Xn1)
vector argument u = (u0, . . . , un1)

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

127

EE278: Introduction to Statistical Signal Processing, winter 20102011

n-dimensional characteristic function:

So exists more generally, only need to be nonnegative definite


(instead of strictly positive definite). Define Gaussian rv more
generally as a rv having a characteristic function of this form (inverse
transform will have singularities)

t
MX( ju) = MX0,...,Xn1 ( ju0, . . . , jun1) = E e ju X

n1

= E exp j uk Xk
k=0

Can be shown using multivariable calculus: Gaussian rv with mean


vector m and covariance matrix has characteristic function
t

MX( ju) = e ju mu u/2


n1

n1
n1


= exp j
uk mk 1/2
uk (k, m)um
k=0

k=0 m=0

Same basic form as Gaussian pdf, but depends directly on , not 1


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

129

Further examples of random processes:

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

Gaussian random processes

Have seen two ways to define rps: Indirectly in terms of an


underlying probability space or directly (Kolmogorov representation)
by describing consistent family of joint distributions (via pmfs, pdfs, or
cdfs).

A random process {Xt ; t T } is Gaussian if the random vectors


(Xt0 , Xt1 , . . . , Xtk1 ) are Gaussian for all positive integers k and all
possible sample times ti T ; i = 0, 1, . . . , k 1.

Used to define discrete time iid processes and processes which can
be constructed from iid processes by coding or filtering.

Consistent family?

In particular: Gaussian random processes and Markov processes

c R.M. Gray 2011

Works for continuous and discrete time.

Yes if all mean vectors and covariance matrices drawn from a


common mean function m(t); t T and covariance function
(t, s); t, s T ; i.e., for any choice of sample times t0, . . . , tk1 T
the random vector (Xt0 , Xt1 , . . . , Xtk1 ) is Gaussian with mean
(m(t0), m(t1), . . . , m(tk1)) and the covariance matrix is
= {(tl, t j); l, j Zk }.

Introduce more classes of processes and develop some properties


for various examples.

EE278: Introduction to Statistical Signal Processing, winter 20102011

130

131

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

132

Discrete time Markov processes

Gaussian random processes in both discrete and continuous time


are extremely common in analysis of random systems and have
many nice properties.

An iid process is memoryless because present independent of past.


A Markov process allows dependence on the past in a structured
way.
Introduce via example.

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

133

A binary Markov process

134

Since process iid


n

p (x ) =
Xn

n1

i=0

{Xn; n = 0, 1, . . .} is a Bernoulli process with

pX (xi) = pw(x )(1 p)nw(x ),

where w(xn) = Hamming weight of the binary vector xn.


Let {Xn} be input to a device which produces an output binary
process {Yn} defined by

x=1
p
pXn (x) =
,

1 p x = 0

n=0
Y0
Yn =
,

Xn Yn1 n = 1, 2, . . .

p (0, 1) a fixed parameter

where Y0 is a binary equiprobable random variable


( pY0 (0) = pY0 (1) = 0.5), independent of all of the Xn and is mod 2
addition

Since the pmf pXn (x), abbreviate to pX :

pX (x) = p x(1 p)1x; x = 0, 1.


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

(linear filter using mod 2 arithmetic)


c R.M. Gray 2011

135

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

136

Alternatively:

Use inverse image formula:

1 if Xn Yn1
Yn =
.

0 if Xn = Yn1

pY n (yn) = Pr(Y n = yn)


= Pr(Y0 = y0, Y1 = y1, Y2 = y2, . . . , Yn1 = yn1)

This process is called a binary autoregressive process. As will be


seen, it is also called the symmetric binary Markov process

= Pr(Y0 = y0, X1 Y0 = y1, X2 Y1 = y2, . . . , Xn1 Yn2 = yn1)


= Pr(Y0 = y0, X1 y0 = y1, X2 y1 = y2, . . . , Xn1 yn2 = yn1)
= Pr(Y0 = y0, X1 = y1 y0, X2 = y2 y1, . . . , Xn1 = yn1 yn2)

Unlike Xn, Yn depends strongly on past values. Since p < 1/2, Yn is


more likely to equal Yn1 than not

= pY0,X1,X2,X3,...,Xn1 (y0, y1 y0, y2 y1, . . . , yn1 yn2)


= pY0 (y0)

i=1

If p is small, Yn is likely to have long runs of 0s and 1s.


n

c R.M. Gray 2011

137

Plug in specific forms of pY0 and pX


n

pY0,Y1 (y0, y1) =

y0

1
=
; y1 = 0, 1.
2

Unlike the iid {Xn} process


n

pY n (y )

1 y1y0
p
(1 p)1y1y0
2 y

pY (yi)

(provided p 1/2)

{Yn} not iid

Joint not product of marginals, but can use chain rule with conditional
probabilities to write as product of conditional pmfs, given by

1
pYn (y) = ; y = 0, 1; n = 0, 1, 2, . . .
2

c R.M. Gray 2011

n1

i=0

In a similar fashion it can be shown that the marginals for Yn are all
the same:

EE278: Introduction to Statistical Signal Processing, winter 20102011

138

Note: Would not be the same with different initialization, e.g., Y0 = 1

n1

Marginal pmfs for Yn evaluated by summing out the joints (total


probability), e.g.,

pY1 (y1) =

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

Hence drop subscript and abbreviate pmf to pY

1 yiyi1
p (y ) =
p
(1 p)1yiyi1 .
2 i=1
Yn

pX (yi yi1).

Used the facts that (1) a b = c iff a = b c, (2) Y0, X1, X2, . . . , Xn1
mutually independent, and (3) Xn are iid.

Task: Find joint pmfs for new process: pY n (y ) = Pr(Y = y )


EE278: Introduction to Statistical Signal Processing, winter 20102011

n1

pYl|Y0,Y1,...,Yl1 (yl|y0, y1, . . . , yl1) =


139

EE278: Introduction to Statistical Signal Processing, winter 20102011

pY l+1 (yl+1)
= pX (yl yl1)
pY l (yl)
c R.M. Gray 2011

140

The binomial counting process

Note: Conditional probability of current output Yl given entire past


Yi; i = 0, 1, . . . , l 1 depends only on the most recent past output
Yl1! This property can be summarized nicely by also deriving the
conditional pmf

pYl|Yl1 (yl|yl1) =

Next filter binary Bernoulli process using ordinary arithmetic.

pYl1,Yl (yl, yl1)


pYl1 (yl1)

{Xn} iid binary random process with marginal pmf


pX (1) = p = 1 pX (0).

= pylyl1 (1 p)1ylyl1

pYl|Y0,Y1,...,Yl1 (yl|y0, y1, . . . , yl1) = pYl|Yl1 (yl|yl1).

A discrete time random process with this property is called a Markov


process or Markov chain

n=0
Y0 = 0
Yn =
.

n Xk = Yn1 + Xn n = 1, 2, . . .
k=1

Yn = output of a discrete time time-invariant linear filter with Kronecker


delta response hk given by hk = 1 for k 0 and hk = 0 otherwise.
By definition,

The binary autoregressive process is a Markov process!

Yn = Yn1 or Yn = Yn1 + 1; n = 2, 3, . . .
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

141

A discrete time process with this property is called a counting


process. Will later see a continuous time counting process which
also can only increase by 1

EE278: Introduction to Statistical Signal Processing, winter 20102011

pY1,...,Yn (y1, . . . , yn) = pY1 (y1)

l=1

= Pr(Yn = yn|Yl = yl; l = 1, . . . , yn1)

= Pr(Xn = yn yn1|Yl = yl; l = 1, . . . , n 1)

= Pr(Xn = yn yn1|X1 = y1, Xi = yi yi1; i = 2, 3, . . . , n 1)


pYl|Y1,...,Yl1 (yl|y1, . . . , yl1)

Follows since since conditioning event {Yi = yi; i = 1, 2, . . . , n 1} is


the event {X1 = y1, Xi = yi yi1; i = 2, 3, . . . , n 1} and, given this
event, the event Yn = yn is the event Xn = yn yn1.

Already found marginal pmf pYn (k) using transforms to be binomial


binomial counting process

Thus

pYn|Yn1,...,Y1 (yn|yn1, . . . , y1)

= pXn|Xn1,...,X2,X1 (yn yn1|yn1 yn2, . . . , y2 y1, y1)

Find conditional pmfs, which imply joints via chain rule.


EE278: Introduction to Statistical Signal Processing, winter 20102011

142

pYn|Yn1,...,Y1 (yn|yn1, . . . , y1)

To completely describe this process need a formula for the joint pmfs
n

c R.M. Gray 2011

c R.M. Gray 2011

143

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

144

Xn iid

Similar derivation

pYn|Yn1,...,Y1 (yn|yn1, . . . , y1) = pX (yn yn1)

pYn|Yn1 (yn|yn1) = Pr(Yn = yn|Yn1 = yn1)


= Pr(Xn = yn yn1|Yn1 = yn1).

Hence chain rule + definition y0 = 0

pY1,...,Yn (y1, . . . , yn) =

i=1

Conditioning event, depends only on values of Xk for k < n, hence


pYn|Yn1 (yn|yn1) = pX (yn yn1)
{Yn} is Markov

pX (yi yi1)

Similar derivation works for sum of iid rvs with any pmf pX to show
that

For binomial counting process, use Bernoulli pX :

pY1,...,Yn (y1, . . . , yn) =

i=1

pYn|Yn1,...,Y1 (yn|yn1, . . . , y1) = pYn|Yn1 (yn|yn1)

or, equivalently,

p(yiyi1)(1 p)1(yiyi1),

Pr(Yn = yn|Yi = yi ; i = 1, . . . , n 1) = Pr(Yn = yn|Yn1 = yn1),

where

Markov

yi yi1 = 0 or 1, i = 1, 2, . . . , n; y0 = 0.
EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

145

Discrete random walk

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

146

binomial theorem

Slight variation: Let Xn be binary iid with alphabet {1, 1} and

Pr(Xn = 1) = p

n=0
0
Yn =
,

n Xk n = 1, 2, . . .
k=1

Also has autoregressive format

Yn = Yn1 + Xn, n = 1, 2, . . .
Transform of the iid random variables is

pYn (k)

MX ( ju) = (1 p)e ju + pe ju,


EE278: Introduction to Statistical Signal Processing, winter 20102011

MYn ( ju) = ((1 p)e ju + pe ju)n

n
nk k
=
(1 p) p e ju(n2k)
k
k=0

(n+k)/2 (nk)/2
e juk .
=
(1 p)
p

(n k)/2

k=n, n+2,...,n2,n

c R.M. Gray 2011

147

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

148

The discrete time Wiener process

n
pYn (k) =
(1 p)(n+k)/2 p(nk)/2 ,
(n k)/2

k = n, n + 2, . . . , n 2, n.

Note that Yn must be even or odd depending on whether n is even or


odd. This follows from the nature of the increments.

{Xn} iid N(0, , 2).


As with the counting process, define

n=0
0
Yn =
,

k=1 Xk n = 1, 2, . . .

discrete time Wiener process

Handle in essentially the same way, but use cdfs and then pdfs
Previously found marginal fYn using transforms to be N(0, n2X )
c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

149

To find the joint pdfs use conditional pdfs and chain rule

fY1,...,Yn (y1, . . . , yn) =

l=1

c R.M. Gray 2011

EE278: Introduction to Statistical Signal Processing, winter 20102011

150

Differentiating the conditional cdf to obtain the conditional pdf

fYl|Y1,...,Yl1 (yl|y1, . . . , yl1).

fYn|Y1,...,Yn1 (yn|y1, . . . , yn1) =

F X (yn yn1) = fX (yn yn1),


yn

To find conditional pdf fYn|Y1,...,Yn1 (yn|y1, . . . , yn1), first find conditional


cdf P(Yn yn|Yni = yni; i = 1, 2, . . . , n 1)
. Analogous to the discrete case:

pdf chain rule

P(Yn yn|Yni = yni; i = 1, 2, . . . , n 1)

= P(Xn yn yn1|Yni = yni; i = 1, 2, . . . , n 1)

fY1,...,Yn (y1, . . . , yn) =

= P(Xn yn yn1) = F X (yn yn1),


EE278: Introduction to Statistical Signal Processing, winter 20102011

i=1

c R.M. Gray 2011

151

EE278: Introduction to Statistical Signal Processing, winter 20102011

fX (yi yi1).
c R.M. Gray 2011

152

and hence

If fX = N(0, 2)

y
(y y )2
n exp i i1
exp 212
22
fY n (yn) =

22 i=2
22

n
1

2 n/2
2
2
= (2 )
exp 2 ( (yi yi1) + y1) .
2

fYn|Y1,...,Yn1 (yn|y1, . . . , yn1) = fYn|Yn1 (yn|yn1).


As in discrete alphabet case, a process with this property is called a
Markov process

i=2

This is a joint Gaussian pdf with mean vector 0 and covariance matrix
KX (m, n) = 2 min(m, n), m, n = 1, 2, . . .
A similar argument implies that

Combine the discrete alphabet and continuous alphabet definitions


into a common definition: a discrete time random process {Yn} is said
to be a Markov process if the conditional cdfs satisfy the relation

Pr(Yn yn|Yni = yni; i = 1, 2, . . .) = Pr(Yn yn|Yn1 = yn1)


for all yn1, yn2, . . .

fYn|Yn1 (yn|yn1) = fX (yn yn1)


EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

153

More specifically, such a {Yn} is frequently called a first-order Markov


process because it depends on only the most recent past value. An
extended definition to nth-order Markov processes can be made in
the obvious fashion.

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

155

EE278: Introduction to Statistical Signal Processing, winter 20102011

c R.M. Gray 2011

154

S-ar putea să vă placă și