Sunteți pe pagina 1din 20

EXAM 1 MATERIAL

BASIC PROBABILITY
Probability of an event's complement:

Probability of the union of events:


Conditional Probability:

which directly implies that:

Law of Total Probability:


For event space

, with

Bayes' Rule:

Definition of Independence:

which implies that:


and

DISCRETE RANDOM VARIABLES


Probability Mass Function:

Cumulative Distribution Function:

***Expected Value:

NOTES: It's very possible to have an expected value that couldn't actually happen. For example, if a prof only
gives out 90s and 100s, and there is a 50% likelihood of each and an even number of students in the class, then
the expected value of the grades is 95 - even though the prof will never actually give a 95.
Expected value is LINEAR!!!! This is a beautiful thing, allowing us to do things like this:

***Variance:

Standard Deviation:

Conditional PMF:

Conditional Expected Value:

Families of Discrete Random Variables (list is not exhaustive, but includes the "most important"):
Bernoulli

- single trial with two possible outcomes (e.g., flipping a coin, answering a yes/no question)
For

Binomial
- repeated trials of Bernoullis (e.g., flipping several coins in sequence, answering several
yes/no questions in sequence)
For a positive integer

and

Geometric
- number of successes until a (given number of) failure(s), or number of failures until a (given
number of) success(es) (e.g., running the Boston marathon every year until the year you manage to
finish; cold-calling people for donations until you have six donations)
For

Discrete Uniform

- outcomes in a given range all have an equal likelihood of occurring (e.g., rolling a die)

For integers a and b such that a<b,

Poisson ( - involves a RATE of occurrences over time (e.g., number of dodgeball hits in 3 minutes, number of
phone calls received in a given span of time)
For

CONTINUOUS RANDOM VARIABLES


Probability Density Function
:
The density function for continuous random variables is different from the mass function for discrete random variables
in the sense that we are no longer looking for "mass" at a particular integer outcome to indicate the probability of that
outcome. Instead, we consider the area under the function's curve within a range of outcomes as the indicator of
probability for that range of outcomes; hence, we integrate between these limits to determine probabilities for
outcomes of continuous random variables, and the total area under a density function must equal 1.

Cumulative Distribution Function


:
Because the cumulative distribution function is equal at every point to the value of the probability density function's
integral from
up to that point, finding the value of the CDF at a point is equivalent to finding the probability that a
random variable's outcome is less than or equal to that point. In other words, its definition has not changed.

***Expected Value
:
As with all transitions from discrete to continuous values, we should expect summations to be replaced by integrals, and
that is the only change you see here:

and, as you might expect:

***Variance
:
Because we define variance in terms of expected values, and because we have amended our definition of expected
value to utilize the necessary integration, the definition of variance is exactly the same as before...

Standard Deviation
Again, this is the same as before:

Families of Continuous Random Variables (list is not even close to exhaustive, but includes the "most important"):
Uniform

- pdf is uniformly distributed on (a,b):

For constants

Exponential (
For

Erlang

For

Gaussian

and a positive integer

- NOTE: Gaussian RVs are HUGELY important!!!!! These babies AREN'T going away, so start
loving them now ...

For constants

FUNCTIONS OF RANDOM VARIABLES


If random variable

is a linear transformation of random variable , i.e.,

, then:

This shouldn't surprise you, because expected value is linear!!

Again, there is no surprise here! Addition of a constant just shifts the location of the density function - it does NOT
affect the spread of the pdf, which is what variance measures. Only the scaling affects the variance, and it shouldn't
surprise you that the effect is quadratic, for variance is essentially a quadratic measure (the second moment minus the
square of the expected value).
Are the above relationships true for ANY random variable? YES!!!! This is true for ANY random variable, provided that
the transformation of that random variable is LINEAR.
NOTE: Linear transformations of any kind on uniform and Gaussian random variables produce uniform and Gaussian
random variables, respectively.
This is not so for exponential and Erlang random variables, the distributions for which are constrained to be 0 for values
less than 0 and are constrained as well to start at 0. Hence, ONLY linear transformations of the form
, where
, result in a Y that is an exponential or an Erlang random variable, respectively. Shifting the original X
distribution in any direction results in a random variable that is neither exponential nor Erlang (respectively); likewise,
scaling the original X distribution by a negative constant will flip the original distribution, again resulting in a random
variable that is neither exponential nor Erlang (respectively).

THE STANDARD NORMAL (AND COMPLEMENTARY) CDFs


Standard Normal CDF:
NOTE: The standard normal CDF is just the CDF of a Gaussian PDF centered at 0, with variance 1. Because the CDF is
always the integral of the PDF, this gives us the following definition:

Notice that the function under integration is exactly the 0 mean, variance 1 Gaussian.
Transformation of , a Gaussian
Random Variable to Standard Normal Random Variable :
We need to be able to do this transformation because there is no analytical solution for the integration of a
Gaussian pdf; thus, we need to be able to linearly transform general Gaussian (non-zero mean and/or non-unity
variance) into the standard Gaussian. To do this:

Probability that is less than :


This is straightforward, given the standard normal CDF. We simply convert our X value (a) to Z, and find the
value of the CDF at that point, because the standard normal CDF is, by definition, P[Z<z]:

Probability that is in
Extending the logic above, this is simply:

Standard Normal Complementary CDF:


As we might expect from the fact that it's called the "complementary" CDF, this is defined as 1 - (value of standard
normal CDF), or equivalently, as the probability that Z takes on some value LARGER than z:

EXAM 2 MATERIAL
Pairs of Random Variables
Relationships between Distributions/Mass Functions:
Discrete:

Continuous:

Marginals

Marginals

Probability of Event A

Probability of Event A

Joint Conditional (on Event A)

Joint Conditional (on Event A)

NOTE: To find marginal


, can marginalize the
expression on the left over the values of y in A!

NOTE: To find marginal


, can marginalize the
expression on the left over the values of y in A!

Conditional (on a Random Variable)

Conditional (on a Random Variable)

and

and

Bayes' Rule

Bayes' Rule

Independence
Two RVs are independent if and only if ...

Independence
Two RVs are independent if and only if ...

...which directly implies that for independent RVs:

...which directly implies that for independent RVs:

and

and
8

Covariance/Correlation:

NOTE: Remember, shifts (addition of scalars) don't affect covariances. Also, scalars can be pulled out, i.e.,

Correlation coefficient

NOTE:

and
uncorrelated;
completely correlated (linear relationship, i.e.

Uncorrelated vs. Independent RVs:


*Independent RVs are uncorrelated, but uncorrelated RVs are NOT necessarily independent, unless they
are Gaussian RVs!!
X,Y Uncorrelated

X,Y Independent

Iterated Expectation:

Jointly Gaussian Random Variables


If X,Y are jointly Gaussian random variables (not necessarily independent), then:

Their joint distribution


is Gaussian (hence the phrase "jointly Gaussian")
Their marginal distributions
and
are Gaussian
X,Y uncorrelated X,Y independent
where
and

Linear combinations of the Gaussians are Gaussian, i.e.:


is Gaussian, with:
and

Linear transformations of X and Y are invertible transformations, and hence the transformed variables
are also jointly Gaussian, i.e.,
and
are not only marginally Gaussian, but also are jointly Gaussian!

The conditional

is Gaussian, with:

The conditional

is Gaussian, with:

10

Detection (Binary Hypothesis Testing)


We have two hypotheses,
("null hypothesis"/"nothing's there") and
("hypothesis"/"something's there"). We
observe some random variable and want to decide, based on its value, which hypothesis is true.
Errors:
Missed Detection: Choose

False Alarm: Choose

when

when

is true.

is true.

Probability of Error: This is always equal to the probability of missed detection times the a priori probability of
the hypothesis plus the probability of false alarm times the a priori probability of the null hypothesis,
i.e.:

Expected Value of the Cost of Errors: All we do here is factor in the associated costs to the probability of error
computation, i.e.:

Detectors:
NOTE: All detectors are written for the continuous case. As always, for the discrete case, the expressions are the same,
but with big "P" substituted for little " ".
Maximum Likelihood (ML) Detector: Most basic detector; compares likelihood ratio to 1 in order to determine
which density is larger. If the density in numerator is larger, then the ratio is larger than 1; if the ratio is smaller
than 1, then the density in the denominator must be larger.

Maximum A Posteriori (MAP) Detector: Minimizes probability of error by weighting the likelihoods with the a
priori probabilities. Note that if the a prioris are equal, then the MAP detector simplifies to the ML detector.

Minimum Cost (Bayes' Risk) Detector: Minimizes expected value of the cost of error by weighting the
likelihoods by both the a priori probabilities and the costs of missed detection/false alarm. Note that if the costs
are equal, then the minimum cost detector simplifies to the MAP detector.

11

Useful Math Facts:


Simplification of the likelihood ratio used in detection becomes much easier when logarithms are taken. So, remember:
*the log of the product is the sum of the logs:
*the log of the quotient is the difference of the logs:
*

(We engineers know that the log can be taken to any base, including base e, so no need to specify
'ln'.)

*log 1 = 0

12

Estimation
We want to know X, but we can only observe Y; so, we have to make an educated guess at the value of X given that we
have observed some value of Y.
Biased/Unbiased:

NOTE: If bias = 0, then

, and the estimator is said to be "unbiased".

ML Estimation:
Choose the value of X that maximizes the conditional distribution, called the "likelihood function".

MAP Estimation:
Choose the value of X that maximizes the conditional distribution, but now we multiply by the prior distribution on X,
because we no longer assume equal priors (as in the ML case).

Minimum Mean Square Error Estimation:


Choose the value of X that minimizes the mean square error,
nonlinear ones) between X and Y.

, over ALL possible relationships (even

Case 1 (Blind Estimate):

Case 2 (A, some attribute of X, is observed, thus restricting the possibilities for X):

Case 3 (A dependent random variable's value, Y=y, is observed):

Properties of MMSE Estimator:


*Unbiased
*Estimate is orthogonal to (uncorrelated with) the estimation error
*All functions of the data used in the estimate are orthogonal to (uncorrelated with) the estimation error
Linear Least Squares Error Estimation:
Easier than MMSE, because we need the conditional distribution for MMSE! If we don't have that distribution, but know
the key statistics of X and Y, then:
13

Choose the value of X that minimizes the mean square error,


between X and Y.

, over ONLY LINEAR relationships

where, clearly,

Properties of LLSE Estimator:


*Unbiased
*Estimate is orthogonal to (uncorrelated with) the estimation error
*Estimation error is orthogonal to (uncorrelated with) the data used in the estimate
Mean Square Error of the LLSE Estimator:
NOTE: This is also called the "variance of the error", or
Let

which is the variance in this case because

the error in the linear estimate. Then the mean square error of the estimate is:

BIG NOTE: X,Y Jointly Gaussian:


--> conditional distribution is Gaussian, and actually, Y and X ARE linearly related. So, in this case,
!!!!!

14

EXAM 3 MATERIAL
LIMIT THEOREMS
Sums of Random Variables
Let

. Then:
by the linearity of expected value.

which, if

are independent (or just uncorrelated), simplifies to:

Repeat after me: If the RVs are independent, then the variance of the sum equals the sum of the variances."
NOTE: PDFs of the sums of RVs are convolutions of the individual PDFs!!!
What if N is random? (i.e., we don't know how many variables we're adding ...)
Let

, and let the

be i.i.d. (independent, identically distributed). Then:

Average of Random Variables


Let be i.i.d. (or even just uncorrelated) RVs. Then:

Markov Inequality
For non-negative RV ,

15

Chebyshev Inequality (this gives a tighter bound than the Markov inequality ...)
For any RV (not necessarily non-negative),

Laws of Large Numbers


Weak Law:
Let

be i.i.d. (although the weak law holds for uncorrelated RVs). Then:

for any

as arbitrarily close to 0 as we like. Stated another way:

The WLLN says that as the number of RVs we're summing approaches infinity, the sample mean approaches
the true mean with a probability that approaches certainty!!!!
Strong Law:
This is very similar to the weak law, except we require that the RVs be i.i.d., and we are essentially changing
the above statement to say that as the number of samples approaches infinity, then the sample mean actually
equals the true mean with absolute certainty:

Central Limit Theorem


Remember, the CLT is your friend!!! It basically says that, when we sum together a bunch of i.i.d. RVs, we get
a Gaussian - which means we can apply all of those nice Gaussian properties simply by invoking the CLT.
Formally, this says that in the limit as n approaches infinity, the CDF of the sum (NOT the sample mean!!!) of
the RVs approaches a Gaussian CDF. So, we can use:
for

i.i.d.

in which we transform to a standard normal Gaussian CDF, then use the phi or Q function to get the
probability we seek. We can of course express the mean and standard deviation above in terms of the mean
and standard deviation of as follows...
Remember:

16

Confidence Intervals

Here, the confidence interval is


and is called the lower limit, while is called the upper limit. Moreover, is
called the confidence coefficient. Notice that if is small, then we're more sure that the outcome we've estimated lies
within the confidence interval.
By Chebyshev's inequality:

Theorem: is a Gaussian
RV with unknown mean . The relationship between a confidence interval
estimate of , denoted , is given by

where

17

MARKOV CHAINS
Markov Property
Basically, it says we only need to know the current state in order to know the probabilities of the next state, so that
there is only a one step delay dependence:

Markov ("Stochastic") Matrices


Let

represent a stochastic matrix.

Row numbers are the states we're leaving; column numbers are the states at which we're arriving, and the -th
element is the probability of entering state from state .

The sum of each row must be 1, because at each time step, some action MUST be taken, whether it's to stay in the
same state or move to another.

Steady State Behavior of Homogeneous Markov Chains


We can iterate the matrix by raising it to the exponent that represents the time of interest to find the stochastic matrix
at that time step:

Under certain conditions, the matrix will converge to the steady state, where each row will contain the same steady
state probability vector, denoted using the Greek lowercase letter pi:

is the eigenvector of
corresponding to eigenvalue 1. There can be as many steady state probability vectors as the
matrix has eigenvalues equal to 1.

Definitions
accessible - State is accessible from state if there's a directed path from to .
communicate - States and communicate if is accessible from and is accessible from . A group whose members
communicate with each other is called a communicating class.
irreducible - In the state diagram or matrix, every state communicates with every other, and hence all states are active
in the steady state, so none of them can be "reduced out".

18

transient - State is transient if communicates with state , but doesn't communicate with . You can think of
transient states as follows: Over time, "things" in that state will leak out to other non-transient states, until
eventually there's virtually nothing left to leak out. Hence,

recurrent - Nontransient. The end.

NOTE: If there are multiple communicating classes with recurrent states, then the steady state probability vector
depends on the initial probability vector (where you probably started).

period - Greatest common divisor of all possible cycles from all states back to themselves.
aperiodic - Period of Markov Chain is equal to 1
NOTE: At least one self loop means that the Markov chain must have period 1, and thus must be aperiodic!!!
NOTE 2: If the period is greater than 1, then the chain will oscillate between steady state vectors, and will depend on
where you started.

Finding Steady State Vectors (without eigendecomposition)


Step 1.

Step 2. Draw the state transition diagram.


Step 3. Identify the transient and recurrent states. If there are any transient states, then you already know the
probability of being in that state after reaching the steady state is 0! Hence, the element in the steady state vector that
corresponds to that state is equal to 0!!!
Step 4. Determine a system of equations, where the elements of are the unknowns. (Remember, you need as many
equations as you have unknowns in order to solve a system of equations.)
a) You get one of these equations for free: because

is a probability vector, its elements must sum to 1, so:

b) Draw a dashed line between one recurrent state and the rest of the states. Then, the probability of flowing
into state must equal the probability of flowing out of state That is, write an equation where

19

NOTE: Be sure not to draw dashed lines around transient states. Their steady state probabilities are zero, so
their values won't help you solve systems of equations involving them.
NOTE 2: Self-loops don't matter, because they're neither flowing into nor out of that state.
c) Repeat part b for a different recurrent state; do this until you have a system of as many equations as you
have unknowns.
d) Solve the system for the unknown elements of the steady state vector.
e) Check your work - your results should sum to 1!!!

20

S-ar putea să vă placă și