Sunteți pe pagina 1din 13

cs109 Final Cheat Sheet

TG Sido
April 4, 2017

1 Fundamentals
1.1 DeMorgan’s Laws

n c n n c n
c
( ⋃ Ei ) = ⋂ Ei ( ⋂ Ei ) = ⋃ Eic
i=1 i=1 i=1 i=1

1.2 Axioms of Probability


Axiom 1 : 0 ≤ P (E) ≤ 1
Axiom 2 : P (S) = 1
Axiom 3 : For any sequence of mutually exclusive events E1 , E2 , . . .
∞ ∞
P ( ⋃ Ei ) = ∑ P (Ei )
i=1 i=1

1.3 Inclusion-Exclusion Identity


P (E ∪ F ) = P (E) + P (F ) − P (EF )
n n
P ( ⋃ Ei ) = ∑ (−1)(r+1) ∑ P (Ei1 , Ei2 , . . . , Eir )
i=1 r=1 i1 <...<ir

1.4 Number of Integer Solutions of Equations


There are (n−1
r−1
) distinct positive integer-valued vectors (x1 , x2 , . . . , xr ) satisfying the equation

x1 + x2 + ⋯ + xr = n xi > 0, i = 1, . . . , r

There are (n+r−1


r−1
) distince nonnegative integer-valued vectors (x1 , x2 , . . . , xr ) satisfying the equation

x1 + x2 + ⋯ + xr = n

2 Conditional Probability
P (EF )
P (E∣F ) = ⇔ P (EF ) = P (E∣F )P (F )
P (F )

2.1 Generalized Chain Rule


P (E1 E2 . . . En ) = P (E1 )P (E2 ∣E1 )P (E3 ∣E1 E2 ) . . . P (En ∣E1 E2 . . . En−1 )

1
2.2 Bayes’ Theorem
The many shapes and forms of Bayes’ Theorem...
P (E) = P (E∣F )P (F ) + P (E∣F c )P (F c )
P (EF ) P (E∣F )P (F )
P (F ∣E) = =
P (E) P (E)
P (E∣F )P (F )
P (F ∣E) =
P (E∣F )P (F ) + P (E∣F c )P (F c )
Fully General Form:
If F1 , F2 , . . . , Fn comprise a set of mutually exclusive and exhaustive events, then
P (E∣Fj )P (Fj )
P (Fj ∣E) = n
∑i=1 P (E∣Fi )P (Fi )
That’s odd.
The odds of H given observed evidence E:
P (H∣E) P (H)P (E∣H)
=
P (H ∣E) P (H c )P (E∣H c )
c

3 Independence
3.1 Definition
Two events are independent if P (EF ) = P (E)P (F ). Otherwise they are dependent.
More generally, events E1 , E2 , . . . , En are independent if for every subset E1′ , E2′ , . . . , Er where r ≤ n it holds
that
P (E1′ E2′ . . . Er ) = P (E1′ )P (E2′ )⋯P (Er )

3.2 Conditional Independence


Two events E and F are conditional independent given G if
P (EF ∣G) = P (E∣G)P (F ∣G)
Dependent events can become independent, and vice-versa, by conditioning on additional information.

4 Random Distributions
4.1 Definitions and Properties
Probability Mass Function:
p(a) = P (X = a)
Probability Density Function:
b ∞
P (a ≤ X ≤ b) = ∫ f (x)dx P (−∞ < X < ∞) = ∫ f (x)dx = 1
a −∞

Cumulative Distribution Function:


F (a) = F (X ≤ a) where − ∞ < a∞

a
F (a) = ∑ p(x) F (a) = ∫ f (x)dx
all x≤a −∞

d
Density f is the derivative of CDF F : f (a) = da
F (a)

2
4.2 Joint distributions
Joint Probability Mass Function:
pX,Y (a, b) = P (X = a, Y = b)
Marginal distributions:

pX (a) = P (X = a) = ∑ pX,Y (a, y) pY (b) = P (Y = b) = ∑ pX,Y (x, b)


y x

Joint Cumulative Probability Distribution (CDF):

FX,Y (a, b) = F (a, b) = P (X ≤ a, Y ≤ b) where − ∞ < a, b < ∞

Marginal distributions:

FX (a) = P (X ≤ a) = P (X ≤ a, Y < ∞) = FX,Y (a, ∞)


FY (b) = P (Y ≤ b) = P (X < ∞, Y ≤ b) = FX,Y (∞, b)

Joint Probability Density Function:


a2 b2
P (a1 < X ≤ a2 , b1 < Y ≤ b2 ) = ∫ ∫ fX,Y (x, y)dydx
a1 b1

a b ∂2
FX,Y (a, b) = ∫ ∫ fX,Y (x, y)dydx fX,Y (a, b) = FX,Y (a, b)
−∞ −∞ ∂a∂b
Marginal density functions:
∞ ∞
fx (a) = ∫ fX,Y (a, y)dy fy (b) = ∫ fX,Y (x, b)dx
−∞ −∞

4.3 Independent Random Variables


n random variables X1 , X2 , . . . , Xn are called independent if
n
P (X1 = x1 , X2 = x2 , . . . , Xn = xn ) = ∏ P (Xi = xi ) for all x1 , x2 , . . . , xn
i=1

or analogously for continuous random variables if


n
P (X1 ≤ a1 , X2 ≤ a2 , . . . , Xn ≤ an ) = ∏ P (Xi ≤ ai ) for all a1 , a2 , . . . , an
i=1

4.4 Convolution
Let X and Y be independent random variables. The convolution of FX and FY is FX+Y :

FX+Y (a) = P (X + Y ≤ a) = ∫ FX (a − y)fY (y)dy
y=−∞


fX+Y (a) = ∫ fX (a − y)fY (y)dy
y=−∞

In discrete case, replace ∫y=−∞ with ∑y , and f (y) with p(y).

3
4.5 Conditional Distributions
Conditional PMF of X given Y :
pX,Y (x, y)
pX∣Y (x∣y) = P (X = x∣Y = y) =
pY (y)
Conditional PDF of X given Y :
fX,Y (x, y)
fX∣Y (x∣y) =
fY (y)
Conditional CDF of X given Y :
FX∣Y (a∣y) = P (X ≤ a, Y = y) = ∑ pX∣Y (x∣y)
x≤a
a
=∫ fX∣Y (x∣y)dx
−∞

n random variables X1 , X2 , . . . , Xn are conditionally independent given Y if


n
P (X1 = x1 , X2 = x2 , . . . , Xn = xn ∣Y = y) = ∏ P (Xi = xi ∣Y = y) for all x1 , x2 , . . . , xn , y
i=1

or analogously for continuous random variables if


n
P (X1 ≤ a1 , X2 ≤ a2 , . . . , Xn ≤ an ∣Y = y) = ∏ P (Xi ≤ ai ∣Y = y) for all a1 , a2 , . . . , an , y
i=1

It is possible to mix continuous and discrete random variables in conditional distributions. For example let X
be a continuous random variable and N be a discrete random variable. Then the conditional PDF of X given
N and the conditional PMF of N given X are
pN ∣X (n∣x)fX (x)
fX∣N (x∣n) =
pN (n)
fX∣N (x∣n)pN (n)
PN ∣X (n∣x) =
fX (x)

5 Expectation
5.1 Definitions
The expected value for a discrete random variable X is defined as
E[X] = ∑ xp(x)
x∶p(x)>0

For a continuous random variable X, the expected value is



E[X] = ∫ xf (x)dx
−∞

5.2 Properties
If I is an indicator variable for the event A, then
E[I] = P (A)
Let g(X) be a real-valued function of X.

E[g(X)] = ∑ g(xi )p(xi ) E[g(X)] = ∫ g(x)f (x)dx
i −∞

4
Let g(X, Y ) be a real-valued function of two random variables.
∞ ∞
E[g(X, Y )] = ∑ ∑ g(x, y)pX,Y (x, y) E[g(X, Y )] = ∫ ∫ g(x, y)fX,Y (x, y)dxdy
y x −∞ −∞

Linearity:
E[aX + b] = aE[X] + b
N -th Moment of X:
E[X n ] = ∑ xn p(x)
x∶p(x)>0

Expected Values of Sums:


n n
E [∑ Xi ] = ∑ E[Xi ]
i=1 i=1

Bounding Expectation:
If random variable X ≥ a then E[X] ≥ a.
If P (a ≤ X < ∞) = 1 then a ≤ E[X] < ∞.
If random variables X ≥ Y then E[X] ≥ E[Y ].

5.3 Conditional Expectation


Conditional Expectation of X given Y = y:
+∞
E[X∣Y = y] = ∑ xpX∣Y (x∣y) E[X∣Y = y] = ∫ xfX∣Y (x∣y)dx
x −∞

Expectation of conditional sum:


n n
E [∑ Xi ∣Y = y] = ∑ E[Xi ∣Y = y]
i=1 i=1

Expectation of conditional expectations:


E[E[X∣Y ]] = E[X]

6 Variance
6.1 Definition
If X is a random variable with mean µ then the variance of X, denoted Var(X), is:

Var(X) = E[(X − µ)2 ] = E[X 2 ] − (E[X])2

6.2 Properties
Var(aX + b) = a2 Var(X)
If X1 , X2 , . . . , Xn are independent random variables, then
n n
Var (∑ Xi ) = ∑ Var(Xi )
i=1 i=1

5
6.3 Covariance
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]
If X and Y are independent, Cov(X, Y ) = 0 Properties:

Cov(X, Y ) = Cov(Y, X)
Cov(X, X) = Var(X)
Cov(aX + b, Y ) = aCov(X, Y )

If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym are random variables, then

⎛n m ⎞ n m
Cov ∑ Xi , ∑ Yj = ∑ ∑ Cov(Xi , Xj )
⎝i=1 j=1 ⎠ i=1 j=1

6.4 Correlation
Cov(X, Y )
ρ(X, Y ) = √
Var(X)Var(Y )
Note: −1 ≤ ρ(X, Y ) ≤ 1.
Correlation measures linearity between X and Y .
If ρ(X, Y ) = 0, X and Y are uncorrelated.

7 Moment Generating Functions


7.1 Definition
Moment Generating Function (MGF) of a random variable X, where −∞ < t < ∞, is

M (t) = E[etX ]

When X is discrete: = ∑ etx p(x)


x

When X is continuous: =∫ etx f (x)dx
−∞

For any n random variables X1 , X2 , . . . , Xn

M (t1 , t2 , . . . , tn ) = E[et1 X1 +t2 X2 +⋯+tn Xn ]

The individual moment generating function is obtained:

MXi (t) = E[etX ] = M (0, . . . , 0, t, 0, . . . , 0) where t at ith place

7.2 Properties
dn
M n (t) = ( ) M (t) = E[X n enX ]
dtn
M n (0) = E[X n ]
MX (t) = MY (t) iff X ∼ Y
X1 , X2 , . . . , Xn independent if and only if:

M (t1 , t2 , . . . , tn ) = MX1 (t1 )MX2 (t2 ) . . . MXn (tn )

6
8 Inequalities
8.1 Boole’s Inequality
Let E1 , E2 , . . . , En be events with indicator random variables Xi .
n n
∑ P (Ei ) ≥ P ( ⋃ Ei )
i=1 i=1

8.2 Markov’s Inequality


X is a nonnegative random variable.

E[X]
P (X ≥ a) ≤ for all a > 0
a

8.3 Chebyshev’s Inequality


X is a random variable with E[X] = µ and Var(X) = σ 2 .

σ2
P (∣X − µ∣ ≥ k) ≤ for all k > 0
k2
One-sided inequality:
σ2
P (X ≥ E[X] + a) ≤ for any a > 0
σ2 + a2
σ2
P (X ≤ E[X] − a) ≤ for any a > 0
σ 2 + a2

8.4 Chernoff Bound


X is a random variable with MGF M (t).

P (X ≥ a) ≤ e−ta M (t) for all t > 0

P (X ≤ a) ≤ e−ta M (t) for all t < 0


In practice, use the t that minimizes e−ta M (t).
If Poisson, and P(X ≥ i) , minimizing t is ln( λi )

8.5 Jensen’s Inequality


If f (x) is a convex function (f ′′ (x) ≥ 0 for all x) then

E[f (x)] ≥ f (E[X])

9 Maximum Likelihood Estimator


9.1 Derivation Method
1) get density function w/ given parameter (λ) by plugging in parameter to density expression
2) L(λ) = ∏ni=1 f (Xi ∣λ)
3) LL(λ) = ∑ni=1 log(f (Xi ∣λ))
4) Now, maximize by setting dLL(λ)

=0
5) Finally, solve for λ̂

7
9.2 Biased or NonBiased?
Unbiased when :
E[Θ̂] = Θ

9.3 Estimator Consistency


for  greater than 0
lim P (∣Θ̂ − Θ∣ ≤ ) = 1
n−>∞
Meaning: as we get more data, estimate should deviate from true value by at most a small amount.

10 Central Limit Theorem


10.1 Discussion
Deals with I.I.D. (Independent and Identically Distributed Random Variables.
CLT says that if Random Variables have a finite mean µ and finite variance σ 2 , then the distribution of the
sum of the first n of them is, for large n, approximately that of a normal variable with mean: nµ and variance:
n n
∑ P (Ei ) ≥ P ( ⋃ Ei )
i=1 i=1

10.2 Confidence Interval


Consider IID Random Variables.
S : sample std. deviation
2
S 2 = ∑ni=1 (Xn−1
i −X)

2
Var(X) = σn
for large n, 100(1 - α)% CI is:
S S
(X − z α2 √ , X + z α2 √ )
n n
Meaning: 100(1 - α)% of time that CI is computed from sample , true µ is in interval.
α
Φ(z α2 ) = 1 −
2
Ex: α = .05, α2 = .025, Φ(z α2 ) = .975, z α2 = 1.96
Confidence Level:
90% − − > 1.645
95% − − > 1.96
99% − − > 2.58

10.3 Calculating n to ensure confidence level with average


Consider IID Random Variables. With given µ and σ 2 √ √
( n X )−nµ
If you have buffer +/- a and confidence level 100(1 - α)% Zn = ∑i=1σ√ni P ( −aσ n ≤ Zn ≤ a n
σ
)

a n
2Φ( ) − 1 = 100(1 − α)%
σ

10.4 Approximate Probability with CLT


23.5−20
Take X as Poisson(20) P (X ≥ 24) = P (Z ≥ √
20
) *Remember continuity correction if X comes from discrete
distribution

8
11 Laws of Large Numbers
Consider IID Random Variables

11.1 Weak Law


for any  greater than 0
n
∑ P (∣X − µ∣ ≥ ) − − > 0n− > ∞
i=1

11.2 Strong Law


X1 + X2 + ...Xn
P ( lim ( ) = µ) = 1
n−>∞ n
Strong implies Weak but NOT vice-versa
Strong law implies for any  greater than 0, there are only finite number of values of n, such that condition:
∣X − µ∣ ≥  holds

12 Discrete Random Variables


12.1 Bernoulli
An experiment that results in ”success” or ”failure.”

X ∼ Ber(p)
P (X = 0) = 1 − p
P (X = 1) = p
E[X] = p
Var(X) = p(1 − p)
M (t) = et p + 1 − p

12.2 Binomial
The number of successes in an experiment with n trials and p probability of success on each trial.

X ∼ Bin(n, p)
n
P (X = i) = p(i) = ( )pi (1 − p)n−i where i = 0, 1, . . . , n
i
E[X] = np
Var(X) = np(1 − p)
M (t) = (pet + 1 − p)n

If Xi ∼ Bin(ni , p) for 1 ≤ i ≤ N , then


N N
(∑ Xi ) ∼ Bin (∑ ni , p)
i=1 i=1

Note that the binomial distribution is a generalization of the Bernoulli distribution, since Ber(p) ∼ Bin(1, p).

9
12.3 Poisson
Approximates the binomial random variable when n is large and p is small enough to make np ”moderate”—
generally when n > 20 and p < 0.05—and approaches the binomial distribution as n → ∞ and p → 0.
X ∼ Poi(λ) where λ = np
λi
P (X = i) = e−λ where i = 0, 1, 2, . . .
i!
E[X] = λ
Var(X) = λ
t
M (t) = eλ(e −1)

The approximations also works to a certain extent when the successes in the trials are not entirely independent,
and when the probability of success in each trial varies slightly.

If Xi ∼ Poi(λi ) for 1 ≤ i ≤ N , then


N N
(∑ Xi ) ∼ Poi (∑ λi )
i=1 i=1

12.4 Geometric
The number of independent trials until a success, where the probability of success is p.
X ∼ Geo(p)
P (X = n) = (1 − p)n−1 p where n = 1, 2, . . .
E[X] = 1/p
Var(X) = (1 − p)/p2

12.5 Negative Binomial


The number of independent trials until r successes, with probability p of success.
X ∼ NegBin(r, p)
n−1 r
P (X = n) = ( )p (1 − p)n−r where n = r, r + 1, . . .
r−1
E[X] = r/p
Var(x) = r(1 − p)/p2
Geo(p) ∼ NegBin(1, p)
Note that the negative binomial distribution generalizes the geometric distribution, with Geo(p) ∼ NegBin(1, p).

12.6 Hypergeometric
The number of white balls drawn after drawing n balls (without replacement) from an urn containing N balls,
with m white balls and N − m other (”black”) balls.
X ∼ HypG(n, N, m)
(mi)(Nn−i
−m
)
P (X = i) = where i = 0, 1, . . . , n
(N
n
)
E[X] = n(m/N )
nm(N − n)(N − m)
Var(X) =
N 2 (N − 1)
HypG(n, N, m) → Bin(n, m/N ) , as N → ∞ and m/N stays constant

10
12.7 Multinomial
The multinomial distribution further generalizes the binomial distribution: given an experiment with n inde-
pendent trials, where each trial results in one of m outcomes, with respective probabilities p1 , p2 , . . . , pm such
that ∑m
i=1 pi = 1, then if Xi denotes the number of trials with outcome i we have

n
P (X1 = c1 , X2 = c2 , . . . , Xm = cm ) = ( )pc1 pc2 ⋯pcmm
c1 , c2 , . . . , cm 1 2
n
where ∑m
i=1 ci = n and (c1 ,c2 ,...,cm ) =
n!
c1 !c2 !⋯cm !
.

13 Continuous Random Variables


If Y is a non-negative continuous random variable

E[Y ] = ∫ P (Y > y)dy
0

13.1 Uniform

X ∼ Uni(α, β)
1
α≤x≤β
β−α
f (x) = {
0 otherwise
α+β
E[X] =
2
(β − α)2
Var(X) =
12

13.2 Normal
For values in common natural phenomena, especially when resulting from the sum of multiple variables.

X ∼ N(µ, σ 2 )
1 (x−µ)2
f (x) = √ e− 2σ2 where − ∞ < x < ∞
σ 2π
E[X] = µ
Var(X) = σ 2
2 2
( σ 2t +µt)
M (t) = e

Letting X ∼ N (µ, σ 2 ) and Y = aX + b, we have

Y ∼ N(aµ + b, a2 σ 2 )
x−b
FY (x) = FX ( )
a
The Standard (Unit) Normal Random Variable Z ∼ N (0, 1) has a cumulative distribution function (CDF)
commonly labeled Φ(z) = P (Z ≤ z) that has some useful properties.
1 2 z
√ e−x /2 dx
Φ(z) = ∫
−∞ 2π
Φ(−z) = 1 − Φ(z)
P (Z ≥ −z) = P (Z > z)

11
Given X ∼ N (µ, σ 2 ) where σ > 0, we can then compute the CDF of X using the CDF of the standard normal
variable.
x−µ
FX (x) = Φ( )
σ
By the de Moivre-Laplace Limit Theorem, the normal variable can approximate the binomial when Var(X) =
np(1 − p) ≥ 10. If we let Sn denote the number of successes (with probability p) in n independent trials, then

⎛ Sn − np ⎞ n→∞
P a≤ √ ≤ b → Φ(b) − Φ(a)
⎝ np(1 − p) ⎠

If Xi ∼ N(µi , σi2 ) for i = 1, 2, . . . , n, then


n n n
(∑ Xi ) ∼ N (∑ µi , ∑ σi2 )
i=1 i=1 i=1

13.3 Exponential
Represents time until some event, with rate λ > 0.
X ∼ Exp(λ)
λe−λx if x ≥ 0
f (x) = {
0 if x < 0
1
E[X] =
λ
1
Var(X) = 2
λ
F (x) = 1 − e−λx where x ≥ 0
Exponentially distributed random variables are memoryless.
P (X > s + t∣X > s) = P (X > t)

13.4 Beta

X ∼ Beta(a, b)
1
B(a,b)
xa−1 (1 − x)b−1 0<x<1
f (x) = {
0 otherwise
1
B(a, b) = ∫ xa−1 (1 − x)b−1 dx
0
a
E[X] =
a+b
ab
Var(X) =
(a + b)2 (a + b + 1)
If X ∼ Uni(0, 1) and N denotes the number of heads resulting from a number of coin flips with some unknown
probability of getting heads, then
X∣(N = n, m + n trials) ∼ Beta(n + 1, m + 1)

14 Useful Definitions
14.1 Taylor Series
n
xn
ex = ∑
i=1 n!

12
14.2 Integration By Parts
b dv b du
∫ u(x) ∗ dx = u(x) ∗ v(x)∣ba − ∫ v(x) ∗
a dx a dx

13

S-ar putea să vă placă și