Sunteți pe pagina 1din 19

CSE 592: Advanced Topics in Computer Science

Principles of Machine Learning (Fall 2011)


Department of Computer Science
Stony Brook University
Homework 2 Solutions: On Probability
Theory, Decision Theory, Classication,
Regression and Statistical Estimation
Out: Thursday, October 27 Due: Thursday, November 3
Exercises (100 pts. total; + 30 pts. Extra Credit)
1. [Sum-of-Squares Regression (Bishop, 1.2)] (10 pts.) Show that the following set of
coupled linear equations are satised by the coecients w
i
which minimize the regularized
sum-of-squares error function: If (x
1
, t
1
), . . . , (x
m
, t
m
) are the input-output example pairs,
then
n

j=0
A
ij
w
j
= T
i
where
A
ij
=
m

l=1
(x
l
)
i+j
, T
i
=
m

l=1
(x
l
)
i
t
l
.
Solution. Given the set of examples above, the sum-of-squares error function is
L(w) =
1
2
m

l=1
_
t
l

i=0
w
i
(x
l
)
i
_
2
.
The optimal value for w which minimize L must satisfy the following rst-order condition:
all partial derivatives must equal zero; so,
L
w
i
(w) =
m

l=1
_
_
t
l

j=0
w
j
(x
l
)
j
_
_
[(x
l
)
i
] = 0.
Rewriting (by rearranging terms), we obtain
m

l=1
_
_
n

j=0
w
j
(x
l
)
j
_
_
(x
l
)
i
=
m

l=1
(x
l
)
i
t
l
n

j=0
_
m

l=1
(x
l
)
i+j
_
w
j
=
m

l=1
(x
l
)
i
t
l
.
We complete the solution by dening A
ij
and T
i
appropriately. .
2. [Linearity of Expectation] (5 pts. total) Let X and Z be random variables of the same
experiment.
(a) (2 pts.) Show that
E[X + Z] = E[X] +E[Z].
Page 1 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Solution. Suppose both X and Z are discrete random variables with joint PMF p
X,Z
.
Then, we have
E[X + Z] =

z
p
X,Z
(x, z) (x + z)
=
_

z
p
X,Z
(x, z) x
_
+
_

z
p
X,Z
(x, z) z
_
=
_

x
x

z
p
X,Z
(x, z)
_
+
_

z
z

x
p
X,Z
(x, z)
_
=
_

x
x p
X
(x)
_
+
_

z
z p
Z
(z)
_
=E[X] + E[Z]
where the rst equality follows by the denition of the expectation, the second by the
distributive properties of sums, the third by taking x and z out of the sum with respect
to z and x, respectively, the fourth by the denition of marginal PMF, and the last one
by the denition of the expectation.
If both X and Z are continuous random variables with joint PDF f
X,Z
. Then, we have
E[X + Z] =
_
x
_
z
f
X,Z
(x, z) (x + z) dxdz
=
__
x
_
z
f
X,Z
(x, z) x dxdz
_
+
__
x
_
z
f
X,Z
(x, z) z dxdz
_
=
__
x
x
__
z
f
X,Z
(x, z) dz
_
dx
_
+
__
z
z
__
x
f
X,Z
(x, z) dx
_
dz
_
=
__
x
x f
X
(x) dx
_
+
__
z
z f
Z
(z) dz
_
=E[X] + E[Z]
by a reasoning similar to the discrete case: the rst equality follows by the denition of
the expectation, the second by the distributive properties of integrals, the third by taking x
and z out of the integral with respect to z and x, respectively, the fourth by the denition
of marginal PDF, and the last one by the denition of the expectation.
Finally, suppose, without loss of generality that X is continuous and Z is discrete with
joint distribution f
X|Z
(x[z)p
Z
(z). (Note that the case when X is discrete and Z is
Page 2 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
continuous is symmetric.) Then, we have
E[X + Z] =
_
x

z
f
X|Z
(x[z)p
Z
(z) (x + z) dx
=
_
_
x

z
f
X|Z
(x[z)p
Z
(z) x dx
_
+
_
_
x

z
f
X|Z
(x[z)p
Z
(z) z dx
_
=
_
_
x
x
_

z
f
X|Z
(x[z)p
Z
(z)
_
dx
_
+
_

z
z p
Z
(z)
_
x
f
X|Z
(x[z) dx
_
=
__
x
x f
X
(x) dx
_
+
_

z
z p
Z
(z)
_
=E[X] + E[Z]
again, by a similar reasoning as for the previous two cases. .
(b) (3 pts.) Show that if, in addition X and Z are independent, then
var(X + Z) = var(X) + var(Z).
Solution. In general, we have
var(X + Z) =E[(X + Z E[X + Z])
2
]
= E[(X E[X] + Z E[Z])
2
]
= E[(X E[X])
2
+ 2(X E[X])(Z E[Z]) + (Z E[Z])
2
]
= E[(X E[X])
2
] + 2E[(X E[X])(Z E[Z])] +E[(Z E[Z])
2
]
= var(X) + 2E[(X E[X])(Z E[Z])] + var(Z).
The random variables (XE[X]) and (Z E[Z]) are independent because X and Z are
independent. Hence, we have
E[(X E[X])(Z E[Z])] =E[(X E[X])]E[(Z E[Z])]
=(E[X] E[X])(E[Z] E[Z])
=(0)(0) = 0
from which the result follows. Let us prove more formally that the rst equality in the
last derivation is true. Suppose both X and Z are discrete random variables with joint
PMF p
X,Z
. Then, we have
E[(X E[X])(Z E[Z])] =

z
p
X,Z
(x, z) (x E[X])(z E[Z])
=

z
p
X
(x)p
Z
(z) (x E[X])(z E[Z])
=

x
p
X
(x) (x E[X])

z
p
Z
(z) (z E[Z])
=
_

x
p
X
(x) (x E[X])
__

z
p
Z
(z) (z E[Z])
_
=E[X E[X]]E[Z E[Z]] .
Page 3 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
The derivation for the cases in which only one or both are continuous is similar. .
3. [Uniform PDF (Bishop, 2.12)] (10 pts. total) The uniform PDF with bounding parame-
ters a and b is dened by
f
X
(x [ a, b)
1
b a
, if a x b,
and 0 everywhere else.
(a) (2 pts.) Show that this PDF is properly normalized.
Solution.
_
+

f
X
(x [ a, b) dx =
_
b
a
f
X
(x [ a, b) dx
=
_
b
a
1
b a
dx
=
1
b a
_
b
a
dx
=
1
b a
_
x[
b
a
_
=
1
b a
[b a]
=1 .
.
(b) (8 pts.) Let X Uniform(a, b). Find expressions for E[X] and var(X).
Solution. The expectation of X is
E[X] =
_
+

f
X
(x [ a, b) x dx
=
_
b
a
f
X
(x [ a, b) x dx
=
_
b
a
1
b a
x dx
=
1
b a
_
b
a
x dx
=
1
b a
_
x
2
2

b
a
_
=
1
b a
_
b
2
a
2
2
_
=
1
b a
_
(b a)(b + a)
2
_
=
b + a
2
.
Page 4 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
To calculate the variance, it is convenient to use the alternative expression involving the
rst two moments of X:
var(X) = E[X
2
] (E[X])
2
.
The second moment of X is
E[X
2
] =
_
+

f
X
(x [ a, b) x
2
dx
=
_
b
a
f
X
(x [ a, b) x
2
dx
=
_
b
a
1
b a
x
2
dx
=
1
b a
_
b
a
x
2
dx
=
1
b a
_
x
3
3

b
a
_
=
1
b a
_
b
3
a
3
3
_
.
From this we obtain that
var(X) =E[X
2
] (E[X])
2
=
1
b a
_
b
3
a
3
3
_

_
b + a
2
_
2
=
4(b
3
a
3
) 3(b a)(b + a)
2
12(b a)
=
4(b
3
a
3
) 3(b
3
+ 2b
2
a + ba
2
b
2
a 2ba
2
a
3
)
12(b a)
=
4(b
3
a
3
) 3(b
3
+ b
2
a ba
2
a
3
)
12(b a)
=
4b
3
4a
3
3b
3
3b
2
a + 3ba
2
+ 3a
3
12(b a)
=
b
3
3b
2
a + 3ba
2
a
3
12(b a)
=
(b a)
3
12(b a)
=
(b a)
2
12
.
.
4. [Multivariate Gaussian Random Variables] (15 pts., Extra Credit) Consider two mul-
tidimensional random vectors X Gaussian(
X
,
X
) and Z Gaussian(
Z
,
Z
), and their
sum Y = X + Z. Find expressions for the marginal and conditional PDFs f
Y
and f
Y|X
,
respectively.
Page 5 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Solution. Let us begin by assuming that Y Gaussian(
Y
,
Y
). This is actually true
because the sum of Gaussian random variables is Gaussian, but we will formally prove that is
the case later. Because Gaussian PDFs are fully determined by their mean and covariance,
the marginal PDF f
Y
is dened by the following mean and covariance:

Y
= E[Y] =E[X+Z] = E[X] +E[Z] =
X
+
Z

Y
= cov(Y) =E[YY
T
]
Y

T
Y
=E[(X+Z)(X+Z)
T
] (
X
+
Z
)(
X
+
Z
)
T
=E[XX
T
+ZZ
T
+ 2XZ
T
] (
X

T
X
+
Z

T
Z
+ 2
X

T
Z
)
=(E[XX
T
]
X

T
X
) + (E[ZZ
T
]
Z

T
Z
) + 2(E[XZ
T
]
X

T
Z
)
=cov(X) + cov(Z) 2 cov(X, Z)
=
X
+
Z
2
X,Z
.
In summary, we can express the marginal PDF of Y in terms of X and Z as Y Gaussian(
X
+

Z
,
X
+
Z
2
X,Z
). Of course, if X and Z are also independent, the expression of the
covariance of Y simplies because in that case
X,Z
= 0 (i.e., the matrix of all zeros), and

Y
=
X
+
Z
.
We now show that in fact Y is a multivariate Gaussian random variable. The CDF of Y can
be expressed as
F
Y
(y) =P(Y y)
=
_
P(Y y[X = x)f
X
(x) dx
=
_
P(x +Z y[X = x)f
X
(x) dx
=
_
F
Z|X
(y x[x)f
X
(x) dx .
Hence, carefully applying the fundamental theorem of calculus, we obtain the PDF of Y:
f
Y
(y) =
_
f
Z|X
(y x[x)f
X
(x) dx
=
_
f
X,Z
(x, y x) dx .
Let W = [X Z]
T
Gaussian(
W
,
1
W
), where

W
=
_

X

Z
_
and

W
=
_

X

X,Z

X,Z

Z
_
=
_

X

X,Z

X,Z

Z
_
1
.
Page 6 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Then, we can express the PDF of Y as
f
Y
(y)
_
exp
_

1
2
__
x
y x
_

W
_
T

W
__
x
y x
_

W
_
_
dx

_
exp
_

1
2
_
_
x
y x
_
T

W
_
x
y x
_
2
T
W
_
x
y x
_
__
dx
=
_
exp
_

1
2
_
x
T

X
x 2x
T

X,Z
(y x) + (y x)
T

Z
(y x) 2
T
X
x 2
T
Z
(y x)
_
_
dx
=
_
exp
_

1
2
_
x
T

X
x 2x
T

X,Z
y + 2x
T

X,Z
x+
y
T

Z
y 2y
T

Z
x +x
T

Z
x+
2
T
X
x 2
T
Z
y + 2
T
Z
x
__
dx
=exp
_

1
2
_
y
T

Z
y 2
T
Z
y
_
__
exp
_

1
2
_
x
T

X
x 2x
T

X,Z
y + 2x
T

X,Z
x+
2y
T

Z
x +x
T

Z
x+
2
T
X
x + 2
T
Z
x
__
dx
=exp
_

1
2
_
y
T

Z
y 2
T
Z
y
_
__
exp
_

1
2
_
x
T
(
X
+
Z
+ 2
X,Z
)x+
2(y
T
(
X,Z
+
Z
) +
T
X

T
Z
)x
__
dx .
Letting
X
+
Z
+2
X,Z
, (y)
1
((
X,Z
+
Z
)y +
X

Z
), and continuing the
last derivation, we have
f
Y
(y) exp
_

1
2
_
y
T

Z
y 2
T
Z
y
_
_

_
exp
_

1
2
_
x
T
x 2(
1
((
X,Z
+
Z
)y +
X

Z
))
T
x
_
_
dx
=exp
_

1
2
_
y
T

Z
y 2
T
Z
y
_
__
exp
_

1
2
_
x
T
x 2(y)
T
x
_
_
dx
=exp
_

1
2
_
y
T

Z
y 2
T
Z
y (y)
T
(y)
_
_

_
exp
_

1
2
_
x
T
x 2(y)
T
x +(y)
T
(y)
_
_
dx
=exp
_

1
2
_
y
T

Z
y 2
T
Z
y (y)
T
(y)
_
__
exp
_

1
2
(x (y))
T
(x (y))
_
dx
exp
_

1
2
_
y
T

Z
y 2
T
Z
y (y)
T
(y)
_
_
.
Now, because
(y)
T
(y) =(
1
((
X,Z
+
Z
)y +
X

Z
))
T
(
1
((
X,Z
+
Z
)y +
X

Z
))
=((
X,Z
+
Z
)y +
X

Z
)
T

1
((
X,Z
+
Z
)y +
X

Z
)
=y
T
(
X,Z
+
Z
)
1
(
X,Z
+
Z
)y + 2(
X

Z
)
T

1
(
X,Z
+
Z
)y + const. ,
Page 7 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
letting B
X,Z
+
Z
so that
(y)
T
(y) =y
T
B
1
By + 2(
X

Z
)
T

1
By + const. ,
we obtain
f
Y
(y) exp
_

1
2
_
y
T

Z
y 2
T
Z
y (y
T
B
1
By + 2(
X

Z
)
T

1
By)
_
_
=exp
_

1
2
_
y
T
(
Z
B
1
B)y 2(
T
Z
+ (
X

Z
)
T

1
B)y
_
_
.
Letting A =
Z
B
1
B and b =
Z
+B
1
(
X

Z
), the last expression simplies to
f
Y
(y) exp
_

1
2
_
y
T
Ay 2b
T
y
_
_
.
The last condition implies that Y is Gaussian, because A is positive-denite (i.e., invertible),
as shown in what follows.
f
Y
(y) exp
_

1
2
_
y
T
Ay 2b
T
A
1
Ay
_
_
=exp
_

1
2
_
y
T
Ay 2(A
1
b)
T
Ay
_
_
exp
_

1
2
_
y
T
Ay 2(A
1
b)
T
Ay + (A
1
b)
T
A(A
1
b)
_
_
=exp
_

1
2
(y A
1
b)
T
A(y A
1
b)
_
which is in fact the PDF of a multivariate Gaussian with mean vector and covariance matrix
parameters A
1
b and A
1
, respectively.
As for the conditional PDF f
Y|X
, rst note that conditioned on the event X = x, Y becomes
the sum of a multivariate Gaussian random variable and a constant: Y = x + Z. Adding a
constant to a multivariate Gaussian random variables yields another multivariate Gaussian
random variable; in this case the mean vector and covariance matrix parameters are

Y|X=x
= E[Y[X = x] =E[X+Z[X = x] = E[x +Z[X = x] = x +E[Z[X = x] = x +
Z|X=x

Y|X=x
= cov(Y[X = x) =E[(Y
Y|X=x
)(Y
Y|X=x
)
T
[X = x]
=E[(x +Z (x +
Z|X=x
))(x +Z (x +
Z|X=x
))
T
[X = x]
=E[(Z
Z|X=x
)(Z
Z|X=x
)
T
[X = x]
=cov(Z[X = x) =
Z|X=x
.
We now show that in fact f
Y|X
is a Gaussian PDF. The conditional CDF of Y given X = x
can be expressed
F
Y|X
(y[x) =P(Y y[X = x)
=F
Z|X
(y x[x) .
Page 8 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Hence, carefully applying the fundamental theorem of calculus, we obtain the conditional PDF
of Y given X = x:
f
Y|X
(y[x) =f
Z|X
(y x[x) .
Note that because X and Z are jointly Gaussian random variables, the conditional PDF f
Z|X
is also a Gaussian PDF. To see this note that, using some of the same denitions and notation
from the derivation for the marginal PDF f
Y
given above,
f
Z|X
(z[x) f
X,Z
(x, z)
exp
_

1
2
__
x
z
_

W
_
T

W
__
x
z
_

W
_
_
=exp
_

1
2
_
(x
X
)
T

X
(x
X
) 2(x
X
)
T

X,Z
(z
Z
) + (z
Z
)
T

Z
(z
Z
)
_
_
exp
_

1
2
_
(z
Z
)
T

Z
(z
Z
) 2(x
X
)
T

X,Z
(z
Z
)
_
_
exp
_

1
2
_
z
T

Z
z 2
T
Z

Z
z 2(x
X
)
T

X,Z
z
_
_
=exp
_

1
2
_
z
T

Z
z 2(
T
Z

Z
+ (x
X
)
T

X,Z
)z
_
_
=exp
_

1
2
_
z
T

Z
z 2(
Z

Z
+
X,Z
(x
X
))
T
z
_
_
and using a result from the previous derivation of f
Y
, and standard properties of inverse
matrices from linear algebra (given in Bishops book), we obtain that Z[X = x is a multivariate
Gaussian with parameters

Z|X=x
=
1
Z
(
Z

Z
+
X,Z
(x
X
)) =
Z
+
1
Z

X,Z
(x
X
) =
Z
+
X,Z

1
X
(x
X
)

Z|X=x
=
1
Z
=
Z

X,Z

1
X

X,Z
.
Hence, for the conditional PDF f
Y|X
, we obtain

Y|X=x
=x +
Z
+
X,Z

1
X
(x
X
)

Y|X=x
=
Z

X,Z

1
X

X,Z
.
.
5. [Decision Theory for Multi-class Classication (Bishop, 1.22)] (5 pts.) Given a loss
matrix with elements L
kj
, the expected risk is minimized if, for each x, we choose the class
label j that minimizes

k
L
kj
p(k[x) .
Verify that, when the loss matrix is given by L
kj
= 1 1[k = j], this reduces to the criterion
choosing the class having the largest posterior probability. What is the interpretation of this
form of loss matrix?
Page 9 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Solution.

k
L
kj
p(k[x) =

k
(1 1[k = j])p(k[x)
=
_

k
p(k[x)
_

k
1[k = j])p(k[x)
_
. =1 p(j[x) .
Hence, we have
arg min
j

k
L
kj
p(k[x) = arg min
j
1 p(j[x) = arg max
j
p(j[x) .
The loss matrix corresponds to the standard 0/1 loss. .
6. [Naive-Bayes Classication (Duda, Hart & Stork, Ch. 2, 43)] (15 pts.) Repre-
sent class-label random variable Z Multinomial(c, q), where each component q
j
of the
c-dimensional vector q corresponds to q
j
= P(Z = j), for j = 1, . . . , c. Represent the
observed input X = (X
1
, . . . , X
d
)
T
as a vector of Bernoulli random variables, each being
X
i
[ Z Bernoulli(p
ij
), where p
ij
= P(X
i
= 1[Z = j). Assume X
1
, . . . , X
d
are conditionally
independent given Z. (This is the so called naive Bayes assumption.)
Show that the minimum probability of error is achieved by the following decision rule: Given
input x, output class k if g
k
(x) g
j
(x) for all alternative classes j ,= k, where
g
j
(x) =
d

i=1
x
i
ln
p
ij
1 p
ij
+
d

i=1
ln(1 p
ij
) + ln q
j
.
Solution. Given input x, the posterior probability for class-label j is
P(Z = j[X = x) P(Z = j, X = x)
=P(Z = j)P(X = x[Z = j)
=P(Z = j)
d

i=1
P(X
i
= x
i
[Z = j)
=q
j
d

i=1
p
x
i
ij
(1 p
ij
)
1x
i
.
Page 10 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Hence, to minimize the probability of error we should class-label j

such that
j

arg max
j
P(Z = j[X = x) =arg max
j
q
j
d

i=1
p
x
i
ij
(1 p
ij
)
1x
i
=arg max
j
ln
_
q
j
d

i=1
p
x
i
ij
(1 p
ij
)
1x
i
_
=arg max
j
ln q
j
+
d

i=1
x
i
ln p
ij
+ (1 x
i
) ln (1 p
ij
)
=arg max
j
ln q
j
+
d

i=1
x
i
ln
p
ij
1 p
ij
+ ln (1 p
ij
)
=arg max
j
d

i=1
x
i
ln
p
ij
1 p
ij
+
d

i=1
ln(1 p
ij
) + ln q
j
from which the solution follows. .
7. [MLE for Gaussian PDF] (10 pts. total) Consider the log-likelihood function for a one-
dimensional Gaussian PDF with mean and variance parameters and
2
, respectively, given
a data set of m examples D = x
1
, x
2
, . . . , x
m
. For convenience, let
1

2
, so that the
log-likelihood can be written as
ln f(D [ , ) =
1
2
_
m

l=1
_
(x
l
)
2
ln + ln 2
_
_
.
(a) (5 pts.) Show that the MLE =
1
m

m
l=1
x
l
by setting the partial derivative of the
log-likelihood with respect to to 0 and solving for .
Solution.
ln f(D [ , )

=
1
2
_
m

l=1
(2(x
l
)(1))
_
=
m

l=1
(x
l
) .
Page 11 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Setting the above derivative to 0 we obtain the MLE for :
m

l=1
(x
l
) =0
m

l=1
(x
l
) =0
_
m

l=1
x
l
_
m =0
m

l=1
x
l
=m
=
1
m
m

l=1
x
l
.
.
(b) (5 pts.) Show that the MLE
2
=
1
m

m
l=1
(x
l
)
2
by rst setting the partial derivative of
the log-likelihood for = with respect to to 0, solving for and nally re-expressing
in terms of
2
.
Solution.
ln f(D [ , )

=
1
2
_
m

l=1
_
(x
l
)
2

_
_
.
Setting the above derivative to 0 we obtain the MLE for :

1
2
_
m

l=1
_
(x
l
)
2

_
_
=0
_
m

l=1
(x
l
)
2
_
m
1

=0

2
=
1

=
1
m
m

l=1
(x
l
)
2
.
.
8. [MLE in Incorrect Statistical Models (Duda, Hart & Stork, Ch. 3, 7)] (20 pts.
total) Show that if our model is poor, the maximum-likelihood classier we derive is not
the besteven among our (poor) model setby exploring the following example. Suppose
we have two (a priori ) equally probable categories, represented by the class label random
variable Z Bernoulli(0.5). Furthermore, we know that, for class label 1, we can represent
the input as X[Z = 1 Gaussian(0, 1) but assume that for class label 0, we can represent
the input as X[Z = 0 Gaussian(, 1). (That is, the only parameter we seek to estimate by
maximum-likelihood techniques is the mean of the distribution for class label 0.) Imagine,
however, that the true conditional PDF over the input given class label 0 is Gaussian(1, 10
6
).
Page 12 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
(a) (3 pts.) What is the value of our MLE in our poor model, given a large amount of
data?
Solution. The value of would be very large, for the following reasons. First, the
Gaussian associated with Z = 1 in our poor model is anchored at 0. Second, the true
distribution has more mass in the positive side of the real line, because the mean for the
Gaussian associated with Z = 0 is 1; in other words, P(X 0) >
1
2
where the probability
is with respect to the true distribution. Hence, given this, and the large variance of
the true model, there is more mass at the right tail (i.e., extreme positive values)
of the marginal density for X in the true model than at the left tail (i.e., extreme
negative values). Therefore, we improve the likelihood of the poor model by capturing
those examples at the right tail with the Gaussian in the poor model corresponding to
Z = 0.
Given the low variance of both Gaussians in the poor model, the most likely location for
when the data size is large is roughly 0.2

10
6
= 200. We can obtain this empirically
by implementing the rst order conditions for the MLE that result from taking the
derivative of the average log-likelihood of the poor model with respect to and setting it
to zero. The derivation leads to the following condition on :
=
1
m
m

l=1
_
_
1
exp
_

_
x
l

b
2
__
+ 1
_
_
x
l
.
We can obtain (an approximation to) the MLE via an iterative process which results from
replacing the equality in the condition above by an assignment operation (i.e., replace
= by ), and updating the value of in rounds, starting from an initial value of 0.
The rst iteration would be setting the average of the examples, which is the empirical
mean. Because the true mean is 1/2, we expect that for large data sizes the rst update
would lead to a value close to 1/2. Once that happens, successive updates will continue
to increase the value monotonically, pulled by examples from the long right tail of
the true distribution, until it reaches the MLE value. .
(b) (5 pts.) What is the decision boundary arising from this MLE in the poor model?
Page 13 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Solution. In general, the decision boundary for the poor model is derived as follows.
For any given input x, we would assign it to class label 1 if
P(Z = 1[x, ) P(Z = 0[x, )
1
2

1

2
e

1
2
x
2
1
2

1

2
e

1
2
x
2
+
1
2

1

2
e

1
2
(x)
2

1
2

1

2
e

1
2
(x)
2
1
2

1

2
e

1
2
x
2
+
1
2

1

2
e

1
2
(x)
2
e

1
2
x
2
e

1
2
(x)
2

1
2
x
2

1
2
(x )
2
x
2
(x )
2
= x
2
2x +
2
2x
2
0

_
x

2
_
0
So the decision boundary occurs at x =

2
, which substituting for the MLE estimate would
become x
200
2
= 100. .
(c) (5 pts.) Ignore for the moment the maximum-likelihood approach, and derive the Bayes
optimal decision boundary given the true underlying distributions: Gaussian(0, 1) for
class label 1 and Gaussian(1, 10
6
) for class label 0. Be careful to include to all portions
of the decision boundary.
Solution. The decision boundary for the true distribution is derived as follows. For
any given input x and a probability law P with respect to the true distribution, we would
assign it to class label 1 if
P(Z = 1[x) P(Z = 0[x)
1
2
1

2
e

1
2
x
2
1
2
1

2
e

1
2
x
2
+
1
2
1

210
3
e

1
2

(x1)
2
10
6

1
2
1

210
3
e

1
2

(x1)
2
10
6
1
2
1

2
e

1
2
x
2
+
1
2
1

210
3
e

1
2

(x1)
2
10
6
e

1
2
x
2

1
10
3
e

1
2

(x1)
2
10
6

1
2
x
2

1
2

(x 1)
2
10
6
3 ln 10
x
2

(x 1)
2
10
6
+ 6 ln 10 =
x
2
2x + 1
10
6
+ 6 ln 10
_
1
1
10
6
_
x
2
+
2x 1
10
6
6 ln 10 0
_
10
6
1
_
x
2
+ 2x 1 10
6
6 ln 10 0 .
Page 14 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Applying the quadratic formula, we obtain
2
_
2
2
4(10
6
1)(1 10
6
6 ln 10)
2(10
6
1)
x
2 +
_
2
2
4(10
6
1)(1 10
6
6 ln 10)
2(10
6
1)
2
_
4 + 4(10
6
1)(1 + 10
6
6 ln 10)
2(10
6
1)
x
2 +
_
4 + 4(10
6
1)(1 + 10
6
6 ln 10)
2(10
6
1)
1
_
1 + (10
6
1)(1 + 10
6
6 ln 10)
10
6
1
x
1 +
_
1 + (10
6
1)(1 + 10
6
6 ln 10)
10
6
1
1
_
10
6
+ (10
6
1)10
6
6 ln 10
10
6
1
x
1 +
_
10
6
+ (10
6
1)10
6
6 ln 10
10
6
1
1 10
6
_
10
6
+ (1 10
6
) 6 ln 10
10
6
1
x
1 + 10
6
_
10
6
+ (1 10
6
) 6 ln 10
10
6
1
1 10
6

6 ln 10
10
6
1
x
1 + 10
6

6 ln 10
10
6
1
3.7169

6 ln 10 x

6 ln 10 3.7169 .
In summary, the decision boundary occurs at approximately 3.7169 and 3.7169, and
output class label 1 for any x such that [x[ 3.7169, approximately. .
(d) (5 pts.) Now consider again classiers based on the (poor) model assumption that the
distribution for class label 0 is Gaussian(, 1). Using your result immediately above,
nd a new value of that will give lower error than the maximum-likelihood classier.
Solution. The misclassication error rate of the poor model decreases monotonically
as we decrease so that the decision boundary of the poor model decreases to 3.7169,
the positive side of the Bayes optimal decision boundary (based on the true distribution).
Hence, because the decision boundary of the poor model for an arbitrary value of is at
/2, setting to be any value between 2 3.7169 7.4338 and 200, would improve the
misclassication error rate (with respect to the MLE). Specically, setting 7.4338
would lead to the best model, in terms of misclassication error rate, among the class
of poor models. .
(e) (2 pts.) Discuss these results, with particular attention to the role of knowledge of the
underlying model.
Solution. Clearly, grossly incorrect assumptions about a model (in this case about the
variance of one of the Gaussians), could render MLE essentially useless, in the sense
that it would lead us to a model with very poor generalization ability (in this case, in
terms of misclassication rate). Hence, one needs to be careful about making assumption
that are unsupported by the data. So, if one is unsure about the value of some parameters
of the probabilistic model that could have generated the data, then it would be better to
treat those parameters as unknowns and estimate them from the data instead (in this
case, the variance of the Gaussian model for Z = 0 should too have been treated as an
unknown). .
9. [Bayesian Statistics (Duda, Hart & Stork, Ch. 3, 17)] (25 pts. total; + 15 pts Extra
Credit) The purpose of this problem is to derive the Bayesian classier for the d-dimenaional
Page 15 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
multivariate Bernoulli case, as described in an earlier exercise on naive Bayes classication.
(In the context of that exercise, the result here, if repeated for each class, would provide an
estimate of the conditional PMF p
X|Z
needed for the classsiers decision rule.) To do this,
we consider each class separately, as the derivation is same for each. Let the conditional
probability for a given category be given by
p(x[) =
d

i=1

x
i
i
(1
i
)
(1x
i
)
,
and let D = x
(1)
, x
(2)
, . . . , x
(m)
be a set of m samples independently drawn according to
this probability distribution.
(a) (5 pts.) If s = (s
1
, . . . , s
d
)
T
is the sum of m samples (i.e., s
i
=

m
l=1
x
(l)
i
) show that
p(D[) =
d

i=1

s
i
i
(1
i
)
(ms
i
)
.
Solution.
p(D[) =
m

l=1
p(x
(l)
[)
=
m

l=1
d

i=1

x
(l)
i
i
(1
i
)
(1x
(l)
i
)
=
d

i=1
m

l=1

x
(l)
i
i
(1
i
)
(1x
(l)
i
)
=
d

i=1

P
m
l=1
x
(l)
i
i
(1
i
)
P
m
l=1
(1x
(l)
i
)
=
d

i=1

s
i
i
(1
i
)
(ms
i
)
.
.
(b) (8 pts.) Assuming a uniform prior distribution for and using the identity
_
1
0

m
(1 )
n
=
m!n!
(m + n + 1)!
,
show that
f([D) =
d

i=1
(m + 1)!
s
i
!(ms
i
)!

s
i
i
(1
i
)
(ms
i
)
.
Page 16 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Solution. By Bayess rule, we have
f([D) p(D[)f()

i=1

s
i
i
(1
i
)
(ms
i
)
.
Now because f
|D
is a proper PDF (i.e., it normalizes to 1), we have, for some constant
c,
1 =
_
[0,1]
d
f([D) d =
_
[0,1]
d
c
d

i=1

s
i
i
(1
i
)
(ms
i
)
d
=c
d

i=1
_
1
0

s
i
i
(1
i
)
(ms
i
)
d
i
=c
d

i=1
s
i
!(ms
i
)!
(s
i
+ (ms
i
) + 1)!
=c
d

i=1
s
i
!(ms
i
)!
(m + 1)!
,
so that
c =
1

d
i=1
s
i
!(ms
i
)!
(m+1)!
=
d

i=1
(m + 1)!
s
i
!(ms
i
)!
,
and
f([D) = c
d

i=1

s
i
i
(1
i
)
(ms
i
)
=
d

i=1
(m + 1)!
s
i
!(ms
i
)!

s
i
i
(1
i
)
(ms
i
)
.
.
(c) (2 pts.) Plot this density for the case d = 1, m = 1 and for the two resulting possibilities
for s
i
.
Solution. For s
1
= 1, we have f(
1
[D) = 2
1
, a linear function with positive slope
(i.e., mode is at 1). For s
1
= 0, we have f(
1
[D) = 2(1
1
), a linear function with
negative slope (i.e., mode is at 0). .
(d) (15 pts., Extra Credit) Integrate the product p(x[)f([D) over to obtain the desired
conditional (posterior) PMF
p(x[D) =
d

i=1
_
s
i
+ 1
m + 2
_
x
i
_
1
s
i
+ 1
m + 2
_
1x
i
.
Page 17 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
Solution.
p(x[D) =
_
[0,1]
d
p(x[)f([D) d
=
_
[0,1]
d
_
d

i=1

x
i
i
(1
i
)
(1x
i
)
__
d

i=1
(m + 1)!
s
i
!(ms
i
)!

s
i
i
(1
i
)
(ms
i
)
_
d
=
_
[0,1]
d
d

i=1

x
i
i
(1
i
)
(1x
i
)
(m + 1)!
s
i
!(ms
i
)!

s
i
i
(1
i
)
(ms
i
)
d
=
_
[0,1]
d
d

i=1
(m + 1)!
s
i
!(ms
i
)!

s
i
+x
i
i
(1
i
)
(ms
i
+1x
i
)
d
=
d

i=1
_
1
0
(m + 1)!
s
i
!(ms
i
)!

s
i
+x
i
i
(1
i
)
(ms
i
+1x
i
)
d
i
=
d

i=1
(m + 1)!
s
i
!(ms
i
)!
_
1
0

s
i
+x
i
i
(1
i
)
(ms
i
+1x
i
)
d
i
=
d

i=1
(m + 1)!
s
i
!(ms
i
)!

(s
i
+ x
i
)!(ms
i
+ 1 x
i
)!
(s
i
+ x
i
+ ms
i
+ 1 x
i
+ 1)!
=
d

i=1
(m + 1)!
s
i
!(ms
i
)!

(s
i
+ x
i
)!(ms
i
+ 1 x
i
)!
(m + 2)!
=
d

i=1
1
m + 2

(s
i
+ x
i
)!
s
i
!

(ms
i
+ 1 x
i
)!
(ms
i
)!
=
d

i=1
_
s
i
+ 1
m + 2
_
x
i
_
ms
i
+ 1
m + 2
_
1x
i
=
d

i=1
_
s
i
+ 1
m + 2
_
x
i
_
1
s
i
+ 1
m + 2
_
1x
i
.
.
(e) (10 pts.) If we think of obtaining the conditional posterior PMF p(x[D) by substituting
an estimate

for in p(x[), what is the eective Bayesian estimate for ?
Solution. First recall that the prior over the parameters f

is the uniform joint PDF,


which is equivalent to assuming that,
i
Beta(1, 1), i.e., a Beta distribution with
hyper-parameters = = 1, for all dimensions i, i.i.d.:
f()
d

i=1

1
i
(1
i
)
1
=
d

i=1

11
i
(1
i
)
11
=1 .
Page 18 of 19
Homework 2 Solutions CSE 592: Machine Learning Fall 2011
The eective Bayesian estimate

= E[[D] is the posterior mean. Recall that the Beta
distribution is the conjugate prior of the posterior for the Bernoulli PMF. Hence, for all
i,
i
[T = D Beta(s
i
+ 1, ms
i
+ 1): that is,
f([D) =
d

i=1
(m + 1)!
s
i
!(ms
i
)!

s
i
i
(1
i
)
(ms
i
)
=
d

i=1
(m + 1)!
s
i
!(ms
i
)!

(s
i
+1)1
i
(1
i
)
(ms
i
+1)1
.
Because

+
is the mean of the Beta distribution with hyper-parameters and , we
obtain that

i
= E[
i
[D] =
s
i
+1
m+2
.
(Note that we could have derived this result from the previous part of this problem, i.e.,
the extra credit, where we derived X
i
[T = D Bernoulli
_
s
i
+1
m+2
_
, for all i, independent.)
.
Collaboration Policy: It is OK to discuss the homework with your peers, but each student must
write and turn in his/her own report, code, etc. based on his/her own work.
Page 19 of 19

S-ar putea să vă placă și