Sunteți pe pagina 1din 20

COMS 4771 Practice Problems

Last updated: 2019-11-30 22:01:53-05:00

These practice problems are for practice only; they are not to be submitted. Please feel free to discuss
on Piazza, in office hours, etc.
We will not be posting any solutions other than those that already appear here.

Contents
1 Probability 3
1.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Bayes Optimal Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Maximum likelihood estimation 3

3 Decision trees 4
3.1 Binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Build your tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Linear algebra 5
4.1 Some basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Linear Algebra in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Distance optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5 Nearest Neighbor 5

6 Linear regression 5
6.1 Heteroskedastic noise in linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.1.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.1.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Linear algebraic perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2.3 Part (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3 Two Approaches to Deriving the OLS Normal Equations . . . . . . . . . . . . . . . . . . . . . 6
6.3.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.4 The least (Euclidean) norm solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5 Regularized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

7 More linear algebra 8


7.1 Gradients and derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7.2 Spectral theorem for symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

8 More linear regression 8

1
9 Dimension reduction 9
9.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

10 Kernels and features 9


10.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

11 Decision boundaries 10
11.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

12 Failure of MLE 11
12.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

13 Ridge regression problem from Dian Jiao 12


13.1 Solution: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

14 Min-max regression 13
14.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

15 Closure properties of kernels 14


15.1 Solution (part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
15.2 Solution (part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

16 Schur product theorem 14


16.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

17 Convergence Theorem for Perceptron 15


17.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

18 Projection onto a line 16

19 Radial Basis Function Kernel 16


19.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

20 Convex Hulls and Linear Separability 16


20.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

21 Kernelized gradient descent for linear regression 17


21.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

22 Model selection and inductive biases 18

23 Linear separability of an arbitrary training set 18


23.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

24 SVM Example 18
24.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

25 Convex sets 19

26 Convex functions 19

2
1 Probability
1.1 Markov’s inequality
Show that:

1. For any event A, we have Pr[A] = E[1{A} ].1


2. For any non-negative random variable X, and any c > 0,
E[X]
Pr[X ≥ c] ≤
c

(Hint: compare the output of the function 1{X>c} with the outcome of X.)
3. Let X1 , . . . , Xn be uncorrelated {0, 1}-valued random
P variables each with distribution Bern(p). We
can estimate the value of p using the sample mean Xi /n. Use the previous result to show that
 X 
p(1 − p) 1
Pr Xi /n − p >  ≤ ≤

n 2 4n2

(Hint: X > 0 =⇒ Pr[X < c] = Pr[X 2 < c2 ])

1.2 Bayes Optimal Classification


In binary classification, the loss function we usually want to minimize is the zero-one loss:

l(f (x), y) = 1{f (x)6=y}

where f (x), y ∈ {0, 1}. In this problem we will consider the effect of using an asymmetric loss function:

lα,β (f (x), y) = α1{f (x)=1,y=0} + β1{f (x)=0,y=1} .

Under this loss function, the two types of errors receive different weights, determined by α, β > 0. Determine
the Bayes optimal classifier, i.e., the classifier that achieves minimum risk assuming the distribution of the
random example (X, Y ) is known, for the loss lα,β where α, β > 0.

2 Maximum likelihood estimation


1. Let X1 , . . . , Xn be the outcomes of n independent rolls of a (potentially) weighted six-sided die. That
is, X1 , . . . , X6 are iid random variables from Categorical(p1 , p2 , p3 , p4 , p5 , p6 ) where the pj ’s are non-
negative and sum to one. (Each takes values in {1, . . . , 6}.) The probability of the observed rolls
(x1 , . . . , x6 ) ∈ {1, . . . , 6}n is
Yn X6
1{xi =j} pj .
i=1 j=1

Show that the MLE estimator for pj is


n
1X
1{xi =j} ,
n i=1
i.e. the fraction
P of rolls with that face value. Hint: compute the log probability and make sure to
enforce the pi = 1 condition (for example, with Lagrange multipliers) while maximizing.
2
2. Consider the probability density p(x) = 2θxe−x θ for x ≥ 0 and p(x) = 0 for x < 0. Find the maximum
likelihood estimator for θ given an iid sample (x1 , . . . , xn ).
1 For those of you who have studied measure theory, assume A is measurable. We will ignore measurability issues in this
course.

3
3 Decision trees
3.1 Binary trees
Suppose you have a decision tree where the splitting rule can have more than two possible output values, so
that non-leaf nodes can have more than two children. Show how to convert it into a decision tree in which
the splitting rules have only two possible output values.

3.2 Build your tree


We recall two measures of uncertainty M ∈ {G, E} for decision trees at node n where pk is the proportion
of examples reaching node n with label k:
X
Gini index at node n: G(n) = 1 − p2k (1)
k
X
Entropy at node n: E(n) = − pk log pk . (2)
k

For one node n with children lleft , lright , we recall the overall uncertainty:
X
M (l) · (#training examples reaching l)
l∈{lleft ,lright }

2
y

1
4 3 2 1 0 1 2 3
x
Figure 1: Dataset: red are label 0 and blue are label 1

P
1. Show that the Gini index rewrites: k pk (1 − pk )
2. Compute both measures for:
(a) All the observations of figure 1
(b) The observations only with x > −1
(c) The observations only with x < 1
3. Using only splits of the form (x1 , x2 ) 7→ 1{xi >c} for c ∈ R, build 2 binary trees of height 1 that minimize
the overall uncertainty (one using the Gini index, and one using the entropy).
4. Do you think using 2 other forms of splits instead, you can build a tree with less nodes?

4
4 Linear algebra
4.1 Some basics
1 T
Let g : R2 → R be the function defined by g(x) := x Ax − bT x + c where
2
   
3 5 1
A := , b := , and c := 9.
5 3 2

1. Compute the determinant of A

2. Without any calculations, show that the eigenvalues of A are real and of opposite sign.
3. Show that for any unit vector b, we have bT x ≤ kxk2 .
4. Show that g has neither a minimum nor a maximum

4.2 Linear Algebra in Linear Regression


Consider the a linear model y = Xβ + , and the fitted values from linear regression ŷ = X β̂ =
X(X T X)−1 X T y. For simplicity, let’s assume i ’s are i.i.d. normal with 0 mean and σ 2 variance. Let
H = X(XT X)−1 XT . Prove that
n
1 X
cov(yi , yˆi ) = tr(H).
σ 2 i=1

Solution
Observe that ŷ = Hy. Now, cov(ŷ, y) = cov(Hy, y) = H cov(y, y) = H(σ 2 In×n ) = σ 2 H. Thus,
1
σ 2 tr(cov(ŷ, y)) = tr(H).

4.3 Distance optimization


Given a point p ∈ Rd and a plane D := {x ∈ Rd |w · x = 0} defined as the set of vectors orthogonal to a
vector w, derive a formula for the minimum distance between p and any point on D.

5 Nearest Neighbor
1. Given a collection of labeled examples D := {(xi , yi )}ni=1 , and two unlabeled examples t1 , t2 , suppose
that the t1 and t2 have the same nearest neighbor in D, d1 (when using `2 norm). Prove that for any
point lying on the line segment in between t1 and t2 , that point’s nearest neighbor in D will also be
d1 .
2. If t1 and t2 ’s nearest neighbors in D simply have the same training label (but may be different ex-
amples), must the nearest neighbor of every point on the line segment in between t1 and t2 have that
training label as well?

6 Linear regression
6.1 Heteroskedastic noise in linear regression
6.1.1 Part (a)
Let Pβ be a probability distribution on Rd × R for the random pair (X, Y ) (where X = (X1 , . . . , Xd )) such
that
X1 , . . . , Xd ∼iid N(0, 1), and Y | X = x ∼ N(xT β, kxk22 ), x ∈ Rd .

5
Here, β = (β1 , . . . , βd ) ∈ Rd are the parameters of Pβ .
True or false: The linear function with the smallest squared loss risk with respect to Pβ is β.
Answer with ”true” or ”false”, and briefly (but precisely) justify your answer.

6.1.2 Part (b)


Let (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R be given, and assume xi 6= 0 for all i = 1, . . . , n. Let fβ be the probability
density function for Pβ as defined in Part (a). Define the function Q : Rd → R by
n
1X
Q(β) := ln fβ (xi , yi ), β ∈ Rd .
n i=1

Find a system of linear equations Aβ = b over variables β = (β1 , . . . , βd ) ∈ Rd such that its solutions are
maximizers of Q over all vectors in Rd .
Write the system of linear of equations by defining the left-hand side matrix A ∈ Rm×d and right-hand
side vector b ∈ Rm (here, m is the number of linear equations), and briefly (but precisely) justify your answer.
You may define A and b as products of matrices and vectors if you like, but make sure these matrices and
vectors are also clearly defined.

6.2 Linear algebraic perspective


Let A ∈ Rn×d and b ∈ Rn be given, and let R
b be defined by R(β)
b := kAβ − bk22 .

6.2.1 Part (a)


Suppose the rank of A is smaller than d. Is the minimizer of R
b to be uniquely defined?
Answer with ”yes” or ”no”, and briefly (but precisely) justify your answer.

6.2.2 Part (b)


Suppose the rank of A is smaller than d. Is the orthogonal projection of b onto the range of A uniquely
defined?
Answer with ”yes” or ”no”, and briefly (but precisely) justify your answer.

6.2.3 Part (c)


Suppose you only have A and not b, but you are given an orthogonal projection b̂ of b onto the range of A.
Explain how to find a minimizer of R b using only A and b̂.
Answer with precise pseudocode for a procedure that takes as input A and b̂ and returns a minimizer of
R.
b Briefly (but precisely) justify the correctness of the procedure.

6.3 Two Approaches to Deriving the OLS Normal Equations


In this problem, we will work through two different ways of deriving the standard OLS normal equations for
minimizing kXw − Y k2 . The first is a standard calculus based approach that directly minimizes w, and the
second is a clever linear algebra approach.

6.3.1 Part (a)


Show that the derivative of kXw − Y k2 = 2X T Xw − 2X T Y . Hint: use the fact that ∂xT Ax/∂x = 2Ax or
write in terms of components.
Set this equal to zero, and solve it, assuming X T X is invertible. If X T X is not invertible, it simply means
that X T X is not bijective, but an inverse can be constructed that has most of desired properties (most
notably AA† A = A). This is called the pseudo-inverse.

6
6.3.2 Part (b)
Here we will derive the same equation, but from a purely geometric point of view. We want to minimize
kXw − Y k2 , which means we want to find the point Xw closest in Rn to Y . Geometrically, we know Xw can
be any point in the column space of X, and the closest point is just going to be the (orthogonal) projection
of Y onto that space. In other words, by changing w, we can select any point in the space spanned by the
columns of X, since Xw = X1 w1 + ... + Xd wd . But if Xw is the orthogonal projection, then Xw − Y is
perpendicular to the column space, so it must lie in the null space of X T . Prove this. Hint: Since Xw − Y
is orthogonal to every column of X, it must be orthogonal to every row of X T . Now use this fact to
recover the normal equations for OLS

6.4 The least (Euclidean) norm solution


Let A ∈ Rn×d and b ∈ Rn . In lecture, we saw that the solution to the normal equations of minimum
Euclidean norm is an element of the row space of A. Prove that if w is a solution to the normal equations
that also is in the row space of A, then it is unique.

6.5 Regularized Linear Regression


6.5.1 Part (a)
We are given a set of two dimensional inputs and their corresponding output pair: xi,1 , xi,2 , yi . We would
like to use the following regression model to predict y:

yi = w12 xi,1 + w22 xi,2


Derive the optimal value for w1 when using least squares as the target minimization function (w2 may appear
in your resulting equation).

6.5.2 Part (b)


Ridge regression usually optimizes the squared (L2 ) norm:
N
X X X
ŴL2 = arg min (tj − wi hi (xj ))2 + λ wi2
w
j=1 i i

An alternative that is less susceptible to outliers is to minimize the “sum of absolute values” (L1 ) norm:
N
X X X
ŴL1 = arg min |tj − wi hi (xj )| + λ wi2
w
j=1 i i

(i) Plot a sketch of the L1 loss function, do not include the regularization term in your plot.

(ii) Give an example of a case where outliers can hurt a learning algorithm.

(iii) Why do you think L1 is less susceptible to outliers than L2 ?

(vi) Are outliers always bad and we should always ignore them? Why?

(v) As with ridge regression in Equation 1, the regularized L1 regression in Equation 2 can also be viewed
a MAP estimator. Explain why by describing the prior P(w) and the likelihood function P (t|x, w) for this
Bayesian learning problem.

7
7 More linear algebra
7.1 Gradients and derivatives
In this exercise, A ∈ Rn×n , b ∈ Rn , J is the Jacobian and ∇ is the gradient. Gradient is defined for scalar
function from Rn to R and Jacobian for vector-valued functions. Sometime people tend to say gradient even
if the function is vector-valued.
 n
R −→ Rn
(a) Let f : , compute J(f )(x) for any x ∈ Rn where f is differentiable
x 7−→ Ax
 n
R −→ R
(b) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
x 7−→ xT Ax + xT b
 n
R −→ R
(c) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
x 7−→ ||x||2
 n
R −→ R
(d) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
x 7−→ Tr(Ax)

GLn (R) −→ R
(e) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
M 7−→ log det M

GLn (R) −→ R
(f) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
M 7−→ M −1

7.2 Spectral theorem for symmetric matrices


Let A ∈ Sn (R) be a real symmetric matrix (meaning AT = A) of size n × n.
Let V be a subspace of Rn , we say V is an invariant subspace of A if ∀x ∈ V, Ax ∈ V .

1. Show that if V is an invariant subspace of A then V ⊥ is also an invariant subspace of A (it is usually
an invariant of AT but here A is symmetric)
2. Suppose Rn = B ⊕C and  B is an
 invariant subspace of A. What does this implies on the bloc structure
D E
of A (if you write A = )
F G

3. Show that: nullspace(A) = span(A)⊥


4. Show that the (complex) eigenvalues of A are actually real
5. Show by induction that there exists an orthonormal basis of eigenvectors of A

8 More linear regression


Suppose (X1 , Y1 ), . . . , (Xn , Yn ) are iid random examples in R × R, where for each i = 1, . . . , n,

Xi ∼ N(0, 1), Yi | Xi = x ∼ N(x3 − x, 1) for all x ∈ R.

Let β ? be a linear function that minimizes the true risk R(β) := E[(X1 β − Y1 )2 ]. Let β̂ be a minimizer of
:= n1 ni=1 (Xi β − Yi )2 . True or false: E[β̂] = β ? for all integers n ≥ 1.
P
the empirical risk R(β)
b
Answer with “true” or “false”. If you answer “true”, give a clear and precise proof. If you answer
“false”, prove your claim with a simple counterexample.

8
9 Dimension reduction
Suppose you are faced with binary classification problem for data from an unknown probability distribution
over R1000 × {0, 1}.
You are asked to consider the following approach to construct a classifier. First, find a feature transfor-
mation ϕ : R1000 → R. Then, after applying ϕ to all 1000-dimensional feature vectors, find an affine classifier
hθ : R → {0, 1} of the form hθ (z) = 1{z≥θ} for some θ ∈ R. Thus, you are to only consider classifiers of the
form fθ (x) = 1{ϕ(x)≥θ} .
True or false: There exists a feature transformation ϕ such that there is a classifier of the form fθ (as
described above) with (zero-one loss) risk no larger than that of any classifier f : R1000 → {0, 1}.
Answer with “true” or “false”. If you answer “true”, give a clear and precise proof. If you answer
“false”, prove your claim with a simple counterexample.

9.1 Solution
True.
Let P be the unknown probability distribution over R1000 × {0, 1}, and (X, Y ) ∼ P .
Define
ϕ(x) := Pr (Y = 1 | X = x) for all x ∈ R1000 .
(X,Y )∼P

The classifier defined by f1/2 (x) = 1{ϕ(x) ≥ 1/2} for all x ∈ R1000 is the Bayes optimal classifier; its risk is
no larger than that of any other classifier.

10 Kernels and features


Consider the following six training examples from R2 \ {0} × {−1, +1}:
√ √
1. Positive examples (with label +1): (1, 1), ( 3, 1), (1, 3)

2. Negative examples (with label −1): (3, 0), (0, 3), (−1, −1)
Note that these examples are not linearly separable. Consider the following function:

xT z
K(x, z) := 1 + √ for all x, z ∈ R2 \ {0}.
xT xz T z

Construct a feature map ϕ : R2 \ {0} → R3 such that the following properties hold: (i) ϕ(x)T ϕ(z) = K(x, z)
for all x, z ∈ R2 \ {0}; and (ii) there exists a linear separator with normal vector w ∈ R3 of the form
w = (a, b, b) for some a, b ∈ R such that yϕ(x)T w ≥ 1 for all training examples (x, y).
Answer with a precise definition of the feature map ϕ (e.g., ϕ((x1 , x2 )) = (x1 , x2 , x1 x2 )), and explain
why this ϕ has the desired properties. As part of the explanation, give the parameters a, b ∈ R of the linear
separator, either explicitly (e.g., a = 1, b = 2) or implicitly as the solution to a system of linear equations.

10.1 Solution
Define
ϕ(x) := (1, x1 /kxk2 , x2 /kxk2 ) for all x ∈ R2 .
We verify that, for any x, z ∈ R2 \ {0},

x1 z1 x2 z2 xT z
ϕ(x)T ϕ(z) = 1 · 1 + · + · =1+ = K(x, z)
kxk2 kzk2 kxk2 kzk2 kxk2 kzk2

as required.

9
√ √ √ √
The positive examples, after mapping through ϕ, are (1, 22 , 22 ), (1, 23 , 21 ), (1, 12 , 23 ). The negative
√ √
examples, after mapping through ϕ, are (1, 1, 0), (1, 0, 1), (1, − 22 , − 22 ). Therefore, we need to ensure the
existence of a weight vector w = (a, b, b) that satisfies

a + 2b ≥ 1,

1+ 3
a+ b ≥ 1,
2
a + b ≤ −1,

a − 2b ≤ −1.

These inequalities are simultaneously satisfied by



a := −(3 + 2 3),

b := 2 + 2 3.

(This solution is obtained by solving the linear system determined by the second and third inequalities above,
except with equality.)

11 Decision boundaries
Consider the following generative model M for binary classification data in R3 × {0, 1} with multivariate
Gaussian class conditional distributions:

Pr(Y = 0) = 1 − π, Pr(Y = 1) = π, X | Y = 0 ∼ N(0, I), X | Y = 1 ∼ N(µ, I).

Here, 0 ∈ R3 is the origin, and I ∈ R3×3 is the identity matrix. The parameters of the model M are
π ∈ (0, 1) and µ ∈ R3 . Let Pπ,µ ∈ M denote the distribution in the model with parameters (π, µ).
Does the following property always, sometimes, or never hold? (“Always” = for all distributions in M;
“never” = for none of the distributions in M; “sometimes” = for some but not all distributions in M.)

The decision boundary of the Bayes optimal classifier for Pπ,µ passes through the line segment
joining 0 and µ.
Answer with either “always”, “sometimes”, or “never”. If your answer is “always” or “never”, write
a clear and precise proof of the claim. If your answer is “sometimes”, write a precise condition on the
parameters (e.g., “µ = (π, 2π, 3π)”) such that the property holds if and only if the condition is satisfied.

11.1 Solution
Sometimes. (This might seem counterintuitive, but consider what happens as π → 0 or π → 1.)
The decision boundary is the set

{x ∈ R3 : (1 − π) exp(−kxk2 /2) = π exp(−kx − µk2 /2)} = {x ∈ R3 : µT x = 21 kµk2 + ln 1−π


π }.

The line segment joining 0 and µ is


{tµ : t ∈ [0, 1]}.
The decision boundary intersects the line segment when there exists t ∈ [0, 1] such that tkµk2 = 21 kµk2 +
ln 1−π
π . In other words,
0 ≤ 12 kµk2 + ln 1−π 2
π ≤ kµk .

10
12 Failure of MLE
In this problem, you will study a case where maximum likelihood fails but empirical risk minimization
succeeds.
Consider the following probability distribution P on X × Y, for X = {0, 1}2 and Y = {0, 1}.

x1 = 0 x1 = 1 x1 = 0 x1 = 1
P (X = (x1 , x2 )) : 1− 1− P (Y = 1 | X = (x1 , x2 )) :
x2 = 0 2 2 x2 = 0 0 1
 
x2 = 1 2 2 x2 = 1 1 0

Above, regard  as a small positive number between 0 and 1. Let f ? be the Bayes optimal classifier for P .
(In this problem, we are concerned with zero-one loss.)
Now also consider the statistical model P for Y | X: P = {PA , PB }, where
( (
4
if x 1 = 0, 0 if x1 = 0,
PA (Y = 1 | X = (x1 , x2 )) = 51 PB (Y = 1 | X = (x1 , x2 )) =
5 if x 1 = 1; 1 if x1 = 1.

Note that P ∈ / P, and that distributions in P do not specify the distribution of X (like logistic regression).
Let fA and fB be the Bayes optimal classifiers, respectively, for PA and PB .
The maximum likelihood approach to learning a classifier selects the distribution in P of highest likelihood
given training data (which are regarded as an iid sample), then returns the optimal classifier for the chosen
distribution (i.e., fA if PA is the maximum likelihood distribution, otherwise fB ).

(a) Give simple expressions for the Bayes optimal classifiers for P , PA , and PB . E.g.,
(
? 1 if x1 + x2 = 2,
f (x) =
0 otherwise.

(b) What is the risk of each classifier from Part (a) under distribution P ?
(c) Suppose in the training data, the number of training examples of the form ((x1 , x2 ), y) is equal to
Nx1 ,x2 ,y , for (x1 , x2 , y) ∈ {0, 1}3 . Give a simple rule for determining the maximum likelihood distribu-
tion in P in terms of Nx1 ,x2 ,y for (x1 , x2 , y) ∈ {0, 1}3 . Briefly justify your answer.
(d) If the training data is a typical iid sample from P (with large sample size n), which classifier from Part
(a) is returned by the maximum likelihood approach? Briefly justify your answer.
(e) If the training data is an iid sample from P of size n, then how large should n be so that, with
probability at least 0.99, Empirical Risk Minimization returns the classifer in {fA , fB } with smallest
risk? Briefly justify your answer.
(This part requires some tools that we may not have covered in the class. You may want to use some-
thing called “Hoeffding’s inequality”: https://en.wikipedia.org/wiki/Hoeffding%27s_inequality.)

12.1 Solution
?
(a) f (x) = 1{x1 6= x2 } = x1 ⊕ x2 , fA (x) = 1{x1 = 0} = 1 − x1 , fB (x) = 1{x1 = 1} = x1 .
(b) R(f ? ) = 0, R(fA ) = 1 − , R(fB ) = .
(c) Likelihood of PA is

(1/5)N0,0,0 +N0,1,0 +N1,0,1 +N1,1,1 × (4/5)N0,0,1 +N0,1,1 +N1,0,0 +N1,1,0

which is always some number strictly between zero and one. The likelihood of PB is

1{N0,0,1 + N0,1,1 + N1,0,0 + N1,1,0 = 0}

which is either zero or one. So the MLE is PB iff N0,0,1 + N0,1,1 + N1,0,0 + N1,1,0 = 0.

11
(d) When n is large, we are likely to see N0,1,1 ≈ n/2 > 0. In this case, the MLE is PA .
(e) There is an absolute constant C > 0 such that if n ≥ C/(1 − 2)2 , then with probability at least 0.99
over the realization of (X1 , Y1 ), . . . , (Xn , Yn ) ∼iid P ,
n
1X
1{fA (Xi ) 6= Yi } − 1{fB (Xi ) 6= Yi } > E[1{fA (X) 6= Y } − 1{fB (X) 6= Y }] − (1 − 2)
n i=1
= R(fA ) − R(fB ) − (1 − 2)
= 0.

This statement follows from Hoeffding’s inequality. In this event, the empirical risk of fA is larger than
that of fB , so ERM selects fB .

13 Ridge regression problem from Dian Jiao


Let A ∈ Rn×d be a matrix of feature vectors and b be a vector of responses. Let wλ be the ridge regression
solution with parameter λ > 0. Let b̂λ := Awλ . Under what condition (about b) is

bT b̂λ
kbk2 kb̂λ k2

invariant as a function of λ > 0?

13.1 Solution:
Let’s rewrite the problem assumption in simpler terms. The hypothesis is that there exist a constant β such
that
∀λ > 0, bT b̂λ = βkb̂λ k2 kbk2
. This looks like Cauchy Schwartz right?
Since norm of b is constant, in this problem the hypothesis is that there exist a constant α such that

∀λ > 0, bT b̂λ = αkb̂λ k2

In order to see what’s happening we first get an expression for bλ . We write the normal equation for
λ > 0.

(AT A + λ)wλ = AT b
Since λ > 0 then (AT A + λ) is positive definite so inversible and

wλ = (AT A + λ)−1 AT b

giving
b̂λ = A(AT A + λ)−1 AT b

So the equality rewrites: ∀λ > 0, bT A(AT A + λI)−1 AT b = αkb̂λ k2


We define: v = AT b and Cλ = (AT A + λI)−1 to avoid carrying all notations and so:

∀λ > 0, v T Cλ v = αkb̂λ k2 (3)


q q
T
with kb̂λ k2 = b̂λ b̂λ = v T CλT AT ACλ v

12
Now we know that AT A is symmetric so up to an orthogonal change of basis it is diagonal. We can
assume for simplicity that AT A is diagonal (and it’s not hard to show that everything will be also true doing
the substitution A ← P T AQ and same for v for the newP A where P and Q are given by SVD). One can also
use the diagonalization with the concept of AT A = µk vk vkT with orthogonal basis of eigenvectors (vk ) –
which is another writing of QT AT AQ)
   
So write AT A = diag(λ1 , ..., λd ) then Cλ = diag λ11+λ , ... λd1+λ . Also CλT AT ACλ = diag (λ λ+λ)
i
2 so
i
squaring equation (3) gives with v = (vi )1≤i≤d :

 2
d d
X vi2  X vi2 λi
∀λ > 0  = α2 2 (4)
i=1
λi + λ i=1 (λi + λ)
d d
X vi2 vj2 X α2 v 2 λi
i
∀λ > 0 = 2 (5)
λ
i,j=1 i
+ λ λ j + λ i=1 (λi + λ)

Now by unicity of the fractional developpement on an open set with non empty interior, we can assess
that ∀i 6= j, (vi vj )2 = 0 that mean only one vi can be 0 and so AT b is an eigenvector of AT A .
One can check it is sufficient condition too so the necessary and sufficient condition is

AT b is an eigenvector of AT A .

Let’s see if we can say more (I have no idea for now).

• Case AT A inversible: since AT b is an eigen vector for a non zero eigenvalue then we know there is
an isomorphism between eigenvectors of AAT to AT A by the linear function x 7→ AT x. So there is x
in eigenvector of AAT such that AT b = AT x and since AT is full row rank x = b so b eigenvector of
AAT (another way is to prove Dx eigenvector of D for D inversible implies x eigenvector of D – left
as exercise). So b is an eigenvector of AAT .

• Case AT A non inversible but AT b is an eigenvector for a non zero eigenvalue: To check
• Case AT A non inversible and AT b in nullspace of AT A: To do

14 Min-max regression
Let (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R be given. Express the following optimization problem (over w) as a linear
program:  

min max yi − xTi w .
w∈Rd 1≤i≤n

(What is the linear objective function? What are the linear inequality constraints? What are the linear
equality constraints?)

14.1 Solution
min z
(z,w)∈R×Rd

s.t. yi − xTi w − z ≤ 0, for all i = 1, . . . , n,


− yi + xTi w − z ≤ 0, for all i = 1, . . . , n.

1. Optimization variables: (z, w) ∈ R × Rd .


2. Linear objective function: z

13
3. Linear inequality constraints: yi − xTi w − z ≤ 0 for all i = 1, . . . , n; −(yi − xTi w) − z ≤ 0 for all
i = 1, . . . , n.
4. Linear equality constraints: none

15 Closure properties of kernels


1. Show that if k1 : X × X → R and k2 : X × X → R are both positive definite kernels, then k = k1 + k2
(i.e., k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 )) is also a positive definite kernel.
2. Show that if k1 : X × X → R and k2 : X × X → R are both positive definite kernels, then k = k1 · k2
(i.e., k(x, x0 ) = k1 (x, x0 ) · k2 (x, x0 )) is also a positive definite kernel. (You can use the Schur product
theorem; see below.)

15.1 Solution (part 1)


We need to show that for any choice of x1 , . . . , xn ∈ X and any (a1 , . . . , an ) ∈ Rn ,
n X
X n
ai aj k(xi , xj ) ≥ 0.
i=1 j=1

Since k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) for any x, x0 ∈ X , we have


X n
n X n X
X n
ai aj k(xi , xj ) = ai aj (k1 (xi , xj ) + k2 (xi , xj ))
i=1 j=1 i=1 j=1
Xn X n n X
X n
= ai aj k1 (xi , xj ) + ai aj k2 (xi , xj ) .
i=1 j=1 i=1 j=1
| {z } | {z }
T1 T2

Now, since k1 is a positive definite kernel, it follows that T1 ≥ 0; similarly, since k2 is a positive definite
kernel, it follows that T2 ≥ 0. Therefore,
n X
X n
ai aj k(xi , xj ) = T1 + T2 ≥ 0.
i=1 j=1

This implies that k is a positive definite kernel.

15.2 Solution (part 2)


Let k1 and k2 both be positive definite kernels. We need to show that for every choice of x1 , . . . , xn ∈ X ,
the matrix M given by Mi,j = k(xi , xj ) = k1 (xi , xj )k2 (xi , xj ) is positive semidefinite.
We’ll use the Schur product Theorem (Problem 16). Fix any x1 , . . . , xn ∈ X . Define the matrix S such
that Si,j = k1 (xi , xj ) and the matrix T such that Ti,j = k2 (xi , xj ). Since k1 and k2 are positive definite
kernels, the matrices S and T are positive semidefinite. The matrix M is the elementwise product of S and
T . By the Schur product theroem, the elementwise product S and T is also positive semidefinite, so we
conclude that M is positive semidefinite. We conclude that k is a positive definite kernel.

16 Schur product theorem


The Schur product theorem says that if S and T are both d × d positive semidefinite matrices, then so is
their elementwise product M := S T (i.e., Mi,j = Si,j Ti,j for all i, j). In this problem, you will prove this
theorem.
Let X ∼ N(0, S) and Y ∼ N(0, T ), and assume X and Y are independent. Let Z := X Y (i.e.,
Zi = Xi Yi for all i). What are the mean and covariance of Z?
Explain why this proves the Schur product theorem.

14
16.1 Solution
First, observe that
E[Zi ] = E[Xi Yi ] = E[Xi ]E[Yi ] = 0.
The penultimate step uses the independence of X and Y . This implies that E[Z] = E[X Y ] = 0. Moreover,
cov(X) = S = E[XX T ] and cov(Y ) = T = E[Y Y T ] since each of X and Y has zero mean.
Next, observe that
cov(Z)ij = E[Zi Zj ] − E[Zi ]E[Zj ] = E[Xi Yi Xj Yj ] − 0 = E[Xi Xj ]E[Yi Yj ] = Sij Tij .
The penultimate step uses the independence of X and Y . Therefore cov(Z) = S T = M .
The Schur product theorem follows because covariance matrices are always positive semidefinite, and any
positive semidefinite matrix can be regarded as a covariance matrix.

17 Convergence Theorem for Perceptron


In this problem, we will prove a convergence guarantee for the perceptron algorithm on linearly separable
data. Assume there is a (unit) vector w∗ that can separate the training samples S with margin γ. Let
R = max~x∈S k~xk. We will prove that the perceptron algorithm makes at most
 2
R
T :=
γ
mistakes.
1. Let wt be the weight vector found after t iterations of the perceptron algorithm. We know that
wt = w(t−1) + yx for some example (x, y). Show that, if the perceptron makes a mistake at iteration t,
~∗ ≥ w
~ (t) · w
w ~∗ + γ
~ (t−1) · w
~ ∗ in terms of T and γ. (The base case should use
~ (T ) · w
Use induction to get an lower-bound on w
0
w = 0.)
(t) 2

2. Now we would like to upper-bound the norm of w ~ . Show that

(t) 2 (t−1) 2

w~ ≤ w ~ + R2

(T ) 2

Likewise use induction to get an upper-bound on w~ in terms of T and R.

3. Finally, combine these two statements to prove that



~ ∗ ≤ w
(T ) ∗
Tγ ≤ w~ (T ) · w ~ kw ~ k≤R T

What does this tell us about the convergence of the perceptron algorithm?

17.1 Solution
1. If the perceptron algorithm makes a mistake at time t, the update is wt = w(t−1) + yx, so w ~∗ =
~ (t) · w
(t−1) ∗ (t−1) ∗ 0
(w + yx) · w
~ ≥w ·w~ + γ by the definition of the margin. By induction (with w = 0 as the
base case), we have w ~ ∗ ≥ T γ.
~ (T ) · w
2. Likewise, we can show that
(t) 2 (t−1)
2
(t−1) 2 (t−1) 2
 
w~ = w ~ + y~x = w
~ + 2y w~ (t−1) · ~x + ky~xk2 ≤ w~ + R2

since the middle term is strictly negative


since we made a mistake on the prior iteration. By induction
0 (t) 2
(with w = 0 as the base case), w ~ ≤ T R2 .

15
3. Therefore, from the prior two parts,

~ ∗ ≤ w
(T ) ∗
~ (T ) · w
Tγ ≤ w ~ kw ~ k≤R T

since a · b ≤ kakkbk and kw∗ k = 1. This tells us that T γ ≤ R T , or solving for T ,
 2
R
T ≤
γ

This tells us that no error can occur for T larger than (R/γ)2 , or in other words, we have a convergence
bound that guarantees that perceptron will find a linear decision boundary. Convergence is slower for
datasets with large kxk values, and also for datasets with small margins.

18 Projection onto a line


http://www.deeplearningbook.org/linear_algebra.pdf

19 Radial Basis Function Kernel


Prove that for the 1 dimensional, unit variance RBF kernel ϕ : R → R∞ given by

2 x2 x3
x 7→ e−x /2
(1, x, √ , √ , ...)
2! 3!
(x1 −x2 )2
that for x1 , x2 ∈ R we have that ϕ(x1 )T ϕ(x2 ) = e− 2

19.1 Solution
By the definition of ϕ, we have that:

T −x21 2 −x22 /2
X xi xi1 2
ϕ(x1 ) ϕ(x2 ) = e e
i=0
i!
P∞
From the Taylor series of f (x) = ex , we have that i=0 (x1 x2 )
i
/i! = ex1 x2 , so
2 2 2 2 −(x1 −x2 )2
ϕ(x1 )T ϕ(x2 ) = e−x1 /2 e−x2 /2 ex1 x2 = e−(x1 −2x1 x2 +x2 )/2 = e 2 .

20 Convex Hulls and Linear Separability


For a set of points X := {xi }ni=1 , each of which in Rd , define the convex hull of X as:
 
X n n
X 
conv(X) = αi xi : αi ≥ 0 ∀i = 1, . . . , n; αi = 1 ⊆ Rd
 
i=1 i=1

Prove that for any two finite sets of points X and Y in Rd , the sets X and Y are linearly separable (with
affine expansion) if and only if conv(X) ∩ conv(Y ) = ∅. (Regard the points in X as positive examples, and
the points in Y as negative examples.)
You can use the hyperplane separation theorem, as well as the fact that the convex hulls of X and Y are
closed and compact.
https://en.wikipedia.org/wiki/Hyperplane_separation_theorem

16
20.1 Solution
Suppose that X = {x1 , . . . , x|X| } and Y = {y 1 , . . . , y |Y | } were linearly separable (with affine expansion) and
conv(X) ∩ conv(Y ) 6= ∅, then take some p ∈ conv(X) ∩ conv(Y ). Due to p’s membership in this intersection,
|X| |Y | P|X| P|Y |
there must exist {αi }i=1 and {βi }i=1 such that αi , βi ≥ 0, i=1 = 1, j=1 βj = 1, and

|X| |Y |
X X
p= αi xi = βj y j .
i=1 j=1

Additionally, since X and Y are linearly separable (with affine expansion), there exists some vector w ∈ Rd
and θ ∈ R such that wT x > θ for all x ∈ X, and wT y < θ for all y ∈ Y . Then
 
|X| |X|
X X
wT p = wT  αi xi  = αi (wT xi ) > θ,
i=1 i=1

and  
|Y | |Y |
X X
wT p = wT  βj y j  = βj (wT y j ) < θ.
j=1 j=1

This is clearly a contradiction, so conv(X) ∩ conv(Y ) = ∅.


Now suppose that conv(X) ∩ conv(Y ) = ∅. Since both conv(X) and conv(Y ) are convex, closed, and
compact, the Hyperplane Separation Theorem I (see Wikipedia) gives that there exist c1 , c2 ∈ R and a vector
v with c1 < c2 and such that v T x > c2 for all x ∈ X and v · y < c1 for all y ∈ Y . Then we also have
v T y < c2 for all y ∈ Y . So X and Y are separated by the affine classifier specified by f (x) = sign(v T x − c)
where c = (c1 + c2 )/2.

21 Kernelized gradient descent for linear regression


Suppose you have a black-box function for computing k : X × X → R, which is a positive definite kernel
with reproducing kernel Hilbert space H and feature map ϕ : X → H. Also suppose you have training data
(x1 , y1 ), . . . , (xn , yn ) ∈ X × R.
Let ŵ ∈ H denote the weight vector obtained after one step of gradient descent on the empirical squared
loss risk on training data after applying the feature map ϕ. Assume gradient descent is initialized at the
origin 0 ∈ H, and the step size used is η = 1.
Give the pseudocode for computing ϕ(x)T ŵ for a given input x ∈ X . Your code should not invoke ϕ,
but it can make calls to the kernel function k.

21.1 Solution
On input x ∈ X :
1. Let S := 0.
2. For i = 1, . . . , n:
(a) S := S + yi k(x, xi )
3. Return S/n.

Explanation. Let   
← ϕ(x1 )T → y1
1  ..  1 .
A := √  .
, b := √  . .
n   n . 
← ϕ(xn ) →
T
yn

17
After one iteration of gradient descent (starting at 0 and with step size η = 1), the weight vector we obtain
is
n
1X
ŵ = 0 − 1 · AT (A0 − b) = AT b = yi ϕ(xi ).
n i=1
Then for any x ∈ X ,
n n
1X 1X
ϕ(x)T ŵ = yi ϕ(x)T ϕ(xi ) = yi k(x, xi ).
n i=1 n i=1

22 Model selection and inductive biases


Suppose you have training data (x1 , y1 ), . . . , (xn , yn ) ∈ R × R, and you would like to use try different feature
expansions before using linear regression. You want to learn the regression coefficients using minimum
Euclidean norm solution to normal equations, and the feature expansions you want to try are polynomial
feature expansions of degrees k = 1, 2, . . . , 100. The polynomial feature expansion for degree k is

ϕk (x) = (1, x, x2 , . . . , xk ).

(a) Suppose you are indifferent to the degree of the polynomial expansion. Describe how to use the “hold
out set” approach to select a feature expansion to use. (Don’t worry about computation time.)
(b) Suppose you prefer lower-degree polynomial feature expansions over higher-degree polynomial feature
expansions. Suggest a way to modify the “hold out set” approach to incorporate this inductive bias.

23 Linear separability of an arbitrary training set


For an arbitrary, fixed set of training data (x1 , y1 ), . . . , (xn , yn ) ∈ X × {−1, +1} with each xi distinct, there
exists a feature map ϕ : X → Rm with a positive definite kernel such that (ϕ(x1 ), y1 ), . . . , (ϕ(xn ), yn ) ∈
Rm × {−1, 1} is linearly separable. Find an explicit formula for such a ϕ. What is a weight vector w ∈ Rm
that separates the training data (i.e., has yi ϕ(xi )T w > 0 for all i = 1, . . . , n)? What is the associated kernel
function k : X × X → R?
Hint: m = n

23.1 Solution
The desired ϕ maps xi ∈ X to the vector in Rn (so m = n) with a 1 in the i-th component and zeroes
everywhere else (i.e., the j-th component of ϕ(xi ) is 1{i=j} ). Then the weight vector that separates the
training data is w := (y1 , . . . , yn ). To see this, observe that for each i = 1, . . . , n,
n
X
yi ϕ(xi )T w = yi yj 1{i=j} = yi2 > 0.
j=1

To calculate the kernel function of this transformation, notice that


n
X
ϕ(xi )T ϕ(xj ) = 1{k=i} 1{k=j} = 1{i=j}
k=1

So the associated kernel function is k(xi , xj ) = 1{i=j} , and the kernel matrix is the n × n identity matrix.

24 SVM Example
Consider two sets of training data A = {(1, 1), (2, 2), (2, 0)} and B = {(0, 0), (1, 0), (0, 1)}, where the points
in A have label −1 and the points in B have label 1. Plot the points in X := A ∪ B, and hand-construct the
maximum margin hyperplane for X . Calculate the optimal margin and identify the support vectors.

18
24.1 Solution
Hyperplane: y = −x + 32
1
Maximum margin: 2√ 2
Support vectors: {(0, 1), (1, 0), (2, 0), (1, 1)}

25 Convex sets
Are the following sets convex or not? If convex, provide a brief justification. If not, provide a counter-example
to its convexity.

1. S = {x ∈ Rd : xT w ≤ b} where w ∈ Rd and b ∈ R
2. S = {x ∈ Rd : kxk2 ≤ 1}
Pd
3. S = {x ∈ Rd : i=1 |xi | ≤ 1}
4. S = {x ∈ Rd : maxi |xi | ≤ 1}

5. S = {x ∈ Rd : x has at most k non-zero entries} where k ∈ {1, . . . , d}.


6. S = {x ∈ Rd : f (x) ≤ 0} where f : Rd → R is a convex function.
7. S = {(x, b) ∈ Rd+1 : f (x) ≤ b} where f : Rd → R is a convex function.

26 Convex functions
We say a function f : Rd → R is concave if −f is convex.
Are the following functions convex, concave, or neither? If convex or concave, provide a brief justification.
If neither, provide a counter-example (either a single point or pair of points) to the convexity property for
an appropriate definition of convexity (for general functions, differentiable functions, or twice-differentiable
functions).

1. The first several functions are f : R → R.


f (x) = ecx for c ∈ R
2. f (x) = |x|
3. f (x) = x3

4. f (x) = x4
5. f (x) = sin(x)
6. f (x) = ln(x) where the domain of f is the positive real line.
7. f (x) = x1/2 where the domain of f is the positive real line.

8. f (x) = cos(x) where the domain of f is [−π/2, π/2].


9. The next several functions are f : Rd → R.
f (x) = xT M x where M is a symmetric d × d matrix with only positive eigenvalues.
10. f (x) = xT M x where M is a symmetric d × d matrix with only negative eigenvalues.
T
11. f (x) = ew x
for non-zero w ∈ Rd .
12. f (x) = |wT x| for non-zero w ∈ Rd .

19
p
13. f (x) = |wT x| for non-zero w ∈ Rd .
p
14. f (x) = |xT M x| where M is a symmetric d × d matrix with only positive eigenvalues.

If we have restricted the domain of a function f to a convex set (e.g., in the case of f (x) = ln(x)), then
the convexity (or concavity) properties just need to be checked over points in the domain, rather than all of
Rd . (The domain should be a convex set!)

20

S-ar putea să vă placă și