Documente Academic
Documente Profesional
Documente Cultură
These practice problems are for practice only; they are not to be submitted. Please feel free to discuss
on Piazza, in office hours, etc.
We will not be posting any solutions other than those that already appear here.
Contents
1 Probability 3
1.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Bayes Optimal Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Decision trees 4
3.1 Binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Build your tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Linear algebra 5
4.1 Some basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Linear Algebra in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Distance optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5 Nearest Neighbor 5
6 Linear regression 5
6.1 Heteroskedastic noise in linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.1.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.1.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Linear algebraic perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2.3 Part (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3 Two Approaches to Deriving the OLS Normal Equations . . . . . . . . . . . . . . . . . . . . . 6
6.3.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.4 The least (Euclidean) norm solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5 Regularized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.1 Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.2 Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
9 Dimension reduction 9
9.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
11 Decision boundaries 10
11.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
12 Failure of MLE 11
12.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
14 Min-max regression 13
14.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
24 SVM Example 18
24.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
25 Convex sets 19
26 Convex functions 19
2
1 Probability
1.1 Markov’s inequality
Show that:
(Hint: compare the output of the function 1{X>c} with the outcome of X.)
3. Let X1 , . . . , Xn be uncorrelated {0, 1}-valued random
P variables each with distribution Bern(p). We
can estimate the value of p using the sample mean Xi /n. Use the previous result to show that
X
p(1 − p) 1
Pr Xi /n − p > ≤ ≤
n 2 4n2
where f (x), y ∈ {0, 1}. In this problem we will consider the effect of using an asymmetric loss function:
Under this loss function, the two types of errors receive different weights, determined by α, β > 0. Determine
the Bayes optimal classifier, i.e., the classifier that achieves minimum risk assuming the distribution of the
random example (X, Y ) is known, for the loss lα,β where α, β > 0.
3
3 Decision trees
3.1 Binary trees
Suppose you have a decision tree where the splitting rule can have more than two possible output values, so
that non-leaf nodes can have more than two children. Show how to convert it into a decision tree in which
the splitting rules have only two possible output values.
For one node n with children lleft , lright , we recall the overall uncertainty:
X
M (l) · (#training examples reaching l)
l∈{lleft ,lright }
2
y
1
4 3 2 1 0 1 2 3
x
Figure 1: Dataset: red are label 0 and blue are label 1
P
1. Show that the Gini index rewrites: k pk (1 − pk )
2. Compute both measures for:
(a) All the observations of figure 1
(b) The observations only with x > −1
(c) The observations only with x < 1
3. Using only splits of the form (x1 , x2 ) 7→ 1{xi >c} for c ∈ R, build 2 binary trees of height 1 that minimize
the overall uncertainty (one using the Gini index, and one using the entropy).
4. Do you think using 2 other forms of splits instead, you can build a tree with less nodes?
4
4 Linear algebra
4.1 Some basics
1 T
Let g : R2 → R be the function defined by g(x) := x Ax − bT x + c where
2
3 5 1
A := , b := , and c := 9.
5 3 2
2. Without any calculations, show that the eigenvalues of A are real and of opposite sign.
3. Show that for any unit vector b, we have bT x ≤ kxk2 .
4. Show that g has neither a minimum nor a maximum
Solution
Observe that ŷ = Hy. Now, cov(ŷ, y) = cov(Hy, y) = H cov(y, y) = H(σ 2 In×n ) = σ 2 H. Thus,
1
σ 2 tr(cov(ŷ, y)) = tr(H).
5 Nearest Neighbor
1. Given a collection of labeled examples D := {(xi , yi )}ni=1 , and two unlabeled examples t1 , t2 , suppose
that the t1 and t2 have the same nearest neighbor in D, d1 (when using `2 norm). Prove that for any
point lying on the line segment in between t1 and t2 , that point’s nearest neighbor in D will also be
d1 .
2. If t1 and t2 ’s nearest neighbors in D simply have the same training label (but may be different ex-
amples), must the nearest neighbor of every point on the line segment in between t1 and t2 have that
training label as well?
6 Linear regression
6.1 Heteroskedastic noise in linear regression
6.1.1 Part (a)
Let Pβ be a probability distribution on Rd × R for the random pair (X, Y ) (where X = (X1 , . . . , Xd )) such
that
X1 , . . . , Xd ∼iid N(0, 1), and Y | X = x ∼ N(xT β, kxk22 ), x ∈ Rd .
5
Here, β = (β1 , . . . , βd ) ∈ Rd are the parameters of Pβ .
True or false: The linear function with the smallest squared loss risk with respect to Pβ is β.
Answer with ”true” or ”false”, and briefly (but precisely) justify your answer.
Find a system of linear equations Aβ = b over variables β = (β1 , . . . , βd ) ∈ Rd such that its solutions are
maximizers of Q over all vectors in Rd .
Write the system of linear of equations by defining the left-hand side matrix A ∈ Rm×d and right-hand
side vector b ∈ Rm (here, m is the number of linear equations), and briefly (but precisely) justify your answer.
You may define A and b as products of matrices and vectors if you like, but make sure these matrices and
vectors are also clearly defined.
6
6.3.2 Part (b)
Here we will derive the same equation, but from a purely geometric point of view. We want to minimize
kXw − Y k2 , which means we want to find the point Xw closest in Rn to Y . Geometrically, we know Xw can
be any point in the column space of X, and the closest point is just going to be the (orthogonal) projection
of Y onto that space. In other words, by changing w, we can select any point in the space spanned by the
columns of X, since Xw = X1 w1 + ... + Xd wd . But if Xw is the orthogonal projection, then Xw − Y is
perpendicular to the column space, so it must lie in the null space of X T . Prove this. Hint: Since Xw − Y
is orthogonal to every column of X, it must be orthogonal to every row of X T . Now use this fact to
recover the normal equations for OLS
An alternative that is less susceptible to outliers is to minimize the “sum of absolute values” (L1 ) norm:
N
X X X
ŴL1 = arg min |tj − wi hi (xj )| + λ wi2
w
j=1 i i
(i) Plot a sketch of the L1 loss function, do not include the regularization term in your plot.
(ii) Give an example of a case where outliers can hurt a learning algorithm.
(vi) Are outliers always bad and we should always ignore them? Why?
(v) As with ridge regression in Equation 1, the regularized L1 regression in Equation 2 can also be viewed
a MAP estimator. Explain why by describing the prior P(w) and the likelihood function P (t|x, w) for this
Bayesian learning problem.
7
7 More linear algebra
7.1 Gradients and derivatives
In this exercise, A ∈ Rn×n , b ∈ Rn , J is the Jacobian and ∇ is the gradient. Gradient is defined for scalar
function from Rn to R and Jacobian for vector-valued functions. Sometime people tend to say gradient even
if the function is vector-valued.
n
R −→ Rn
(a) Let f : , compute J(f )(x) for any x ∈ Rn where f is differentiable
x 7−→ Ax
n
R −→ R
(b) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
x 7−→ xT Ax + xT b
n
R −→ R
(c) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
x 7−→ ||x||2
n
R −→ R
(d) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
x 7−→ Tr(Ax)
GLn (R) −→ R
(e) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
M 7−→ log det M
GLn (R) −→ R
(f) Let f : , compute ∇f (x) for any x ∈ Rn where f is differentiable
M 7−→ M −1
1. Show that if V is an invariant subspace of A then V ⊥ is also an invariant subspace of A (it is usually
an invariant of AT but here A is symmetric)
2. Suppose Rn = B ⊕C and B is an
invariant subspace of A. What does this implies on the bloc structure
D E
of A (if you write A = )
F G
Let β ? be a linear function that minimizes the true risk R(β) := E[(X1 β − Y1 )2 ]. Let β̂ be a minimizer of
:= n1 ni=1 (Xi β − Yi )2 . True or false: E[β̂] = β ? for all integers n ≥ 1.
P
the empirical risk R(β)
b
Answer with “true” or “false”. If you answer “true”, give a clear and precise proof. If you answer
“false”, prove your claim with a simple counterexample.
8
9 Dimension reduction
Suppose you are faced with binary classification problem for data from an unknown probability distribution
over R1000 × {0, 1}.
You are asked to consider the following approach to construct a classifier. First, find a feature transfor-
mation ϕ : R1000 → R. Then, after applying ϕ to all 1000-dimensional feature vectors, find an affine classifier
hθ : R → {0, 1} of the form hθ (z) = 1{z≥θ} for some θ ∈ R. Thus, you are to only consider classifiers of the
form fθ (x) = 1{ϕ(x)≥θ} .
True or false: There exists a feature transformation ϕ such that there is a classifier of the form fθ (as
described above) with (zero-one loss) risk no larger than that of any classifier f : R1000 → {0, 1}.
Answer with “true” or “false”. If you answer “true”, give a clear and precise proof. If you answer
“false”, prove your claim with a simple counterexample.
9.1 Solution
True.
Let P be the unknown probability distribution over R1000 × {0, 1}, and (X, Y ) ∼ P .
Define
ϕ(x) := Pr (Y = 1 | X = x) for all x ∈ R1000 .
(X,Y )∼P
The classifier defined by f1/2 (x) = 1{ϕ(x) ≥ 1/2} for all x ∈ R1000 is the Bayes optimal classifier; its risk is
no larger than that of any other classifier.
2. Negative examples (with label −1): (3, 0), (0, 3), (−1, −1)
Note that these examples are not linearly separable. Consider the following function:
xT z
K(x, z) := 1 + √ for all x, z ∈ R2 \ {0}.
xT xz T z
Construct a feature map ϕ : R2 \ {0} → R3 such that the following properties hold: (i) ϕ(x)T ϕ(z) = K(x, z)
for all x, z ∈ R2 \ {0}; and (ii) there exists a linear separator with normal vector w ∈ R3 of the form
w = (a, b, b) for some a, b ∈ R such that yϕ(x)T w ≥ 1 for all training examples (x, y).
Answer with a precise definition of the feature map ϕ (e.g., ϕ((x1 , x2 )) = (x1 , x2 , x1 x2 )), and explain
why this ϕ has the desired properties. As part of the explanation, give the parameters a, b ∈ R of the linear
separator, either explicitly (e.g., a = 1, b = 2) or implicitly as the solution to a system of linear equations.
10.1 Solution
Define
ϕ(x) := (1, x1 /kxk2 , x2 /kxk2 ) for all x ∈ R2 .
We verify that, for any x, z ∈ R2 \ {0},
x1 z1 x2 z2 xT z
ϕ(x)T ϕ(z) = 1 · 1 + · + · =1+ = K(x, z)
kxk2 kzk2 kxk2 kzk2 kxk2 kzk2
as required.
9
√ √ √ √
The positive examples, after mapping through ϕ, are (1, 22 , 22 ), (1, 23 , 21 ), (1, 12 , 23 ). The negative
√ √
examples, after mapping through ϕ, are (1, 1, 0), (1, 0, 1), (1, − 22 , − 22 ). Therefore, we need to ensure the
existence of a weight vector w = (a, b, b) that satisfies
√
a + 2b ≥ 1,
√
1+ 3
a+ b ≥ 1,
2
a + b ≤ −1,
√
a − 2b ≤ −1.
(This solution is obtained by solving the linear system determined by the second and third inequalities above,
except with equality.)
11 Decision boundaries
Consider the following generative model M for binary classification data in R3 × {0, 1} with multivariate
Gaussian class conditional distributions:
Here, 0 ∈ R3 is the origin, and I ∈ R3×3 is the identity matrix. The parameters of the model M are
π ∈ (0, 1) and µ ∈ R3 . Let Pπ,µ ∈ M denote the distribution in the model with parameters (π, µ).
Does the following property always, sometimes, or never hold? (“Always” = for all distributions in M;
“never” = for none of the distributions in M; “sometimes” = for some but not all distributions in M.)
The decision boundary of the Bayes optimal classifier for Pπ,µ passes through the line segment
joining 0 and µ.
Answer with either “always”, “sometimes”, or “never”. If your answer is “always” or “never”, write
a clear and precise proof of the claim. If your answer is “sometimes”, write a precise condition on the
parameters (e.g., “µ = (π, 2π, 3π)”) such that the property holds if and only if the condition is satisfied.
11.1 Solution
Sometimes. (This might seem counterintuitive, but consider what happens as π → 0 or π → 1.)
The decision boundary is the set
10
12 Failure of MLE
In this problem, you will study a case where maximum likelihood fails but empirical risk minimization
succeeds.
Consider the following probability distribution P on X × Y, for X = {0, 1}2 and Y = {0, 1}.
x1 = 0 x1 = 1 x1 = 0 x1 = 1
P (X = (x1 , x2 )) : 1− 1− P (Y = 1 | X = (x1 , x2 )) :
x2 = 0 2 2 x2 = 0 0 1
x2 = 1 2 2 x2 = 1 1 0
Above, regard as a small positive number between 0 and 1. Let f ? be the Bayes optimal classifier for P .
(In this problem, we are concerned with zero-one loss.)
Now also consider the statistical model P for Y | X: P = {PA , PB }, where
( (
4
if x 1 = 0, 0 if x1 = 0,
PA (Y = 1 | X = (x1 , x2 )) = 51 PB (Y = 1 | X = (x1 , x2 )) =
5 if x 1 = 1; 1 if x1 = 1.
Note that P ∈ / P, and that distributions in P do not specify the distribution of X (like logistic regression).
Let fA and fB be the Bayes optimal classifiers, respectively, for PA and PB .
The maximum likelihood approach to learning a classifier selects the distribution in P of highest likelihood
given training data (which are regarded as an iid sample), then returns the optimal classifier for the chosen
distribution (i.e., fA if PA is the maximum likelihood distribution, otherwise fB ).
(a) Give simple expressions for the Bayes optimal classifiers for P , PA , and PB . E.g.,
(
? 1 if x1 + x2 = 2,
f (x) =
0 otherwise.
(b) What is the risk of each classifier from Part (a) under distribution P ?
(c) Suppose in the training data, the number of training examples of the form ((x1 , x2 ), y) is equal to
Nx1 ,x2 ,y , for (x1 , x2 , y) ∈ {0, 1}3 . Give a simple rule for determining the maximum likelihood distribu-
tion in P in terms of Nx1 ,x2 ,y for (x1 , x2 , y) ∈ {0, 1}3 . Briefly justify your answer.
(d) If the training data is a typical iid sample from P (with large sample size n), which classifier from Part
(a) is returned by the maximum likelihood approach? Briefly justify your answer.
(e) If the training data is an iid sample from P of size n, then how large should n be so that, with
probability at least 0.99, Empirical Risk Minimization returns the classifer in {fA , fB } with smallest
risk? Briefly justify your answer.
(This part requires some tools that we may not have covered in the class. You may want to use some-
thing called “Hoeffding’s inequality”: https://en.wikipedia.org/wiki/Hoeffding%27s_inequality.)
12.1 Solution
?
(a) f (x) = 1{x1 6= x2 } = x1 ⊕ x2 , fA (x) = 1{x1 = 0} = 1 − x1 , fB (x) = 1{x1 = 1} = x1 .
(b) R(f ? ) = 0, R(fA ) = 1 − , R(fB ) = .
(c) Likelihood of PA is
which is always some number strictly between zero and one. The likelihood of PB is
which is either zero or one. So the MLE is PB iff N0,0,1 + N0,1,1 + N1,0,0 + N1,1,0 = 0.
11
(d) When n is large, we are likely to see N0,1,1 ≈ n/2 > 0. In this case, the MLE is PA .
(e) There is an absolute constant C > 0 such that if n ≥ C/(1 − 2)2 , then with probability at least 0.99
over the realization of (X1 , Y1 ), . . . , (Xn , Yn ) ∼iid P ,
n
1X
1{fA (Xi ) 6= Yi } − 1{fB (Xi ) 6= Yi } > E[1{fA (X) 6= Y } − 1{fB (X) 6= Y }] − (1 − 2)
n i=1
= R(fA ) − R(fB ) − (1 − 2)
= 0.
This statement follows from Hoeffding’s inequality. In this event, the empirical risk of fA is larger than
that of fB , so ERM selects fB .
bT b̂λ
kbk2 kb̂λ k2
13.1 Solution:
Let’s rewrite the problem assumption in simpler terms. The hypothesis is that there exist a constant β such
that
∀λ > 0, bT b̂λ = βkb̂λ k2 kbk2
. This looks like Cauchy Schwartz right?
Since norm of b is constant, in this problem the hypothesis is that there exist a constant α such that
In order to see what’s happening we first get an expression for bλ . We write the normal equation for
λ > 0.
(AT A + λ)wλ = AT b
Since λ > 0 then (AT A + λ) is positive definite so inversible and
wλ = (AT A + λ)−1 AT b
giving
b̂λ = A(AT A + λ)−1 AT b
12
Now we know that AT A is symmetric so up to an orthogonal change of basis it is diagonal. We can
assume for simplicity that AT A is diagonal (and it’s not hard to show that everything will be also true doing
the substitution A ← P T AQ and same for v for the newP A where P and Q are given by SVD). One can also
use the diagonalization with the concept of AT A = µk vk vkT with orthogonal basis of eigenvectors (vk ) –
which is another writing of QT AT AQ)
So write AT A = diag(λ1 , ..., λd ) then Cλ = diag λ11+λ , ... λd1+λ . Also CλT AT ACλ = diag (λ λ+λ)
i
2 so
i
squaring equation (3) gives with v = (vi )1≤i≤d :
2
d d
X vi2 X vi2 λi
∀λ > 0 = α2 2 (4)
i=1
λi + λ i=1 (λi + λ)
d d
X vi2 vj2 X α2 v 2 λi
i
∀λ > 0 = 2 (5)
λ
i,j=1 i
+ λ λ j + λ i=1 (λi + λ)
Now by unicity of the fractional developpement on an open set with non empty interior, we can assess
that ∀i 6= j, (vi vj )2 = 0 that mean only one vi can be 0 and so AT b is an eigenvector of AT A .
One can check it is sufficient condition too so the necessary and sufficient condition is
AT b is an eigenvector of AT A .
• Case AT A inversible: since AT b is an eigen vector for a non zero eigenvalue then we know there is
an isomorphism between eigenvectors of AAT to AT A by the linear function x 7→ AT x. So there is x
in eigenvector of AAT such that AT b = AT x and since AT is full row rank x = b so b eigenvector of
AAT (another way is to prove Dx eigenvector of D for D inversible implies x eigenvector of D – left
as exercise). So b is an eigenvector of AAT .
• Case AT A non inversible but AT b is an eigenvector for a non zero eigenvalue: To check
• Case AT A non inversible and AT b in nullspace of AT A: To do
14 Min-max regression
Let (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R be given. Express the following optimization problem (over w) as a linear
program:
min max yi − xTi w .
w∈Rd 1≤i≤n
(What is the linear objective function? What are the linear inequality constraints? What are the linear
equality constraints?)
14.1 Solution
min z
(z,w)∈R×Rd
13
3. Linear inequality constraints: yi − xTi w − z ≤ 0 for all i = 1, . . . , n; −(yi − xTi w) − z ≤ 0 for all
i = 1, . . . , n.
4. Linear equality constraints: none
Now, since k1 is a positive definite kernel, it follows that T1 ≥ 0; similarly, since k2 is a positive definite
kernel, it follows that T2 ≥ 0. Therefore,
n X
X n
ai aj k(xi , xj ) = T1 + T2 ≥ 0.
i=1 j=1
14
16.1 Solution
First, observe that
E[Zi ] = E[Xi Yi ] = E[Xi ]E[Yi ] = 0.
The penultimate step uses the independence of X and Y . This implies that E[Z] = E[X Y ] = 0. Moreover,
cov(X) = S = E[XX T ] and cov(Y ) = T = E[Y Y T ] since each of X and Y has zero mean.
Next, observe that
cov(Z)ij = E[Zi Zj ] − E[Zi ]E[Zj ] = E[Xi Yi Xj Yj ] − 0 = E[Xi Xj ]E[Yi Yj ] = Sij Tij .
The penultimate step uses the independence of X and Y . Therefore cov(Z) = S T = M .
The Schur product theorem follows because covariance matrices are always positive semidefinite, and any
positive semidefinite matrix can be regarded as a covariance matrix.
(t)
2
(t−1)
2
w~
≤
w ~
+ R2
(T )
2
Likewise use induction to get an upper-bound on
w~
in terms of T and R.
What does this tell us about the convergence of the perceptron algorithm?
17.1 Solution
1. If the perceptron algorithm makes a mistake at time t, the update is wt = w(t−1) + yx, so w ~∗ =
~ (t) · w
(t−1) ∗ (t−1) ∗ 0
(w + yx) · w
~ ≥w ·w~ + γ by the definition of the margin. By induction (with w = 0 as the
base case), we have w ~ ∗ ≥ T γ.
~ (T ) · w
2. Likewise, we can show that
(t)
2
(t−1)
2
(t−1)
2
(t−1)
2
w~
=
w ~ + y~x
=
w
~
+ 2y w~ (t−1) · ~x + ky~xk2 ≤
w~
+ R2
15
3. Therefore, from the prior two parts,
√
~ ∗ ≤
w
(T )
∗
~ (T ) · w
Tγ ≤ w ~
kw ~ k≤R T
√
since a · b ≤ kakkbk and kw∗ k = 1. This tells us that T γ ≤ R T , or solving for T ,
2
R
T ≤
γ
This tells us that no error can occur for T larger than (R/γ)2 , or in other words, we have a convergence
bound that guarantees that perceptron will find a linear decision boundary. Convergence is slower for
datasets with large kxk values, and also for datasets with small margins.
2 x2 x3
x 7→ e−x /2
(1, x, √ , √ , ...)
2! 3!
(x1 −x2 )2
that for x1 , x2 ∈ R we have that ϕ(x1 )T ϕ(x2 ) = e− 2
19.1 Solution
By the definition of ϕ, we have that:
∞
T −x21 2 −x22 /2
X xi xi1 2
ϕ(x1 ) ϕ(x2 ) = e e
i=0
i!
P∞
From the Taylor series of f (x) = ex , we have that i=0 (x1 x2 )
i
/i! = ex1 x2 , so
2 2 2 2 −(x1 −x2 )2
ϕ(x1 )T ϕ(x2 ) = e−x1 /2 e−x2 /2 ex1 x2 = e−(x1 −2x1 x2 +x2 )/2 = e 2 .
Prove that for any two finite sets of points X and Y in Rd , the sets X and Y are linearly separable (with
affine expansion) if and only if conv(X) ∩ conv(Y ) = ∅. (Regard the points in X as positive examples, and
the points in Y as negative examples.)
You can use the hyperplane separation theorem, as well as the fact that the convex hulls of X and Y are
closed and compact.
https://en.wikipedia.org/wiki/Hyperplane_separation_theorem
16
20.1 Solution
Suppose that X = {x1 , . . . , x|X| } and Y = {y 1 , . . . , y |Y | } were linearly separable (with affine expansion) and
conv(X) ∩ conv(Y ) 6= ∅, then take some p ∈ conv(X) ∩ conv(Y ). Due to p’s membership in this intersection,
|X| |Y | P|X| P|Y |
there must exist {αi }i=1 and {βi }i=1 such that αi , βi ≥ 0, i=1 = 1, j=1 βj = 1, and
|X| |Y |
X X
p= αi xi = βj y j .
i=1 j=1
Additionally, since X and Y are linearly separable (with affine expansion), there exists some vector w ∈ Rd
and θ ∈ R such that wT x > θ for all x ∈ X, and wT y < θ for all y ∈ Y . Then
|X| |X|
X X
wT p = wT αi xi = αi (wT xi ) > θ,
i=1 i=1
and
|Y | |Y |
X X
wT p = wT βj y j = βj (wT y j ) < θ.
j=1 j=1
21.1 Solution
On input x ∈ X :
1. Let S := 0.
2. For i = 1, . . . , n:
(a) S := S + yi k(x, xi )
3. Return S/n.
Explanation. Let
← ϕ(x1 )T → y1
1 .. 1 .
A := √ .
, b := √ . .
n n .
← ϕ(xn ) →
T
yn
17
After one iteration of gradient descent (starting at 0 and with step size η = 1), the weight vector we obtain
is
n
1X
ŵ = 0 − 1 · AT (A0 − b) = AT b = yi ϕ(xi ).
n i=1
Then for any x ∈ X ,
n n
1X 1X
ϕ(x)T ŵ = yi ϕ(x)T ϕ(xi ) = yi k(x, xi ).
n i=1 n i=1
ϕk (x) = (1, x, x2 , . . . , xk ).
(a) Suppose you are indifferent to the degree of the polynomial expansion. Describe how to use the “hold
out set” approach to select a feature expansion to use. (Don’t worry about computation time.)
(b) Suppose you prefer lower-degree polynomial feature expansions over higher-degree polynomial feature
expansions. Suggest a way to modify the “hold out set” approach to incorporate this inductive bias.
23.1 Solution
The desired ϕ maps xi ∈ X to the vector in Rn (so m = n) with a 1 in the i-th component and zeroes
everywhere else (i.e., the j-th component of ϕ(xi ) is 1{i=j} ). Then the weight vector that separates the
training data is w := (y1 , . . . , yn ). To see this, observe that for each i = 1, . . . , n,
n
X
yi ϕ(xi )T w = yi yj 1{i=j} = yi2 > 0.
j=1
So the associated kernel function is k(xi , xj ) = 1{i=j} , and the kernel matrix is the n × n identity matrix.
24 SVM Example
Consider two sets of training data A = {(1, 1), (2, 2), (2, 0)} and B = {(0, 0), (1, 0), (0, 1)}, where the points
in A have label −1 and the points in B have label 1. Plot the points in X := A ∪ B, and hand-construct the
maximum margin hyperplane for X . Calculate the optimal margin and identify the support vectors.
18
24.1 Solution
Hyperplane: y = −x + 32
1
Maximum margin: 2√ 2
Support vectors: {(0, 1), (1, 0), (2, 0), (1, 1)}
25 Convex sets
Are the following sets convex or not? If convex, provide a brief justification. If not, provide a counter-example
to its convexity.
1. S = {x ∈ Rd : xT w ≤ b} where w ∈ Rd and b ∈ R
2. S = {x ∈ Rd : kxk2 ≤ 1}
Pd
3. S = {x ∈ Rd : i=1 |xi | ≤ 1}
4. S = {x ∈ Rd : maxi |xi | ≤ 1}
26 Convex functions
We say a function f : Rd → R is concave if −f is convex.
Are the following functions convex, concave, or neither? If convex or concave, provide a brief justification.
If neither, provide a counter-example (either a single point or pair of points) to the convexity property for
an appropriate definition of convexity (for general functions, differentiable functions, or twice-differentiable
functions).
4. f (x) = x4
5. f (x) = sin(x)
6. f (x) = ln(x) where the domain of f is the positive real line.
7. f (x) = x1/2 where the domain of f is the positive real line.
19
p
13. f (x) = |wT x| for non-zero w ∈ Rd .
p
14. f (x) = |xT M x| where M is a symmetric d × d matrix with only positive eigenvalues.
If we have restricted the domain of a function f to a convex set (e.g., in the case of f (x) = ln(x)), then
the convexity (or concavity) properties just need to be checked over points in the domain, rather than all of
Rd . (The domain should be a convex set!)
20