Assignment 3: Gaussian distribution, Linear Classification, and
Linear Regression
1. (10 points) The Gaussian distribution
q R β ∞ R∞ b2 exp − β2 (s − γ)2 ds = 1, show that −∞ exp(−as2 +bs)ds = π/a exp( 4a p (a) Using the fact that 2π −∞ ). (b) Suppose that µ ∼ N (0, 1/α) and x|µ ∼ N (µ, 1/β). By integrating out µ, show that the marginal distribution of x is given by x ∼ N (0, 1/β + 1/α) [Hint: Write down the joint distribution p(x, µ) = p(x|µ)p(µ). To perform the integral over µ, use the identity from part b)] (c) In the lecture, we showed that a product of two Gaussian probability density functions is proportional to a Gaussian density function, but we did not derive the proportionality-factor. Suppose that p1 (x) = N (x, µ1 , 1/β1 ) and p2 (x) = N (x, µ2 , 1/β2 ). Find Z such that 1 p(x) = p1 (x)p2 (x) = N (x, 1/β(β1 µ1 + β2 µ2 ), 1/β) where β = β1 + β2 . (1) Z [Hint: Go through the calculations we did in the lecture again, but carefully keep track of the factors we had dropped.] 2. (10 points) Linear Regression (based on Bishop exercise 3.3). Consider a data-set in which each data point tn has a weighting rn > 0, so that the sum-of-square error function is N X E(ω) = rn (tn − ω > xn )2 . (2) n=1
(a) Find the parameter-vector ω̂ which minimizes this error function.
(b) Describe two interpretation of this error functions in terms of i) replicated measurements and ii) a data-dependent noise-variance. 3. (15 points) Linear classification and the logistic function. This exercise will be concerned with the logistic function σ(s) = 1/(1 + exp(−s)) as well as its connection with linear classification in Gaussian models. (a) Show that the logistic function satisfies σ(−s) + σ(s) = 1 and find the first two derivatives of σ(s), σ 0 (s) and σ 00 (s). (b) Plot σ(s) as well as log(σ(s)) as a function of s (either using Python or with pen and paper, a rough plot which captures the qualitative features of the functions is sufficient). Explain why, for large s > 0, log(σ(s)) ≈ 0 and log(σ(−s)) ≈ −s. (c) Suppose that we have data from two classes, and the data within each class is Gaussian distributed with the same covariance, i.e. x|t = 1 ∼ N (µ+ , Σ) and x|t = −1 ∼ N (µ− , Σ), and that the two classes the same prior probabilities π+ = P (t = +1) = π− = P (t = −1) = 0.5. Show that the conditional probability of belonging to the positive class can be written as a logistic function P (t = 1|x) = σ(ω > x + ωo ) and identify the corresponding parameters ω and ωo . 4. (15 points) Linear Classification [Python] Download the file LinearClassification.mat, in which you will find training data xTrain (a matrix of size N = 500 by D = 2) with labels tTrain. Your job will be to train and compare two classification algorithms on this data. (a) Calculate the means and the covariances of each of the two classes, as well as the average covariance Σ = 12 Σ+ + 12 Σ− . Use µ+ , µ− and Σ to the weight vector ω and offset ωo . of the Gaussian linear discriminant analysis used in lectures. (b) Plot the data as well as the decision boundary into a 2-D plot, and calculate the (training) error rate of the algorithm, i.e. the proportion of points in the training set which were misclassified by it. Use the data in xTest and tTest to also calculate its error rate on the test set. (c) Calculate the parameters of the decision function y(x) = x> Ax + b> x + c of the ’quadratic discrim- inant analysis’ that can be derived by doing classification in a Gaussian model without assuming that Σ+ = Σ− , and calculate the training- and the test-error rate of this algorithm. (d) For each data-point in the test-set, calculate its (scaled and signed) distance to the decision boundary (i.e. the values of y(x) for each x). Make a plot which contains the histogram of all points in the positive class (in blue) as well as a histogram of the points in the negative class (in red). (e) Calculate the decision boundary of the quadratic algorithm and add it to the plot used in (b). 5. (20 points) Regression with basis functions [Python] Download the file LinearRegression.mat, in which you will find training data xTrain (a vector of length N = 20) with outputs tTrain. Your job will be to train a nonlinear regression model from x to t using basis functions. (a) We want to use a 50-dimensional basis-set, i.e. the ‘feature-vector’ z(x) should be 50-dimensional with zi (x) = 2 exp(−(x − i)2 /σ 2 ) with σ = 5 and i = 1, . . . 50. Make a plot of the 50 basis functions (use the x-values in xPlot). Calculate the 50 × N matrix zTrain for which the n-th row is z(xn ), and produce an image of the matrix (using matplotlib.pyplot.matshow). (b) Using α = β = 1 (same notation as in lectures), calculate the posterior mean µ = E(ω|D) (a 50 × 1 vector) and plot it. (c) The posterior mean µ is a vector of weights of the basis functions. Calculate the corresponding P50 predictive mean by fµ (x) = E(t(x)|D) = i=1 µi zi (x) and plot the predictive mean and the observed training data into the same plot. (d) Calculate the posterior covariance over weights Σ = Cov(ω|D) and display it as an image. Extract the diagonal of Σ go obtain the posterior variance, and use it to plot ± 2 standard deviation error bars on the mean in part b) (e) Calculate, for each x (use the values in xPlot), the predictive p variance Var(t|D, x), and use it to plot ’error bars’ for the predictive distribution, i.e. fµ (x) ± 2 Var(t|D, x).