A1 Report

2015 | February | 10
Report
Assignment 1
Karan Goel, 2011EE50555
Part 1: Small Dataset (first 20 points)

Comparison of Loss Functions
To compare various loss functions, the small dataset was divided in a 4:1 ratio. 16 points were used for
training the linear model, while 4 were used for testing. The Mean Squared Prediction Error and the
Mean Absolute Prediction Error are noted in Table 1. Figure 1 shows the variation of Mean Absolute
Prediction Error v/s the Polynomial Order.
Polynomial Order
1
2
3
4
5
8
10
1
2
3
4
5
8
10
Squared Loss Absolute Loss Huber Loss Hampel Loss

Mean Squared Prediction Error
79907834
25244397
66715039
79907834
3625802
4141409
4038223
3625802
36475.32
33839.53
35051.83
31758.24
796.4604
763.9789
1014.938
999.6108
11771.15
9683.386
17932.89
19654.91
10785436
2557741
5857425
31425649
49050846168
59729003
12122043070
28806448169
Mean Absolute Prediction Error
8780.378
4899.833
8015.42
8780.378
1794.252
1936.124
1899.168
1794.252
164.2568
159.7959
161.9135
152.2111
27.38742
26.97899
30.85986
30.45805
94.92881
87.15927
115.7957
121.0288
2126.542
1023.567
1571.796
3645.455
131582.8
5624.049
65160.95
100724.5
Bisquare Loss
68722207
3861330
17715.53
1375.599
23689.89
3780434
7884536392
8137.174
1855.813
109.4717
35.40511
132.2092
1266.367
52980.33
Types of Loss Functions Let the true value be denoted by y and the predicted value be denoted by y.
Let L denote the loss function and a = y y.
Squared Loss
1
L = a2
2
Absolute Loss
L = |a|
Huber Loss
(
L=
IIT Delhi
1 2
2a ,
|a|
1 2
2 ,
|a| ,
otherwise.
Last modified: February 23, 2015
Report
Hampel Loss
1 2
|a| 1 ,
2a ,
|a| 1 2 ,
1 < |a| 2 ,
1
2 1
L=
|a|
1
3
1 2 2 12 + (3 2 ) 21 (1 ( 3 2 )2 ), 2 < |a| 3
1 2 + ( ) 1 ,
otherwise.
1 2
3
2 2
2 1
Bisquare Loss (Turkeys Loss)

(
L=
2
6 (1
2
6,
(1
a2 3
) ),
2
|a| ,
otherwise.
Figure 1: Mean Absolute Prediction Error v/s Polynomial Order

IIT Delhi
Report
Now restricting the analysis to Squared Loss, we observe the following cross-validation results.
Polynomial Order
1
2
3
4
5
6
Cross Validation (Leave-one-out)

Mean Squared Prediction Error Mean Absolute Prediction Error
23802270
3742
527001
500
4742
39.5
141
8.55
255
10.4
8767
35.4
Property
Residual Standard Error
Residual SSE
Estimate of Noise Variance
AIC
Polynomial Order
3
4
5
31.8
7.91 7.91
16195 937
877
1012 62.5 62.6
201
146
146
Cross validation clearly indicates that the low order polynomials (1st , 2nd , 3rd ) tend to underfit the data,
while the 6th order polynomial overfits and does not generalize well to unseen data.
The results seem to indicate that the 4th order polynomial fit works best. Even though the 5th order
polynomial is similar in most statistics, it tends to slightly overfit (shown clearly by cross-validation), and
it also fails the null hypothesis test for the x5 co-efficient (high p-value).
The 4th order polynomial is given by
y = 17963.10 55667.21x + 16741.70x2 1943.39x3 + 123.52x4
(1)
The noise variance comes out to be 62.5.

Regularization
Now applying Regularization to the Squared Loss function, we get the results shown below. Regularization
was applied in 2 forms
Ridge Regularization (L2 norm of co-efficients)
Lasso Regularization (L1 norm of co-efficients)
Ridge tends to favor lower co-efficient values, while Lasso does feature selection and selects sparse weights.
IIT Delhi
Report
Ridge Regularization: Cross Validation (Leave-one-out)

Polynomial Order Mean Squared Error Mean Absolute Error
4
144306
265
5
48746
158
6
8187
69.7
7
679
19.6
8
1556
22.3
9
6999
40.2
10
7022
38.8
Lasso Regularization: Cross Validation (Leave-one-out)
Polynomial Order Mean Squared Error Mean Absolute Error
4
119788
239
5
23743
109
6
239
10.4
7
1429
24.9
8
6498
47.5
9
10371
59.5
10
10713
59.3
Figure 2: Mean Squared Error v/s Log Lambda for Best Lasso Regularization
IIT Delhi
Report
Figure 3: Mean Squared Error v/s Log Lambda for Best Ridge Regularization
Cross-validation indicates that the best values of the regularization parameter () are lasso = 3.26 and
ridge = 2.10. Figures 2 and 3 show how the lambda value varies with cross-validated mean squared
errors. We select the value of that minimizes this error. The noise variance estimates come out to be
465 for ridge, and 161 for lasso. The corresponding mean residual sum of squares errors for ridge and
lasso are 279 and 104.
The overall regression equations are
y = 6545.751 + 2631.198x 165.514x2 67.667x3 + 23.701x4 5.397x5 + 1.103x6 0.218x7
(2)
for the ridge regularization and

y = 43.29 3.21x5 + 3.43x6
(3)
for lasso regularization.
Part 2: Full Dataset

For the full dataset, we have 100 points. First restricting the analysis to Squared Loss, we observe the
following using cross validation results.
IIT Delhi
Report
Squared Loss - Cross Validation (Leave-one-out)

Polynomial Order Mean Squared Prediction Error Mean Absolute Prediction Error
4
1759771
995
5
1736696
923
6
79.2
5.75
7
84.7
5.76
8
104
6.07
9
103
6.14
Property
Residual Standard Error
Residual SSE
Estimate of Noise Variance
AIC
6
6.82
4325
46.5
676
7
6.78
4230
46
676
Polynomial Order
7 (Absolute Loss) 6 (Huber Loss)
6.92
6.82
4409
4330
47.9
46.6
667
677
7 (Hampel Loss)
6.91
4390
47.7
680
Figure 4: Plot of Best Fit 6th degree polynomial from Squared Loss
The equation of the 6th order squared loss polynomial turns out to be
y = 6372 12304x 87712x2 12301x3 + 51000x4 5003x5 + 9846x6
IIT Delhi
(4)
Report
and for the 7th order squared loss polynomial is
y = 6371.9 12303.97x 87712.11x2 12300.85x3 + 50999.53x4 5002.67x5 + 9845.56x6 9.76x7 (5)

We choose the 6th order polynomial since it gives nearly the same performance as the higher degree one,
but with greater simplicity of model and better cross-validation results.
Regularization
Ridge Regularization: Cross Validation (Leave-one-out)
Polynomial Order Mean Squared Error Mean Absolute Error Optimal
6
78.6
5.75
0.00883
7
84.8
5.76
0.000182
8
160
6.88
0.000001
Lasso Regularization: Cross Validation (Leave-one-out)
Polynomial Order Mean Squared Error Mean Absolute Error Optimal
6
78.1
5.73
0.00893
7
84
5.76
0.00345
8
151
6.73
0.000001
Noise Variance
46.5
46
54.4
Noise Variance
46.6
46
53.1
L1-regularization gives the best overall performance, with a noise variance of 46.6. The regression equation
given by the Lasso regularization is
y = 7.372 1.167x 1.711x2 4.041x3 0.469x4 2.997x5 + 3.467x6
(6)
Given the best performance of this model, and the small co-efficient values, this is the best guess for the
actual underlying polynomial of the dataset.
Part 3: Boston Housing Dataset

The Boston Housing Dataset contains 506 instances with 14 attributes and contains information about
real-estate in Boston. We start by trying Squared Loss regression using all the features.
Property
Residual SE
R-Squared
F-Statistic
AIC
Property
Residual SE
R-Squared
F-Statistic
AIC
IIT Delhi
All Features
4.75
0.734
108
3028
AIC (upto 4th power)
3.56
0.85
87.7
2757
Without INDUS, AGE

4.74
0.735
128
3024
Lasso
3.89
0.816
-
Ridge
3.89
0.822
-
AIC (upto cubic)

3.74
0.835
95.6
2799
Report
Cross Validation (Leave-one-out)

Regression Type
Mean Squared Error
All Features
23.7
Without INDUS, AGE
23.5
Lasso Regularized (Cubic)
16.7
Ridge Regularized (Cubic)
16.4
AIC Feature Selection (Cubic)
15.6
th
AIC Feature Selection (4 Power)
15
Mean Absolute Error

3.38
3.37
2.75
2.72
2.68
2.63
Figure 5: Residuals v/s Fitted for AIC Feature Selected Model (4th power)
The distribution of the residuals v/s fitted values curves show that the noise in the data has been wellexplained by the final model (ideally the red line in the graph should be a horizontal line passing through
0). The basic model (with all features), has biased residuals (with some regularity) which indicates that
the model is not as expressive.
IIT Delhi
Report
Figure 6: Residuals v/s Fitted for All Features Model
IIT Delhi

A1 Report

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A1 Report

Încărcat de

Drepturi de autor:

Formate disponibile

2015 | February | 10

Part 1: Small Dataset (first 20 points)

Squared Loss Absolute Loss Huber Loss Hampel Loss

Last modified: February 23, 2015

Bisquare Loss (Turkeys Loss)

Figure 1: Mean Absolute Prediction Error v/s Polynomial Order

Last modified: February 23, 2015

Cross Validation (Leave-one-out)

The noise variance comes out to be 62.5.

Last modified: February 23, 2015

Ridge Regularization: Cross Validation (Leave-one-out)

Last modified: February 23, 2015

for the ridge regularization and

for lasso regularization.

Part 2: Full Dataset

Last modified: February 23, 2015

Squared Loss - Cross Validation (Leave-one-out)

Last modified: February 23, 2015

y = 6371.9 12303.97x 87712.11x2 12300.85x3 + 50999.53x4 5002.67x5 + 9845.56x6 9.76x7 (5)

Part 3: Boston Housing Dataset

Without INDUS, AGE

Last modified: February 23, 2015

AIC (upto cubic)

Cross Validation (Leave-one-out)

Mean Absolute Error

Last modified: February 23, 2015

Figure 6: Residuals v/s Fitted for All Features Model

Last modified: February 23, 2015

S-ar putea să vă placă și