Documente Academic
Documente Profesional
Documente Cultură
Sarajit Poddar
31 July 2015
Contents
1 Objective
2.1
Initial setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.3
Model development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
2.5
2.6
11
2.7
13
Objective
The objective of this article is to explore machine learning algorithm for classification of diamonds into various
cost buckets depending on various characteristics.
2
2.1
2.1.1
2.1.2
2.1.3
2.1.4
Structure of dataset
str(dataset)
## 'data.frame':
5000 obs. of 12 variables:
## $ carat : num 1.02 0.57 0.71 0.51 0.41 0.79 1 0.52 0.9 1.19 ...
## $ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 3 5 3 5 5 5 1 5 3 4 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 5 1 2 1 2 3 6 6 3 1 ...
2
##
##
##
##
##
##
##
##
##
$
$
$
$
$
$
$
$
$
clarity:
depth :
table :
price :
x
:
y
:
z
:
fprice :
fcarat :
summary(dataset)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
carat
Min.
:0.2500
1st Qu.:0.5200
Median :0.7000
Mean
:0.7127
3rd Qu.:0.9000
Max.
:1.5100
cut
Fair
: 213
Good
: 518
Very Good:1081
Premium :1209
Ideal
:1979
table
Min.
:51.00
1st Qu.:56.00
Median :57.00
Mean
:57.57
3rd Qu.:59.00
Max.
:76.00
price
Min.
:1000
1st Qu.:1624
Median :2391
Mean
:2635
3rd Qu.:3684
Max.
:5000
color
clarity
depth
D: 704
SI1
:1319
Min.
:53.10
E:1004
SI2
:1028
1st Qu.:61.00
F: 994
VS2
: 992
Median :61.80
G: 953
VS1
: 629
Mean
:61.78
H: 685
VVS2
: 417
3rd Qu.:62.60
I: 418
VVS1
: 349
Max.
:79.00
J: 242
(Other): 266
x
y
Min.
:0.000
Min.
:4.230
1st Qu.:5.180
1st Qu.:5.190
Median :5.690
Median :5.700
Mean
:5.661
Mean
:5.662
3rd Qu.:6.150
3rd Qu.:6.150
Max.
:7.380
Max.
:7.200
z
Min.
:0.000
1st Qu.:3.200
Median :3.520
Mean
:3.496
3rd Qu.:3.830
Max.
:4.820
fprice
3
:1105
4
: 855
5
: 656
6
: 547
9
: 512
10
: 463
(Other): 862
fcarat
6
:1127
8
: 846
5
: 578
11
: 548
7
: 517
9
: 418
(Other): 966
head(dataset, 5)
##
##
##
##
##
##
##
##
##
##
##
##
carat
cut color clarity depth table price
x
y
z
1.02 Very Good
H
SI1 58.4
62 4518 6.55 6.61 3.84
0.57
Ideal
D
SI1 61.0
57 1847 5.34 5.31 3.25
0.71 Very Good
E
VVS2 63.4
58 3340 5.64 5.69 3.59
0.51
Ideal
D
VS2 62.2
58 1678 5.10 5.12 3.18
0.41
Ideal
E
VS1 62.1
56 1153 4.78 4.75 2.96
fprice fcarat
9068
10
11
47377
4
6
3178
7
8
45482
4
6
40631
3
5
9068
47377
3178
45482
40631
2.2
2.2.1
count
900
600
300
0
2
fprice
# Histogram of carat distribution
qplot(fcarat, data=dataset, geom="histogram")
10
count
900
600
300
0
3
10
11
12
13
14
15
16
fcarat
#
g
g
g
g
g
## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?
10
9
8
fprice
7
6
5
4
3
2
3
10
11
12
13
14
15
fcarat
clarity
2.3
I1
SI2
SI1
VS2
VS1
VVS2
VVS1
Model development
## [1] 3504
10
## [1] 1496
10
2.4
2.4.1
IF
16
2.4.2
library(rattle); library(rpart.plot)
fancyRpartPlot(modFit$finalModel)
1
3
.00 .22 .17 .13 .11
.09 .08 .10 .09
100%
yes
x < 5.5
no
3
9
.00 .01 .05 .16 .17
.15 .14 .17 .16
59%
y < 6.1
3
.00 .52 .34 .10 .02
.01 .00 .00 .00
41%
6
.00 .01 .09 .29 .30
.19 .08 .03 .01
31%
9
.00 .00 .00 .01 .03
.11 .20 .33 .32
28%
Model validation
7
0
9
0
8
0
3
0
9
0
1
0
7
10
0
1
0
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
5
6
7
8
9
10
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
97 306 323 200 87 34 14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2 13 29 104 196 324 310
0
0
0
0
0
0
0
Overall Statistics
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.4024
(0.3861, 0.4189)
0.2209
< 2.2e-16
Kappa : 0.2939
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
0.000000
0.9858
0.0000
0.0000 0.84334 0.00000
1.000000
0.7469
1.0000
1.0000 0.76001 1.00000
NaN
0.5248
NaN
NaN 0.30131
NaN
0.998573
0.9946
0.8291
0.8687 0.97533 0.91067
0.001427
0.2209
0.1709
0.1313 0.10930 0.08933
0.000000
0.2178
0.0000
0.0000 0.09218 0.00000
0.000000
0.4150
0.0000
0.0000 0.30594 0.00000
0.500000
0.8663
0.5000
0.5000 0.80168 0.50000
Class: 8 Class: 9 Class: 10
Sensitivity
0.00000 0.90251
0.00000
Specificity
1.00000 0.79205
1.00000
Pos Pred Value
NaN 0.33129
NaN
Neg Pred Value
0.91838 0.98614
0.90725
Prevalence
0.08162 0.10245
0.09275
Detection Rate
0.00000 0.09247
0.00000
Detection Prevalence 0.00000 0.27911
0.00000
Balanced Accuracy
0.50000 0.84728
0.50000
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
7
0
8
0
0
8
0
0
0
0
9
0
1
0
0
8
10
0
0
0
0
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
6
7
8
9
10
0
0
0
0
0
8
0
0
0
0
41 125 134
0
0
0
0
0
0
0
5 15
0
0
0
80
0
0
46
0
28 12
6
0
0
0
0
0
0
94 140 132
0
0
0
Overall Statistics
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.3991
(0.3741, 0.4244)
0.2213
< 2.2e-16
Kappa : 0.2892
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
0.000000
0.9758
0.0000
0.000 0.81707 0.00000
1.000000
0.7365
1.0000
1.000 0.77477 1.00000
NaN
0.5127
NaN
NaN 0.30876
NaN
0.998663
0.9908
0.8289
0.869 0.97175 0.91043
0.001337
0.2213
0.1711
0.131 0.10963 0.08957
0.000000
0.2159
0.0000
0.000 0.08957 0.00000
0.000000
0.4211
0.0000
0.000 0.29011 0.00000
0.500000
0.8562
0.5000
0.500 0.79592 0.50000
Class: 8 Class: 9 Class: 10
Sensitivity
0.00000 0.91503
0.00000
Specificity
1.00000 0.78258
1.00000
Pos Pred Value
NaN 0.32407
NaN
Neg Pred Value
0.91845 0.98778
0.90775
Prevalence
0.08155 0.10227
0.09225
Detection Rate
0.00000 0.09358
0.00000
Detection Prevalence 0.00000 0.28877
0.00000
Balanced Accuracy
0.50000 0.84880
0.50000
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
2.5
2.5.1
2.5.2
Model validation
2.5.2.1
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
:
:
:
:
0.9991
(0.9975, 0.9998)
0.2209
< 2.2e-16
Kappa : 0.999
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
1.000000
1.0000
0.9983
1.0000
0.9974 1.00000
1.000000
0.9996
1.0000
1.0000
1.0000 0.99937
1.000000
0.9987
1.0000
1.0000
1.0000 0.99365
1.000000
1.0000
0.9997
1.0000
0.9997 1.00000
0.001427
0.2209
0.1709
0.1313
0.1093 0.08933
0.001427
0.2209
0.1707
0.1313
0.1090 0.08933
0.001427
0.2212
0.1707
0.1313
0.1090 0.08990
1.000000
0.9998
0.9992
1.0000
0.9987 0.99969
Class: 8 Class: 9 Class: 10
Sensitivity
0.99650
1.0000
1.00000
Specificity
1.00000
1.0000
1.00000
Pos Pred Value
1.00000
1.0000
1.00000
Neg Pred Value
0.99969
1.0000
1.00000
Prevalence
0.08162
0.1025
0.09275
Detection Rate
0.08134
0.1025
0.09275
Detection Prevalence 0.08134
0.1025
0.09275
Balanced Accuracy
0.99825
1.0000
1.00000
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
6
0
0
0
40
85
34
4
0
1
7
0
0
0
7
36
57
22
10
2
8
0
0
0
0
6
19
46
45
6
9
0
0
0
1
1
8
15
80
48
10
0
0
0
0
1
0
7
42
88
Overall Statistics
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.6651
(0.6406, 0.689)
0.2213
< 2.2e-16
Kappa : 0.6097
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
0.000000
0.9396
0.8125 0.61224 0.51829 0.42537
1.000000
0.9803
0.9589 0.94000 0.93994 0.95301
NaN
0.9311
0.8031 0.60606 0.51515 0.47107
0.998663
0.9828
0.9612 0.94145 0.94065 0.94400
0.001337
0.2213
0.1711 0.13102 0.10963 0.08957
0.000000
0.2079
0.1390 0.08021 0.05682 0.03810
0.000000
0.2233
0.1731 0.13235 0.11029 0.08088
0.500000
0.9599
0.8857 0.77612 0.72912 0.68919
Class: 8 Class: 9 Class: 10
Sensitivity
0.37705 0.52288
0.63768
Specificity
0.96288 0.92777
0.95803
Pos Pred Value
0.47423 0.45198
0.60690
Neg Pred Value
0.94568 0.94466
0.96299
Prevalence
0.08155 0.10227
0.09225
Detection Rate
0.03075 0.05348
0.05882
Detection Prevalence 0.06484 0.11832
0.09693
Balanced Accuracy
0.66997 0.72532
0.79785
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
2.6
2.6.1
# Fitted model
fitted.model <- lm(price ~ fcarat + cut + clarity + color + table + y + z + fprice,
data = dataset.pr)
11
Call:
lm(formula = price ~ fcarat + cut + clarity + color + table +
y + z + fprice, data = dataset.pr)
Residuals:
Min
1Q
-478.00 -91.32
Median
-0.86
3Q
90.29
Max
653.91
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 692.7184
121.8809
5.684 1.39e-08 ***
fcarat.L
63.6436
83.7364
0.760 0.447263
fcarat.Q
-207.6135
64.4114 -3.223 0.001276 **
fcarat.C
-59.1854
59.6963 -0.991 0.321518
fcarat^4
22.1916
53.7768
0.413 0.679872
fcarat^5
-35.8988
45.3522 -0.792 0.428657
fcarat^6
-2.3563
36.5826 -0.064 0.948646
fcarat^7
-25.9610
31.6190 -0.821 0.411654
fcarat^8
-1.1780
29.5570 -0.040 0.968211
fcarat^9
-13.4611
25.6953 -0.524 0.600390
fcarat^10
-15.2370
19.4604 -0.783 0.433680
fcarat^11
-29.6520
13.2595 -2.236 0.025378 *
fcarat^12
-17.7874
8.5098 -2.090 0.036649 *
fcarat^13
-10.1020
6.2293 -1.622 0.104931
cut.L
24.6540
7.0539
3.495 0.000478 ***
cut.Q
2.7526
5.7337
0.480 0.631200
cut.C
5.7071
5.1496
1.108 0.267803
cut^4
-5.7487
4.3535 -1.320 0.186742
clarity.L
370.0914
14.4251 25.656 < 2e-16 ***
clarity.Q
-79.9797
10.2724 -7.786 8.37e-15 ***
clarity.C
57.5250
8.2458
6.976 3.43e-12 ***
clarity^4
-36.7786
6.6591 -5.523 3.50e-08 ***
clarity^5
1.7895
5.5359
0.323 0.746521
clarity^6
17.2801
5.0240
3.439 0.000588 ***
clarity^7
14.7822
4.6156
3.203 0.001370 **
color.L
-172.2569
8.0063 -21.515 < 2e-16 ***
color.Q
-27.3245
5.8839 -4.644 3.51e-06 ***
color.C
-13.3488
5.3907 -2.476 0.013309 *
color^4
4.9552
5.0367
0.984 0.325251
color^5
-3.7985
4.6906 -0.810 0.418089
color^6
6.2334
4.2783
1.457 0.145190
table
-0.9325
0.9375 -0.995 0.319938
y
294.9346
16.8682 17.485 < 2e-16 ***
z
119.8985
17.1153
7.005 2.79e-12 ***
fprice.L
3103.1795
31.1688 99.560 < 2e-16 ***
fprice.Q
238.9273
26.0004
9.189 < 2e-16 ***
fprice.C
-126.1668
21.6234 -5.835 5.73e-09 ***
fprice^4
45.2983
15.5962
2.904 0.003695 **
fprice^5
-53.8780
10.0269 -5.373 8.08e-08 ***
12
##
##
##
##
##
##
##
##
##
fprice^6
40.1292
6.7176
5.974 2.48e-09 ***
fprice^7
-14.4412
5.6136 -2.573 0.010125 *
fprice^8
3.7922
5.3327
0.711 0.477048
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 122.7 on 4958 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9897
F-statistic: 1.167e+04 on 41 and 4958 DF, p-value: < 2.2e-16
2.6.2
2
3
4
5
6
7
8
9
10
2
3
5
2
0 1085
0
21
0
1
0
0
0
0
0
0
0
0
0
0
4
0
16
806
35
0
0
0
0
0
5
0
4
26
580
40
7
0
1
0
6
0
0
2
34
466
36
6
1
1
7
0
0
0
3
36
370
20
8
0
8
0
0
0
3
4
22
331
15
7
9
0
0
0
0
0
10
45
439
42
10
0
0
0
0
1
2
6
48
413
2.7
2.7.1
x <- dataset.pr$price;
y <- resid(fitted.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha = 0.1) +
13
500
Residual
250
250
500
1000
2000
3000
Fitted value
2.7.2
g
g
g
g
g
g
g
g
<<<<<<<-
14
4000
5000
5000
Actual Price
4000
3000
2000
1000
1000
2000
3000
4000
Predicted Price
2.7.3
1st Qu.
-90.0000
Median
1.0000
Mean
0.0002
3rd Qu.
91.0000
Max.
478.0000
15
5000
600
1000
Mean
Density Curve
Normal Curve
200
Frequency
1400
600
400
200
200
400
Price difference
2.7.4
Plotting the difference between actuals and the prediction (in %age)
1st Qu.
-3.5000
Median
0.0450
Mean
0.3333
3rd Qu.
3.9100
Max.
26.2500
1000
500
0
Frequency
1500
Mean
Density Curve
Normal Curve
30
20
10
0
Price difference
2.7.5
model.rmse<- sqrt(mean(residuals(fitted.model)^2))
model.rmse
## [1] 122.1546
17
10
20
30