Sunteți pe pagina 1din 18

Project 1

Prostate Project
Part 1. Describe you data, the purpose of your analysis
Prostate Cancer Stamey et al. (1989) studied potential predictors of prostate-
specific antigen (PSA) in patients. The independent variables include X, log of Cavol,
Log of weight, age, log of bph, svi, log of cp, gleason, and pgg45.

Part 2. Describe your variables, use EDA techniques to check if any


transformations, data replacement is necessary. Prepare the data for analysis.
> prostate<-
read.csv("/Users/baiyunchen/Dropbox/Courses/STA5703/Assignment/project1/p
rostate.csv",header= T,sep=",")

> dim(prostate)

[1] 97 10

> names(prostate)

[1] "X" "lcavol" "lweight" "age" "lbph" "svi"

[7] "lcp" "gleason" "pgg45" "lpsa"

> prostate[1:5,]

X lcavol lweight age lbph svi lcp gleason pgg45

1 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0

2 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0

3 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20


4 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0

5 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0

lpsa

1 -0.4307829

2 -0.1625189

3 -0.1625189

4 -0.1625189

5 0.3715636

1
Project 1

> pairs(lpsa~.,data=prostate)

> boxplot(prostate)

2
Project 1

Divide the data into training and validation data sets


> set.seed(1234)

> prostate.train<-
sample(1:nrow(prostate),round(nrow(prostate)/2),replace=FALSE)

> index.train<-sample(1:nrow(prostate),round(nrow(prostate)/2),replace=FALSE)

> prostate.train<-prostate[index.train,]

> prostate.validation<-prostate[-index.train,]

Part 3. Build different models using different techniques


Full model
> model1<-
lm(lpsa~=1+lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate.
train)

Error: unexpected '=' in "model1<-lm(lpsa~="

> model1<-
lm(lpsa~1+lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate.train)

> summary(model1)

Call:

lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi +

lcp + gleason + pgg45, data = prostate.train)

Residuals:

Min 1Q Median 3Q Max

-1.46523 -0.34944 0.02133 0.45204 1.33804

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.891004 2.140249 0.416 0.6795

lcavol 0.642491 0.144581 4.444 7.11e-05 ***

3
Project 1

lweight 0.690378 0.387860 1.780 0.0829 .

age -0.040023 0.016852 -2.375 0.0226 *

lbph 0.128830 0.089682 1.437 0.1588

svi 1.087138 0.326371 3.331 0.0019 **

lcp -0.096406 0.141141 -0.683 0.4986

gleason 0.062949 0.232260 0.271 0.7878

pgg45 0.001542 0.006047 0.255 0.8001

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7012 on 39 degrees of freedom

Multiple R-squared: 0.715, Adjusted R-squared: 0.6565

F-statistic: 12.23 on 8 and 39 DF, p-value: 1.503e-08

Best Subsets
> leaps<-
regsubsets(lpsa~1+lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostat
e.train,nbest=1)

> names(leaps)

[1] "np" "nrbar" "d" "rbar" "thetab" "first" "last" "vorder" "tol"
"rss" "bound"

[12] "nvmax" "ress" "ir" "nbest" "lopt" "il" "ier" "xnames"


"method" "force.in" "force.out"

[23] "sserr" "intercept" "lindep" "nullrss" "nn" "call"

> summary(leaps)

Subset selection object

Call: regsubsets.formula(lpsa ~ 1 + lcavol + lweight + age + lbph +

svi + lcp + gleason + pgg45, data = prostate.train, nbest = 1)

8 Variables (and intercept)

Forced in Forced out

4
Project 1

lcavol FALSE FALSE

lweight FALSE FALSE

age FALSE FALSE

lbph FALSE FALSE

svi FALSE FALSE

lcp FALSE FALSE

gleason FALSE FALSE

pgg45 FALSE FALSE

1 subsets of each size up to 8

Selection Algorithm: exhaustive

lcavol lweight age lbph svi lcp gleason pgg45

1 ( 1 ) "*" " " """" """""" ""

2 ( 1 ) "*" " " " " " " "*" " " " " ""

3 ( 1 ) "*" "*" " " " " "*" " " " " ""

4 ( 1 ) "*" "*" "*" " " "*" " " " " ""

5 ( 1 ) "*" "*" "*" "*" "*" " " " " ""

6 ( 1 ) "*" "*" "*" "*" "*" "*" " " ""

7 ( 1 ) "*" "*" "*" "*" "*" "*" "*" ""


8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"

> leaps$rss

[1] 67.27038 66.45893 46.28542 44.98887 44.97942 33.37148 29.85956 28.88242


19.17387

> plot(leaps$rss)

5
Project 1

> coef(leaps,6)

(Intercept) lcavol lweight age lbph svi lcp

1.28867769 0.65108329 0.67448276 -0.03836805 0.13572784 1.11858421


-0.06266703

> library(car)

> subsets(leaps,statistic="rss")

6
Project 1

> subsets(leaps,statistic="cp")

7
Project 1

> subsets(leaps,statistic="bic")

8
Project 1

9
Project 1

> model2<-lm(lpsa~1+lcavol+lweight+age+svi,data=prostate.train)

> summary(model2)

Call:

lm(formula = lpsa ~ 1 + lcavol + lweight + age + svi, data = prostate.train)

Residuals:

Min 1Q Median 3Q Max

-1.37218 -0.37259 0.04658 0.44372 1.54817

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.36601 1.13619 0.322 0.74891

lcavol 0.60727 0.10163 5.976 3.97e-07 ***

lweight 0.78727 0.34821 2.261 0.02888 *

age -0.02872 0.01483 -1.937 0.05937 .

svi 0.93710 0.28559 3.281 0.00206 **

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6908 on 43 degrees of freedom

Multiple R-squared: 0.695, Adjusted R-squared: 0.6666

F-statistic: 24.49 on 4 and 43 DF, p-value: 1.306e-10

Stepwise Regression
> model<-
lm(lpsa~1+lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate.train)

> steps<-stepAIC(model,direction="both")

Start: AIC=-26.05

10
Project 1

lpsa ~ 1 + lcavol + lweight + age + lbph + svi + lcp + gleason +

pgg45

Df Sum of Sq RSS AIC

- pgg45 1 0.0320 19.206 -27.9674

- gleason 1 0.0361 19.210 -27.9570

- lcp 1 0.2294 19.403 -27.4765

<none> 19.174 -26.0473

- lbph 1 1.0145 20.188 -25.5724

- lweight 1 1.5576 20.732 -24.2982

- age 1 2.7732 21.947 -21.5633

- svi 1 5.4549 24.629 -16.0296

- lcavol 1 9.7085 28.882 -8.3825

Step: AIC=-27.97

lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason

Df Sum of Sq RSS AIC


- gleason 1 0.1310 19.337 -29.641

- lcp 1 0.2220 19.428 -29.416

<none> 19.206 -27.967

- lbph 1 1.0435 20.249 -27.428

+ pgg45 1 0.0320 19.174 -26.047

- lweight 1 1.7501 20.956 -25.782

- age 1 2.8032 22.009 -23.428

- svi 1 5.7435 24.949 -17.409

- lcavol 1 9.8820 29.088 -10.042

11
Project 1

Step: AIC=-29.64

lpsa ~ lcavol + lweight + age + lbph + svi + lcp

Df Sum of Sq RSS AIC

- lcp 1 0.1206 19.457 -31.343

<none> 19.337 -29.641

- lbph 1 1.1466 20.483 -28.876

+ gleason 1 0.1310 19.206 -27.967

+ pgg45 1 0.1268 19.210 -27.957

- lweight 1 1.6299 20.967 -27.757

- age 1 2.6723 22.009 -25.428

- svi 1 6.0016 25.338 -18.666

- lcavol 1 10.0995 29.436 -11.471

Step: AIC=-31.34

lpsa ~ lcavol + lweight + age + lbph + svi

Df Sum of Sq RSS AIC

<none> 19.457 -31.343

- lbph 1 1.0625 20.520 -30.791

- lweight 1 1.5335 20.991 -29.701

+ lcp 1 0.1206 19.337 -29.641

+ pgg45 1 0.0530 19.404 -29.474

+ gleason 1 0.0295 19.428 -29.416

- age 1 2.5573 22.015 -27.416

- svi 1 6.0970 25.554 -20.259

12
Project 1

- lcavol 1 16.7003 36.158 -3.599

> model3<-lm(lpsa~1+lcavol+lweight+age+lbph+svi,data=prostate.train)

> summary(model3)

Call:

lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi,

data = prostate.train)

Residuals:

Min 1Q Median 3Q Max

-1.43279 -0.36145 0.05076 0.43338 1.41406

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.33225 1.28853 1.034 0.307088

lcavol 0.60161 0.10020 6.004 3.92e-07 ***

lweight 0.64666 0.35543 1.819 0.075985 .

age -0.03628 0.01544 -2.349 0.023577 *


lbph 0.12921 0.08532 1.514 0.137413

svi 1.06764 0.29430 3.628 0.000768 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6806 on 42 degrees of freedom

Multiple R-squared: 0.7108, Adjusted R-squared: 0.6763

F-statistic: 20.64 on 5 and 42 DF, p-value: 2.362e-10

13
Project 1

Ridge Regression
> model.ridge<-
lm.ridge(lpsa~1+lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate.t
rain,na.action="na.omit",lambda=seq(1,100,.02))

> plot(model.ridge$lambda,model.ridge$GCV)

> model.ridge$lambda[model.ridge$GCV==min(model.ridge$GCV)]

[1] 4.44

> model.ridge<-
lm.ridge(lpsa~1+lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate.t
rain,na.action="na.omit",lambda=4.44)

14
Project 1

> model.ridge

lcavol lweight age lbph svi lcp gleason pgg45

0.865719691 0.499779323 0.561858506 -0.027173767 0.092847064


0.937095244 0.013524115 0.043011778 0.002437025

> model.ridge2<-
lm(lpsa~1+lcavol+lweight+age+lbph+svi+lcp+gleason,data=prostate.train)

> model.ridge2

Call:

lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi + lcp + gleason, data =
prostate.train)

Coefficients:

(Intercept) lcavol lweight age lbph svi lcp gleason

0.61977 0.64572 0.71277 -0.04020 0.13036 1.10067 -0.09475


0.09745

> summary(model.ridge2)

Call:

lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi +

lcp + gleason, data = prostate.train)

Residuals:

Min 1Q Median 3Q Max

-1.455101 -0.352466 0.008478 0.433936 1.341303

Coefficients:

Estimate Std. Error t value Pr(>|t|)

15
Project 1

(Intercept) 0.61977 1.83537 0.338 0.73737

lcavol 0.64572 0.14233 4.537 5.11e-05 ***

lweight 0.71277 0.37335 1.909 0.06344 .

age -0.04020 0.01664 -2.416 0.02034 *

lbph 0.13036 0.08843 1.474 0.14825

svi 1.10067 0.31824 3.459 0.00130 **

lcp -0.09475 0.13933 -0.680 0.50041

gleason 0.09745 0.18657 0.522 0.60435

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6929 on 40 degrees of freedom

Multiple R-squared: 0.7145, Adjusted R-squared: 0.6645

F-statistic: 14.3 on 7 and 40 DF, p-value: 3.886e-09

Part 4. Select best model based on RSS on the validation data


> x.validation<-data.frame(prostate.train[,-10])

> y.validation<-data.frame(prostate.train[,10])

> predict.model1<-predict(model1,x.validation)

> ls()

[1] "cars" "data1" "datafat" "Distance" "index.train"


"leaps"

[7] "model" "model.lasso" "model.ridge" "model.ridge2"


"model.test" "model1"

[13] "model2" "model3" "predict.model1" "predict.model2"


"predict.model3" "predict.model4"

[19] "predict.model5" "prostate" "prostate.train" "prostate.validation"


"rss.model1" "rss.model2"

[25] "rss.model3" "rss.model4" "rss.model5" "Speed" "steps"


"woodard"

16
Project 1

[31] "x" "x.validation" "y" "y.validation"

> predict.model2<-predict(model2,x.validation)

> predict.model3<-predict(model3,x.validation)

> predict.model4<-predict(model.ridge2,x.validation)

> rss.model1<-sum((y.validation-predict.model1)^2)

> rss.model2<-sum((y.validation-predict.model2)^2)

> rss.model3<-sum((y.validation-predict.model3)^2)

> rss.model4<-sum((y.validation-predict.model4)^2)

> rss.model1

[1] 19.17387

> rss.model2

[1] 20.51985

> rss.model3

[1] 19.45737

> rss.model4

[1] 19.20584

Part 5 Describe Your Model


> summary(model1)

Call:

lm(formula = lpsa ~ 1 + lcavol + lweight + age + lbph + svi +

lcp + gleason + pgg45, data = prostate.train)

Residuals:

Min 1Q Median 3Q Max

-1.46523 -0.34944 0.02133 0.45204 1.33804

17
Project 1

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.891004 2.140249 0.416 0.6795

lcavol 0.642491 0.144581 4.444 7.11e-05 ***

lweight 0.690378 0.387860 1.780 0.0829 .

age -0.040023 0.016852 -2.375 0.0226 *

lbph 0.128830 0.089682 1.437 0.1588

svi 1.087138 0.326371 3.331 0.0019 **

lcp -0.096406 0.141141 -0.683 0.4986

gleason 0.062949 0.232260 0.271 0.7878

pgg45 0.001542 0.006047 0.255 0.8001

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7012 on 39 degrees of freedom

Multiple R-squared: 0.715, Adjusted R-squared: 0.6565

F-statistic: 12.23 on 8 and 39 DF, p-value: 1.503e-08

Our final model is:

lpsa =0.89+0.64* lcavol + 0.69*lweight - 0.04age + 0.13*lbph + 1.09*svi – 0.10*lcp +


0.06*gleason + 0.002*pgg45

This model is 71.5% accurate in predicting the value of lpsa.

18

S-ar putea să vă placă și