Sunteți pe pagina 1din 22

Advanced Analysis of

Engineering Data
IENG Course Project

Prepared by
Omar Al-shebeeb
Mazin Alahmadi
Omar Alfaydi

Instructor
Dr.Iskander

Fall 2014

1
Table of Contents
Executive Summary ................................................................................................................................ 3
Introduction............................................................................................................................................. 4
Data Preparation and Cleaning................................................................................................................. 4
Transformations .................................................................................................................................. 5
Division of Data .................................................................................................................................. 6
The Assumptions of the Model ................................................................................................................ 6
Selection of the Best Model ..................................................................................................................... 6
Model Refinement ................................................................................................................................... 8
With respect to X ................................................................................................................................. 8
With respect to Y ................................................................................................................................. 8
Influential Observation ........................................................................................................................ 8
DFFITs (Difference in fitted values) ................................................................................................ 8
Cooks Distance (influence on all fitted values)................................................................................ 9
DFBETAs (Influence of coefficients) ............................................................................................... 9
Multicolinearity Check ............................................................................................................................ 9
The Final Model ...................................................................................................................................... 9
Residual Analysis .................................................................................................................................. 10
Kolmogorov-Smirnov Test ................................................................................................................ 12
Model Validation .................................................................................................................................. 12
Obtaining the Confidence Interval ......................................................................................................... 13
Conclusion ............................................................................................................................................ 14
Appendices............................................................................................................................................ 15
Appendix A: Scatter Plots for the Variables that have errors. ............................................................ 15
Appendix B: Scatter Plots for the Variables vs. Response ................................................................. 16
Appendix C: SAS Code for Stepwise regression. .............................................................................. 17
Appendix D: SAS Code for Final Model ........................................................................................... 20
Appendix E: Final model results. ...................................................................................................... 21
References............................................................................................................................................. 22

2
Executive Summary

The best possible model constructed for the problem is:

Y 1517.72 12293IX 1 155.0347X 2 61.856XD 3 57.20XD 5 81.122X 61 167.42XD 7


137.5362XD 9 18.5192X 18 34.98X 78

The model was obtained by first analyzing and cleaning the. The initial cleaning of the data was

done by separately plotting each defined variable and looking for obvious outlying points and

errors. The outliers were then removed from the data while the remaining observations were

divided into two groups. The larger of the two groups was used for regression analysis and the

other group was used to validate the final model. To get an initial model, Stepwise regression

was used in SAS. The initial model was then studied in depth and refined using various methods.

Next, residual analysis was conducted in order to check models assumptions, and Kolmogorov-

Smirnov test was also applied. Finally, the model has been validated before it has been used to

obtain the required confidence interval:

1435.69 Yh new 1894.31

3
Introduction

This report is built and prepared for the purpose of applying the knowledge gained from the
course. The project follows different steps in order to build the best model that fits the
problem. These steps consist of:

1) Data preparation: Fortunately the data was provided, but needs some cleaning and errors
elimination.
2) Reduction of number of variables: Finding the best possible model.
3) Model refinement and selection: This step mainly includes residual analysis.
4) Model validation.

Each step will be explained and constructed in details throughout the report.

Data Preparation and Cleaning

In this section, we identify extreme observations that are well separated from the
remainder of the data. Scatter plots will be utilized to detect such cases which will be discarded
from our data set. The following graphs show extreme observations with respect to Y values.

Dependent Variable Vs. Obs Index Dependent Variable Vs. Obs Index
10000 3000

8000 2500
2000
6000
1500
4000
1000
2000 500
0 0
0 200 400 600 0 200 400 600

Similar graphs were constructed for each predictor to identify potential errors in the data
set (refer to Appendix A). Table 1 below shows the seven observations that will be discarded:

Observation # 254 345 430 281 54 328 155


With Respect To Y X1 X1 X2 X3 X6 X8
Value 9352 677 544 3 15 4 44
Table 1: Outliers and Errors.

4
Transformations

We consider transformations for predictors that have a nonlinear regression relation with
Y, the treatment cost. We do so when we detect a violation on the normality or equality of the
variance assumptions for the error term. Since we do not have our best model yet, we will
consider all possible transformations for each quantitative predictor. Also, we will center our
variables around their means and scale them to standardize the units:

Xi - Xi
XDi xi
SXi

1
IXi
Xi
LXi log(Xi) i {1,3,4,5,7,8,9}

SXi Xi

XiS xi2

XiT xi3

For qualitative predictors (Gender and Geographic Location), we define the following indicator
variables:

1 ; Male
X2 X2
0 ; Female
1 ; East
X61 X61
0 ; Otherwise
1 ; Central
X62 X62
0 ; Otherwise

Note that when both X 61 and X 62 are equal to 0, it represents the West region.

Also, we considered two variables interactions:

X12=X1D*X2D; X13=X1D*X3D; X14=X1D*X4D; X15=X1D*X5D; X16=X1D*X6D;


X17=X1D*X7D; X18=X1D*X8D; X19=X1D*X9D;

X23=X2D*X3D; X24=X2D*X4D; X25=X2D*X5D; X26=X2D*X6D; X27=X2D*X7D;


X28=X2D*X8D; X29=X2D*X9D;

X34=X3D*X4D; X35=X3D*X5D; X36=X3D*X6D; X37=X3D*X7D; X38=X3D*X8D;


X39=X3D*X9D;

5
X45=X4D*X5D; X46=X4D*X6D; X47=X4D*X7D; X48=X4D*X7D; X49=X4D*X7D;

X56=X5D*X6D; X57=X5D*X7D; X58=X5D*X8D; X59=X5D*X9D;

X67=X6D*X7D; X68=X6D*X8D; X69=X6D*X7D;

X78=X7D*X8D; X79=X7D*X9D;

X89=X8D*X9D;

Division of Data
We have now, after cleaning, 521 data observations. Assuming the data set is randomized, we
consider the first 2/3rd data observations (approximately 350) for building the model and the
remaining are for validation purpose.

The Assumptions of the Model

Standard regression assumptions, such as the requirements that the residuals are mutually
independent, normally distributed, and have constant variance are applied to the models in this
report.

Selection of the Best Model

Because the number of predictor is relatively large, use of best subsets algorithm may not be
feasible [1]. Instead, an appropriate algorithm to be used here is Stepwise regression. Appendix
X shows the code used to run the algorithm in SAS. Also, Table 2 shows the summery of the
Stepwise results:

6
Summary of Stepwise Selection
Step Variable Variable Number Partial Model C(p) F Value Pr > F
Entered Removed Vars In R-Square R-Square
1 XD7 1 0.4556 0.4556 819.640 290.36 <.0001
2 XD9 2 0.2305 0.6860 328.609 254.00 <.0001
3 X2 3 0.0525 0.7385 218.306 69.27 <.0001
4 XD3 4 0.0352 0.7737 145.079 53.46 <.0001
5 XD5 5 0.0162 0.7899 112.365 26.50 <.0001
6 X78 6 0.0142 0.8041 84.0254 24.76 <.0001
7 IX1 7 0.0112 0.8153 62.0874 20.66 <.0001
8 X61 8 0.0114 0.8267 39.7969 22.27 <.0001
9 X18 9 0.0017 0.8283 38.2494 3.27 0.0712
10 X8S 10 0.0022 0.8305 35.5317 4.40 0.0367
11 X13 11 0.0019 0.8324 33.4994 3.79 0.0524
12 X37 12 0.0019 0.8344 31.3525 3.93 0.0482
13 X79 13 0.0016 0.8359 29.9441 3.25 0.0722
14 X561 14 0.0017 0.8376 28.3630 3.44 0.0644
15 X38 15 0.0017 0.8393 26.6921 3.56 0.0602
16 XD4 16 0.0015 0.8409 25.3977 3.21 0.0740
17 X619 17 0.0014 0.8422 24.4685 2.87 0.0910
18 X48 18 0.0012 0.8435 23.7972 2.63 0.1056
19 X362 19 0.0012 0.8447 23.1923 2.58 0.1092
20 X618 20 0.0011 0.8458 22.9079 2.27 0.1328
21 X28 21 0.0015 0.8473 21.6050 3.31 0.0699
22 X27 22 0.0019 0.8492 19.5800 4.07 0.0445
23 X15 23 0.0011 0.8503 19.2493 2.37 0.1250
24 X8S 22 0.0009 0.8494 19.2442 2.02 0.1557
25 LX7 23 0.0012 0.8506 18.5808 2.71 0.1008
26 X78 22 0.0002 0.8504 17.1039 0.53 0.4663
Table 2: Stepwise regression to get the initial model.

7
When comparing between different models, following rules were taken to select the best model:

a) After step 5, R2 value started to improve slightly which means the steps after 5 could
be considered as good models based on.

b) After step 8, C(p) values came down to the value that is relatively low.

c) After step 9, F-values has been reduced dramatically.

After taking all the rules explained above into account, we decided that step 9 was the best model
to choose. So, initial model that we came up with is as below:

1507.49 11035IX1 160.594X2 68.80XD3 59.67XD5 86.56X61 175.62XD7


Y
136.09XD9 12.87X18 27.4291X78

Model Refinement

With respect to X
To find outliers with respect to X, H ii which is defined as diagonal element of H matrix of each
2p 20
observation to the value of 0.0571 and 13 observations were removed.
n 350

With respect to Y
Comparing absolute value of Studentized Deleted Residual of each observation to Bonferroni Coefficient
B t (n p) t 0.1 (337 10) t (0.99985;327) 3.655 , no observations were eliminated.
2n 2(337)

Influential Observation
DFFITs (Difference in fitted values)
p 10
The absolute value of DFFITs were compared to 2 2 0.34452 and 7 observations had to be
n 337
eliminated.

8
Cooks Distance (influence on all fitted values)
Comparing Cooks Distance values to F0.1 (10,330 10) 0.48 , all cooks distances were smaller this
value. Thus, no observations were eliminated.

DFBETAs (Influence of coefficients)


2
Value of is very small when n becomes large and causes a large number of observations to be
n
considered as influential. Therefore, in order to decrease the amount of eliminated data points, the
3
comparison was made by the value of 0.165145 and 10 observations were removed.
330

Multicolinearity Check

Variance Inflation Factor were calculated for each variable and compared with the value of 10. As a
result, no variables were eliminated and we kept all 10 variables in the model. The result has been shown
in Appendix E.

The Final Model

Y 1517.72 12293IX 1 155.0347X 2 61.856XD 3 57.20XD 5 81.122X 61 167.42XD 7


137.5362XD 9 18.5192X 18 34.98X 78

Where:

X D 1 (X 1 59.08) / (11.40);
IX 1 1 / X 1;
X D 3 (X 3 167.31) / (39.96);
X D 5 (X 5 9.54) / (3.34);
X D 7 (X 7 4.36) / (1.55);
X D 8 (X 8 3.16) / (1.03);
X D 9 (X 9 70.34) / (30.87);
X 78 X D 7 * X D 8;
X 18 X D 1* X D 8;

1 ; East
X 61
0 ; Otherwise

9
The SAS code for the final model can be found in Appendix D.

Analysis of Variance
Source DF Sum of Mean F Value Pr > F
Squares Square
Model 9 29013316 3223702 170.16 <.0001
Error 310 5873106 18946
Corrected Total 319 34886422
Table 3: ANOVA for the final model.

The ANOVA, Table 3, confirms that the final model with 9 variables is appropriate and p-value

for this model shows that the model is highly significant. Note that R2 value is 0.8317 and R2adj= 0.8268,

which indicate that a good portion of the variation is explained by this model.

Residual Analysis

Now it is time to check if the models assumptions stated earlier were satisfied. The SAS results

for residuals of final model have been shown in Figure 1. The following table (Table 4) shows the

interpretation of the residual plots next page:

Plot Interpretation
Residuals vs. Predictors The plots are satisfying and do not show any violation from
constant variance assumption for error terms. However, there
seems to be a pattern (funnel shape) on variable X78.
Residuals vs. Predicted Y Values The plot shows a structure-less pattern, which means the variance
is constant.
Normality Plot and Normal Histogram There seems to be a violation of normality assumption. However, it
is not recommended to transform on Y because that may materially
change the shape of the distribution of the error terms from the
normal distribution and may also lead to substantially differing
error term variances.
Table 4: Interpretations of residual graphs.

In conclusion, the assumptions of the model will be considered satisfying at this stage.

10
Figure 1:

Residual
Graphs

11
Kolmogorov-Smirnov Test

We take a step further in the analysis of the residuals to test the models assumption that the error should
follow a normal distribution with mean zero and variance 2 . For simplicity we used the Studentized
residuals and compare that with standard normal distribution. We find the critical difference value when
1.36 1.36
0.05 and n = 320: D 0.05 0.076
n 320

Using Matlab, we were able to get the following results: D max 0.0581 p value 0.2211 . Hence,
we cannot reject that the values come from a standard normal distribution since the maximum difference
is less than the critical difference. The following plot (Figure 2) shows the comparison between the
theoretical/hypothesized standard normal distribution and the observed distribution.

Empirical CDF
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2

0.1

0
-3 -2 -1 0 1 2 3 4 5
x

Figure 2: Graphical comparison between the theoretical standard normal distribution and the observed
distribution.

Model Validation

Now we have to validate our model, the remaining 171 observations are used and the same

transformations were performed on them. To conduct a validation test, MSPR defined as Mean Squared

12
Prediction Error was calculated using the validation data set. The equation for MSPR has been shown

below:

Y Y
171 2
i i
MSPR i 1
23770.01
171

From SAS, MSE is equal to 18946. MSPR is reasonably greater (about 25% greater) than the MSE.

Therefore, the model developed is valid.

Obtaining the Confidence Interval

Before calculating confidence interval, this observation should be checked whether it is an outlier. Using
Maple Software, the following was concluded:

Hhh Xh XX Xh 0.0193
-1

Where:

1

0.016129
1

0.417668
0.16168

Xh 1
1.058065

0.377835
0.208891

0.862888

Note: the values in Xh are obtained by using the transformation made earlier.

2p 20
Since H hh 0.0193 = 0.0625 , this point is not an outlier.
n 320

Now we will obtain a 90% confidence interval for the treatment cost for a patient with the following
information:

13
Given observation:

Age: 62 years

Gender: Male

Weight: 184 lbs.

Family Annual Income: $112,000

Number of Visits to Doctors/Hospitals in Last 2 Years: 9

Geographic Location: East

Number of Interventions or Procedures Carried out: 6

Number of Tracked Drugs Prescribed: 4

Number of Days of Treatment: 82

The following calculations were obtained by MATLAB:

X 1665
Yh h

S2Y
h new
-1

MES 1 ( Xh XX Xh ) 19314.1891
SY 138.9754
h new

t 0.05 (319) 1.65


t (n p) S = 1665 (1.65)(138.9754)
90% CI : Yh 0.05 Yh

1435.69 Yh new 1894.31

Conclusion

First, Stepwise regression was conducted to get the initial model. This initial model was

then refined by removing outlying X and Y outliers. After that it was checked to see if there was

any multicollinearity between the variables, no multicollinearity was found. The final model had

320 observations and variable X4 (Family Income) had no contribution to the treatment cost.

When validating the model, it was found that the MSPR value is slightly larger than the MSE

value. This concludes that the model is valid. Finally, we obtained a 90% confidence interval for

the treatment cost.

14
Appendices

Appendix A: Scatter Plots for the Variables that have errors.

X1 Scatter Plot X1 Scatter Plot


800 100
600
400 50
200
0 0
0 200 400 600 0 200 400 600

X2 Scatter Plot X2 Scatter Plot


4 4

2 2

0 0
0 200 400 600 0 200 400 600

X3 Scatter Plot X3 Scatter Plot


300 400
200
200
100
0 0
0 200 400 600 0 200 400 600

X6 Scatter Plot X6 Scatter Plot


6 4
4
2
2
0 0
0 200 400 600 0 200 400 600

X8 Scatter Plot X8 Scatter Plot


50 10

0 0
0 200 400 600 0 200 400 600

15
Appendix B: Scatter Plots for the Variables vs. Response

X1 vs Y X2 vs Y
3000 3000
2000 2000
1000 1000
0 0
0 50 100 0 1 2 3

X3 vs Y X4 vs Y
3000 3000
2000 2000
1000 1000
0 0
0 100 200 300 0 100000 200000

X5 vs Y X6 vs Y
3000 4000
2000
2000
1000
0 0
0 5 10 15 20 0 1 2 3 4

X7 vs Y X8 vs Y
4000 4000

2000 2000

0 0
0 5 10 0 2 4 6 8

X9 vs Y
4000

2000

0
0 50 100 150 200

16
Appendix C: SAS Code for Stepwise regression.

DATA PROJ314;
INPUT Y X1 X2 X3 X4 X5 X61 X62 X7 X8 X9;
XD1= (X1- 59.09)/(11.66);
IX1= 1/X1;
LX1= LOG(X1);
SX1= SQRT(X1);
X1S= XD1*XD1;
X1T= XD1*XD1*XD1;

XD3= (X3- 167.08)/(40.33);


IX3= 1/X3;
LX3= LOG(X3);
SX3= SQRT(X3);
X3S= XD3*XD3;
X3T= XD3*XD3*XD3;

XD4= (X4- 93442.07)/(29703.25);


IX4= 1/X4;
LX4= LOG(X4);
SX4= SQRT(X4);
X4S= XD4*XD4;
X4T= XD4*XD4*XD4;

XD5= (X5- 9.64)/(3.46);


IX5= 1/X5;
LX5= LOG(X5);
SX5= SQRT(X5);
X5S= XD5*XD5;
X5T= XD5*XD5*XD5;

XD7= (X7- 4.37)/( 1.66);


IX7= 1/X7;
LX7= LOG(X7);
SX7= SQRT(X7);
X7S= XD7*XD7;
X7T= XD7*XD7*XD7;

XD8= (X8- 3.19)/(1.11);


IX8= 1/X8;
LX8= LOG(X8);
SX8= SQRT(X8);
X8S= XD8*XD8;
X8T= XD8*XD8*XD8;

17
XD9= (X9- 71.18)/(32.36);
IX9= 1/X9;
LX9= LOG(X9);
SX9= SQRT(X9);
X9S= XD9*XD9;
X9T= XD9*XD9*XD9;

X12=XD1*X2; X13=XD1*XD3; X14=XD1*XD4; X15=XD1*XD5; X161=XD1*X61;


X162=XD1*X62; X17=XD1*XD7; X18=XD1*XD8; X19=XD1*XD9;

X23=X2*XD3; X24=X2*XD4; X25=X2*XD5; X261=X2*X61; X262=X2*X62; X27=X2*XD7;


X28=X2*XD8; X29=X2*XD9;

X34=XD3*XD4; X35=XD3*XD5; X361=XD3*X61; X362=XD3*X62; X37=XD3*XD7;


X38=XD3*XD8; X39=XD3*XD9;

X45=XD4*XD5; X461=XD4*X61; X462=XD4*X62; X47=XD4*XD7; X48=XD4*XD8;


X49=XD4*XD9;

X561=XD5*X61; X562=XD5*X62; X57=XD5*XD7; X58=XD5*XD8; X59=XD5*XD9;

X617=X61*XD7; X627=X62*XD7; X618=X61*XD8; X628=X62*XD8; X619=X61*XD9;


X629=X62*XD9;

X78=XD7*XD8; X79=XD7*XD9;

X89=XD8*XD9;

CARDS;
922 41 0 115 72501 5 1 0 3 2 33
1597 61 1 174 95711 12 1 0 6 4 75
1233 65 0 156 58907 4 0 1 5 3 77
1258 51 0 176 93313 7 0 1 2 2 74
1482 55 0 158 143441 10 1 0 6 4 67
1693 46 1 118 133247 15 0 1 4 3 132
1903 41 0 108 58920 15 0 1 7 5 142
.
.
.
1095 58 0 203 64180 9 0 1 6 3 16
1128 78 0 190 124540 7 0 0 2 2 58
1185 60 1 190 132753 4 0 1 3 2 14
PROC STEPWISE;
MODEL Y= XD1 X2 XD3 XD4 XD5 X61 X62 XD7 XD8 XD9
IX1 IX3 IX4 IX5 IX7 IX8 IX9

18
LX1 LX3 LX4 LX5 LX7 LX8 LX9
SX1 SX3 SX4 SX5 SX7 SX8 SX9
X1S X3S X4S X5S X7S X8S X9S
X1T X3T X4T X5T X7T X8T X9T
X12 X13 X14 X15 X161 X162 X17 X18 X19
X23 X24 X25 X261 X262 X27 X28 X29
X34 X35 X361 X362 X37 X38 X39
X45 X461 X462 X47 X48 X49
X561 X562 X57 X58 X59
X617 X627 X618 X628 X619 X629
X78 X79
X89
/STEPWISE SLE=0.15 SLS=0.15 ;
RUN;

19
Appendix D: SAS Code for Final Model

DATA PROJ314;
INPUT Y X1 X2 X3 X4 X5 X61 X62 X7 X8 X9;
XD1= (X1- 59.08)/(11.40);
IX1= 1/X1;
XD3= (X3- 167.31)/( 39.96);
XD5= (X5- 9.54)/( 3.34);
XD7= (X7- 4.36)/( 1.55);
XD8= (X8- 3.16)/( 1.03);
XD9= (X9- 70.34)/( 30.87);
X78=XD7*XD8;
X18=XD1*XD8;

CARDS;
922 41 0 115 72501 5 1 0 3 2 33
1597 61 1 174 95711 12 1 0 6 4 75
1233 65 0 156 58907 4 0 1 5 3 77
1258 51 0 176 93313 7 0 1 2 2 74
1482 55 0 158 143441 10 1 0 6 4 67
1693 46 1 118 133247 15 0 1 4 3 132
1903 41 0 108 58920 15 0 1 7 5 142
.
.
.
1095 58 0 203 64180 9 0 1 6 3 16
1128 78 0 190 124540 7 0 0 2 2 58
1185 60 1 190 132753 4 0 1 3 2 14

PROC REG;
MODEL Y= IX1 X2 XD3 XD5 X61 XD7 XD9 X18 X78/R ALL INFLUENCE VIF ;
RUN;

20
Appendix E: Final model results.

21
References

[1] Applied Linear Regression Models, by Kutner, Nachtsheim, and Neter, 4th edition.

22

S-ar putea să vă placă și