Sunteți pe pagina 1din 16

Multiple Regression

Group 9
Akshey Bhogra (PGP/18/178)
Rahul Kaman (PGP/18/211)
Shrishti Khushiram (PGP/18/219)
Indu (PGP/18/351)
Vikram Singh (PGP/18/228)
Objective
To develop a multivariate regression model which can determine the car price
value based on a variety of characteristics such as mileage, make, model,
engine size, interior style, and cruise control.
Sample Size : 800 2005 GM cars
Simple Linear Regression
Create a Simple Linear Regression to find the relationship
between the price of a car and mileage
Price = 24723-0.17 Mileage
T statistics for the slope coefficient ( b1) t =-4.09 ( p value <
0.001)
R square = 2%
Equation 1 Interpretation
1. There is a linear relationship between retail
price and mileage, but there is too much
variation around the expected value ( low r
square value) .

2. More parameters need to be included to make
the model more accurate
3. Outliers show us that there are cars with
price more than 52000
4. For the top end convertible cars , the fitted
slope to the predict price from mileage is -
.48 is much steeper than -.17. The
depreciation in price for the high end car is
more


Variable Selection
To select the variables following conditions need
to be satisfied
1. R square should be high
2. Adjusted R square should be high
3. Mallows cp should be close to number of
predictors or minimum

Selected Variables
a) Mileage
b) Cylinder
c) Liter
d) Doors
e) Cruise
f) Sound
g) Leather

Equation 2
Price = 7323- 0.171 Mileage+3200 Cylinder -1463 Doors+6206
Cruise-2024 Sound+3327 leather

S= 7387.11 R- Square = 44.6 % R square( adjusted) =44.26 %
R square( 44.6 % > 2 % ) is better but still the value of R square is
quite low
Equation 2 ( contd..)
The normal probability plot shows that at both
end there is deviation from the normal
distribution, more variability when prices are
higher

The histogram shows that it is positively skewed.
The long upper tail is due to the high retail price Car
models such as Cadillac XLR V 8 , a top end model
The assumptions that the errors are distributed normally is violated

Equation 2 ( contd..) Hetroskedasticity
1. Variance is not constant
2. Model is inaccurate
3. Higher price shows more variability

1. Residuals are observed against
alphabetical order of observations
2. Cars with similar make and model have
similar retail price
3. Hence make should be also included in
the model
Specially constructed variables
Certain factors such as make and model also impact the retail price of
variable, hence creation and inclusion of dummy variable in the model
is important
Dummy variable can be created for
Make
Model
Trim
Type

Multicollinearity
The equation missed liter and considered only cylinder
Cylinder and Liter are highly correlated variables, so both cant be
used in modelling of price
Regression model constructed to determine which one is more
precise ( Cylinder or Liter )
Liter was more precise and easy to measure and hence cylinder
can be removed
Data is transformed to log to remove the effects of the outliers


Equation 3
TPrice = 3.98-0.000003 Mileage + 0.0997 Liter+ 0.0400 Buick
+0.249 Cadillac +0.00937 Chev +0.345 SAAB

S=0.0515753 R-square = 91.7 % R-square (adjusted) =91.6 %

R-square value has improved considerably as compared to previous
model
Interpretations

1. The error terms are not distributed
normally

2. Residual versus fitted shows clustering is
still visible

3. The residual vs observation order shows
systematic pattern but are much
pronounced than earlier

4. More variables need to be included
Equation 4
Tprice =3.92 -0.000004 Mileage +0.0958 Liter +0.0335 Doors
+0.00752 Cruise +0.00522 Sound + 0.00626 Leather +0.0417
buick +0.233 Cadillac - .0133 Chev - 0.00042 Pontiac +0.281
SAAB+ 0.138 Conv -0.0890 Hatchback -0.0711 Sedan

S= 0.0393 R square = 95.2 % R-square ( adjusted ) = 95.1 %
Interpretation
1. The residuals appear to be
homoscedastic and more closely to
follow a normal distribution

2. May consider to include model, but
then dummy variable will be large
Recommendations
1. Try more statistical models to come up with the multivariate regression model
2. Check Model assumptions
3. If two or more variables are related to each other, take one which is more
significant and easy to measure
4. Include qualitative variables in the model if they impact the dependent variable
( through dummy variable)




Thank You

S-ar putea să vă placă și