Sunteți pe pagina 1din 10

Boston House Pricing Model

Introduction:

The data to be analyzed has been collected for the purpose of predicting the price of houses in
Boston based on various features related to the individual houses.

This report seeks to examine the influence of several attributes on the prices of housing, in an
attempt to discover the most suitable explanatory variables. Here we will evaluate the
performance and predictive power of a regression model that has been trained and tested on data
collected on houses in Boston, Massachusetts. A model trained on this then can be used to make
certain predictions about a home — in particular, its monetary value. This model would prove to
be invaluable for someone like a real estate agent who could make use of such information on a
daily basis. The SPSS analytical software will be used to conduct this analysis.

Variable Definition:

Dependent Variable:

SalePrice: Actual price each house is sold

TOTAL: In total we have 1085 items to analyze and form a regression model

Independent Variables:

MSSubClass: Identifies the type of dwelling involved in the sale.

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

OverallQual: Rates the overall material and finish of the house

OverallCond: Rates the overall condition of the house

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

MasVnrArea: Masonry veneer area in square feet

BsmtFinSF1: Type 1 finished square feet

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area


TotalBsmtSF: Total square feet of basement area

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

KitchenAbvGr: Kitchens above grade

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Fireplaces: Number of fireplaces

GarageYrBlt: Year garage was built

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

MiscVal: Value of miscellaneous feature (Elevator, 2nd Garage, Tennis Court, etc)

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)


Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

Statistic Statistic Statistic Statistic Std. Error Statistic

MSSubClass 1085 20 190 56.62 1.251 41.215


LotFrontage 1085 21 234 80.74 1.202 39.590
LotArea 1085 1300 215245 10162.71 258.021 8499.044
OverallQual 1085 3 10 6.27 .040 1.319
OverallCond 1085 2 9 5.51 .030 .991
YearBuilt 1085 1880 2010 1979.28 .787 25.929
YearRemodAdd 1085 1950 2010 1987.67 .586 19.297
MasVnrArea 1085 0 975 112.37 5.088 167.603
BsmtFinSF1 1085 0 2188 460.51 13.036 429.386
BsmtUnfSF 1085 0 2153 577.51 13.707 451.508
TotalBsmtSF 1085 0 3206 1081.54 12.285 404.669
@1stFlrSF 1085 483 2633 1167.76 10.813 356.180
@2ndFlrSF 1085 0 1540 339.68 13.035 429.348
LowQualFinSF 1085 0 528 2.37 .872 28.724
GrLivArea 1085 520 3238 1509.81 13.552 446.395
BsmtFullBath 1085 0 2 .44 .016 .511
BsmtHalfBath 1085 0 1 .05 .007 .227
FullBath 1085 0 3 1.61 .016 .523
HalfBath 1085 0 2 .41 .015 .508
BedroomAbvGr 1085 0 6 2.84 .023 .762
KitchenAbvGr 1085 1 3 1.04 .006 .198
TotRmsAbvGrd 1085 3 12 6.50 .046 1.518
Fireplaces 1085 0 3 .62 .019 .626
GarageYrBlt 1085 1906 2010 1982.63 .686 22.593
GarageCars 1085 1 4 1.91 .019 .614
GarageArea 1085 164 1220 511.32 5.342 175.953
WoodDeckSF 1085 0 576 95.15 3.497 115.177
OpenPorchSF 1085 0 312 48.56 1.775 58.452
EnclosedPorch 1085 0 330 14.92 1.530 50.397
@3SsnPorch 1085 0 508 3.97 .985 32.449
ScreenPorch 1085 0 440 14.07 1.588 52.310
PoolArea 1085 0 648 1.61 .930 30.649
YrSold 1085 2006 2010 2007.83 .040 1.329
SalePrice 1085 58500 611657 188324.25 2220.735 73149.529
Valid N (listwise) 1085
Analysis:
The purpose of our analysis is to predict figure out the effect of the different predictor or
independent variables, as mentioned above, on the dependent variable i.e. the price of the houses
and come up with a linear regression model which can satisfactorily predict the price given the
values of the predictor variables.
Here, a standard linear multiple regression was performed to assess the ability of the predictor
variables to predict the sale price of houses.

Assumptions:
1. The larger the sample size, the better is the analysis. For all practical purposes, for each
independent variable, there should be 15 observations or samples.
2. The dependent variable is normally distributed.
3. There is no multi-collinearity among the predictor variables, i.e. there are no redundant
predictor variables. (Having R value more than .7)
4. There are no outliers in the sample data set, as multiple regression is very sensitive to
outliers, high or low.
5. The dependent and independent variables are linearly related.

Verification of the assumptions:


Step 1:
The first step of analysis is to perform a correlation analysis among the different dependent and
independent variables to find out which variables have an inconsequential impact on the outcome
variable or if there are any multi-collinearity among the predictor variables.
Upon performing the correlation analysis among the different variables, the following
observations were made:
The dependent variables, namely OverallQual, YearBuilt, YearRemodAdd, MasVnrArea,
BsmtFinSF1, TotalBsmtSF, 1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, Fireplaces,
GarageYrBlt, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF have a correlation
coefficient more than 0.3 with respect to SalePrice and hence have a significant influence on the
dependent variable.
Again, going through the correlation coefficients between the independent variables it was
observed that YearBuilt and GarageYrBlt are highly correlated (R = .874) and so are 1stFlrSF
and TotalBsmtSF (R = .848), GrLivArea and TotRmsAbvGrd (R = .815) and GarageArea and
GarageCars (R = .815).
So taking only one from the pair of variables mentioned above, we get the final list of variables
to consider and form our regression model as follows:
OverallQual, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, 1stFlrSF, GrLivArea,
FullBath, Fireplaces, GarageArea, WoodDeckSF, OpenPorchSF.
The corresponding Correlation table is given below.

Step 2:
Plotting a normal curve for our dependent variable SalePrice we found that it is not quite
normally distributed, rather it is right-skewed. So to normalize the variable, we took logarithm of
the sale price to come up with a new variable LogSalePrice which we found to be more normally
distributed, as shown below.

So, our regression model will be taking this new variable LogSalePrice as the dependent
variable.
Step 3:

Now checking the Normal P-P plot of regression we can see that there is not much deviation
from the best fit line of regression as shown in the above diagram. Normality of the plot is tested
here.
This graph above shows the linear relationship which exists between the dependent and
independent variables.

Step 4:
Coming to the scatter plot diagram, we can see that it forms a rough rectangular distribution
which confirms the linearity of the relationship between the predictor and the dependent
variables.
Outliers are those values which have a standard residual value of (+/-) 3.3. This graph shows
very few outliers tending toward those boundaries, as compared to our sample size, so that would
not be a problem while performing the regression analysis.
This can also be confirmed from the Case Wise Diagnostics table given below, where we can
only find 14 such cases out of a total of 1085.

Casewise Diagnosticsa

Case Number Std. Residual LogSalePrice Predicted Value Residual

49 -3.240 5.26 5.4315 -.17625


111 3.035 5.37 5.2060 .16507
205 3.276 5.15 4.9710 .17819
232 3.255 5.57 5.3970 .17705
294 -3.324 4.83 5.0069 -.18080
339 -5.110 4.80 5.0730 -.27793
463 -6.716 4.92 5.2818 -.36532
486 -4.299 5.11 5.3444 -.23383
503 3.814 5.59 5.3858 .20744
564 -3.092 5.03 5.1976 -.16819
590 3.192 5.77 5.5920 .17363
776 -3.183 5.06 5.2338 -.17314
909 -3.391 5.05 5.2336 -.18442
984 -4.705 5.17 5.4232 -.25592
a. Dependent Variable: LogSalePrice

Analysis of the Regression Model:

Now we move forward to evaluating our model that has been created from our regression
analysis and see how effective it is in predicting the outcome, how statistically significant it is
and how accurate it will be in its predictions.
As a first step we look at the Model Summary box, given below.

Model Summaryb

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .938a .879 .878 .05439

a. Predictors: (Constant), OpenPorchSF, WoodDeckSF, BsmtFinSF1,


MasVnrArea, Fireplaces, YearRemodAdd, @1stFlrSF, FullBath,
GarageArea, GrLivArea, OverallQual, YearBuilt
b. Dependent Variable: LogSalePrice

Here we look at the R square value which tells us how much of the variance in the dependent
variable, which is the Sale Price in this case, is explained by the model.
The R square value here is .879, which if expressed as a percentage denotes that the dependent
variables accounts for about 86.9% of the variance in the Sale Price of houses.

The next step is to assess the statistical significance of our result, whether the model is a true
predictor of the actual outcome happening in the population. For this we need to look at the
ANOVA table as given below.

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 23.043 12 1.920 649.075 .000b

1 Residual 3.171 1072 .003

Total 26.215 1084


a. Dependent Variable: LogSalePrice
b. Predictors: (Constant), OpenPorchSF, WoodDeckSF, BsmtFinSF1, MasVnrArea, Fireplaces,
YearRemodAdd, @1stFlrSF, FullBath, GarageArea, GrLivArea, OverallQual, YearBuilt

Here we can test the null hypothesis that the model cannot accurately predict the outcome.
From the last column of the table above we can see that the significance value or in other words
the P-value is .000 (upto the 3rd place of decimal) which is less than 0.05.
So we can say that there is statistical significance of this model and we reject the null hypothesis.
In other words, we can say that the model can significantly predict the price of houses given the
value of the predictor variables.
Finally, we have a look at the Coefficients table of our regression outcome, as given below.

Here we get the actual coefficients associated with each dependent variable of our regression
model as given by the Unstandardized Beta coefficients.

The Standardized Beta coefficients, in the above table, show us how much each independent
variable contributes to the prediction of the dependent variable. Here find that OverallQual has
the most contribution towards predicting the Sales Price.

The Significance value in the above table denotes how significant a contribution each variable
makes towards the prediction of the outcome. This is nothing but the P-value and a value less
than 0.05 denotes a high contribution.

So, from the above table, we finally get our regression equation as given below.

Log10(SalesPrice) = 0.038*OverallQual + 0.001*YearBuilt + 0.001*YearRemodAdd


– 1.047*10-5*MasVnrArea + 5.004*10-5*BsmtFinSF1
+ 4.962*10-5*1stFlrSF + 0.0001169*GrLvlArea – 0.013*FullBath
+ 0.020*Fireplaces + 8.399*10-5*GarageArea + 3.229*10-5*WoodDeckSF
+ 9.217*10-5*OpenPorchSF + 0.972

S-ar putea să vă placă și