Documente Academic
Documente Profesional
Documente Cultură
Introduction:
The data to be analyzed has been collected for the purpose of predicting the price of houses in
Boston based on various features related to the individual houses.
This report seeks to examine the influence of several attributes on the prices of housing, in an
attempt to discover the most suitable explanatory variables. Here we will evaluate the
performance and predictive power of a regression model that has been trained and tested on data
collected on houses in Boston, Massachusetts. A model trained on this then can be used to make
certain predictions about a home — in particular, its monetary value. This model would prove to
be invaluable for someone like a real estate agent who could make use of such information on a
daily basis. The SPSS analytical software will be used to conduct this analysis.
Variable Definition:
Dependent Variable:
TOTAL: In total we have 1085 items to analyze and form a regression model
Independent Variables:
MiscVal: Value of miscellaneous feature (Elevator, 2nd Garage, Tennis Court, etc)
Assumptions:
1. The larger the sample size, the better is the analysis. For all practical purposes, for each
independent variable, there should be 15 observations or samples.
2. The dependent variable is normally distributed.
3. There is no multi-collinearity among the predictor variables, i.e. there are no redundant
predictor variables. (Having R value more than .7)
4. There are no outliers in the sample data set, as multiple regression is very sensitive to
outliers, high or low.
5. The dependent and independent variables are linearly related.
Step 2:
Plotting a normal curve for our dependent variable SalePrice we found that it is not quite
normally distributed, rather it is right-skewed. So to normalize the variable, we took logarithm of
the sale price to come up with a new variable LogSalePrice which we found to be more normally
distributed, as shown below.
So, our regression model will be taking this new variable LogSalePrice as the dependent
variable.
Step 3:
Now checking the Normal P-P plot of regression we can see that there is not much deviation
from the best fit line of regression as shown in the above diagram. Normality of the plot is tested
here.
This graph above shows the linear relationship which exists between the dependent and
independent variables.
Step 4:
Coming to the scatter plot diagram, we can see that it forms a rough rectangular distribution
which confirms the linearity of the relationship between the predictor and the dependent
variables.
Outliers are those values which have a standard residual value of (+/-) 3.3. This graph shows
very few outliers tending toward those boundaries, as compared to our sample size, so that would
not be a problem while performing the regression analysis.
This can also be confirmed from the Case Wise Diagnostics table given below, where we can
only find 14 such cases out of a total of 1085.
Casewise Diagnosticsa
Now we move forward to evaluating our model that has been created from our regression
analysis and see how effective it is in predicting the outcome, how statistically significant it is
and how accurate it will be in its predictions.
As a first step we look at the Model Summary box, given below.
Model Summaryb
Here we look at the R square value which tells us how much of the variance in the dependent
variable, which is the Sale Price in this case, is explained by the model.
The R square value here is .879, which if expressed as a percentage denotes that the dependent
variables accounts for about 86.9% of the variance in the Sale Price of houses.
The next step is to assess the statistical significance of our result, whether the model is a true
predictor of the actual outcome happening in the population. For this we need to look at the
ANOVA table as given below.
ANOVAa
Here we can test the null hypothesis that the model cannot accurately predict the outcome.
From the last column of the table above we can see that the significance value or in other words
the P-value is .000 (upto the 3rd place of decimal) which is less than 0.05.
So we can say that there is statistical significance of this model and we reject the null hypothesis.
In other words, we can say that the model can significantly predict the price of houses given the
value of the predictor variables.
Finally, we have a look at the Coefficients table of our regression outcome, as given below.
Here we get the actual coefficients associated with each dependent variable of our regression
model as given by the Unstandardized Beta coefficients.
The Standardized Beta coefficients, in the above table, show us how much each independent
variable contributes to the prediction of the dependent variable. Here find that OverallQual has
the most contribution towards predicting the Sales Price.
The Significance value in the above table denotes how significant a contribution each variable
makes towards the prediction of the outcome. This is nothing but the P-value and a value less
than 0.05 denotes a high contribution.
So, from the above table, we finally get our regression equation as given below.