Sunteți pe pagina 1din 4

Boston Housing Price

Rodrigo Barbosa de Santis

May 21, 2019

1 Introduction
The objective of this work is to develop a learning machine model able to predict Boston house’s
price according to known attributes.

2 Materials and methods

2.1 Multi-layered perceptron

The multi-layered feedforward network, also known as multi-layered perceptron (MLP) pro-
vides a general artificial neural network representation, as shown in Figure 1.
The MLP have one or more non-linear layers – known as hidden layers – and can learn a non-
linear function f (·) : Rm → Ro , where m is the number of dimensions for input and o is the
number of dimensions for output, from a given set of features X = x1 , x2 , ..., xm and a target y
for regression (Pedregosa et al., 2011).

Figure 1 – General MLP with a single hidden layer (Pedregosa et al., 2011)

The input layer is composed by a set of neurons {xi |x1 , x2 , ..., xm } which are modified by a
non-linear activation function g(·) : R set by the user – such as logistic or hyperbolic tangent

1
∑i
– and incorporated to the weighted linear summation m wi xi that generates the subsequent
hidden layers and output values.
The main advantage of MLP is their capability to learn non-linear models in real-time. On the
other hand, some drawbacks of these kind of networks is the loss function in problems with
more than one local minimum, its sensitivity to feature scaling and the requirement of hyper-
parameters tuning.

2.2 Boston housing dataset

The S&P 500 is a free-float, capitalization-weighted index of the top 500 publicly listed stocks in
the US. The dataset includes a list of all the stocks contained therein and associated key financial
such as price, market capitalization, earnings, price/earnings ratio, price to book etc.
The dataset attributes are the following:
– x1 : per capita crime rate by town
– x2 : proportion of residential land zoned for lots over 25,000 sq.ft.
– x3 : proportion of non-retail business acres per town.
– x4 : Charles River dummy variable (1 if tract bounds river; 0 otherwise)
– x5 : nitric oxides concentration (parts per 10 million)
– x6 : average number of rooms per dwelling
– x7 : proportion of owner-occupied units built prior to 1940
– x8 : weighted distances to five Boston employment centres
– x9 : index of accessibility to radial highways
– x10 : full-value property-tax rate per $10,000
– x11 : pupil-teacher ratio by town
– x12 : the proportion of blacks by town
– x13 : % lower status of the population
– y: Median value of owner-occupied homes in $1000’s

2.3 Cross-validation

Cross-validation is an important concept used to estimate how accurate is a classifier for a new
set of data, avoiding a common problem known as overfitting, in which a particular model
become excessively complex and unable to classify other dataset than the training one.
The k-fold cross-validation, one of the most applied along with grid search, randomly splits a
dataset D into k mutually exclusive subsets (or folds) of practically equal size. Figure 2 exem-
plifies the iterative procedure of cross-validation through data division into k subsets.

2
k=1
k=2
Training set
k=3
Validation set
k=4
k=5
Figure 2 – Division of the dataset into K = 5 folds

2.4 Performance metrics

For assessing the methods classification performance, it was applied the evaluation metrics mean
squared error (M SE), which computes the difference between distance between the predicted
and desired values, given by: (Pedregosa et al., 2011)

1 ∑
N −1
M SE(y, ŷ) = (yi − ŷi )2
N i=0

where N is the number of samples, ŷi is the estimated target output, yi is the corresponding
(correct) target output.

3 Development
The dataset was randomly split into two different subsets, whereas 80% is designated for training
and 20% for testing the model. The network was trained using the K-fold cross validation
technique, with K = 5, adopting the parameters in Table 1 for the estimator.

Table 1 – Parameters adopted for the learning classifier

Estimator Parameters Values


Alpha (α) 10−4
MLP Activation (ρ) relu
Hidden layers [12]

The algorithm is implemented using Python 2.7.8 (Van Rossum, 1998), including the following
libraries:
1. Matplotlib (http://matplotlib.org/) – a library that provides a groupf of 2D charts and image
functions;
2. NumPy (http://www.numpy.org/) – A large set of functions that allows arrays manipula-
tion;
3. Scikit-learn (http://scikit-learn.org/) – Machine learning library in Python.

3
4 Results
The mean squared errors achieved in training and testing sets were 0.755 and 1.965, respectively.

Figure 3 – Predicted against desired values comparison.

5 Conclusion
The MLP regressor was able to provide a reasonable generalization model for predicting house’s
pricing values from the dataset.

References
Pedregosa, F., Varoquaux, G., Gramfort, A. and Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M. & Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, vol. 12, pp. 2825-2830.
Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (1999). Neural and adaptive systems: funda-
mentals through simulations with CD-ROM. John Wiley & Sons, Inc..

Van Rossum, G. (1998). Python: a computer language. Version 2.7.8. Amsterdam, Stichting
Mathematisch Centrum. (http://www.python.org).

S-ar putea să vă placă și