Sunteți pe pagina 1din 10

Robust geographically weighted regression with least absolute deviation method in

case of poverty in Java Island


Rawyanil Afifah, Yudhie Andriyana, and I. G. N. Mindra Jaya

Citation: AIP Conference Proceedings 1827, 020023 (2017); doi: 10.1063/1.4979439


View online: http://dx.doi.org/10.1063/1.4979439
View Table of Contents: http://aip.scitation.org/toc/apc/1827/1
Published by the American Institute of Physics
Robust Geographically Weighted Regression
with Least Absolute Deviation Method
in Case of Poverty in Java Island
Rawyanil Afifaha), Yudhie Andriyanab) and I G N Mindra Jayac)

Department of Statistics, Universitas Padjajaran, Bandung 40132

Corresponding author: a)rawy.afifah@gmail.com


b)
y.andriyana@unpad.ac.id
c)
jay.komang@gmail.com

Abstract. Geographically Weighted Regression (GWR) is a development of an Ordinary Least Squares (OLS) regression which
is quite effective in estimating spatial non-stationary data. On the GWR models, regression parameters are generated locally, each
observation has a unique regression coefficient. Parameter estimation process in GWR uses Weighted Least Squares (WLS). But
when there are outliers in the data, the parameter estimation process with WLS produces estimators which are not efficient.
Hence, this study uses a robust method called Least Absolute Deviation (LAD), to estimate the parameters of GWR model in the
case of poverty in Java Island. This study concludes that GWR model with LAD method has a better performance.

INTRODUCTION

Poverty is a multidimensional problem that is faced by most countries, including Indonesia. Poverty is defined as
an individual situation of being unable to meet the minimum levels of income, food, clothing, healthcare, shelter,
and other essentials [5]. In Indonesia, the poverty rate is published by Statistics Indonesia (Badan Pusat Statistik
(BPS)-the official statistics of Indonesia). BPS used the basic need approach to determine the poverty. Poverty is
seen as an economic inability to meet the basic needs of food and non-food which is measured from the expenditure
side [2].
The data of poverty rate is distributed spatially. According to Anselin [1], spatial data has two effects, spatial
dependency and spatial heterogeneity. Spatial dependency described as the observations at one location are depend
on the observation at other locations. Observations on adjacent location will tend to have the same characteristics,
and will increasingly be different along the distance of the observations. Spatial heterogeneity can be shown by the
different influence of explanatory variables on the response variable at each location.The existence of spatial
heterogeneity will lead to the homoscedasticity assumption at the classical regression model is not fulfilled. The
variance of the model is no longer constant, but different at each observation. Therefore, the regression model was
developed to allow the variances of the model to be different for each location by making the local regression
coefficients, it means that each location will have its own regression coefficients [3]. A method that accommodates
local regression coefficients is Geographically Weighted Regression (GWR).
GWR uses point approach where each parameter of the regression model is estimated at any point in the
geographic location. The estimation of the regression coefficients uses Weighted Least Squares (WLS) method.
WLS procedure estimates the parameter by minimizing the sum squared of errors. As we knew, this procedure is
sensitive to the existence of outliers, hence, a robust method to the existence of outliers is needed. One of the
methods that can be used in the existence of outliers is Least Absolute Deviation (LAD). LAD is able to overcome
the effect that caused by the outlier without detecting the observation that influences the outlier.

Statistics and its Applications


AIP Conf. Proc. 1827, 020023-1020023-9; doi: 10.1063/1.4979439
Published by AIP Publishing. 978-0-7354-1495-2/$30.00

020023-1
BASIC CONCEPT

Linear Regression Analysis


Regression analysis is a model to describe the relationship between response and explanatory variables.
Mathematically, linear regression model is given by:
K
yi E0  E
k 1
k xik  Hi

Where yi is the response variable of ith observation ; E 0 , E1 ,...,E K are regression coefficient to be estimated or can
also be called parameters; xi1 , xi 2 ,...,xiK are the explanatory variables; H i i 1,2,...,n are independent random errors
with mean zero and variance V 2 ; n is number of observations, and K is number of explanatory variables. The model
can be expressed as matrix:

where



, ,

, and

Estimating the regression coefficient is done by using Ordinary Least Squares (OLS) procedure by
minimizing:
2
n n K

i 1
H i2 i
yi  E 0 

1

k 1
E j xik


Then we get the OLS estimator of is:
E X T X 1 X T Y
Where E is an K  1 u 1 estimated regression coefficients vector; X is n u K  1 a matrix of explanatory
variables which are 1 at the first column, and Y is n u1 a vector of the response variable. In the next discussion in
this paper, this linear regression model will be called global model.

Testing the Spatial Heterogeneity


Estimating the global model using OLS is depend on some assumptions that have to be fulfilled. The
assumptions are normality, nonautocorrelation, homoscedasticity, and multicollinearity. All of this should be
fulfilled to get the Best Linear Unbiased Estimator (BLUE). If one of this assumption is not fulfilled, the estimator
produced will be no longer BLUE.
In spatial data, the assumption of nonautocorrelation and homoscedasticity are almost cannot be fulfilled because
of the spatial effect on the spatial data. The dependency effect on the spatial data cannot fulfill the assumption of
nonautocorrelation because the observations, in this context are location, are depend on each others. The other
effect, heterogeneity, cannot fulfill the assumption of homoscedasticity because of variances of the model are no
longer homogeneous.Variances of the model will be different depend on the location.
To test the spatial heterogeneity, we can use Breusch-Pagan (BP) Test [1]. The hypotheses are expressed as
follows:

 at least one i where 

020023-2
and the statistics test is formulated as below:

~

where,

 and ,

With Z is a matrix that contains the vectors at each observation.

Hypothesis null will be rejected if , where K is number of variables

Detecting Outliers
Least Squares is the most common method used to estimate parameters of the regression model. This method
talks about parameters that can produce sum squared of error as small as possible. It is a problem when the are
outliers in the data,which may produce large errors. This large errors of outliers will be doubled because they are
squared in finding the estimator. Therefore, it is important to detect whether outliers are existed in the data we used.
Boxplot is one of the common method that is used to detect outliers. Boxplot uses the value of Interquartile
Range (IQR) to detect outliers. IQR defines as the difference of 1st quartile and 3rd quartile.

Outliers are observations that have condition:
a. The value less than  , or
b. The value more than  
In Exploratory spatial data analysis, [15] transform boxplot into a map that is called boxmap. Boxmap visualize the
outliers spatially.

Geographically Weighted Regression (GWR)


There are many ways to overcome the existence of the heterogeneity effect that caused the homoscedasticity
assumption unfulfilled. But, in the case of poverty, we need models that can describe it locally, for each location
because overcome the poverty problem locally is easier than globally. Therefore, in this case, we use models which
can estimate parameters locally.
Fotheringham [3] introduced a model that can estimate the regression parameters locally. This model is called
Geographically Weighted Regression (GWR). The model of GWR can be written as follows:
K
yi E 0 ui , vi  E
k 1
k ui , vi xik  H i (1)

yi is the observation value of response variable; ui , vi is the spatial coordinate of the ith location; xi1 , xi 2 ,...,xiK
are observations of explanatory variables X 1 ,...,X K at ui , vi ; E k ui , vi k 1,2,...,K are unknown regression
coefficients to be estimated and H i i 1,2,...,n are independent random errors with mean zero and variance V 2 .
The model is estimated by using Weighted Least Squares (WLS) method. The procedure is to minimize the
equation:


where is the weigth of location j to estimate the parameters in location i.The equation can be written as
matrix form below:

The regression coefficient will be estimated using this equation:

This weighting matrix describes the influence of the neighbors observation through their distances to the ultimate
location. The weight will increase as the decreasing of the distances. The matrix of weighting matrix is:

020023-3





The values of the weighting matrix can be computed by using several methods. Kernel function is the most common
procedure. The kernel is a function that describes the density of the distances among all locations.There are several
Kernel function which is commonly used. One of them is Gaussian Kernel. The formula to compute the weight
using Gaussian Kernel is as follow:





where is the euclidean distance between location i and location j . h is a bandwidth
which is non-negative.
Bandwidth is important in GWR to choose the neighborhood that will include in estimation parameter at one
location. Bandwidth control the fit of the curve to data and the smoothness of the data. The bigger the bandwidth,
the model produced will be smoother. But if the model is too smooth, there will be no differentiation between the
model produced and the global model. Therefore, choosing the optimum bandwidth is very important to obtain the
best model. One of the criteria that can be used to choose the optimum bandwidth is Cross-Validation which is
formulated as follow:

(2)

is the estimated value of when the ith location ommited from the model. The bandwidth is optimum when
the CV score is minimum [3].

Geographically Weighted Regression Using Least Absolute Deviation (Robust GWR)


Estimating model using Least Squares procedure is sensitive to the presence of outliers. The outliers can produce
a great residual that can make the variance also greater so that the confidence interval will be wider. The impact is
that the estimated coefficients of regression will be no longer consistent. Hence,we need an alternative method to
estimate the parameters. A more robust method, that can be less influenced onthe final estimates by the outliers.
One of the more robust method than least squares is Least Absolute Deviation (LAD). LAD is well suited to
longer-tailed error distribution, like Laplace or Cauchy [6]. The concept of this method is no more difficult than the
concept of least squares. The concept of LAD just replaces the term of the least squares to . But, it is quite
complicated when we use it in the actual calculation. The concept of LAD parameter estimation is to minimize the
equation:
(3)

If the least square has the formula to estimate parameters by differencing the loss function, the LAD method
does not have any formula to estimate them because the eq.(3) is not differentiable. So that, the calculation of the
LAD method to estimate parameters is solved by using an algorithm. Charnes, Cooper, and Ferguson in 1955
showed the equivalence between LAD and linear programming problem [6]. In 1959, Wagner suggested that the
LAD problem can be solved by solving the dual of LAD problem that can be reduced to a problem with a smaller
basis.
In the context of GWR, the parameters from the eq.(1) at certain location ( ) and shows the
weight at that location, can be estimated using LAD method by adopting the eq.(3) as follows:
Minimize
subject to
Let

and the decomposition of the residual is



020023-4
Then

dan

The linear programming solving the LAD problem as follows:

Minimize

Subject to




Technically, the estimation of RGWR will use the rq function at quantreg package of R-language


programming. The Estimation at location is by inputing the matrix as the



covariate and as the weighting vector. The process repeated at
different location , where  .

In choosing the bandwidth that will be used to estimate the model, it is necessary to use criteria that also robust
to the presence of outliers. The CV score like being introduced in eq.(2) is sensitive to outliers. The CV errors of
outliers can dominate the whole CV score. In other words, the bandwidth chosen will no longer have an essential
effect because the CV score will almost same at every bandwidth given. So that, Zhang [4] use Absolute Value
Cross Validation (ACV) criteria that is more robust and can minimize the large errors of the outliers. The ACV
score is defined as:
(4)
Similar to the CV score, the bandwidth chosen is the bandwidth which has the smallest ACV score.

METHODOLOGY
This study uses the data from Survei Sosial Ekonomi Nasional (SUSENAS), Survei Angkatan Kerja Nasional
(SAKERNAS), and Potensi Desa (PODES) in 2015 that was held by BPS. The coverage area used is 119
regions/cities in Java Island. The variables used in this study are:
1. The percentage of the poor
2. The percentage of the head of households educated less that elementary school
3. The percentage of people who suffered health complaints during the past month
4. The percentage of households by the use of protected drinking water
5. Underemployment rate
6. The percentage of per capita consumption for food
7. The Percentage of people employed in informal sectors
8. The percentage of villages in the area of land
The steps taken for modeling the poverty rate using robust GWR are described as follows:
1. Prepare the data. The data consists of poverty rate of study area as response variable and the explanatory
variables are social and economic variables which are indicated to be the cause of poverty.
2. Test the spatial heterogeneity using BP Test.
3. Detect the existence of outliers.
4. If there are outliers and variances of the model are not constant across the study area, use Robust GWR to
establish the poverty modelling. The steps are:
a. Calculate the distance matrix using the data.
b. Find the optimum bandwidth using ACV criteria.
c. Construct the weighting matrix for the ultimate location using the optimum bandwidth.
d. Estimate parameters on the model by using linear programming algorithm.

020023-5
RESULT
The Java Island consists of six provinces that is divided into 119 regions/cities. It is 129.738 km2 with density
1121 people per km2. The highest percentage of poverty in Java is in Sampang Region with 25,69% and the lowest
is in Tangerang Selatan City with 1,69%. Figure 1 shows the distribution of percentage of poverty in Java Island.

FIGURE 1. Distribution of Percentage of Poverty in Java Island, 2015

The summary of the data can be seen as Table 1 below.


TABLE 1. Summary of the Data
Std.
Variables Min Max Mean Median
Dev
1,69 25,69 11,32 11,27 4,90
15,02 85,69 55,48 61,68 18,11
18,05 47,62 33,36 33,24 5,98
38,47 100,00 74,14 74,34 13,06
9,34 68,80 28,42 27,13 11,83
31,07 60,95 47,61 48,68 6,94
16,50 86,12 56,33 60,74 18,34
0,00 100,00 82,58 87,67 19,84

The resulting global regression model is as follows :


     

Testing the spatial heterogeneity using Breucsh-Pagan test conclude that the global model has non-constant
variance. It is indicated by statistic as 14,267 with p-value 0,046. Therefore the GWR model can be used to describe
the relationship between the percentage of poverty and it explanatory variables. In another hand, detection of
outliers using boxmap shows that there are outliers of the global model residual. The outliers are in Tasikmalaya
City, Purbalingga Region, Bondowoso Region, Kulon Progo Region, Bantul Region, Gunung Kidul Region, and
Sampang Region. The boxmap shows by Figure 2.

020023-6
FIGURE 2. The Boxmap of Global Regression Model

Using the GWmodel package on R 3.3.1 software and fixed gaussian kernel function, the optimum bandwidth
obtained for GWR model is 83,39 km with CV score 41,87. The parameter estimation by GWR as follows:

TABLE 2. GWR Model Parameter Estimation Summary


Standard
Parameter Minimum Maximum Mean Median
Deviation
-0.49 0.30 0.01 0.08 0.21
-0.43 0.97 0.18 0.23 0.39
-0.20 0.34 0.03 -0.02 0.16
-0.15 0.18 0.01 -0.01 0.10
-0.57 0.40 -0.01 -0.01 0.24
-0.44 0.53 0.09 -0.01 0.28
0.03 1.04 0.47 0.52 0.20
-0.28 0.48 0.02 0.00 0.18

Like GWR model, the RGWR model also obtained using and fixed gaussian kernel function. Instead of using
CV score as the criteria for choosing the bandwidth, RGWR model uses ACV score. The optimum bandwidth
obtained is 33,10 km with ACV score 94,21. The parameter estimation by RGWR as follows:

TABLE 3. RGWR Model Parameter Estimation Summary


Standard
Parameter Minimum Maximum Mean Median
Deviation
-1,26 1,52 -0,05 -0,06 0,52
-2,17 3,83 0,33 0,27 1,05
-1,52 1,18 0,04 0,00 0,41
-1,56 1,27 -0,02 0,00 0,49
-2,00 3,20 0,10 -0,08 0,75
-2,43 1,66 -0,03 0,08 0,73
-1,28 3,01 0,44 0,52 0,86
-3,86 2,42 0,15 0,14 0,64

020023-7
The mapping of the percentage of poverty by using GWR and RGWR model to compare visually are shown by
Figure 3 below.

(a). Percentage of Poverty

(b). Estimation of Percentage of Poverty Using RGWR

(c). Estimation of Percentage of Poverty Using GWR

FIGURE 3. Distribution of Percentage of poverty and Its Estimation Using RGWR and GWR model

Based on Figure 3 can be shown that the estimation of percentage of poverty using RGWR model are more
similar with actual values than using GWR model. According to the Mean Square Error (MSE) value, RGWR
model obtained as 0,29. This value is relatively lower than the value of MSE on the model GWR 3,76. This
indicates that the RGWR model is better than GWR to describe the percentage of poverty in Java Island in 2015.

CONCLUSION
From this study, we can conclude that the percentage of poverty in Java Island 2015 is spatially distributed and
has spatial heterogeneity effect. The RGWR model can produces the estimate values closer to the actual values than
the GWR model. This indicates that the use of robust technique for spatial data containing outliers have a better
performance.

020023-8
REFERENCES
1. L. Anselin, Spatial Econometrics: Method and Models (Kluwer Academic Publisher, The Netherlands, 1988)
2. Badan Pusat Statistik, Data dan Informasi Kemiskinan Kabupaten/Kota 2014. (Badan Pusat Statistik, Jakarta,
2015)
3. A.S. Fotheringham, C. Brunsdon and M. Charlton, Geographically Weighted Regression. (John Wiley and
Sons, Chicester UK, 2002)
4. H. Zhang and C. Mei, Local Least Absolute Deviation Estimation of Spatially Varying Coefficient Models:
Robust Geographically Weighted Regression Approaches,International Journal of Geographical Information
Science. Vol 25:9, 1467:1489 (2011).
5. M.P. Todaro and S.C. Smith, Economic Development 11th Edition (Pearson Education, Boston, 2012)
6. Y. Dodge, An Introduction to L1-norm based statistical data analysis,Computational Statistics & Data
Analysis, 5, 239-253 (1987).
7. D. S. Prihantari, Analisis Pengaruh Pertumbuhan Ekonomi Terhadap Ketidakmerataan Pendapatandan
Kemiskinan di JawaTimur, Master Thesis,Universitas Indonesia, 2013
8. H. Yasin, PemilihanVariabelPada Geographically Weighted Regression, Media Statistika, Vol 4:2, 63-72
(2011).
9. S. Chatterjee and A.S Hadi, Regression Analysis by Example Fourth Edition (John Wiley & Sons, New Jersey,
2006).
10. D. Birkes and Y. Dodge, Alternative Methods of Regression (John Wiley & Sons, New York, 1993)
11. A. Djuraidah and A. H. Wigena, Regresi Spasial Untuk Menentukan Faktorfaktor Kemiskinan di Provinsi
Jawa Timur, Jurnal Statistika, Vol.12 No.1, 1-8
12. R. Agusti, Pemodelan Data Panel Tak Seimbang di Pulau Jawa dengan Model Spasial Durbin, Master
Thesis, Institu Pertanian Bogor, 2015.
13. R. F. M. Zain, Analisis Data Kemiskinan Provinsi DKI Jakarta dan Jawa Barat dengan Metode
Geogrpahically Weighted Logistic Regression (GWLR), Master Thesis, Universitas Padjajaran, 2016.
14. A. Saefuddin, N. A. Setiabudi and N. A. Achsani, On comparison between ordinary linear regression and
geographically weighted regression: with Application to Indonesian Poverty Data, European Journal of
Scientific Research, 57(2): 275-285.
15. L. Anselin, 2005, Exploring Spatial Data with GeoDaTM : A Workbook, Spatial Analysis Laboratory
Department of Geography University of Illinois, Urbana-Champaign Urbana, IL 61801

020023-9

S-ar putea să vă placă și