Documente Academic
Documente Profesional
Documente Cultură
Abstract. Geographically Weighted Regression (GWR) is a development of an Ordinary Least Squares (OLS) regression which
is quite effective in estimating spatial non-stationary data. On the GWR models, regression parameters are generated locally, each
observation has a unique regression coefficient. Parameter estimation process in GWR uses Weighted Least Squares (WLS). But
when there are outliers in the data, the parameter estimation process with WLS produces estimators which are not efficient.
Hence, this study uses a robust method called Least Absolute Deviation (LAD), to estimate the parameters of GWR model in the
case of poverty in Java Island. This study concludes that GWR model with LAD method has a better performance.
INTRODUCTION
Poverty is a multidimensional problem that is faced by most countries, including Indonesia. Poverty is defined as
an individual situation of being unable to meet the minimum levels of income, food, clothing, healthcare, shelter,
and other essentials [5]. In Indonesia, the poverty rate is published by Statistics Indonesia (Badan Pusat Statistik
(BPS)-the official statistics of Indonesia). BPS used the basic need approach to determine the poverty. Poverty is
seen as an economic inability to meet the basic needs of food and non-food which is measured from the expenditure
side [2].
The data of poverty rate is distributed spatially. According to Anselin [1], spatial data has two effects, spatial
dependency and spatial heterogeneity. Spatial dependency described as the observations at one location are depend
on the observation at other locations. Observations on adjacent location will tend to have the same characteristics,
and will increasingly be different along the distance of the observations. Spatial heterogeneity can be shown by the
different influence of explanatory variables on the response variable at each location.The existence of spatial
heterogeneity will lead to the homoscedasticity assumption at the classical regression model is not fulfilled. The
variance of the model is no longer constant, but different at each observation. Therefore, the regression model was
developed to allow the variances of the model to be different for each location by making the local regression
coefficients, it means that each location will have its own regression coefficients [3]. A method that accommodates
local regression coefficients is Geographically Weighted Regression (GWR).
GWR uses point approach where each parameter of the regression model is estimated at any point in the
geographic location. The estimation of the regression coefficients uses Weighted Least Squares (WLS) method.
WLS procedure estimates the parameter by minimizing the sum squared of errors. As we knew, this procedure is
sensitive to the existence of outliers, hence, a robust method to the existence of outliers is needed. One of the
methods that can be used in the existence of outliers is Least Absolute Deviation (LAD). LAD is able to overcome
the effect that caused by the outlier without detecting the observation that influences the outlier.
020023-1
BASIC CONCEPT
Where yi is the response variable of ith observation ; E 0 , E1 ,...,E K are regression coefficient to be estimated or can
also be called parameters; xi1 , xi 2 ,...,xiK are the explanatory variables; H i i 1,2,...,n are independent random errors
with mean zero and variance V 2 ; n is number of observations, and K is number of explanatory variables. The model
can be expressed as matrix:
where
, ,
, and
Estimating the regression coefficient is done by using Ordinary Least Squares (OLS) procedure by
minimizing:
2
n n K
i 1
H i2 i
yi E 0
1
k 1
E j xik
Then we get the OLS estimator of is:
E X T X 1 X T Y
Where E is an K 1 u 1 estimated regression coefficients vector; X is n u K 1 a matrix of explanatory
variables which are 1 at the first column, and Y is n u1 a vector of the response variable. In the next discussion in
this paper, this linear regression model will be called global model.
020023-2
and the statistics test is formulated as below:
~
where,
and ,
With Z is a matrix that contains the vectors at each observation.
Hypothesis null will be rejected if , where K is number of variables
Detecting Outliers
Least Squares is the most common method used to estimate parameters of the regression model. This method
talks about parameters that can produce sum squared of error as small as possible. It is a problem when the are
outliers in the data,which may produce large errors. This large errors of outliers will be doubled because they are
squared in finding the estimator. Therefore, it is important to detect whether outliers are existed in the data we used.
Boxplot is one of the common method that is used to detect outliers. Boxplot uses the value of Interquartile
Range (IQR) to detect outliers. IQR defines as the difference of 1st quartile and 3rd quartile.
Outliers are observations that have condition:
a. The value less than , or
b. The value more than
In Exploratory spatial data analysis, [15] transform boxplot into a map that is called boxmap. Boxmap visualize the
outliers spatially.
yi is the observation value of response variable; ui , vi is the spatial coordinate of the ith location; xi1 , xi 2 ,...,xiK
are observations of explanatory variables X 1 ,...,X K at ui , vi ; E k ui , vi k 1,2,...,K are unknown regression
coefficients to be estimated and H i i 1,2,...,n are independent random errors with mean zero and variance V 2 .
The model is estimated by using Weighted Least Squares (WLS) method. The procedure is to minimize the
equation:
where is the weigth of location j to estimate the parameters in location i.The equation can be written as
matrix form below:
This weighting matrix describes the influence of the neighbors observation through their distances to the ultimate
location. The weight will increase as the decreasing of the distances. The matrix of weighting matrix is:
020023-3
The values of the weighting matrix can be computed by using several methods. Kernel function is the most common
procedure. The kernel is a function that describes the density of the distances among all locations.There are several
Kernel function which is commonly used. One of them is Gaussian Kernel. The formula to compute the weight
using Gaussian Kernel is as follow:
where is the euclidean distance between location i and location j . h is a bandwidth
which is non-negative.
Bandwidth is important in GWR to choose the neighborhood that will include in estimation parameter at one
location. Bandwidth control the fit of the curve to data and the smoothness of the data. The bigger the bandwidth,
the model produced will be smoother. But if the model is too smooth, there will be no differentiation between the
model produced and the global model. Therefore, choosing the optimum bandwidth is very important to obtain the
best model. One of the criteria that can be used to choose the optimum bandwidth is Cross-Validation which is
formulated as follow:
(2)
is the estimated value of when the ith location ommited from the model. The bandwidth is optimum when
the CV score is minimum [3].
If the least square has the formula to estimate parameters by differencing the loss function, the LAD method
does not have any formula to estimate them because the eq.(3) is not differentiable. So that, the calculation of the
LAD method to estimate parameters is solved by using an algorithm. Charnes, Cooper, and Ferguson in 1955
showed the equivalence between LAD and linear programming problem [6]. In 1959, Wagner suggested that the
LAD problem can be solved by solving the dual of LAD problem that can be reduced to a problem with a smaller
basis.
In the context of GWR, the parameters from the eq.(1) at certain location ( ) and shows the
weight at that location, can be estimated using LAD method by adopting the eq.(3) as follows:
Minimize
subject to
Let
and the decomposition of the residual is
020023-4
Then
dan
Minimize
Subject to
Technically, the estimation of RGWR will use the rq function at quantreg package of R-language
programming. The Estimation at location is by inputing the matrix as the
covariate and as the weighting vector. The process repeated at
different location , where .
In choosing the bandwidth that will be used to estimate the model, it is necessary to use criteria that also robust
to the presence of outliers. The CV score like being introduced in eq.(2) is sensitive to outliers. The CV errors of
outliers can dominate the whole CV score. In other words, the bandwidth chosen will no longer have an essential
effect because the CV score will almost same at every bandwidth given. So that, Zhang [4] use Absolute Value
Cross Validation (ACV) criteria that is more robust and can minimize the large errors of the outliers. The ACV
score is defined as:
(4)
Similar to the CV score, the bandwidth chosen is the bandwidth which has the smallest ACV score.
METHODOLOGY
This study uses the data from Survei Sosial Ekonomi Nasional (SUSENAS), Survei Angkatan Kerja Nasional
(SAKERNAS), and Potensi Desa (PODES) in 2015 that was held by BPS. The coverage area used is 119
regions/cities in Java Island. The variables used in this study are:
1. The percentage of the poor
2. The percentage of the head of households educated less that elementary school
3. The percentage of people who suffered health complaints during the past month
4. The percentage of households by the use of protected drinking water
5. Underemployment rate
6. The percentage of per capita consumption for food
7. The Percentage of people employed in informal sectors
8. The percentage of villages in the area of land
The steps taken for modeling the poverty rate using robust GWR are described as follows:
1. Prepare the data. The data consists of poverty rate of study area as response variable and the explanatory
variables are social and economic variables which are indicated to be the cause of poverty.
2. Test the spatial heterogeneity using BP Test.
3. Detect the existence of outliers.
4. If there are outliers and variances of the model are not constant across the study area, use Robust GWR to
establish the poverty modelling. The steps are:
a. Calculate the distance matrix using the data.
b. Find the optimum bandwidth using ACV criteria.
c. Construct the weighting matrix for the ultimate location using the optimum bandwidth.
d. Estimate parameters on the model by using linear programming algorithm.
020023-5
RESULT
The Java Island consists of six provinces that is divided into 119 regions/cities. It is 129.738 km2 with density
1121 people per km2. The highest percentage of poverty in Java is in Sampang Region with 25,69% and the lowest
is in Tangerang Selatan City with 1,69%. Figure 1 shows the distribution of percentage of poverty in Java Island.
020023-6
FIGURE 2. The Boxmap of Global Regression Model
Using the GWmodel package on R 3.3.1 software and fixed gaussian kernel function, the optimum bandwidth
obtained for GWR model is 83,39 km with CV score 41,87. The parameter estimation by GWR as follows:
Like GWR model, the RGWR model also obtained using and fixed gaussian kernel function. Instead of using
CV score as the criteria for choosing the bandwidth, RGWR model uses ACV score. The optimum bandwidth
obtained is 33,10 km with ACV score 94,21. The parameter estimation by RGWR as follows:
020023-7
The mapping of the percentage of poverty by using GWR and RGWR model to compare visually are shown by
Figure 3 below.
FIGURE 3. Distribution of Percentage of poverty and Its Estimation Using RGWR and GWR model
Based on Figure 3 can be shown that the estimation of percentage of poverty using RGWR model are more
similar with actual values than using GWR model. According to the Mean Square Error (MSE) value, RGWR
model obtained as 0,29. This value is relatively lower than the value of MSE on the model GWR 3,76. This
indicates that the RGWR model is better than GWR to describe the percentage of poverty in Java Island in 2015.
CONCLUSION
From this study, we can conclude that the percentage of poverty in Java Island 2015 is spatially distributed and
has spatial heterogeneity effect. The RGWR model can produces the estimate values closer to the actual values than
the GWR model. This indicates that the use of robust technique for spatial data containing outliers have a better
performance.
020023-8
REFERENCES
1. L. Anselin, Spatial Econometrics: Method and Models (Kluwer Academic Publisher, The Netherlands, 1988)
2. Badan Pusat Statistik, Data dan Informasi Kemiskinan Kabupaten/Kota 2014. (Badan Pusat Statistik, Jakarta,
2015)
3. A.S. Fotheringham, C. Brunsdon and M. Charlton, Geographically Weighted Regression. (John Wiley and
Sons, Chicester UK, 2002)
4. H. Zhang and C. Mei, Local Least Absolute Deviation Estimation of Spatially Varying Coefficient Models:
Robust Geographically Weighted Regression Approaches,International Journal of Geographical Information
Science. Vol 25:9, 1467:1489 (2011).
5. M.P. Todaro and S.C. Smith, Economic Development 11th Edition (Pearson Education, Boston, 2012)
6. Y. Dodge, An Introduction to L1-norm based statistical data analysis,Computational Statistics & Data
Analysis, 5, 239-253 (1987).
7. D. S. Prihantari, Analisis Pengaruh Pertumbuhan Ekonomi Terhadap Ketidakmerataan Pendapatandan
Kemiskinan di JawaTimur, Master Thesis,Universitas Indonesia, 2013
8. H. Yasin, PemilihanVariabelPada Geographically Weighted Regression, Media Statistika, Vol 4:2, 63-72
(2011).
9. S. Chatterjee and A.S Hadi, Regression Analysis by Example Fourth Edition (John Wiley & Sons, New Jersey,
2006).
10. D. Birkes and Y. Dodge, Alternative Methods of Regression (John Wiley & Sons, New York, 1993)
11. A. Djuraidah and A. H. Wigena, Regresi Spasial Untuk Menentukan Faktorfaktor Kemiskinan di Provinsi
Jawa Timur, Jurnal Statistika, Vol.12 No.1, 1-8
12. R. Agusti, Pemodelan Data Panel Tak Seimbang di Pulau Jawa dengan Model Spasial Durbin, Master
Thesis, Institu Pertanian Bogor, 2015.
13. R. F. M. Zain, Analisis Data Kemiskinan Provinsi DKI Jakarta dan Jawa Barat dengan Metode
Geogrpahically Weighted Logistic Regression (GWLR), Master Thesis, Universitas Padjajaran, 2016.
14. A. Saefuddin, N. A. Setiabudi and N. A. Achsani, On comparison between ordinary linear regression and
geographically weighted regression: with Application to Indonesian Poverty Data, European Journal of
Scientific Research, 57(2): 275-285.
15. L. Anselin, 2005, Exploring Spatial Data with GeoDaTM : A Workbook, Spatial Analysis Laboratory
Department of Geography University of Illinois, Urbana-Champaign Urbana, IL 61801
020023-9