Sunteți pe pagina 1din 5

2016 International Conference on Information, Communication Technology and System (ICTS)

Using Google Trend Data in Forecasting Number of Dengue


Fever Cases with ARIMAX Method Case Study : Surabaya,
Indonesia
Wiwik Anggraeni1), Laras Aristiani2)
Department of Information Systems
1,2)

Sepuluh Nopember Institute of Technology


Surabaya, Indonesia
1)
wiwik@is.its.ac.id. 2)larasarist@gmail.com

AbstractIndonesia has the highest number of dengue fever searched on Google. Previous studies have used Google Flu
cases in Southeast Asia. Early detection of the disease is required Trends [6] and Google Dengue Trends [7] as surveillance
in order to be able to prepare preventive measures against dengue system alternatives and forecasting number of dengue fever and
fever. Previous research has shown that certain query search influenza cases [8, 9, 10]. Both services were available online
related to communicable disease on Google Trends are highly
for free, yet Googles scientists had never revealed what search
correlated with number of communicable disease cases in South
Korea. Based on previous research, Google Trends search index terms used to track the disease. These services were
shows potential to be included as external variable in a discontinued in 2015, however it is still available limited for
multivariate quantitative forecasting model. academic research purposes.
Using time series model, the role of Google Trends on Search query data on Google are available for public on a
epidemiology of dengue fever transmissions in Surabaya will be web service called Google Trend. It allows people to examine
analyzed. This research uses several data (1) Number of dengue the trends of certain search queries. In previous study Google
fever cases obtained from general local hospital of Dr. Soetomo (2) Trend for certain queries using the survey on influenza are
Google Trends search index of certain queries related to dengue correlated with national surveillance data in South Korea [11].
fever. All of the data spans from December 2010 August 2015.
Although a research on correlation between dengue fever cases
Interpolation and extrapolation techniques are used to handle the
missing data. ARIMA and ARIMAX model with Google Trends and certain dengue fever search queries in Indonesian has never
data are implemented in order to forecast the number of dengue been conducted, the study has shown that Google Trends can
fever cases. The research shows that the addition of Google Trends be considered as independent variables for multivariate
into ARIMAX model improves forecasting performance. The best forecasting model. Google Trend geographical location is able
ARIMAX with Google Trends model improves MAPE value by to be broken-down to city level, which makes it more favorable
3%. than Wikipedia access log data.
KeywordsForecasting; ARIMA; ARIMAX; Google Trends; Autoregressive Integrated Moving Average with exogenous
Dengue Fever. variable (ARIMAX) is chosen to develop forecasting model of
I. INTRODUCTION dengue fever cases. Previous studies have shown that ARIMAX
model performs better than univariate ARIMA. Forecasting
According to World Health Organization, Indonesia is the performance of ARIMAX method including variation of
2nd country with the most dengue fever cases in the world [1]. calendar effects is better than ARIMA model [12]. ARIMAX
The incidence rate of dengue fever in Indonesia has been method with various variables have been used in previous
increasing constantly since 1968. Based on data obtained from studies. School absenteeism [13], pharmacy drug sales [14], and
Indonesian Ministry of Health, the incidence rate of dengue climatic factors (temperature, relative humidity, rainfall, and air
fever is 41,25 for every 100.000 citizens in 2013 [2]. pressure) [15]. ARIMA model has been widely used in
Controlling dengue fever transmission is part of Indonesian epidemiology to monitor and infectious disease [16]. We
Ministry of Healths strategic plan [3]. Forecasting their impact present a comparative analysis of ARIMA and ARIMAX with
is crucial for planning an effective response strategy [4]. Google Trends forecasting model to forecast dengue fever cases
Previous studies have shown the potential of traces data on in Surabaya, Indonesia.
the internet for monitoring and forecasting the transmission of
communicable disease. Wikipedia access data log has been II. DATA AND METHODOLOGY
used to monitor and predict several communicable disease in
Haiti, Uganda, China, Japan, Poland, United States of America A. Data
and Norway [5, 4]. The limitation of these studies is that using The dependent variable is the number of dengue fever cases.
article language as location proxy is considered weak, because It was obtained from general local hospital RSUD Dr. Soetomo
it cannot be used for the smaller scale than country-level [5]. Surabaya, Indonesia. Google Trend search index is
Google launched Google Flu Trends and Google Dengue external/independent variables for the model. Google Trend
Trends in 2008. These data are modeled from search queries search queries related to dengue fever is determined based on

978-1-5090-1381-4/16/$31.00 2016 IEEE 114


related search results appeared after searching for demam To model ARIMAX with Google Trend, first search index
berdarah. The queries used for ARIMAX modelling are of demam berdarah, dbd, demam and dengue are
demam berdarah, dbd, demam and dengue. In English lagged from 1 until 10, resulting 11 variation of Google Trends
language demam berdarah means dengue fever. Other terms external variables for each queries. Each variables have to go to
in Indonesian language of dengue fever is demam berdarah exactly the same transformation process that dependent variable
dengue, shortened into dbd. In English demam means has gone through. The correlation between those variables and
fever. Google Trend search index data is downloaded on 21st dependent variable are analyzed. Only variables with significant
September 2015 from http://www.google.com/trend. correlation (p-value <0.05) will be considered as external
Geographical location is set to East Java area. Weekly Google variables for ARIMAX model. Similar to ARIMA fitting
Trends search index then aggregated into monthly by averaging process, the coefficients of AR and MA terms as well as lagged
over each week. All of data span from January 2010 August Google Trends variable are estimated using maximum
2015. Interpolation and extrapolation techniques are used to likelihood method. Insignificant estimated coefficients (p-value
handle the missing data. > 0.05) will be excluded and the model will be re-fitted. The
residual of ARIMAX model will also be tested for Ljung Box
B. Methodology test. All ARIMA modelling and corresponding statistical tests
Both dependent and independent variables are divided into were performed using SAS University Edition Software.
two: one set for training set (parameter estimation), and another
for testing set. The ratio of training set and testing set is 70:30. III. RESULTS
Firstly ARIMA model using dengue fever time series is Figure 1 and Figure 2 are the log transformed dependent
developed, followed by ARIMAX with Google Trend as variable with differencing order = 0 and differencing order =1.
external variables. ARIMA model consists of 3 parts : auto As mentioned in Table 1 both of these variables are stationary.
regression, moving average and differencing order. General
notation of ARIMA is ARIMA(p,d,q), where p is the order of
auto regressive (AR), d is the differencing order, and q are the
moving average order (MA).
First the stationarity of dependent variable is tested through
Augmented Dickey Fuller Test (p-value <0.05) [17]. ARIMA is
based on assumption that the dependent variable is stationary,
that the mean and variances of the series are independent of time.
Stationarity can be achieved by differencing the series, or using
logarithmic transformation towards dependent variables in order
to stabilize the variance or mean [15]. In this research the
dependent variable is logarithmically transformed twice to
reduce the variances of the time series. The transformed
variables then goes through ADF test to check the stationarity of
Figure 1. Plot of log transformed dependent variables d=0
the series. Table 1 shows the result of ADF test for dengue fever
time series. It turns out that non-differenced series is already
stationary (d=0), and the series with 1st order of differencing is
also stationary (d=1).
TABLE 1 STATIONARITY TEST RESULT
Differencing
Dickey-Fuller Lag Order p-value
order
d=0 -3.6141 3 0.04204
d=1 -4.1989 3 0.01

Once the dependent variable is stationary, the order of p and


q are determined based on Partial Autocorrelation Function
(PACF) and Autocorrelation Function (ACF). Based on PACF
and ACF, 16 ARIMA models are fitted. In the fitting process AR
Figure 2. Plot of log transformed dependent variables d=1
and MA coefficients are estimated using maximum likelihood
method. Models with insignificant p-value (p-value>0.05) will The possible values of d in ARIMA(p,d,q) are d=0 and d=1.
be terminated. The residuals of model that has significant value For d=0, PACF lags cut-off at lag 1 and lag 2 while ACF plot
of AR and MA coefficient are further inspected for existence of is sinusoidal. Sinusoidal d=0 ACF plot may indicate MA terms
white noise through Ljung Box test [18]. The existence of white for d=0 equals to 0. Other possible values for MA terms of d=0
noise in the model residual indicates that the model is adequate are q=1, q=2 and q=3. Figure 3 and Figure 4 are the plots of
(p-value > 0.05). Models without white noise in its residuals will PACF and ACF for d=0.
be discarded. ARIMA models which have passed estimated
coefficient significance test and white noise test, the goodness
of fit is examined through MAPE (Mean Average Percentage).
ARIMA model with the best performance will be the baseline of
ARIMAX modelling.

115
parameters, there are 16 ARIMA models. These models are
going to be fitted.

TABLE 2 IDENTIFIED ARIMA(P,D,Q) PARAMETERS


Identified p Identified d Identified q
parameter parameter parameter
1,2 0 1,2,3
4 1 4,5

In order to include Google Trends as external variables to


the model, the correlations between dengue fever cases and
Google trends search index for each queries are examined.
Table 4 shows that there are significant correlation with
Figure 3. Plot of PACF for d=0 dengue queries at lag 0, 1, 3 and 8, demam berdarah queries
at lag 2, 7 and 8, demam queries at lag 2 and 8, dbd queries
at lag 2, 3, 7, 8 and 9.
The blue shading in Table 3 indicates that there is a
significant correlation between Google trend queries and
dengue fever cases. The number below the Spearman
coefficients indicates the strength of correlation. The
interpretations of these numbers are:
(1) a VERY STONG correlation
(2) a STRONG correlation
(3) a MODERATE correlation
(4) a WEAK correlation
Variables with WEAK and insignificant correlations will be
excluded from the ARIMAX model. Only those with VERY
STRONG, STRONG and MODERATE correlation will be
Figure 4. Plot of ACF for d=0
fitted to the model. The estimation result is shown in Table 3.
These models have significant AR, MA and external variables
For d=1 PACF lags cut-off at lag 4, while ACF plot is cut-
parameter.
off at lag 4 and lag 5. Table 2 summarizes the identified
ARIMA univariate model parameter. Based on these identified
TABLE 3 SUMMARY OF MODEL PERFORMANCE AND ESTIMATION RESULT
AR MA External Variables White Noise
Model MAPE Chi-
Est. Pr> |t| Est. Pr> |t| Vars Est. Pr> |t| p-value
Square
ARIMA(1.0.3) 32.56% 0.56531 0.0034 -0.4436 0.0083 1.16 0.5612
ARIMA(1.0.1) 34.19% 0.74963 <.0001 -0.4036 0.0117 7.95 0.0933
ARIMA(2.0.0) 33.37% -0.3792 0.0064 - - 8.3 0.0811
ARIMA(2.0.1) 32.34% -0.7622 <.0001 0.53024 0.0328 6.91 0.0747
ARIMA(4.1.0) 33.87% -0.4357 0.0034 - - 1.07 0.5844
ARIMAX(2.0.1) 30.85% -0.74428 <.0001 0.54451 0.0498 dengue (lag 1) 0.35677 0.033 4.4 0.2215
ARIMAX(1.0.3) 30.29% 0.552 0.005 -0.42622 0.0162 dengue (lag 1) 0.32721 0.0486 0.53 0.7657
ARIMAX(1.0.3) 31.45% 0.63735 <.0001 -0.62746 <.0001 demam berdarah (lag 2) 0.16776 0.0003 3.24 0.1982
demam berdarah (lag 2) 0.17203 0.0005
ARIMAX(1.0.0) 34.88% 0.90258 <.0001 12.9 0.2996
dbd_log2_2 0.04358 0.0481
dengue (lag 2) 0.52375 0.0021
ARIMAX(1.0.0) 31% 0.89091 <.0001 18.09 0.0795
demam berdarah (lag 2) 0.17694 0.0002
dengue (lag 1) 0.40698 0.0036
ARIMAX(2.0.0) 29.11% -0.33857 0.0239 14.89 0.1361
demam berdarah (lag 2) 0.1574 <.0001

116
TABLE 4 CORRELATIONS BETWEEN GOOGLE TREND QUERIES AND DENGUE FEVER CASE

Google Trend Queries Lag 0 Lag 1 Lag 2 Lag 3 Lag 4 Lag 5 Lag6 Lag7 Lag 8 Lag 9 Lag10
0.373* 0.555* 0.541* -0.318*
dengue 0.265 0.278 -0.018 -0.058 -0.172 -0.1582 -0.059
(3) (2) (2) (3)
0.307* -0.386* -0.331*
demam berdarah 0.0942 0.202 0.166 0.089 -0.076 -0.227 -0.221 -0.0594
(3) (3) (3)
0.350* -0.321*
demam 0.098 0.264 0.283 0.227 0.0287 -0.172 -0.187 -0.217 -0.0281
(3) (3)
0.423* 0.407* -0.432* -0.535* -0.474*
dbd -0.0203 0.182 0.263 0.031 -0.216 -0.320
(3) (3) (3) (2) (3)

IV. DISCUSSIONS
The result of this research shows that there is a correlation
ARIMAX(2,0,0) with GoogleTrend
between certain Google Trend queries related to dengue fever 1200
and the number of dengue fever cases. The best fit MAPE of 1000
univariate ARIMA model for dengue fever case is

# of cases
800
ARIMA(2,0,1) with MAPE of 32.34%. The best fit MAPE of
ARIMAX model is ARIMAX(2,0,0) with dengue lag 1 and 600
demam berdarah lag 2 as external variables with MAPE of 400
29.11%. From both of the best models for ARIMA and
200
ARIMAX, we found that including external variables Google
Trend improve the MAPE by 3.23%. In average the 0

Jul-10

Jul-11

Jul-12

Jul-13
Okt-10

Okt-11

Okt-12

Okt-13
Jan-10
Apr-10

Jan-11
Apr-11

Jan-12
Apr-12

Jan-13
Apr-13
performance of ARIMAX model with Google Trend is better
than univariate ARIMA model by 2%. The best models of Time period
ARIMA and ARIMAX will be implemented in testing test. Actual Forecast L95 U95
Table 5 shows the performance comparison between these
models on training set and testing set. As shown in Table 5, FIGURE 6. PLOT OF ARIMAX(2,0,0) WITH GOOGLE TREND FITTED MODEL
ARIMAX performs better in predicting the number of dengue
fever cases. Addition of certain Google trend queries does improve the
performance of forecasting model statistically. However there
TABLE 5 PERFORMANCE COMPARISON OF ARIMA AND ARIMAX is a limitation of using Google Trends in this study. The
selection of queries related to dengue fever in Indonesian is
MAPE decided through related search results shown on Google Trend.
Model External Variables Training Testing Although there is a significant correlation between the related
set set search results shown and number of dengue fever cases, this
ARIMA(2.0.1) - 32.34% 32.34% method has weakness. The related search results is selected
dengue_log2_1 based on frequency correlation of search queries in certain
ARIMAX(2.0.0) demamberdarah_log 29.11% 19% period of time [19]. Hence, it can be changed as the time period
2_2 goes on, resulting an inconsistent suggestion. For further
studies using Google Trend search queries, the queries can be
The fitted value of ARIMA(2,0,1) and ARIMAX(2,0,0) are decided based on survey to sample of population asking what
plotted in Figure 5 and Figure 6. These models are within come to your mind when youre searching for dengue fever?
confidence interval of 95% [11].

ARIMA(2,0,1) Univariate Fitted Model V. CONCLUSION


1400 This study concludes that the addition of Google Trends in
1200
forecasting number of dengue fever cases using ARIMA
1000
method can improve the model performance. However, the
# of cases

800
600
selection method of search queries on Google Trend has to be
400
refined in further studies.
200
0
Jul-10

Jul-11

Jul-12

Jul-13
Okt-10

Okt-11

Okt-12

Okt-13
Jan-10
Apr-10

Jan-11
Apr-11

Jan-12
Apr-12

Jan-13
Apr-13

time period
Actual Forecast lower 95% upper 95%

Figure 5. Plot of ARIMA(2,0,1) fitted model

117
ARIMAX Method in Moslem Kids Clothes Demand
REFERENCES Forecasting : Case Study," Procedia Computer Science,
vol. 72, pp. 630-637, 2015.
[1] World Health Organization, "Prevention and control of [13] E. JR, H. AG, B. JS, B. DL and O. D. e. al, "Usefulness
dengue and dengue haemorrhagic fever," World Health of school absenteeism data for predicting influenza
Organization, India, 2003. outbreaks, United States," Emerging infectious diseases,
[2] Central of Data and Information, Indonesian Ministry of vol. 18, no. 8, 2012.
Health, "Dengue Fever situation in Indonesia," [14] P. A, "Comparison : Flu Prescription Sales Data from a
Indonesian Ministry of Health, Jakarta, 2014. Retail Pharmacy in the US with Google Flu Trends and
[3] Indonesian Ministry of Health, Strategic Plan of US ILINet(CDC) Data as Flu Activity Indicator," PLOS
Indonesian Ministry of Health 2015-2019, Jakarta: ONE, vol. 7, 2012.
Indonesian Ministry of Health, 2015. [15] R. P. Soebiyanto, F. Adimi and R. K. Kiang, "Modelling
[4] K. S. Hickmann, G. Fairchild, R. Priedhorsky, N. and Predicting Seasonal Influenza Transmission in
Generous and J. M. H. et.al, "Forecasting the 2013-2014 Warm Regions Using Climatological Parameters,"
Influenza Season Using Wikipedia," PLoS PLOS ONE, vol. 5, no. 3, 2010.
Computational Biology, vol. 11, no. 5, 2015. [16] S. Chadsuthi, C. Modchang, Y. Lenbury, S.
[5] N. Generous, G. Fairchild, A. Deshpande, S. Y. D. Valle Iamsirithaworn and W. Triampo, "Modelling seasonal
and R. Priedhorsky, "Global Disease Monitoring and leptospirosis transmission and its association with
Forecasting with Wikipedia," PLOS ONE, vol. 10, no. rainfall and temperature in Thailand using time-series
11, 2014. and ARIMAX analyses," Asian Pacific Journal of
[6] Google, "Google Flu Trends," Google, 2015. [Online]. Tropical Medicine, 2012.
Available: http://www.google.org/flutrends. [Accessed [17] "Dickey-Fuller Unit Root Test (Stasionarity Test),"
29 September 2015]. [Online]. Available:
http://staffweb.hkbu.edu.hk/billhung/econ3600/applicat
[7] Google, "Google Dengue Trends," Google, 2015.
ion/app01/app01.html. [Accessed 6 Oktober 2015].
[Online]. Available:
http://www.google.org/denguetrends. [Accessed 29 [18] C. Chatfield, "Basic of Time Series Analysis," in Time-
September 2015]. Series Forecasting, Florida, CRC Press LLC, 2000, pp.
[8] O. M. Araz, D. Bentley and R. L. Muellman, "Using 20-42.
Google Flu Trends data in forecasting influenza-like- [19] D. E. Bowman, R. E. Ortega, M. L. Hamrick, J. R.
illness relatde ED visits in Omaha, Nebraska," American Spiegel and T. R. Kohn, "Refining search queries by the
Journal of Emergency Medicine, vol. 32, 2014. suggestion of correlated terms from prior searches".
[9] A. F. Dugas, M. Jalalpour, Y. Gel, S. Levin, F. Torcaso, Amerika Serikat Patent US6006225 A, 21 December
T. Igusa and R. E. Rothman, "Influenza Forecasting with 1999.
Google Flu Trends," PLOS ONE, vol. 8, no. 2, 2013.
[10] J. Ginsberg, M. H. Mohebbi, R. S. Patel and L. B. et.al,
"Detecting influenza epidemics using search engine
query data," 19 February 2009. [Online]. Available:
http://static.googleusercontent.com/media/research.goo
gle.com/en//archive/papers/detecting-influenza-
epidemics.pdf. [Accessed 29 September 2015].
[11] S. C. J. M. S. S.-Y. L. J. e. a. Cho S, "Correlation
between National Influenza Surveillance Data and
Google Trends in South Korea," PLOS ONE, vol. 8, no.
12, 2013.
[12] W. Anggraeni, R. A. Viniarti and Y. D. Kurniawati,
"Performance Comparisons Between ARIMA and

118