Documente Academic
Documente Profesional
Documente Cultură
Monthly gas production of Australia between year 1956–1995 is released by Australian Bureau of Statistics which
is in time series format.
Objective is to read the data and do various analysis on same using reading, plotting, observing and conducting
applicable tests.
Model building and to forecast for 12 months is also expected in this project using ARIMA and Auto Arima
models.
We must come up with best model for our prediction by comparing performance measures of the models.
2. Assumptions
There are a few assumptions considered:
• The Sample size is adequate to perform techniques applicable for time series dataset.
• All the necessary packages are installed in R.
• Dataset File to be used for this project is available in package Forecast.
3. Exploratory Data Analysis
In this analysis, objective is to identify key characteristics of the series using descriptive analysis methods.
Insight would provide us with a better understanding of the past and can be utilized to forecast the future.
3.1 Cleaning Up
Cleaning of environment and Memory
summary(gas)
To conclude that the dataset “gas” has trend and seasonality component present in its time series.
3.7 Periodicity of dataset
Time series object are an ordered sequence of values (data points) of variables at equally spaced time interval. It
is in time domain.
There are couple of methods to detect periodicity of timeseries object. Below are 2 which is used in this project
for detection of periodicity .
• DFT Identify the underlying periodic patterns by transforming into the frequency domain
• Autocorrelation: Correlate the signal with itself.
3.7.1 Periodicity check by computing Fourier transform
A Fourier analysis is a method for expressing a function as a sum of periodic components, and for recovering the
function from those components.
• When both the function and its Fourier transform are replaced with discretized counterparts, it is called the
discrete Fourier transform (DFT).
The periodogram shows the “power” of each possible frequency, and we can clearly see
spikes between 0 and 0.1, frequency close to 0 is high then decreasing effect then
at frequency 0.07 Hz
• By looking at the correlogram, we noticed that all correlations are above the blue lines, which indicate that
the correlations are significantly different from zero.The slow decrease in the ACF as the lags increase is
due to the trend, while the “scalloped” shape is due the periodicity .
• When data have a trend, the auto correlations for small lags tend to be large and positive because
observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive
values that slowly decrease as the lags increase.
• When data are periodical , the auto correlations will be larger for the periodic lags (at multiples of the
seasonal frequency) than for other lags.
It concludes that there is periodicity in dataset which is annual for sure.
Time series are classified to be stationary if mean , variance covariance of series is not function of time and is
constant.
There are various methods to check if time series is stationary or not. Below are few which is used in this
project.
• Observing plots : Review time series plot of gas for obvious trend and seasonality .
• Summary Statistics: Review of summary statistics of data for season or random portions and check for
differences .
• Statistical tests like Dickey Fuller test (ADF) and check p value to be within 0.05.
4.1 Observing plots
Observation:
• The year on year trend clearly shows that the gas distribution have been increasing without fail.
• The variance and the mean value in July and August is much higher than rest of the months.
• Even though the mean value of each month is quite different their variance is small. Hence, we have
strong seasonal effect with a cycle of 12 months or less.
• By visual inspecting it is clear that time series is not stationary.
4.2 Summary Statistics to check stationarity
Mean and variance are not constant in partitions of timeseries thus project time series is Non -stationary .
ADF tests the null hypothesis that a unit root is present in time series. ADF statistic is a negative number and
more negative it is the stronger the rejection of the hypothesis that there is a unit root.
Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root,
meaning it is stationary.
p-value > 0.05: Accept H0, the data has a unit root and is non-stationary
p-value ≤ 0.05: Reject H0. the data does not have a unit root and is stationary
4.3.2 ADF Test result / inference
ADF test statistics confirms that p value is more than 0.05 which means that data has unit root and is non
stationary . Thus we cannot reject Null Hypothesis H0.
Multiplicative decomposition is more prevalent with economic series because most seasonal economic
series do have seasonal variations which increase with the level of the series.
Rather than choosing either an additive or multiplicative decomposition, we could transform the data
beforehand.
We will start with decomposing the series into its three components - the trend, seasonal and random
components. The ts_decompose function provides an interactive inference for the decompose function
Perform ts_decompose Test on “gas”
We can observe that the trend of the series is fairly flat up to 1970 and afterward start to increase. Also, it
seems from the trend plot that it is not linear.
You can note from both the series and decompose plots that the series has a strong seasonal pattern along
with a non-linear trend. We will use the ts_seasonal, ts_heatmap and ts_surface functions to explore the
seasonal pattern of the series:
The ts_seasonal function provides three different views of the seasonality of the series (when the type
argument is equal to all):
• Split and plot the series by the full frequency cycle of the series, which in the case of monthly series is
a full year. This view allows you to observe and compare the variation of each frequency unit from
year to year. The plot’s color scale set by the chronological order of the lines (i.e., from dark color for
early years and bright colors for the latest years).
• Plot each frequency unit over time, and in the case of monthly series, each line represents a month of
consumption over time. This allows us to observe if the seasonal pattern remains the same over time.
• Last but not least, is box-plot representative of each frequency unit, which allows us to compare the
distribution of each frequency unit.
The main observations from this set of plots are:
• The structure of the seasonal pattern remain the same over the years - high consumption through the
July August, low through the start of year then fall post August .
• The distribution of the consumption during the July August is wider than the ones throughout the rest
of the year.
• The series is growing from year to year
To get a more clear view of the seasonal pattern of the series, you may want to remove the series growth (or
detrend) and replot it:
Remove the seasonal and trend effect and perform ts_decompose Test on “gas”
5.1 STL method to de-seasonalize time series
STL is a very versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and
Trend decomposition using Loess”. It does an additive decomposition and the four graphs are the original
data, seasonal component, trend component and the remainder.
If the focus is on figuring out whether the general trend of demand is up, we deseasonalize, and possibly forget
about the seasonal component. However, if you need to forecast the demand in next month, then you need take
into account both the secular trend and seasonality.
Lets de-seasonalize the time series and compare
Above shows the Actual time series in blue and de-seasonalized timeseries in Red .
5. ARIMA Model for Forecasting 12 months
Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the
irregular component of a time series, that allows for non-zero autocorrelations in the irregular
component. ARIMA models are defined for stationary time series so time series need to be stationary or
converted to Stationary.
Let us split the data into train and test data set
As per section 4.3 , we concluded that time series is non stationary as p value is more than 0.05.
Perform ADF Test on “gas”
Step 2: De-seasonalize the data
Perform difference on data until it appears stationary. Use unit root test if unsure.
Apply Difference 1 on data “gas” then check p_value again using ADF test
ADF test statistics confirms that p value is less than 0.05 for difference of 1 which means that the data does not
have a unit root and is stationary. Thus we reject Null Hypothesis H0.
We got (difference) d value as 1 , lets get ACF plot – q and PACF plot – p values .
Lets do ACF and PACF plot on original time series then on de-seaonalized , differenced time series .
Perform ACF on “gas” and “gasTrSt”
ACF plots display correlation between the series and its lags. Most of lines are significant as they are
beyond 2 blue lines 2nd line is significant and then couple more . Look for spikes at specific lag points of
the difference series , highest spike is at 6
Perform PACF on “gas” and countd1
In PACF plot also , Most of lines are significant as they are beyond 2 blue lines. Look for spikes at
specific lag points of the difference series , highest spike is at 6 then at 12
Step 5 : Manual ARIMA model
To find an appropriate ARIMA model based on the ACF and PACF. The significant spike at lag 0 in the ACF
suggests a non-seasonal MA component, then significant spike at lag 7,12. Consequently, we begin with an
ARIMA(0,1,1) and try various models with varying p and q values. The residuals for the fitted model are
shown below figure. By analogous logic applied to the PACF, we could also have started with an ARIMA(1,1,0).
Step 6: Choose arima model by comparing parameters and above ACF, PACF plot
Both the ACF and PACF show significant spikes indicating that some additional non-seasonal terms need to be
included in the model. Then compare values of AIC of various ARIMA models and chosen smaller aic value and
model with few parameters .
Consequently, we choose the ARIMA(0,1,3)(0,1,1)44 model. Its residuals are plotted. All the spikes are now
are not within the significance limits, but count of spikes are reduced. The Ljung-Box test also shows that the
residuals have no remaining autocorrelations by reducing lag.
• Choose model with fewer parameters
• Compare MSE, AIC, BIC etc to choose between models
• Smaller values preferred
• Practical considerations, if any, or domain input
Model AIC
ARIMA(1,1,1) 8485.34
ARIMA(2,1,1) 8477.67
ARIMA(1,1,2) 8473.51
ARIMA(2,1,2) 8475.39
ARIMA(3,1,2) 8230.24
ARIMA(5,1,1) 8349.32
ARIMA(6,1,1) 8298.03
ARIMA(1,1,5) 8428.75
Lets consider model with p,d,q values as 3,1,2 as it AIC value is small and parameters are less compared
to 6,1,1.
Below are plots and forecasting done on same.
ARIMA model
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you confirm the null-
hypothesis.
We now have a seasonal ARIMA model that passes the required checks. After checking multiple models and
co paring AIC plus parameters. and is ready for forecasting for 2 models . Forecasts from the model for the a
year are shown in below figures .
Model AIC
ARIMA(1,1,1)(0,1,1)[12] 8013.43
ARIMA(2,1,1) (0,1,1) [12] 8010.26
ARIMA(1,1,2) (0,1,1) [12] 8010.75
ARIMA(2,1,2) (0,1,1) [12] 8012.1
ARIMA(3,1,2) (0,1,1) [12] 8014.09
ARIMA(5,1,1) (0,1,1) [12] 8016.59
ARIMA(6,1,1) (0,1,1) [12] 8008.72
ARIMA(1,1,5) (0,1,1) [12] 8009.8
Below are 2 chosen models for which details like plots , residual test etc is shown along with forecasting
Forecasts from the model for the next year is shown in Figure .The forecasts follow decreasing then
increasing trend .
Seasonal Arima Model of 2,1,1 (0,1,1)
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you confirm the null-
hypothesis.
Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you
confirm the null-hypothesis.
The forecasted monthly gas production is represented by blue line with 80 percent and 95 percent confidence
intervals in dark gray and light gray respectively. This forecast shows a similar pattern developing in the next 12
months as gas production declines in the 5 months and then resumes previous high year end values at the end of
1996. Lets perform Arima by bringing it back the seasonality and compare.
Above model shows first order nonseasonal differences, no seasonal autoregressive terms, and first order
seasonal moving average terms with a lag component of twelvemonths.
The forecasted monthly gas production is represented by blue line with 80 percent and 95 percent confidence
intervals in dark gray and light gray respectively. This forecast shows a similar pattern developing in the next 12
months as gas production declines in the 5 months and then resumes previous high year end values at the end of
1996.
6. Accuracy of ARIMA Model
After comparing some the arima models with and without seasonality(Manual and Auto). We will compare
some of the models fitted so far using a test set consisting of the last data of 1995 then also applying on
original data gas.
The models chosen manually and with auto.arima() are both in the top four models based on their AIC values.
When AIC value and MAPE is critieria of chosing model then ensure that order of differencing is same .
However, when comparing models using a test set, it does not matter how the forecasts were produced —
the comparisons are always valid. In above tables, we can find various models ,we compared seasonal = False
and True and found that results are better with seasonal = TRUE for both manual and auto.arima models .
None of the models considered here pass all of the residual tests. In practice, we would normally use the best
model we could find, even if it did not pass all the tests.
Forecasts from the ARIMA(1,1,5)(0,1,1) 12 model (manual) (which has the lowest AIC value on the test set,
amongst models with only seasonal differencing) is below whose MAPE values observed the ARIMA model
provided the lowest values and we selected the model for the Forecasting. .
Accuracy of Arima model which is best is Manual seasonal Arima (1,1,5)( 0,1,1)[ 12 ]
For Time Series Forecasting problem, we had observed both trend and seasonality in the data. Trend is increasing
in the data along with high variation in seasonality, It also had few outliers.
Accuracy of Arima (1,1,5)(0,1,1)12 is most accurate and 3% less than other models .