Sunteți pe pagina 1din 32

Project 6: Australian monthly gas production

“Gas” Forecast Report


Contents
1. Project Objective ...................................................................................................................................................... 3
2. Assumptions ............................................................................................................................................................. 3
3. Exploratory Data Analysis ....................................................................................................................................... 4
3.1 Cleaning Up ....................................................................................................................................................... 4
3.2 Reading the dataset ............................................................................................................................................ 4
3.3 Number of Rows and Columns .......................................................................................................................... 4
3.3 Dataset Summary ............................................................................................................................................... 5
3.4 Checking missing values and Outliers in dataset ............................................................................................... 5
3.5 Plotting of data ................................................................................................................................................... 5
3.5.1 Normal Plot .......................................................................................................................................... 5
3.5.2 Seasonal Plot ........................................................................................................................................ 6
3.5.3 Subseries Plot ...................................................................................................................................... 6
3.5.4 Lag Plot ................................................................................................................................................ 7
3.5.5 Auto Correlation Plot ........................................................................................................................... 7
3.6 Observation on plots and component presence in time series ............................................................................ 8
3.6.1 Normal Plot .......................................................................................................................................... 8
3.6.2 Seasonal Plot ........................................................................................................................................ 8
3.6.3 Subseries Plot ...................................................................................................................................... 8
3.6.4 Lag Plot ................................................................................................................................................ 8
3.6.5 Autocorrelation .................................................................................................................................... 8
3.7 Periodicity of dataset.......................................................................................................................................... 9
Lets find out 2 highest “power” frequencies . .......................................................................................................... 9
4. Is time series Stationary? ....................................................................................................................................... 10
4.1 Observing plots ................................................................................................................................................ 11
4.2 Summary Statistics to check stationarity ......................................................................................................... 12
4.3 Dickey Fuller test to check stationarity............................................................................................................ 12
5. De-seasonalize the time series. .............................................................................................................................. 13
5.1 STL method to de-seasonalize time series ....................................................................................................... 16
5. ARIMA Model for Forecasting 12 months ............................................................................................................ 18
6. Accuracy of ARIMA Model .................................................................................................................................. 31
1. Project Objective
This project is to analyze Australian Monthly Gas production dataset “Gas” in package “Forecast”.

Monthly gas production of Australia between year 1956–1995 is released by Australian Bureau of Statistics which
is in time series format.

Objective is to read the data and do various analysis on same using reading, plotting, observing and conducting
applicable tests.
Model building and to forecast for 12 months is also expected in this project using ARIMA and Auto Arima
models.
We must come up with best model for our prediction by comparing performance measures of the models.

The Dataset looks like is shown below:


Variable Description
Year Year
Month Month
Gas Production Gas Produced during the specified month and year

2. Assumptions
There are a few assumptions considered:

• The Sample size is adequate to perform techniques applicable for time series dataset.
• All the necessary packages are installed in R.
• Dataset File to be used for this project is available in package Forecast.
3. Exploratory Data Analysis
In this analysis, objective is to identify key characteristics of the series using descriptive analysis methods.

• Seasonal pattern - single/multiple seasonal pattern (if exists)


• The trend type - linear/exponential
• Structural breaks and outliers
• Any other pattern of the series

Insight would provide us with a better understanding of the past and can be utilized to forecast the future.

3.1 Cleaning Up
Cleaning of environment and Memory

3.2 Reading the dataset

Structure of the data set “gas” and value analysis

3.3 Number of Rows and Columns


• The number of rows in the dataset is 476
• The number of columns (Features) in the dataset is 1
• Frequency of TS object is 12
3.3 Dataset Summary

summary(gas)

3.4 Checking missing values and Outliers in dataset


Check for outliers and missing values

Let’s clean the data using tsclean process.

3.5 Plotting of data

3.5.1 Normal Plot

Plotting the dataset “gas”


3.5.2 Seasonal Plot

Seasonal Plots of “gas”

3.5.3 Subseries Plot

Subseries Plot of “gas”


3.5.4 Lag Plot

Gglagplot Plots of “gas”

3.5.5 Auto Correlation Plot

Lets see Autocorrelation values of “gas”


3.6 Observation on plots and component presence in time series
3.6.1 Normal Plot
Plot shows two patterns:
• an overall positive trend
There is a clear and increasing trend. The sudden drop at the start of each year needs to be investigated
in order to find what cause this effect at the end of the calendar year.
• a zig-zagging seasonal pattern.
There is also a strong seasonal pattern that increases in size as the level of the series increases. Any
forecasts of this series would need to capture the seasonal pattern, and the fact that the trend is
changing slowly.

3.6.2 Seasonal Plot


• A seasonal plot allows the underlying seasonal pattern to be seen more clearly, and is especially useful in
identifying years in which the pattern changes. In this case, it is clear that there is a jump in sales in July ,
August each year., 2012 and 2013. The data also show a considerable increase of sales for 2013. Over all
the graph also show an increased trend starting on June 2012.

3.6.3 Subseries Plot


• The horizontal lines indicate the means for each month. This form of plot enables the underlying seasonal
pattern to be seen clearly and shows the changes in seasonality over time. It is especially useful in
identifying changes within seasons.
3.6.4 Lag Plot
• Here the colors indicate the month of the variable on the vertical axis. The lines connect points in
chronological order. The relationship is strongly positive at lag 12, reflecting the strong seasonality in the
data.
3.6.5 Autocorrelation
• By looking at the correlogram, we noticed that all correlations are above the blue lines, which indicate that
the correlations are significantly different from zero.The slow decrease in the ACF as the lags increase is
due to the trend, while the “scalloped” shape is due the seasonality.
• When data have a trend, the auto correlations for small lags tend to be large and positive because
observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive
values that slowly decrease as the lags increase.
• When data are seasonal, the auto correlations will be larger for the seasonal lags (at multiples of the
seasonal frequency) than for other lags.
• When data are both trended and seasonal, you see a combination of these effects.

To conclude that the dataset “gas” has trend and seasonality component present in its time series.
3.7 Periodicity of dataset
Time series object are an ordered sequence of values (data points) of variables at equally spaced time interval. It
is in time domain.
There are couple of methods to detect periodicity of timeseries object. Below are 2 which is used in this project
for detection of periodicity .
• DFT Identify the underlying periodic patterns by transforming into the frequency domain
• Autocorrelation: Correlate the signal with itself.
3.7.1 Periodicity check by computing Fourier transform
A Fourier analysis is a method for expressing a function as a sum of periodic components, and for recovering the
function from those components.
• When both the function and its Fourier transform are replaced with discretized counterparts, it is called the
discrete Fourier transform (DFT).

Lets perform Discrete Fourier transform method on “gas”

The periodogram shows the “power” of each possible frequency, and we can clearly see
spikes between 0 and 0.1, frequency close to 0 is high then decreasing effect then
at frequency 0.07 Hz

Lets find out 2 highest “power” frequencies .


The periodogram shows the “power” of each possible frequency, and we can clearly see
spikes between 0 and 0.1, frequency close to 0 is high then decreasing effect then
at frequency 0.07 Hz
Frequencies of 0.00208 Hz and 0.004166 Hz show highest 2 periodicity
The main periodicity detected is 480 days. A secondary periodicity of 240 days was a
lso found. So, it is concluded that it has monthly and annually periodicity
3.7.2 Periodicity check by Auto Correlation

Lets see Autocorrelation values of “gas”

• By looking at the correlogram, we noticed that all correlations are above the blue lines, which indicate that
the correlations are significantly different from zero.The slow decrease in the ACF as the lags increase is
due to the trend, while the “scalloped” shape is due the periodicity .
• When data have a trend, the auto correlations for small lags tend to be large and positive because
observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive
values that slowly decrease as the lags increase.
• When data are periodical , the auto correlations will be larger for the periodic lags (at multiples of the
seasonal frequency) than for other lags.
It concludes that there is periodicity in dataset which is annual for sure.

4. Is time series Stationary?

Time series are classified to be stationary if mean , variance covariance of series is not function of time and is
constant.

There are various methods to check if time series is stationary or not. Below are few which is used in this
project.
• Observing plots : Review time series plot of gas for obvious trend and seasonality .
• Summary Statistics: Review of summary statistics of data for season or random portions and check for
differences .
• Statistical tests like Dickey Fuller test (ADF) and check p value to be within 0.05.
4.1 Observing plots

• check the gas distribution across the spectrum by plot


• Fit in a line to check upward and downward effect .
• To display year on year trend check cycle aggregates
• Box plot for sense of seasonal effect

Observation:
• The year on year trend clearly shows that the gas distribution have been increasing without fail.
• The variance and the mean value in July and August is much higher than rest of the months.
• Even though the mean value of each month is quite different their variance is small. Hence, we have
strong seasonal effect with a cycle of 12 months or less.
• By visual inspecting it is clear that time series is not stationary.
4.2 Summary Statistics to check stationarity

To check Time series is stationarity is to review summary statistics.


Let us proceed with splitting time series into two (or more) partitions and compare the mean and variance of
each group. If they differ and the difference is statistically significant, the time series is likely non-stationary.

Splitting of time series then crosschecking mean and variance

Mean and variance are not constant in partitions of timeseries thus project time series is Non -stationary .

4.3 Dickey Fuller test to check stationarity


Statistical tests help us to make conclusion about the data. They can only be used to inform the degree to which a
null hypothesis can be accepted or rejected. The result must be interpreted for a given problem to be meaningful.

ADF tests the null hypothesis that a unit root is present in time series. ADF statistic is a negative number and
more negative it is the stronger the rejection of the hypothesis that there is a unit root.

4.3.1 Null Hypothesis H0 and Alternate Hypothesis H1


Null Hypothesis (H0): If accepted, it suggests the time series has a unit root, meaning it is non-stationary. It has
some time dependent structure.

Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root,
meaning it is stationary.

p-value > 0.05: Accept H0, the data has a unit root and is non-stationary

p-value ≤ 0.05: Reject H0. the data does not have a unit root and is stationary
4.3.2 ADF Test result / inference

Perform ADF Test on “gas”

ADF test statistics confirms that p value is more than 0.05 which means that data has unit root and is non
stationary . Thus we cannot reject Null Hypothesis H0.

5. De-seasonalize the time series.


A time series decomposition is procedure which transform a time series into multiple different time series. The
original time series is often computed (decompose) into 3 sub-time series:
• Seasonal: patterns that repeat with fixed period of time.
• Trend: the underlying trend of the metrics.
• Random: (also call “noise”, “Irregular” or “Remainder”) Is the residuals of the time series after allocation
into the seasonal and trends time series.
• Other than above three component there is Cyclic component which occurs after long period of time
To get a successful decomposition, it is important to choose between the additive or multiplicative model.
To choose the right model we need to look at the time series.
a. The additive model is useful when the seasonal variation is relatively constant over time.
b. The multiplicative model is useful when the seasonal variation increases over time.

Multiplicative decomposition is more prevalent with economic series because most seasonal economic
series do have seasonal variations which increase with the level of the series.

Rather than choosing either an additive or multiplicative decomposition, we could transform the data
beforehand.

We will start with decomposing the series into its three components - the trend, seasonal and random
components. The ts_decompose function provides an interactive inference for the decompose function
Perform ts_decompose Test on “gas”

We can observe that the trend of the series is fairly flat up to 1970 and afterward start to increase. Also, it
seems from the trend plot that it is not linear.
You can note from both the series and decompose plots that the series has a strong seasonal pattern along
with a non-linear trend. We will use the ts_seasonal, ts_heatmap and ts_surface functions to explore the
seasonal pattern of the series:

Perform ts_decompose Test on “gas”

The ts_seasonal function provides three different views of the seasonality of the series (when the type
argument is equal to all):
• Split and plot the series by the full frequency cycle of the series, which in the case of monthly series is
a full year. This view allows you to observe and compare the variation of each frequency unit from
year to year. The plot’s color scale set by the chronological order of the lines (i.e., from dark color for
early years and bright colors for the latest years).
• Plot each frequency unit over time, and in the case of monthly series, each line represents a month of
consumption over time. This allows us to observe if the seasonal pattern remains the same over time.
• Last but not least, is box-plot representative of each frequency unit, which allows us to compare the
distribution of each frequency unit.
The main observations from this set of plots are:

• The structure of the seasonal pattern remain the same over the years - high consumption through the
July August, low through the start of year then fall post August .
• The distribution of the consumption during the July August is wider than the ones throughout the rest
of the year.
• The series is growing from year to year

To get a more clear view of the seasonal pattern of the series, you may want to remove the series growth (or
detrend) and replot it:

Remove the seasonal and trend effect and perform ts_decompose Test on “gas”
5.1 STL method to de-seasonalize time series

STL is a very versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and
Trend decomposition using Loess”. It does an additive decomposition and the four graphs are the original
data, seasonal component, trend component and the remainder.

Perform STL Test on “gas”


Perform STL Test on “gas”

If the focus is on figuring out whether the general trend of demand is up, we deseasonalize, and possibly forget
about the seasonal component. However, if you need to forecast the demand in next month, then you need take
into account both the secular trend and seasonality.
Lets de-seasonalize the time series and compare

Above shows the Actual time series in blue and de-seasonalized timeseries in Red .
5. ARIMA Model for Forecasting 12 months
Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the
irregular component of a time series, that allows for non-zero autocorrelations in the irregular
component. ARIMA models are defined for stationary time series so time series need to be stationary or
converted to Stationary.

Let us split the data into train and test data set

Step 1: Check for Stationarity using ADF test .

As per section 4.3 , we concluded that time series is non stationary as p value is more than 0.05.
Perform ADF Test on “gas”
Step 2: De-seasonalize the data

De-seasonalize the data using stl

Step 3: Perform difference /Log

Perform difference on data until it appears stationary. Use unit root test if unsure.
Apply Difference 1 on data “gas” then check p_value again using ADF test

Now let’s test the stationarity using ADF test


Perform ADF Test on “gas”

ADF test statistics confirms that p value is less than 0.05 for difference of 1 which means that the data does not
have a unit root and is stationary. Thus we reject Null Hypothesis H0.

We got (difference) d value as 1 , lets get ACF plot – q and PACF plot – p values .

Step 4: ACF and PACF plots for p and q values

Lets do ACF and PACF plot on original time series then on de-seaonalized , differenced time series .
Perform ACF on “gas” and “gasTrSt”

ACF plots display correlation between the series and its lags. Most of lines are significant as they are
beyond 2 blue lines 2nd line is significant and then couple more . Look for spikes at specific lag points of
the difference series , highest spike is at 6
Perform PACF on “gas” and countd1

In PACF plot also , Most of lines are significant as they are beyond 2 blue lines. Look for spikes at
specific lag points of the difference series , highest spike is at 6 then at 12
Step 5 : Manual ARIMA model

To find an appropriate ARIMA model based on the ACF and PACF. The significant spike at lag 0 in the ACF
suggests a non-seasonal MA component, then significant spike at lag 7,12. Consequently, we begin with an
ARIMA(0,1,1) and try various models with varying p and q values. The residuals for the fitted model are
shown below figure. By analogous logic applied to the PACF, we could also have started with an ARIMA(1,1,0).

Perform arima model for various p,q values .


Below are ACF , PACF, residual plots

Step 6: Choose arima model by comparing parameters and above ACF, PACF plot

Both the ACF and PACF show significant spikes indicating that some additional non-seasonal terms need to be
included in the model. Then compare values of AIC of various ARIMA models and chosen smaller aic value and
model with few parameters .
Consequently, we choose the ARIMA(0,1,3)(0,1,1)44 model. Its residuals are plotted. All the spikes are now
are not within the significance limits, but count of spikes are reduced. The Ljung-Box test also shows that the
residuals have no remaining autocorrelations by reducing lag.
• Choose model with fewer parameters
• Compare MSE, AIC, BIC etc to choose between models
• Smaller values preferred
• Practical considerations, if any, or domain input

Model AIC

ARIMA(1,1,1) 8485.34
ARIMA(2,1,1) 8477.67
ARIMA(1,1,2) 8473.51
ARIMA(2,1,2) 8475.39
ARIMA(3,1,2) 8230.24
ARIMA(5,1,1) 8349.32
ARIMA(6,1,1) 8298.03
ARIMA(1,1,5) 8428.75

Lets consider model with p,d,q values as 3,1,2 as it AIC value is small and parameters are less compared
to 6,1,1.
Below are plots and forecasting done on same.

ARIMA model

Lets check residual independent

Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you confirm the null-
hypothesis.

Conclusion: do not accept H0: Residuals are independent


Lets do predict for 12 months and plot it.

Step 7:Lets consider seasonal arima and test the model

We now have a seasonal ARIMA model that passes the required checks. After checking multiple models and
co paring AIC plus parameters. and is ready for forecasting for 2 models . Forecasts from the model for the a
year are shown in below figures .

Model AIC

ARIMA(1,1,1)(0,1,1)[12] 8013.43
ARIMA(2,1,1) (0,1,1) [12] 8010.26
ARIMA(1,1,2) (0,1,1) [12] 8010.75
ARIMA(2,1,2) (0,1,1) [12] 8012.1
ARIMA(3,1,2) (0,1,1) [12] 8014.09
ARIMA(5,1,1) (0,1,1) [12] 8016.59
ARIMA(6,1,1) (0,1,1) [12] 8008.72
ARIMA(1,1,5) (0,1,1) [12] 8009.8

Below are 2 chosen models for which details like plots , residual test etc is shown along with forecasting

Seasonal Arima Model of 2,1,1 (0,1,1)


Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you confirm the null-
hypothesis.

Conclusion: do not accept H0: Residuals are independent

Forecasting of Arima Model 2,1,1 (0,1,1) for 12 months

Forecasts from the model for the next year is shown in Figure .The forecasts follow decreasing then
increasing trend .
Seasonal Arima Model of 2,1,1 (0,1,1)

Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you confirm the null-
hypothesis.

Conclusion: do not accept H0: Residuals are independent

Forecasting of Arima Model 2,1,1 (0,1,1) for 12 months


Forecasts from the model for the next year is shown in Figure .The forecasts follow decreasing then
increasing trend which is same as earlier

Step 8 : Auto Arima Model

Perform auto arima with seasonality as False

ACF and PACF on residuals

Box-Ljung Test
To check is residual are independent
H0: Residuals are independent
Ha: Residuals are not independent
p-value < 0.05 lets you reject of the null-hypothesis, but a p-value > 0.05 does not let you
confirm the null-hypothesis.

Conclusion: do not reject H0: Residuals are independent

Testing the model

Perform auto arima with seasonality as False on test data


Forecasting

Perform Forecast of 12 months on Arima model with seasonality as False

The forecasted monthly gas production is represented by blue line with 80 percent and 95 percent confidence
intervals in dark gray and light gray respectively. This forecast shows a similar pattern developing in the next 12
months as gas production declines in the 5 months and then resumes previous high year end values at the end of
1996. Lets perform Arima by bringing it back the seasonality and compare.

Perform auto arima with seasonality as True


Forecasting

Perform Forecast of 12 months on Arima model with seasonality as True

Accuracy of Arima model can be checked by Accuracy measure MAPE

Above model shows first order nonseasonal differences, no seasonal autoregressive terms, and first order
seasonal moving average terms with a lag component of twelvemonths.

The forecasted monthly gas production is represented by blue line with 80 percent and 95 percent confidence
intervals in dark gray and light gray respectively. This forecast shows a similar pattern developing in the next 12
months as gas production declines in the 5 months and then resumes previous high year end values at the end of
1996.
6. Accuracy of ARIMA Model
After comparing some the arima models with and without seasonality(Manual and Auto). We will compare
some of the models fitted so far using a test set consisting of the last data of 1995 then also applying on
original data gas.

The models chosen manually and with auto.arima() are both in the top four models based on their AIC values.
When AIC value and MAPE is critieria of chosing model then ensure that order of differencing is same .
However, when comparing models using a test set, it does not matter how the forecasts were produced —
the comparisons are always valid. In above tables, we can find various models ,we compared seasonal = False
and True and found that results are better with seasonal = TRUE for both manual and auto.arima models .
None of the models considered here pass all of the residual tests. In practice, we would normally use the best
model we could find, even if it did not pass all the tests.

Forecasts from the ARIMA(1,1,5)(0,1,1) 12 model (manual) (which has the lowest AIC value on the test set,
amongst models with only seasonal differencing) is below whose MAPE values observed the ARIMA model
provided the lowest values and we selected the model for the Forecasting. .

Accuracy of Arima model which is best is Manual seasonal Arima (1,1,5)( 0,1,1)[ 12 ]
For Time Series Forecasting problem, we had observed both trend and seasonality in the data. Trend is increasing
in the data along with high variation in seasonality, It also had few outliers.

Accuracy of Arima (1,1,5)(0,1,1)12 is most accurate and 3% less than other models .

S-ar putea să vă placă și