Sunteți pe pagina 1din 16

Introduction to R Package - zoo

STAT 890: Statistical Computing


Pulindu Ratnasekera
February 23, 2013

Introduction

Package zoo provides methods for dealing with totally ordered indexed observations.
In other words it allows creating and manupulating time series. However the main
objective of this package is handling Irregualr Time Series where the base package of
R doesnt allow in handling irregulalry spaced obeservations. Therefore key feature
is that the independence of particular index/date/time. Not only that this zoo
package is consistant with ts and base R.

Overview: Package zoo

From an overall perspective package zoo can be classified into six categories.
1. Creating/Defining a time series (can be both regular and irregular).
2. Wrinting or reading a time series to or from a text document (txt, csv etc.).
3. Manipulating a time series using various functions
4. General functions that heps to handle a zoo series

5. Interpolating or predicting missing values of a time series


6. Ploting a time series
Following sections will brief each of the six categories whihch were discussed above.

2.1

Creating an Irregular Time Seires using package zoo

From this section onwards main emphasis will be given to irrgularly spaced data
series. However as mentioned above package zoo allows to create both regular and
irregular time series.
Regular will be created using the function zooreg (which is not discussed in
this document)
Irregular will be created using the function zoo
When it comes to function zoo, there are three main arguments.
1. x: a numeric vector or a matrix of time series data
2. order.by: an index vector with unique entries by which the observations in x
are ordered
3. frequency: a numeric indicating the frequency of order.by. If specified zoo
will return a regular time series
>
>
>
>
>
>
>
>

n = 40
# Generating 40 random values from a exponential distribution
# and getting there cumulatinve sum
timeIndex = cumsum(rexp(n, rate = 0.05))
rTimeIndex = round(timeIndex, digit=0)
# Creating dates for 40 days
irts.Date = as.Date("2010-01-01") + rTimeIndex
# Generatin 40 random values from a normal distribution with

> # mean=20 and SD = 2


> irtsVal = round(rnorm(40, 20, 2), digit=2)
In the above code irts.Date was created to replace the order.by object and irtsVal
was generated to replace the x vector in the function zoo. Once we have them we
could create the zoo (Irregular Time Series) in the followin manner.
> # Creates a zoo series with the generated values and dates
> IRTSzoo = zoo(irtsVal, irts.Date)
> head(IRTSzoo) # Printing the first six points of the zoo series
2010-01-17 2010-01-26 2010-03-08 2010-03-20 2010-05-08 2010-06-03
21.82
18.15
17.42
21.33
20.41
19.63

2.2

Writing and Reading zoo series from a text file

Objectinve writing and reading a zoo series is saving zoo series in text file and retrieving it from that text file once needed. This can be done using the functions
write.zoo and read.zoo respectively.
write.zoo: here we need to specify the name of the zoo series, seperator and the
name of the text file
read.zoo: in order to read we need to specify a the name of the text file, format
of the index (which was saved in the text file) and whether there is a header or
not
> # Writing the zoo series to a text file in order to use them
> # in the future
> #write.zoo(IRTSzoo, sep=" ", "IRTSdata.txt")
> # Reading zoo series from the text file
> # After reading from the text file then it is assigned to the

> # object "IRTS"


> IRTS = read.zoo("IRTSdata.txt", format="%Y-%m-%d", header=FALSE)
> head(IRTS)
2010-04-08 2010-04-18 2010-06-06 2010-06-10 2010-06-21 2010-08-15
20.02
15.75
18.42
22.58
19.49
21.14

2.3

Manipulating a zoo series

Mainly there are five functions which allows us to manupulate a zoo series in different
way.
1. coredata: using which we could extract or replace data in a zoo series
2. index: using which we could extract or replace time index/dates in a zoo series
3. as.yearmon: represents a monthly zoo series
4. as.yearqtr: represents a quarterly zoo series
5. aggregate: provides the summary statistics of a zoo series (monthly means,
monthly sums etc.)

2.3.1

coredata():

> # Extracting numerical Data/values from a "zoo" series


> coredata(IRTS)
[1]
[13]
[25]
[37]

20.02
18.69
21.00
20.31

15.75
17.69
20.53
24.03

18.42
21.05
17.60
21.48

22.58 19.49 21.14 21.33 18.10 20.03 20.47 19.38 20.41


22.79 20.36 19.86 19.41 22.83 22.20 22.81 21.25 20.55
24.30 17.71 21.90 17.53 18.21 18.58 17.60 19.86 18.30
16.62

2.3.2

index():

> # Extracting index/Dates from a "zoo" series


> head(index(IRTS))
[1] "2010-04-08" "2010-04-18" "2010-06-06" "2010-06-10" "2010-06-21"
[6] "2010-08-15"

2.3.3

yearmon():

> # Function "yearmon" facilitates to represent a monthly zoo series


> as.yearmon(index(IRTS))
[1]
[7]
[13]
[19]
[25]
[31]
[37]

2.3.4

"Apr
"Aug
"Oct
"May
"Nov
"Feb
"Mar

2010"
2010"
2010"
2011"
2011"
2012"
2012"

"Apr
"Aug
"Oct
"Aug
"Dec
"Feb
"Mar

2010"
2010"
2010"
2011"
2011"
2012"
2012"

"Jun
"Sep
"Nov
"Aug
"Jan
"Mar
"Apr

2010"
2010"
2010"
2011"
2012"
2012"
2012"

"Jun
"Sep
"Feb
"Aug
"Jan
"Mar
"May

2010"
2010"
2011"
2011"
2012"
2012"
2012"

"Jun
"Sep
"Mar
"Sep
"Feb
"Mar

2010"
2010"
2011"
2011"
2012"
2012"

"Aug
"Oct
"Apr
"Oct
"Feb
"Mar

2010"
2010"
2011"
2011"
2012"
2012"

yearqtr():

> # Function "yearqtr" facilitates to represent a quaterly zoo series


> as.yearqtr(index(IRTS))
[1]
[8]
[15]
[22]

"2010
"2010
"2010
"2011

Q2"
Q3"
Q4"
Q3"

"2010
"2010
"2011
"2011

Q2"
Q3"
Q1"
Q3"

"2010
"2010
"2011
"2011

Q2"
Q3"
Q1"
Q4"

"2010
"2010
"2011
"2011

Q2"
Q3"
Q2"
Q4"

"2010
"2010
"2011
"2011

Q2"
Q4"
Q2"
Q4"

"2010
"2010
"2011
"2012

Q3"
Q4"
Q3"
Q1"

"2010
"2010
"2011
"2012

Q3"
Q4"
Q3"
Q1"

[29] "2012 Q1" "2012 Q1" "2012 Q1" "2012 Q1" "2012 Q1" "2012 Q1" "2012 Q1"
[36] "2012 Q1" "2012 Q1" "2012 Q1" "2012 Q2" "2012 Q2"

2.3.5

aggregate():

In the following code first attempts to calculate monthly means by splitting the zoo
series into subsets months. Then it extracts monthly means via the function aggregata. This can be consedered as indirect way of coverting an irregular time series to a
regulat one. Similarly using the second function quarterly means have been calculated
and thus it is similar to converting an irregular time series to a regular quarterly time
series.
> #### Computing summary statistics using function "aggregate"
> monMeans = aggregate(IRTS, as.yearmon(index(IRTS)), mean); monMeans
Apr 2010
17.88500
Apr 2011
19.86000
Feb 2012
18.83750

Jun 2010
20.16333
May 2011
19.41000
Mar 2012
19.78000

Aug 2010
20.19000
Aug 2011
22.61333
Apr 2012
21.48000

Sep 2010
19.96000
Sep 2011
21.25000
May 2012
16.62000

Oct 2010
18.93000
Oct 2011
20.55000

Nov 2010
21.05000
Nov 2011
21.00000

Feb 2011
22.79000
Dec 2011
20.53000

Mar 2011
20.36000
Jan 2012
20.95000

> qtrMeans = aggregate(IRTS, as.yearqtr(index(IRTS)), mean); qtrMeans


2010 Q2 2010 Q3 2010 Q4 2011 Q1 2011 Q2 2011 Q3 2011 Q4 2012 Q1
19.25200 20.07500 19.46000 21.57500 19.63500 22.27250 20.69333 19.66083
2012 Q2
19.05000

2.4

General Functions associated with zoo series

In this section three functions of the package zoo was considered. First attempts
to check whether the zoo series is a regular one or not. Second and third are the
lagging and differecing functions of a time series. These function will be discussed in
the following sub sections.

2.4.1

Testing regularity of a Time Seires

The function is.regular function check whether a series of ordered observations has
an underlying regularity or is even strictly regular. The function is.regular takes
two arguments. First is the object representing the series of ordered observations.
Second is a logical argument whether we need to check strict regularity or not.
A time series can either be irregular (unequally spaced), strictly regular (equally
spaced) or have an underlying regularity, i.e., be created from a regular series by
omitting some observations. Here, the latter property is called regular. Consequently,
regularity follows from strict regularity but not vice versa. Thus if we are to use
is.regular for an irregular time series we need to have the logical strict always as
TRUE.
>
>
>
>

# Logical test to check whether the series is a regular one or not


# Here "srtict=TRUE" is important a even weak regularity is
# considered as regular series
is.regular(IRTS, strict=TRUE)

[1] FALSE

2.4.2

Lagging and Differencing a Time Series

In time series analysis lagging and differencing are two of the most frequently used
techniques. When it comes to package zoo, it too allows us to use zoo series with the
7

functions lag and diff that is being used in the base package. Precisely speaking
function diff comes under base package and the function lag comes under stats
packagr.
With regard to the function we need to specify two arguments for both the functions
mentioned above. First is the zoo series we created and then for function lag we need
to specify number of lags as the second argument. Similarly we need to specify the
order of difference when it comes to function diff as its secnond argument.
However there is an additional term in both functions when it is used under package
zoo (which is not available in the base or the stats packages). That is the logical
argument na.pad. If na.pad is TRUE it adds any times that would not otherwise
have been in the result with a value of NA. If FALSE those times are dropped.
Following R code demonstrates the use of functions lag and diff in relation to an
irregular time series under the package zoo.

2.4.3

lag():

> #### Computing Lags


> lag(IRTS, k=1)
2010-04-08
15.75
2010-08-24
20.03
2010-11-04
22.79
2011-08-20
21.25
2012-02-11
21.90
2012-03-14

2010-04-18
18.42
2010-09-14
20.47
2011-02-01
20.36
2011-09-30
20.55
2012-02-20
17.53
2012-03-27

2010-06-06
22.58
2010-09-15
19.38
2011-03-07
19.86
2011-10-14
21.00
2012-02-26
18.21
2012-03-30

2010-06-10
19.49
2010-09-28
20.41
2011-04-04
19.41
2011-11-09
20.53
2012-02-29
18.58
2012-04-19

2010-06-21
21.14
2010-10-06
18.69
2011-05-16
22.83
2011-12-04
17.60
2012-03-06
17.60

2010-08-15
21.33
2010-10-11
17.69
2011-08-01
22.20
2012-01-21
24.30
2012-03-08
19.86

2010-08-23
18.10
2010-10-29
21.05
2011-08-16
22.81
2012-01-28
17.71
2012-03-12
18.30

20.31

24.03

21.48

16.62

> ## "na.pad" is an additional feature in function "lag" compared to


> ##the traditional lag function in base package
> lag(IRTS, k=1, na.pad=TRUE)
2010-04-08
15.75
2010-08-24
20.03
2010-11-04
22.79
2011-08-20
21.25
2012-02-11
21.90
2012-03-14
20.31

2.4.4

2010-04-18
18.42
2010-09-14
20.47
2011-02-01
20.36
2011-09-30
20.55
2012-02-20
17.53
2012-03-27
24.03

2010-06-06
22.58
2010-09-15
19.38
2011-03-07
19.86
2011-10-14
21.00
2012-02-26
18.21
2012-03-30
21.48

2010-06-10
19.49
2010-09-28
20.41
2011-04-04
19.41
2011-11-09
20.53
2012-02-29
18.58
2012-04-19
16.62

2010-06-21
21.14
2010-10-06
18.69
2011-05-16
22.83
2011-12-04
17.60
2012-03-06
17.60
2012-05-13
NA

2010-08-15
21.33
2010-10-11
17.69
2011-08-01
22.20
2012-01-21
24.30
2012-03-08
19.86

2010-08-23
18.10
2010-10-29
21.05
2011-08-16
22.81
2012-01-28
17.71
2012-03-12
18.30

2010-08-15
0.19
2010-10-11
-1.00
2011-08-01
-0.63
2012-01-21

2010-08-23
-3.23
2010-10-29
3.36
2011-08-16
0.61
2012-01-28

diff():

> #### Differencing an irregular time series


> diff(IRTS, -1)
2010-04-08
-4.27
2010-08-24
1.93
2010-11-04
1.74
2011-08-20

2010-04-18
2.67
2010-09-14
0.44
2011-02-01
-2.43
2011-09-30

2010-06-06
4.16
2010-09-15
-1.09
2011-03-07
-0.50
2011-10-14

2010-06-10
-3.09
2010-09-28
1.03
2011-04-04
-0.45
2011-11-09

2010-06-21
1.65
2010-10-06
-1.72
2011-05-16
3.42
2011-12-04

-1.56
-0.70
0.45
-0.47
-2.93
6.70
-6.59
2012-02-11 2012-02-20 2012-02-26 2012-02-29 2012-03-06 2012-03-08 2012-03-12
4.19
-4.37
0.68
0.37
-0.98
2.26
-1.56
2012-03-14 2012-03-27 2012-03-30 2012-04-19
2.01
3.72
-2.55
-4.86
>
>
>
>

## "na.pad" is an additional feature in function "diff"


## compared to the traditional
## differencing function in base package
diff(IRTS, -1, na.pad=TRUE)

2010-04-08
-4.27
2010-08-24
1.93
2010-11-04
1.74
2011-08-20
-1.56
2012-02-11
4.19
2012-03-14
2.01

2010-04-18
2.67
2010-09-14
0.44
2011-02-01
-2.43
2011-09-30
-0.70
2012-02-20
-4.37
2012-03-27
3.72

2010-06-06
4.16
2010-09-15
-1.09
2011-03-07
-0.50
2011-10-14
0.45
2012-02-26
0.68
2012-03-30
-2.55

2010-06-10
-3.09
2010-09-28
1.03
2011-04-04
-0.45
2011-11-09
-0.47
2012-02-29
0.37
2012-04-19
-4.86

2010-06-21
1.65
2010-10-06
-1.72
2011-05-16
3.42
2011-12-04
-2.93
2012-03-06
-0.98
2012-05-13
NA

2010-08-15
0.19
2010-10-11
-1.00
2011-08-01
-0.63
2012-01-21
6.70
2012-03-08
2.26

Following time series plot represents the original time series and was plotted in order to
make comparison with the time series plot which was created with an order difference
1.
> plot(IRTS, xlab="Index", ylab="Values",
+
main="Time Series Plot - without differencing")
> points(IRTS, col="red", pch=20)

10

2010-08-23
-3.23
2010-10-29
3.36
2011-08-16
0.61
2012-01-28
-6.59
2012-03-12
-1.56

20
16

18

Values

22

24

Time Series Plot without differencing

2011

2012
Index

> plot(diff(IRTS, -1, na.pad=TRUE), xlab="Index",


+
ylab="Values - 1st Differencing",
+
main="Time Series Plot - First Order of Difference")
> points(diff(IRTS, -1, na.pad=TRUE), col="red", pch=20)

11

2
0
2
6

Values 1st Differencing

Time Series Plot First Order of Difference

2011

2012
Index

2.5

Interpolating / Predicting Missing values of a Time Series

This section mainly focus on interpolating or predicting the missing values of a time
series. Mainly there are 5 methods which aid us doing the same. They are as follows,
1. na.locf: this method replaces each NA with the most recent non-NA prior
to it. We could replace the NA from the following non-NA by setting logical
arument to TRUE.
2. na.fill: this function fills NA values or spcified positions.In this function
12

mainly we need to specify two arguments of which first to the object which
was created under zoo series. Second is the arguments fill which is a three
component list or a vector that is coerced to a list. The three components represent the fill value to the left of the data, within the interior of the data and
to the right of the data, respectively. The value of any component may be the
keyword extend to indicate repetition of the leftmost or rightmost non-NA
value or linear interpolation in the interior.
3. na.approx: is a function which replaces NA values via linear interpolation.
4. na.spline: is a function which replaces NA values via cubic spline interpolation
5. na.StructTS: fills NA values using Season Kalman Filter. Ideal when you
have seasonal variations in the time series data and mainly works with regular
time series as the input object should have a frequency.
Following code demonstrates each of the interpolating methods discussed above. However the 5th menthod, that is estimation via seasonal Kalman Filter doesnt work with
this particular irregular time series as there is no frequency present ins this data.
In order to make a comparison among the alternative interpolating methods the 3rd
value of the time series was changed to a NA. Then using each of the interpolating
methods the 3rd value of the series was estimated and those estimated values are
shown in the time series plot given below.
>
>
>
>
>
>
>
>

# 5 ways of interpolating a zoo series


newIRTS = IRTS
# Assigning a "NA" to the third value of the zoo series
coredata(newIRTS[3]) = NA
na.locf(newIRTS)
# Last oservation is carried formward
na.fill(newIRTS, "extend")
# Linear interpolation
na.approx(newIRTS)
# Linear Interpolation
na.spline(newIRTS)
# Cubic Spline Interpolation

13

> #na.StructTS(newIRTS)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
>
>
>
>

# Interpolates using Seasonal Kalman Filter

# Visualizing the interploating methods using a plot


# Ploting IRTS seires having a "NA" to third values of the series
plot(newIRTS, type = "o", xlab="Index", ylab="Value", pch = 20)
# Estimating third value of the series using the
#last observation carried forward method
points(na.locf(newIRTS)[3], col = "red", pch = 15)
# Estimating third value of the series using linear interpolation
points(na.fill(newIRTS, "extend")[3], col = "blue", pch = 15)
# Estimating third value of the series using linear interpolation
points(na.approx(newIRTS)[3], col = "green", pch = 15)
# Estimating third value of the series using cubic spline interpolation
points(na.spline(newIRTS)[3], col = "orange", pch = 15)
# Assigning the actual value to the third value of the series
coredata(newIRTS[3]) = coredata(IRTS[3])
points(newIRTS[3], col = "black", pch = 15)
# Creating the legend
legend(x = "bottom", legend = c("na.locf", "na.fill", "na.approx",
"na.spline", "Actual"), col=c("red","blue","green",
"orange", "black"), pch = 15)
# Following method doesnt work as for this particular data set
# as there is no repitition for the same index
#points(na.StructTS(IRTS)[3])

14

24
22
20
18

Value

16

na.locf
na.fill
na.approx
na.spline
Actual
2011

2012
Index

2.6

Making an autoplot with the zoo series

Package zoo allows to plot different types of plots whether it is a regular or irregular
time series. Mainly there are 3 plots and they are,
plot(): the usual plot function that we used in R and I have already used it in
this document to plot the time series plots under differecing function.
ggplot(): requires the package ggplot2. Here we need to specify three main
arguments. aes represents the axes, data can be specified using the function
fortify where it takes a zoo object and converts it into a data frame and can
be used as given in the R code below. Then we need to specify geom whether
15

it is a lines, point, ribbon etc.


autoplot(): This plot is very much similar to the ggplot. What autoplot()
does is, it takes a zoo object and returns a ggplot object.
Following R code creates an autoplot for the time series that was discussed throughout
this document.
> # Making an "autoplot" with zoo
> # geom: character specifying which geom to use (line, point, ribbon etc.)
> autoplot(IRTS) + geom_point(col = "red") + xlab("Index") + ylab("Values")

Values

22.5

20.0

17.5

201007

201101

201107

Index

16

201201

S-ar putea să vă placă și