Sunteți pe pagina 1din 401

Statistics 133: Concepts in Computing with Data

Instructor: Dr. Cari Kaufman


cgk@stat.berkeley.edu

GSI: Daisy Huang


yanhuang@stat.berkeley.edu
What Are Data?
Numbers

Example: Traffic on I-80


Text

Example: SPAM or HAM?


Images, video, or audio

Example: Mary Jane ski area and Rifle Sight trail

Height taken from a digital


elevation model, with
overlaid high-resolution
photograph.

Plan your descent through the bumps and go for it.


Bump skiing does not get much harder than this.This pitch
is a long one and typically does not have much loose snow so
technique is important even if you decide to traverse across to the
left to lose some speed. Bear to to skier's left at the bottom of this
pitch and finish out the run on Feebleminded. Look for good snow on
the sides. However you get down this run you should feel like you skied
something hard and a bit wild -- and done it in view of all the folks comfortably
sitting on the SuperGauge chairs. You will not find many other black runs that
will stretch you like Riflesight Notch. - From the Mary Jane Project
Meta-data

Example: Shelters along the Applachian trail


Course Expectations
Getting Started with R
Why use R?

Some of you may have used statistical software with a


GUI, like Minitab. You may also be familiar with other
programming languages, like C, Java, Python, etc.

In this class, we’ll use the R programming language and


environment as our “home base” for performing many
data analytic tasks.

Some benefits of R:
• Allows custom analyses and easy replicability
• High level language designed for statistics
• Active user community, lots of add-ons
• It’s free!
A screenshot from http://www.R-project.org/
R can be run in interactive or batch modes. The
interactive mode is useful for trying out new analyses and
making sure your code is doing what you think it is. The
batch mode is useful for carrying out pre-defined analyses
in the background.

For now, we’ll focus


on the interactive
mode.

When you fire up


R, you’ll see a
prompt, like this:
At the prompt, you can type an expression. An expression
is a combination of letters/numbers/symbols which are
interpreted by a particular programming language
according to its rules. It then returns a value. We can also
say it evaluates to that value.

> 3 + 5
[1] 8
> 1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15 16 17 18 19 20
>
> # This is a comment
>
> 30 + 10 / # I'm not done typing
+ 2
[1] 35
To store a value, we can assign it to a variable.

> x1 <- 32 %% 5
> print(x1)
[1] 2
> x2 <- 32 %/% 5
> x2 # In interactive mode, this prints the object
[1] 6
> ls() # List all my variables
[1] "x1" "x2"
> rm(x2) # Remove a variable
> ls()
[1] "x1"
Variable names must follow some rules:

• May not start with a digit or underscore (_)


• May contain numbers, characters, and some
punctuation - period and underscore are ok, but most
others are not
• Case-sensitive, so x and X are different
Advice on variable names:

• Use meaningful names


• Avoid names that already have a meaning in R. If in
doubt, check:
> exists("pi")
[1] TRUE
There are several ways to save your objects for later.

You can use the save and load functions to save specific
variables.

> save(x1, file = "x1.RData")


> rm(x1)
> ls()
character(0)
> load(file = "x1.RData")
> ls()
[1] "x1"

When you quit R, you’ll be asked whether you want to


save ALL the contents of your current workspace.
> q()
Save workspace image? [y/n/c]:
A function is a portion of code that performs a specific
task. Usually it takes some inputs, performs some
computations, and returns a value.

The inputs are called arguments to the function. When


you use a function with a particular set of arguments, you
are set to be calling the function. The computer evaluates
the function call and returns the output.

For now, we’ll work with R’s built-in functions, and the
most important things to know are how to call the
function and how to get help when you need it.
First, determine the arguments.

> args(rnorm)
function (n, mean = 0, sd = 1)
NULL
> args(plot)
function (x, y, ...) default values

The “...” argument is special and we’ll talk about it later.


When you call a function, you can specify the arguments
either by position or by name, or a combination.
> x <- 1:100
> y <- rnorm(100, sd = x) # Combination
> plot(x, y) # By position
y

-150 -100 -50 0 50 100 150

0
20
40

x
60
80
100
> help(rnorm) # A shortened version of the real page:
Normal package:stats R
Documentation

The Normal Distribution

Description:
Random generation for the normal distribution with
mean equal to 'mean' and standard deviation equal to 'sd'.

Usage:
rnorm(n, mean = 0, sd = 1)

Arguments:
n: number of observations.
mean: vector of means.
sd: vector of standard deviations.
Details:
If 'mean' or 'sd' are not specified they assume the
default values of 0 and 1, respectively.

Value:
'rnorm' generates random deviates.

Source:
See RNG for how to select the algorithm and for
references to the supplied methods.

References:
Becker, R. A., Chambers, J. M. and Wilks, A. R.
(1988) _The New S Language_. Wadsworth & Brooks/Cole.

See Also:
'runif' and '.Random.seed'

Examples:
...
R has a number of built-in data types. The three most
basic types are numeric, character, and logical.

You can check the type using the mode function.

> mode(3.5)
[1] "numeric"
> mode("Hello")
[1] "character"
> mode(2 < 3)
[1] "logical"

Actually, the three types are numeric, character, and


logical vectors. There’s no such thing as a scalar in R, just a
vector of length one.
A vector in R is a collection of values of the same type.
You can join vectors together using the c (for
“concatenate”) function.

> c(1.3, 2, 8/3)


[1] 1.300000 2.000000 2.666667
> c("a", "l", "q")
[1] "a" "l" "q"
> c(TRUE, FALSE, FALSE)
[1] TRUE FALSE FALSE
>
> c(1, 2, FALSE)
[1] 1 2 0
> c(1, 2, "c")
[1] "1" "2" "c"

The last two expressions illustrate implicit coercion. You


should try to avoid this in most situations.
The elements of a vector can have names.

> unfair.coin <- c("heads" = 0.55, "tails" = 0.45)


> unfair.coin
heads tails
0.55 0.45
> names(unfair.coin)
[1] "heads" "tails"
>
> # Another way to do it
> fair.coin <- c(0.5, 0.5)
> names(fair.coin) <- names(unfair.coin)
> fair.coin
heads tails
0.5 0.5
There five ways to extract elements of a vector.
> unfair.coin[1] # 1) Inclusion by position
heads
0.55
> unfair.coin[-1] # 2) Exclusion by position
tails
0.45
> unfair.coin["heads"] # 3) By name
heads
0.55
> unfair.coin[unfair.coin > 0.5] # 4) By logical index
heads
0.55
> unfair.coin[] # 5) No index (include everything)
heads tails
0.55 0.45
A few announcements:

If you haven’t gotten your computer account, be sure to


email Daisy (yanhuang@stat.berkeley.edu) ASAP.

If you are just joining the course this week, please see me
after class, in office hours, or send me an email if you have
not done so already.
Last time in our introduction to R, we learned how to

• start and quit R in interactive mode


• do basic calculations in R
• assign, print, list, and remove variables
• save some or all of the variables in the workspace
• find the arguments for a function and use the help
system
• create numeric, character, and logical vectors and
concatenate them using c()
• name the elements of a vector
• extract elements of a vector five different ways
A few notes before we move on...

You can assign values to variables using = rather than <- if


you like.

You can use c to concatenate existing vectors.


> x1 <- c(1, 8)
> x2 <- 2:5
> x3 <- 4
> c(x1, x2, x3)
[1] 1 8 2 3 4 5 4

Remember that unlike other languages you may have


used, R does not start indexing with 0. Also, it does not
allow mixing of positive and negative subscripts. (Why
not?)
Indexing by exclusion can be used to remove elements of
a vector.
> x <- 1:5
> x
[1] 1 2 3 4 5
> x <- x[-c(1,3,5)]
> x
[1] 2 4

There was a question about indexing by name when the


names are not unique. It appears that R returns only the
first element with that name. So I’d avoid repeating
names.
> x <- 1:2
> names(x) <- c("a", "a")
> x["a"]
a
1
Today, we’ll cover

• missing values and other special values


• assigning parts of a vector using indexing
• vector arithmetic and the recycling rule
• making patterned vectors
• some built-in summary functions for vectors
• basic manipulation of character vectors
• logical vectors and Boolean algebra
• a new data type: factors
Next time: more complicated data structures, reading
data into R
Next week: graphics
The missing value symbol is NA. Note that this is different
from “NA”, so don’t include the quotation marks. You can
check for the presence of NA values using the is.na
function.

> x <- c(1, 5, NA)


> is.na(x)
[1] FALSE FALSE TRUE

Other special values are NaN, for “not a number,” which


typically arises when you try to compute an
indeterminate form such as 0/0. The result of dividing a
non-zero number by zero is Inf (or -Inf).
In general, the same indexing may be used to assign values
to elements of a vector. Make sure the vector exists
first, or you will get an error.

Can you guess what x will look like after each of the
following lines?

> x <- 1:10


> names(x) <- letters[1:10]
> x[1:2] <- 2:1 # By inclusion
> x[-(1:2)] <- 10:3 # By exclusion
> x["a"] <- 100 # By name
> x[x==100] <- NA # By logical index
> x[] <- 10 # No index
> x <- 10 # Watch out - what happens here?
A very important feature of R is that it can carry out
vectorized calculations. What this means is that basic
arithmetic, as well as many built-in R functions, will
operate on each element of a vector. This avoids much of
the looping that’s used in lower-level languages.

> x <- 1:3


> x * 10
[1] 10 20 30
> x^2
[1] 1 4 9
> y <- 0:2
> x + y
[1] 1 3 5
> x / y
[1] Inf 2.0 1.5
When the vectors in a calculation are of different lengths,
R follows the recycling rule. That is, it starts repeating
elements from the shorter one.

> x <- 1:3


> y <- 1:2
> x + y
[1] 2 4 4
Warning message:
In x + y : longer object length is not a multiple of
shorter object length

We’ve actually used this before. It would be a good


exercise for you to go through the notes so far and
identify where R is applying the recycling rule.
R has a number of built-in functions for making patterned
vectors, including seq and rep. We’ve seen “:” many times,
which is just a special case of the seq function.

> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(0, 10, by = 2)
[1] 0 2 4 6 8 10
> seq(0, 0.5, length = 6)
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> seq(1, 0, by = -0.1)
[1] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
> rep(c(0, 1), times = 5)
[1] 0 1 0 1 0 1 0 1 0 1
> rep(letters[1:5], each = 2)
[1] "a" "a" "b" "b" "c" "c" "d" "d" "e" "e"
R also has many built-in summary functions.

> x <- rnorm(100)


> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.92500 -0.71430 -0.19300 -0.09377 0.49810 2.68200
> mean(x)
[1] -0.09377121
> min(x)
[1] -1.925202
> max(x)
[1] 2.682179
> range(x)
[1] -1.925202 2.682179
> length(x)
[1] 100
> sum(x)
[1] -9.377121
> prod(x)
[1] 1.482105e-25
A handy way to make patterned character vectors is to
use the paste function.

> args(paste)
function (..., sep = " ", collapse = NULL)
NULL

The help says . . . represents “one or more R objects, to


be converted to character vectors.” This actually depends
on the function, but “one or more R objects” is a good
way to think of it for now. For another example, see the
help for c().

Type help(paste) to see more about how this function


works.
Some examples using paste
> paste("Iteration", 1:3)
[1] "Iteration 1" "Iteration 2" "Iteration 3"
> paste("Iteration", 1:3, sep = "")
[1] "Iteration1" "Iteration2" "Iteration3"
> words <- c("Hi", "everyone")
> paste(words, collapse = " ")
[1] "Hi everyone"
> paste(letters[1:5], collapse = "-")
[1] "a-b-c-d-e"
> paste("Iteration", 1:3, sep = "", collapse = "-")
[1] "Iteration1-Iteration2-Iteration3"
The substr function allows us to extract parts of a string.

> some.letters <- paste(letters[1:5], collapse = "-")


> some.letters
[1] "a-b-c-d-e"
> substr(some.letters, start = 1, stop = 3)
[1] "a-b"

It also allows us to assign parts of a string.

> substr(some.letters, start = 1, stop = 3) <- "A*B"


> some.letters
[1] "A*B-c-d-e"

We’ll talk a lot more about working with text later in the
course.
We learned that one of the three main data types in R is
a logical vector, which is either TRUE or FALSE. To
understand how R operates on logical vectors, you need
to know a bit about Boolean algebra.

Boolean algebra is a mathematical formalization of the


truth or falsity of statements. It has three operations,
which we’ll call “not,” “or,” and “and.” Boolean algebra
tells us how to evaluate the truth or falsity of compound
statements that are built using these operations. For
example, if A and B are statements, some compound
statements are

A and B
(not A) or B
The “not” operation just causes the statement following it
to switch its truth value. So (not TRUE) is FALSE and
(not FALSE) is TRUE. The compound statement A and B is
TRUE only if both A and B are TRUE. The compound
statement A or B is TRUE if either or both A or B is TRUE.

In R, we write ! for “not,” & for “and,” and | for “or.”


Note: all of these are vectorized!
> A <- c(TRUE, TRUE, FALSE, FALSE)
> B <- c(TRUE, FALSE, TRUE, FALSE)
> !A
[1] FALSE FALSE TRUE TRUE
> A & B
[1] TRUE FALSE FALSE FALSE
> A | B
[1] TRUE TRUE TRUE FALSE
We often need to test various conditions using the
relational operators. Again, these are vectorized and follow
the recycling rule.
> x <- 1:5
> x > 2
[1] FALSE FALSE TRUE TRUE TRUE
> x < 2
[1] TRUE FALSE FALSE FALSE FALSE
> x == 2
[1] FALSE TRUE FALSE FALSE FALSE
> x >= 2
[1] FALSE TRUE TRUE TRUE TRUE
> x <= 2
[1] TRUE TRUE FALSE FALSE FALSE
> x != 2
[1] TRUE FALSE TRUE TRUE TRUE
Two other useful functions that operate on logical vectors
are all and any. Can you guess what they do?

Logical vectors in R are just special representations of


numeric vectors filled with 1’s and 0’s.Treating them as 1’s
and 0’s in calculations where we’d otherwise use their
numeric value is one of those instances in which implicit
coercion is ok, even helpful.

> x <- rnorm(1000)


> sum(x > 0) # Number of times the condition is TRUE
[1] 468
> mean(x > 0) # Proportion of times the condition is TRUE
[1] 0.468
> y <- x * (x > 0) # Multiplying by an indicator variable
> min(y)
[1] 0
Factors are a special storage class in R used for
categorical data.

> group <- rep(c("control", "treatment"), each = 2)


> group
[1] "control" "control" "treatment" "treatment"
> group <- factor(group)
> group
[1] control control treatment treatment
Levels: control treatment
> levels(group)
[1] "control" "treatment"

Because the levels of a factor are internally coded as


integers, this is more efficient than using character
vectors. However, we still have the advantage of seeing
what the levels represent (rather than just the integer
codes).
Announcement: There will be another “Short
Assignment” posted later today and due Monday night.

Today’s topics

• Data structures galore:


matrices, arrays, data frames, and lists
• More ways to operate efficiently on entire data
structures and avoid looping
You can create a matrix in R using the matrix function. By
default, matrices in R are assigned by column-major order.
You can assign them by row-major order by setting the
byrow argument to TRUE. Note that the first argument to
matrix is a vector, so all elements must be of the same
type (numeric, character, or logical).
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> m <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Assign names to the rows and columns of a matrix:
> rownames(m) <- letters[1:2]
> colnames(m) <- letters[1:3]
> m
a b c
a 1 2 3
b 4 5 6

Find the dimensions of a matrix:


> dim(m); nrow(m); ncol(m)
[1] 2 3
[1] 2
[1] 3

Exchange rows and columns:


> t(m) # t for transpose
a b
a 1 4
b 2 5
c 3 6
To index elements of a matrix, use the same five methods
of indexing we covered for vectors, but with the first
index for rows and the second for columns.

Note: by default the result is coerced to a vector if


possible, rather than a matrix with a single row or
column.

Can you guess what each line returns?


> m
a b c
a 1 3 5
b 2 4 6
> m[-1, 2] # Exclusion & inclusion by position
> m["a",] # By name, empty column index
> m[, c(TRUE, TRUE, FALSE)] # Empty row index, logical
To avoid the coercion to lower dimension, add the
argument drop = FALSE to the indexing.
> m[1, 1, drop = FALSE]
a
a 1
> m[1, , drop = FALSE] # Note empty column index!
a b c
a 1 2 3

You can also index a matrix using a single index, as you


would for a vector. The ordering is again in column-major
order. (How could you change this?)
> m
a b c
a 1 2 3
b 4 5 6
> m[2]
[1] 4
An array is like a matrix, but with arbitrary dimension.
The first argument is still a vector of elements to fill the
array, but the second argument is a vector of sizes in each
dimension.

> array(1:24, dim = c(2, 3, 4)) # A 3-D array


, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6 Note: entries are filled
, , 2
[,1] [,2] [,3]
in such that the first index
[1,] 7 9 11 varies the fastest and the
[2,] 8 10 12
, , 3
last index varies the slowest.
[,1] [,2] [,3]
[1,] 13 15 17
[2,] 14 16 18
Data frames are also like matrices, but columns can be of
different types.

Example: Number of cars on Friday, 6th and Friday, 13th:

> cars
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
3 1991 September 137055 136018 7 to 8
4 1991 September 133732 131843 9 to 10
5 1991 December 123552 121641 7 to 8
6 1991 December 121139 118723 9 to 10
7 1992 March 128293 125532 7 to 8
8 1992 March 124631 120249 9 to 10
9 1992 November 124609 122770 7 to 8
10 1992 November 117584 117263 9 to 10
The data.frame function will extract column names either
from arguments with a name = value construction, or from
the arguments themselves.

> data.frame(let = letters[1:2], val = 1:2)


let val
1 a 1
2 b 2
> grp <- factor(rep(c("Control", "Treatment"), each = 2))
> effect <- rnorm(4, mean = rep(c(0, 10), each = 2))
> data.frame(grp, effect)
grp effect
1 Control 0.4145526
2 Control 0.6052182
3 Treatment 10.4363078
4 Treatment 10.7534556
> data.frame(1:2, rnorm(2))
X1.2 rnorm.2.
1 1 -0.58342447
Data frames can be indexed in all the ways that matrices
can. The result will be either a vector or another data
frame. Coercion to vectors can again be avoided using
the argument drop = FALSE.

We can also extract a column by name using the $


symbol.

> cars$Year
[1] 1990 1990 1991 1991 1991 1991 1992 1992 1992 1992
> cars$Month
[1] July July September September
[5] December December March March
[9] November November
Levels: December July March November September
Data frames are actually a special kind of list. As when
constructing a data frame, we specify the elements of a
list using either name = value or just value for each
argument. Unlike a data frame, lists are not displayed in
columns, and each element can have a different length.

> ingredients <- list(cheese = c("Cheddar", "Swiss"),


+ meat = c("Ham","Turkey", "Bologna"))
> ingredients
$cheese
[1] "Cheddar" "Swiss"
$meat
[1] "Ham" "Turkey" "Bologna"

Note that the elements are not associated with one


another by position, as they were in a given row of a data
frame.
You will often encounter lists as return values of function
calls in R.

> x <- 1:100


> y <- x * 3 + rnorm(100)
> regression.results <- lm(y~x) # Regress y on x
> is.list(regression.results)
[1] TRUE
> names(regression.results)
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
> regression.results$coef # Note partial matching
(Intercept) x
0.2433211 2.9950379
Lists can be indexed by name, using $.

They can also be indexed like vectors, using []. The result
will be another list.

> regression.results[1]
$coefficients
(Intercept) x
0.08847387 2.99781408

To extract individual elements of a list, enclose the index


in [[]]. The result will be coerced to a simpler structure,
depending on the element.

> regression.results[[1]]
(Intercept) x
0.08847387 2.99781408
To summarize, the types of data structures we have
encountered so far are:

vector
matrix
array
list
data frame

Matrices and arrays are actually just stored as vectors


with shape information, so our discussions of “vectorized”
calculations hold for matrices and arrays as well.

This is NOT true for lists and data frames.


Sometimes we want an operation to be applied to
individual dimensions of a matrix or array, or to each
element of a list.

Here R provides something called the apply mechanism.


This again avoids the need for looping through each
dimension or over each element of a list, as we would do
in lower-level languages.

• apply for matrices and arrays

• lapply and sapply for lists


> args(apply)
function (X, MARGIN, FUN, ...)
NULL

There’s that . . . argument again. The help page for apply


has this to say about the arguments:
X the array to be used.

MARGIN a vector giving the subscripts which the


function will be applied over. 1 indicates
rows, 2 indicates columns, c(1,2)
indicates rows and columns.
FUN the function to be applied: see ‘Details’.
In the case of functions like +, %*%,
etc., the function name must be backquoted
or quoted.
... optional arguments to FUN.
Let’s first talk about the MARGIN argument. This is a vector
representing the dimension(s) to which FUN will be
applied. Another way to think about it is that MARGIN gives
the dimension(s) we want to preserve.

> A <- array(1:12, dim = c(2, 3, 2))


> apply(A, 1, sum)
[1] 36 42
> apply(A, 2, sum)
[1] 18 26 34
> apply(A, c(1, 2), sum)
[,1] [,2] [,3]
[1,] 8 12 16
[2,] 10 14 18
We’ve seen that the “. . .” argument can stand for an
arbitrary number of objects, as in the c or paste
functions. Here it allows us to pass arguments through
from one function to another. Here, from apply to FUN.
> A[1,2,2] <- NA
> apply(A, 1, sum)
[1] NA 42
> apply(A, 1, sum, na.rm = TRUE)
[1] 27 42

Note that by using “. . .”, the author of the apply function


didn’t need to specify all possible arguments that FUN
could take.
Announcements:
Regina Wu, Hana Ueda, and John Jimenez will be helping
to answer your questions on bSpace and in lab.

Homework 1 is due next Wednesday night. There have


been some problems on bSpace. Please make sure you
get a verification email when you upload your assignment.

Today’s topics:

• Review of data structures and how to index them


• The apply mechanism, revisited
• Reading and writing data from within R
• Keeping track of your commands
• High-level graphics
The types of data structures and how to index them:

Vectors: [index]
> x[1:10]; x[-3]; x[x>3]

Matrices: [rowindex, colindex]


> m[1,2]; m[1:2, ]; m[ ,“a”]

Arrays: [index1, index2, ..., indexK]


> a[1, 3, ]; a[v==TRUE,,]

Data frames: [rowindex, colindex], $name


> cars$Cars6; cars[,3:4]; cars[cars$Junction == “7 to 8”,]

Lists: $name, [index], [[index]]


> ingredients$meat; indgredients[1:2]; ingredients[[1]]

Note: both $ and [[]] can index only one element.


Last time we started talking about the apply function.
Let’s review how this works for matrices.
> args(apply)
function (X, MARGIN, FUN, ...)
NULL
any additional
the matrix arguments to FUN
the function
which dimension
to operate on -
1 for rows, 2 for columns

> m <- matrix(1:4, nrow = 2)


> m
[,1] [,2]
[1,] 1 3
[2,] 2 4
> apply(m, 2, paste, collapse = "")
[1] "12" "34"
The lapply and sapply functions both apply a specified
function FUN to each element of a list. The former returns
a list object and the latter returns a vector when
possible. Again, both allow passing of additional
arguments to FUN through the “. . .” argument.

> random.draws <- list(x1 = rnorm(10), x2 = rnorm(100000))


> lapply(random.draws, mean)
$x1
[1] 0.0827779

$x2
[1] 0.001470952

> sapply(random.draws, mean)


x1 x2
0.082777901 0.001470952
The tapply function allows us to apply a function to
different parts of a vector, where the parts are indexed by a
factor or list of factors.

Single factor:

> grp <- factor(rep(c("Control", "Treatment"), each = 4))


> grp
[1] Control Control Control Control
[5] Treatment Treatment Treatment Treatment
Levels: Control Treatment
>
> effect <- rnorm(8) # Make up some fake data
> tapply(effect, INDEX = grp, FUN = mean)
Control Treatment
0.2180109 -0.2433582
Multiple factors:

> sex <- factor(rep(c("Female", "Male"), times = 4))


> sex
[1] Female Male Female Male Female Male Female Male
Levels: Female Male
> tapply(effect, INDEX = list(grp, sex), FUN = mean)
Female Male
Control 0.3634973 0.07252456
Treatment -0.2860360 -0.20068040
Many data sets are stored as tables in text files. The
easiest way to read these into R is using either the
read.table or read.csv function.

As you can see in help(read.table), there are quite a few


options that can be changed. Some of the important ones
are
• file - name or URL
• header - are column names at the top of the file?
• sep - what divides elements of the table
• na.strings - symbol for missing values, like 9999
• skip - number of lines at the top of the file to ignore
is like read.table, but with different defaults for
read.csv
CSV (comma separated value) files.
By default, all strings are read in as factors.

If a file doesn’t contain column names, you can add them


after the fact. Here’s how I created the R objects for the
assignment last week:

> cars <- read.csv("~/Desktop/friday13thcars.csv",


+ header = FALSE)
> cars[1:2,]
V1 V2 V3 V4 V5
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
> names(cars) <- c("Year", "Month", "Cars6",
+ "Cars13", "Junction")
> cars[1:2,]
Year Month Cars6 Cars13 Junction
1 1990 July 139246 138548 7 to 8
2 1990 July 134012 132908 9 to 10
Earthquakes Example:
Data from the California Geological Survey

> CAquakes <- read.table(file = "http://www.consrv.ca.gov/


cgs/rghm/quakes/Documents/ms49epicenters.txt", header =
TRUE)
> dim(CAquakes)
[1] 383 4
> CAquakes[1:3,]
Date Latitude Longitude M
1 18001011 36.8 -121.5 5.5
2 18001122 32.9 -117.8 6.3
3 18030000 34.2 -118.1 5.5
> mode(CAquakes$Date)
[1] "numeric"

How can we extract the years/months/days from the


Date column?
> datechar <- as.character(CAquakes$Date)
> substring(datechar, 1, 4)[1:3]
[1] "1800" "1800" "1803"
> CAquakes$Year <- as.numeric(substring(datechar, 1, 4))
> CAquakes$Month <- as.numeric(substring(datechar, 5, 6))
> CAquakes$Day <- as.numeric(substring(datechar, 7, 8))
> CAquakes[1:3,]
Date Latitude Longitude M Year Month Day
1 18001011 36.8 -121.5 5.5 1800 10 11
2 18001122 32.9 -117.8 6.3 1800 11 22
3 18030000 34.2 -118.1 5.5 1803 0 0
> CAquakes$Month[CAquakes$Month == 0] <- NA
> CAquakes$Day[CAquakes$Day == 0] <- NA
> summary(CAquakes$Month)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 4.000 6.000 6.281 9.000 12.000 2.000
> summary(CAquakes$Day)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 9.00 18.00 16.64 24.00 31.00 3.00
To save your R commands, use a plain text editor. Here
are two I like:

• R for Mac and Windows has a built-in text editor.


Access commands related to it, such as New Document
and Save, under the File menu. One nice feature is that it
automatically prints the arguments for functions at the
bottom of the window.

• The Emacs editor has a special package called ESS, for


“Emacs Speaks Statistics,” that makes working with .R files
very easy. It’s installed on all the 342 lab computers. It
includes keyboard shortcuts to evaluate the code, rather
than cutting and pasting. (See http://stat.ethz.ch/ESS/
refcard.pdf.)
Whichever editor you choose, you can run all the
commands in a particular file using source(“myfile.R”).

A few more notes:

If you don’t save your files as plain text, this won’t work,
since R cannot interpret any extra formatting commands.
So I do NOT recommend you use Microsoft Word.

If you’re cutting and pasting from the R session window


back into the text editor, be sure not to copy the prompt
(> symbol) as well.

If you want to keep your results in your .R file, put a # in


front of each line to mark them as comments.
Graphics in R
Part 1: High-level graphics
functions
We’ll be working in this section with many of R’s built-in
data sets. To see a list of them, just type

> data()
Data sets in package 'datasets':

AirPassengers Monthly Airline Passenger Numbers


1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide uptake in grass
plants
ChickWeight Weight versus age of chicks
different diets

. . . many more
1. Barplots

> x <- 1:5; names(x) <- letters[1:5]


> barplot(x)
5
4
3
2
1
0

a b c d e
> VADeaths
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
> barplot(VADeaths, legend = TRUE)
200

70−74
65−69
60−64
55−59

This stacked barplot


50−54
150

makes it hard to read


anything but the
100

bottom category and


50

the total.
0

Rural Male Rural Female Urban Male Urban Female


Making a good plot in R is often a matter of iterative
improvement.

> barplot(VADeaths, beside = TRUE, legend = TRUE)


70

50−54
55−59
60−64
60

65−69
70−74
50
40
30
20
10
0

Rural Male Rural Female Urban Male Urban Female


> barplot(VADeaths, beside = TRUE, legend = TRUE,
+ ylab = "Deaths per 1000",
+ main = "Death rates in Virginia, 1940")

Death rates in Virginia, 1940


70

50−54
55−59
60−64
60

65−69
70−74
50
Deaths per 1000

40
30
20
10
0

Rural Male Rural Female Urban Male Urban Female


Saving your plots as graphics files

If you call a high-level plot command, R will automatically


start a graphics device or window.

To save the contents of the already open device to a file,


use dev.print.

> barplot(VADeaths, legend = TRUE)


> dev.print(device = pdf, file = "mybar.pdf",
+ height = 5, width = 6) # Inches
> dev.print(device = jpeg, file = "mybar.jpeg",
+ height = 500, width = 600) # Pixels

See help(device) for a list of other graphics formats.


To close the device (shut the window), type
> dev.off()

Alternatively, you can open up the device with a given file


name, run the commands, then use dev.off(). The device
itself won’t appear as a window. This is useful if you want
to run your commands in BATCH mode.

> pdf(file = "mybar.pdf", height = 6, width = 6)


> barplot(VADeaths, legend = TRUE)
> dev.off()
2. Pie charts

> pie(c(1, 1, 2), labels = letters[1:3])

b a

Note that elements


of the vector are
normalized by their
sum, so that the
total gives 100% of
the pie.

c
> Titanic
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3

. . . two more matrices not printed here, with survivors


Did all groups have an equal survival rate?
> apply(Titanic, 1, sum) # Total passengers, each class
1st 2nd 3rd Crew
325 285 706 885
> pie(apply(Titanic, 1, sum), main = "Total Passengers")
> pie(apply(Titanic[,,,"Yes"], 1, sum),
+ main = "Survivors")

Total Passengers Survivors

2nd

2nd 1st

1st
3rd

3rd

Crew
Crew
Studies of human perception show we are not very good
at comparing areas, volumes, or angles.

• When making bar plots, start the axis at zero and


keep all bars the same width, so that length and area
are proportional.

• Try to avoid pie charts for anything requiring a


precise comparison.
3. Histograms

> precip[1:4] # Average annual precipitation in cities


Mobile Juneau Phoenix Little Rock
67.0 54.7 7.0 48.5
> hist(precip)

Histogram of precip

The height of the


25

bars shows the


20

number of
15

observations falling
Frequency

into each bin.


10
5
0

0 10 20 30 40 50 60 70

precip
There are several ways to change the cutoff points.

> hist(precip, breaks = 10) # Only a suggestion


> hist(precip, breaks = seq(min(precip), max(precip),
+ length = 11)) # Force it

Histogram of precip Histogram of precip


14

15
12
10

10
Frequency

Frequency
8
6

5
4
2
0

10 20 30 40 50 60 70 10 20 30 40 50 60

precip precip
Again, let’s add meaningful axis labels and a title.
> hist(precip, breaks = 10, xlab = "Inches",
+ main = "Yearly Average Rainfall for US Cities")

Yearly Average Rainfall for US Cities


14
12
10
Frequency

8
6
4
2
0

10 20 30 40 50 60 70

Inches
4. Boxplots

> boxplot(precip, ylab = "Inches",


+ main = "Yearly Average Rainfall for US Cities")

Yearly Average Rainfall for US Cities


Outlier
Upper whisker - Upper quartile + 1.5 IQR
60
50

Upper quartile
40
Inches

Median Inter-quartile range (IQR)


Lower quartile
30
20

Lower whisker - Lower quartile - 1.5 IQR


10


● Outliers
> mtcars[1:2,1:5]
mpg cyl disp hp drat
Mazda RX4 21 6 160 110 3.9
Mazda RX4 Wag 21 6 160 110 3.9
> boxplot(mpg~cyl, data = mtcars, xlab = "Cylinders",
+ ylab = "Miles per Gallon",
+ main = "Fuel Consumption")

Fuel Consumption
30
Miles per Gallon

25
20
15


10

4 6 8

Cylinders
5. Scatterplots

> state.x77[1:2,1:4]
Population Income Illiteracy Life Exp
Alabama 3615 3624 2.1 69.05
Alaska 365 6315 1.5 69.31
> plot(state.x77[,"Income"], state.x77[,"Life Exp"])


73

● ●

● ●●
● ●

● ●
72



state.x77[, "Life Exp"]

● ●
● ●



71

● ● ●

● ● ● ●
● ● ●

● ● ● ●
● ● ● ●
70




69

● ●


68

3000 3500 4000 4500 5000 5500 6000

state.x77[, "Income"]
> plot(state.x77[,"Income"], state.x77[,"Life Exp"],
+ xlab = "Per Capita Income (Dollars)",
+ ylab = "Life Expectancy (Years)",
+ main = "Income and Life Expectancy in U.S., 1970s")
Income and Life Expectancy in U.S., 1970s


73

● ●

● ●●
● ●

● ●
72


Life Expectancy (Years)

● ● ●
● ●



71

● ● ●

● ● ● ●
● ● ●

● ● ● ●
● ● ● ●
70




69

● ●


68

3000 3500 4000 4500 5000 5500 6000

Per Capita Income (Dollars)

How can we label the interesting cases?


See Stat133Lecture5.R
Announcement: John will have office hours tonight from
8-10pm in the bSpace chatroom, and I will have office
hours tomorrow from 2:30-3:30.

A few loose ends from last time:

arguments to par - mar, oma, xaxt, and yaxt

functions - legend, locator, axis, abline


Graphics, Continued:
The Dirty Dozen
From Wainer, H. (1984) “How to Display Data Badly.”
The American Statistician, 38, 137-147.
Additional images from Tufte, E. The Visual Display of
Quntitative Information.
1. Show as few data as possible.

An example with lots


of “chart junk,” not to
mention visual
distortion
How many data points?
2. Hide what data you do show.
3. Ignore the visual metaphor, or reverse it mid-graph.
4. Only order matters.

Are we supposed to
compare length, area,
or volume?
5. Graph data out
of context.
6. Change scales in
mid-axis.
7. Emphasize the trivial.
(Ignore the important.)
8. Jiggle the baseline.
Sometimes varying the baseline is ok, if the main points of
comparison are the first category and the total. This plot
is bizarre in other ways.
9. Austria first!
10. Label (a) Illegibly,
(b) Incompletely,
(c) Incorrectly, and
(d) Ambiguously.
11. More (dimensions) is murkier.
More dimensions AND
colors!
12. If it has been done well in the past, think of another
way to do it.
On the other hand, here are some creative plotting
techniques you may want to consider.
1. Letting the data points represent another variable.

(E.J. Marey)
2. Using “small
multiples”
3. Letting deformation represent a variable.

versus
An example from www.swivel.com
Critique:
x-axis labels poorly located - put them at election years
y-axis label misleading - these are numbers of counties
use of color could be improved (eg. red/blue)
Some R code for you to play with:

parties <- read.csv(file = "graph_26277754.csv", header = TRUE)


parties
names(parties) <- c("Democrat", "Republican", "Year")
order(parties$Year)
parties <- parties[order(parties$Year),]
party.mat <- t(as.matrix(parties[,1:2]))
barplot(party.mat, beside = TRUE)
barplot(party.mat, beside = TRUE, col = c("blue", "red"))
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year)
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year, legend = TRUE)
barplot(party.mat, beside = TRUE, col = c("blue", "red"),
names.arg = parties$Year, legend = TRUE,
ylim = c(0, 50))
title(main = "California Counties\nMajority Party of Registered Voters",
xlab = "Election Year", ylab = "Number of Counties")
dev.print(pdf, file = "countyvotes.pdf", height = 6, width = 7)
California Counties
Majority Party of Registered Voters

50
Democrat
Republican
40
Number of Counties

30
20
10
0

1992 1996 2000 2004 2008

Election Year

But does this new graph tell the whole story?


Programming in R
So far we have relied on the built-in functionality of R to
carry out our analyses. In the next three lectures, we’ll
cover

• Importing packages into R


• How to write your own functions
• The meaning of “environments” and “variable scope”
• How to use flow control mechanisms like if and for
• Debugging your code when something goes wrong
• Timing and speeding up your code
If a particular package is already installed on your system,
you can access its contents by typing

> library(“nameofpackage”)
or
> library(nameofpackage)

The authors of the package write documentation for the


functions and datasets included in it, which you can read
as usual using help().

All packages come with a reference manual, which you


can access by visiting CRAN. Go to http://cran.r-
project.org/, click on “Packages,” then scroll down for the
particular package. This is just a hard copy of the help
pages. A few packages also come with a tutorial.
Writing your own functions in R

Think about the code we’ve been writing so far in R. It


has been

• made up of a list of commands, one after another


• specific to the particular dataset we’re working with.

Functions allow us to

• organize our code into individual tasks


• reuse the same code on different datasets by making
the data an argument to the function.
For example, last time we simulated some data.

> beta0 <- 3; beta1 <- 1


> m <- beta0 + beta1 * x
> y <- rnorm(100, mean = m, sd = 10)

Then we plotted it and added a linear regression line.


> plot(x, y)
> ls.mod <- lm(y~x)
> abline(a = ls.mod$coef[1], b = ls.mod$coef[2])

What we want to do now is encapsulate the last three


lines into a function that we can apply to any x and y
vector, not just the ones currently in our workspace.
Anatomy of a function

The syntax for writing a function is

function ( arglist ) body

Typically we assign the function to a particular name. This


should describe what the function does.

myfunction <- function (arglist) body

A function without a name is called an “orphan” function.


These can be very powerfully used with the apply
mechanism. Stay tuned....
function ( arglist ) body

The keyword function just tells R that you want to create


a function.

Recall that the arguments to a function are its inputs,


which may have default values.

> args(substring)
function (text, first, last = 1e+06)

Here, if we do not explicitly specify last when we call


substring, it will be assigned the default value of 1e+06,
which is very large. (Why do you think this was chosen?)
A few notes on writing the arguments list

When you’re writing your own function, it’s good practice


to put the most important arguments first. Often these
will not have default values.

This allows the user of your function to easily specify the


arguments by position, eg. plot(xvec, yvec) rather than
plot(x = xvec, y = yvec).
Next we have the body of the function, which typically
consists of expressions surrounded by curly brackets.
Think of these as performing some operations on the
input values given by the arguments.

{
expression 1
expression 2
return(value)
}

The return expression hands control back to the caller of


the function and returns a given value. If the function
returns more than one thing, this is done using a named
list, for example

return(list(total = sum(x), avg = mean(x))).


In the absence of a return expression, a function will
return the last evaluated expression. This is particularly
common if the function is short.

For example, I could write the simple function:

my.mean <- function(x) sum(x)/length(x)

Here I don’t even need brackets {}, since there is only


one expression.

A return expression anywhere in the function will cause


the function to return control to the user immediately,
without evaluating the rest of the function. This is often
used in conjunction with if statements, which we’ll come
to later.
Returning to our example, let’s make a function to carry
out these three steps for any vectors x and y.
> plot(x, y)
> ls.mod <- lm(y~x)
> abline(a = ls.mod$coef[1], b = ls.mod$coef[2])

What should we call it? (What does it do?)


What will be the arguments? Should they have default
values?
What (if anything) should the function return?
What do you actually need to type into R to create this
function?

Also, looking ahead, what might go wrong with the


function?
Environments and variable scope

R has a special mechanism for allowing you to use the


same name in different places in your code and have it
refer to different objects.

For example, you want to be able to create new variables


in your functions and not worry if there are variables
with the same name already in the workspace.

The solution relies on environments.


When you call a function, R creates a new workspace
containing just the variables defined by the arguments of
that function. This collection of variables is called a frame.

> x <- 1; y <- 2


> lookatframe <- function(a, b, c) print(ls())
> lookatframe(a = 1, b = 2, c = 3)
[1] "a" "b" "c"

However, R has a way of accessing variables that are not


in the frame created by the function.

> lookatframe <- function(a, b, c){print(ls()); print(x)}


> lookatframe(a = 1, b = 2, c = 3)
[1] "a" "b" "c”
[1] 1
What is happening is that R is looking for variables with
that name in a sequence of environments. An environment
is just a frame (collection of variables) plus a pointer to
the next environment to look in.

In our example, R didn’t find the variable x in the


environment defined by the function lookatframe, so it
went on to the next one. In this case, this was our main
workspace, which is called the Global Environment.

The “next environment to look in” is called the parent


environment. For the environment created by a function
call, this is just the environment we were in when we
called the function.
If R reaches the Global Environment and still can’t find
the variable, it looks in something called the search path.
This is a list of additional environments, which is used for
packages of functions and user attached data.

You can see the search path by typing search().


Computing in R consists of sequentially evaluating
statements. Flow control structures allow us to control
which statements are evaluated and in what order.

In R these consist of
• if/else statements
• for and while loops
• break and next
• the switch function
Statements can be grouped together using curly braces
“{” and “}”. A group of statements is called a block. For
today’s lecture, the word statement will refer to either a
single statement or a block.
The basic syntax for an if/else statement is

if ( condition ) {
statement1
} else {
statement2
}

First, condition is evaluated. If the first element of the


result is TRUE then statement1 is evaluated. If the first
element of the result is FALSE then statement2 is evaluated.
Only the first element of the result is used.

If the result is numeric, 0 is treated as FALSE and any other


number as TRUE. If the result is not logical or numeric, or
if it is NA, you will get an error.
When we discussed Boolean algebra before, we met the
operators & (AND) and | (OR).

Recall that these are both vectorized operators.

If/else statements, on the other hand, are based on a


single, “global” condition. So we often see constructions
using any or all to express something related to the
whole vector, like

if ( any(x < -1 | x > 1) )


warning("Value(s) in x outside the interval [-1,1]”)

(We’ll discuss error handling more next time.)


There is another set of operators, && and ||, which are not
vectorized. In fact, they ignore all but the first element of
whatever you give them.

The advantage in using them is that they only evaluate as


much as they need to in order to return TRUE or FALSE.

For example, in A && B, first A will be evaluated. If it is


FALSE, R will immediately evaluate to FALSE for the whole
expression, and will not evaluate B.

Likewise, in A || B, R will immediately evaluate to TRUE if A


is TRUE.
The result of an if/else statement can be assigned. For
example,

> if ( any(x <= 0) ) y <- log(1+x) else y <- log(x)

is the same as

> y <- if ( any(x <= 0) ) log(1+x) else log(x)

Also, the else clause is optional. Another way to do the


above is

> if( any(x <= 0) ) x <- 1+x


> y <- log(x)

Note that this version this changes x as well.


If/else statements can be nested.

if (condition1 )
statement1
else if (condition2)
statement2
else if (condition3)
statement3
else
statement4

The conditions are evaluated, in order, until one evaluates


to TRUE. The the associated statement/block is evaluated.
The statement in the final else clause is evaluated if none
of the conditions evaluates to TRUE.
A note about formatting if/else statements:

When the if statement is not in a block, the else (if


present) must appear on the same line as statement1 or
immediately following the closing brace. For example,

if (condition) {statement1}
else {statement2}

will be an error if not part of a larger block and/or


function, because R will evaluate the first line.
Some common uses of if/else clauses

1. With logical arguments to tell a function what to do

corplot <- function(x, y, plotit = TRUE){


if ( plotit == TRUE ) plot(x, y)
cor(x,y)
}

2. To verify that the arguments of a function are as


expected

if ( !is.matrix(m) )
stop("m must be a matrix”)
3. To handle common numerical errors

ratio <- if ( x!=0 ) y/x else NA

4. In general, to control which block of code is executed

if ( dist == “normal” ){
return( rnorm(n) )
} else if (dist == “t”){
return(rt(n, df = 1, ncp = 0))
} else stop(“distribution not implemented”)
These if/else constructions are useful for global tests, not
tests applied to individual elements of a vector.

However, there is a vectorized function called ifelse.


> args(ifelse)
function (test, yes, no)

R object that can be R objects of the same


coerced to logical size as test

For each element of test, the corresponding element of


yes is returned if the element is TRUE, and the
corresponding element of no is returned if it is FALSE.
Some examples of ifelse

ratio <- ifelse(x!=0, y/x, NA) # (Compare with earlier)

US.indicator <- ifelse(country == “USA”, 1, 0)

plot(Income, Donations,
col = ifelse(party == “Republican”, “red”, “blue”)
Looping is the repeated evaluation of a statement or block
of statements.

Much of what is handled using loops in other languages


can be more efficiently handled in R using vectorized
calculations or one of the apply mechanisms.

However, certain algorithms, such as those requiring


recursion, can only be handled by loops.

There are two main looping constructs in R: for and while.


For loops

A for loop repeats a statement or block of statements a


predefined number of times.

The syntax in R is

for ( name in vector ){


statement
}

For each element in vector, the variable name is set to the


value of that element and statement1 is evaluated.

vector often contains integers, but can be any valid type.


Some examples of for loops:

fibseq <- rep(NA, 100)


fibseq[1:2] <- 1
for(i in 3:100)
fibseq[i] <- fibseq[i-1] + fibseq[i-2]

datafiles <- paste(“data”, 1:10, “.RData”, sep = “”)


for(file in datafiles)
load(file)
While loops

A while loop repeats a statement or block of statements


for as many times as a particular condition is TRUE.

The syntax in R is

while (condition){
statement
}

condition is evaluated, and if it is TRUE, statement is


evaluated. This process continues until condition
evaluates to FALSE.
Exercise:

The expression sample(c(1, 0), size = 1, prob = c(p, 1-


p)) simulates a random coin flip, where the coin has
probability p of coming up heads, represented by a 1.

Write a function that simulates flipping a coin until a fixed


number of heads are obtained. It should take the
probability p and the total number of heads total and
return the trial on which the final head was obtained.
This produces a single sample from the negative binomial
distribution.
coin.flips <- function(total.heads, p = 0.5){
current.heads <- n.trials <- 0
while(current.heads < total.heads){
n.trials <- n.trials + 1
if(sample(c(1,0), size = 1, prob = c(p, 1-p))){
current.heads <- current.heads + 1
}
}
return(n.trials)
}
Announcement:
Graded homework 2 will be posted on bSpace after class.
It is due next Friday at 11pm.

Please take advantage of the lab sessions tomorrow to


get started.
Continued from last time...

The expression sample(c(1, 0), size = 1, prob = c(p, 1-


p)) simulates a random coin flip, where the coin has
probability p of coming up heads, represented by a 1.

Write a function that simulates flipping a coin until a fixed


number of heads are obtained. It should take the
probability p and the total number of heads total and
return the trial on which the final head was obtained.
This produces a single sample from the negative binomial
distribution.

Now write a function to take multiple samples from the


negative binomial distribution.
The break statement causes an exit from the innermost
loop that is currently being executed. The next statement
immediately causes control to return to the start of the
loop. These are typically used in conjunction with an if
statement.

> for(i in 1:10){


+ if(i == 5)
+ break
+ }
> i
[1] 5

Notice that the “name” being iterated over in a for loop,


in this case i, still exists once the loop is done. This tells
you where you were when the break occurred.
The syntax for the switch function is

switch(EXPR, ...)

where the additional arguments specified by “...” may be


named. EXPR is evaluated. If the result is a number
between 1 and the number of additional arguments then
the corresponding element of “...” is evaluated and
returned. If EXPR returns a character string then that
string is used to match the names of the elements in “...”.

switch(distribution, normal = rnorm(1),


t = rt(1, df = 1, ncp = 0),
poisson = rpois(1, lambda = 1),
stop(“distribution not implemented”))
Catching errors

1. The function stop stops execution of the current


expression and prints a specified error message.

> showstop <- function(x){


+ if(any(x < 0)) stop("x must be >= 0")
+ return("ok")
+ }
> showstop(1)
[1] "ok"
> showstop(c(-1, 1))
Error in showstop(c(-1, 1)) : x must be >= 0
2. A similar function is stopifnot. It has the advantage of
being able to take multiple conditions.

> showstopifnot <- function(x){


+ stopifnot(x>=0, x%%2 == 1)
+ return("ok")
+ }
> showstopifnot(1)
[1] "ok"
> showstopifnot(c(1, -1))
Error: all(x >= 0) is not TRUE
> showstopifnot(c(1,2))
Error: x%%2 == 1 is not all TRUE
3. Finally, warning just prints a warning message without
stopping the execution of the function.

> ratio.warn <- function(x, y){


+ if(any(y == 0))
+ warning("Dividing by zero")
+ return(x/y)
+ }
> ratio.warn(x = 1, y = c(1, 0))
[1] 1 Inf
Warning message:
In ratio.warn(x = 1, y = c(1, 0)) : Dividing by zero
> ratio.warn(x = 1:3, y = 1:2)
[1] 1 1 3
Warning message:
In x/y : longer object length is not a multiple of shorter
object length
Some debugging strategies

1. The traceback function prints the sequence of calls that


led to the last error. This can show you where in your
function something is going wrong.

It may not even be in the function itself, but in another


function that is being called within the original function.

> cv <- function(x) sd(x/mean(x))


> cv(0)
Error in var(x, na.rm = na.rm) : missing observations in
cov/cor
> traceback()
3: var(x, na.rm = na.rm)
2: sd(x/mean(x))
1: cv(0)
2. If you have some idea where the error is occurring, you
can use print to check that key variables are what you
think they are.

3. Consider “commenting out” lines of your code where


the error might occur, then adding them back in one by
one.

4. To step through the function, expression by expression,


and be able to print out any variable at each step, use the
debug function. Use undebug to turn of debugging.
> coin.flips(total.heads = 5)
debugging in: coin.flips(total.heads = 5)
debug: {
current.heads <- n.trials <- 0
while (current.heads < total.heads) {
n.trials <- n.trials + 1
if (sample(c(1, 0), size = 1, prob = p)) {
current.heads <- current.heads + 1
}
}
return(n.trials)
} What’s about to be evaluated
Browse[1]> n
debug: current.heads <- n.trials <- 0
Browse[1]>
While in the debugger, you can use the following
commands:

'n' (or just return) - Advance to the next step.


'c' - continue to the end of the current context: e.g. to
the end of the loop if within a loop or to the end of the
function.
'Q' - exit the browser and the current evaluation and
return to the top-level prompt.

You can also evaluate any valid R expression. For


example, you can type the names of variables to see their
current values.
Efficient programming

The first rule of efficient programming in R is to make use


of vectorized calculations and the apply mechanisms
whenever possible.

You can check how much time it takes to evaluate any


expression by wrapping it in system.time(). Units are in
seconds.

> system.time(normal.samples <- rnorm(1000000))


user system elapsed
0.196 0.013 0.221
wall clock time

CPU time for CPU time for system


R process on behalf of R
> x <- y <- 1:100000
> time1 <- system.time({
+ z <- x[1] + y[1]
+ for(i in 2:100000)
+ z <- c(z, x[i] + y[i])})
> time2 <- system.time({
+ z <- rep(NA, 100000)
+ for(i in 1:100000)
+ z[i] <- x[i] + y[i]})
> time3 <- system.time(x+y)
> time1/time3
user system elapsed
41687.5 80872.0 83769.5
> time2/time3
user system elapsed
276.5 4.0 279.5
Simulation in R
First, a brief review of probability theory.

Probability allows us to quantify statements about the


chance of an event taking place. There are two formal
definitions of probability:

• Frequentist: long-run relative frequency of an event


occurring in repeated experiments.

• Subjective/Bayesian: an individual’s degree of belief in


the occurrence of the event, given the evidence.

We will focus on the first definition. However, the basic


laws of probability are the same under both definitions.
A probability distribution assigns a number P(A) to each
event in the sample space (set of all possible outcomes).
P(A) must be between 0 and 1.

We may characterize the distribution using the cumulative


distribution function or CDF, defined by

F (x) = P (X ≤ x)

We call X the random variable.

Exercise: Graph the CDF for the random variable equal


to the number of heads in two independent coin flips.
Three important properties of the CDF are

1. F is non-decreasing: x1 < x2 implies F (x1 ) ≤ F (x2 )


2. F is normalized: limx→−∞ = 0 and limx→∞ = 1
3. F is right-continuous: limy↓x = F (x)

In fact a function mapping the real line to [0,1] is a CDF if


and only if it satisfies these three conditions.
The inverse CDF or quantile function is defined by

F −1
(q) = inf{x : F (x) > q}

If you’re not familiar with “inf,” just think of it as the


minimum.

Exercise: What does the inverse CDF for the coin flipping
example look like?
A random variable X is discrete if it takes countably many
values. We define the probability mass function for X by

f (x) = P (X = x)

A random variable X is continuous if there exists a


function f, called the probability density function (PDF), with

1. f (x) ≥ 0 for all x


!∞
2. −∞ f (x)dx = 1
!b
3. P (a < X < b) = a
f (x)dx
! x
Note that F (x) = f (t)dt

In R, there are many built-in functions for handling
distributions, some of which we have seen already.

The prefixes of the functions indicate what they do:

d - evaluate the PDF


p - evaluate the CDF
q - evaluate the inverse CDF
r - take a random sample

Note that the functions prefixed by d, p, and q are all


calculating mathematical quantities.

However, once we have a random sample, we can also


estimate the PDF, CDF, and inverse CDF....
A histogram is a type of density estimator.
! b
Recall that f (x)dx = P (a < X < b)
a

For each bin of a histogram (with lower endpoint a and


upper endpoint b), we count the number of observations
falling into the bin, i.e.
!n
I{a < Xi ≤ b}
i=1
If we properly normalize each of these quantities, the
total area of the rectangles in the histogram is one, just
like the area under a PDF. You can do this automatically in
R with hist(x, prob = TRUE).
The empirical CDF uses the same sort of counting idea.
Define
!n
i=1 I{Xi ≤ x}
F̂ (x) =
n

We are estimating a probability by a proportion. Another


way to think of it is that we estimate the PDF by a
discrete distribution which assigns probability 1/n to each
data point.

Exercise: Write a function which calculates the empirical


CDF. It should take a vectors sample and x.
Finally, the quantile function in R returns the sample
quantiles, defined by

F̂ −1
(q) = inf{x : F̂ (x) > q}

Note that we just “plug in” the empirical CDF to the


definition of the quantile function.
We’ll talk more next time about specific distributions.
For now, let’s consider the role that simulation can play in
helping us understand statistics.

We can think of probability theory as complimentary to


statistical inference.

Probability

Distribution Observed data

Inference
A statistic is a function of a sample, for example the
sample mean or a sample quantile.

Statistics are often used as estimators of quantities of


interest about the distribution, called parameters.
Estimators are random variables; parameters are not.

In simple cases, we can study the distribution of the


statistic analytically. For example, we can prove that
under mild conditions the standard error of the sample
mean decreases at a rate proportional to 1/√n.

In more complicated cases, we turn to simulation.


Whereas mathematical results are symbolic, in terms of
arbitrary parameters and sample size, on a computer we
must specify particular values.

A single experiment looks something like this:

}
X1
Particular choice X2
of parameters, Single statistic
sample size
Xn

To study the distribution of the statistic, we repeat the


whole experiment B times. The larger B is, the better our
approximation of the distribution.
Steps in carrying out a simulation study:

1. Specify what makes up an individual experiment: sample


size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
Example: Find the standard error of the median when
sampling from the normal distribution. How does it vary
with the sample size and with the standard deviation?
Recall our example of a simulation study from last time:

Find the standard error of the median when sampling


from the normal distribution. How does it vary with the
sample size and with the standard deviation?
Steps in carrying out a simulation study:

1. Specify what makes up an individual experiment: sample


size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
A quick review of some other probability distributions
available in R:

abbreviation - Distribution
! 1
a≤x≤b
unif - Uniform(a, b) f (x) = b−a
0 otherwise

!
λe−λx x>0
exp - Exponential(λ) f (x) =
0 otherwise

e−λ λx
pois - Poisson(λ) f (x) = , x = 1, 2, 3, . . .
x!
! "
n x
binom - Binomial f (x) = p (1 − p)x , x = 0, 1, . . . , n
x

What if a distribution is not available in R? For instance,


there is no built-in Bernoulli distribution.

Well, in this case you could just use binom with size=1, or
sample(0:1, 1, prob = c(p, 1-p)).

We can also derive certain distributions from others. For


example, last week we sampled from the negative
binomial distribution by explicitly counting the number of
trials until we got the desired number of heads. Can we
sample from the Bernoulli in some other way?
If the inverse CDF (the quantile function) has an inverse
in closed form, there is a method for generating random
values from the distribution.

The inverse CDF method is simple:

1. Generate n samples from a standard uniform


distribution. Call this vector u. In R, u <- runif(n).

2. Take y <- F.inv(u), where F.inv computes the


inverse CDF of the distribution we want.

We can prove that the CDF of the random values


produced in this way is exactly F.

 0 u<0
P (U ≤ u) = u 0≤u≤1

1 u>1

Therefore,

P (Y ≤ y) = P (F (U ) ≤ y)
−1

= P (F (F (U )) ≤ F (y))
−1

= P (U ≤ F (y)
= F (y)

We used the fact that F is nondecreasing in the second


line.
Example: Triangle distribution with endpoints at a and b
and center at c.
 2(x−a)

 (b−a)(c−a) a ≤ x < c
f (x) = 2(b−x)
c ≤ x ≤ b

 (b−a)(b−c)
0 otherwise

1.0
0.8
We need to:

0.6
Density
1. Find the CDF

0.4
2. Find the inverse CDF
0.2
3. Write a function to carry 0.0
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0

x
We ended last time by talking about the inverse-CDF
method:

1. Generate n samples from a standard uniform


distribution. Call this vector u. In R, u <- runif(n).

2. Take y <- F.inv(u), where F.inv computes the


inverse CDF of the distribution we want.
Example: Triangle distribution with endpoints at a and b
and center at c.
 2(x−a)

 (b−a)(c−a) a ≤ x < c
f (x) = 2(b−x)
c ≤ x ≤ b

 (b−a)(b−c)
0 otherwise

1.0
0.8
We need to:

0.6
Density
1. Find the CDF

0.4
2. Find the inverse CDF
0.2
3. Write a function to carry 0.0
out the inverse CDF method.
0.0 0.5 1.0 1.5 2.0

x
Using the fact that the total area is one, and that the area
of a triangle is 1/2 base x height, we find that


 0 x < 0

 (x−a) 2
0≤x<c
F (x) = (b−a)(c−a)
(b−x) 2

 1 − (b−a)(b−c) c ≤ x ≤ b


1 b<x
Inverting this function, we have
! "
y(b − a)(c − a) + a 0 ≤ y < c−a
F −1 (x) = " b−a
b − (1 − y)(b − a)(b − c) b−a ≤ y ≤ 1
c−a

Now we can write our function.


We’ll finish off this section on simulation by going over
one more example of designing a simulation study, dealing
with the risk of the James-Stein estimator.

In 1956, Charles Stein rocked the world of statistics when


he proved that the maximum likelihood estimator (MLE)
is inadmissible (that is, we can always find a better
estimator) in this simple problem when d ≥ 3 :
indep
Let Yi ∼ N (θi , σ ),2
θi ∈ #, i = 1, . . . , d

The MLE for the vector θ is just the vector Y of


observations, which seems intuitively sensible.

To state Stein’s result, we first have to talk about risk.


Speaking somewhat more formally, a loss function
describes the consequences of using a particular
estimator θ̂ when the true parameter is θ.

A common loss function is the squared error


L(θ, θ̂) = (θ − θ̂)2
which we can generalize to multiple dimensions by
summing the squared errors in each dimension.
d
!
L(θ, θ̂) = ˆ
(θi − θi ) = ||θ − θ̂||
2 2

i=1

But θ̂ is random, because it depends on the data. We call


the expected value of the loss for given θ the risk function.
It’s easy to calculate the risk (under squared error loss) of
the MLE in this problem:
! d #
"
E[L(θ, Y )] = E (θi − Yi )2

i=1
d
" $ %
= E (θi − Yi )2
i=1
= dσ 2

What Stein proved was that when d ≥ 3, we can find


another estimator whose risk is always less than that of
the MLE, no matter what θ is.
A famous example is the James-Stein estimator:
! "
(d − 2)σ 2
θ̂JS = 1− Y
||Y ||2

We’ll now study the risk of the James-Stein estimator


through simulation and compare it to the risk of the MLE.

Note that for a given data set in the simulation, we can


calculate the loss, because we know θ.

We can approximate the risk (the expected value of the


loss) by generating many data sets and then calculating
the sample mean of the vector of losses.
Recall the steps one more time:

1. Specify what makes up an individual experiment: sample


size, distributions, parameters, statistic of interest.
2. Write an expression or function to carry out an
individual experiment and return the statistic.
3. Determine what inputs, if any, to vary.
4. For each combination of inputs, repeat the experiment
B times, providing B samples of the statistic.
5. For each combination of input, summarize the empirical
CDF of the statistic of interest, e.g. look at the sample
mean or standard error.
6. State and/or plot the results.
One more quick note about summarizing your simulation
results....

When you are reporting the means of your simulated


distributions, it’s a good idea to add an indication of your
uncertainty as well.

The Central Limit Theorem tells us that the sample mean of


the simulated distribution is, with a sufficiently large
sample, approximately normally distributed. We can use
this to form a 95% confidence interval:

X̄ ± 2SD/ B

Note that we have control over B, so we can make the


intervals as narrow as we like!
UNIX Basics
Operating systems

An operating system (OS) is a piece of software that


controls the hardware and other pieces of software on
your computer.

The most popular OS today, Microsoft Windows, uses a


graphical user interface (GUI) for you to interact with the
OS. This is easy to learn but not very powerful.

UNIX, on the other hand, is hard at first to learn, but it


allows you vastly more control over what your computer
can do. There area actually many different “flavors” of
UNIX, but what we’ll cover applies to almost all of them.
The differences between, say, Windows and UNIX stem
from an underlying philosophy about what software
should do.

Windows: Programs are large, multi-functional. Example:


Microsoft Word.

UNIX: Many small programs, which can be combined to


get the job done. A “toolbox approach.” Example: stop all
my (cgk’s) processes whose name begins with cat and a
space:

ps -u cgk | grep “[0-9] cat” | awk ‘{print$2}’ | xargs kill


The UNIX kernel is the part of the OS that actually
carries out basic tasks.

The UNIX shell is the user interface to the kernel. Like


flavors of UNIX, there are also many different shells. For
this course, it
doesn’t matter
which one you
use. The default
on the lab
computers is
called tcsh. the prompt - yours
will differ
The first thing you need to know about UNIX are how to
work with directories and files. Technically, everything in
UNIX is a file, but it’s easier to think of directories as you
would folders on Windows or Mac OS.
Directories are organized in an inverted tree structure.

To see the directory you’re currently in, type the


command pwd (“present working directory”).

There are two “special” directories: The top level


directory, named “/”, is called the root directory.
Your home directory, named “~”, contains all your files. For
Mary, “~” and “/users/mary” mean the same thing.
To create a new directory, use the command mkdir. Then
to move into it, use cd.

$ pwd
/Users/cgk
$ mkdir unixexamples
$ cd unixexamples
$ ls
$ ls -a
. ..

ls -a means to show all files, including the hidden files


starting with a dot (“.”).

The two hidden files here are special and exist in every
directory. “.” refers to the current directory, and “..”
refers to the directory above it.
This brings us to the distinction between relative and
absolute path names. (Think of a path like an address in
UNIX, telling you where you are in the directory tree.)

You may have noticed that I typed cd unixexamples, rather


than cd /Users/cgk/unixexamples.

The first is the relative path; the second is the absolute


path.

To refer to a file, you need to either be in the directory


where the file is located, or you need to refer to it using a
relative or absolute path name.
Example:
$ pwd
/Users/cgk/unixexamples
$ echo "Testing 1 2 3" > test.txt
$ ls
test.txt
$ cat test.txt
Testing 1 2 3
$ cd ..
$ cat test.txt
cat: test.txt: No such file or directory
$ cat unixexamples/test.txt
Testing 1 2 3

Note that file names must be unique within a particular


directory, but having, say, both /Users/cgk/test.txt and
/Users/cgk/unixexamples/test.txt is OK.
Commands, arguments, and options

We’ve already started using these; now let’s define them


more precisely.

The general syntax for a UNIX command looks like this:

$ command -options argument1 argument2

(The number of arguments may vary.) An argument


comes at the end of the command line. It’s usually the
name of a file or some text.

Example: move/rename a file.

$ mv test.txt newname.txt
Options come between the name of the command and
the arguments, and they tell the command to do
something other than its default. They’re usually prefaced
with one or two hyphens.

$ pwd
/Users/cgk
$ rmdir unixexamples
rmdir: unixexamples: Directory not empty
$ rm -r unixexamples
$ ls
Desktop Movies Rlibs
Documents Music Sites
Icon? Pictures Work
Library Public bin
MathematicaFonts README
mathfonts.tar
To look at the syntax of any particular UNIX command,
type man (for “manual”) and then the name of the
command.

The two most important parts of the man page are


labeled SYNOPSIS and DESCRIPTION. These are very
much like the “Usage” and “Arguments” in R’s help pages.

SYNOPSIS shows you the syntax for a particular


command. Bracketed arguments are optional.

DESCRIPTION tells you what all the options do.

Press the space bar to scroll forward through the man


page, “b” to go backward, and “q” to exit.
You can refer to multiple files at once using wildcards. The
most common one is the asterisk (*). It stands in for
anything (including nothing at all).

$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls G*
Gagging.text Going.nxt
$ ls *.xt
Bing.xt

The question mark (?) is similar, except it can only


represent a single character.
$ ls ?ing.xt
Bing.xt
Finally, square brackets can be replaced by whatever
characters are within those brackets.

$ ls
AGing.txt Bing.xt Gagging.text Going.nxt ing.ext
$ ls [A-G]ing.*
Bing.xt

The wildcards can also be combined.

$ ls *G*
AGing.txt Gagging.text Going.nxt
$ ls *i*.*e*
Gagging.text ing.ext

We’ll cover text matching in a lot more detail next week


when we talk about regular expressions.
A recap of commands so far

pwd print working directory


ls list contents of current directory
ls -a list contents, include hidden files
mkdir create a new directory
cd dname change directory to dname
cd .. change to parent directory
cd ~ change to home directory
mv move or rename a file
rm remove a file
rm -r remove all lower-level files
Here are a few more handy ones:

wc -l - count the number of lines in a file


$ wc -l halfdeg.elv
134845 halfdeg.elv

head -nx - look at the first x lines of a file

$ head -n5 halfdeg.elv


Tyndall Centre grim file created on 13.05.2003 at 13:52 by
Dr. Tim Mitchell
.elv = elevation (km)
0.5deg MarkNew elev
[Long=-180.00, 180.00] [Lati= -90.00, 90.00] [Grid X,Y=
720, 360]
[Boxes= 67420] [Years=1975-1975] [Multi= 0.0010]
[Missing=-999]
tail - like head, but look at the end of the file

cp - copy a file

$ cp unixexamples/Bing.xt .

cat - print the contents of a file

echo - write arguments

The real power in UNIX, however, comes from stringing


these commands together. We’ll talk about this next
time.
Today, we’ll talk about

Some interfaces between R and UNIX


- getting the results of UNIX commands from within R
- running R in BATCH mode and monitoring its progress

Putting UNIX commands together using


- redirection
- pipes

Manipulating data in UNIX using filtering commands in


combination with redirection and pipes
The system function in R allows you to execute a UNIX
command and either print the result to the screen or
store it as an R object (argument intern = TRUE).

> system("ls")
datagen.R group1.dat group2.dat group3.dat
> system("head -n2 *.dat")
==> group1.dat <==
height weight Goal: Read in all the data
65.4 134.9
files and put them in a
==> group2.dat <== single matrix with an extra
height weight column for group.
65.7 145.7

==> group3.dat <==


Referring to your UNIX
height weight handout, what should our
63.8 138.9 strategy be?
> nlines <- system("wc -l *.dat", intern = TRUE)
> nlines
[1] " 3 group1.dat" " 4 group2.dat"
[3] " 3 group3.dat" " 10 total"
> nfiles <- length(nlines) - 1
> nlines <- as.numeric(substring(nlines[1:nfiles], 7, 8)) - 1
> nlines
[1] 2 3 2
> hw <- matrix(NA, nrow = sum(nlines), ncol = 3)
> startline <- 1
> for(group in 1:nfiles){
+ temp <- read.table(file = paste("group", group,
+ ".dat", sep = ""),
+ header = TRUE)
+ index <- startline:(startline+nrow(temp)-1)
+ hw[index,1:2] <- as.matrix(temp)
+ hw[index,3] <- as.matrix(group)
+ startline <- startline + nrow(temp)
+ }
> names(hw) <- c("height", "weight", "group")
BATCH jobs in R are useful whenever

- You have a long job and you want to be able to use the
computer for other things in the meantime.
- You want to log out of the machine while the job is
running and come back to it later.
- You’re running the job on a remote machine, and again
you want to log out.
- You want to be courteous to other users of the machine
by decreasing the priority of the job.
To start a BATCH job, use

What would be printed to


the screen instead goes here.

nice R CMD BATCH scriptfile.R outfile.Rout &

Give the job lower priority


Indicates that you want
to run this job in the
background.
A few other things to keep in mind:

- scriptfile.R should require no input from the user. For


example, don’t use identify.

- Graphics should be created by surrounding the relevant


code with pdf(file = “filename.pdf”) and dev.off(), rather
than using dev.print(pdf, file = “filename.pdf”).

- In simulations it can be helpful to include a line like


if(i %% 10 == 0) print(paste(“Iteration”, i)
Then you can monitor it using tail -f outfile.Rout.

- By default, the workspace will be saved in .RData. You


can also save specific objects using the save function.
To see information about currently running processes,
just type top. There are arguments to top that allow you
to sort by CPU usage, memory, etc. See man top for more
details.
The number at the beginning of the line is called the
process ID, or PID.

To kill a particular process, type kill PID, substituting the


correct PID.

Sometimes you want to see the list of all processes in a


non-interactive way. For example, you might want to pipe
the results through a filter, as we’ll discuss next.

On BSD UNIX systems (like the Apple machines in the


lab), ps -aux will list all processes.
On other systems, ps -ef does the trick.
We’ll use ssh and sftp to start a new UNIX session on a
remote machine and to send files back and forth between
our computer and the remote computer.

To log into a statistics department machine, type

ssh -l uname scf-ugNN.berkeley.edu

where uname is the user name you’ve been assigned for


the course, and NN is a number between 01 and 27.

You will be prompted for your password. Your starting


directory is your home directory on the network. Note
this is the same no matter which department computer
you log into!
To transfer files back and forth, first type

sftp -l uname scf-ugNN.berkeley.edu

You can use pwd, cd, and ls just as you would at the usual
prompt to find the right remote file or directory. You can
also use lpwd, lcd, and lls to move around the local
machine.

To copy a file from the remote computer to your


computer, type get nameoffile.

Top copy a file from your computer to the remote


computer, type put nameoffile.

Type exit to quit.


Redirection and pipes are really at the heart of the UNIX
philosophy, which is to have many small tools, each one
suited for a particular job.

Redirection refers to changing the input and output of


individual commands/programs.

The “standard input” or STDIN is usually your keyboard.


The “standard output” or STDOUT is usually your
terminal (monitor).

As an example, if we type cat at the prompt and hit


return, the computer will accept input from us until it hits
an end-of-file (EOF) marker, which on most systems is
CNTRL-D. Each time we hit return, our input is printed
to the terminal.
We can redirect as follows:

> redirects STDOUT to a file


< redirects STDIN from a file
>> redirects STDOUT to a file, but appends rather than
overwriting.

(There’s also a <<, but it’s use is more advanced than we’ll
cover.)

Here are two examples:

$ cat > temp.txt


$ sort < temp.txt

Try it out!
The idea behind pipes is that rather than redirecting
output to a file, we redirect it into another program.

Another way to say this is that STDOUT of one program


is used as STDIN to another program.

A common use of pipes is to view the output of a


command through a pager, like less. This is particularly
useful if the output is very long.

$ ls | less

Pipe
Note that the data flows from left to right. See the UNIX
handout for more details on less.
A program through which data is piped is called a filter.
We’ve already seen a few filters: head, tail, and wc.

Two more common filters are

sort - sort lines of text files alphabetically


uniq - strip duplicate lines when they follow each other

$ cat somenumbers.txt

What will be the output of:


cat somenumbers.txt | sort
cat somenumbers.txt | uniq
cat somenumbers.txt | sort | uniq
$ cat somenumbers.txt | sort
One
One
One
Three
Two
Two
Two
$ cat somenumbers.txt | uniq
One
Two
One
Two
Three
One
$ cat somenumbers.txt | sort | uniq
One
Three
Two
Today: A quick wrap up on UNIX filters, then on to
regular expressions.

There will be a short assignment posted on bSpace later


today to give you some practice with these.

Recall the filters we’ve seen so far:

head - first lines of file


tail - last lines of file
wc - word (or line or character or byte) count
sort - sort lines of text files alphabetically
uniq - strip duplicate lines when they follow each other
Here are two more useful filters:

grep - print lines matching a pattern (We’ll talk about


patterns more shortly -- for now just think of the pattern
as requiring an exact match.)

$ grep “save” *.R


Print all lines in any file ending with .R which contain the
word (pattern) save.

cut - select portions of each line of a file

$ cut -d “ ” -f 3-7
Here are some practice problems:

On many systems, the file /etc/passwd shows information


about the registered users for the machine. A quick look
at the file shows there are also some notes at the top.

1. Determine the total number of users.


2. Sort the users and display the information for the last
five users, alphabetically speaking.
3. Show just the usernames for these entries.
4. Put the usernames in a file called lastusers.txt.
Regular Expressions
Regular expressions give us a powerful way of matching
patterns in text data.

Example 1: election data from three different datasets.


We know these are the same places, but how can the
computer recognize that?
Example 2: Creating variables that predict whether an
email is SPAM

- numbers or underscores in the sending address


- all capital letters in the subject line
- fake “words” like Vi@graa
- number of exclamation points in the subject line
- received time in the current time zone
Example 3: Mining the State of the Union addresses

How long are the speeches? How do the distributions of


certain words change over time? Which presidents have
given “similar” speeches?
The language of regular expressions allows us to carry
out some common tasks, such as

• extracting pieces of text in non-standard format -


for example, finding all the links in an HTML document
• creating variables from information found in text
• cleaning and transforming text into a uniform format,
resolving inconsistencies in format between files
• mining text by treating documents directly as data
Most importantly, we do this all programatically rather
than by hand, so that we can easily reproduce our work if
needed.
Regular expressions are constructed from three things:

Literal characters is matched only by the character itself.

A character class is matched by any member of the


specified class. For example, [A-Z].

Modifiers operate on literal characters, character classes,


or combinations of the two.
First let’s go “under the hood” a little bit and think about
the algorithms we could use to identify pattern matches
made up of literal characters.

What strategy is this code using?

> string
[1] "St John the Baptist Parish"
> if (substring(string, 1, 3) == "St ")
+ newstring <- paste("St. ",
+ substring(string, 4, nchar(string)), sep = "")
> newstring
[1] "St. John the Baptist Parish"

Can you see any problems with it?


A more general approach is to split the input string into a
a vector of characters and then iterate over those
characters looking for the particular string.

> string <- "The Slippery St Frances"


> characters <- unlist(strsplit(string, ""))
> characters
[1] "T" "h" "e" " " "S" "l" "i" "p" "p" "e" "r" "y"
[13] " " "S" "t" " " "F" "r" "a" "n" "c" "e" "s"
> possible <- which(characters == "S")
> possible
[1] 5 14
> substring(string, possible, possible + 2)
[1] "Sli" "St "
The regular expression “St ” is made up of three literal
characters. The regular expression matching engine does
something very similar to what we just did.

The Slippery St Frances


|| |||
Found S________|| |||
Followed by t?__| No |||
Is it S?________| No ...||| Keep looking for an S
|||
Found S_________________|||
Followed by t?___________|| Yes
Followed by a blank?______| Yes - A match!
Luckily, we don’t actually need to write our own functions
for replacement. The R functions gsub() and sub() look
for a pattern and replace it within a string with some
other text.

The “g” in gsub() refers to global. It changes all the


matches, whereas sub() only replaces the first match.

> strings <- c("a test", "and one and one is two",
+ "one two three")
> gsub("one", "1", strings)
[1] "a test" "and 1 and 1 is two" "1 two three"
> sub("one", "1", strings)
[1] "a test" "and 1 and one is two" "1 two three"
What about finding fake “words” such as rep1!c@ted or
Vi@graa? In this case, we’re looking for numbers and/or
punctuation surrounded by regular letters.

These concepts of “numbers”, “punctuation”, and “regular


letters” get at the idea of equivalent characters or character
classes.

We can enumerate any collection of characters within


[ ]. Example: [Tt]his

If we put a caret (^) as the first character , this indicates


that the equivalent characters are the complement of the
enumerated characters.
The character “-” when used within the character class
pattern identifies a range.

Examples: [0-9], [A-Za-z]

If we want to include the character “-” in the set of


characters to match, put it at the beginning of the
character set to avoid confusion.

Example: [-+][0-9]

Note that here we’ve created a pattern from a sequence


of two sub-patterns.
There are also built-in character sets for commonly used
collections.
[[:alpha:]] All alphabetic
[[:digit:]] Digits 0123456789
[[:alnum:]] All alphabetic and numeric
[[:lower:]] Lower case alphabetic
[[:upper:]] Upper case alphabetic
[[:punct:]] Punctuation characters
[[:blank:]] Blank characters, i.e. space or tab

These can be used in conjunction with other characters,


for example [[:digit:]_].
The grep() function in R works in somewhat the same
way as the UNIX command grep, although rather than
returning the matching strings, it returns the indices of the
elements for which there was a match.

However, you can easily use the indices to grab the


corresponding strings.

> Addresses
[1] "Cari Kaufman <cgk@stat.berkeley.edu"
[2] "depchairs03-04@uclink.berkeley.edu"
[3] "Chancey <_arkbound@deutschland.de>"
> grep("[[:digit:]_]", Addresses)
[1] 2 3
> Addresses[grep("[[:digit:]_]", Addresses)]
[1] "depchairs03-04@uclink.berkeley.edu"
[2] "Chancey <_arkbound@deutschland.de>"
Going back to our fake “words” example, what will this
match?

[[:alpha:]][[:digit:][:punct:]][[:alpha:]]

Can you foresee any problems with it?


> subjectLines
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "It's me"
> grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
subjectLines)
[1] 2 3

We can either remove the apostrophe first:


> newString <- gsub("'", "", subjectLines)
> grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[1] 2

Or we can specify the particular punctuation marks we’re


looking for:
> grep("[[:alpha:]][[:digit:]!@#$%^&*():;?,.][[:alpha:]]",
subjectLines)
[1] 2
gregexpr() shows exactly where the pattern was found:

> newString
[1] "Re: 90 days" "Fancy rep1!c@ted watches" "Its me"
> gregexpr("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]",
newString)
[[1]]
[1] -1
attr(,"match.length") No match
[1] -1
[[2]]
[1] 12 Starting at 12,
attr(,"match.length") match of length 3
[1] 3
[[3]]
[1] -1 No match
attr(,"match.length")
[1] -1
Did we miss anything??
We didn’t find p1!c because it consists of four characters:
a letter, a digit, a punctuation mark, and another letter.

To search for the more general pattern of any number of


digits or punctuation marks between letters, we use

[[:alpha:]][[:digit:][:punct:]]+[[:alpha:]]

The plus sign indicates that members from the second


character class (digits and punctuation) may appear one or
more times.

The plus sign is an example of a meta character.


More meta characters

^ As the first character in the pattern,


anchor for the beginning of the line
As the first character in [], exclude these
$ End of line anchor
? Character or sub-pattern occurs zero or
one time
+ Character or sub-pattern occurs one or
more times
* Character or sub-pattern occurs zero or
more times
. Any single character
[ ] Character class
- Range within a character class
( ) Group or sub-pattern
| Alternation, i.e. one subpattern or
another
{ } Quantifier: {n} means exactly n repeats of
the sub-pattern
{n, m} n to m repeats
{n,} n or more repeats

What will this match?

^[^[:lower:]]+$
The position of a character in a pattern determines
whether it is treated as a meta character.

Examples: [-+*/], [1-9]*

When you want to refer to one of these symbols literally,


you need to precede it with a backslash (\). However, R
already have a special meaning in R’s character strings --
they indicate control characters like newline (\n).

So, to refer to these symbols in R’s regular expressions,


you need to precede them with two backslashes.

The characters for which you need to do this are:

. ^ $ + ? ( ) [ ] { } | \
Announcements:

There will be a new homework posted on bSpace today


and due Wed., Oct. 29.

I will be out of town Wednesday - Sunday and able to


respond to email only about once per day.

-- In class Thursday, Dr. Deborah Nolan will be our guest


speaker. She’ll cover some concepts you’ll need in the
new homework.
-- Daisy will be holding extra office hours on Thursday
from 3:30-4:30 in the 342 Evans Hall lab.
Last time, we learned that regular expressions are made
up of

1. Literal characters
2. Character classes
3. Modifiers

Today we’ll cover some advanced concepts, including

- getting text data into R


- greedy matching
- tagging and back references

We’ll do this all in the context of learning how to


automatically grab text data from the web.
If I select
a state
here, I get...
a table with vote counts for that state. The URL is

http://www.usatoday.com/news/politicselections/
vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=CA

If I click a few more states,


I see the only thing that
changes is the abbreviation
at the end.

So it’s easy to create the


URLS in R using paste.
The most flexible way to read text files, like web pages,
into R is to use the readLines function. The result is a
character vector where each element is a line in the file.

state <- "California"


# Use the built-in state names and abbreviations
abb <- state.abb[state == state.name]
web <- readLines(paste("http://www.usatoday.com/news/
politicselections/vote2004/PresidentialByCounty.aspx?
oi=P&rti=G&tf=l&sp=", abb, sep = ""))

If we wanted a single long character vector, we could


simply say

paste(web, collapse = “ ”)

However, it’s often easier to process line by line.


Most web browsers allow you to look at the source file
creating what you see on the screen.

The first county was Alameda. Searching for it in the file,


we see some lines like this:
<td class="notch_medium" width="153"><b>County</b></td><td class="notch_medium" align="Right" width="65"><b>Total
Precincts</b></td><td class="notch_medium" align="Right" width="70"><b>Precincts Reporting</b></td><td class="notch_medium" align="Right"
width="60"><b>Bush</b></td><td class="notch_medium" align="Right" width="60"><b>Kerry</b></td><td class="notch_medium" align="Right"
width="60"><b>Nader</b></td>
</tr><tr>
<td class="notch_white" width="153"><b>Alameda</b></td><td class="notch_white" align="Right" width="65">1,141</td><td
class="notch_white" align="Right" width="70">1,141</td><td class="notch_white" align="Right" width="60">107,489</td><td class="notch_white"
align="Right" width="60">326,675</td><td class="notch_white" align="Right" width="60">0</td>
</tr><tr>
<td class="notch_light" width="153"><b>Alpine</b></td><td class="notch_light" align="Right" width="65">5</td><td
class="notch_light" align="Right" width="70">5</td><td class="notch_light" align="Right" width="60">311</td><td class="notch_light"
align="Right" width="60">373</td><td class="notch_light" align="Right" width="60">0</td>
</tr><tr>
<td class="notch_white" width="153"><b>Amador</b></td><td class="notch_white" align="Right" width="65">57</td><td
class="notch_white" align="Right" width="70">57</td><td class="notch_white" align="Right" width="60">10,479</td><td class="notch_white"
align="Right" width="60">6,211</td><td class="notch_white" align="Right" width="60">0</td>
</tr><tr>

Some additional searching reveals that width=“153” only


occurs in the lines of the table. Knowing something about
HTML would have helped, but it wasn’t necessary.
Using grep in R, we can find just those lines containing
width=“153”. Removing the first one (the header of the
table), we have a bunch of lines like this:

[1] "\t\t\t\t<td class=\"notch_white\" width=


\"153\"><b>Alameda</b></td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"

Goal: grab the county name, votes for Bush, and votes for
Kerry.
[1] "\t\t\t\t<td class=\"notch_white\" width=
\"153\"><b>Alameda</b></td><td class=\"notch_white\"
align=\"Right\" width=\"65\">1,141</td><td class=
\"notch_white\" align=\"Right\" width=\"70\">1,141</td><td
class=\"notch_white\" align=\"Right\" width=
\"60\">107,489</td><td class=\"notch_white\" align=\"Right
\" width=\"60\">326,675</td><td class=\"notch_white\"
align=\"Right\" width=\"60\">0</td>"

We can make our lives a little easier by first removing all


the HTML tags. Note that they all start with < and end
with >. So we might think to try

gsub(pattern = "<.*>", replacement = "",


x = web[rowsoftable])
The issue is that regular expression engine performs
something called “greedy matching.” This means that it will
always try to find the longest pattern that satisfies the
match.

To get around this, we have to consider what we really


want to match: in this case it isn’t anything at all (denoted
by .*), it’s anything except a right angle bracket (denoted by
[^>]*).

> newtable <- gsub(pattern = "<[^>]*>", replacement = " ",


+ x = web[rowsoftable])
> newtable[1:3]
[1] "\t\t\t\t Alameda 1,141 1,141 107,489 326,675 0 "
[2] "\t\t\t\t Alpine 5 5 311 373 0 "
[3] "\t\t\t\t Amador 57 57 10,479 6,211 0 "
[1] "\t\t\t\t Alameda 1,141 1,141 107,489 326,675 0 "
[2] "\t\t\t\t Alpine 5 5 311 373 0 "
[3] "\t\t\t\t Amador 57 57 10,479 6,211 0 "

There are a couple different ways we can now tackle the


problem. As an exercise, try doing it using strsplit.
(What regular expression can you use as the delimiter?)

Another way to do it is to use tagging and back references.

The idea is that we write a regular expression for the


whole pattern, and then we mark what we later want to
refer to using ().

> gsub(pattern = "([[:alnum:]]+)@([[:alnum:].]+)",


+ replacement = "\\1", "cgk@stat.berkeley.edu")
[1] "cgk"
We can do the same thing here, with pattern

pattern <- "^\t\t\t\t (.*) [[:digit:],]+ +[[:digit:],]+


+([[:digit:],]+) +([[:digit:],]+).*$"
counties <- gsub(pattern, "\\1", newtable)
bush <- gsub(pattern, "\\2", newtable)
kerry <- gsub(pattern, "\\3", newtable)

bush <- as.numeric(gsub(",", "", bush))


kerry <- as.numeric(gsub(",", "", bush))

Make sure you understand what each part of the regular


expression is doing.

Now we’ll switch over to R to finish the job.


XML
Other than the text data we’ve just learned to work
with, most of the data sets we’ve seen have been in the
form of ASCII tables.

Date Time Lat Lon Depth Mag


1968/01/12 22:19:10.35 36.6453 -121.2497 6.84 3.00
1968/02/09 13:42:37.05 37.1527 -121.5448 8.49 3.00
1968/02/21 14:39:48.10 37.1783 -121.5780 6.95 3.80
1968/03/02 04:25:53.94 36.8343 -121.5447 5.35 3.00
1968/03/17 15:07:02.12 37.3088 -121.6615 4.39 3.00
1968/03/21 21:54:59.94 37.0378 -121.7407 11.86 4.30

Advantages:
- easy to read, write, and process
- in standard cases, don’t need a lot of extra information

But these advantages can quickly disappear....


XML (for ‘eXtensible Markup Language’) is a standard for
semantic, hierarchical representation of data.
<state>
  <gml:name abbreviation="AL">
  ALABAMA
  </gml:name>
  <county>
   <gml:name>
   Autauga County
   </gml:name>
   <gml:location> Relationships between pieces
    <gml:coord>
     <gml:X> of data reflect relationships
     -86641472 in the real world.
     </gml:X>
     <gml:Y>
     32542207
     </gml:Y>
    </gml:coord>
   </gml:location>
  </county>
...
Some positive aspects of XML are

- data is self-describing
- format separates content from structure
- data can be easily merged and exchanged
- file is human-readable
- but file is also easily machine-generated
- standards are widely adopted

Some negative aspects are

- XML documents can be very verbose and hard to read


- it’s so general that it’s hard to develop tools for all cases
- files can be quite large due to high amount of
redundancy
XML is has become quite popular in many scientific
fields, and it is standard in many web applications for the
exchange and visualization of data. We’ll learn how to 1)
read/process it, and 2) create it.

We’ll do both of these things from within R, but first let’s


start with an overview of what XML documents look like.
The basic unit of XML code is called an “element” or
“node.” It is made up of both markup and content.
Markup consists of tags, attributes, and comments.

<CYL> 6 </CYL> <!-- CYL element with content 6 -->

Start tag End tag


Content Comment - can go anywhere

<CYL> </CYL> <!-- CYL element with no content -->


<CYL/> <!-- another way to write it -->

<CYL size=”2”> 6 </CYL>

An attribute
XML is well-formed if it obeys certain syntax rules. The
rules for tags are

1.Tag names are case-sensitive; start and end tags much


match exactly.
2. No spaces are allowed between the < and the tag
name.
3. Tag names must begin with a letter and contain only
alphanumeric characters.
4. An element must have both an open and closing tag
unless it is empty.
5. An empty element that does not have a closing tag
must be of the form <tagname/>.
6. Tags must nest properly.
XML declaration
An example and processing
<?xml version="1.0" encoding="ISO-8859-1"?>
instructions
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
<PLANT> Note how indentation
<COMMON>Marsh Marigold</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL> makes it easier to
<ZONE>4</ZONE>
<LIGHT>Mostly Sunny</LIGHT> check that the tags
<PRICE>$6.81</PRICE>
<AVAILABILITY>051799</AVAILABILITY> are correctly nested.
</PLANT>
</CATALOG>
In addition, we have the rules

7. All attributes must appear in quotes in a name =


“value” format
8. Isolated markup characters must be specified via entity
references. For example, < is specified by &lt; and > is
specified by &gt;.
9. All XML documents must contain a root node containing
all the other nodes.

This brings us to the tree structure of an XML document.


There is only one root or document node in the tree, and
all the other nodes are contained within it.

We might also think of these other nodes as being


descendants of the root node. We use the language of a
family tree to refer to relationships between nodes.
- parents
- children
- siblings
- ancestors
- descendants

The terminal nodes in a tree are also known as leaf nodes.


Content always falls in a leaf node.
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>

leaf nodes
Working with XML in R
The first thing we need to do is load the XML library.

> library(XML)

Then read the XML file into R using xmlTreeParse

> doc <- xmlTreeParse(“plant_catalog.xml”)

and extract the root node using xmlRoot.

> root <- xmlRoot(doc)
> class(root)
[1] "XMLNode"
Aside:

R allows for object-oriented programming. We’re not going


to do any of this style of programming ourselves, but it’s
helpful to know how to interpret it when we see it.

A class is a definition of a type of object. A class contains


slots that are used to hold class-specific information. A
particular object is called an instance of the class.

Methods are functions that are specialized for a certain


class. Rather than being fully object-oriented, R uses what
are called generic functions. These determine the type of
object being operated on and then call the appropriate
function. To see the classes for which the function has
been defined, use methods(functionname).
Ok, back to XML.

implements what is called the DOM


xmlTreeParse
(Document Object Model) parser. It reads the entire file
into memory.

We don’t have time to cover it, but you should be aware


of another parsing model called SAX (Simple API for
XML). It reads the document incrementally and is more
memory efficient, but it is trickier to use.

The tree structure is represented in R as a list of lists.

We can access an element within a node (i.e., a child),


using the usual [[ ]] indexing for lists.
> ## Look at one plant node
> oneplant <- root[[1]]
> class(oneplant)
[1] "XMLNode"
> print(oneplant)
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>

We’ve reached the leaf nodes.


We can access the content of the leaf nodes using the
function xmlValue.

> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"

Note: this removes the markup:

> oneplant[['COMMON']]
<COMMON>Bloodroot</COMMON>
There are special XML versions of lapply, and sapply,
named xmlApply, xmlSApply. Each takes an XMLNode object as
its primary argument. They iterate over the node’s
children nodes, invoking the given function.

Like lapply, xmlApply returns a list. Like sapply, xmlSApply


returns a simpler data structure if possible. In this case,
we can use xmlSApply to extract the names of all the
plants.

> commons <- xmlSApply(root, function(x){


+ xmlValue(x[['COMMON']])})
> head(commons)
PLANT PLANT PLANT
"Bloodroot" "Columbine" "Marsh Marigold"
PLANT PLANT PLANT
"Cowslip" "Dutchman's-Breeches" "Ginger, Wild"
Now we can create the full dataframe.

> getvar <- function(x, var) xmlValue(x[[var]])


> res <- lapply(names(root[[1]]), function(var){
+ xmlSApply(root,getvar,var)})
> plants <- data.frame(res)
> names(plants) <- names(root[[1]])

What is this command doing?


An overview of where we are:

This week: finishing XML, then spatial data


No class next Tuesday, starting SQL on Thursday

We’ll have one more graded assignment (spatial data) and


one more short assignment (SQL).

Nov 18 -- I’ll assign the group projects. These will involve


analyzing the results of the presidential election.
Part I -- due Dec 1 -- data collection and planning
Part 2 -- due Dec 17 (day of the final)
-- carrying out the analysis and completing your report

The final is on paper (not computer) and will take 1-2 hrs.
First, a quick review of sapply and lapply. Remember:

1) lapply and sapply can operate on either a list or a vector.


What will be the results of the following?

lapply(1:3, function(x){x^2})
sapply(1:3, function(x){x^2})
myList <- list(a=1, b=2, c=3)
lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})

2) You can always include additional arguments. Examples:

sapply(myList, log)
sapply(myList, log, base = 10)
sapply(myList, function(x, pow){x^pow}, pow = 3)
3) If the results of sapply cannot be simplified, then sapply
and lapply will return the same thing.

myList <- list(a=1:2, b=3:5, c=6)


lapply(myList, function(x){x^2})
sapply(myList, function(x){x^2})

(All of these are just examples to illustrate the differences


between sapply and lapply. Of course if you have a
vector, you can just use vectorized operations, e.g. vec^2.)

Finally, do you remember what plain old apply does?


The functions xmlApply and xmlSApply work like lapply and
sapply, except their arguments are XML nodes.

returns a list (the elements may themselves be


xmlApply
XML nodes).

If it can, xmlSApply returns a vector or matrix. If not, it


also returns a list.

With all of these functions, always ask yourself


1) What do I want to operate on (iterate over)?
2) What do I want to produce?
Spatial Data
In most introductory statistics textbooks, we assume that
when there is more than one observation, they are iid
(independent and identically distributed). This makes the
theory of estimators using these observations very
analytically tractable.

However, one can easily think of instances where this is


not the case.

-- observations of some genetic variable will tend to be


closer within families than between families
-- variables that change over time can have a distribution
for current values that depends on past values
-- variables that are measured over space can have similar
values for nearby locations
Spatial data can be divided into three main types:

1. Geostatistical data associate a value (univariate case) or


values (multivariate case) with a particular location.

Clearly it wouldn’t be appropriate to treat these data as


iid. Often we fit a parametric model for the correlation.
● ●● ● ● ●● ● ● ●●●● ●● ● ● ● ● ● ●● ● ●● ●● ● ●

● ● ●● ● ●
● ● ● ●●● ● ● ● ●●● ●● ● ●
● ●
● ● ● ● ● ●

One of the most


● ● ● ● ●● ● ●

●●
●●
●● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●●●● ●● ● ●● ● ● ● ●● ● ●
● ●● ● ●● ●●● ●● ● ● ● ● ●● ● ● ● ● ●
● ●●●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●


● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●
● ●● ●●●●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●
● ● ●● ●

● ●

● ● ●●
● ● ● ● ●

● ●
● ●

● ● ●● ● ● ● ●●●● ● ●● ● ● ● ●●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●●● ● ●●● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ●
● ●●●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ●
●● ● ●
● ●
● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●●● ● ●● ●
● ● ●● ● ● ● ●● ● ●● ●
●● ● ● ● ● ●

● ●● ●● ●
● ● ●● ●
● ● ● ● ● ● ● ● ● ● ●●
●●●●●● ●
●● ● ●● ●
● ●
● ●
●●

●●● ●
● ●
● ●●



● ●●
●● ● ●● ● ● ●
● ●● ●
●●

●● ●● ●
● ● ●
●● ●
● ●●●●


●●


● ●

● ●● ● ●●





● ●

●●







● ●● ●
● ● ●
● ● ● ●●





● ●


●● ●●
●●



●● ●● ●
● ●●
●● ●● ● ● ● ● ●



● ●●
● ● ● ● ●●● ●
●● ● ● ●




●●●●
●● ●●

● ●
● ● ●



● ●● ●
● ●




● ● ● ●
●● ● ●
● ●









● ●●●


3
● ●
● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ●●● ●
●● ● ● ● ●●● ●●

common goals is
●●●●● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ● ●
● ● ●●●● ●● ● ● ●●●● ●
● ● ●●●
●●●
● ●● ●● ● ●● ● ●● ● ● ● ● ● ●●● ●● ●● ●●●●● ● ● ●● ● ● ● ● ● ●●●● ● ●●●●●● ●
●●● ● ● ●● ● ●●● ● ●● ● ●
● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●
●●
● ● ●
● ● ● ●
● ● ● ● ● ●

● ● ● ● ● ●● ● ● ● ●●
● ●●● ● ●
● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●
● ●●● ● ● ●● ● ●● ● ●

● ●●● ● ● ●●● ● ●●●● ● ●● ● ●
● ● ●● ● ● ● ● ●● ●●●●●●●● ● ● ●●●●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ●●
●● ●
●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ●● ●● ●
●●
● ●● ● ●●● ● ● ● ●● ●● ● ●● ●
● ●● ● ●● ● ● ● ●●● ●● ●● ● ●●● ● ●●●●●● ●
● ● ● ● ●●● ● ●● ● ●●● ● ● ●
● ●●
● ● ● ● ●● ●● ● ●●● ● ● ●●
●●
● ● ●●
● ● ● ●●●
●●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ●●● ●●●●●●● ● ●● ● ●● ●●●● ●● ● ● ●● ● ●● ●●● ●●●● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●
●●●
● ● ● ● ●
●● ●●●● ●● ● ● ●●●●
● ●●
●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ●● ●

●●
● ●●● ●● ●●● ● ● ●● ● ●
● ●●● ●● ●● ●●●●● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ●● ● ●●● ●●● ●● ●● ●●
●● ●
● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ●● ●● ●● ●●● ●
●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●

● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ●● ●● ●● ●●● ●●●●● ●●●●
● ●●●● ●●● ● ●● ● ● ●●●● ● ●●●●●● ●● ●●●●
●●
●● ● ●


● ● ●


● ● ●● ●●
●● ●●

●●●





●●
● ●
●●

●● ● ●● ●
● ● ● ● ●● ●
● ● ●● ● ● ●● ●
● ● ●●● ●● ● ● ● ● ● ● ●●● ●●

● ●●●
●● ● ●
● ● ●
● ●
●●
●● ●●● ●●●●
●●
●●
● ● ●
●●● ● ● ● ● ●
● ●
● ● ● ●●● ●
●● ●
●●
● ●
● ●●

●●
●●
●●●●


●●●●●
● ●●
●●
●●●●●

●●


●●●

●●●
●●●
●●●●
●●● ● ●● ●●●
●●●●
●●
●●
●●
●●●
●●●

●●
●● ●
●●●
●●
●●●

●●

●●


● ●
●●


●●●●●● 2

prediction of the
● ●● ●●● ●● ●● ● ●●●● ●●
● ● ● ● ● ● ● ● ●●● ●●●● ●
● ● ● ● ●● ●●● ●● ● ●●● ● ●● ● ●●●●●● ● ●● ●●● ●● ●●●● ●●● ●● ●
● ●●●● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
●● ● ● ●●

● ● ●● ● ●● ●●● ●●● ●● ● ● ●● ●●● ● ●●●● ●●●● ●●●● ●●● ●● ●

●●●●
●●● ●●


●● ●●● ●● ● ● ●● ● ● ●●● ●
● ●●●●● ●●● ●
●●● ●●●●● ● ● ●● ● ●●●
● ●
●●●●●
●● ●● ●● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●
● ● ● ● ●●●●
● ● ●●● ●● ●● ●●●●●● ●●●
● ●● ●
●●
●●
●● ●●
●●
●●●

● ● ●● ● ● ●
● ●● ● ● ●● ● ● ● ● ●● ●●● ●● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●●●●●●


●●

● ●●
●● ●●●
● ● ●●●●●●●● ●●
●● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●
● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ●●
●● ● ●●●● ●● ●●●●● ●
●●

●●
●●●● ●


●●●●

●●●●● ●● ●●●

● ●●
●●● ● ● ● ● ●● ● ●●● ● ●● ● ●
● ● ● ●● ● ● ●● ●
●●●● ●● ● ●● ● ●● ●
●● ● ●● ●
● ● ●● ● ●
● ● ●●● ●●
● ● ●
● ●● ● ● ●
● ● ●● ● ●●●●
● ●
● ●●● ●
● ●● ●●●●● ●●● ●● ● ●● ●
● ●●
●● ●●● ● ● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●


●●
●●
● ●●● ●●
●● ● ● ● ● ●●
● ●

●● ● ● ● ●●●●● ● ●●●●● ●●● ●●●● ● ● ●●●● ●
●● ● ●
● ●●● ● ● ● ● ●● ●●●●● ● ● ● ● ●● ● ●●●●●●● ● ● ●●● ●●●●●● ●●●
● ●

● ●●
● ●
●●●● ● ●

● ● ● ●● ● ● ●
● ●●● ● ● ● ● ● ●
●●●●●● ● ● ●●
● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●
●●●
●● ●
●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ●●●● ● ● ●● ●● ●
●● ●●●● ● ● ●●● ●● ● ●● ● ●●●● ● ●● ●●●● ● ●●● ●
● ●
●● ● ●● ●
●●●●●● ●● ● ● ●●●●● ● ●● ●●● ●●● ●●● ● ● ●
● ● ●
●●● ● ● ●●● ● ● ● ●●● ● ● ● ●
●●● ●● ● ● ● ●● ● ● ●● ●● ● ●●
●●●
●● ●
●●●●●●●●
●● ●
●●●● ● ●
● ● ●● ● ●●● ●
●● ●
●●
● ● ●●● ●● ● ● ●●● ● ● ● ●●● ●●●
●● ●● ● ● ●● ● ● ● ● ●
● ●● ●● ●●● ●●● ●●● ●●
● ●● ●●● ●●
●●
●●
●●●●

●● ●● ● ● ● ● ●●●
●●
● ●●●


●●

●●
●●●●

●●●
●●● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●● ●●
● ● ● ●●

●● ●●●● ●
● ●● ● ●● ● ●
● ● ●●●●● ●●●
● ●● ● ● ●● ● ● ●●●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●
●●●●● ●
●●● ●●●● ●● ● ● ● ●
●●
●●● ●●●
● ●●●●●
●● ● ●●
● ● ●● ●● ● ● ●●● ● ●●●●● ●●● ●●● ●● ●● ● ● ● ● ●●● ● ●
● ●● ●●● ● ● ● ● ●●●●
● ●● ●●● ●● ● ● ●● ● ● ●● ●●●●●● ●●● ●●
●● ●
●●●●●


●● ● ●● ● ●
● ● ●● ●●● ●●●●● ●● ●● ●
●●●●● ●
●●● ● ●●●● ● ● ● ● ● ● ●● ● ●
●●● ● ● ●●
●●● ●●
●●● ● ●● ●● ●● ●● ●●●
●●●●●●

●● ● ●
●●● ●●
● ● ●●
●●●●● ●
●● ●● ●●●●●● ● ●●●● ● ● ● ●●● ● ● ● ● ●●●●●
● ●●● ● ● ●●●● ●●● ● ●●● ●●●
●● ●●●●● ● ● ●● ● ● ● ● ●●●●
● ●● ●●●●●● ● ● ●● ●● ● ●●●● ●●●●● ●● ● ●●●●●●●● ●● ● ●●

variable of interest
● ●●● ● ●●● ● ● ●●●● ● ● ●● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●
● ●● ●

●●

●●


● ●● ● ●

● ● ● ●●
● ●● ●



●●






●●●
●●

● ●●●
● ● ●
●●

●●
●● ●

●●●



● ●





●●● ●


●● ●

● ●
● ● ●● ●

●●●


●●● ●●

●●



●● ●● ● ● ●● ● ●



●●

● ●


● ●● ● ●●●● ●
● ●●●●●

●●● ●




● ● ●
● ● ●● ● ● ● ● ● ●●


● ● ●●


●●
● ●



●●●●



●● ●


● ●●●


●●●● ● ●
●●● ●
●●● ●●● ●●●
● ● ●
●● ●
●●●●●● ●●
●● ● ●



●● ●
● ●



●●● ●●● ●● ● ● ●
●●

● ● ●● ●●



●● ● ●●
● ●

● ●


● ●

●●●
●●●
●●● ● ●
●●


● ●●

● ● ●● ● ● ●




●●●● ●●●●
●● ●


●●●




●●
●●






●●●●●●●●
●● ● ●●
●● ●● ●● ●● ●● ●●●
● ●


● ●●●●
●● ●●











●●●● ●●

●● ●●●
● ●

●●●● ●

● ●
●●●●●

●●



1
●● ●●● ● ● ● ●
● ●● ●
●● ● ●●●●●● ●●● ●● ● ● ●● ● ● ●
● ●●●● ●●● ● ●
●● ● ●
●●● ●● ●●●
●●●
●●
●●● ● ● ● ● ● ● ●●●● ● ●
●● ●
●●● ●
● ● ●● ●●●● ●●● ● ● ●● ●●● ●●● ● ●●●●●●

●● ● ● ● ●
● ●● ●● ● ●● ● ●●●●

●●● ●●

●●●● ● ●
●●●● ● ● ●● ●●
● ● ●● ● ● ● ● ● ●●● ●●●● ●●●● ● ●
●●● ● ● ●●● ●
● ●●●● ● ●
● ●●● ●●● ● ●
● ●● ●● ● ●● ●● ● ●● ●

● ●●●●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ● ● ●●●
●●●● ● ●●●
●●
● ●●
●●
●●● ●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ●●●● ●●●● ●●●● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ●●
● ● ● ● ●●●● ● ●●
●●●●
●●

●●● ● ●●●● ● ●● ● ●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ●●●● ●●●● ●
●● ● ●● ● ● ● ● ● ●●●●●●● ●
●●●●●● ● ● ● ● ●
●●
●● ● ●●● ●● ● ● ● ●
●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ●
●●
● ●●●● ● ●● ●
● ● ● ●
● ●●●● ● ● ● ●
● ● ●● ●
● ● ●● ● ● ● ● ● ●


●● ●●●●● ●
●● ●● ●● ● ● ●●●

●● ●● ●● ●● ●●● ● ● ●● ● ●●● ●●● ●●● ●● ● ● ●
● ●● ● ●● ●
● ●● ● ●● ●● ●●● ●
●● ● ●● ●●● ● ● ● ●


●● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ●●●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ● ●● ●
● ●●● ● ●● ● ●●●● ● ● ●●

●●●●●● ● ● ●●● ● ● ● ●● ● ●● ●
●●
● ● ●● ●● ●● ● ● ●
● ● ● ●
● ●● ●●● ● ●● ● ● ●● ●●●●●● ● ● ● ● ●● ● ●● ●
● ●●● ● ●
●● ● ● ● ●●
● ● ●●●● ●●●●●● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●● ●● ●●● ●● ● ● ●●●●
●● ● ● ●● ●● ●●● ●● ●
● ● ●● ●● ●● ●●●● ● ● ● ●● ●● ●●
● ●●

at new locations.
●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●●●
●●●●● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●● ● ● ● ●● ● ● ●● ●
●●
● ●● ●● ●
● ● ● ●●●●● ●●● ●●
● ●● ●●
● ●●
● ● ●● ● ●
● ●●●

● ●●● ●
● ●● ● ●● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ● ●● ●
●●● ●● ●

●● ● ●
● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ●●● ●
● ● ●● ●●●● ●
●●● ●
●●
●●● ●
● ● ● ●●

● ● ●●● ●
●● ● ● ●●
● ●●
●● ● ●●● ●
●● ●●
●● ●●●●●●
●● ●●●
●●●●●●●●
● ●● ●
●● ● ●● ● ● ● ● ● ●● ●●
● ● ● ● ●● ●● ● ●●● ● ●● ●● ●● ●● ●● ● ●● ●●●●●●●● ● ●●● ●●
● ●●
●●


● ●●

●● ●● ● ●


●● ● ● ●
● ●

● ● ● ●● ●
● ●●

●●

● ●● ●
● ●

● ● ● ●● ●

● ● ●
● ●●


● ●●

● ● ●
●●

●●
● ●
●●

● ●●

● ● ● ●●●
● ●
● ● ●●
●●●● ● ● ●●● ●●
● ●● ●● ● ●●● ● ●
● ●● ● ● ● ●●●●●●●●●●
● ● ● ● ●
●●
●●


●●● ●
●●
●● ●●● ●
●●●●
●●


● ●●
● ● ● ● ●
● ●● ●● ●●● ● ●●

● ●●
●● ●●



● ● ●● ●


● ●
● ● ●●
● ●●








●●


●●●●
●●
●●● ●●● ●
●● ●
●●● ● ●
●● ●●
● ●●
●●●


●● ● ●
●● ● ●● ●
●● ● ● ●


0

●● ●● ● ● ● ●

● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ● ●●
●● ●●●●●
● ● ● ●●●●● ● ● ● ●
●● ● ● ●

● ● ●

● ●
●●● ●●●● ●● ● ●●
●● ● ● ● ● ● ●● ● ● ●● ●
● ● ●●●●● ● ●
● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●
●● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●
● ●● ● ●● ●● ● ● ●●●●● ●●
● ●● ●● ● ● ●
● ● ●
●●
●● ●● ● ● ●●●
●● ●
● ●● ●● ●
●●● ●●● ● ● ● ● ● ● ● ●●●●●●●●●● ● ●●●● ●● ●
● ●●●●● ● ● ● ●● ● ● ● ●●● ●●●●●● ● ●●●
●●●●

● ●

●●
●●




●●
●●
●●

●●●
●●

●●
● ●● ●●●●●●●●●● ●

● ● ● ● ● ● ● ● ●● ●

●● ● ● ●●●

● ●●●●● ● ● ●● ●●●● ● ● ● ● ● ●●●● ●●
●●● ●●
● ● ● ●
●●●● ● ● ●
● ● ●
● ●
● ●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●
●●●●
●●●
●●●
●●

●●●●

● ●● ● ●● ●● ●● ●
●●●
● ● ● ●●
● ● ●● ● ●●
● ● ● ●● ●●
●●● ●●●●●●● ● ● ●● ●●● ● ● ● ●●●●●●● ●●● ●●● ● ●● ● ● ● ● ●●● ●



●●●
●●●●
● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ●● ●●●● ●● ●●● ● ● ●● ● ● ●●● ● ● ●
●●
●●●
●●●

●●●● ●●
● ●● ● ● ●●● ●●●● ● ● ●●● ● ● ● ●● ● ● ●● ●● ●●
● ● ●●
●●● ● ●●● ●● ● ● ●●●●● ● ● ●●● ● ●● ● ● ●● ●
● ●● ●●●● ●●
● ●●●●●● ● ● ● ●
● ● ● ●● ●

●●●●●●

●● ● ● ● ●● ●● ●●● ●● ●● ●●● ● ●
● ● ●●● ● ●● ●●● ●●● ● ● ●●● ● ●● ● ●

●● ●
●●●●

●●●● ●● ●●●● ●● ●
● ● ●● ● ●● ●●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ●●● ●
●●●●●●● ●
● ● ● ●● ●
● ● ● ● ●● ●● ●● ● ● ● ● ●● ●
● ● ●●●●● ● ●● ●
● ●● ● ● ●●● ● ● ●● ●
●●●●●●● ●● ●●● ● ● ● ●
●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●
● ● ● ● ●
●● ● ● ● ● ● ● ●●● ● ● ● ●●●● ●●●●
●●●●●●●
● ●●● ●●● ● ● ●●● ●●●●●●● ●●●● ●● ●●●●
●●●●● ● ●●● ● ● ●● ●
● ● ● ● ●● ●
● ● ● ●●● ● ●
●●● ●●●
● ● ● ●●● ● ●● ● ●● ● ● ●● ●● ● ●●●●● ● ●● ● ●● ●● ● ●

●●
● ●●● ● ● ● ●●●●
●●●
●● ● ●● ●●



● ●
●●●●
●● ●
● ●●● ●●●● ● ● ●● ● ● ●● ● ●● ● ●●● ● ●●● ●●●● ●● ● ●● ●

●● ●● ● ●●
● ● ●●●● ●●●●●●● ●●●●● ●● ●●●● ●● ●●● ● ● ●●●● ● ●●● ●●●● ● ●
●● ●●● ● ● ● ●● ●
● ●● ● ●●
●● ● ● ●● ● ● ● ●●●● ● ●● ● ●● ●● ● ●●
● ●●● ● ●
●●●●
●● ●●

●●●●

●●●

● ●
●●●
●●●
●●
●●●
●●

● ● ●



● ● ●●

●● ● ● ●
● ●● ● ●
●● ●●●

●● ●●
●●● ●● ●●●●●● ●

● ●●●● ● ●●●● ● ● ●


●●● ●●● ●●●
● ●● ● ●●
● ●

●● ● ●
● ●●
●●
●●
● ●●●● ● ● ●● ● ● ●
● ●● ●

● ●●●

●●●●

●●
● ●

● ●●



●●● ● ●● ●● ●● ● ● ●
● ●● ●● ●●
● ● ● ●●
● ● ●●
●● ●● ● ● ●● ●
● ●
●●
●●
●●



−1
● ● ●●
● ●● ●● ●●●●● ●●●● ●● ● ●● ● ● ● ●● ● ●● ●●●
●● ● ● ●● ● ●●● ●●
● ● ●
●● ● ● ●●● ●●●● ●●●● ● ●● ●●● ● ● ●● ●● ●
● ●●
● ● ● ●● ● ● ●●● ●● ● ● ●●● ●●● ●●● ● ●●● ●●●●● ●● ● ● ● ● ● ●● ●●● ● ● ●
●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●
●● ●● ●● ● ●● ● ●● ● ●●● ● ● ●●● ●●● ●● ●●●● ●● ●●●●
● ● ● ● ● ●● ● ●
● ● ● ● ●●●● ● ●●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●● ●●● ● ● ● ●● ● ● ● ●●●● ● ● ●
● ● ● ● ● ●● ●●
●● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●
●●● ●● ●●●●●
●● ● ● ●●● ● ● ●●● ●●

●●●
● ●
● ● ● ●● ● ● ●●● ●

●● ● ● ●●
● ● ● ●●●● ●●●● ● ●
● ●

●●●●● ● ● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●●
● ● ● ●● ● ●●
●● ●● ●● ● ● ● ● ● ●●●
● ●● ● ●● ●
●●● ●●● ● ● ● ●

● ●● ● ● ● ●●● ●● ● ●
● ● ●●● ●


● ●●

●●
● ●
● ● ●
●● ●●
● ●●
●●
● ●●

●●●

● ●
● −2
● ●
●● ●● ●●●●●
● ●●●
● ●●● ●● ● ●●● ●

● ● ●
●●
● ● ● ●●●
● ●● ●
● ●
●●●●●● ●● ●●
● ● ●● ●
●●
●●


●●
● ●
●●
2. Lattice data consist of measurements that are particular
to a certain geographic region, such as a county.

It’s common to see lattice data when the data collection


method is controlled
by government
agencies. The census
is one example. A
lot of epidemiological
data looks like this
too.

Modelling tends to
focus on the structure
of “neighborhoods.”
3. Point process data consist of the locations of particular
events. If there are also values associated with the event,
this is known as a marked point process.

The earthquake data in your homework can be modeled


as a marked point process in space and time.

Some questions to study with


this type of data are
- Is the rate of events the same
everywhere? (Is it a homogeneous
point process?)
- Given this underlying rate, are
the events independent? Or do
they tend to cluster together?
Geostatistical data uses a continuous representation of
space, with s ∈ S ⊆ #3

We write the covariance between the variable of interest


at any two locations as Cov(Z(s), Z(s! )).

Note:

Cov(X, Y ) = E[(X − EX)(Y − EY )]


! !
= V ar(X) V ar(Y )Cor(X, Y )
= σ Cor(X, Y )
2

if V ar(X) = V ar(Y ) = σ 2
A few simplifying assumptions about the structure of the
covariance function are

1. stationarity - the covariance between Z(s) and Z(s’)


depends only on the relative locations, ie. s-s’.

2. isotropy - the covariance between Z(s) and Z(s’)


depends only on the distance between the locations,
ie. ||s-s’||.

If both stationarity and isotropy hold, then the covariogram


can be used to study the form of the spatial covariance
function.
The (empirical) covariogram is a scatterplot that puts
distance on the x-axis and covariance on the y-axis.

If we had multiple replications, such as independent


observations in time, then for each pair of locations we
could calculate their covariance over those replications
and add one point to the covariogram. (How many
points would there be in all?)

However, if we have just a single replication, we can look


at pairs whose distance are within a given window of the
distance we want to plot.
Let Id,! = {(i, j) : i ≤ j, ||si − sj || ∈ (d − !, d + !)}

Then the empirical covariogram

1 !
Ĉ! (d) = (Zi − Z̄ (i)
)(Zj − Z̄ (j)
)
#Id,!
(i,j)∈Id,!

mean over all the i’s mean over all the j’s
Recall from last time:

Geostatistical data consist of observations associated with


a set of locations.

Today we’ll talk about how to interpolate a geostatistical


data set.

Example: Average 65
42
surface ozone at 60
40

monitoring stations 55
38

50

Our goal: Estimate 45


36

ozone over a grid


40
34

35

of locations
−95 −90 −85 −80 −75
Plotting the data

You’ll need to load the packages fields and maps.

From fields, we use


image.plot - color plot of a data set on a regular grid
as.image - take an irregularly spaced data set and put it on
a grid with nrow rows and ncol columns.

From maps, we use


map(“state”, add = TRUE)

Other databases are available -


try “county,” “usa”, “world”
Now we plot the correlogram and use it to estimate the
spatial covariance.


30
25
Estimated Covariance

20


15


10



● ●

5




● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
0

● ●
● ●

0 100 200 300 400 500

Distance
The idea is that we fit a parametric model to the
correlogram. In other words, we specify a functional form
for the covariance, where the function depends on
certain parameters, and then we estimate those
parameters.

A common parametric model takes the covariance to


decay exponentially with distance

Cov(Z(s), Z(s )) = σ exp{−||s − s ||/θ}


! 2 !

Variance at a single location Controls rate of decay;


higher means higher
correlation at a given distance
One way to estimate the parameters in the covariance
function is to use nonlinear least squares.


30

Minimize the sum


25

of squared
Estimated Covariance

20

residuals over

the parameters
15

using function nls


10



● ●

5




● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
0

● ●
● ●

0 100 200 300 400 500

Distance
Now we need to determine the set of locations at which
we want to do the prediction.

Remember expand.grid from when we were running


simulation studies? It works here too, just be sure to
convert to a matrix, rather than a data frame.

You need to determine the resolution of the grid. Higher


resolution looks better, but takes longer.

Also be sure not to extrapolate, ie. predict beyond the


range of the data.
A fairly dense grid

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
42

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
40

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
38

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
lat

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
36

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
34

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
32

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−95 −90 −85 −80 −75

lon
Ok, now it is time to predict.

We will use a linear combination of the observations to


give us a prediction at a new location. The only question
then is what weights to assign to each observation.

We choose the weights to minimize the variance of the


resulting estimate. The “best” predictor minimizes this
quantity, and we call it the BLUP, for “best linear unbiased
predictor.”

It turns out the weights for the BLUP are easy to derive if
you know the true covariance function....
First we form the covariance matrix for the observations.
This contains the covariances for every pair of
observations.

Σij = Cov(Z(si ), Z(sj ))


!
= σ exp{−||si −
2 !
sj ||/θ}

We calculate a similar matrix for each pair where one


location comes from the set of observations and the
other location comes from the new grid.

γij = Cov(Z(si ), Z(sj ))



= σ exp{−||si −
2 ∗
sj ||/θ}

New location j
If the vector of observations Z has mean zero, the BLUP
is simply
γΣ
! −1
Z

Weight matrix - depends on particular values


of the parameters as well as the distances

This procedure is called kriging after Daniel Krige, a South


African gold miner.

Now, if we don’t know the parameters, we can plug in our


estimates when we compute the covariance matrices. This
predictor is no longer the BLUP. (Why not?)
We can also estimate the variance in our predictions.

If the parameters are known, then the vector of variances


of the BLUP at each predicted location is

σ 1m − diag{γ Σ
2 ! −1
γ}

We plug in our estimates to approximate this variance.

The variance will be small when the predicted location is


close to the original locations.

In fact, with the covariance function we’re currently using,


the variance will be exactly zero if we predict at a
location where we already have an observation. (It
interpolates.)
Announcements:

There will be no lab tomorrow. I’ll assign your group


projects next Tuesday, and we’ll have a short in-lab
assignment on SQL next Friday.

Be sure to attend on Tuesday so you can make plans with


your group.

If you want to practice the SQL statements we talk about


today, there are lots of free SQL interpreters (programs
that mimic a database server) online.
For example, see http://sqlcourse.com/select.html.
Databases and SQL
A database is a collection of data with information about
how the data are organized (meta-data). A database server
is like a web server, but responds to requests for data
rather than web pages.

We’ll talk about relational database management systems


(RDBMS) and how to communicate with them using the
structured query language (SQL).

Why use a database?

• Coordinate synchronized access to data


• Change continually; give immediate access to live data
• Centralize data for backups
• Support client-server computing
• Control access to the data
A RDBMS had three main parts
• Data definition
• Data access
• Privilege management
We’ll concentrate on data access, assuming the database
is already available and we have the needed privileges.

Topics:
• using SQL to extract info from RDBMSs
• relating these back to similar tasks in R
• using SQL from within R
There are tradeoffs in terms of what we choose to do
using SQL and what we do in R.
A database is made up of one or more two dimensional
tables, usually stored as files on the server.

A very important concept in the design of a database is


normalization. The idea is to remove as much redundancy
as possible when creating the tables. We do this by
breaking the full dataset into separate tables.

The “relational” in RDBMS comes from the fact that we


then need to link the tables together.

For now let’s talk about a single table....


A table is a rectangular arrangement of values, where a
row represents a case, and a column represents a variable
(just like a data frame in R).

Missing value
Terminology

Object Statistics Database


Table Data frame Relation
Row Case Tuple
Column Variable Attribute
Row count Sample size Cardinality
Column count Dimension Degree
Row ID Row name Key
An entity is the general object of interest. For example, a
lab test. Each case is a particular occurrence of the entity.
This means that rows in the table are unique.

To identify each row, we us a key. A key is just an


attribute or a combination of attributes.

In the lab test example, there is a composite key of both


patient ID and date, since neither is necessarily unique.

In R, the row names of a dataframe play a similar role.


Queries and the SELECT statement

SQL allows us to interactively query the database to


reduce the data by subsetting, grouping, or aggregation.

Each database program tends to have its own version of


SQL, but they all support the same basic SQL statements.
(We say statements rather than commands because SQL
is referred to as a declarative rather than a programming
language.)

The SQL statement for retrieving data is the SELECT


statement. The result will always be another table.
A table called Chips:

The simplest possible query gives back everything:

SELECT * FROM Chips;

By convention, we display SQL commands in upper case.


Selecting by variables/attributes

Recall that in R, we can select particular variables


(columns) by name.

Chips[ , c(“Mips”, “Microns”)]

The order of the variable names determines the order in


which they’ll be returned in the resulting data frame.

The corresponding SQL query is

SELECT Mips, Microns FROM Chips;


Selecting by cases/tuples

Likewise, in R we can select cases from a dataframe using


their row names.

Chips[c(“Pentium”, “PentiumII”), ]

Two equivalent SQL queries are

SELECT * FROM Chips


WHERE Processor = “Pentium” OR
Processor = “PentiumII”;

SELECT * FROM Chips


WHERE Processor IN (“Pentium”, “PentiumII”);
In both R and SQL, we can do both types of subsetting at
once.

R:

Chips[c(“Pentium”, “PentiumII”), c(“Mips”, “Microns”)]

SQL:

SELECT Mips, Microns FROM Chips


WHERE Processor IN (“Pentium”, “PentiumII”);
Generalizing, so far we have the syntax

SELECT attribute(s) FROM relation(s)


[WHERE constraints];
[optional]
How would we pull the years of all 32-bit processors that
execute fewer than 250 million instructions per second,
1) in SQL, 2) in R?
SQL offers limited features for summarizing data -- some
aggregate functions that operate over the rows of a table,
and some mathematical functions that operate on
individual values in a tuple.

The aggregate functions are


• COUNT - number of tuples
• SUM - total of all values for an attribute
• AVG - average value for an attribute
• MIN - minimum value for an attribute
• MAX - maximum value for an attribute
SELECT attribute(s) FROM relation(s)
[WHERE constraints];

or functions of attributes
Additional clauses

The GROUP BY clause makes the aggregate functions in


SQL more useful. It enables the aggregates to be applied
to subsets of the tuples in a table.

SELECT Region, SUM(Amount) FROM Sales


GROUP BY Region;

The WHERE clause can’t contain an aggregate function,


but the HAVING clause can be used to refer to the
groups to be selected.

SELECT Region, SUM(Amount) FROM Sales


GROUP BY Region HAVING SUM(Amount) > 100000;
A few other predicates and clauses are

DISTINCT - forces values of an attribute in the results


table to have unique values

NOT - negates conditions in WHERE or HAVING clause

LIMIT - limits the number of tuples returned

SELECT DISTINCT State FROM Sales


WHERE NOT Region IN (“East”, “West”)
LIMIT 10;
The order of execution of the clauses in a SELECT
statement is as follows:

1. FROM: The working table is constructed.

2. WHERE: The WHERE clause is applied to each tuple of


the table, and only the rows that test TRUE are retained.

3. GROUP BY: The results are broken into groups of


tuples all with the same value of the GROUP BY clause.

4. HAVING: The HAVING clause is applied to each group


and only those that test TRUE are retained.

5. SELECT: The attributes not in the list are dropped and


options such as DISTINCT and LIMIT are applied.
Finally, another useful command is ORDER BY. This is
always executed last, since it is technically part of the host
language and not SQL.

For example, this will order the first seven, not give you
the top seven!

SELECT Location, Amount FROM Sales


ORDER BY Amount DESC LIMIT 7;
Moving on to multiple tables: an example
This is where normalization really comes into play. We
could have stored the same information in one big table:

Designing efficient databases is a topic in its own right. As


users of the database, we just need to understand the
relationships between tables.
An example

SELECT CID, SUM(Balance) AS Total


FROM Registration AS R, Accounts AS A
WHERE A.AcctNo = R.AcctNo
GROUP BY CID;
Your turn: Find the names and addresses of all customers
with accounts in the downtown branch of the bank.
It helps to think about order of exectution (FROM, then
WHERE, then GROUP BY, then SELECT).
Recall from last time how subsetting rows and columns in
SQL differs from doing it in R.

R:

Chips[c(“Pentium”, “PentiumII”), c(“Mips”, “Microns”)]

SQL:

SELECT Mips, Microns FROM Chips


WHERE Processor IN (“Pentium”, “PentiumII”);
Also remember the additional clauses GROUP BY and
WHERE.

The GROUP BY clause enables the aggregate functions to


be applied to subsets of the tuples in a table.

SELECT Region, SUM(Amount) FROM Sales


GROUP BY Region;

The WHERE clause can’t contain an aggregate function,


but the HAVING clause can be used to refer to the
groups to be selected.

SELECT Region, SUM(Amount) FROM Sales


GROUP BY Region HAVING SUM(Amount) > 100000;
Interacting with databases and multiple tables

Each server may provide multiple databases, each of


which may contain multiple tables, each of which may
have multiple columns. The following commands are
useful for orienting yourself.

SHOW DATABASES;
SHOW TABLES IN database;
SHOW COLUMNS IN table;
DESCRIBE table; Same thing

Today we’ll get practice using the command line program


MySQL to interact with a database.
springer.cgk% mysql -u stat133 -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 5.0.51a-3ubuntu5.1 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the


buffer.

mysql> SHOW DATABASES;


+--------------------+ You can log into springer from
| Database | any of the department machines.
+--------------------+
| information_schema |
After typing this command, enter
| albums | password T0pSecr3t
| baseball |
| music |
+--------------------+
4 rows in set (0.00 sec)
We’ll try some examples with the database called albums.

mysql> SHOW TABLES IN albums;


+------------------+
| Tables_in_albums |
+------------------+
| Album |
| Artist |
| Track |
+------------------+
3 rows in set (0.00 sec)

mysql> DESCRIBE Album;


ERROR 1046 (3D000): No database selected

Note that to access the tables within a database, we can


use the same “dot” notation as before, e.g. albums.Album.
Or...
We can also first say
mysql> USE albums;
Reading table information for completion of table and
column names
You can turn off this feature to get a quicker startup
with -A

Database changed
mysql> DESCRIBE Album;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid | bigint(20) | YES | MUL | NULL | |
| aid | double | YES | | NULL | |
| title | text | YES | | NULL | |
+-------+------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
mysql> describe Album;
+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| alid  | bigint(20) | YES  | MUL | NULL    |       |
| aid   | double     | YES  |     | NULL    |       |
| title | text       | YES  |     | NULL    |       |
+-------+------------+------+-----+---------+-------+

mysql> describe Artist;
+-------+--------+------+-----+---------+-------+ What is the
| Field | Type   | Null | Key | Default | Extra |
+-------+--------+------+-----+---------+-------+ structure of
| aid   | double | YES  | MUL | NULL    |       |
| name  | text   | YES  |     | NULL    |       |
this database?
+-------+--------+------+-----+---------+-------+

mysql> describe Track;
+----------+------------+------+-----+---------+-------+
| Field    | Type       | Null | Key | Default | Extra |
+----------+------------+------+-----+---------+-------+
| alid     | bigint(20) | YES  |     | NULL    |       |
| aid      | double     | YES  |     | NULL    |       |
| title    | text       | YES  |     | NULL    |       |
| filesize | bigint(20) | YES  |     | NULL    |       |
| bitrate  | double     | YES  |     | NULL    |       |
| length   | bigint(20) | YES  |     | NULL    |       |
+----------+------------+------+-----+---------+-------+
The individual tables don’t have much interesting
information, since they rely on the IDs.

mysql> SELECT * FROM Album LIMIT 5;


+------+------+-----------------------+
| alid | aid | title |
+------+------+-----------------------+
| 346 | 372 | 'Perk Up' | This is just so that
| 226 | 235 | 'Round Midnight | the examples fit
| 316 | 326 | 10 to 4 at the 5 Spot |
| 204 | 205 | 100% Pure Funk |
on the slides -- you’ll
| 500 | 265 | 150 MPH | want to remove it.
+------+------+-----------------------+
5 rows in set (0.01 sec)

To see anything interesting, we have to link together


multiple tables.
mysql> SELECT Artist.name, Album.title FROM Artist, Album
-> WHERE Artist.aid = Album.aid LIMIT 5;
+----------------------+-----------------------+
| name | title |
+----------------------+-----------------------+
| Shelly Manne | 'Perk Up' |
| Kenny Burrell | 'Round Midnight |
| Pepper Adams Quintet | 10 to 4 at the 5 Spot |
| Jimmy McGriff | 100% Pure Funk |
| Louie Bellson | 150 MPH |
+----------------------+-----------------------+
5 rows in set (0.00 sec)

Combining two datasets in this way is called an inner join.

The arrow (->) appears above because I haven’t entered


the terminating “;” yet. You shouldn’t type it in.
Also, don’t forget you can use AS to rename tables or
columns. This can save a lot of typing.

mysql> SELECT ar.name AS Artist, al.title AS 'Album Name'


-> FROM Artist AS ar, Album AS al
-> WHERE ar.aid = al.aid LIMIT 5;
+----------------------+-----------------------+
| Artist | Album Name |
+----------------------+-----------------------+
| Shelly Manne | 'Perk Up' |
| Kenny Burrell | 'Round Midnight |
| Pepper Adams Quintet | 10 to 4 at the 5 Spot |
| Jimmy McGriff | 100% Pure Funk |
| Louie Bellson | 150 MPH |
+----------------------+-----------------------+
5 rows in set (0.00 sec)
Subqueries

The result of one query can be used to represent a value


in another query.

For example, say we wanted to find the artist and title of


the longest track in the database.

This gives the length, but not the other information:

mysql> SELECT MAX(length) FROM Track;


+-------------+
| MAX(length) |
+-------------+ (Remember the aggregate
| 1561 | functions are COUNT,
+-------------+ SUM, AVG, MIN, and MAX.)
1 row in set (0.00 sec)
Now let’s get the name of the track.

mysql> SELECT title,length FROM Track
-> WHERE length = (SELECT max(length) FROM Track);
+------------+--------+
| title      | length |
+------------+--------+
| The Lovers |   1561 |
+------------+--------+
1 row in set (0.00 sec)

How do we get the artist and album information as well?


Ask yourself:
1. What tables do I need (for the FROM clause)?
2. What constraints do I need (for the WHERE clause)?
3. What columns do I want to SELECT?
4. Should I rename anything to save typing?
mysql> SELECT tr.title,al.title,ar.name,tr.length
    ->   FROM Track as tr, Album as al, Artist as ar
    ->   WHERE tr.alid = al.alid AND tr.aid = al.aid
-> AND ar.aid = al.aid
    ->   AND length = (SELECT max(length) FROM Track);

+------------+------------------------+------------+--------+
| title      | title                  | name       | length |
+------------+------------------------+------------+--------+
| The Lovers | Invitation to Openness | Les McCann |   1561 |
+------------+------------------------+------------+--------+
1 row in set (0.00 sec)

Note that it’s very important to make all the links between
tables, or you will get unwanted rows in the table.
Another example: let’s make a table of the number of
artists with certain numbers of albums. (How many
artists have one album, how many have two, etc.)

First, this tells us how many albums each artist (aid) has:

mysql> SELECT aid, COUNT(aid) AS ct FROM Album


-> GROUP BY aid LIMIT 5;
+------+----+
| aid | ct |
+------+----+
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 5 |
+------+----+
5 rows in set (0.01 sec)
mysql> SELECT ct, COUNT(ct) FROM
-> (SELECT aid, count(aid) AS ct FROM Album GROUP BY aid) AS x
-> GROUP BY ct ORDER BY ct;
+----+-----------+
| ct | COUNT(ct) |
+----+-----------+ We’re using the whole
| 1 | 318 |
| 2 | 63 |
table from the last slide
| 3 | 30 | as a subquery here.
| 4 | 20 |
| 5 | 7 |
Remember the result
| 6 | 4 | looked like:
| 7 | 3 |
| 8 | 2 |
+------+----+
| 9 | 2 |
| 10 | 1 | | aid | ct |
| 11 | 1 | +------+----+
| 12 | 2 | | 1 | 3 |
| 14 | 1 | | 2 | 1 |
+----+-----------+ | 3 | 1 |
13 rows in set (0.00 sec) | 4 | 1 |
| 5 | 5 |
Using MySQL with R

The RMySQL library allows you to connect to a database,


submit a query, and receive the results as a data frame. The
important functions are dbDriver, dbConnect, and
dbGetQuery.

> library(RMySQL)
Loading required package: DBI
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, dbname = "albums",
+ user = "stat133", pass = "T0pSecr3t")

This assumes you are already logged into springer. If you're


connecting from some other SCF machine, you'd need to
add the host='springer' argument to the dbConnect call.
We can grab each full table:

> album <- dbGetQuery(con,


+ statement = "SELECT * FROM Album")
> track <- dbGetQuery(con,
+ statement = "SELECT * FROM Track")
> artist <- dbGetQuery(con,
+ statement = "SELECT * FROM Artist")

Notice that we can get away with not using the


terminating (“;”) here.

The merge function in R performs an inner join. We can


use it to recreate the huge table with every column.
> album.and.artist <- merge(album, artist, by = "aid")
> full <- merge(track, album.and.artist, by = "alid")
> head(full)
alid aid.x title.x filesize bitrate length aid.y
1 1 2 S'Wonderful 3351 126.5808 217 2
2 1 2 Taboo 2528 126.0203 164 2
3 1 2 Just One Of Those Things 3753 128.0000 237 2
4 1 2 Yardbird Suite 2879 126.3243 187 2
5 1 2 It's The Talk Of The Town 2915 126.5238 189 2
6 1 2 Mighty Like A Rose 4503 126.7958 291 2
title.y name
1 Al Haig Trio And Sextets Featu Al Haig
2 Al Haig Trio And Sextets Featu Al Haig
3 Al Haig Trio And Sextets Featu Al Haig
4 Al Haig Trio And Sextets Featu Al Haig
5 Al Haig Trio And Sextets Featu Al Haig
6 Al Haig Trio And Sextets Featu Al Haig

Now we can do any processing we need to do.


On the other hand, with large databases it is often slow
or even impossible to load all the tables into R or create
one large dataframe. Then we can customize the query to
select just what we want and to do some processing on
the remote server.
> query <- "SELECT al.title, ar.name, SUM(tr.length) AS tot\
+ FROM Album AS al,Artist AS ar,Track AS tr\
+ WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid\
+ GROUP BY tr.alid\
+ HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC"
> albums <- dbGetQuery(con, statement = query)

What do you think this will do? This says that the string
is not finished, even though
we hit return.
> head(albums)
title name tot
1 'Perk Up' Shelly Manne 2684
2 Kaleidoscope Sonny Stitt 2679
3 Red Garland's Piano (Remastere Red Garland 2676
4 Ask The Ages Sonny Sharrock 2675
5 Duo Charlie Hunter & Leon Parker 2667
6 Tenor Conclave Hank Mobley/Al Cohn/John Coltr 2665
Numerical
Optimization
Function optimization refers to the problem of finding a
value of x to make some function f(x) as large (or as
small) as possible.

In statistics, these problems often arise in the context of


calculating estimates of model parameters.
• Nonlinear least squares
• Generalized linear models
• Maximum likelihood estimates
Sometimes we can find the solution explicitly, for example
using the derivatives of f. But when the solution can’t be
found in closed form, we turn to numerical optimization.
There are very many techniques for numerical
optimization, and we can’t possibly cover them all.

We’ll talk about two basic methods and how to program


them:
• Golden section search
• Newton-Ralphson algorithm
Then we’ll move on to some statistical examples and how
to use the built-in optimization methods in R.

Since every maximization problem can be rewritten as a


minimization problem, using -f(x) rather than f(x), we’ll
assume from now on that we’re minimizing.
Golden section search

Assume f(x) has a single minimum on the interval [a,b].

The golden section search algorithm iteratively shrinks the


interval over which we’re looking for the minimum, until
the length of the interval is less than some preset
tolerance.

The name golden section comes from the fact that at


each iteration, we choose a new point to evaluate so that
we can reuse one of the points from the last iteration. It
works out that the way to do this is to maintain the so-
called golden ratio between the distances between points.
Consider two line segments c and d. They are said to be
in golden ratio if their sum c+d is to c as c is to d.

c+d c c d
= =φ
c d
dφ + d dφ
⇒ =
dφ d
φ+1
⇒ =φ
φ
⇒ φ −φ−1=0
2

1+ 5
⇒ φ= ≈ 1.618034
2
An example:

2.5

Start with

2.0
f(x)

1.5
x1 = b − (b − a)/φ ●

= a + (b − a)/φ

1.0
x2
1 2 3 4 5

Now compare
f (x1 ) and f (x2 ). a x1 x2 b

Since f (x1 ) < f (x2 ),


we know the minimum
must be in [a, x2 ].
An example

2.5

2.0
f(x)

1.5

Add a new point

1.0
and maintain the 1 2 3 4 5

golden ratio. x

a x1 x2 b

a x1 x2 b

This time f (x1) > f (x2), so the


minimum must be in [x1 , b].
An example

2.5

2.0
f(x)

1.5


1.0
1 2 3 4 5

a x1 x2 b

a x1 x2 b

Keep going like this.... a x1 x2 b


An example

2.5

2.0
f(x)

1.5


● ●

1.0
1 2 3 4 5

a x1 x2 b

When b-a is sufficiently a x1 x2 b


small, we stop and report
a minimum of (a+b)/2. One a x1 x2 b

can show that the error is x1 x2


a b
at most (1-Φ)(b-a).
The Newton-Raphson algorithm can be used if the function
to be minimized has two continuous derivatives that may
be evaluated.

Again assume that there is a single minimum in [a,b]. If


the minimizing value x* is not at a or b, then f (x ) = 0.
! ∗

If in additionf !! (x∗ ) > 0, then x* is a minimum.

The main idea behind N-R is that if we have an initial


value x0 that is close to the minimizing value, then we
can approximate
f (x) ≈ f (x0 ) + (x − x0 )f (x0 )
! ! !!
f (x) ≈ f (x0 ) + (x − x0 )f (x0 )
! ! !!

f ! (x0 )
Setting RHS = 0 gives x1 = x0 − !!
f (x0 )

We keep going in this way until f (xn ) is sufficiently close


!

to zero.

It’s important to have a good initial guess, otherwise the


Taylor series approximation may be very poor and we
may even have f (xn+1 ) > f (xn ).
We’ve already seen one example of numerical
optimization in action, when we used nonlinear least
squares to fit a curve to the covariogram for spatial data.

This is useful more generally, for non-linear regression


models of the form
N (0, σ 2 )
y = f (x, β) + " error term

Response Vector of
variable coefficients
Vector of
Nonlinear covariates
function

Example: weight loss

400




170
●●

350
Weight (kg)

Weight (lb)


●●

150



● ●
●●

300

●●● ●

130
●●
●●

●● ●


● ●●

● ●

250
●●● ● ●

110
● ●

0 50 100 150 200 250

Days

Patients tend to lose weight at diminishing rate. Here is


data from one patient with a linear fit superimposed.

Another proposed model is y = β0 + β1 2 +"


−tθ

Time taken to
Ultimate lean
lose half amount
weight (asymptote)
Total amount remaining to be lost
to be lost
The function nls in R uses numerical optimization to find
the values of the parameters that minimize the sum of
squared errors
n
!
(Yi − f (xi , β))
2

i=1

The main arguments are


• formula - outcome on the LHS, function on the RHS
• data - dataframe holding the variables
• start - vector of starting values
Numerical
Optimization II: Fitting
Generalized Linear
Models
The normal linear model assumes that

1) the expected value of the outcome variable can be


expressed as a linear function of the explanatory
variables, and

2) the residuals (observations minus their expected


values) are independent and identically distributed with
a normal distribution.

Last time we talked about relaxing assumption 1), using


nonlinear regression models.

Today we’ll talk about relaxing assumption 2), using what


are called generalized linear models.
First, a few words about the normal linear model.

With a single explanatory variable, it has the form

Yi = β0 + β1 Xi + "i

iid
where !i ∼ N (0, σ 2 ), i = 1, . . . , n

Recall that the least squares estimates of β0 and β1


minimize the residual sum of squares
n
!
RSS(β0 , β1 ) = (Yi − β0 − β1 Xi )
2

i=1

With a little calculus, we can minimize RSS explicitly....


n
!
∂RSS/∂β0 = −2 (Yi − β0 − β1 Xi )
i=1
!n
∂RSS/∂β1 = −2 (Yi − β0 − β1 Xi )Xi
i=1

Setting each equal to zero and solving, we get


!n
i=1 (Xi − X̄)(Yi − Ȳ )
β̂1 = !n
i=1 (X i − X̄)2

β̂0 = Ȳ − β1 X̄

In this model, the least squares estimates are equal to the


maximum likelihood estimates, which we’ll discuss next
time.
We can do similar calculations if we have more than one
explanatory variable.

Another way of thinking about what we’ve done in the


normal linear model is that we’ve expressed the mean of
the Y’s as a linear combination of the X’s.

E[Yi ] = β0 + β1 Xi + E["i ]
= β0 + β1 Xi

To work with non-normal distributions, we’re going to


slightly modify this idea.
First, a motivating example.

Prior to the launch of the space shuttle Challenger, there


was some debate about whether temperature had any
effect on the performance of a key part called an O-ring.
The following plot, with data from past flights, was used as
evidence that it was safe to launch at a temperature of 31°F.
One key problem with this analysis was that the
engineers left out the data from all the flights with no O-
ring problems, under the mistaken assumption that these
gave no extra information.
The solid rocket motors (labeled
3 and 4) are delivered to
Kennedy Space Center in four
pieces, and they are connected
on site using the O-rings. There
are actually two sets of O-rings
at each joint, but we’ll focus on
the primary ones.

So in each launch, there are six


primary O-rings that can fail. If
any one fails, it can lead to a
catastrophic failure of the whole
shuttle.
Temp Fail Date
1 66 0 4/12/81
Here are the data on past 2 70 1 11/12/81
failures of the primary O- 3 69 0 3/22/82
4 68 0 11/11/82
rings. 5 67 0 4/4/83
6 72 0 6/18/83
7 73 0 8/30/83
The data from past flights 8 70 0 11/28/83
come from rocket motors 9 57 1 2/3/84
10 63 1 4/6/84
that are retrieved from the 11 70 1 8/30/84
ocean after the flight. There 12 78 0 10/5/84
13 67 0 11/8/84
had been 24 shuttle 14 53 2 1/24/85
launches prior to 15
16
67
75
0 4/12/85
0 4/29/85
Challenger, of which the 17 70 0 6/17/85
rocket motors were 18
19
81
76
0 7/29/85
0 8/27/85
retrieved in 23 cases. 20 79 0 10/3/85
21 75 2 10/30/85
22 76 0 11/26/85
23 58 1 1/12/86
We could fit a linear regression model to this data,
relating the expected number of failures to temperature.

Some problems with this approach are that

1) the residuals are

6
clearly not iid

5
normal
4
Failures

2) if we go out
3

far enough, we
2

● ●

actually predict
1

●● ● ●

a negative number
0

●●●●● ●● ●● ●● ●

of failures. 0 20 40 60 80

Temperature (Degrees F)
Instead we will fit a logistic regression model.

This model is appropriate when the data have a binomial


distribution (counting the number of events out of n
trials), of which binary data is a special case with n=1.

The expected value for a given trial is pi , the probability


of an event when the explanatory variable X = Xi . We
relate this to the linear predictor using the logit function.
! "
pi
log = β0 + β1 Xi
1 − pi

This ratio is called, the odds, so


the logit can also be called the log odds.
The case we are discussing (binomial outcome, logit
function) is a special case of a larger class of models called
generalized linear models.

Some other examples:

Normal outcome, identity link Poisson outcome, log link


Yi ∼ N (µi , σ ),2
Yi ∼ P ois(λi )
µi = β0 + β1 Xi log(λi ) = β0 + β1 Xi

Note that in each case, the link function maps the space of
the parameter representing the mean of the distribution
(µi , λi , or pi ) to the real line, which is the space of the linear
predictor.
Generalized linear models can be fit using an algorithm
called iteratively reweighted least squares. In R, this is
implemented in the function glm.

# First create a matrix with events and non-events


FN <- cbind(challenge$Fail, 6 - challenge$Fail)

# Fit using specified family, default link function


glm.fit <- glm(FN~Temp, data = challenge,
family = binomial)

# Now predict for a range of temperatures


tempseq <- seq(0, 90, length = 100)
pred <- predict(glm.fit, newdata = data.frame(Temp =
tempseq), se.fit = TRUE)
inv.logit <- function(x){1/(1+exp(-x))}
lines(tempseq, inv.logit(pred$fit))
lines(tempseq, inv.logit(pred$fit + 2*pred$se.fit), lty = 2)
lines(tempseq, inv.logit(pred$fit - 2*pred$se.fit), lty = 2)
The confidence interval at 31°F is quite wide, but the
point estimate probably still should have been cause for
alarm, especially since the temperature was colder than
anything that had been tried before.
1.0
0.8
Probability of Failure

0.6
0.4

● ●
0.2

●● ● ●
0.0

●●
●●●●●● ●● ●● ●

0 20 40 60 80

Temperature (Degrees F)
Interpretation of the coefficients is a bit trickier in logistic
regression models than it is in linear regression models.

When E[Yi ] = β0 + β1 Xi , we can say that


• β0 is the expected value of Y when Xi = 0
(This is not always interesting; for example the
temperature will almost never be 0°F.)
• β1 is the change in the expected value of Y due to a unit
increase in X.
! "
pi
Now we have log = β0 + β1 Xi
1 − pi
We can interpret the parameters in terms of log odds,
odds, or probabilities.
! "
pi
log = β0 + β1 Xi
1 − pi

The interpretation regarding log odds is the easiest to


state but probably the hardest to understand.

• β0 is the log odds of an event when Xi = 0


•β1 is the change in the log odds due to a unit
increase in X.

Now, if we exponentiate both sides, we get

pi /(1 − pi ) = exp(β0 + β1 Xi )

which implies that exp(β0 ) is the value of the odds


when Xi = 0
Also, suppose Xi = Xj + 1.

pi /(1 − pi ) exp(β0 ) exp(β1 Xi )


=
pj /(1 − pj ) exp(β0 ) exp(β1 Xj )
= exp(β1 (Xi − Xj )) = exp(β1 )

So exp(β1 ) gives the multiplicative change in odds


corresponding to a one unit change in X.

In particular, if X takes only the values 0 and 1, then exp(β1 )


is the odds ratio for category 1 compared to category 0.

The interpretation in terms of probabilities is conditional


on other variables in the model, so we’ll save it for after
we talk about using multiple regressors.
Another example, this time with multiple explanatory
variables

> library(MASS)
> birthwt[1:2,]
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551

'low' indicator of birth weight less than 2.5kg


'age' mother's age in years
'lwt' mother's weight in pounds at last menstrual period
'race' mother's race ('1' = white, '2' = black, '3' = other)
'smoke' smoking status during pregnancy
'ptl' number of previous premature labours
'ht' history of hypertension
'ui' presence of uterine irritability
'ftv' number of physician visits during the first trimester
'bwt' birth weight in grams
We are now confronted with the question of model
choice. There are a variety of principles that can guide us
here, but in the interest of time, let’s consider one
criterion balancing goodness of fit with parsimony (the
number of parameters).

The Akaike information criterion is

AIC = 2k − 2 log(L)

where k is the number of parameters in the given model


and L is the maximized value of the likelihood for that
model. (For now, you can think of the likelihood as the
joint density of the data for a particular setting of the
parameter values.) Looking at this criterion, we favor
models with a lower value of AIC.
Numerical
Optimization III:
Maximizing Likelihoods
One of the canonical cases in which we need to
numerically optimize a function in statistics is to find the
maximum likelihood estimate.
iid
For X1 , . . . , Xn ∼ f (x; θ), the likelihood function is
!n
L(θ) = f (Xi ; θ)
i=1

The log-likelihood function is n


!
!(θ) = log L(θ) = log f (Xi ; θ)
i=1

The maximum likelihood estimator (MLE), which we’ll


denote by θ̂, is the value of θ that maximizes L(θ).
(Note that this is equivalent to maximizing !(θ).
Another important thing to note is that we can multiply
the likelihood by a constant (or add a constant to the log-
likelihood), and this does not change the location of the
maximum.

Therefore, we often work only with the part of the


likelihood that concerns θ. This part of the function is
called the kernel.

In simple cases, we can often find the MLE in closed form


by, for example, differentiating the log-likelihood with
respect to θ, setting this equal to zero, and solving for θ.
But things are often not this simple!
As an example, let’s go back to the logistic regression
model. Remember, we have Yi ∼ Ber(pi ), i = 1, . . . , n
where, inverting the logit function, we have

exp{β0 + β1 Xi }
pi =
1 + exp{β0 + β1 Xi }

n
!
Yi
and the likelihood function is L(β0 , β1 ) = pi (1 − pi ) ,
Yi

substituting in the expression above. i=1

We can’t maximize this analytically as a function of β0


and β1 , but we can easily write a function for the
likelihood or log-likelihood and have R do the work for
us....
# Function for the negative log-likelihood

logistic.nll <- function(beta, x, y, verbose = FALSE){


if(verbose) print(beta)
beta0 <- beta[1]; beta1 <- beta[2]
pvec <- exp(beta0 + beta1 * x) /
(1 + exp(beta0 + beta1 * x))
fvec <- y * log(pvec) + (1-y) * log(1 - pvec)
return(-sum(fvec))
}

# Use optim to minimize the nll


# par is a vector of starting values
# better starting values => faster convergence, and
# less chance of missing the global maximum

optim(par = c(0, 0), fn = logistic.nll,


x = x, y = y, verbose = TRUE)
In the case of logistic regression, minimizing the negative
log-likelihood using optim will give the same answer as
using glm with family = binomial.

However, there are many other models without built-in


functions like glm. One example is the spatial models we
discussed a few weeks ago.

Suppose we have a spatial field with mean zero and


covariance function

Cov(Z(si ), Z(Sj )) = σ exp{−||si − sj ||/ρ}


2

2
Before, we estimated σ and ρ by finding the covariogram
and fitting a curve to it using nonlinear least squares.
However, the MLEs are actually much better estimators.
The kernel of the likelihood function (for normal data)
looks like this:

|Σ(σ 2 , ρ)|−1/2 exp{−Z " Σ(σ 2 , ρ)−1 Z/2}

where Z = (Z1 , Z2 , . . . , Zn )! is the vector of


observations and Σ(σ 2
, ρ) is the n by n matrix with

Σ(σ 2 , ρ)i,j = Cov(Z(si ), Z(Sj ))


= σ exp{−||si − sj ||/ρ}
2

We can again use optim to find the MLEs numerically....


A few words before we move on:

1) It’s always preferable to find the MLEs in closed form if


you can. The answer is exact, and you avoid all the errors
that can be introduced in the numerical optimization,
including possibly converging to a local rather than a
global optimum.

2) If you do need to use numerical optimization, it’s a


good idea to evaluate the likelihood (or log-likelihood)
over a grid of values first, to help you find a good starting
value.

3) There’s a lot more theoretical detail concerning MLEs


that we don’t have time to cover, importantly how to
estimate uncertainty. See Stat 135.
Nonparametric
regression and
scatterplot smoothing
We’ve looked at linear models and nonlinear models with
a specified form, but what if you don’t know a good
function to relate two variables X and Y?

● ●


50

● ●●● ●
● ● ●


●● This data set shows
head acceleration
● ●
●● ● ● ● ● ● ● ●
● ●

0


●●●● ●●
●●●●
●●● ● ●● ● ● ●● ●
●●● ●●● ●●
Acceleration

● ●

in a simulated


● ● ● ● ●●● ●● ●

●●●● ●● ● ●
● ● ●

●● ●
motorcycle accident,

● ●
−50

●● ●
● ●
● ● ●
●●

used to test helmets.


●●
● ● ●
●● ●
● ●

−100



● ●
● ●

● ●

● ●● ● ●
● ●

10 20 30 40 50

Times
This area of statistics is known as nonparametric regression
or scatterplot smoothing. Basically, we want to draw a
curve through the data that relates X and Y. More
formally, we suppose

Yi = f (Xi ) + !i

where f is an unknown function and the !i are iid with


some common distribution, typically normal.

Now, if we don’t put any restrictions onf, it’s easy to get


a perfect fit to the data -- just draw a curve that passes
through all the points! But this curve is unlikely to give
good predictions for any future observations.
Aside: This actually gets at a fundamental idea in
statistics, called the bias-variance tradeoff. We can get a
very low variance estimator of f by interpolating the
data, using a very wiggly curve. But this introduces a lot
of bias. So we look for a happy medium.

We won’t cover
the theoretical
details, here,
but just keep in
mind this question
of how much
smoothing to do.
Back to the motorcycle data....

One of the simplest things we could do would be to fit a


high degree polynomial.

But fitting a global


● ●

polynomial this

50
● ●● ●

● ● ●
● ●●

way isn’t very ●








● ● ● ● ●

efficient.

0

●●●● ● ●●● ● ● ● ● ●● ●
●● ●●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ●● ●
● ● ● ●
●●●●● ● ● ●

● ● ●
y


● ●

● ●

How about breaking


−50


●● ●

● ● ●
● ●

● ●

up the region of
● ● ●

● ●
● ●

−100

x and fit a separate,




● ●

● ●
● ●
● Degree = 5

lower-degree
● ●● ● ●
● ●
Degree = 10
● Degree = 20

polynomial in each 10 20 30 40 50

region?
x
This type of model is known as a piecewise polynomial
model or regression splines.

The breakpoints, between which we have separate


polynomial functions, are called knots.

Typically we impose some constraints on the way the


functions match up at the knots, such as maintaining the
first and second derivatives.

So the modelling choices boil down to

1) where to put the knots


2) what degree polynomial to fit between the knots

More knots less smoothing


Motorcycle data with 6 knots:

● ●


50

● ●● ●

● ● ●
● ●●


● ●
● ● ● ● ● ● ●
● ●
● ●

0


●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●

●●●● ● ● ●

● ● ●
y


● ●

● ● ●
−50

●● ●

● ● ●
● ●

● ●
● ● ●

● ●
● ●

−100



● ●

● ●
● ●

● ● ● ● ●
● ●
Degree = 1
● Degree = 3

10 20 30 40 50

x
Motorcycle data with 9 knots:

● ●


50

● ●● ●

● ● ●
● ●●


● ●
● ● ● ● ● ● ●
● ●
● ●

0


●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●

●●●● ● ● ●

● ● ●
y


● ●

● ● ●
−50

●● ●

● ● ●
● ●

● ●
● ● ●

● ●
● ●

−100



● ●

● ●
● ●

● ● ● ● ●
● ●
Degree = 1
● Degree = 3

10 20 30 40 50

x
Smoothing spline models are defined in a slightly different
way. Within a class of functions, a smoothing spline
minimizes the penalized least squares criterion

n
! "
1
(Yi − f (Xi )) + λ
2
f (x)dx
!!
n i=1

The parameter λ controls how smooth the function is (in


terms of integrated second derivative).

We can specify λ in terms of the equivalent degrees of


freedom of the model, or we can choose it in a data-based
way, using something called cross validation.
● ●


50
● ●● ●

● ● ●
● ●●


● ●
● ● ● ● ● ● ●
● ●
● ●

0


●●●● ● ●●● ● ● ● ● ●● ●
●● ● ●●● ●● ● ●●
●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●

●●●● ● ● ●

● ● ●
y


● ●

● ● ●
−50

●● ●

● ● ●
● ●

● ●
● ● ●

● ●
● ●

−100



● ●

● ●
● ●
● df = 10
● ● ● ● ●
● ●
df = 20
● df chosen by cross−validation

10 20 30 40 50

S-ar putea să vă placă și