Sunteți pe pagina 1din 88

An Introduction to Mapping and Spatial Modelling R

By and Richard Harris, School of Geographical Sciences, University of Bristol

An Introduction to Mapping and Modelling R by Richard Harris is licensed under a Creative


Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Based on a work at www.social-statistics.org.
You are free:
to Share to copy, distribute and transmit the work
to Remix to adapt the work
Under the following conditions:
Attribution You must attribute the work in the following manner: Based on An Introduction to
Mapping and Spatial Modelling R by Richard Harris (www.social-statistics.org).
Noncommercial You may not use this work for commercial purposes. Use for education in a
recognised higher education institution (a College or University) is permissible.
Share Alike If you alter, transform, or build upon this work, you may distribute the resulting
work only under the same or similar license to this one.
With the understanding that:
Waiver Any of the above conditions can be waived if you get permission from the copyright
holder (Richard Harris, rich.harris@bris.ac.uk)
Public Domain Where the work or any of its elements is in the public domain under applicable
law, that status is in no way affected by the license.
Other Rights In no way are any of the following rights affected by the license:
Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
The author's moral rights;
Rights other persons may have either in the work itself or in how the work is used, such as publicity
or privacy rights.
Notice For any reuse or distribution, you must make clear to others the license terms of this
work which applies also to derivatives.
(Document version 0.1, November, 2013. Draft version.)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

Introduction and contents


This document presents a short introduction to R highlighting some geographical functionality.
Specifically, it provides:

10

A basic introduction to R (Session 1)

A short 'showcase' of using R for data analysis and mapping (Session 2)

Further information about how R works (Session 3)

Guidance on how to use R as a simple GIS (Session 4)

Details on how to create a spatial weights matrix (Session 5)

An introduction to spatial regression modelling including Geographically Weighted


Regression (Session 6)

Further sessions will be added in the months (more likely, years) ahead.
The document is provided in good faith and the contents have been tested by the author. However,
use is entirely as the user's risk. Absolutely no responsibility or liability is accepted by the author
for consequences arising from this document howsoever it is used. It is is licensed under a Creative
Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License (see above).
Before starting the following should be considered.
First, you will notice that in this document the pages and, more unusually, the lines are numbered.
The reason is educational: it makes directing a class to a specific part of a page easier and faster. For
other readers, the line numbers can be ignored.
20

30

Second, the sessions presume that, as well as R, a number of additional R packages (libraries) have
been installed and are available to use. You can install them by following the 'Before you begin'
instructions below.
Third, each session is written to be completed in a single sitting. If that is not possible, then it would
normally be possible to stop at a convenient point, save the workspace before quitting R, then
reload the saved workspace when you wish to continue. Note, however, that whereas the additional
packages (libraries) need be installed only once, they must be loaded each time you open R and
require them. Any objects that were attached before quitting R also need to be attached again to take
you back to the point at which you left off. See the sections entitled 'Saving and loading
workspaces', 'Attaching a data frame' and 'Installing and loading one or more of the packages
(libraries)' on pages 10, 31 and 37 for further information.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

Before you begin


Install R. It can be downloaded from http://cran.r-project.org/.
I currently am using version 3.0.2.
Start R. Use the drop-down menus to change your working directory to somewhere
you are happy to download all the files you need for this tutorial.
At the > prompt type,
download.file("http://dl.dropboxusercontent.com/u/214159700/RIntro.zip", "Rintro.zip")
and press return.
10

Next, type
unzip("Rintro.zip")

All the data you need for the sessions are now available in the working directory.
If you would like to install all the libraries (packages) you need for these practicals, type
load(begin.RData)

and then
install.libs()

You are advised to read Installing and loading one or more of the packages (libraries) on p. 37
before doing so.

20

Please note:
this is a draft version of the document and has not as yet
been thoroughly checked for typos and other errors.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

Session 1: Getting Started with R


This session provides a brief introduction to how R works and introduces some of the more
common commands and procedures. Don't worry if not everything is clear at this stage. The
purpose is to get you started not to make you an expert user. If you would prefer to jump straight to
seeing R in action, then move on the Session 2 (p.13) and come back to this introduction later.

1.1 About R
R is an open source software package, licensed under the GNU General Public Licence. You can
obtain and install it for free, with versions available for PCs, Macs and Linux. To find out what is
available, go to the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/
10

Being free is not necessarily a good reason to use R. However, R is also well developed, well
documented, widely used and well supported by an extensive user community. It is not just software
for 'hobbyists'. It is widely used in research, both academic and commercial. It has well developed
capabilities for mapping and spatial analysis.
In his book R in a Nutshell (O'Reilly, 2010), Joseph Adler writes, R is very good at plotting
graphics, analyzing data, and fitting statistical models using data that fits in the computer's
memory. Nevertheless, no software provides the perfect tool for every job and Adler adds that it's
not good at storing data in complicated structures, efficiently querying data, or working with data
that doesn't fit in the computer's memory.

20

30

To these caveats it should be added that R does not offer spreadsheet editing of data of the type
found, for example, in Microsoft Excel. Consequently, it is often easier to prepare and 'clean' data
prior to loading them into R. There is an add-in to R that provides some integration with Excel. Go
to http://rcom.univie.ac.at/ and look for RExcel.
A possible barrier to learning R is that it is generally command-line driven. That is, the user types a
command that the software interprets and responds to. This can be daunting for those who are used
to extensive graphical user interfaces (GUIs) with drop-down menus, tabs, pop-up menus, left or
right-clicking and other navigational tools to steer you through a process. It may mean that R takes
a while longer to learn; however, that time is well spent. Once you know the commands it is usually
much faster to type them than to work through a series of menu options. They can be easily edited
to change things such as the size or colour of symbols on a graph, and a log or script of the
commands can be saved for use on another occasion or for sharing with others.
Saying that, a fairly simple and platform independent GUI called R Commander can be installed
(see http://cran.r-project.org/web/packages/Rcmdr/index.html). Field et al.'s book Discovering
Statistics Using R provides a comprehensive introduction to statistical analysis in R using both
command-lines and R Commander.

1.2 Getting Started

40

Assuming R has been installed in the normal way on your computer, clicking on the link/shortcut to
R on the desktop will open the RGui, offering some drop-down menu options, and also the R
Console, within which R commands are typed and executed. The appearance of the RGui differs a
little depending upon the operating system being used (Windows, Mac or Linux) but having used
one it should be fairly straightforward to navigate around another.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

Figure 1.1. Screen shot of the R Gui for Windows

1.2.1 Using R as a calculator

At its simplest, R can be used as a calculator. Typing 1 + 1 after the prompt > will (after pressing
the return/enter key, ) produce the result 2, as in the following example:
> 1 + 1
[1] 2

Comments can be indicated with a hash tag and will be ignored


> # This is a comment, no need to type it

Some other simple mathematical expressions are given below.


10

20

30

> 10 - 5
[1] 5
> 10 * 2
[1] 20
> 10 - 5 * 2
[1] 0
> (10 - 5) * 2
[1] 10
> sqrt(100)
[1] 10
> 10^2
[1] 100
> 100^0.5
[1] 10
> 10^3
[1] 1000
> log10(100)
[1] 2
> log10(1000)
[1] 3
> 100 / 5
[1] 20
> 100^0.5 / 5
[1] 2

# The order of operations gives priority to


# multiplication
# The use of brackets changes the order
# Uses the function that calculates the square root
# 102
# 100.5, i.e. the square root again

# Uses the function that calculates the common log

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

1.2.2 Incomplete commands

If you see the + symbol instead of the usual (>) prompt it is because what has been typed is
incomplete. Often there is a missing bracket. For example,

10

> sqrt(
+ 100
+ )
[1] 10
> (1 + 2) * (5 - 1
+ )
[1] 12

# The + symbol indicates that the command is incomplete

Commands broken over multiple lines can be easier to read.

20

> for (i in 1:10) {


+ print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

# This is a simple loop


# printing the numbers 1 to 10 on-screen

1.2.3 Repeating or modifying a previous command

If there is a mistake in a line of a code that needs to be corrected or if some previously typed
commands will be repeated then the and keys on the keyboard can be used to scroll between
previous entries in the R Console. Try it!

1.3 Scripting and Logging in R


30

1.3.1 Scripting

You can create a new script file from the drop down menu File New script (in Windows) or File
New Document (Mac OS). It is basically a text file in which you could write, for example,
a <- 1:10
print(a)

40

In Windows, if you move the cursor up to the required line of the script and press Ctrl + R, then it
will be run in the R Console. So, for example, move the cursor to where you have typed a <- 1:10
and press Ctrl + R. Then move down a line and do the same. The contents of a, the numbers 1 to 10,
should be printed in the R Console. If you continue to keep the focus on the Scripting window and
go to Edit in the RGui you will find an option to run everything. Similar commands are available
for other Operating Systems (e.g. Mac key + Return). You can save files and load previously saved
files.
Scripting is both good practice and good sense. It is good practice because it allows for
reproducibility of your work. It is good sense because if you need to go back and change things you
can do so easily without having to start from scratch.
Tip: It can be sensible to create the script in a simple text editor that is independent of R, such as
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

Notepad. Although you will not be able to use Ctrl + R in the same way, if R crashes for any reason
you will not lose your script file.
1.3.2 Logging

You can save the contents of the R Console window to a text file which will then give you a log file
of the commands you have been using (including any mistakes). The easiest way to do this is to
click on the R Console (to take the focus from the Scripting window) and then use File Save
History (in Windows) or File Save As (Mac). Note that graphics are not usually plotted in the R
Console and therefore need to be saved separately.

1.4 Some R Basics


10

1.4.1 Functions, assignments and getting help

It is helpful to understand R as an object-oriented system that assigns information to objects within


the current workspace. The workspace is simply all the objects that have been created or loaded
since beginning the session in R. Look at it this way: the objects are like box files, containing useful
information, and the workspace is a larger storage container, keeping the box files together. A useful
feature of this is that R can operate on multiple tables of data at once: they are just stored as
separate objects within the workspace.
To view the objects currently in the workspace, type
> ls()
character(0)

20

Doing this runs the function ls(), which lists the contents of the workspace. The result,
character(0), indicates that the workspace is empty. (Assuming it currently is).
To find out more about a function, type ? or help with the function name,
> ?ls()
> help(ls)

This will provide details about the function, including examples of its use. It will also list the
arguments required to run the function, some of which may be optional and some of which may
have default values which can be changed as required. Consider, for example,
> ?log()

A required argument is x, which is the data value or values. Typing log() omits any data and
generates an error. However, log(100) works just fine. The argument base takes a default value of e1
which is approximately 2.72 and means the natural logarithm is calculated. Because the default is
assumed unless otherwise stated so log(100) gives the same answer as log(100, base=exp(1)).
Using log(100, base=10) gives the common logarithm, which can also be calculated using the
convenience function log10(100).
30

The results of mathematical expressions can be assigned to objects, as can the outcome of many
commands executed in the R Console. When the object is given a name different to other objects
within the current workspace, a new object will be created. Where the name and object already
exist, the previous contents of the object will be over-written, without warning so be careful!
> a <- 10 5
> print(a)
[1] 5
> b <- 10 * 2
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

> print(b)
[1] 20
> print(a * b)
[1] 100
> a <- a * b
> print(a)
[1] 100

10

20

In these examples the assignment is achieved using the combination of < and -, as in a <- 100.
Alternatively, 100 -> a could be used or, more simply, a = 100. The print(..)command can often
be omitted, though it is useful, and sometimes necessary (for example, when what you hope should
appear on-screen doesn't).
> f = a * b
> print(f)
[1] 2000
> f
[1] 2000
> sqrt(b)
[1] 4.472136
> print(sqrt(b), digits=3)
[1] 4.47
> c(a,b)
[1] 100 20
> c(a,sqrt(b))
[1] 100.000000
4.472136
> print(c(a,sqrt(b)), digits=3)
[1] 100.00
4.47

# The additional parameter now specifies


# the number of significant figures
# The c(...) function combines its arguments

1.4.2 Naming objects in the workspace

Although the naming of objects is flexible, there are some exceptions,


30

> _a <- 10
Error: unexpected input in "_"
> 2a <- 10
Error: unexpected symbol in "2a"

Note also that R is case sensitive, so a and A are different objects


> a
> A
> a
[1]

<- 10
<- 20
== A
FALSE

The following is rarely sensible because it won't appear in the workspace, although it is there,
40

> .a <- 10
> ls()
[1] "a" "b" "f"
> .a
[1] 10
> rm(.a, A)

# Removes the objects .a and A (see below)

1.4.3 Removing objects from the workspace

From typing ls() we know when the workspace is not empty. To remove an object from the
workspace it can be referenced to explicitly as in rm(A) or indirectly by its position in the
workspace. To see how the second of these options will work, type
> ls()

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

[1] "a" "b" "f"

The output returned from the ls() function is here a vector of length three where the first element is
the first object (alphabetically) in the workspace, the second is the second object, and so forth. We
can access specific elements by using notation of the form ls[index.number]. So, the first element
the first object in the workspace can be obtained using,
> ls()[1]
[1] "a"
> ls()[2]
[1] "b"

Note how the square brackets


Similarly,
10

# Get the brackets right! some rounded some square

[]

are used to reference specific elements within the vector.

> ls()[3]
[1] "f"
> ls()[c(1,3)]
[1] "a" "f"
> ls()[c(1,2,3)]
[1] "a" "b" "f"
> ls()[c(1:3)]
[1] "a" "b" "f"

# 1:3 means the numbers 1 to 3

Using the remove function, rm(...), the second and third objects in the workspace can be removed
using
20

> rm(list=ls()[c(1,3)])
> ls()
[1] "b"

Alternatively, objects can be removed by name


> rm(b)

To delete all the objects in the workspace and therefore empty it, type the following code but be
warned! there is no undo function. Whenever rm(...) is used the objects are deleted permanently.
> rm(list=ls())
> ls()
character(0)

30

# In other words, the workspace is empty

1.4.4 Saving and loading workspaces

Because objects are deleted permanently, a sensible precaution prior to using rm(...) is to save the
workspace. To do so permits the workspace to be reloaded if necessary and the objects recovered.
One way to save the workspace is to use
> save.image(file.choose(new=T))

Alternatively, the drop-down menus can be used (File Save Workspace in the Windows version
of the RGui). In either case, type the extension .RData manually else it risks being omitted, making
it harder to locate and reload what has been saved. Try creating a couple of objects in your
workspace and then save it with the names workspace1.RData
To load a previously saved workspace, use
40

> load(file.choose())

or the drop-down menus.


When quitting R, it will prompt to save the workspace image. If the option to save is chosen it will
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

10

be saved to the file .RData within the working directory. Assuming that directory is the default one,
the workspace and all the objects it contains will be reloaded automatically each and every time R is
opened, which could be useful but also potentially irritating. To stop it, locate and delete the file.
The current working directory is identified using the get working directory function, getwd() and
changed most easily using the drop-down menus.
> getwd()
[1] "/Users/rich_harris"

(Your working directory will differ from the above)


10

Tip: A good strategy for file management is to create a new folder for each project in R, saving the
workspace regularly using a naming convention such as Dec_8_1.RData, Dec_8_2.RData etc. That
way you can easily find and recover work.

1.5 Quitting R
Before quitting R, you may wish to save the workspace. To quit R use either the drop-down menus
or
> q()

As promised, you will be prompted whether to save the workspace. Answering yes will save the
workspace to the file .RData in the current working directory (see section 1.4.4, 'Saving and loading
workspaces', on page 10, above). To exit without the prompt, use
> q(save = "no")

20

Or, more simply,


> q("no")

1.6 Getting Help


In addition to the use of the ? or help() documentation and the material available at CRAN,
http://cran.r-project.org/, R has an active user community. Helpful mailing lists can be accessed
from www.r-project.org/mail.html.
Perhaps the best all round introduction to R is the An Introduction to R which is freely available at
CRAN (http://cran.r-project.org/manuals.html) or by using the drop-down Help menus in the RGui.
It is clear and succinct.
30

I also have a free introduction to statistical analysis in R which accompanies the book Statistics for
Geography and Environmental Science. It can be obtained from http://www.social-statistics.org/?
p=354.
There are many books available. My favourite, with a moderate level statistical leaning and written
with clarity is,
Maindonald, J. & Braun, J., 2007. Data Analysis and Graphics using R (2nd edition). Cambridge:
CUP.
I also find useful,
Adler, J., 2010. R in a Nutshell. O'Reilly: Sebastopol, CA.
Crawley, MJ, 2005. Statistics: An Introduction using R. Chichester: Wiley (which is a shortened
version of The R Book by the same author).

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

11

Field, A., Miles, J. & Field, Z., 2012. Discovering Statistics Using R. London: Sage
However, none of these books is about mapping or spatial analysis (of particular interest to me as a
geographer). For that, the authoritative guide making the links between geographical information
science, geographical data analysis and R (but not really written for R newcomers) is,
Bivand, R.S., Pebesma, E.J. & Gmez-Rubio, V., 2008. Applied Spatial Data Analysis with R.
Berlin: Springer.
Also helpful is,
Ward, M.D. & Skrede Gleditsch, K., 2008. Spatial Regression Models. London: Sage. (Which uses
R code examples).
10

And
Chun, Y. & Griffith, D.A., 2013. Spatial Statistics and Geostatistics. London: Sage. (I found this
book a little eccentric but it contains some very good tips on its subject and gives worked examples
in R).
The following book has a short section of maps as well as other graphics in R (and is also, as the
title suggests, good for practical guidance on how to analyse surveys using cluster and stratified
sampling, for example):
Lumley, T., 2010. Complex Surveys. A Guide to Analysis Using R. Hoboken, NJ: Wiley.

20

Springer publish an ever-growing series of books under the banner Use R! If you are interested in
visualization, time-series analysis, Bayesian approaches, econometrics, data mining, , then you'll
find something of relevance at http://www.springer.com/series/6991. But you may well also find
what you are looking for for free on the Internet.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

12

Session 2: A Demonstration of R
This session provides a quick tour of some of R's functionality, with a focus on some geographical
applications. The idea here is to showcase a little of what R can do rather than providing a
comprehensive explanation to all that is going on. Aim for an intuitive understanding of the
commands and procedures but do not worry about the detail. More information about the workings
of R is given in the next session. More details about how to use R as a GIS and for spatial analysis
are given in Sessions 4, 5 and 6.
Note: this session assumes the libraries RgoogleMaps, png, sp and spdep are installed and available
for use. You can find out which packages you currently have installed by using
10

> row.names(installed.packages())

If the packages cannot be found then they can be installed using


install.packages(c("RgoogleMaps","png","sp","spdep")). Note that you may need administrative
rights on your computer to install the package (see Section 3.5.1, Installing and loading one or more
of the packages (libraries), p.37).

2.1 Getting Started

20

As the focus of this session is on showing what R can do rather than teaching you how to do it.
instead of requiring you to type a series of commands, they can instead be executed automatically
from a previously written source file (a script: see Section 1.3.1, page 7). As the commands are
executed we will ask R to echo (print) them to the screen so you can following what is going on. At
regular intervals you will be prompted to press return before the script continues.
To begin, type,
> source(file.choose(), echo=T)

and load the source file session2.R. After some comments that you should ignore, you will be
prompted to load the .csv file schools.csv:
> ## Read in the file schools.csv file
> wait()
Please presss return
schools.data <- read.csv(file.choose())

30

Assuming there is no error, we will now proceed to a simple inspection of the data. Remember: the
commands you see written below are the ones that appear in the source file. You do not need to type
them yourself for this session.

2.2 Checking the data


It is always sensible to check a data table for any obvious errors.
> head(schools.data)
> tail(schools.data)

# Shows the first few rows of the data


# Shows the bottom few rows of the data

We can produce a summary of each column in the data table using


> summary(schools.data)

In this instance, each column is a continuous variable so we obtain a six-number summary of the
centre and spread of each variable.
40

The names of the variables are


An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

13

> names(schools.data)

Next the number of columns and rows; and a check row-by-row to see if the data are complete
(have no missing data).
> ncol(schools.data)
> nrow(schools.data)
> complete.cases(schools.data)

It is not the most comprehensive check but everything appears to be in order.

2.3 Some simple graphics


10

The file schools.csv contains information about the location and some attributes of schools in
Greater London (in 2008). The locations are given as a grid reference (Easting, Northing). The
information is not real but is realistic. It should not, however, be used to make inferences about real
schools in London.
Of particular interest is the average attainment on leaving primary school (elementary school) of
pupils entering their first year of secondary school. Do some schools in London attract higher
attaining pupils more than others? The variable attainment contains this information.
A stripchart and then a histogram will show that (not surprisingly) there is variation in the average
prior attainment by school.

20

>
>
>
+

attach(schools.data)
stripchart(attainment, method="stack", xlab="Mean Prior Attainment by School")
hist(attainment, col="light blue", border="dark blue", freq=F, ylim=c(0,0.30),
xlab=Mean attainment)

Here the histogram is scaled so the total area sums to one. To this we can add a rug plot,
> rug(attainment)

also a density curve, a Normal curve for comparison and a legend.

30

>
>
>
>
>
>
+

lines(density(sort(attainment)))
xx <- seq(from=23, to=35, by=0.1)
yy <- dnorm(xx, mean(attainment), sd(attainment))
lines(xx, yy, lty="dotted")
rm(xx, yy)
legend("topright", legend=c("density curve","Normal curve"),
lty=c("solid","dotted"))

If would be interesting to know if attainment varies by school type. A simple way to consider this is
to produce a box plot. The data contain a series of dummy variables for each of a series of school
types (Voluntary Aided Church of England school: coe = 1; Voluntary Aided Roman Catholic: rc =
1; Voluntary controlled faith school: vol.con = 1; another type of faith school: other.faith = 1; a
selective school (sets an entrance exam): selective = 1). We will combine these into a single,
categorical variable then produce the box plot showing the distribution of average attainment by
school type.
First the categorical variable:
40

> school.type <- rep("Not Faith/Selective", times=nrow(schools.data))


# This gives each school an initial value which will
# then be replaced with its actual type
> school.type[coe==1] <- "VA CoE"
# Voluntary Aided Church of England schools are given
# the category VA CoE

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

14

> school.type[rc==1] <- "VA RC"


# Voluntary Aided Roman Catholic schools are given
# the category VA RC [etc.]
> school.type[vol.con==1] <- "VC"
> school.type[other.faith==1] <- "Other Faith"
> school.type[selective==1] <- "Selective"
> school.type <- factor(school.type)
> levels(school.type)
# A list of the categories
[1] "Not Faith/Selective" "Other Faith"
"Selective" [etc.]

10

Now the box plots:


>
>
+
>
>

par(mai=c(1,1.4,0.5,0.5))
# Changes the graphic margins
boxplot(attainment ~ school.type, horizontal=T, xlab="Mean attainment", las=1,
cex.axis=0.8)
# Includes options to draw the boxes and labels horizontally
abline(v=mean(attainment), lty="dashed")
# Adds the mean value to the plot
legend("topright", legend="Grand Mean", lty="dashed")

Not surprisingly, the selective schools (those with an entrance exam) recruit the pupils with highest
average prior attainment.

Figure 2.1. A histogram with annotation in R

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

15

Figure 2.2. Mean prior attainment by school type

2.4 Some simple statistics


It appears (in Figure 2.2) that there are differences in the levels of prior attainment of pupils in
different school types. We can test whether the variation is significant using an analysis of variance.
> summary(aov(attainment ~ school.type))
Df Sum Sq Mean Sq F value Pr(>F)
school.type
5 479.8
95.95
71.42 <2e-16 ***
Residuals
361 485.0
1.34

It is, at a greater than 99.9% confidence (F = 71.42, p < 0.001).


10

We might also be interested in comparing those schools with the highest and lowest proportions of
Free School Meal eligible pupils to see if they are recruiting pupils with equal or differing mean
prior attainment. We expect a difference because free school meal eligibility is used as an indicator
of a low income household and there is a link between economic disadvantage and educational
progress in the UK.
>
#
>
#
>

attainment.high.fsm.schools <- attainment[fsm > quantile(fsm, probs=0.75)]


Finds the attainment scores for schools with the highest proportions of FSM pupils
attainment.low.fsm.schools <- attainment[fsm < quantile(fsm, probs=0.25)]
Finds the attainment scores for schools with the lowest proportions of FSM pupils
t.test(attainment.high.fsm.schools, attainment.low.fsm.schools)
Welch Two Sample t-test

20

data: attainment.high.fsm.schools and attainment.low.fsm.schools


t = -15.0431, df = 154.164, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.437206 -2.639240
sample estimates:
mean of x mean of y
26.58352 29.62174

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

16

It comes as little surprise to learn that those schools with the greatest proportions of FSM eligible
pupils are also those recruiting lower attaining pupils on average (mean attainment 26.6 Vs 29.6, t =
-15.0, p < 0.001, the 95% confidence interval is from -3.44 to 2.64).
Exploring this further, the Pearson correlation between the mean prior attainment of pupils entering
each school and the proportion of them that are FSM eligible is -0.689, and significant (p < 0.001):
> round(cor(fsm, attainment),3)
> cor.test(fsm, attainment)
Pearson's product-moment correlation

10

data: fsm and attainment


t = -18.1731, df = 365, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7394165 -0.6313939
sample estimates:
cor
-0.6892159

Of course, the use of the Pearson correlation assumes that the relationship is linear, so let's check:
> plot(attainment ~ fsm)
> abline(lm(attainment ~ fsm))

20

# Adds a line of best fit (a regression line)

There is some suggestion the relationship might be curvilinear. However, we will ignore that here.
Finally, some regression models. The first seeks to explain the mean prior attainment scores for the
schools in London by the proportion of their intake who are free school meal eligible. (The result is
the line of best fit added to the scatterplot above).
The second model adds a variable giving the proportion of the intake of a white ethnic group.
The third adds a dummy variable indicating whether the school is selective or not.
> model1 <- lm(attainment ~ fsm, data=schools.data)
> summary(model1)
Call:
lm(formula = attainment ~ fsm, data = schools.data)

30

Residuals:
Min
1Q Median
-2.8871 -0.7413 -0.1186

3Q
0.5487

Max
3.6681

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.6190
0.1148 258.12
<2e-16 ***
fsm
-6.5469
0.3603 -18.17
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

40

Residual standard error: 1.178 on 365 degrees of freedom


Multiple R-squared: 0.475, Adjusted R-squared: 0.4736
F-statistic: 330.3 on 1 and 365 DF, p-value: < 2.2e-16

> model2 <- lm(attainment ~ fsm + white, data=schools.data)


> summary(model2)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

17

Call:
lm(formula = attainment ~ fsm + white, data = schools.data)
Residuals:
Min
1Q Median
-2.9442 -0.7295 -0.1335

10

3Q
0.5111

Max
3.7837

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.1250
0.1979 152.21 < 2e-16 ***
fsm
-7.2502
0.4214 -17.20 < 2e-16 ***
white
-0.8722
0.2796
-3.12 0.00196 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.164 on 364 degrees of freedom
Multiple R-squared: 0.4887,
Adjusted R-squared: 0.4859
F-statistic: 173.9 on 2 and 364 DF, p-value: < 2.2e-16

> model3 <- update(model2, . ~ . + selective)


# Means: take the previous model and add the variable 'selective'
> summary(model3)

20

Call:
lm(formula = attainment ~ fsm + white + selective, data = schools.data)
Residuals:
Min
1Q
-2.6262 -0.5620

30

Median
0.0537

3Q
0.5607

Max
3.6215

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.1706
0.1689 172.712
<2e-16 ***
fsm
-5.2381
0.3591 -14.586
<2e-16 ***
white
-0.2299
0.2249 -1.022
0.307
selective
3.4768
0.2338 14.872
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.9189 on 363 degrees of freedom
Multiple R-squared: 0.6823,
Adjusted R-squared: 0.6796
F-statistic: 259.8 on 3 and 363 DF, p-value: < 2.2e-16

Looking at the adjusted R-squared value, each model appears to be an improvement on the one that
precedes it (marginally so for model 2). However, looking at the last (model 3), we may suspect that
we could drop the white ethnicity variable with no significant loss in the amount of variance
explained. An analysis of variance confirms that to be the case.
40

> model4 <- update(model3, . ~ . - white)


# Means: take the previous model but remove the variable 'white'
> anova(model4, model3)
Analysis of Variance Table
Model 1: attainment ~ fsm + selective
Model 2: attainment ~ fsm + white + selective
Res.Df
RSS Df Sum of Sq
F Pr(>F)
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

18

1
2

364 307.42
363 306.54

0.88222 1.0447 0.3074

The residual error, measured by the residual sum of squares (RSS), is not very different for the two
models, and that difference, 0.882, is not significant (F = 1.045, p = 0.307).

2.5 Some simple maps


For a geographer like myself, R becomes more interesting when we begin to look at its
geographical data handling capabilities.

10

The schools data contain geographical coordinates and are therefore geographical data.
Consequently they can be mapped. The simplest way for point data is to use a 2-dimensional plot,
making sure the aspect ratio is fixed correctly.
> plot(Easting, Northing, asp=1, main="Map of London schools")
# The argument asp=1 fixes the aspect ratio correctly

Amongst the attribute data for the schools, the variable esl gives the proportion of pupils who speak
English as an additional language. It would be interesting for the size of the symbol on the map to
be proportional to it.
> plot(Easting, Northing, asp=1, main="Map of London schools",
+ cex=sqrt(esl*5))

It would also be nice to add a little colour to the map. We might, for example, change the default
plotting 'character' to a filled circle with a yellow background.
20

> plot(Easting, Northing, asp=1, main="Map of London schools",


+ cex=sqrt(esl*5), pch=21, bg="yellow")

A more interesting option would be to have the circles filled with a colour gradient that is related to
a second variable in the data the proportion of pupils eligible for free school meals for example.
To achieve this, we can begin by creating a simple colour palette:
> palette <- c("yellow","orange","red","purple")

We now cut the free school meals eligibility variable into quartiles (four classes, each containing
approximately the same number of observations).
> map.class <-

30

cut(fsm, quantile(fsm), labels=FALSE, include.lowest=TRUE)

The result is to split the fsm variable into four groups with the value 1 given to the first quarter of
the data (schools with the lowest proportions of eligible pupils), the value 2 given to the next
quarter, then 3, and finally the value 4 for schools with the highest proportions of FSM eligible
pupils.
There are, then, now four map classes and the same number of colours in the palette. Schools in
map class 1 (and with the lowest proportion of fsm-eligible pupils) will be coloured yellow, the next
class will be orange, and so forth.
Bringing it all together,
> plot(Easting, Northing, asp=1, main="Map of London schools",
+ cex=sqrt(esl*5), pch=21, bg=palette[map.class])

40

It would be good to add a legend, and perhaps a scale bar and North arrow. Nevertheless, as a first
map in R this isn't too bad!

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

19

Figure 2.3. A simple point map in R

Why don't we be a bit more ambitious and overlay the map on a Google Maps tile, adding a legend
as we do so? This requires us to load an additional library for R and to have an active Internet
connection.
> library(RgoogleMaps)

If you get an error such as the following


Error in library(RgoogleMaps) : there is no package called RgoogleMaps

it is because the library has not been installed.

10

Assuming that the data frame, schools.data, remains in the workspace and attached (it will be if you
have followed the instructions above), and that the colour palette created above has not been
deleted, then the map shown in Figure 2.4 is created with the following code:
> MyMap <- MapBackground(lat=Lat, lon=Long)
> PlotOnStaticMap(MyMap, Lat, Long, cex=sqrt(esl*5), pch=21,
bg=palette[map.class])
> legend("topleft", legend=paste("<",tapply(fsm, map.class, max)),
pch=21, pt.bg=palette, pt.cex=1.5, bg="white", title="P(FSM-eligible)")
> legVals <- seq(from=0.2,to=1,by=0.2)
> legend("topright", legend=round(legVals,3), pch=21, pt.bg="white",
pt.cex=sqrt(legVals*5), bg="white", title="P(ESL)")

20

(If you are running the script for this session then the code you see on-screen will differ slightly.
That is because it has some error trapping included in it incase there is no Internet connection
available)
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

20

Remember that the data are simulated. The points shown on the map are not the true locations of
schools in London. Do not worry about understanding the code in detail the purpose is to see the
sort of things R can do with geographical data. We will look more closely at the detail in later
sessions.

Figure 2.4. A slightly less simple map produced in R

2.6 Some simple geographical analysis


Remember the regression models from earlier? It would be interesting to test the assumption that
the residuals exhibit independence by looking for spatial dependencies. To do this we will consider
to what degree the residual value for any one school correlates with the mean residual value for its
six nearest other schools (the choice of six is completely arbitrary).
10

First, we will take a copy of the schools data and convert it into an explicitly spatial object in R:
>
>
>
>
>
>
>
>
>

detach(schools.data)
schools.xy <- schools.data
library(sp)
attach(schools.xy)
coordinates(schools.xy) <- c("Easting", "Northing")
# Converts into a spatial object
class(schools.xy)
detach(schools.xy)
proj4string(schools.xy) <- CRS("+proj=tmerc datum=OSGB36")
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

21

> # Sets the Coordinate Referencing System

Second, we find the six nearest neighbours for each school.


> library(spdep)
> nearest.six <- knearneigh(schools.xy, k=6, RANN=F)
> # RANN = F to override the use of the RANN package that may not be installed

We can learn from this that the six nearest schools to the first school in the data (row 1) are schools
5, 38, 2, 40, 223 and 6:
> nearest.six$nn[1,]
[1]
5 38
2 40 223

10

The neighbours object, nearest.six, is an object of class knn:


>

class(nearest.six)

It is next converted into the more generic class of neighbours.

20

> neighbours <- knn2nb(nearest.six)


> class(neighbours)
[1] "nb"
> summary(neighbours)
Neighbour list object:
Number of regions: 367
Number of nonzero links: 2202
Percentage nonzero weights: 1.634877
Average number of links: 6
[etc.]

The connections between each point and its neighbours can then be plotted. It may take a few
minutes.
> plot(neighbours, coordinates(schools.xy))

Having identified the six nearest neighbours to each school we could give each equal weight in a
spatial weights matrix or, alternatively, decrease the weight with distance away (so the first nearest
neighbour gets most weight and the sixth nearest the least). Creating a matrix with equal weight
given to all neighbours is sufficient for the time being.
30

> spatial.weights <- nb2listw(neighbours)

(The other possibility is achieved by creating then supplying a list of general weights to the
function, see ?nb2listw)
We now have all the information required to test whether there are spatial dependencies in the
residuals. The answer is yes (Moran's I = 0.218, p < 0.001, indicating positive spatial
autocorrelation).
> lm.morantest(model4, spatial.weights)
Global Moran's I for regression residuals

40

data:
model: lm(formula = attainment ~ fsm + selective, data = schools.data)
weights: spatial.weights
Moran I statistic standard deviate = 7.9152, p-value = 1.235e-15
alternative hypothesis: greater
sample estimates:
Observed Moran's I
Expectation
Variance
0.2181914682
-0.0038585704
0.0007870118

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

22

2.7 Tidying up
It is better to save your workspace regularly whilst you are working (see Section 1.4.4, 'Saving and
loading workspaces', page 10) and certainly before you finish. Don't forget to include the
extension .RData when saving. Having done so, you can tidy-up the workspace.
> save.image(file.choose(new=T))
> rm(list=ls())
# Be careful, it deletes everything!

2.8 Further Information

10

A simple introduction to graphics and statistical analysis in R is given in Statistics for Geography
and Environmental Science: An Introduction in R, available at http://www.social-statistics.org/?
p=354.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

23

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

24

Session 3: A Little More about the workings of R


This session provides a little more guidances on the 'inner workings' of R. All the commands are
contained in file session3.R and can be run using it (see the section 'Scripting' on p.7). You can, if
you wish, skip this session and move straight on to the sessions on mapping and spatial modelling
in R, returning to this later to better understand some of the commands and procedures you will
have used.

3.1 Classes, types and coercion

10

Let us create two objects, each a vector containing ten elements. The first will be the numbers from
one to ten, recorded as integers. The second will be the same sequence but now recorded as real
numbers (that is, 'floating point' numbers, those with a decimal place).
> b <- 1:10
> b
[1] 1 2 3 4 5 6 7 8 9 10
> c <- seq(from=1.0, to=10.0, by=1)
> c
[1] 1 2 3 4 5 6 7 8 9 10

Note that in the second case, we could just type,


20

> c <- seq(1, 10, 1)


> c
[1] 1 2 3 4 5 6

9 10

This works because if we don't explicitly define the argument (by omitting from=1 etc.) then R will
assume we are giving values to the arguments in their default order, which in this case is in the
order from, to and by.Type ?seq and look under Usage for this to make a little more sense.
In any case, the two objects, b and c, appear the same on screen but one is an object of class integer
whereas the other is an object of class numeric and of type double (double precision in the memory
space).

30

> class(b)
[1] "integer"
> class(c)
[1] "numeric"
> typeof(c)
[1] "double"

Often it possible to coerce an object from one class and type to another.

40

> b <- 1:10


> class(b)
[1] "integer"
> b <- as.double(b)
> class(b)
[1] "numeric"
> typeof(b)
[1] "double"
> class(c)
> c <- as.integer(c)
> class(c)
[1] "integer"
> c

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

25

[1] 1 2 3 4 5 6 7 8
> c <- as.character(c)
> class(c)
[1] "character"
> c
[1] "1" "2" "3" "4" "5"

9 10

"6"

"7"

"8"

"9"

"10"

The examples above are trivial. However, it is important to understand that seemingly generic
functions like summary(...) can produce outputs that are dependent upon the class type. Try, for
example,
10

20

> class(b)
[1] "numeric"
> summary(b)
Min. 1st Qu. Median
Mean 3rd Qu.
1.00
3.25
5.50
5.50
7.75
> class(c)
[1] "character"
> summary(c)
Length
Class
Mode
10 character character

Max.
10.00

In the first instance, a six number summary of the centre and spread of the numeric data is given.
That makes no sense for character data. The second summary gives the length of the vector, its class
type and its storage mode.
A more interesting example is provided if we consider the plot(...) command, used first with a
single data variable, secondly with two variables in a data table, and finally on a model of the
relationship between those two variables.
The first variable is created by generating 100 observations drawn randomly from a Normal
distribution with mean of 100 and a standard deviation of 20.
> var1 <- rnorm(n=100, mean=100, sd=20)

30

Being random, the data assigned to the variable will differ from user to user. Usually we would
want this. However, in this case it would be easier to ensure we get the same by ensuring we each
get the same 'random' draw:
> set.seed(1)
> var1 <- rnorm(n=100, mean=100, sd=20)

Always check the data,

40

> class(var1)
[1] "numeric"
> length(var1)
[1] 100
> summary(var1)
Min. 1st Qu. Median
55.71
90.12 102.30
> head(var1)
[1] 87.47092 103.67287
> tail(var1)
[1] 131.73667 111.16973

# The number of elements in the vector

Mean 3rd Qu.


102.20 113.80

Max.
148.00

# The first few elements


83.28743 131.90562 106.59016

83.59063

# The last few elements


74.46816

88.53469

75.50775

90.53199

They seem fine! Returning to the use of the plot(...) command, in this instance it simply plots the
data in order of their position in the vector.
> plot(var1)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

26

Figure 3.1. A simple plot of a numeric vector

To demonstrate a different interpretation of the plot command, a second variable is created that is a
function of the first but with some random error.
> set.seed(101)
> var2 <- 3 * var1 + 10 + rnorm(100, 0, 25)
# which, because n, mean and sd are the first three arguments into rnorm
# is the same as writing var2 <- 3 * var1 + 10 + rnorm(n=100, mean=100, sd=20)
> head(var2)
[1] 264.2619 334.8301 242.9887 411.0758 337.5397 290.1211

10

20

Next, the two variables are gathered together in a data table, of class data frame, where each row is
an observation and each column is a variable. There is more about data frames on page 29, in
Section 3.2 ('Data frames')
> mydata <- data.frame(x = var1, y = var2)
> class(mydata)
[1] "data.frame"
> head(mydata)
x
y
1 87.47092 264.2619
2 103.67287 334.8301
3 83.28743 242.9887
4 131.90562 411.0758
5 106.59016 337.5397
6 83.59063 290.1211
> nrow(mydata)
# The number of rows in the data
[1] 100
> ncol(mydata)
# The number of columns
[1] 2

In this case, plotting the data frame will produce a scatter plot (to which the line of best fit shown in
Figure 3.2 will be added shortly).
> plot(mydata)

30

If there had been more than two columns in the data table, or if they had not been arranged in x, y
order, then the plot could be produced by referencing the columns directly. All the following are
equivalent:
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

27

>
>
>
>
>

with(mydata, plot(x, y))


with(mydata, plot(y ~ x))
plot(mydata$x, mydata$y)
plot(mydata[,1], mydata[,2])
plot(mydata[,2] ~ mydata[,1])

# Here the order is x, y


# Here it is y ~ x
# Plot using the first and second columns

The attach(...) command could also be used. This is introduced in Section 3.2.2, 'Attaching a data
frame' on page 31.

Figure 3.2. A scatter plot. A line of best fit has been added.

The line of best fit in Figure 3.2 is a regression line. To fit the regression model, summarising the
relationship between y and x, use
10

> model1 <- lm(y ~ x, data=mydata)


> class(model1)
[1] "lm"

# lm is short for linear model

model1 is an object of class lm, short for linear model. Using the
the relationship between y and x.

20

summary(...)

function summarises

> summary(model1)
Call:
lm(formula = y ~ x, data = mydata)
Residuals:
Min
1Q Median
3Q
Max
-57.102 -16.274
0.484 15.188 47.290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
8.6462
13.6208
0.635
0.527
x
3.0042
0.1313 22.878
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 23.47 on 98 degrees of freedom
Multiple R-squared: 0.8423,
Adjusted R-squared: 0.8407
F-statistic: 523.4 on 1 and 98 DF, p-value: < 2.2e-16

30

We created variable y as a function of x and it shows: x is a significant predictor of y at a greater


than 99.9 confidence, 84% of the variance in y is explained by x, and the equation of the regression
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

28

line is y = 8.65 + 3.00x.


Now using the plot(...) function on the object of class lm has an effect that is different from the
previous two cases. It produces a series a diagnostic plots to help check the assumptions of
regression have been met.
> plot(model1)

The first plot is a check for non-constant variance and outliers, the second for normality of the
model residuals, the third is similar to the first, and the fourth identifies both extreme residuals and
leverage points.
10

These four plots can be viewed together, changing the default graphical parameters to show the
plots in a 2-by-2 array (as in Figure 3.3).
> par(mfrow = c(2,2))
> plot(model1)

# Sets the graphical output to be 2 x 2

Finally, we might like to go back to our previous scatter plot and add the regression line of best fit
to it,
> par(mfrow = c(1,1))
> plot(mydata)
> abline(model1)

# Resets the window to a single graph

Figure 3.3.Default plots for an object of class linear model

3.2 Data frames


20

The preceding section introduced the data frame as a class of object containing a table of data where
the variables are the columns of the data and the rows are the observations.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

29

> class(mydata)
> summary(mydata)

Looking at the data summary, the object mydata contains two columns, labelled x and y. These
column headers can also be revealed by using
> names(mydata)
[1] "x" "y"

or with
> colnames(mydata)
[1] "x" "y"

10

The row names appear to be the numbers from 1 to 100 (the number of rows in the data), though
actually they are character data:
> rownames(mydata)
[1] "1"
"2"
"3"
"4"
> class(rownames(mydata))
[1] "character"

"5"

"6"

"7"

"8"

[etc.]

The column names can be changed either individually or together. Individually:

20

> names(mydata)[1] <- "v1"


> names(mydata)[2] <- "v2"
> names(mydata)
[1] "v1" "v2"

All at once:
> names(mydata) <- c("x","y")
> names(mydata)
[1] "x" "y"

as can the row names,

30

> rownames(mydata)[1] <- "0"


> rownames(mydata)
[1] "0"
"2"
"3"
"4"
"5"
"6"
"7"
"8" [etc.]
> rownames(mydata) = seq(from=0, by=1, length.out=nrow(mydata))
> rownames(mydata)
[1] "0"
"1"
"2"
"3"
"4"
"5" "6" "7" "8" [etc.]

The above can be especially useful when merging data tables with GIS shapefiles in R (because the
first entry in an attribute table for a shapefile usually is given an ID of 0). Otherwise, it is usually
easiest for the first row in a data table to be labelled 1, so let's put them back to how they were.
> rownames(mydata) = 1:nrow(mydata)
> rownames(mydata)
[1] "1"
"2"
"3"
"4"
"5"
"6"

"7"

"8" [etc.]

3.2.1 Referencing rows and columns in a data frame

40

The square bracket notation can be used to index specific row, columns or cells in the data frame.
For example:
> mydata[1,]
x
y
1 87.47092 264.2619
> mydata[2,]
x
y
2 103.6729 334.8301
> round(mydata[2,],2)
x
y

# The first row of data

# The second row of data

# The second row, rounded to 2 decimal places

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

30

10

2 103.67 334.83
> mydata[nrow(mydata),]
# The final row of the data
x
y
100 90.53199 261.236
> mydata[,1]
# The first column of data
[1] 87.47092 103.67287 83.28743 131.90562 [etc.]
> mydata[,2]
# The second column, which here is also
[1] 264.2619 334.8301 242.9887 411.0758 337.5397 [etc.]
> mydata[,ncol(mydata)]
# the final column of data
[1] 264.2619 334.8301 242.9887 411.0758 337.5397 [etc.]
> mydata[1,1]
# The data in the first row of the first column
[1] 87.47092
> mydata[5,2]
# The data in the fifth row of the second column
[1] 337.5397
> round(mydata[5,2],0)
[1] 338

Specific columns of data can also be referenced using the $ notation


20

30

> mydata$x
# Equivalent to mydata[,1] because the column name is x
[1] 87.47092 103.67287 83.28743 131.90562 106.59016 [etc.]
> mydata$y
[1] 264.2619 334.8301 242.9887 411.0758 337.5397 290.1211 [etc.]
> summary(mydata$x)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
55.71
90.12 102.30 102.20 113.80 148.00
> summary(mydata$y)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
140.4
284.1
314.1
315.6
355.7
447.6
> mean(mydata$x)
[1] 102.1777
> median(mydata$y)
[1] 314.1226
> sd(mydata$x)
# Gives the standard deviation of x
[1] 17.96399
> boxplot(mydata$y)
> boxplot(mydata$y, horizontal=T, main="Boxplot of variable y")

(Boxplots are sometimes said to be easier to read when drawn horizontally)


One way to avoid the use of the $ notation is to use the function with(...) instead:
40

> with(mydata, var(x))


# Gives the variance of x
[1] 322.7048
> with(mydata, plot(y, xlab="Observation number"))

However, even this is cumbersome so the attach function may be preferred...


3.2.2 Attaching a data frame

Sometimes any of the ways to access a specific part of a data table becomes tiresome and it is useful
to reference the column or variable name directly. For example, instead of having to type
mean(mydata[,1]), mean(mydata$x) or with(mydata, mean(x)) it would be easier just to refer to the
variable of interest, x, as in mean(x).
To achieve this the attach(...) command is used. Compare, for example,
> mean(x)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

31

Error in mean(x) : object 'x' not found

(which generates an error because there is not an object called x in the workspace; it is only a
column name within the data frame mydata) with
> attach(mydata)
> mean(x)
[1] 102.1777

(which works fine). If, to use the earlier analogy, objects in R's workspace are like box files, then
now you have opened one up and its contents (which include the variable x) are visible.
To detach the contents of the data frame use detach(...)
10

> detach(mydata)
> mean(x)
Error in mean(x) : object 'x' not found

It is sensible to use detach when the data frame is no longer being used or else confusion can arise
when multiple data frames contain the same column names, as in the following example:

20

30

> attach(mydata)
> mean(x)
# This will give the mean of mydata$x
[1] 102.1777
> mydata2 = data.frame(x = 1:10, y=11:20)
> head(mydata2)
x y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
> attach(mydata2)
The following object(s) are masked from 'mydata':
x, y
> mean(x)
# This will now give the mean of mydata2$x
[1] 5.5
> detach(mydata2)
> mean(x)
[1] 102.1777
> detach(mydata)
> rm(mydata2)

3.2.3 Sub-setting the data table and logical queries

Subsets of a data frame can be created by referencing specific rows within it. For example, imagine
we want a table only of those observations that have a a value above the mean of some variable.
40

50

> attach(mydata)
> subset <- which(x > mean(x))
> class(subset)
[1] "integer"
> subset
[1] 2 4 5 7 8 9 11 12 15 18 19 20 21 22 25 30 31 33 [etc.]
> mydata.sub <- mydata[subset,]
> head(mydata.sub)
x
y
2 103.6729 334.8301
4 131.9056 411.0758
5 106.5902 337.5397

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

32

7 109.7486 354.7155
8 114.7665 351.4811
9 111.5156 367.4726

Note how the row names of this subset have been inherited from the parent data frame.
A more direct approach is to define the subset as a logical vector that is either true or false
dependent upon whether a condition is met.

10

20

> subset <- x > mean(x)


> class(subset)
[1] "logical"
> subset
[1] FALSE TRUE FALSE TRUE TRUE FALSE
> mydata.sub <- mydata[subset,]
> head(mydata.sub)
x
y
2 103.6729 334.8301
4 131.9056 411.0758
5 106.5902 337.5397
7 109.7486 354.7155
8 114.7665 351.4811
9 111.5156 367.4726

TRUE

TRUE

TRUE [etc.]

A yet more succinct way of achieving the same is:

30

> mydata.sub <- mydata[x > mean(x),]


# Selects those rows that meet the logical condition, and all columns
> head(mydata.sub)
x
y
2 103.6729 334.8301
4 131.9056 411.0758
5 106.5902 337.5397
7 109.7486 354.7155
8 114.7665 351.4811
9 111.5156 367.4726

In the same way, to select those rows where x is greater than or equal to the mean of x and y is
greater than or equal to the mean of y
> mydata.sub <- mydata[x >= mean(x) & y >= mean(y),]
# The symbol & is used for and

Or, those rows where x is less than the mean of x or y is less than the mean of y
> mydata.sub <- mydata[x < mean(x) | y < mean(y),]
# The symbol | is used for or

3.2.4 Missing data

40

50

Missing data is given the value NA. For example,


> mydata[1,1] = NA
> mydata[2,2] = NA
> head(mydata)
x
y
1
NA 264.2619
2 103.67287
NA
3 83.28743 242.9887
4 131.90562 411.0758
5 106.59016 337.5397
6 83.59063 290.1211

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

33

R will, by default, report NA or an error when some calculations are tried with missing data:
> mean(mydata$x)
[1] NA
> quantile(mydata$y)
Error in quantile.default(mydata$y) :
missing values and NaN's not allowed if 'na.rm' is FALSE

To overcome this, the default can be changed or the missing data removed.
To ignore the missing data in the calculation,
10

> mean(mydata$x, na.rm=T)


[1] 102.3263
> quantile(mydata$y, na.rm=T)
# Divides the data into quartiles
0%
25%
50%
75%
100%
140.4475 282.1064 313.7536 356.5862 447.6020

Alternatively, there are various ways to remove the missing data. For example
> subset <- !is.na(mydata$x)

creates a logical vector which is true where the data values of x are not missing (the
expresion means not):
> head(subset)
[1] FALSE TRUE

20

TRUE

TRUE

TRUE

in the

TRUE

Using the subset,


> x2 <- mydata$x[subset]
> mean(x2)
[1] 102.3263

More succinctly,
> with(mydata, mean(x[!is.na(x)]))
[1] 102.3263

Alternatively, a new data frame can be created without any missing data whereby any row with any
missing value is omitted.
30

40

> subset <- complete.cases(mydata)


> head(subset)
[1] FALSE FALSE TRUE TRUE TRUE TRUE
> mydata.complete = mydata[subset,]
> head(mydata.complete)
x
y
3 83.28743 242.9887
4 131.90562 411.0758
5 106.59016 337.5397
6 83.59063 290.1211
7 109.74858 354.7155
8 114.76649 351.4811

3.2.5 Reading data from a file into a data frame

The accompanying file schools.csv (used in Session 2) contains information about the location and
some attributes of schools in Greater London (in 2008). The locations are given as a grid reference
(Easting, Northing). The information is not real but is realistic.
A standard way to read a file into a data frame, with cases corresponding to lines and variables to
fields in the file, is to use the read.table(...) command.
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

34

> ?read.table

In the case of schools.csv, it is comma delimited and has column headers. Looking through the
arguments for read.table the data might be read into R using
> schools.data <- read.table("schools.csv", header=T, sep=",")

This will only work if the file is located in the working directory, else the location (path) of the file
will need to be specified (or the working directory changed). More conveniently, use file.choose()
> schools.data <- read.table(file.choose(), header=T, sep=",")

Looking through the usage of read.table in the R help page, a variant of the command is found
where the defaults are for comma delimited data. So, most simply, we could use,
10

schools.data <- read.csv(file.choose())

Having read-in the data, some basic checks of it are helpful,

20

30

> head(schools.data, n=3)


FSM
EAL
SEN white blk.car blk.afr indian pakistani [etc.]
1 0.659 0.583 0.031 0.217
0.032
0.222 0.002
0.020
2 0.391 0.424 0.001 0.350
0.087
0.126 0.003
0.012
3 0.708 0.943 0.038 0.048
0.000
0.239 0.000
0.004
# Views the first three lines of the data
> ncol(schools.data)
[1] 17
> nrow(schools.data)
[1] 366
> summary(schools.data)
FSM
EAL
SEN
[etc.]
Min.
:0.0000
Min.
:0.0000
Min.
:0.00000
1st Qu.:0.1323
1st Qu.:0.1472
1st Qu.:0.00800
Median :0.2500
Median :0.3165
Median :0.02000
Mean
:0.2702
Mean
:0.3491
Mean
:0.02308
3rd Qu.:0.3897
3rd Qu.:0.5122
3rd Qu.:0.03400
Max.
:0.7730
Max.
:1.0000
Max.
:0.11300

It seems to be fine.
For more about importing and exporting data in R, consult the R help document, R Data
Import/Export (see under the Help menu in R or http://cran.r-project.org/manuals.html).

3.3 Lists
A list is a little like a data frame but offers a more flexible way to gather objects of different classes
together. For example,
> mylist <- list(schools.data, model1, "a")
> class(mylist)
[1] "list"

To find the number of components in a list, use length(...),


40

> length(mylist)
[1] 3

Here the first component is the data frame containing the schools data. The second component is the
linear model created earlier. The third is the character a. To reference a specific component,
double square brackets are used:
> head(mylist[[1]], n=3)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

35

FSM
EAL
SEN
1 0.659 0.583 0.031
2 0.391 0.424 0.001
3 0.708 0.943 0.038

10

white blk.car blk.afr indian pakistani [etc.]


0.217
0.032
0.222 0.002
0.020
0.350
0.087
0.126 0.003
0.012
0.048
0.000
0.239 0.000
0.004

> summary(mylist[[2]])
Call:
lm(formula = y ~ x, data = mydata)
Residuals:
Min
1Q Median
3Q
Max
-57.102 -16.274
0.484 15.188 47.290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
8.6462
13.6208
0.635
0.527
x
3.0042
0.1313 22.878
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 23.47 on 98 degrees of freedom
Multiple R-squared: 0.8423,
Adjusted R-squared: 0.8407
F-statistic: 523.4 on 1 and 98 DF, p-value: < 2.2e-16

20

> class(mylist[[3]])
[1] "character"

The double square brackets can be combined with single ones. For example,
> mylist[[1]][1,]
FSM
EAL
SEN white blk.car blk.afr indian pakistani [etc.]
1 0.659 0.583 0.031 0.217
0.032
0.222 0.002
0.020

is the first row of the schools data. The first cell of the same data is
> mylist[[1]][1,1]
[1] 27

3.4 Writing a function


30

In brief, a function is written in R in the follow way,


> function.name <- function(list of arguments) {
+
function code
+
return(result)
+
}

So, a simple function to divide the product of two numbers by their sum could be,
>
+
+
+

40

my.function <- function(x1, x2) {


result <- (x1 * x2) / (x1 + x2)
return(result)
}

Now running the function


> my.function(3, 7)
[1] 2.1

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

36

3.5 R packages for mapping and spatial data analysis


By default, R comes with a base set of packages and methods for data analysis and visualization.
However, there are many other packages available, too, that greatly extend R's value and
functionality. These packages are listed alphabetically at http://cran.r-project.org/web/packages/
available_packages_by_name.html.
Because there are so many, it can be useful to browse the packages by topic (at http://cran.rproject.org/web/views/). The topic, or 'task view' of particular interest here is the analysis of spatial
data: http://cran.r-project.org/web/views/Spatial.html
3.5.1 Installing and loading one or more of the packages (libraries)

10

Note: If reading this in class it is likely that the packages have been installed already or you will
not have the administrative rights to install them. If so, this section is for information only.There
also is no need to install the packages if you have done so already (when following the instructions
under 'Before you begin' on p.3).
To install a specific package the install.packages(...) command is used, as in:

20

> install.packages("ctv")
Installing package(s) into /Users/ggrjh/Library/R/2.13/library
(as lib is unspecified)
trying URL 'http://cran.uk.r-project.org/bin/macosx/leopard/contrib/2.13/ctv_0.74.tgz'
Content type 'application/x-gzip' length 289693 bytes (282 Kb)
opened URL
==================================================
downloaded 282 Kb

The package needs to be installed once but loaded each time R is started, using the library(...)
command
> library("ctv")

In this case what has been installed is a package that will now allow all the packages associated
with the spatial task view to be installed together, using:
> install.views("Spatial")

30

Note that installing packages may, by default, require access to a directory/folder for which
administrative rights are required. If necessary, it is entirely possible to install R (and therefore the
additional packages) in, for example, 'My Documents' or on a USB stick.
3.5.2 Checking which packages are installed

You can see which packages are installed by using


> row.names(installed.packages())

To see whether a specific package is installed use


> is.element("sp",installed.packages()[,1])
# replace sp with another package name to check if that is installed

3.6 Tidying up and quitting


40

You may want to save and/or tidy up your workspace before quitting R. See sections 1.5 and 2.7 on
pages 11 and 23.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

37

3.7 Further Information


See An Introduction to R, available at CRAN (http://cran.r-project.org/manuals.html) or by using
the drop-down Help menus in the RGui.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

38

Session 4: Using R as a simple GIS


This session introduces some of the libraries available for R that allow it to be used as a simple GIS.
Examples of plotting (x, y) point data, producing choropleth maps, creating a raster, and
undertaking a spatial join are given. The code required may seem quite difficult on first impression
but it becomes easier with practice, plus often it is a case of 'recycling' existing code, making small
changes where required. The advantages of using R instead of a normal standalone GIS are
addressed with an example of mapping the residuals from a simple econometric (regression) model
and looking for a correlation between neighbours.
10

For this session you will need to have the following libraries installed: sp, maptools, GISTools,
classInt, RColorBrewer, raster and spdep (see 'Installing and loading one or more of the packages
(libraries)', p. 37).

4.1 Simple Maps of XY Data


4.1.1 Loading and Mapping XY Data

We begin by reading some XY Data into R. The file landprices.csv is in a comma separated format
and contains information about land parcels in Beijing, including a point georeference a centroid,
marking the centre of the land parcel. The data are simulated and not real but they are realistic.

20

> landdata <- read.csv(file.choose())


> head(landdata, n=3)
x
y LNPRICE LNAREA DCBD
1 454393.1 4417809
9.29 11.84 8.28
2 442744.9 4417781
8.64 10.97 9.04
3 444191.7 4416996
6.69
7.92 8.89

DELE DRIVER DPARK Y0405 Y0607 Y0809


7.25
7.38 8.09
0
1
0
5.61
8.41 7.51
0
1
0
5.62
8.22 7.27
0
1
0

Given the geographical coordinates, it is possible to map the data as they are in much the same way
as we did in Section 2.5 ('Some simple maps'). At its simplest,
> with(landdata, plot(x, y, asp=1))

However, given an interest in undertaking spatial analysis in R, it would be better to convert the
data into what R will explicitly recognise as a spatial object. For this we will require the sp library,
Assuming it is installed,
> library(sp)

30

We can now coerce landdata into an object of class spatial (sp) by telling R that the geographical
coordinates for the data are found in columns 1 and 2 of the current data frame.
> coordinates(landdata) = c(1,2)
> class(landdata)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"

The locations of the data points are now simply plotted using
> plot(landdata)

or, if we prefer a different symbol,


40

> plot(landdata, pch=10)


> plot(landdata, pch=4)

(type ?points and scroll down to below 'pch values' to see the options for different types of point
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

39

character)
4.1.2 The Coordinate Reference System (CRS)

10

If you type summary(landdata) you will find it is an object of class SpatialPointsDataFrame


meaning it is comprised of geographic data (the Spatial Points) and also attribute data (information
about the land parcels at those points). The bounding box the rectangle enclosing all the points
has the coordinates (428554.2, 4406739.2) at its bottom-left corner and the coordinates (463693.8,
4440346.6) at its top-right. A six number summary is given for each of the attribute variables to
show their average and range. Above these summaries you will find some NAs, indicating missing
information. What are missing are the details about the coordinate reference system (CRS) for the
data. If the geographical coordinates were in (longitude, latitude) format then we could use the code
proj4string(landdata) <- CRS(+longlat) to set the CRS. In our case, the data use the Xian
1980 / Gauss-Kruger CM 117E projection. If you go to http://www.epsg-registry.org/ you will find
various search options to retrieve the EPSG (European Petroleum Survey Group) code for this
projection. It is 2345.
>
>
>
>

20

crs <- CRS("+init=epsg:2345")


proj4string(landdata) <- crs
summary(landdata)
plot(landdata, pch=21, bg="yellow", cex=0.7)
# cex means character expansion: it controls the symbol size

4.1.3 Adding an additional map layer

The map still lacks a sense of geographical context so we will add a polygon shapefile giving the
boundaries of districts in Beijing. The file is called 'beijing_districts.shp'. This first needs to be
loaded into R which in turn requires the maptools library:
> library(maptools)
> districts <- readShapePoly(file.choose())
> summary(districts)

As before, the coordinate reference system is missing but is the same as for the land price data.
> proj4string(districts) <- crs
> summary(districts)

30

We can now plot the boundaries of the districts and then overlay the point data on top:
> plot(districts)
> plot(landdata, pch=21, bg="yellow", cex=0.7, add=T)

4.2 Creating a choropleth map using GISTools


A choropleth map is one where the area units (the polygons) are shaded according to some
measurement of them. Looking at the attribute data for the districts shapefile (which is accessed
using the code @data) we find it includes a measure of the population density,

40

> head(districts@data, n=3)


SP_ID
POPDEN
JOBDEN
0
0 0.891959 1.495150
1
1 1.246510 0.466264
2
2 5.489660 2.787890

That is a variable we can map. The easiest way to do this is using the GIS Tools library,
> library(GISTools)

As with most libraries, if we want to know more about what it can do, type ? followed by its name
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

40

(so here ?GISTools) and follow the link to the main index. If you do, you will find there is a function
called choropleth with the description: Draws a choropleth map given a spatialPolygons object, a
variable and a shading scheme.
Currently we have the spatialPolygons object. It is the object districts.
> class(districts)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"

10

The difference between a spatialPolygons object and a SpatialPolygonsDataFrame is the first


contains only the geographical information required to draw the boundaries of a map whereas the
second also contains attribute data that could be used to shade it (but doesn't have to be; another
variable could be used). We will use the variable POPDEN. All that is left is a shading scheme. In
the first instance, let's allow another function, auto.shading, to do the work for us.
> shades <- auto.shading(districts@data$POPDEN)
> choropleth(districts, districts@data$POPDEN)
> plot(landdata, pch=21, bg="yellow", cex=0.7, add=T)

It would be good to add a legend to the map. To do so, use the following command and then click
towards the bottom-right of your map in the place where you would like the legend will go:
20

> locator(1)
$x
[1] 462440
$y
[1] 4407000

We can now place the legend at those map units,


> choro.legend(462440,4407000,shades,fmt="%4.1f",title='Population density')

In practice, it may take some trial-and-error to get the legend in the right place.
Similarly, we can add a north arrow and a map scale,
30

> north.arrow(461000, 4445000, "N", len=1000, col="light gray")


> map.scale(425000,4400000,10000,"km",5,subdiv=2,tcol='black',scol='black',
+ sfcol='black')

The result is shown in Figure 4.1, below.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

41

Figure 4.1. An example of a choropleth map produced in R


4.2.1 Some alternative choropleth maps

What we see on a choropleth map and how we interpret it is a function of the classification used to
shade the areas. Typing ?auto.shading we discover that the default for autoshading is to use a
quantile classification with five (n + 1) categories and a red shading scheme. We might like to
compare this with using a standard deviation based classification and one based on the range.
> x <- districts@data$POPDEN
> shades2 <- auto.shading(x, cutter=sdCuts, cols=brewer.pal(5,"Greens"))
> shades3 <- auto.shading(x, cutter=rangeCuts, cols=brewer.pal(5,"Blues"))

10

We could now plot these maps in the same way as before. It will work; however, it may also
become tiresome typing the same code each time to overlay the point data and to add the
annotations. We could save ourselves the trouble by writing a simple function to do it:
> map.details <- function(shading) {
+
plot(landdata, pch=21, bg="yellow", cex=0.7, add=T)
+
choro.legend(461000,4407000,shading,fmt="%4.1f",title='Population density')
+
north.arrow(473000, 4445000, "N", len=1000, col="light gray")
+
map.scale(425000,4400000,10000,"km",5,subdiv=2,tcol='black',
+
scol='black',sfcol='black')
+ }

Now we can produce the maps,


20

> # The quantile classification, as before


> choropleth(districts, x, shades)
> map.details(shades)
> # The standard deviation classification
> choropleth(districts, x, shades2)
> map.details(shades2)
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

42

> # The range-based classification


> choropleth(districts, x, shades3)
> map.details(shades3)

4.3 XY Maps with the point symbols shaded


In Section 4.1.1, the maps we produced of the land price data were not very elegant. It would be
nice to alter the size or shading of the points according to some attribute of the data; for example the
land value. This information is contained in the dataset,
> head(landdata@data,
LNPRICE LNAREA DCBD
1
9.29 11.84 8.28
2
8.64 10.97 9.04
3
6.69
7.92 8.89

10

n=3)
DELE DRIVER DPARK Y0405 Y0607 Y0809
7.25
7.38 8.09
0
1
0
5.61
8.41 7.51
0
1
0
5.62
8.22 7.27
0
1
0

Altering the size is achieved simply enough by passing some function of the variable (LNPRICE) to
the character expansion argument (cex). For example,
> x <- landdata@data$LNPRICE
> plot(landdata, pch=21, bg="yellow", cex=0.2*x)

Shading them according to their value is a little harder. The process is to cut the values into groups
(using a quintile classification, for example). Then create a colour palette for each group. Finally,
map the points and shade them by group.
20

4.3.1 Creating the groups

There are various ways this can be done. The first stage is to find the lower and upper values (the
break points) for each group and an easy way to do that is to use the classInt library that is designed
for the purpose.
> library(classInt)

For example, for a quantile classification with five groups the break points are:
> classIntervals(x, 5, "quantile")

(we can see the number of land parcels in each group is approximately but not exactly equal)
For a 'natural breaks' classification they are
> classIntervals(x, 5, "fisher")

30

or
> classIntervals(x, 5, "jenks")

For an equal interval classification


> classIntervals(x, 5, "equal")

For a standard deviation classification


> classIntervals(x, 5, "sd")

and so forth (see ?classIntervals for more options)


Let's use a natural breaks classification, extracting the break points from the
function and storing them in an object called break.points,
40

classIntervals(...)

> break.points <- classIntervals(x, 5, "fisher")$brks


# Specifying $brks gives just the values without the number in each group
> break.points
[1] 4.850 6.300 7.105 7.865 8.820 11.060
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

43

The second stage is to assign each of the land price values to one of the five groups. This is done
using the cut(...) function.
> groups <- cut(x, break.points, include.lowest=T, labels=F)

We can check the number of land parcels in each group by counting them
> table(groups)
groups
1
2
3
4
5
159 267 352 227 112

Which should be the same as for


10

> classIntervals(x, 5, "fisher")


style: fisher
[4.85,6.3)
[6.3,7.105) [7.105,7.865)
159
267
352

[7.865,8.82)
227

[8.82,11.06]
112

4.3.2 Creating the colour palette and mapping the data

This is most easily done using the RColorBrewer library which is based on the ColorBrewer
website, http://www.colorbrewer.org and is designed to create nice looking colour palettes
especially for thematic maps.
> library(RColorBrewer)

20

In fact, you have used this library already when creating the choropleth maps. It is implicit in the
command auto.shading(x, cutter=sdCuts, cols=brewer.pal(5,'Greens')), for example, where
the function brewer.pal(...) is a call to RColorBrewer asking it to create a sequential palette of
five colours going from light to dark green.
We can create a colour palette in the same way,
> palette <- brewer.pal(5, "Greens")
# Use ?brewer.pal to find out about other colour schemes

map the data


> plot(districts)
> plot(landdata, pch=21, bg=palette[groups], cex=0.2*x, add=T)

and add the north arrow and scale bar


30

> north.arrow(473000, 4445000, "N", len=1000, col="light gray")


> map.scale(425000,4400000,10000,"km",5,subdiv=2,tcol='black',scol='black',
+ sfcol='black')

Adding the legend is a little harder and uses the legend(...) function,
> legend("bottomright", legend=c("4.85 to <6.3", "6.3 to <7.105",
+ "7.105 to <7.865","7.865 to <8.82","8.82 to 11.06"), pch=21, pt.bg=palette,
+ pt.cex = c(0.2*5.6, 0.2*6.7, 0.2*7.5, 0.2*8.3, 0.2*9.9), title="Land value (log)")

See ?legend for further details.

40

Tip When creating maps in colour, be wary of choosing colours that cannot be distinguished from
one another by those with colour-blindness. Red-green combinations are particularly problematic.
Be aware, also, that many publications still require graphics to be in grayscale. You could use, for
example, palette <- brewer.pal(5, "Greys")

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

44

Figure 4.2 A shaded point map produced in R

4.4 Creating a raster grid


A problem with the map shown in Figure 5.2 is that of over-plotting: of the multiple points
occupying the same space on the map and obscuring each other. One solution is to convert the
points into a raster grid, giving the average land value per grid cell.
We begin by loading the raster library
> library(raster)

then define the length of each raster cell (here giving a 1km by 1km cell size as the units are metres)
> cell.length <- 1000

10

We will allow the grid to complete cover the districts of Beijing so will base its dimensions on the
bounding box (the minimum enclosing rectangle) for the districts. The bounding box is found using
> bbox(districts)
min
max
x 418358.5 473517.4
y 4391094.2 4447245.9

and this information will be used to calculate the number of columns (ncol) and the number of rows
(nrow) for the grid:

20

>
>
>
>
>
>
>

xmin
xmax
ymin
ymax
ncol
nrow
ncol

<<<<<<-

bbox(districts)[1,1]
bbox(districts)[1,2]
bbox(districts)[2,1]
bbox(districts)[2,2]
round((xmax - xmin) / cell.length, 0)
round((ymax - ymin) / cell.length, 0)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

45

[1] 55
> nrow
[1] 56

We then create a blank 55 by 56 raster grid,


> blank.grid <- raster(ncols=ncol, nrows=nrow, xmn=xmin, xmx=xmax, ymn=ymin, ymx=ymax)

The next stage is to define the (x, y) and attribute values of the points that we are going to aggregate
by averaging into the bank grid.
10

>
>
>
>
>

xs <- coordinates(landdata)[,1]
ys <- coordinates(landdata)[,2]
xy <- cbind(xs, ys)
x <- landdata@data$LNPRICE
land.grid = rasterize(xy, blank.grid, x, mean)

We can now plot the grid using the default values,


> plot(land.grid)
> plot(districts, add=T)

or customise it a little

20

>
>
>
>
>
+
>

break.points <- classIntervals(x, 8, "pretty")$brks


palette <- brewer.pal(8, "YlOrRd")
plot(land.grid, col=palette, breaks=break.points, main="Mean log land value")
plot(districts, add=T)
map.scale(426622,4394549,10000,"km",5,subdiv=2,tcol='black',scol='black',
sfcol='black')
north.arrow(468230, 4394708, "N", len=1000, col="white")

Figure 4.3. The land price data rasterised

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

46

4.5 Spatially joining data


A further example of the sort of GIS functionality available in R is given by assigning each of the
land parcel points the attribute data for the districts within which they are located a spatial join
based on a point-in-polygon operation.
This is simple to achieve with the basis being the
the polygons. For example,

10

over(...)

function, here overlaying the points on

> head(over(landdata, districts), n=3)


SP_ID POPDEN JOBDEN
1
57 17.9478 19.3322
2
8 22.9676 20.4468
3
15 32.4849 65.3001

shows that the first point in the land price data is located in the 58 th of the districts. The reason that
it is the 58th and not the 57th is that the IDs (SP_ID) are numbered beginning from zero not one,
which is common for GIS. We can easily check this is correct:
> plot(districts[58,])
> plot(landdata[1,], pch=21, add=T)

- the point is indeed with the polygon.


To join the data we will create a new data frame containing the attribute data from landdata and the
results of the overlay function above.
20

> joined.data <- data.frame(slot(landdata, "data"), over(landdata, districts))


# could also use data.frame(landdata@data, over(landdata, districts)
> head(joined.data, n=3)
LNPRICE LNAREA DCBD DELE DRIVER DPARK Y0405 Y0607 Y0809 SP_ID POPDEN JOBDEN
1
9.29 11.84 8.28 7.25
7.38 8.09
0
1
0
57 17.9478 19.3322
2
8.64 10.97 9.04 5.61
8.41 7.51
0
1
0
8 22.9676 20.4468
3
6.69
7.92 8.89 5.62
8.22 7.27
0
1
0
15 32.4849 65.3001

These data cannot be plotted as they are. What we have is just a data table. It is not linked to any
map.
30

> class(joined.data)
[1] "data.frame"

However, they can be linked to the geography of the existing map to create a new Spatial Pointswith-attribute data object that can then be mapped in the way described in Section 4.3, p.43.
> combined.map <- SpatialPointsDataFrame(coordinates(landdata), joined.data)
> class(combined.map)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
> proj4string(combined.map) <- crs
> head(combined.map@data)

40

> x <- combined.map@data$POPDEN


# Select the variable to map
> break.points <- classIntervals(x, 5, "quantile")$brks
# Find the breaks between the map classes
> groups <- cut(x, break.points, include.lowest=T, labels=F)
# Place the observations into the groups (map classes)
> palette <- brewer.pal(5, "Blues")
# Create a colour palette
> plot(districts)
> plot(combined.map, pch=21, bg=palette[groups], cex=0.8, add=T)
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

47

10

# Plot the data


> map.scale(426622,4396549,10000,"km",5,subdiv=2,tcol='black',scol='black',
+ sfcol='black')
> north.arrow(473000, 4445000, "N", len=1000, col="white")
# Add the map scale and north arrow
> classIntervals(x, 5, "quantile")
# Look at the break points to include in the legend
> legend("bottomright", legend=c("0.87 to <3.32", "3.32 to <10.2",
+ "10.2 to <17.7","17.7 to <25.4","25.4 to 283"), pch=21, pt.bg=palette,
+ pt.cex=0.8, title="Population density")
# Add the legend to the map

Figure 4.4. The population density of the districts at each of the land parcel points
(the map is a result of a spatial join operation)

4.6 Why is any of this useful?


A question you may ask at this stage is why use R when it appears to involve a reasonable amount
of code to do things that are easily completed using the point-and-click interface of a standard GIS.
It's a good question.
There are several answers.
The first is that R produces graphics of publishable quality and you have considerable control over
their design. That might be an advantage.
20

Secondly, the code used to complete the tasks and produces the maps allows for reproducibility: it
can be shared with (and checked by) other people. It can also be easily changed if, for example, you
wanted to slightly alter the point sizes or change the classes from coloured to grayscale. Making a
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

48

few tweaks to a script can be much faster than having to go through a number of drop-down menus,
tabs, right-clicks, etc. to achieve what you want.
Third, precisely because R's spatial capabilities do not standalone from the rest of its functionality,
they allow for the integration of statistical and spatial ways of working. Three examples follow.
4.6.1 Example 1: mapping regression residuals

We can use the combined dataset to fit a hedonic land price model, estimating some of the
predictors of land price at each of the locations. The variables are:

10

20

LNPRICE: Log of the land price: RMB per square metre


LNAREA: Log of the land parcel size
DCBD: distance to the CBD on a log scale
DELE: distance to the nearest elementary school on a log scale
DRIVER: distance to the nearest river on a log scale
DPARK: distance to the nearest park on a log scale
Y0405: A dummy variable (1 or 0) indicating whether the land was sold in the period 2004-5
Y0607: A dummy variable indicating whether the land was sold in the period 2006-7
Y0809: A dummy variable indicating whether the land was sold in the period 2007-8
(All of the remaining land parcels were sold before 2004)
SP_ID: An ID for the polygon (district) in which the land parcel is located
POPDEN: the population density in each district. data source: the fifth census data in 2000.
JOBDEN the jobden density in each district. data source: the fifth census data in 2000.
Remember: the data are simulated and not genuine.
To fit the regression model we use the function lm(...), short for linear model.
> model1 <- lm(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data=combined.map@data)
> summary(model1)
Call:
lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809, data = combined.map@data)

30

Residuals:
Min
1Q
Median
-2.83427 -0.59053 -0.03558

3Q
0.53943

Max
2.97878

Coefficients:

40

Estimate Std. Error t value Pr(>|t|)


(Intercept) 11.916563
0.575031 20.723 < 2e-16 ***
DCBD
-0.259700
0.055202 -4.705 2.87e-06 ***
DELE
-0.079764
0.032784 -2.433 0.01513 *
DRIVER
0.063370
0.029665
2.136 0.03288 *
DPARK
-0.294466
0.046747 -6.299 4.31e-10 ***
POPDEN
0.004552
0.001047
4.346 1.51e-05 ***
JOBDEN
0.006990
0.003009
2.323 0.02037 *
Y0405
-0.183913
0.057289 -3.210 0.00136 **
Y0607
0.152825
0.087152
1.754 0.07979 .
Y0809
0.803620
0.118338
6.791 1.81e-11 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

49

Residual standard error: 0.868 on 1107 degrees of freedom


Multiple R-squared: 0.295,
Adjusted R-squared: 0.2893
F-statistic: 51.47 on 9 and 1107 DF, p-value: < 2.2e-16

To obtain the residuals (errors) from this model we can use any of the functions residuals(...),
rstandard(...) or rstudent(...) to obtain the 'raw' residuals, the standarised residuals and the
Studentised residuals, respectively.
> residuals <- rstudent(model1)
> summary(residuals)

10

20

An advantage of using R is that we can now map the residuals to look for geographical patterns
that, if they exist, would violate the assumption of independent errors (and potentially affect both
the estimate of the model coefficients and their standard errors).
>
>
>
>
>
>
+
>
>
+

break.points <- c(min(residuals), -1.96, 0, 1.96, max(residuals))


groups <- cut(residuals, break.points, include.lowest=T, labels=F)
palette <- brewer.pal(4, "Set2")
plot(districts)
plot(combined.map, pch=21, bg=palette[groups], cex=1, add=T)
map.scale(426622,4396549,10000,"km",5,subdiv=2,tcol='black',scol='black',
sfcol='black')
north.arrow(473000, 4445000, "N", len=1000, col="white")
legend("bottomright", legend=c("<1.96", "-1.96 to 0", "0 to 1.96",">1.96"),
pch=21, pt.bg=palette, pt.cex=0.8, title="Studentised residuals")

Figure 4.5. Map of the regression residuals from a model predicting the land parcel prices

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

50

4.6.2 Example 2: Comparing values with those of a nearest neighbour

Is there a geographical pattern to the residuals in Figure 4.5? Perhaps, although this raises the
question of what actually a random pattern would like. What there definitely is, is a significant
correlation between the residual value at any one point and that of its nearest neighbouring point:

10

> library(spdep)
# Loads the spatial dependence library
> knn1 <- knearneigh(combined.map, k=1, RANN=F)$nn
# Finds the first nearest neighbour to each point
> head(knn1, 3)
[,1]
[1,]
10
# The nearest neighbour to point 1 is point 10
[2,] 172
# The nearest neighbour to point 2 is point 172
[3,] 152
[etc.]
> cor.test(residuals, residuals[knn1])
Pearson's product-moment correlation

20

data: residuals and residuals[knn1]


t = 13.0338, df = 1115, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3116038 0.4134505
sample estimates:
cor
0.3636132
# Calculates the correlation between a point and its nearest neighbour

4.6.3 Example 3: Exporting the data as a new shapefile

If we wished, we could now save the residual values as a new shapefile to be used in other GIS.
This is straightforward and uses the same procedure to create a Spatial Points-with-attribute data
object that we used in Section 4.5.
30

residuals.map <- SpatialPointsDataFrame(coordinates(landdata), data.frame(residuals))


getwd()
# This is the working directory where the map will be saved.
# You may want to change it using the drop down menus.
writePolyShape(residuals.map, "residuals.shp")

4.7 R Geospatial Data Abstraction Library


Before ending this session we note the package rgdal providing an interface to many of the open
source projects gathered in the Open Source Geospatial Foundation (www.osgeo.org); in particular,
to Frank Warmerdam's Geospatial Data Abstraction Library (http://www.gdal.org). This package is
not installed with the other spatial packages under the Spatial Task View (see 'Installing and loading
one or more of the packages (libraries),' p.37) but can be installed in the normal way from CRAN 1:
> install.packages("rgdal")

40

Once installed, rgdal can be used for spatial data import and export, and projection and
transformation, as documented in Chapter 4 of Bivand et al.
1 Note: it used to be the case that Mac Intel OS X binaries were not provided on CRAN, but could be installed from
the CRAN Extras repository with
> setRepositories(ind=1:2)
> install.packages("rgdal")

However, at the time of writing the Mac binaries are provided on CRAN and can be downloaded in the normal way
without having to change the source repository.
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

51

Reference:
Bivand, R.S., Pebesma, E.J. & Gmez-Rubio, V., 2008. Applied Spatial Data Analysis with R.
Berlin: Springer.

4.8 Getting Help


There is a mailing list for discussing the development and use of R functions and packages for
handling and analysis of spatial, and particularly geographical, data. It can be subscribed to at
www.stat.math.ethz.ch/mailman/listinfo/r-sig-geo.
The spatial cheat sheet by Barry Rowlingson at Lancaster University is really helpful:
http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/cheatsheet.html
10

There are some excellent R spatial tips and tutorials on Chris Brunsdon's Rpubs site,
http://rpubs.com/chrisbrunsdon, and on James Cheshire's website, http://spatial.ly/r/.
Perhaps the most hardest thing is to remember which library to use when. At the risk of oversimplification:

20

Use library(maptools) to import and export shapefile


Use library(sp) to create and manipulate spatial objects in R
Use library(GISTools) to help create maps
Use library (RColorBrewer) to create colour schemes for maps
Use library(raster) to create rasters and rasterisations of data
Use library(spdep) to look at the spatial dependencies in data and (as we shall see in the next
session) to create spatial weights matrices
Tip: if you load begin by loading library(GISTools) you will find maptools, sp and RColorBrewer
(amongst others) are automatically loaded too.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

52

Session 5: Defining Neighbours


This session demonstrates more of R's capability for spatial analysis by showing how to create
spatial weightings based on contiguity, nearest neighbours and by distance. Methods of inverse
distance weighting are also introduced and used to produce Moran plots and tests of spatial
dependency in data. We find that the assumption of independence amongst the residuals of the
regression model fitted in the previous session appears to be unwarranted.

5.1 Identifying neighbours


5.1.1 Getting Started

10

If you have closed and restarted R since the last session, load the workspace session5.RData which
contains the districts polygons and combined map created previously in Session 4 (see Section
4.1.3, p.40 and Section 4.5, p.47). All the code for this session is contained in the file Session5.R
> load(file.choose())

# Load the workspace session5.RData

Also load the spdep library,


> library(spdep)

5.1.2 Creating a contiguity matrix

A contiguity matrix is one that identifies polygons that share boundaries and (in what is called the
Queen's case) corners too. In other words, it identifies neighbouring areas. To do this we use the
poly2nb(...) function polygons to an object of class neighbours.
> contig <- poly2nb(districts)

20

30

A summary of the neighbours object shows that there are 134 regions (districts, which can be
confirmed using nrow(districts)) with each being linked to 5.46 others, on average. There are two
regions with no links (use plot(districts) and you can see them to the east of the map), and two
regions with 10 links. The Queen's case is assumed by default, see ?poly2nb.
> summary(contig)
Neighbour list object:
Number of regions: 134
Number of nonzero links: 732
Percentage nonzero weights: 4.076632
Average number of links: 5.462687
2 regions with no links:
20 80
Link number distribution:
0
2

1 2 3 4 5 6 7 8
1 6 8 23 29 26 21 9
1 least connected region:
129 with 1 link
2 most connected regions:
90 100 with 10 links

40

9 10
7 2

It is helpful to learn a little more about the structure of the contiguity object. It is an object of class
nb which is itself a type of list.
> class(contig)
[1] "nb"
> typeof(contig)
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

53

[1] "list"

Looking at the first part of this list we find that the first district has two neighbours: polygons 52
and 54.
> contig[[1]]
[1] 52 54

The second district has five neighbours: polygons 3, 4, 6, 99, 100


> contig[[2]]
[1]
3
4
6

99 100

We can confirm this is correct by plotting the district and its neighbours on a map
10

> plot(districts)
> plot(districts[2,], col="red", add=T)
> plot(districts[c(3,4,6,99,100),], col="yellow", add=T)

5.1.3 K nearest neighbours (knn)

It would not make sense to evaluate contiguity of point data. Instead, we could find, for example,
the six nearest neighbours to each point:
> knear6 <- knearneigh(combined.map, k=6, RANN=F)

20

30

40

This produces an object of class knn. Looking at its parts we find that the six nearest neighbours
(nn) to point 1 are points 10, 11, 1077, 1076, 93, 453, where 10 is the closest; that there are 1117
points in total (np); that we searched for the six nearest neighbours (k); that the points exist in a
two-dimensional space (dimension); and that we can see the coordinates of the points (labelled x).
> names(knear6)
[1] "nn"
"np"
"k"
> head(knear6$nn)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]
10
11 1077 1076
93 453
[2,] 172 110 155 162 153 156
[3,] 152 149 150 151 169 168
[4,] 135 143 679 148 678 674
[5,]
32 164 166 165 167 168
[6,] 432 920 969 1024 919 968
> head(knear6$np)
[1] 1117
> head(knear6$k)
[1] 6
> head(knear6$dimension)
[1] 2
> head(knear6$x, n=3)
x
y
[1,] 454393.1 4417809
[2,] 442744.9 4417781
[3,] 444191.7 4416996

"dimension" "x"

Imagine we are interested in calculating the correlation between some variable (call it x) at each of
the points and at each of the points' closest neighbour. From the above [ head(knear6$nn)] we can
see this is the correlation between x 1, x2, x3, x4, x5, x6, (etc.) and x10, x172, x152, x135, x32, x432, (etc.).
For the correlation with the second closest neighbours it would be with x 11, x110, x149, x143, x164, x920,
(etc.), for the third closest, x1077, x155, x150, x679, x166, x969, (etc.), and so forth. Using the simulated data
about the price of land parcels in Beijing, we can calculate these correlations as follows:
> x <- combined.map$LNPRICE

# Or, combined.map@data$LNPRICE

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

54

> cor(x, x[knear6$nn[,1]])


[1] 0.5275507
> cor(x, x[knear6$nn[,2]])
[1] 0.4053884
> cor(x, x[knear6$nn[,3]])
[1] 0.4073915
> cor(x, x[knear6$nn[,6]])
[1] 0.3163395

10

# Correlation with the 1st nearest neighbour


# Correlation with the 2nd nearest neighbour
# Correlation with the 3rd nearest neighbour
# Correlation with the 6th nearest neighbour

What these values suggest is that even at the sixth nearest neighbour, the value of a land parcel at
any given point tends to be similar to the value of the land parcels around it an example of
positive spatial autocorrelation.
An issue is that the threshold of six nearest neighbours is purely arbitrary. An interesting question is
how far how many neighbours away we can typically go from a point and still find a similarity
in the land price values. One way to determine this would be to carry on with the calculations
above, repeating the procedure until we get to, say, the 250 th nearest neighbour. This is, in fact, what
we will do but automating the procedure. One way to achieve this is to use a for loop:

20

>
>
>
>
>
+
+
>

knear100 <- knearneigh(combined.map, k=250, RANN=F)


# Find the 250 nearest neighbours
correlations <- vector(mode="numeric", length=250)
# Creates an object to store the correlations
for (i in 1: 250) {
correlations[i] <- cor(x, x[knear250$nn[,i]])
}
correlations
[1] 0.52755068 0.40538835 0.40739150 0.35804349 0.33960454 0.31633952 [etc.]

Another way which amounts to the same thing is to make use of R's ability to apply a function
sequentially to columns (or rows) in an array of data:
30

> correlations <- apply(knear250$nn, 2, function(i) cor(x, x[i]))


# The 2 means to apply the correlation function to the columns of knear100$nn
> correlations
[1] 0.52755068 0.40538835 0.40739150 0.35804349 0.33960454 0.31633952 [etc.]

In either case, once we have the correlations they can be plotted,


> plot(correlations, xlab="nth nearest neighbour", ylab="Correlation")
> lines(lowess(correlations))
# Adds a smoothed trend line

Looking at the plot (Figure 5.1) we find that the land prices become more dissimilar (less
correlated) the further away we go from each point, dropping to zero correlation from after about
the 200th nearest neighbour. The rate of decrease in the correlation is greatest to about the 35 th
neighbour, after which it begins to flatten.
40

We can also determine the p-values associated with each of these correlations and identify which
are not significant at a 99% confidence,
> pvals <- apply(knear250$nn, 2, function(i) cor.test(x, x[i])$p.value)
> which(pvals > 0.01)
[1] 63 88 110 115 121 125 134 136 137 138 140 142 145 146 [etc.]

It is from about the 100th neighbour that the correlations begin to become insignificant. Whether this
is usefully information or not is a moot point: a measure of statistical significance is really only an
indirect measure of the sample size. It may be better to make a decision about the threshold at
which the neighbours are not substantively correlated based on the actual correlations (the effect
sizes) rather than their p-values. Whilst it remains a subjective choice, here we will here use the 35 th
neighbour as the limit, before which the correlations are typically equal to r = 0.20 or greater.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

55

> knear35 <- knearneigh(combined.map, k=35, RANN=F)

To now convert this object of class knn to the same class of object that we had in Section 5.1.2
('Creating a contiguity matrix') we use
> knear35nb <- knn2nb(knear35)
> class(knear35nb)
[1] "nb"
> head(knear35nb, n=1)
[[1]]
[1]
8
10
11
14
19

10

54

56

57

82

91

92

93

[etc.]

Note that these are now in numeric order.


> knear35nb
Neighbour list object:
Number of regions: 1117
Number of nonzero links: 39095
Percentage nonzero weights: 3.133393
Average number of links: 35
Non-symmetric neighbours list

Figure 5.1. The Pearson correlation between the land parcel values at
each point and their nth nearest neighbour
5.1.4 Identifying neighbours by (physical) distance apart

20

It is also possible to identify the neighbours of points by their Euclidean or Great Circle distance
apart using the function dnerneigh(...). For example, if we wanted to identify all points between
100 and 1000 metres of each other:
> d100to1000 <- dnearneigh(combined.map, 100, 1000)
> class(d100to1000)
[1] "nb"
> d100to1000
Neighbour list object:
Number of regions: 1117
Number of nonzero links: 9822
Percentage nonzero weights: 0.7872154

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

56

Average number of links: 8.793196


48 regions with no links:
31 45 51 52 139 160 193 276 277 289 291 313 336 403 429 508 511 512 516 518 520 535
559 565 567 706 710 716 717 792 796 805 811 817 818 819 860 862 877 878 899 900 947
965 988 992 1037 1039

See ?dnerneigh(...) for further details.

5.2 Creating a Spatial Weights matrix

10

What we created in Section 5.1 was a list of neighbours where we had flexibility to decide about
what counts as a neighbour. The next stage will be to convert it into a spatial weights matrix so we
can use it for various methods of spatial analysis. This extra stage of conversion may seem like an
unnecessary additional chore. However, the creation of the spatial weights matrix allows us to
define the strength of relationship between neighbours. For example, we may want to give more
weight to neighbours that are located closer together and less weight to those that are further apart
(decreasing to zero beyond a certain threshold).
5.2.1 Creating a binary list of weights

We could create a simple binary 'matrix' from any of our existing lists of neighbours. In principle:
> spcontig <- nb2listw(contig, style="B")
Error in nb2listw(contig, style = "B") : Empty neighbour sets found

20

Note, however, the error message, which arises because two of the Chinese districts do not share a
boundary with others,
> contig

In this case, we shall have to instruct the function to permit an empty set
> spcontig <- nb2listw(contig, style="B", zero.policy=T)

The same problem does not arise for the one hundred nearest neighbours (by definition, it cannot
each point has neighbours) but it does for the distance based list:
> spknear35 <- nb2listw(knear35nb, style="B")
> spd100to1000 <- nb2listw(d100to1000, style="B")
Error in nb2listw(d100to1000, style = "B") : Empty neighbour sets found
> spd100to1000 <- nb2listw(d100to1000, style="B", zero.policy=T)

30

What we create are objects of class listw (a list of spatial weights):


> class(spcontig)
[1] "listw" "nb"
> class(spknear35)
[1] "listw" "nb"
> class(spd100to1000)
[1] "listw" "nb"

40

Looking at the first of these objects we can see how it has been constructed. It contains binary
weights (style B); district 1 has two neighbours, districts 52 and 54; and both of those have been
given a weight of one (all other districts therefore have a weight of zero with district 1). Similarly,
district 2 has neighbours 3, 4, 6, 999 and 100, each with a weight of one.
> names(spcontig)
[1] "style"
"neighbours" "weights"
> spcontig$style
[1] "B"
> head(spcontig$neighbours, n=2)
[[1]]
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

57

[1] 52 54
[[2]]
[1]
3

99 100

> head(spcontig$weights, n=2)


[[1]]
[1] 1 1
[[2]]
[1] 1 1 1 1 1

The other spatial weights 'matrices' have the same form.


10

5.2.2 Creating a row-standardised list of weights

Using binary weights where (1) indicates two places are neighbours, and (0) indicates they are not,
may create a problem when different places have different numbers of neighbours (as is the case for
both the contiguity and distance-based approaches). Imagine a calculation where the result is in
someway dependent upon the sum of the weights involved. For example,
n

y i = wij x j
All things being equal, we expect places with more neighbours to generate larger values of y simply
because they have more non-zero values contributing to the sum. A way around this problem is to
scale the weights so that for any one place they sum to one a process known as rowstandardisation and actually the default option:
20

30

> spcontig <- nb2listw(contig, zero.policy=T)


> spknear35 <- nb2listw(knear35nb)
> spd100to1000 <- nb2listw(d100to1000, zero.policy=T)
> names(spcontig)
[1] "style"
"neighbours" "weights"
> spcontig$style
[1] "W"
> head(spcontig$neighbours, n=2)
[[1]]
[1] 52 54
[[2]]
[1]
3

99 100

> head(spcontig$weights, n=2)


[[1]]
[1] 0.5 0.5
[[2]]
[1] 0.2 0.2 0.2 0.2 0.2

In the case of the contiguity matrix, district 1 still has neighbours 52 and 54, and district 2 still has
neighbours 3, 4, 6, 99 and 100, but the weights are now row-standardised (style W) and in each case
they sum to one.
5.2.3 Creating an inverse distance weighting (IDW)

40

A more ambitious undertaking is to decrease the weighting given to two points according to their
distance apart, reducing to zero, for example, beyond the 35 th nearest neighbour. To achieve this, we
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

58

begin by calculating the distances between each of the points, using the spDists(...) function. This
calculate the distances between two sets of points where the points' locations are defined by an (x,
y) coordinates (or by longitude and latitude: see ?spDists). To obtain the (x, y) coordinates of all the
land parcels contained in our combined map we could use the function coordinates(...), therefore
obtaining the distance-between-points matrix using
> d.matrix <- spDists(coordinates(combined.map), coordinates(combined.map))

However, since combined.map is a SpatialPoints(DataFrame) object we can just reference it directly


without explicitly supplying the point coordinates,
> d.matrix <- spDists(combined.map, combined.map)

10

Either way, the same result is achieved: an np by np matrix where np is the number of points and
the matrix contains the distances between them:
> d.matrix
# The full matrix. It's too large to show on screen.
> head(d.matrix[1,1:10])
# The distances from point 1 to the first 10 others
[1]
0.000 11648.171 10233.788 10360.952 9345.806 3618.377
> nrow(d.matrix)
[1] 1117

This is showing that the distance from point 1 to point 2 is 11.6km. The distance from point 1 to
itself is, of course, zero and the matrix is symmetric,
20

> d.matrix[1,2]
[1] 11648.17
> d.matrix[2,1]
[1] 11648.17

# The distance from point 1 to point 2 ...


# ... is the same as from point 2 to point 1

If we are going to reduce the weighting to zero beyond the 35 th nearest neighbour then we don't
actually need the full distance matrix, only the distances from each point to those 35 neighbours.
We have already identified the nearest 35 neighbours for each point but for the sake of
completeness let's do it again:

30

> knear35 <- knearneigh(combined.map, k=35, RANN=F)


# Find the 35 nearest neighbours for each point
> np <- knear35$np
# The total number of points

Looking at the results we know, for example, that the nearest neighbours to point 1 are points 10,
11, 1077, etc. so the distances we need to extract from the distance matrix are row 1, columns 10,
11, 1077, and so forth. For point 2 the nearest neighbours are 172, 110, 155, etc. so from the
distance matrix we need row 2, columns 172, 110, 155,
> head(knear35$nn, n=2)
[,1] [,2] [,3] [,4] [,5] [etc.]
[1,]
10
11 1077 1076
93 [etc.]
[2,] 172 110 155 162 153 [etc.]

For point 1 we may obtain the distances to its 35 nearest neighbours using,
40

> i <- knear35$nn[1,]


> head(i)
[1]
10
11 1077 1076
93 453
> distances <- d.matrix[1,i]
> head(distances)
[1] 411.0010 566.2565 570.7371 760.7815 784.0760 788.7454

For point 2:
> i <- knear35$nn[2,]

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

59

> head(i)
[1] 172 110 155 162 153 156
> distances <- d.matrix[2,i]
> head(distances)
[1] 220.4186 293.3992 827.5340 844.6190 845.0816 868.7554

The same logic underpins the following code except instead of manually obtaining the distances for
each point in turn, it loops through them all sequentially. It is also calculates weights that are
1
inversely related to the distance from a point to its neighbours (here wij = 0.5 is used).
d ij
10

> d.weights <- vector(mode="list", length=np)


# Creates an empty list to store
> for (i in 1:np) {
+
neighbours <- knear35$nn[i,]
+
distances <- d.matrix[i,neighbours]
+
+ }
>

20

d.weights[[i]] <- 1/distances^0.5

the inverse distance weights


# A loop taking each point in turn
# Finds the neighbours for the ith point
# Calculates the distance between the
# ith point and its neighbours
# Calculates the IDW

We can now create the list of neighbours (an object of class nb) and have a corresponding list of
general weights (based on inverse distance weighting) that together allow for the final spatial
weights matrix to be created:
> knear35nb <- knn2nb(knear35)
> head(knear35nb, n=2)
# The list of neighbours
[[1]]
[1]
10
11
91
92
[etc.]
[[2]]
[1]

61

66

67

77

[etc.]

> head(d.weights, n=2)


# The corresponding list of general weights
[[1]]
[1] 0.04932630 0.04202362 0.04185833 0.03625518 0.03571255 [etc.]

30

[[2]]
[1] 0.06735594 0.05838087 0.03476219 0.03440880 0.03439938 [etc.]
> spknear35IDW <- nb2listw(knear35nb, glist=d.weights)
# Creates the spatial weights matrix, now with IDW

Looking at the result we find that point 1 still has points 10, 11, 91, 92 and so forth as its neighbours
(as it should, it would be worrying if that had changed!) but looking at their weighting, it decreases
with distance away.
> head(spknear35IDW$neighbours, n=1)
[[1]]
[1]
10
11
91
92 [etc.]

40

> head(spknear35IDW$weights, n=1)


[[1]]
[1] 0.04698452 0.04002853 0.03987110 0.03453395 [etc.]

wij
1
STD
0.5 but are rescaled to w ij =
d ij
wi
because they are row-standardised. This means that the inverse distance weighting is a function of
the local distribution of points around each point not just how far away away they are. For example,
Note, however, that the weights are not actually wij =

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

60

consider a point where all its neighbours are quite far from it. Using a strictly distance-based
weighting each of those neighbours should receive a low weighting. However, once row
standardisation is applied those low weights will be scaled upwards to sum to one. Reciprocally,
imagine a point where all its neighbours are very close. Using a distance-based weighting those
neighbours should receive a high weighting; in effect, though, they will be scaled downwards by the
row standardisation. This may sound undesirable and counter to the objectives of inverse distance
weighting, and can be prevented by changing the weights style
> spknear35IDWC <- nb2listw(knear35nb, glist=d.weights, style="C")

10

However, imagine the points sample across both urban and rural areas. The distances between
points will most likely be smaller in the urban regions (where the density of points is greater,
reflecting the greater population density), with greater distances between points in the rural regions.
If row standardisation is not applied then the net result will be to give more weight to the urban
parts of the region such that any subsequent calculation dependent upon the sum of the weights will
be more strongly influenced by the urban areas than by the rural ones. Therefore careful though
needs to be given to the style of weights to use.
5.2.4 Variants of the above

Common forms of inverse distance weighting include the bisquare and Gassian functions. These
are, respectively,
2

w ij =(1

d ij
2

d MAX

) where dMAX is the threshold beyond which the weights are set to zero; and
2

wij =exp(0.5
20

30

40

d ij
2

d MAX

The R code to calculate these weightings is below.


# Calculates the bisquare weighting to the 35th neighbour
knear35 <- knearneigh(combined.map, k=35, RANN=F)
np <- knear35$np
d.weights <- vector(mode="list", length=np)
for (i in 1:np) {
neighbours <- knear35$nn[i,]
distances <- d.matrix[i,neighbours]
dmax <- distances[35]
d.weights[[i]] <- (1 - distances^2/dmax^2)^2
}
spknear35bisq <- nb2listw(knn2nb(knear35), glist=d.weights, style="C")
# Calculates the Gaussian weighting to the 35th neighbour
knear35 <- knearneigh(combined.map, k=35, RANN=F)
np <- knear35$np
d.weights <- vector(mode="list", length=np)
for (i in 1:np) {
neighbours <- knear35$nn[i,]
distances <- d.matrix[i,neighbours]
dmax <- distances[35]
d.weights[[i]] <- exp(-0.5*distances^2/dmax^2)
}
spknear35gaus <- nb2listw(knn2nb(knear35), glist=d.weights, style="C")

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

61

5.3 Using the spatial weights


5.3.1 Creating a spatially lagged variable

Once we have the spatial weights we can use them to create a spatially lagged variable. For
example, if xi is the value of the land parcel at point i, then its spatial lag is the mean value of the
land parcels that are the neighbours of i, where those neighbours are defined by the spatial weights.
More precisely, it is the weighted mean value if, for example, inverse distance weighting has been
employed. It is straightforward to calculate the spatially lagged variable. For example,
> x <- combined.map$LNPRICE
> lagx <- lag.listw(spknear35gaus, x)

10

Having done so, the correlation between points and their neighbours can be calculated,
> cor.test(x, lagx)
Pearson's product-moment correlation

20

data: x and lagx


t = 14.1361, df = 1115, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3389422 0.4384758
sample estimates:
cor
0.389847

Here there is evidence of significant positive spatial autocorrelation that the land price at one
point tends to be similar to the land prices of its neighbours. This can be seen if we plot the two
variables on a scatter plot, although the relationship is also somewhat noisy and may not be linear.
>
>
>
>

plot(lagx ~ x)
best.fit <- lm(lagx ~ x)
abline(best.fit)
summary(best.fit)

Call:
lm(formula = lagx ~ x)

30

Residuals:
Min
1Q Median
-2.2042 -0.3933 -0.0004

3Q
0.3912

Max
1.6712

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.70723
0.12272
46.51
<2e-16 ***
x
0.23176
0.01639
14.14
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

40

Residual standard error: 0.5639 on 1115 degrees of freedom


Multiple R-squared: 0.152,
Adjusted R-squared: 0.1512
F-statistic: 199.8 on 1 and 1115 DF, p-value: < 2.2e-16

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

62

Figure 5.1. The relationship between the land price values and the spatial lag of those values
5.3.2 A Moran plot and test

What we created in Figure 5.1 is known as a Moran plot. A more direct way of producing it is to use
the moran.plot(...) function,
> moran.plot(x, spknear35gaus)

which flags potential outliers / influential observations. To suppress their labelling, include the
argument labels=F.
The Moran coefficient and related test provide a measure of the spatial autocorrelation in the data,
given the spatial weightings.
10

> moran.plot(x, spknear35gaus)


> moran.test(x, spknear35gaus)
Moran's I test under randomisation
data: x
weights: spknear35gaus
Moran I statistic standard deviate = 39.663, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic
Expectation
Variance
0.2657983392
-0.0008960573
0.0000452122

20

Essentially the Moran statistic is a correlation value, although it need not be exactly zero in the
presence of no correlation (here the expected value is not zero but slightly negative) and can go
beyond the range -1 to +1. The interpretation though is that the price of the land parcels and their
neighbours are positively correlated: there is a tendency for like-near-like values.
More strictly, we should acknowledge the note found under ?moran.test(...) that the derivation of
the tests assumes the matrix is symmetric, which it is not (because where A is one of the nearest
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

63

neighbours to B it does not follow that B is necessarily one of the nearest neighbours to A):
> spknear35gaus
Characteristics of weights list object:
Neighbour list object:
Number of regions: 1117
Number of nonzero links: 39095
Percentage nonzero weights: 3.133393
Average number of links: 35
Non-symmetric neighbours list
# Note that the 'matrix' is not symmetric

10

Weights style: C
Weights constants summary:
n
nn
S0
S1
S2
C 1117 1247689 1117 58.55285 4623.092

To correct for this we can use a helper function, listw2U(...),


> moran.test(x, listw2U(spknear35gaus))

5.4 Checking a regression model for spatially correlated errors


We end this session by fitting re-fitting the regression model of Section 4.6.1, 'Example 1: mapping
regression residuals', p. 49 but this time using a Moran test to check whether the assumption of
spatial independence in the residuals is warranted.
20

First the regression model,


> model1 <- lm(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data=combined.map)

Now a Moran plot and test.


> moran.plot(residuals(model1), spknear35gaus)
> lm.morantest(model1, listw2U(spknear35gaus))
# listw2U is used because the weights are not symmetrical
Global Moran's I for regression residuals

30

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: listw2U(spknear35gaus)
Moran I statistic standard deviate = 16.0962, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran's I
Expectation
Variance
9.683406e-02
-3.893216e-03
3.916053e-05

The estimated correlation is about 0.097. Not huge, perhaps, but significant enough to question the
assumption of independence.
40

Note that this result is dependent on the spatial weightings. If we change them, then the results of
the Moran test will change also. For example, using the bi-square weighings (from Section 5.2.4,
p.61):
> lm.morantest(model1, listw2U(spknear35bisq))
Global Moran's I for regression residuals

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

64

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: listw2U(spknear35bisq)
Moran I statistic standard deviate = 13.2469, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran's I
Expectation
Variance
9.651698e-02
-3.961995e-03
5.753359e-05

10

Here the change is slight, largely because the rate of decay of the inverse distance weighting matters
rather less than the number of neighbours it decays to. It both cases above the threshold is 35, a
number we obtained by judgement from Figure 5.1. Imagine we had chosen 150 instead. The results
we then get (using a Gaussian decay function) are,
Moran I statistic standard deviate = 13.9404, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran's I
Expectation
Variance
3.230500e-02
-2.469718e-03
6.222665e-06

20

which, although still statistically significant, reduces the Moran's I value to about a third of its
previous value.

5.5 Summary
The general process for creating spatial weights in R is as follows:
(a) read-in the (X, Y) data or shapefile into the R workspace and (in doing-so) convert it into a
spatial object (see Session 4).
(b) Decide how neighbouring observations will be defined: by nearest neighbour, by distance,
by contiguity for example.
(c) Convert the object of class nb into an object of class listw (spatial weights). For k-nearest
neighbours there is a prior stage of converting the knn object into class nb.
30

(d) At the time of creating the spatial weights object you need to decide what type of weights to
use, for example binary or row-standardised. You may also supply a list of general weights
to produce inverse distance weighting.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

65

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

66

Session 6: Spatial regression analysis


In this session we draw together some of what we have learned about mapping and creating spatial
weights in R to undertake spatial forms of regression, specifically a spatially lagged error model, a
lagged y model, a mixed model, Geographically Weighted Regression and a multilevel model. The
session is more an introduction to how to fit these sorts of models in R rather than an in-depth
tutorial into their specification and use. Inevitably some of the models fit the (simulated) data better
than others. However, that is not to say they would necessarily be preferred when using other data
and in other analytical contexts. Which model to choose will depend on the purpose of the analysis
and the theoretical basis for the modelling.
10

This session requires the spdep, Gwmodel and lme4 libraries to have been installed.

6.1 Introduction
6.1.1 Getting Started

If you have closed and restarted R since the last session, load the workspace session6.RData which
contains the districts map and the combined (synthetic) land parcel and district data from Session 4
as well as the spatial weights with a Gaussian decay to the 35 th neighbour created in Session 5.
> load(file.choose())
> library(spdep)
> ls()
[1] "combined.map" "districts"

20

"spknear35gaus"

Recall that the (log) of the land price values show significant spatial variation,
> moran.test(combined.map$LNPRICE, spknear35gaus)
Moran's I test under randomisation
data: combined.map$LNPRICE
weights: spknear35gaus
Moran I statistic standard deviate = 39.663, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic
Expectation
Variance
0.2657983392
-0.0008960573
0.0000452122

30

Our challenge is to try and explain some of that variation within a regression framework.
6.1.2 OLS regression

We begin by re-fitting the regression model from the end of the previous session, noting once again
the apparent spatial dependencies in the residuals that violates the assumption of independence:
> model1 <- lm(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data=combined.map)
> lm.morantest(model1, listw2U(spknear35gaus))
Global Moran's I for regression residuals
data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

67

Y0607 + Y0809, data = combined.map)


weights: listw2U(spknear35gaus)
Moran I statistic standard deviate = 16.0962, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran's I
Expectation
Variance
9.683406e-02
-3.893216e-03
3.916053e-05

10

In addition to their apparent lack of independence we may also note that the residuals appear to
show evidence of heteroskedasticity of non-constant variance (they are therefore not independent
nor identically distributed):
> plot(residuals(model1) ~ fitted(model1))
# Plot the residuals against the fitted values
> abline(h=0, lty="dotted")
# Add a horizontal line at residual value = 0
> lines(lowess(fitted(model1), residuals(model1)), col="red")
# Add a trend line. We are hoping not to find a trend but
# to see that the residuals are random noise around 0.
# They are not.

More simply, a similar plot can be produced using


> plot(model1, which=3)

20

however that same code won't work for the spatial models we produce below. Instead, we can write
a function to produce what we need,
> hetero.plot <- function(model) {
+
plot(residuals(model) ~ fitted(model))
+
abline(h=0, lty="dotted")
+
lines(lowess(fitted(model), residuals(model)), col="red")
+ }
> hetero.plot(model1)

30

There are no quick fixes for the violated assumption of independent and identically distributed
errors. The violation suggests we cannot take on trust the standard errors, t- and p-values shown
under the model summary. It is likely that the standard errors for at least some of the predictor
variables have been under-estimated (because if we have spatial dependencies in the residuals then
they likely arise from spatial dependencies in the data which in turn mean we have less degrees of
freedom than we think we have).
> summary(model1)
Call:
lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809, data = combined.map)

40

Residuals:
Min
1Q
Median
-2.83427 -0.59053 -0.03558

3Q
0.53943

Max
2.97878

Coefficients:
(Intercept)
DCBD
DELE
DRIVER
DPARK
POPDEN
JOBDEN

Estimate Std. Error t value Pr(>|t|)


11.916563
0.575031 20.723 < 2e-16 ***
-0.259700
0.055202 -4.705 2.87e-06 ***
-0.079764
0.032784 -2.433 0.01513 *
0.063370
0.029665
2.136 0.03288 *
-0.294466
0.046747 -6.299 4.31e-10 ***
0.004552
0.001047
4.346 1.51e-05 ***
0.006990
0.003009
2.323 0.02037 *

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

68

Y0405
-0.183913
Y0607
0.152825
Y0809
0.803620
--Signif. codes: 0 ***

0.057289
0.087152
0.118338

-3.210 0.00136 **
1.754 0.07979 .
6.791 1.81e-11 ***

0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.868 on 1107 degrees of freedom


Multiple R-squared: 0.295,
Adjusted R-squared: 0.2893
F-statistic: 51.47 on 9 and 1107 DF, p-value: < 2.2e-16

10

If we have doubts about this model because of the spatial patterning of the errors (which are likely
but not necessarily caused by the patterning of the Y variable) then we need to consider other
approaches.

6.2 Spatial Econometric Methods


6.2.1 Spatial Error Model

One option is to fit a spatial simultaneous autoregressive error model which decomposes the error
into two parts: a spatially lagged component and a remaining error: y= X + W +

20

Fitting the model and comparing it with the standard regression model we find that two of the
predictor variables (DELE and POPDEN) no longer are significant at a conventional level and that
the standard errors for many have risen. The lambda (a measure of spatial autocorrelation) is
significant. The model fits the data better than the previous model (the AIC score is lower and the
log likelihood value greater, as is the pseudo-R 2 value):
> model2 <- errorsarlm(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN +
Y0405 + Y0607 + Y0809, data=combined.map, spknear35gaus)
> summary(model2)
Call:errorsarlm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK +
+ Y0405 + Y0607 + Y0809, data = combined.map,
listw = spknear35gaus)
Residuals:
Min
1Q
Median
-2.724707 -0.555175 -0.050399

30

40

3Q
0.484010

POPDEN + JOBDEN

Max
2.750120

Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.7929865 1.0803367 11.8417 < 2.2e-16
DCBD
-0.4528868 0.1179775 -3.8388 0.0001237
DELE
-0.0430191 0.0414930 -1.0368 0.2998391
DRIVER
0.0805519 0.0389786 2.0666 0.0387749
DPARK
-0.2225801 0.0640377 -3.4758 0.0005094
POPDEN
0.0040076 0.0012339 3.2479 0.0011624
JOBDEN
0.0032269 0.0035468 0.9098 0.3629286
Y0405
-0.2185441 0.0548818 -3.9821 6.831e-05
Y0607
0.2488569 0.0832995 2.9875 0.0028127
Y0809
0.9301437 0.1131248 8.2223 2.220e-16

***
*
***
**
***
**
***

Lambda: 0.68882, LR test value: 84.635, p-value: < 2.22e-16


Asymptotic standard error: 0.053718
z-value: 12.823, p-value: < 2.22e-16
Wald statistic: 164.43, p-value: < 2.22e-16
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

69

Log likelihood: -1379.548 for error model


ML residual variance (sigma squared): 0.67987, (sigma: 0.82455)
Number of observations: 1117
Number of parameters estimated: 12
AIC: 2783.1, (AIC for lm: 2865.7)

(I have added the asterisks manually).


The AIC is given above but we can also obtain it using,
10

> AIC(model1)
[1] 2865.731
> AIC(model2)
[1] 2783.096

The log likelihood scores using,


> logLik(model1)
'log Lik.' -1421.865 (df=11)
> logLik(model2)
'log Lik.' -1379.548 (df=12)

And the pseudo-R2 for the spatial error model as,


> cor(combined.map$LNPRICE, fitted(model2))^2
[1] 0.3590577

20

The problem of heteroskedasticity remains, however:


> hetero.plot(model2)

6.2.2 A spatially lagged y model

Although the spatial error model (above) fits the data better than the standard OLS model, it tells us
only that there is an unexplained spatial structure to the residuals, not what caused them. It may
offer better estimates of the model parameters and their statistical significance but it does not
presuppose any particular spatial process generating the patterns in the land price values. A different
model that explicitly tests for whether the land value at a point is functionally dependent on the
values of neighbouring points is the spatially lagged y model: Y =W y+ X +
The model is fitted in R using,
30

> model3 <- lagsarlm(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405
+ Y0607 + Y0809, data=combined.map, spknear35gaus)
> summary(model3)
Call:lagsarlm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809, data = combined.map, listw = spknear35gaus)
Residuals:
Min
1Q
Median
-2.818629 -0.577205 -0.051596

40

3Q
0.517413

Max
3.005430

Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.3098717 0.8336866 11.1671 < 2.2e-16
DCBD
-0.1753427 0.0585675 -2.9939 0.0027547
DELE
-0.0706011 0.0323879 -2.1799 0.0292677
DRIVER
0.0334482 0.0304653 1.0979 0.2722435
DPARK
-0.2591509 0.0467432 -5.5441 2.954e-08

***
**
*
***

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

70

POPDEN
JOBDEN
Y0405
Y0607
Y0809

0.0042982
0.0057660
-0.1869054
0.1928584
0.8452182

0.0010362 4.1482 3.351e-05 ***


0.0029799 1.9350 0.0529953
0.0565474 -3.3053 0.0009488 ***
0.0860329 2.2417 0.0249819 **
0.1167707 7.2383 4.545e-13 ***

Rho: 0.23634, LR test value: 18.273, p-value: 1.9136e-05


Asymptotic standard error: 0.055848
z-value: 4.2318, p-value: 2.3185e-05
Wald statistic: 17.908, p-value: 2.3185e-05

10

Log likelihood: -1412.729 for lag model


ML residual variance (sigma squared): 0.73357, (sigma: 0.85648)
Number of observations: 1117
Number of parameters estimated: 12
AIC: 2849.5, (AIC for lm: 2865.7)
LM test for residual autocorrelation
test value: 109.4, p-value: < 2.22e-16

The model is an improvement on the OLS model but does not appear to fit the data as well as the
error model (the lagged y model has a greater AIC, lower log likelihood and lower pseudo-R 2):
20

> AIC(model3)
[1] 2849.458
> logLik(model3)
'log Lik.' -1412.729 (df=12)
> cor(combined.map$LNPRICE, fitted(model3))^2
[1] 0.3074552

The heteroskedasticity remains,


> hetero.plot(model3)

Moreover (and possibly related to the heteroskedasticity) significant autocorrelation remains in the
residuals from this model. We can, of course, plot these residuals (see Session 4, 'Using R as a
simple GIS' for further details). Here we will write a simple function to do so.
30

40

50

> # A function to draw a quick map


> quickmap <- function(x, subset=NULL) {
+
# The subset is of use for the later GWR model
+
if(!is.null(subset)) {
+
x <- x[subset]
+
combined.map <- combined.map[subset,]
+
}
+
library(classInt)
+
break.points <- classIntervals(x, 5, "fisher")$brks
+
groups <- cut(x, break.points, include.lowest=T, labels=F)
+
library(RColorBrewer)
+
palette <- brewer.pal(5, "Spectral")
+
plot(districts)
+
plot(combined.map, pch=21, bg=palette[groups], add=T)
+
# Some code to create the legend automatically...
+
break.points <- round(break.points, 2)
+
n <- length(break.points)
+
break.points[n] <- break.points[n] + 0.01
+
txt <- vector("character",length=n-1)
+
for (i in 1:(length(break.points) - 1)) {
+
txt[i] <- paste(break.points[i],"to <",break.points[i+1])
+
}
+
legend("bottomright",legend=txt, pch=21, pt.bg=palette)
+ }
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

71

> quickmap(x)

Figure 6.1. The residuals from the spatial lagged y model still
display evidence of positive spatial autocorrelation

10

20

Note that the beta estimates of the lagged y-model cannot be interpreted in the same way as for a
standard OLS model. For example, the beta estimate of 0.004 for the POPDEN variable does not
mean that if (hypothetically) we increased that variable by one unit at each location we should then
expect the (log) of the land parcel to everywhere increase by 0.004 even holding the other X
variables constant. The reason is because if we did raise the value it would start something akin to a
'chain reaction' through the feedback of Y via the lagged Y values which will have a different
overall effect at different locations. That (equilibrium) effect is obtained from premultiplying by
1
( I W ) a given change in x at a location, holding x constant for other locations. The code
below, based on Ward & Gleditsch (2008, pp.47) will do that, taking each location in turn. However
I would advise against running it here as it takes a long time. For further details see Ward &
Gleditsch pp. 44-50.
>
>
>
>
>
>
>
>
+
+
+
+
+
+

## You are advised not to run this. Based on Ward & Gleditsch pp.47
n <- nrow(combined.map)
I <- matrix(0, nrow=n, ncol=n)
diag(I) <- 1
rho <- model3$rho
weights.matrix <- listw2mat(spknear35gaus)
results <- rep(NA, times=10)
for (i in 1:10) {
cat("\nCalculating for point",i," of ",n)
xvector <- rep(0, times=n)
xvector[i] <- 1
impact <- solve(I - rho * weights.matrix) %*% xvector * 0.004
results[i] <- impact[i]
}

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

72

6.2.3 Choosing between the models using Lagrange Multiplier (LM) Tests

Before fitting the spatial error and lagged y models (above), we could have looked for evidence in
support of them using the function lm.LMtests(...). This tests the basic OLS specification against
the more general spatial error and lagged y models. Robust tests also are given. There is evidence
for both of the spatial models in favour of the simpler OLS model but it is stronger (in purely
statistical terms) for the error model.
> lm.LMtests(model1, spknear35gaus, test="all")
Lagrange multiplier diagnostics for spatial dependence

10

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: spknear35gaus
LMerr = 199.8088, df = 1, p-value < 2.2e-16

Lagrange multiplier diagnostics for spatial dependence


data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: spknear35gaus
LMlag = 22.6309, df = 1, p-value = 1.963e-06

20

Lagrange multiplier diagnostics for spatial dependence


data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: spknear35gaus
RLMerr = 182.3297, df = 1, p-value < 2.2e-16

Lagrange multiplier diagnostics for spatial dependence

30

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: spknear35gaus
RLMlag = 5.1518, df = 1, p-value = 0.02322

Lagrange multiplier diagnostics for spatial dependence


data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data = combined.map)
weights: spknear35gaus
SARMA = 204.9606, df = 2, p-value < 2.2e-16
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

73

Warning message:
In lm.LMtests(model1, spknear35gaus, test = "all") :
Spatial weights matrix not row standardized

6.2.4 Including lagged X variables

So far we have accommodated the spatial dependencies in the data by consideration to the error and
with consideration to the dependent (y) variable. Attention now turns to the predictor variables. An
extension to the lagged y model is to lag all the included x variables too.
10

> model4 <- lagsarlm(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405
+ Y0607 + Y0809, data=combined.map, spknear35gaus, type="mixed")
> summary(model4)
Call:lagsarlm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809, data = combined.map, listw = spknear35gaus,
type = "mixed")
Residuals:
Min
1Q
Median
-2.703670 -0.547982 -0.012966

20

30

40

3Q
0.496047

Max
2.831382

Type: mixed
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value
(Intercept)
14.42342251 2.30649891 6.2534
DCBD
-0.70355874 0.22856175 -3.0782
DELE
-0.00261626 0.04397004 -0.0595
DRIVER
0.09144563 0.04468328 2.0465
DPARK
-0.06440409 0.08154334 -0.7898
POPDEN
0.00313902 0.00130863 2.3987
JOBDEN
-0.00063843 0.00377879 -0.1690
Y0405
-0.21727049 0.05452762 -3.9846
Y0607
0.22959603 0.08351013 2.7493
Y0809
0.90803383 0.11300773 8.0351
lag.(Intercept) -9.63048411 2.68285336 -3.5896
lag.DCBD
0.65596211 0.23856166 2.7497
lag.DELE
-0.03552211 0.06711110 -0.5293
lag.DRIVER
-0.04970303 0.07485862 -0.6640
lag.DPARK
-0.12152627 0.13681302 -0.8883
lag.POPDEN
0.00164266 0.00261945 0.6271
lag.JOBDEN
0.01011556 0.00665176 1.5207
lag.Y0405
0.23932841 0.25856976 0.9256
lag.Y0607
-0.62537354 0.35590512 -1.7571
lag.Y0809
-1.36572890 0.53207778 -2.5668

Pr(>|z|)
4.017e-10
0.0020826
0.9525531
0.0407043
0.4296362
0.0164527
0.8658360
6.760e-05
0.0059719
8.882e-16
0.0003311
0.0059658
0.5965952
0.5067167
0.3743980
0.5305917
0.1283262
0.3546615
0.0788947
0.0102646

***
**
*
*
***
**
***
***
**

Rho: 0.57914, LR test value: 49.317, p-value: 2.1777e-12


Asymptotic standard error: 0.066316
z-value: 8.7331, p-value: < 2.22e-16
Wald statistic: 76.267, p-value: < 2.22e-16
Log likelihood: -1364.385 for mixed model
ML residual variance (sigma squared): 0.66612, (sigma: 0.81616)
Number of observations: 1117
Number of parameters estimated: 22
AIC: 2772.8, (AIC for lm: 2820.1)
LM test for residual autocorrelation

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

74

test value: 0.43053, p-value: 0.51173

What we find is that the lag of the distance to the CBD (measured on a lag scale) is significant but
note that the direction of the relationship is different from that for the original DCBD variable the
sign has reversed. The same has happened to the dummy variable indicating the sale of the land
parcel in the years 2008 to 2009. Taking the first case, what it suggests is that the relationship
between land price value and the (log of) distance to the CBD is not linear. Adding the square of
this variable to the original OLS model improves the model fit:
> model1b <- update(model1, . ~ . + I(DCBD^2))
> summary(model1b)

10

# Adds the square of the variable

Call:
lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809 + I(DCBD^2), data = combined.map)
Residuals:
Min
1Q
Median
-2.84512 -0.59732 -0.04968

3Q
0.53605

Max
2.97359

Coefficients:

20

30

Estimate Std. Error t value Pr(>|t|)


(Intercept) 4.633565
2.691502
1.722 0.08543 .
DCBD
1.376557
0.593375
2.320 0.02053 *
DELE
-0.066188
0.033051 -2.003 0.04547 *
DRIVER
0.065121
0.029583
2.201 0.02792 *
DPARK
-0.271174
0.047360 -5.726 1.33e-08 ***
POPDEN
0.004193
0.001052
3.985 7.20e-05 ***
JOBDEN
0.010106
0.003204
3.154 0.00165 **
Y0405
-0.187146
0.057129 -3.276 0.00109 **
Y0607
0.139812
0.087018
1.607 0.10840
Y0809
0.809708
0.118003
6.862 1.13e-11 ***
I(DCBD^2)
-0.095239
0.034389 -2.769 0.00571 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.8654 on 1106 degrees of freedom
Multiple R-squared: 0.2999,
Adjusted R-squared: 0.2935
F-statistic: 47.37 on 10 and 1106 DF, p-value: < 2.2e-16
> anova(model1, model1b)
Analysis of Variance Table

40

Model 1: LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +


Y0607 + Y0809
Model 2: LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809 + I(DCBD^2)
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
1107 834.13
2
1106 828.39 1
5.7448 7.67 0.005708 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

However, doing the same in the spatial model results in a situation where the distance to CBD
variable no longer is significant under the spatial error model nor under the lagged y model though
in the latter case it is more borderline and the square of the variable remains significant:
> model2b <- update(model2, . ~ . + I(DCBD^2))

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

75

> summary(model2b)
Call:errorsarlm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK +
POPDEN + JOBDEN + Y0405 + Y0607 + Y0809 + I(DCBD^2), data = combined.map,
listw = spknear35gaus)
Residuals:
Min
1Q
Median
-2.73137 -0.55895 -0.05405

10

20

3Q
0.48248

Max
2.74627

Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.2141717 3.7866485 2.4333 0.014961
DCBD
0.3841081 0.8585578 0.4474 0.654595
DELE
-0.0366181 0.0420046 -0.8718 0.383337
DRIVER
0.0813644 0.0389130 2.0909 0.036534
DPARK
-0.2145776 0.0645093 -3.3263 0.000880
POPDEN
0.0039035 0.0012375 3.1543 0.001609
JOBDEN
0.0040846 0.0036396 1.1223 0.261754
Y0405
-0.2191549 0.0548720 -3.9939 6.499e-05
Y0607
0.2466920 0.0833052 2.9613 0.003063
Y0809
0.9292095 0.1130977 8.2160 2.220e-16
I(DCBD^2)
-0.0500019 0.0509827 -0.9808 0.326710

**

**
***
**
***
**
***

Lambda: 0.68506, LR test value: 77.873, p-value: < 2.22e-16


Asymptotic standard error: 0.05426
z-value: 12.625, p-value: < 2.22e-16
Wald statistic: 159.4, p-value: < 2.22e-16

30

Log likelihood: -1379.069 for error model


ML residual variance (sigma squared): 0.67948, (sigma: 0.82431)
Number of observations: 1117
Number of parameters estimated: 13
AIC: 2784.1, (AIC for lm: 2860)
> model3b <- update(model3, . ~ . + I(DCBD^2))
> summary(model3b)
Call:lagsarlm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809 + I(DCBD^2), data = combined.map,
listw =
spknear35gaus)
Residuals:
Min
1Q
Median
-2.828129 -0.573136 -0.047161

40

3Q
0.531172

Max
2.948786

Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.0409714 2.6745551 1.5109 0.1308153
DCBD
1.0508575 0.5886085 1.7853 0.0742086
DELE
-0.0611436 0.0326664 -1.8718 0.0612404
DRIVER
0.0373019 0.0304562 1.2248 0.2206611
DPARK
-0.2445848 0.0472913 -5.1719 2.318e-07
POPDEN
0.0040491 0.0010413 3.8884 0.0001009
JOBDEN
0.0082186 0.0031842 2.5811 0.0098486
Y0405
-0.1890889 0.0564567 -3.3493 0.0008102

***
***
***
***

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

76

Y0607
Y0809
I(DCBD^2)

0.1796598
0.8462846
-0.0717869

0.0860161 2.0887 0.0367368 *


0.1165821 7.2591 3.897e-13 ***
0.0342566 -2.0956 0.0361209 **

Rho: 0.21633, LR test value: 14.876, p-value: 0.00011479


Asymptotic standard error: 0.056612
z-value: 3.8212, p-value: 0.00013282
Wald statistic: 14.601, p-value: 0.00013282

10

Log likelihood: -1410.567 for lag model


ML residual variance (sigma squared): 0.73092, (sigma: 0.85494)
Number of observations: 1117
Number of parameters estimated: 13
AIC: 2847.1, (AIC for lm: 2860)
LM test for residual autocorrelation
test value: 100.58, p-value: < 2.22e-16

6.2.5 Including lagged X variables in the OLS model

We can, if we wish, include specific lagged X variables in the OLS model. The process is to create
them then include them in the model. The lag of DCBD and the lag of Y0809 are the most obvious
candidates to include (from model 4, above). To create the lagged variables,
20

> lag.DCBD <- lag.listw(spknear35gaus, combined.map$DCBD)


> lag.Y0809 <- lag.listw(spknear35gaus, combined.map$Y0809)

and then add them to the model,


> model1c <- update(model1, .~. + lag.DCBD + lag.Y0809)

The result seems to fit the data better than the original OLS model (AIC score of 2848.3 Vs 2865.7
remember, the lower the better) but actually the lag of DCBD appears not to be significant in this
model whilst significant spatial autocorrelation appears to remain:
> summary(model1c)
Call:
lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809 + lag.DCBD + lag.Y0809, data = combined.map)

30

Residuals:
Min
1Q Median
-2.9443 -0.5892 -0.0318

3Q
0.5279

Max
2.8866

Coefficients:

40

Estimate Std. Error t value Pr(>|t|)


(Intercept) 11.742616
0.572254 20.520 < 2e-16 ***
DCBD
-0.226122
0.075021 -3.014 0.002636 **
DELE
-0.068113
0.032621 -2.088 0.037026 *
DRIVER
0.105230
0.030779
3.419 0.000652 ***
DPARK
-0.282638
0.046536 -6.074 1.72e-09 ***
POPDEN
0.004162
0.001042
3.995 6.89e-05 ***
JOBDEN
0.008902
0.003012
2.955 0.003188 **
Y0405
-0.198322
0.056897 -3.486 0.000510 ***
Y0607
0.161288
0.087123
1.851 0.064399 .
Y0809
0.852457
0.118536
7.192 1.18e-12 ***
lag.DCBD
-0.056987
0.054031 -1.055 0.291789
lag.Y0809
-2.079764
0.476501 -4.365 1.39e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

77

Residual standard error: 0.8605 on 1105 degrees of freedom


Multiple R-squared: 0.3084,
Adjusted R-squared: 0.3015
F-statistic: 44.79 on 11 and 1105 DF, p-value: < 2.2e-16
> AIC(model1c)
[1] 2848.332
> lm.morantest(model1c, spknear35gaus)
Global Moran's I for regression residuals

10

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809 + lag.DCBD + lag.Y0809,
data = combined.map)
weights: spknear35gaus
Moran I statistic standard deviate = 15.2495, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran's I
Expectation
Variance
8.890303e-02
-4.717012e-03
3.769016e-05

6.3 Discussion
20

Although the revised spatial models are presenting different results in regard to the distance to CBD
variable and its effect on land price, and these differ again from the revised OLS models, there may
be an argument that, in a general sense, the models are indicating the same thing. Distance to CBD
is a geographic measure. It tries to explain something about land values by their distance from the
CBD. The spatial error and lagged y models introduce geographical considerations in other ways
(and in addition to the distance to CBD variable). Looking at the results of the Lagrange Multiplier
tests (but we could also use the AIC or Log Likelihood scores) the spatial error model remains the
'preferred' model. And that, again, is another clue: the complexity of the spatial patterning of the
land parcel prices has yet to be fully captured by our model; its causes remain largely unexplained.
> lm.LMtests(model1b, spknear35gaus, test=c("LMerr", "LMlag"))
Lagrange multiplier diagnostics for spatial dependence

30

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809 + I(DCBD^2), data =
combined.map)
weights: spknear35gaus
LMerr = 169.6169, df = 1, p-value < 2.2e-16

Lagrange multiplier diagnostics for spatial dependence

40

data:
model: lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809 + I(DCBD^2), data =
combined.map)
weights: spknear35gaus
LMlag = 18.0596, df = 1, p-value = 2.141e-05

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

78

Warning message:
In lm.LMtests(model1b, spknear35gaus, test = c("LMerr", "LMlag")) :
Spatial weights matrix not row standardized

6.4 Geographically Weighted Regression (GWR)


An alternative way of attempting to explain the spatial variation in the land value prices is to allow
the effect sizes of the predictor variables to themselves vary over space. Geographically Weighted
Regression (GWR) offers this where the estimate of x at point location i is not simply the global
estimate for all points in the study region but a local estimate based on surrounding points weighted
by the inverse of their distance away.
10

6.4.1 Fitting the GWR model

The stages of fitting a Geographically Weighted Regression model are first to load the GWR library,
calculate a distance matrix containing the distances between points, calibrate the bandwidth for the
local model fitting, fit the model, then look for evidence of spatial variations in the estimates.
First the library. There are two we could use. The first is library(spgwr). However, we will use the
more recently developed library(GWmodel) which contains a suite of tools for geographically
weighted types of analysis.
> library(GWmodel)

Next the distance matrix:


> distances <- gw.dist(dp.locat=coordinates(combined.map))

20

Now the bandwidth, here using a nearest neighbours metric and a Gaussian decay for the inverse
distance weighting (for other options, see ?bw.gwr). The bandwidth is found using a cross-validation
optimisation procedure.
> bw <- bw.gwr(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809, data=combined.map, adaptive=T, dMat=distances)
> bw
[1] 30

Here the bandwidth decreases to zero at the 30th neighbour, encouragingly similar to the 35 th
neighbour value we have used for our spatial weights throughout this session. Next the model is
fitted:
30

>gwr.model <- gwr.basic(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN +


Y0405 + Y0607 + Y0809, data=combined.map, adaptive=T, dMat=distances, bw=bw)

Looking at the results we can see that the model has a better fit to the data than any other fitted thus
far. The summary of the GWR estimates is of how the (local) beta estimates vary across the study
region. For example, from the inter-quartile range, the effect of distance to CBD on land value
prices is typically found to be from -0.682 to -0.030.

40

> gwr.model
[..]
***********************************************************************
*
Results of Geographically Weighted Regression
*
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Adaptive bandwidth: 30 (number of nearest neighbours)

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

79

Regression points: the same locations as observations are used.


Distance metric: A distance matrix is specified for this model calibration.

10

20

****************Summary of GWR coefficient estimates:******************


Min.
1st Qu.
Median
3rd Qu.
Max.
X.Intercept. -2.992e+01 8.776e+00 1.163e+01 1.432e+01 36.7400
DCBD
-2.452e+00 -6.818e-01 -3.166e-01 -3.005e-02 3.6800
DELE
-9.924e-01 -1.839e-01 -2.105e-02 1.184e-01 0.5259
DRIVER
-9.369e-01 -5.216e-02 1.076e-01 2.344e-01 0.7053
DPARK
-1.161e+00 -2.895e-01 -1.756e-01 -3.797e-02 0.7852
POPDEN
-2.440e-02 -7.219e-04 3.197e-03 1.001e-02 0.0293
JOBDEN
-1.965e-02 -2.778e-03 5.276e-03 1.744e-02 0.1056
Y0405
-8.825e-01 -3.227e-01 -2.218e-01 -1.028e-01 0.4346
Y0607
-9.749e-01 -2.447e-01 2.268e-01 7.012e-01 1.3980
Y0809
-1.016e+00 -3.069e-01 8.426e-01 1.869e+00 2.7530
************************Diagnostic information*************************
Number of data points: 1117
Effective number of parameters (2trace(S) - trace(S'S)): 178.6996
Effective degrees of freedom (n-2trace(S) + trace(S'S)): 938.3004
AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 2555.443
AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 2381.741
Residual sum of squares: 489.1524
R-square value: 0.58657
Adjusted R-square value: 0.5077481
***********************************************************************
Program stops at: 2013-10-23 15:03:25

The estimates themselves for each of the land parcel points are found within the gwr model's spatial
data frame:
> names(gwr.model$SDF)

30

The parts of this data frame with the original variable names are the local beta estimates, those with
the suffice _SE are the corresponding standard errors, together giving the t-values, marked _TV.
6.4.2 Mapping the estimates

If we map the local beta estimates for the distance to CBD variable, the spatially variable effect
becomes clear,
> x <- gwr.model$SDF$DCBD
> par(mfrow=c(1,2))
# This will allow for two maps to be drawn side-by-side
> quickmap(x)

However, we might wish to ignore those that are locally insignificant at a 95% confidence.
> quickmap(x, subset=abs(gwr.model$SDF$DCBD_TV) > 1.96)

40

Doing so, what we appear to find is that there are clusters of land parcels where distance from the
CBD has a greater effect on their value than is true for surrounding locations.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

80

Figure 6.2. The local beta estimates for the distance to CBD variable estimated
using Geographically Weighted Regression. In the right-side plot the 'insignificant'
estimates are omitted.
6.4.3 Testing for significance of the GWR parameter variability

(Note: at the time of the writing there is an error in the function montecarlo.gwr(...) included in
the GWmodel package version 1.2-1 that will be changed in future updates. Load the file
montecarlo.R using the source(file.choose()) function to import a corrected version.)
The function montecarlo.gwr(...) uses a randomisation approach to undertake significance testing
for the variability of the estimated local beta estimates (the regression parameters). The default
number of simulations is 99 (which, with the actual estimates, gives 100 sets of values in total).
That is quite a low number but can be used for illustrative purposes. In practice it would be better to
raise it to 999, 9999 or even more.
10

> montecarlo.gwr(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +


Y0607 + Y0809, data=combined.map, adaptive=T, dMat=distances, bw=bw)
Tests based on the Monte Carlo significance test

20

(Intercept)
DCBD
DELE
DRIVER
DPARK
POPDEN
JOBDEN
Y0405
Y0607
Y0809

p-value
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.03
0.00
0.00

All of the variables appear to show significant spatial variation.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

81

6.5 A Simple Multilevel Model


A final modelling approach we might consider here is a multilevel one. At its simplest, a multilevel
approach allows the unexplained residual variance from a linear regression model to be partitioned
at different levels here the individual land price values (at the lower level) and the census districts
(at the higher more aggregate level). We can then assess what amount of the unexplained
variance in the land prices is due to some sort of higher-level 'contextual' effect.
6.5.1 Null models

10

The simplest multilevel model is one that does nothing more than estimate the mean of the land
price values, uses that as the sole predictor of the land price values, and then partitions the errors in
the way described above. In other words, it is a regression model for which there is only an
intercept term (no slope) and the residual variances are estimated at the lower and higher
geographical levels.
There are a number of packages to fit multilevel models in R. We shall use...
> library(lme4)

To fit the model with an intercept-only using standard OLS estimation and no partitioning of the
residual variance we would type,
> nullLMmodel <- lm(LNPRICE ~ 1, data=combined.map)

Obtaining the log likelihood value as,


20

> logLik(nullLMmodel)
'log Lik.' -1617.09 (df=2)

To fit the corresponding model using a multilevel approach where the residual variance is assumed
to be random at both the land parcel and district scales the notation is similar but includes the
parentheses identifying the scales. 1 indicates the lower level whilst the variable SP_ID arises from
the overlay of the point and polygonal data undertaken in Section 4.5, p. 47, 'Spatially joining data'
and gives a unique ID for each district.

30

> nullMLmodel <- lmer(LNPRICE ~ (1 | SP_ID), data=combined.map@data)


> summary(nullMLmodel)
Linear mixed model fit by REML ['lmerMod']
Formula: LNPRICE ~ (1 | SP_ID)
Data: combined.map@data
REML criterion at convergence: 2983.876
Random effects:
Groups
Name
Variance Std.Dev.
SP_ID
(Intercept) 0.3537
0.5947
Residual
0.7222
0.8498
Number of obs: 1117, groups: SP_ID, 111
Fixed effects:
Estimate Std. Error t value
(Intercept) 7.53186
0.06523
115.5

40

The key thing to note here is the proportion of the residual variance that is at the district level,
> 0.3537 / (0.3537 + 0.7222)

[1] 0.328748
- almost one third. This is a sizeable amount and is suggestive of the spatial patterning of the land
An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

82

parcel values. The log likelihood of this model is greater than for the OLS model,
> logLik(nullMLmodel)
'log Lik.' -1491.938 (df=3)

6.5.2 The likelihood ratio test

The likelihood ratio test statistic is two times the difference in the log likelihood values for two
models, here
> 2 * as.numeric((logLik(nullMLmodel) - logLik(nullLMmodel)))
[1] 250.3039

10

Assessing against a chi-squared distribution with 1 degree of freedom (the difference in the degrees
of freedom for the two models, arising from the estimation of the additional, higher-level error
variance) we find 'the probability the result (i.e. the improved likelihood value) has arisen by
chance' is essentially zero:
> 1 - pchisq(250, 1)
[1] 0

6.5.3 A random intercepts model

Having established that there is an appreciable amount of variation in the land parcel prices at the
district scale, our next stage will be to refit our predictive regression mode but again with a
multilevel framework. Recall, for example, (OLS) model1,
20

> model1$call
lm(formula = LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN +
JOBDEN + Y0405 + Y0607 + Y0809, data = combined.map)

Its multilevel equivalent is,


> MLmodel <- lmer(LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607 + Y0809 + (1 | SP_ID), data=combined.map@data)
> summary(MLmodel)
Linear mixed model fit by REML ['lmerMod']
Formula: LNPRICE ~ DCBD + DELE + DRIVER + DPARK + POPDEN + JOBDEN + Y0405 +
Y0607
+ Y0809 + (1 | SP_ID)
Data: combined.map@data

30

REML criterion at convergence: 2817.831


Random effects:
Groups
Name
Variance Std.Dev.
SP_ID
(Intercept) 0.1481
0.3848
Residual
0.6322
0.7951
Number of obs: 1117, groups: SP_ID, 111
[]

Even with the predictor variables now included there remains an appreciable amount of variation
between districts,
40

> 0.1481 / (0.1481 + 0.6322)


[1] 0.1897988

Consequently there is strong support in favour of the multilevel model over the OLS one,
> 2
[1]
> 1
[1]

* as.numeric((logLik(MLmodel) - logLik(model1)))
25.90044
- pchisq(25.9, 1)
3.595691e-07

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

83

6.5.4 Mapping the district-level residuals

There is much more we could undertake in regard to the multilevel model, including allowing the
effect of each predictor variable to vary from one district to another (a random intercepts and slopes
model). Here, however, we shall confide ourselves to mapping the district level residuals to
identify those districts where the land parcel values are higher or lower than average.
The process of doing so begins by obtaining the residuals at the higher level of the hierarchy, using
the function ranef(...) (short for random effects)
> district.resids <- ranef(MLmodel)

10

The output from this function is a list, in this case of length 1 (this being the number of levels
above the land parcel level for which that variance has been estimated i.e. the district level).
> typeof(district.resids)
[1] "list"
> length(district.resids)
[1] 1

Inspecting its contents we find that it is a data frame containing the IDs of the districts and
information telling us whether the district-level effect is one of raising or decreasing the land parcel
prices:
20

> head(district.resids[[1]], n=3)


(Intercept)
10
0.22619895
100 0.09476389
101 -0.22773190
> summary(district.resids[[1]][,1])
Min. 1st Qu.
Median
Mean
-0.78800 -0.21480 0.03126 0.00000

3rd Qu.
0.21810

Max.
0.61750

Note that not every one of the census districts in Beijing will be included in this output. That is
because not every district contains a land parcel that was sold in the period of the data. We therefore
have to match those districts for which we do have data back to the original map of all districts,
which we can then use to map the residuals. First the matching,
30

>
>
>
>

ids1 <- as.numeric(rownames(district.resids[[1]]))


ids2 <- as.numeric(as.numeric(districts$SP_ID))
matched <- match(ids1, ids2)
districts2 <- districts[matched,]

Second, the mapping,

40

districts2 <- districts[matched,]


x <- district.resids[[1]][,1]
group <- cut(x, quantile(x, probs=seq(0,1,0.2)), include.lowest=T, labels=F)
palette <- brewer.pal(5, "RdYlBu")
plot(districts)
plot(districts2, col=palette[group], add=T)
round(quantile(x, probs=seq(0,1,0.2)),3)
legend("bottomright", legend=c("-0.788 to < -0.261","-0.261 to < -0.069",
+ "-0.069 to < 0.105","0.105 to < 0.250",
+ "0.250 to < 0.618"), pch=21, pt.bg=palette, cex=0.8)

Looking at the map, there appear to be clusters of districts with higher than expected land parcel
prices, some contiguous or close to districts with lower than expected prices. Not all these residual
values are necessarily significantly different from zero. Nevertheless, what we seem to have
evidence for again is the complexity of the geographical patterning.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

84

Figure 6.3. The district-level residual estimates (white = no data)

6.6 Consolidation and Conclusion


What this session has demonstrated is a number of spatial models, all within a regression
framework, that can be applied to model land parcel prices in Beijing (but remember that the data
are simulated). Purely in terms of Akaike's Information Criterion (the AIC scores) the GWR and
spatial error models are the best fits to the data although in many respects they (and the multilevel
model) tell us only that geographical complexities exist, not actually about the processes that cause
them. Even so, precisely because they do exist, it should warn us against using standard regression
techniques that assume the residual errors are independently and identically distributed.

6.7 Getting Help


10

There is an excellent workbook on spatial regression analysis in R by Luc Anselin. It is available at


http://openloc.eu/cms/storage/openloc/workshops/UNITN/20110324-26/Basile/Anselin2007.pdf
The definitive text for Geographically Weighted Regression is Fotheringham et al. (2002):
Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. For more
information about the GWmodel package, see http://arxiv.org/pdf/1306.0413.pdf
An introduction to multilevel modelling in R is offered by Camille Szmaragd and George Leckie at
the Centre for Multilevel Modelling, University of Bristol:
http://www.bristol.ac.uk/cmm/learning/module-samples/5-r-sample.pdf

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

85

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

86

More to follow. Check at www.social-statistics.org for updates.

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

87

An Introduction to Mapping and Spatial Modelling in R. Richard Harris, 2013

88

S-ar putea să vă placă și