Lecture 12

Lecture 12: Debugging and Databases
STAT GR5206 Statistical Computing & Introduction to Data Science
Cynthia Rush
Columbia University
December 9, 2016
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 1 / 107
Course Notes
Final is next Friday, December 16, 1:10pm - 4:00pm in this room.

Homework is due on Monday.
Last Time
Split/Apply/Combine: A model for working with data.

plyr Package: Similar to the apply() family, but more consistent.
PCA and K-Means Clustering
Topics for Today
Databases: What are databases? Intro to SQL and interfacing R

with SQL.
Debugging: Simple techniques for correcting buggy code.
Clustering: Clustering with the K-means algorithm.
Section I
Databases: SQL and Querying
Databases
A record is a collection of fields (likes rows and columns).

A table is a collection of records which all have the same fields with
different values. These are like dataframes in R.
A database is a collection of tables.
Databases vs. Dataframes
Rs dataframes are actually tables

R Jargon Database Jargon
column field
row record
dataframe table
types of the columns table schema
bunch of related dataframes database
Databases
So, Why Do We Need Database Software?

Size
R keeps its dataframes in memory
Industrial databases can be much bigger
Work with selected subsets
Speed
Clever people have worked very hard on getting just what you want fast
Concurrency
Many users accessing the same database simultaneously
Lots of potential for trouble (two users want to change the same record
at once)
Databases

Databases live on a server, which manages them
Users interact with the server through a client program
Lets multiple users access the same database simultaneously
Databases

Databases live on a server, which manages them
Users interact with the server through a client program
Lets multiple users access the same database simultaneously
SQL (structured query language) is the standard for database
software
Mostly about queries, which are like doing row/column selections on
a dataframe in R
SQL
Connecting R to SQL
SQL is its own language, independent of R (similar to regular
expressions). But were going to learn how to run SQL queries
through R.
SQL
Connecting R to SQL
SQL is its own language, independent of R (similar to regular
expressions). But were going to learn how to run SQL queries
through R.
First, install the packages DBI, RSQLite.
Also, we need a database file: download the file baseball.db and
save it in your working directory.
> library(DBI)
> library(RSQLite)
> drv <- dbDriver("SQLite")
> con <- dbConnect(drv, dbname="baseball.db")
The object con is now a persistent connection to the database
baseball.db.
SQL
Listing Whats Available

> dbListTables(con) # List tables in our database
[1] "AllstarFull" "Appearances"
[3] "AwardsManagers" "AwardsPlayers"
[5] "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost"
[9] "Fielding" "FieldingOF"
[11] "FieldingPost" "HallOfFame"
[13] "Managers" "ManagersHalf"
[15] "Master" "Pitching"
[17] "PitchingPost" "Salaries"
[19] "Schools" "SchoolsPlayers"
[21] "SeriesPost" "Teams"
[23] "TeamsFranchises" "TeamsHalf"
[25] "sqlite_sequence" "xref_stats"
SQL
Listing Whats Available
> dbListFields(con, "Batting") # Fields in Batting table
[1] "playerID" "yearID" "stint" "teamID"
[5] "lgID" "G" "G_batting" "AB"
[9] "R" "H" "2B" "3B"
[13] "HR" "RBI" "SB" "CS"
[17] "BB" "SO" "IBB" "HBP"
[21] "SH" "SF" "GIDP" "G_old"
> dbListFields(con, "Pitching") # Fields in Pitching table
[1] "playerID" "yearID" "stint" "teamID" "lgID"
[6] "W" "L" "G" "GS" "CG"
[11] "SHO" "SV" "IPouts" "H" "ER"
[16] "HR" "BB" "SO" "BAOpp" "ERA"
[21] "IBB" "WP" "HBP" "BK" "BFP"
[26] "GF" "R"
SQL
Importing a Table as a Data Frame

> batting <- dbReadTable(con, "Batting")
> class(batting)
[1] "data.frame"
> dim(batting)
[1] 93955 24
SQL
Importing a Table as a Data Frame

> batting <- dbReadTable(con, "Batting")
> class(batting)
[1] "data.frame"
> dim(batting)
[1] 93955 24
Now we can perform R operations on batting, since its a data frame

In lecture today, well use this route primarily to check our work in
SQL; in general, want to do as much in SQL as possible, since its
more efficient and likely simpler
Check Yourself
Tasks
Using dbReadTable(), grab the table named Salaries and save it as a
data frame called salaries. Using the salaries data frame and ddply(),
compute the payroll (total of salaries) for each team in the year 2010. Find
the 3 teams with the highest payrolls, and the team with the lowest payroll.
Check Yourself
Solutions
> library(plyr)
> salaries <- dbReadTable(con, "Salaries")
> my.sum.func <- function(team.yr.df) {
+ return(sum(team.yr.df$salary))
+ }
> payroll <- ddply(salaries, .(yearID, teamID), my.sum.func)
Check Yourself
Solutions
> payroll <- payroll[payroll$yearID == 2010, ]
> payroll <- payroll[order(payroll$V1, decreasing = T), ]
> payroll[1:3, ]
yearID teamID V1
733 2010 NYA 206333389
719 2010 BOS 162447333
721 2010 CHN 146609000
> payroll[nrow(payroll), ]
yearID teamID V1
737 2010 PIT 34943000
SQL
SELECT
Main tool in the SQL language: SELECT, which allows you to perform
queries on a particular table in a database. It has the form:
SELECT columns or computations

FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset, count;
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional
Examples
Pick out five columns from the table Batting, and look at the first 10 rows:
> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",

+ "FROM Batting",
+ "LIMIT 10"))
playerID yearID AB H HR
1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aaronha01 1954 468 131 13
7 aaronha01 1955 602 189 27
8 aaronha01 1956 609 200 26
9 aaronha01 1957 615 198 44
10 aaronha01 1958 601 196 30
Examples
This is our first successful SQL query (congrats!)

We can replicate the command on the imported data frame:
> batting[1:10, c("playerID", "yearID", "AB", "H", "HR")]

1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aaronha01 1954 468 131 13
7 aaronha01 1955 602 189 27
8 aaronha01 1956 609 200 26
9 aaronha01 1957 615 198 44
10 aaronha01 1958 601 196 30
Examples
To reiterate: the previous call was simply to check our work, and we
wouldnt actually want to do this on a large database, since itd be much
more inefficient to first read into an R data frame, and then call R
commands
SQL
ORDER BY
We can use the ORDER BY option in SELECT to specify an ordering for the
rows
Default is ascending order; add DESC for descending

+ "FROM Batting",
+ "ORDER BY HR DESC",
+ "LIMIT 5"))
1 bondsba01 2001 476 156 73
2 mcgwima01 1998 509 152 70
3 sosasa01 1998 643 198 66
4 mcgwima01 1999 521 145 65
5 sosasa01 2001 577 189 64
Check Yourself
Tasks
Run the following queries and determine what theyre doing. Write R code to do
the same thing on the batting data frame.
+ "FROM Batting",
+ "WHERE yearID >= 1990
+ AND yearID <= 2000",
+ "ORDER BY HR DESC",
+ "LIMIT 5"))
> dbGetQuery(con, paste("SELECT playerID, yearID, MAX(HR)",
+ "FROM Batting"))
Check Yourself
Solutions
> bat.ord <- batting[order(batting$HR, decreasing = TRUE), ]
> subset <- bat.ord$yearID >= 1990 & bat.ord$yearID <= 2000
> columns <- c("playerID", "yearID", "AB", "H", "HR")
> head(bat.ord[subset, columns], 5)
54613 mcgwima01 1998 509 152 70
78578 sosasa01 1998 643 198 66
54614 mcgwima01 1999 521 145 65
78579 sosasa01 1999 625 180 63
31877 griffke02 1997 608 185 56
> batting[which.max(batting$HR), c("playerID","yearID","HR")]
playerID yearID HR
7514 bondsba01 2001 73
Section II
Databases: SQL Computations

column field
row record
dataframe table
collection of related dataframes database
conditional indexing SELECT, FROM, WHERE, HAVING
d*ply() GROUP BY
order() ORDER BY
SQL
SELECT
Main tool in the SQL language: SELECT, which allows you to perform
queries on a particular table in a database. It has the form:
SELECT columns or computations

FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset, count;
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional.
Importantly, in the first line of SELECT we can directly specify
computations that we want performed.
Examples
To calculate the average number of homeruns, and average number of hits:

> dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
+ "FROM Batting"))
AVG(HR) AVG(H)
1 2.970549 40.67684
Examples
To calculate the average number of homeruns, and average number of hits:

> dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
+ "FROM Batting"))
AVG(HR) AVG(H)
1 2.970549 40.67684
We can replicate this simple command on an imported data frame:

> mean(batting$HR, na.rm = TRUE)
[1] 2.970549
> mean(batting$H, na.rm = TRUE)
[1] 40.67684
GROUP BY
We can use the GROUP BY option in SELECT to define aggregation groups

> dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
+ "FROM Batting",
+ "GROUP BY playerID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))
playerID AVG(HR)
1 pujolal01 40.80000
2 howarry01 36.14286
3 rodrial01 36.05882
4 bondsba01 34.63636
5 mcgwima01 34.29412
GROUP BY
We can use the GROUP BY option in SELECT to define aggregation groups

> dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
+ "FROM Batting",
+ "LIMIT 5"))
playerID AVG(HR)
1 pujolal01 40.80000
2 howarry01 36.14286
3 rodrial01 36.05882
4 bondsba01 34.63636
5 mcgwima01 34.29412
Note: the order of commands here matters; try switching the order of
GROUP BY and ORDER BY above, and youll get an error.
WHERE
We can use the WHERE option in SELECT to specify a subset of the rows to
use (pre-aggregation/pre-calculation)
> dbGetQuery(con, paste("SELECT yearID, AVG(HR)",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY yearID",
+ "LIMIT 5"))
yearID AVG(HR)
1 1996 5.073620
2 1999 4.692699
3 2000 4.525437
4 2004 4.490115
5 2001 4.412288
Check Yourself
Tasks
Run the following queries and determine what theyre doing. Write R code
to do the same thing on the batting data frame. Hint use daply().
> dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
+ "FROM Batting",
+ "GROUP BY teamID",
+ "LIMIT 5"))
Check Yourself
Solutions
> bat.sub <- batting[batting$yearID >= 1990, ]
> my.mean.func <- function(team.df) {
+ return(mean(team.df$HR, na.rm = TRUE))
+ }
> avg.hrs <- daply(bat.sub, .(teamID), my.mean.func)
> avg.hrs <- sort(avg.hrs, decreasing = TRUE)
> head(avg.hrs, 5)
CHA NYA TOR CAL TEX
6.164251 5.986486 5.760937 5.625731 5.563961
AS
We can use AS in the first line of SELECT to rename computed columns

> dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
+ "FROM Batting",
+ "ORDER BY avgHR DESC",
+ "LIMIT 5"))
yearID avgHR
1 1987 5.300832
2 1996 5.073620
3 1986 4.730769
4 1999 4.692699
5 1977 4.601010
HAVING
We can use the HAVING option in SELECT to specify a subset of the rows
to display (post-aggregation/post-calculation)
> dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
+ "FROM Batting",
+ "HAVING avgHR >= 4.5",
+ "ORDER BY avgHR DESC"))
yearID avgHR
1 1996 5.073620
2 1999 4.692699
3 2000 4.525437
Check Yourself
Tasks
Recompute the payroll for each team in 2010, but now with
dbGetQuery() and an appropriate SQL query. In particular, the output of
dbGetQuery() should be a data frame with two columns, the first giving
the team names, and the second the payrolls, just like your output from
daply() before. (Hint: your SQL query here will have to use GROUP BY.)
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT teamID, SUM(salary) as SUMsal",
+ "FROM Salaries",
+ "WHERE yearID == 2010",
+ "GROUP BY teamID",
+ "ORDER BY SUMsal DESC",
+ "LIMIT 3"))
teamID SUMsal
1 NYA 206333389
2 BOS 162447333
3 CHN 146609000
Section III
Databases: Join

column field
row record
dataframe table
collection of related dataframes database
conditional indexing SELECT, FROM, WHERE, HAVING
d*ply() GROUP BY
order() ORDER BY
merge() INNER JOIN or just JOIN
JOIN
Sometimes we need to combine information from many tables.
patient_last patient_first physician_id complaint

Morgan Dexter 37010 insomnia
Soprano Anthony 79676 malaise
Swearengen Albert NA healthy
Garrett Alma 90091 nerves
Holmes Sherlock 43675 addiction
physician_last physician_first physicianID plan

Meridian Emmett 37010 UPMC
Melfi Jennifer 79676 BCBS
Cochran Amos 90091 UPMC
Watson John 43675 VA
JOIN
Suppose we want to know which doctors are treating patients for

insomnia.
Complaints are in one table and physicians in another.
In R, we use merge() to link the tables by physicianID.
Here physicianID or physician_id is acting as the key or the
identifier.
JOIN
In all weve seen so far with SELECT, the FROM line has just specified one
table. But sometimes we need to combine information from many tables.
Use the JOIN option for this
JOIN
In all weve seen so far with SELECT, the FROM line has just specified one
table. But sometimes we need to combine information from many tables.
Use the JOIN option for this
There are 4 options for JOIN:
1. INNER JOIN or just JOIN: retain just the rows each table that match
the condition.
2. LEFT OUTER JOIN or just LEFT JOIN: retain all rows in the first
table, and just the rows in the second table that match the condition.
3. RIGHT OUTER JOIN or just RIGHT JOIN: retain just the rows in the
first table that match the condition, and all rows in the second table.
4. FULL OUTER JOIN or just FULL JOIN: retain all rows in both tables
Fields that cannot be filled in are assigned NA values
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
Examples
Suppose we want to find the average salaries of the players with the top 10
highest homerun averages. We need to combine the two tables.
> dbGetQuery(con, paste("SELECT *",
+ "FROM Salaries",
+ "ORDER BY playerID",
+ "LIMIT 8"))
yearID teamID lgID playerID salary
1 2004 SFN NL aardsda01 300000
2 2007 CHA AL aardsda01 387500
3 2008 BOS AL aardsda01 403250
4 2009 SEA AL aardsda01 419000
5 2010 SEA AL aardsda01 2750000
6 1986 BAL AL aasedo01 600000
7 1987 BAL AL aasedo01 625000
8 1988 BAL AL aasedo01 675000
Examples
> query <- paste("SELECT yearID, teamID,

+ lgID, playerID, HR",
+ "FROM Batting",
+ "LIMIT 7")
> dbGetQuery(con, query)
yearID teamID lgID playerID HR
1 2004 SFN NL aardsda01 0
2 2006 CHN NL aardsda01 0
3 2007 CHA AL aardsda01 0
4 2008 BOS AL aardsda01 0
7 1954 ML1 NL aaronha01 13
Examples
We can use a JOIN on the pair: yearID, playerID.
> dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
+ "FROM Batting JOIN Salaries
+ USING(yearID, playerID)",
+ "LIMIT 7"))
yearID playerID salary HR

1 2004 aardsda01 300000 0
2 2007 aardsda01 387500 0
3 2008 aardsda01 403250 0
4 2009 aardsda01 419000 0
5 2010 aardsda01 2750000 0
6 1986 aasedo01 600000 NA
7 1987 aasedo01 625000 NA
Note that here were missing one of David Aardsmas records from the Batting
table (i.e., the JOIN discarded 1 record)
Examples
We can replicate this using merge() on imported data frames:

> merged <- merge(x = batting, y = salaries,
+ by.x = c("yearID","playerID"),
+ by.y = c("yearID","playerID"))
> merged[order(merged$playerID)[1:8],
+ c("yearID", "playerID", "salary", "HR")]
16708 2004 aardsda01 300000 0
19378 2007 aardsda01 387500 0
20277 2008 aardsda01 403250 0
21164 2009 aardsda01 419000 0
21990 2010 aardsda01 2750000 0
585 1986 aasedo01 600000 NA
1360 1987 aasedo01 625000 NA
2033 1988 aasedo01 675000 NA
Examples
For demonstration purposes, we can use a LEFT JOIN on the pair: yearID,
playerID:
> dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
+ "FROM Batting LEFT JOIN Salaries
+ "LIMIT 7"))

1 2004 aardsda01 300000 0
2 2006 aardsda01 NA 0
3 2007 aardsda01 387500 0
4 2008 aardsda01 403250 0
5 2009 aardsda01 419000 0
6 2010 aardsda01 2750000 0
7 1954 aaronha01 NA 13
Examples
Now we can see that we have all 6 of David Aardsmas original

records from the Batting table (i.e., the LEFT JOIN used them all,
and just filled in an NA value when it was missing his salary)
Currently, RIGHT JOIN and FULL JOIN are not implemented in the
RSQLite package
Examples
Now, as to our original question (average salaries of the players with the top 10
highest homerun averages):
> dbGetQuery(con, paste("SELECT playerID, AVG(HR), AVG(salary)",
+ "FROM Batting JOIN Salaries
+ "ORDER BY Avg(HR) DESC",
+ "LIMIT 10"))
playerID AVG(HR) AVG(salary)

1 howarry01 45.80000 9051000.0
2 pujolal01 40.80000 8953204.1
3 fieldpr01 38.00000 3882900.0
4 rodrial01 36.05882 15553897.2
5 reynoma01 34.66667 550777.7
6 bondsba01 34.63636 8556605.5
7 mcgwima01 34.29412 4814020.8
8 gonzaca01 34.00000 406000.0
9 dunnad01 33.50000 6969500.0
Check Yourself
Tasks
Using the Fielding table, list the 10 worst (highest) number of error
(E) commited by a player in one season, only considering years 1990
and later. In addition to the number of errors, list the year and player
ID for each record.
By appropriately merging the Fielding and Salaries tables, list the
salaries for each record that you extracted in the last question.
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT yearID, playerID, E",
+ "FROM Fielding",
+ "ORDER BY E DESC",
+ "LIMIT 10"))
yearID playerID E
1 1992 offerjo01 42
2 1993 offerjo01 37
3 1996 valenjo03 37
4 2000 valenjo03 36
5 1998 carusmi01 35
6 1995 offerjo01 35
7 2008 reynoma01 34
8 2010 desmoia01 34
9 1993 cordewi01 33
10 2000 glaustr01 33
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT yearID, playerID, E, salary",
+ "FROM Fielding LEFT JOIN Salaries
+ "ORDER BY E DESC",
+ "LIMIT 10"))
yearID playerID E salary

1 1992 offerjo01 42 135000
2 1993 offerjo01 37 300000
3 1996 valenjo03 37 300000
4 2000 valenjo03 36 1320000
5 1998 carusmi01 35 170000
6 1995 offerjo01 35 1600000
7 2008 reynoma01 34 396500
8 2010 desmoia01 34 400000
9 1993 cordewi01 33 126500
10 2000 glaustr01 33 275000
Section IV
Debugging
Debugging
Bug is the original name for glitches and unexpected defects in code:
dates back to at least Edison in 1876.
Debugging is a the process of locating, understanding, and removing
bugs from your code.
Debugging
Bug is the original name for glitches and unexpected defects in code:
dates back to at least Edison in 1876.
Debugging is a the process of locating, understanding, and removing
bugs from your code.
Why should we care to learn about this?
The truth: youre going to have to debug, because youre not perfect
(none of us are!) and so you cant write perfect code.
Debugging is frustrating and time-consuming, but essential.
Writing code that makes it easier to debug later is worth it, even if it
takes a bit more time (lots of our design ideas support this).
Simple things you can do to help: use lots of comments, use
meaningful variable names!
Debugging
How?
Debugging is (largely) a process of differential diagnosis. Stages of
debugging:
1. Reproduce the error: can you make the bug reappear?
2. Characterize the error: what can you see that is going wrong?
3. Localize the error: where in the code does the mistake originate?
4. Modify the code: did you eliminate the error? Did you add new ones?
Debugging
Reproduce the bug

Step 0: make if happen again
Can we produce it repeatedly when re-running the same code, with
the same input values?
And if we run the same code in a clean copy of R, does the same
thing happen?
Debugging
Reproduce the bug

Step 0: make if happen again
Can we produce it repeatedly when re-running the same code, with
the same input values?
And if we run the same code in a clean copy of R, does the same
thing happen?
Characterize the bug

Step 1: figure out if its a pervasive/big problem
How much can we change the inputs and get the same error?
Or is it a different error?
And how big is the error?
Debugging
Localize the bug

Step 2: find out exactly where things are going wrong
This is most often the hardest part!
Today, well learn how to understand errors, using print().
There are many more sophisticated debugging tools like traceback()
or the R tool browser() which lets you interactively debug.
Unfortunately dont have time for these.
Localizing the Bug
Sometimes error messages are easier to decode, sometimes theyre harder;

this can make locating the bug easier or harder.
> my.plotter = function(x, y, my.list=NULL) {
+ if (!is.null(my.list))
+ plot(my.list, main="A plot from my.list!")
+ else
+ plot(x, y, main="A plot from x, y!")
+ }
> my.plotter(x=1:8, y=1:8)
> my.plotter(my.list=list(x=-10:10, y=(-10:10)^3))
Localizing the Bug
> my.plotter() # Easy to understand error message
Error in plot(x, y, main = "A plot from x, y!") :

argument "x" is missing, with no default
Localizing the Bug

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))
Error in xy.coords(x, y, xlabel, ylabel, log) :

'x' is a list, but does not have components 'x' and 'y'
Localizing the Bug


Who called xy.coords()? (Not us, at least not explicitly!) And why is it
saying x is a list? (We never set it to be so!)
Localizing the Bug
Lets modify the function by calling print() at various points, to print
out the state of variables, to help localize the error.
> my.plotter = function(x, y, my.list=NULL) {
+ if (!is.null(my.list)) {
+ print("Here is my.list:")
+ print(my.list)
+ print("Now about to plot my.list")
+ plot(my.list, main="A plot from my.list!")
+ }
+ else {
+ print("Here is x:"); print(x)
+ print("Here is y:"); print(y)
+ print("Now about to plot x, y")
+ plot(x, y, main="A plot from x, y!")
+ }
+ }
Localizing the Bug

[1] "Here is my.list:"
$x
[1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
$Y
[1] -1000 -729 -512 -343 -216 -125 -64 -27 -8
[20] 729 1000
[1] "Now about to plot my.list"

Localizing the Bug
> my.plotter(x="hi", y="there")

[1] "Here is x:"
[1] "hi"
[1] "Here is y:"
[1] "there"
[1] "Now about to plot x, y"
Error in plot.window(...) : need finite 'xlim' values In addi

1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by
2: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf
Check Yourself
Tasks
Below is a random.walk() function like the one you wrote in homework.
Unfortunately, this one has some bugs find them and fix them! If you forget the
random walk algorithm, its written in the next slide.
> random.walk = function(x.start = 5, seed = NULL) {
+ if (!is.null(seed)) set.seed(seed)
+ x.vals <- x.start
+ while (TRUE) {
+ r <- runif(1, -2, 1)
+ if (tail(x.vals + r, 1) <= 0) break
+ else x.vals <- c(x.vals, x.vals + r)
+ }
+ return(x.vals = x.vals, num.steps = length(x.vals))
+ }
>
> # random.walk(x.start=5, seed=3)$num.steps # Should print 8
> # random.walk(x.start=10, seed=7)$num.steps # Should print 14
Check Yourself
The Random Walk Procedure

1. Start with an initial value for x.
2. Draw a random number r uniformly between -22 and 1.
3. Replace x with x + r
4. Stop if x <= 0
5. Else repeat.
Check Yourself
Solutions
> random.walk = function(x.start=5, seed=NULL) {
+ if (!is.null(seed)) set.seed(seed)
+ x.vals <- x.start
+ while (TRUE) {
+ r <- runif(1, -2, 1)
+ if (tail(x.vals + r, 1) <= 0) break
+ else x.vals <- c(x.vals, x.vals[length(x.vals)] + r)
+ print(x.vals)
+ }
+ ret.val <- list(x.vals = x.vals, num.steps = length(x.vals))
+ print(ret.val)
+ return(ret.val)
+ }
Check Yourself
Solutions
> random.walk(x.start = 5, seed = 3)$num.steps
[1] 5.000000 3.504125

[1] 5.000000 3.504125 3.926674
[1] 5.000000 3.504125 3.926674 3.081501
[1] 5.000000 3.504125 3.926674 3.081501 2.064704
[1] 5.000000 3.504125 3.926674 3.081501 2.064704 1.871006
[1] 5.000000 3.504125 3.926674 3.081501 2.064704 1.871006
[7] 1.684188
[1] 5.0000000 3.5041246 3.9266738 3.0815008 2.0647038
[6] 1.8710058 1.6841880 0.0580883
$x.vals
[1] 5.0000000 3.5041246 3.9266738 3.0815008 2.0647038
[6] 1.8710058 1.6841880 0.0580883
$num.steps
[1] 8
Check Yourself
Solutions
> random.walk(x.start = 10, seed = 7)$num.steps
[1] 10.00000 10.96673

[1] 10.00000 10.96673 10.15996
[1] 10.000000 10.966728 10.159964 8.507058
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770 5.759958
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770 5.759958 4.257524
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770 5.759958 4.257524
Section V
K -means Clustering
Clustering
Clustering refers to a broad set of techniques for finding subgroups,

or clusters, in a dataset.
The idea is to partition the observations into distinct groups so that
observations within each group are similar, while observations in
different groups are different from each other.
Clustering
Clustering refers to a broad set of techniques for finding subgroups,

or clusters, in a dataset.
The idea is to partition the observations into distinct groups so that
observations within each group are similar, while observations in
different groups are different from each other.
Similarities to PCA
Both clustering and PCA seek to simplify the data via a small number of
summaries:
PCA looks to find a low-dimensional representaiton of the
observations that explain most of the variance.
Clustering looks to find homogenous subgroups among the
observations.
K -Means Clustering
K -Means Clustering
Simple approach for partitioning a data set into K distinct,
non-overlapping clusters. Note K is pre-specified.
K -Means Clustering
K -Means Clustering
Simple approach for partitioning a data set into K distinct,
non-overlapping clusters. Note K is pre-specified.
How Do We Do It?
First we specify the number of desired clusters K .
The K -means algorithm then assigns each observation to exactly one
of the K clusters.
The algorithm boils down to a simple and intuitive mathematical
problem.
1
Example of K -means
A simulated dataset with 150 observations in two-dimensional space. We
see the results of the K -mean algorithm using different values of K .
K=2 K=3 K=4
0
Some figures taken from An Introduction to Statistical Learning (Springer, 2013)
with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
K -means Clustering
Notation
Let C1 , C2 , . . . , CK denote sets containing the indices of the observations
in each cluster. These sets satisfy the following properties:
1. C1 C2 CK = {1, . . . , n}. (Each observation belongs to at
least one of the K clusters.)
2. Ck Ck 0 = . (The clusters are non-overlapping.)
K -means Clustering
Main Idea
The idea behind K -means clsutering is that a good clustering is one for
which the within-cluster variation is small as possible.
K -means Clustering
Main Idea
The idea behind K -means clsutering is that a good clustering is one for
which the within-cluster variation is small as possible.
Within-Cluster Variation
For cluster Ck , denote the within-cluster variation by W (Ck ).
W (Ck ) measures the amount by which the observations within a
cluster differ within each other.
The algorithm is then an optimization problem:
( K )
X
min W (Ck ) .
C1 ,...,Ck
k=1
K -means Clustering
Optimization Task
To solve the optimization problem, we need to define W (Ck ).
The most common choice is to use squared Euclidean distance:
p
1 X X
W (Ck ) = (xij xi 0 j )2 ,
|Ck | 0
i,i Ck j=1
where |Ck | denotes the number of observations in the k th cluster.

In words this is the sum of the pairwise squared Euclidean distances
between observations in the k th cluster, divided by the total number
of observations in the k th cluster.
K -means Clustering
Optimization Task
To solve the optimization problem, we need to define W (Ck ).
The most common choice is to use squared Euclidean distance:
p
1 X X
W (Ck ) = (xij xi 0 j )2 ,
|Ck | 0
i,i Ck j=1
where |Ck | denotes the number of observations in the k th cluster.

In words this is the sum of the pairwise squared Euclidean distances
between observations in the k th cluster, divided by the total number
of observations in the k th cluster.
Consequently, we want to solve the optimization problem
K
X p
1 X X 2
min (xij xi 0 j )
C1 ,...,CK |Ck | 0
k=1 i,i Ck j=1
K -means Clustering
How do we Solve this Problem?

Want to come up with an algorithm that partitions the data such that
the following is minimized:
K
X p
1 X X
min (xij xi 0 j )2
C1 ,...,CK |Ck | 0
k=1 i,i Ck j=1
This is a hard problem there are around K n ways to partition n

observations in to K clusters!
The K-means clustering algorithm is very simple and can be shown to
provide a pretty good solution.
K -means Clustering
K -Means Clustering Algorithm

1. Randomly assign a number, from 1 to K , to each observation. This
serves as initial cluster assignments for the observations.
2. Iterate until the cluster assignments stop changing:
a. For each of the K clusters, compute the cluster centriod. The k th
centroid is the vector of the p feature means (covariate means) for the
observations in the k th cluster.
b. Assign each observation to the cluster whose centriod is closest (where
closest is defined using Euclidean distance).
K -means Clustering
K -Means Clustering Algorithm

1. Randomly assign a number, from 1 to K , to each observation. This
serves as initial cluster assignments for the observations.
2. Iterate until the cluster assignments stop changing:
a. For each of the K clusters, compute the cluster centriod. The k th
centroid is the vector of the p feature means (covariate means) for the
observations in the k th cluster.
b. Assign each observation to the cluster whose centriod is closest (where
closest is defined using Euclidean distance).
This algorithm provides a local minimum not global.

We must decide on how many clusters K exist in the data beforehand.
K -means Clustering
Data Step 1 Iteration 1, Step 2a
Iteration 1, Step 2b Iteration 2, Step 2a Final Results
K -means Clustering
Why does the Algorithm Work?

Notice the following identity
p p
1 X X 2
XX
(xij xi 0 j ) = 2 (xij xkj )2 ,
|Ck | 0
i,i Ck j=1 iCk j=1
1 P
where xkj = |Ck | iCk xij is the mean of feature j in cluster Ck .
Reallocating the observations (step 2b) can only improve the above,
thereby always decreasing the value of the objective function in the
optimization problem.
As the algorithm runs, the clustering obtained will continually improve
until the result no longer changes.
K -means Clustering
The Initial Cluster Matters

Recall, the algorithm finds a local rather than a global optimum.
K -means Clustering
The Initial Cluster Matters

Recall, the algorithm finds a local rather than a global optimum.
Therefore the results will depend on the initial (random) cluster
assignment of each observation in step 1.
Its important to run the algorithm multiple times from different
random initial clusterings. Then select the best solution as
determined by the objective function.
K -means Clustering
K -means performed six times with different starting assignments gets six
differnet values of the objective.
390 10. Unsupervised Learning
320.9 235.8 235.8
235.8 235.8 310.9
FIGURE 10.7. K-means clustering performed six times on the data from Fig-
ure 10.5 with K = 3, each time with a dierent random assignment of the ob-
Cynthia Rush servations in Step 1Lecture 12: Debugging
of the K-means and Databases
algorithm. December
Above each plot is the value of 9, 2016 82 / 107
Example of K -means
Recall that iris dataset:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Example of K -means
> library(ggplot2)
> ggplot(data = iris) +
+ geom_point(aes(Petal.Length, Petal.Width,
+ color = Species))
2.5
2.0
Species
Petal.Width
1.5 setosa
versicolor
1.0 virginica
0.5
0.0
2 4 6
Petal.Length
Example of K -means
Run the K -means algorithm with K = 2.
> km.out <- kmeans(iris[, 3:4], centers = 2, nstart = 20)

> km.out$centers
Petal.Length Petal.Width
1 4.925253 1.6818182
2 1.492157 0.2627451
> km.out$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[82] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
[109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[136] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Example of K -means
> table(km.out$cluster, iris$Species)

setosa versicolor virginica
1 0 49 50
2 50 1 0
> iris$cluster <- as.factor(km.out$cluster)
> ggplot(data = iris) +
+ geom_point(mapping = aes(Petal.Length, Petal.Width,
+ color = iris$cluster))
Example of K -means
Example of K -means

> table(km.out$cluster, iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 2 46
3 0 48 4
Example of K -means
Example of K -means
Initial Cluster Matters

To run the kmeans() function in R with multiple initial cluster
assignments, we use the nstart argument.
If a value of nstart greater than one is used, then K-means
clustering will be performed using multiple random assignments in
Step 1 of the K -means algorithm.
km.out$tot.withinss is the total within-cluster sum of squares.
This is what we seek to minimize.
Example of K -means
> set.seed(3)
> km.out$tot.withinss
[1] 31.41289
> km.out$tot.withinss
[1] 31.37136
Example of K -means
Lets actually code the algorithm ourselves!

> data <- iris[, 3:4]
> # First we randomly assign clusters
>
> clusters <- sample(1:3, nrow(data), replace = TRUE)
> # Next we calculate the centers
>
> centers <- apply(data, 2, tapply, clusters, mean)
> centers
1 3.517647 1.078431
2 4.078261 1.315217
3 3.711321 1.215094
Example of K -means
Finally we need to calculate the new cluster assignments

> dist <- function(p1, p2) {
+ return(sum((p1 - p2)^2))
+ }
> point.assign <- function(point, centers) {
+ # Input: one point
+ # Output: which cluster center is closest
+ return(which.min(c(dist(point, centers[1, ]),
+ dist(point, centers[2, ]),
+ dist(point, centers[3, ]))))
+ }
> new.clusters <- function(points, centers) {
+ # Input: points and centers
+ # Output: new cluster assignment
+ return(apply(points, 1, point.assign, centers))
+ }
Example of K -means
> new.clus <- new.clusters(points = data, centers = centers)
Example of K -means
> new.clus <- new.clusters(points = data, centers = centers)
Step 1 is complete, now we iterate!

> while(any(new.clus != clusters)) {
+ clusters <- new.clus
+ centers <- apply(data, 2, tapply, clusters, mean)
+ new.clus <- new.clusters(points = data, centers = centers)
+ }
> table(new.clus, iris$Species)
new.clus setosa versicolor virginica
1 50 0 0
2 0 2 46
3 0 48 4
Example of K -means
Check Yourself
Tasks
Use the previous code to write a function K.means which takes as input
two arguments data and K and returns the final clustering the algorithm
finds. Note that in the previous work we assumed K = 3, so generalize
your function in terms of K .
Check Yourself
Solution
> K.means <- function(data, K) {
+ clusters <- sample(1:K, nrow(data), replace = TRUE)
+ new.clus <- new.clusters(points = data,
+ centers = centers)
+ while(any(new.clus != clusters)) {
+ clusters <- new.clus
+ new.clus <- new.clusters(points = data,
+ centers = centers)
+ }
+ return(list(clusters = new.clus, centers = centers))
+ }
Check Yourself
Solution
> K.means(data, K = 3)
$clusters
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
[82] 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 3
[109] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3
[136] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3
$centers
1 4.269231 1.342308
2 1.462000 0.246000
3 5.595833 2.037500
Check Yourself
Solution
> km.out$centers
1 1.462000 0.246000
2 5.595833 2.037500
3 4.269231 1.342308
Section VI
Exam Review
Format
All written: will last from 1:10pm - 3:30pm.

For no reason should you be on your cell phone/laptop/smart
watch/whatever.
Topics
Topics covered before midterm will be covered implicitly.

Focus on topics covered week 8 - last week.
Anything from lecture, lab, homework is fair game.
Look to homework assignments (and check yourself questions) to get
a feel for what I think are important topics for you to understand
from each lecture.
Topics
Randomness in R/Generating Random Numbers

Simulations: dfoo(), rfoo(), qfoo(), pfoo().
Simulations: sample()
Inverse Transform Method
Rejection Sampling
Topics
Distributions as Models
MLE or MOM estimation.
Testing fit (visually and otherwise).
Permutation tests.
Topics
Optimization
Constrained vs. Unconstrained
Gradient Descent: Generally, how/why does it work? How is it
different than Newtons Method and what are the
stregnths/weaknesses of each.
(I will not, for example, ask you to code up your own gradient descent
algorithm, though I may ask you to properly use something already
coded.)
Built-in R optimization functions (like we used in homework).
Topics
Transforming Data
Functions like sort() and order()
Obviously apply(), tapply(), sapply(), lapply() but also
form of the output when we use different functions like range() vs.
mean().
Built-in R functions split(), aggregate(), and merge().
Functions in the plyr package to replace the apply() family.
Topics
Unsupervised Learning
The difference between supervised and unsupervised learning.
How to find principle components and iterpret them.
Familiarity and understnading with clustering generally.

Lecture 12

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture 12

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 12: Debugging and Databases

STAT GR5206 Statistical Computing & Introduction to Data Science

Final is next Friday, December 16, 1:10pm - 4:00pm in this room.

Split/Apply/Combine: A model for working with data.

Databases: What are databases? Intro to SQL and interfacing R

Databases: SQL and Querying

A record is a collection of fields (likes rows and columns).

Rs dataframes are actually tables

So, Why Do We Need Database Software?

So, Why Do We Need Database Software?

So, Why Do We Need Database Software?

Listing Whats Available

Importing a Table as a Data Frame

Importing a Table as a Data Frame

Now we can perform R operations on batting, since its a data frame

SELECT columns or computations

> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",

This is our first successful SQL query (congrats!)

> batting[1:10, c("playerID", "yearID", "AB", "H", "HR")]

> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",

> batting[which.max(batting$HR), c("playerID","yearID","HR")]

Databases: SQL Computations

Rs dataframes are actually tables

SELECT columns or computations

To calculate the average number of homeruns, and average number of hits:

To calculate the average number of homeruns, and average number of hits:

We can replicate this simple command on an imported data frame:

We can use the GROUP BY option in SELECT to define aggregation groups

We can use the GROUP BY option in SELECT to define aggregation groups

We can use AS in the first line of SELECT to rename computed columns

Rs dataframes are actually tables

Sometimes we need to combine information from many tables.

patient_last patient_first physician_id complaint

physician_last physician_first physicianID plan

Suppose we want to know which doctors are treating patients for

> query <- paste("SELECT yearID, teamID,

yearID playerID salary HR

We can replicate this using merge() on imported data frames:

yearID playerID salary HR

Now we can see that we have all 6 of David Aardsmas original

playerID AVG(HR) AVG(salary)

yearID playerID E salary

Reproduce the bug

Reproduce the bug

Characterize the bug

Localize the bug

Sometimes error messages are easier to decode, sometimes theyre harder;

> my.plotter() # Easy to understand error message

Error in plot(x, y, main = "A plot from x, y!") :

> my.plotter() # Easy to understand error message

Error in plot(x, y, main = "A plot from x, y!") :

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))

Error in xy.coords(x, y, xlabel, ylabel, log) :

> my.plotter() # Easy to understand error message

Error in plot(x, y, main = "A plot from x, y!") :

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))

Error in xy.coords(x, y, xlabel, ylabel, log) :

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))

[1] "Now about to plot my.list"

Error in xy.coords(x, y, xlabel, ylabel, log) :

> my.plotter(x="hi", y="there")

Error in plot.window(...) : need finite 'xlim' values In addi

The Random Walk Procedure