Sunteți pe pagina 1din 124

Lecture 12: Debugging and Databases

STAT GR5206 Statistical Computing & Introduction to Data Science

Cynthia Rush

Cynthia Rush Columbia University

December 9, 2016

Lecture 12: Debugging and Databases

December 9, 2016

1 / 107

Course Notes

Final is next Friday, December 16, 1:10pm - 4:00pm in this room.

Homework is due on Monday.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

2 / 107

Last Time

Split/Apply/Combine: A model for working with data.

plyr Package:

PCA and K-Means Clustering

Similar to the apply() family, but more consistent.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

3 / 107

Topics for Today

Databases: What are databases? Intro to SQL and interfacing R with SQL.

Debugging: Simple techniques for correcting buggy code.

Clustering: Clustering with the K-means algorithm.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

4 / 107

Section I

Databases: SQL and Querying

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

5 / 107

Databases

A record is a collection of fields (likes rows and columns).

A table is a collection of records which all have the same fields with different values. These are like dataframes in R.

A database is a collection of tables.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

6 / 107

Databases vs. Dataframes

R’s dataframes are actually tables

R Jargon Database Jargon column row dataframe types of the columns bunch of related dataframes
R Jargon
Database Jargon
column
row
dataframe
types of the columns
bunch of related dataframes
field
record
table
table schema
database
Cynthia Rush
Lecture 12: Debugging and Databases
December 9, 2016
7 / 107

Databases

So, Why Do We Need Database Software?

Size

R keeps its dataframes in memory

Industrial databases can be much bigger

Work with selected subsets

Speed

Clever people have worked very hard on getting just what you want fast

Concurrency

Many users accessing the same database simultaneously

Lots of potential for trouble (two users want to change the same record at once)

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

8 / 107

Databases

So, Why Do We Need Database Software?

Databases live on a server, which manages them

Users interact with the server through a client program

Lets multiple users access the same database simultaneously

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

9 / 107

Databases

So, Why Do We Need Database Software?

Databases live on a server, which manages them

Users interact with the server through a client program

Lets multiple users access the same database simultaneously

SQL (structured query language) is the standard for database software

Mostly about queries, which are like doing row/column selections on a dataframe in R

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

9 / 107

SQL

Connecting R to SQL

SQL is its own language, independent of R (similar to regular expressions). But we’re going to learn how to run SQL queries through R.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

10 / 107

SQL

Connecting R to SQL

SQL is its own language, independent of R (similar to regular expressions). But we’re going to learn how to run SQL queries through R.

First, install the packages DBI, RSQLite.

Also, we need a database file: download the file baseball.db and save it in your working directory.

> library(DBI)

> library(RSQLite)

> dbDriver("SQLite")

> dbConnect(drv,

drv

con

<-

<-

dbname="baseball.db")

The object con is now a persistent connection to the database baseball.db.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

10 / 107

SQL

Listing What’s Available

>

dbListTables(con)

#

List

tables

in

our

database

[1]

"AllstarFull"

"Appearances"

[3]

"AwardsManagers"

"AwardsPlayers"

[5]

"AwardsShareManagers"

"AwardsSharePlayers"

[7]

"Batting"

"BattingPost"

[9]

"Fielding"

"FieldingOF"

[11]

"FieldingPost"

"HallOfFame"

[13]

"Managers"

"ManagersHalf"

[15]

"Master"

"Pitching"

[17]

"PitchingPost"

"Salaries"

[19]

"Schools"

"SchoolsPlayers"

[21]

"SeriesPost"

"Teams"

[23]

"TeamsFranchises"

"TeamsHalf"

[25]

"sqlite_sequence"

"xref_stats"

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

11 / 107

SQL

Listing What’s Available

> dbListFields(con, "Batting") # Fields in Batting table [1] "playerID" "yearID"
> dbListFields(con,
"Batting")
#
Fields
in
Batting
table
[1]
"playerID"
"yearID"
"stint"
"teamID"
[5]
"lgID"
"G"
"G_batting"
"AB"
[9]
"R"
"H"
"2B"
"3B"
[13]
"HR"
"RBI"
"SB"
"CS"
[17]
"BB"
"SO"
"IBB"
"HBP"
[21]
"SH"
"SF"
"GIDP"
"G_old"
> dbListFields(con,
"Pitching")
#
Fields
in
Pitching
table
[1]
"playerID"
"yearID"
"stint"
"teamID"
"lgID"
[6]
"W"
"L"
"G"
"GS"
"CG"
[11]
"SHO"
"SV"
"IPouts"
"H"
"ER"
[16]
"HR"
"BB"
"SO"
"BAOpp"
"ERA"
[21]
"IBB"
"WP"
"HBP"
"BK"
"BFP"
[26]
"GF"
"R"

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

12 / 107

SQL

Importing a Table as a Data Frame

> batting <- dbReadTable(con, "Batting") > class(batting) [1] "data.frame" >
> batting
<-
dbReadTable(con,
"Batting")
> class(batting)
[1]
"data.frame"
> dim(batting)
[1]
93955
24

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

13 / 107

SQL

Importing a Table as a Data Frame

> batting

> class(batting)

"data.frame"

> dim(batting)

<-

dbReadTable(con,

[1]

[1]

93955

24

"Batting")

Now we can perform R operations on batting, since it’s a data frame

In lecture today, we’ll use this route primarily to check our work in SQL; in general, want to do as much in SQL as possible, since it’s more efficient and likely simpler

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

13 / 107

Check Yourself

Tasks

Using dbReadTable(), grab the table named Salaries and save it as a data frame called salaries. Using the salaries data frame and ddply(), compute the payroll (total of salaries) for each team in the year 2010. Find the 3 teams with the highest payrolls, and the team with the lowest payroll.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

14 / 107

Check Yourself

Solutions

>

library(plyr)

 

>

salaries

<-

dbReadTable(con,

"Salaries")

 

>

my.sum.func

<-

function(team.yr.df)

{

+

return(sum(team.yr.df$salary))

+

}

>

payroll

<-

ddply(salaries,

.(yearID,

teamID),

my.sum.func)

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

15 / 107

Check Yourself

Solutions

> payroll[payroll$yearID

payroll

<-

==

2010,

]

> payroll[order(payroll$V1,

payroll

<-

decreasing

=

T),

]

> payroll[1:3,

]

 

yearID

teamID

V1

733

2010

NYA

206333389

719

2010

BOS

162447333

721

2010

CHN

146609000

>

payroll[nrow(payroll),

]

 

yearID

teamID

V1

737

2010

PIT

34943000

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

16 / 107

SQL

SELECT

Main tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:

SELECT

columns

or

computations

FROM

table

WHERE

condition

 

GROUP

BY

columns

HAVING

condition

 

ORDER

BY

column

[ASC

|

DESC]

LIMIT

offset,

count;

 

WHERE, GROUP

BY, HAVING, ORDER

BY, LIMIT are all optional

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

17 / 107

Examples

Pick out five columns from the table “Batting”, and look at the first 10 rows:

>

dbGetQuery(con,

paste("SELECT

playerID,

yearID,

AB,

H,

HR",

+

"FROM

Batting",

+

"LIMIT

10"))

 

playerID

yearID

AB

H

HR

1

aardsda01

2004

0

0

0

2

aardsda01

2006

2

0

0

3

aardsda01

2007

0

0

0

4

aardsda01

2008

1

0

0

5

aardsda01

2009

0

0

0

6

aaronha01

1954

468

131

13

7

aaronha01

1955

602

189

27

8

aaronha01

1956

609

200

26

9

aaronha01

1957

615

198

44

10

aaronha01

1958

601

196

30

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

18 / 107

Examples

This is our first successful SQL query (congrats!)

We can replicate the command on the imported data frame:

>

batting[1:10,

c("playerID",

"yearID",

"AB",

"H",

"HR")]

 

playerID

yearID

AB

H

HR

1

aardsda01

2004

0

0

0

2

aardsda01

2006

2

0

0

3

aardsda01

2007

0

0

0

4

aardsda01

2008

1

0

0

5

aardsda01

2009

0

0

0

6

aaronha01

1954

468

131

13

7

aaronha01

1955

602

189

27

8

aaronha01

1956

609

200

26

9

aaronha01

1957

615

198

44

10

aaronha01

1958

601

196

30

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

19 / 107

Examples

To reiterate: the previous call was simply to check our work, and we wouldn’t actually want to do this on a large database, since it’d be much more inefficient to first read into an R data frame, and then call R commands

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

20 / 107

SQL

ORDER

BY

We can use the ORDER BY option in SELECT to specify an ordering for the rows

Default is ascending order; add DESC for descending

>

dbGetQuery(con,

paste("SELECT

playerID,

yearID,

AB,

H,

HR",

+

"FROM

Batting",

+

"ORDER

BY

HR

DESC",

+

"LIMIT

5"))

 

playerID

yearID

AB

H

HR

1

bondsba01

2001

476

156

73

2

mcgwima01

1998

509

152

70

3

sosasa01

1998

643

198

66

4

mcgwima01

1999

521

145

65

5

sosasa01

2001

577

189

64

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

21 / 107

Check Yourself

Tasks

Run the following queries and determine what they’re doing. Write R code to do the same thing on the batting data frame.

>

dbGetQuery(con,

paste("SELECT

playerID,

yearID,

AB,

H,

HR",

+

"FROM

Batting",

 

+

"WHERE

yearID

>=

1990

+

AND

yearID

<=

2000",

+

"ORDER

BY

HR

DESC",

 

+

"LIMIT

5"))

>

dbGetQuery(con,

paste("SELECT

playerID,

yearID,

MAX(HR)",

+

"FROM

Batting"))

 

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

22 / 107

Check Yourself

Solutions

> batting[order(batting$HR,

bat.ord

<-

decreasing

=

TRUE),

]

> subset

<-

bat.ord$yearID

>=

1990

&

bat.ord$yearID

<=

2000

> columns

<-

c("playerID",

"yearID",

"AB",

"H",

"HR")

> head(bat.ord[subset,

columns],

 

playerID

yearID

AB

H

HR

54613

mcgwima01

1998

509

152

70

78578

sosasa01

1998

643

198

66

54614

mcgwima01

1999

521

145

65

78579

sosasa01

1999

625

180

63

31877

griffke02

1997

608

185

56

> batting[which.max(batting$HR),

 

playerID

yearID

HR

7514

bondsba01

2001

73

5)

c("playerID","yearID","HR")]

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

23 / 107

Section II

Databases: SQL Computations

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

24 / 107

Databases vs. Dataframes

R’s dataframes are actually tables

R Jargon Database Jargon column row dataframe types of the columns collection of related dataframes
R Jargon
Database Jargon
column
row
dataframe
types of the columns
collection of related dataframes
conditional indexing
d*ply()
order()
field
record
table
table schema
database
SELECT, FROM, WHERE, HAVING
GROUP
BY
ORDER
BY
Cynthia Rush
Lecture 12: Debugging and Databases
December 9, 2016
25 / 107

SQL

SELECT

Main tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:

SELECT

columns

or

computations

FROM

table

WHERE

condition

 

GROUP

BY

columns

HAVING

condition

 

ORDER

BY

column

[ASC

|

DESC]

LIMIT

offset,

count;

 

WHERE, GROUP

Importantly, in the first line of SELECT we can directly specify

computations that we want performed.

BY, LIMIT are all optional.

BY, HAVING, ORDER

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

26 / 107

Examples

To calculate the average number of homeruns, and average number of hits:

>

dbGetQuery(con,

paste("SELECT

AVG(HR),

AVG(H)",

+

"FROM

Batting"))

 

AVG(HR)

AVG(H)

1

2.970549

40.67684

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

27 / 107

Examples

To calculate the average number of homeruns, and average number of hits:

>

dbGetQuery(con,

paste("SELECT

AVG(HR),

AVG(H)",

+

"FROM

Batting"))

 

AVG(HR)

AVG(H)

 

1

2.970549

40.67684

We can replicate this simple command on an imported data frame:

>

mean(batting$HR,

na.rm

=

TRUE)

[1]

2.970549

>

mean(batting$H,

na.rm

=

TRUE)

[1]

40.67684

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

27 / 107

GROUP

BY

We can use the GROUP BY option in SELECT to define aggregation groups

>

dbGetQuery(con,

paste("SELECT

playerID,

AVG(HR)",

+

"FROM

Batting",

+

"GROUP

BY

playerID",

+

"ORDER

BY

AVG(HR)

DESC",

+

"LIMIT

5"))

 

playerID

AVG(HR)

1

pujolal01

40.80000

2

howarry01

36.14286

3

rodrial01

36.05882

4

bondsba01

34.63636

5

mcgwima01

34.29412

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

28 / 107

GROUP

BY

We can use the GROUP BY option in SELECT to define aggregation groups

>

dbGetQuery(con,

paste("SELECT

playerID,

AVG(HR)",

+

"FROM

Batting",

+

"GROUP

BY

playerID",

+

"ORDER

BY

AVG(HR)

DESC",

+

"LIMIT

5"))

 

playerID

AVG(HR)

1

pujolal01

40.80000

2

howarry01

36.14286

3

rodrial01

36.05882

4

bondsba01

34.63636

5

mcgwima01

34.29412

Note: the order of commands here matters; try switching the order of GROUP BY and ORDER BY above, and you’ll get an error.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

28 / 107

WHERE

We can use the WHERE option in SELECT to specify a subset of the rows to use (pre-aggregation/pre-calculation)

>

dbGetQuery(con,

paste("SELECT

yearID,

AVG(HR)",

+

"FROM

Batting",

+

"WHERE

yearID

>=

1990",

+

"GROUP

BY

yearID",

+

"ORDER

BY

AVG(HR)

DESC",

+

"LIMIT

5"))

 

yearID

AVG(HR)

1

1996

5.073620

2

1999

4.692699

3

2000

4.525437

4

2004

4.490115

5

2001

4.412288

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

29 / 107

Check Yourself

Tasks

Run the following queries and determine what they’re doing. Write R code to do the same thing on the batting data frame. Hint use daply().

>

dbGetQuery(con,

paste("SELECT

teamID,

AVG(HR)",

+

"FROM

Batting",

+

"WHERE

yearID

>=

1990",

+

"GROUP

BY

teamID",

+

"ORDER

BY

AVG(HR)

DESC",

+

"LIMIT

5"))

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

30 / 107

Check Yourself

Solutions

> bat.sub

<-

batting[batting$yearID

>=

1990,

]

> my.mean.func

<-

function(team.df)

{

+

return(mean(team.df$HR,

na.rm

=

TRUE))

+

}

>

avg.hrs

<-

daply(bat.sub,

.(teamID),

my.mean.func)

>

avg.hrs

<-

sort(avg.hrs,

decreasing

=

TRUE)

>

head(avg.hrs,

5)

 

CHA

NYA

TOR

CAL

 

TEX

6.164251

5.986486

5.760937

5.625731

5.563961

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

31 / 107

AS

We can use AS in the first line of SELECT to rename computed columns

>

dbGetQuery(con,

paste("SELECT

yearID,

AVG(HR)

as

avgHR",

+

"FROM

Batting",

+

"GROUP

BY

yearID",

 

+

"ORDER

BY

avgHR

DESC",

+

"LIMIT

5"))

 

yearID

avgHR

1

1987

5.300832

2

1996

5.073620

3

1986

4.730769

4

1999

4.692699

5

1977

4.601010

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

32 / 107

HAVING

We can use the HAVING option in SELECT to specify a subset of the rows to display (post-aggregation/post-calculation)

>

dbGetQuery(con,

paste("SELECT

yearID,

AVG(HR)

as

avgHR",

+

"FROM

Batting",

+

"WHERE

yearID

>=

1990",

+

"GROUP

BY

yearID",

 

+

"HAVING

avgHR

>=

4.5",

+

"ORDER

BY

avgHR

DESC"))

 

yearID

avgHR

1

1996

5.073620

2

1999

4.692699

3

2000

4.525437

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

33 / 107

Check Yourself

Tasks

Recompute the payroll for each team in 2010, but now with dbGetQuery() and an appropriate SQL query. In particular, the output of dbGetQuery() should be a data frame with two columns, the first giving the team names, and the second the payrolls, just like your output from daply() before. (Hint: your SQL query here will have to use GROUP BY.)

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

34 / 107

Check Yourself

Solutions

>

dbGetQuery(con,

paste("SELECT

teamID,

SUM(salary)

as

SUMsal",

+

"FROM

Salaries",

 

+

"WHERE

yearID

==

2010",

+

"GROUP

BY

teamID",

+

"ORDER

BY

SUMsal

DESC",

+

"LIMIT

3"))

teamID

SUMsal

1 NYA

206333389

2 BOS

162447333

3 CHN

146609000

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

35 / 107

Section III

Databases: Join

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

36 / 107

Databases vs. Dataframes

R’s dataframes are actually tables

R Jargon

Database Jargon

column row dataframe types of the columns collection of related dataframes conditional indexing d*ply() order() merge()

field record table table schema database SELECT, FROM, WHERE, HAVING

 

GROUP

BY

ORDER

BY

INNER

JOIN or just JOIN

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

37 / 107

JOIN

JOIN Sometimes we need to combine information from many tables. patient_last patient_first physician_id complaint

Sometimes we need to combine information from many tables.

patient_last

patient_first

physician_id

complaint

Morgan

Dexter

37010

insomnia

Soprano

Anthony

79676

malaise

Swearengen

Albert

NA

healthy

Garrett

Alma

90091

nerves

Holmes

Sherlock

43675

addiction

physician_last

physician_first

physicianID

plan

Meridian

Emmett

37010

UPMC

Melfi

Jennifer

79676

BCBS

Cochran

Amos

90091

UPMC

Watson

John

43675

VA

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

38 / 107

JOIN

Suppose we want to know which doctors are treating patients for insomnia.

Complaints are in one table and physicians in another.

In R, we use merge() to link the tables by physicianID.

Here physicianID or physician_id is acting as the key or the identifier.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

39 / 107

JOIN

In all we’ve seen so far with SELECT, the FROM line has just specified one table. But sometimes we need to combine information from many tables. Use the JOIN option for this

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

40 / 107

JOIN

In all we’ve seen so far with SELECT, the FROM line has just specified one table. But sometimes we need to combine information from many tables. Use the JOIN option for this

There are 4 options for JOIN:

1. INNER JOIN or just JOIN: retain just the rows each table that match the condition.

2. LEFT OUTER JOIN or just LEFT JOIN: retain all rows in the first table, and just the rows in the second table that match the condition.

3. RIGHT OUTER JOIN or just RIGHT JOIN: retain just the rows in the first table that match the condition, and all rows in the second table.

4. FULL OUTER JOIN or just FULL JOIN: retain all rows in both tables

Fields that cannot be filled in are assigned NA values

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

40 / 107

INNER

JOIN

INNER JOIN Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 41 / 107

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

41 / 107

LEFT

JOIN

LEFT JOIN Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 42 / 107

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

42 / 107

RIGHT

JOIN

RIGHT JOIN Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 43 / 107

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

43 / 107

FULL

JOIN

FULL JOIN Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 44 / 107

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

44 / 107

Examples

Suppose we want to find the average salaries of the players with the top 10 highest homerun averages. We need to combine the two tables.

>

dbGetQuery(con,

paste("SELECT

*",

+

"FROM

Salaries",

+

"ORDER

BY

playerID",

+

"LIMIT

8"))

 

yearID

teamID

lgID

playerID

salary

1

2004

SFN

NL

aardsda01

300000

2

2007

CHA

AL

aardsda01

387500

3

2008

BOS

AL

aardsda01

403250

4

2009

SEA

AL

aardsda01

419000

5

2010

SEA

AL

aardsda01

2750000

6

1986

BAL

AL

aasedo01

600000

7

1987

BAL

AL

aasedo01

625000

8

1988

BAL

AL

aasedo01

675000

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

45 / 107

Examples

>

query

<-

paste("SELECT

yearID,

teamID,

+

lgID,

playerID,

HR",

+

"FROM

Batting",

+

"ORDER

BY

playerID",

 

+

"LIMIT

7")

>

dbGetQuery(con,

query)

 

yearID

teamID

lgID

playerID

HR

1 2004

SFN

NL

aardsda01

0

2 2006

CHN

NL

aardsda01

0

3 2007

CHA

AL

aardsda01

0

4

2008

BOS

AL

aardsda01

0

5 2009

SEA

AL

aardsda01

0

6 2010

SEA

AL

aardsda01

0

7

1954

ML1

NL

aaronha01

13

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

46 / 107

Examples

We can use a JOIN on the pair: yearID, playerID.

>

dbGetQuery(con,

paste("SELECT

yearID,

playerID,

salary,

HR",

+

"FROM

Batting

JOIN

Salaries

+

USING(yearID,

playerID)",

+

"ORDER

BY

playerID",

+

"LIMIT

7"))

 

yearID playerID salary HR

1

2004 aardsda01 300000 0

2

2007 aardsda01 387500 0

3

2008 aardsda01 403250 0

4

2009 aardsda01 419000 0

5

2010 aardsda01 2750000 0

6

1986 aasedo01 600000 NA

7

1987 aasedo01 625000 NA

Note that here we’re missing one of David Aardsma’s records from the Batting table (i.e., the JOIN discarded 1 record)

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

47 / 107

Examples

We can replicate this using merge() on imported data frames:

>

merged

<-

merge(x

=

batting,

y

=

salaries,

+

by.x

=

c("yearID","playerID"),

+

by.y

=

c("yearID","playerID"))

>

merged[order(merged$playerID)[1:8],

 

+

c("yearID",

"playerID",

"salary",

"HR")]

 

yearID playerID salary HR

 

16708

2004 aardsda01 300000 0

19378

2007 aardsda01 387500 0

20277

2008 aardsda01 403250 0

21164

2009 aardsda01 419000 0

21990

2010 aardsda01 2750000 0

585

1986 aasedo01 600000 NA

1360

1987 aasedo01 625000 NA

2033

1988 aasedo01 675000 NA

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

48 / 107

Examples

For demonstration purposes, we can use a LEFT JOIN on the pair: yearID, playerID:

>

dbGetQuery(con,

paste("SELECT

yearID,

playerID,

salary,

HR",

+

"FROM

Batting

LEFT

JOIN

Salaries

+

USING(yearID,

playerID)",

+

"ORDER

BY

playerID",

 

+

"LIMIT

7"))

 
 

yearID playerID salary HR

 

1

2004 aardsda01

300000

0

2

2006

aardsda01

NA

0

3

2007

aardsda01

387500

0

4

2008

aardsda01

403250

0

5 2009

aardsda01

419000

0

6 2010

aardsda01

2750000

0

7

1954

aaronha01

NA

13

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

49 / 107

Examples

Now we can see that we have all 6 of David Aardsma’s original records from the Batting table (i.e., the LEFT JOIN used them all, and just filled in an NA value when it was missing his salary)

Currently, RIGHT JOIN and FULL JOIN are not implemented in the RSQLite package

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

50 / 107

Examples

Examples Now, as to our original question (average salaries of the players with the top 10

Now, as to our original question (average salaries of the players with the top 10 highest homerun averages):

>

+

+

dbGetQuery(con,

paste("SELECT

"FROM

playerID,

Batting

JOIN

USING(yearID,

AVG(HR),

AVG(salary)",

Salaries

playerID)",

+

"GROUP

BY

playerID",

+

"ORDER

BY

Avg(HR)

DESC",

+

"LIMIT

10"))

playerID

AVG(HR)

AVG(salary)

howarry01

45.80000

9051000.0

pujolal01

40.80000

8953204.1

fieldpr01

38.00000

3882900.0

1

2

3

4 rodrial01 36.05882 15553897.2

550777.7

6

7

406000.0

9 dunnad01 33.50000 6969500.0

10

908750.0

8

5

8556605.5

4814020.8

reynoma01

bondsba01

mcgwima01

gonzaca01

34.66667

34.63636

34.29412

34.00000

Cynthia Rush

kingmda01

32.50000

Lecture 12: Debugging and Databases

December 9, 2016

51 / 107

Check Yourself

Tasks

Using the Fielding table, list the 10 worst (highest) number of error (E) commited by a player in one season, only considering years 1990 and later. In addition to the number of errors, list the year and player ID for each record.

By appropriately merging the Fielding and Salaries tables, list the salaries for each record that you extracted in the last question.

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

52 / 107

Check Yourself

Solutions

>

dbGetQuery(con,

paste("SELECT

yearID,

playerID,

E",

+

"FROM

Fielding",

+

"WHERE

yearID

>=

1990",

+

"ORDER

BY

E

DESC",

+

"LIMIT

10"))

 

yearID playerID E

1

1992 offerjo01 42

2

1993 offerjo01 37

3

1996 valenjo03 37

4

2000 valenjo03 36

5

1998 carusmi01 35

6

1995 offerjo01 35

7

2008 reynoma01 34

8

2010 desmoia01 34

9

1993 cordewi01 33

10

2000 glaustr01 33

Cynthia Rush

Lecture 12: Debugging and Databases

December 9, 2016

53 / 107

Check Yourself

Solutions

>

dbGetQuery(con,

paste("SELECT

yearID,

playerID,

E,

salary",

+

"FROM

Fielding

LEFT

JOIN

Salaries

+

USING(yearID,

playerID)",

+

"WHERE

yearID

>=

1990",

 

+

"ORDER

BY

E

DESC",

+

"LIMIT

10"))

 

yearID playerID E salary

1

1992 offerjo01 42 135000

2

1993 offerjo01 37 300000

3

1996 valenjo03 37 300000