SQL and Shell Baseball Analysis

Shell
and SQL Baseball and airport analysis

Austin Kinion
Part 1 (i): Compute the number of outbound flights for each of the five airports OAK, SMF, LAX,
SFO and JFK, and sort these counts from largest to smallest.
Done in Shell
#Get the data from the website
$ curl -O http://eeyore.ucdavis.edu/stat141/Data/Airline2012_13.tar.gz
#output for the amount of time it took to get it
% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 291M 100 291M 0 0 8803k 0 0:00:33 0:00:33 --:--:-
- 7512k
so it took 33 seconds to get the data from the website in Shell

#Grab each of the months csv files
tar -zxvf Airline2012_13.tar.gz
#Create a file named ORGIN.txt with all of the origin columns pulled
out for each month
time cut -d , -f 15 201[23]_*.csv | sort | uniq -c | sort -nr >
ORIGIN.txt
The time it took was: user 1m55.666s

#Grab the names of the five airports out of the ORGIN file and count
for each airport
# and orer the counts Then put into new file ORIGIN2.txt
time egrep 'OAK|SMF|LAX|SFO|JFK' ORIGIN.txt > ORIGIN2.txt
The time it took was: real 0m0.004s
So that's a total time of about 2 minutes for the shell.

Taking a look at the file created from shell:
File: ORIGIN2.txt:
222029 "LAX"
169734 "SFO"
105097 "JFK"
44911 "OAK"
43145 "SMF"
Part 1 (i): Done in R

setwd("~/Desktop/Airline2012_13")
list.files()
#concatinate all five airport names into variable: airports
airports = c("LAX", "SFO", "JFK", "OAK", "SMF")
#call variable origincounts a blank vector with five 0's to fill in

later
origincounts = numeric(5)
#Use filenames to later iterate over all files

filenames = list.files()
#loop over all csv files and pull out origin column
system.time(for (i in 1:length(filenames))
{
#Read in csv files 1:12
cur.csv = read.csv(filenames[i])
#Create a table with the origin counts for each airport
origintable = table(cur.csv$ORIGIN)
#Create a table with just the five airports that were asked for
origincounts = origincounts + origintable[airports]
})
The output matches that of the output from shell, so that is a very good thing.
#R output to show it matches the shell output:
origincounts
#LAX SFO JFK OAK SMF
#222029 169734 105097 44911 43145
And the time it took was:

# user system elapsed
# 438.322 15.523 458.014
So, if we use the user time, that is about 7.3 minutes, which is over 3 times as long as
the time it took in shell. So I have determined, that when you need to grep a few
items out of a very large database, that It is faster to do it in Shell over R.
Part 1 (ii) Compute the total number of flights in and out of the five airports, i.e., the sume of both
the inbound and outbound flights. You can do this however you want using a mix of the shell and
R code. One way is to first obtain the lines in the files which involve any of these five airports.
Then obtain a count for each pair of airports, i.e., ORIGIN, DESTINATION pairs. At most, how
many will there be? Then read these counts by ORIGIN, DESTINATION pairs into R and
compute the total number of flights for each of the 5 airports.
First, the code that was done in shell:
#This grabs each of the airport names out of all the csv files from the
data:
egrep 'OAK|SMF|LAX|SFO|JFK' 201[23]*.csv > 12_13.txt
#Then this takes the content from the file created above and
#pull outs both the origin and destination column, and sorts the counts
cut -d , -f 15,25 12_13.txt | sort | uniq -c | sort -nr >
ORIGIN_DEST.txt
So the counts for all five airports, whether a destination or an origin are all in the file
above. I will now go use R to finish cleaning up the data and get the counts for each
of the five airports
setwd("~/")
data=readLines("ORIGIN_DEST.txt")
#Grab each of the 3 letter codes for the airports out of the file
LAX= data[grep("LAX", data)]
JFK=data[grep("JFK", data)]
SMF=data[grep("SMF", data)]
SFO=data[grep("SFO", data)]
OAK=data[grep("OAK", data)]
#Name regular expression to make it easier for coding later

regex='([^0-9])'
#Substitute all the nonsense with nothing to just obtain the numbers
LAX_NUM=gsub(regex, '', LAX)
#Sum up those numbers obtained
LAX_TOTAL=sum(as.numeric(LAX_NUM))
#Do the same thing as above for the rest of the airports
JFK_NUM=gsub(regex, '', JFK)
JFK_TOTAL=sum(as.numeric(JFK_NUM))
OAK_NUM=gsub(regex, '', OAK)

OAK_TOTAL=sum(as.numeric(OAK_NUM))
SMF_NUM=gsub(regex, '', SMF)

SMF_TOTAL=sum(as.numeric(SMF_NUM))
SFO_NUM=gsub(regex, '', SFO)

SFO_TOTAL=sum(as.numeric(SFO_NUM))
#List each of the airports with their sums from above

A= list(LAX=LAX_TOTAL, SFO=SFO_TOTAL, JFK=JFK_TOTAL, OAK=OAK_TOTAL,
SMF=SMF_TOTAL)
#Make result from above into data frame for readability
result_air=as.data.frame(A)
And the output for this is:

LAX SFO JFK OAK SMF
444080 339463 210175 89820 86293

Baseball with SQL
Number 1
What years does the data cover? are there data for each of these years?
#Find the years in which the data ranges, This function is from Nick's
OH's
year_range= function(tbl, db){

query= 'SELECT yearID FROM '
query = paste0(query, tbl, ';')
cat(paste0('Query:', query, '\n'))
# Use tryCatch() to catch errors.
tryCatch(dbGetQuery(db, query),
error = function(e) NULL)
}

tables= dbListTables(db)

years = lapply(tables, year_range, db)
u= unlist(years)

first_year = min(u)
last_year = max(u)
So the data begins in the year:

>first_year
1871
And ends in the Year:

>last_year
2013 #So the data ranges from 1871-2013
Are there data for al the years?
#If returns TRUE, then yes:

all(seq(min(u), max(u)) %in% u)
[1] TRUE
So there are data for each of the years in the Set

Number 2
How many (unique) people are included in the database? How many are players, managers, etc?
#help from Piazza was given to make sure count was correct
number_unique= function(tbl, db){

query= 'SELECT playerID FROM '
query = paste0(query, tbl, ';')
cat(paste0('Query:', query, '\n'))
# Use tryCatch() to catch errors.
tryCatch(dbGetQuery(db, query),
error = function(e) NULL)
}
uni_people = lapply(tables, number_unique, db)
playerIDs = unlist(people)
unique_people = length(unique(playerIDs))
unique_people
managers = dbGetQuery(db, 'SELECT COUNT(DISTINCT playerID) FROM
Managers')
managers
Number_players = unique_people-managers
Found 18359 unique in the set, 682 Managers, and 17677 players. Some players
were managers at some point, and found that there might be some overlap in the
tables, but I believe that this is a very good estimate of the number of players and
managers from all of the years.
Number 3
What team won the World Series in 2000?
Win_2000= dbGetQuery(db, "SELECT name FROM Teams WHERE WSWin = 'Y' and
yearID = '2000';")
>Win_2000
name
1 New York Yankees
The winner of the World Series in 2000 was the New York Yankees
Number 4What team lost the World Series each year?
World_Series_Losers = dbGetQuery(db, "SELECT yearID, name FROM Teams

WHERE LGWin = 'Y' and WSWin = 'N' GROUP BY yearID;")
For the sake of saving paper, I will just show the first 5 years of world series losers
and the last five years:
yearID name
1 1884 New York Metropolitans
2 1885 Chicago White Stockings
3 1886 Chicago White Stockings
4 1887 St. Louis Browns
5 1888 St. Louis Browns
... ... ...
112 2009 Philadelphia Phillies
113 2010 Texas Rangers
114 2011 Texas Rangers
115 2012 Detroit Tigers
116 2013 St. Louis Cardinal
Number 5
Do you see a relationship between the number of games won in a season and winning the World
Series?
#I recieved a lot of help from Charles on this problem.
World_Series_Winners = dbGetQuery(db, "SELECT WSWin,teamID,W, yearID
FROM Teams WHERE LGWin = 'Y' AND WSWin = 'Y' GROUP BY TeamID;")
World_Series_win= World_Series_Winners[,3]
World_Series_year= World_Series_Winners[,4]

plot(World_Series_year,World_Series_win, xlab="Year", ylab="Number of
Wins")

World_Series_Losers = dbGetQuery(db, "SELECT WSWin,teamID,W, yearID
FROM Teams WHERE WSWin = 'N' GROUP BY TeamID;")
World_Series_win2= World_Series_Losers[,3]
World_Series_year2= World_Series_Losers[,4]
plot(World_Series_year2,World_Series_win2, xlab="Year", ylab="Number of

Wins")

So there does seem to be a relationship for numbers of wins and winning the world
series, since the plots differ substantially. It is apparent that there is a higher
number of wins for the teams who won the world series, so therefore they differ.
just to make sure that My plot reading skills are correct, I computer the mean and
median for the number f wins in each catagory:
median(World_Series_win)
95
mean(World_Series_win)
95.16
median(World_Series_win2)
70
mean(World_Series_win2)
67
So this just clarifies my point from above that three are a higher number of wins for
the teams who won the world series.
Number 6
In 2003, what were the three highest salaries?

high_salary = dbGetQuery(db, "SELECT salary FROM Salaries WHERE yearID
= '2003' ORDER BY salary DESC limit 3; ")

>high_salary
salary
1 22000000
2 20000000
3 18700000
So the three highest slaries are: $22,000,000, $20,000,000, and $18,700,000.

Number 7
For 1999, compute the total payroll of each of the different teams. Next compute the team
payrolls for all years in the database for which we have salary information.
Payroll_99= dbGetQuery(db, "SELECT teamID,sum(salary) FROM Salaries
WHERE yearID='1999' GROUP BY teamID;")
So the payroll for the year 1999 for each of the teams is:
>Payroll_99
teamID sum(salary)
1 ANA 55388166
2 ARI 68703999
3 ATL 73140000
4 BAL 80605863
5 BOS 63497500
6 CHA 25620000
7 CHN 62343000
8 CIN 33962761
9 CLE 72978462

10 COL 61935837
11 DET 36489666
12 FLO 21085000
13 HOU 54914000
14 KCA 26225000
15 LAN 80862453
16 MIL 43377395
17 MIN 21257500
18 MON 17903000
19 NYA 86734359
20 NYN 65092092
21 OAK 24431833
22 PHI 31692500
23 PIT 24697666
24 SDN 49768179
25 SEA 54125003
26 SFN 46595057
27 SLN 49778195
28 TBA 38870000
29 TEX 76709931
30 TOR 45444333
``````````````````````````````````````````````````````````````````
Payroll= dbGetQuery(db, "SELECT teamID,sum(salary), yearID FROM
Salaries GROUP BY teamID, yearID;")
For the sake of paper, I will just display the first 5 salaries from th first five teams,
and the last 5:
teamID sum(salary) yearID

1 ATL 14807000 1985
2 BAL 11560712 1985
3 BOS 10897560 1985
4 CAL 14427894 1985
5 CHA 9846178 1985
... ... ... ...
824 SLN 92260110 2013
825 TBA 52955272 2013
826 TEX 112522600 2013
827 TOR 126288100 2013
828 WAS 113703270 2013

Number 8
Study the change in salary over time. Have salaries kept up with inflation, fallen behind, or grown
faster?

#bring all dollars to 1985 dollars.
CPI=read.table("CPI.txt")

#Multiply the cpi values by salary and then plot.
#Much help From Nick was recieved to answer this question
year_salary=dbGetQuery(db, "SELECT yearID, sum(salary) AS salary FROM
Salaries GROUP BY yearID")

CPI2=year_salary/CPI
CPI2=t(CPI2) #Take the transpose so it will work
year=as.vector(year$yearID) #For the lines statement

plot(year_salary, type='l', col="blue", lwd=2.5, xlab="Year",
ylab="Salary")
lines(year,CPI2, type='l',col="red", lwd=2.5)
legend(1985,3.0e+09, c("Salary","Salary with no inflation"), # puts
text in the legend
lty=c(1,1), # gives the legend appropriate symbols (lines)
lwd=c(1,1),col=c("blue","red"), cex=.55)

Above is a plot which shows the increase in salary since 1985 (in blue), and the
increase in salary if the salary were computed in 1985 dollars (in red). Another way
of exapling the red line is that it is the salary with no inflation rate. It is clear that the
salary for MLB players in increasing much faster than inflation, so players are
getting paid a lot more money than they did in 1985.
Number 9
Compare payrolls for the teams that are in the same leagues, and then in the same divisions. Are
there any interesting characteristics? Have certain teams always had top payrolls over the years?
Is there a connection between payroll and performance?
#American League Salary

library(reshape)
library(reshape2)
American_L_Sal= dbGetQuery(db, "SELECT teamID, sum(salary), yearID FROM
Salaries WHERE lgID= 'AL' GROUP BY teamID, yearID")
Teams= unique(American_L_Sal[,1])
names(American_L_Sal)= c("Team", "Salary", "Year")
m=melt(American_L_Sal,id=c("Team", "Year"))
c=cast(m, Year~Team)
matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary",
main=American League)
legend("topleft", legend=Teams, cex=.4, col=1:68, pch=.5, lty=.5,
lwd=1)

Above, we can see the salaries for each of the American League teams. This graph
and th one's following took me several hours and I am very proud of them. The
graph above is not the easiest to read, but I think it is still plenty readable. We can
see that the highest paid team for the American league (the recent years) is
undoubtably the Texas Rangers.

#National League Salary
National_L_Sal= dbGetQuery(db, "SELECT teamID, sum(salary), yearID FROM
Salaries WHERE lgID= 'NL' GROUP BY teamID, yearID")
Teams= unique(National_L_Sal[,1])
names(National_L_Sal)= c("Team", "Salary", "Year")
m=melt(National_L_Sal,id=c("Team", "Year"))
matplot(c, type="l", col=55:68, xlab="Year", ylab="Salary", main="Nat.
League")
lwd=1)

Above we can see that Mostly in the last five years, for the National League, the LA
Dogers have been the highest paid tem, until the last 2 years, where we can see that
the New York Yankees salaries have boosted greatly.

#Now the devisions:
#American League West: Used Nick's code from discussion, recieved help
from Michael in OH's
American_LW_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary),
a.yearID FROM Salaries AS a, Teams as b WHERE a.teamID = b.teamID AND
a.yearID = b.yearID AND a.lgID = b.lgID AND
a.lgID= 'AL' AND b.divID= 'W' GROUP BY a.yearID, a.teamID;")

Teams= unique(American_LW_Sal[,1])

names(American_LW_Sal)= c("Team", "Salary", "Year")
m=melt(American_LW_Sal,id=c("Team", "Year"))

matplot(c, type="l", col=50:60, xlab="Year", ylab="Salary", main="AL
WEST")
lwd=1)

Highest paid team in last 5 years looks to be Seatle Mariners for AL West

#American League East
American_LE_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary),
a.lgID= 'AL' AND b.divID= 'E' GROUP BY a.yearID, a.teamID;")

Teams= unique(American_LE_Sal[,1])
names(American_LE_Sal)= c("Team", "Salary", "Year")

m=melt(American_LE_Sal,id=c("Team", "Year"))

EAST")
lwd=1)

Highest paid team in last 5 years looks to be NY Yankees for AL East

#American League Central
American_LC_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary),
a.lgID= 'AL' AND b.divID= 'C' GROUP BY a.yearID, a.teamID;")
Teams= unique(American_LC_Sal[,1])
names(American_LC_Sal)= c("Team", "Salary", "Year")

m=melt(American_LC_Sal,id=c("Team", "Year"))

Central")
lwd=1)

Highest paid team in last 5 years looks to be Kansas City Royals for AL Central

#National League West
National_LW_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary),
a.lgID= 'NL' AND b.divID= 'W' GROUP BY a.yearID, a.teamID;")

Teams= unique(National_LW_Sal[,1])
names(National_LW_Sal)= c("Team", "Salary", "Year")

m=melt(National_LW_Sal,id=c("Team", "Year"))

matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main=NL
WEST)
lwd=1)

Highest paid team in last 5 years looks to be Arizona Diamondback for NL West, but
SF Giants seem to have become the highest from 2012-2013.

#National League East
National_LE_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary),
a.lgID= 'NL' AND b.divID= 'E' GROUP BY a.yearID, a.teamID;")
Teams= unique(National_LE_Sal[,1])
names(National_LE_Sal)= c("Team", "Salary", "Year")

m=melt(National_LE_Sal,id=c("Team", "Year"))

matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main="NL
EAST")
lwd=1)

Highest paid team in last 5 years looks to be Maimi Marlins for NL East.

#National League Central
National_LC_Sal= dbGetQuery(db, "SELECT a.teamID, sum(a.salary),
a.lgID= 'NL' AND b.divID= 'C' GROUP BY a.yearID, a.teamID;")
Teams= unique(National_LC_Sal[,1])
names(National_LC_Sal)= c("Team", "Salary", "Year")

m=melt(National_LC_Sal,id=c("Team", "Year"))

matplot(c, type="l", col=1:68, xlab="Year", ylab="Salary", main="NL
Central")
lwd=1)

Highest paid team in last 5 years looks to be Chicago Cubs for NL Central, but
CINCINNATI REDS seem to have become the highest from 2012-2013.

NUMBER 10
Has the distribution of home runs for players increased over the years?

#Number 10
home_run= dbGetQuery(db, "SELECT yearID,HR AS homerun FROM Batting ")
a=split(home_run$homerun, home_run$yearID)
boxplot(a, outwex=.2, outline=FALSE)

From the plot, we can see that the distrobution of homeruns HAS changed over the
years, with a peak in the 90's and early 2000's most likely due to steroid use being
unregulated.
BONUS QUESTIONS!
Have the RBI's in the last 13 years gone down due to Steroid use being enforced?
RBI= dbGetQuery(db, "SELECT yearID,sum(RBI) FROM Batting WHERE yearID
BETWEEN 2000 AND 2013 GROUP BY yearID ")
plot(RBI, type='l', ylab="RBI's", xlab="Year", main="RBI's since 2000"
)
#Verify that Batting has worstened with Steroid decline by looking at

Homeruns
home_runs= dbGetQuery(db, "SELECT yearID,sum(HR) AS homerun FROM
Batting WHERE yearID BETWEEN 2000 AND 2013 GROUP BY yearID")
plot(home_runs, type='l', ylab="Homeruns", xlab="Year", main="Homeruns
since 2000" )

So it is apparent that the RBI's have gone down recently, and the amount of
homeruns, and this is likely due to the decrease of steroids over the past 10-15
years.
NUMBER 2 Look at the number of strikeouts over the years, have pitchers gotten
better?
strike_outs= dbGetQuery(db, "SELECT yearID,sum(SO) AS homerun FROM
Pitching GROUP BY yearID")
plot(strike_outs, type='l', ylab="Strikeouts", xlab="Years",
main="Strikeouts over the years")

It does appear that pitchers have gotten better over the the years, but that could
also mean that batters have gotten worse while pitchers remained the same.
Question 3
Who are the 5 top managers (managerID's) with the highest number of wins in this
dataset?
top_managers= dbGetQuery(db, "SELECT playerID,yearID,W AS wins FROM
Managers ORDER BY W DESC limit 5")
top_managers
#From baseballreference.com
#Frank Chance, Lou Piniella, Joe Torre, AL Lopez, Fred Clarke
Number 4
In which years were there tie games? and how many were there?
dbGetQuery(db, "SELECT sum(ties), yearID AS year FROM SeriesPost WHERE
ties=1 GROUP BY yearID;")

sum(ties) year
1 1 1885
2 1 1890
3 1 1892
Number 5
Who are all of the pitchers in the MLB for the year 2013 and which team did they
play for?
pitchers=dbGetQuery(db, "SELECT teamID, playerID,yearID, Pos FROM
FieldingPost WHERE yearID=2013 AND Pos='P';")
There are 166 Pitchers, so I will just list the first 5 and the last 5:
teamID playerID yearID POS
1 DET albural01 2013 P
2 DET albural01 2013 P
3 CLE allenco01 2013 P
4 DET alvarjo02 2013 P
5 OAK anderbr04 2013 P
... ... ...... ... ..
162 BOS workmbr01 2013 P
165 TBA wrighja01 2013 P
166 TBA wrighwe01 2013 P

SQL and Shell Baseball Analysis

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

SQL and Shell Baseball Analysis

Încărcat de

Drepturi de autor:

Formate disponibile

Shell

and SQL Baseball and airport analysis

so it took 33 seconds to get the data from the website in Shell

The time it took was: user 1m55.666s

The time it took was: real 0m0.004s

So that's a total time of about 2 minutes for the shell.

Part 1 (i): Done in R

#call variable origincounts a blank vector with five 0's to fill in

#Use filenames to later iterate over all files

And the time it took was:

First, the code that was done in shell:

#Name regular expression to make it easier for coding later

OAK_NUM=gsub(regex, '', OAK)

SMF_NUM=gsub(regex, '', SMF)

SFO_NUM=gsub(regex, '', SFO)

#List each of the airports with their sums from above

And the output for this is:

So the data begins in the year:

And ends in the Year:

Are there data for al the years?

#If returns TRUE, then yes:

So there are data for each of the years in the Set

World_Series_Losers = dbGetQuery(db, "SELECT yearID, name FROM Teams

plot(World_Series_year2,World_Series_win2, xlab="Year", ylab="Number of

In 2003, what were the three highest salaries?

So the three highest slaries are: $22,000,000, $20,000,000, and $18,700,000.

teamID sum(salary) yearID

#American League Salary

#Verify that Batting has worstened with Steroid decline by looking at

S-ar putea să vă placă și