Documente Academic
Documente Profesional
Documente Cultură
SFO and JFK, and sort these counts from largest to smallest.
Done
in
Shell
#Get
the
data
from
the
website
$
curl
-O
http://eeyore.ucdavis.edu/stat141/Data/Airline2012_13.tar.gz
#output
for
the
amount
of
time
it
took
to
get
it
%
Total
%
Received
%
Xferd
Average
Speed
Time
Time
Time
Current
Dload
Upload
Total
Spent
Left
Speed
100
291M
100
291M
0
0
8803k
0
0:00:33
0:00:33
--:--:-
-
7512k
list.files()
#concatinate
all
five
airport
names
into
variable:
airports
airports
=
c("LAX",
"SFO",
"JFK",
"OAK",
"SMF")
#loop
over
all
csv
files
and
pull
out
origin
column
system.time(for
(i
in
1:length(filenames))
{
#Read
in
csv
files
1:12
cur.csv
=
read.csv(filenames[i])
#Create
a
table
with
the
origin
counts
for
each
airport
origintable
=
table(cur.csv$ORIGIN)
#Create
a
table
with
just
the
five
airports
that
were
asked
for
origincounts
=
origincounts
+
origintable[airports]
})
The
output
matches
that
of
the
output
from
shell,
so
that
is
a
very
good
thing.
#R
output
to
show
it
matches
the
shell
output:
origincounts
#LAX
SFO
JFK
OAK
SMF
#222029
169734
105097
44911
43145
So,
if
we
use
the
user
time,
that
is
about
7.3
minutes,
which
is
over
3
times
as
long
as
the
time
it
took
in
shell.
So
I
have
determined,
that
when
you
need
to
grep
a
few
items
out
of
a
very
large
database,
that
It
is
faster
to
do
it
in
Shell
over
R.
Part 1 (ii) Compute the total number of flights in and out of the five airports, i.e., the sume of both
the inbound and outbound flights. You can do this however you want using a mix of the shell and
R code. One way is to first obtain the lines in the files which involve any of these five airports.
Then obtain a count for each pair of airports, i.e., ORIGIN, DESTINATION pairs. At most, how
many will there be? Then read these counts by ORIGIN, DESTINATION pairs into R and
compute the total number of flights for each of the 5 airports.
#This
grabs
each
of
the
airport
names
out
of
all
the
csv
files
from
the
data:
egrep
'OAK|SMF|LAX|SFO|JFK'
201[23]*.csv
>
12_13.txt
#Then
this
takes
the
content
from
the
file
created
above
and
#pull
outs
both
the
origin
and
destination
column,
and
sorts
the
counts
cut
-d
,
-f
15,25
12_13.txt
|
sort
|
uniq
-c
|
sort
-nr
>
ORIGIN_DEST.txt
So
the
counts
for
all
five
airports,
whether
a
destination
or
an
origin
are
all
in
the
file
above.
I
will
now
go
use
R
to
finish
cleaning
up
the
data
and
get
the
counts
for
each
of
the
five
airports
setwd("~/")
data=readLines("ORIGIN_DEST.txt")
#Grab
each
of
the
3
letter
codes
for
the
airports
out
of
the
file
LAX=
data[grep("LAX",
data)]
JFK=data[grep("JFK",
data)]
SMF=data[grep("SMF",
data)]
SFO=data[grep("SFO",
data)]
OAK=data[grep("OAK",
data)]
#Substitute
all
the
nonsense
with
nothing
to
just
obtain
the
numbers
LAX_NUM=gsub(regex,
'',
LAX)
#Sum
up
those
numbers
obtained
LAX_TOTAL=sum(as.numeric(LAX_NUM))
#Do
the
same
thing
as
above
for
the
rest
of
the
airports
JFK_NUM=gsub(regex,
'',
JFK)
JFK_TOTAL=sum(as.numeric(JFK_NUM))
Baseball
with
SQL
Number
1
What years does the data cover? are there data for each of these years?
#Find
the
years
in
which
the
data
ranges,
This
function
is
from
Nick's
OH's
year_range=
function(tbl,
db){
query=
'SELECT
yearID
FROM
'
query
=
paste0(query,
tbl,
';')
cat(paste0('Query:',
query,
'\n'))
#
Use
tryCatch()
to
catch
errors.
tryCatch(dbGetQuery(db,
query),
error
=
function(e)
NULL)
}
tables=
dbListTables(db)
years
=
lapply(tables,
year_range,
db)
u=
unlist(years)
first_year
=
min(u)
last_year
=
max(u)
Found
18359
unique
in
the
set,
682
Managers,
and
17677
players.
Some
players
were
managers
at
some
point,
and
found
that
there
might
be
some
overlap
in
the
tables,
but
I
believe
that
this
is
a
very
good
estimate
of
the
number
of
players
and
managers
from
all
of
the
years.
Number
3
What team won the World Series in 2000?
Win_2000=
dbGetQuery(db,
"SELECT
name
FROM
Teams
WHERE
WSWin
=
'Y'
and
yearID
=
'2000';")
>Win_2000
name
1
New
York
Yankees
The
winner
of
the
World
Series
in
2000
was
the
New
York
Yankees
Number 4What team lost the World Series each year?
For
the
sake
of
saving
paper,
I
will
just
show
the
first
5
years
of
world
series
losers
and
the
last
five
years:
yearID
name
1
1884
New
York
Metropolitans
2
1885
Chicago
White
Stockings
3
1886
Chicago
White
Stockings
4
1887
St.
Louis
Browns
5
1888
St.
Louis
Browns
...
...
...
112
2009
Philadelphia
Phillies
113
2010
Texas
Rangers
114
2011
Texas
Rangers
115
2012
Detroit
Tigers
116
2013
St.
Louis
Cardinal
Number 5
Do you see a relationship between the number of games won in a season and winning the World
Series?
#I
recieved
a
lot
of
help
from
Charles
on
this
problem.
World_Series_Winners
=
dbGetQuery(db,
"SELECT
WSWin,teamID,W,
yearID
FROM
Teams
WHERE
LGWin
=
'Y'
AND
WSWin
=
'Y'
GROUP
BY
TeamID;")
World_Series_win=
World_Series_Winners[,3]
World_Series_year=
World_Series_Winners[,4]
plot(World_Series_year,World_Series_win,
xlab="Year",
ylab="Number
of
Wins")
World_Series_Losers
=
dbGetQuery(db,
"SELECT
WSWin,teamID,W,
yearID
FROM
Teams
WHERE
WSWin
=
'N'
GROUP
BY
TeamID;")
World_Series_win2=
World_Series_Losers[,3]
World_Series_year2=
World_Series_Losers[,4]
So
there
does
seem
to
be
a
relationship
for
numbers
of
wins
and
winning
the
world
series,
since
the
plots
differ
substantially.
It
is
apparent
that
there
is
a
higher
number
of
wins
for
the
teams
who
won
the
world
series,
so
therefore
they
differ.
just
to
make
sure
that
My
plot
reading
skills
are
correct,
I
computer
the
mean
and
median
for
the
number
f
wins
in
each
catagory:
median(World_Series_win)
95
mean(World_Series_win)
95.16
median(World_Series_win2)
70
mean(World_Series_win2)
67
So
this
just
clarifies
my
point
from
above
that
three
are
a
higher
number
of
wins
for
the
teams
who
won
the
world
series.
Number
6
For 1999, compute the total payroll of each of the different teams. Next compute the team
payrolls for all years in the database for which we have salary information.
Payroll_99=
dbGetQuery(db,
"SELECT
teamID,sum(salary)
FROM
Salaries
WHERE
yearID='1999'
GROUP
BY
teamID;")
So
the
payroll
for
the
year
1999
for
each
of
the
teams
is:
>Payroll_99
teamID
sum(salary)
1
ANA
55388166
2
ARI
68703999
3
ATL
73140000
4
BAL
80605863
5
BOS
63497500
6
CHA
25620000
7
CHN
62343000
8
CIN
33962761
9
CLE
72978462
10
COL
61935837
11
DET
36489666
12
FLO
21085000
13
HOU
54914000
14
KCA
26225000
15
LAN
80862453
16
MIL
43377395
17
MIN
21257500
18
MON
17903000
19
NYA
86734359
20
NYN
65092092
21
OAK
24431833
22
PHI
31692500
23
PIT
24697666
24
SDN
49768179
25
SEA
54125003
26
SFN
46595057
27
SLN
49778195
28
TBA
38870000
29
TEX
76709931
30
TOR
45444333
``````````````````````````````````````````````````````````````````
Payroll=
dbGetQuery(db,
"SELECT
teamID,sum(salary),
yearID
FROM
Salaries
GROUP
BY
teamID,
yearID;")
For
the
sake
of
paper,
I
will
just
display
the
first
5
salaries
from
th
first
five
teams,
and
the
last
5:
Number 8
Study the change in salary over time. Have salaries kept up with inflation, fallen behind, or grown
faster?
#bring
all
dollars
to
1985
dollars.
CPI=read.table("CPI.txt")
#Multiply
the
cpi
values
by
salary
and
then
plot.
#Much
help
From
Nick
was
recieved
to
answer
this
question
year_salary=dbGetQuery(db,
"SELECT
yearID,
sum(salary)
AS
salary
FROM
Salaries
GROUP
BY
yearID")
CPI2=year_salary/CPI
CPI2=t(CPI2)
#Take
the
transpose
so
it
will
work
year=as.vector(year$yearID)
#For
the
lines
statement
plot(year_salary,
type='l',
col="blue",
lwd=2.5,
xlab="Year",
ylab="Salary")
lines(year,CPI2,
type='l',col="red",
lwd=2.5)
legend(1985,3.0e+09,
c("Salary","Salary
with
no
inflation"),
#
puts
text
in
the
legend
lty=c(1,1),
#
gives
the
legend
appropriate
symbols
(lines)
lwd=c(1,1),col=c("blue","red"),
cex=.55)
Above
is
a
plot
which
shows
the
increase
in
salary
since
1985
(in
blue),
and
the
increase
in
salary
if
the
salary
were
computed
in
1985
dollars
(in
red).
Another
way
of
exapling
the
red
line
is
that
it
is
the
salary
with
no
inflation
rate.
It
is
clear
that
the
salary
for
MLB
players
in
increasing
much
faster
than
inflation,
so
players
are
getting
paid
a
lot
more
money
than
they
did
in
1985.
Number 9
Compare payrolls for the teams that are in the same leagues, and then in the same divisions. Are
there any interesting characteristics? Have certain teams always had top payrolls over the years?
Is there a connection between payroll and performance?
Teams=
unique(American_L_Sal[,1])
names(American_L_Sal)=
c("Team",
"Salary",
"Year")
m=melt(American_L_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=1:68,
xlab="Year",
ylab="Salary",
main=American
League)
legend("topleft",
legend=Teams,
cex=.4,
col=1:68,
pch=.5,
lty=.5,
lwd=1)
Above,
we
can
see
the
salaries
for
each
of
the
American
League
teams.
This
graph
and
th
one's
following
took
me
several
hours
and
I
am
very
proud
of
them.
The
graph
above
is
not
the
easiest
to
read,
but
I
think
it
is
still
plenty
readable.
We
can
see
that
the
highest
paid
team
for
the
American
league
(the
recent
years)
is
undoubtably
the
Texas
Rangers.
#National
League
Salary
National_L_Sal=
dbGetQuery(db,
"SELECT
teamID,
sum(salary),
yearID
FROM
Salaries
WHERE
lgID=
'NL'
GROUP
BY
teamID,
yearID")
Teams=
unique(National_L_Sal[,1])
names(National_L_Sal)=
c("Team",
"Salary",
"Year")
m=melt(National_L_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=55:68,
xlab="Year",
ylab="Salary",
main="Nat.
League")
legend("topleft",
legend=Teams,
cex=.4,
col=40:68,
pch=.5,
lty=.5,
lwd=1)
Above
we
can
see
that
Mostly
in
the
last
five
years,
for
the
National
League,
the
LA
Dogers
have
been
the
highest
paid
tem,
until
the
last
2
years,
where
we
can
see
that
the
New
York
Yankees
salaries
have
boosted
greatly.
#Now
the
devisions:
#American
League
West:
Used
Nick's
code
from
discussion,
recieved
help
from
Michael
in
OH's
American_LW_Sal=
dbGetQuery(db,
"SELECT
a.teamID,
sum(a.salary),
a.yearID
FROM
Salaries
AS
a,
Teams
as
b
WHERE
a.teamID
=
b.teamID
AND
a.yearID
=
b.yearID
AND
a.lgID
=
b.lgID
AND
a.lgID=
'AL'
AND
b.divID=
'W'
GROUP
BY
a.yearID,
a.teamID;")
Teams=
unique(American_LW_Sal[,1])
names(American_LW_Sal)=
c("Team",
"Salary",
"Year")
m=melt(American_LW_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=50:60,
xlab="Year",
ylab="Salary",
main="AL
WEST")
legend("topleft",
legend=Teams,
cex=.6,
col=50:60,
pch=.5,
lty=.5,
lwd=1)
Highest
paid
team
in
last
5
years
looks
to
be
Seatle
Mariners
for
AL
West
#American
League
East
American_LE_Sal=
dbGetQuery(db,
"SELECT
a.teamID,
sum(a.salary),
a.yearID
FROM
Salaries
AS
a,
Teams
as
b
WHERE
a.teamID
=
b.teamID
AND
a.yearID
=
b.yearID
AND
a.lgID
=
b.lgID
AND
a.lgID=
'AL'
AND
b.divID=
'E'
GROUP
BY
a.yearID,
a.teamID;")
Teams=
unique(American_LE_Sal[,1])
names(American_LE_Sal)=
c("Team",
"Salary",
"Year")
m=melt(American_LE_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=50:60,
xlab="Year",
ylab="Salary",
main="AL
EAST")
legend("topleft",
legend=Teams,
cex=.6,
col=50:60,
pch=.5,
lty=.5,
lwd=1)
Highest
paid
team
in
last
5
years
looks
to
be
NY
Yankees
for
AL
East
#American
League
Central
American_LC_Sal=
dbGetQuery(db,
"SELECT
a.teamID,
sum(a.salary),
a.yearID
FROM
Salaries
AS
a,
Teams
as
b
WHERE
a.teamID
=
b.teamID
AND
a.yearID
=
b.yearID
AND
a.lgID
=
b.lgID
AND
a.lgID=
'AL'
AND
b.divID=
'C'
GROUP
BY
a.yearID,
a.teamID;")
Teams=
unique(American_LC_Sal[,1])
names(American_LC_Sal)=
c("Team",
"Salary",
"Year")
m=melt(American_LC_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=1:68,
xlab="Year",
ylab="Salary",
main="AL
Central")
legend("topleft",
legend=Teams,
cex=.6,
col=1:68,
pch=.5,
lty=.5,
lwd=1)
Highest
paid
team
in
last
5
years
looks
to
be
Kansas
City
Royals
for
AL
Central
#National
League
West
National_LW_Sal=
dbGetQuery(db,
"SELECT
a.teamID,
sum(a.salary),
a.yearID
FROM
Salaries
AS
a,
Teams
as
b
WHERE
a.teamID
=
b.teamID
AND
a.yearID
=
b.yearID
AND
a.lgID
=
b.lgID
AND
a.lgID=
'NL'
AND
b.divID=
'W'
GROUP
BY
a.yearID,
a.teamID;")
Teams=
unique(National_LW_Sal[,1])
names(National_LW_Sal)=
c("Team",
"Salary",
"Year")
m=melt(National_LW_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=1:68,
xlab="Year",
ylab="Salary",
main=NL
WEST)
legend("topleft",
legend=Teams,
cex=.6,
col=1:68,
pch=.5,
lty=.5,
lwd=1)
Highest
paid
team
in
last
5
years
looks
to
be
Arizona
Diamondback
for
NL
West,
but
SF
Giants
seem
to
have
become
the
highest
from
2012-2013.
#National
League
East
National_LE_Sal=
dbGetQuery(db,
"SELECT
a.teamID,
sum(a.salary),
a.yearID
FROM
Salaries
AS
a,
Teams
as
b
WHERE
a.teamID
=
b.teamID
AND
a.yearID
=
b.yearID
AND
a.lgID
=
b.lgID
AND
a.lgID=
'NL'
AND
b.divID=
'E'
GROUP
BY
a.yearID,
a.teamID;")
Teams=
unique(National_LE_Sal[,1])
names(National_LE_Sal)=
c("Team",
"Salary",
"Year")
m=melt(National_LE_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=1:68,
xlab="Year",
ylab="Salary",
main="NL
EAST")
legend("topleft",
legend=Teams,
cex=.6,
col=1:68,
pch=.5,
lty=.5,
lwd=1)
Highest
paid
team
in
last
5
years
looks
to
be
Maimi
Marlins
for
NL
East.
#National
League
Central
National_LC_Sal=
dbGetQuery(db,
"SELECT
a.teamID,
sum(a.salary),
a.yearID
FROM
Salaries
AS
a,
Teams
as
b
WHERE
a.teamID
=
b.teamID
AND
a.yearID
=
b.yearID
AND
a.lgID
=
b.lgID
AND
a.lgID=
'NL'
AND
b.divID=
'C'
GROUP
BY
a.yearID,
a.teamID;")
Teams=
unique(National_LC_Sal[,1])
names(National_LC_Sal)=
c("Team",
"Salary",
"Year")
m=melt(National_LC_Sal,id=c("Team",
"Year"))
c=cast(m,
Year~Team)
matplot(c,
type="l",
col=1:68,
xlab="Year",
ylab="Salary",
main="NL
Central")
legend("topleft",
legend=Teams,
cex=.6,
col=1:68,
pch=.5,
lty=.5,
lwd=1)
Highest
paid
team
in
last
5
years
looks
to
be
Chicago
Cubs
for
NL
Central,
but
CINCINNATI
REDS
seem
to
have
become
the
highest
from
2012-2013.
NUMBER
10
Has the distribution of home runs for players increased over the years?
#Number
10
home_run=
dbGetQuery(db,
"SELECT
yearID,HR
AS
homerun
FROM
Batting
")
a=split(home_run$homerun,
home_run$yearID)
boxplot(a,
outwex=.2,
outline=FALSE)
From
the
plot,
we
can
see
that
the
distrobution
of
homeruns
HAS
changed
over
the
years,
with
a
peak
in
the
90's
and
early
2000's
most
likely
due
to
steroid
use
being
unregulated.
BONUS
QUESTIONS!
Have
the
RBI's
in
the
last
13
years
gone
down
due
to
Steroid
use
being
enforced?
RBI=
dbGetQuery(db,
"SELECT
yearID,sum(RBI)
FROM
Batting
WHERE
yearID
BETWEEN
2000
AND
2013
GROUP
BY
yearID
")
plot(RBI,
type='l',
ylab="RBI's",
xlab="Year",
main="RBI's
since
2000"
)
So
it
is
apparent
that
the
RBI's
have
gone
down
recently,
and
the
amount
of
homeruns,
and
this
is
likely
due
to
the
decrease
of
steroids
over
the
past
10-15
years.
NUMBER
2
Look
at
the
number
of
strikeouts
over
the
years,
have
pitchers
gotten
better?
strike_outs=
dbGetQuery(db,
"SELECT
yearID,sum(SO)
AS
homerun
FROM
Pitching
GROUP
BY
yearID")
plot(strike_outs,
type='l',
ylab="Strikeouts",
xlab="Years",
main="Strikeouts
over
the
years")
It
does
appear
that
pitchers
have
gotten
better
over
the
the
years,
but
that
could
also
mean
that
batters
have
gotten
worse
while
pitchers
remained
the
same.
Question
3
Who
are
the
5
top
managers
(managerID's)
with
the
highest
number
of
wins
in
this
dataset?
top_managers=
dbGetQuery(db,
"SELECT
playerID,yearID,W
AS
wins
FROM
Managers
ORDER
BY
W
DESC
limit
5")
top_managers
#From
baseballreference.com
#Frank
Chance,
Lou
Piniella,
Joe
Torre,
AL
Lopez,
Fred
Clarke
Number
4
In
which
years
were
there
tie
games?
and
how
many
were
there?
dbGetQuery(db,
"SELECT
sum(ties),
yearID
AS
year
FROM
SeriesPost
WHERE
ties=1
GROUP
BY
yearID;")
sum(ties)
year
1
1
1885
2
1
1890
3
1
1892
Number
5
Who
are
all
of
the
pitchers
in
the
MLB
for
the
year
2013
and
which
team
did
they
play
for?
pitchers=dbGetQuery(db,
"SELECT
teamID,
playerID,yearID,
Pos
FROM
FieldingPost
WHERE
yearID=2013
AND
Pos='P';")
There
are
166
Pitchers,
so
I
will
just
list
the
first
5
and
the
last
5:
teamID
playerID
yearID
POS
1
DET
albural01
2013
P
2
DET
albural01
2013
P
3
CLE
allenco01
2013
P
4
DET
alvarjo02
2013
P
5
OAK
anderbr04
2013
P
...
...
......
...
..
162
BOS
workmbr01
2013
P
163
BOS
workmbr01
2013
P
164
BOS
workmbr01
2013
P
165
TBA
wrighja01
2013
P
166
TBA
wrighwe01
2013
P