Sunteți pe pagina 1din 37

Yudelman 1

Predicting Batting Average of Balls in


Play in Major League Baseball using
StatCast Data

Adam Yudelman
ECN 215
Professor Allin Cottrell
11/23/2015

Yudelman 2
Table of Contents:
I.
II.
III.
IV.
V.
VI.
VII.
VIII.

IX.

Introduction
Literature Review
Data
Expectations
Modeling and Testing
Interpretation and Conclusions
Suggestions for Future Work
Appendix
a. Summary Statistics
b. Actual vs Predicted BABIP
c. R Script
Works Cited

Yudelman 3
Introduction:
In Major League Baseball, teams are constantly searching for undervalued players. This process
is considered arbitrage, where players are signed to contracts that do not accurately represent their skill
level or value to a team. This search became popularized by the movie Moneyball, where the Oakland
Athletics target on-base percentage, a measure of how often a player reaches base given his plate
appearances, as an undervalued skill. The movie brought the idea of sports analytics to the forefront of
American pop culture as the movie received more than a handful of Oscar nominations (Moneyball).
Sports analytics crosses econometrics with sports. Baseball in particular is ripe for data analysis; every
pitch is recorded with a speed, location, and break, and every hit is defined in variety of ways, such as
how hard the ball was hit and the location at where it landed, to properly describe the process and
outcome. Moreover, baseball data is almost entirely independent. A singular pitcher pitches to a
singular hitter. The outcomes depend almost entirely on those two players. However, in some cases, this
does not hold.
As more data is being collected, teams are looking for new types of arbitrage. One of the
greatest factors teams have to keep in mind is luck. Over a sample of 162 games, players can in fact get
lucky, improving their hitting statistics. For a hitter, this can mean a weak ground ball finding a gap in
between two players. The premise of luck is that a pitcher and hitter cannot control where a defensive
player is set up. Hitting the ball hard and on a line is the best outcome for hitter, yet sometimes that
batted ball is hit directly at a defender; thus, the players statistics, particularly his batting average
(defined as (total hits)/(total at bats)) are penalized with an out rather than a hit. Unfortunately, for a
long while, teams looked at batting average as the most important metric for determining a players skill
level. However, as described above, batting average is dependent on the exogenous variable of
defensive positioning. In an attempt to improve the understanding of a batting average, analysts came
up the statistic Batting Average of Balls in Play (BABIP).

Yudelman 4
BABIP is defined as:
BABIP = (H HR)/(AB K HR + SF)
where H = hits, HR = homerun, AB = at bats, K = strikes outs, HR = home run, and SF = sacrifice fly
BABIP is textually defined as the frequency with which a player gets a hit on a ball in play. It relies on
three factors: Skill, defense, and luck. Skill is described as the ability to hit a ball hard and in a manner
that likely would result in a hit. Defense is the defensive positioning of the opponent, an aspect the
player cannot control. Luck is best exemplified by a weakly hit fly ball that lands right between two
players; its an unintentional outcome. The idea is that a player who has one season with a high BABIP
compare to the rest of his career may have been the beneficiary of luck more so than an improvement
in skill (BABIP).
This paper investigates BABIP in a non-results based format. Essentially, the research intends to
predict BABIP using batted ball descriptors. This is important because batted ball descriptors are not
privy to luck or defense. It is only the outcome of the ball off the bat. These descriptors range from a
subjective categorical method of how hard a ball is hit to the direction in which the ball is headed. A
good model that predicts BABIP (generally referred to as expected BABIP, or xBABIP) can isolate skill
against the luck factor; thus, large residuals against actual BABIP can identify players who are victims or
beneficiaries of this luck. For teams, identifying those unlucky players, and thus those likely
undervalued, can allow them to sign good players at a below value price. This can been seen as two
separate markets: one for analytically based teams and one for non-analytically based teams. More
often than not, by using these underlying statistics, the analytically based team will be the one to
commit arbitrage through the two market valuation difference.
This paper in particular looks at the release of a new dataset and its potential effects on
predicting BABIP. Last year, Major League Baseball introduce a new ball tracking system called StatCast.

Yudelman 5
StatCast uses optical tracking technology to measure how fast, with what acceleration, and how far a
defender runs. In addition, and more importantly for this paper, StatCast tracks and publishes the exit
velocity of a baseball hit by the batter. Previous research in to predicting a non-results based BABIP has
not used this new StatCast data, so this research intends to see if the data can be used to build a better
model for xBABIP.
Literature Review
The idea of creating an xBABIP is not new. The rise in popularity in baseball analytics has created
countless dedicated websites where individuals commit their free time to analyzing statistics in hopes of
better understanding the game. Three of the most reputable websites are FanGraphs, Baseball
Prosepectus, and The Hardball Times. Searching the archives of these websites return a bit of prior
research on the subject matter.
In 2008, Chris Dutton of The Hardball Times, first delved into predictive BABIP in 2008. Popular
opinion at the time said that adding .120 to a players line drive percentage (percentage of batted balls
categorized as line drives) could act as a proxy for what a players BABIP should look like. Dutton refused
this as a reasonable explanation, for other factors, such as speed and ability to control the strike zone
seem to be variables that would also be relevant. He postulated that a quicker player would be able to
get a hit on a slow grounder to an infielder while a slower player, who hit the ball with the exact same
profile, would not get a hit.
Using data from 2002 to 2008, Dutton developed a model that took into account the batted ball
profile as well as some relevant metrics for the players personal skill profile. His OLS regression found
positive and significant (at the 1% level) for a hitters eye (defined as strikeouts divided by walks), line
drive percentage, speed score (to be discussed later on), and pitches per plate appearance (Dutton,
2008). He found negative coefficient for pitches per extra base hit, fly ball to ground ball ratio, spray (a

Yudelman 6
measure of how well a hitter disperses his hits all over the field), and contact rate (a metric that looks at
how well a player avoids strikeouts). Dutton also attempted to control for park effects, the year, and
whether a batter hits lefty or right or both, but he found these indicator variables to be insignificant.
Duttons model had an r-squared of .348. During out of sample testing, he found a correlation of 59%
between his xBABIP and actual BABIP. In comparison, the rudimentary formula of xBABIP = (.120 + LD%)
only had an r-squared of .03 and an out of sample correlation of 18% (Dutton, 2008).
Duttons results very clearly show that a model for expected BABIP is necessary. Conventional
wisdom with the very simple model using only line drive percentage proved not very predictive at all.
Dutton isolated significant variables and his work is used as the backbone for other research on the
topic.
In 2010, Matt Swartz of Baseball Prospectus looked once again at batting average of balls in
play. Swartz noted that year-to-year BABIP only has a correlation of about .37. BABIP is highly influenced
by the type of batted ball, as defined by line drive percentage, outfield fly ball percentage, ground ball
percentage, and infield fly ball percentage. The following table shows the league average distribution,
the year-to-year correlation for this distribution, the league average BABIP, and the year-to-year
correlation for BABIP for each type of batted ball.
BABIP by Type of Batted Ball
Batted Ball Type

League Average
Type of Hit
Distribution

Average BABIP for


Type of Hit

BABIP Average
Year-to-Year
Correlation

.21

Type of Hit
Distribution Yearto-Year
Correlation
.37

Line Drive

.730

.12

Outfield Fly Ball

.44

.72

.240

.22

Ground Ball

.35

.78

.170

.30

Infield Ground Ball

.11

.68

.020

.17

Yudelman 7

Interpreting the year-to-year correlations, batted ball type distribution is rather steady. This is to say if a
batter hits 30% fly balls one year, the numbers suggest that the batter will hit very close to that same
percentage the next. This is very important, for if a model is based on these batted ball type distribution,
a player has to have similar numbers year to year for the predictions to mean anything. In contrast, the
BABIP average year-to-year correlations suggest that there is far more uncertainty and inconsistency.
In addition to batted ball type, Swartz also looked at speed of the player similarly to Dutton.
Rather than using the speed score, Swartz used triples per at bat as a proxy because triples require a
component of speed to beat the ball to third base rather than just settle for a double.
Swartz then developed two models. The first used the weighted average of the previous three
years to predict the fourth years BABIP. This model found positive and statistically significant
coefficients for line drive percentage, ground ball percentage, ground ball BABIP, infield hits per infield
chances, outfield fly ball BABIP, the natural log of homeruns per at bats, and the natural log for contact
made per pitches swung at. The model then had one variable, infield fly ball percentage, with a negative
coefficient. This model had an r-squared of .31. For this papers purpose, this model is not very helpful,
for the data being used is only from the 2015 season and does not include any of the previous seasons
BABIP data. Also, given the BABIP year-to-year inconsistency, it is questionable to have used the
previous years BABIP as a variable.
Swartzs second model looked only at the previous years data, a model that is much more alike
the one this paper attempts to build. Swartz finds that line drive percentage, ground ball percentage,
infield hits per infield chances, the natural log of homeruns per at bats, outfield fly ball percentage, and
triples all have positive coefficients. Again, infield fly ball percentage has a negative coefficient. This
model has an r-square of just .21 (Swartz, 2010). Looking at his approach, Swartz proves that batted ball

Yudelman 8
type is in fact an important variable when predicting BABIP; however, his models did not show
improvement on Duttons previous research.
The most recent study on expected batting average of balls in play comes from Alex
Chamberlain of FanGraphs. Chamberlains impetus for the research comes from the release of a few
new batted ball type statistics. Hard%, in conjunction with Medium% and Soft%, measures how often a
player hits a ball hard. Interestingly, Chamberlain notes that Hard% has almost no correlation in line
drive percentage; thus, Hard% captures well hit groundballs as well as well hit fly balls. True FB% is
defined as fly ball percentage minus infield fly ball percentage and True IFFB% measures how many
infield fly balls per ball hit in play rather than per fly ball. Finally, Chamberlain also introduces Oppo%,
complemented by Pull% and Center%, which measures the percent of batted balls that are hit into
opposite field. Using data from 2002 to 2014 (n = 1971), Chamberlain developed the following model:
xBABIP = .1975 - .4838*(True IFFB%) - .0914*(True FB%) + .2594*(LD%) + .1822*(Hard%) +
.1198*(Oppo%) + .0042*(Speed Score)
The model has an adjusted r-squared of .456 as well as a year-to-year correlation of .4712 (Chamberlain,
2015).
Using these new statistics, this model has significantly more predictability that the models
discussed previously. Batted ball type and speed remain very important factors in predicting BABIP over
all the research reviewed, and it seems that the new metrics, which further describe the batted ball
profile, improve the model. Considering this with the new dataset for which this paper intends to build a
model upon, this is a promising outcome. Because of this, this paper intends to use Chamberlains model
as the archetype.

Yudelman 9
Data:
The data for this paper comes from FanGraphs.com and BaseballSavant.com. Both websites
have full data from the 2015 Major League Baseball season for their respective statistics. This paper
limits the sample to only players who qualified for the batting title, which requires 3.1 plate appearances
per game in the season. This translates out to at least 502 plate appearances for the entire season. Thus,
the sample is limited to the 141 qualified players from the 2015 season. The training set is a randomly
selected collection of 106 players and the testing set consists of the remaining 36 players.
The dataset currently has 43 columns representing the key (the player) and 42 descriptive
statistics. However, the research presented only builds an OLS model on a selected group of the
variables. The following explains each of these selected variables. The attached appendix also includes
the means, medians, and standard deviations for each statistic. Note that for all percentage metrics, this
paper will be using decimal format (i.e. 10% is represented as .10).
Batting Average of Balls in Play (BABIP):
Calculated as (Hits Home Runs)/(At Bats Strikeouts Home Runs + Sacrifice Flies)
Batting Average of Balls in Play measures how often a ball put in play results in a hit. As
discussed before, BABIP incorporates talent, luck, and defense. No one with over 4,000 career
plate appearances (roughly 6ish seasons) has ever had a BABIP of over .380, and a more
traditional mark of .350 indicates the best players in the league (BABIP). The following box
plots intends to show the wide spread of BABIP:

Yudelman 10

The most significant outlier is Albert Pujols at .217. This is a far departure from Pujols career
average BABIP of .297, so even as he ages, it is unlikely that .217 is a representative measure for
his hitting skill.
Batted Ball Type: Line Drive Percentage (LD%), Groundball Percentage (GB%), Fly Ball Percentage (FB%),
and Infield Fly Ball Percentage (IFFB%):
Calculated as:
Line Drive Percentage = Line Drives / Balls in Play
Fly Ball Percentage = Fly Balls / Balls in Play
Ground Ball Percentage = Ground Balls / Balls in Play
Infield Fly Ball Percentage = Infield Fly Balls / Fly Balls

Yudelman 11
These four metrics are grouped together, for they are all related. These are the four categorized
outcomes of a ball in play. LD%, FB%, and GB% sum to 1, while IFFB% is a category of defining a
fly ball. The following shows the hit distribution for 2015 players (ordered by ascending GB%):

For the majority of players, GB% dominates the hit profile; however, line drive percentage is
rather steady across the board regardless of the other two metrics.
Batted Ball Direction: Pull Percent (Pull%), Center Percent (Cent%), and Opposite Field Percent (Oppo%):
Calculated as:
Pull% = Pulled Balls/Total Batted Balls
Cent% = Centered Balls/Total Batted Balls
Oppo% = Opposite Balls/Total Batted Balls
Batted ball direction metrics split the field into three equal 30 degree sections. A pull location is
defined as the batter pulling the ball towards the side from which he hits from. For example, if a

Yudelman 12
right handed hitter pulls the ball towards third base, it counts towards the Pull%. If the ball is hit
up the middle, it is counted towards the Cent%, and if the ball is hit towards the first baseman, it
is counted towards the Oppo%. For a left handed hitter, a ball hit to the first base side counts
toward the Pull%, and if the ball is hit towards the third baseman, it is counted towards the
Oppo%. Similar to FB%, GB%, and LD%, these metrics sum to 1 for each player. The following,
curtesy of the FanGraphs page Batted Ball Direction, attempts to describe a players hitting
style based on the distribution of his batted ball direction breakdown.
Batter Type

Pull%

Cent%

Oppo%

Average

.40

.35

.25

Extreme Pull

.55

.25

.20

Extreme Oppo.

.30

.30

.40

Of not, players want to have as balanced of a distribution as possible so that defenses are not
able to position themselves heavily towards one side or the other.
Quality of Contact: Soft Hit Percentage (Soft%), Medium Hit Percentage (Med%), Hard Hit Percentage
(Hard%):
Quality of contact statistics are proprietary metrics from Baseball Information Solutions, and
have only recently been released to the public. While the exact formula is not known, it is
common knowledge that hang time, trajectory, and landing location factor into the calculation
(Quality of Contact Stats). Once again, these metrics sum to 1, so every batted ball is assigned
int one of the three buckets. The following shows the distribution for the qualified players in
2015.

Yudelman 13

Medium hit balls dominate every players profile. Based on the research of Alex Chamberlain,
hard hit percentage should lead to a higher BABIP. This study hopes to confirm that.
Speed Score (Spd%):
Speed score is also a propriety metric. It attempts to capture both the speed and base running
ability of a player. The metric varies depending on the website, but FanGraphs uses a
combination of Stolen Base Percentage, Frequency of Stolen Base Attempts, Percentage of
Triples, and Runs Scored Percentage (Speed Score). The 2015 sample shows speed scores vary
quite a bit from player to player. The following is a box plot of the distribution:

Yudelman 14

The wide distribution makes intuitive sense. The nature of baseball is that some positions
require far more athleticism than others, so rosters have variety of body types and athleticism.
From FanGraphs own research on speed score over the years, the following, taken from their
Speed Score page, shows how one can rate a players speed according to their score:
Rating

Speed Score

Excellent

7.0

Great

6.0

Above Average

5.5

Average

4.5

Below Average

4.0

Poor

3.0

Awful

2.0

Yudelman 15
StatCast Data: Average miles per hour of a ball off the bat (Avg MPH), Average miles per hour of a ball
off the bat for a line drive and fly ball (Avg LD/FB MPH), Average miles per hour of a ball off the bat for
a ground ball (Avg GB MPH)
Before delving into each individual metric, each of which is rather self-explanatory, there must
be a discussion regarding the reliability of StatCast data. Research done at FanGraphs by Tony
Blengino looked at the limitation of the 2015 data. Blengino downloaded all the data from the
first half of the 2015 season and found that for 25.4% of batted balls, the batted ball velocity
was reported as NULL. Hence, for any batter, it is safe to assume that one-fourth of the data is
missing. More troubling is the split among the missing and reported data. Blengino found that
reported data associated with a much higher average and slugging percentage (a measurement
of players ability to consistently get extra base hits in addition) (Blengino, 2015). Digging
further, StatCast reported infield fly balls as NULL 56.3% of the time and often missed weak
ground balls (Blengino, 2015). This no doubt will have an effect on this papers analysis and is
important to keep in mind.
Breaking down each of the statistics, the following is a line graph plotting each of the three
metrics:

Yudelman 16

As expected, line drives are hit a higher speed than groundballs. However, there is a lot of
spread among the data. Regardless, this paper hopes that StatCast data can be used as a
complement, or even a substitute, to the quality of contact metrics.
Plate Discipline Statistics: Outside of the Strike Zone Swing Percentage (O-Swing%), Inside of the Strike
Zone Swing Percentage (Z-Swing%), Overall Swing Percentage (Swing%), and Swinging Strike Percentage
(SwStr%)
Calculated as:
O-Swing% = Swings at pitches outside of the strike zone/ Total pitches outsize of the
strike zone

Yudelman 17
Z-Swing% = Swings at pitches inside of the strike zone/Total pitches inside of the strike
zone
Swing% = Swings at pitches/Total pitches
SwStr% = Swings and misses/Total pitches
These metrics represent how well a player is able to control the strike zone. Balls pitched inside
the strike zone are easier to hit, so players with a high O-Swing% lack plate discipline. SwStr% is
a metric that captures a players ability to make contact consistently. The data summary in the
appendix shows the league averages with rather small standard deviations; thus, players are
rather consistent with this metric.
Contact Consistency Metrics: Contact Rate for Swings for Pitches Outside of the Strike Zone (OContact%), Contact Rate for Swings for Pitches Inside the Strike Zone (Z-Contact%), Contact Rate for All
Swings (Contact%):
Calculated as:
O-Contact% =Contact made on pitches outside of the strike zone / Swings on pitches
outside the zone
Z-Contact% = Contact made on pitches inside the zone / Swings on pitches inside the
zone
Contact% = Contact made on a swing / Swings
The ability to avoid swinging at pitches outside of the zone is important because getting ahead
in the count (more balls than strikes) allows for hitters to expect pitches closer to the middle of
the zone. Having a high contact rate itself is indicative of hitters bat control.

Yudelman 18
Expectations:
The expectations for this paper is that this research will prove to be a more thorough
examination of predictive BABIP methods. Chamberlains model proved to be the best method
examined, yet he openly admitted that he handpicked statistics that he thought would be helpful in
predicting BABIP. This research intends to look at all the batted ball profile statistics described above to
develop a thorough model complete with diagnostic tests. The expectation is that the StatCast data will
help the model by providing previously unused data. The model must be careful of collinearity however,
for quality of contact and StatCast MPH are likely related. The model likely will see positive and
significant coefficients associated with LD%, Oppo%, HR/FB%, and Speed Score. Negative coefficients are
to be expected on FB%, GB%, Pull%, and Soft%. Plate discipline metrics are more difficult to project. OSwing% should have a negative coefficient, for hitting balls outside of the strike zone is difficult, and
SwStr% should be negative as well, for lots of missed swings do not likely correlated in good contact
when the ball is eventually put in play. Z-Swing% should have a positive coefficient, for players are
swinging at strikes early and often, which are usually the easiest pitches to hit. For the StatCast data,
higher MPH should mean more well hit balls resulting in hits, so there should be a positive effect on
BABIP.
Modeling And Testing:
First, the data is split into a 75% training set and a 25% testing set. This leaves 106 players for
training the model and 36 for testing it. Because of the fear of collinearity, the first step to building this
model is making a correlation matrix of the concerned metrics:

Yudelman 19

By interpreting this matrix, it because clear that including certain combinations of metrics will
lead to the over fitting of the data. The following bullet points summarize the findings:

GB% and FB% are highly correlated, but not LD%

Hard% is highly correlated with Soft% and Med%

Hard%/Med%/Soft% are highly correlated with the Statcast data

StatCast data for GB and LD/FB are not correlated with each other, so both can be used in
one model as long as the overall StatCast average is not used.

Pull% is correlated heavily with Med% and Opp%, but Opp% and Med% are not.

All the swing metrics are correlated

All the contact metrics are correlated

Yudelman 20
Given these findings, I intend to build two models. One model will use the Quality of Contact data and
the other will use StatCast data as a substitute. Comparing the two models should show whether the
release of StatCast data helps improve the predictive power of xBABIP.
The advantage of this papers Quality of Contact Model will be the addition of several additional
metrics. The model attempts to predict BABIP using LD%, GB%, Oppo%, Hard%, Speed Score, O-Swing%,
, and Contact%. Using these metrics, it accounts for the following: Batted ball type, batted ball direction,
quality of contact, player speed, player discipline, and player contact skills. Thinking through the metrics,
there does not seem to be any variable where diminishing effects would come about; thus, none of the
metrics are transformed. The following is the first model output:

The two insignificant variables, Swing% and Contact%, lead to a omit F-test where the null hypothesis is
BSwing = BContact = 0. The following shows the regression output for the new model:

Yudelman 21

As we see, removing the two variables result is being left with just significant variables. However, the
omit F-test requires an F-statistic in order to reject the null hypothesis. The following shows the anova
outputs for both models:

Calculation of the F-statistic finds a p-value just below .05, so we have reject the null hypothesis. Thus,
Contact% and Swing% do improve the model, even if they are not significant.

Yudelman 22
Given this result, the paper will go ahead with the unrestricted model as the Quality of Contact Model.
The following shows the residual plot:

The scatter is random, has no influential outliers, and is centered around zero, so the model satisfies iid
errors. Furthermore, a heteroscedasticity test confirms there are no signs of heteroscedasticity:

Moving on to the StatCast model, the variables remain the same except for the replacement of
Hard% with Avg LD/FB MPH. In theory, these metrics are measuring the same skill, so the results
should be analogous. The following shows the first output:

Yudelman 23

This initial output is very similar to the initial output of the Quality of Contact Model. Again, an omit ftest to test the null hypothesis BSwing% = BContact% = 0. Below is the reduced model:

And the Anova tables:

Yudelman 24

The omit f-test again gives a p-value below .05, so we can reject the null hypothesis.
Given this result, the research will continue with the unrestricted model. To evaluate the diagnostics,
below are the residual plot and heteroscedasticity test:

Yudelman 25

Again, the scatter is random, has no influential outliers, and is centered around zero, so the model
satisfies iid errors. There is also no heteroscedasticity.
Interpretation and Conclusions:
The coefficients for both models agree with each other. LD%, Oppo%, Hard% and Speed score all
have positive coefficients. Unsurprisingly LD% has the largest coefficient, which makes sense given how
high the batting average is on those type of hits. For Swing% and Contact%, both models have negative
coefficients. This was not expected, yet makes sense. If a player is swinging and getting contact on the
majority of pitches, the player is likely to sacrifice waiting for the one pitch in one zone to hit for just
hitting anything. If the player waits for a pitch in a certain zone, he is more likely to make better contact
and thus have a higher chance of getting a hit. The variable with which the models diverge, Hard% and
Avg LD/FB MPH, both have positive coefficients. This is not surprising. However, the StatCast Avg
LD/FB MPH is not as significant; thus, it is likely not as good of a predictor compared to Hard%. To
compare the two models, the following are graphs that plot the actual BABIP against the predicted
BABIP using the 36 player testing set:

Yudelman 26

The plots act as a complement to the r-squared analysis of the two models. Both plot have pretty linear
relationships, which suggest good predictive power. Interestingly, the StatCast Model seems to be

Yudelman 27
consistently over-predicting BABIP by .02-.03. This may be a factor of the previously discussed data
issues associated with StatCast.
Overall, the models and research show that Quality of Contact Model is better than the StatCast
Model. This is evident in the significance of the variables, the adjusted r-squares, and the results of the
testing set. The Quality of Contact Model explains 44.1% of the variation in BABIP while the StatCast
model explains 40.1% of the variation. This is a disappointing finding. StatCast was announced to much
excitement, yet the data quality issues seem to cloud its ability to be a truly useful dataset. This is in no
way a damning statement for the future of StatCast the results of the Stat Cast Model are indeed
promising. Despite the small sample size, it is valuable to have been able to confirm and improve upon
(ever slightly) the previous research on the topic. To further examine the results of the models, the
actual and predicted values for BABIP for all 141 players are attached in the appendix as Table 2. The full
dataset used is available digitally.
Suggestions for Future Work:
In several years, StatCast data should be more relatable, creating a much more thorough and
comprehensive dataset. At that point, this research should be repeated. Moreover, more data on
exogenous variables, such as defensive positioning, should become available in the next few years, so
adding more variables to a larger training scould help the predictive value.
There is also a completely different approach to think about predicting BABIP. Rather than use
player averages over an entire season to try predict BABIP, a different attempt at modeling could look at
each at bat individually. Given the depth of StatCast data and the adjoining exit angle, defensive
positioning, and hang time data, a logit model should be able to give a value for whether a batted ball
will become a hit. Averaging the results of this model over an entire season could give an expected

Yudelman 28
BABIP devoid of luck. There is no telling whether this model would be much better, but it is an
alternative worth looking at.

Yudelman 29
Appendix
Table 1: Summary Statistics for Metrics

Metric
BABIP
LD%
GB%
FB%
IFFB%
HR/FB
IFH%
Pull%
Cent%
Oppo%
Soft%
Med%
Hard%
Spd
ABs With Data
Avg - MPH
Avg - FB/LD MPH
Avg - GB MPH
O-Swing%
Z-Swing%
Swing%
SwStr%

Mean
0.30893662
0.212
0.443767606
0.34421831
0.086190141
0.120309859
0.066866197
0.397605634
0.350443662
0.252077465
0.169485915
0.525901408
0.304753521
4.102816901
320.2112676
89.23725352
92.27697183
86.94746479
0.317943662
0.676295775
0.477323944
0.091823944

Median
0.309
0.212
0.447
0.3495
0.086
0.1155
0.0595
0.394
0.3495
0.246
0.167
0.525
0.306
3.95
322.5
89.38
92.39
87
0.313
0.678
0.479
0.087

Standard
Deviation
0.033764557
0.028549438
0.068567399
0.069931366
0.043866303
0.059034626
0.035786776
0.059530316
0.034337908
0.042932649
0.036206905
0.038956333
0.056165649
1.684098943
44.25579153
2.318449623
2.636167724
2.551268791
0.057985765
0.060572828
0.051245962
0.030332268

Yudelman 30
Table 2: Actual BABIP vs Predicted
Name
Lucas Duda
Jose Bautista
Todd Frazier
Brian McCann
Brandon Moss
Kris Bryant
Edwin Encarnacion
Jay Bruce
Justin Upton
Brian Dozier
Nolan Arenado
Asdrubal Cabrera
Anthony Rizzo
J.D. Martinez
Chris Davis
Aramis Ramirez
Carlos Beltran
Jimmy Rollins
Joc Pederson
Mookie Betts
Albert Pujols
Curtis Granderson
Matt Carpenter
Mike Moustakas
Derek Norris
David Ortiz
Kyle Seager
Trevor Plouffe
Addison Russell
Ian Kinsler
Logan Forsythe
Josh Reddick
Evan Longoria
Stephen Vogt
Nick Castellanos
Mark Trumbo
Bryce Harper
Logan Morrison

Actual BABIP
0.285
0.237
0.271
0.235
0.285
0.378
0.267
0.251
0.304
0.261
0.284
0.306
0.289
0.339
0.319
0.253
0.297
0.246
0.262
0.31
0.217
0.305
0.321
0.294
0.31
0.264
0.278
0.274
0.324
0.323
0.323
0.278
0.309
0.29
0.322
0.313
0.369
0.238

Quality of Contact Prediction StatCast Prediction


0.292
0.297
0.259
0.266
0.284
0.275
0.248
0.254
0.286
0.282
0.314
0.313
0.277
0.283
0.293
0.287
0.302
0.303
0.280
0.291
0.293
0.286
0.284
0.291
0.295
0.296
0.319
0.308
0.308
0.311
0.272
0.265
0.275
0.280
0.272
0.279
0.285
0.292
0.296
0.298
0.263
0.259
0.319
0.325
0.327
0.331
0.282
0.279
0.275
0.280
0.303
0.296
0.299
0.300
0.288
0.287
0.281
0.289
0.299
0.301
0.290
0.296
0.293
0.309
0.291
0.301
0.280
0.291
0.314
0.314
0.301
0.302
0.315
0.306
0.283
0.280

Yudelman 31
Marcus Semien
Alex Rodriguez
Manny Machado
Mike Trout
Andrew McCutchen
Josh Donaldson
Yoenis Cespedes
Brandon Belt
Evan Gattis
Salvador Perez
Carlos Santana
Yangervis Solarte
Wilmer Flores
Troy Tulowitzki
Freddy Galvis
Charlie Blackmon
Neil Walker
Kevin Pillar
Adrian Gonzalez
Ryan Howard
Carlos Gonzalez
Dexter Fowler
Adam Jones
Marlon Byrd
Daniel Murphy
Jose Reyes
Adrian Beltre
Prince Fielder
Kole Calhoun
Jose Altuve
Matt Kemp
Adam Lind
Paul Goldschmidt
Gregory Polanco
Kendrys Morales
Mitch Moreland
Torii Hunter
Chris Owings
Chris Coghlan
Didi Gregorius
Angel Pagan

0.312
0.278
0.297
0.344
0.339
0.314
0.323
0.363
0.264
0.27
0.261
0.279
0.273
0.331
0.309
0.325
0.306
0.306
0.294
0.272
0.284
0.308
0.286
0.297
0.278
0.301
0.295
0.323
0.304
0.329
0.311
0.309
0.382
0.308
0.319
0.317
0.258
0.305
0.284
0.297
0.31

0.308
0.289
0.296
0.346
0.327
0.307
0.311
0.355
0.291
0.263
0.277
0.280
0.288
0.309
0.307
0.328
0.304
0.304
0.319
0.321
0.294
0.309
0.288
0.314
0.299
0.276
0.308
0.292
0.303
0.280
0.333
0.303
0.352
0.313
0.304
0.301
0.282
0.340
0.323
0.292
0.304

0.318
0.292
0.296
0.348
0.325
0.306
0.309
0.347
0.292
0.269
0.289
0.269
0.285
0.298
0.309
0.322
0.301
0.303
0.316
0.322
0.288
0.316
0.280
0.308
0.296
0.278
0.303
0.287
0.309
0.272
0.309
0.294
0.353
0.313
0.303
0.303
0.280
0.334
0.318
0.297
0.310

Yudelman 32
Nelson Cruz
Brett Gardner
Buster Posey
Brandon Crawford
Matt Duffy
Russell Martin
Kolten Wong
Joey Votto
Brett Lawrie
Miguel Cabrera
Ben Zobrist
Pablo Sandoval
Jose Abreu
Yadier Molina
Elvis Andrus
Jace Peterson
Michael Taylor
Michael Brantley
Billy Butler
Lorenzo Cain
Jhonny Peralta
Ian Desmond
Ryan Braun
Chase Headley
Brandon Phillips
Martin Prado
Alcides Escobar
Melky Cabrera
Gerardo Parra
Kevin Kiermaier
Odubel Herrera
Alexei Ramirez
A.J. Pollock
Starlin Castro
Shin-Soo Choo
Billy Burns
Jason Kipnis
Adam Eaton
Nick Markakis
Francisco Cervelli
Avisail Garcia

0.35
0.312
0.32
0.294
0.336
0.262
0.296
0.371
0.32
0.384
0.288
0.27
0.333
0.295
0.283
0.296
0.311
0.318
0.282
0.347
0.311
0.307
0.322
0.317
0.315
0.313
0.286
0.297
0.325
0.306
0.387
0.264
0.338
0.298
0.335
0.339
0.356
0.345
0.338
0.359
0.32

0.315
0.318
0.313
0.310
0.258
0.296
0.311
0.342
0.304
0.352
0.287
0.290
0.318
0.301
0.308
0.315
0.331
0.310
0.299
0.345
0.313
0.308
0.346
0.308
0.332
0.312
0.316
0.314
0.336
0.320
0.339
0.284
0.337
0.277
0.318
0.314
0.347
0.342
0.313
0.325
0.330

0.320
0.334
0.299
0.307
0.277
0.304
0.307
0.342
0.310
0.349
0.295
0.290
0.313
0.297
0.309
0.320
0.333
0.307
0.303
0.339
0.307
0.313
0.334
0.317
0.327
0.319
0.316
0.316
0.330
0.322
0.344
0.285
0.324
0.277
0.326
0.317
0.353
0.344
0.314
0.324
0.327

Yudelman 33
David Peralta
Matt Duffy
Erick Aybar
Ender Inciarte
Xander Bogaerts
Robinson Cano
Anthony Gose
Wilson Ramos
Austin Jackson
Eric Hosmer
Jean Segura
Jason Heyward
Brock Holt
Yunel Escobar
Starling Marte
Andrelton Simmons
Joe Mauer
Cameron Maybin
DJ LeMahieu
Ben Revere
Dee Gordon
Christian Yelich

0.368
0.336
0.3
0.329
0.372
0.316
0.352
0.256
0.342
0.336
0.298
0.329
0.35
0.347
0.333
0.285
0.309
0.316
0.362
0.338
0.383
0.37

0.337
0.342
0.300
0.337
0.336
0.325
0.349
0.302
0.344
0.349
0.324
0.327
0.336
0.323
0.338
0.304
0.347
0.331
0.379
0.335
0.335
0.364

0.333
0.337
0.293
0.328
0.333
0.322
0.350
0.314
0.342
0.349
0.322
0.322
0.335
0.323
0.337
0.306
0.354
0.345
0.385
0.339
0.334
0.361

R Script:
attach(Data)
#Diagnostic Plots
#Box Plot for BABIP
boxplot(Data$BABIP, data = Data, main = "BABIP in 2015 for Qualified Hitters")
#Line Graph for Batted Ball Types
plot(Data$GB., type = "b", col = "blue", lwed = 2, ylim = c(0,1), ylab = "Percentage of Hits", xlab =
"Player", main = "Hit Distribution in 2015 for Qualified Hitters")
lines(Data$FB., col = "red", type = "b", lwd=2)
lines(Data$LD., col = "green", type = "b", lwd=2)
legend("topright",legend=c("GB%","FB%","LD%"),
lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),
ncol=2,bty="n",cex=0.8,
text.col=c("blue","red","green"),
inset=0.01)

Yudelman 34
#Line Graph for Quality of Contact
plot(Data$Soft., type = "b", col = "blue", lwed = 2, ylim = c(0,1), ylab = "Percentage of Hits", xlab =
"Player", main = "Quality of Contact Distribution in 2015 for Qualified Hitters")
lines(Data$Med., col = "red", type = "b", lwd=2)
lines(Data$Hard., col = "green", type = "b", lwd=2)
legend("topright",legend=c("Soft%","Med%","Hard%"),
lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),
ncol=2,bty="n",cex=0.8,
text.col=c("blue","red","green"),
inset=0.01)
#Box Plot for Speed Score
boxplot(Data$Spd, data = Data, main = "Speed Score in 2015 for Qualified Hitters")
#Line Graph for StatCast data
plot(Data$Avg...GB.MPH, type = "b", col = "blue", lwed = 2, ylim = c(80, 115), ylab = "Average MPH", xlab
= "Player", main = "StatCast Data off the Bat in 2015 for Qualified Hitters")
lines(Data$Avg...FB.LD.MPH, col = "red", type = "b", lwd=2)
lines(Data$Avg...MPH, col = "green", type = "b", lwd=2)
legend("topright",legend=c("Average MPH for Ground Balls" ,"Average MPH for Line Drives and Fly
Balls","Average MPH Overall"),
lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),
ncol=2,bty="n",cex=0.8,
text.col=c("blue","red","green"),
inset=0.01)
Data$HR.FB <- as.numeric(Data$HR.FB)
Data$Z.Swing.<- as.numeric(Data$Z.Swing.)
Data$O.Swing.<- as.numeric(Data$O.Swing.)
Data$SwStr.<- as.numeric(Data$SwStr.)
Data$IFFB. <- as.numeric(Data$IFFB.)
#Randomly subsetting Data into Testing and Training Sets
smp_size <- floor(0.75 * nrow(Data))
#Set the seed to make datasets reproducible
set.seed(123)
#Create Samples
train_ind <- sample(seq_len(nrow(Data)), size = smp_size)
TrainingSet <- Data[train_ind,]
TestingSet <- Data[-train_ind,]
#Checking correlations with a correlation matrix
d <- data.frame(Data$LD., Data$GB., Data$FB.,Data$IFFB., Data$HR.FB, Data$Soft., Data$Med.,
Data$Hard., Data$Avg...MPH, Data$Avg...FB.LD.MPH, Data$Avg...GB.MPH, Data$Pull., Data$Cent.,

Yudelman 35
Data$Oppo., Data$O.Swing., Data$Z.Swing., Data$SwStr., Data$O.Contact.,Data$Z.Contact.,
Data$Contact.)
M <- cor(d)
Install.packages(corrplot)
library(corrplot)
corrplot(M, method = "circle")
#Building Quality of Contact Models
QualityOfContactModel <- lm(BABIP ~ LD. + FB. + Oppo. + Hard. + Spd + Swing.+ Contact., data =
TrainingSet)
summary(QualityOfContactModel)
QualityOfContactModelUpdated <- lm(BABIP ~ LD. + FB. + Oppo. + Hard. + Spd, data = TrainingSet)
summary(QualityOfContactModelUpdated)
#Omit F-Test
anova(QualityOfContactModel)
anova(QualityOfContactModelUpdated)
#Checking for IID
plot(QualityOfContactModel)
#Check for Heteroscedasticity
install.packages("lmtest")
library(lmtest)
bptest(QualityOfContactModel)
#Building StatCast Model
StatCastModel <- lm(BABIP ~ LD. + FB. + Oppo. + Avg...FB.LD.MPH + Spd + Swing. + Contact., data =
TrainingSet)
summary(StatCastModel)
StatCastModelUpdated <- lm(BABIP ~ LD. + FB. + Oppo. + Avg...FB.LD.MPH + Spd , data = TrainingSet)
summary(StatCastModelUpdated)
#Omit F-Test
anova(StatCastModel)
anova(StatCastModelUpdated)
#Checking for IID
plot(StatCastModel)
#Check for Heteroscedasticity
bptest(StatCastModel)
#Plotting Against Testing Set
plot(TestingSet$BABIP, predict(StatCastModel, TestingSet), main = "StatCast Model Against Actual")

Yudelman 36
plot(TestingSet$BABIP, predict(QualityOfContactModel, TestingSet), main = "Quality of Contact Model
Against Actual")
#Creating Final Dataset
FinalDataSet <- data.frame(Data$Name, Data$BABIP)
FinalDataSet$QualityofContactPrediction <- predict(QualityOfContactModel, Data)
FinalDataSet$StatCastPrediction <- predict(StatCastModel, Data)

Yudelman 37
Works Cited
BABIP. (n.d.) Retrieved November 19, 2015, from http://www.fangraphs.com/library/pitching/babip/
Batted ball direction. (n.d.) Retrieved November 19, 2015, from http://www.fangraphs.com/library/
offense/batted-ball-direction/
Blengino, T. (2015, August 6). The limitations of the 2015 statcast data. Retrieved November 19,
2015, from http://www.fangraphs.com/blogs/the-limitations-of-the-statcast-data/
Chamberlain, A. (2015, May 6). New hitter xBABIP based on BIS batted ball data. Retrieved November
19, 2015, from http://www.fangraphs.com/fantasy/new-hitter-xbabip-based-on-bis-batted-balldata/
Dutton, C. (2008, December 2). Batters and BABIP. Retrieved November 19, 2015, from
http://www.hardballtimes.com/batters-and-babip/
Moneyball. (n.d.). Retrieved November 19, 2015, from Internet Movie Database:
http://www.imdb.com/title/tt1210166/
Spd. (n.d.). Retrieved November 19, 2015, from http://www.fangraphs.com/library/offense/spd/
Swartz, M. (2010, March 23). Ahead in the count: Predicting BABIP, part 1. Retrieved November 19,
2015, from http://www.baseballprospectus.com/article.php?articleid=10333
Quality of contact stats. (n.d.). Retrieved November 19, 2015, from http://www.fangraphs.com/library/
offense/quality-of-contact-stats/

S-ar putea să vă placă și