Documente Academic
Documente Profesional
Documente Cultură
A Thesis
Presented to the
Faculty of
In Partial Fulfillment
Master of Science
in
Statistics
by
Summer 2010
iii
Copyright
c 2010
by
Vincent Stanley Dayes
iv
A unique and highly practical system for identifying good and bad bets at the major
Southern California Thoroughbred racetracks is created and analyzed. A probability model
for each individual race is created; a function of odds, Morning Line, each horse’s past
performances, current trainer and jockey, and miscellaneous factors depending on type of
race. A continuous response variable, “Perf”, (a numerical performance estimator) is used as
the response variable in the regression analysis performed. After obtaining new estimates for
Perf, Monte Carlo methods were then implemented to calculate probabilities of each horse’s
1st, 2nd, 3rd, or 4th place finish. Horses were then grouped according to Odds, and reports
were generated to analyze results and calculate Expected Values. To find the numerous hidden
factors and patterns that only occur under specific conditions, numerous subsets of races and
horses were anayzed using hundreds of covariates. A Baseline of probabilities is created using
a simple model based mainly on odds of a horse. Then the final model probabilities resulting
from the estimated regression parameters equation are compared to the baseline probabilities.
Those that differ significantly are separated into two groups: Estimated probabilities higher
than the baseline’s are considered profitable bets “Overlays”, while those less than the
baseline’s are “Underlays” (unprofitable bets). Each group is displayed in the odds-based
report format. 10 3/4 years of horse-racing data is used with 8 3/4 years set up as “Regression”
Dataset and the two mosr recent complete years as “Testing” Dataset. Of primary interest is
the Profitability or Expected Values of each group. In sum, various parameters and wagering
options are analyzed for their positive or negative affects on profitability.
vi
TABLE OF CONTENTS
PAGE
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statement of Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 A Typical Horse Race . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Variables Input into SAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Summary Statistics of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Data Separated into Odds Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 The Daily Racing Form for the Serious Handicapper . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Perf: The Important Response Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Data Preparation in MS ACCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 SAS Operations and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Non-Indicator Covariates Created in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Indicator Type Covariates Created in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 WBF Exponent Found Using Box-Cox Method . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 SAS Regression and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Matlab: Simulating Horse Races for Probability Estimates. . . . . . . . . . . . . . . . . . . 27
3.5 Comparing Probability Files in ACCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
4.2 Underlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Comparisons of Results by the Four Major Odds Ranges. . . . . . . . . . . . . . . . . . . . . 30
5 MULTICOLLINEARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1 Response Variable: Perf (and Power Point) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.4 Predictor Variables Included in the Final Regression Model. . . . . . . . . . . . . . . . . . 38
6.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
viii
LIST OF TABLES
PAGE
Table 2.1 Descriptive Statistics of Numerical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Table 2.2 Regression Data by Collapsing Odds Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Table 3.1 Trainer Names and ID Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Table 3.2 Test Data Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 3.3 Regression Model Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Table 4.1 Overlays: Comparison of Win Results(A) to Baseline(B) . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table 4.2 Overlays: Comparison of 2nd, 3rd, and 4th Results(B) to Baseline(A) . . . . . . . . . . 30
Table 4.3 Underlays: Comparison of Win Results(A) to Baseline(B) . . . . . . . . . . . . . . . . . . . . . . . 30
Table 4.4 Underlays: Comparison of 2nd, 3rd, 4th Results(A) to Baseline(B) . . . . . . . . . . . . . . 31
Table 5.1 Covariates With Variance of Inflation Greater Than 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Table 7.1 Comparison of Overlays and Underlays Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Table 7.2 Odds Range 9-27 of Overlays and Underlays Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
ix
LIST OF FIGURES
PAGE
Figure 1.1 Daily Racing Form. Note abundance of information for each horse. . . . . . . . . . . . . 6
Figure 4.1 Win percentage comparisons between Underlays, Baseline, and
Overlays by odds ranges. Win percentages for Overlays substantially
greater than those for Underlays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 4.2 EV comparisons between Underlays, Baseline, and Overlays by odds
ranges. EVs for Overlays are much greater than those for Underlays. . . . . . . . . . . . . . . 33
Figure 4.3 Test results: Finish comparisons between Underlays, Baseline, and
Overlays by odds ranges. Total percentages significantly greater for
Overlays than Underlays except for 0-4 range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1
CHAPTER 1
INTRODUCTION
Perhaps the most complex and challenging multi-entry competition is the horse race.
Horse races are basically unique and independent of each other. The race conditions,
restrictions and eligibility requirements determine which horses are allowed to be entered in a
race, said conditions apply to all the horses in the race, such as a “Maiden” race where only
horses which have never won a race in their lives may be entered. Typical restrictions are by
sex, age, state bred in, types of and number of races previously won, etc. Race conditions may
be distance, racing surface (dirt, turf or synthetic track), physical condition of track, purse
offered, etc. Thus each race is a cluster of horses running under race-specific factors.
Horse-specific factors are jockey, trainer, post-position, equipment (blinkers, type of shoe,
etc.), (legal) drugs, assigned weight, etc. But for the serious handicapper, the most important
information is the past-performances for each horse listed in the “Daily Racing Form.” Listed
in chronologically descending order, the previous (up to 10) races of each horse are
capsulized.
1.1 H ISTORY
Gambling on horse races has been around since man first started riding horses.
Modern horse racing exists because it is a popular form of legalized gambling and is accepted
as benefitting local and state economies by generating large amounts of tax dollars and
providing jobs and money.
Statisticians have been analyzing horse racing data for many years, with milestone
works by Harville [1], Henery [2], Stern [3] and others. Many other disciplines also have
researchers investigating the ponies. Hausch et al. [4] cover articles from economists,
psychologists, management scientists, probability theorists as well as professional gamblers.
The first model proposed by Harville [1] is a simple way of computing ordering probabilities
based on winning probabilities. Henery [2] suggested using a normal distribution for
estimating running times where as Stern [3] recommended using gamma distributions for the
same purpose. Bacon-Shone, Lo and Busche [5] and Lo and Bacon-Shone [6] showed that the
Henery and Stern models were better fits than the Harville model for particular racing data.
However, since both the Henery and Stern models are complicated to use in practice, Lo and
Bacon-Shone [7] suggested a simple approximation for both the Henery and Stern models.
2
Also heavily investigated is the favorite-longshot bias (where favorites are typically
underbet so odds are too high and longshots overbet so their odds are too low) that has often
been found in gambling data (see Ali [8], Asch et al. [9], Ziemba and Hausch [10], Lo [4], and
Bacon-Shone and Lo [11]). This bias also appears in this study, but is not the main focus.
Basically all the works cited in this Section were aimed at finding models that would
accurately estimate probabilites that could result in turning a profit at the racetrack, by finding
profitable wagers and/or avoiding the unprofitable ones. This work is aimed at developing a
system that facilitates evaluating whichever patterns, variables, statistics, etc. that a
handicapper might wish to investigate, and continually improve an already useful model by
adding new covariates that are significant.
1.3 O BJECTIVE
The Objective is to create and analyze a practical system for weighing all the positive
and negative factors associated with each horse in a race and then calculating probabilities for
1st, 2nd, 3rd, and 4th place finishes. The horses whose (1st place) probabilites are
significantly higher than the probabilities as reflected by their odds, should be good bets
(Overlays) and those whose probabilities are significantly lower, are wagers to avoid
(Underlays). Finding a response variable that numerically rates a horse’s performance is also
part of the objective.
Baseline : Estimated Perfs are derived from the Simple Regression Model which is based
only on odds and number of horses in race. From each horse’s estimated Perfs,
estimated probabilities are calculated which are baseline values which the estimated
probabilities from the final regression model are compared to
Betting Pool : Each type of wager has its own pool of money bet, separate from all other
pools
Blinkers : A hood placed over a horse’s head with cups sewn onto the eye openings. This
restricts a horse’s vision so it can only see straight ahead
Box-Cox Method : Used to find the best fit of the win bet fraction (wbf) to the performance
response variable (Perf) by finding the exponent λ that minimizes Sum of Squares
5
Error. A new predictor variable, wbfAll equal to wbf raised to λ is used in place of wbf
(see Kutner [12])
Breakage : This is due to odds being rounded downward to the nearest tenth of a dollar and
the wagering establishment keeping the difference
Claimed : When a horse runs in a claiming race and is “claimed” by a licensed owner or
trainer it is purchased for the claiming amount specified for that race. The horse must be
in the starting gate when the race goes off. Once the race starts, the horse offically
belongs to the new owner even if it is injured or drops dead, but any monies won goes to
the original owner
Claiming Race : Horses which may be claimed (purchased) for a specified price
Class : Level of competition - numerical evalution of the general strength of a race. The
concept of “Class” is used here to categorize numerically the quality of a race and
therefore its entrants. The strongest runners are in the highest classes (and highest
purses to be won) and vice versa
Daily Double : A wager where the winners of two consecutive races must be picked to win
the bet. Originally the first two races of the day, now most tracks offer this on all
consecutive races
Daily Racing Form : Resembles a small newspaper filled with racing information for each
horse running on a particular day (see Figure 1.1)
Entry : When two or more horses are entered in a race and are considered a single entity for
wagering purposes
Exacta : An exotic wager where the exact order of the first two finishers in a single race is
specified
Exotics : Newer, more complicated bets such as Trifectas, Superfectas on single races and
multiple race bets like the Pick 3, Pick 4, and Pick 6
EV : Expected Value - used here in same sense as Profitability - expected or average return
on a wager
6
Figure 1.1. Daily Racing Form. Note abundance of information for each horse.
7
Favorite : The most heavily bet horse in a race
Handicapper : An experienced Daily Racing Form reader able to hold huge amounts of
information in his head and at the same time judge the relative merits of each horse in a
race, coming up with estimates of winning probabilities
Handicap Race : A stakes race where weights (see weight) are assigned according a horse’s
past performances
House Take : House “Cut”, Track Percentage - the amount taken out of the Betting Pool by
the House or race track. For simple pools like the win, place or show, it is around 14%
to 18%. For the exotic pools, it is around 20%. It varies by state and race track
Indicator 0 or 1 : Covariates are set to 1 if they occur, otherwise they are set to 0
Lasix : Legal anti-bleeding drug - common in California, illegal in some states and countries
Length : About nine feet - the length of a generic horse from the tip of its nose to the end of
its tail (when running) - also a rough time measurement: one length is about a fifth of a
second
Maiden Race : Races only for horses who have never won a race
Major Odds Range : In this study, the (four) major ranges are: 0.1 to 4.0, 4.1 to 9.0, 9.1 to
27.0, and 27.1 and UP
Overlay : When probability of a horse winning is greater than the probability indicated by its
odds
Out Finish : Finishing 5th place or worse - not 1st, 2nd, 3rd or 4th
Pace Style : The usual early-race location of a horse - may be forwardly placed early or in
the rear
Paddock : The area where the horses are viewed before a race
Past Performances : Daily Racing Form information lines (see Figure 1.1)
Photo Finish : A close finish where the finish picture must be examined to determine the
order of finish
Pick 3, 4, or 6 : Wagers where the winners of all the included races must be picked
Place : A wager where a bettor wins if his horse comes in 1st or 2nd. Also the place position
is 2nd place
Post Parade : After horses leave the paddock and before the race, the horses come out onto
the racetrack and parade in front of the grandstands
Post Time : Official time horses are supposed to be at the starting gate. Most races start a few
minutes after post time
9
Power Point : A numerical indicator of a strength threshold value at the finish of a race: a
function of the number of horses in the race and the distances in lengths between the top
four finishers. Originally equal to the second place finish for races with 7 or fewer
horses, equal to the midpoint between 2nd and 3rd for races with 8, 9, and 10 horses,
and equal to 3rd place finish for races with 11 or more horses. If a horse finished at the
Power Point, it was assigned a Perf of zero. For example, if ahead of the Power Point,
Perf would equal the number of lengths times 10, if behind, Perf was minus the number
of lengths multiplied by 10
Purse : Prize money offered in a race of which typically 60% goes to the winner, 20% to 2nd
place, 12% 3rd, 6% to 4th and 2% to 5th (the distribution percentages vary from state to
state)
Racetrack : The three California tracks in this study are all flat ovals with Santa Anita and
Del Mar being a mile in circumference and Hollywood being a mile and 1/8. The turf or
grass course is just inside of the main course which up until 2007 was a dirt track.
Racetracks are publicly owned but strictly regulated by state agencies
Reflected Probs : Inverted odds: Probabilities that reflect how a horse is bet - with estimated
Track Percentage taken into account
Regression Data : Data used to develop model and find Estimated Parameters/Regression
Coefficients
Regression Funct. : Model/equation used to predict new values of response variable Perf
from Test Data/Prediction Set
Results Dataset : Subset of Testing Dataset consisting of horses whose estimated win
probabilities (from the Regression Function) differ significantly from the Baseline win
probabilities
10
Saddle Cloth Number : Official number of horse, used when placing bets or checking
results - frequently the same as the post position, but not always
Saving Ground : Minimizing distance horse has to run by staying close to the inside rail on
turns
Scratch : A horse does not run (for whatever reason) in a race it is entered in
Show : A wager where the bettor wins if his pick comes in 1st, 2nd or 3rd. Usually has a
small return, sometimes 10 cents on the dollar
Stakes Race : Highest class of races with the largest purses, for example, the Kentucky
Derby
Superfecta : An exotic wager where the exact order of the first four finishers in a single race
is specified
Synthetic Track : In May, 2006, the California Horse Racing Board mandated that all
California horse racing tracks had to switch their dirt tracks to synthetic surfaces for the
safety and welfare of horses. Hollywood has been using Cushion Track since November
2006, Santa Anita tried Cushion Track from September 2007 to summer of 2008 when
it switched to Pro-Ride due to drainage problems. Del Mar has been using Polytrack
since July 2007. Polytrack is similar to Pro-Ride and the two have been treated as the
same surface type in this study
Testing Data : Data set aside for model validation - data is not used to in creating Regression
Function/Estimated Parameters
Totalizer Board : Huge display at race track displaying all important betting information
Trainer : Responsible for training, behavior, overseeing the exercise routine of horses,
selects races for horse to run in, and picks the jockey
Trifecta : An exotic wager where the exact order of the first three finishers in a single race is
specified
CHAPTER 2
DATA
The data comes from the three major southern California Horse Racing Tracks: Santa
Anita (Los Angeles), Hollywood (Los Angeles), and Del Mar (San Diego). These three tracks
form a circuit since only one is open at a time and the same horses, trainers, and jockeys move
from track to track. Thus the data has a consistent, homogenous nature. The races were run
from January 1999 to November 2009 - 10 3/4 years. Out of 23,478 races, 19,930 races were
used for the individual race model (168,253 horses), the others being rejected because of too
few horses (minimum 6 horses in a race), abnormalities, entries (multiple horses coupled
together for betting purposes), causing complications in odds analysis), and corrupted data.
There are three different types of race data: Current Race data, which is data a
handicapper has before the race goes off (typically found in The Daily Racing Form). Results
data is the results of the Current Races. Past Performances data is how a horse performed in
previous races so it is a combination of the other two types of data.
The pre-race data was exported from The Daily Racing Form files, imported into MS
ACCESS, checked for errors, processed for easy analysis and then exported from ACCESS in
Comma-Delimited files, which were then read and analyzed by SAS and Matlab. The results
data and the final odds came from Equibase Inc. which specializes in horse-racing results
data. The data was purchased through Post Time Solutions, Inc.
The 10 3/4 years of data was split into two groups: The first, (Regression Dataset) is 8
3/4 years of data (1/27/99 to 11/05/07 - 16,284 races/136,855 horses) and the second (Testing
Dataset) is the last two years of data (11/06/08 to 11/07/09 - 3,646 races/31,398 horses).
age : Age of horses allowed in race: Age 2: 13.27% of races, age 3: 19.05%, age 4: 2.61%,
age 3UP: 41.76%, age 4UP: 23.31% Note that there were 8 races for 3 and 4 year olds
only
blinks : Blinkers changed: X = Blinkers taken off: 2.59%, B = Blinkers put on: 4.95% No
change in blinkers: 92.45%
13
Table 2.1. Descriptive Statistics of Numerical Variables
Variable Minimum Median Maximum Std Dev 10th PCT 90th PCT
days1st 2 29 1876 85.64 15 180
days2nd 11 72 1903 109.39 37 223
days3rd 17 116 1945 124.40 62 297
dist 20 65 140 13.07 55 85
horseAge 2 3 12 1.31 2 5
ML1 0.01 0.10 0.68 0.07 0.03 0.22
monthBorn 1 3 12 1.56 2 5
nhor 6 9 14 1.95 6 12
numLines 0 6 10 3.83 0 10
numLineDiff -8.90 0 8.75 2.02 -2.57 2.40
odds 0.10 8.70 243.2 21.69 2.2 44.6
odds1 0.05 9.00 339.5 18.62 2.0 35.6
odds2 0.05 10.40 339.5 16.66 2.0 30.7
perf -210 -30 121 99.67 -210 55
pp 1 5 14 2.78 1 9
speed1Diff 0 2 10 3.83 0 10
speed12Diff 0 3 10 3.77 0 10
speed123Diff 0 4 10 3.74 0 10
turfStarts 0 0 78 5.92 0 10
turfWins 0 0 15 1.31 0 2
wbf 0.004 0.10 0.91 0.13 0.02 0.32
wbfAll 0.43 0.70 0.99 0.11 0.55 0.84
wbfOld1 0.003 0.10 0.95 0.13 0.03 0.33
wbfOld2 0.003 0.09 0.95 0.13 0.03 0.33
cl12 : Claim indicator for last three races: 1 = claimed in last race, 2 = claimed in second race
back, 4 = claimed in 3rd race back - 19,366 horses out of 168,253 were claimed in at
least one of their last three races (11.51%)
days2nd : Number of days since 2nd race back (see Table 2.1)
days3rd : Number of days since 3rd race back (see Table 2.1)
dist : Distance of race in tenths of a furlong - furlong is 1/8th of a mile - from 20 to 140 (1/4
mile to 1 3/4 mile); most common distance: 60 or 6 furlongs (3/4 mile): 4,718 out of
19,930 races (see Table 2.1)
horseType : Type of horse: f = filly (female age 4 or less) 35.2%, m = mare (female age 5
and up) 6.5%, c = colt (male age 4 or less) 25.2%, h = horse (male age 5 and up) 5.7%,
g = gelding (castrated male any age) 27.2%
JockID : Three character code for professional race-riders, of 470 different jockeys, 41 had
1000 or more rides
lasix1st : Indicator-type: 1 if first time horse has had the drug lasix in its life (1st time starters
not included) 3,760/168,253 (2.2%)
lasixL : L if horse has been given lasix, (96.5% have lasix, 3.5% do not)
monthBorn : Month horse is foaled - note that horses born in the same year all are
considered to have the same age whether born Janaury 1 or December 31. 93.9% are
foaled in January thru May (March 27.8%, April 25.4%, and February 21.0%) and the
other 6.1% in June through December, being mainly Southern Hemisphere horses (see
Table 2.1)
nhor : Number of horses in a race. From 6 to 14 (Minimum was set to 6 for analysis
purposes). Percentages by number of horses: 6 - 17.8%, 7 - 19.9%, 8 - 21.1%, 9 -
16.7%, 10 - 13.8%, 11 - 7.9%, 12 - 6.9%, 13 - 2.0%, 14 - 0.8% (see Table 2.1)
numLines : Number of previous races to a maximum of 10. Refers to the number of “lines”
of past performances in the Daily Racing Form (see Table 2.1)
numLineDiff : For each race, the average number of lines is calculated. Then each horse’s
number of lines is subtracted to get numLineDiff (see Table 2.1)
odds : Final odds horse went of at: from 0.1 (minimum by law) to 243.2. For odds
distribution of Regression Data see Table 2.1
odds1 : Odds in last race. Note that in other states minimum odds may be 0.05. (see Table
2.1)
15
odds2 : Odds in 2nd race back (see Table 2.1)
sex of Race : Race restriction by sex: 41.56% races were for female horses only - 58.44%
races were for either sex, although only 270 out of 98,329 horses were fillies or mares:
0.3% running against the boys
speed12Diff : Difference from average speed of the best of each horse’s last two races (see
Table 2.1)
speed123Diff : Difference from average speed of the best of each horse’s last three races (see
Table 2.1)
stateBred : Three character code for state or country horse was bred in. Most common states
are California: 36.4%, Kentucky: 25.4%, Florida: 6.1% and the most common foreign
countries are Ireland: 2.0%, Great Britain: 1.7%, and Argentina: 1.1%
track : Three racetracks: SA had 45.75% of the races, HOL had 36.26%, and DMR had
17.98% races
trainID : Three character code for trainers. There are 985 trainers, of which Doug O’Neil
had the most horses entered: 4083, Bob Baffert had 3568, and 34 other trainers had
1000 or more horses entered
turf : One character field where “T” indicated a turf race: 27.2%, “P” a race on Polytrack or
Pro-Ride synthetic surfaces: 7.8%, “C” indicated Cushion track synthetic surface:
10.7% and a blank meant dirt surface: 54.3%
turfStarts : Number of lifetime races run on the turf: from 0 to a maximun of 78 (see Table
2.1)
turfWins : Number of lifetime wins on the turf: from 0 to a maximum of 15 (see Table 2.1)
type : Numerical designator of type of race: types are 0, 1, 4, 6, 8, 10, 14, 21, 22, 23, 31, 32,
33. Most common race types: Maiden Claiming (type = 0): 21.2%, Maiden
Allowance(type = 1): 18.7%, Allowance Non-Winners of 1 (31): 13.9%, Claiming
Middle(22): 12.9%, Claiming High(23): 10.0%, and Claiming Low(21): 9.7%
16
wbfOld1 Win bet fraction from odds of previous race (see Table 2.1)
wbfOld2 Win bet fraction from odds of 2nd race back (see Table 2.1)
2.3 S UBGROUPS
Various subsets of races and/or horses were run through the regression stepwise
selection process with all covariates to find predictor variables that are either hidden or are
much more significant in a subgroup than in the overall total set of all races and horses.
Subgroups considered:
MClm : Maiden Claiming - races for horses who have never won a race and can be claimed
for a specified claiming amount - considered to have the most volatile and unpredictable
horses - many veteran jockies avoid riding in these races - has the lowest purse amounts
- Claimed covariates may be significant in these races
MAlw : Maiden Allowance - Races for horses who have never won a race and are not
claimable - many stars of the future are in these races
NonMaid : Races for horses who have won at least one race
Turf/Grass : Races run on a turfcourse - turf surface is thought to suit style of running for
some horses and bb slippery and unsuitable for others due to different leg action
Cush : Races run on Cushion Track synthetic surface (replacing dirt surfaces)
Alw : Allowance Races - horses are not claimable - various restrictions usually apply limiting
horses eligible for race - not including Alowance races for Non-Winners of one or two
races
AlwNW12 : Allowance race for either Non-Winners of one or two races - these races are a
threshold for horses that go on to have profitable careers and run in Handicap and
Stakes races and those who fade into the lower class Claiming races
17
Stakes : Special races with highest purse amounts - also important for establishing a horse’s
reputation which directly influences its’ breeding value
Age2 : Races for two-year-olds - young horses may be quite inconsistent in their
performances
Age3UP : Races for three-year-olds and up and races for four years and up
Sprint : Races for short distances less than 7 furlongs - usually favors horses with early speed
MidDist : Races for distances 8 to 9 furlongs - two turn races where first turn is close to the
start (Santa Anita and Del Mar) so post position may be more significant in these races
LongDist : Races for distance greater than 9 furlongs - favors horses with stamina and lighter
weights assignments
Fill : Races for Fillies and Mares only - these races may have more longshots
Male : Races for any sex - usually all male, but not always so Fill covariate can be analyzed
here
yr9902 : Data set from 1/27/99 to 12/25/2002 - First 3 and 11/12 years of Regression Data
yr0305 : Data set from 12/26/02 to 12/25/2005 - middle 3 years of Regression Data
yr0607 : Data set from 12/26/05 to 11/4/2007 - last two complete years of Regression Data -
may show trends that are changing
yr07 : Data set from 10/29/06 to 11/4/2007 - last complete year of Regression Data - may
show trends that are changing
a67 : Races with 6 or 7 horses in race - predictors may vary especially when compared to
a11UP subgroup
a11UP : Races with 11 or more horses in race - predictors may vary when compared to
a11UP subgroup, so post position may be more significant in these races
ClaimLow : Classes with low Claiming Amounts (8,000, 10,000, 12,500) - Claimed
covariates may be significant in these races
18
ClaimMid : Classes with middle Claiming Amounts (16,000, 20,000, 25,500, 32,000) -
Claimed covariates may be significant in these races
ClaimHigh : Classes with highest Claiming Amounts (40,000 and up) - Claimed covariates
may be significant in these races
DMR : Races run only at Del Mar Race Track - trainer and jockeys may do better here than
at other tracks
HOL : Races run only at Hollywood Race Track - trainer and jockeys may do better here
than at other tracks
SA : Races run only at Santa Anita Race Track - trainer and jockeys may do better here than
at other tracks
T65 : Races run on the downhill Turf course at Santa Anita - these races are so different from
all others that the covariates may greatly change values
number of horses and there would be an appropriate grouping of odds ranges for that number
of horses. For analyses based on large sample sizes (many thousands of horses), the top group
of 16 odds ranges, as presented in Table 2.2, is preferable and thus used in subsequent
analyses. However, in some subgroup analyses, the Overlays and Underlays are shown using
4, 2, and 1 odds range groupings since the number of horses considered are on the order of
1000-1500, too few for the full 16 odds range grouping. Frequently, for very small subsets 50
or less, only the totals line is appropriate, but even then it could be of interest to scan upward
even to the 8 and 16 odds ranges to see the odds distribution of the selected horses. Making
the number of lines of each grouping of odds ranges a power of 2 enables the user to scan up
20
and down the report and get an understanding of the distribution of the odds of the horses
selected and so understand the totals line better.
CHAPTER 3
METHODOLOGY
• Data is first prepared in MS ACCESS, response variable Perf is calculated, predictor
variables are created, two datasets are prepared in MS ACCESS, the Regression Data
and the Testing Data, and then exported to SAS.
• Regression analysis is performed in SAS on the Regression Data until a suitable model
is found.
• Test Data is then used in the model to generate two files: Baseline and Results Files.
• Estimated values for the response variable, Perf, are found using the model’s regression
coefficients and then exported to Matlab for Monte Carlo type processing.
• For the Baseline Perfs, only a simple formula using two predictor variables.
• For the Results file, all the significant predictor variables are used to calculate Perf. At
this point, horses must be grouped together so they can be processed as clusters of
horses within a race.
• Monte Carlo processing produces estimated probabilities of each horse finishing 1st,
2nd, 3rd, or 4th in their race for both the Baseline and Results Datasets. These
probabilities are then exported back into ACCESS for comparison and
report-generation.
• Win probabilities in the Results dataset that differ significantly from the Baseline win
probabilities are separated into two groups: Overlays (profitable bets) where estimated
probabilities are greater than the corresponding Baseline probability, and Underlays
(unprofitable bets) where probabilities are less. Tables are generated showing these
results and key information.
Perf : Performance Indicator (and Power Point) calculated for each horse
Speed1Diff : Average Speed for race is calculated, then subtracted from each horse’s Speed
Rating
Speed12Diff : Each horse’s maximum speed rating in last 2 races is subtracted from race
average
Speed123Diff : Each horse’s maximum speed rating in last 3 races is subtracted from race
average
Flags : Indicator-type: Flag is zero unless current race is 2nd race within 60 days after
maiden win
24
Odds1 : Odds in last race
Days1st : Number days since last race, if zero, set to 180 for processing purposes (1st time
starters had values of zero which would throw off calculations)
Days2nd : Number of days since 2nd race back, if zero, set to Maximum of 200 or Days1st +
20 for processing purposes (1st and 2nd time starters had values of zero which would
throw off calculations)
Days3rd : Number of days since 3rd race back, if zero, set to Maximum of 230 or Days2nd +
30 for processing purposes (1st, 2nd, and 3rd time starters had values of zero which
would throw off calculations)
wbfAll : Win bet fraction raised to an exponent determined through Box-Cox Method
CHAPTER 4
RESULTS
Test Results are divided into two groups; Overlays, horses with estimated probabilities
significantly greater (> 1.33 ∗ Baseline) than the baseline’s probabilities, and Underlays,
horses with probabilities significantly less (< 0.6 ∗ Baseline) than the baseline’s probabilities.
Each group is further divided into two tables, the first being a comparison of win results to
baseline results and the second being a comparison of 2nd, 3rd, and 4th place finishes.
4.1 OVERLAYS
For the Overlays (Tables 4.1 and 4.2), 1,548 horses were selected from the test group.
Note that the proportion selected for the low (0 to 4) odds range 181/1548 = 11.7% was
much less than the proportion for the whole test group 7588/31398 = 24.2%. Thus one must
be careful comparing totals for the Results to totals for the baseline since the 0 to 4 odds range
has a much higher percentage of 1sts, 2nds, 3rds and 4ths. For example, when looking at Table
4.2, and comparing the “Out” numbers, the total percentage for the Baseline (OutB = 53.6) is
lower than the Results (OutA = 55.2), while at the same time, the OutA (results) is lower than
the OutB for each of the 4 odds ranges. This is a kind of numerical optical illusion due to the
weighted distribution of the Results to high-odds horses. For the Underlays, the weighted
distribution is much more severe. Of the 1036 horses selected in the Underlays group, only 11
were in the 0-4 odds range. The proportions were 11/1036 = 1.1% to 7588/31398 = 24.2%
of the whole test group. Thus care must be taken when looking at totals. For horses with
Results probabilities significantly greater (> 1.33 ∗ Baseline) than the Baseline probabilities,
the Overlays Table 4.1, shows an overall increase in EV/Profitability from 0.78 to 1.03.
4.2 U NDERLAYS
Underlays are the horses whose estimated winning probabilities are significantly
(< 0.6 ∗ Baseline) less than the Baseline’s probabilities. These horses show a marked
decrease in overall EV/Profitability, 0.78 to 0.46 (see Table 4.3). These values are somewhat
misleading since most of the horses selected in this Results group were in the high odds ranges
which had low EVs to begin with (see Table 2.2), but the EVs of 0.0 and 0.67 for the 4-9 and
9-27 odds ranges are lower than the corresponding baseline EVs. Looking at the four odds
ranges breakdown in Table 4.4, the first odds range 0.1 to 4 should be ignored since it only has
11 horses in it. The other three odds ranges showed decreases in all finishes, 2nd, 3rd, and 4th
which resulted in increases in the OutA percentages over the Baseline’s OutB numbers.
Underlays had only 11 horses in that group of which three were winners. Figure 4.2 also
reflects this situation in the 0-4 range. But in general, the two figures show that there is a
substantial increase in win percentage and Expected Value with the Overlays subset and a
definite decrease with the Underlays. Figure 4.3 shows the combined percentages for 1st
through 4th place finishes.
32
40
20
30
15
Win %
Win %
20
10
10
5
0
0
Under Base Over Under Base Over
3.0
10 12
2.0
8
Win %
Win %
6
1.0
4
2
0.0
0
8
15
6
Win %
Win %
10
4
5
2
0
1.2
1.0
0.8
0.8
0.6
EV
EV
0.4
0.4
0.2
0.0
0.0
Under Base Over Under Base Over
1.2
0.8
0.8
EV
EV
0.4
0.4
0.0
0.0
1.2
0.8
0.8
EV
EV
0.4
0.4
0.0
0.0
100
80
1st 1st
2nd 2nd
80
3rd 3rd
60
60 4th 4th
Total%
Total%
40
40
20
20
0
0
Under Base Over Under Base Over
40
1st 1st
2nd 2nd
50
3rd 3rd
30
4th 4th
40
Total%
Total%
30
20
20
10
10
0
Figure 4.3. Test results: Finish comparisons between Underlays, Baseline, and
Overlays by odds ranges. Total percentages significantly greater for Overlays
than Underlays except for 0-4 range.
35
CHAPTER 5
MULTICOLLINEARITY
Shown in Table 5.1 is part of the SAS Variance Inflation Factor (VIF) diagnostic
results. Covariates not shown (mostly trainers, jockeys, and state-bred) all had VIF values less
than 1.4 and so were not flagged for high collinearity concerns. As expected, days1st (days
since last race) , days2nd (days since 2nd race back) and days3rd are correlated since days2nd
is by definition, always larger than days1st, and days3rd is always larger than days2nd. It can
not be concluded that they are correlated to each other, but when days2nd is by itself in the
Final model, its VIF drops to 1.12, (in Table 3.3 ) showing small correlation. Similarly,
Speed1Diff, Speed12Diff and Speed123Diff show correlation values over 2 in Table 5.1 but
when Speed12Diff is by itself, the VIF value drops to around 1.25 as shown in Table 3.3. For
the Final Model, looking at Table 3.3, the highest VIF is 1.35 of wbfAll, from which the
conclusion is that there is no serious concern of multicollinearity in the model.
CHAPTER 6
DISCUSSION
The Overlays showed significant improvements in Win EV/Profitability as well as
improvements in 2nd, 3rd and 4th place finish percentages, as noted in Section 4.1.
Conversely, the Underlays indicated horses to be avoided due to low Win EV/Profitability and
lower 2nd, 3rd and 4th place percentages as noted in Section 4.2.
6.2 S UBGROUPS
Subgroups did not yield many new significant covariates. A bit of a surprise as some
covariates and subgroups had been designed with each other in mind, such as fillies and mares
running against males, and post position 1 in the Middle Distance subgroup where the first
turn comes up quick. Of course the coefficients for individual trainers and jockeys varied from
subgroup to subgroup, but there were no gigantic increases or decreases that coincided with
other strong indicators (p-Values, F-Values, partial R-Squared contributions, etc.). There is
still a lot of valuable information to be gleaned from subgroups - the trick is finding the best
covariates to test against. Often times a handicapper will wonder how a certain pattern looks
in a specific Subgroup. Since almost any pattern can be converted to an indicator-type
covariate, it can then be processed through the system described in this paper to find its value
as a predictor variable. There are undoubtly numerous (currently unidentified) covariates that
do not show up as significant when looking at the total Regression Dataset, but would be
highly significant if looked at in a certain Subgroup. The potential in this area is enormous.
6.5 M ISCELLANEOUS
The best predictors other than the baseline predictors (Intercept, wbfAll, and nHor)
were, in order of strength: numLineDiff, blinkon, days2nd, notLasix, speedDiff2, ppOut, pp2
and pp3. The rest of the predictors were trainers, jockeys, and horses bred in France.
The only time sensitive step in the process was the computation of Monte Carlo
probability estimates which took around 10 to 25 hours depending on the speed of the
computer used. EVs (Expected Value/Profitability) do not always increase directly with
increases in Perf. It may be that the improvement in Perfs show up in improved 2nd, 3rd, or
4th place performances.
In an article by Clive Thompson [18] in Wired magazine titled “Advantage: Cyborgs,”
it is pointed out that in a “freestyle” 2005 online chess tournament, where any kind of entrant
was allowed, the most successful players were “Cyborgs,” those able to use computers as
“assistants” most efficiently. That principle undoubtedly holds at the racetracks. The system
described here has tremendous potential for assisting handicappers. Finding accurate
probabilities should translate into high profitability.
40
CHAPTER 7
CONCLUSIONS
1. The system works. Table 7.1 shows a comparison of the totals for Overlays versus
Underlays. The differences are dramatic even taking into account the differences in
distribution by odds ranges.
2. A better comparison is Table 7.2 since it is for the odds range 9-27 and the percentage
of total horses in the range is about the same (32.2% to 38.2%). Horses in the 9-27 odds
range are longshots, basically overlooked or lightly bet. Although a bettor has to be
patient for Overlays and Underlays to happen, they can lead to profitable bets when
used in the exotic wagering, especially the exactas, trifectas and superfectas since which
horses to bet and which to avoid are clearly identified. To hit a 15 or 20 to one longshot
in the correct spot on an exotic bet can really boost the payoff!
BIBLIOGRAPHY
[1] D.A. Harville. Assigning probabilities to the outcomes of multi-entry competitions.
Journal of American Statistical Association, 68:312-316, 1973.
[2] R.J. Henery. Permutation probabilities as models for horse races. Journal of Royal
Statistical Society B, 43:86-91, 1981.
[3] H. Stern. Models for distributions on permutations. Journal of American Statistical
Association, 85:558-564, 1990.
[4] D.B. Hausch, V.S.Y. Lo, and W.T. Ziembe. Efficiency of Racetrack Betting Markets.
Academic Press, New York, NY, 1994.
[5] J.B. Bacon-Shone, V.S.Y. Lo, and K. Busche. Logistics analyses of complicated bets.
Research Report 11, Department of Statistics, the University of Hong Kong, 1992.
[6] V.S.Y. Lo and J. Bacon-Shone. Comparison between two models for predicting ordering
probabilities in multi-entry competitions. The Statistician, 43(2):317-327, 1994.
[7] V.S.Y. Lo and J. Bacon-Shone. Handbook of Investments: Efficiency of Sports and
Lottery Markets. Elsevier, London, England, 2008.
[8] M.M.Ali. Probability and utility estimates for racetrack bettors. Journal of Political
Economy, 84:803-815, 1977.
[9] P. Asch, B. Malkiel, and R. Quandt. Market efficiency in racetrack betting. Journal of
Business, 57:165-174, 1984.
[10] W.T. Ziemba and D.B. Hausch. Dr. Z’s Beat the Racetrack. Morrow, New York, NY,
1987.
[11] J.B. Bacon-Shone and V.S.Y. Lo. Probability and statistical models for racing. Journal of
Quantitative Analysis in Sports, 4(2):2-11, 2008.
[12] M.H. Kutner, C.J. Nachtsheim, and J. Neter. Applied Linear Regression Models.
McGraw-Hill Irwin, New York, NY, 2004.
[13] B. Harris. Emotional Bob Baffert heads into Thoroughbred Racing Hall of Fame. Sports
News, August 12, 2009.
[14] J. Bossert. Trainers bemoan synthetic tracks as Breeders’ Cup approaches. New York
Daily News, October 22, 2008.
[15] Wikipedia. Bob Baffert, 2010. http://en.wikipedia.org/wiki/Bob Baffert, accessed May
2010.
[16] Wikipedia. Kent Desormeaux, 2010. http://en.wikipedia.org/wiki/Kent Desormeaux,
accessed May 2010.
[17] Wikipedia. Garrett Gomez, 2010. http://en.wikipedia.org/wiki/Garrett K. Gomez,
accessed May 2010.
43
[18] C. Thompson. Advantage: Cyborgs. Wired Magazine, 42, April 2010.