Model Considerations For Multi-Entry Competitions

MODEL CONSIDERATIONS FOR MULTI-ENTRY COMPETITIONS
A Thesis
Presented to the
Faculty of
San Diego State University
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
in
Statistics
by
Vincent Stanley Dayes
Summer 2010
iii
Copyright c 2010
by
iv
Gambler’s Prayer: Dear Lord, please let me break even,
because I really need the money – Mr. X

v
ABSTRACT OF THE THESIS

Model Considerations for Multi-Entry Competitions
by
Master of Science in Statistics
San Diego State University, 2010
A unique and highly practical system for identifying good and bad bets at the major
Southern California Thoroughbred racetracks is created and analyzed. A probability model
for each individual race is created; a function of odds, Morning Line, each horse’s past
performances, current trainer and jockey, and miscellaneous factors depending on type of
race. A continuous response variable, “Perf”, (a numerical performance estimator) is used as
the response variable in the regression analysis performed. After obtaining new estimates for
Perf, Monte Carlo methods were then implemented to calculate probabilities of each horse’s
1st, 2nd, 3rd, or 4th place finish. Horses were then grouped according to Odds, and reports
were generated to analyze results and calculate Expected Values. To find the numerous hidden
factors and patterns that only occur under specific conditions, numerous subsets of races and
horses were anayzed using hundreds of covariates. A Baseline of probabilities is created using
a simple model based mainly on odds of a horse. Then the final model probabilities resulting
from the estimated regression parameters equation are compared to the baseline probabilities.
Those that differ significantly are separated into two groups: Estimated probabilities higher
than the baseline’s are considered profitable bets “Overlays”, while those less than the
baseline’s are “Underlays” (unprofitable bets). Each group is displayed in the odds-based
report format. 10 3/4 years of horse-racing data is used with 8 3/4 years set up as “Regression”
Dataset and the two mosr recent complete years as “Testing” Dataset. Of primary interest is
the Profitability or Expected Values of each group. In sum, various parameters and wagering
options are analyzed for their positive or negative affects on profitability.
vi
TABLE OF CONTENTS
PAGE
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statement of Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 A Typical Horse Race . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Variables Input into SAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Summary Statistics of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Data Separated into Odds Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 The Daily Racing Form for the Serious Handicapper . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Perf: The Important Response Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Data Preparation in MS ACCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 SAS Operations and Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Non-Indicator Covariates Created in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Indicator Type Covariates Created in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 WBF Exponent Found Using Box-Cox Method . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 SAS Regression and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Matlab: Simulating Horse Races for Probability Estimates. . . . . . . . . . . . . . . . . . . 27
3.5 Comparing Probability Files in ACCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
4.2 Underlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Comparisons of Results by the Four Major Odds Ranges. . . . . . . . . . . . . . . . . . . . . 30
5 MULTICOLLINEARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1 Response Variable: Perf (and Power Point) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.4 Predictor Variables Included in the Final Regression Model. . . . . . . . . . . . . . . . . . 38
6.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
viii
LIST OF TABLES
PAGE
Table 2.1 Descriptive Statistics of Numerical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Table 2.2 Regression Data by Collapsing Odds Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Table 3.1 Trainer Names and ID Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Table 3.2 Test Data Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 3.3 Regression Model Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Table 4.1 Overlays: Comparison of Win Results(A) to Baseline(B) . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table 4.2 Overlays: Comparison of 2nd, 3rd, and 4th Results(B) to Baseline(A) . . . . . . . . . . 30
Table 4.3 Underlays: Comparison of Win Results(A) to Baseline(B) . . . . . . . . . . . . . . . . . . . . . . . 30
Table 4.4 Underlays: Comparison of 2nd, 3rd, 4th Results(A) to Baseline(B) . . . . . . . . . . . . . . 31
Table 5.1 Covariates With Variance of Inflation Greater Than 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Table 7.1 Comparison of Overlays and Underlays Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Table 7.2 Odds Range 9-27 of Overlays and Underlays Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
ix
LIST OF FIGURES
PAGE
Figure 1.1 Daily Racing Form. Note abundance of information for each horse. . . . . . . . . . . . . 6
Figure 4.1 Win percentage comparisons between Underlays, Baseline, and
Overlays by odds ranges. Win percentages for Overlays substantially
greater than those for Underlays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 4.2 EV comparisons between Underlays, Baseline, and Overlays by odds
ranges. EVs for Overlays are much greater than those for Underlays. . . . . . . . . . . . . . . 33
Figure 4.3 Test results: Finish comparisons between Underlays, Baseline, and
Overlays by odds ranges. Total percentages significantly greater for
Overlays than Underlays except for 0-4 range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1
CHAPTER 1
INTRODUCTION
Perhaps the most complex and challenging multi-entry competition is the horse race.
Horse races are basically unique and independent of each other. The race conditions,
restrictions and eligibility requirements determine which horses are allowed to be entered in a
race, said conditions apply to all the horses in the race, such as a “Maiden” race where only
horses which have never won a race in their lives may be entered. Typical restrictions are by
sex, age, state bred in, types of and number of races previously won, etc. Race conditions may
be distance, racing surface (dirt, turf or synthetic track), physical condition of track, purse
offered, etc. Thus each race is a cluster of horses running under race-specific factors.
Horse-specific factors are jockey, trainer, post-position, equipment (blinkers, type of shoe,
etc.), (legal) drugs, assigned weight, etc. But for the serious handicapper, the most important
information is the past-performances for each horse listed in the “Daily Racing Form.” Listed
in chronologically descending order, the previous (up to 10) races of each horse are
capsulized.
1.1 H ISTORY
Gambling on horse races has been around since man first started riding horses.
Modern horse racing exists because it is a popular form of legalized gambling and is accepted
as benefitting local and state economies by generating large amounts of tax dollars and
providing jobs and money.
Statisticians have been analyzing horse racing data for many years, with milestone
works by Harville [1], Henery [2], Stern [3] and others. Many other disciplines also have
researchers investigating the ponies. Hausch et al. [4] cover articles from economists,
psychologists, management scientists, probability theorists as well as professional gamblers.
The first model proposed by Harville [1] is a simple way of computing ordering probabilities
based on winning probabilities. Henery [2] suggested using a normal distribution for
estimating running times where as Stern [3] recommended using gamma distributions for the
same purpose. Bacon-Shone, Lo and Busche [5] and Lo and Bacon-Shone [6] showed that the
Henery and Stern models were better fits than the Harville model for particular racing data.
However, since both the Henery and Stern models are complicated to use in practice, Lo and
Bacon-Shone [7] suggested a simple approximation for both the Henery and Stern models.
2
Also heavily investigated is the favorite-longshot bias (where favorites are typically
underbet so odds are too high and longshots overbet so their odds are too low) that has often
been found in gambling data (see Ali [8], Asch et al. [9], Ziemba and Hausch [10], Lo [4], and
Bacon-Shone and Lo [11]). This bias also appears in this study, but is not the main focus.
Basically all the works cited in this Section were aimed at finding models that would
accurately estimate probabilites that could result in turning a profit at the racetrack, by finding
profitable wagers and/or avoiding the unprofitable ones. This work is aimed at developing a
system that facilitates evaluating whichever patterns, variables, statistics, etc. that a
handicapper might wish to investigate, and continually improve an already useful model by
adding new covariates that are significant.
1.2 S TATEMENT OF P ROBLEM

The problem is to “beat” the odds, finding profitable bets and avoiding losing
propositions. Turning a profit at the horse races involves the basic calculation of return versus
risk, payoff vs. probability. By waiting until the last few minutes before a race goes off, a
bettor has an accurate approximation of payoff/return for a straight win bet, which can also be
a strong indicator to payoffs of other types of bets. So the key to success is obtaining accurate
probabilities of winning and then choosing the wagers where return (odds) far outweighs risk
(win probability).
Even for the most experienced handicappers, one race may take from 20 minutes to
two hours to produce accurate probabilities for each horse. Factors vary from race to race but
starting points are: Morning Line, each horse’s past performances, current trainer and jockey,
pace style (relative to the pace styles of the other horses in the race), and miscellaneous
factors and patterns depending on type of race. Actually the amount of available information
is overwhelming and only a computer-assisted handicapper can accurately assess each horse’s
probabilites for winning in a reasonable amount of time, much less calculate probabilities for
coming in 2nd, 3rd or 4th required to bet Superfectas (picking exact order of the top four
finishers).
The betting public in general does a good job of estimating most horse’s probability of
winning which means the obvious predictor variables like best jockeys and trainers, recent
strong performances, sparkling workouts, etc. are reflected in the odds. The problem then is to
find predictor variables that are significant and at the same time, relatively independent of
odds. Then there needs to be a way to determine how strongly these covariates weigh against
each other and how they relate to a horse’s performance. Also there needs to be a numerical
rating system of performance for each horse in a race, a system that takes into account the
number of horses in a race and where each horse finishes.
3
1.3 O BJECTIVE
The Objective is to create and analyze a practical system for weighing all the positive
and negative factors associated with each horse in a race and then calculating probabilities for
1st, 2nd, 3rd, and 4th place finishes. The horses whose (1st place) probabilites are
significantly higher than the probabilities as reflected by their odds, should be good bets
(Overlays) and those whose probabilities are significantly lower, are wagers to avoid
(Underlays). Finding a response variable that numerically rates a horse’s performance is also
part of the objective.
1.4 A T YPICAL H ORSE R ACE

Trainers pick the races for their horses and find a jockey through dealing with jockey
agents. Horses are entered a few days before a race and post positions are assigned by small
numbered balls “pills” drawn. The Racing Form is usually available two days before the race
and the Morning Line one day before. Everything is synchronized at the racetrack. Even as
one race is running, the horses for the next race are being led to the Paddock, where they are
saddled, checked over and calmed down. Then the horses are led to a viewing ring surrounded
by a crowd of bettors who are intensely studying the horses for good or bad signs, and the
horses are led around the outside of the ring while the owners and their friends and families,
being on the inside (usually all dressed up and pretending they do not notice the crowd on the
outside) watch their horse in a confident manner. At this point the trainer gives final
instructions to the jockey (who usually ignores them) and some official calls “Riders up” and
the jockeys jump on the horses and then the horses are led through a tunnel to the track where
there is a post parade in front of the grandstands, then the horses warm up by jogging around
the track, and a few minutes before Post Time they start walking to the starting gate. Around
Post Time the horses are “loaded” into the starting gate all the time being checked over by the
track veterinarian. When the horses are all loaded and calm, the offical starter pushes a button
that opens all the gates in front of each horse and the race is off. The start is the most chaotic
point in the race with horses frequently going sideways, bumping and cutting each other off,
or sometimes leaving the gate very slowly. At the start jockeys are seeking advantageous
position that gives their horse its’ best chance of winning. Some horses are front-runners -
they like to be in the lead and “steal” the race by setting the pace without using up their
reserve energy, and then having enough gas in the tank to fight off late challengers (going
“wire to wire”). Other horses may have a “stalking” style where they stay right behind the
leader or leaders until near the end of the race and then go all out. Then there are the “closers”
who may stay near the rear of the field and then close strongly the last part of the race. All
jockeys try to save ground on the turns by staying as near the rail as possible at the same time
4
trying to avoid trouble in the form of being blocked by horses in front or being pinched into
the rail. The end of a race can be quite exciting as frequently horses are tightly packed at the
finish line, separated by inches after running a mile or more. Jockeys are expert at using the
whip - some horses respond well to whipping and others do not in which case the jockey may
just show the whip by placing it in front of the horses eyes or just lightly tapping the horse
once or twice. Jockeys are also adept at urging their mounts to give their best efforts,
especially in the final straight-away before the finish line (the “home stretch.”) Note that
jockeys do not actually sit on their mounts but balance themselves on the stirrups the whole
race so as not to impede or interfere with their horse’s running action. Jockeys are tremendous
athletes who must have large amounts of strength, courage, lightning reflexes and good
judgement and instincts to succeed.
The flip side to the running of the race is the wagering. At the track there is a huge
Totalizer Board that displays odds for win bets and the total amounts bet on each horse, and
which are updated every minute or so. This information is also available on monitors
displayed all around the public areas. Wagering is done over the internet right up until post
time. At the track, wagering is allowed until a loud bell goes off, a few seconds after the race
starts. The difference is usually from two or three minutes to 10 or more. Thus bettors at the
track have an advantage in that they have time to see the effects from last minute internet
wagering and get an accurate estimate of the final odds and still have time to make their own
bets.
1.5 D EFINITION OF T ERMS
Baseline : Estimated Perfs are derived from the Simple Regression Model which is based
only on odds and number of horses in race. From each horse’s estimated Perfs,
estimated probabilities are calculated which are baseline values which the estimated
probabilities from the final regression model are compared to
Bay : Reddish brown color of horses
Betting Pool : Each type of wager has its own pool of money bet, separate from all other
pools
Blinkers : A hood placed over a horse’s head with cups sewn onto the eye openings. This
restricts a horse’s vision so it can only see straight ahead
Box-Cox Method : Used to find the best fit of the win bet fraction (wbf) to the performance
response variable (Perf) by finding the exponent λ that minimizes Sum of Squares
5
Error. A new predictor variable, wbfAll equal to wbf raised to λ is used in place of wbf
(see Kutner [12])
Breakage : This is due to odds being rounded downward to the nearest tenth of a dollar and
the wagering establishment keeping the difference
Breeder : Whoever breeds the horse
Claimed : When a horse runs in a claiming race and is “claimed” by a licensed owner or
trainer it is purchased for the claiming amount specified for that race. The horse must be
in the starting gate when the race goes off. Once the race starts, the horse offically
belongs to the new owner even if it is injured or drops dead, but any monies won goes to
the original owner
Claiming Race : Horses which may be claimed (purchased) for a specified price
Class : Level of competition - numerical evalution of the general strength of a race. The
concept of “Class” is used here to categorize numerically the quality of a race and
therefore its entrants. The strongest runners are in the highest classes (and highest
purses to be won) and vice versa
Colt : A male horse age 4 or less
Cushion Track : A type of synthetic surface
Daily Double : A wager where the winners of two consecutive races must be picked to win
the bet. Originally the first two races of the day, now most tracks offer this on all
consecutive races
Daily Racing Form : Resembles a small newspaper filled with racing information for each
horse running on a particular day (see Figure 1.1)
Entry : When two or more horses are entered in a race and are considered a single entity for
wagering purposes
Exacta : An exotic wager where the exact order of the first two finishers in a single race is
specified
Exotics : Newer, more complicated bets such as Trifectas, Superfectas on single races and
multiple race bets like the Pick 3, Pick 4, and Pick 6
EV : Expected Value - used here in same sense as Profitability - expected or average return
on a wager
6
Figure 1.1. Daily Racing Form. Note abundance of information for each horse.
7
Favorite : The most heavily bet horse in a race
Filly : Female horse age 4 or less
Furlong : An eighth of a mile
Gelding : A castrated male horse of any age
Handicapper : An experienced Daily Racing Form reader able to hold huge amounts of
information in his head and at the same time judge the relative merits of each horse in a
race, coming up with estimates of winning probabilities
Handicap Race : A stakes race where weights (see weight) are assigned according a horse’s
past performances
Horse : Specifically a male horse (not gelded) of age 5 or greater
House Take : House “Cut”, Track Percentage - the amount taken out of the Betting Pool by
the House or race track. For simple pools like the win, place or show, it is around 14%
to 18%. For the exotic pools, it is around 20%. It varies by state and race track
Indicator 0 or 1 : Covariates are set to 1 if they occur, otherwise they are set to 0
Jockey : Professional rider of horses
Lasix : Legal anti-bleeding drug - common in California, illegal in some states and countries
Length : About nine feet - the length of a generic horse from the tip of its nose to the end of
its tail (when running) - also a rough time measurement: one length is about a fifth of a
second
Line : Refers to a past performance line in the Daily Racing Form
Longshot : General term meaning a horse that is unlikely to win
Maiden Race : Races only for horses who have never won a race
Major Odds Range : In this study, the (four) major ranges are: 0.1 to 4.0, 4.1 to 9.0, 9.1 to
27.0, and 27.1 and UP
Mare : Female horse 5 or more years in age
Monte Carlo Method : Computational system of simulation using reapeated random

sampling to compute results
8
Morning Line : Predicted final odds - may appear in Racing Form one or more days before
race
Odds : Return on investment, should it be successful
Overlay : When probability of a horse winning is greater than the probability indicated by its
odds
Out Finish : Finishing 5th place or worse - not 1st, 2nd, 3rd or 4th
Pace : The speed of the early leaders in a race
Pace Style : The usual early-race location of a horse - may be forwardly placed early or in
the rear
Paddock : The area where the horses are viewed before a race
Past Performances : Daily Racing Form information lines (see Figure 1.1)
Perf : Dependent (Response) Variable - numerical evaluation of a horse’s performance in a

given race, a function of lengths ahead or behind the Power Point. Originally just 10 *
lengths from Power Point (negative if behind Power Point, positive if ahead). For
example, if the winning horse were two lengths ahead of the second place horse in a
7-horse race (where the Power Point equals the second place finish), its Perf would be
20. There was a set minimum for Perf (in this study it is -210)
Photo Finish : A close finish where the finish picture must be examined to determine the
order of finish
Pick 3, 4, or 6 : Wagers where the winners of all the included races must be picked
Place : A wager where a bettor wins if his horse comes in 1st or 2nd. Also the place position
is 2nd place
Polytrack : Synthetic race track (general term)
Post Parade : After horses leave the paddock and before the race, the horses come out onto
the racetrack and parade in front of the grandstands
Post Time : Official time horses are supposed to be at the starting gate. Most races start a few
minutes after post time
9
Power Point : A numerical indicator of a strength threshold value at the finish of a race: a
function of the number of horses in the race and the distances in lengths between the top
four finishers. Originally equal to the second place finish for races with 7 or fewer
horses, equal to the midpoint between 2nd and 3rd for races with 8, 9, and 10 horses,
and equal to 3rd place finish for races with 11 or more horses. If a horse finished at the
Power Point, it was assigned a Perf of zero. For example, if ahead of the Power Point,
Perf would equal the number of lengths times 10, if behind, Perf was minus the number
of lengths multiplied by 10
Profitability : The positive or negative return per wager. Synonomous with EV
Pro-Ride : A specific type of synthetic racetrack
Purse : Prize money offered in a race of which typically 60% goes to the winner, 20% to 2nd
place, 12% 3rd, 6% to 4th and 2% to 5th (the distribution percentages vary from state to
state)
Race Restrictions : Restrictions on horses allowed into a specific race
Racetrack : The three California tracks in this study are all flat ovals with Santa Anita and
Del Mar being a mile in circumference and Hollywood being a mile and 1/8. The turf or
grass course is just inside of the main course which up until 2007 was a dirt track.
Racetracks are publicly owned but strictly regulated by state agencies
Racing Form : See Daily Racing Form
Reflected Probs : Inverted odds: Probabilities that reflect how a horse is bet - with estimated
Track Percentage taken into account
1/(odds ∗ 1.2 + 1) (1.1)
Regression Data : Data used to develop model and find Estimated Parameters/Regression
Coefficients
Regression Funct. : Model/equation used to predict new values of response variable Perf
from Test Data/Prediction Set
Results Dataset : Subset of Testing Dataset consisting of horses whose estimated win
probabilities (from the Regression Function) differ significantly from the Baseline win
probabilities
10
Saddle Cloth Number : Official number of horse, used when placing bets or checking
results - frequently the same as the post position, but not always
Saving Ground : Minimizing distance horse has to run by staying close to the inside rail on
turns
Scratch : A horse does not run (for whatever reason) in a race it is entered in
Show : A wager where the bettor wins if his pick comes in 1st, 2nd or 3rd. Usually has a
small return, sometimes 10 cents on the dollar
Stakes Race : Highest class of races with the largest purses, for example, the Kentucky
Derby
Superfecta : An exotic wager where the exact order of the first four finishers in a single race
is specified
Synthetic Track : In May, 2006, the California Horse Racing Board mandated that all
California horse racing tracks had to switch their dirt tracks to synthetic surfaces for the
safety and welfare of horses. Hollywood has been using Cushion Track since November
2006, Santa Anita tried Cushion Track from September 2007 to summer of 2008 when
it switched to Pro-Ride due to drainage problems. Del Mar has been using Polytrack
since July 2007. Polytrack is similar to Pro-Ride and the two have been treated as the
same surface type in this study
Testing Data : Data set aside for model validation - data is not used to in creating Regression
Function/Estimated Parameters
Totalizer Board : Huge display at race track displaying all important betting information
Trainer : Responsible for training, behavior, overseeing the exercise routine of horses,
selects races for horse to run in, and picks the jockey
Trifecta : An exotic wager where the exact order of the first three finishers in a single race is
specified
Wbf : Win Bet Fraction, inverted odds:
wbf = 1/(odds + 1) (1.2)
WbfAll : Wbf raised to exponent λ found using the Box-Cox method
wbf All = wbf λ (1.3)

11
Weight : Horses are assigned minmum weights according to race conditions. All jockeys are
weighed before all races - if under the assigned weight they carry extra weights in their
saddle. Overweights don’t matter, except to the horse
Whip : Leather instrument used by a jockey to encourage his horse

12
CHAPTER 2
DATA
The data comes from the three major southern California Horse Racing Tracks: Santa
Anita (Los Angeles), Hollywood (Los Angeles), and Del Mar (San Diego). These three tracks
form a circuit since only one is open at a time and the same horses, trainers, and jockeys move
from track to track. Thus the data has a consistent, homogenous nature. The races were run
from January 1999 to November 2009 - 10 3/4 years. Out of 23,478 races, 19,930 races were
used for the individual race model (168,253 horses), the others being rejected because of too
few horses (minimum 6 horses in a race), abnormalities, entries (multiple horses coupled
together for betting purposes), causing complications in odds analysis), and corrupted data.
There are three different types of race data: Current Race data, which is data a
handicapper has before the race goes off (typically found in The Daily Racing Form). Results
data is the results of the Current Races. Past Performances data is how a horse performed in
previous races so it is a combination of the other two types of data.
The pre-race data was exported from The Daily Racing Form files, imported into MS
ACCESS, checked for errors, processed for easy analysis and then exported from ACCESS in
Comma-Delimited files, which were then read and analyzed by SAS and Matlab. The results
data and the final odds came from Equibase Inc. which specializes in horse-racing results
data. The data was purchased through Post Time Solutions, Inc.
The 10 3/4 years of data was split into two groups: The first, (Regression Dataset) is 8
3/4 years of data (1/27/99 to 11/05/07 - 16,284 races/136,855 horses) and the second (Testing
Dataset) is the last two years of data (11/06/08 to 11/07/09 - 3,646 races/31,398 horses).
2.1 VARIABLES I NPUT INTO SAS

Note: see Table 2.1 for statistics on numerical data.
age : Age of horses allowed in race: Age 2: 13.27% of races, age 3: 19.05%, age 4: 2.61%,
age 3UP: 41.76%, age 4UP: 23.31% Note that there were 8 races for 3 and 4 year olds
only
blinks : Blinkers changed: X = Blinkers taken off: 2.59%, B = Blinkers put on: 4.95% No
change in blinkers: 92.45%
13
Table 2.1. Descriptive Statistics of Numerical Variables
Variable Minimum Median Maximum Std Dev 10th PCT 90th PCT
days1st 2 29 1876 85.64 15 180
days2nd 11 72 1903 109.39 37 223
days3rd 17 116 1945 124.40 62 297
dist 20 65 140 13.07 55 85
horseAge 2 3 12 1.31 2 5
ML1 0.01 0.10 0.68 0.07 0.03 0.22
monthBorn 1 3 12 1.56 2 5
nhor 6 9 14 1.95 6 12
numLines 0 6 10 3.83 0 10
numLineDiff -8.90 0 8.75 2.02 -2.57 2.40
odds 0.10 8.70 243.2 21.69 2.2 44.6
odds1 0.05 9.00 339.5 18.62 2.0 35.6
odds2 0.05 10.40 339.5 16.66 2.0 30.7
perf -210 -30 121 99.67 -210 55
pp 1 5 14 2.78 1 9
speed1Diff 0 2 10 3.83 0 10
speed12Diff 0 3 10 3.77 0 10
speed123Diff 0 4 10 3.74 0 10
turfStarts 0 0 78 5.92 0 10
turfWins 0 0 15 1.31 0 2
wbf 0.004 0.10 0.91 0.13 0.02 0.32
wbfAll 0.43 0.70 0.99 0.11 0.55 0.84
wbfOld1 0.003 0.10 0.95 0.13 0.03 0.33
wbfOld2 0.003 0.09 0.95 0.13 0.03 0.33
cl12 : Claim indicator for last three races: 1 = claimed in last race, 2 = claimed in second race
back, 4 = claimed in 3rd race back - 19,366 horses out of 168,253 were claimed in at
least one of their last three races (11.51%)
date : Julian date of race - 36187 to 40124 (1/27/1999 to 11/7/2009)
days1st : Number of days since last race (see Table 2.1)
days2nd : Number of days since 2nd race back (see Table 2.1)
days3rd : Number of days since 3rd race back (see Table 2.1)
dist : Distance of race in tenths of a furlong - furlong is 1/8th of a mile - from 20 to 140 (1/4
mile to 1 3/4 mile); most common distance: 60 or 6 furlongs (3/4 mile): 4,718 out of
19,930 races (see Table 2.1)
finish : Place of finish (1 to 14)

14
flags : Indicator-type: Flag is one when current race is 2nd race within 60 days after maiden
win. 3737 out of 168,253 (2.2%)
horseAge : Age of individual horse: from 2 to 12 (see Table 2.1)
horseType : Type of horse: f = filly (female age 4 or less) 35.2%, m = mare (female age 5
and up) 6.5%, c = colt (male age 4 or less) 25.2%, h = horse (male age 5 and up) 5.7%,
g = gelding (castrated male any age) 27.2%
JockID : Three character code for professional race-riders, of 470 different jockeys, 41 had
1000 or more rides
lasix1st : Indicator-type: 1 if first time horse has had the drug lasix in its life (1st time starters
not included) 3,760/168,253 (2.2%)
lasixL : L if horse has been given lasix, (96.5% have lasix, 3.5% do not)
ML1 : Inverted Morning Line Ffraction - M L1 = 1/(1 + M L) where ML is the original

Morning Line pre-race estimate of the final odds - ML1 is nromalized to account for
horses that scratch before the race goes off (see Table 2.1)
monthBorn : Month horse is foaled - note that horses born in the same year all are
considered to have the same age whether born Janaury 1 or December 31. 93.9% are
foaled in January thru May (March 27.8%, April 25.4%, and February 21.0%) and the
other 6.1% in June through December, being mainly Southern Hemisphere horses (see
Table 2.1)
nhor : Number of horses in a race. From 6 to 14 (Minimum was set to 6 for analysis
purposes). Percentages by number of horses: 6 - 17.8%, 7 - 19.9%, 8 - 21.1%, 9 -
16.7%, 10 - 13.8%, 11 - 7.9%, 12 - 6.9%, 13 - 2.0%, 14 - 0.8% (see Table 2.1)
numLines : Number of previous races to a maximum of 10. Refers to the number of “lines”
of past performances in the Daily Racing Form (see Table 2.1)
numLineDiff : For each race, the average number of lines is calculated. Then each horse’s
number of lines is subtracted to get numLineDiff (see Table 2.1)
odds : Final odds horse went of at: from 0.1 (minimum by law) to 243.2. For odds
distribution of Regression Data see Table 2.1
odds1 : Odds in last race. Note that in other states minimum odds may be 0.05. (see Table
2.1)
15
odds2 : Odds in 2nd race back (see Table 2.1)
perf : Response Variable - numerical evaluation of a horse’s performance in a given race

from 121 to -210 (see Table 2.1)
pp : Post position in race - 1 to 14 (see Table 2.1)
sex of Race : Race restriction by sex: 41.56% races were for female horses only - 58.44%
races were for either sex, although only 270 out of 98,329 horses were fillies or mares:
0.3% running against the boys
speed1Diff : Difference from average speed of race (see Table 2.1)
speed12Diff : Difference from average speed of the best of each horse’s last two races (see
Table 2.1)
speed123Diff : Difference from average speed of the best of each horse’s last three races (see
Table 2.1)
stateBred : Three character code for state or country horse was bred in. Most common states
are California: 36.4%, Kentucky: 25.4%, Florida: 6.1% and the most common foreign
countries are Ireland: 2.0%, Great Britain: 1.7%, and Argentina: 1.1%
track : Three racetracks: SA had 45.75% of the races, HOL had 36.26%, and DMR had
17.98% races
trainID : Three character code for trainers. There are 985 trainers, of which Doug O’Neil
had the most horses entered: 4083, Bob Baffert had 3568, and 34 other trainers had
1000 or more horses entered
turf : One character field where “T” indicated a turf race: 27.2%, “P” a race on Polytrack or
Pro-Ride synthetic surfaces: 7.8%, “C” indicated Cushion track synthetic surface:
10.7% and a blank meant dirt surface: 54.3%
turfStarts : Number of lifetime races run on the turf: from 0 to a maximun of 78 (see Table
2.1)
turfWins : Number of lifetime wins on the turf: from 0 to a maximum of 15 (see Table 2.1)
type : Numerical designator of type of race: types are 0, 1, 4, 6, 8, 10, 14, 21, 22, 23, 31, 32,
33. Most common race types: Maiden Claiming (type = 0): 21.2%, Maiden
Allowance(type = 1): 18.7%, Allowance Non-Winners of 1 (31): 13.9%, Claiming
Middle(22): 12.9%, Claiming High(23): 10.0%, and Claiming Low(21): 9.7%
16
wbfOld1 Win bet fraction from odds of previous race (see Table 2.1)
wbfOld2 Win bet fraction from odds of 2nd race back (see Table 2.1)
2.2 S UMMARY S TATISTICS OF N UMERICAL DATA

Some of the interesting statistics from Table 2.1: A horse had a race 2 days after a
previous race and came in 5th out of 6, and another, a 9 year old gelding who came back to
the races after more than a 5 year layoff and came in last in an 8 horse field, the oldest horse
running was 12 and the highest odds in the California tracks was 243.2, while a horse went off
at 339.5 somewhere else.
2.3 S UBGROUPS
Various subsets of races and/or horses were run through the regression stepwise
selection process with all covariates to find predictor variables that are either hidden or are
much more significant in a subgroup than in the overall total set of all races and horses.
Subgroups considered:
MClm : Maiden Claiming - races for horses who have never won a race and can be claimed
for a specified claiming amount - considered to have the most volatile and unpredictable
horses - many veteran jockies avoid riding in these races - has the lowest purse amounts
- Claimed covariates may be significant in these races
MAlw : Maiden Allowance - Races for horses who have never won a race and are not
claimable - many stars of the future are in these races
NonMaid : Races for horses who have won at least one race
Turf/Grass : Races run on a turfcourse - turf surface is thought to suit style of running for
some horses and bb slippery and unsuitable for others due to different leg action
NonTurf : Races not run on a turfcourse
Poly : Races run on Polytrack synthetic surface (replacing dirt surfaces)
Cush : Races run on Cushion Track synthetic surface (replacing dirt surfaces)
Alw : Allowance Races - horses are not claimable - various restrictions usually apply limiting
horses eligible for race - not including Alowance races for Non-Winners of one or two
races
AlwNW12 : Allowance race for either Non-Winners of one or two races - these races are a
threshold for horses that go on to have profitable careers and run in Handicap and
Stakes races and those who fade into the lower class Claiming races
17
Stakes : Special races with highest purse amounts - also important for establishing a horse’s
reputation which directly influences its’ breeding value
Age2 : Races for two-year-olds - young horses may be quite inconsistent in their
performances
Age3 : Races for three-year-olds
Age3UP : Races for three-year-olds and up and races for four years and up
Sprint : Races for short distances less than 7 furlongs - usually favors horses with early speed
MidDist : Races for distances 8 to 9 furlongs - two turn races where first turn is close to the
start (Santa Anita and Del Mar) so post position may be more significant in these races
LongDist : Races for distance greater than 9 furlongs - favors horses with stamina and lighter
weights assignments
Fill : Races for Fillies and Mares only - these races may have more longshots
Male : Races for any sex - usually all male, but not always so Fill covariate can be analyzed
here
yr9902 : Data set from 1/27/99 to 12/25/2002 - First 3 and 11/12 years of Regression Data
yr0305 : Data set from 12/26/02 to 12/25/2005 - middle 3 years of Regression Data
yr0607 : Data set from 12/26/05 to 11/4/2007 - last two complete years of Regression Data -
may show trends that are changing
yr07 : Data set from 10/29/06 to 11/4/2007 - last complete year of Regression Data - may
show trends that are changing
a67 : Races with 6 or 7 horses in race - predictors may vary especially when compared to
a11UP subgroup
a8910 : Races with 8, 9, or 10 horses in race
a11UP : Races with 11 or more horses in race - predictors may vary when compared to
a11UP subgroup, so post position may be more significant in these races
ClaimLow : Classes with low Claiming Amounts (8,000, 10,000, 12,500) - Claimed
covariates may be significant in these races
18
ClaimMid : Classes with middle Claiming Amounts (16,000, 20,000, 25,500, 32,000) -
Claimed covariates may be significant in these races
ClaimHigh : Classes with highest Claiming Amounts (40,000 and up) - Claimed covariates
may be significant in these races
DMR : Races run only at Del Mar Race Track - trainer and jockeys may do better here than
at other tracks
HOL : Races run only at Hollywood Race Track - trainer and jockeys may do better here
than at other tracks
SA : Races run only at Santa Anita Race Track - trainer and jockeys may do better here than
at other tracks
T65 : Races run on the downhill Turf course at Santa Anita - these races are so different from
all others that the covariates may greatly change values
2.4 DATA S EPARATED INTO O DDS R ANGES

Regression data in Table 2.2 is separated into odds ranges for analysis. The top portion
has 16 odds ranges, which are folded into 8 odds ranges just below. The 8 odds ranges are
collapsed into 4 ranges, then 2 and then 1 line of totals. The EV shows that the
Profitability/EV of horses of odds from 0.1 to 9.0 is around 0.82 to 0.90, with an average
about 0.85 (third line from bottom of Table). The Ev then tapers off as the EV for the 9 to 27
range varies from 0.76 to 0.84 with an average of 0.81 (fifth row from bottom). EV then
decreases rapidly to a low of 0.30 for the 75 and Up range. This supports the famous
favorite-longshot bias (favorites underbet so odds are too high and longshots overbet so their
odds are too low) that has often been found in gambling data. See Ali [8], Asch et al. [9],
Ziemba and Hausch [10], Lo [4], and Bacon-Shone and Lo [11].
Looking at Table 2.2, the Perf column shows a definite decrease as the rows descend
and the odds increase (and wbf decreases. Although the response variable Perf is independent
of odds when it is calculated (see Section 1.5), it is highly (inversely) correlated to odds: the
lower the odds, the higher the average Perf, and vice versa. This is to be expected since the
best performing horses (judging from previous races and other factors) get bet the most and
thereby have the lowest odds.
The sixteen odds ranges was chosen so that the separations fell on whole integers and
an approximately equal number of horses would accumulate in each range (except for the two
extremes). Having 16 ranges made it easy to convert to 8 ranges, then 4, 2, and 1 (overall
totals). The report that generates Table 2.2 was designed so that it could be used for any
19
Table 2.2. Regression Data by Collapsing Odds Ranges
Odds-range Total wins win% EV Perf
0-1 2552 1354 53.1 0.90 31.27
1-2 8890 3054 34.4 0.85 7.14
2-3 11889 2961 24.9 0.85 -10.63
3-4 11205 2078 18.5 0.82 -22.63
4-5 9473 1483 15.7 0.85 -34.66
5-6 7748 1009 13.0 0.84 -42.69
6-7 6797 782 11.5 0.85 -47.06
7-9 11136 1069 9.6 0.85 -55.31
9-11 8960 645 7.2 0.78 -65.90
11-14 9976 629 6.3 0.84 -75.66
14-19 10709 477 4.5 0.76 -87.24
19-27 10889 394 3.6 0.84 -99.83
27-35 7005 156 2.2 0.70 -111.82
35-50 8133 119 1.5 0.62 -126.91
50-75 6924 59 0.9 0.52 -145.51
75-UP 4569 15 0.3 0.30 -169.46
0.1-2 11442 4408 38.5 0.86 12.52
2-4 23094 5039 21.8 0.84 -16.45
4-6 17221 2492 14.5 0.84 -38.27
6-9 17933 1851 10.3 0.85 -52.18
9-14 18936 1274 6.7 0.81 -71.04
14-27 21598 871 4.0 0.80 -93.59
27-50 15138 275 1.8 0.66 -119.93
50-UP 11493 74 0.6 0.43 -155.03
0-4 34536 9447 27.4 0.84 -6.85
9 35154 4343 12.4 0.85 -45.37
9-27 40534 2145 5.3 0.81 -83.06
27-UP 26631 349 1.3 0.56 -135.08
All 136855 16284 11.9 0.78 -64.27
number of horses and there would be an appropriate grouping of odds ranges for that number
of horses. For analyses based on large sample sizes (many thousands of horses), the top group
of 16 odds ranges, as presented in Table 2.2, is preferable and thus used in subsequent
analyses. However, in some subgroup analyses, the Overlays and Underlays are shown using
4, 2, and 1 odds range groupings since the number of horses considered are on the order of
1000-1500, too few for the full 16 odds range grouping. Frequently, for very small subsets 50
or less, only the totals line is appropriate, but even then it could be of interest to scan upward
even to the 8 and 16 odds ranges to see the odds distribution of the selected horses. Making
the number of lines of each grouping of odds ranges a power of 2 enables the user to scan up
20
and down the report and get an understanding of the distribution of the odds of the horses
selected and so understand the totals line better.
2.5 T HE DAILY R ACING F ORM FOR THE S ERIOUS

H ANDICAPPER
Most of the important pre-race information comes from The Daily Racing Form. The
Racing Form is similar to a small newspaper and contains key information on every horse
running in each race for a particular racetrack. Figure 1.1 shows information for a fifth race
(the big 5 in upper left corner) at Santa Anita on March 9, 2007. The race information is given
at the top: distance is 7 furlongs, it is a claiming race which means the horses may be claimed
or purchased (by registered trainers or owners only) for $25,000. The purse or prize money is
$28,000 of which typically 60% goes to the winner, 20% to 2nd place, 12% 3rd, 6% to 4th
and 2% to 5th (the distribution percentages varies from state to state). After the purse amount
comes the race restrictions: this race is open only to three year old fillies. Next is the weight
assignment - all must carry at least 122 pounds (jockey with added weights as needed).
Horses are allowed three pounds off if they have not won a race since January 30th of this
year. Also if the horses run for a lower claiming amount ($22,500) they are allowed two
pounds off. All information pertaining to the race in general, is given in this top area.
Below the race information are three sections beginning with “2 Tee Dee,” “3 Brought
It,” and “1 Warrens Grindstone.” Each of these sections is a detailed summary of the key
information for the three fillies: Tee Dee, Brought It, and Warrens Grindstone (truncated).
Horses are listed in post position order (source for pp variable) so Tee Dee, if she runs, will
leave from the inside or post position one. If for some reason she does not run (scratches),
then Brought It starts in the one post position. The big number in front of the names is the
official number used for wagering purposes, known as the Saddle Number or Cloth Number.
All wagers are made by using this number, not to be confused with post position. Below the
Saddle Number is the jockey name and his (or her) record for the year and the record for the
previous year. For Tee Dee, the jockey is “M A Pedroza” and to the right is the trainer name,
“Jeff Mullins.” Just above “Jeff Mullins” is the breeder, “Nicholas ... (Ky),” the Ky indicating
Tee Dee was bred in Kentucky, which is the source for the stateBred variable. Note that if a
horse is from another country, that country’s code would be in parenthesis next to the horse’s
name. For this study, country bred in and state bred in were lumped together into the same
field, stateBred. At the top line above trainer and breeder names, is “B f 3 (Jan)” which
indicates Tee Dee is a bay colored, three year old filly who was foaled in January, from which
the monthBred, horseType and horseAge variables are obtained. To the right of trainer and
breeder names is a large “L 119.” The L signifys the horse will be given the legal
anti-bleeding drug Lasix and the 119 is the assigned weight. To the right of L 119, at the top
21
is “Life 7.” This is where the variable numLines comes from (to a maximum of 10), 7 being
the number of races Tee Dee has had so far in her racing career. Looking back to the far left
below “Pedroza,” is “11Feb07,” indicating that Tee Dee’s last race was February 11, 2007.
The difference between the current race date and the previous race date is from where the
variable days1st comes. Just below “11Feb07” is “12Jan07” which is Tee Dee’s 2nd race back
and below that is “3Dec06” or Tee Dee’s 3rd race back. The two predictor variables, days2nd
and days3rd comes from these dates. In the blank area above and to the left of the large “L
119” is where a short note may appear such as “blinkers off,” “blinkers on,” or “1st time
lasix.” The variables blinksOff, blinksOn and lasix1st come from here. Also not shown here,
but of extreme importance, is the Morning Line, which will appear in large numbers just to
the left of the horse’s name. Directly below the “L 119” in Tee Dee’s section, is “2.60” and
just below that is “16.20.” These are the odds Tee Dee went off at in her last two races and are
the source of the odds1 and odds2 variables. Looking at the line for Brought It that begins
with “11Feb07,” there is a (circled) f followed by “Clm c-(20-18)” which indicates that
Bought It was claimed in her last race where the claiming prices were from $18,000 to
$20,000. Note that Tee Dee has a similar notation in her 2nd race back. These notations are
where the cl12 variable is from.
Obviously there is a lot more information here that is not used in this study. The
number of possible patterns and combinations of variables that could be analyzed is almost
endless.
22
CHAPTER 3
METHODOLOGY
• Data is first prepared in MS ACCESS, response variable Perf is calculated, predictor
variables are created, two datasets are prepared in MS ACCESS, the Regression Data
and the Testing Data, and then exported to SAS.
• Regression analysis is performed in SAS on the Regression Data until a suitable model
is found.
• Test Data is then used in the model to generate two files: Baseline and Results Files.
• Estimated values for the response variable, Perf, are found using the model’s regression
coefficients and then exported to Matlab for Monte Carlo type processing.
• For the Baseline Perfs, only a simple formula using two predictor variables.
• For the Results file, all the significant predictor variables are used to calculate Perf. At
this point, horses must be grouped together so they can be processed as clusters of
horses within a race.
• Monte Carlo processing produces estimated probabilities of each horse finishing 1st,
2nd, 3rd, or 4th in their race for both the Baseline and Results Datasets. These
probabilities are then exported back into ACCESS for comparison and
report-generation.
• Win probabilities in the Results dataset that differ significantly from the Baseline win
probabilities are separated into two groups: Overlays (profitable bets) where estimated
probabilities are greater than the corresponding Baseline probability, and Underlays
(unprofitable bets) where probabilities are less. Tables are generated showing these
results and key information.
3.1 P ERF : T HE I MPORTANT R ESPONSE

VARIABLE
For regression analysis, the key step is getting a functional response variable. A
continuous dependent variable, “PERF” is calculated from the “Results” data. Perf is an
estimated numerical evaluation of each horse’s performance in a race, independent of the
race’s “Class” (level of competition). See Section 1.5 for more information and background
on Perf and Power Point. Finding Perf is a two step process: First a “Power Point” for each
race is derived as a function of the number of horses in the race and the distances in lengths
between the top four finishers. Then for each horse, Perf is a function of finish and distance
(in lengths) from the Power Point. The greater the Perf, the stronger the finish and vice versa.
Perf varies from a max of 121 to a min of -210. The minimum Perf, -210, is assigned to all
23
horses considered to have finished sufficiently far back that there is no value in trying to
evaluate their performance. This cut-off point is around 5 to 8 lengths behind the Power Point,
depending on surface type. Perf also increases as the wbf increases since the strongest horses
have the lower odds (and thus higher wbf) and higher Perfs.
3.2 DATA P REPARATION IN MS ACCESS

Datasets are prepared for SAS to facilitate easy importing and analysis. Although each
race is a cluster of horses, SAS processes each horse record independently (so the order of the
records does not matter). Therefore some predictor variables that are race specific have to be
prepared accordingly. For example, The Daily Racing Form provides a speed rating which
varies from 0 to 117. Horses in high class races will, in general, have high speed ratings and
those in low class races will have low speed ratings. What matters is the relative speed ratings
to the other horses in the race. So to get SpeedDiff1, the average speed rating for each horse’s
last race is found and then subtracted from each horse’s speed. Thus SpeedDiff1 indicates the
relative speed of each horse to the other horses in the race, independent of speed rating of all
other races and horses. Similar processing in ACCESS was done for other variables that were
specific to the race. Some variables, including the response variable, Perf, were created using
Visual Basic programs developed within the MS ACCESS framework or by using the flexible
Query system in ACCESS. Other covariates developed like this are MLadj, Odds1, Odds2,
and Flags.
Predictor Variables Requiring Special Processing in ACCESS:
MLadj : Adjusted Morning Line - Morning Lines are normalized so total inverted Morning
Lines add to 1
Perf : Performance Indicator (and Power Point) calculated for each horse
Speed1Diff : Average Speed for race is calculated, then subtracted from each horse’s Speed
Rating
Speed12Diff : Each horse’s maximum speed rating in last 2 races is subtracted from race
average
Speed123Diff : Each horse’s maximum speed rating in last 3 races is subtracted from race
average
NumLinesDiff : Each horse’s number of race-lines (previous races up to a maximum of 10)

is subtracted from race average
Flags : Indicator-type: Flag is zero unless current race is 2nd race within 60 days after
maiden win
24
Odds1 : Odds in last race
Odds2 : Odds in 2nd race back
Days1st : Number days since last race, if zero, set to 180 for processing purposes (1st time
starters had values of zero which would throw off calculations)
Days2nd : Number of days since 2nd race back, if zero, set to Maximum of 200 or Days1st +
20 for processing purposes (1st and 2nd time starters had values of zero which would
throw off calculations)
Days3rd : Number of days since 3rd race back, if zero, set to Maximum of 230 or Days2nd +
30 for processing purposes (1st, 2nd, and 3rd time starters had values of zero which
would throw off calculations)
3.3 SAS O PERATIONS AND P ROCESSING

Many predictor variables are created in SAS based on data imported from ACCESS.
Most are the indicator type: 1 if present in a data field, 0 if not present. Specific jockeys and
trainers are examples - 72 individual trainers have their own covariate from the one field
TrainID and from JockID 24 jockey covariates are created. Other predictor variables are
calculated in SAS and have continuous values such as wbfOld1 and wbfOld2 which are the
win bet fractions for odds1 and odds2 respectively.
3.3.1 Non-Indicator Covariates Created in SAS

See Table 2.1 for statistics on these covariates.
wbf : Win bet fraction of odds: 1 / ( 1 + odds )
wbfAll : Win bet fraction raised to an exponent determined through Box-Cox Method
wbfOld1 : Win bet fraction from odds of previous race
wbfOld2 : Win bet fraction from odds of 2nd race back
3.3.2 Indicator Type Covariates Created in SAS

The Post Position field yielded five indicator variables that were of interest: the three
inside posts 1 to 3 and the two far outside post positions: pp1, pp2, pp3, ppOut (far outside
post) and ppInOut (the post just to the left of far outside post). Since saving ground (running
distance) on the turns is naturally quite important since the less distance a horse has to run, the
better its chances of a good finish. Post position is a definite factor for getting a horse into
favorable position on turns. On many two-turn races such as a mile at Santa Anita and Del
25
Mar, the first turn comes up in less than a furlong and the inside positions can be an advantage
for quick starting horses who then save ground on the first turn. However, post position 1 is
considered the most dangerous position because of its proximity to the inside rail where many
horse racing accidents have taken place - oftentimes horses are pincehed between the rail and
other horses. Seven countries and eight states indicator variables came from the stateBred
field.
The jockey field is used to create 24 Indicator-type covariates for individual jockeys.
In a similar fashion, 72 Indicator-type covariates for individual trainers were created: Table
3.1. Other indicator variables included three Claimed indicators: cl1 (horse claimed in last
race), cl2 (claimed 2nd back), and cl3) from the cl12 field, two (blinksOn and BlinksOff)
from the blinkers field, two (start1st and start2nd) from the numLines Field, and two input
fields were changed to indicator types (Lasix1st and notLasix) to facilitate processing.
Table 3.1. Trainer Names and ID Codes

ID Name ID Name ID Name
A Barry Abrams Ag Paul Aguirre AV A. C. Avila
B Bob Baffert Bec Rafael Becerra C Jack Carava
Cad Ruben Cardenas Cec Ben Cecil CJ Julio Canani
Cs James Cassidy CV Vladmir Cerin D Neil Drysdale
DC Caesar Dominguez Dej Jose DeLima DO Craig Dollase
EL Ronald Ellis Eur Peter Eurton F Robert Frankel
FA Jerry Fanning Ga Carla Gaines GL Patrick Gallagher
Gla Mark Glatt Gok Sal Gonzalez Gp Paco Gonzalez
Gre Beau Greely Gut Jorge Guitierrez H Robert B. Hess
HA Mike Harrington Hab Eoin Harty HD Bruce Headley
Hen Dan Hendriks HF David Hofmans Hol Jerry Hollendorfer
Jom Martin F. Jones Kna Steve Knapp Kor Brian Koriner
La David La Croix LE Craig Lewis Ma Michael Machowsky
Ma2 Gary Mandella MC Ronald McAnally MD Richard Mandella
Mii Peter Miller MM Mike Mitchell Mo Henry Moreno
Mog Ed Moger Mul Jeff Mullins Mum Kristin Mulhall
ON Doug O’Neil Paa Christopher Paasch Pei Jorge Periban
Pol Marcelo Polanco Pow Leonard Powell Puy Mike Puype
SA John Sdler SH Sanford Shulman Shc Gary Sherlock
She Art Sherman Shi John Shirreffs Si Clifford Sise
SJ Jenine Sahadi SM Melvin Stute SP William Spawr
Ste Roger Stein Stg Gary Stute TR Eddie Truman
VB Jack Van Berg VD Darrell Vienna Wa Ward Wesley
WK Kathy Walsh WT Ted West Zuc Howard Zucker
26
3.3.3 WBF Exponent Found Using Box-Cox Method

The best predictor of a horse’s performance is the odds it goes off at, as shown by
Table 2.2 where the two performance measurements, win percentage and Perf, decrease
reading down the table as the odds increase. The powerful betting public made up of
thousands of bettors wagering many thousands and frequently millions of dollars on a single
race, is constantly searching for a “bargin” horse - one whose return is better than expected.
Like the stock market, there are last minute “corrections” to horses that appear to have value.
Although the odds are the best predictor, they do not come in an easy-to-use form since odds
do not translate directly to probabilities and the total odds of all the horses in a race has no
significance. Inverting the odds to get the win bet fraction: wbf = 1/(odds + 1) is a start
since the total win bet fractions would add to one if there was no House Cut. With the House
Cut which varies due to Breakage, the win bet fractions sum to around 1.20. Thus win bet
fractions indicate how strongly each horse is bet relative to each other. In the early stages of
this project, it was noticed that the square root of wbf was a better fit than wbf itself. So it
seemed likely that the best fit was wbf raised to an optimal exponent. Thus the well-known
Box-Cox [12] transformation procedure, based on a maximum likelihood estimation routine,
is used to find the optimal exponent for wbf. Notice that in this instance, wbf is the response
variable and Perf is the predictor variable. This procedure was performed starting with coarse
intervals of 0.1 for the exponent, then 0.01, 0.001, and 0.0001 was used, reaching the limits of
accuracy for the SAS Box-Cox procedure. Thus an exponent was found to the 4th decimal
place (0.1548). A new predictor variable was then created for each horse:
wbf All = wbf 0.1548 .
3.3.4 SAS Regression and Model Selection

The REG procedure in SAS fits a linear regression model by least squares to find
estimated coefficinets for each predictor variable. The Stepwise, Forward, and Backward
Selection processes are used (with a selection criterion of 0.05) and compared to find the best
model. These selection processes depended on Mallows’ Cp criterion. The Variance Inflation
Factor (VIF) selection is used to check for multicollinearity. After considerations, various
covariates were deleted from the final model due to correlation problems and low significance.
Data Subgroups are run through through the same process as the above section and if
warranted, new predictor variables are created - always of the indicator type since they are
specific to the subgroups. Note that in some cases original covariates may be set to 0 when the
new covariates are set to 1 to avoid correlation problems.
The regression process is repeated with the new and orginal covariates. The VIF
diagnostic is especially important for checking for correlation between old and new
27
covariates. The standard deviation used for Monte Carlo processing of test results is generated
in this step. A Baseline model for testing was created using wbfAll and the number of horses
in the race to get a predicted Perf value for each horse. Table 3.2 presents the ANOVA Table
and parameter estimates.
Table 3.2. Test Data Baseline Model

Parameter Standard
Variable Estimate Error P-value 95% CI
Intercept -412.04 2.28 <.0001 (-416.50, -407.57)
wbfAll 459.28 2.38 <.0001 (454.62, 463.93)
nhor 2.96 0.13 <.0001 (2.71, 3.22)
3.4 M ATLAB : S IMULATING H ORSE R ACES FOR

P ROBABILITY E STIMATES
Two files containing the predicted perfs for the Horses in the Test Dataset were
created: the Test Baseline File and the Test Results File. They were then exported to Matlab
for Monte Carlo-style processing. The standard error used here is generated in the step
described in Section 3.3.4. For this step, horses are grouped together by race. Each horse in
each race has a random normal number times the standard deviation added to its predicted
perf to simulate the variances in performance as predicted by the Regression Model of Table
3.3. Each race was simulated 100,000 times. The number of simulations a particular horse has
the highest total was divided by 100,000 to get the estimated probability of winning. The
same process was used to get estimated values for 2nd ,3rd , and 4th place probabilities.
3.5 C OMPARING P ROBABILITY F ILES IN

ACCESS
The two probability files, Test Baseline and Test Results, are exported to ACCESS for
comparison reports. The final model probabilities that orginated from the estimated regression
parameters equation are compared to the baseline probabilities. Those that differ significantly
are separated into two groups: estimated probabilities higher than the baseline’s are
considered good bets “Overlays,” while those less than the baseline probabilities are
“Underlays” - bad bets. Each group is displayed in an odds-based report format. The Results
File was generated using the Regression Function on each horse in the Test Data, plugging in
Regression Coefficients to get predicted values for perf (pred) as shown in Table 3.3.
28
Table 3.3. Regression Model Coefficients
Parm. Std.
Variable Description Est. Error P-value 95% CI VIF
Intercept intercept -403.22 2.37 <.0001 (-407.88, -398.58) 0
wbfAll wbf to exponent 447.57 2.64 <.0001 (442.40, 452.75) 1.35
nhor number of horses 3.14 0.13 <.0001 (2.89, 3.41) 1.11
days2nd days 2nd race back -0.03 0.023 <.0001 (-0.035, -0.25) 1.12
speed12 diff speed last 2 0.38 0.07 <.0001 (0.24, 0.52) 1.25
numLine diff number of lines 3.64 0.13 <.0001 (3.34, 3.89) 1.10
blinkon blinkers on -16.68 1.10 <.0001 (-18.84, -14.53) 1.00
pp2 post position 2 2.61 0.76 0.0006 (1.11, 4.01) 1.05
ppOut far outside post pos. -2.50 0.76 0.0010 (-3.99, -1.01) 1.05
pp3 post position 2.20 0.76 0.0039 (0.71, 3.70) 1.05
notLasix not using lasix -7.86 1.29 <.0001 (-10.40, -5.33) 1.03
jEsp jockey - V Espinoza -3.10 1.02 0.0023 (-5.10, -1.10) 1.03
jBat jockey - T Baze 9.11 2.72 0.0317 (5.03, 13.25) 1.02
jSol jockey - A. Solis 4.41 1.17 0.0002 (2.13, 6.71) 1.08
jGar jockey - M. Garcia 6.00 1.37 <.0001 (3.32, 8.68) 1.01
jSmi jockey - M. Smith -5.47 1.71 0.0014 (-8.84, -2.11) 1.01
jBav jockey - M. Baze 9.91 2.26 <.0001 (5.47, 14.34) 1.00
jSor jockey - D. Sorenson 6.99 2.00 0.0005 (3.07, 10.91) 1.00
jBj jockey - R. Bejarano 8.41 2.94 0.0319 (4.27, 12.87) 1.04
jQui jockey - A. Quimez 6.24 2.70 0.0211 (2.16, 10.36) 1.02
jRor jockey - J. Rosario 14.98 3.59 <.0001 (10.77, 19.43) 1.03
Bec trainer - R. Becerra 8.02 2.64 0.0024 (2.84, 13.20) 1.01
Cad trainer - R. Cardenas -12.38 3.67 0.0007 (-19.57, -5.20) 1.00
Cec trainer - B. Cecil 7.48 3.70 0.0433 (0.23, 14.74) 1.00
EL trainer - R. Ellis 6.49 2.86 0.0235 (0.87, 12.10) 1.00
Gut trainer - J. Guiterrez 15.19 3.70 <.0001 (7.94, 22.44) 1.00
HD trainer - B. Headley 6.40 2.45 0.0274 (2.75, 10.55) 1.04
Kna trainer - S. Knapp -7.37 2.11 0.0005 (-11.5, -3.24) 1.00
LE trainer - C. Lewis -6.85 2.53 0.0067 (-11.81, -1.90) 1.00
Ma2 trainer - G. Mandella 12.12 3.90 0.0018 (4.49, 19.74) 1.00
MD trainer - R. Mandella 4.40 2.13 0.0391 (0.22, 8.57) 1.02
Puy trainer - M. Puype 12.70 3.81 0.0008 (5.24, 20.16) 1.00
SA trainer - J. Sadler 8.34 1.72 <.0001 (4.97, 11.71) 1.02
VB trainer - J. Van Berg -9.28 2.25 <.0001 (-13.69, -4.88) 1.00
VD trainer - D. Vienna 9.25 2.66 0.0005 (4.03, 14.47) 1.00
WK trainer - K. Walsh 6.28 3.70 0.0353 (3.17, 9.34) 1.00
zFR bred in France -7.00 2.60 0.0070 (-12.09, -1.91) 1.01
29
CHAPTER 4
RESULTS
Test Results are divided into two groups; Overlays, horses with estimated probabilities
significantly greater (> 1.33 ∗ Baseline) than the baseline’s probabilities, and Underlays,
horses with probabilities significantly less (< 0.6 ∗ Baseline) than the baseline’s probabilities.
Each group is further divided into two tables, the first being a comparison of win results to
baseline results and the second being a comparison of 2nd, 3rd, and 4th place finishes.
4.1 OVERLAYS
For the Overlays (Tables 4.1 and 4.2), 1,548 horses were selected from the test group.
Note that the proportion selected for the low (0 to 4) odds range 181/1548 = 11.7% was
much less than the proportion for the whole test group 7588/31398 = 24.2%. Thus one must
be careful comparing totals for the Results to totals for the baseline since the 0 to 4 odds range
has a much higher percentage of 1sts, 2nds, 3rds and 4ths. For example, when looking at Table
4.2, and comparing the “Out” numbers, the total percentage for the Baseline (OutB = 53.6) is
lower than the Results (OutA = 55.2), while at the same time, the OutA (results) is lower than
the OutB for each of the 4 odds ranges. This is a kind of numerical optical illusion due to the
weighted distribution of the Results to high-odds horses. For the Underlays, the weighted
distribution is much more severe. Of the 1036 horses selected in the Underlays group, only 11
were in the 0-4 odds range. The proportions were 11/1036 = 1.1% to 7588/31398 = 24.2%
of the whole test group. Thus care must be taken when looking at totals. For horses with
Results probabilities significantly greater (> 1.33 ∗ Baseline) than the Baseline probabilities,
the Overlays Table 4.1, shows an overall increase in EV/Profitability from 0.78 to 1.03.
Table 4.1. Overlays: Comparison of Win Results(A) to Baseline(B)

range TotB TotA winB winA 1%B 1%A EV:B EV:A PerfB PerfA
0-4 7588 181 2022 52 26.6 28.7 0.85 1.09 -0.8 12.1
4-9 8519 416 1019 53 12.0 12.7 0.83 0.94 -35.8 -22.3
9-27 9509 592 533 44 5.6 7.4 0.84 1.20 -71.3 -52.0
27-UP 5782 359 72 6 1.2 1.7 0.53 0.82 -128.5 -113.8
0-9 16107 597 3041 105 18.9 17.6 0.83 0.98 -19.3 -11.9
9-UP 15291 951 605 50 4.0 5.3 0.72 1.06 -92.9 -75.4
All 31398 1548 3646 155 11.6 10.0 0.78 1.03 -55.2 -50.9
30
Table 4.2. Overlays: Comparison of 2nd, 3rd, and 4th Results(B) to Baseline(A)
range Tot 2A 3A 2%B 2%A 3%B 3%A 4%B 4%A Out OutA
0-4 181 37 31 20.6 20.4 16.3 17.1 12.1 10.5 24.4 23.2
4-9 416 78 57 14.4 18.8 14.6 13.7 14.0 13.7 45.0 41.1
9-27 592 53 69 7.6 9.0 9.9 11.7 12.4 13.3 64.4 58.6
27-UP 359 9 22 2.3 2.5 3.9 6.1 6.1 7.5 86.5 82.2
0-9 597 115 88 17.3 19.3 15.4 14.7 13.1 12.7 35.3 35.7
9-UP 951 62 91 5.6 6.5 7.6 9.6 10.0 11.1 72.8 67.5
All 1548 177 179 11.6 11.4 11.6 11.6 11.6 11.8 53.6 55.2
4.2 U NDERLAYS
Underlays are the horses whose estimated winning probabilities are significantly
(< 0.6 ∗ Baseline) less than the Baseline’s probabilities. These horses show a marked
decrease in overall EV/Profitability, 0.78 to 0.46 (see Table 4.3). These values are somewhat
misleading since most of the horses selected in this Results group were in the high odds ranges
which had low EVs to begin with (see Table 2.2), but the EVs of 0.0 and 0.67 for the 4-9 and
9-27 odds ranges are lower than the corresponding baseline EVs. Looking at the four odds
ranges breakdown in Table 4.4, the first odds range 0.1 to 4 should be ignored since it only has
11 horses in it. The other three odds ranges showed decreases in all finishes, 2nd, 3rd, and 4th
which resulted in increases in the OutA percentages over the Baseline’s OutB numbers.
Table 4.3. Underlays: Comparison of Win Results(A) to Baseline(B)

range TotB TotA win winA 1% 1%A EV:B EV:A PerfB PerfA
0-4 7588 11 2022 3 26.6 27.3 0.85 1.10 -0.8 -23.5
4-9 8519 61 1019 0 12.0 0.0 0.83 0.00 -35.8 -86.4
9-27 9509 334 533 13 5.6 3.9 0.84 0.67 -71.3 -107.7
27-UP 5782 630 72 5 1.2 0.8 0.53 0.39 -128.5 -160.7
0-9 16107 72 3041 3 18.9 4.2 0.83 0.17 -19.3 -76.8
9-UP 15291 964 605 18 4.0 1.9 0.72 0.49 -92.9 -142.3
All 31398 1036 3646 21 11.6 2.0 0.78 0.46 -55.2 -137.8
4.3 C OMPARISONS OF R ESULTS BY THE F OUR

M AJOR O DDS R ANGES
A visual representation of the results is best to highlight the difference between the
three datasets: Underlays, Overlays and the Baseline. Care must be taken when comparing the
overall results since the proportions of horses in the four major Odds Ranges are different for
each of the three datasets, as noted in Sections 4.1 and 4.2. For example, Figure 4.1 shows
that the win percentages are fairly even for the odds range 0-4 but it should be noted that the
31
Table 4.4. Underlays: Comparison of 2nd, 3rd, 4th Results(A) to Baseline(B)
range TotA 2A 3A 2%B 2%A 3%B 3%A 4%B 4%A OutB OutA
0-4 11 1 2 20.6 9.1 16.3 18.2 12.1 9.1 24.4 36.4
4-9 61 6 7 14.4 9.8 14.6 11.5 14.0 11.5 45.0 67.2
9-27 334 10 17 7.6 3.0 9.9 5.1 12.4 11.4 64.4 76.6
27-UP 630 3 15 2.3 0.5 3.9 2.4 6.1 2.1 86.5 94.3
0-9 72 7 9 17.3 9.7 15.4 12.5 13.1 11.1 35.3 62.5
9-UP 964 13 32 5.6 1.3 7.6 3.3 10.0 5.3 72.8 88.2
All 1036 20 41 11.6 1.9 11.6 4.0 11.6 5.7 53.6 86.4
Underlays had only 11 horses in that group of which three were winners. Figure 4.2 also
reflects this situation in the 0-4 range. But in general, the two figures show that there is a
substantial increase in win percentage and Expected Value with the Overlays subset and a
definite decrease with the Underlays. Figure 4.3 shows the combined percentages for 1st
through 4th place finishes.
32
Win %: Odds Range 0 to 4 Win %: Odds Range 4 to 9
40
20
30
15
Win %
Win %
20
10
10
5
0
0
Under Base Over Under Base Over
Win %: Odds Range 9 to 27 Win %: Odds Range 27 and UP
3.0
10 12
2.0
8
Win %
Win %
6
1.0
4
2
0.0
0
Win %: Odds Range 0 to 9 Win %: Odds Range 9 and UP

20
8
15
6
Win %
Win %
10
4
5
2
0
Figure 4.1. Win percentage comparisons between Underlays, Baseline, and

Overlays by odds ranges. Win percentages for Overlays substantially greater
than those for Underlays.
33
EV: Odds Range 0 to 4 EV: Odds Range 4 to 9
1.2
1.0
0.8
0.8
0.6
EV
EV
0.4
0.4
0.2
0.0
0.0
EV: Odds Range 9 to 27 EV: Odds Range 27 and UP

1.2
1.2
0.8
0.8
EV
EV
0.4
0.4
0.0
0.0
EV: Odds Range 0 to 9 EV: Odds Range 9 and UP

1.2
1.2
0.8
0.8
EV
EV
0.4
0.4
0.0
0.0
Figure 4.2. EV comparisons between Underlays, Baseline, and Overlays by

odds ranges. EVs for Overlays are much greater than those for Underlays.
34
1, 2, 3, 4th% − Range 0 to 4 1, 2, 3, 4th% − Range 4 to 9
100
80
1st 1st
2nd 2nd
80
3rd 3rd
60
60 4th 4th
Total%
Total%
40
40
20
20
0
0
1, 2, 3, 4th% − Range 9 to 27 1, 2, 3, 4th% − Range 27 & UP

60
40
1st 1st
2nd 2nd
50
3rd 3rd
30
4th 4th
40
Total%
Total%
30
20
20
10
10
0
Figure 4.3. Test results: Finish comparisons between Underlays, Baseline, and
Overlays by odds ranges. Total percentages significantly greater for Overlays
than Underlays except for 0-4 range.
35
CHAPTER 5
MULTICOLLINEARITY
Shown in Table 5.1 is part of the SAS Variance Inflation Factor (VIF) diagnostic
results. Covariates not shown (mostly trainers, jockeys, and state-bred) all had VIF values less
than 1.4 and so were not flagged for high collinearity concerns. As expected, days1st (days
since last race) , days2nd (days since 2nd race back) and days3rd are correlated since days2nd
is by definition, always larger than days1st, and days3rd is always larger than days2nd. It can
not be concluded that they are correlated to each other, but when days2nd is by itself in the
Final model, its VIF drops to 1.12, (in Table 3.3 ) showing small correlation. Similarly,
Speed1Diff, Speed12Diff and Speed123Diff show correlation values over 2 in Table 5.1 but
when Speed12Diff is by itself, the VIF value drops to around 1.25 as shown in Table 3.3. For
the Final Model, looking at Table 3.3, the highest VIF is 1.35 of wbfAll, from which the
conclusion is that there is no serious concern of multicollinearity in the model.
Table 5.1. Covariates With Variance of Inflation Greater Than 2.0

Parameter Standard Variance
Variable Estimate Error P-value 95% CI Inflation
days1st 0.0072 0.0045 0.1046 (-0.0015, 0.016) 2.56
days2nd -0.029 0.0049 <.0001 (-0.039, -0.019) 5.07
days3rd -0.0077 0.0036 0.0323 (-0.015, -0.00065) 3.42
speed1Diff -0.085 0.13 0.5124 (0.52, 0.087) 4.25
speed123Diff 0.22 0.12 0.0620 (-0.011, 0.46) 3.51
speed12Diff 0.19 0.09 <.0001 (0.04, 0.34) 2.08
36
CHAPTER 6
DISCUSSION
The Overlays showed significant improvements in Win EV/Profitability as well as
improvements in 2nd, 3rd and 4th place finish percentages, as noted in Section 4.1.
Conversely, the Underlays indicated horses to be avoided due to low Win EV/Profitability and
lower 2nd, 3rd and 4th place percentages as noted in Section 4.2.
6.1 R ESPONSE VARIABLE : P ERF ( AND P OWER

P OINT )
Perhaps the most critical component of this system is the response variable Perf. In the
early stages of analysis, a Perf with a minimum of -210 and a simple function according to
lengths ahead or behind the Power Point was used. This was deemed unsatifactory because
some horses won big and their added lengths of victory were not nearly as important as,
perhaps, the length before and behind the area in a close finish of two or more horses vying
for the win. So Perf values were increased proportionately more for added distance directly
ahead of the Power Point up to a close winner. For big winners, the Perfs were high, but extra
lengths over a 5 length winning margin did not increase the Perf overly much. Similarly for
horses close to, but behind the Power Point, their Perf ratings fall more rapidly the further
from the Power Point to the -210 minimum. Other Perf scores were tried including one with a
-390 minimum, but it was deemed unacceptable as it seemed to fit the horses in the middle
and rear of the race better than those in the front. The final Perf used here was also a function
of the track surface. We noticed that races run on turf generally have marked closer finishes.
Perf values were thus adjusted accordingly. Note that the synthetic surfaces, Polytrack and
Cushion Track, had finsihes far more similar to dirt tracks. The Power Point is an attempt to
numerically evaluate a point in a race’s finish that served as a threshold separating strong
efforts from lesser efforts. This is heavily related to the 2nd place finish in a race with few
horses (6-7), 3rd place finish in races with 8 to 10 horses, and some point between 3rd and 4th
for races with over 10 horses. Thus it is a function of number of horses in a race, and the
distance in lengths between 2nd and 3rd, and the distance between 3rd and 4th, and
sometimes between 1st and 2nd.
37
6.2 S UBGROUPS
Subgroups did not yield many new significant covariates. A bit of a surprise as some
covariates and subgroups had been designed with each other in mind, such as fillies and mares
running against males, and post position 1 in the Middle Distance subgroup where the first
turn comes up quick. Of course the coefficients for individual trainers and jockeys varied from
subgroup to subgroup, but there were no gigantic increases or decreases that coincided with
other strong indicators (p-Values, F-Values, partial R-Squared contributions, etc.). There is
still a lot of valuable information to be gleaned from subgroups - the trick is finding the best
covariates to test against. Often times a handicapper will wonder how a certain pattern looks
in a specific Subgroup. Since almost any pattern can be converted to an indicator-type
covariate, it can then be processed through the system described in this paper to find its value
as a predictor variable. There are undoubtly numerous (currently unidentified) covariates that
do not show up as significant when looking at the total Regression Dataset, but would be
highly significant if looked at in a certain Subgroup. The potential in this area is enormous.
6.3 L IMITATIONS OF THE S TUDY

Unfortunately synthetics replaced dirt surfaces right at the end of the Regression Data
period. It is impossible to judge the effect the switch to synthetic track surfaces has had on
this study. Of the 16,284 races making up the Regression Dataset, 10,982 (66.4%) were run
on dirt and 896 (5.5%) on synthetics. For the Testing Dataset, 2,668 of the 3,646 (73.2%)
were run on synthetics and NONE on dirt. How great of a difference the surface has on a race
is open to speculation, but many trainers complained about the switch of surfaces and some
owners and trainers took their horses to other tracks [13]. Undoubtly there has been an
adjustment period for the trainers, jockeys and horses to get used to the synthetics [14].
Unfortuately that adjustment period occurs mainly during the period of the Testing Dataset.
The majority of covariates analyzed here are either trainers or jockeys, although in
essence they are basically just two covariates: jockey and trainer. Like other humans, trainers
and jockeys come and go, and have ascending and descending periods. Significant new
jockeys and trainers may have appeared at the end of the Regression Data period and so either
do not have regression coefficients or the estimates may be inaccurate due to small sample
size. The subgroup datasets for the last year of Regression Data (2007) and the last two years
(2006 - 2007) helped somewhat in this regard - one jockey and two trainers were deemed
significant enough to be added. Note that there were hundreds of trainers and jockeys who
appeared sporadically and because of their lack of data were not added to the study.
38
6.4 P REDICTOR VARIABLES I NCLUDED IN THE

F INAL R EGRESSION M ODEL
Out of the 72 indicator-type trainer covariates, 16 appeared in the final model as
shown in Table 3.3, with 12 of the 16 having positive parameter estimates which means a
positive influence on predicted Perf (since they are indicator-type variables). Since the 72
trainers were selected because they had the most horses and in general that is an indicator of
financial success at horse racing, it was expected that the majority of trainers would have
positive parameters. It is interesting that Bob Baffert [15] (selected to Horse Racing’s Hall of
Fame and has had three Kentucky Derby winners, to name two from amongst his numerous
accolades, and probably the most famous trainer in this study) was not in the final model.
Although he enjoys great success, his celebrity status translates into his horses being heavily
bet. Since he did not have a negative parameter estimate, his success and notoriety balance
each other. Ten of 24 jockeys were included in the final regression model. From post position,
(pp) three of the five covariates, from the stateBred field, one of the seven countries (France)
and none of the eight states made the final cut. Actually the stateBred field was not expected
to have any covariates in the final model. The foreign horses especially should be indicator
variables in conjuction with number of races in the U.S. (see item 6 in the Conclusion
Chapter). Similarly 8 of the 10 jockeys in the final model had positive parameter estimates.
Also the two most famous, Kent Desormeaux [16] (holds current record for most wins by a
jockey in the U.S. in one year and is one of four jockeys to win three national titles) and
Garrett Gomez [17] (U.S. leading jockey in total earnings for the years 2006 to 2009) were
not in the final model, probably since their excellent riding skills were offset by their
name-recognition. The biggest surprise was the effect of adding blinkers to horses that had
not worn them in their previous race. The blinkon covariate had the largest (absolute) value of
any indicator-type covariate: -16.68. Handicappers typically consider adding blinkers a
positive sign. The other big surprise was numLineDiff - the difference in a horse’s number of
lines compared to the race’s average. Basically this says that experience helps. NotLasix was
also an intersting indicator with its -7.85 value. Since so many (California) horses use lasix
(96.5%) and in most races it seems like all the horses have it, it is easy to overlook horses that
are not using it. Many covariates that were of high interest to us and therefore purposely
included in this study did not show up in the final model. Some of these were: the Claimed
covariates, cl1 (claimed last race), cl2 (claimed 2nd race back), and cl3 (claimed 3rd race
back), flags covariate (2nd race since maiden win if within 60 days), and odds1 (odds in last
race).
39
6.5 M ISCELLANEOUS
The best predictors other than the baseline predictors (Intercept, wbfAll, and nHor)
were, in order of strength: numLineDiff, blinkon, days2nd, notLasix, speedDiff2, ppOut, pp2
and pp3. The rest of the predictors were trainers, jockeys, and horses bred in France.
The only time sensitive step in the process was the computation of Monte Carlo
probability estimates which took around 10 to 25 hours depending on the speed of the
computer used. EVs (Expected Value/Profitability) do not always increase directly with
increases in Perf. It may be that the improvement in Perfs show up in improved 2nd, 3rd, or
4th place performances.
In an article by Clive Thompson [18] in Wired magazine titled “Advantage: Cyborgs,”
it is pointed out that in a “freestyle” 2005 online chess tournament, where any kind of entrant
was allowed, the most successful players were “Cyborgs,” those able to use computers as
“assistants” most efficiently. That principle undoubtedly holds at the racetracks. The system
described here has tremendous potential for assisting handicappers. Finding accurate
probabilities should translate into high profitability.
40
CHAPTER 7
CONCLUSIONS
1. The system works. Table 7.1 shows a comparison of the totals for Overlays versus
Underlays. The differences are dramatic even taking into account the differences in
distribution by odds ranges.
Table 7.1. Comparison of Overlays and Underlays Totals

Totals of Important Statistics Underlays Overlays
Profitability/EV 0.46 1.03
Winnning Percentage 2.0 10.0
Combined 1st, 2nd % 3.9 21.6
Combined 1st, 2nd, 3rd % 7.9 33.2
Combined 1st - 4th % 13.6 45.0
Average Perf -137.8 -50.9
% of Total Horses in Odds Range 0-4 1.1 11.7
% of Total Horses in Odds Range 27 and UP 60.8 23.2
2. A better comparison is Table 7.2 since it is for the odds range 9-27 and the percentage
of total horses in the range is about the same (32.2% to 38.2%). Horses in the 9-27 odds
range are longshots, basically overlooked or lightly bet. Although a bettor has to be
patient for Overlays and Underlays to happen, they can lead to profitable bets when
used in the exotic wagering, especially the exactas, trifectas and superfectas since which
horses to bet and which to avoid are clearly identified. To hit a 15 or 20 to one longshot
in the correct spot on an exotic bet can really boost the payoff!
Table 7.2. Odds Range 9-27 of Overlays and Underlays

Totals
Important Statistics: Odds 9 - 27 Underlays Overlays
Profitability/EV 0.67 1.20
Winnning Percentage 3.9 7.4
Combined 1st, 2nd % 6.9 16.4
Combined 1st, 2nd, 3rd % 13.0 28.1
Combined 1st - 4th % 24.4 41.4
Average Perf -107.7 -52.0
41
3. The system, though it is in its infantcy stage, works well at identifying a predictive
model. Using these regression methods will produce more accurate probabilities on
some horses than those reflected from the odds.
4. The system is usable at the racetrack. Once a regression equation is found, new
estimated probabilities can be generated and calculations quickly made on any new
horse to highlight wagers that are likely to be profitable. This includes not only straight
win bets, but perhaps more importantly, the exotic single-race bets such as Exactas,
Trifectas, and Superfectas, as well as the multiple-race wagers such as the Daily
Doubles, Pick3, Pick4, and Pick 6.
5. Just about any pattern or combination of factors or subset of horses can easily and
quickly be turned into a predictor variable and analyzed to see how and if it affects a
horses probabilities. The flags covariate (see Section 2.1) is an example of an obscure
pattern that we found interesting and wanted to investigate and was able to do so just by
making in an indicator type predictor. This is a tremendous tool for handicappers who
have often wondered about special situations but had no feasible way to get an accurate
answer.
6. Improvements are possible: the Response Variable, Perf and its underlying key statistic,
Power Point can both be tweaked for better overall performance. Possible new predictor
variables with some appropriate variable number N: weight drops from one race to the
next, lowest weight in race by N or more pounds, switching distance type after N or
more races at one specific type, new jockey after previous jockey rode N or more times,
moving up or down in class, three year old horses in races for ages three and up, adding
(or removing) blinkers after N races of not wearing (or wearing) them, are some
possibilities. Others may involve comparing lifetime and current year records for
statistics such as average earnings per race, percentages for winning or placing, or
showing. The foreign horses could also provide valuable predictors like first race in
U.S., 2nd race, etc. or when they switch to dirt or synthetic surface for the first time
(many horses from Europe have run only on turf when they come to the U. S.). There
are numerous possibilities for new predictors.
42
BIBLIOGRAPHY
[1] D.A. Harville. Assigning probabilities to the outcomes of multi-entry competitions.
Journal of American Statistical Association, 68:312-316, 1973.
[2] R.J. Henery. Permutation probabilities as models for horse races. Journal of Royal
Statistical Society B, 43:86-91, 1981.
[3] H. Stern. Models for distributions on permutations. Journal of American Statistical
Association, 85:558-564, 1990.
[4] D.B. Hausch, V.S.Y. Lo, and W.T. Ziembe. Efficiency of Racetrack Betting Markets.
Academic Press, New York, NY, 1994.
[5] J.B. Bacon-Shone, V.S.Y. Lo, and K. Busche. Logistics analyses of complicated bets.
Research Report 11, Department of Statistics, the University of Hong Kong, 1992.
[6] V.S.Y. Lo and J. Bacon-Shone. Comparison between two models for predicting ordering
probabilities in multi-entry competitions. The Statistician, 43(2):317-327, 1994.
[7] V.S.Y. Lo and J. Bacon-Shone. Handbook of Investments: Efficiency of Sports and
Lottery Markets. Elsevier, London, England, 2008.
[8] M.M.Ali. Probability and utility estimates for racetrack bettors. Journal of Political
Economy, 84:803-815, 1977.
[9] P. Asch, B. Malkiel, and R. Quandt. Market efficiency in racetrack betting. Journal of
Business, 57:165-174, 1984.
[10] W.T. Ziemba and D.B. Hausch. Dr. Z’s Beat the Racetrack. Morrow, New York, NY,
1987.
[11] J.B. Bacon-Shone and V.S.Y. Lo. Probability and statistical models for racing. Journal of
Quantitative Analysis in Sports, 4(2):2-11, 2008.
[12] M.H. Kutner, C.J. Nachtsheim, and J. Neter. Applied Linear Regression Models.
McGraw-Hill Irwin, New York, NY, 2004.
[13] B. Harris. Emotional Bob Baffert heads into Thoroughbred Racing Hall of Fame. Sports
News, August 12, 2009.
[14] J. Bossert. Trainers bemoan synthetic tracks as Breeders’ Cup approaches. New York
Daily News, October 22, 2008.
[15] Wikipedia. Bob Baffert, 2010. http://en.wikipedia.org/wiki/Bob Baffert, accessed May
2010.
[16] Wikipedia. Kent Desormeaux, 2010. http://en.wikipedia.org/wiki/Kent Desormeaux,
accessed May 2010.
[17] Wikipedia. Garrett Gomez, 2010. http://en.wikipedia.org/wiki/Garrett K. Gomez,
accessed May 2010.
43
[18] C. Thompson. Advantage: Cyborgs. Wired Magazine, 42, April 2010.

Model Considerations For Multi-Entry Competitions

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Model Considerations For Multi-Entry Competitions

Încărcat de

Drepturi de autor:

Formate disponibile

MODEL CONSIDERATIONS FOR MULTI-ENTRY COMPETITIONS

San Diego State University

of the Requirements for the Degree

Vincent Stanley Dayes

Gambler’s Prayer: Dear Lord, please let me break even,

because I really need the money – Mr. X

ABSTRACT OF THE THESIS

1.2 S TATEMENT OF P ROBLEM

1.4 A T YPICAL H ORSE R ACE

1.5 D EFINITION OF T ERMS

Bay : Reddish brown color of horses

Breeder : Whoever breeds the horse

Colt : A male horse age 4 or less

Cushion Track : A type of synthetic surface

Filly : Female horse age 4 or less

Furlong : An eighth of a mile

Gelding : A castrated male horse of any age

Horse : Specifically a male horse (not gelded) of age 5 or greater

Jockey : Professional rider of horses

Line : Refers to a past performance line in the Daily Racing Form

Longshot : General term meaning a horse that is unlikely to win

Mare : Female horse 5 or more years in age

Monte Carlo Method : Computational system of simulation using reapeated random

Odds : Return on investment, should it be successful

Pace : The speed of the early leaders in a race

Perf : Dependent (Response) Variable - numerical evaluation of a horse’s performance in a

Polytrack : Synthetic race track (general term)

Profitability : The positive or negative return per wager. Synonomous with EV

Pro-Ride : A specific type of synthetic racetrack

Race Restrictions : Restrictions on horses allowed into a specific race

Racing Form : See Daily Racing Form

1/(odds ∗ 1.2 + 1) (1.1)

Wbf : Win Bet Fraction, inverted odds:

wbf = 1/(odds + 1) (1.2)

WbfAll : Wbf raised to exponent λ found using the Box-Cox method

wbf All = wbf λ (1.3)

Whip : Leather instrument used by a jockey to encourage his horse

2.1 VARIABLES I NPUT INTO SAS

date : Julian date of race - 36187 to 40124 (1/27/1999 to 11/7/2009)

days1st : Number of days since last race (see Table 2.1)

finish : Place of finish (1 to 14)

horseAge : Age of individual horse: from 2 to 12 (see Table 2.1)

ML1 : Inverted Morning Line Ffraction - M L1 = 1/(1 + M L) where ML is the original

perf : Response Variable - numerical evaluation of a horse’s performance in a given race

pp : Post position in race - 1 to 14 (see Table 2.1)

speed1Diff : Difference from average speed of race (see Table 2.1)

2.2 S UMMARY S TATISTICS OF N UMERICAL DATA

NonTurf : Races not run on a turfcourse

Poly : Races run on Polytrack synthetic surface (replacing dirt surfaces)

Age3 : Races for three-year-olds

a8910 : Races with 8, 9, or 10 horses in race

2.4 DATA S EPARATED INTO O DDS R ANGES

2.5 T HE DAILY R ACING F ORM FOR THE S ERIOUS

3.1 P ERF : T HE I MPORTANT R ESPONSE

3.2 DATA P REPARATION IN MS ACCESS

NumLinesDiff : Each horse’s number of race-lines (previous races up to a maximum of 10)

Odds2 : Odds in 2nd race back

3.3 SAS O PERATIONS AND P ROCESSING

3.3.1 Non-Indicator Covariates Created in SAS

wbf : Win bet fraction of odds: 1 / ( 1 + odds )

wbfOld1 : Win bet fraction from odds of previous race

wbfOld2 : Win bet fraction from odds of 2nd race back

3.3.2 Indicator Type Covariates Created in SAS