Sunteți pe pagina 1din 16

Blackjack:

Can ADP Bring Down The House ?


Cristian Figueroa
December 14, 2012

Introduction

The purpose of this project is to better understand the house advantage in


Blackjack. Blackjack is a card game commonly played at most casinos, for
the purpose of this document we establish the player as a male character and
the dealer as a female character. The idea is the following:
The dealer gives 2 cards per player. The dealer has a facedown card
and a face up card.
Theres a set of actions each player can take : Hit, Stand, Double Down
and Surrender.
Hit: The dealer gives an additional card to the players hand, he
can do this repeatedly.
Stand: The player takes no further actions, ending his turn.
Double Down: The player increases his bet, but commits to Hit,
and afterwards Stand.
Surrender: The player surrenders the game in exchange for recovering half of its bet. This can only be done in the beginning of
the hand.
After the player is done with its actions, the dealer gets additional
cards (if required) until her hand scores 17 or higher. The score of a
hand is the sum of the scores of each particular card. The score of the
cards are as follow:
1

An ace, can be counted as a 1 or as 11, whatever is more beneficial.


Cards with faces ( J, Q, or K) count as 10.
Any other cards score according to their numerical value.
If the score of the players cards go beyond 21 then the player automatically loses. Otherwise the player closest to 21 wins.

Figure 1: Example of an outcome of a hand. The player in the top lost,


scoring 22. The player on the bottom scores 18.
Back in the 60s professor Edward O. Thorp proved that a card counting
scheme could be used to overcome the house advantage in the game. By
house advantage we refer to the fact that players using basic strategy are
bound to lose in the long run, with strictly average luck. This advantage
comes from the fact that the dealer get to play second, and hence it is possible
for the player to bust and go over 21, before the dealer even begins playing.
Thorp with the aid of IBM 704 used simulations in order to compute
probabilities and design a cards counting scheme that would improve his
odds. Thorp then decided to try his results in real casinos, earning as much
as $11,000 in a single weekend, of course, he was then expelled by casino
security. His book Beat the dealer: a winning strategy for the game of
twenty one [4] quickly became a best-seller, selling more than 700,000 copies.
Thorps methodology relied mostly in simulations, and at the end of his
book one can appreciate many of the several tables used to compute his strategy. Blackjack can be seen as an stochastic dynamic programming problem.
The problem is that the number of states can be really high, if we consider
a single deck with 52 cards, each possible configuration of the deck (ignoring
2

suits), dealers card, and players score (if we define the score as the sum of
the cards) we have of the order of 6.5 billion states 1 , more than the number
of breaths an average human has on his lifetime. Even storing the optimal
policy would be a challenge. Consider now that most casinos have 5 or 6
decks in a table.
Hence we use approximate dynamic programming (ADP) to solve the
problem. The advantages from this method is that allows a standard way of
dealing with the problem and its easy to adapt to dierent casino rules. The
main disadvantage coming from the ADP methodology is that it requires simulation in order to compute the optimal action. A complimentary work to be
considered is the use of Machine Learning algorithms in the resulting policy
in order to obtain rules that approximate the resulting policy. The approximate dynamic programming scheme we use is the Smoothed Approximate
Dynamic Programming (S-ALP) as described in [5].
We compare three dierent policies: the wikipedia policy, the simple
S-ALP and the smart S-ALP. The wikipedia policy is the results of a
player following the strategy chart that can be found in the wikipedia entry
on Blackjack. The simple S-ALP is a policy with just a constant function,
representing a policy of a player that just simulates the eect of the next
stage. The smart S-ALP is a policy where the basis functions are chosen
based on what would be important for the player.
The report is organized as follows, first in Section 2 we describe the model
we use, describing the controls, state space, probability matrix and dynamic
programming formulation. Next, we describe the approximate dynamic programming scheme we use in some detail. The implementation, results and
conclusions end the report.

Blackjack Model

In order to understand the model, we begin by explaining what is the sequence of events in a blackjack game. First, the player must decide how
much to bet. Once the player, has made his bet, he is dealt two cards, the
dealer is dealt a facedown card and a face up card. Here the player can
take four dierent actions 2 : Hit, Stand, Double Down and Surrender. The
1

Deck configurations dealers face up card players sum of cards= (59 17) 10 21
In reality there are five dierent actions, but the game was simplified to not include
split for the purpose of this project.
2

description of each action is as follows:


Hit: The dealer gives an additional card to the players hand. This can
be followed by a Hit, Stand or Double Down.
Stand: The player takes no further actions, ending his turn.
Double Down: The player increases his bet, but commits to Hit, and
afterwards Stand.
Surrender: The player surrenders the game in exchange for recovering
half of its bet. This can only be done in the beginning of the hand.
Once the player is finished with his turn, the dealer checks whether the
player went over 21 (busted) or not. If the player busted, then the dealer
reveals her facedown card and collects the bet, otherwise the dealer gets any
additional cards required until her score reaches 17 or higher. If the dealer
goes over 21 or has lower score than the players, then the player is given
twice his bet (1:1 payos), else if the dealer has a higher score than the
player, then the dealer collects the players bet. In case of a tie, the player
collects his bet back. A sequence of events can be observed in Figure 2.

Figure 2: Flow Chart of Blackjack.

The uncertain number of stages between bets present a major challenge


for modeling blackjack. The main reason being that not all actions are allowed at all times, betting can only occur prior to knowing the cards to be
dealt, and surrender can only be performed as the first action in a hand. Using a discounted scheme presents challenging as discounting profits based in
the length of a game, may result in undesired consequences, such as incentives
for a player to stand just to prevent discounting profits.
To overcome this challenge we consider the following decisions for the
player. At each hand the player decides what the first action is, as well
as how much the player will bet in the next hand. After choosing the first
decision, he plays the hand by following the rules found in the wikipedia entry
of blackjack, under basic strategy. These set of rules can be summarized in
Figure 2.The idea is that by separating the first play from the after play,
means that in each stage the player has the same actions available: how
much to bet in the next hand and what action to take as a first action in
the current hand. Additional advantages of this model is that each stage is
a hand, and hence discounting has the expected eect. Disadvantages with
this model are that the next state is very variable.
A justification for this approach is the following. The main contribution
of ADP to the strategy is that we can be strategic in our betting, in order to
overcome the house advantage. This eect as well as taking into account
the first action in the hand should be the most prevalent in obtaining higher
profits. What happens after the first decision can be considered as a second
order eect of some sort. In Figure 2 a flow chart of the model can be
appreciated.
Now that the decisions to be modeled have been described, a in detail
description of the states, controls, transitions and payos can be established.

2.1

States

A state in the problem consists in the dealers face up card, the players cards,
the deck configuration and the current bet.
The dealer face up card is straightforward to describe, we use a numerical value, d, between 1-10 to describe it.
The players card can be stored as two values, (p, a), p being the sum
of the cards in the player hands, and a the number of aces held by
5

Figure 3: Strategy followed after the first action. H: Hit, Su: Hit, S: Stand,
D: Double Down.
the player. This significantly reduces the possible number of states as
opposed to storing each of the cards received by the player.
Deck configuration, c, We store the number of cards of each value
still left in the deck. This requires an array of length 10 composed
of numerical values. For example, if there are only 13 cards scoring
10 points and 5 cards scoring 4 points, the deck configuration is: c =
(0, 0, 0, 0, 5, 0, 0, 0, 0, 13).
The current bet, b, can be stored by its numerical value.
Therefore we can describe the state space as the following set:
S = {(d, (p, a), c, b) 2

10

B|d 2 [1, 10], p 2 [2, 20], a 2 [0, 2], c 2 ([0, 4n]9 [0, 16n])}
6

Figure 4: Flow Chart of the model. Purple represents the decision of the
player, the rest is the transition process to a new state.
Where B is the set of possible bets, and n the number of decks used by
the dealer. The size of the state space is considerable, for a dealer using
three decks, we have |S| 1013 , a number larger than the number of stars
estimated to be in the Milky Way.

2.2

Control

Having described the quantities required in order to completely characterize


a game in blackjack, we can begin describing the set of controls. In our case
is the product of two decisions: How much to bet in the next game and what
to do as a first action.
In order to have a finite set of states we restrict the betting decisions to
a finite set B. The possible set of first actions to take have been described
before, these are: Double Down, Hit, Stand and Surrender. The control that
a player can use are not restricted by the state the player has reached. Hence
the set of controls U can be described as:
U = {(b0 , s)|b0 2 B, s 2 {Double Down, Hit, Stand, Surrender}}.
7

2.3

Costs

There are two factors that add to the cost of a control given a state: the next
bet, and the expected profit of the action taken. Hence the cost of a control
(b0 , s) given a state s can be written as:
8
b0 2b (W in|s)
if s = Stand, Hit
<
0
b + b 4b (W in|s) if s = DoubleDown
g(x, u) =
:
b0 0.5b
if s = Surrender
Here b0 represents the next bet (part of the control), and b represents the
previous bet.

2.4

Transitions

Given that the post game actions, as well as the dealer actions can lead to
many dierent states, an analytical expression for all the reachable states
as well as their transition probability is hard to obtain. Given that we use
simulation in order to approximate the value of the expected basis functions
as well as the expected costs, it is not necessary to obtain an analytical
expression.

2.5

Stochastic Path vs. Discounted Profit

Both formulations, average cost or discounted cost, could be used to formulate the objective function of our dynamic programming problem. We decide
to formulate the problem as a discounted maximization problem. The main
reason for this is the measure of choice by the paper that uses the methodology we follow [5], in order to solve the ADP. Also after consulting with V.
Farias about why solve Tetris in [3] as a discounted profit problem rather
than an stochastic path, he mentioned that the discounted cost setting has
better numerical stability.

2.6

Infinite Stages

In order to use the discounted profit setting, we need to have infinite stages.
Of course this cannot happen with a finite number of decks unless we decide
to shue the deck and reset the game. This is exactly what we do, and for
a fixed number cards l if the deck has less than or equal to l cards, at the
8

beginning of the game, then the deck is reshued and the new game is drawn
from a freshly reshued deck. Its interesting to examine how the profit of
the player changes based in this number.

The Approximate Dynamic Programming


Problem

Given the model described, we can now formulate the dynamic programming
problem associated with it. In this case the Bellman equation is the following:
X
J ((d, (p, a), c, b)) = min{g((d, (p, a), c, b), u)+
p(d,(p,a),c,b)s (u)J (s)}
u2U

s2S((d,(p,a),c,b),u)

Where S(r) denotes the set of states reachable from state r.


Solving this problem exactly, involves finding J for all the relevant states.
Even storing this value is out of the question as it involves billions of values,
storing the transition matrix is also out of the question. Hence we decide to
use approximate dynamicPprogramming in the value function, in order to find
coefficients r such that ni=1 ri i (s) J (s) for carefully chosen functions
i.
We begin by detailing the method chosen: Smoothed Approximate Linear
Programming. We then follow with the basis function chosen for our problem.

3.1

Smoothed Approximate Linear Program: S-ALP

There are several methods in the approximate dynamic programming literature in order to compute the r coefficients, when approximating the value
function. The method we decided to use is the smoothed dynamic programming problem, following the paper by V. Farias and C. Moallemi [5]. The
reason for this choice is that, rather than rely in projections and fixed points
it uses linear programming to compute the r coefficients. As an MIT student
we have access to IBMs CPLEX, and hence we can compute the r coefficients
efficiently by using this software.
The smoothed ALP(Approximate Linear Program) approach is based in
the fact that the solution to the following problem gives a solution to the

Bellman equation:
max J
s.t. J T J

P
Where we follow the operational notation: T J(x) = minu2U {g(x, u)+ s p(x, s)J(s)}.
Even though, the inequality looks non linear, this is in fact an linear, as we
can replace the inequality by the following set of inequalities:
X
J(x) g(x, u) +
p(x, s)J(s) 8u 2 U
s

The reason this delivers the solution of the bellman equation is that a solution
to the optimization problem yields the largest of the fixed points of T .
Moreover, we can replace the cost coefficient by any positive cost, or
probability distribution, and the solution to the previous LP also yields the
solution to the Bellman equation. The ALP approach suggest solving the
following problem:
max c0 J
s.t. J T J
J 2 span( )
Where is the matrix of the basis functions we want to use to approximate
the value function. Equivalently, this problem can be written as the following
LP:
max c0 r
s.t.
r T r
A good reference for ALP is given by D. P. de Farias and B. Van Roy in
[2]. Given the large number of constraints of this LP in contrast with the
small number of variables, constraint sampling can be used. Hence we need
to only use a small portion of the constraints in order to obtain a fairly good
value for r. An improvement in this method is to bootstrap the policies in
and resolve the ALP as shown in [3].
Improvements can be made to improve solution, in [5], they use the
smoothed linear program. The smoothed linear program solves:
max
s.t.

c0 r
r
T r+
0
0
10

Here represents the constraint violation distribution, and the constraint


violation budget. Hence its a generalization of the previous approach. The
parameter starts close to zero, resembling the ALP approach, and then the
parameter keeps being increased as long as it keeps improving the policy.
A case study with the game Tetris has been done in order to compare
several of the popular ADP policies. The following table can be obtained
from [5].

Additional advantages that come from this methodology is that, in terms


of performance, given the large number of constraints against the small number of variables, dual simplex method solves in a matter of seconds.

3.2

Basis Functions

Based on knowledge of the game, we used two concepts to define our basis
functions: deck load and cards-to-shue. Deck load is defined as the sum of
the number of cards of high value remaining in the deck (A, K, Q, J, 10)
minus the cards of low value remaining in the deck (2, 3, 4, 5, 6). A deck
that has recently been shued has a deck load of zero. By cards-to-shue
we mean the number of cards to be dealt before the deck is re shued.
Using these two notions we define our basis functions, 0 , ..., 5 as:

0:

1 , ...,

constant.
4:

One per each combination of possible bet, and whether the


deck load is positive or not. Example, 1 (s) = 6, 2 (s) = ... = 4 (s) =
0, if the player did a minimum bet in the previous round, and the deck
load is 6. 2 (s) = 2, 1 (s) = ... = 4 (s) = 0, if the player did a
minimum bet in the previous round, and the deck load is -2.
11

5:

Cards-to-shue.

The reason we separated the combination of bets and signs for the deck
loads is that we are going to try and interpret the coefficients that multiply
the basis function.
One of the reasons for settling for these basis functions as opposed to
functions that depend on the dealers card or the players card is that using
such values have very high variability, as are basically an independent sample
from the deck. Therefore requires large amount of simulations in order to
obtain reasonable stability of [ i |(b, s)]
We compare the results we obtain by choosing this basis functions with
the results we obtain by choosing only the constant basis function (meaning
pure simulation).

Implementation

Using Java we generate several classes that allows to simulate a blackjack


game and keep track of the relevant quantities. There are two types of
classes considered for this purpose, the classes to model the game (Dealer,
Deck, Player) and the classes to simulate games, recording states, shuing
the deck, etc.
In order to generate the S-ALP, we first play games following simple
strategies in order to generate fixed number of states: nStates. Then to
create the constraints that would go into the S-ALP. We start by choosing
a fixed number of samples, nSamples, that we use in order to compute the
[ r|s, u] and g(s, u), given a state s and a control u. Then, we randomly
choose 80% of the generated states and create the constraints for those states
through simulation.
Then we call a method such that, given a budget for constraint violations
, generates the S-ALP and solves it using CPLEX. Afterwards we simulate
the policy, and adjust the parameter in order to find a good value that
improves the performance of the policy.
The diagram depicted in Figure 4 illustrates the process done in finding
a policy in this methodology.
The total process took in the order of hours to solve.
The detail of the code used to solve the S-ALP, can be found in:
http://www.mit.edu/~cfiguero/Blackjack/
12

Figure 5: Steps involved in the process of finding a policy through S-ALP.


A brief description of the files:
Player, Deck, Dealer: These classes are straightforward object classes
with methods for drawing cards, updating bets, and game actions.
These objects are used in most of the other classes.
Simulator, ALP: These objects are used to simulate several games of
blackjack occurring between a dealer and a player. The methods for
generating states and constraints can be found in ALP. Methods for calculating the expected outcome of an action can also be found in ALP.
Outputs of generating states are saved in ,states.txt, outputs of generating constraints are saved in constraints.txt, a log of the development
of the game used to generate the constraints is saved in runLog.txt.
Optimizer: This class reads the outputs from constraints.txt, and
given a parameter creates the S-ALP, and solves it.
ALPSimulator: This class was designed to simulate the policy that is
obtained through the ALP.
runner, runSol, NormRunner: Classes designed to use the previous
classes in order to obtain results. The way to obtain a policy is by
executing runner. To use this policy and see its performance, execute
solRunner. To see the performance of a out of the box policy, execute
N ormRunner.
VisualTable: This is a graphical representation of the performance of
the policy.

13

Results and Analysis

As we find a policy and start simulating it, we find that policies with a
large , tend to make inappropriate decisions when playing. After careful
consideration, we realized that this mainly occurred due to misestimating
future costs. Given the variability of outcomes in the next game, it is hard
to appropriately estimate the next state and future costs without exhaustive
sampling. Using a value of small enough, provides better results while at
the same time taking into account future costs.

Figure 6: Profit collected after playing 1000 hands, for 100 dierent samples.
In Figure 5, the variability of the profits can be appreciated. Each of the
dots represent the profits after playing a thousand hands. We can see the
variability of the profits, this is well known fact in the gambling community,
independent of the level of skill (how good the policy is) positive profits are
not guaranteed in the short run [1]. The eects performance of a policy can
only be appreciated in the long term, and in this case, this should be more
than 1000 hands. The inability to quantify the performance of the policy
reliably prevents us to take full advantage of the S-ALP, as its not easy to
decide whether to increase or not.
The average profit of the ADP policy with a small is 11.8 in 100 repetitions of 1000 games. The average profit of the policy found in the internet
is -11.335 in 100 repetitions of 1000 games.

14

Conclusions

Blackjack is a game where between stages there is high variability. This in


turn requires a scheme that is able to make enough samples to obtain reliable
measures when trying to anticipate future costs. In our case we took enough
samples that it wouldnt take more than a few hours to solve, and a couple
of hours to simulate.
Turns out that even with this amount of samples the variability in the
results makes it very hard for the S-ALP policy to estimate the future profits.
This limits the benefits that come from the ADP solution. This is not saying
that an ADP approach could not work, it just requires more simulation than
the one used in this project.
To counter the eects of a bad estimation of future profits, an increased
discount of future profits can be used. By lowering a policy can be obtained
that beats the best online policy, but does not take too big of an advantage
as one would desire. The low can be understood in the following way. In
our model, the most important stages to consider when making a decision are
the current stage, and the one directly after it. The reason is that in deciding
the bet for the next stage, if the policy is too myopic it will always choose
the smallest possible bet, but if the policy looks too much into future profits,
it will make an inappropriate decision (like surrender) in the current play,
this could potentially be fixed through more simulation.
Using rollout policies or n-step lookahead, seems to be a more appropriate
method for ADP that adjust better to the problem. The reason for this is
that there exists very good base policies that the methods can improve. Also
the limited vision of the methodology aligns well with the structure of the
game as mentioned before.
Several books about blackjack state that the advantage of a skilled player
in blackjack is around 2%[1], given the high variability it seems unlikely that
ADP can get an advantage much higher than that.
Blackjack turned out to be a very interesting game to study in terms of
modeling, and analysis of the policy. Much of the role of the dierent parameters , , number of states sampled and expectations, came into play in
analyzing the performance of the policy obtained through this methodology.
It poses an interesting challenge to see if a policy can be found, through careful modeling and parameter tuning, obtaining higher than a 10% revenue.

15

References
[1] American Publishing Corporation (2000). How To Beat The Dealer in
Any Casino. American Publishing Corporation (2000), 2000.
[2] B. Van Roy D. P. de Farias. The linear programming approach to approximate dynamic programming. Operations Research, 2003.
[3] Vivek Farias and Benjamin Van Roy. Tetris: A study of randomized
constraint sampling. Probabilistic and Randomized Methods for Design
Under Uncertainty.
[4] Edward O. Thorp. Beat the Dealer: A Winning Strategy for the Game
of Twenty One. Vintage, 1966.
[5] V. F. Farias V. V. Desai and C. C. Moallemi. Approximate dynamic
programming via a smoothed approximate linear program. Operations
Research, 2009.

16

S-ar putea să vă placă și