Sunteți pe pagina 1din 10

CS221: Project Final Report

Beating the Stratego

December 16, 2016

In this document, we describe our work on implementing an AI that can play the game Stratego.
We chose this game for it provides interesting challenges due to the complexity of its state-space
and to the high level of uncertainty (unlike Chess, initially, all of opponent’s pieces ranks are
unknown). We’ll detail the model we chose to represent the game, the approach we followed
and the algorithms we built, present our results and analyze the sources of errors and potential
improvements.

1 Problem Definition
1.1 Stratego
Stratego is a zeros-sum strategy board game. Two players battle on a 10x10 board to capture the
enemy’s flag. Each of them starts with 40 pieces, as per figure 1. Each piece can be a number
between 1 and 10 (10 is the ”stronger” rank), or a flag or a bomb. Each player only sees its own
pieces, and an opponent piece is only revealed upon a fight with it (whether we win it or not).
There are two 2*2 lakes on the middle rows that can’t be walked
on. This creates around them three corridors that are strategically
important, because they are the only points of passage between the
two sides.
Each turn, the player moves one piece (non-diagonally) of one
square (except for the scout that can cross the board in one turn).
Bombs and flags can’t be moved.
A combat occurs when one piece walks on an occupied square.
The general rule is that the highest-ranking piece wins, and ties
kill both belligerents. However, bombs will kill any assailant and
can only be destroyed by a miner (rank 3). A spy (rank 1) can kill
a Marshal (rank 10), but only if the spy attacks first. The goal of
the game is to capture the opponent’s flag. If a player cannot move
anymore he loses the game.
Figure 1: Initial position on
1.2 Motivation Stratego board

Stratego combines a strategy-game, type Chess, and adds uncer-


tainty due to unknown pieces of the opponent. Even though first
moves rely a lot on luck and risk-management, as the game unrolls, the uncertainty is reduced as
players gain knowledge or refine their beliefs on their opponent’s pieces.
This yields very interesting challenges from an AI perspective, such as making decisions under
uncertainty, and being able to value the information we gather from what we observe (the moves
of our opponents) to then infer the identity of the hidden pieces.
Van den Herik et al., 2002 showed that for Stratego, the complexity of the state-space is 10115 ,
and 10535 for the game tree. In comparison, chess is respectively 1046 and 10143 . Hence, it’s
obviously computationally impossible to evaluate the tree, and we will need to find ways to reduce
the search space.

1.3 Goals and constraints


The sheer size of the potential state space forces the AI to make major compromises to drastically
cut the number of states that must be explored. Naively, each unknown piece on the board would

1
multiply the number of states to explore by 12 (because there are 12 different ranks). On the
other hand, if there were no uncertainty at all, it would be very easy to build an excellent AI (our
baseline). Hence, we chose to explore a way to remove uncertainty first, and then solve the game
without uncertainty. This decision is the major design choice that separates different Stratego AIs,
in our case we have chosen a belief-based approach.
Reducing the size of the state space comes at a cost: it greatly decreases the quality of the AIs.
Hence, our goal is to provide a reasonable approach to building a Stratego AI, combining several
ideas and algorithms in a way that we have not seen in an other program.
The main constraint is performance. We want our AI to be able to play within 10s on average.
We’re also limited by the relatively small size of our training set (we have about 40,000 games),
which is very small compared to the number of possible games. Hence, our training set only ex-
plores a minuscule fraction of the space, and we’ll use reinforcement learning to generate more data.

We built two algorithms:


Belief Takes as input an opponent’s move, the resulting board, and outputs an updated belief on
the identity of the opponent’s pieces.
This algorithm uses the belief developed throughout previous moves, and ”removes” uncer-
tainty by outputting the most likely position for each of the opponent’s pieces. Implicitly,
this represents a board where all the opponent’s pieces are known, and if this prediction were
100% accurate then we would win very easily. Hence, the goal is to get optimal predictions.
Main AI Takes a state as input (board, turn, belief) and outputs a move (Piece, Vector). This
algorithm plays a reduced Stratego game (without uncertainty) using the most likely board
outputted by the belief algorithm.

2 Evaluation
In order to evaluate our AI, we defined 2 players:
Baseline At a given state s, it makes a random prediction on the pieces of the opponent, then uses
a Monte-Carlo algorithm. It tries several possible actions ai ∈ Actions(s), and for each of
them keeps on playing randomly against a random opponent for a few moves until it reaches
s0i , then picks the action that led to the best s0i (evaluated using a basic score function as the
sum of the ranks of remaining pieces).
Oracle Same as the baseline player but instead of initially making random assumptions on the
opponent pieces, it is omniscient and knows all the pieces of the opponent. The same Monte-
Carlo algorithm is used to pick a move, with a few handcoded constraints to ensure move
quality (ie. we aggressively attack every piece we know we can defeat). This player does
not follow good policies, but since he knows the opponent’s pieces it still beats any human
player.
We simulated hundreds of games to find the ratio of victories of each player:
• Baseline player vs Random player : 90% of victories for Baseline
• Oracle player vs Baseline player : 87% of victories for Oracle
Our objective is to get as close as possible from the Oracle player. This means beating the baseline
as often as possible without cheating, i.e. without looking at the non-revealed opponent’s pieces.

3 Literature Review
Compared to other games such as Chess, Checkers or Go, there are few papers on the Stratego from
an AI perspective. Some papers were still useful to get insights on how to deal with uncertainty,
evaluate states and speed up the search of the next move:
• Invincible, a Stratego bot - de Boer et al. 2007. This paper researches how to value the pieces
on a board. Unlike Chess, the value of a given piece evolves as the game unrolls since the
value is correlated to the other pieces still in game: for instance, a 5 and a 6 have the same
value when the opponent doesn’t have any 5 or 6 remaining. However, the author does not
try to predict moves like us, but instead constructs plans of moves instead (e.g. ”I want to
kill this particular piece within 5 moves”).

2
• Competitive Play in Stratego - A.F.C Arts. This thesis details a way to build a stratego AI.
However, it uses the expectiminimax algorithm to deal with uncertainty, whereas we decided
to solve a Constraint Satisfaction Problem to get the most likely board so as to remove
uncertainty. As many other related literature, one of the main conclusions is that building a
powerful stratego AI is a very hard task because of the game’s high complexity.
• The strongest stratego AI is called Probe and is available online for free! The algorithm is
complex and is based on exhaustive search as well as path-finding algorithms. It’s probably
more efficient than our AI in the sense that it enables a deeper search tree.

4 Model
In this section, we describe the state-based model we use to represent a game and the different
choices the AI can make.

4.1 States
The simplest way to fully capture the information of a game in Stratego would be to use states
made of a board, who’s turn it is, and the history of each moves. There are two problems with
this approach: the number of potential states is combinatorially huge, and they don’t generalize
well. Indeed, if two states only differ by one difference in one history, they do not appear ”closer”
than two completely different states, even though this notion of distance should be captured.
To improve this, we get rid of the storage of all histories. Instead, a state is made of a turn
(player 0 or 1), the actual board that the player sees and a belief on the unknown pieces. Stratego
is a game where each player has partial information on his opponent pieces. However, one can
build an ”intuition” of the identity of some pieces based on the history of their moves or their non-
moves, and on their position on the board. For instance, if two pieces move aggressively together,
they’re probably a couple spy/general, because the spy protects the general (rank 9) from the
marshal (rank 10). Similarly, if a piece is totally immobile even though it could have made various
good moves, it’s probably a bomb or a flag. Those intuitions are encoded as probabilities. They
are defined for each potential rank of each unknown opponent pieces: we call them a Belief. For
instance, we could have a 20% belief that the piece in square (3, 1) is a flag.

State = (Board, player id, Belief )


The belief is the actual board where unknown pieces are represented by a distribution of
probabilities over the possible ranks, which represent the player’s belief of the identity of each
piece (detailed in section Algorithms). For instance, upon discovery that a piece p is of rank r,
P(p = r) = 1 and P(p = r0 , r0 6= r) = 0.

4.2 Actions
We define an action as a couple (piece, vector), where ’piece’ is a Piece object (defined in section
Architecture) containing the information on a given piece such as its value (’B’ for bomb, 10 for a
marshall) and its position (x,y) on the board, and ’vector’ defines the move. For instance, a piece
following the vector (1,0) moves one square to the right. An action is called legal if it obeys the

3
Stratego rules (can’t attack your own pieces, can’t move into a lake or outside the board, can’t
move diagonally).
The set of available actions for a given player is the set of all the legal actions of his remaining
pieces. If the number of legal moves is limited at the beginning of a game, it can get much greater,
which implies a wider game tree. Hence there is an increasing computation time as the game
evolves, which is somewhat counterbalanced by the decrease in uncertainty.

5 Infrastructure
We decided to create our own version of Stratego, using Python. This solution gives us more
flexibility as we try several algorithms. The project is divided into several classes:

Game contains the two players, the board, and information relative to the game state such as
whose turn it is, or whether the game is over or not. A game is started using its method
’run’.
Board one of the game’s attributes, gathers everything there is to know about the board’s state:
dimension, lakes’ positions, as well as the position of each remaining piece.
Player two of the game’s attributes. Here, we defined a superclass Player and several subclasses,
such as RandomPlayer, BaselinePlayer, ... Each subclass must implement a method initial-
Position placing the pieces at the beginning of a game, and a method play, called each time
a given player has to make a move. We coded 4 players:
• Random Player : chooses a random action among the possible actions. It is obviously
the fastest player, and is useful for basic testing, but it is a very weak opponent and
does not provide a real challenge.
• Baseline Player: cf 2.
• Oracle Player: cf 2.
• Minimax Player/AI: maintains a belief on the pieces of the opponent, and given this
belief uses a minimax algorithm on the most likely board to decide on the next action
to take (has additional methods such as an Evaluation Function)
Move defines a ’move’ object with a piece and a vector.
Piece defines a ’piece’ object with its value (’F’ for flag, ’B’ for bombs, 1 to 10 for regular pieces),
its position on the board, its status (revealed or not), its move history (list of moves from
the beginning of the game) and the player it belongs to.
Belief object used by the Minimax Player. A ’belief’ object contains a dictionary with probabil-
ities for each piece to be in each square (as a 3D matrix (x, y, rank)). A deeper description
is given in section 6.2

Figure 2: Board of Stratego generated with our code

6 Algorithms
A game from our AI perspective can be described sequentially:
• First, our AI picks an initial layout and instantiates its belief on the opponent’s pieces.

4
• At each turn, assuming we play second (those steps can be easily transposed in the case
where we play first):

– Observe the opponent’s move, and update our belief accordingly.


– Generate the most likely board given our current belief (approximately solving a CSP).
– Run a limited-depth exploration of the game tree to find the best possible move, as-
suming that the actual board is the likely board we found at the previous step. At the
bottom of the tree, we use our TD-learned evaluation function to compute a score for
the state.
– Observe the result of our move, and if we went through a fight, update our belief
accordingly.

6.1 Picking an initial layout


The initial layout is of utmost importance in Stratego. Fortunately, good layouts are usually very
similar and follow a simple set of sanity rules. From the dataset of 40,000 winning boards we have,
we can just pick one of those at random as our initial layout. This is almost equivalent to building
a probability distribution for a good layout from those winning boards and sample from it.
The randomness allows the AI to avoid being too predictable. We also use basic hardcoded rules
to prevent very bad layouts from happening (flag is always on the last line, bombs should protect it,
etc.). One issue that arises is that some players use very peculiar initial layouts, targeted towards a
specific style of game (very aggressive or defensive for instance). Those layouts are fairly frequent
in our database but turn out to be bad for an AI that won’t play the specific style associated with
those layouts.

6.2 Beliefs
The end goal of our belief algorithms is to be able to output at each turn the most likely disposition
of the opponent’s pieces and use it as if it were the actual board in subsequent tree-searching
algorithms. To create a belief, for each piece on the board, we store a dictionary that associates a
rank to the belief that this particular piece has this given rank.

6.2.1 Heuristics
When a combat takes place, both pieces identities are revealed. If our piece loses, the opponent
piece is now known, we set the probability of its rank to 1, and other ranks to 0. Our normalizing
algorithms will make sure that the probability of this rank over other pieces decreases accordingly
(the extreme case is when we discovered the last piece of a rank, we set the probability of this rank
over all the other pieces at 0).
We also use three more advanced heuristics to update our belief:
Missed opportunities When an opponent’s piece is in an adjacent square to one of our revealed
pieces (e.g. after a battle), it could attack us. If it does not, the probability that this op-
ponent piece has a lower rank than ours increases, and the probability that it is stronger
decreases. If it keeps not attacking for several turns when he had the possibility, this heuris-
tic will keep tweaking the probabilities turn after turn.

General happiness The idea is that the opponent tries to maximize the happiness of its pieces:
an opponent piece is happier the closer it gets to a weaker revealed piece of ours, and less
happy if our piece is stronger. For each unknown piece, for each rank r, we could compute
how happy this piece would be if it were of rank r with regard to all our revealed pieces. The
score is simply the sum over P the set of our revealed pieces
X 1[ropponent > rp ]
ManhattanDistance(opponent, p)
p∈P

Initial neighbors For each of the opponent’s pieces, we store its initial position on the board.
Upon discovery of an opponent’s piece, we can then look at its direct neighbors on the initial
board (between 2 and 4 neighbors).
From our database of games, we learned for each rank the probability that an other rank

5
is initially located behind, ahead or to its side (left or right are grouped because of the
symmetry of the problem). For instance, we learned that for a bomb, the probability that
a flag is located behind is much higher than the uniform probability of 1/12. Using this
information, we can then update the beliefs on the initial neighbors of each piece we discover.
The impact of the calibration of those heuristics is
major on the prediction of the opponent’s board, and
hence on the quality of our AI that relies heavily on this
prediction. The problem we face is that those heuristics
assume that the opponent is a rational human. Thus,
when we make our AI play against non-rational AIs, such
as a random player or even to some extent a Monte-Carlo
player, the inferences on the belief we draw from some be-
haviors are sometimes false. We face a tradeoff for test-
ing: we can run our AI against other simple bots, which
is very fast but does not unleash the full potential of our
AI, or against ourselves, which is much more realistic but
extremely slow in comparison to the first method. Figure 3: Belief on each square

6.2.2 Normalizing the probabilities


For each piece on the board, we store a dictionary that associates a rank to the belief that this
piece has this rank. In order to make sense, we need our probabilities to sum to 1 for each piece
across all ranks, and to sum to the total number nr of pieces of rank r for each rank r across all
pieces. Let P the set of pieces, and R the set of ranks, we want:
X
∀r ∈ R P(p = r) = nr
p∈P

X
∀p ∈ P P(p = r) = 1
r∈R

To normalize the probabilities, we use an iterative algorithm that converges towards a normal-
ized board. We start by normalizing each piece by dividing each probability by their sum for the
piece. Then, we normalize the ranks by dividing them by their sum across all pieces. We repeat
this two-step process until changes are below a certain threshold . This guarantees that each of
the sums above are less than  away from nr and 1.

6.2.3 Finding the most likely board


From the distribution of probabilities, we want to extract
the most likely board on which our minimax algorithm
would work (basically, we’re removing the uncertainty).
This is a CSP: we’re looking for a board that respects
constraints on the total number of each rank, and max-
imizes a weight (the product of probabilities) to get the
most likely legal board. However, the size of this CSP
would cause the resolution to take a lot of time, even
using AC3, LCV and MCV.
Hence, we use a heuristic built using constraint relax-
ation. We won’t enforce the constraints on the number
of pieces for most of them, ie. if we have 7 pieces of rank
2 instead of 8, we hope that this won’t have a major im- Figure 4: Most likely board
pact on our AI’s decisions. We start by attributing the
most important pieces (10, 9, flag, and spy), in this order, to the squares where they have the high-
est probabilities. Then, we randomly sample all the other pieces on the remaining unattributed
squares using the probability distributions.

6.3 Evaluation function


Since we can not compute the whole search tree (because the depth and the branching factor are
billions of orders of magnitude higher than what a computer can handle), we limit our search to
a small depth (from 1 to 5 in our project). Unless we’re at a state very close from the end of the

6
game, we can’t reach the end leaves of the game tree with such a small evaluation depth. Hence,
the evaluation function will allow us to get an heuristic of the value of a state when we reach the
depth limit.
We want to calculate the score of a state s as Eval(s) = wT φ̇(s), where φ(s) is a vector of
features extracted from s, and w is a vector of weights for those features.

6.3.1 Feature extraction


We manually build the feature extractor using our intuition and knowledge of the game to generate
feature patterns. The goal is to capture variables that can reflect the quality of a position while still
have a good generalization power. Various research papers give heuristics and scores to evaluate a
board quality, and we implemented those as one feature each, that are bagged together in φ with
more basic features. For instance, naming φ1 (s), φ2 (s), ... the coordinates of φ(s), we choosed
   
φ1 (s) Value of the remaining pieces
φ2 (s)  Minimum distance between my flag and the enemy 
φ3 (s) = Relative number of opponent pieces to my number of pieces
   
.. ..
   
. .

6.3.2 Improving the coefficients


Then, we’ll learn the coefficient using training dataset, where we have access to 40,000 Stratego
games. We also generate more data by running our AI against a Monte-Carlo player that has a
relatively rational behavior, albeit very different (and arguably less good) than the one of a human
player. We perform a TD-learning using this training data: for each couple (s, s0 ) of consecutive
states in our database, we update our θ parameter in the Eval function according to the following
rule:
θ ← θ − η(Eval(s) − (r(s0 ) + γEval(s0 )))φ(s) − η · λ · θ
We also add a penalization term on the norm of our weights to avoid overfitting : the implied
objective function we want to minimize is
2 2
(Eval(s) − (r(s0 ) + γEval(s0 ))) + λkθk2 = θT φ(s) − (r(s0 ) + γθT φ(s0 )) + λkθk2

We used 
 γ = 1
η = 0.01
λ = 0.01

and our reward function r is



 1 if s is a winning state
r : s 7→ −1 if s is a losing state
0 otherwise

6.4 Finding the next move


Since there are uncertainties in the game, our first idea was to use expectiminimax. The problem
is that there are too many possible ”random” boards, and the chance nodes would have too many
children, making the tree computationally intractable. Hence, we decided to use minimax with a
fixed board for the opponent: the most likely one we deduce from the probability distribution, as
explained in the section on Belief.
Hence, at each node in the game tree we compute a belief board which tries to predict what the
actual board is by guessing the opponent’s pieces. Once we have this board, we can run a regular
minimax algorithm with depth-limited search (empirically, a good trade-off between speed and
good results is depth = 3). To speed up our algorithm, we also used several pruning algorithms :
• Alpha-beta pruning: the values of all nodes on the optimal path are the same
• Zobrist hashing: we keep track of the explored states when looking for the best move, by
storing each couple (state, score) in a dictionary, using a hash function applied on a each
board (e.g. move 1 + move 2 = move 2 + move 1)

7
Figure 5: Zobrist Hashing principle - Performance of pruning algorithms

The Zobrist hash function Z is applied to a board b and returns the value h = Z(b). Since
there are 24 different kinds of pieces (1 to 10, Flag and Bombs, for both players) and 92 squares on
the board (10 × 10 − 2 × 4 after removing the lakes), we initialize our Zobrist table t as a 24 × 92
array filled with randomly generated bitmaps (i.e. random integers between 0 and 264 − 1).
Then we compute h: we first initialize h to 0, and for each piece on the board, we update

h ← h XOR t[value(piece), position(piece)]

where 0 ≤ value(piece) < 24 and 0 ≤ position(piece) < 92. We finally get our hash value h.
Such a algorithm enables us to compute efficiently a hash value for a board given the previous
board. Indeed, since a = (a XOR b) XOR b for any integers a and b, we can update a board h
value by ”XORing out” the pieces that moved, and ”XORing in” such remaining pieces. Hence if
the piece p moves from one square to another without attacking an opponent’s piece, the new h
value is

h ← (h XOR t[value(piece), oldPosition(piece)]) XOR t[value(piece), newPosition(piece)]

7 Results
We plotted on the charts below the results for our minimax player after playing several games:
• Performance of different players vs Baseline: our AI player achieves a pretty good perfor-
mance against the Baseline player (84% victories with depth 3 against 95% victories for our
oracle player) given that some moves of the Baseline are sometimes random so harder for our
AI to understand and predict.
• Performance of our AI against the Baseline for different depths: the performance increases
with the depth but for depths 4 and 5, the time our AI needs to compute the next move is
too long to play at a normal pace.

Figure 6: Performance of players vs Baseline (left) - Performance with increased depth (right

8
8 Error Analysis
We have been able to identify the main sources of error in our AI, i.e. the reasons responsible for
the AI loosing or not winning fast enough (making non-optimal moves).

8.1 Opponent’s pieces estimation


We were able to assess the accuracy of our heuristics by testing them on our dataset of 40K
Stratego games. We launched 1000 games, and at each move we compared our ”most likely board”
computed by our heuristics, with the actual board. Therefore, for each ith move (x axis), we have
an accuracy rate (y axis) which is the number of pieces values correctly predicted vs the number
of pieces values to predict ( = number of not-revealed opponent pieces). At the beginning of the
game, our accuracy is very low (20% in the best case), since our AI has very little information.
But as the game unrolls, our accuracy rate increases and reaches 80% after 300 moves (a game is
usually around 500 moves).
Having a better accuracy at the beginning is not feasible, but we could try to reach a better
accuracy sooner (so far we reach 50% of accuracy after 150 moves). To this purpose, we would
need better heuristics (detailed in section 9)

Figure 7: Performance of the heuristics

8.2 Evaluation function


The evaluation function is a crucial point of every depth-limited minimax algorithm. In our case,
the quality of a state is very hard to define, especially when the game has just started.

8.2.1 Missing features


By lack of time, we could not implement and learn more sophisticated features we envisioned
initially. For instance, we don’t handle the fact that groups of pieces are usually stronger than the
sum of their pieces, and the evaluation function should offer an incentive for our AI to group pieces
in a meaningful way (spy with general for instance). This is only one example, but common sense,
experience playing Stratego and blogs or scientific literature give hundreds of potential features
that could be tested. That said, we feel that the one we chose are the most important and capture
a good chunk of the variance.

8.2.2 Performance issues


Running the TD-learning algorithm on enough boards to get a good convergence, given the mas-
sive diversity of potential states, requires playing a high number of games. Due to performance
limitations, we have not been able to run TD-learning on more than a few hundreds of games.
Also, we have not been able to tweak our feature generator as much as we wanted, because each
change in the code required a several-hours-long retraining process. We believe that a fully-trained
AI could make much better decisions.

8.2.3 Analysis of the weights’ influence


In order to test the importance of weight selection, we have run several games with weights chosen
at random, and the difference in victory ratio between the best and the worst trial was about 30%.
This shows how major the impact of feature weights it.

9
8.2.4 Analysis of the evaluation evolution
We displayed figure 8 the evolution of the evaluation function for each of the two players during a
stratego game.

Figure 8: Evolution of the evaluation function for two players playing against each other

As expected, the winner has a greater evaluation score as the game evolves.

9 Next steps
9.1 Learning belief heuristics
Given enough time, we would have explored an interesting path: automatically learn heuristics of
belief update. Right now, we rely on a small set of handcoded heuristics that give good results,
but we feel that we could discern more interesting patterns from the database of 40,000 games we
have access to. Ideally, we’d like to be able to learn rules by looking backwards from the time a
piece is revealed to the beginning of the game, and grasp clues that could have allowed us to infer
its identity as early as possible.

9.2 Improving heuristic for the search of the most likely board
Once we have updated the belief, the player must assign a value to each case of the board. So far,
we decided to use a simple heuristic which is : we start by assigning the values of the pieces (10,
9, flag, spy) because a player can have at most one of each of these pieces. And then we assign all
the remaining pieces to the square of the board with the highest probability of containing it. We
think of improving it in different ways :

• Using a CSP on the position of the important pieces (10, 9, flag, spy) : in order to maximize
the probability of the join positions of these important pieces and not only maximize the
probability piece by piece
• Hardcoding a better heuristic than choosing a piece randomly among the ones that are still
to be assigned, and assign it to the case with its highest probability (maybe an order on
pieces ?).

9.3 Improve minimax


Once we have our ’most likely board’, we run a minimax algorithm to find the best action of the
player, given the board we chose. We think that we should relax this condition and run minimax
on maybe 2 or 3 ’very likely boards’, to better generalize and have less ’strong’ assumptions on the
board. Then our player would be an Expectiminimax player : if we take 3 very likely boards, he
will maximize the score on each board with probability 1/3 given to each board.

10

S-ar putea să vă placă și