Documente Academic
Documente Profesional
Documente Cultură
SIMULATION
Alex Nguyen
Prof. Scheffler
1 Project Outline
For the final project I will implement and analyze different strategies in
chess. Writing a program that plays chess autonomously has been a massive
accomplishment in artificial intelligence. Chess is a perfect information two-player
zero sum game. It is a perfect information game because both players have perfect
knowledge of the state of the board and can make decisions accordingly. Because
it is zero-sum, a win for one player implies a loss in another.
For the final project, I implement two different kinds of strategies to play
computer chess. Alpha-beta pruning is a classical search technique to search
through possible moves from a position to find the best move. The current best
chess engines use alpha-beta pruning to perform a search through candidate moves
from a position, terminating the search at a certain depth and evaluate the board
position using a human-made evaluation function. This evaluation function is
based on human intuition on how good a chess position is based on the imbalance
of material, central position occupancy, positioning of the bishop pair... I also
implemented a radically different strategy, using reinforcement learning combined
with the representation power of neural networks to encode a “strategy” using
parameters of a neural network and learn the strategy purely through self-play.
This is the approach used by DeepMind in order to build a Go-playing engine
that defeated Lee Sedol, a feat unimaginable previously because Go is a game of
such complexity that it has eluded the grasp of engines utilizing the best search
techniques. The goal is to be able to create a chess engine that uses simulation
and look-ahead to be able to play chess at close to my amateurish level.
1
2 Introduction
Figure 1: A Chessboard.
The objective of the game is for one side to checkmate the other, i.e. put
the king in an attack that ensures that there are no legal moves that avoids king
capture in the next move.
2
Each piece on the board are allowed to move differently. Pawns are allowed to
move one step ahead at the time and not allowed to move backward, except for
their first movement, when they have the option to move two steps ahead. Pawns
can only capture diagonally (one step ahead). If a pawn reaches the last rank
of opponent territory, then they are promoted and can become any other piece.
Knights move in an (2 by 1) L-shape in any direction. Bishops can move arbitrarily
far in any diagonal direction, ensuring that they stay on the same color square
throughout their duration in the game. Rooks can move arbitrarily far in any
horizontal or vertical direction. Queens combine the power of bishops and rooks,
being able to move arbitrarily far in any straight line on the board. Kings can
move to any adjacent square in its Moore neighborhood. A piece can be captured
if they lie on a square that an opposing piece can move to.
Chess also contains special rules that are history-dependent, such as castling
and en passant. Castling can only occur if a king has not moved and can only be
done once. Castling allows a king to move two steps to the right or three steps
to the left, which normally a king cannot do. En passant allows us a pawn to
capture another pawn adjacent to it on the 5th row (for black) and 4th row (for
white).
3
3 Alpha-Beta Pruning
Consider the different strategies that we can take when choosing a move. Chess
is a highly non-linear game and heavily punishes greedy algorithms. A move that
can seem optimal right now might not indeed be optimal (in leading to terminal
game outcomes). For example, a capture of a bishop with a queen might seem
good right now, but is sub-optimal if it leads to the queen being captured in the
next move. Therefore, we need to simulate the game ahead in order to determine
the best move.
4
possible different positions. Therefore, expanding the minimax tree all the way
until the end is infeasible.
Thus, rather than simulating the game all the way until the end, we can define
a parameter max depth which stops simulation at a certain depth of the tree and
evaluate the board position using a handmade evaluation function. Simply from
the board, we can see that the more pieces that a player has, the more likely they
are to win. Having more pieces allows players to attack and defend appropriately.
However, the number of pieces is not the only factor, their positioning is also
important. It is advantageous to put pieces in positions that will allow them to
control as many squares as possible, similar to a war game over territory. At the
beginning of the game, it is advantageous to control the center of the board because
the center allows for maximal projection of square control. It is also important to
ensure king safety (usually by castling) to minimize the possibility of an attack
that will lead to checkmate. However, at the end of the game a good heuristic is to
move the king towards the center, where it can play a more active role due to the
reduced number of pieces on the board.
5
important. This means that the top chess engines tend to be more conservative
with their material and excel at tactics: a short sequence of moves that end with
material gain. They tend to be poorer at strategy than humans, who tend to use a
more intuitive and positional style to coordinate attacks using a somewhat more
conceptual framework.
By scoring the value of a board, we can try to maximize this score for us,
knowing that the opponent will select a move that will minimize our score (while
themselves assuming that we will select the move that will maximize our own score
in response to that).
One way to improve the minimax tree search is to use alpha-beta pruning. In
alpha beta pruning, we avoid evaluating certain terminal nodes of the tree if there
already exists a better move available. As an example, suppose White is about to
make a move.
6
At each level, White (the maximizing player) keeps a parameter α and Black
(the minimizing player) keeps a parameter β. The parameter α and β are the best
values that White and Black can guarantee for their own moves. We can think of
α and β as the heads of two opposing forces. Starting from opposite ends, White
tries to push α forward as much as possible and Black tries to move β backward as
much as possible.
Suppose the board position is at node A for White, which considers whether it
should take action B or C. Then, considering Black’s possible responses at B, the
children of B, we arrive at node D. At this node, the maximizing player (White)
is playing. If they take the left action then they managed to push the value of α1
to 3 (the subscript 2 implies that this α belongs to the tree at height 1). If they
take the right action then they managed to push the value of α1 to 5. Once all
of the children of D are evaluated, we pop up a level, arriving at node D, which
is a decision taken by Black. Therefore, White notices that it has managed to
guarantee the value of 5 by taking move D. Thus it sets β2 = 5. Then, at B, Black
considers whether it should be taking action E. At E White has two possible
choices. The leftmost choice is 6, which means that it has managed to push its
value of α1 to 6, which is greater than the β2 value of 5. Thus, popping up a level
again, White knows that Black’s best effort has led to a β2 level of 5, and thus it is
guaranteed an α3 of 5 for node B. Therefore, Black has no reason to evaluate the
right child of E because it knows that the worst case scenario for node E allows
White to guarantee a better value than the value Black is already guaranteed at
node D. Similarly, as White considers whether to take action C, it sees that Black
can take either choice F , which gives best value 2. This means that with action F ,
7
Black has managed to push its β2 all the way to 1. Therefore, White has no reason
to continue evaluating the second child of F or all the children of G. This strategy
allows us to save computation time for a lot of nodes and significantly speeds up
computation.
The max depth parameter is very important because it controls for “far ahead”
into the future the alpha-beta pruning algorithm sees. If max depth = 1 then the
chess engine simply avoids taking moves that ends up in capture and selects moves
greedily otherwise. It has no conception that capturing a piece can result in a
recapture just a move later. If max depth = 2 it has basic knowledge of avoiding
re-capture and can take precautions to run away from being attacked. If max depth
= 5, the algorithm already takes one minute to run for each move, even though
we have alpha-beta pruning. On the other hand, the algorithm is good at tactics,
is passable at opening development, and requires some concentration from me to
defeat.
8
4 Deep Reinforcement Learning
The AlphaZero is based on the idea from deep learning of avoiding handmade
features directed top-down from humans but allowing deep neural networks to
“discover” their own set of features from a large set of data. To do this, the algorithm
uses a deep feedforward neural network (with skip connections–these will not be
expanded in detail here). The neural network computes two different outputs: a
policy output and a value output. The policy output gives a probability distribution
over the most promising moves to explore in a position and the value network gives
the “score” of the position in the interval [−1, 1] where 1 is white wins and −1 is
black wins.
At first, the network plays essentially random moves since the parameters in
the network are initialized randomly. However, we generate games of self-play and
record the result. From the universal approximation theorem of neural networks,
we know that neural networks are able to approximate any function with the
appropriate set of parameters. Note that the value of a position is based purely
on whether it leads to a winning outcome. Therefore, the network is free of pre-
conceived human notions about important features such as the presence of material.
9
We will see that in the final result, even though material is important, the network
can make spectacular sacrifices, willingly lose material for long-term positional
1
gains based on its accurate evaluation of the board position from all features
of the board, not just the presence of material. Thus, using gradient descent we
can train the network to become better at computing the value of a position (i.e.
allowing the network to discover its own set of features). One disadvantage is that
this set of features is opaque since it is encoded in the parameters of the neural
networks. Therefore, we might never know the “best” way to play chess like the
best chess-playing neural networks.
However, the objective of the game is not to create the most accurate position
evaluators but to create the best chess players. The policy output of the neural
network directs our attention towards the moves that would maximize the value.
Even here, the ideas of minimax feature prominently. The policy network guides
a Monte Carlo Tree Search algorithm that simulates the game forward. We can
think of the policy network as providing a prior probability (or degree of belief)
over the move that would maximize the value. The Monte Carlo Tree Search then
uses this prior probability to select the next move to explore. This forms a sort
of Monte Carlo Markov Chain (MCMC) method where we use some knowledge
learned about the game of chess itself within the neural network to move towards
the “important” areas in the “distribution” of most valuable moves. The analogy
is not completey accurate, as in MCMC we are able to calculate the posterior
distribution at a point with perfect accuracy. On the other hand, for this situation
1
A positional sacrifice is almost considered an “oxymoron” in chess theory, since no human
(or computer) can look far enough into the feature to dare attempt losing material for something
as elusive as a “positional” advantage.
10
we do not know the distribution over the best move (even after expansion of a
board state) and must approximate this distribution by learning from lots of data.
As the network becomes more and more accurate in predicting the value of a board
state, the network also propagates this information into its prior distribution of
the move that would maximize this value. Eventually, the network overall becomes
very good at recognizing both the value of a board position and the moves that
would be most likely to bring about an optimal outcome.
1. If we are at a board position (let’s say the root node) where we have never
been before, we call the board position “unvisited” and “expand” the node.
2. To expand the node, we encode the board position into a (18, 8, 8) tensor.
(a) The first 6 (8, 8) matrices encode the positions of our pieces (since there
are six different piece types).
(b) The next 6 matrices encode the positions of the opponent pieces.
(c) The next 4 matrices encode whether there are kingside or queenside
castling options available for our side and the opponent side.
(d) The next matrix encodes the en passant position, and the last matrix
11
Figure 3: A Board Position.
12
encodes the half-move clock (important for determining whether a draw
is about to happen).
3. We then pass the tensor through the network, which returns a policy output
and a value output.
5. If we still have playouts left, consider the children of the node we have
expanded (which includes all the legal moves for the opponent from that
position). Since we have just expanded the root node, all of these child nodes
are unvisited.
i. n is the number of times the node has been visited. Since the node
has never been visited, n is initialized to 0.
ii. w gives the total value of the node. Since the node has never been
visited, w is initialized to 0.
iii. q gives the average value of the node. Since the node has never been
visited, w is initialized to 0.
iv. p is the prior probability of the node. This prior probability is inher-
ited from the parent (the policy output of the network corresponding
to that move)
(b) Among the children representing the opponent’s legal moves at that
13
position, we need to simulate the opponent’s action.
6. Once we have selected the next best node to expand, we repeat step 2 for the
network, noticing that we have switched perspectives and need to accordingly
change the value function (Black tries to minimize the value, while White
tries to maximize the value).
14
(a) The parent node earns a value of w from the child’s node calculated
value, increment its original value of wold by the new value of wnew .
(b) The parent node earns a visit, and thus we increment the parent node’s
n value by 1.
(c) The parent node’s q value becomes recalculated by dividing the wtotal
with the current number of visits.
8. Once we have run out the number of playouts, we must decide what move
to take. The sensible move to take is the move that has been most visited,
because it generally tends to maximize the value of A that we want (striking
a balance between exploration and exploitation).
In order to train the network, for every position we save the normalized number
of visits π(s, a) for every possible child node (s represents the board state and a
represents each possible legal move). Once the game is over, we also save the game
result z to the board position. Then, we try to get the network to minimize the
following loss function using gradient descent [1]:
Thus, the network will try to minimize the difference between the predicted
value and the actual value (i.e. become better at evaluating board positions),
the categorical cross-entropy (i.e. the difference between the distribution of prior
probabilities of best moves returned by the network p~ and the actual distribution
of best moves computed after the Monte Carlo Tree Search ~π ), and the sum of the
squares of the network parameter (a form of regularization to avoid overfitting). As
the value network becomes more and more accurate, the Monte Carlo Tree Search
15
becomes more and more likely to return the actual best move. Then, as the Monte
Carlo Tree Search becomes more and more accurate, the prior probabilities over
best moves also becomes more likely to zero in on the most promising moves.
16
Figure 4: Histogram of Values Evaluated as Child Nodes of g6.
I implemented the algorithm in Python, using the keras and chess modules
in order to implement the networks. In addition, I learned how to optimize
neural network computation from previous chess projects (using threading and
multiprocessing to optimize computation on CPUs) [2]. However, since training
the network would require significant investment in time, computational power,
and cloud computing costs, I did not manage to train the network. Here is the plot
of the distribution of values that were predicted from child nodes of the move g6 in
the board position represented in Figure 3.
Note that the network is untrained. We see that the values predicted are
close to 0, since the network is untrained and essentially is predicting a random
17
value. The 95% confidence interval of the value is [−0.04069338, 0.11674655]. This
is probably a very wide confidence interval since it is likely to not be an accurate
representation of the actual value of the network. For a better prediction I would
need to generate lots of games of self-play and train the network to give better
value predictions.
In order to play with the either the RL agent or the agent using alpha-beta
pruning (which is a faster and better player, simply because the neural network
has not been trained), run python run game.py alphabeta or python run game.py rl in the
command line.
18
References
[1] Julian Schrittwieser Ioannis Antonoglou Matthew Lai Arthur Guez Marc Lanctot
Laurent Sifre Dharshan Kumaran Thore Graepel Timothy Lillicrap Karen
Simonyan Demis Hassabis David Silver, Thomas Hubert. Mastering chess and
shogi by self-play with a general reinforcement learning algorithm.
19