Documente Academic
Documente Profesional
Documente Cultură
2.1. Scaling and parallelizing ES case the perturbation distribution p corresponds to a mix-
ture of Gaussians, for which the update equations remain
ES has three aspects that make it particularly well suited
unchanged. At the very extreme, every worker would per-
to be scaled up to many parallel workers: 1) It operates on
turb only a single coordinate of the parameter vector, which
complete episodes, thereby requiring only infrequent com-
means that we would be using pure finite differences.
munication between the workers. 2) The only information
obtained by each worker is the scalar return of an episode: To reduce variance, we use antithetic sampling (Geweke,
provided the workers know what random perturbations the 1988), also known as mirrored sampling (Brockhoff et al.,
other workers used, each worker only needs to send and re- 2010) in the ES literature: that is, we always evaluate pairs
ceive a single scalar to and from each other worker to agree of perturbations , , for Gaussian noise vector . We also
on a parameter update. ES thus requires extremely low find it useful to perform fitness shaping (Wierstra et al.,
bandwidth, in sharp contrast to policy gradient methods, 2014) by applying a rank transformation to the returns be-
which require workers to communicate entire gradients. 3) fore computing each parameter update. Doing so removes
It does not require value function approximations. Rein- the influence of outlier individuals in each population and
forcement learning with value function estimation is inher- decreases the tendency for ES to fall into local optima early
ently sequential: In order to improve upon a given policy, in training. In addition, we apply weight decay to the pa-
multiple updates to the value function are typically needed rameters of our policy network: this prevents the parame-
in order to get enough signal. Each time the policy is then ters from growing very large compared to the perturbations.
significantly changed, multiple iterations are necessary for
Evolution Strategies, as presented above, works with full-
the value function estimate to catch up.
length episodes. In some rare cases this can lead to low
A simple parallel version of ES is given in Algorithm 2. CPU utilization, as some episodes run for many more steps
than others. For this reason, we cap episode length at a
Algorithm 2 Parallelized Evolution Strategies constant m steps for all workers, which we dynamically
1: Input: Learning rate , noise standard deviation , adjust as training progresses. For example, by setting m
initial policy parameters 0 to be equal to twice the average number of steps taken per
2: Initialize: n workers with known random seeds, and episode, we can guarantee that CPU utilization stays above
initial parameters 0 50% in the worst case.
3: for t = 0, 1, 2, . . . do We plan to release full source code for our implementation
4: for each worker i = 1, . . . , n do of parallelized ES in the near future.
5: Sample i N (0, I)
6: Compute returns Fi = F (t + i )
2.2. The impact of network parameterization
7: end for
8: Send all scalar returns Fi from each worker to every Whereas RL algorithms like Q-learning and policy gradi-
other worker ents explore by sampling actions from a stochastic policy,
9: for each worker i = 1, . . . , n do Evolution Strategies derives learning signal from sampling
10: Reconstruct all perturbations
Pn j for j = 1, . . . , n instantiations of policy parameters. Exploration in ES is
11: Set t+1 t + n 1
j=1 Fj j thus driven by parameter perturbation. For ES to improve
12: end for upon parameters , some members of the population must
13: end for achieve better return than others: i.e. it is crucial that Gaus-
sian perturbation vectors occasionally lead to new indi-
In practice, we implement sampling by having each worker viduals + with better return.
instantiate a large block of Gaussian noise at the start of For the Atari environments, we found that Gaussian per-
training, and then perturbing its parameters by adding a turbations on the parameters of DeepMinds convolutional
randomly indexed subset of these noise variables at each it- architectures (Mnih et al., 2015) did not always lead to ad-
eration. Although this means that the perturbations are not equate exploration: For some environments, randomly per-
strictly independent across iterations, we did not find this to turbed parameters tended to encode policies that always
be a problem in practice. Using this strategy, we find that took one specific action regardless of the state that was
the second part of Algorithm 2 (lines 9-12) only takes up a given as input. However, we discovered that we could get
small fraction of total time spend for all our experiments, performance matching that of policy gradient methods for
even when using up to 1,440 parallel workers. When using most games by using virtual batch normalization (Salimans
many more workers still, or when using very large neu- et al., 2016) in the policy specification. Virtual batch nor-
ral networks, we can reduce the computation required for malization is precisely equivalent to batch normalization
this part of the algorithm by having workers only perturb a (Ioffe & Szegedy, 2015) where the minibatch used for cal-
subset of the parameters rather than all of them: In this
Evolution Strategies as an Alternative for Reinforcement Learning
culating normalizing statistics is chosen at the start of train- over actions at each time period, applying a softmax to the
ing and is fixed. This change in parameterization makes scores of each action. Doing so yields the objective
the policy more sensitive to very small changes in the in-
put image at the early stages of training when the weights FP G () = E R(a(, )),
of the policy are random, ensuring that the policy takes
a wide-enough variety of actions to gather occasional re- with gradients
wards. For most applications, a downside of virtual batch
FP G () = E {R(a(, )) log p(a(, ); )} .
normalization is that it makes training more expensive. For
our application, however, the minibatch used to calculate
the normalizing statistics is much smaller than the number Evolution strategies, on the other hand, add the noise in
of steps taken during a typical episode, meaning that the parameter space. That is, they perturb the parameters as
overhead is negligible. For the MuJoCo tasks, we achieved = + , with from a multivariate Gaussian distribu-
good performance on nearly all the environments with the tion, and then pick actions as at = a(, ) = (s; ). It
standard multilayer perceptrons mapping to continuous ac- can be interpreted as adding a Gaussian blur to the original
tions. However, we observed that for some environments, objective, which results in a smooth, differentiable cost:
we could encourage more exploration by discretizing the
FES () = E R(a(, )),
actions. This forced the actions to be non-smooth with
respect to input observations and parameter perturbations, this time with gradients
and thereby encouraged a wide variety of behaviors to be
n o
played out over the course of rollouts.
FES () = E R(a(, )) log p((, ); ) .
3. Smoothing in parameter space versus The two methods for smoothing the decision problem are
smoothing in action space thus quite similar, and can be made even more so by adding
noise to both the parameters and the actions.
As mentioned in section 2, a large source of difficulty in
RL stems from the lack of informative gradients of pol- 3.1. When is ES better than PG?
icy performance: such gradients may not exist due to non-
smoothness of the environment or policy, or may only be Given these two methods of smoothing the decision prob-
available as high-variance estimates because the environ- lem, which one should we use? The answer depends
ment usually can only be accessed via sampling. Explic- strongly on the structure of the decision problem and on
itly, suppose we wish to solve general decision problems which type of Monte Carlo estimator is used to estimate the
that give a return R(a) after we take a sequence of ac- gradients FP G () and FES (). Suppose the correla-
tions a = {a1 , . . . , aT }, where the actions are determined tion between the return and the actions is low. Assuming
by a either a deterministic or a stochastic policy function we approximate these gradients using simple Monte Carlo
at = (s; ). The objective we would like to optimize is (REINFORCE) with a good baseline on the return, we have
thus
F () = R(a()). Var[ FP G ()] Var[R(a)] Var[ log p(a; )],
Since the actions are allowed to be discrete and the pol- and
icy is allowed to be deterministic, F () can be non-smooth
in . More importantly, because we do not have explicit Var[ FES ()] Var[R(a)] Var[ log p(; )].
access to the underlying state transition function of our de-
If both methods perform a similar amount of exploration,
cision problems, the gradients cannot be computed with a
Var[R(a)] will be similar for both expressions. The differ-
backpropagation-like algorithm. This means we cannot di-
ence will thus be Pin the second term. Here we have that
rectly use standard gradient-based optimization methods to T
log p(a; ) = t=1 log p(at ; ) is a sum of T un-
find a good solution for .
correlated terms, so that the variance of the policy gradi-
In order to both make the problem smooth and to have a ent estimator will grow nearly linearly with T . The cor-
means of to estimate its gradients, we need to add noise. responding term for evolution strategies, log p(; ), is
Policy gradient methods add the noise in action space, independent of T . Evolution strategies will thus have an
which is done by sampling the actions from an appropri- advantage compared to policy gradients for long episodes
ate distribution. For example, if the actions are discrete with very many time steps. In practice, the effective num-
and (s; ) calculates a score for each action before select- ber of steps T is often reduced in policy gradient methods
ing the best one, then we would sample an action a(, ) by discounting rewards. If the effects of actions are short-
(here is the noise source) from a categorical distribution lasting, this allows us to dramatically reduce the variance
Evolution Strategies as an Alternative for Reinforcement Learning
in our gradient estimate, and this has been critical to the standard gradient-based optimization of large neural net-
success of applications such as Atari games. However, this works easier than for small ones: large networks have fewer
discounting will bias our gradient estimate if actions have local minima (Kawaguchi, 2016).
long lasting effects. Another strategy for reducing the ef-
fective value of T is to use value function approximation. 3.3. Advantages of not calculating gradients
This has also been effective, but once again runs the risk of
biasing our gradient estimates. In addition to being easy to parallelize, and to having an
advantage in cases with long action sequences and delayed
Evolution strategies is thus an attractive choice if the ef- rewards, black box optimization algorithms like ES have
fective number of time steps T is long, actions have long- other advantages over reinforcement learning techniques
lasting effects, and if no good value function estimates are that calculate gradients.
available.
The communication overhead of implementing ES in a dis-
tributed setting is lower than for reinforcement learning
3.2. Problem dimensionality
methods such as policy gradients and Q-learning, as the
The gradient estimate of ES can be interpreted as a method only information that needs to be communicated across
for randomized finite differences in high-dimensional
n o processes are the scalar return and the random seed that
space. Indeed, using the fact that EN (0,I) F () = 0, was used to generate the perturbations , rather than a full
we get gradient. Also, it can deal with maximally sparse and de-
layed rewards; there is no need for the assumption that time
F ( + ) information is part of the reward
() = EN (0,I)
By not requiring backpropagation, black box optimizers re-
F ( + ) F () duce the amount of computation per episode by about two
= EN (0,I)
thirds, and memory by potentially much more. In addi-
tion, not explicitly calculating an analytical gradient pro-
It is now apparent that ES can be seen as computing a finite tects against problems with exploding gradients that are
difference derivative estimate in a randomly chosen direc- common when working with recurrent neural networks.
tion, especially as becomes small. By smoothing the cost function in parameter space, we re-
The resemblance of ES to finite differences suggests the duce the pathological curvature that causes these problems:
method will scale poorly with the dimension of the param- bounded cost functions that are smooth enough cant have
eters . Theoretical analysis indeed shows that for general exploding gradients. At the extreme, ES allows us to in-
non-smooth optimization problems, the required number of corporate non-differentiable elements into our architecture,
optimization steps scales linearly with the dimension (Nes- such as modules that use hard attention (Xu et al., 2015).
terov & Spokoiny, 2011). However, it is important to note Black box optimization methods are uniquely suited to cap-
that this does not mean that larger neural networks will per- italize on advances in low precision hardware for deep
form worse than smaller networks when optimized using learning. Low precision arithmetic, such as in binary neural
ES: what matters is the difficulty, or intrinsic dimension, of networks, can be performed much cheaper than at high pre-
the optimization problem. To see that the dimensionality cision. When optimizing such low precision architectures,
of our model can be completely separate from the effective biased low precision gradient estimates can be a problem
dimension of the optimization problem, consider a regres- when using gradient-based methods. Similarly, specialized
sion problem where we approximate a univariate variable hardware for neural network inference, such as TPUs, can
y with a linear model y = x w: if we double the number be used directly when performing optimization using ES,
of features and parameters in this model by concatenating while their limited memory usually makes backpropagation
x with itself (i.e. using features x0 = (x, x)), the problem impossible.
does not become more difficult. In fact, the ES algorithm
will do exactly the same thing when applied to this higher By perturbing in parameter space instead of action space,
dimensional problem, as long as we divide the standard de- black box optimizers are naturally invariant to the fre-
viation of the noise by two, as well as the learning rate. quency at which our agent acts in the environment. For
reinforcement learning, on the other hand, it is well known
In practice, we observe slightly better results when using that frameskip is a crucial parameter to get right for the
larger networks with ES. For example, we tried both the optimization to succeed (Braylan et al., 2000). While this
larger network and smaller network used in A3C (Mnih is usually a solvable problem for games that only require
et al., 2016) for learning Atari 2600 games, and on aver- short-term planning and action, it is a problem for learn-
age obtained better results using the larger network. We ing longer term strategic behavior. For these problems, RL
hypothesize that this is due to the same effect that makes
Evolution Strategies as an Alternative for Reinforcement Learning
102
102 103
Number of CPU cores Figure 2. Learning curves for Pong using varying frame-skip pa-
rameters. Although performance is stochastic, each setting leads
to about equally fast learning, with each run converging in around
Figure 1. Time to reach a score of 6000 on 3D Humanoid with
100 weight updates.
different number of CPU cores. Experiments are repeated 7 times
and median time is reported.
Table 2. Ratio of timesteps needed by TRPO without variance
reduction to reach various fractions of the learning progress of
frame to perform well in the environment. If, on the other TRPO with variance reduction at 5 million timesteps. ( means
TRPO performance with variance reduction was never reached.)
hand, the frameskip is set too low, the effective time length
of the episode increases too much, which deteriorates RL E NVIRONMENT 25% 50% 75% 100%
performance as analyzed in section 3.1. An advantage of
ES as a black box optimizer is that its gradient estimate is H ALF C HEETAH 3.76 4.19 3.12 2.85
invariant to the length of the episode, which makes it much H OPPER 0.66 1.03 2.58 4.25
I NVERTED D OUBLE P ENDULUM 1.18 1.26 1.97
more robust to the action frequency. We demonstrate this I NVERTED P ENDULUM 0.96 0.99 0.99 1.92
by running the Atari game Pong using a frame skip param- S WIMMER 0.20 0.20 0.22 0.14
eter in {1, 2, 3, 4}. As can be seen in Figure 2, the learning WALKER 2 D 1.86 3.81 5.28 7.75
curves for each setting indeed look very similar.
lyzed in the convex setting, e.g., by Duchi et al. (2015). Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig,
Nesterov (2012) has analyzed the randomized block coor- Schneider, Jonas, Schulman, John, Tang, Jie, and
dinate descent, which would take a gradient step on a ran- Zaremba, Wojciech. OpenAI Gym. arXiv preprint
domly chosen dimension of the parameter vector, which is arXiv:1606.01540, 2016.
almost equivalent to evolution strategies with axis-aligned
noise. Nesterov has a surprising result: a setting in which Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John,
randomized block coordinate descent outperforms the de- and Abbeel, Pieter. Benchmarking deep reinforcement
terministic, noiseless gradient estimate in terms of its rate learning for continuous control. In Proceedings of the
of convergence. 33rd International Conference on Machine Learning
(ICML), 2016a.
Hyper-Neat (Stanley et al., 2009) is an alternative approach
to evolving both the weights of the neural networks and Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L,
their parameters, an approach we believe to be worth revis- Sutskever, Ilya, and Abbeel, Pieter. RL2 : Fast reinforce-
iting given our promising results with scaling up ES. ment learning via slow reinforcement learning. arXiv
preprint arXiv:1611.02779, 2016b.
The primary difference between prior work and ours is the
results: we have been able to show that when applied cor- Duchi, John C, Jordan, Michael I, Wainwright, Martin J,
rectly, ES is competitive with current RL algorithms in and Wibisono, Andre. Optimal rates for zero-order con-
terms of performance on the hardest problems solvable to- vex optimization: The power of two function evalua-
day, and is surprisingly close in terms of data efficiency, tions. IEEE Transactions on Information Theory, 61(5):
while being extremely parallelizeable. 27882806, 2015.
for vision-based reinforcement learning. In Proceedings Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan,
of the 15th annual conference on Genetic and evolution- Michael, and Abbeel, Pieter. High-dimensional con-
ary computation, pp. 10611068. ACM, 2013. tinuous control using generalized advantage estimation.
arXiv preprint arXiv:1506.02438, 2015b.
Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel,
Pieter. End-to-end training of deep visuomotor poli- Schwefel, H.-P. Numerische optimierung von computer-
cies. Journal of Machine Learning Research, 17(39): modellen mittels der evolutionsstrategie. 1977.
140, 2016. Sehnke, Frank, Osendorfer, Christian, Ruckstie, Thomas,
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Peters, Jan, and Schmidhuber, Jurgen.
Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Parameter-exploring policy gradients. Neural Networks,
Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, 23(4):551559, 2010.
Ostrovski, Georg, et al. Human-level control through Silver, David, Huang, Aja, Maddison, Chris J, Guez,
deep reinforcement learning. Nature, 518(7540):529 Arthur, Sifre, Laurent, Van Den Driessche, George,
533, 2015. Schrittwieser, Julian, Antonoglou, Ioannis, Panneershel-
vam, Veda, Lanctot, Marc, et al. Mastering the game of
Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza,
go with deep neural networks and tree search. Nature,
Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim,
529(7587):484489, 2016.
Silver, David, and Kavukcuoglu, Koray. Asynchronous
methods for deep reinforcement learning. In Interna- Spall, James C. Multivariate stochastic approximation us-
tional Conference on Machine Learning, 2016. ing a simultaneous perturbation gradient approximation.
IEEE transactions on automatic control, 37(3):332341,
Nesterov, Yurii. Efficiency of coordinate descent methods 1992.
on huge-scale optimization problems. SIAM Journal on
Optimization, 22(2):341362, 2012. Srivastava, Rupesh Kumar, Schmidhuber, Jurgen, and
Gomez, Faustino. Generalized compressed network
Nesterov, Yurii and Spokoiny, Vladimir. Random gradient- search. In International Conference on Parallel Prob-
free minimization of convex functions. Foundations of lem Solving from Nature, pp. 337346. Springer, 2012.
Computational Mathematics, pp. 140, 2011.
Stanley, Kenneth O, DAmbrosio, David B, and Gauci, Ja-
Parr, Ronald and Russell, Stuart. Reinforcement learning son. A hypercube-based encoding for evolving large-
with hierarchies of machines. Advances in neural infor- scale neural networks. Artificial life, 15(2):185212,
mation processing systems, pp. 10431049, 1998. 2009.
Rechenberg, I. and Eigen, M. Evolutionsstrategie: Op- Stulp, Freek and Sigaud, Olivier. Policy improvement
timierung Technischer Systeme nach Prinzipien der Bi- methods: Between black-box optimization and episodic
ologischen Evolution. Frommann-Holzboog Stuttgart, reinforcement learning. 2012.
1973. Sun, Yi, Wierstra, Daan, Schaul, Tom, and Schmidhuber,
Juergen. Efficient natural evolution strategies. In Pro-
Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Che-
ceedings of the 11th Annual conference on Genetic and
ung, Vicki, Radford, Alec, and Chen, Xi. Improved tech-
evolutionary computation, pp. 539546. ACM, 2009.
niques for training gans. In Advances in Neural Informa-
tion Processing Systems, pp. 22262234, 2016. Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco:
A physics engine for model-based control. In Intelli-
Schaul, Tom, Glasmachers, Tobias, and Schmidhuber, gent Robots and Systems (IROS), 2012 IEEE/RSJ Inter-
Jurgen. High dimensions and heavy tails for natural evo- national Conference on, pp. 50265033. IEEE, 2012.
lution strategies. In Proceedings of the 13th annual con-
ference on Genetic and evolutionary computation, pp. Wierstra, Daan, Schaul, Tom, Peters, Jan, and Schmid-
845852. ACM, 2011. huber, Juergen. Natural evolution strategies. In
Evolutionary Computation, 2008. CEC 2008.(IEEE
Schmidhuber, Jurgen, Wierstra, Daan, Gagliolo, Matteo, World Congress on Computational Intelligence). IEEE
and Gomez, Faustino. Training recurrent networks by Congress on, pp. 33813387. IEEE, 2008.
evolino. Neural computation, 19(3):757779, 2007.
Wierstra, Daan, Schaul, Tom, Glasmachers, Tobias, Sun,
Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Yi, Peters, Jan, and Schmidhuber, Jurgen. Natural evo-
Michael I, and Moritz, Philipp. Trust region policy opti- lution strategies. Journal of Machine Learning Research,
mization. In ICML, pp. 18891897, 2015a. 15(1):949980, 2014.
Evolution Strategies as an Alternative for Reinforcement Learning
Table 3. Final results obtained using Evolution Strategies on Atari 2600 games (feedforward CNN policy, deterministic policy evaluation,
averaged over 10 re-runs with up to 30 random initial no-ops), and compared to results for DQN (Mnih et al., 2015) and A3C (Mnih
et al., 2016).
Evolution Strategies as an Alternative for Reinforcement Learning
Table 4. MuJoCo tasks: Ratio of ES timesteps to TRPO timesteps needed to reach various percentages of TRPOs learning progress at 5
million timesteps. These results were computed from ES learning curves averaged over 6 reruns.
E NVIRONMENT % TRPO FINAL SCORE TRPO SCORE TRPO TIMESTEPS ES TIMESTEPS ES TIMESTEPS / TRPO TIMESTEPS
I NVERTED D OUBLE P ENDULUM 25% 2358.98 8.73 E +05 3.98 E +05 0.46
50% 4609.68 9.65 E +05 4.66 E +05 0.48
75% 6874.03 1.07 E +06 5.30 E +05 0.49
100% 9104.07 4.39 E +06 5.39 E +06 1.23
Table 5. TRPO scores on MuJoCo tasks with and without variance reduction (discounting and value functions). These results were
computed from learning curves averaged over 3 reruns.
I NVERTED D OUBLE P ENDULUM 25% 2358.98 8.73 E +05 1.03 E +06 1.18
50% 4609.68 9.65 E +05 1.22 E +06 1.26
75% 6874.03 1.07 E +06 2.12 E +06 1.97
100% 9104.07 4.39 E +06