Documente Academic
Documente Profesional
Documente Cultură
Marin Bukov
Department of Physics,
arXiv:1803.08823v1 [physics.comp-ph] 23 Mar 2018
University of California,
Berkeley, CA 94720,
USA
Charles K. Fisher
Unlearn.AI, San Francisco,
CA 94108
David J. Schwab
Initiative for the Theoretical Sciences,
The Graduate Center,
City University of New York,
365 Fifth Ave., New York,
NY 10016
Machine Learning (ML) is one of the most exciting and dynamic areas of modern research
and application. The purpose of this review is to provide an introduction to the core
concepts and tools of machine learning in a manner easily understood and intuitive
to physicists. The review begins by covering fundamental concepts in ML and modern
statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization
before moving on to more advanced topics in both supervised and unsupervised learning.
Topics covered in the review include ensemble models, deep learning and neural networks,
clustering and data visualization, energy-based models (including MaxEnt models and
Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize
the many natural connections between ML and statistical physics. A notable aspect of
the review is the use of Jupyter notebooks to introduce modern ML/statistical packages
to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations
of supersymmetric decays of proton-proton collisions). We conclude with an extended
outlook discussing possible uses of machine learning for furthering our understanding
of the physical world as well as open problems in ML where physicists maybe able to
contribute.
A. Research at the intersection of physics and ML 107 ity with several topics found in graduate physics curricula
B. Topics not covered in review 108 (partition functions, statistical mechanics) and a fluency
C. Rebranding Machine Learning as “Artificial
Intelligence” 109
in mathematical techniques such as linear algebra, multi-
D. Social Implications of Machine Learning 109 variate calculus, variational methods, probability theory,
and Monte-Carlo methods. It also assumes a familiar-
XIX. Acknowledgments 110 ity with basic computer programming and algorithmic
A. An overview of the datasets used in review 110 design.
1. Ising dataset 110
2. SUSY dataset 110
3. MNIST Dataset 111
A. What is Machine Learning?
References 111
The last three decades have seen an unprecedented in- Any review on ML must simultaneously accomplish
crease in our ability to generate and analyze large data two related but distinct goals. First, it must convey the
sets. This “big data” revolution has been spurred by an rich theoretical foundations underlying modern ML. This
exponential increase in computing power and memory task is made especially difficult because ML is very broad
commonly known as Moore’s law. Computations that and interdisciplinary, drawing on ideas and intuitions
were unthinkable a few decades ago can now be routinely from many fields including statistics, computational neu-
performed on laptops. Specialized computing machines roscience, and physics. Unfortunately, this means mak-
(such as GPU-based machines) are continuing this trend ing choices about what theoretical ideas to include in the
towards cheap, large-scale computation, suggesting that review. This review emphasizes connections with sta-
the “big data” revolution is here to stay. tistical physics, physics-inspired Bayesian inference, and
This increase in our computational ability has been ac- computational neuroscience models. Thus, certain ideas
companied by new techniques for analyzing and learning (e.g., gradient descent, expectation maximization, varia-
from large datasets. These techniques draw heavily from tional methods, and deep learning and neural networks)
ideas in statistics, computational neuroscience, computer are covered extensively, while other important ideas are
science, and physics. Similar to physics, modern ML given less attention or even omitted entirely (e.g., statis-
places a premium on empirical results and intuition over tical learning, support vector machines, kernel methods,
the more formal treatments common in statistics, com- Gaussian processes). Second, any ML review must give
puter science, and mathematics. This is not to say that the reader the practical know-how to start using the tools
proofs are not important or undesirable. Rather, many and concepts of ML for practical problems. To accom-
of the advances of the last two decades – especially in plish this, we have written a series of Jupyter notebooks
fields like deep learning – do not have formal justifica- to accompany this review. The notebooks introduce the
tions (much like there still exists no mathematically well- nuts-and-bolts of how to use, code, and implement the
defined concept of the Feynman path-integral in d > 1). methods introduced in the main text. Luckily, there
are numerous great ML software packages available in
Physicists are uniquely situated to benefit from and Python (scikit-learn, tensorflow, Pytorch, Keras) and we
contribute to ML. Many of the core concepts and tech- have made extensive use of them. We have also made use
niques used in ML – such as Monte-Carlo methods, simu- of a new package, Paysage, for energy-based generative
lated annealing, variational methods – have their origins models which has been co-developed by one of the au-
in physics. Moreover, “energy-based models” inspired by thors (CKF) and maintained by Unlearn.AI (a company
statistical physics are the backbone of many deep learn- affiliated with two of the authors: CKF and PM). The
ing methods. For these reasons, there is much in modern purpose of the notebooks is to both familiarize physicists
ML that will be familiar to physicists. with these resources and to serve as a starting point for
Physicists and astronomers have also been at the fore- experimenting and playing with ideas.
front of using “big data”. For example, experiments such ML can be divided into three broad categories: super-
as CMS and ATLAS at the LHC generate petabytes of vised learning, unsupervised learning, and reinforcement
data per year. In astronomy, projects such as the Sloan learning. Supervised learning concerns learning from la-
Digital Sky Survey (SDSS) routinely analyze and release beled data (for example, a collection of pictures labeled
hundreds of terabytes of data measuring the properties as containing a cat or not containing a cat). Common
of near a billion stars and galaxies. Researchers in these supervised learning tasks include classification and re-
fields are increasingly incorporating recent advances in gression. Unsupervised learning is concerned with find-
ML and data science, and this trend is likely to acceler- ing patterns and structure in unlabeled data. Examples
ate in the future. of unsupervised learning include clustering, dimensional-
Besides applications to physics, part of the goal of this ity reduction, and generative modeling. Finally, in rein-
review is to serve as an introductory resource for those forcement learning an agent learns by interacting with an
looking to transition to more industry-oriented projects. environment and changing its behavior to maximize its
Physicists have already made many important contribu- reward. For example, a robot can be trained to navigate
tions to modern big data applications in an industrial in a complex environment by assigning a high reward to
setting (Metz, 2017). Data scientists and ML engineers actions that help the robot reach a desired destination.
in industry use concepts and tools developed for ML to We refer the interested reader to the classic book by Sut-
gain insight from large datasets. A familiarity with ML ton and Barto Reinforcement Learning: an Introduction
is a prerequisite for many of the most exciting employ- (Sutton and Barto, 1998). While useful, the distinction
ment opportunities in the field, and we hope this review between the three types of ML is sometimes fuzzy and
will serve as a useful introduction to ML for physicists fluid, and many applications often combine them in novel
beyond an academic setting. and interesting ways. For example, the recent success
5
of Google DeepMind in developing ML algorithms that vor of present-day ML problems. By re-analyzing the
excel at tasks such as playing Go and video games em- same datasets with multiple techniques, we hope readers
ploy deep reinforcement learning, combining reinforce- will be able to get a sense of the various, inevitable trade-
ment learning with supervised learning methods based offs involved in choosing how to analyze data. Certain
on deep neural networks. techniques work better when data is limited while others
Here, we limit our focus to supervised and unsuper- may be better suited to large data sets with many fea-
vised learning. The literature on reinforcement learning tures. A short description of these datasets are given in
is extensive and uses ideas and concepts that, to a large the Appendix.
degree, are distinct from supervised and unsupervised This review draws generously on many wonderful text-
learning tasks. For this reason, to ensure cohesiveness books on ML and we encourage the reader to con-
and limit the length of this review, we have chosen not sult them for further information. They include Abu
to discuss reinforcement learning. However, this omis- Mostafa’s masterful Learning from Data, which intro-
sion should not be mistaken for a value judgement on duces the basic concepts of statistical learning theory
the utility of reinforcement learning for solving physical (Abu-Mostafa et al., 2012), the more advanced but
problems. For example, some of the authors have used equally good The Elements of Statistical Learning by
inspiration from reinforcement learning to tackle difficult Hastie, Tibshirani, and Friedman (Friedman et al., 2001),
problems in quantum control (Bukov et al., 2017). Michael Neilsen’s indispensable Neural Networks and
In writing this review, we have tried to adopt a style Deep Learning which serves as a wonderful introduction
that reflects what we consider to be the best of the to the neural networks and deep learning (Nielsen, 2015)
physics tradition. Physicists understand the importance and David MacKay’s outstanding Information Theory,
of well-chosen examples for furthering our understand- Inference, and Learning Algorithms which introduced
ing. It is hard to imagine a graduate course in statistical Bayesian inference and information theory to a whole
physics without the Ising model. Each new concept that generation of physicists (MacKay, 2003). More com-
is introduced in statistical physics (mean-field theory, prehensive (and much longer) books on modern ML
transfer matrix techniques, high- and low-temperature techniques include Christopher Bishop’s classic Pattern
expansions, the renormalization group, etc.) is applied to Recognition and Machine Learning (Christopher, 2016)
the Ising model. This allows for the progressive building and the more recently published Machine Learning: A
of intuition and ultimately a coherent picture of statisti- Probabilistic Perspective by Kevin Murphy (Murphy,
cal physics. We have tried to replicate this pedagogical 2012). Finally, one of the great successes of modern ML is
approach in this review by focusing on a few well-chosen deep learning, and some of the pioneers of this field have
techniques – linear and logistic regression in the case of written a textbook for students and researchers entitled
supervised learning and clustering in the case of unsu- Deep Learning (Goodfellow et al., 2016). In addition to
pervised learning – to introduce the major theoretical these textbooks, we have consulted numerous research
concepts. papers, reviews, and web resources. Whenever possible,
In this same spirit, we have chosen three interest- we have tried to point the reader to key papers and other
ing datasets with which to illustrate the various algo- references that we have found useful in preparing this re-
rithms discussed here. (i) The SUSY data set consists view. However, we are neither capable of nor have we
of 5, 000, 000 Monte-Carlo samples of proton-proton col- made any effort to make a comprehensive review of the
lisions decaying to either signal or background processes, literature.
which are both parametrized with 18 features. The sig- The review is organized as follows. We begin by
nal process is the production of electrically-charged su- introducing polynomial regression as a simple example
persymmetric particles, which decay to W bosons and an that highlights many of the core ideas of ML. The next
electrically-neutral supersymmetric particle, invisible to few chapters introduce the language and major concepts
the detector, while the background processes are various needed to make these ideas more precise including tools
decays involving only Standard Model particles (Baldi from statistical learning theory such as overfitting, the
et al., 2014). (ii) The Ising data set consists of 104 bias-variance tradeoff, regularization, and the basics of
states of the 2D Ising model on a 40 × 40 square lat- Bayesian inference. The next chapter builds on these
tice, obtained using Monte-Carlo (MC) sampling at a examples to discuss stochastic gradient descent and its
few fixed temperatures T . (iii) The MNIST dataset com- generalizations. We then apply these concepts to linear
prises 70000 handwritten digits, each of which comes in and logistic regression, followed by a detour to discuss
a square image, divided into a 28 × 28 pixel grid. The how we can combine multiple statistical techniques to
first two datasets were chosen to reflect the various sub- improve supervised learning, introducing bagging, boost-
disciplines of physics (high-energy experiment, condensed ing, random forests, and XG Boost. These ideas, though
matter) where we foresee techniques from ML becom- fairly technical, lie at the root of many of the advances
ing an increasingly important tool for research. The in ML over the last decade. The review continues with
MNIST dataset, on the other hand, introduces the fla- a thorough discussion of supervised deep learning and
6
neural networks, as well as convolutional nets. We then One of the most important observations we can make
turn our focus to unsupervised learning. We start with is that the out-of-sample error is almost always greater
data visualization and dimensionality reduction before than the in-sample error, i.e. Eout ≥ Ein . We explore
proceeding to a detailed treatment of clustering. Our this point further in Sec. VI and its accompanying note-
discussion of clustering naturally leads to an examina- book. Splitting the data into mutually exclusive train-
tion of variational methods and their close relationship ing and test sets provides an unbiased estimate for the
with mean-field theory. The review continues with a predictive performance of the model – this is known as
discussion of deep unsupervised learning, focusing on cross-validation in the ML and statistics literature. In
energy-based models, such as Restricted Boltzmann Ma- many applications of classical statistics, we start with a
chines (RBMs) and Deep Boltzmann Machines (DBMs). mathematical model that we assume to be true (e.g., we
Then we discuss two new and extremely popular model- may assume that Hooke’s law is true if we are observing
ing frameworks for unsupervised learning, generative ad- a mass-spring system) and our goal is to estimate the
versarial networks (GANs) and variational autoencoders value of some unknown model parameters (e.g., we do
(VAEs). We conclude the review with an outlook and not know the value of the spring stiffness). Problems in
discussion of promising research directions at the inter- ML, by contrast, typically involve inference about com-
section physics and ML. plex systems where we do not know the exact form of the
mathematical model that describes the system. There-
fore, it is not uncommon for ML researchers to have mul-
II. WHY IS MACHINE LEARNING DIFFICULT? tiple candidate models that need to be compared. This
comparison is usually done using Eout and the model that
A. Setting up a problem in ML and data science minimizes this out-of-sample error is chosen as the best
model (i.e. model selection). Note that once we select
Almost every problem in ML and data science starts the best model on the basis of its performance on Eout ,
with the same ingredients. The first ingredient is the the real-world performance of the winning model should
dataset X. The second is the model g(w), which is a be expected to be slightly worse because the test data
function of the parameters w. The final ingredient is the was now used in the fitting procedure.
cost function C(X, g(w)) that allows us to judge how well
the model g(w) explains, or in general performs on, the
observations X. The model is fit by finding the value of B. Polynomial Regression
w that minimizes the cost function. For example, one
commonly used cost function is the squared error. Min- In the previous section, we mentioned that multiple
imizing the squared error cost function is known as the candidate models are typically compared using the out-
method of least squares, and is typically appropriate for of-sample error Eout . It may be at first surprising that
experiments with Gaussian measurement errors. the model that has the lowest out-of-sample error Eout
ML researchers and data scientists follow a standard usually does not have the lowest in-sample error Ein .
recipe to obtain models that are useful for prediction Therefore, if our goal is to obtain a model that is use-
problems. We will see why this is necessary in the fol- ful for prediction we do not want to choose the model
lowing sections, but it is useful to present the recipe up that provides the best explanation for the current obser-
front to provide context. The first step in the analysis vations. At first glance, the observation that the model
is to randomly divide the dataset X into two mutually providing the best explanation for the current dataset
exclusive groups Xtrain and Xtest called the training and probably will not provide the best explanation for future
test sets. The fact that this must be the first step should datasets is very counter-intuitive.
be heavily emphasized – performing some analysis (such Moreover, the discrepancy between Ein and Eout be-
as using the data to select important variables) before comes more and more important, as the complexity of our
partitioning the data is a common pitfall that can lead data, and the models we use to make predictions, grows.
to incorrect conclusions. Typically, the majority of the As the number of parameters in the model increases,
data are partitioned into the training set (e.g., 90%) with we are forced to work in high-dimensional spaces. The
the remainder going into the test set. The model is fit by “curse of dimensionality” ensures that many phenomena
minimizing the cost function using only the data in the that are absent or rare in low-dimensional spaces become
training set ŵ = argminw {C(Xtrain , g(w))}. Finally, the generic. For example, the nature of distance changes in
performance of the model is evaluated by computing the high dimensions, as evidenced in the derivation of the
cost function using the test set C(Xtest , g(ŵ)). The value Maxwell distribution in statistical physics where the fact
of the cost function for the best fit model on the training that all the volume of a d-dimensional sphere of radius
set is called the in-sample error Ein = C(Xtrain , g(ŵ)) r is contained in a small spherical shell around r is ex-
and the value of the cost function on the test set is called ploited. Almost all critical points of a function (i.e., the
the out-of-sample error Eout = C(Xtest , g(ŵ)). points where all derivatives vanish) are saddles rather
7
y
Training 1.0
−2 Linear
Poly 3 0.5
−4 Poly 10
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
x x
Training
−2 Linear 20
Poly 3
−4 Poly 10
0
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
x x
FIG. 1 Fitting versus predicting for noiseless data. Ntrain = 10 points in the range x ∈ [0, 1] were generated from a
linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear models (red), all
polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 20 new data points
with xtest ∈ [0, 1.2] (shown on right). Notice that in the absence of noise (σ = 0), given enough data points that fitting and
predicting are identical.
y
Training
−2 Linear 0
Poly 3
−5
−4 Poly 10
−10
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
x x
Training
−2 Linear 0
Poly 3
−5
−4 Poly 10
−10
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25
x x
FIG. 2 Fitting versus predicting for noisy data. Ntrain = 100 noisy data points (σ = 1) in the range x ∈ [0, 1] were
generated from a linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear
models (red), all polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 20
new data points with xtest ∈ [0, 1.2](shown on right). Notice that even when the data was generated using a tenth order
polynomial, the linear and third order polynomials give better out-of-sample predictions, especially beyond the x range over
which the model was trained.
We will refer to the f (xi ) as the function used to generate that these three model classes contain different number
the data, and σ as the noise strength. The larger σ is the of parameters. Whereas g1 (x; w1 ) has only two parame-
noisier the data; σ = 0 corresponds to the noiseless case. ters (the coefficients of the zeroth and first order terms
in the polynomial), g3 (x; w3 ) and g10 (x; w10 ) have four
To make predictions, we will consider a family of func- and eleven parameters, respectively. This reflects the fact
tions gα (x; wα ) that depend on some parameters wα . that these three models have different model complexities.
These functions represent the model class that we are us- If we think of each term in the polynomial as a “feature”
ing to model the data and make predictions. Note that in our model, then increasing the order of the polyno-
we choose the model class without knowing the function mial we fit increases the number of features. Using a
f (x). The gα (x; wα ) encode the features we choose to more complex model class may give us better predictive
represent the data. In the case of polynomial regression power, but only if we have enough statistical power to
we will consider three different model classes: (i) all poly- accurately learn the model parameters associated with
nomials of order 1 which we denote by g1 (x; w1 ), (ii) all these extra features from the training dataset.
polynomials up to order 3 which we denote by g3 (x; w3 ),
and (iii) all polynomials of order 10, g10 (x; w10 ). Notice To learn the parameters wα , we will train our models
9
y
for a simple task like polynomial regression. Training
To illustrate these ideas, we encourage the reader to
experiment with the accompanying notebook to gener- −2 Linear
ate data using a linear function f (x) = 2x and a tenth Poly 3
order polynomial f (x) = 2x − 10x5 + 15x10 and ask −4 Poly 10
how the size of the training dataset Ntrain and the noise
strength σ affect the ability to make predictions. Obvi- 0.0 0.2 0.4 0.6 0.8 1.0
ously, more data and less noise leads to better predic- x
tions. To train the models (linear, third-order, tenth-
order), we uniformly sampled the interval x ∈ [0, 1] and
N =10000, σ =1 (pred.)
20
constructed Ntrain training examples using (1). We then Test
fit the models on these training samples using standard 15 linear
least-squares regression. To visualize the performance of
the three models, we plot the predictions using the best 10 3rd order
fit parameters for a test set where x are drawn uniformly 10th order
from the interval x ∈ [0, 1.2]. Notice that the test interval 5
y
is slightly larger than the training interval.
Figure 1 shows the results of this procedure for the 0
noiseless case, σ = 0. Even using a small training set
with Ntrain = 10 examples, we find that the model class −5
that generated the data also provides the best fit and the
most accurate out-of-sample predictions. That is, the
−10
0.00 0.25 0.50 0.75 1.00 1.25
linear model performs the best for data generated from a x
linear polynomial (the third and tenth order polynomials
perform similarly), and the tenth order model performs
the best for data generated from a tenth order polyno-
mial. While this may be expected, the results are quite
different for larger noise strengths.
Figure 2 shows the results of the same procedure for
noisy data, σ = 1, and a larger training set, Ntrain = 100.
As in the noiseless case, the tenth order model provides FIG. 3 Fitting versus predicting for noisy data.
the best fit to the data (i.e., the lowest Ein ). In contrast, Ntrain = 104 noisy data points (σ = 1) in the range x ∈ [0, 1]
the tenth order model now makes the worst out-of-sample were generated from a tenth-order polynomial. This data was
predictions (i.e., the highest Eout ). Remarkably, this is fit using three model classes: linear models (red), all polyno-
mials of order 3 (yellow), all polynomials of order 10 (green)
true even if the data were generated using a tenth order
and used to make prediction on Ntest = 100 new data points
polynomial. with xtest ∈ [0, 1.2](shown on right). The tenth order polyno-
At small sample sizes, noise can create fluctuations in mial gives good predictions but the model’s predictive power
the data that look like genuine patterns. Simple mod- quickly degrades beyond the training data range.
els (like a linear function) cannot represent complicated
patterns in the data, so they are forced to ignore the
fluctuations and to focus on the larger trends. Complex use less expressive models with fewer parameters, or we
models with many parameters, such as the tenth order can collect more data so that the likelihood that the noise
polynomial in our example, can capture both the global appears patterned decreases. Indeed, when we increase
trends and noise-generates patterns at the same time. In the size of the training data set by two orders of mag-
this case, the model can be tricked into thinking that the nitude to Ntrain = 104 (see Figure 3) the tenth order
noise encodes real information. This problem is called polynomial clearly gives both the best fits and the most
“overfitting” and leads to a steep drop-off in predictive predictive power over the entire training range x ∈ [0, 1],
performance. and even slightly beyond to approximately x ≈ 1.05.
We can guard against overfitting in two ways: we can This is our first experience with what is known as the
10
bias-variance tradeoff, c.f. Sec. III.B. When the amount f (x) as best as possible, namely, we would like to find
of training data is limited as it is when Ntrain = 100, h ∈ H such that h ≈ f in some strict mathematical
one can often get better predictive performance by using sense which we specify below. If this is possible, we say
a less expressive model (e.g., a lower order polynomial) that we learned f (x). But if the function f (x) can, in
rather than the more complex model (e.g., the tenth- principle, take any value on unobserved inputs, how is it
order polynomial). The simpler model has more “bias” possible to learn in any meaningful sense?
but is less dependent on the particular realization of the The answer is that learning is possible in the restricted
training dataset, i.e. less “variance”. Finally we note that sense that the fitted model will probably perform approx-
even with ten thousand data points, the model’s perfor- imately as well on new data as it did on the training data.
mance quickly degrades beyond the original training data Once an appropriate error function E is chosen for the
range. This demonstrates the difficulty of predicting be- problem under consideration (e.g. sum of squared errors
yond the training data we mentioned earlier. in linear regression), we can define two distinct perfor-
This simple example highlights why ML is so difficult mance measures of interest. The in-sample error, Ein ,
and holds some universal lessons that we will encounter quantifies performance of a particular hypothesis on the
repeatedly in this review: training data we used to fit the model. The out-of-sample
or generalization error, Eout is the performance on new
• Fitting is not predicting. Fitting existing data well
data. Recall that our goal is to make the out-of-sample
is fundamentally different from making predictions
error, Eout , as small as possible. However, when we are
about new data.
training our model, we only have access to the in-sample
• Using a complex model can result in overfitting. In- error, Ein , on the training data. Note that these two er-
creasing a model’s complexity (i.e number of fitting rors are generally distinct because finite-size effects stem-
parameters) will usually yield better results on the ming from sampling noise ensure that Ein will not be the
training data. However when the training data size same as Eout , and the parameters are tuned to minimize
is small and the data are noisy, this results in over- Ein . This is precisely the distinction between fitting and
fitting and can substantially degrade the predictive predicting introduced in Sec II.
performance of the model. This raises a natural question: Can we say something
general about the relationship between Ein and Eout ?
• For complex datasets and small training sets, sim- Surprisingly, the answer is ‘Yes’. We can in fact say
ple models can be better at prediction than com- quite a bit. This is the domain of statistical learning
plex models due to the bias-variance tradeoff. It theory, and we give a brief overview of the main results
takes less data to train a simple model than a com- in this section. Our goal is to briefly introduce some
plex one. Therefore, even though the correct model of the major ideas from statistical learning theory be-
is guaranteed to have better predictive performance cause of the important role they have played in shaping
for an infinite amount of training data (less bias), how we think about machine learning. However, this is
the training errors stemming from finite-size sam- a highly technical and theoretical field, so we will just
pling (variance) can cause simpler models to out- briefly skim over some introductory topics. An in-depth
perform the more complex model when sampling is and more thorough introduction to statistical learning
limited. theory can be found in the introductory textbook by Abu
• It is difficult to generalize beyond the situations Mustafa (Abu-Mostafa et al., 2012).
encountered in the training data set.
In this section, we briefly summarize and discuss the The basic intuitions of statistical learning can be sum-
sense in which learning is possible, with a focus on su- marized in three simple schematics. The first schematic,
pervised learning. We begin with an unknown function shown in Figure 4, shows the typical out-of-sample er-
y = f (x) and fix a hypothesis set H consisting of all func- ror, Eout , and in-sample error, Ein , as a function of the
tions we are willing to consider, defined also on the do- amount of training data. In making this graph, we have
main of f . This set may be uncountably infinite (e.g. if assumed that the true data is drawn from a sufficiently
there are real-valued parameters to fit). The choice of complicated distribution, so that we cannot exactly learn
which functions to include in H usually depends on our the function f (x). Hence, after a quick initial drop (not
intuition about the problem of interest. The function shown in figure), the in-sample error will increase with
f (x) produces a set of pairs (xi , yi ), i = 1 . . . N , which the number of data points, since our models are not pow-
serve as the observable data. Our goal is to select a func- erful enough to learn the true function we are seeking to
tion from the hypothesis set h ∈ H which approximates approximate. In contrast, the out-of-sample error will de-
11
Optimum
E out
E out
Error
Variance
Variance
{ }
Error
}
Bias
Bias
Model Complexity
E in
B. Bias-Variance Decomposition
High variance, x x x x In this section, we dig further into the central princi-
low-bias model ple that underlies much of machine learning: the bias-
x x x x
x variance tradeoff. Roughly speaking, this says that there
x x x True model
x x x x
is a tradeoff between how expressive our model class is
x x x x x and how sensitive the fitted model is to sample fluctu-
x x x x x ations in the training data. That is, the more (less)
x x xx x x expressive the model, the larger (smaller) the fluctua-
x x x
x x x tions. Less expressive models exhibit bias in what they
x Low variance, are able to fit, and thus there is a tradeoff between the
xx x x x x high-bias model bias and the variance of the fitted model. Oftentimes in
physics, we are mostly concerned with expressivity, e.g.
whether the true ground state wavefunction can be well-
approximated by a class of variational wavefunctions such
FIG. 6 Bias-Variance tradeoff. Another useful depiction as a matrix product state. In the learning context, there
of the bias-variance tradeoff is to think about how Eout varies is the additional challenge of finding the best variational
as we consider different training data sets of a fixed size. A state with finite sampling. We will see that while this
more complex model (green) will exhibit larger fluctuations
concept is a generally useful heuristic to keep in mind,
(variance) due to finite size sampling effects than the sim-
pler model (black). However, the average over all the trained it is a mathematically precise statement when decompos-
models (bias) is closer to the true model for the more complex ing the squared error. Finally, we note that a better term
model. would be the bias-variance decomposition, as it is possible
to have high bias and high variance.
We will discuss the bias-variance tradeoff in the con-
text of continuous predictions such as regression. How-
ever, many of the intuitions and ideas discussed here also
carry over to classification tasks. Consider a dataset L
reason for this is that, even though using a more com- consisting of the data XL = {(yj , xj ), j = 1 . . . N }. Let
plicated model always reduces the bias, at some point us assume that the true data is generated from a noisy
the model becomes too complex for the amount of train- model
ing data and the generalization error becomes large due
y = f (x) + (2)
to high variance. Thus, to minimize Eout and maximize
our predictive power, it may be more suitable to use a where is normally distributed with mean zero and stan-
more biased model with small variance than a less-biased dard deviation σ .
model with large variance. This important concept is Assume that we have a statistical procedure (e.g. least-
commonly called the bias-variance tradeoff and gets at squares regression) for forming a predictor ĝL (x) that
the heart of why machine learning is difficult. gives the prediction of our model for a new data point x.
This estimator is chosen by minimizing a cost function
Another way to visualize the bias-variance tradeoff is which we take to be the squared error
shown in Figure 6. In this figure, we imagine training X
a complex model (shown in green) and a simpler model C(X, ĝ(x)) = (yi − ĝL (xi ))2 . (3)
i
(shown in black) many, many times on different training
sets of a fixed size N . Due to the sampling noise from We are interested in the generalization error on all data
having finite size data sets, the learned models will differ drawn from the true model, not just the error on the
for each choice of training sets. In general, more complex particular training dataset L that we have in hand. This
models need a larger amount of training data. For this is just the expectation of the cost function over many
reason, the fluctuations in the learned models (variance) different data sets {Lj }. Denote this expectation value
will be much larger for the more complex model than the by EL . In other words, we can view ĝL as a stochastic
simpler model. However, if we consider the asymptotic functional that depends on the dataset L and we can
performance as we increase the size of the training set think of EL as the expected value of the functional if we
(the bias), it is clear that the complex model will even- drew an infinite number of datasets {L1 , L2 , . . .}.
tually perform better than the simpler model. Thus, de- We would also like to average over different instances
pending on the amount of training data, it may be more of the “noise” and we denote the expectation value over
favorable to use a less complex, high-bias model to make the noise by E . Thus, we can decompose the expected
predictions. generalization error as
13
" #
X
EL, [C(X, ĝ(x))] = EL, (yi − ĝL (xi ))2
i
" #
X
= EL, (yi − f (xi ) + f (xi ) − ĝL (xi ))2
i
X
= E [(yi − f (xi ))2 ] + EL, [(f (xi ) − ĝL (xi ))2 ] + 2E [yi − f (xi )]EL [f (xi ) − ĝL (xi )]
i
X
= σ2 + EL [(f (xi ) − ĝL (xi ))2 ], (4)
i
where in the last line we used the fact that our noise has zero mean and variance σ2 and the sum over i applies to all
terms. It is also helpful to further decompose the second term as follows:
EL [(f (xi ) − ĝL (xi ))2 ] = EL [(f (xi ) − EL [ĝL (xi )] + EL [ĝL (xi )] − ĝL (xi ))2 ]
= EL [(f (xi ) − EL [ĝL (xi )])2 ] + EL [(ĝL (xi ) − EL [ĝL (xi )])2 ]
+ 2EL [(f (xi ) − EL [ĝL (xi )])(ĝL (xi ) − EL [ĝL (xi )])]
= (f (xi ) − EL [ĝL (xi )])2 + EL [(ĝL (xi ) − EL [ĝL (xi )])2 ]. (5)
The first term is called the bias which is a function of the parameters θ and a cost func-
X tion C(X, g(θ)) that allows us to judge how well the
Bias2 = (f (xi ) − EL [ĝL (xi )])2 (6) model g(θ) explains the observations X. The model is
i fit by finding the values of θ that minimize the cost func-
tion. In this section, we discuss one of the most powerful
and measures the deviation of the expectation value of
and widely used classes of methods for performing this
our estimator (i.e. the asymptotic value of our estimator
minimization – gradient descent and its generalizations.
in the infinite data limit) from the true value. The second
The basic idea behind these methods is straightforward:
term is called the variance
iteratively adjust the parameters in the direction where
the gradient of the cost function is large and negative.
X
V ar = EL [(ĝL (xi ) − EL [ĝL (xi )])2 ], (7)
i
In this way, the training procedure ensures the parame-
ters flow towards a local minimum of the cost function.
and measures how much our estimator fluctuates due to However, in practice gradient descent is full of surprises
finite-sample effects. Combining these expressions, we and a series of ingenious tricks have been developed by
see that the expected out-of-sample error of our model the optimization and machine learning communities to
can be decomposed as improve the performance of these algorithms.
The underlying reason why training a machine learn-
Eout = EL, [C(X, ĝ(x))] = Bias2 + V ar + N oise. (8)
ing algorithm is difficult is that the cost functions we
wish to optimize are usually complicated, rugged, non-
The bias-variance tradeoff summarizes the fundamen-
convex functions in a high-dimensional space with many
tal tension in machine learning, particularly supervised
local minima. To make things even more difficult, we
learning, between the complexity of a model and the
almost never have access to the true function we wish
amount of training data needed to train it. Since data
to minimize but, instead must estimate this function di-
is often limited, in practice it is often useful to use a
rectly from data. In modern applications, both the size
less-complex model with higher bias – a model whose
of the dataset and the number of parameters we wish to
asymptotic performance is worse than another model –
fit is often enormous (millions of parameters and exam-
because it is easier to train and less sensitive to sampling
ples). The goal of this chapter is to explain how gradient
noise arising from having a finite-sized training dataset
descent methods can be used to train machine learning
(smaller variance). This is the basic intuition behind the
algorithms even in these difficult settings.
schematics in Figs. 4, 5, and 6.
This chapter seeks to both introduce commonly used
methods and give intuition for why they work. We
IV. GRADIENT DESCENT AND ITS GENERALIZATIONS also include some practical tips for improving the per-
formance of stochastic gradient descent (Bottou, 2012;
Almost every problem in ML and data science starts LeCun et al., 1998b). To help the reader gain more in-
with the same ingredients: a dataset X, a model g(θ), tuition about gradient descent and its variants, we have
14
y
η =0.1
hyper-parameters can effect performance). The reader
η =0.5
may also wish to consult useful reviews that cover these −2.5
η =1
topics (Ruder, 2016) and this blog http://ruder.io/
optimizing-gradient-descent/. η =1.01
−5.0
−4 −2 0 2 4
x
A. Gradient Descent and Newton’s method
FIG. 7 Gradient descent exhibits three qualitatively
We begin by introducing a simple first-order gradient different regimes as a function of the learning rate.
descent method and comparing and contrasting it with Result of gradient descent on surface z = x2 + 10y 2 − 1 for
another algorithm, Newton’s method. Newton’s method learning rate of η = 0.1, 0.9, 1.01. Notice that the trajectory
converges to the global minima in multiple steps for small
is intimately related to many algorithms (conjugate gra-
learning rates (η = 0.1). Increasing the learning rate fur-
dient, quasi-Newton methods) commonly used in physics ther (η = 0.9) causes the trajectory to oscillate around the
for optimization problems. Denote the function we wish global minima before converging. For even larger learning
to minimize by E(θ). In the context of machine learning, rates (η = 1.01) the trajectory diverges from the minima. See
E(θ) is just the cost function E(θ) = C(X, g(θ)). As we corresponding notebook for details.
shall see for linear and logistic regression in Secs. VI, VII,
this energy function can almost always be written as a
sum over n data points, To better understand this behavior and highlight some
of the shortcomings of GD, it is useful to contrast GD
n
X with Newton’s method which is the inspiration for many
E(θ) = ei (xi , θ). (9) widely employed optimization methods. In Newton’s
i=1
method, we choose the step v for the parameters in such
For example, for linear regression ei is just the mean a way as to minimize a second-order Taylor expansion to
square-error for data point i whereas, for logistic regres- the energy function
sion, it is the cross-entropy. To make analogy with phys-
1
ical systems, we will often refer to this function as the E(θ + v) ≈ E(θ) + ∇θ E(θ)v + vT H(θ)v,
“energy”. 2
In the simplest gradient descent (GD) algorithm, we where H(θ) is the Hessian matrix of second derivatives.
iteratively update the parameters according to the fol- Differentiating this equation respect to v and noting that
lowing rule. Initialize the parameters to some value θ0 for the optimal value vopt we expect ∇θ E(θ + vopt ) = 0,
and iteratively update the parameters according to the yields the following equation
equation
0 = ∇θ E(θ) + H(θ)vopt . (11)
vt = ηt ∇θ E(θt ),
θt+1 = θt − vt (10) Rearranging this expression results in the desired update
rules for Newton’s method
where ∇θ E(θ) is the gradient of E(θ) w.r.t to θ and we
have introduced a learning rate, ηt , that controls how big vt = H −1 (θt )∇θ E(θt ) (12)
a step we should take in the direction of the gradient θt+1 = θt − vt . (13)
at time t. It is clear that for sufficiently small choice of
the learning rate ηt this methods will converge to a local Since we have no guarantee that the Hessian is well con-
minimum of the cost function. However, choosing a small ditioned, in almost all applications of Netwon’s method,
ηt comes at a huge computational cost. The smaller ηt , one replaces the inverse of the Hessian H −1 (θt ) by some
the more steps we have to take to reach the local mini- suitably regularized pseudo-inverse such as [H(θt )+I]−1
mum. In contrast, if ηt is too large, we can overshoot the with a small parameter (Battiti, 1992).
minimum and the algorithm becomes unstable (it either For the purposes of machine learning, Newton’s
oscillates or even moves away from the minimum). This method is not practical for two interrelated reasons.
is shown in Figure 7. In practice, one usually specifies First, calculating a Hessian is an extremely expensive
a “schedule” that decreases ηt at long times. Common numerical computation. Second, even if we employ first-
schedules include power law and exponential decays in order approximation methods to approximate the Hes-
time. sian (commonly called quasi-Newton methods), we must
15
store and invert a matrix with n2 entries, where n is the A E(θ) B E(θ)
number of parameters. For models with millions of pa-
η<ηopt η=ηopt
rameters such as those commonly employed in the neu-
ral network literature, this is close to impossible with
present-day computational power. Despite these practi-
cal shortcomings, Newton’s method gives many impor-
tant intuitions about how to modify GD algorithms to
improve their performance. Notice that, unlike in GD θ θ
where the learning rate is the same for all parameters, θmin θmin
D E(θ)
Newton’s method automatically “adapts” the learning C E(θ)
rate of different parameters depending on the Hessian η>ηopt η>2ηopt
matrix. Since the Hessian encodes the curvature of the
surface we are trying to find the minimum of – more
specifically, the singular values of the Hessian are in-
versely proportional to the squares of the local curvatures
of the surface – Newton’s method automatically adjusts θ θ
the step size so that one takes larger steps in flat di-
θmin θmin
rections with small curvature and smaller steps in steep
directions with large curvature. FIG. 8 Effect of learning rate on convergence. For a
Our derivation of Newton’s method also allows us to one dimensional quadratic potential, one can show that there
develop intuition about the role of the learning rate in exists four different qualitative behaviors for gradient descent
GD. Let’s first consider the special case of using GD to (GD) as a function of the learning rate η depending on the
find the minimum of a quadratic energy function of a sin- relationship between η and ηopt = [∂θ2 E(θ)]−1 . (a) For η <
ηopt , GD converges to the minimum. (b) For η = ηopt , GD
gle parameter θ (LeCun et al., 1998b). Given the current
converges in a single step. (c) For ηopt < η < 2ηopt , GD
value of our parameter θ, we can ask what is the optimal oscillates around the minima and eventually converges. (d)
choice of the learning rate ηopt , where ηopt is defined as For η > 2ηopt , GD moves away from the minima. This figure
the value of η that allows us to reach the minimum of the is adapted from (LeCun et al., 1998b).
quadratic energy function in a single step (see Figure 8).
To find ηopt , we expand the energy function to second
order around the current value introduction to SVD) and consider the singular values
1 {λ} of the Hessian. If we use a single learning rate for all
E(θ + v) = E(θc ) + ∂θ E(θ)v + ∂θ2 E(θ)v 2 . (14) parameters, in analogy with (16), convergence requires
2
that
Differentiating with respect to v and setting θmin = θ − v
2
yields η< , (17)
λmax
θmin = θ − [∂θ2 E(θ)]−1 ∂θ E(θ). (15)
where λmax is the largest singular value of the Hessian.
Comparing with (10) gives, If the minimum eigenvalue λmin differs significantly from
the largest value λmax , then convergence in the λmin -
ηopt = [∂θ2 E(θ)]−1 . (16) direction will be extremely slow! One can actually show
that the convergence time scales with the condition num-
One can show that there are four qualitatively different ber κ = λmax /λmin (LeCun et al., 1998b).
regimes possible (see Fig. 8) (LeCun et al., 1998b). If
η < ηopt , then GD will take multiple small steps to reach
the bottom of the potential. For η = ηopt , GD reaches B. Limitations of the simplest gradient descent algorithm
the bottom of the potential in a single step. If ηopt <
η < 2ηopt , then the GD algorithm will oscillate across The last section hints at some of the major shortcom-
both sides of the potential before eventually converging ings of the simple GD algorithm described in (10). Before
to the minima. However, when η > 2ηopt , the algorithm proceeding, we briefly summarize these limitations and
actually diverges! discuss general strategies for modifying GD to overcome
It is straightforward to generalize this to the multidi- these deficiencies.
mensional case. The natural multidimensional general-
ization of the second derivative is the Hessian H(θ). We • GD finds local minima of our function. Since the
can always perform a singular value decomposition (i.e. GD algorithm is deterministic, if it converges, it
a rotation by an orthogonal matrix for quadratic minima will converge to a local minimum of our energy
where the Hessian is symmetric, see Sec. VI.B for a brief function. Because in ML we are often dealing with
16
extremely rugged landscapes with many local min- us to keep track of not only the gradient but second
ima, this can lead to poor performance. A similar derivatives of the energy function (note as discussed
problem is encountered in physics. To overcome above, the ideal scenario would be to calculate the
this, physicists often use methods like simulated Hessian but this proves to be too computationally
annealing that introduce a fictitious “temperature” expensive).
which is eventually taken to zero. The “tempera-
ture” term introduces stochasticity in the form of • GD can take exponential time to escape saddle
thermal fluctuations that allow the algorithm to points, even with random initialization. As we men-
thermally tunnel over energy barriers. This sug- tioned, GD is extremely sensitive to initial condi-
gests that, in the context of ML, we should modify tion since it determines the particular local mini-
GD to include stochasticity. mum GD would eventually reach. However, even
with a good initialization scheme, through the in-
• GD is sensitive to initial conditions. One con- troduction of randomness to be introduced later,
sequence of the local nature of GD is that initial GD can still take exponential time to escape sad-
conditions matter. Depending on where one starts, dle points, which are prevalent in high-dimensional
one will end up at a different local minima. There- space, even for non-pathological objective functions
fore, it is very important to think about how one (Du et al., 2017). Indeed, there are modified GD
initializes the training process. This is true for GD methods developed recently to accelerate the es-
as well as more complicated variants of GD intro- cape. The details of these boosted method are be-
duced below. yond the scope of this review, and we refer avid
readers to (Jin et al., 2017) for details.
• Gradients are computationally expensive to calcu-
late for large datasets. In many cases in statistics In the next few subsections, we will introduce variants
and ML, the energy function is a sum of terms, of GD that address many of these shortcomings. These
with one term for each P data point. For example, in generalized gradient descent methods form the backbone
n
linear regression, E ∝ i=1 (yi − wT · xi )2 ; for lo- of much of modern deep learning and neural networks,
gistic regression, the square error is replaced by the see Sec IX. For this reason, the reader is encouraged to
cross entropy, see Secs. VI, VII. Thus, to calculate really experiment with different methods in landscapes
the gradient we have to sum over all n data points. of varying complexity using the accompanying notebook.
Doing this at every GD step becomes extremely
computationally expensive. An ingenious solution
to this, discussed below, is to calculate the gra- C. Stochasticity Gradient Descent (SGD) with mini-batches
dients using small subsets of the data called “mini
batches”. This has the added benefit of introducing One of the most widely-applied variants of the gra-
stochasticity into our algorithm. dient descent algorithm is stochastic gradient descent
(SGD)(Bottou, 2012; Williams and Hinton, 1986). As
• GD is very sensitive to choices of learning rates. As the name suggests, unlike ordinary GD, the algorithm
discussed above, GD is extremely sensitive to the is stochastic. Stochasticity is incorporated by approx-
choice of learning rates. If the learning rate is very imating the gradient on a subset of the data called a
small, the training process take an extremely long minibatch 2 . The size of the minibatches is almost always
time. For larger learning rates, GD can diverge and much smaller than the total number of data points n,
give poor results. Furthermore, depending on what with typical minibatch sizes ranging from ten to a few
the local landscape looks like, we have to modify hundred data points. If there are n points, and the mini-
the learning rates to ensure convergence. Ideally, batch size is M , there will be n/M minibatches. Let us
we would “adaptively” choose the learning rates to denote these minibatches by Bk where k = 1 . . . n/M .
match the landscape. Thus, in SGD, at each gradient descent step we approx-
imate the gradient using a minibatch Bk ,
• GD treats all directions in parameter space uni-
n
formly. Another major drawback of GD is that X X
unlike Newton’s method, the learning rate for GD ∇θ E(θ) = ∇θ ei (xi , θ) −→ ∇θ ei (xi , θ), (18)
is the same in all directions in parameter space. For i i∈Bk
cycling over all M minibatches. A full iteration over all Before proceeding further, let us try to get more in-
n data points – in other words using all M minibatches tuition from these equations. It is helpful to consider a
– is called an epoch. For notational convenience, we will simple physical analogy with a particle of mass m moving
denote the mini-batch approximation to the gradient by in a viscous medium with drag coefficient µ and potential
E(w) (Qian, 1999). If we denote the particle’s position
M
X by w, then its motion is described by
∇θ E M B (θ) = ∇θ ei (xi , θ). (19)
i∈Bk d2 w dw
m 2
+µ = −∇w E(w). (23)
dt dt
With this notation, we can rewrite the SGD algorithm as
We can discretize this equation in the usual way to get
vt = ηt ∇θ E M B (θ),
θt+1 = θt − vt . (20) wt+∆t − 2wt + wt−∆t wt+∆t − wt
m +µ = −∇w E(w).
(∆t)2 ∆t
Thus, in SGD, we replace the actual gradient over the (24)
full data at each gradient descent step by an approxima- Rearranging this equation, we can rewrite this as
tion to the gradient computed using a minibatch. This
has two important benefits. First, it introduces stochas- (∆t)2 m
∆wt+∆t = − ∇w E(w) + ∆wt . (25)
ticity and decreases the chance that our fitting algorithm m + µ∆t m + µ∆t
gets stuck in isolated local minima. Second, it signifi-
cantly speeds up the calculation as one does not have to Notice that this equation is identical to Eq. (22) if we
use all n data points to calculate the gradient. Empirical identify the position of the particle, w, with the parame-
and theoretical work suggests that SGD has additional ters θ. This allows us to identify the momentum param-
benefits. Chief among these is that introducing stochas- eter and learning rate with the mass of the particle and
ticity is thought to act as a natural regularizer that pre- the viscous drag as:
vents overfitting in deep, isolated minima (Bishop, 1995a; m (∆t)2
Keskar et al., 2016). γ= , η= . (26)
m + µ∆t m + µ∆t
Thus, as the name suggests, the momentum parameter
D. Adding Momentum is proportional to the mass of the particle and effec-
tively provides inertia. Furthermore, in the large vis-
In practice, SGD is almost always used with a “mo- cosity/small learning rate limit, our memory time scales
mentum” or inertia term that serves as a memory of the as (1 − γ)−1 ≈ m/(µ∆t).
direction we are moving in parameter space. This is typ- Why is momentum useful? SGD momentum helps
ically implemented as follows the gradient descent algorithm gain speed in directions
vt = γvt−1 + ηt ∇θ E(θt ) with persistent but small gradients even in the presence
of stochasticity, while suppressing oscillations in high-
θt+1 = θt − vt , (21)
curvature directions. This becomes especially important
where we have introduced a momentum parameter γ, in situations where the landscape is shallow and flat in
with 0 ≤ γ ≤ 1, and for brevity we dropped the ex- some directions and narrow and steep in others. It has
plicit notation to indicate the gradient is to be taken been argued that first-order methods (with appropriate
over a different mini-batch at each step. We call this al- initial conditions) can perform comparable to more ex-
gorithm gradient descent with momentum (GDM). From pensive second order methods, especially in the context
these equations, it is clear that vt is a running average of complex deep learning models (Sutskever et al., 2013).
of recently encountered gradients and (1 − γ)−1 sets the Empirical studies suggest that the benefits of including
characteristic time scale for the memory used in the av- momentum are especially pronounced in complex models
eraging procedure. Consistent with this, when γ = 0, in the initial “transient phase” of training, rather than
this just reduces down to ordinary SGD as described in during a subsequent fine-tuning of a coarse minimum.
Eq. (20). An equivalent way of writing the updates is The reason for this is that, in this transient phase, corre-
lations in the gradient persist across many gradient de-
∆θt+1 = γ∆θt − ηt ∇θ E(θt ), (22) scent steps, accentuating the role of inertia and memory.
These beneficial properties of momentum can some-
where we have defined ∆θt = θt − θt−1 . In what should times become even more pronounced by using a slight
be a familiar scenario to many physicists, momentum modification of the classical momentum algorithm called
based methods were first introduced in old, largely for- Nesterov Accelerated Gradient (NAG) (Nesterov, 1983;
gotten (until recently) Soviet papers (Nesterov, 1983; Sutskever et al., 2013). In the NAG algorithm, rather
Polyak, 1964). than calculating the gradient at the current parameters,
18
∇θ E(θt ), one calculates the gradient at the expected A related algorithm is the ADAM optimizer. In
value of the parameters given our current momentum, ADAM, we keep a running average of both the first and
∇θ E(θt + γvt−1 ). This yields the NAG update rule second moment of the gradient and use this information
to adaptively change the learning rate for different pa-
vt = γvt−1 + ηt ∇θ E(θt + γvt−1 )
rameters. In addition to keeping a running average of the
θt+1 = θt − vt . (27) first and second moments of the gradient (i.e. mt = E[gt ]
One of the major advantages of NAG is that it allows for and st = E[gt2 ], respectively), ADAM performs an addi-
the use of a larger learning rate than GDM for the same tional bias correction to account for the fact that we are
choice of γ. estimating the first two moments of the gradient using a
running average (denoted by the hats in the update rule
below). The update rule for ADAM is given by (where
E. Methods that use the second moment of the gradient multiplication and division are once again understood to
be element-wise operations below)
In stochastic gradient descent, with and without mo-
mentum, we still have to specify a “schedule” for tuning gt = ∇θ E(θ) (29)
the learning rates ηt as a function of time. As discussed in mt = β1 mt−1 + (1 − β1 )gt
the context of Newton’s method, this presents a number st = β2 st−1 + (1 − β2 )gt2
of dilemmas. The learning rate is limited by the steep- mt
est direction which can change depending on the current m̂t =
1 − β1t
position in the landscape. To circumvent this problem, st
ideally our algorithm would keep track of curvature and ŝt =
1 − β2t
take large steps in shallow, flat directions and small steps
m̂t
in steep, narrow directions. Second-order methods ac- θt+1 = θt − ηt √ ,
complish this by calculating or approximating the Hes- ŝt +
sian and normalizing the learning rate by the curvature. (30)
However, this is very computationally expensive for ex-
where β1 and β2 set the memory lifetime of the first and
tremely large models. Ideally, we would like to be able to
second moment and are typically taken to be 0.9 and 0.99
adaptively change the step size to match the landscape
respectively, and η and are identical to RMSprop.
without paying the steep computational price of calcu-
Like in RMSprop, the effective step size of a parameter
lating or approximating Hessians.
depends on the magnitude of its gradient squared. To
Recently, a number of methods have been introduced
understand this better, let us rewrite this expression in
that accomplish this by tracking not only the gradient,
terms of the variance σt2 = ŝt − (m̂t )2 . Consider a single
but also the second moment of the gradient. These
parameter θt . The update rule for this parameter is given
methods include AdaGrad (Duchi et al., 2011), AdaDelta
by
(Zeiler, 2012), RMS-Prop (Tieleman and Hinton, 2012),
and ADAM (Kingma and Ba, 2014). Here, we discuss m̂t
the last two as representatives of this class of algorithms. ∆θt+1 = −ηt p . (31)
σt2 + m2t +
In RMS prop, in addition to keeping a running average
of the first moment of the gradient, we also keep track of We now examine different limiting cases of this expres-
the second moment denoted by st = E[gt2 ]. The update sion. Assume that our gradient estimates are consistent
rule for RMS prop is given by so that the variance is small. In this case our update
gt = ∇θ E(θ) (28) rule tends to ∆θt+1 → −ηt (here we have assumed that
m̂t ). This is equivalent to cutting off large persis-
st = βst−1 + (1 − β)gt2
tent gradients at 1 and limiting the maximum step size
gt
θt+1 = θt − ηt √ , in steep directions. On the other hand, imagine that the
st + gradient is widely fluctuating between gradient descent
where β controls the averaging time of the second mo- steps. In this case σ 2 m̂2t so that our update becomes
ment and is typically taken to be about β = 0.9, ηt is a ∆θt+1 → −ηt m̂t /σt . In other words, we adapt our learn-
learning rate typically chosen to be 10−3 , and ∼ 10−8 ing rate so that it is proportional to the signal-to-noise
is a small regularization constant to prevent divergences. ratio (i.e. the mean in units of the standard deviation).
Multiplication and division by vectors is understood as From a physical point of view, this is extremely desir-
an element-wise operation. It is clear from this formula able. The standard deviation serves as a natural adap-
that the learning rate is reduced in directions where the tive scale for deciding whether a gradient is large or small.
norm of the gradient is consistently large. This greatly Thus, ADAM has the beneficial effects of adapting our
speeds up the convergence by allowing us to use a larger step size so that we cut off large gradient directions (and
learning rate for flat directions. hence prevent oscillations and divergences) and measur-
19
y
noting that recent studies have shown adaptive methods GDM
like RMSProp, ADAM, and AdaGrad tend to general- NAG
−2
ize worse than SGD in classification tasks, though they RMS
achieve smaller training error. Such discussion is beyond ADAMS
the scope of this review so we refer readers to (Wilson −4
et al., 2017) for more details. −4 −2 0 2 4
x
FIG. 9 Comparison of GD and its generalization for
F. Comparison of various methods
Beale’s function. Trajectories from gradient descent (GD;
black line), gradient descent with momentum (GDM; magenta
To better understand these methods, it is helpful to line), NAG (cyan-dashed line), RMSprop (blue dash-dot line),
visualize the performance of the five methods discussed and ADAM (red line) for Nsteps = 104 . The learning rate
above – gradient descent (GD), gradient descent with for GD, GDM, NAG is η = 10−6 and η = 10−3 for ADAM
momentum (GDM), NAG, ADAM, and RMSprop. To and RMSprop. β = 0.9 for RMSprop, β1 = 0.9 and β2 =
do so, we will use Beale’s function: 0.99 for ADAM, and = 10−8 for both methods. Please see
corresponding notebook for details.
f (x, y) = (1.5 − x + xy)2 (32)
2 2 3 2
+(2.25 − x + xy ) + (2.625 − x + xy ) .
practices laid out in (Bottou, 2012; LeCun et al., 1998b;
This function has a global minimum at (x, y) = (3, 0.5) Tieleman and Hinton, 2012).
and an interesting structure that can be seen in Fig. 9.
The figure shows the results of using all five methods • Randomize the data when making mini-batches. It
for Nsteps = 104 steps for three different initial condi- is always important to randomly shuffle the data
tions. In the figure, the learning rate for GD, GDM, and when forming mini-batches. Otherwise, the gra-
NAG are set to η = 10−6 whereas RMSprop and ADAM dient descent method can fit spurious correlations
have a learning rate of η = 10−3 . The learning rates resulting from the order in which data is presented.
for RMSprop and ADAM can be set significantly higher
than the other methods due to their adaptive step sizes. • Transform your inputs. As we discussed above,
For this reason, ADAM and RMSprop tend to be much learning becomes difficult when our landscape has
quicker at navigating the landscape than simple momen- a mixture of steep and flat directions. One simple
tum based methods (see Fig. 9). Notice that in some trick for minimizing these situations is to standard-
cases (e.g. initial condition of (−1, 4)), the trajectories ize the data by subtracting the mean and normaliz-
do not find the global minimum but instead follow the ing the variance of input variables. Whenever pos-
deep, narrow ravine that occurs along y = 1. This kind of sible, also decorrelate the inputs. To understand
landscape structure is generic in high-dimensional spaces why this is helpful, consider the case of linear re-
where saddle points proliferate. Once again, the adaptive gression. It is easy to show that for the squared
step size and momentum of ADAM and RMSprop allows error cost function, the Hessian of the energy ma-
these methods to traverse the landscape faster than the trix is just the correlation matrix between the in-
simpler first-order methods. The reader is encouraged to puts. Thus, by standardizing the inputs, we are
consult the corresponding Jupyter notebook and exper- ensuring that the landscape looks homogeneous in
iment with changing initial conditions, the loss surface all directions in parameter space. Since most deep
being minimized, and hyper-parameters to gain more in- networks can be viewed as linear transformations
tuition about all these methods. followed by a non-linearity at each layer, we expect
this intuition to hold beyond the linear case.
G. Gradient descent in practice: practical tips • Monitor the out-of-sample performance. Always
monitor the performance of your model on a valida-
We conclude this chapter by compiling some practical tion set (a small portion of the training data that is
tips from experts for getting the best performance from held out of the training process to serve as a proxy
gradient descent based algorithms, especially in the con- for the test set – see Sec. XI for more on validation
text of deep neural networks discussed later in the review, sets). If the validation error starts increasing, then
see Secs. XVI.B, IX. This section draws heavily on best the model is beginning to overfit. Terminate the
20
learning process. This early stopping significantly The posterior distribution describes our knowledge about
improves performance in many settings. the unknown parameter w after observing the data X.
In many cases, it will not be possible to analytically
• Adaptive optimization methods don’t always have compute the normalizing constant in the denominator
good generalization. As we mentioned, recent stud- of the posterior distribution (i.e. the partition function
ies have shown that adaptive methods such as p(X) = dw p(X|w)p(w)) and Markov Chain Monte
R
ADAM, RMSPorp, and AdaGrad tend to have poor Carlo (MCMC) methods are needed to draw random
generalization compared to SGD or SGD with mo- samples from p(w|X).
mentum, particularly in the high-dimensional limit
(i.e. the number of parameters exceeds the number The likelihood function p(X|w) is a common feature
of data points) (Wilson et al., 2017). Although it is of both classical statistics and Bayesian inference, and
not clear at this stage why these methods perform is determined by the model and the measurement noise.
so well in training deep neural networks such as Many common statistical procedures such as least-square
generative adversarial networks (GANs) (Goodfel- fitting can be cast into the formalism of Maximum Like-
low et al., 2014), simpler procedures like properly- lihood Estimation (MLE). In MLE, one chooses the pa-
tuned SGD may work as well or better in these rameters ŵ that maximize the likelihood (or equivalently
applications. the log-likelihood since log is a monotonic function) of the
observed data:
In other words, we are looking to find the w which min- Jupyter notebooks. The upshot is, following our defini-
imizes the L2 error. Geometrically speaking, the predic- tion of Ēin and Ēout in Section III, the average in-sample
tor function g(xi ; w) = xTi w defines a hyperplane in Rp . and out-of-sample error can be shown to be
Minimizing the least squares error is therefore equivalent p
to minimizing the sum of all projections (i.e. residuals) Ēin = σ 2 1 − (39)
n
for all points xi to this hyperplane (see Fig. 10). For- p
mally, we denote the solution to this problem as ŵLS : Ēout = σ 2 1 + , (40)
n
provided we obtain the least squares solution ŵLS from
ŵLS = arg min ||Xw − y||22 , (37) i.i.d. samples X and y generated through y = Xwtrue +
w∈Rp 3 . Therefore, we can calculate the average generaliza-
tion error explicitly:
which, after straightforward differentiation, leads to
p
|Ēin − Ēout | = 2σ 2 . (41)
T
ŵLS = (X X) −1 T
X y. (38) n
This imparts an important message: if we have p n
Note that we have assumed that X T X is invertible, which (i.e. high-dimensional data), the generalization error is
is often the case when n p. Formally speaking, if extremely large, meaning the model is not learning. Even
rank(X) = p, namely, the predictors X1 , . . . , Xp (i.e. when we have p ≈ n, we might still not learn well due
columns of X) are linearly independent, then ŵLS is to the intrinsic noise σ 2 . One way to ameliorate this
unique. In the case of rank(X) < p, which happens when is, as we shall see in the following few sections, to use
p > n, X T X is singular, implying there are infinitely regularization. We will mainly focus on two forms of
many solutions to the least squares problem, Eq. (37). regularization: the first one employs an L2 penalty and
In this case, one can easily show that if w0 is a solution, is called Ridge regression, while the second uses an L1
w0 +η is also a solution for any η which satisfies Xη = 0 penalty and is called LASSO.
(i.e. η ∈ null(X)). Having determined the least squares
solution, we can calculate y, the best fit of our data X,
as ŷ = X ŵLS = PX y, where PX = X(X T X)−1 X T , y
c.f. Eq. (36). Geometrically, PX is the projection matrix span( { X 1 , · · · , X p } )
which acts on y and projects it onto the column space of
X, which is spanned by the predictors X1 , · · · , Xp (see y − ŷ
FIG. 11). Notice that we found the optimal solution ŵLS
in one shot, without doing any sort of iterative optimiza-
tion like that discussed in Section IV. ŷ
Rp
B. Ridge-Regression
This problem is equivalent to the following constrained which implies that the Ridge predictor satisfies
optimization problem
ŷRidge = X ŵRidge
ŵRidge (t) = arg min ||Xw − y||22 . (43) = U D(D2 + λI)−1 DU T y
w∈Rp : ||w||22 ≤t p
X d2j
= uj 2 uTj y (48)
d + λ
This means that for any t ≥ 0 and solution ŵRidge in j=1 j
Eq. (43), there exists a value λ ≥ 0 such that ŵRidge ≤ UUT y (49)
solves Eq. (42), and vice versa4 . With this equivalence, it
= X ŷ ≡ ŷLS , (50)
is obvious that by adding a regularization term, λ||w||22 ,
to our least squares loss function, we are effectively con- where uj are the columns of U . Note that in the in-
straining the magnitude of the parameter vector learned equality step we assumed λ ≥ 0 and used SVD to sim-
from the data. plify Eq. (38). By comparing Eq. (48) with Eq. (50), it is
To see this, let us solve Eq. (42) explicitly. Differenti- clear that in order to compute the fitted vector ŷ, both
ating w.r.t. w, we obtain, Ridge and least squares linear regression have to project
y to the column space of X. The only difference is that
ŵRidge (λ) = (X T X + λIp×p )−1 X T y. (44) Ridge regression further shrinks each basis component j
by a factor d2j /(d2j + λ). We encourage the reader to do
In fact, when X is orthogonal, one can simplify this ex- the exercises in Notebook 3 to develop further intuition
pression further: about how Ridge regression works.
ŵLS
ŵRidge (λ) = , for orthogonal X, (45) C. LASSO and Sparse Regression
1+λ
In this section, we study the effects of adding an L1 reg-
where ŵLS is the least squares solution given by Eq. (38). ularization penalty, conventionally called LASSO, which
This implies that the ridge estimate is merely the least stands for “least absolute shrinkage and selection opera-
squares estimate scaled by a factor (1 + λ)−1 . tor”. Concretely, LASSO in the penalized form is defined
Can we derive a similar relation between the fitted by the following regularized regression problem:
vector ŷ = X ŵRidge and the prediction made by least
squares linear regression? To answer this, let us do a sin- ŵLASSO (λ) = arg min ||Xw − y||22 + λ||w||1 . (51)
gular value decomposition (SVD) on X. Recall that the w∈Rp
LASSO Ridge
scaled by
λ
λ 1+ λ
FIG. 13 [Adapted from (Friedman et al., 2001)] Illustra-
tion of LASSO (left) and Ridge regression (right). The blue
λ
concentric ovals are the contours of the regression function
while the red shaded regions represent the constraint func-
tions: (left) |w1 | + |w2 | ≤ t and (right) w12 + w22 ≤ t. In-
tuitively, since the constraint function of LASSO has more
FIG. 12 [Adapted from (Friedman et al., 2001)] Comparing protrusions, the ovals tend to intersect the constraint at the
LASSO and Ridge regression. The black 45 degree line is vertex, as shown on the left. Since the vertices correspond to
the unconstrained estimate for reference. The estimators are parameter vectors w with only one non-vanishing component,
shown by red dashed lines. For LASSO, this corresponds to LASSO tends to give sparse solution.
the soft-thresholding function Eq. (53) while for Ridge regres-
sion the solution is given by Eq. (45)
1.0
How different are the solutions found using LASSO Train (Ridge)
Test (Ridge)
and Ridge regression? In general, LASSO tends to give 0.8 Train (LASSO)
sparse solutions, meaning many components of ŵLASSO Test (LASSO)
are zero. An intuitive justification for this result is pro-
Performance
0.6
vided in Fig. 13. In short, to solve a constrained op-
timization problem with a fixed regularization strength 0.4
t ≥ 0, for example, Eq. (43) and Eq. (52), one first carves
out the “feasible region" specified by the regularizer in the
0.2
{w1 , · · · , wd } space. This means that a solution ŵ0 is le-
gitimate only if it falls in this region. Then one proceeds
0.0
by plotting the contours of the least squares regressors in 10
2
10
1
10
0
10
1
10
2
1.0
Ridge LASSO
600 0.8
600
Train (OLS)
500
500 0.6 Test (OLS)
R2
400 400 Train (Ridge)
|wi|
Let us make a few remarks: (i) the regularization pa- 5 Look closer, and you will see that LASSO actually splits the
rameter λ affects the Ridge and LASSO regressions at weights rather equally for the periodic boundary condition ele-
scales separated by a few orders of magnitude. Notice ment at the edges of the anti-diagonal.
26
FIG. 17 Learned interaction matrix Jij for the Ising model ansatz in Eq. (55) for ordinary least squares (OLS) regression
(left), Ridge regression (middle) and LASSO (right) at different regularization strengths λ. OLS is λ-independent but is shown
for comparison throughout.See Notebook 4.
27
E. Convexity of regularizer From the setup of linear regression, the data D used to
fit the regression model is generated through y = xT w +
In the previous section, we mentioned that the analyt- . We often assume that is a Gaussian noise with mean
ical solution of LASSO can be found by invoking its con- zero and variance σ 2 . To connect linear regression to the
vexity. In this section, we provide a gentle introduction Bayesian framework, we often write the model as
to convexity theory and highlight a few properties which
can help us understand the differences between LASSO p(y|x, θ) = N (y|µ(x), σ 2 (x)). (60)
and Ridge regression. First, recall that a set C ⊆ Rn is
called convex if for any x, y ∈ C and t ∈ [0, 1], In other words, our regression model is defined by a con-
ditional probability that depends not only on data x but
tx + (1 − t)x ∈ C. (58) on some model parameters θ. For example, if the mean
is a linear function of x given by µ = xT w, and the vari-
In other words, every line segment joining x, y lies en-
ance is fixed σ 2 (x) = σ 2 , then θ = (w, σ 2 ). In statistics,
tirely in C. A function f : Rn → R is called con-
many problems rely on estimation of some parameters of
vex if its domain, dom(f ), is a convex set, and for any
interest. For example, suppose we are given the height
x, y ∈dom(f ) and t ∈ [0, 1] we have
data of 20 junior students from a regional high school,
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y), (59) but what we are interested in is the average height of all
high school juniors in the whole county. It is conceivable
That is, the function lies on or below the line segment that the data we are given are not representative of the
joining its evaluation at x and y. This function f is student population as a whole. It is therefore necessary
called strictly convex if this inequality holds strictly for to devise a systematic way to preform reliable estima-
x 6= y and t ∈ (0, 1). Now, it turns out that for con- tion. Here we present the maximum likelihood estima-
vex functions, any local minimizer is a global minimizer. tion (MLE), and show that least squares regression can
Algorithmically, this means that in the optimization pro- be derived from this framework.
cedure, as long as we are “going down the hill” and agree MLE is defined by
to stop when we reach a minimum, then we have hit
the global minimum. In addition to this, there is an θ̂ ≡ arg max log p(D|θ). (61)
θ
abundance of rich theory regarding convex duality and
optimality, which allow us to understand the solutions Using the assumption that samples are i.i.d, we can write
even before solving the problem itself. We refer interested the log-likelihood as
readers to (Boyd and Vandenberghe, 2004; Rockafellar,
n
2015). X
l(θ) ≡ log p(D|θ) = log p(yi |xi , θ) (62)
Now let us examine the two regularizers we introduced
i=1
earlier. A close inspection reveals that LASSO and Ridge
regressions are both convex problems but only Ridge re- Using Eq. (60), we get
gression is a strictly convex problem (assuming λ > 0). n
From convexity theory, this means that we always have a 1 X n
l(θ) = − (yi − xTi w)2 − log(2πσ 2 ) (63)
unique solution for Ridge but not necessary for LASSO. 2σ 2 i=1 2
In fact, it was recently shown that under mild conditions, 1
such as demanding general position for columns of X, =− ||Xw − y||22 + const. (64)
2σ 2
the LASSO solution is indeed unique (Tibshirani et al.,
2013). Apart from this theoretical characterization, (Zou By comparing Eq. (37) and Eq. (64), it is clear that per-
and Hastie, 2005) introduced the notion of Elastic Net to forming least squares is the same as maximizing the log-
retain the desirable properties of both LASSO and Ridge likelihood of this model.
regression, which is now one of the standard tools for What about adding regularization? In Section V, we
regression analysis and machine learning. We refer to introduced the maximum a posteriori probability (MAP)
reader to explore this in Notebook 2. estimate. Here we show that it actually corresponds to
regularized linear regression, where the choice of prior
determines the type of regularization. Recall Bayes’ rule
F. Bayesian formulation of linear regression
p(θ|D) ∝ p(D|θ)p(θ). (65)
In Section V, we gave an overview of Bayesian inference
and phrased it in the context of learning and uncertainty Now instead of maximizing the log-likelihood, l(θ) =
quantification. In this section we formulate least squares log p(D|θ), let us maximize the log posterior, log p(θ|D).
regression from a Bayesian point of view. We shall see Invoking Eq. (65), the MAP estimator becomes
that regularization in learning will emerge naturally as
part of the Bayesian inference procedure. θ̂MAP ≡ arg max log p(D|θ) + log p(θ). (66)
θ
28
Suppose we use the Gaussian Qprior6 with zero mean and In your investigation you would probably record many
variance τ , namely, p(w) = j N (wj |0, τ 2 ), we can re-
2
things (hopefully including the length and mass!) in an
cast the MAP estimator into effort to give yourself the best possible chance of deter-
mining the unknown relationship, perhaps writing down
n n
1 X 1 X the temperature of the room, any air currents, if the ta-
θ̂MAP ≡ arg max − 2 (yi − xTi w)2 − 2 w2
θ 2σ i=1 2τ j=1 j ble were vibrating, etc. What you have done is create a
high-dimensional dataset for yourself. However you actu-
1 1 ally possess an even higher-dimensional dataset than you
= arg max − 2 ||Xw − y||2 − 2 ||w||2 . (67)
2 2
θ 2σ 2τ probably would admit to yourself. For example you are
probably aware of the time of day, that it is a Wednes-
Note that we dropped constant terms that don’t depend
day, your friend Alice being in attendance, your friend
on the parameters. The equivalence between MAP esti-
Bob being absent with a cold, the country in which you
mation with a Gaussian prior and Ridge regression is
are doing the experiment, and the planet you are on, but
established by comparing Eq. (67) and Eq. (43) with
you almost assuredly haven’t written these down in your
λ ≡ σ 2 /τ 2 . We relegate the analogous derivation for
notebook. Why not? The reason is because you entered
LASSO to an exercise in Notebook 3.
the classroom with strongly held prior beliefs that none of
those things affect the physics takes place in that room.
G. Recap and a general perspective on regularizers Even of the things you did write down in an effort to
be a careful scientist you probably hold some doubt as
In this section, we explored least squares linear regres- to their importance to your result and what is serving
sion with and without regularization. We motivated the you here is the intuition that probably only a few things
need for regularization due to poor generalization, in par- matter in the physics of pendula. Hence again you are
ticular in the “high-dimensional limit" (p n). Instead approaching the experiment with prior beliefs about how
of showing the average in-sample and out-of-sample er- many features you will need to pay attention to in order
rors for the regularized problem explicitly, we conducted to predict what will happen when you swing an unknown
numerical experiments in Notebook 3 on the diabetes pendulum. This example might seem a bit contrived, but
dataset and showed that regularization typically leads the point is that we live in a high-dimensional world of
to better generalization. Due to the equivalence between information and while we have good intuition about what
the constrained and penalized form of regularized regres- to write down in our notebook for well-known problems,
sion (in LASSO and Ridge, but not generally true in cases often in the field of ML we cannot say with any confi-
such as L0 penalization), we can regard the regularized dence a priori what the small list of things to write down
regression problem as an un-regularized problem but on will be, but we can at least use regularization to help us
a constrained set of parameters. Since the size of the al- enforce that the list not be too long so that we don’t end
lowed parameter space (e.g. w ∈ Rp when un-regularized up predicting that the period of a pendulum depends on
vs. w ∈ C ⊂ Rp when regularized) is roughly a proxy for Bob having a cold on Wednesdays.
model complexity, solving the regularized problem is in Of course, in both LASSO and Ridge regression there
effect solving the un-regularized problem with a smaller is a parameter λ involved. In principle, this hyper-
model complexity class. This implies that we’re less likely parameter is usually predetermined, which means that
to overfit. it is not part of the regression process. As we saw in
We also showed the connection between using a reg- Fig. 15, our learning performance and solution depends
ularization function and the use of priors in Bayesian strongly on λ, thus it is vital to choose it properly. As
inference. This connection can be used to develop more we discussed in Sec. V.C, one approach is to assume an
intuition about why regularization implies we are less uninformative prior on the hyper-parameters, p(λ), and
likely to overfit the data: Let’s say you are a young average the posterior over all choices of λ following this
Physics student taking a laboratory class where the goal distribution. However, this comes with a large computa-
of the experiment is to measure the behavior of several tional cost. Therefore, it is simpler to choose the regular-
different pendula and use that to predict the formula ization parameter through some optimization procedure.
(i.e. model) that determines the period of oscillation. We’d like to emphasize that linear regression can be
applied to model non-linear relationship between input
and response. This can be done by replacing the input
x with some nonlinear function φ(x). Note that doing
6 Indeed, a Gaussian prior is the conjugate prior that gives a so preserves the linearity as a function of the parame-
Gaussian posterior. For a given likelihood, conjugacy guar-
ters w, since model is defined by the their inner product
antees the preservation of prior distribution at the posterior
level. For example, for a Gaussian(Geometric) likelihood with φT (x)w. This method is known as basis function expan-
a Gaussian(Beta) prior, the posterior distribution is still Gaus- sion (Bishop, 2006; Murphy, 2012).
sian(Beta) distribution. Recent years have also seen a surge of interest in un-
29
0 1 2 3
(i.e yi = 0 or yi = 1). In many cases, it is favorable to The maximum likelihood estimator is defined as the set
have a “soft” classifier that outputs the probability of a of parameters that maximize the log-likelihood:
given category rather than a single value. For example, n
given xi , the classifier outputs the probability of being
X
yi log f (xTi w)+(1−yi ) log 1 − f (xTi w) .
ŵ = arg max
in category m. Logistic regression is the most canonical θ i=1
example of a soft classifier. In logistic regression, the (75)
probability that a data point xi belongs to a category Since the cost (error) function is just the negative log-
yi = {0, 1} is is given by likelihood, for logistic regression we have that
where θ = w are the weights we wish to learn from the Eq. (76) is known in statistics as the cross entropy. Fi-
data. To gain some intuition for these equations, consider nally, we note that just as in linear regression, in practice
a collection of non-interacting two-state systems coupled we usually supplement the cross-entropy with additional
to a thermal bath (e.g a collection of atoms that can be regularization terms, usually L1 and L2 regularization
in two states). Furthermore, denote the state of system (see Sec. VI for discussion of these regularizers).
i by a binary variable: yi ∈ {0, 1}. From elementary
statistical mechanics, we know that if the two states have
B. Minimizing the cross entropy
energies 0 and 1 the probability for finding the system
in a state yi is just:
The cross entropy is a convex function of the weights w
e−β0 1 and, therefore, any local minimizer is a global minimizer.
P (yi = 1) = = , Minimizing this cost function leads to the following equa-
e−β0
+ e−β1 1 + e−β∆
tion
P (yi = 1) = 1 − P (yi = 0). (71)
n
X
f (xTi w) − yi xi , (77)
Notice that in these expressions, as is often the case in 0 = ∇C(w) =
i=1
physics, only energy differences are observable. If the
difference in energies between two states is given by ∆ = where we made use of the logistic function identity
xTi w, we recover the expressions for logistic regression. ∂z f (z) = f (z)[1 − f (z)]. Equation (77) defines a tran-
We shall use this mapping between partition functions scendental equation for w, the solution of which, unlike
and classification to generalize the logistic regressor to linear regression, cannot be written in a closed form.
soft-max regression in Sec. VII.D. Notice that in terms For this reason, one must use numerical methods such
of the logistic function, we can write as those introduced in Sec. IV to solve this optimization
problem.
P (yi = 1) = f (xTi w) = 1 − P (yi = 0). (72)
where the lattice site indices i, j run over all nearest (blue) and test (red) accuracy curves being very close to
neighbors of a 2D square lattice, and J is an interaction each other. Interestingly, the liblinear minimizer outper-
energy scale. We adopt periodic boundary conditions. forms SGD on the training and test data, but not on the
Onsager proved that this model undergoes a phase tran- near-critical data for certain values of the regularization
sition in the thermodynamic limit from an ordered fer- strength λ. Moreover, similar to the linear regression
romagnet with all spins aligned to a disordered √ phase at examples, we find that there exists a sweet spot for the
the critical temperature Tc /J = 2/ log(1 + 2) ≈ 2.26. SGD regularization strength λ that results in optimal
For any finite system size, this critical point is expanded performance of the logistic regressor, at about λ ∼ 10−1 .
to a critical region around Tc . We might expect that the difficulty of the phase recogni-
An interesting question to ask is whether one can tion problem depends on the temperature of the queried
train a statistical classifier to distinguish between the two sample. Looking at the states in the near-critical region,
phases of the Ising model. If successful, this can be used c.f. Fig. 20, it is no longer easy for a trained human eye
to locate the position of the critical point in more compli- to distinguish between the ferromagnetic and the disor-
cated models where an exact analytical solution has so far dered phases close to Tc . Therefore, it is interesting to
remained elusive (Morningstar and Melko, 2017; Zhang also compare the training and test accuracies to the ac-
et al., 2017b). In other words, given an Ising state, we curacy of the near-critical state predictions. (Recall that
would like to classify whether it belongs to the ordered or the model is not trained on near-critical states.) Indeed,
the disordered phase, without any additional information the liblinear accuracy is about 7% smaller for the criti-
other than the spin configuration itself. This categorical cal states (green curves) compared to the test data (red
machine learning problem is well suited for logistic re- line).
gression, and will thus consist of recognising whether a Finally, it is important to note that all of Scikit’s logis-
given state is ordered by looking at its bit configurations. tic regression solvers have a in-built regularizers. We did
Notice that, for the purposes of applying logistic regres- not emphasize the role of the regularizers in this section,
sion, the 2D spin state of the Ising model will be flattened but they are crucial in order to prevent overfitting. We
out to a 1D array, so it will not be possible to learn in- encourage the interested reader to play with the different
formation about the structure of the contiguous ordered regularization types and numerical solvers in Notebook 6
2D domains [see Fig. 20]. Such information can be incor- and compare model performances.
porated using deep convolutional neural networks, see
Section IX.
To this end, we consider the 2D Ising model on a 40×40 2. SUSY
square lattice, and use Monte-Carlo (MC) sampling to
prepare 104 states at every fixed temperature T out of In high energy physics experiments, such as the AT-
a pre-defined set. We furthermore assign a label to each LAS and CMS detectors at the CERN LHC, one major
state according to its phase: 0 if the state is disordered, hope is the discovery of new particles. To accomplish this
and 1 if it is ordered. task, physicists attempt to sift through events and clas-
It is well-known that near the critical temperature Tc , sify them as either a signal of some new physical process
the ferromagnetic correlation length diverges, which leads or particle, or as a background event from already un-
to, amongst other things, critical slowing down of the MC derstood Standard Model processes. Unfortunately, we
algorithm. Perhaps identifying the phases is also harder don’t know for sure what underlying physical process oc-
in the critical region. With this in mind, consider the curred (the only information we have access to are the
following three types of states: ordered (T /J < 2.0), final state particles). However, we can attempt to de-
near-critical (2.0 ≤ T /J ≤ 2.5) and disordered (T /J > fine parts of phase space that will have a high percentage
2.5). We use both ordered and disordered states to train of signal events. Typically this is done by using a se-
the logistic regressor and, once the supervised training ries of simple requirements on the kinematic quantities
procedure is complete, we will evaluate the performance of the final state particles, for example having one or
of our classification model on unseen ordered, disordered, more leptons with large amounts of momentum that are
and near-critical states. transverse to the beam line (pT ). Instead, here we will
Here, we deploy the liblinear routine (the default for use logistic regression in an attempt to find the relative
Scikit’s logistic regression) and stochastic gradient de- probability that an event is from a signal or a background
scent (SGD, see Sec. IV for details) to optimize the logis- event. Rather than using the kinematic quantities of fi-
tic regression cost function with L2 regularization. We nal state particles directly, we will use the output of our
define the accuracy of the classifier as the percentage of logistic regression to define a part of phase space that is
correctly classified data points. Comparing the accuracy enriched in signal events (see Jupyter notebookNotebook
on the training and test data, we can study the degree of 5).
overfitting. The first thing to notice from Fig. 21 is the The dataset we are using comes from the UC Irvine
small degree of overfitting, as suggested by the training ML repository and has been produced using Monte Carlo
32
10 10 10
20 20 20
30 30 30
0 20 0 20 0 20
FIG. 20 Examples of typical states of the 2D Ising model for three different temperatures in the ordered phase (T /J = 0.75,
left), the critical region (T /J = 2.25, middle) and the disordered phase (T /J = 4.0, right). The linear system dimension is
L = 40 sites.
FIG. 22 The probability of an event being a classified as a signal event for true signal events (left, blue) and background events
(right, red).
important feature. leading leptons for both signal and background events.
Figure 24 shows these two distributions, and one can see
that while some signal events are easily distinguished,
many live in the same part of phase space as the back-
ground. This effect can also be seen by looking at fig-
ure 22 where you can see that some signal events ‘look’
like background events and vice-versa.
requirements on all of them. While putting more re- from which we can define the cost function in a similar
quirements would indeed increase background rejection, fashion:
it would also decrease signal efficiency. Hence, the cut- n M −1
based approach will never yield as strong discrimination
X X
E(w) = − yim log P (yim = 1|xi , wm )
as logistic regression. One other interesting point about i=1 m=0
these results is that the higher-order variables noticeably + (1 − yim ) log (1 − P (yim = 1|xi , wm )) . (81)
help the ML techniques. In later sections, we will return
to this point to see if more sophisticated techniques can As expected, for M = 1, we recover the cross entropy for
provide further improvement. logistic regression, cf. Eq. (76).
FIG. 26 Visualization of the weights wj after training a SoftMax Regression model on the MNIST dataset (see Notebook
7). We emphasize that SoftMax Regression does not have explicit 2D spatial knowledge; the model learns from data points
flattened out in a one-dimensional array.
one can also imagine that the ensemble predictions can and less sensitive to sampling noise arising from having a
be much worse than the predictions from each of the in- finite-sized training dataset (i.e. smaller variance). Here,
dividual models that constitute the ensemble, especially we will revisit the bias-variance tradeoff in the context
when pooling reinforces weak but correlated deficiencies of ensembles, drawing upon the beautiful discussion in
in each of the individual predictors. Thus, it is impor- Ref. (Louppe, 2014).
tant to understand when we expect ensemble methods to A key property that will emerge from this analysis is
work. the correlation between models that constitute the en-
In order to do this, we will revisit the bias-variance semble. The degree of correlation between models is
tradeoff, discussed in Sec. III, and generalize it to con- important for two distinct reasons. First, holding the
sider an ensemble of classifiers. We will show that the ensemble size fixed, averaging the predictions of corre-
key to determining when ensemble methods work is the lated models reduces the variance less than averaging
degree of correlation between the models in the ensemble uncorrelated models. Second, in some cases, correla-
(Louppe, 2014). Armed with this intuition, we will intro- tions between models within an ensemble can result in
duce some of the most widely-used and powerful ensem- an i ncrease in bias, offsetting any potential reduction in
ble methods including bagging (Breiman, 1996), boosting variance gained from ensemble averaging. We will dis-
(Freund et al., 1999; Freund and Schapire, 1995; Schapire cuss this in the context of bagging below. One of the
and Freund, 2012), random forests (Breiman, 2001), most dramatic examples of increased bias from correla-
and gradient boosted trees such as XGBoost (Chen and tions is the catastrophic predictive failure of almost all
Guestrin, 2016). derivative models used by Wall Street during the 2008
financial crisis.
ing of data XL = {(yj , xj ), j = 1 . . . N }. Let us assume limit of infinite data) from the true value. The variance
that the true data is generated from a noisy model X
V ar = EL [(ĝL (xi ) − EL [ĝL (xi )])2 ], (86)
i
y = f (x) + , (82)
measures how much our estimator fluctuates due to
finite-sample effects. The noise term
where is a normally distributed with mean zero and
standard deviation σ .
X
N oise = σ2i (87)
Assume that we have a statistical procedure (e.g. least- i
squares regression) for forming a predictor ĝL (x) that
is the part of the error due to intrinsic noise in the data
gives the prediction of our model for a new data point x
generation process that no statistical estimator can over-
given that we trained the model using a dataset L. This
come.
estimator is chosen by minimizing a cost function which,
Let us now generalize this to ensembles of estimators.
for the sake of concreteness, we take to be the squared
Given a dataset XL and hyper-parameters θ that param-
error
eterize members of our ensemble, we will consider a pro-
cedure that deterministically generates a model ĝL (xi , θ)
given XL and θ. We assume that the θ includes some
X
C(X, g(x)) = (yi − ĝL (xi ))2 . (83)
i random parameters that introduce stochasticity into our
ensemble (e.g. an initial condition for stochastic gradient
The dataset L is drawn from some underlying distri- descent or a random subset of features or data points used
bution that describes the data. If we imagine drawing for training). We will be concerned with the expected
many datasets different datasets {Lj } of the same size as prediction error of the aggregate ensemble predictor
L from this distribution, we know that the correspond- M
ing estimators ĝLj (x) will differ from each other due to 1 X
A
ĝL (xi , {θ}) = ĝL (xi , θm ). (88)
stochastic effects arising from sampling noise. For this M m=1
reason, we can view our estimator ĝL (x) as a random
variable (technically a functional) and define an expec- For future reference, let us define the mean, variance,
tation value EL in the usual way. EL is computed by and covariance (i.e. the connected correlation function
by drawing infinitely many different datasets {Lj } of the in the language of physics), and the normalized correla-
same size, fitting the corresponding estimator, and then tion coefficient with respect to θ of the estimators in our
averaging the results. We will also average over different ensemble as
instances of the “noise” . The expectation value over the
Eθ [ĝL (x, θ)] = µL,θ (x)
noise will be denoted by E .
Eθ [ĝL (x, θ) ] − Eθ [ĝL (x, θ)]2 = σL,θ
2 2
(x)
After defining these expectation values, as discussed in
Sec. III, we can decompose the expected generalization Eθ [ĝL (x, θm )ĝL (x, θm0 )] − Eθ [ĝL (x, θm )]2 = CL,θm ,θm0 (x)
error as CL,θm ,θm0 (x)
ρ(x) = 2 . (89)
σL,θ
EL, [C(X, g(x))] = Bias2 + V ar + N oise. (84) Note that by definition, we assume m 6= m0 in CL,θm ,θm0 .
We can now ask about the expected generalization
where the bias, (out-of-sample) error for the ensemble
" #
X
A A 2
X EL,,θ C(X, ĝL (x)) = EL,,θ (yi − ĝL (xi , {θ})) .
Bias2 = (f (xi ) − EL [ĝL (xi )])2 , (85)
i
i
(90)
As in the single estimator case, we decompose the error
measures the deviation of the expectation value of our es- into a noise term, a bias-term, and a variance term. To
timator (i.e. the asymptotic value of our estimator in the see this, note that
37
" #
X
A A
EL,,θ [C(X, ĝL (x))] = EL,,θ (yi − f (xi ) + f (xi ) − ĝL (xi , {θ}))2
i
X
= EL,,θ [(yi − f (xi ))2 + (f (xi ) − ĝL
A
(xi , {θ}))2 + 2(yi − f (xi ))(f (xi ) − ĝL
A
(xi , {θ})]
i
X X
= σ2i + A
EL,θ [(f (xi ) − ĝL (xi , {θ}))2 ], (91)
i i
where in the last line we have used the fact that E [yi ] = f (xi ) to eliminate the last term. We can further decompose
the second term as
A
EL,θ [(f (xi ) − ĝL (xi , {θ}))2 ] = EL,θ [(f (xi ) − EL,θ [ĝL
A A
(xi , {θ})] + EL,θ [ĝL A
(xi , {θ})] − ĝL (xi , {θ}))2 ]
A
= EL,θ [(f (xi ) − EL,θ [ĝL (xi , {θ})])2 ] + EL,θ [(EL,θ [ĝL
A A
(xi , {θ})] − ĝL (xi , {θ}))2 ]
A A A
+ 2EL,θ [(EL,θ [ĝL (xi , {θ})] − ĝL (xi , {θ}))(f (xi ) − EL,θ [ĝL (xi , {θ})])
A
= (f (xi ) − EL,θ [ĝL (xi , {θ})])2 + EL,θ [(ĝL
A A
(xi , {θ}) − EL,θ [ĝL (xi , {θ})])2 ]
≡ Bias2 (xi ) + V ar(xi ), (92)
where we have defined the bias of an aggregate predictor So far the calculation for ensembles is almost iden-
as tical to that of a single estimator. However, since the
aggregate estimator is a sum of estimators, its variance
Bias2 (x) ≡ (f (x) − EL,θ [ĝL
A
(x, {θ})])2 (93) implicitly depends on the correlations between the indi-
vidual estimators in the ensemble. Using the definition
and the variance as of the aggregate estimator Eq. (88) and the definitions in
Eq. (89), we see that
A
V ar(x) ≡ EL,θ [(ĝL A
(x, {θ}) − EL,θ [ĝL (x, {θ})])2 ]. (94)
A A
V ar(x) = EL,θ [(ĝL (x, {θ}) − EL,θ [ĝL (x, {θ})])2 ]
1 X X
= 2 EL,θ [ĝL (x, θm )ĝL (x, θm0 )] − M 2 [µL,θ (x)]2
M 0 i
m,m
1 − ρ(x) 2
2
= ρ(x)σL,θ + σL,θ . (95)
M
This last formula is the key to understanding the power single model
of random ensembles. Notice that by using large ensem-
bles (M → ∞), we can significantly reduce the variance,
and for completely random ensembles where the mod-
Bias2 (x) = (f (x) − EL,θ [ĝL
A
(x, {θ}])2
els are uncorrelated (ρ(x) = 0), maximally suppresses
M
the variance! Thus, using the aggregate predictor beats 1 X
= (f (x) − EL,θ [ĝL (x, θm ])2 (96)
down fluctuations due to finite-sample effects. The key, M m=1
as the formula indicates, is to decorrelate the models as
much as possible while still using a very large ensemble. = (f (x) − µL,θ )2 . (97)
One can be worried that this comes at the expense of a
very large bias. This turns out not to be the case. When
models in the ensemble are completely random, the bias Thus, for a random ensemble one can always add more
of the aggregate predictor is just the expected bias of a models without increasing the bias. This observation lies
behind the immense power of random forest methods dis-
cussed below. For other methods, such as bagging, we
will see that the bootstrapping procedure actually does
increase the bias. But in many cases, this increase in bias
is negligible compared to the reduction in variance.
38
oretical discussion above, we know that this can signifi- a stable procedure the accuracy is limited by the bias
cantly reduce the variance without increasing the bias. introduced by using bootstrapped datasets. This means
While simple and intuitive, this form of aggregation that there is an instability-to-stability transition point
clearly works only when we have enough data in each par- beyond which bagging stops improving our prediction.
titioned set Li . To see this, one can consider the extreme
limit where Li contains exactly one point. In this case,
the base hypothesis gLi (x) (e.g. linear regressor) becomes
extremely poor and the procedure above fails. One way
to circumvent this shortcoming is to resort to empir-
ical bootstrapping, a resampling technique in statis-
tics introduced by Efron (Efron, 1979) (see accompany-
ing box and Fig. 28). The idea of empirical bootstrap-
ping is to use sampling with replacement to create new
“bootstrapped” datasets {LBS 1 , . . . , LM } from our origi-
BS
and
M
X
BS
ĝL (x) = arg max I[gLBS
i
(x) = j]. (101)
j M̂ n ( D )
i=1
small changes in the training set result in large changes for k = 1, · · · , B. These are then used to evaluate the accu-
in predictions (Breiman, 1996). When the procedure is racy of M̂n (D) (see also box on Bootstrapping in main text).
unstable, the prediction error is dominated by the vari- It can be shown that in the n → ∞ limit the distribution
?(k)
ance and one can exploit the aggregation component of of Mn would be a Gaussian centered around M̂n (D) with
bagging to reduce the prediction error. In contrast, for variance σ 2 defined by Eq. (102) scales as 1/n.
40
x2
placement from D to get a new set D?(1) =
?(1) ?(1) 0
{X1 , · · · , Xn }, called a bootstrap sample,
which possibly contains repetitive elements. Then −1
we repeat the same procedure to get in total B
−2
such sets: D?(1) , · · · , D?(B) . The next step is to
use these B bootstrap sets to get the bootstrap −3
−1.0 −0.5 0.0 0.5 1.0
estimate of the quantity of interest. For example, x1
?(k)
let Mn = M edian(D?(k) ) be the sample median
of bootstrap data D?(k) . Then we can construct FIG. 29 Bagging applied to the perceptron learning
the variance of the distribution of bootstrap medi- algorithm (PLA). Training data size n = 500, number of
ans as : bootstrap datasets B = 25, each contains 50 points. Colors
corresponds to different classes while the marker indicates the
B 2 how these points are labelled: cross for true label and circle
1 X ?(k) for that obtained by bagging. Each gray dashed line indicates
Vd
arB (Mn ) = Mn − M̄n? , (102)
B−1 the prediction made, based on every bootstrap set while the
k=1
dark dashed black line is the average of these.
where
B
1 X ?(k) more ‘democratic’ ways as is done in bagging. In all cases,
M̄n? = Mn (103) the idea is to build a strong predictor by combining many
B
k=1
weaker classifiers.
is the mean of the median of all bootstrap sam- In boosting, an ensemble of weak classifiers {gk (x)}
ples. Specifically, Bickel and Freedman (Bickel and is combined into an aggregate, boosted classifier. How-
Freedman, 1981) and Singh (Singh, 1981) showed ever, unlike bagging, each classifier is associated with a
that in the n → ∞ limit, the distribution of the weight αk that indicates how much it contributes to the
bootstrap estimate will be a Gaussian centered aggregate classifier
around M̂n (D) = M edian(X1 , · · · √
, Xn ) with stan- M
dard deviation proportional to 1/ n. This means gA (x) =
X
αk gk (x), (104)
that the bootstrap distribution M̂n? − M̂n approxi- K=1
mates fairly well the sampling distribution M̂n −M
from which we obtain the training data D. Note where k αk = 1. For the reasons outlined above, boost-
P
that M is the median based on which the true dis- ing, like all ensemble methods, works best when we com-
tribution D is generated. In other words, if we plot bine simple, high-variance classifiers into a more complex
?(k)
the histogram of {Mn }B k=1 , we will see that in
whole.
the large n limit it can be well fitted by a Gaus- Here, we focus on “adaptive boosting” or AdaBoost,
sian which sharp peaks at M̂n (D) and vanishing first proposed by Freund and Schapire in the mid 1990s
variance whose definition is given by Eq. (102) (see (Freund et al., 1999; Freund and Schapire, 1995; Schapire
Fig. 28). and Freund, 2012). The basic idea behind AdaBoost, is
to form the aggregate classifier in an iterative process.
Importantly, at each iteration we reweight the error func-
C. Boosting tion to reinforce data points where the aggregate classifier
performs poorly. In this way, we can successively ensure
Another powerful and widely used ensemble method is that our classifier has good performance over the whole
Boosting. In bagging, the contribution of all predictors dataset.
is weighted equally in the bagged (aggregate) predictor. We now discuss the AdaBoost procedure in greater
However, in principle, there are myriad ways to combine detail. Suppose that we are given a data set L =
different predictors. In some problems one might prefer {(xi , yi ), i = 1, · · · , N } where xi ∈ X and yi ∈ Y =
to use an autocratic approach that emphasizes the best {+1, −1}. Our objective is to find an optimal hypoth-
predictors, while in others it might be better to opt for esis/classifier g : X → Y to classify the data. Let
41
E. Gradient Boosted Trees and XGBoost where the index i runs over data points and the index j
runs over decision trees in our ensemble. In XGBoost,
Before we turn to applications of these techniques, we the regularization function is chosen to be
briefly discuss one final class of ensemble method that
has become increasingly popular in the last few years: λX 2
T
(M ) 1.00
Note that by definition yi = gA (xi ). The central idea Train (coarse)
is that for large t, each decision tree is a small pertur- 0.95 Test (coarse)
Critical (coarse)
bation to the predictor (of order 1/K) and hence we can 0.90 Train (fine)
perform a Taylor expansion on our loss function to second Test (fine)
0.85 Critical (coarse)
Accuracy
order:
0.80
N
X (t−1) 0.75
Ct = l(yi , ŷi + gt (xi )) + Ω(gt )
0.70
i=1
≈ Ct−1 + ∆Ct , (110) 0.65
10 20 30 40 50 60 70 80 90 100
Nestimators
with
90
(t−1) 1 Fine
∆Ct = bi l(yi , ŷi )gt (xi ) + ai gt (xi )2 + Ω(gt ), (111) 80
2 Coarse
70
where
60
10
We then choose the t-th decision tree gt to minimize ∆Ct .
0
This is almost identical to how we derived the Newton 10 20 30 40 50 60 70 80 90 100
Nestimators
method update in the section on gradient descent, see
Sec. IV.
FIG. 32 Using Random Forests (RFs) to classify Ising Phases.
We can actually derive an expression for the param-
(Top) Accuracy of RFs for classifying the phase of samples
eters of gt that minimize ∆Ct analytically. To simplify from the Ising mode for the training set (blue), test set (red),
notation, it is useful to define the set of points xi that get and critical region (green) using coarse trees with a few leaves
mappedP to leaf j: Ij = {i P : qt (xi ) = j} and the functions (triangles) and fine decision trees with many leaves (filled cir-
Bj = i∈Ij bi and Aj = i∈Ij ai . Notice that in terms cles). RFs were trained on samples from ordered and disor-
of these quantities, we can write dered phases but were not trained on samples from the critical
region. (Bottom) The time it takes to train RFs scales linearly
T with the number of estimators in the ensemble.
X 1
∆Ct = [Bj wj + (Aj + λj )wj2 ] + γT, (114)
j=1
2
minimum of ∆Ctopt which is then added to the ensemble.
where we made the t-dependence of all parameters im- We emphasize that this is only a very high level sketch of
plicit. To find the optimal wj , just as in Newton’s method how the algorithm works. In practice, additional regular-
we take the gradient of the above expression with respect ization such as shrinkage and feature subsampling is also
to wj and set this equal to zero, to get used. In addition, there are many numerical and techni-
cal tricks used for the approximate algorithm and how to
Bj find splits of the data that give good decision trees (Chen
wjopt = − . (115)
Aj + λ and Guestrin, 2016).
Plugging this expression into ∆Ct gives
K F. Applications to the Ising model and Supersymmetry
1 X Bj2
∆Ctopt = − + γT. (116) Datasets
2 j=1 Aj + λ
We now illustrate some of these ideas using two exam-
It is clear that ∆Ctopt measures the in-sample performance ples drawn from physics: (i) classifying the phases of the
of gt and we should find the decision tree that minimizes spin configurations of the 2D-Ising model above and be-
this value. In principle, one could enumerate all possi- low the critical temperature using random forests and (ii)
ble trees over the data and find the tree that minimizes classifying Monte-Carlo simulations of collision events in
∆Ctopt . However, in practice this is impossible. Instead, the SUSY dataset as supersymmetric or standard using
an approximate greedy algorithm is run that optimizes an XGBoost implementation of gradient-boosted trees.
one level of the tree at a time by trying to find optimal Both examples were analyzed in Sec. VII.C using logistic
splits of the data. This leads to a tree that is a good local regression. Here we show that on the Ising dataset, the
44
Features
The Ising dataset used for classification by RFs here lepton 1 pT 351
lepton 2 pT 348
is identical to that used to study logistic regression in S_R 346
M_R 331
Sec. VII.C. We assign a label to each state according to R 319
lepton 1 phi 314
its phase: 0 if the state is disordered, and 1 if it is ordered. MT2 314
lepton 2 phi 290
We divide the dataset into three categories according to M_TR_2 283
the temperature at which samples are drawn: ordered M_Delta_R 273
(T /J < 2.0), near-critical (2.0 ≤ T /J ≤ 2.5) and disor- 0 100 200 300 400 500
F score
dered (T /J > 2.5) (see Figure 20). We use the ordered
and disordered states to train a random forest and eval- FIG. 33 Feature Importance Scores in SUSY dataset from
uate our learned model on a test set of unseen ordered applying XGBoost to 100, 000 samples. See Notebook 10 for
and disordered states (test sets). We also ask how well more details.
our RF can predict the phase of samples drawn in the
critical region (i.e. predict whether the temperature of
a critical sample is above or below the critical tempera- sification. The higher the Fscore, the more important
ture). Since our model is never trained on samples in the the feature for classification. Figure 33 shows the feature
critical region, prediction in this region is a test of the scores from our XGBoost algorithm for the production of
algorithm’s ability to generalize to new regions in phase electrically-charged supersymmetric particles (χ±) which
space. decay to W bosons and an electrically neutral supersym-
The results of fits using RFs to predict phases are metric particle χ0 , which is invisible to the detector. The
shown in Figure 32. We used two types of RF classifiers, features are a mix of eight directly measurable quanti-
one where the ensemble consists of coarse decision trees ties from the detector, as well as ten hand crafted fea-
with a few leaves and another with finer decision trees tures chosen using physics knowledge. Consistent with
with many leaves (see corresponding notebook). RFs the physics of these supersymmetric decays in the lepton
have extremely high accuracy on the training and test channel, we find that the most informative features for
sets (over 99%) for both coarse and fine trees. How- classification are the missing transverse energy along the
ever, notice that the RF consisting of coarse trees per- vector defined by the charged leptons (Axial MET) and
form extremely poorly on samples from the critical region the missing energy magnitude due to χ0 .
whereas the RF with fine trees classifies critical samples
with an accuracy of nearly 85%. Interestingly, and unlike
with logistic regression, this performance in the critical IX. AN INTRODUCTION TO FEED-FORWARD DEEP
region requires almost no parameter tuning. This is be- NEURAL NETWORKS (DNNS)
cause, as discussed above, RFs are largely immune to
overfitting problems even as the number of estimators in Over the last decade, neural networks have emerged
the ensemble becomes large. Increasing the number of es- as the one of most powerful and widely-used supervised
timators in the ensemble does increase performance but learning techniques. Deep Neural Networks (DNNs) have
at a large cost in computational time (Fig. 32 bottom). a long history (Bishop, 1995b; Schmidhuber, 2015), but
In the second application of ensemble methods to remerged to prominence after a rebranding as “Deep
physics-related datasets, we used the XGBoost imple- Learning” in the mid 2000s (Hinton et al., 2006; Hinton
mentation of gradient boosted trees to classify Monte- and Salakhutdinov, 2006). DNNs truly caught the atten-
Carlo collisions from the SUSY dataset. With default tion of the wider machine learning community and indus-
parameters using a small subset of the data (100, 000 out try in 2012 when Alex Krizhevsky, Ilya Sutskever, and
of the full 5, 000, 000 samples), we were able to achieve Geoff Hinton used a GPU-based DNN model (AlexNet)
a classification accuracy of about 79%, which could be to lower the error rate on the ImageNet Large Scale
improved to nearly 80% after some fine-tuning (see ac- Visual Recognition Challenge (ILSVRC) by an incred-
companying notebook). This is comparable to published ible twelve percent from 28% to 16% (Krizhevsky et al.,
results (Baldi et al., 2014) and those obtained using lo- 2012). Just three years later, a machine learning group
gistic regression in earlier chapters. One nice feature of from Microsoft achieved an error of 3.57% using an ultra-
ensemble methods such as XGBoost is that they auto- deep residual neural network (ResNet) with 152 layers
matically allow us to calculate feature scores (Fscores) (He et al., 2016)! Since then, DNNs have become the
that rank the importance of various features for clas- workhorse technique for many image and speech recogni-
45
tion based machine learning tasks. The large-scale indus- In physics, DNNs and CNNs have already found
trial deployment of DNNs has given rise to a number of numerous applications. In statistical physics, they
high-level libraries and packages (Caffe, Keras, Pytorch, have been applied to detect phase transitions in 2D
and TensorFlow) that make it easy to quickly code and Ising (Tanaka and Tomiya, 2017a) and Potts (Li
deploy DNNs. et al., 2017) models, lattice gauge theories (Wetzel and
Conceptually, it is helpful to divide neural networks Scherzer, 2017), and different phases of polymers (Wei
into four categories: (i) general purpose neural networks et al., 2017). It has also been shown that deep neural net-
for supervised learning, (ii) neural networks designed works can be used to learn free-energy landscapes (Sidky
specifically for image processing, the most prominent ex- and Whitmer, 2017). At the same time, methods from
ample of this class being Convolutional Neural Networks statistical physics have been applied to the field of deep
(CNNs), (iii) neural networks for sequential data such as learning to study the thermodynamic efficiency of learn-
Recurrent Neural Networks (RNNs), and (iv) neural net- ing rules (Goldt and Seifert, 2017) to explore the hy-
works for unsupervised learning such as Deep Boltzmann pothesis space that DNNs span, make analogies between
Machines. Here, we will limit our discussions to the first training DNNs and spin glasses (Baity-Jesi et al., 2018;
two categories (unsupervised learning is discussed later Baldassi et al., 2017), and to characterize phase transi-
in the review). Though increasingly important for many tions with respect to network topology in terms of er-
applications such as audio and speech recognition, for rors (Li and Saad, 2017). In relativistic hydrodynam-
the sake of brevity, we omit a discussion of sequential ics, deep learning has been shown to capture features of
data and RNNs from this review. For an introduction non-linear evolution and has the potential to accelerate
to RNNs and LSTM networks see Chris Olah’s blog, and numerical simulations (Huang et al., 2018), while in me-
Chapter 13 of (Bishop, 2006) as well as the introduction chanics CNNs have been used to predict eigenvalues of
to RNNs in Chapter 10 of (Goodfellow et al., 2016) for photonic crystals (Finol et al., 2018). Recently, DNNs
sequential data. have been used to improve the efficiency of Monte-Carlo
Due to the number of recent books on deep learning algorithms (Shen et al., 2018).
(see for example Michael Nielsen’s introductory online Deep learning has also found interesting applications
book (Nielsen, 2015) and the more advanced (Goodfellow in quantum physics. Various quantum phase transi-
et al., 2016)), the goal of this section is to give a high- tions (Arai et al., 2017; Broecker et al., 2017; Iakovlev
level introduction to the basic ideas behind of DNNs, et al., 2018; van Nieuwenburg et al., 2017b; Suchs-
and provide some practical knowledge for coding simple land and Wessel, 2018) can be detected and studied
neural nets for supervised learning tasks. This section using DNNs and CNNs, including the transverse-field
assumes the reader is familiar with the basic concepts Ising model (Ohtsuki and Ohtsuki, 2017), topological
introduced in earlier sections on logistic and linear re- phases (Yoshioka et al., 2017; Zhang et al., 2017a,b),
gression. Throughout, we strive to provide intuition be- and even non-equilibrium many-body localization (van
hind the inner workings of DNNs, as well as highlight Nieuwenburg et al., 2017a,b; Schindler et al., 2017;
limitations of present-day algorithms. Venderley et al., 2017). Representing quantum states
The influx of corporate and industrial interests has as DNNs (Gao et al., 2017; Gao and Duan, 2017; Levine
rapidly transformed the field in the last few years. This et al., 2017; Saito and Kato, 2017) and quantum state
massive influx of money and researchers has given rise to tomography (Torlai et al., 2017) are among some of
new dogmas and best practices that change rapidly. As the impressive achievements to reveal the potential of
with most intellectual fields experiencing rapid expan- DNNs to facilitate the study of quantum systems. Ma-
sion, many commonly accepted heuristics many turn out chine learning techniques involving neural networks were
not to be as powerful as thought (Wilson et al., 2017), also used to study quantum (Baireuther et al., 2017;
and widely held beliefs not as universal as once imagined Breuckmann and Ni, 2017; Davaasuren et al., 2018; Kras-
(Lee et al., 2017; Zhang et al., 2016). This is especially tanov and Jiang, 2017; Maskara et al., 2018) and fault-
true in modern neural networks where results are largely tolerant (Chamberland and Ronagh, 2018) error correc-
empirical and heuristic and lack the firm footing of many tion, estimate rates of coherent and incoherent quantum
earlier machine learning methods. For this reason, in processes (Greplova et al., 2017), and the recognition of
this review we have chosen to emphasize tried and true state and charge configurations and auto-tuning in quan-
fundamentals, while pointing out what, from our current tum dots (Kalantre et al., 2017). In quantum information
vantage point, seem like promising new techniques. The theory, it has been shown that one can perform gate de-
field is rapidly evolving and readers are urged to read compositions with the help of neural nets (Swaddle et al.,
papers and to implement these algorithms themselves in 2017). In lattice quantum chromodynamics, DNNs have
order to gain a deep appreciation for the incredible power been used to learn action parameters in regions of pa-
of modern neural networks, especially in the context of rameter space where PCA fails (Shanahan et al., 2018).
image, speech, and natural language processing, as well Last but not least, DNNs also found place in the study
as limitations of the current methods. of quantum control (Yang et al., 2017), and in scattering
46
{ output
layer
see Figure 34.
Historically in the neural network literature, common
{
choices of nonlinearities included step-functions (percep-
{
input trons), sigmoids (i.e. Fermi functions), and the hyper-
layer
bolic tangent. More recently, it has become more com-
mon to use rectified linear functions (ReLUs), leaky rec-
tified linear units (leaky ReLUs), and exponential linear
units (ELUs) (see Figure 35). Different choices of non-
FIG. 34 Basic architecture of neural networks. (A)
linearities lead to different computational and training
The basic components of a neural network are stylized neu-
rons consisting of a linear transformation that weights the properties for neurons. The underlying reason for this is
importance of various inputs, followed by a non-linear activa- that we train neural nets using gradient descent based
tion function. (b) Neurons are arranged into layers with the methods, see Sec. IV, that require us to take derivatives
output of one layer serving as the input to the next layer. of the neural input-output function with respect to the
weights w(i) and the bias b(i) . Notice that the derivatives
of the aforementioned non-linearities σ(z) have very dif-
theory to learn s-wave scattering length (Wu et al., 2018) ferent properties. The derivative of the perceptron is zero
of potentials. everywhere except where the input is zero. This discon-
tinuous behavior makes it impossible to train perceptrons
using gradient descent. For this reason, until recently the
A. Neural Network Basics most popular choice of non-linearity was the tanh func-
tion or a sigmoid/Fermi function. However, this choice
Neural networks (also called neural nets) are neural- of non-linearity has a major drawback. When the input
inspired nonlinear models for supervised learning. As weights become large, as they often do in training, the
we will see, neural nets can be viewed as natural, more activation function saturates and the derivative of the
powerful extensions of supervised learning methods such output with respect to the weights tends to zero since
as linear and logistic regression and soft-max methods. ∂σ/∂z → 0 for z 1. Such “vanishing gradients” are
a feature of any saturating activation function (top row
of Fig. 35), making it harder to train deep networks. In
contrast, for a non-saturating activation function such as
1. The basic building block: neurons
ReLUs or ELUs, the gradients stay finite even for large
inputs.
The basic unit of a neural net is a stylized “neuron” i
that takes a vector of inputs x = (x1 , x2 , . . . , xd ) and pro-
duces a scalar output ai (x). A neural network consists of 2. Layering neurons to build deep networks: network
many such neurons stacked into layers, with the output architecture.
of one layer serving as the input for the next (see Figure
34). The first layer in the neural net is called the input The basic idea of all neural networks is to layer neurons
layer, the middle layers are often called “hidden layers”, in a hierarchical fashion, the general structure of which is
and the final layer is called the output layer. known as the network architecture (see Fig. 34). In the
The exact function ai varies depending on the type of simplest feed-forward networks, each neuron in the in-
non-linearity used in the neural network. However, in put layer of the neurons takes the inputs x and produces
essentially all cases ai can be decomposed into a linear an output ai (x) that depends on its current weights, see
operation that weights the relative importance of the var- Eq. (118). The outputs of the input layer are then treated
ious inputs and a non-linear transformation σi (z) which as the inputs to the next hidden layer. This is usually
is usually the same for all neurons. The linear trans- repeated several times until one reaches the top or output
formation in almost all neural networks takes the form layer. The output layer is almost always a simple clas-
of a dot product with a set of neuron-specific weights sifier of the form discussed in earlier sections: a logistic
(i) (i) (i)
w(i) = (w1 , w2 , . . . , wd ) followed by re-centering with regression or soft-max function in the case of categorical
47
0 0 0
-1 -1 -1
-5 0 5 -5 0 5 -5 0 5
ReLU Leaky ReLU ELU
6 6 6
4 4 4
2 2 2
0 0 0
-5 0 5 -5 0 5 -5 0 5
FIG. 35 Possible non-linear activation functions for neurons. In modern DNNs, it has become common to use non-linear
functions that do not saturate for large inputs (bottom row) rather than saturating functions (top row).
data (i.e. discrete labels) or a linear regression layer in hidden layers (hence the ‘deep’ in deep learning). There
the case of continuous outputs. Thus, the whole neural are many ideas of why such deep architectures are fa-
network can be thought of as a complicated nonlinear vorable for learning. Increasing the number of layers in-
transformation of the inputs x into an output ŷ that de- creases the number of parameters and hence the represen-
pends on the weights and biases of all the neurons in the tational power of neural networks. Indeed, recent numer-
input, hidden, and output layers. ical experiments suggests that as long as the number of
The use of hidden layers greatly expands the represen- parameters is larger than the number of data points, cer-
tational power of a neural net when compared with a sim- tain classes of neural networks can fit arbitrarily labeled
ple soft-max or linear regression network. Perhaps, the random noise samples (Zhang et al., 2016). This suggests
most formal expression of the increased representational that large neural networks of the kind used in practice can
power of neural networks (also called the expressivity) is express highly complex functions. Adding hidden layers
the universal approximation theorem which states that a is also thought to allow neural nets to learn more complex
neural network with a single hidden layer can approxi- features from the data. Work on convolutional networks
mate any continuous, multi-input/multi-output function suggests that the first few layers of a neural network learn
with arbitrary accuracy. The reader is strongly urged simple, “low-level” features that are then combined into
to read the beautiful graphical proof of the theorem in higher-level, more abstract features in the deeper layers.
Chapter 4 of Nielsen’s free online book (Nielsen, 2015). Other works suggest that it is computationally and al-
The basic idea behind the proof is that hidden neurons gorithmically easier to train deep networks rather than
allow neural networks to generate step functions with ar- shallow, wider nets, though this is still an area of major
bitrary offsets and heights. These can then be added controversy and active research (Mhaskar et al., 2016).
together to approximate arbitrary functions. The proof Choosing the exact network architecture for a neural
also makes clear that the more complicated a function, network remains an art that requires extensive numer-
the more hidden units (and free parameters) are needed ical experimentation and intuition, and is often times
to approximate it. Hence, the applicability of the ap- problem-specific. Both the number of hidden layers and
proximation theorem to practical situations should not the number of neurons in each layer can affect the per-
be overemphasized. In physics, a good analogy are ma- formance of a neural network. There seems to be no
trix product states, which can approximate any quantum single recipe for the right architecture for a neural net
many-body state to an arbitrary accuracy, provided the that works best. However, a general rule of thumb that
bond dimension can be increased arbitrarily – a severe seems to be emerging is that the number of parameters in
requirement not met in any practical implementation of the neural net should be large enough to prevent under-
the theory. fitting (see theoretical discussion in (Advani and Saxe,
Modern neural networks generally contain multiple 2017)).
48
Empirically, the best architecture for a problem de- For categorical data, the most commonly used loss
pends on the task, the amount and type of data that is function is the cross-entropy (Eq.(76) and Eq.(81)), since
available, and the computational resources at one’s dis- the output layer is often taken to be a logistic classifier
posal. Certain architectures are easier to train, while for binary data with two types of labels, or a soft-max
others might be better at capturing complicated depen- classifier if there are more than two types of labels. The
dencies in the data and learning relevant input features. cross-entropy was already discussed extensively in earlier
Finally, there have been numerous works that move be- sections on logistic regression and soft-max classifiers, see
yond the simple deep, feed-forward neural network ar- Sec. VII. Recall that for classification of binary data, the
chitectures discussed here. For example, modern neural output of the top layer of the neural network is the prob-
networks for image segmentation often incorporate “skip ability ŷi (w) = p(yi = 1|xi ; w) that data point i is pre-
connections” that skip layers of the neural network (He dicted to be in category 1. The cross-entropy between
et al., 2016). This allows information to directly propa- the true labels yi ∈ {0, 1} and the predictions is given by
gate to a hidden or output layer, bypassing intermediate
n
layers and often improving performance. X
E(w) = − yi log ŷi (w) + (1 − yi ) log [1 − ŷi (w)] .
i=1
B. Training deep networks More generally, for categorical data, y can take on M
values so that y ∈ {0, 1, . . . , M − 1}. For each datapoint
In the previous section, we introduced the basic ar- i, define a vector yim called a ‘one-hot’ vector, such that
chitecture for neural networks. Here we discuss how to
efficiently train large neural networks. Luckily, the basic
(
1, if yi = m
procedure for training neural nets is the same as we used yim = (121)
0, otherwise.
for training simpler supervised learning algorithms, such
as logistic and linear regression: construct a cost/loss
We can also define the probability that the neural net-
function and then use gradient descent to minimize the
work assigns a datapoint to category m: ŷim (w) = p(yi =
cost function. Neural networks differ from these simpler
m|xi ; w). Then, the categorical cross-entropy is defined
supervised procedures in that generally they contain mul-
as
tiple hidden layers that make taking the gradient more
computationally difficult. We will return to this in the n M
X X −1
next section which discusses the “backpropagation” algo- E((w)) = − yim log ŷim (w)
rithm for computing gradients. i=1 m=0
Like all supervised learning procedures, the first thing + (1 − yim ) log [1 − ŷim (w)] . (122)
one must do to train a neural network is to specify a loss
As in linear and logistic regression, this loss function is
function. Given a data point (xi , yi ), the neural network
often supplemented by additional terms that implement
makes a prediction ŷi (w), where w are the parameters
regularization.
of the neural network. Recall that in most cases, the
Having defined an architecture and a cost function,
top output layer of our neural net is either a continuous
we must now train the model. As with other supervised
predictor or a classifier that makes discrete (categorical)
learning methods, we make use of gradient descent-based
predictions. Depending on whether one wants to make
methods to optimize the cost function. Recall that the
continuous or categorical predictions, one must utilize a
basic idea of gradient descent is to update the parame-
different kind of loss function.
ters w to move in the direction of the gradient of the cost
For continuous data, the loss functions that are com-
function ∇w E(w). In Sec. IV, we discussed numerous
monly used to train neural networks are identical to those
optimizers that implement variations of stochastic gra-
used in linear regression, and include the mean squared
dient descent (SGD, Nesterov, RMSProp, Adam, etc.)
error
Most modern neural network packages, such as Keras,
1 X allow the user to specify which of these optimizers they
E(w) = (yi − ŷi (w))2 , (119)
N i would like to use in order to train the neural network.
Depending on the architecture, data, and computational
where N is the number of data points, and the mean- resources, different optimizers may work better on the
absolute error (i.e. L1 norm) problem, though vanilla SGD is a good first choice.
1 X Finally, we note that unlike in linear and logistic re-
E(w) = |yi − ŷi (w)|. (120) gression, calculating the gradients for a neural network
N i
requires a specialized algorithm, called backpropogation
The full cost-function often includes additional terms (often abbreviated backprop) which forms the heart of
that implement regularization (e.g. L1 or L2 regulariz- any neural network training procedure. Backpropaga-
ers). tion has been discovered multiple times independently
49
load_data()
15 # reshape data form, we can build our first neural network. Let’s cre-
16 X_train = X_train.reshape(X_train.shape[0], ate an instance of Keras’ Sequential() class, and call
N_features).astype(’float32’) it model. As the name suggests, this class allows us to
50
model accuracy
volutional layers. Every Dense() layer accepts as its first 0.92
required argument an integer which specifies the number
of neurons. The type of activation function for the layer 0.90
is defined using the activation optional argument, the
0.88
input of which is the name of the activation function
in string format. Examples include ’relu’, ’tanh’, 0.86
’elu’, ’sigmoid’, ’softmax’, see Fig. 35. In order
for our DNN to work properly, we must ensure that 0.84
the numbers of input and output neurons for each layer
0 2 4 6 8
match. Therefore, we specify the shape of the input in epoch
the first layer of the model explicitly using the optional
argument input_shape=(N_features,), see line 37 be- FIG. 37 Model accuracy of the DNN defined in the main text
low. The sequential construction of the model then al- to study the MNIST problem as a function of the training
lows Keras to infer the correct input/output dimensions epochs.
of all hidden layers automatically. Hence, we only need
to specify the size of the softmax output layer to match
the number of categories, see line 45. arguments for the optimizer, loss, and the validation
metric as follows:
33 ##### create deep neural network
34 # instantiate model 47 ##### choose loss function, optimizer, and metric
35 model = Sequential() 48 # compile the model
36 # add a dense all-to-all sigmoid layer 49 model.compile(
37 model.add(Dense(100,input_shape=(N_features,), 50 optimizer=keras.optimizers.SGD(lr=0.01,
activation=’sigmoid’)) momentum=0.9),
38 # add a dense all-to-all tanh layer 51 loss=keras.losses.
39 model.add(Dense(400, activation=’tanh’)) categorical_crossentropy,
40 # add a dense all-to-all relu layer 52 metrics=[’accuracy’]
41 model.add(Dense(400, activation=’relu’)) 53 )
# add a dense all-to-all elu layer
Training the DNN is a one-liner using the fit()
42
43 model.add(Dense(50, activation=’elu’))
44 # add a dense soft-max layer method of the Sequential class. The first two re-
45 model.add(Dense(N_categories, activation=’softmax’)) quired arguments are the training input and output
data. As optional arguments, we specify the mini-
Next, we choose the loss function according to which we batch_size, the number of training epochs, and the test
will train the DNN. For classification problems, this is the or validation_data. To monitor the training procedure
cross-entropy, and since the output data was cast in cate- for every epoch, we set verbose=True.
gorical form, we choose the categorical_crossentropy
defined in Keras’ losses module. Depending on the 55 ##### train model using minibatches
# train DNN
problem of interest, one can pick another suitable loss
56
57 history=model.fit(X_train, Y_train,
function. To optimize the weights of the net, we choose 58 batch_size=64,
SGD. This algorithm is available to use under Keras‘ 59 epochs=10,
optimizers module, and we could use Adam() or any 60 validation_data=(X_test, Y_test),
other built-in algorithm as well. The parameters for the 61 verbose=True
optimizer, such as lr (learning rate) or momentum are 62 )
passed using the corresponding optional arguments of the
SGD() function. All available arguments can be found in
Keras’ online documentation. While the loss function D. The backpropagation algorithm
and the optimizer are essential for the training proce-
dure, to test the performance of the model one may want In the last section, we saw how to deploy a high-level
to look at a particular metric of performance. For in- package, Keras, to design and train a simple neural net-
stance, in categorical tasks one typically looks at their work. This training procedure requires us to be able to
’accuracy’, which is defined as the percentage of cor- calculate the derivative of the cost function with respect
rectly classified data points. To complete the definition of to all the parameters of the neural network (the weights
our model, we use the compile() method, with optional and biases of all the neurons in the input, hidden, and
51
0.4 ∂E
∆L
j = . (125)
∂zjL
model loss
0.3
This definition is the first of the four backpropagation
equations.
0.2 We can analogously define the error of neuron j in
layer l, ∆lj , as the change in the cost function w.r.t. the
0.1 weighted input zjl :
0 2 4 6 8 ∂E ∂E 0 l
∆lj = l
= σ (zj ), (I)
epoch ∂zj ∂alj
FIG. 38 Model loss of the DNN defined in the main text where σ 0 (x) denotes the derivative of the non-linearity
to study the MNIST problem as a function of the training σ(·) with respect to its input evaluated at x. Notice
epochs. that the error function ∆lj can also be interpreted as the
partial derivative of the cost function with respect to the
bias blj , since
visible layers). A brute force calculation is out of the
question since it requires us to calculate as many gradi-
∂E ∂E ∂blj ∂E
ents as parameters at each step of the gradient descent. ∆lj = = = l, (II)
The backpropagation algorithm (Rumelhart and Zipser, ∂zjl ∂blj ∂zjl ∂bj
1985) is a clever procedure that exploits the layered struc-
ture of neural networks to more efficiently compute gra- where in the last line we have used the fact that
dients (for a more detailed discussion with Python code ∂blj /∂zjl = 1. This is the second of the four backprop-
examples see Chapter 2 of (Nielsen, 2015)). agation equations.
We now derive the final two backpropagation equations
using the chain rule. Since the error depends on neurons
1. Deriving and implementing the backpropagation equations in layer l only through the activation of neurons in the
subsequent layer l + 1, we can use the chain rule to write
At its core, backpropagation is simply the ordinary
∂E X ∂E ∂z l+1
chain rule for partial differentiation, and can be summa- ∆lj = = k
rized using four equations. In order see this, we must first ∂zjl k
∂zk
l+1 ∂z l
j
establish some useful notation. We will assume that there X ∂zkl+1
are L layers in our network with l = 1, . . . , L indexing the = ∆l+1
k
layer. Denote by wjk l
the weight for the connection from k
∂zjl
the k-th neuron in layer l − 1 to the j-th neuron in layer
!
X
l. We denote the bias of this neuron by blj . By construc- = ∆l+1 l+1
k wkj σ 0 (zjl ). (III)
tion, in a feed-forward neural network the activation alj k
of the j-th neuron in the l-th layer can be related to the This is the third backpropagation equation. The final
activities of the neurons in the layer l − 1 by the equation equation can be derived by differentiating of the cost
X
! function with respect to the weight wjk
l
as
l−1
l
aj = σ wjk ak + bj = σ(zjl ),
l l
(123)
k ∂E ∂E ∂zjl
l
= = ∆lj al−1
k (IV)
∂wjk ∂zjl wjk
l
where we have defined the linear weighted sum
X Together, Eqs. (I), (II), (III), and (IV) define the four
zjl = l
wjk akl−1 + blj . (124)
backpropagation equations relating the gradients of the
k
activations of various neurons alj , the weighted inputs
P l l−1 l
By definition, the cost function E depends directly on zjl = k wjk ak +bj , and the errors ∆lj . These equations
the activities of the output layer aL
j . It of course also indi- can be combined into a simple, computationally efficient
rectly depends on all the activities of neurons in lower lay- algorithm to calculate the gradient with respect to all
ers in the neural network through iteration of Eq. (123). parameters (Nielsen, 2015).
52
Standard
1. Implicit regularization using SGD: initialization,
Neural Net
hyper-parameter tuning, and early stopping
Batch Normalization is a regularization scheme that In Sec. IX.C, we demonstrated that the numerical
has been quickly adopted by the neural network com- implementation of DNNs is greatly facilitated by open
munity since its introduction in 2015 (Ioffe and Szegedy, source python packages, such as Keras, TensorFlow, and
2015). The basic inspiration behind Batch Normalization Pytorch, just to name a few. The complexity and learn-
is the long-known observation that training in neural net- ing curves for these packages differ, depending on the
works works best when the inputs are centered around reader’s level of familiarity with python. There are DNN
zero with respect to the bias. The reason for this is that packages written in other languages, such as Caffe which
it prevents neurons from saturating and gradients from is written in C++, but we do not use them in this review.
vanishing in deep nets. In the absence of such center- Keras is a high-level framework which does not require
ing, changes in parameters in lower layers can give rise any knowledge about the inner workings of the under-
to saturation effects in higher layers, and vanishing gra- lying deep learning algorithms. Coding DNNs in Keras
dients. The idea of Batch Normalization is to introduce is particularly simple, see Sec. IX.C, and allows one to
additional new “BatchNorm" layers that standardize the quickly grasp the big picture behind the theoretical con-
inputs by the mean and variance of the mini-batch. cepts which we introduced above. However, for advanced
Consider a layer l with d neurons whose inputs are applications, which may require more direct control over
(z1l , . . . , zdl ). We standardize each dimension so that the operations in between the layers, Keras’ high-level
design may prove insufficient.
zkl − E[zkl ]
ẑkl = q , (127) If one opens up the Keras black box, one will find that
Var[zkl ] it wraps the functionality of another package – Tensor-
Flow8 . Over the last years, TensorFlow, which is sup-
where the mean and expectation are taken over all sam- ported by Google, has been gaining popularity and has
ples in the mini-batch. One problem with this procedure become the preferred library for deep learning and is used
is that it may change the representational power of the in Kaggle competitions, university classes, and industry.
neural network. For example, for tanh non-linearities, it In TensorFlow one constructs data flow graphs, the nodes
may force the network to live purely in the linear regime of which represent mathematical operations, while the
around z = 0. Since non-linearities are crucial to the edges encode multidimensional tensors (data arrays). A
representational power of DNNs, this could dramatically deep neural net can then be thought of as a graph with
alter the power of the DNN. For this reason, one intro- a particular architecture. One needs to understand this
duces two new parameters γkl and βkl for each neuron that concept well before one can truly unleash TensorFlow’s
can shift and scale the normalized input full potential. Unfortunately, the learning curve can be
ŷkl = γkl ẑkl + βkl . (128) rather steep for TensorFlow, and requires a certain de-
gree of perseverance and time to internalize the underly-
These new parameters are then learned just like the
ing idea.
weights and biases using backpropagation (since this is
There are, however, many other open source packages
just an extra layer for the chain rule). We initialize the
which allow for control over the inter- and intra-layer op-
neural network so that at the beginning of training the
erations, without the need to introduce computational
inputs are being standardized. Backpropagation than ad-
graphs. Such an example is Pytorch, which offers li-
justs γ and β during training.
braries for automatic differentiation of tensors at GPU
In practice, Batch Normalization considerably im-
speed. As we discussed above, manipulating neural nets
proves the learning speed by preventing gradients from
boils down to fast array multiplication and contraction
vanishing. However, it also seems to serve as a power-
operations and, therefore, the torch.nn library often
ful regularizer for reasons that are not fully understood.
does the job of providing enough access and controlla-
One plausible explanation is that in batch normalization,
bility to manipulate the linear algebra operations under-
the gradient for a sample depends not only on the sam-
lying deep neural nets.
ple itself but also on all the properties of the mini-batch.
For the benefit of the reader, we have prepared Jupyter
Since a single sample can occur in different mini-batches,
notebooks for DNNs using all three packages for the deep
this introduces additional randomness into the training
learning problems we discuss below. We invite the reader
procedure which seems to help regularize training.
to carefully examine the differences in the code which
should help them decide on which package they prefer to
F. Deep neural networks in practice: examples use.
2. Approaching the learning problem the original values of the data differ by orders of magni-
tude, training can be slowed down or impeded. This is
Let us now analyze a typical procedure for using neu- related to the vanishing and exploding gradient problem
ral networks to solve supervised learning problems. As in backprop discussed in Sec. IX.D. Therefore, one uses
can be seen already from the code snippets in Sec. IX.C, two tricks here: (i) all data should be mean-centered,
constructing a deep neural network to solve ML problems i.e. from every data point we subtract the mean of the
is a multiple-stage process. Generally, one can identify a entire dataset, and (ii) rescale the data, for which there
set of key steps: are two ways: if the data is approximately normally
distributed, one can rescale by the standard deviation.
1. Collect and pre-process the data.
Otherwise, it is typically rescaled by the maximum abso-
2. Define the model and its architecture. lute value so the rescaled data lies in the interval [−1, 1].
Rescaling ensures that the weights of the DNN are of a
3. Choose the cost function and the optimizer. similar order of magnitude.
4. Train the model. Oftentimes insufficient data serves as a major bottle-
neck on the ultimate performance of DNNs. In such cases
5. Evaluate and study the model performance one can consider data augmentation, i.e. distorting data
on the validation and test data. samples from the existing dataset in some way to enhance
size the dataset. Obviously, if one knows how to do this,
6. Adjust the hyperparameters (and, if neces-
one already has partial information about the important
sary, network architecture) to optimize per-
features in the data.
formance for the specific dataset.
Whereas it is always possible to view Steps 1-5 as
At this point, a few remarks are in order. While we generic and independent of the particular problem we
treat Step 1 above as consisting mainly of loading and re- are trying to solve, it is only when these steps are put
shaping a dataset prepared ahead of time, we emphasize together in Step 6 that the real benefit of deep learning
that getting data into an appropriate form is an insepa- is revealed, compared to less sophisticated methods such
rable part of the learning process and usually difficult. as regression or bagging, see Secs. VI, VII, and VIII. The
One of the first questions we are usually faced with is optimal choice of network architecture, cost function, and
how to determine the sizes of the training and test data optimizer is determined by the properties of the training
sets. The MNIST dataset, which has 10 classification and test datasets, which are only revealed when we try to
categories, uses 80% of the available data for training improve the model. A typical strategy of exploring the
and 20% for testing. On the other hand, the ImageNet hyperparameter landscape is to use grid searches.
data which has 1000 categories is split 50% − 50%. As While there is no single recipe to approach all ML prob-
a rule of thumb, the more classification categories there lems, we believe that the above to-do list gives a good
are in the task, the closer the sizes of the training and overview and can be a useful guideline to the layman.
test datasets should be in order to prevent overfitting. Furthermore, as should be clear, this ‘recipe’ can be ap-
Once the size of the training set is fixed, it is common to plied to general ML tasks, not just DNNs. We refer the
reserve 20% of it for validation, which is used to fine-tune reader to Sec. XI for more useful hints and tips on how
the hyperparameters of the model. to use the validation data during the training process.
The next issue is how to choose the right hyperpa-
rameters to begin training the model with. For instance,
according to Bengio, the optimal learning rate is often an 3. SUSY dataset
order of magnitude lower than the smallest learning rate
that blows up the loss (Bengio, 2012). One should also In this section, we discuss a DNN approach to the
keep in mind that, depending on how ambitious a prob- SUSY dataset introduced in the context of logistic regres-
lem one is tackling, training the model can take a consid- sion in Sec. VII.C.2, and Bagging in Sec. VIII.F. There
erable amount of time. This can severely slow down any is, however, an interest in using deep learning methods
progress on improving the model in Step 6. Therefore, it to automatically discover collision features. Benchmark
is usually a good idea to play with a small enough per- results using Bayesian Decision Trees from a standard
centage of the training data to get a rough feeling about physics package, and five-layer neural networks using the
the correct hyperparameter regimes, the usefulness of the dropout algorithm were presented in the original pa-
DNN architecture, and to debug one’s code. The size of per (Baldi et al., 2014) to compare the ability of deep
this small ‘play set’ should be such that training on it can learning to bypass the need of using such high-level fea-
be done fast and in real time to allow to quickly adjustthe tures. For a detailed description of the SUSY dataset
hyperparameters. and the corresponding classification problem, we refer the
Also related to data preprocessing is the standardiza- reader to Sec. VII.C.2. Our goal here is to study system-
tion of the dataset. It has been found empirically that if atically the accuracy of a DNN classifier as a function of
56
1.0 1.0
1e-05 0.0001 0.001 0.01 0.1
1e-06 1e-05 0.0001 0.001 0.01 0.1
1000 54.5% 64.0% 73.5% 77.0% 72.5% 0.8 0.8
1 50.5% 49.4% 53.7% 49.5% 52.6% 43.0%
hidden neurons
data set size
10000 55.0% 57.2% 74.8% 78.1% 79.0% 0.6 10 37.2% 55.4% 68.3% 90.1% 96.4% 99.6% 0.6
100000 47.8% 63.5% 74.4% 78.6% 79.6% 0.4 100 55.6% 64.9% 43.2% 90.6% 99.6% 99.9% 0.4
1.0
FIG. 40 Grid search results for the test set accuracy of the
DNN for the SUSY problem as a function of the learning rate
1e-06 1e-05 0.0001 0.001 0.01 0.1
and the size of the dataset. The data used includes all high-
and low-level features. 0.8
1 50.2% 49.7% 66.7% 49.9% 52.3% 47.1%
hidden neurons
the learning rate and the dataset size. 10 47.1% 67.6% 63.8% 67.9% 83.5% 84.4% 0.6
Unlike the MNIST example where we used Keras, here
we use the opportunity to introduce the Pytorch package, 100 55.8% 58.3% 54.9% 67.5% 73.4% 91.2% 0.4
c.f. corresponding notebook. We leave the discussion of
the code-specific details for the accompanying notebook.
1000 42.7% 43.0% 56.7% 58.5% 74.5% 96.2%
To classify the SUSY collision events, we construct a 0.2
DNN with two dense hidden layers of 200 and 100 neu- learning rate
rons, respectively. We use ReLU activation between the
input and the hidden layers, and a sofmax output layer. 0.0
We apply dropout regularization on the weights of the
DNN. Similar to MNIST, we use the cross-entropy as a FIG. 41 Grid search results for the test set accuracy (top)
cost function and minimize it using SGD with batches of and the critical set accuracy (bottom) of the DNN for the
size 10% of the training data size. We train the DNN Ising classification problem as a function of the learning rate
over 10 epochs. and the number of hidden neurons.
Figure 40 shows the accuracy of our DNN on the test
data as a function of the learning rate and the size of the
dataset. It is considered good practice to start with a hidden neurons and the learning rate. The discussion is
logarithmic scale to search through the hyperparameters, accompanied by a notebook written in TensorFlow. As in
to get an overall idea for the order of magnitude of the the previous example, the interested reader can find the
optimal values. In this example, the performance peaks discussion of the code-specific details in the notebook.
at the largest size of the dataset and a learning rate of To classify whether a given spin configuration is in the
0.1, and is of the order of 80%. For comparison, in the ordered or disordered phase, we construct a minimalistic
original study (Baldi et al., 2014), the authors achieved model for a DNN with a single hidden layer containing
≈ 89% by using the entire dataset with 5, 000, 000 points a number of hidden neurons. The network architecture
and a more sophisticated network architecture, trained thus includes a ReLU-activated input layer, the hidden
using GPUs. layer, and the softmax output layer. We pick the cate-
gorical cross-entropy as a cost function and minimize it
using SGD with mini-batches of size 100. We train the
4. Phases of the 2D Ising model DNN over 100 epochs.
Figures 41 show the outcome of a grid search over a
In this section, we discuss a DNN approach to the Ising log-spaced learning rate and the number of neurons in the
dataset introduced in Sec. VII.C.1. We study the prob- hidden layer. We see that about 10 neurons are enough
lem of classifying the states of the 2D Ising model with at a learning rate of 0.1 to get to a very high accuracy on
a DNN (Tanaka and Tomiya, 2017a), focussing on the the test set. However, if we aim at capturing the physics
model performance as a function of both the number of close to criticality, clearly more neurons are required to
57
reliably learn the more complex correlations in the Ising physical models for quantum condensed matter systems
states. (Levine et al., 2017).
One of the core lessons of physics is that we should ex- A convolutional neural network is a translationally in-
ploit symmetries and invariances when analyzing physi- variant neural network that respects locality of the in-
cal systems. Properties such as locality and translational put data. CNNs are the backbone of many modern deep
invariance are often built directly into the physical laws. learning applications and here we just give a high-level
Our statistical physics models often directly incorporate overview of CNNs that should allow the reader to delve
everything we know about the physical system being an- directly into the specialized literature. The reader is also
alyzed. For example, we know that in many cases it is strongly encouraged to consult the excellent, succinct
sufficient to consider only local couplings in our Hamilto- notes for the Stanford CS231n Convolutional Neural Net-
nians, or work directly in momentum space if the system works class developed by Andrej Karpathy and Fei-Fei Li
is translationally invariant. This basic idea, tailoring our (https://cs231n.github.io/). We have drawn heavily
analysis to exploit additional structure, is a key feature of on the pedagogical style of these notes in crafting this
modern physical theories from general relativity, through section.
gauge theories, to critical phenomena. There are two kinds of basic layers that make up a
Like physical systems, many datasets and supervised CNN: a convolution layer that computes the convolution
learning tasks also possess additional symmetries and (as a mathematical operation, see this practical guide to
structure. For instance, consider a supervised learning image kernels) of the input with a bank of filters, and
task where we want to label images from some dataset as pooling layers that coarse-grain the input while main-
being pictures of cats or not. Our statistical procedure taining locality and spatial structure, see Fig. 42. For
must first learn features associated with cats. Because two-dimensional data, a layer l is characterized by three
a cat is a physical object, we know that these features numbers: height Hl , width Wl , and depth Dl . The height
are likely to be local (groups of neighboring pixels in the and width correspond to the sizes of the two-dimensional
two-dimensional image corresponding to whiskers, tails, spatial (Wl , Hl )-plane (in neurons), and the depth Dl the
eyes, etc). We also know that the cat can be anywhere number of filters in that layer. All neurons correspond-
in the image. Thus, it does not really matter where in ing to a particular filter have the same parameters (i.e.
the picture these features occur (though relative positions shared weights and bias).
of features likely do matter). This is a manifestation of In general, we will be concerned with local spatial fil-
translational invariance that is built into our supervised ters (often called a receptive field in analogy with neuro-
learning task. This example makes clear that, like many science) that take as inputs a small spatial patch of the
physical systems, many ML tasks (especially in the con- previous layer at all depths. For instance, a square filter
text of computer vision and image processing) also pos- of size F is a three-dimensional array of size F ×F ×Dl−1 .
sess additional structure, such as locality and translation The convolution consists of running this filter over all lo-
invariance. cations in the spatial plane. To demonstrate how this
The all-to-all coupled neural networks in the previous works in practice, let us a consider the simple example
section fail to exploit this additional structure. For exam- consisting of a one-dimensional input of depth 1, shown
ple, consider the image of the digit ‘four’ from the MNIST in Fig. 43. In this case, a filter of size F × 1 × 1 can
dataset shown in Fig. 36. In the all-to-all coupled neural be specified by a vector of weights w of length F . The
networks used there, the 28 × 28 image was considered stride, S, encodes by how many neurons we translate the
a one-dimensional vector of size 282 = 796. This clearly filter by when performing the convolution. In addition,
throws away lots of the spatial information contained in it is common to pad the input with P zeros (see Fig. 43).
the image. Not surprisingly, the neural networks com- For an input of width W , the number of neurons (out-
munity realized these problems and designed a class of puts) in the layer is given by (W − F + 2P )/S + 1. We
neural network architectures, convolutional neural net- invite the reader to check out this visualization of the
works or CNNs, that take advantage of this additional convolution procedure for a square input of unit depth.
structure (locality and translational invariance) (LeCun After computing the filter, the output is passed through
et al., 1995). Furthermore, what is especially interesting a non-linearity, a ReLU in Fig. 43. In practice, one of-
from a physics perspective is that it has been recently ten inserts a BatchNorm layer before the non-linearity,
shown that these CNN architectures are intimately re- cf. Sec. IX.E.3.
lated to models such as tensor networks (Stoudenmire, These convolutional layers are interspersed with pool-
2018; Stoudenmire and Schwab, 2016) and, in particu- ing layers that coarse-grain spatial information by per-
lar, MERA-like architectures that are commonly used in forming a subsampling at each depth. One common pool-
58
H Fully
Connected
Layer
W
Convolution Coarse-graining Convolution Coarse-graining
(pooling) (pooling)
FIG. 42 Architecture of a Convolutional Neural Network (CNN). The neurons in a CNN are arranged in three
dimensions: height (H), width (W ), and depth (D). For the input layer, the depth corresponds to the number of channels (in
this case 3 for RGB images). Neurons in the convolutional layers calculate the convolution of the image with a local spatial
filter (e.g. 3 × 3 pixel grid, times 3 channels for first layer) at a given location in the spatial (W, H)-plane. The depth D of the
convolutional layer corresponds to the number of filters used in the convolutional layer. Neurons at the same depth correspond
to the same filter. Neurons in the convolutional layer mix inputs at different depths but preserve the spatial location. Pooling
layers perform a spatial coarse graining (pooling step) at each depth to give a smaller height and width while preserving the
depth. The convolutional and pooling layers are followed by a fully connected layer and classifier (not shown).
ing operation is the max pool. In a max pool, the spatial with similar phenomena in physics: e.g. in translation-
dimensions are coarse-grained by replacing a small region ally invariant systems we can parametrize all eigenmodes
(say 2×2 neurons) by a single neuron whose output is the by specifying only their momentum (wave number) and
maximum value of the output in the region. In physics, functional form (sin, cos, etc.), while without translation
this pooling step is very similar to the decimation step invariance much more information is required.
of RG (Iso et al., 2018; Koch-Janusz and Ringel, 2017;
Lin et al., 2017; Mehta and Schwab, 2014). This gener-
ally reduces the dimension of outputs. For example, if B. Example: CNNs for the 2D Ising model
the region we pool over is 2 × 2, then both the height
and the width of the output layer will be halved. Gen- The inclusion of spatial structure in CNNs is an impor-
erally, pooling operations do not reduce the depth of the tant feature that can be exploited when designing neural
convolutional layers because pooling is performed sepa- networks for studying physical systems. In the accompa-
rately at each depth. A simple example of a max-pooling nying notebook, we used Pytorch to implement a simple
operation is shown in Fig. 44. There are some studies CNN composed of a single convolutional layer followed by
suggesting that pooling might be unnecessary (Springen- a soft-max layer. We varied the depth of the CNN layer
berg et al., 2014), but pooling layers remain a staple of from unity – a single set of weights and one bias – to a
most CNNs. depth of 50 distinct weights and biases. The CNN was
In a CNN, the convolution and max-pool layers are then trained using SGD for five epochs using a training
generally followed by an all-to-all connected layer and a set consisting of samples from far in the paramagnetic
high-level classifier such as a soft-max. This allows us and ordered phases. The results are shown in Fig. 45.
to train CNNs as usual using the backprop algorithm, The CNN achieved a 100% accuracy on the test set for
cf. Sec. IX.D. From a backprop perspective, CNNs are all architectures, even for a CNN with depth one. We also
almost identical to fully connected neural network archi- checked the performance of the CNN on samples drawn
tectures except with tied parameters. from the near-critical region for temperatures T slightly
Apart from introducing additional structure, such as above and below the critical temperature Tc . The CNN
translational invariance and locality, this convolutional performed admirably even on these critical samples with
structure also has important practical and computational an accuracy of between 80% and 90%. As is the case
benefits. All neurons at a given layer represent the same with all ML and neural networks, the performance on
filter, and hence can all be described by a single set of parts of the data that are missing from the training set
weights and biases. This reduces the number of free pa- is considerably worse than on test data that is similar
rameters by a factor of H ×W at each layer. For example, to the training data. This highlights the importance of
for a layer with D = 102 and H = W = 102 , this gives properly constructing an accurate training dataset and
a reduction in parameters of nearly 106 ! This allows for the considerable obstacles of generalizing to novel situ-
the training of much larger models than would otherwise ations. We encourage the interested reader to explore
be possible with fully connected layers. We are familiar the corresponding notebook and design better CNN ar-
59
(a) F=3
weight=[1,-1,1] 3 0 1 0
S=1
(units to shift bias=-2 ReLU
0 (unit slope) 0 1 1 1 3 1
filter by)
1 1 -1 0 4 2
2 3 2 1
2 1 -1 0
4 1 0 1
P=1 2 -1 -3 0
-1 6 4 4 . .
3 -4 -6 0
.. ..
0 output
input 2 5 2 1
W=5
1 1 0 1 5 2
(b) F=4
S=2 weight=[1,-1,2,1] 1 0 1 0 3 1
(units to shift
filter by) bias=-1 ReLU
3 1 0 1
1 (unit slope)
2
2 1 1 FIG. 44 Illustration of Max Pooling. Illustration of max-
pooling over a 2 × 2 region. Notice that pooling is done at
2 each depth (vertical axis) separately. The number of outputs
P=0 is halved along each dimension due to this coarse-graining.
-1 1 0 0
0 output C. Pre-trained CNNs and transfer learning
70 test simultaneously.
critical All these procedures can be carried out easily by using
60 1 5 10 20 50
packages such as Caffe or the Torch Vision library in
PyTorch. PyTorch provides a nice python notebook that
Depth of hidden layer serves as tutorial on transfer learning and the reader is
strongly urged to read the tutorials carefully if interested
in this topic.
FIG. 45 Single-layer convolutional network for classi-
fying phases in the Ising mode. Accuracy on test set and
critical samples for a convolutional neural network with sin-
gle layer of varying depth with filters of size 2, max-pool layer XI. HIGH-LEVEL CONCEPTS IN DEEP NEURAL
with receptive field of size 2, followed by soft-max classifier. NETWORKS
Notice that the test accuracy is 100% even for a CNN of depth
one with a single set of weights. Accuracy on the near-critical In the previous sections, we introduced deep neural
dataset is significantly below that for the test set. networks and discussed how we can use these networks
to perform supervised learning. Here, we take a step back
and discuss some high-level questions about the practice
this turns out to be true for many tasks one might be
and performance of neural networks. The first part of this
interested in.
section presents a deep learning workflow inspired by the
There are three distinct ways one can take a pre-
bias-variance tradeoff. This workflow is especially rele-
trained CNN and repurpose it for a new task. The follow-
vant to industrial applications where one is often trying
ing discussion draws heavily on the notes from the course
to employ neural networks to solve a particular problem.
CS231n mentioned in the introduction to this section.
In the second part of this section, we shift gears and ask
• Use CNN as fixed feature detector at top the question, why have neural networks been so success-
layer. If the new dataset we want to train on is ful? We provide three different high-level explanations
small and similar to the original dataset, we can that reflect current dogmas. Finally, we end the section
simply use the CNN as a fixed feature detector and by discussing the limitations of supervised learning meth-
retrain our classifier. In other words, we remove the ods and current neural network architectures.
classifier (soft-max) layer at the top of the CNN and
replace it with a new classifier (linear SVM or soft-
max) relevant to our supervised learning problem. A. Organizing deep learning workflows using the bias-variance
In this procedure, the CNN serves as a fixed map tradeoff
from images to relevant features (the outputs of the
top fully-connected layer right before the original Imagine that you are given some data and asked
classifier). This procedure prevents overfitting on to design a neural network for learning how to per-
small, similar datasets and is often a useful starting form a supervised learning task. What are the best
point for transfer learning. practices for organizing a systematic workflow that
allows us to efficiently do this? Here, we present
• Use CNN as fixed feature detector at inter- a simple deep learning workflow inspired by think-
mediate layer. If the dataset is small and quite ing about the bias-variance tradeoff (see Figure 46).
different from the dataset used to train the origi- This section draws heavily on Andrew Ng’s tuto-
nal image, the features at the top level might not rial at the Deep Learning School (available online
be suitable for our dataset. In this case, one may at https://www.youtube.com/watch?v=F1ka6a13S9I)
want to instead use features in the middle of the which readers are strongly encouraged to watch.
CNN to train our new classifier. These features are
thought to be less fine-tuned and more universal The first thing we would like to do is divide the data
(e.g. edge detectors). This is motivated by the idea into three parts. A training set, a validation or dev
that CNNs learn increasingly complex features the (development) set, and a test set. The test set is the
61
data on which we want to make predictions. The dev set Establish proxy for op;mal error rate (e.g. expert performance)
is a subset of the training data we use to check how well
we are doing out-of-sample, after training the model on
the training dataset. We use the validation error as a
proxy for the test error in order to make tweaks to our Yes
Bigger model
model. It is crucial that we do not use any of the test Training error high? Train longer
data to train the algorithm. This is a cardinal sin in Underfi9ng New model architecture
ML. We thus suggest the following workflow: No
Performance
ing paradigm is the ability to learn relevant features Medium NN
of the data with relatively little domain knowledge, i.e.
with minimal hand-crafting. Often, the power of deep Small NN
learning stems from its ability to act like a black box
that can take in a large stream of data and find good
features that capture properties of the data we are in- Traditional
terested in. This ability to learn good representations (e.g. logistic reg)
with very little hand-tuning is one of the most attractive
properties of DNNs. Many of the other supervised learn-
ing algorithms discussed here (regression-based models,
ensemble methods such as random forests or gradient-
boosted trees) perform comparably or even better than
Amount of Data
neural networks but when using hand-crafted features
FIG. 47 Large neural networks can exploit the vast
with small-to-intermediate sized datasets. amount of data now available. Schematic of how neural
The hierarchical structure of deep learning models is network performance depends on amount of available data
thought to be crucial to their ability to represent com- (Figure based on Andrew Ng’s talk at the 2016 Deep Learn-
plex, abstract features. For example, consider the use ing School available at https://www.youtube.com/watch?v=
of CNNs for image classification tasks. The analysis of F1ka6a13S9I.)
CNNs suggests that the lower-levels of the neural net-
works learn elementary features, such as edge detectors,
which are then combined into higher levels of the net- is that they are able to exploit the additional signal in
works into more abstract, higher-level features (e.g. the large datasets for difficult supervised learning tasks. Fun-
famous example of a neuron that “learned to respond to damentally, modern DNNs are unique in that they con-
cats”) (Le, 2013). More recently, it has been shown that tain millions of parameters, yet can still be trained on
CNNs can be thought of as performing tensor decompo- existing hardwares. The complexity of DNNs (in terms
sitions on the data similar to those commonly used in of parameters) combined with their simple architecture
numerical methods in modern quantum condensed mat- (layer-wise connections) hit a sweet spot between expres-
ter (Cohen et al., 2016). sivity (ability to represent very complicated functions)
One of the interesting consequences of this line of and trainability (ability to learn millions of parameters).
thinking is the idea that one can train a CNN on one Indeed, the ability of large DNNs to exploit huge
large dataset and the features it learns should also be use- datasets is thought to differ from many other commonly
ful for other supervised tasks. This results in the ability employed supervised learning methods such as Support
to learn important and salient features directly from the Vector Machines (SVMs). Figure 47 shows a schematic
data and then transfer this knowledge to a new task. In- depicting the expected performance of DNNs of differ-
deed, this ability to learn important, higher-level, coarse- ent sizes with the number of data samples and compares
grained features is reminiscent of ideas like the renormal- them to supervised learning algorithms such as SVMs or
ization group (RG) in physics where the RG flows sep- ensemble methods. When the amount of data is small,
arate out relevant and irrelevant directions, and certain DNNs offer no substantial benefit over these other meth-
unsupervised deep learning architectures have a natural ods and often perform worse. However, large DNNs seem
interpretation in terms of variational RG schemes (Mehta to be able to exploit additional data in a way other meth-
and Schwab, 2014). ods cannot. The fact that one does not have to hand
engineer features makes the DNN even more well suited
for handling large datasets. Recent theoretical results
2. Neural networks can exploit large amounts of data suggest that as long as a DNN is large enough, it should
generalize well and not overfit (Advani and Saxe, 2017).
With the advent of smartphones and the internet, there In the data-rich world we live in (at least in the context
has been an explosion in the amount of data being gen- of images, videos, and natural language), this is a recipe
erated. This data-rich environment favors supervised for success. In other areas where data is more limited,
learning methods that can fully exploit this rich data deep learning architectures have (at least so far) been less
world. One important reason for the success of DNNs successful.
63
3. Neural networks scale up well computationally • Homogeneous data.—Almost all DNNs deal
with homogeneous data of one type. It is very hard
A final feature that is thought to underlie the success to design architectures that mix and match data
of modern neural networks is that they can harness the types (i.e. some continuous variables, some discrete
immense increase in computational capability that has variables, some time series). In applications beyond
occurred over the last few decades. The architecture of images, video, and language, this is often what is
neural networks naturally lends itself to parallelization required. In contrast, ensemble models like random
and the exploitation of fast but specialized processors forests or gradient-boosted trees have no difficulty
such as graphical processing units (GPUs). Google and handling mixed data types.
NVIDIA set on a course to develop TPUs (tensor pro-
cessing units) which will be specifically designed for the • Many physics problems are not about
mathematical operations underlying deep learning archi- prediction.—In physics, we are often not inter-
tectures. The layered architecture of neural networks also ested in solving prediction tasks such as classifi-
makes it easy to use modern techniques such as automatic cation. Instead, we want to learn something about
differentiation that make it easy to quickly deploy them. the underlying distribution that generates the data.
Algorithms such as stochastic gradient descent and the In this case, it is often difficult to cast these ideas in
use of mini-batches make it easy to parallelize code and a supervised learning setting. While the problems
train much larger DNNs than was thought possible fifteen are related, it’s possible to make good predictions
years ago. Furthermore, many of these computational with a “wrong” model. The model might or might
gains are quickly incorporated into modern packages with not be useful for understanding the physics.
industrial resources. This makes it easy to perform nu-
merical experiments on large datasets, leading to further Some of these remarks are particular to DNNs, oth-
engineering gains. ers are shared by all supervised learning methods. This
motivates the use of unsupervised methods which in part
circumnavigate these problems.
C. Limitations of supervised learning with deep networks
signal
noise
N −1 i xi of these data points is zero9 . Denote N × in those components. Low values of explained variance
P
D design matrix as X = [x1 , x2 , . . . ; xN ]T whose rows may imply that the intrinsic dimensionality of the data
are the data points and columns correspond to different is high or simply that it cannot be captured by a lin-
features. The D × D (symmetric) covariance matrix is ear representation. For a detailed introduction to PCA,
therefore see the introductory tutorial by Shlens (Shlens, 2014) or
Bishop (Bishop, 2006).
1
Σ(X) = X T X. (129)
N −1
Notice that the j-th diagonal entry of Σ(X) corresponds C. Multidimensional scaling
to the variance of j-th feature and the Σ(X)ij measures
the covariance (i.e. connected correlation in the language Multidimensional scaling (MDS) is a non-linear dimen-
of physics) between feature i and feature j. sional reduction technique which preserves the pairwise
We are interested in finding a new basis for the data distance or dissimilarity dij between data points (Cox
that emphasizes highly variable directions while reduc- and Cox, 2000). Moving forward, we use the term “dis-
ing redundancy between basis vectors. In particular, we tance” and “dissimilarity” interchangeably. There are two
will look for a linear transformation that reduces the co- types of MDS: metric and non-metric. In metric MDS,
variance between different features. To do so, we first the distance matrix is computed under a pre-defined met-
perform singular value decomposition (SVD) on design ric and the latent coordinates Ỹ are obtained by minimiz-
matrix X, namely, X = U SV T , where S is a diagonal ing the difference between the distance matrix in the orig-
matrix of singular value si , the orthogonal matrix U con- inal space (dij (X)) and that in the latent space (dij (Y )):
tains (as its columns) the left singular vectors of X, and X
similarly V contains (as its columns) the right singular Ỹ = arg min wij |dij (X) − dij (Y )|, (131)
Y
vectors of X. With this, one can rewrite the covariance i<j
matrix as
where wij is a weight value: wij ≥ 0. The weight ma-
1
Σ(X) = V SU T U SV T trix wij is a set of free parameters that specify the level
N −1 of confidence (or precision) in the value of dij (X). If
S2
=V VT Euclidean metric is used, it is the same as PCA and is
N −1 usually referred to as classical scaling (Torgerson, 1958).
≡ V ΛV T . (130) Thus MDS is often considered as a generalization of PCA.
In non-metric MDS, dij can be any distance matrix. The
where Λ is a diagonal matrix with eigenvalues λi in the objective function is then to preserve the ordination in
decreasing order along the diagonal (i.e. eigendecompo- the data, i.e. if d12 (X) < d13 (X) in the original space,
sition). It is clear that the right singular vectors of X then in the latent space we should have d12 (Y ) < d13 (Y ).
(i.e. columns of V ) are principal directions of Σ(X), and Both MDS and PCA can be implemented using standard
the singular values of X are related to the eigenvalues of Python packages such as Scikit. MDS algorithms typi-
covariance matrix Σ(X) via λi = s2i /(N − 1). To reduce cally have a scaling of O(N 3 ) where N corresponds to the
the dimensionality of data from D to D̃ < D, we first con- number of data points. In the case of PCA, if one is only
struct the D × D̃ projection matrix ṼD0 by selecting the interested in a small fraction of the principal components
singular components with the D̃ largest singular values. with largest variance (which is usually the case), efficient
The projection of the data from D to a D̃ dimensional implementations based on Lanczos methods can achieve
space is simply Ỹ = X ṼD0 . scaling of O(N 2 ) for dense matrices (Halko et al., 2011).
The singular vector with the largest singular value (i.e PCA and MDS are often among the first data visualiza-
the largest variance) is referred to as the first principal tion techniques one resorts to.
component, the singular vector with the second largest
singular value as the second principal component,
PD and so
on. An important quantity is the ratio λi / i=1 λi which D. t-SNE
is referred as the percentage of the explained variance
contained in a principal component (see FIG. 51.b). It is often desirable to preserve local structures in high-
It is common in data visualization to present the data dimensional dataset. This is, however, typically not pos-
projected on the first few principal components. This is sible using linear techniques such as PCA. Many non-
valid as long as a large part of the variance is explained linear techniques such as non-classical MDS(Cox and
Cox, 2000), self-organizing map (Kohonen, 1998), Isomap
(Tenenbaum et al., 2000) and Locally Linear Embedding
(Roweis and Saul, 2000) have been proposed recently.
9 We can always center around the mean: x̄ ← xi − x̄ These techniques are generally good at preserving local
67
structures in the data but typically fail to capture struc- This is achieved by gradient descent, see Sec. IV.
tures at the larger scale such as the clusters in which the When visualizing data with t-SNE it is important to
data is organized(Maaten and Hinton, 2008). Recently, understand what is being plotted. Here we list some
t-stochastic neighbor embedding (t-SNE) has emerged as important properties that one should bear in mind when
one of the go-to methods for visualizing high-dimensional analyzing t-SNE plots.
data. t-SNE is a non-parametric method that utilizes
non-linear embeddings. When used appropriately, t-SNE • t-SNE can rotate data. The KL divergence is in-
is a powerful technique for unraveling the hidden struc- variant under rotations in the latent space, since it
ture of high-dimensional datasets while at the same time only depends on the distance between points. For
preserving locality. In physics, t-SNE has recently been this reason, t-SNE plots that are rotations of each
used to reduce the dimensionality and classify spin con- other should be considered equivalent.
figurations, generated with the help of Monte Carlo sim- • t-SNE results are stochastic. Note that although
ulations, for the Ising (Carrasquilla and Melko, 2017) KL divergence is convex in the domain of distribu-
and Fermi-Hubbard models at finite temperatures (Ch’ng tions, it is generally not in the domain of qij (i.e.
et al., 2017). It was also applied to study clustering latent coordinate y). Therefore, in applying gradi-
transitions in random satisfiability problems which bears ent descent the solution will depend on the initial
close resemblance to spin glass models(Day, 2018). seed. Thus, the map obtained may vary depending
The idea of stochastic neighbor embedding is to asso- on the seed used and different t-SNE runs will give
ciate a probability distribution to the neighborhood of slightly different results.
each data(note x ∈ Rs , s the number of features):
• t-SNE generally preserves short distance informa-
exp(−||xi − xj ||2 /2σi2 ) tion. As a rule of thumb, one should expect that
pi|j =P 2 2 , (132)
k6=i exp(−||xi − xk || /2σi ) nearby points on the t-SNE points are also closeby
in the original space. The reason is the nature of
where pi|j can be interpreted as the likelihood that xj
the mapping discussed in Figure 52.
is xi ’s neighbor. σi are free parameters that are usually
fixed by fixing the local entropy (or equivalently the per- • Scales are deformed in t-SNE. Since a scale-free dis-
plexity) of each data point10 . Intuitively, the Gaussian tribution is used in the latent space, one should
likelihood (i.e. short-tailed) means that only points that not put too much emphasis on the meaning of the
are nearby xi contribute to its probability distribution. A variances of the any clusters observed in the latent
symmetrized probability distribution is constructed from space.
(132) : pij ≡ (pi|j + pj|i )/(2N ). This symmetrization en-
sures that even outliers contribute pij and as such, have • t-SNE is computationally intensive. Finally, a di-
meaningful embedding coordinates (Maaten and Hinton, rect implementation of t-SNE has an algorithm
2008). complexity of O(N 2 ) which is only applicable to
t-SNE, on the other hand, constructs an equivalent small to medium data sets. Improved scaling
probability distribution in a low dimensional latent space of the form O(N log N ) can be achieved at the
(with coordinates yi ∈ Rt , t < s, with t the dimension of cost of approximating (134) by using Barnes-Hut
the latent space: method(Maaten and Hinton, 2008).
short-tail
yi = arg min |p(xi ) q(y)|
y
long-tail
p
q
x0,y0
y1 x2 y2
x1
FIG. 52 Illustration of the t-SNE embedding. xi points cor-
respond to the original high-dimensional points while the yi
points are the corresponding low-dimensional map points pro-
duced by t-SNE. Here we consider two points, x1 , x2 , that are
respectively “close” and “far” from x0 . The high-dimensional
gaussian (short-tail) distribution p(x) of x0 ’s neighbors is
shown in blue. The low-dimensional Cauchy (fat-tail) distri-
bution q(y) of x0 ’s neighbors is shown in red. The map point FIG. 53 Different visualizations of a gaussian mixture formed
yi , are obtained by minimizing the difference |q(y) − p(xi )| of K = 30 mixtures in a D = 40 dimensional space. The
(similar to minimizing the KL divergence). We see that point gaussians have the same covariance but have means drawn
x1 is mapped to short distances |y1 −y0 |. In contrast, far-away uniformly at random in the space [−10, 10]40 . (a) Plot of the
points such as x2 are mapped to relatively large distances first two coordinates. The labels of the different gaussian is
|y2 − y0 |. indicated by the different colors. Note that in a realistic set-
ting, label information is of course not available, thus making
it very hard to distinguish the different “blobs". (b) Random
XIII. CLUSTERING projection of the data onto a 2 dimensional space. (c) pro-
jection onto the first 2 principal components. Only a small
fraction of the variance is explained by those components (the
In this section, we continue our discussion of unsuper-
ratio is indicated along the axis). (d) t-SNE embedding (per-
vised learning methods. Unsupervised learning is con- plexity = 60, # iteration = 1000) in a 2 dimensional latent
cerned with discovering structure in unlabeled data (for space. t-SNE captures correctly the local structure of the
instance learning local structures for data visualization, data.
see section XII). The lack of labels make unsupervised
learning much more difficult and subtle than its super-
vised counterpart. What is somewhat surprising is that
even without labels it is still possible to uncover and ex-
ploit the hidden structure in the data. Perhaps, the sim-
plest example of unsupervised learning is clustering. The
aim of clustering is to group unlabelled data into clusters
according to some similarity or distance measure. Infor-
mally, a cluster is thought of as a set of points sharing
some pattern or structure.
Clustering finds many applications throughout data
mining (Larsen and Aone, 1999), data compression and
signal processing (Gersho and Gray, 2012; MacKay, FIG. 54 Visualization of the MNIST handwritten dataset
2003). Clustering can be used to identify coarse features (here N = 10000). (a) First two principal components. (b)
t-SNE applied with a perplexity of 30, a Barnes-Hut angle of
or high level structures in an unlabelled dataset. The 0.5 and 1000 gradient descent iterations. In order to reduce
technique also finds many applications in physical sci- the noise and speed up computation, PCA was first applied
ences, ranging from detecting celestial emission sources to the dataset to project it down to 40 dimensions.
in astronomical surveys (Sander et al., 1998) to inferring
groups of genes and proteins with similar functions in
biology (Eisen et al., 1998), and building entanglement and for this reason, is among the most widely used and
classifiers (Lu et al., 2017). Clustering is perhaps the employed data analysis and machine learning techniques.
simplest way to look for hidden structure in a dataset The field of clustering is vast and there exists a
69
flurry of clustering methods suited for different purposes. cluster means {µ} and the data point assignments in or-
Some common considerations one has to take into ac- der to minimize the following objective function:
count when choosing a particular method is the distribu-
K X
N
tion of the clusters (overlapping/noisy clusters vs. well- X
separated clusters), the geometry of the data (flat vs. J({x, µ}) = rnk (xn − µk )2 , (135)
k=1 n=1
non-flat), the cluster size distribution (multiple sizes vs.
uniform sizes), the dimensionality of the data (low vs. where rnk is a binary variable (rnk ∈ {0, 1}) called the
high dimensional) and the computational efficiency of the assignment. The assignment rnk is 1 if xP n is assigned to
desired method (small vs. large dataset). cluster and 0 otherwise. Notice that
k k rnk = 1 ∀ n
We begin section XIII.A with a focus on popular prac- and n rnk ≡ Nk , where Nk the number of points as-
P
tical clustering methods such as K-means clustering, hi- signed to cluster k. The minimization of this objective
erarchical clustering and density clustering. Our goal is function can be understood as trying to find the best
to highlight the strength, weaknesses and differences be- cluster means such that the variance within each cluster
tween these techniques, while laying out some of the theo- is minimized. In physical terms, J is equivalent to the
retical framework required for clustering analysis. There sum of the moments of inertia of every cluster. Indeed,
exist many more clustering methods beyond those dis- as we will see below, the cluster means µk correspond to
cussed in this section 11 . The methods we discuss were the centers of mass of their respective cluster.
chosen for their pedagogical value and/or their applica-
bility to problems in physics.
In section XIII.B we discuss gaussian mixture models
K-means algorithm K-means algorithm alternates be-
and the formulation of clustering through latent variable
tween two steps:
models. This section introduces many of the methods we
will encounter when discussing other unsupervised learn- 1. Expectation: Given a set of assignments {rnk },
ing methods later in the review. Finally, in section XIII.C minimize J with respect to µk . Taking a simple
we discuss the problem of clustering in high-dimensional derivative and setting it to zero yields the follow-
data and possible ways to tackle this difficult problem. ing update rule:
The reader is also urged to experiment with various clus-
tering methods using Notebook 15. 1 X
µk = rnk xn . (136)
Nk n
A common drawback of hierarchical methods is that that are outliers are expected to form regions of low den-
they do not scale well: at every step, a distance ma- sity. Density clustering has the advantage of being able to
trix between all clusters must be updated/computed. consider clusters of multiple shapes and sizes while identi-
Efficient implementations achieve a typical computa- fying outliers. The method is also suitable for large-scale
tional complexity of O(N 2 ) (Müllner, 2011), making the applications.
method suitable for small to medium-size datasets. A The core assumption of DB clustering is that a rel-
simple but major speed-up for the method is to initial- ative local density estimation of the data is possible.
ize the clusters with K-means using a large K (but still In other words, it is possible to order points according
a small fraction of N ) and then proceed with hierarchi- to their densities. Density estimates are usually accu-
cal clustering. This has the advantage of preserving the rate for low-dimensional data but become unreliable for
large-scale structure of the hierarchy while making use of high-dimensional data due to large sampling noise. Here,
the linear scaling of K-means. In this way, hierarchical for brevity, we confine our discussion to one of the most
clustering may be applied to very large datasets. widely used density clustering algorithms DBSCAN. We
have also had great success with another recently in-
0.9 troduced variant of DB clustering (Rodriguez and Laio,
a)
1 2014) that is similar in spirit which the reader is urged
0.8 0 to consult. One of the authors has also created a Python
package which makes use of accurate density estimates
2
0.7 via kernel methods combined with agglomerative cluster-
ing to produce fast and accurate density clustering (see
0.6
3 GitHub).
0.5
DBSCAN algorithm Here we describe the most promi-
5
4 nent DB clustering algorithm: DBSCAN, or density-
0.4
based spatial clustering of applications with noise (Es-
0.4 0.6 0.8 ter et al., 1996). Again, consider a set of N data points
N
X ≡ {xn }n=1 .
0.30 b) We start by defining the ε-neighborhood of point xn
0.25 as follows:
→ Return the cluster assignments C1 , · · · , Ck , with k B. Clustering and Latent Variables via the Gaussian Mixture
the number of clusters. Points that have not been Models
assigned to a cluster are considered noise or out-
liers. In the previous section, we introduced several practical
methods for clustering. In this section, we will approach
Note that DBSCAN does not require the user to specify clustering from a more abstract vantage point, and in
the number of clusters but only ε and minPts. While, it the process, introduce many of the core ideas underlying
is common to heuristically fix these parameters, methods unsupervised learning. A central concept in many un-
such as cross-validation can be used for their determi- supervised learning techniques is the idea of a latent or
nation. Finally, we note that DBSCAN is very efficient hidden variable. Even though latent variables are not di-
since efficient implementations have a computational cost rectly observable, they still influence the visible structure
of O(N log N ). of the data. For example, in the context of clustering we
can think of the cluster identity of each datapoint (i.e.
which cluster does a datapoint belongs to) as a latent
a) variable. And even though we cannot see the cluster la-
bel explicitly, we know that points in the same cluster
tend to be closer together. The latent variables in our
data (cluster identity) are a way of representing and ab-
stracting the correlations between datapoint.
In this language, we can think of clustering as an algo-
" rithm to learn the most probable value of a latent variable
(cluster identity) associated with each datapoint. Cal-
culating this latent variable requires additional assump-
tions about the structure of our dataset. Like all unsu-
pervised learning algorithms, in clustering we must make
an assumption about the underlying probability distribu-
tion from which the data was generated. Our model for
how the data is generated is often called our generative
minPts=4 model. In clustering, we assume that data points are as-
signed a cluster, with each cluster characterized by some
High cluster-specific probability distribution (e.g. a Gaussian
b) with some mean and variance that characterizes the clus-
ter). We then specify a procedure for finding the value
of the latent variable. This is often done by choosing the
values of the latent variable that minimizing some cost
function.
Relative density
point x in a GMM is given by The γ(zk ) are often referred to as the “responsibility”
that mixture k takes for explaining x. Just like in our
K
X discussion of soft-max classifiers, this can be made into
p(x|{µk , Σk , πk }) = N (x|µk , Σk )πk . (144) a “hard-assignment” by assigning each point to the clus-
k=1
ter with the largest probability: arg maxk γ(zk ) over the
Given a dataset X = {x1 , · · · , xN }, we can write the responsibilities.
likelihood of the dataset as The complication is of course that we do not know that
the parameters θ of the underlying GMM but instead
N
Y must also learn them from the dataset X. As discussed
p(X|{µk , Σk , πk }) = p(xi |{µk , Σk , πk }) (145) above, ideally we could do this by choosing the param-
i=1
eters that maximize the likelihood (or equivalently the
For future reference, let us denote the set of parameters log-likelihood) of the data
{µk , Σk , πk } by θ.
To see how we can use GMM and MLE to perform
clustering, we introduce discrete binary K-dimensional
latent variables z for each data point x whose k-th com-
ponent is 1 if point x was generated from the k-th Gaus- θ̂i = arg max log p(X|θ) (150)
θi
sian and zero otherwise (these are often called “one-hot
variables”). For instance if we were considering a Gaus-
sian mixture with K = 3, we would have three possible
values for z ≡ (z1 , z2 , z3 ) : (1, 0, 0), (0, 1, 0) and (0, 0, 1).
We cannot directly observe the variable z. It is a latent
where θi ∈ {µk , Σk , πk }. Once we knew the MLEs θ̂i , we
variable that encodes the cluster identity of point x. Let
could use Eq. (149) to calculate the optimal hard cluster
us also denote all the N latent variables corresponding
to a dataset X by Z. assignment arg maxk γ̂(zk ) where γ̂(zk ) = p(zk = 1|x; θ̂).
Viewing the GMM as a generative model, we can write
the probability p(x|z) of observing a data point x given In practice, due to the complexity of Eq. (145), it is
z as almost impossible to find the global maximum of the like-
K lihood function. Instead, we must settle for a local max-
imum. One approach to finding a local maximum of the
Y
p(x|z; {µk , Σk }) = N (x|µk , Σk )zk (146)
k=1 likelihood is to use a method like stochastic gradient de-
scent on the negative log-likelihood, cf. Sec IV. Here, we
as well as the probability of observing a given value of introduce an alternative, powerful approach for finding
latent variable local minima in latent variable models using an iterative
K procedure called Expectation Maximization (EM). Given
an initial guess for the parameters θ(0) , the EM algorithm
Y
p(z|{πk }) = πkzk . (147)
k=1 iteratively generated new estimates for the parameters
θ(1) , θ(2) , . . .. Importantly, the likelihood is guaranteed to
Using Bayes rule, we can write the joint probability of be non-decreasing under these iterations and hence EM
a clustering assignment z and a data point x given the converges to a local maximum of the likelihood (Neal and
GMM parameters as Hinton, 1998).
p(x, z; θ) = p(x|z; {µk , Σk })p(z|{πk }). (148)
The central observation underlying EM is that it is
We can also use Bayes rule to rearrange this expression often much easier to calculate the conditional likelihoods
to give the conditional probability of the data point x of the latent variables p̃(t) (Z) = p(Z|X; θ(t) ) given some
being in the k-th cluster, γ(zk ), given model parameters choice of parameters and the maximum of the expected
θ as log likelihood given an assignment of the latent variables:
θ(t+1) = arg maxθ Ep(Z|X;θ(t) ) [log p(X, Z; θ)]. To get an
πk N (x|µk , Σk )
γ(zk ) ≡ p(zk = 1|x; θ) = PK . (149) intuition for this later quantity notice that we can write
j=1 πj N (x|µj , Σj )
N X
K
(t)
X
Ep̃(t) [log p(X, Z; θ)] = γik [log N (xi |µk , Σk ) + log πk ] , (151)
i=1 k=1
74
(t)
where have used the shorthand γik = p(zik |X; θ(t) ) with
zik the k-th component of zi . Taking the derivative of
this equation with
P respect to µk , Σk , and πk (subject to
the constraint k πk = 1) and setting this to zero yields
the intuitive equations
PN (t)
(t+1) i γik xi
µk = P (t)
i γik
PN (t)
(t+1) γik (xi − µk )(xi − µk )T
i
Σk = P (t)
i γik
(t+1) 1 X (t)
πk = γik (152)
N
k
These are just the usual estimates for the mean and vari-
ance, with each data point weighed according to our cur-
rent best guess for the probability that it belongs to clus-
ter k. We can then use our new estimate θ(t+1) to cal-
(t+1)
culate new memberships γik and repeat the process.
This is essentially the K-Means algorithm discussed in
the first section.
This discussion of the Gaussian mixture model intro-
duces several concepts that we will return to repeatedly
in the context of unsupervised learning. First, it is often
useful to think of the visible correlations between features
in the data as resulting from hidden or latent variables.
Second, we will often posit a generative model that en-
codes the structure we think exists in the data and then
find parameters that maximize the likelihood of the ob-
served data. Third, often we will not be able to directly FIG. 58 (a) Application of gaussian mixture modelling to
estimate the MLE, and will have to instead look for a the Ising dataset. The normalized histogram corresponds to
computationally efficient way to find a local minimum of the first principal component distribution of the dataset (or
the likelihood. equivalently the magnetization in this case). The 1D data
is fitted with a K = 3-component gaussian mixture. The
likehood of the fitted gaussian mixture is represented in red
C. Clustering in high-dimension and is obtained via the expectation-maximization algorithm
(b) The gaussian mixture model can be used to compute
posterior probability (responsibilities), i.e. the probability
Clustering data in high-dimension can be very chal- of being in one of the phases. Note that the point where
lenging. One major problem that is aggravated in high- γ(1) = γ(2) = γ(3) can be interpreted as the critical point.
dimensions is the accumulation of noise coming from spu- Indeed the crossing occurs at T ≈ 2.26.
rious features that tends to “blur” distances (Domingos,
2012; Kriegel et al., 2009; Zimek et al., 2012). Many
clustering algorithms rely on the explicit use of a simi- the two-dimensional embedding that is presented. Using
larity measure or distance metrics that weigh all features t-SNE directly on original data leads to a “blurring” of
equally. For this reason, one must be careful when us- the clusters (the reader is encouraged to test this them-
ing an off-the-shelf method in high dimensions. In order seleves).
to perform clustering on high-dimensional data, it is of- However, simple feature selection or feature denoising
ten useful to denoise the data before proceeding with (using PCA for instance) can sometimes be insufficient
using a standard clustering method such as K-means for learning clusters due to the presence of large vari-
(Kriegel et al., 2009). Figure 54 illustrates an applica- ations in the signal and noise of the features that are
tion of denoising to high-dimensional data. PCA (sec- relevant for identifying the underlying clusters (Kriegel
tion XII.B) was used to denoise the MNIST dataset by et al., 2009). Recent promising work suggests one way
projecting the 784 original dimensions onto the 40 di- to overcome these limitations is to learn the latent space
mensions with the largest principal components. The re- and the cluster labels at the same time (Xie et al., 2016).
sulting features were then used to construct a Euclidean Finally we end the clustering section with a short dis-
distance matrix which was used by t-SNE to compute cussion on clustering validation, which can be particu-
75
larly difficult for high-dimensional data. Often clustering the partition function
validation, i.e. verifying wether the obtained labels are
“valid” is done by direct visual inspection. That is, the Zp = Trx e−βE(x) , (154)
data is represented in a low-dimensional space and the
(where the trace is taken over all possible configurations
cluster labels obtained are visually inspected to make
x) since
sure that different labels organize into distinct “blobs”.
For high-dimensional data, this is done by performing e−βE(x1 )
dimensional reduction (section XII). However, this can p(x) = . (155)
Zp
lead to the appearance of spurious clusters since dimen-
sional reduction inevitably loses information about the In general, calculating the partition function Zp is ana-
original data. Thus, these methods should be used with lytically and computationally intractable.
care when trying to validate clusters (see (Wattenberg For example, for the Ising model with N binary spins,
et al., 2016) for an interactive discussion on how t-SNE the trace involves calculating a sum over 2N terms, which
can sometime be misleading and how to effectively use is a difficult task for most energy functions. For this rea-
it). son, physicists (and machine learning scientists) have de-
A lot of work has been done to devise ways of validating veloped various numerical and computational methods
clusters based on various metrics and measures(Kriegel for evaluating such partition functions. One approach
et al., 2009). Perhaps one the most intuitive way of defin- is to use Monte-Carlo based methods to draw samples
ing a good clustering is by measuring how well clusters from the underlying distribution (this can be done know-
generalize. Recently, clustering methods based on lever- ing only the relative probabilities) and then use these
aging powerful classifiers to measure the generalization samples to numerically estimate the partition function.
errors of the clusters have been developed by some of This is the philosophy behind powerful methods such as
the authors (Day and Mehta, 2018) and we believe this Markov Chain Monte Carlo (MCMC) (Andrieu et al.,
represent an especially promising research direction for 2003) and annealed importance sampling (Neal and Hin-
high-dimensional clustering. Finally, we emphasize that ton, 1998) which are widely used in both the statistical
this discussion is far from exhaustive and we refer the physics and machine learning communities. An alterna-
reader to (Rokach and Maimon, 2005), Chapter 15, for tive approach – which we focus on here – is to approxi-
an in-depth survey of the various validation techniques. mate the the probability distribution p(x) and partition
function using a “variational distribution” q(x; θq ) whose
partition function we can calculate exactly. The varia-
XIV. VARIATIONAL METHODS AND MEAN-FIELD tional parameters θq are chosen to make the variational
THEORY (MFT) distribution as close to the true distribution as possible
(how this is done is the focus of much of this section).
A common thread in many unsupervised learning tasks One of the most-widely applied examples of a varia-
is accurately representing the underlying probability dis- tional method in statistical physics is Mean-Field Theory
tribution from which a dataset is drawn. Unsuper- (MFT). MFT can be naturally understood as a procedure
vised learning of high-dimensional, complex distributions for approximating the true distribution of the system by
presents a new set of technical and computational chal- a factorized distribution. The deep connection between
lenges that are different from those we encountered in MFT and variational methods is discussed below. These
a supervised learning setting. When dealing with com- variational MFT methods have been extended to under-
plicated probability distributions, it is often much easier stand more complicated spin models (also called graph-
to learn the relative weights of different states or data ical models in the ML literature) and form the basis of
points (ratio of probabilities), than absolute probabili- powerful set of techniques that go under the name of Be-
ties. In physics, this is the familiar statement that the lief Propagation and Survey Propagation (MacKay, 2003;
weights of a Boltzmann distribution are much easier to Wainwright et al., 2008; Yedidia et al., 2003).
calculate than the partition function. The relative prob- Variational methods are also widely used in ML to ap-
ability of two configurations, x1 and x2 , are proportional proximate complex probabilistic models. For example,
to the difference between their Boltzmann weights below we show how the Expectation Maximization (EM)
procedure, which we discussed in the context of Gaus-
p(x1 ) sian Mixture Models for clustering, is actually a general
= e−β(E(x1 )−E(x2 )) , (153)
p(x2 ) method that can be derived for any latent (hidden) vari-
able model using a variational procedure (Neal and Hin-
where as is usual in statistical mechanics β is the inverse ton, 1998). This section serves as an introduction to this
temperature and E(x; θ) is the energy of state x given powerful class of variational techniques. For readers in-
some parameters (couplings) θ . However, calculating the terested in an in-depth discussion on variational infer-
absolute weight of a configuration requires knowledge of ence for probabilistic graphical models, we recommend
76
the simplest MFT of the Ising model, the variational dis- variational parameters for all the spins are identical, with
tribution is chosen so that all spins are independent: θi = θ for all i. Then, the mean-field equations reduce
! to their familiar textbook form, m = tanh(θ) and θ =
1 X Y eθi si β(zJm(θ) + h), where z is the coordination number of
q(s, θ) = exp θi si = . (163)
Zq i i
2 cosh θi the lattice (i.e. the number of nearest neighbors).
Equations (165) and (168) form a closed system, known
In other words, we have chosen a distribution q which
as the mean-field equations for the Ising model. To find
factorizes on every lattice site. An important property
a solution to these equations, one method is to iterate
of this functional form is that we can analytically find a
through and update each θi , once at a time, in an asyn-
closed-form expression for the variational partition func-
chronous fashion. To see the relationship of this approach
tion Zq . This simplicity also comes at a cost: ignor-
to solving the MFT equations to Expectation Maximiza-
ing correlations between spins. These correlations be-
tion (EM), it is helpful to explicitly spell out the iterative
come less and less important in higher dimensions and
procedure to find the solutions to Eq. (168). We start by
the MFT ansatz becomes more accurate.
initializing our variational parameters to some θ (0) and
To evaluate the variational free-energy, we make use
repeat the following until convergence:
of Eq. (160). First, we need the entropy Hq of the dis-
tribution q. Since q factorizes over the lattice sites, the
entropy separates into a sum of one-body terms 1. Expectation: Given a set of assignments at iteration
X t, θ (t) , calculate the corresponding magnetizations
Hq (θ) = − q(s, θ) log q(s, θ) m(t) using Eq. (165)
{si =±1}
X
=− qi log qi + (1 − qi ) log(1 − qi ), (164)
2. Maximization: Given a set of magnetizations mt ,
i
find new assignments θ(t+1) which minimize the
variational free energy Fq . From, Eq. (168) this
θi
where qi = 2 cosh
e
θi is the probability that spin si is in the
+1 state. Next, we need to evaluate the average of the is just
Ising energy E(s, J ) with respect to the variational dis-
tribution q. Although the energy contains bilinear terms, (t+1) (t)
X
θi =β Jij mj + hi . (169)
we can still evaluate this average easily, because the spins j
are independent (uncorrelated) in the q distribution. The
mean value of spin si in the q distribution, or the on-site
magnetization, is given by From these equations, it is clear that we can think of the
X eθi si MFT of the Ising model as an EM-like procedure similar
mi = hsi iq = si = tanh(θi ). (165) to the one we used for k-means clustering and Gaussian
2 cosh θi
si =±1 Mixture Models in Sec. XIII.
Since the spins are independent, we have As if well known in statistical physics, even though
MFT is not exact, it can often yield qualitatively and
1X
even quantitatively precise predictions (especially in high
X
hE(s, J )iq = − Jij mi mj − hi mi . (166)
2 i,j i dimensions). The discrepancy between the true physics
and MFT predictions stems from the fact that the varia-
The total variational free-energy is tional distribution q we chose does not model the correla-
βFq (J , θ) = βhE(s, J )iq − Hq , tions between spins. For instance, it predicts the wrong
value for the critical temperature for the 2D Ising model.
and minimizing with respect to the variational parame- It even erroneously predicts the existence of a phase tran-
ters θ, we obtain sition in one dimension at a non-zero temperature. We
refer the interested reader to standard textbooks on sta-
∂ dqi X tistical physics for a detailed analysis of applicability of
βFq (J , θ) = 2 −β Jij mj + hi + θi . MFT to the Ising model. However, we emphasize that
∂θi dθi j
the failure of any particular variational ansatz does not
(167) compromise the power of the approach. In some cases,
Setting this equation to zero, we arrive at one can consider changing the variational ansatz to im-
X prove the predictive properties of the corresponding vari-
θi = β Jij mj (θj ) + hi . (168)
ational MFT (Yedidia, 2001; Yedidia et al., 2003). The
j
take-home message is that variational MFT is a powerful
For the special case of a uniform field hi = h and uniform tool but one that must be applied and interpreted with
nearest neighbor couplings Jij = J, by symmetry the care.
78
the statistical uncertainty one has about the value of a − i λi fi (x). The values of the Lagrange multipliers are
P
random variable x drawn from a probability distribution chosen to match the observed averages for the set of func-
p(x). The Shannon entropy of the distribution is defined tions {fi (x)} whose average value is being fixed:
as
∂ log Z
Z
hfi imodel = dxp(x)fi (x) = = hfi iobs . (179)
Sp = −Trx p(x) log p(x) (176) ∂λi
where the trace is a sum/integral over all possible values In other words, the parameters of the distribution can be
a variable can take. Jaynes showed that the Boltzmann chosen such that
distribution follows from the Principle of Maximum En- ∂λi log Z = hfi idata . (180)
tropy. A physical system should be described by the
probability distribution with the largest entropy subject To gain more intuition for the MaxEnt distribution, it
to certain constraints (often provided by measuring the is helpful to relate the Lagrange multipliers to the famil-
average value of conserved, extensive quantities such as iar thermodynamic quantities we use to describe physical
the energy, particle number, etc.) The principle uniquely systems (Jaynes, 1957a). Our x denotes the microscopic
specifies a procedure for parameterizing the functional state of the system, i.e. the MaxEnt distribution is a
form of the probability distribution. Once we have spec- probability distribution over microscopic states. How-
ified and learned this form we can, of course, generate ever, in thermodynamics we only have access to average
new examples by sampling this distribution. quantities. If we know only the average energy hE(x)iobs ,
Let us illustrate how this works in more detail. Sup- the MaxEnt procedure tells us to maximize the entropy
pose that we have chosen a set of functions {fi (x)} whose subject to the average energy constraint. This yields
average value we want to fix to some observed values
1 −βE(x)
hfi iobs . The Principle of Maximum Entropy states that p(x) = e . (181)
we should choose the distribution p(x) with the largest Z
uncertainty (i.e. largest Shannon entropy Sp ), subject to where we have identified the Lagrange multiplier conju-
the constraints that the model averages match the ob- gate to the energy λ1 = −β = 1/kB T with the (negative)
served averages: inverse temperature. Now, suppose we also constrain the
Z particle number hN (x)iobs . Then, an almost identical
hfi imodel := dxfi (x)p(x) = hfi iobs (177) calculation yields a MaxEnt distribution of the functional
form
We can formulate the Principle of Maximum Entropy 1 −β(E(x)−µN (x))
p(x) = e , (182)
as an optimization problem using the method of Lagrange Z
multipliers by minimizing:
where we have rewritten our Lagrange multipliers in
X Z the familiar thermodynamic notation λ1 = −β and
L[p] = −Sp + λi hfi iobs − dxfi (x)p(x) λ2 = µ/β. Since this is just the Boltzmann distribu-
Z
i
tion, we can also relate the partition function in our
+γ 1− dxp(x) , MaxEnt model to the thermodynamic free-energy via
F = −β −1 log Z. The choice of which quantities to
constrain is equivalent to working in different thermo-
where the first set of constraints enforce the requirement dynamic ensembles.
for the averages and the last constraint enforces the nor-
malization that the trace over the probability distribu-
tion equals one. We can solve for p(x) by taking the 2. From statistical mechanics to machine learning
functional derivative and setting it to zero
The MaxEnt idea also provides a general procedure
δL
for learning a generative model from data. The key dif-
X
0= = (log p(x)) + 1) − λi fi (x) − γ.
δp i ference between MaxEnt models in (theoretical) physics
and ML is that in ML we have no direct access to ob-
The general form of the maximum entropy distribution served values hfi iobs . Instead, these averaged must be
is then given by directly estimated from data (samples). To denote this
1 Pi λi fi (x) difference, we will call empirical averages calculated from
p(x) = e (178) data as hfi idata . We can think of MaxEnt as a statisti-
Z
cal inference procedure simply by replacing hfi iobs by
where Z(λi ) = dxe i λi fi (x) is the partition function. hfi idata above.
R P
The maximum entropy distribution is clearly just This subtle change has important implications for
the usual Boltzmann distribution with energy E(x) = training MaxEnt models. First, since we do not know
83
these averages exactly, but must estimate from the data, The resulting probability density function is,
our training procedures must be careful not to overfit to
the observations (our samples might not be reflective of p(x) = Z −1 e−E(x)
the true values of these statistics). Second, the averages 1 1 T −1 T 1 T
D. Computing gradients exactly for most interesting models in both physics and
ML.
We still need to specify a procedure for minimizing the There are exceptional cases in which we can calcu-
cost function. One powerful and common choice that late expectation values analytically. When this hap-
is widely employed when training energy-based models pens, the generative model is said to have a Tractable
is stochastic gradient descent (SGD) (see Sec. IV). Per- Likelihood. One example of a generative model with a
forming MLE using SGD requires calculating the gradi- Tractable Likelihood is the Gaussian MaxEnt model for
ent of the log-likelihood Eq. (186) with respect to the real valued data discussed in Eq. (185). The param-
parameters θi . To simplify notation and gain intuition, eters/Lagrange multipliers for this model are the local
it is helpful to define “operators” Oi (x), conjugate to the fields a and the pairwise coupling matrix J. In this case,
parameters θi the usual manipulations involving Gaussian integrals al-
low us to exactly find the parameters µ = −J −1 a and
∂E(x; θi ) Σ = −J −1 , yielding the familiar expressions µ = hxidata
Oi (x) := . (190)
∂θi and Σ = h(x − hxidata )(x − hxidata )T idata . These are
Since the partition function is just the cumulant gener- the standard estimates for the sample mean and covari-
ative function for the Boltzmann distribution, we know ance matrix. Converting back to the Lagrange multipli-
that the usual statistical mechanics relationships between ers yields
expectation values and derivatives of the log-partition
J = −h(x − hxidata )(x − hxidata )T i−1
data . (193)
function hold:
∂ log Z({θi }) Returning to the generic case where most energy-based
hOi (x)imodel = Trx pθ (x)Oi (x) = − . (191) models have intractable likelihoods, we must estimate ex-
∂θi
pectation values numerically. One way to do this is draw
In terms of the operators {Oi (x)}, the gradient of samples Smodel = {x0i } from the model pθ (x) and evalu-
Eq. (186) takes the form (Ackley et al., 1987) ate expectation values using these samples:
∂L({θi }) D ∂E(x; θi ) E ∂ log Z({θi }) Z
−
X
= + hh(x)imodel = dxpθ (x)h(x) ≈ g(x0i ). (194)
∂θi ∂θi data ∂θi
x0i ∈Smodel
= hOi (x)idata − hOi (x)imodel . (192)
These equations have a simple and beautiful interpre- The samples from the model x0i ∈ Smodel are often re-
tation. The gradient of the log-likelihood with respect to ferred to as fantasy particles in the ML literature and
a model parameter is a difference of moments – one calcu- can be generated using simple MCMC algorithms such
lated directly from the data and one calculated from our as Metropolis-Hasting which are covered in most modern
model using the current model parameters. The data- statistical physics classes. However, if the reader is unfa-
dependent term is known as the positive phase of the miliar with MCMC methods or wants a quick refresher,
gradient and the model-dependent term is known as the we recommend the concise and beautiful discussion of
negative phase of the gradient. This derivation also gives MCMC methods from both the physics and ML point-
an intuitive explanation for likelihood-based training pro- of-view in Chapters 29-32 of David MacKay’s masterful
cedures. The gradient acts on the model to lower the en- book (MacKay, 2003).
ergy of configurations that are near observed data points Finally, we note that once we have the fantasy particles
while raising the energy of configurations that are far (samples) from the model, we can also easily calculate
from observed data points. Finally, we note that all infor- the gradient of an arbitrary expectation value hg(x)imodel
mation about the data only enters the training procedure using what is commonly called the “log-derivative trick”
through the expectations hOi (x)idata and our generative in ML (Fu, 2006; Kleijnen and Rubinstein, 1996):
model is blind to information beyond what is contained
∂ ∂pθ (x)
Z
in these expectations. hg(x)imodel = dx g(x)
∂θi ∂θi
To use SGD, we must still calculate the expectation D ∂ log p (x)
values that appear in Eq. (192). The positive phase of the
E
θ
= g(x)
gradient – the expectation values with respect to the data ∂θi model
– can be easily calculated using samples from the training = hOi (x)g(x)imodel
dataset. However, the negative phase – the expectation
X
≈ Oi (x)g(x0i ). (195)
values with respect to the model – are generally much x0i ∈Smodel
more difficult to compute. We will see that in almost
all cases, we will have to resort to either numerical or This expression allows us to take gradients of more com-
approximate methods. The fundamental reason for this plex cost functions beyond the MLE procedure discussed
is that it is impossible to calculate the partition function here.
86
E. Summary of the training procedure layers of latent variables, we can even construct powerful
deep generative models that possess many of the same
We now summarize the discussion above and present a desirable properties as the deep, discriminative neural
general procedure for training an energy based model us- networks.
ing SGD on the cost function (see Sec. IV). Our goal is to We begin with a discussion that tries to provide a sim-
fit the parameters of a model pλ ({θi }) = Z −1 e−E(x,{θi }) . ple intuition for why latent variables are such a powerful
Training the model involves the following steps: tool for generative models. Next, we introduce a power-
1. Read a minibatch of data, {x}. ful class of latent variable models called Restricted Boltz-
mann Machines (RBMs) and discuss techniques for train-
2. Generate a random samplea (fantasy partscles) ing these models. After that, we discuss Deep Boltzmann
{x0 } ∼ pλ using an MCMC algorithm (e.g., Machines (DBMs), which have multiple layers of latent
Metropolis-Hastings). variables. We then introduce the new Paysage package
for training energy-based models and demonstrate how
3. Compute the gradient of log-likelihood using these to use it on the MNIST dataset and samples from the
samples and Eq. (192), where the averages are Ising model. We conclude by discussing recent physics
taken over the minibatch of data and the fantasy literature related to energy-based generative models.
particles/samples from the model, respectively.
– i.e., a visible and hidden layer that interact with each To understand what correlations are captured by p(v) it
other but not among themselves – is often depicted using is helpful to introduce the distribution
a graph of the form shown in Fig. 61.
ebµ (hµ )
An RBM can have different properties depending on qµ (hµ ) = (205)
Z
whether the hidden and visible layers are taken to be
Bernoulli or Gaussian. The most common choice is to of hidden units hµ , ignoring the interactions between v
have both the visible and hidden units be Bernoulli. This and h, and the cumulant generating function
is what is typically meant by an RBM. However, other tn
Z X
Kµ (t) := log dhµ qµ (hµ )ethµ = κµ(n) . (206)
combinations are also possible and used in the ML lit- n!
n
erature. When all the units are continuous, the RBM
(n)
reduces to a multi-dimensional Gaussian with a very par- Kµ (t) is defined such that the nth cumulant is κµ =
ticular correlation structure. When the hidden units are ∂tn Kµ |t=0 .
continuous and the visible units are discrete, the RBM is The cumulant generating function appears in the
equivalent to a generalized Hopfield model (see discussion marginal free-energy of the visible units, which can be
above). When the the visible units are continuous and rewritten (up to a constant term) as:
the hidden units are discrete, the RBM is often called a !
Gaussian Bernoulli Restricted Boltzmann Machine (Dahl
X X X
E(v) = − ai (vi ) − Kµ Wiµ vi
et al., 2010; Hinton and Salakhutdinov, 2006). It is even i µ i
possible to perform multi-modal learning with a mixture X XX (
P
Wiµ vi )n
of continuous and discrete variables. For all these archi- =− ai (vi ) − κ(n)
µ
i
n!
tectures, the important point is that all interactions occur i µ n
only between the visible and hidden units and there are
!
X X X
no interactions between units within the hidden or visi- =− ai (vi ) − κ(1)
µ Wiµ vi
ble layers (see Fig. 61). This is analogous to Quantum i i µ
action structure has two major advantages: (i) it enables We see that the marginal energy includes all orders of in-
capturing both pairwise and higher-order correlations be- teractions between the visible units, with the n-th order
tween the visible units and (ii) it makes it easier to sample cumulants of qµ (hµ ) weighting the n-th order interac-
from the model using an MCMC method known as block tions between the visible units. In the case of the Hop-
Gibbs sampling, which in turn makes the model easier to field model we discussed previously, qµ (hµ ) is a standard
train. (1)
Gaussian distribution where the mean is κµ = 0, the
Before discussing training, it is worth better under- (2)
variance is κµ = 1, and all higher order cumulants are
standing the kind of correlations that can be captured zero. Plugging these cumulants into Eq. (207) recovers
using an RBM. To do so, we can marginalize over the hid- Eq. (199).
den units and ask about the resulting distribution over These calculations make clear the underlying reason
just the visible units for the incredible representational power of RBMs with
a Bernoulli hidden layer. Each hidden unit can encode in-
e−E(v,h)
Z Z
p(v) = dhp(v, h) = dh (203) teractions of arbitrarily high order. By combining many
Z different hidden units, we can encode very complex in-
teractions at all orders. Moreover, we can learn which
(where the integral should be replaced by a trace in all
order of correlations/interactions are important directly
expressions for discrete units).
from the data instead of having to specify them ahead of
We can also define a marginal energy using the expres-
time as we did in the MaxEnt models. This highlights
sion
the power of generative models with even the simplest in-
e−E(v) teractions between visible and latent variables to encode,
p(v) := . (204) learn, and represent complex correlations present in the
Z
data.
Combining these equations,
Z
C. Training RBMs
E(v) = − log dhe−E(v,h)
X X Z P RBMs are a special class of energy-based generative
=− ai (vi ) − log dhµ ebµ (hµ )+ i vi Wiµ hµ models, which can be trained using the Maximum Like-
i µ lihood Estimation (MLE) procedure described in detail
89
in Sec. XV. To briefly recap, first, we must choose a cost A Alternating Gibbs Sampling
function – for MLE this is just the negative log-likelihood t=0 t=1 t=2 t = οο
with or without an additional regularization term to pre-
vent overfitting. We then minimize this cost function us-
ing one of the Stochastic Gradient Descent (SGD) meth-
data
ods described in Sec. IV.
The gradient itself can be calculated using Eq. (192). B Contrastive Divergence (CD-n)
For example, for the Bernoulli-Bernoulli RBM in t=0 t=1 t=2 t=n
Eq. (202) we have
∂L({Wiµ , ai , bµ })
= hvi hµ idata − hvi hµ imodel (208)
∂Wiµ data
∂L({Wiµ , ai , bµ })
∂ai
= hvi idata − hvi imodel C Persistent Contrastive Divergence (PCD-n)
t=0 t=1 t=2 t=n
∂L({Wiµ , ai , bµ })
= hhµ idata − hhµ imodel ,
∂bµ
where the positive expectation with respect to data is un- fantasy particles
derstood to mean sampling from the model while clamp- from last SGD step
ing the visible units to their observed values in the data.
As before, calculating the negative phase of the gradi- FIG. 62 (Top) To draw fantasy particles (samples from
ent (the expectation value with respect to the model) model) we can perform alternating (block) Gibbs sampling
requires that we draw samples from the model. Luck- between the visible and hidden layers starting with a sam-
ple from the data using the marginal distributions p(h|v)
ily, the bipartite form of the interactions in RBMs were and p(v|h). The “time” t corresponds to the time in the
specifically chosen with this in mind. Markov chain for the Monte Carlo and measures the num-
ber of passes between the visible and hidden states. (Middle)
In Contrastive Divergence (CD), we approximately sample
1. Gibbs sampling and contrastive divergence (CD) the model by terminating the Gibbs sampling after n steps
(CD-n) starting from the data. (C) In Persistent Contrastive
The bipartite interaction structure of an RBM makes it Divergence (PCD), instead of restarting our sample from the
possible to calculate expectation values using a Markov data, we initialize the sampler with the fantasy particles cal-
culated from the model at the last SGD step.
Chain Monte Carlo (MCMC) method known as Gibbs
sampling. The key reason for this is that since there are
no interactions of visible units with themselves or hidden
values with respect to the data. To calculate expectation
units with themselves; the visible and hidden units of an
values with respect to the model, we use (block) Gibbs
RBM are conditionally independent:
sampling. The idea behind (block) Gibbs sampling is
Y to iteratively sample from the conditional distributions
p(v|h) = p(vi |h)
ht+1 ∼ p(h|vt ) and vt+1 ∼ p(v|ht+1 ) (see Figure 62,
i
Y top). Since the units are conditionally independent, each
p(h|v) = p(hµ |v), (209) step of this iteration can be performed by simply draw-
µ ing random numbers. The samples are guaranteed to
converge to the equilibrium distribution of the model in
with
the limit that t → ∞. At the end of the Gibbs sampling
procedure, one ends up with a minibatch of samples (fan-
X
p(vi = 1|h) = σ(ai + Wiµ hµ ) (210)
µ tasy particles).
X One drawback of Gibbs sampling is that it may take
p(hµ = 1|v) = σ(bµ + Wiµ vi )
many back and forth iterations to draw an independent
i
sample. For this reason, the Hinton group introduced
and where σ(x) = 1/(1 + e−x ) is the sigmoid function. an approximate Gibbs sampling technique called Con-
Using these expressions it is easy to compute expec- trastive Divergence (CD) (Hinton, 2002; Hinton et al.,
tation values with respect to the data. The input to 2006). In CD-n, we just perform n iterations of (block)
gradient descent is a minibatch of observed data. For Gibbs sampling, with n often taken to be as small as 1
each sample in the minibatch, we simply clamp the visi- (see Figure 62)! The price for this truncation is, of course,
ble units to the observed values and apply Eq. (210) using that we are not drawing samples from the true model dis-
the probability for the hidden variables. We then average tribution. But for our purpose – using the expectations
over all samples in the minibatch to calculate expectation to estimate the gradient for SGD – CD-n has been proven
90
to work reasonably well. As long as the approximate gra- Deep Boltzmann Layerwise Fine-tuning with
dients are reasonably correlated with the true gradient, Machine (DBM) Pretraining PCD on full DBM
SGD will move in a reasonable direction. CD-n of course
does come at a price. Truncating the Gibbs sampler pre-
vents sampling far away from the starting point, which
for CD-n are the data points in the minibatch. Therefore,
our generative model will be much more accurate around
regions of feature space close to our training data. Thus,
as is often the case in ML, CD-n sacrifices the ability
to generalize to some extent in order to make the model
easier to train.
Some of these undesirable features can be tempered
by using a slightly different variant of CD called Persis- FIG. 63 Deep Boltzmann Machine contain multiple hidden
tent Contrastive Divergence (PCD) (Tieleman and Hin- layers. To train deep networks, first we perform layerwise
ton, 2009). In PCD, rather than restarting the Gibbs training where each two layers are treated as a RBM. This
sampler from the data at each gradient descent step, we can be followed by fine-tuning using gradient descent and per-
start the Gibbs sampling at the fantasy particles (sam- sistent contrastive divergence (PCD).
ples from the model) in the last gradient descent step (see
Figure 62). Since parameters change slowly compared to
CD and PCD and to result in more interpretable
the Gibbs sampling, samples that are high probability at
learned features.
one step of the SGD are also likely to be high probabil-
ity at the next step. This ensures that PCD does not • Learning Rates Typically, it is helpful to reduce
introduce large errors in the estimation of the gradients. the learning rate in later stages of training.
The advantage of using fantasy particles to initialize the
Gibbs sampler is to allow PCD to explore parts of the • Updates for CD and PCD There are several
feature space that are much further from the training computational tricks one can use for speeding up
dataset than once could reach with ordinary CD. the alternating updates in CD and PCD (see Sec-
tion 3 in (Hinton, 2012)).
2. Practical Considerations
D. Deep Boltzmann Machine
The previous section gave an overview of how to train
RBMs. However, there are many “tricks-of-the-trade” In this section, we introduce Deep Boltzmann Ma-
that are missing from this discussion. Luckily, a succinct chines (DBMs). Unlike RBMs, DBMs possess multi-
summary of these has been compiled by Geoff Hinton ple hidden layers and were the first models rebranded
and published as a note that readers interested in train- as “deep learning” (Hinton et al., 2006; Hinton and
ing RBMs are urged to consult (Hinton, 2012). Salakhutdinov, 2006) 13 . Many of the advantages that
For completeness, we briefly list some of the important are thought to stem from having deep layers were al-
points here: ready discussed in Sec. XI in the context of discrimina-
tive DNNs. Here, we revisit many of the same themes
• Initialization. The model must be initialized. with emphasis on energy-based models.
Hinton suggests taking the weights Wiµ from a An RBM is composed of two layers of neurons that
Gaussian with mean zero and standard deviation are connected via an undirected graph, see Fig. 61. As
σ = 0.01 (Hinton, 2012). An alternative initial- a result, it is possible to perform sampling v ∼ p(v|h)
ization scheme proposed by Glorot and Bengio in- and inference h ∼ p(h|v) with the same model. As with
stead chooses the standard deviation
√ to scale with the Hopfield model, we can view each of the hidden units
the size of the layers: σ = 2/ Nv + Nh where Nv as representative of a pattern, or feature, that could be
and Nh are number of visible and hidden units re- present in the data. (In general, one should think of ac-
spectively (Glorot and Bengio, 2010). The bias of tivity patterns of hidden units representing features in
the hidden units is initialized to zero while the bias the data.) The inference step involves assigning a proba-
of the visible units is typically taken to be inversely bility to each of these features that expresses the degree
proportional to the mean activation, ai = hvi i−1 data . to which each feature is present in a given data sample.
• Regularization One can of course use an L1 or L2
penalty, typically only on the weight parameters,
not the biases. Alternatively, Dropout has been 13 Technically, these were Deep Belief Networks (DBNs) where only
shown to decrease overfitting when training with the top layer was undirected
91
In an RBM, hidden units do not influence each other dur- blow up – a problem that we discussed extensively in the
ing the inference step, i.e. hidden units are conditionally earlier sections of DNNs. It is worth noting that once pre-
independent given the visible units. There are a num- trained, we can use the usual Boltzmann learning rules
ber of reasons why this is unsatisfactory. One reason is (Eq. (192)) to fine-tune the weights and improve the per-
the desire for sparse, distributed representations, where formance of the DBM (Hinton et al., 2006; Hinton and
each observed visible vector will strongly activate a few Salakhutdinov, 2006). As we demonstrate in the next
(i.e. more than one but only a very small fraction) of the section, the Paysage package presented here can be used
hidden units. In the brain, this is thought to be achieved to both construct and train DBNs using such a pretrain-
by inhibitory lateral connections between neurons. How- ing procedure.
ever, adding lateral intra-layer connections between the
hidden units makes the distribution difficult to sample
from, so we need to come up with another way of creat-
ing connections between the hidden units.
With the Hopfield model, we saw that pairwise linear
connections between neurons can be mediated through
another layer. Therefore, a simple way to allow for ef-
fective connections between the hidden units is to add
another layer of hidden units. Rather than just having
two layers, one visible and one hidden, we can add addi-
tional layers of latent variables to account for the corre-
lations between hidden units. Ideally, as one adds more FIG. 64 Samples (“fantasy particles” ) generated using the
and more layers, one might hope that the correlations indicated model trained on MNIST dataset. Samples were
generated by running (alternative) layerwise Gibbs sampling
between hidden variables become smaller and smaller
for 100 steps. This allows the final sample to be very far
deeper into the network. This basic logic is reminiscent of away from the starting point in our feature space. Notice
renormalization procedures that seek to decorrelate lay- that the generated samples look much less like hand writ-
ers at each step (Li and Wang, 2018; Mehta and Schwab, ten reconstructions than in Fig. 60 which uses a single max-
2014; Vidal, 2007). The price of adding additional layers probability iteration of the Gibbs sampler, indicating that
is that the models become harder to train. training is much less effective when exploring regions of prob-
Training DBMs is more subtle than RBMs due to the ability space faraway from the training data. In the next
section, we will argue that this is likely generic feature of
difficulty of propagating information from visible to hid-
Likelihood-based training.
den units. However, Hinton and collaborators realized
that some of these problems could be alleviated via a lay-
erwise procedure. Rather than attempting to the train
the whole DBM at once, we can think of the DBM as a
E. Example: Using Paysage for MNIST
stack of RBMs (see Fig. 63). One first trains the bottom
two layers of the DBM – treating it as if it is a stand-
alone RBM. Once this bottom RBM is trained, we can In this section, we demonstrate how to use
generate “samples” from the hidden layer and use these the new open source package Paysage (French
samples as an input to the next RBM (consisting of the for landscape) for training unsupervised energy-
first and second hidden layer – purple hexagons and green based models on the MNIST dataset. Paysage’s
squares in Fig. 63). This procedure can then be repeated documentation is available on GitHub under
to pretrain all layers of the DBM. https://github.com/drckf/paysage/tree/master/docs.
This pretraining initializes the weights so that SGD The package was developed by one of the authors (CKF)
can be used effectively when the network is trained in a along with his colleagues at Unlearn.AI and makes it
supervised fashion. In particular, the pretraining helps easy to build, train, and deploy energy-based generative
the gradients to stay well behaved rather than vanish or models with different architectures.
Below, we show how to build and train four different kinds of models: (i) a “Hopfield” type RBM with Gaussian
hidden units and Bernoulli (binary) visible units, (ii) a conventional RBM where both the visible and hidden units are
Bernoulli, (iii) a conventional RBM with an additional L1 -penalty that enforces sparsity, and (iv) a Deep Boltzmann
Machine (DBM) with three Bernoulli layers with L1 penalty each. In the following, we demonstrate the simplicity of
using Paysage walking the reader step-by-step through shorts snippets of code.
We kick off by loading the required packages. Note that Paysage requires Python 3.6 or higher (see additional guides
to install the package in Notebook 17). We also fix the seed of the random number generator to ensure reproducibility
of our numerical experiment.
92
We want to study the MNIST digit dataset. A preprocessed version of the data is conveniently built into Paysage,
and our first task is to download it. To this end, let us fetch the directory Paysage was installed in and print it:
15 ##### download and preprocess MNIST data in Paysage
16 # fetch paysage directory
17 paysage_path=os.path.dirname(os.path.dirname(paysage.__file__))
18 mnist_path=os.path.join(paysage_path, "mnist", "mnist.h5")
19 shuffled_mnist_path=os.path.join(paysage_path, "mnist", "shuffled_mnist.h5")
20 # print path to Paysage directory
21 print(paysage_path)
To download the data, open up a terminal and navigate to the Paysage directory to run python
mnist/download_mnist.py. We can also check if the data have been successfully downloaded:
22 # check if data has been loaded
23 if not os.path.exists(mnist_path):
24 raise IOError("{} does not exist. run mnist/download_mnist.py to fetch from the web".format(mnist_path)
)
If this is the first time using the data set, we need to shuffle it. This step is necessary, since we shall shortly
employ SGD-based algorithms in the training process (cf. Sec. IV) which requires using small minibatches of data to
compute the gradient at each step. If the data have an order, then the estimates for the gradients computed from the
minibatches will be biased. Shuffling the data ensures that the gradient estimates are unbiased (though still noisy).
The data can be compressed by setting ‘complevel > 0‘, but we won’t use that here.
26 ##### set up minibatch data generator
27 # shuffle data if running for the first time
28 if not os.path.exists(shuffled_mnist_path):
29 DataShuffler(mnist_path,shuffled_mnist_path,complevel=0).shuffle()
Next, we create a python generator, which splits the data into a training and validation sets, and separates them into
minibatches of size batch_size. Before we begin training, we set data to training mode.
30 # batch size
31 batch_size=100
32 # create data generator object with minibathces
33 data=HDFBatch(shuffled_mnist_path,’train/images’, batch_size,
34 transform=paysage.preprocess.binarize_color,train_fraction=0.95)
35 # reset the data generator in training mode
36 data.reset_generator(mode=’train’)
To monitor the progress of performance metrics during training, we define the variable performance which tells
Paysage to measure the reconstruction error from the validation set. Possible metrics include the reconstruction error
(used in this example) and metrics related to difference in energy of random samples and samples from the model
(see metrics.md in Paysage documentation for a complete list).
37 # the reconstruction error will be computed from the validation set
38 performance=ProgressMonitor(data,metrics=[’ReconstructionError’])
Having loaded and preprocessed the data, we now move on to construct a hopfield model. To do this, we use the
Model class and with a visible BernoulliLayer and a hidden GaussianLayer. Note that the visible layer has the
same size as the input data points, which is read off data.ncols. The number of hidden units is num_hidden_units.
93
We also standardize the mean and variance of the Gaussian layer setting them to zero and unity, respectively (the
nomenclature of Paysage here is inspired by the terminology in Variational Autoencoders, cf. Sec. XVII).
40 ##### create hopfield model
41 # hidden units
42 num_hidden_units=200
43 # set up the model
44 hopfield=Model([BernoulliLayer(data.ncols), # visible layer
45 GaussianLayer(num_hidden_units) # hidden layer
46 ])
47 # set mean and standard deviation of hidden layer to to 0 and 1, respectively
48 hopfield.layers[1].set_fixed_params([’loc’, ’log_var’])
We choose to train the model with the Adam optimizer. To ensure convergence, we attenuate the learning_rate
hyperparameter according to a PowerLawDecay schedule: learning_rate(t) =initial/(1+coefficient×t). It will
prove convenient to define the function Adam_optimizer for this purpose.
49 # set up an optimizer method (ADAM in this case)
50 def ADAM_optimizer(initial,coefficient):
51 # define learning rate attenuation schedule
52 learning_rate=PowerLawDecay(initial=initial,coefficient=coefficient)
53 # return optimizer object
54 return optimizers.ADAM(stepsize=learning_rate)
Next, we have to create the model. First, we initialize the model using the initialize function attribute which
accepts the data as a required argument. We choose the initialization routine glorot, cf. discussion in Sec XVI.C.2.
Second, we define an optimizer calling the function Adam_optimizer, and store the object under the name opt. To
create an MCMC sampler, we use the method from_batch of the SequentialMC class, passing the model and the
data. Last, we create an SGD object called trainer to train the model using Persistent Contrastive Divergence (pcd)
with a fixed number of monte_carlo_steps. We can also monitor the reconstruction error during training. Last,
we train the model in epochs (cf. variable num_epochs), calling the train() method of trainer. These steps are
universal for shallow generative models, and it is convenient to combine them in the function train_model, which we
shall use repeatedly.
55 # define function to compile and train model
56 num_epochs=20 # training epochs
57 monte_carlo_steps=1 # number of MC sampling steps
58 def train_model(model,num_epochs,monte_carlo_steps,performance):
59 # make a simple guess for the initial parameters of the model
60 model.initialize(data,method=’glorot_normal’)
61 # set optimizer
62 opt=ADAM_optimizer(1E-2,1.0)
63 # set up a Monte Carlo sampler
64 sampler=SequentialMC.from_batch(model,data)
65 # use persistent contrastive divergence to fit the model
66 trainer=SGD(model,data,opt,num_epochs,sampler,
67 method=pcd,mcsteps=monte_carlo_steps,monitor=performance)
68 # train model
69 trainer.train()
70 # train hopfield model
71 train_model(hopfield,num_epochs,monte_carlo_steps,performance)
Let us now show how to build a few more generative models with Paysage. We can easily create a Bernoulli RBM
and train it using the functions defined above as follows:
73 ##### Bernoulli RBM
74 rbm = Model([BernoulliLayer(data.ncols), # visible layer
75 BernoulliLayer(num_hidden_units) # hidden layer
76 ])
77 # train Bernoulli RBM
78 train_model(rbm,num_epochs,monte_carlo_steps,performance)
Constructing a Bernoulli RBM with L1 regularization is also straightforward in Paysage, using the add_penalty
method which accepts a dictionary as an input. Some layers may have multiple properties (such as the location and
scale parameters of a Gaussian layer) so the dictionary key specifies to which property the penalty should be applied
94
sampler.state as required arguments. We can combine these steps in the function compute_reconstructions.
Figure 60 shows the result.
125 ##### compute reconstructions
126 def compute_reconstructions(model, data):
127 """
128 Computes reconstructions of the input data.
129 Input v -> h -> v’ (one pass up one pass down)
130
131 Args:
132 model: a model
133 data: a tensor of shape (num_samples, num_visible_units)
134
135 Returns:
136 tensor of shape (num_samples, num_visible_units)
137
138 """
139 # a configuration sampled from an RBM needs to specify the values
140 # of both the visible and hidden units
141 # since the data only specify the visible units, we need to initialize
142 # some hidden unit values
143 # the visible and hidden units are stored in a State object
144 data_state=State.from_visible(data,model)
145 # define MC sampler
146 sampler=SequentialMC(model)
147 # define a starting point for MC sampler
148 sampler.set_state(data_state)
149 # compute reconstructions
150 recons=model.deterministic_iteration(1,sampler.state).units[0]
151 #
152 return paysage.backends.to_numpy_array(recons)
Once we have the trained models ready, we can use MCMC to draw samples from the corresponding probability
distributions, called “fantasy particles”. To this end, let us draw a random_sample from the validation data and
compute the model_state. Next, we define an MCMC sampler based on the model, and set its state to model_state.
To compute the fantasy particles, we do layer-wise Gibbs sampling for a total of n_steps equilibration steps. The last
step (controlled by the boolean mean_field) is a final mean-field iteration (see tricks discussed in (Hinton, 2012)).
Figure 64 shows the result.
154 ##### compute fantasy particles
155 def compute_fantasy_particles(model,data,num_steps,mean_field=True):
156 """
157 Draws samples from the model using Gibbs sampling Markov Chain Monte Carlo .
158 Starts from randomly initialized points.
159
160 Args:
161 model: a model
162 data: a tensor of shape (num_samples, num_visible_units)
163 num_steps (int): the number of update steps
164 mean_field (bool; optional): run a final mean field step to compute probabilities
165
166 Returns:
167 tensor of shape (num_samples, num_visible_units)
168
169 """
170 # compute random data sample
171 random_sample=model.random(data)
172 # get model state from visible layer
173 model_state=State.from_visible(random_sample,model)
174 # define MC sampler
175 sampler=SequentialMC(model)
176 # change sampler state
177 sampler.set_state(model_state)
178 # does n_steps forward and backward passes
96
179 sampler.update_state(num_steps)
180 if mean_field: # see Hinton’s 2012 paper: trick (practical guide for training)
181 fantasy_particles=model.mean_field_iteration(1,sampler.state).units[0]
182 else:
183 fantasy_particles=sampler.state.units[0]
184 #
185 return paysage.backends.to_numpy_array(fantasy_particles)
One can use generative models to reduce the noise in images (de-noising). Let us randomly flip a fraction,
fraction_to_flip, of the black&white bits in the validation data, and use the models defined above to reconstruct
(de-noise) the digit images. Figure 65 shows the result.
187 ##### denoise MNIST images
188 # get validation data
189 examples = data.get(mode=’validate’) # shape (batch_size, 784)
190 # reset data generator to beginning of the validation set
191 data.reset_generator(mode=’validate’)
192 # add some noise to the examples by randomly flipping some pixels 0 -> 1 and 1 -> 0
193 fraction_to_flip=0.15
194 # create flipping mask
195 flip_mask=paysage.backends.rand_like(examples) < fraction_to_flip
196 # compute noisy data
197 noisy_data=(1-flip_mask) * examples + flip_mask * (1 - examples)
198 # define number of digits to display
199 num_to_display=8
200 # compute de-noised images
201 hopfield_denoised=compute_reconstructions(hopfield,noisy_data[:num_to_display])
202 rbm_denoised=compute_reconstructions(rbm,noisy_data[:num_to_display])
203 rbmL1_denoised=compute_reconstructions(rbmL1,noisy_data[:num_to_display])
204 dbm_L1_denoised=compute_reconstructions(dbm_L1,noisy_data[:num_to_display])
The full code used to generate Figs. 60, 64 and 65 is available in Notebook 17.
F. Example: Using Paysage for the Ising Model One of the lessons from this problem is that, similar to
real-life problems, this task is computationally intensive
We can also use Paysage to analyze the 2D Ising data (Notebook 17). The training time on present-day laptops
set. In previous sections, we used our knowledge of the easily exceeds that of previous studies from this review.
critical point at Tc /J ≈ 2.26 (see Onsager’s solution) to Thus, we encourage the interested reader to try GPU-
label the spin configurations and study the problem of based training and study the resulting speed-up.
classifying the states according to their phase of matter. Figures 66, 67 and 68 show the results of the numerical
However, in more complicated models, where the precise experiment at T /J = 1.75, 2.25, 2.75 respectively, for a
position of Tc is not known, one cannot label the states DBM with Nhidden = 800. Looking at the reconstructions
with such an accuracy, if at all. and the fantasy particles, we see that our DBM works
well in the disordered and critical regions. However, the
As we explained, generative models can be used to
chosen layer architecture is not optimal for T = 1.75 in
learn a variational approximation for the probability dis-
the ordered phase.
tribution that generated the data points. By using only
the 2D spin configurations, we now want to train a
Bernoulli RBM, the fantasy particles of which are ther- G. Generative models in physics
mal Ising configurations. Unlike in previous studies of
the Ising dataset, here we perform the analysis at a fixed Generative models have been studied in the context of
temperature T . We can then apply our model at three physics. For instance, in Biophysics, dynamic Boltzmann
different values T = 1.75, 2.25, 2.75 in the ordered, criti- distributions have been used as effective models in chem-
cal and disordered regions, respectively. ical kinetics (Ernst et al., 2018). In Statistical Physics,
We define a Deep Boltzmann machine with two hidden they were used to identify the criticality in the Ising
layers of Nhidden and Nhidden /10 units, respectively, and model (Morningstar and Melko, 2017). In parallel, tools
apply L1 regularization to all weights. As in the MNIST from Statistical Physics have been applied to analyze the
problem above, we apply layer-wise pre-training, and de- learning ability of RBMs (Decelle et al., 2018; Huang,
ploy Persistent Contrastive Divergence to train the DBM 2017b), characterizing the sparsity of the weights, the ef-
using ADAM. fective temperature, the nonlinearities in the activation
97
FIG. 66 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the ordered
phase of the the 2D Ising data set at T /J = 1.75. We used two hidden layers of 1000 and 100 layers, respectively.
FIG. 67 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the critical
regime of the the 2D Ising data set at T /J = 2.25. We used two hidden layers of 1000 and 100 layers, respectively.
A. The limitations of maximizing Likelihood A related quantity called the Jensen-Shannon divergence,
1
p + q
p + q
The Kullback-Leibler (KL)-divergence plays a central JS(p, q) = DKL p
+ DKL q
2 2
2
role in many generative models. Developing an intuition
about KL-divergences is one of the keys to understanding does satisfy all of the properties of a squared metric (i.e.,
why adversarial learning has proved to be such a power- the square root of the Jensen-Shannon divergence is a
ful method for generative modeling. Here, we revisit the metric). An important property of the KL-divergence
KL-divergence with an eye towards understanding GANs that we will make use of repeatedly is its positivity:
and motivate adversarial learning. The KL-diveregence DKL (p||q) ≥ 0 with equality if and only if p(x) = q(x)
measures the similarity between two probability distribu- almost everywhere.
tions p(x) and q(x). Strictly speaking, the KL divergence In generative models in ML, the two distributions
is not a metric because it is not symmetric and does not we are usually concerned with are the model distribu-
satisfy the triangle inequality. tion pθ (x) and the data distribution pdata (x). We of
Given two distributions, there are two distinct KL- course would like these models to be as similar as possi-
divergences we can construct: ble. However, as we discuss below, there are many sub-
tleties about how we measure similarities that can have
Z
p(x) large consequences for the behavior of training proce-
DKL (p||q) = dxp(x) log (211) dures. Maximizing the log-likelihood of the data under
q(x)
the model is the same as minimizing the KL divergence
q(x)
Z
DKL (q||p) = dxq(x) log . (212) between the data distribution and the model distribu-
p(x) tion DKL (pdata ||pθ ). To see this, we can rewrite the KL
99
FIG. 68 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the disordered
phase of the the 2D Ising data set at T /J = 2.75. We used two hidden layers of 1000 and 100 layers, respectively.
∆ = 2.0 ∆ = 5.0
Model Model
Data Data
15
10
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 −4 −2 0 2 4
∆ x
FIG. 69 KL-divergences between the data distribution pdata
and the model pθ . Data is drawn from a bimodal Gaus- Data-Model
sian distribution with unit variances peaked at ±∆ with 5
Model-Data
∆ = 2.0 and the model pθ (x) is a Gaussian with mean zero 4
and same variance as pθ (x). (Top) pdata and pθ for ∆ = 2.
(Bottom) DKL (pdata ||pθ ) (Data-Model) and DKL (pθ ||pdata ) 3
(Model-Data) as a function of ∆. Notice that DKL (pdata ||pθ )
is insensitive to placing weight in the model distribution in 2
regions where pdata ≈ 0 whereas DKL (pθ ||pdata ) punishes this
harshly. 1
0
0 2 4 6 8 10
tutorial by Goodfellow (Goodfellow, 2016). GANs are ∆
also notorious for being hard to train. For this reason,
FIG. 70 KL-divergences between the data distribution pdata
readers wishing to play with GANs should also consider
and the model pθ . Data is drawn from a Gaussian mixture
the very nice practical discussion entitled “How to train of the form pdata = 0.25N (−∆) + 0.25 ∗ N (∆) + 0.5N (0)
a GAN” (affectionately labeled “ganhacks”) available at where N (a) is a normal distribution with unit variance cen-
https://github.com/soumith/ganhacks. tered at x = a. pθ (x) is a Gaussian with σ 2 = 2. (Top)
The central idea of GANs is to construct two differ- pdata and pθ for ∆ = 5. (Middle) pdata and pθ for ∆ = 1.
entiable neural networks (see Fig. 71). The first neural (Bottom) DKL (pdata ||pθ ) [Data-Model] and DKL (pθ ||pdata )
network, usually a (de)convolutional network based on [Model-Data] as a function of ∆. Notice that DKL (pθ ||pdata )
is insensitive to placing weight in the model distribution in
the DCGAN architecture (Radford et al., 2015), approx- regions where pθ ≈ 0 whereas DKL (pdata ||pθ ) punishes this
imates a generator function G(z; θG ) that takes as input harshly .
a z sampled from some prior on the latent space, and out-
puts a x from the model. The second network approxi-
mates a discriminator function D(x; θD ) that is designed To define the cost function for training, it is useful to
to distinguish between x from the data and samples gen- define the functional
erated by the model: x = G(z; θG ). The scalar D(x) rep-
resents the probability that x came from the data rather V (D, G) = Ex∼pdata log D(x) (215)
than the model pθG . We train D to distinguish actual + Ez log [1 − D(G(z))]. (216)
data points from synthetic examples and the generative
network to fool the discriminative network. In the version of GANs most amenable to theoretical
101
“Generator”
x sampled from network G x-space
data
latent space
Dataset: D
input z
analysis – though not the version usually implemented We now turn our attention to another class of powerful
in practice – we take the cost function for the discrimi- latent-variable, generative models called Variational Au-
nator and generators to be C (G) = −C (D) = 12 V (D, G). toencoders (VAEs). VAEs exploit the variational/mean-
This choice of cost functions corresponds to what is called field theory ideas presented in Sec. XIV to build complex
a zero-sum game. Since the discriminator is maximized, generative models using deep neural networks (DNNs).
we can write a cost function for the generator as The central idea behind VAEs is to represent the map
from latent variables to observable variables using a
C(G) = max V (G, D). (217) DNN. The use of latent variables is a common theme
D
in many of the generative models we have encountered in
It turns out that this cost function is related to the unsupervised learning tasks from Gaussian Mixture Mod-
Jensen-Shannon Divergence in a simple manner (Good- els (see Sec. XIII) to Restricted Boltzmann Machines.
fellow, 2016; Goodfellow et al., 2014): However, in VAEs this mapping, p(x|z, θ) is much less
restrictive and much more complicated since it takes the
C(G) = − log 4 + 2JS(pdata , pθG ). (218) form of a DNN. This added complexity means we can-
not use techniques such as Expectation Maximization to
This brings us back full circle to the discussion in the last train the model and instead must rely of methods based
section on KL-divergences. on backpropagation.
102
1. VAEs as variational models the log-likelihood. For this reason, in the VAE literature,
it is often called the Evidence Lower BOund or ELBO.
We start by discussing VAEs from a variational per- Equation (221) has a beautiful interpretation. The
spective. We will make extensive use of the concepts first term in this equation can be viewed as a “recon-
introduced in Sec. XIV and the reader is strongly- struction error”, where we start with data x, encode it
encouraged to refresh their memory of this section before into the latent representation using our approximate pos-
proceeding. A VAE is a latent-variable model pθ (x, z) terior qφ (z|x), and then evaluate the log probability of
with a latent space z and observed variables x. The latent the original data given the inferred latents. For binary
variables are drawn from some pre-specified prior distri- variables, this is just the cross-entropy which we first en-
bution p(z). In practice, p(z) is almost always taken to countered when studying logistic regression, cf. Sec. VII.
be a multivariate Gaussian. The conditional distribution The second term acts as a regularizer and encourages the
pθ (x|z) maps points in the latent space to new examples posterior distributions to be close to p(z). By maximizing
(see Fig. 72). This is often called a “stochastic decoder” the ELBO, we minimize the KL-divergence between the
and defines the generative model for the data. The re- approximate and true posterior. By choosing a tractable
verse mapping that gives the posterior over the latent qφ (z|x), we make this feasible (see Fig. 72).
variables pθ (z|x) is often called the “stochastic encoder”.
A central challenge in latent variable modeling is to in-
fer the posterior distribution of the latent variables given 2. Training via the reparametrization trick
a sample from the data. This can in principle be done
via Bayes’ rule: pθ (z|x) = p(z)p θ (x|z)
. For some models, VAEs train models by minimizing the variational free
pθ (x)
we can calculate this analytically. In this case, we can energy (maximizing the ELBO). Training a VAE is some-
use techniques like Expectation Maximization (EM) (see what complicated because we must simultaneously learn
Sec. XIV). However, in general this is intractable since two sets of parameters: the parameters θ that define our
the denominator requires computing a sum generative model pθ (x, z) as well as the variational pa-
R over all con-
figurations of the latent variables, p θ (x) = pθ (x, z)dz = rameters φ in qφ (z|x). The basic approach will be the
pθ (x|z)p(z)dz (i.e. a partition function in the language same as for all DNN models: we will use gradient de-
R
of physics), which is often intractable for large models. scent with the variational free energy as the objective
In VAEs, where the pθ (x|z) is modeled using a DNN, this (cost) function. For a dataset L, we can write our cost
is impossible. function as
A first attempt to address the issue of computing X
p(x) could be through importance sampling (Neal, 2001). Cθ,φ (L) = −Fqφ (x). (222)
That is, we choose a proposal distribution q(z|x) which x∈L
is easy to sample from, and rewrite the sum as an expec- Taking the gradient with respect to θ is easy since only
tation with respect to this distribution: the first term in Eq. (221) depends on θ,
p(z)
Z
pθ (x) = pθ (x|z) qφ (z|x)dz. (219) Cθ,φ (x) = Eqφ (z|x) [∇θ log pθ (x, z)]
qφ (z|x)
∼ ∇θ log pθ (x, z) (223)
Thus, by sampling from qφ (z|x) we can get a Monte Carlo
where in the second line we have replaced the expec-
estimate of p(x). However, this requires generating sam-
tation value with a single Monte-Carlo sample z drawn
ples and thus our estimates will be noisy. If our proposal
from qφ (z|x) (see Fig. XVII.C.2). When pθ (x|z) is ap-
distribution is poor, the variance in the estimate can be
proximated by a neural network, this can be calculated
very high.
using backpropagation with the reconstruction error as
An alternative approach that avoids these sampling
the objective function.
issues is to use the variational approach discussed in
On the other hand, calculating the gradient with re-
Sec. XIV. We know from Eq. (160) that we can write
spect to the parameters φ is more complicated since φ
the log-likelihood as
also appears in the expectation value Eqφ (z|x) . Ideally, we
log p(x) = DKL (qφ (z|x)kpθ (z|x, θ)) − Fqφ (x), (220) would like to also use backpropagation to calculate this
as well. It turns out that this can be done by a simple
where the variational free energy is defined as change of variables that often goes under the name the
“reparameterization trick” (Kingma and Welling, 2013;
− Fqφ (x) ≡ Eqφ (z|x) [log pθ (x, z)] − DKL (qφ (z|x)|p(z)). Rezende et al., 2014). The basic idea is to change vari-
(221) ables so that φ no longer appears in the distribution we
In writing this term, we have used Bayes rule and are taking an expectation value with respect to. To do
Eq. (172). Since the KL-divergence is strictly positive, this, we express the random variable z ∼ qφ (z|x) as some
the (negative) variational free energy is a lower-bound on differentiable and invertible transformation of another
103
Datapoint
Negative
Variational Free Energy:
(ELBO)
E [log pθ(x|z) - KL(qφ(z|x)||p(z))]
q (z|x)
φ
FIG. 73 Schematic explaining the computational flow of VAEs. Figure based on Kingma Ph.D. dissertation Chapter 2. (Kingma
et al., 2017).
log qφ (z|x) = log p() − log dφ (x, φ). (228) There is a fundamental connection between the vari-
ational autoencoder objective and the information bot-
Since we can calculate gradients, we can now use back- tleneck (IB) for lossy compression (Tishby et al., 2000).
propagation on the full the ELBO objective function (we The information bottleneck imagines we have input data
return to this below when we discuss concrete architec- x that is correlated with another variable of interest, y,
tures and implementations of VAE). and we are given access to the joint distribution, p(x, y).
One of the problems that commonly occurs when train- Our task is to take x as input and compress it in such a
ing VAEs by performing a stochastic optimization of the way as to retain as much information as possible about
ELBO (variational free energy) is that it often gets stuck the relevance variable, y. To do this, Tishby et al. pro-
in undesirable local minima, especially at the beginning pose to maximize the objective function
of the training procedure (Bowman et al., 2015; Kingma
et al., 2017; Sønderby et al., 2016). The underlying rea-
son for this is that the ELBO objective function can be LIB = I(y; z) − βI(x; z) (230)
improved in two qualitatively different ways correspond-
ing to each of the two terms in Eq. (221): by minimizing over a stochastic encoding distribution q(z|x), where z is
the reconstruction error or by making the posterior dis- our compression of the input, and β is a tradeoff param-
tribution qφ (z|x) to be close to p(z) (Of course, the goal eter that sets the relative preference of compression and
104
1X
J
The real advantage that VAEs offer over embeddings
−DKL (qφ (z|x)|p(z)) = 1 + log σj2 (x) − µ2j (x) − σj2 (x) such
. as t-SNE is that they are generative models. Given
2 j=1 a set of examples, we can generate new examples – or fan-
(238) tasy particles as they are commonly called in ML – by
sampling the latent space z and then using the decoder to
This analytic expression allows us to implement the map these latent variables to new examples. The results
Gaussian VAE in a straight forward way using neural net- of this procedure are shown in Figure XVII.D.2. In the
works. The computational graph for this implementation top figure, we sample the latent space uniformly in a 5×5
is shown in Fig. 74. Notice that since the parameters are grid. Notice that this results in extremely similar exam-
all compositions of differentiable functions, we can use ples through much of the latent space. The underlying
standard backpropagation algorithms to train VAEs. reason for this is that uniform sampling does not respect
the underlying Gausssian structure of the latent space z.
In the bottom figure, we perform a uniform sampling on
2. VAEs for the MNIST dataset
the probability p(z) and mapped this back to the latent
space using the inverse Cumulative Distribution Func-
In Notebook 19, we have implemented a VAE using
tion (CDF) of the Gaussian. We see that the diversity of
Keras and trained it using the MNIST dataset. The ba-
the generated examples is much higher for this sampling
sic architecture is the one describe above. All figures
procedure.
were generated with a VAE that has a latent space of
dimension 2. The architecture of both the encoder and This example is indicative of a more general problem:
decoder is a Multi-layer Perceptron (MLPs) – neural net- once we have learned a generative model how should we
works with a single hidden layer. For this example, we sample latent spaces (White, 2016). This is especially
take the dimension of the hidden layer for both neural important in high-dimensional spaces where direct visu-
networks to be 256. We trained the VAE using the RMS- alization is not possible. Often certain directions in the
prop optimizer for 50 epochs. latent space can have different meanings. A particularly
We can visualize the embedding in the latent space striking visual illustration is the “smile vector” that in-
by plotting z of the test set and coloring the points by terpolates between smiling and frowning faces (White,
digit identity [0-9] (see Figure XVII.D.2). Notice that 2016).
106
physics that may not be immediately obvious, such that the insights remain scattered and a cohesive
as quantum physics (Dunjko and Briegel, 2017). theoretical understanding is lacking.
For example, recent works have used ideas from ML
to investigate disparate topics such as disordered • Biological physics and ML. Biological physics
materials and glasses (Schoenholz, 2017), electronic is generating ever more datasets in fields ranging
structure calculations (Grisafi et al., 2017), design- from neuroscience to evolution and immunology. It
ing and analyzing quantum materials by integrat- is likely that ML will be an important part of the
ing ML with existing techniques such as Dynami- biophysics toolkit in the future. Many of the au-
cal Mean Field Theory (DMFT) (Arsenault et al., thors of this review were inspired to engage with
2014), and even experimental learning of quantum ML for this reason.
states by using ML to aid in quantum tomogra-
phy (Rocchetto et al., 2017). • Using ideas from physics to develop new
ML algorithms. Many of the core ideas of ML
• Machine Learning on quantum computers. from Monte-Carlo techniques to variational meth-
Another interesting area of research that is likely ods have their origin in physics. There has been a
grow is asking if and how quantum comput- tremendous amount of recent work developing tools
ers can help improve state-of-the art ML algo- to understand physical systems that may be of po-
rithms (Arunachalam and de Wolf, 2017; Benedetti tential use to ML. For example, in quantum con-
et al., 2016, 2017; Bromley and Rebentrost, 2018; densed matter techniques such as DMRG, MERA,
Ciliberto et al., 2017; Daskin, 2018; Innocenti et al., etc. have enriched both our practical and con-
2018; Mitarai et al., 2018; Perdomo-Ortiz et al., ceptual understandings (Stoudenmire and White,
2017; Rebentrost et al., 2017; Schuld et al., 2017; 2012; Vidal, 2007; White, 1992). It will be inter-
Schuld and Killoran, 2018; Schuld et al., 2015). esting to figure how and if these numerical methods
Concrete examples that seek to extend some of can be translated from a physics to a ML setting.
the basic ideas and methods we introduced in There are tantalizing hints that this is likely to be
this review to the quantum computing realm in- a fruitful direction (Han et al., 2017; Stoudenmire,
clude: algorithms for quantum-assisted gradient 2017; Stoudenmire and Schwab, 2016).
descent (Kerenidis and Prakash, 2017; Rebentrost
et al., 2016), classification (Schuld and Petruccione,
2017), and Ridge regression (Yu et al., 2017). Inter- B. Topics not covered in review
est in this field will undoubtedly grow once reliable
quantum computers become available (see also this Despite the considerable length of the review, we have
recent review (Dunjko and Briegel, 2017) ). had to make many omissions for the sake of brevity. It
is our hope and belief that after reading this review the
• Monte-Carlo Methods. An interesting area that reader will have the conceptual and practical knowledge
has seen a renewed interest with Bayesian modeling to quickly learn about these other topics. Among the
is the development of new Monte-Carlo methods for most prominent topics missing from this review are:
sampling complex probability distributions. Some
of the workhorses of modern Machine Learning – • Temporal/Sequential Data. We have not cov-
Annealed Importance Sampling (AIS) (Neal, 2001) ered techniques for dealing with temporal or se-
and Hamiltonian or Hybrid Monte-Carlo (HMC) quential data. Here, too there are many connec-
(Neal et al., 2011) – are intimately related to tions with statistical physics. A powerful class of
physics. As pointed out by Radford Neal, AIS is models for sequential data called Hidden Markov
just the Jarzynski inequality (Jarzynski, 1997) as Models (Rabiner, 1989) that utilize dynamical
a Monte-Carlo method and HMC was developed programming techniques have natural statistical
by physicists and exploits Hamiltonian dynamics physics interpretations in terms of transfer matri-
to improve proposal distributions. ces (see (Mehta et al., 2011) for explicit exam-
ple of this). Recently, Recurrent Neural Networks
• Statistical physics style theory of Deep (RNNs) have become an important and powerful
Learning. Many techniques in ML have origins tool for dealing with sequence data (Goodfellow
in statistical physics. Yet, a physics-style theory et al., 2016). RNNs generalize many of the ideas
of Deep Learning remains elusive. A key question discussed in the DNN section to deal with temporal
is to ask when and why these models manage to data.
generalize well. Physicists are only beginning to
ask these questions (Advani and Saxe, 2017; Mehta • Reinforcement Learning. Many of the most ex-
and Schwab, 2014; Saxe et al., 2013; Shwartz-Ziv citing developments in the last five years have come
and Tishby, 2017). But right now, it is fair to say from combining ideas from reinforcement learning
109
with deep neural networks (Mnih et al., 2015; Sut- that form the basis of modern ML with the more com-
ton and Barto, 1998). RL traces its origins to be- monplace notions about what humans and behavioral sci-
haviourist psychology, when it was conceived as entists mean by intelligence (see (Lake et al., 2017) for
a way to explain and study reward-based deci- an enlightening and important modern discussion of this
sion making. RL was put on solid mathematical distinction from a quantitative cognitive science point of
grounds in the 50’s by Richard Bellman and col- view as well as (Dreyfus, 1965) for a surprisingly relevant
laborators, and has by now become an inseparable philosophy-based critique from 1965).
part of robotics and artificial intelligence. RL is a Almost all the techniques discussed here rely on op-
field of Machine Learning, in which an agent learns timizing a pre-specified objective function on a given
how to master performing a specific task through dataset. Yet, we know that for large, complex models
an interaction with its environment. Depending on changing the data distribution or the goal can lead to an
the reward it receives, the agent chooses to take an immediate degradation of performance. Deep networks
action affecting the environment, which in turn de- have poor generalizations to even a slightly different con-
termines the value of the next received reward, and text (the infamous Validation-Test set mismatch). This
so on. The long-term goal of the agent is to max- inability to abstract and generalize is a common criticism
imise the cumulative expected return, thus improv- lobbied against branding modern ML techniques as AI
ing its performance in the longer run. Shadowed by (Lake et al., 2017). For all these reasons, we have chosen
more traditional optimal control algorithms, Rein- to use the term Machine Learning rather than artificial
forcement Learning has only recently taken off in intelligence through out the review.
physics (Albarran-Arriagada et al., 2018; August This is far from the first time we have seen the use of
and HernÃąndez-Lobato, ????; Bukov et al., 2017; the term artificial intelligence and the grandiose promises
Cárdenas-López et al., 2017; Chen et al., 2014; that it implies. In fact, the early 1950’s and 1960’s as
Dunjko et al., 2017; Fösel et al., 2018; Lamata, well as the early 1980’s saw similar AI bubbles (see this
2017; Melnikov et al., 2017; Neukart et al., 2017; interesting summary by Luke Muehlhauser for Open Phi-
Niu et al., 2018; Ramezanpour, 2017; Reddy et al., lanthropy (Muehlhauser, 2016)). These AI bubbles have
2016b; Sriarunothai et al., 2017; Zhang et al., 2018). been followed by what have been dubbed “AI Winters”
Of particular interest are biophysics inspired works (McDermott et al., 1985).
that seek to use RL to understand navigation and The “Singularity” may not be coming but the advances
sensing in turbulent environments (Colabrese et al., in computing and the availability of large data sets likely
2017; Masson et al., 2009; Reddy et al., 2016a; Ver- ensure that the kind of statistical learning frameworks
gassola et al., 2007). discussed are here to stay. Rather than a general arti-
ficial intelligence, the kind of techniques presented here
• Support Vector Machines (SVMs) and Ker- seem to be best suited for three important tasks: (a) au-
nel Methods. SVMs and kernel methods are a tomating prediction from lots of labeled examples in a
powerful set of techniques that work well when the narrowly-defined setting (b) learning how to parameter-
amount of training data is limited (Burges, 1998). ize and capture the correlations of complex probability
The mathematics and theory of SVM are very dif- distributions, and (c) finding policies for tasks with well-
ferent from statistical physics and for this reason defined goals and clear rules. We hope that this review
we chose not to include them here. However, SVMs has given the reader enough conceptual tools to start
and kernel methods have played an extremely im- forming their own opinions about reality and hype when
portant role in ML and are worth understanding. it comes to modern ML research.
The immense scientific progress in ML has also been The last decade has also seen a systematic increase in
accompanied by a massive public relations effort centered the use and deployment of Machine Learning techniques
around Silicon Valley. Starting with the success of Ima- into new areas of life and society. Some of the readers of
geNet (the most prominent early use of GPUs for train- this review may currently be (or eventually be) employed
ing large models) and the widespread adoption of Deep in industrial settings that seek to harness ML for practi-
Learning based techniques by the Silicon Valley compa- cal purposes. However, caution is in order when applying
nies, there has been a deliberate move to rebrand modern ML. Without foresight and accountability, the scale and
ML as “artificial intelligence” or AI (see graphs in (Katz, scope of modern ML algorithms can lead to large scale
2017)). unaccountable and undemocratic outcomes that can re-
AI, by design, is an ambiguous term that mixes aspi- inforce or even worsen existing inequality and inequities.
rations with reality. It also conflates the statistical ideas Mathematician and data scientist turned social commen-
110
tator Cathy O’Neil has dubbed the indiscriminate use of 16 × 10000 samples of 40 × 40 spin configurations (i.e.
these Big Data techniques “Weapons of Math Destruc- the design matrix has 160000 samples and 1600 features)
tion” (O’Neil, 2017). drawn at temperatures 0.25, 0.5, · · · 4.0. The samples
When ML is used in a social context, abstract statis- are drawn for the Boltzmann distribution of the two-
tical relationships have real social consequences. False dimensional ferromagnetic Ising model on a 40×40 square
positives can mean the difference between life and death lattice with periodic boundary conditions.
(for example in the context of “signature drone strikes”)
(Mehta, 2015). ML algorithms, like all techniques, have
important limitations and should be employed with great 2. SUSY dataset
caution. It is our hope that ML practitioners keep this
in mind when working in social settings. The SUSY dataset was generated by Baldi et al (Baldi
All algorithms involve inherent tradeoffs in fairness, a et al., 2014) to explore the efficacy of using Deep Learning
point formalized by computer scientist Jon Kleinberg and for classifying collision events. The dataset is download-
collaborators in a very interesting recent paper (Klein- able from the UCI Machine Learning Repository, a won-
berg et al., 2016). It is far from clear how to make al- derful resource for interesting datasets. Here we quote
gorithms fair for all people involved. This is even more directly from the paper:
true with methods like Deep Learning that are hard to
interpret. All ML algorithms have implicit assumptions The data has been produced using Monte
and choices reflected in the datasets we use to the kind Carlo simulations and contains events with
of functions we choose to optimize. It is important to two leptons (electrons or muons). In high
remember that there is no “ view from nowhere” (Adam, energy physics experiments, such as the AT-
2006; Katz, 2017) – all ML algorithms reflect a point of LAS and CMS detectors at the CERN LHC,
view and a set of assumptions about the world we live one major hope is the discovery of new par-
in. For this reason, we hope that ML practitioners and ticles. To accomplish this task, physicists at-
data scientists will take the time to consider the social tempt to sift through data events and classify
consequences of their actions. For example, developing a them as either a signal of some new physics
Hippocratic Oath for data scientists is now being consid- process or particle, or instead a background
ered (Simonite, 2018). Doing no harm seems like a good event from understood Standard Model pro-
start for making sure that we harness ML for the benefit cesses. Unfortunately we will never know for
of all members of society. sure what underlying physical process hap-
pened (the only information to which we have
access are the final state particles). How-
XIX. ACKNOWLEDGMENTS
ever, we can attempt to define parts of phase
space that will have a high percentage of sig-
PM, CHW, and AD were supported by Simon’s Foun-
nal events. Typically this is done by using a
dation in the form of a Simons Investigator in the MMLS
series of simple requirements on the kinematic
and NIH MIRA program grant: 1R35GM119461. MB
quantities of the final state particles, for ex-
acknowledges support from the Emergent Phenomena in
ample having one or more leptons with large
Quantum Systems initiative of the Gordon and Betty
amounts of momentum that is transverse to
Moore Foundation and the ERC synergy grant UQUAM.
the beam line ( pT ). Here instead we will
DJS was supported as a Simons Investigator in the
use logistic regression in order to attempt to
MMLS and by NIH K25 grant GM098875-02. PM and
find out the relative probability that an event
DJS would like to thank the NSF grant: PHYD1066293
is from a signal or a background event and
for supporting the Aspen Center for Physics (ACP) for
rather than using the kinematic quantities of
facilitating discussions leading to this work. PM and DJS
final state particles directly we will use the
would also like to thank Anirvan Sengupta, Justin Kin-
output of our logistic regression to define a
ney, and Ilya Nemenman for useful conversations during
part of phase space that is enriched in sig-
the ACP working group.
nal events. The dataset we are using has the
value of 18 kinematic variables ("features") of
Appendix A: An overview of the datasets used in review the event. The first 8 features are direct mea-
surements of final state particles, in this case
1. Ising dataset the pT , pseudo-rapidity, and azimuthal angle
of two leptons in the event and the amount
The Ising dataset we use throughout the review was of missing transverse momentum (MET) to-
generated using the standard Metropolis algorithm to gether with its azimuthal angle. The last ten
generate a Markov Chain. The full dataset consist of features are functions of the first 8 features;
111
these are high-level features derived by physi- Adam, Alison (2006), Artificial knowing: Gender and the
cists to help discriminate between the two thinking machine (Routledge).
classes. You can think of them as physicists Advani, Madhu, and Surya Ganguli (2016), “Statistical me-
attempt to use non-linear functions to classify chanics of optimal convex inference in high dimensions,”
Physical Review X 6 (3), 031034.
signal and background events and they have Advani, Madhu, Subhaneil Lahiri, and Surya Ganguli (2013),
been developed with a lot of deep thinking “Statistical mechanics of complex neural systems and high
on the part of physicist. There is however, dimensional data,” Journal of Statistical Mechanics: The-
an interest in using deep learning methods to ory and Experiment 2013 (03), P03014.
obviate the need for physicists to manually Advani, Madhu S, and Andrew M Saxe (2017), “High-
develop such features. Benchmark results us- dimensional dynamics of generalization error in neural net-
ing Bayesian Decision Trees from a standard works,” arXiv preprint arXiv:1710.03667.
Aitchison, Laurence, Nicola Corradi, and Peter E Latham
physics package and 5-layer neural networks (2016), PLoS computational biology 12 (12), e1005110.
and the dropout algorithm are presented in Albarran-Arriagada, F, J. C. Retamal, E. Solano, and
the original paper to compare the ability of L. Lamata (2018), “Measurement-based adapta-
deep-learning to bypass the need of using such tion protocol with quantum reinforcement learning,”
high level features. We will also explore this arXiv:1803.05340 .
topic in later notebooks. The dataset con- Alemi, Alexander A, Ian Fischer, Joshua V Dillon, and Kevin
sists of 5 million events, the first 4,500,000 of Murphy (2016), “Deep variational information bottleneck,”
arXiv preprint arXiv:1612.00410.
which we will use for training the model and Alemi, Alireza, and Alia Abbara (2017), “Exponential capac-
the last 500,000 examples will be used as a ity in an autoencoder neural network with a hidden layer,”
test set. arXiv preprint arXiv:1705.07441 .
Amit, Daniel J (1992), Modeling brain function: The world of
attractor neural networks (Cambridge university press).
3. MNIST Dataset Amit, Daniel J, Hanoch Gutfreund, and Haim Sompolinsky
(1985), “Spin-glass models of neural networks,” Physical
The MNIST dataset is one of the simplest and most Review A 32 (2), 1007.
widely used Machine Learning Datasets. The MNIST Andrieu, Christophe, Nando De Freitas, Arnaud Doucet, and
Michael I Jordan (2003), “An introduction to mcmc for
dataset consists of hand-written images of numerical
machine learning,” Machine learning 50 (1-2), 5–43.
characters 0−9 and consists of a training set of 60,000 ex- Arai, Shunta, Masayuki Ohzeki, and Kazuyuki Tanaka
amples, and a test set of 10,000 examples (LeCun et al., (2017), “Deep neural network detects quantum phase tran-
1998a). Information about the MNIST database and its sition,” arXiv preprint arXiv:1712.00371 .
historical importance can be found at Yann Lecun’s wed- Arimoto, Suguru (1972), “An algorithm for computing the
site: http://yann.lecun.com/exdb/mnist/. A brief capacity of arbitrary discrete memoryless channels,” IEEE
description from the website: Transactions on Information Theory 18 (1), 14–20.
Arsenault, Louis-François, Alejandro Lopez-Bezanilla,
The original black and white (bilevel) images O Anatole von Lilienfeld, and Andrew J Millis (2014),
from NIST were size normalized to fit in a “Machine learning for many-body physics: the case of the
20x20 pixel box while preserving their aspect anderson impurity model,” Physical Review B 90 (15),
ratio. The resulting images contain grey lev- 155136.
Arunachalam, Srinivasan, and Ronald de Wolf (2017),
els as a result of the anti-aliasing technique
“A survey of quantum learning theory,” arXiv preprint
used by the normalization algorithm. the im- arXiv:1701.06806 .
ages were centered in a 28x28 image by com- August, Moritz, and JosÃľ Miguel HernÃąndez-Lobato
puting the center of mass of the pixels, and (????), arXiv:1802.04063.
translating the image so as to position this Baireuther, P, TE O’Brien, B Tarasinski, and CWJ
point at the center of the 28x28 field. Beenakker (2017), “Machine-learning-assisted correction
of correlated qubit errors in a topological code,” arXiv
The MNIST is often included by default in many modern preprint arXiv:1705.07855 .
ML packages. Baity-Jesi, M, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous,
C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli (2018),
“Comparing dynamics: Deep neural networks versus glassy
REFERENCES systems,” .
Baldassi, Carlo, Federica Gerace, Hilbert J Kappen, Carlo
Abu-Mostafa, Yaser S, Malik Magdon-Ismail, and Hsuan- Lucibello, Luca Saglietti, Enzo Tartaglione, and Riccardo
Tien Lin (2012), Learning from data, Vol. 4 (AMLBook Zecchina (2017), “On the role of synaptic stochasticity
New York, NY, USA:). in training low-precision neural networks,” arXiv preprint
Ackley, David H, Geoffrey E Hinton, and Terrence J Se- arXiv:1710.09825 .
jnowski (1987), “A learning algorithm for boltzmann ma- Baldassi, Carlo, Federica Gerace, Luca Saglietti, and Ric-
chines,” in Readings in Computer Vision (Elsevier) pp. cardo Zecchina (2018), “From inverse problems to learning:
522–533. a statistical mechanics approach,” in Journal of Physics:
112
Conference Series, Vol. 955 (IOP Publishing) p. 012001. Breiman, Leo (2001), “Random forests,” Machine learning
Baldi, Pierre, Peter Sadowski, and Daniel Whiteson (2014), 45 (1), 5–32.
“Searching for exotic particles in high-energy physics with Breuckmann, Nikolas P, and Xiaotong Ni (2017), “Scalable
deep learning,” Nature communications 5, 4308. neural network decoders for higher dimensional quantum
Barber, David (2012), Bayesian reasoning and machine learn- codes,” arXiv preprint arXiv:1710.09489 .
ing (Cambridge University Press). Broecker, Peter, Fakher F Assaad, and Simon Trebst (2017),
Barra, Adriano, Alberto Bernacchia, Enrica Santucci, and “Quantum phase recognition via unsupervised machine
Pierluigi Contucci (2012), “On the equivalence of hopfield learning,” arXiv preprint arXiv:1707.00663 .
networks and boltzmann machines,” Neural Networks 34, Bromley, Thomas R, and Patrick Rebentrost (2018),
1–9. “Batched quantum state exponentiation and quantum heb-
Barra, Adriano, Giuseppe Genovese, Peter Sollich, and bian learning,” arXiv:1803.07039 .
Daniele Tantari (2016), “Phase transitions in restricted Bukov, Marin, Alexandre G.R. Day, Dries Sels, Phillip Wein-
boltzmann machines with generic priors,” arXiv preprint berg, Anatoli Polkovnikov, and Pankaj Mehta (2017), “Re-
arXiv:1612.03132 . inforcement learning in different phases of quantum con-
Barra, Adriano, Giuseppe Genovese, Daniele Tantari, and Pe- trol,” arXiv preprint arXiv:1705.00565 .
ter Sollich (2017), “Phase diagram of restricted boltzmann Burges, Christopher JC (1998), “A tutorial on support vector
machines and generalised hopfield networks with arbitrary machines for pattern recognition,” Data mining and knowl-
priors,” arXiv preprint arXiv:1702.05882 . edge discovery 2 (2), 121–167.
Battiti, Roberto (1992), “First-and second-order methods for Cárdenas-López, FA, L Lamata, JC Retamal, and E Solano
learning: between steepest descent and newton’s method,” (2017), “Generalized quantum reinforcement learning with
Neural computation 4 (2), 141–166. quantum technologies,” arXiv preprint arXiv:1709.07848 .
Benedetti, Marcello, John Realpe-Gómez, Rupak Biswas, Carleo, Giuseppe, Yusuke Nomura, and Masatoshi Imada
and Alejandro Perdomo-Ortiz (2016), “Quantum-assisted (2018), “Constructing exact representations of quantum
learning of graphical models with arbitrary pairwise con- many-body systems with deep neural networks,” arXiv
nectivity,” arXiv preprint arXiv:1609.02542 . preprint arXiv:1802.09558 .
Benedetti, Marcello, John Realpe-Gómez, and Alejandro Carleo, Giuseppe, and Matthias Troyer (2017), “Solving the
Perdomo-Ortiz (2017), “Quantum-assisted helmholtz ma- quantum many-body problem with artificial neural net-
chines: A quantum-classical deep learning framework for works,” Science 355 (6325), 602–606.
industrial datasets in near-term devices,” arXiv preprint Carrasquilla, Juan, and Roger G Melko (2017), “Machine
arXiv:1708.09784 . learning phases of matter,” Nature Physics 13 (5), 431.
Bengio, Yoshua (2012), “Practical recommendations for Chalk, Matthew, Olivier Marre, and Gasper Tkacik (2016),
gradient-based training of deep architectures,” in Neural “Relevant sparse codes with variational information bottle-
networks: Tricks of the trade (Springer) pp. 437–478. neck,” in Advances in Neural Information Processing Sys-
Bény, Cédric (2018), “Inferring relevant features: from qft to tems, pp. 1957–1965.
pca,” arXiv preprint arXiv:1802.05756 . Chamberland, Christopher, and Pooya Ronagh (2018), “Deep
Berger, James O, and José M Bernardo (1992), “On the devel- neural decoders for near term fault-tolerant experiments,”
opment of the reference prior method,” Bayesian statistics arXiv preprint arXiv:1802.06441 .
4, 35–60. Chechik, Gal, Amir Globerson, Naftali Tishby, and Yair
Bickel, Peter J, and David A Freedman (1981), “Some asymp- Weiss (2005), “Information bottleneck for gaussian vari-
totic theory for the bootstrap,” The Annals of Statistics , ables,” Journal of machine learning research 6 (Jan), 165–
1196–1217. 188.
Bishop, Chris M (1995a), “Training with noise is equivalent to Chen, Chunlin, Daoyi Dong, Han-Xiong Li, Jian Chu, and
tikhonov regularization,” Neural computation 7 (1), 108– Tzyh-Jong Tarn (2014), “Fidelity-based probabilistic q-
116. learning for control of quantum systems,” IEEE transac-
Bishop, Christopher M (1995b), Neural networks for pattern tions on neural networks and learning systems 25 (5), 920–
recognition (Oxford university press). 933.
Bishop, Christopher M (2006), Pattern recognition and ma- Chen, Jing, Song Cheng, Haidong Xie, Lei Wang, and Tao Xi-
chine learning (springer). ang (2018), “Equivalence of restricted boltzmann machines
Blahut, Richard (1972), “Computation of channel capacity and tensor network states,” Phys. Rev. B 97, 085104.
and rate-distortion functions,” IEEE transactions on Infor- Chen, Tianqi, and Carlos Guestrin (2016), “Xgboost: A scal-
mation Theory 18 (4), 460–473. able tree boosting system,” in Proceedings of the 22nd acm
Bottou, Léon (2012), “Stochastic gradient descent tricks,” in sigkdd international conference on knowledge discovery and
Neural networks: Tricks of the trade (Springer) pp. 421– data mining (ACM) pp. 785–794.
436. Cheng, Song, Jing Chen, and Lei Wang (2017), “Information
Bowman, Samuel R, Luke Vilnis, Oriol Vinyals, Andrew M perspective to probabilistic modeling: Boltzmann machines
Dai, Rafal Jozefowicz, and Samy Bengio (2015), “Gener- versus born machines,” arXiv preprint arXiv:1712.04144 .
ating sentences from a continuous space,” arXiv preprint Ch’ng, Kelvin, Nick Vazquez, and Ehsan Khatami
arXiv:1511.06349. (2017), “Unsupervised machine learning account of mag-
Boyd, Stephen, and Lieven Vandenberghe (2004), Convex netic transitions in the hubbard model,” arXiv preprint
optimization (Cambridge university press). arXiv:1708.03350 .
Bradde, Serena, and William Bialek (2017), “Pca meets rg,” Christopher, M Bishop (2016), PATTERN RECOGNI-
Journal of Statistical Physics 167 (3-4), 462–475. TION AND MACHINE LEARNING. (Springer-Verlag
Breiman, Leo (1996), “Bagging predictors,” Machine learning New York).
24 (2), 123–140. Ciliberto, Carlo, Mark Herbster, Alessandro Davide Ialongo,
113
Friedman, Jerome H (2001), “Greedy function approximation: Alexander Lerchner (2016), “beta-vae: Learning basic vi-
a gradient boosting machine,” Annals of statistics , 1189– sual concepts with a constrained variational framework,”
1232. .
Fu, Michael C (2006), “Gradient estimation,” Handbooks in Hinton, Geoffrey E (2002), “Training products of experts by
operations research and management science 13, 575–616. minimizing contrastive divergence,” Neural computation
Gao, Jun, Lu-Feng Qiao, Zhi-Qiang Jiao, Yue-Chi Ma, Cheng- 14 (8), 1771–1800.
Qiu Hu, Ruo-Jing Ren, Ai-Lin Yang, Hao Tang, Man-Hong Hinton, Geoffrey E (2012), “A practical guide to training re-
Yung, and Xian-Min Jin (2017), “Experimental machine stricted boltzmann machines,” in Neural networks: Tricks
learning of quantum states with partial information,” arXiv of the trade (Springer) pp. 599–619.
preprint arXiv:1712.00456 . Hinton, Geoffrey E, Simon Osindero, and Yee-Whye Teh
Gao, Xun, and Lu-Ming Duan (2017), “Efficient representa- (2006), “A fast learning algorithm for deep belief nets,”
tion of quantum many-body states with deep neural net- Neural computation 18 (7), 1527–1554.
works,” arXiv preprint arXiv:1701.05039 . Hinton, Geoffrey E, and Ruslan R Salakhutdinov (2006), “Re-
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dun- ducing the dimensionality of data with neural networks,”
son, Aki Vehtari, and Donald B Rubin (2014), Bayesian science 313 (5786), 504–507.
data analysis, Vol. 2 (CRC press Boca Raton, FL). Ho, Tin Kam (1998), “The random subspace method for con-
Gersho, Allen, and Robert M Gray (2012), Vector quantiza- structing decision forests,” IEEE transactions on pattern
tion and signal compression, Vol. 159 (Springer Science & analysis and machine intelligence 20 (8), 832–844.
Business Media). Hopfield, John J (1982), “Neural networks and physical sys-
Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006), tems with emergent collective computational abilities,”
“Extremely randomized trees,” Machine learning 63 (1), Proceedings of the national academy of sciences 79 (8),
3–42. 2554–2558.
Glorot, Xavier, and Yoshua Bengio (2010), “Understanding Huang, Haiping (2017a), “Mean-field theory of input dimen-
the difficulty of training deep feedforward neural networks,” sionality reduction in unsupervised deep neural networks,”
in Proceedings of the Thirteenth International Conference arXiv preprint arXiv:1710.01467 .
on Artificial Intelligence and Statistics, pp. 249–256. Huang, Haiping (2017b), “Statistical mechanics of unsuper-
Goldt, Sebastian, and Udo Seifert (2017), “Thermodynamic vised feature learning in a restricted boltzmann machine
efficiency of learning a rule in neural networks,” arXiv with binary synapses,” Journal of Statistical Mechanics:
preprint arXiv:1706.09713 . Theory and Experiment 2017 (5), 053302.
Goodfellow, Ian (2016), “Nips 2016 tutorial: Generative ad- Huang, Hengfeng, Bowen Xiao, Huixin Xiong, Zeming Wu,
versarial networks,” arXiv preprint arXiv:1701.00160. Yadong Mu, and Huichao Song (2018), “Applications
Goodfellow, Ian, Yoshua Bengio, and Aaron of deep learning to relativistic hydrodynamics,” arXiv
Courville (2016), Deep Learning (MIT Press) preprint arXiv:1801.03334 .
http://www.deeplearningbook.org. Hubbard, J (1959), “Calculation of partition functions,” Phys-
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, ical Review Letters 3 (2), 77.
David Warde-Farley, Sherjil Ozair, Aaron Courville, and Iakovlev, I A, O. M. Sotnikov, and V. V. Mazurenko
Yoshua Bengio (2014), “Generative adversarial nets,” in Ad- (2018), “Supervised learning magnetic skyrmion phases,”
vances in neural information processing systems, pp. 2672– arXiv:1803.06682 .
2680. Innocenti, Luca, Leonardo Banchi, Alessandro Ferraro,
Greplova, Eliska, Christian Kraglund Andersen, and Klaus Sougato Bose, and Mauro Paternostro (2018), “Supervised
Mølmer (2017), “Quantum parameter estimation with a learning of time-independent hamiltonians for gate design,”
neural network,” arXiv preprint arXiv:1711.05238 . arXiv:1803.07119 .
Grisafi, Andrea, David M Wilkins, Gábor Csányi, and Ioffe, Sergey, and Christian Szegedy (2015), “Batch normal-
Michele Ceriotti (2017), “Symmetry-adapted machine- ization: Accelerating deep network training by reducing in-
learning for tensorial properties of atomistic systems,” ternal covariate shift,” in International Conference on Ma-
arXiv preprint arXiv:1709.06757 . chine Learning, pp. 448–456.
Halko, Nathan, Per-Gunnar Martinsson, Yoel Shkolnisky, Iso, Satoshi, Shotaro Shiba, and Sumito Yokoo (2018),
and Mark Tygert (2011), “An algorithm for the principal “Scale-invariant feature extraction of neural net-
component analysis of large data sets,” SIAM Journal on work and renormalization group flow,” arXiv preprint
Scientific computing 33 (5), 2580–2594. arXiv:1801.07172 .
Han, Zhao-Yu, Jun Wang, Heng Fan, Lei Wang, and Pan Jarzynski, Christopher (1997), “Nonequilibrium equality for
Zhang (2017), “Unsupervised generative modeling using free energy differences,” Physical Review Letters 78 (14),
matrix product states,” arXiv preprint arXiv:1709.01662 . 2690.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Jaynes, Edwin T (1957a), “Information theory and statistical
(2015), “Delving deep into rectifiers: Surpassing human- mechanics,” Physical review 106 (4), 620.
level performance on imagenet classification,” in Proceed- Jaynes, Edwin T (1957b), “Information theory and statistical
ings of the IEEE international conference on computer vi- mechanics. ii,” Physical review 108 (2), 171.
sion, pp. 1026–1034. Jaynes, Edwin T (1968), “Prior probabilities,” IEEE Trans-
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun actions on systems science and cybernetics 4 (3), 227–241.
(2016), “Deep residual learning for image recognition,” in Jaynes, Edwin T (1996), Probability theory: the logic of sci-
Proceedings of the IEEE conference on computer vision and ence (Washington University St. Louis, MO).
pattern recognition, pp. 770–778. Jaynes, Edwin T (2003), Probability theory: the logic of sci-
Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess, ence (Cambridge university press).
Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Jeffreys, Harold (1946), “An invariant form for the prior prob-
115
ability in estimation problems,” Proceedings of the Royal Krzakala, Florent, Marc Mézard, Francois Sausset, Yifan Sun,
Society of London. Series A, Mathematical and Physical and Lenka Zdeborová (2012b), “Probabilistic reconstruc-
Sciences , 453–461. tion in compressed sensing: algorithms, phase diagrams,
Jin, Chi, Praneeth Netrapalli, and Michael I Jordan (2017), and threshold achieving matrices,” Journal of Statistical
“Accelerated gradient descent escapes saddle points faster Mechanics: Theory and Experiment 2012 (08), P08009.
than gradient descent,” arXiv preprint arXiv:1711.10456. Lake, Brenden M, Tomer D Ullman, Joshua B Tenenbaum,
Jordan, Michael I, Zoubin Ghahramani, Tommi S Jaakkola, and Samuel J Gershman (2017), “Building machines that
and Lawrence K Saul (1999), “An introduction to vari- learn and think like people,” Behavioral and Brain Sciences
ational methods for graphical models,” Machine learning 40.
37 (2), 183–233. Lamata, Lucas (2017), “Basic protocols in quantum reinforce-
Kalantre, Sandesh S, Justyna P Zwolak, Stephen Ragole, ment learning with superconducting circuits,” Scientific Re-
Xingyao Wu, Neil M Zimmerman, MD Stewart, and Ja- ports 7.
cob M Taylor (2017), “Machine learning techniques for Larsen, Bjornar, and Chinatsu Aone (1999), “Fast and effec-
state recognition and auto-tuning in quantum dots,” arXiv tive text mining using linear-time document clustering,” in
preprint arXiv:1712.04914 . Proceedings of the fifth ACM SIGKDD international con-
Katz, Yarden (2017), “Manufacturing an artificial intelligence ference on Knowledge discovery and data mining (ACM)
revolution,” SSRN Preprint . pp. 16–22.
Kerenidis, Iordanis, and Anupam Prakash (2017), “Quan- Le, Quoc V (2013), “Building high-level features using large
tum gradient descent for linear systems and least squares,” scale unsupervised learning,” in Acoustics, Speech and Sig-
arXiv preprint arXiv:1704.04992 . nal Processing (ICASSP), 2013 IEEE International Con-
Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, ference on (IEEE) pp. 8595–8598.
Mikhail Smelyanskiy, and Ping Tak Peter Tang (2016), LeCun, Yann, Yoshua Bengio, et al. (1995), “Convolutional
“On large-batch training for deep learning: Generalization networks for images, speech, and time series,” The hand-
gap and sharp minima,” arXiv preprint arXiv:1609.04836. book of brain theory and neural networks 3361 (10), 1995.
Kingma, Diederik P, and Jimmy Ba (2014), “Adam: LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick
A method for stochastic optimization,” arXiv preprint Haffner (1998a), “Gradient-based learning applied to docu-
arXiv:1412.6980. ment recognition,” Proceedings of the IEEE 86 (11), 2278–
Kingma, Diederik P, and Max Welling (2013), 2324.
“Auto-encoding variational bayes,” arXiv preprint LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus-
arXiv:1312.6114. Robert Müller (1998b), “Efficient backprop,” in Neural net-
Kingma, DP, et al. (2017), “Variational inference & deep works: Tricks of the trade (Springer) pp. 9–50.
learning,” PhD thesis 978-94-6299-745-5. Lee, Jason D, Ioannis Panageas, Georgios Piliouras, Max Sim-
Kleijnen, Jack PC, and Reuven Y Rubinstein (1996), “Op- chowitz, Michael I Jordan, and Benjamin Recht (2017),
timization and sensitivity analysis of computer simulation “First-order methods almost always avoid saddle points,”
models by the score function method,” European Journal arXiv preprint arXiv:1710.07406.
of Operational Research 88 (3), 413–427. Lehmann, Erich L, and George Casella (2006), Theory of
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan point estimation (Springer Science & Business Media).
(2016), “Inherent trade-offs in the fair determination of risk Lehmann, Erich L, and Joseph P Romano (2006), Testing
scores,” arXiv preprint arXiv:1609.05807. statistical hypotheses (Springer Science & Business Media).
Koch-Janusz, Maciej, and Zohar Ringel (2017), “Mutual in- Levine, Yoav, David Yakira, Nadav Cohen, and Amnon
formation, neural networks and the renormalization group,” Shashua (2017), “Deep learning and quantum entangle-
arXiv preprint arXiv:1704.06279 . ment: Fundamental connections with implications to net-
Kohonen, Teuvo (1998), “The self-organizing map,” Neuro- work design.” arXiv preprint arXiv:1704.01552 .
computing 21 (1-3), 1–6. Li, Bo, and David Saad (2017), “Exploring the func-
Krastanov, Stefan, and Liang Jiang (2017), “Deep neural tion space of deep-learning machines,” arXiv preprint
network probabilistic decoder for stabilizer codes,” arXiv arXiv:1708.01422 .
preprint arXiv:1705.09334 . Li, Chian-De, Deng-Ruei Tan, and Fu-Jiun Jiang (2017),
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek (2009), “Applications of neural networks to the studies of
“Clustering high-dimensional data: A survey on subspace phase transitions of two-dimensional potts models,” arXiv
clustering, pattern-based clustering, and correlation clus- preprint arXiv:1703.02369 .
tering,” ACM Transactions on Knowledge Discovery from Li, Richard Y, Rosa Di Felice, Remo Rohs, and Daniel A
Data (TKDD) 3 (1), 1. Lidar (2018), “Quantum annealing versus classical ma-
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton chine learning applied to a simplified computational biology
(2012), “Imagenet classification with deep convolutional problem,” npj Quantum Information 4 (1), 14.
neural networks,” in Advances in neural information pro- Li, Shuo-Hui, and Lei Wang (2018), “Neural network renor-
cessing systems, pp. 1097–1105. malization group,” arXiv preprint arXiv:1802.02840 .
Krzakala, Florent, Andre Manoel, Eric W Tramel, and Lenka Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih (2000), “A
Zdeborová (2014), “Variational free energies for compressed comparison of prediction accuracy, complexity, and train-
sensing,” in Information Theory (ISIT), 2014 IEEE Inter- ing time of thirty-three old and new classification algo-
national Symposium on (IEEE) pp. 1499–1503. rithms,” Machine learning 40 (3), 203–228.
Krzakala, Florent, Marc Mézard, François Sausset, YF Sun, Lin, Henry W, Max Tegmark, and David Rolnick (2017),
and Lenka Zdeborová (2012a), “Statistical-physics-based “Why does deep and cheap learning work so well?” Journal
reconstruction in compressed sensing,” Physical Review X of Statistical Physics 168 (6), 1223–1247.
2 (2), 021005. Liu, Zhaocheng, Sean P Rodrigues, and Wenshan Cai
116
(2017), “Simulating the ising model with a deep convo- ment learning,” Nature 518 (7540), 529.
lutional generative adversarial network,” arXiv preprint Morningstar, Alan, and Roger G Melko (2017), “Deep
arXiv:1710.04987 . learning the ising model near criticality,” arXiv preprint
Loh, Wei-Yin (2011), “Classification and regression trees,” arXiv:1708.04622 .
Wiley Interdisciplinary Reviews: Data Mining and Knowl- Muehlhauser, Luke (2016), “What should we learn from past
edge Discovery 1 (1), 14–23. ai forecasts?” Open Philanthropy Website .
Louppe, Gilles (2014), “Understanding random forests: From Müllner, Daniel (2011), “Modern hierarchical, agglomerative
theory to practice,” arXiv preprint arXiv:1407.7502. clustering algorithms,” arXiv preprint arXiv:1109.2378 .
Lu, Sirui, Shilin Huang, Keren Li, Jun Li, Jianxin Chen, Murphy, Kevin (2012), Machine Learning: A Probabilistic
Dawei Lu, Zhengfeng Ji, Yi Shen, Duanlu Zhou, and Perspective (MIT press).
Bei Zeng (2017), “A separability-entanglement classifier via Nagai, Yuki, Huitao Shen, Yang Qi, Junwei Liu, and Liang
machine learning,” arXiv preprint arXiv:1705.01523 . Fu (2017), “Self-learning monte carlo method: Continuous-
Maaten, Laurens van der, and Geoffrey Hinton (2008), “Vi- time algorithm,” arXiv preprint arXiv:1705.06724 .
sualizing data using t-sne,” Journal of machine learning re- Neal, Radford M (2001), “Annealed importance sampling,”
search 9 (Nov), 2579–2605. Statistics and computing 11 (2), 125–139.
MacKay, David JC (2003), Information theory, inference and Neal, Radford M, and Geoffrey E Hinton (1998), “A view
learning algorithms (Cambridge university press). of the em algorithm that justifies incremental, sparse, and
Maskara, Nishad, Aleksander Kubica, and Tomas other variants,” in Learning in graphical models (Springer)
Jochym-O’Connor (2018), “Advantages of versatile neural- pp. 355–368.
network decoding for topological codes,” arXiv preprint Neal, Radford M, et al. (2011), “Mcmc using hamiltonian dy-
arXiv:1802.08680 . namics,” Handbook of Markov Chain Monte Carlo 2 (11).
Masson, JB, M Bailly Bechet, and Massimo Vergassola Nesterov, Yurii (1983), “A method of solving a convex pro-
(2009), “Chasing information to search in random environ- gramming problem with convergence rate o (1/k2),” in So-
ments,” Journal of Physics A: Mathematical and Theoret- viet Mathematics Doklady, Vol. 27, pp. 372–376.
ical 42 (43), 434009. Neukart, Florian, David Von Dollen, Christian Seidel, and
Mattingly, Henry H, Mark K Transtrum, Michael C Abbott, Gabriele Compostella (2017), “Quantum-enhanced rein-
and Benjamin B Machta (2018), “Maximizing the infor- forcement learning for finite-episode games with discrete
mation learned from finite data selects a simple model,” state spaces,” arXiv preprint arXiv:1708.09354 .
Proceedings of the National Academy of Sciences 115 (8), Nguyen, H Chau, Riccardo Zecchina, and Johannes Berg
1760–1765. (2017), “Inverse statistical problems: from the inverse ising
McDermott, Drew, M Mitchell Waldrop, B Chandrasekaran, problem to data science,” Advances in Physics 66 (3), 197–
John McDermott, and Roger Schank (1985), “The dark 261.
ages of ai: a panel discussion at aaai-84,” AI Magazine Nielsen, Michael A (2015), Neural networks and deep learning
6 (3), 122. (Determination Press).
Mehta, Pankaj (2015), “Big data’s radical potential,” Jacobin van Nieuwenburg, Evert, Eyal Bairey, and Gil Refael (2017a),
. “Learning phase transitions from dynamics,” arXiv preprint
Mehta, Pankaj, and David J Schwab (2014), “An exact map- arXiv:1712.00450 .
ping between the variational renormalization group and van Nieuwenburg, Evert PL, Ye-Hua Liu, and Sebastian D
deep learning,” arXiv preprint arXiv:1410.3831. Huber (2017b), “Learning phase transitions by confusion,”
Mehta, Pankaj, David J Schwab, and Anirvan M Sengupta Nature Physics 13 (5), 435.
(2011), “Statistical mechanics of transcription-factor bind- Niu, Murphy Yuezhen, Sergio Boixo, Vadim Smelyanskiy,
ing site discovery using hidden markov models,” Journal of and Hartmut Neven (2018), “Universal quantum con-
statistical physics 142 (6), 1187–1205. trol through deep reinforcement learning,” arXiv preprint
Melnikov, Alexey A, Hendrik Poulsen Nautrup, Mario Krenn, arXiv:1803.01857 .
Vedran Dunjko, Markus Tiersch, Anton Zeilinger, and Nomura, Yusuke, Andrew Darmawan, Youhei Yamaji, and
Hans J Briegel (2017), “Active learning machine learns Masatoshi Imada (2017), “Restricted-boltzmann-machine
to create new quantum experiments,” arXiv preprint learning for solving strongly correlated quantum systems,”
arXiv:1706.00868 . arXiv preprint arXiv:1709.06475 .
Metz, Cade (2017), “Move over, coders- Ohtsuki, Tomi, and Tomoki Ohtsuki (2017), “Deep learning
physicists will soon rule silicon valley,” the quantum phase transitions in random electron systems:
Https://deepmind.com/blog/deepmind-ai-reduces-google- Applications to three dimensions,” Journal of the Physical
data-centre-cooling-bill-40/. Society of Japan 86 (4), 044708.
Mezard, Marc, and Andrea Montanari (2009), Information, O’Neil, Cathy (2017), Weapons of math destruction: How big
physics, and computation (Oxford University Press). data increases inequality and threatens democracy (Broad-
Mhaskar, Hrushikesh, Qianli Liao, and Tomaso Poggio way Books).
(2016), “Learning functions: when is deep better than shal- Papanikolaou, Stefanos, Michail Tzimas, Hengxu Song, An-
low,” arXiv preprint arXiv:1603.00988. drew CE Reid, and Stephen A Langer (2017), “Learning
Mitarai, Kosuke, Makoto Negoro, Masahiro Kitagawa, and crystal plasticity using digital image correlation: Exam-
Keisuke Fujii (2018), “Quantum circuit learning,” arXiv ples from discrete dislocation dynamics,” arXiv preprint
preprint arXiv:1803.00745 . arXiv:1709.08225 .
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, An- Pedregosa, F, G. Varoquaux, A. Gramfort, V. Michel,
drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
et al. (2015), “Human-level control through deep reinforce- napeau, M. Brucher, M. Perrot, and E. Duchesnay (2011),
117