Documente Academic
Documente Profesional
Documente Cultură
Corinna Cortes 1 Xavier Gonzalvo 1 Vitaly Kuznetsov 1 Mehryar Mohri 2 1 Scott Yang 2
(A DA N ET) adaptively learn both the structure trained using back-propagation, the model will always have
of the network and its weights. They are as many layers as the one specified because there needs to
based on a solid theoretical analysis, including be at least one path through the network in order for the
data-dependent generalization guarantees that we hypothesis to be non-trivial. While single weights may be
prove and discuss in detail. We report the results pruned (Han et al., 2015), a technique originally termed
of large-scale experiments with one of our algo- Optimal Brain Damage (LeCun et al., 1990), the architec-
rithms on several binary classification tasks ex- ture itself is unchanged. This imposes a stringent lower
tracted from the CIFAR-10 dataset. The results bound on the complexity of the model. Since not all ma-
demonstrate that our algorithm can automati- chine learning problems admit the same level of difficulty
cally learn network structures with very com- and different tasks naturally require varying levels of com-
petitive performance accuracies when compared plexity, complex models trained with insufficient data can
with those achieved for neural networks found by be prone to overfitting. This places a burden on a practi-
standard approaches. tioner to specify an architecture at the right level of com-
plexity which is often hard and requires significant levels
of experience and domain knowledge. For this reason,
1. Introduction network architecture is often treated as a hyperparameter
Deep neural networks form a powerful framework for ma- which is tuned using a validation set. The search space
chine learning and have achieved a remarkable perfor- can quickly become exorbitantly large (Szegedy et al.,
mance in several areas in recent years. Representing the 2015; He et al., 2015) and large-scale hyperparameter tun-
input through increasingly more abstract layers of feature ing to find an effective network architecture is wasteful of
representation has shown to be extremely effective in ar- data, time, and resources (e.g. grid search, random search
eas such as natural language processing, image caption- (Bergstra et al., 2011)).
ing, speech recognition and many others (Krizhevsky et al., In this paper, we attempt to remedy some of these issues. In
2012; Sutskever et al., 2014). However, despite the com- particular, we provide a theoretical analysis of a supervised
pelling arguments for using neural networks as a general learning scenario in which the network architecture and pa-
template for solving machine learning problems, training rameters are learned simultaneously. To the best of our
these models and designing the right network for a given knowledge, our results are the first generalization bounds
task has been filled with many theoretical gaps and practi- for the problem of structural learning of neural networks.
cal concerns. These general guarantees can guide the design of a vari-
To train a neural network, one needs to specify the param- ety of different algorithms for learning in this setting. We
eters of a typically large network architecture with several describe in depth two such algorithms that directly benefit
layers and units, and then solve a difficult non-convex opti- from the theory that we develop.
mization problem. From an optimization perspective, there In contrast to enforcing a pre-specified architecture and a
is no guarantee of optimality for a model obtained in this corresponding fixed complexity, our algorithms learn the
way, and often, one needs to implement ad hoc methods requisite model complexity for a machine learning prob-
(e.g. gradient clipping or batch normalization (Pascanu lem in an adaptive fashion. Starting from a simple linear
1 model, we add more units and additional layers as needed.
Google Research, New York, NY, USA 2 Courant Institute,
New York, NY, USA. Correspondence to: Vitaly Kuznetsov <vi- The additional units that we add are carefully selected and
talyk@google.com>. penalized according to rigorous estimates from the theory
of statistical learning. Remarkably, optimization problems
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
e k : H = Sl H
H e k . Then, F coincides with the convex As pointed out earlier, the family of functions F is the con-
k=1
hull of H: F = conv(H). vex hull of H. Thus, generalization bounds for ensemble
methods can be used to analyze learning with F. In particu-
For any k ∈ [l] we will also consider the family Hk∗ derived
lar, we can leverage the recent margin-based learning guar-
from Hk by setting Λk,s = 0 for s < k − 1, which corre-
antees of Cortes et al. (2014), which are finer than those
sponds to units connected only to the layer below. We sim-
that can be derived via a standard Rademacher complex-
ilarly define He ∗ = H∗ ∪ (−H∗ ) and H∗ = ∪l H∗ , and
k k k k=1 k ity analysis (Koltchinskii & Panchenko, 2002), and which
define the F∗ as the convex hull F∗ = conv(H∗ ). Note that admit an explicit dependency on the mixture weights wk
the architecture corresponding to the family of functions F∗ defining the ensemble function f . That leads to the follow-
is still more general than standard feedforward neural net- ing learning guarantee.
work architectures since the output unit can be connected
to units in different layers. Theorem 1 (Learning bound). Fix ρ > 0. Then, for any
δ > 0, with probability at least 1 − δ over the draw of a
size m from Dm , the following inequality holds
sample S ofP
3. Learning problem l
for all f = k=1 wk · hk ∈ F:
We consider the standard supervised learning scenario and
l r
assume that training and test points are drawn i.i.d. accord- bS,ρ (f ) + 4 e k) + 2 log l
X
R(f ) ≤ R
wk
Rm (H
ing to some distribution D over X × {−1, +1} and denote ρ 1 ρ m
k=1
by S = ((x1 , y1 ), . . . , (xm , ym )) a training sample of size
m drawn according to Dm . + C(ρ, l, m, δ),
be more conveniently used in our algorithms. The next re- The learning bound of Corollary 1 is a finer guarantee than
sults in this section provide precisely such upper bounds, previous ones by (Bartlett, 1998), (Neyshabur et al., 2015),
thereby leading to a more explicit generalization bound. or (Sun et al., 2016). This is because it explicitly differenti-
1 1 ates between the weights of different layers while previous
We will denote by q the conjugate of p, that is p + q = 1,
bounds treat all weights indiscriminately. This is crucial
and define r∞ = maxi∈[1,m] kΨ(xi )k∞ . to the design of algorithmic design since the network com-
Our first result gives an upper bound on the Rademacher plexity no longer needs to grow exponentially as a function
complexity of Hk in terms of the Rademacher complexity of depth. Our bounds are also more general and apply to
of other layer families. more other network architectures, such as those introduced
Lemma 1. For any k > 1, the empirical Rademacher in (He et al., 2015; Huang et al., 2016).
complexity of Hk for a sample S of size m can be upper-
bounded as follows in terms of those of Hs s with s < k: 5. Algorithm
k−1
X 1 This section describes our algorithm, A DA N ET, for adap-
b S (Hk ) ≤ 2
R Λk,s nsq R
b S (Hs ). tive learning of neural networks. A DA N ET adaptively
s=1 grows the structure of a neural network, balancing model
complexity with empirical risk minimization. We also de-
For the family Hk∗ , which is directly relevant to many of scribe in detail in Appendix C another variant of A DA N ET
our experiments, the following more explicit upper bound which admits some favorable properties.
can be derived, using Lemma 1.
Qk Qk Let x 7→ Φ(−x) be a non-increasing convex function
Lemma 2. Let Λk = s=1 2Λs,s−1 and Nk = s=1 ns−1 . upper-bounding the zero-one loss, x 7→ 1x≤0 , such that Φ
Then, for any k ≥ 1, the empirical Rademacher complexity is differentiable over R and Φ0 (x) 6= 0 for all x. This surro-
of Hk∗ for a sample S of size m can be upper bounded as gate loss Φ may be, for instance, the exponential function
follows: Φ(x) = ex as in AdaBoost Freund & Schapire (1997), or
r the logistic function, Φ(x) = log(1 + ex ) as in logistic
∗
1
q log(2n0 ) regression.
RS (Hk ) ≤ r∞ Λk Nk
b .
2m
5.1. Objective function
Note that Nk , which is the product of the number of units
in layers below k, can be large. This suggests that values of Let {h1 , . . . , hN } be a subset of H∗ . In the most general
p closer to one, that is larger values of q, could be more case, N is infinite. However, as discussed later, in practice,
helpful to control complexity in such cases. More gen- the search is limited to a finite set. For any j ∈ [N ], we
erally, similar explicit upper bounds can be given for the will denote by rj the Rademacher complexity of the family
Rademacher complexities of subfamilies of Hk with units Hkj that contains hj : rj = Rm (Hkj ).
connected only to layers k, k − 1, . . . , k − d, with d fixed, PN
d < k. Combining Lemma 2 with Theorem 1 helps derive A DA N ET seeks to find a function f = j=1 wj hj ∈
the following explicit learning guarantee for feedforward F∗ (or neural network) that directly minimizes the data-
neural networks with an output unit connected to all the dependent generalization bound of Corollary 1. This leads
other units. to the following objective function:
Corollary 1 (Explicit learning bound). Fix ρ > 0. Let m N N
1 X X X
Qk Qk
Λk = s=1 4Λs,s−1 and Nk = s=1 ns−1 . Then, for any F (w) = Φ 1 − yi wj hj + Γj |wj |, (4)
m i=1 j=1 j=1
δ > 0, with probability at least 1 − δ over the draw of a
size m from Dm , the following inequality holds
sample S ofP where w ∈ RN and Γj = λrj + β, with λ ≥ 0 and
l
for all f = k=1 wk · hk ∈ F∗ : β ≥ 0 hyperparameters. The objective function (4) is a
l r convex function of w. It is the sum of a convex surrogate
2 X
1
q 2 log(2n0 ) of the empirical error and a regularization term, which is a
R(f ) ≤ RS,ρ (f ) +
b
wk 1 r∞ Λk Nk
ρ m weighted-l1 penalty containing two sub-terms: a standard
k=1
r norm-1 regularization which admits β as a hyperparame-
2 log l
+ + C(ρ, l, m, δ), ter, and a term that discriminates the functions hj based on
ρ m their complexity.
q log l log( 2 )
where C(ρ, l, m, δ) = 4 ρ2 m The optimization problem consisting of minimizing the ob-
ρ2 log( log l ) m + 2m =
δ
The option selected is the one leading to the best reduction h and Rm Hlt−1 +1 otherwise. In other words, if
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
6. Experiments
Figure 3. Pseudocode of the AdaNet algorithm. On line 3 two
candidate subnetworks are generated (e.g. randomly or by solving In this section we present the results of our experiments
(6)). On lines 3 and 4, (5) is solved for each of these candidates. with A DA N ET algorithm.
On lines 5-7 the best subnetwork is selected and on lines 9-11
termination condition is checked.
6.1. CIFAR-10
In our first set of experiments, we used the CIFAR-10
minw Ft (w, h) ≤ minw Ft (w, h0 ), then
dataset (Krizhevsky, 2009). This dataset consists of 60,000
w∗ = argmin Ft (w, h), ht = h images evenly categorized in 10 different classes. To
w∈RB reduce the problem to binary classification we consid-
and otherwise ered five pairs of classes: deer-truck, deer-horse,
automobile-truck, cat-dog, dog-horse. Raw
w∗ = argmin Ft (w, h0 ), ht = h0 images have been preprocessed to obtain color histograms
w∈RB and histogram of gradient features. The result is 154 real
If F (wt−1 + w∗ ) < F (wt−1 ) then we set ft−1 = ft + valued features with ranges [0, 1].
w∗ · ht and otherwise we terminate the algorithm. We compared A DA N ET to standard feedforward neural
There are many different choices for the W EAK L EARNER networks (NN) and logistic regression (LR) models. Note
algorithm. For instance, one may generate a large number that convolutional neural networks are often a more nat-
of random networks and select the one that optimizes (5). ural choice for image classification problems such as
Another option is to directly minimize (5) or its regularized CIFAR-10. However, the goal of these experiments is
version: not to obtain state-of-the-art results for this particular task,
m
but to provide a proof-of-concept illustrating that our struc-
1 X
tural learning approach can be competitive with traditional
Ft (w, h) =
e Φ 1 − yi ft−1 (xi ) − yi w · h(xi )
m i=1 approaches for finding efficient architectures and training
corresponding networks.
+ R(w, h), (6)
Note that A DA N ET algorithm requires the knowledge of
over both w and h. Here R(w, h) is a regularization term complexities rj , which in certain cases can be estimated
that, for instance, can be used to enforce that kus kp ≤ Λk,s from data. In our experiments, we have used the upper
in (2). Note that, in general, (6) is a non-convex objective. bound in Lemma 2. Our algorithm admits a number of hy-
However, we do not rely on finding a global solution to perparameters: regularization hyperparameters λ, β, num-
the corresponding optimization problem. In fact, standard ber of units B in each layer of new subnetworks that are
guarantees for regularized boosting only require that each used to extend the model at each iteration and a bound
h that is added to the model decreases the objective by a Λk on weights (u0 , u) in each unit. As discussed in Sec-
constant amount (i.e. it satisfies δ-optimality condition) for tion 5, there are different approaches to finding candidate
a boosting algorithm to converge (Rätsch et al., 2001; Luo subnetworks in each iteration. In our experiments, we
& Tseng, 1992). searched for candidate subnetworks by minimizing (6) with
Furthermore, the algorithm that we present in Appendix C R = 0. This also requires a learning rate hyperparame-
uses a weak-learning algorithm that solves a convex sub- ter η. These hyperparamers have been optimized over the
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Table 1. Experimental results for A DA N ET, NN, LR and NN-GP for different pairs of labels in CIFAR-10. Boldfaced results are
statistically significant at a 5% confidence level.
Table 3. Experimental results for different variants ofA DA N ET. Table 4. Experimental results for Criteo dataset.
Kuznetsov, Vitaly, Mohri, Mehryar, and Syed, Umar. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Multi-class deep boosting. In NIPS, 2014. Practical Bayesian Optimization of Machine Learning
Algorithms. In Pereira, F., Burges, C. J. C., Bottou, L.,
Kwok, Tin-Yau and Yeung, Dit-Yan. Constructive algo- and Weinberger, K. Q. (eds.), NIPS, pp. 2951–2959. Cur-
rithms for structure learning in feedforward neural net- ran Associates, Inc., 2012.
works for regression problems. IEEE Transactions on
Neural Networks, 8(3):630–645, 1997. Sun, Shizhao, Chen, Wei, Wang, Liwei, Liu, Xiaoguang,
and Liu, Tie-Yan. On the depth of deep neural networks:
LeCun, Yann, Denker, John S., and Solla, Sara A. Optimal A theoretical view. In AAAI, 2016.
brain damage. In NIPS, 1990.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence
Lehtokangas, Mikko. Modelling with constructive back- to sequence learning with neural networks. In NIPS,
propagation. Neural Networks, 12(4):707–716, 1999. 2014.
Leung, Frank HF, Lam, Hak-Keung, Ling, Sai-Ho, and Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Tam, Peter KS. Tuning of the structure and parame- Pierre, Reed, Scott E., Anguelov, Dragomir, Erhan, Du-
ters of a neural network using an improved genetic al- mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
gorithm. IEEE Transactions on Neural Networks, 14(1): Going deeper with convolutions. In CVPR, 2015.
79–88, 2003.
Telgarsky, Matus. Benefits of depth in neural networks. In
Lian, Xiangru, Huang, Yijun, Li, Yuncheng, and Liu, Ji. COLT, 2016.
Asynchronous parallel stochastic gradient for nonconvex
optimization. In NIPS, pp. 2719–2727, 2015. Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan,
Memisevic, Roland, Salakhutdinov, Ruslan, and Bengio,
Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On Yoshua. Architectural complexity measures of recurrent
the computational efficiency of training neural networks. neural networks. CoRR, 2016.
In NIPS, pp. 855–863, 2014.
Zhang, Yuchen, Lee, Jason D, and Jordan, Michael I. ` 1-
Luo, Zhi-Quan and Tseng, Paul. On the convergence of co- regularized neural networks are improperly learnable in
ordinate descent method for convex differentiable mini- polynomial time. arXiv:1510.03528, 2015.
mization. Journal of Optimization Theory and Applica-
tions, 72(1):7 – 35, 1992. Zoph, Barret and Le, Quoc V. Neural architecture search
with reinforcement learning. CoRR, 2016.
Ma, Liying and Khorasani, Khashayar. A new strategy
for adaptively constructing multilayer feedforward neu-
ral networks. Neurocomputing, 51:361–385, 2003.
where the second inequality holds by Talagrand’s contrac- Proof. Since F∗ is the convex hull of H∗ , we can apply
tion lemma. Theorem 1 with Rm (H e ∗ ) instead of Rm (H
k
e k ). Observe
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
optimization problem:
argmin min Ft (w, h).
t−1 l +1 w∈R
h∈∪s=1 Hs0
(s) Λs,s−1
ui = q |t,hs−1,t−1,i |q−1 ,
kt,hs−1,t−1 kqp The theorem above defines the choice of descent coordi-
nate at each round and motivates the following algorithm,
which is a consequence of Banach space duality. A DA N ET.CVX. At each round, A DA N ET.CVX can de-
To see this, note first that by Hölder’s inequality, every u ∈ sign the optimal candidate subnetwork within its searched
Rns−1,t−1 with kukp ≤ Λs,s−1 satisfies: space in closed form, leading to an extremely efficient up-
date. However, this comes at the cost of a more restrictive
u · E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )] search space than the one used in A DA N ET. The pseu-
i∼Dt
docode of A DA N ET.CVX is provided in Figure 5.
≤ kukp k E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )]kq
i∼Dt
At the same time, our choice of u(s) also attains this upper
bound:
u(s) · t,hs−1,t−1
ns−1,t−1
(s)
X
= ui t,hs−1,t−1,i
i=1
ns−1,t−1
X Λs,s−1
= q |t,hs−1,t−1,i |q
i=1 kt,hs−1,t−1 kqp
Λs,s−1
= q kt,hs−1,t−1 kqq
kt,hs−1,t−1 kqp
= Λs,s−1 kt,hs−1,t−1 kq .