Cortes (2016) - AdaNet Adaptive Structural Learning of Artificial Neural Networks

AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Corinna Cortes 1 Xavier Gonzalvo 1 Vitaly Kuznetsov 1 Mehryar Mohri 2 1 Scott Yang 2
Abstract et al., 2013; Ioffe & Szegedy, 2015)) to derive a coherent

We present new algorithms for adaptively learn- models.
ing artificial neural networks. Our algorithms Moreover, if a network architecture is specified a priori and
arXiv:1607.01097v3 [cs.LG] 28 Feb 2017
(A DA N ET) adaptively learn both the structure trained using back-propagation, the model will always have
of the network and its weights. They are as many layers as the one specified because there needs to
based on a solid theoretical analysis, including be at least one path through the network in order for the
data-dependent generalization guarantees that we hypothesis to be non-trivial. While single weights may be
prove and discuss in detail. We report the results pruned (Han et al., 2015), a technique originally termed
of large-scale experiments with one of our algo- Optimal Brain Damage (LeCun et al., 1990), the architec-
rithms on several binary classification tasks ex- ture itself is unchanged. This imposes a stringent lower
tracted from the CIFAR-10 dataset. The results bound on the complexity of the model. Since not all ma-
demonstrate that our algorithm can automati- chine learning problems admit the same level of difficulty
cally learn network structures with very com- and different tasks naturally require varying levels of com-
petitive performance accuracies when compared plexity, complex models trained with insufficient data can
with those achieved for neural networks found by be prone to overfitting. This places a burden on a practi-
standard approaches. tioner to specify an architecture at the right level of com-
plexity which is often hard and requires significant levels
of experience and domain knowledge. For this reason,
1. Introduction network architecture is often treated as a hyperparameter
Deep neural networks form a powerful framework for ma- which is tuned using a validation set. The search space
chine learning and have achieved a remarkable perfor- can quickly become exorbitantly large (Szegedy et al.,
mance in several areas in recent years. Representing the 2015; He et al., 2015) and large-scale hyperparameter tun-
input through increasingly more abstract layers of feature ing to find an effective network architecture is wasteful of
representation has shown to be extremely effective in ar- data, time, and resources (e.g. grid search, random search
eas such as natural language processing, image caption- (Bergstra et al., 2011)).
ing, speech recognition and many others (Krizhevsky et al., In this paper, we attempt to remedy some of these issues. In
2012; Sutskever et al., 2014). However, despite the com- particular, we provide a theoretical analysis of a supervised
pelling arguments for using neural networks as a general learning scenario in which the network architecture and pa-
template for solving machine learning problems, training rameters are learned simultaneously. To the best of our
these models and designing the right network for a given knowledge, our results are the first generalization bounds
task has been filled with many theoretical gaps and practi- for the problem of structural learning of neural networks.
cal concerns. These general guarantees can guide the design of a vari-
To train a neural network, one needs to specify the param- ety of different algorithms for learning in this setting. We
eters of a typically large network architecture with several describe in depth two such algorithms that directly benefit
layers and units, and then solve a difficult non-convex opti- from the theory that we develop.
mization problem. From an optimization perspective, there In contrast to enforcing a pre-specified architecture and a
is no guarantee of optimality for a model obtained in this corresponding fixed complexity, our algorithms learn the
way, and often, one needs to implement ad hoc methods requisite model complexity for a machine learning prob-
(e.g. gradient clipping or batch normalization (Pascanu lem in an adaptive fashion. Starting from a simple linear
1 model, we add more units and additional layers as needed.
Google Research, New York, NY, USA 2 Courant Institute,
New York, NY, USA. Correspondence to: Vitaly Kuznetsov <vi- The additional units that we add are carefully selected and
talyk@google.com>. penalized according to rigorous estimates from the theory
of statistical learning. Remarkably, optimization problems
are defined as follows. Let l denote the number of interme-

diate layers in the network and nk the maximum number of
units in layer k ∈ [l]. Each unit j ∈ [nk ] in layer k repre-
sents a function denoted by hk,j (before composition with
an activation function). Let X denote the input space and
for any x ∈ X , let Ψ(x) ∈ Rn0 denote the corresponding
feature vector. Then, the family of functions defined by the
first layer functions h1,j , j ∈ [n1 ], is the following:
Figure 1. An example of a general network architecture: output n o
layer (green) is connected to all of the hidden units as well as some H1 = x 7→ u · Ψ(x) : u ∈ Rn0 , kukp ≤ Λ1,0 , (1)
input units. Some hidden units (red and yellow) are connected not
only to the units in the layer directly below but to units at other where p ≥ 1 defines an lp -norm and Λ1,0 ≥ 0 is a hyperpa-
levels as well. rameter on the weights connecting layer 0 and layer 1. The
family of functions hk,j , j ∈ [nk ], in a higher layer k > 1
for both of our algorithms turn out to be strongly convex is then defined as follows:
and hence are guaranteed to have a unique global solu-
tion which is in stark contrast with other methodologies for
k−1
X
training neural networks. Hk = x 7→ us · (ϕs ◦ hs )(x) :
s=1
The paper is organized as follows. In Appendix A, we
give a detailed discussion of previous work related to this us ∈ Rns , kus kp ≤ Λk,s , hk,s ∈ Hs , (2)
topic. Section 2 describes the broad network architecture
and therefore hypothesis set that we consider. Section 3 where for each unit function hk,s , us in (2) denotes the
provides a formal description of our learning scenario. In vector of weights for connections from that unit to a lower
Section 4, we prove strong generalization guarantees for layer s < k. The Λk,s s are non-negative hyperparameters
learning in this setting which guide the design of the algo- and ϕs ◦ hs abusively denotes a coordinate-wise compo-
rithm described in Section 5 as well as a variant described sition: ϕs ◦ hs = (ϕs ◦ hs,1 , . . . , ϕs ◦ hs,ns ). The ϕs s
in Appendix C. We conclude with experimental results in are assumed to be 1-Lipschitz activation functions. In par-
Section 6. ticular, they can be chosen to be the Rectified Linear Unit
function (ReLU function) x 7→ max{0, x}, or the sigmoid
2. Network architecture function x 7→ 1+e1−x . The choice of the parameter p ≥ 1
determines the sparsity of the network and the complexity
In this section, we describe the general network architec- of the hypothesis sets Hk .
ture we consider for feedforward neural networks, thereby
also defining our hypothesis set. To simplify the presenta- For the networks we consider, the output unit can be con-
tion, we restrict our attention to the case of binary classifi- nected to all intermediate units, which therefore defines a
cation. However, all our results can be straightforwardly function f as follows:
extended to multi-class classification, including the net- nk
l X l
X X
work architecture by augmenting the number of output f= wk,j hk,j = wk · hk , (3)
units, and our generalization guarantees by using existing k=1 j=1 k=1
multi-class counterparts of the binary classification ensem-
ble margin bounds we use. where hk = [hk,1 , . . . , hk,nk ]>∈ Hknk and wk ∈ Rnk is
the vector of connection weights to units of layer k. Ob-
A common model for feedforward neural networks is the
serve that for us = 0 for s < k − 1 and wk = 0 for k < l,
multi-layer architecture where units in each layer are only
our architectures coincides with standard multi-layer feed-
connected to those in the layer below. We will consider
forward ones.
more general architectures where a unit can be connected
to units in any of the layers below, as illustrated by Fig- We will denote by F the family of functions f defined by
ure 1. In particular, the output unit in our network architec- (3) with the absolute value of the weights summing to one:
tures can be connected to any other unit. These more gen- ( l )
l
eral architectures include as special cases standard multi- X X
layer networks (by zeroing out appropriate connections) as F= wk · hk : hk ∈ Hk ,
nk
kwk k1 = 1 .
k=1 k=1
well as somewhat more exotic ones (He et al., 2015; Huang
et al., 2016).
Let H
e k denote the union of Hk and its reflection, H
ek =
More formally, the artificial neural networks we consider Hk ∪ (−Hk ), and let H denote the union of the families
e k : H = Sl H
H e k . Then, F coincides with the convex As pointed out earlier, the family of functions F is the con-
k=1
hull of H: F = conv(H). vex hull of H. Thus, generalization bounds for ensemble
methods can be used to analyze learning with F. In particu-
For any k ∈ [l] we will also consider the family Hk∗ derived
lar, we can leverage the recent margin-based learning guar-
from Hk by setting Λk,s = 0 for s < k − 1, which corre-
antees of Cortes et al. (2014), which are finer than those
sponds to units connected only to the layer below. We sim-
that can be derived via a standard Rademacher complex-
ilarly define He ∗ = H∗ ∪ (−H∗ ) and H∗ = ∪l H∗ , and
k k k k=1 k ity analysis (Koltchinskii & Panchenko, 2002), and which
define the F∗ as the convex hull F∗ = conv(H∗ ). Note that admit an explicit dependency on the mixture weights wk
the architecture corresponding to the family of functions F∗ defining the ensemble function f . That leads to the follow-
is still more general than standard feedforward neural net- ing learning guarantee.
work architectures since the output unit can be connected
to units in different layers. Theorem 1 (Learning bound). Fix ρ > 0. Then, for any
δ > 0, with probability at least 1 − δ over the draw of a
size m from Dm , the following inequality holds
sample S ofP
3. Learning problem l
for all f = k=1 wk · hk ∈ F:
We consider the standard supervised learning scenario and
l r
assume that training and test points are drawn i.i.d. accord- bS,ρ (f ) + 4 e k) + 2 log l
X
R(f ) ≤ R wk Rm (H
ing to some distribution D over X × {−1, +1} and denote ρ 1 ρ m
k=1
by S = ((x1 , y1 ), . . . , (xm , ym )) a training sample of size
m drawn according to Dm . + C(ρ, l, m, δ),
For a function f taking values in R, we denote by R(f ) =

q
2 log( δ2 )
log( ρlogml ) log
4
l
E(x,y)∼D [1yf (x)≤0 ] its generalization error and, for any where C(ρ, l, m, δ) = ρ2 m + 2m =
q
ρ > 0, by RbS,ρ (f ) its empirical margin error on the sample e 1 log l .
O
Pm ρ m
1
S: RS,ρ (f ) = m
b
i=1 1yi f (xi )≤ρ .
The proof of this result, as well as that of all other
The learning problem consists of using the training sam-
main theorems are given in Appendix B. The bound of
ple S to determine a function f defined by (3) with small
the theorem can be generalized to hold uniformly for all
generalization error R(f ). For an accurate predictor f , we
expect many of the weights to be zero and the correspond- p∈ (0, 1], at the price of an additional term of the form
ρ
log log2 (2/ρ)/m using standard techniques (Koltchin-
ing architecture to be quite sparse, with fewer than nk units
skii & Panchenko, 2002).
at layer k and relatively few non-zero connections. In that
sense, learning an accurate function f implies also learning Observe that the bound of the theorem depends only log-
the underlying architecture. arithmically on the depth of the network l. But, perhaps
more remarkably, the complexity term of the bound is a
In the next section, we present data-dependent learning
kwk k1 -weighted average of the complexities of the layer
bounds for this problem that will help guide the design of
hypothesis sets Hk , where the weights are precisely those
our algorithm.
defining the network, or the function f . This suggests that
a function f with a small empirical margin error and a deep
4. Generalization bounds architecture benefits nevertheless from a strong generaliza-
tion guarantee, if it allocates more weights to lower layer
Our learning bounds are expressed in terms of the
units and less to higher ones. Of course, when the weights
Rademacher complexities of the hypothesis sets Hk . The
are sparse, that will imply an architecture with relatively
empirical Rademacher complexity of a hypothesis set G for
fewer units or connections at higher layers than at lower
a sample S is denoted by R
b S (G) and defined as follows:
ones. The bound of the theorem further gives a quantita-
" # tive guide for apportioning the weights depending on the
m
b S (G) = 1 X Rademacher complexities of the layer hypothesis sets.
R E sup σi h(xi ) ,
m σ h∈G i=1
This data-dependent learning guarantee will serve as a
foundation for the design of our structural learning algo-
where σ = (σ1 , . . . , σm ), with σi s independent uniformly rithms in Section 5 and Appendix C. However, to fully
distributed random variables taking values in {−1, +1}. exploit it, the Rademacher complexity measures need to
Its Rademacher complexity is defined by Rm (G) = be made explicit. One advantage of these data-dependent
ES∼Dm [R b S (G)]. These are data-dependent complexity measures is that they can be estimated from data, which
measures that lead to finer learning guarantees (Koltchin- can lead to more informative bounds. Alternatively, we can
skii & Panchenko, 2002; Bartlett & Mendelson, 2002). derive useful upper bounds for these measures which can
be more conveniently used in our algorithms. The next re- The learning bound of Corollary 1 is a finer guarantee than
sults in this section provide precisely such upper bounds, previous ones by (Bartlett, 1998), (Neyshabur et al., 2015),
thereby leading to a more explicit generalization bound. or (Sun et al., 2016). This is because it explicitly differenti-
1 1 ates between the weights of different layers while previous
We will denote by q the conjugate of p, that is p + q = 1,
bounds treat all weights indiscriminately. This is crucial
and define r∞ = maxi∈[1,m] kΨ(xi )k∞ . to the design of algorithmic design since the network com-
Our first result gives an upper bound on the Rademacher plexity no longer needs to grow exponentially as a function
complexity of Hk in terms of the Rademacher complexity of depth. Our bounds are also more general and apply to
of other layer families. more other network architectures, such as those introduced
Lemma 1. For any k > 1, the empirical Rademacher in (He et al., 2015; Huang et al., 2016).
complexity of Hk for a sample S of size m can be upper-
bounded as follows in terms of those of Hs s with s < k: 5. Algorithm
k−1
X 1 This section describes our algorithm, A DA N ET, for adap-
b S (Hk ) ≤ 2
R Λk,s nsq R
b S (Hs ). tive learning of neural networks. A DA N ET adaptively
s=1 grows the structure of a neural network, balancing model
complexity with empirical risk minimization. We also de-
For the family Hk∗ , which is directly relevant to many of scribe in detail in Appendix C another variant of A DA N ET
our experiments, the following more explicit upper bound which admits some favorable properties.
can be derived, using Lemma 1.
Qk Qk Let x 7→ Φ(−x) be a non-increasing convex function
Lemma 2. Let Λk = s=1 2Λs,s−1 and Nk = s=1 ns−1 . upper-bounding the zero-one loss, x 7→ 1x≤0 , such that Φ
Then, for any k ≥ 1, the empirical Rademacher complexity is differentiable over R and Φ0 (x) 6= 0 for all x. This surro-
of Hk∗ for a sample S of size m can be upper bounded as gate loss Φ may be, for instance, the exponential function
follows: Φ(x) = ex as in AdaBoost Freund & Schapire (1997), or
r the logistic function, Φ(x) = log(1 + ex ) as in logistic
∗
1
q log(2n0 ) regression.
RS (Hk ) ≤ r∞ Λk Nk
b .
2m
5.1. Objective function
Note that Nk , which is the product of the number of units
in layers below k, can be large. This suggests that values of Let {h1 , . . . , hN } be a subset of H∗ . In the most general
p closer to one, that is larger values of q, could be more case, N is infinite. However, as discussed later, in practice,
helpful to control complexity in such cases. More gen- the search is limited to a finite set. For any j ∈ [N ], we
erally, similar explicit upper bounds can be given for the will denote by rj the Rademacher complexity of the family
Rademacher complexities of subfamilies of Hk with units Hkj that contains hj : rj = Rm (Hkj ).
connected only to layers k, k − 1, . . . , k − d, with d fixed, PN
d < k. Combining Lemma 2 with Theorem 1 helps derive A DA N ET seeks to find a function f = j=1 wj hj ∈
the following explicit learning guarantee for feedforward F∗ (or neural network) that directly minimizes the data-
neural networks with an output unit connected to all the dependent generalization bound of Corollary 1. This leads
other units. to the following objective function:
Corollary 1 (Explicit learning bound). Fix ρ > 0. Let m N N
1 X X X
Qk Qk
Λk = s=1 4Λs,s−1 and Nk = s=1 ns−1 . Then, for any F (w) = Φ 1 − yi wj hj + Γj |wj |, (4)
m i=1 j=1 j=1
δ > 0, with probability at least 1 − δ over the draw of a
size m from Dm , the following inequality holds
sample S ofP where w ∈ RN and Γj = λrj + β, with λ ≥ 0 and
l
for all f = k=1 wk · hk ∈ F∗ : β ≥ 0 hyperparameters. The objective function (4) is a
l r convex function of w. It is the sum of a convex surrogate
2 X 1
q 2 log(2n0 ) of the empirical error and a regularization term, which is a
R(f ) ≤ RS,ρ (f ) +
b wk 1 r∞ Λk Nk

ρ m weighted-l1 penalty containing two sub-terms: a standard
k=1
r norm-1 regularization which admits β as a hyperparame-
2 log l
+ + C(ρ, l, m, δ), ter, and a term that discriminates the functions hj based on
ρ m their complexity.
q log l log( 2 )
where C(ρ, l, m, δ) = 4 ρ2 m The optimization problem consisting of minimizing the ob-
ρ2 log( log l ) m + 2m =
δ
q jective function F in (4) is defined over a very large space

e 1 log l , and where r∞ = ES∼Dm [r∞ ].
O of base functions hj . A DA N ET consists of applying coor-
ρ m
of the current value of the objective function, which de-

pends both on the empirical error and the complexity of the
subnetwork added, which is penalized differently in these
two options.
Figure 2 illustrates this construction and the two options
(a) just described. An important aspect of our algorithm is that
the units of a subnetwork learned at a previous iteration
(say h1,1 in Figure 2) can serve as input to deeper subnet-
work added later (for example h2,2 or h2,3 in the Figure).
Thus, the deeper subnetworks added later can take advan-
tage of the embeddings that were learned at the previous
iterations. The algorithm terminates after T rounds or if
(b) the A DA N ET architecture can no longer be extended to im-
prove the objective (4).
Figure 2. Illustration of the algorithm’s incremental construction
of a neural network. The input layer is indicated in blue, the out- More formally, A DA N ET is a boosting-style algorithm that
put layer in green. Units in the yellow block are added at the first applies (block) coordinate descent to (4). At each iteration
iteration while units in purple are added at the second iteration. of block coordinate descent, descent coordinates h (base
Two candidate extensions of the architecture are considered at the learners in the boosting literature) are selected from the
the third iteration (shown in red): (a) a two-layer extension; (b)
space of functions H∗ . These coordinates correspond to
a three-layer extension. Here, a line between two blocks of units
the direction of the largest decrease in (4). Once these co-
indicates that these blocks are fully-connected.
ordinates are determined, an optimal step size in each of
dinate descent to (4). In that sense, our algorithm is similar these directions is chosen, which is accomplished by solv-
to the DeepBoost algorithm of Cortes et al. (2014). How- ing an appropriate convex optimization problem.
ever, unlike DeepBoost, which combines decision trees, Note that, in general, the search for the optimal descent
A DA N ET learns a deep neural network, which requires new coordinate in an infinite-dimensional space or even in fi-
methods for constructing and searching the space of func- nite but large sets such as that of all decision trees of some
tions hj . Both of these aspects differ significantly from the large depth may be intractable, and it is common to resort
decision tree framework. In particular, the search is par- to a heuristic search (weak learning algorithm) that returns
ticularly challenging. In fact, the main difference between δ-optimal coordinates. For instance, in the case of boosting
the algorithm presented in this section and the variant de- with trees one often grows trees according to some partic-
scribed in Appendix C is the way new candidates hj are ular heuristic (Freund & Schapire, 1997).
examined at each iteration.
We denote the A DA N ET model after t − 1 rounds by
5.2. Description ft−1 , which is parameterized by wt−1 . Let hk,t−1 de-
note the vector of outputs of units in the k-th layer of the
We start with an informal description of A DA N ET. Let A DA N ET model, lt−1 be the depth of the A DA N ET archi-
B ≥ 1 be a fixed parameter determining the number of tecture, nk,t−1 be the number of units in k-th layer after
units per layer of a candidate subnetwork. The algorithm t − 1 rounds. At round t, we select descent coordinates
proceeds in T iterations. Let lt−1 denote the depth of the by considering two candidate subnetworks h ∈ H e ∗ and
lt−1
neural network constructed before the start of the t-th itera-
h0 ∈ H e∗
lt−1 +1 that are generated by a weak learning algo-
tion. At iteration t, the algorithm selects one of the follow-
rithm W EAK L EARNER. Some choices for this algorithm in
ing two options:
our setting are described below. Once we obtain h and h0 ,
1. augmenting the current neural network with a subnet- we select one of these vectors of units, as well as a vector of
work with the same depth as that of the current network weights w ∈ RB , so that the result yields the best improve-
h ∈ Hl∗Bt−1
, with B units per layer. Each unit in layer k of ment in (4). This is equivalent to minimizing the following
this subnetwork may have connections to existing units in objective function over w ∈ RB and u ∈ {h, h0 }:
layer k − 1 of A DA N ET in addition to connections to units m
in layer k − 1 of the subnetwork. 1 X
Ft (w, u) = Φ 1 − yi ft−1 (xi ) − yi w · u(xi )
m i=1
2. augmenting the current neural network with a deeper
subnetwork (depth lt−1 +1) h0 ∈ Hl∗B . The set of allowed + Γu kwk1 , (5)
t−1
connections is defined the same way as for h. where Γu = λru + β and ru is Rm Hlt−1 if u =

The option selected is the one leading to the best reduction h and Rm Hlt−1 +1 otherwise. In other words, if
problem at each step and that additionally has a closed-

A DA N ET(S = ((xi , yi )m
i=1 ) form solution. This comes at the cost of a more restricted
1 f0 ← 0 search space for finding a descent coordinate at each step
2 for t ← 1 to T do of the algorithm.
h, h0 ← W EAK L EARNER S, ft−1

3
We conclude this section by observing that in our descrip-
4 w ← M INIMIZE Ft (w, h)
tion of A DA N ET we have fixed B for all iterations and
5 w0 ← M INIMIZE Ft (w, h0 )
only two candidate subnetworks are considered at each
6 if Ft (w, h0 ) ≤ Ft (w0 , h0 ) then
step. Our approach easily extends to an arbitrary number
7 ht ← h
of candidate subnetworks (for instance of different depth l)
8 else ht ← h0
as well as varying number of units per layer B. Further-
9 if F (wt−1 + w∗ ) < F (wt−1 ) then
more, selecting an optimal subnetwork among the candi-
10 ft−1 ← ft + w∗ · ht
dates is easily parallelizable allowing for efficient and ef-
11 else return ft−1
fective search for optimal descent directions.
12 return fT
6. Experiments
Figure 3. Pseudocode of the AdaNet algorithm. On line 3 two
candidate subnetworks are generated (e.g. randomly or by solving In this section we present the results of our experiments
(6)). On lines 3 and 4, (5) is solved for each of these candidates. with A DA N ET algorithm.
On lines 5-7 the best subnetwork is selected and on lines 9-11
termination condition is checked.
6.1. CIFAR-10
In our first set of experiments, we used the CIFAR-10
minw Ft (w, h) ≤ minw Ft (w, h0 ), then
dataset (Krizhevsky, 2009). This dataset consists of 60,000
w∗ = argmin Ft (w, h), ht = h images evenly categorized in 10 different classes. To
w∈RB reduce the problem to binary classification we consid-
and otherwise ered five pairs of classes: deer-truck, deer-horse,
automobile-truck, cat-dog, dog-horse. Raw
w∗ = argmin Ft (w, h0 ), ht = h0 images have been preprocessed to obtain color histograms
w∈RB and histogram of gradient features. The result is 154 real
If F (wt−1 + w∗ ) < F (wt−1 ) then we set ft−1 = ft + valued features with ranges [0, 1].
w∗ · ht and otherwise we terminate the algorithm. We compared A DA N ET to standard feedforward neural
There are many different choices for the W EAK L EARNER networks (NN) and logistic regression (LR) models. Note
algorithm. For instance, one may generate a large number that convolutional neural networks are often a more nat-
of random networks and select the one that optimizes (5). ural choice for image classification problems such as
Another option is to directly minimize (5) or its regularized CIFAR-10. However, the goal of these experiments is
version: not to obtain state-of-the-art results for this particular task,
m
but to provide a proof-of-concept illustrating that our struc-
1 X
tural learning approach can be competitive with traditional
Ft (w, h) =
e Φ 1 − yi ft−1 (xi ) − yi w · h(xi )
m i=1 approaches for finding efficient architectures and training
corresponding networks.
+ R(w, h), (6)
Note that A DA N ET algorithm requires the knowledge of
over both w and h. Here R(w, h) is a regularization term complexities rj , which in certain cases can be estimated
that, for instance, can be used to enforce that kus kp ≤ Λk,s from data. In our experiments, we have used the upper
in (2). Note that, in general, (6) is a non-convex objective. bound in Lemma 2. Our algorithm admits a number of hy-
However, we do not rely on finding a global solution to perparameters: regularization hyperparameters λ, β, num-
the corresponding optimization problem. In fact, standard ber of units B in each layer of new subnetworks that are
guarantees for regularized boosting only require that each used to extend the model at each iteration and a bound
h that is added to the model decreases the objective by a Λk on weights (u0 , u) in each unit. As discussed in Sec-
constant amount (i.e. it satisfies δ-optimality condition) for tion 5, there are different approaches to finding candidate
a boosting algorithm to converge (Rätsch et al., 2001; Luo subnetworks in each iteration. In our experiments, we
& Tseng, 1992). searched for candidate subnetworks by minimizing (6) with
Furthermore, the algorithm that we present in Appendix C R = 0. This also requires a learning rate hyperparame-
uses a weak-learning algorithm that solves a convex sub- ter η. These hyperparamers have been optimized over the
Table 1. Experimental results for A DA N ET, NN, LR and NN-GP for different pairs of labels in CIFAR-10. Boldfaced results are
statistically significant at a 5% confidence level.
Label pair A DA N ET LR NN NN-GP
deer-truck 0.9372 ± 0.0082 0.8997 ± 0.0066 0.9213 ± 0.0065 0.9220 ± 0.0069

deer-horse 0.8430 ± 0.0076 0.7685 ± 0.0119 0.8055 ± 0.0178 0.8060 ± 0.0181
automobile-truck 0.8461 ± 0.0069 0.7976 ± 0.0076 0.8063 ± 0.0064 0.8056 ± 0.0138
cat-dog 0.6924 ± 0.0129 0.6664 ± 0.0099 0.6595 ± 0.0141 0.6607 ± 0.0097
dog-horse 0.8350 ± 0.0089 0.7968 ± 0.0128 0.8066 ± 0.0087 0.8087 ± 0.0109
following ranges: λ ∈ {0, 10−8 , 10−7 , 10−6 , 10−5 , 10−4 },

Table 2. Average number of units in each layer.
B ∈ {100, 150, 250}, η ∈ {10−4 , 10−3 , 10−2 , 10−1 }.
We have used a single Λk for all k > 1 optimized over Label pair A DA N ET NN NN-GP
{1.0, 1.005, 1.01, 1.1, 1.2}. For simplicity, β = 0.
1st layer 2nd layer
Neural network models also admit learning rate η
deer-truck 990 0 2048 1050
and regularization coefficient λ as hyperparameters, as deer-horse 1475 0 2048 488
well as the number of hidden layers l and number automobile-truck 2000 0 2048 1595
of units n in each hidden layer. The range of η cat-dog 1800 25 512 155
was the same as for A DA N ET and we varied l in dog-horse 1600 0 2048 1273
{1, 2, 3}, n in {100, 150, 512, 1024, 2048} and λ ∈
{0, 10−5 , 10−4 , 10−3 , 10−2 , 10−1 }. Logistic regression same configuration is used for solving (6). We use T = 30
only admits η and λ as its hyperparameters that were op- for A DA N ET in all our experiments although in most cases
timized over the same ranges. Note that the total number algorithm terminates after 10 rounds.
of hyperparameter settings for A DA N ET and standard neu-
ral networks is exactly the same. Furthermore, the same In each of the experiments, we used standard 10-fold cross-
holds for the number of hyperparameters that determine re- validation for performance evaluation and model selection.
sulting architecture of the model: Λ and B for A DA N ET In particular, the dataset was randomly partitioned into 10
and l and n for neural network models. Observe that while folds, and each algorithm was run 10 times, with a different
a particular setting of l and n determines a fixed architec- assignment of folds to the training set, validation set and
ture Λ and B parameterize a structural learning procedure test set for each run. Specifically, for each i ∈ {0, . . . , 9},
that may result in a different architecture depending on the fold i was used for testing, fold i+1 (mod 10) was used for
data. validation, and the remaining folds were used for training.
For each setting of the parameters, we computed the aver-
In addition to the grid search procedure, we have con- age validation error across the 10 folds, and selected the
ducted a hyperparameter optimization for neural net- parameter setting with maximum average accuracy across
works using Gaussian process bandits (NN-GP), which validation folds. We report average accuracy (and standard
is a sophisticated Bayesian non-parametric method for deviations) of the selected hyperparameter setting across
response-surface modeling in conjunction with a bandit test folds in Table 1.
algorithm (Snoek et al., 2012). Instead of operating on
a pre-specified grid, this allows one to search for hy- Our results show that A DA N ET outperforms other meth-
perparameters in a given range. We used the following ods on each of the datasets. The average architectures for
ranges: λ ∈ [10−5 , 1], η ∈ [10−5 , 1], l ∈ [1, 3] and all label pairs are provided in Table 2. Note that NN and
n ∈ [100, 2048]. This algorithm was run for 500 tri- NN-GP always selects one layer architecture. The archi-
als which is more than the number of hyperparameter set- tectures selected by A DA N ET typically also have just one
tings considered by A DA N ET and NN. Observe that this layer and fewer nodes than those selected by NN and NN-
search procedure can also be applied to our algorithm but GP. However, on a more challenging problem cat-dog
we choose not to do it in this set of experiments to further A DA N ET opts for a more complex model with two layers
demonstrate competitiveness of the structural learning ap- which results in a better performance. This further illus-
proach. trates that our approach allows to learn network architec-
tures in adaptive fashion depending on the complexity of
In all experiments we use ReLu activations. NN, NN-GP the given problem.
and LR are trained using stochastic gradient method with
batch size of 100 and maximum of 10,000 iterations. The As discussed in Section 5, various different heuristics can
be used to generate candidate subnetworks on each itera-
Table 3. Experimental results for different variants ofA DA N ET. Table 4. Experimental results for Criteo dataset.
Algorithm Accuracy (± std. dev.) Algorithm Accuracy
A DA N ET.SD 0.9309 ± 0.0069 A DA N ET 0.7846

A DA N ET.R 0.9336 ± 0.0075 NN 0.7811
A DA N ET.P 0.9321 ± 0.0065
A DA N ET.D 0.9376 ± 0.0080 training, validation and test set.1 Our training set received
the first 5 days of data (32,743,299 instances) and valida-
tion and test sets consist of 1 day of data (6,548,659 in-
stances).
tion of A DA N ET. In the second set of experiments we have
varied objective function (6), as well as the domain over Gaussian processes bandits were used to find the best hy-
which it is optimized. This allows us to study sensitivity perparameter settings on validation set both for A DA N ET
of A DA N ET to the choice of heuristic that is used to gen- and NN. For A DA N ET we have optimized over the
erate candidate subnetworks. In particular, we have con- following hyperparameter ranges: B ∈ 125, 256, 512,
sidered the following variants of A DA N ET. A DA N ET.R Λ ∈ [1, 1.5], η ∈ [10−4 , 10−1 ], λ ∈ [10−12 , 10−4 ].
uses R(w, h) = Γh kwk1 as a regularization term in (6). For NN the ranges were as follows: l ∈ [1, 6],
As A DA N ET architecture grows, each new subnetwork is n ∈ [250, 512, 1024, 2048], η ∈ [10−5 , 10−1 ], λ ∈
connected to all the previous subnetworks which signif- [10−6 , 10−1 ]. We train NNs for 100,000 iterations us-
icantly increases the number of connections in the net- ing mini-batch stochastic gradient method with batch size
work and overall complexity of the model. A DA N ET.P and of 512. Same configuration is used at each iteration of
A DA N ET.D are restricting connections to existing subnet- A DA N ET to solve (6). The maximum number of hyper-
works in different ways. A DA N ET.P connects each new parameter trials is 2,000 for both methods. Results are pre-
subnetwork only to subnetwork that was added on the presented in Table 4. In this experiment, NN chooses archi-
vious iteration. A DA N ET.D uses dropout on the connec- tecture with four hidden layer and 512 units in each hid-
tions to previously added subnetworks. den layer. Remarkbly, A DA N ET achieves a better accuracy
with an architecture consisting of single layer with just 512
Finally, A DA N ET uses an upper bound on Rademacher
nodes. While the difference in performance appears small
complexity from Lemma 2. A DA N ET.SD uses standard
it is statistically significant on this challenging task.
deviations of the outputs of the last hidden layer on the
training data as surrogate for Rademacher complexities.
The advantage of using this data-dependent measure of 7. Conclusion
complexity is that it eliminates hyperparameter Λ reducing
We presented a new framework and algorithms for adap-
the hyperparameter search space. We report average accu-
tively learning artificial neural networks. Our algorithm,
racies across test folds for deer-truck pair in Table 3.
A DA N ET, benefits from strong theoretical guarantees. It
simultaneously learns a neural network architecture and
6.2. Criteo Click Rate Prediction
its parameters by balancing a trade-off between model
We also compared A DA N ET to NN on Criteo Click complexity and empirical risk minimization. The data-
Rate Prediction dataset https://www.kaggle.com/c/ dependent generalization bounds that we presented can
criteo-display-ad-challenge. This dataset consists help guide the design of alternative algorithms for this
of 7 days of data where each instance is an impression and problem. We reported favorable experimental results
a binary label (clicked or not clicked). Each impression demonstrating that our algorithm is able to learn network
has 13 count features and 26 categorical features. Count architectures that perform better than the ones found via
features have been transformed by taking the natural log- grid search. Our techniques are general and can be applied
arithm. The values of categorical features appearing less to other neural network architectures such as CNNs and
than 100 times are replaced by zeros. The rest of the val- RNNs.
ues are then converted to integers which are then used as
keys to look up embeddings (that are trained together with
each model). If the number of possible values for a feature
x is d(x), then embedding dimension is set to 6d(f )1/4
for d(f ) > 25. Otherwise, embedding dimension is d(f ).
Missing feature values are set to zero. 1
Note that test set provided in this link does not have ground
We have split training set provided in the link above into truth labels and can not be used in our experiments.
References Han, Hong-Gui and Qiao, Jun-Fei. A structure optimisation

algorithm for feedforward neural network construction.
Alvarez, Jose M and Salzmann, Mathieu. Learning the
Neurocomputing, 99:347–357, 2013.
number of neurons in deep networks. In NIPS, 2016.
Han, Song, Pool, Jeff, Tran, John, and Dally, William J.
Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma,
Learning both weights and connections for efficient neu-
Tengyu. Provable bounds for learning some deep rep-
ral networks. In NIPS, 2015.
resentations. In ICML, pp. 584–592, 2014.
Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train
Arora, Sanjeev, Liang, Yingyu, and Ma, Tengyu. Why are
faster, generalize better: Stability of stochastic gradient
deep nets reversible: A simple theory, with implications
descent. arXiv:1509.01240, 2015.
for training. arXiv:1511.05653, 2015.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Baker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Raskar,
Jian. Deep residual learning for image recognition.
Ramesh. Designing neural network architectures using
CoRR, abs/1512.03385, 2015.
reinforcement learning. CoRR, 2016.
Huang, Gao, Liu, Zhuang, and Weinberger, Kilian Q.
Bartlett, Peter L. The sample complexity of pattern classi-
Densely connected convolutional networks. CoRR,
fication with neural networks: the size of the weights is
2016.
more important than the size of the network. Information
Theory, IEEE Transactions on, 44(2), 1998. Ioffe, Sergey and Szegedy, Christian. Batch normalization:
Accelerating deep network training by reducing internal
Bartlett, Peter L. and Mendelson, Shahar. Rademacher and
covariate shift. In ICML, 2015.
Gaussian complexities: Risk bounds and structural re-
sults. JMLR, 3, 2002. Islam, Md Monirul, Yao, Xin, and Murase, Kazuyuki. A
constructive algorithm for training cooperative neural
Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, and
network ensembles. IEEE Transactions on Neural Net-
Kégl, Balázs. Algorithms for hyper-parameter optimiza-
works, 14(4):820–834, 2003.
tion. In NIPS, pp. 2546–2554, 2011.
Islam, MohamOBmad, Sattar, Abdul, Amin, Farnaz, Yao,
Chen, Tianqi, Goodfellow, Ian J., and Shlens, Jonathon.
Xin, and Murase, Kazuyuki. A new adaptive merging
Net2net: Accelerating learning via knowledge transfer.
and growing algorithm for designing artificial neural net-
CoRR, 2015.
works. IEEE Transactions on Systems, Man, and Cyber-
Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, netics, 39(3):705–722, 2009.
Arous, Gérard Ben, and LeCun, Yann. The loss surfaces
Janzamin, Majid, Sedghi, Hanie, and Anandkumar, Anima.
of multilayer networks. arXiv:1412.0233, 2014.
Generalization bounds for neural networks through ten-
Cohen, Nadav, Sharir, Or, and Shashua, Amnon. On the ex- sor factorization. arXiv:1506.08473, 2015.
pressive power of deep learning: a tensor analysis. arXiv,
2015. Kawaguchi, Kenji. Deep learning without poor local min-
ima. In NIPS, 2016.
Cortes, Corinna, Mohri, Mehryar, and Syed, Umar. Deep
boosting. In ICML, pp. 1179 – 1187, 2014. Koltchinskii, Vladmir and Panchenko, Dmitry. Empiri-
cal margin distributions and bounding the generalization
Daniely, Amit, Frostig, Roy, and Singer, Yoram. Toward error of combined classifiers. Annals of Statistics, 30,
deeper understanding of neural networks: The power of 2002.
initialization and a dual view on expressivity. In NIPS,
2016. Kotani, Manabu, Kajiki, Akihiro, and Akazawa, Kenzo.
A structural learning algorithm for multi-layered neural
Eldan, Ronen and Shamir, Ohad. The power of depth for networks. In International Conference on Neural Net-
feedforward neural networks. arXiv:1512.03965, 2015. works, volume 2, pp. 1105–1110. IEEE, 1997.
Freund, Yoav and Schapire, Robert E. A decision-theoretic Krizhevsky, Alex. Learning multiple layers of features
generalization of on-line learning and an application to from tiny images. Master’s thesis, University of Toronto,
boosting. Journal of Computer System Sciences, 55(1): 2009.
119–139, 1997.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Ha, David, Dai, Andrew M., and Le, Quoc V. Hypernet- Imagenet classification with deep convolutional neural
works. CoRR, 2016. networks. In NIPS, pp. 1097–1105, 2012.
Kuznetsov, Vitaly, Mohri, Mehryar, and Syed, Umar. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Multi-class deep boosting. In NIPS, 2014. Practical Bayesian Optimization of Machine Learning
Algorithms. In Pereira, F., Burges, C. J. C., Bottou, L.,
Kwok, Tin-Yau and Yeung, Dit-Yan. Constructive algo- and Weinberger, K. Q. (eds.), NIPS, pp. 2951–2959. Cur-
rithms for structure learning in feedforward neural net- ran Associates, Inc., 2012.
works for regression problems. IEEE Transactions on
Neural Networks, 8(3):630–645, 1997. Sun, Shizhao, Chen, Wei, Wang, Liwei, Liu, Xiaoguang,
and Liu, Tie-Yan. On the depth of deep neural networks:
LeCun, Yann, Denker, John S., and Solla, Sara A. Optimal A theoretical view. In AAAI, 2016.
brain damage. In NIPS, 1990.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence
Lehtokangas, Mikko. Modelling with constructive back- to sequence learning with neural networks. In NIPS,
propagation. Neural Networks, 12(4):707–716, 1999. 2014.
Leung, Frank HF, Lam, Hak-Keung, Ling, Sai-Ho, and Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Tam, Peter KS. Tuning of the structure and parame- Pierre, Reed, Scott E., Anguelov, Dragomir, Erhan, Du-
ters of a neural network using an improved genetic al- mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
gorithm. IEEE Transactions on Neural Networks, 14(1): Going deeper with convolutions. In CVPR, 2015.
79–88, 2003.
Telgarsky, Matus. Benefits of depth in neural networks. In
Lian, Xiangru, Huang, Yijun, Li, Yuncheng, and Liu, Ji. COLT, 2016.
Asynchronous parallel stochastic gradient for nonconvex
optimization. In NIPS, pp. 2719–2727, 2015. Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan,
Memisevic, Roland, Salakhutdinov, Ruslan, and Bengio,
Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On Yoshua. Architectural complexity measures of recurrent
the computational efficiency of training neural networks. neural networks. CoRR, 2016.
In NIPS, pp. 855–863, 2014.
Zhang, Yuchen, Lee, Jason D, and Jordan, Michael I. ` 1-
Luo, Zhi-Quan and Tseng, Paul. On the convergence of co- regularized neural networks are improperly learnable in
ordinate descent method for convex differentiable mini- polynomial time. arXiv:1510.03528, 2015.
mization. Journal of Optimization Theory and Applica-
tions, 72(1):7 – 35, 1992. Zoph, Barret and Le, Quoc V. Neural architecture search
with reinforcement learning. CoRR, 2016.
Ma, Liying and Khorasani, Khashayar. A new strategy
for adaptively constructing multilayer feedforward neu-
ral networks. Neurocomputing, 51:361–385, 2003.
Narasimha, Pramod L, Delashmit, Walter H, Manry,

Michael T, Li, Jiang, and Maldonado, Francisco. An
integrated growing-pruning method for feedforward net-
work training. Neurocomputing, 71(13):2831–2847,
2008.
Neyshabur, Behnam, Tomioka, Ryota, and Srebro, Nathan.

Norm-based capacity control in neural networks. In
COLT, 2015.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua.

On the difficulty of training recurrent neural networks.
In ICML, 2013.
Rätsch, Gunnar, Mika, Sebastian, and Warmuth, Man-

fred K. On the convergence of leveraging. In NIPS, pp.
487–494, 2001.
Sagun, Levent, Guney, V Ugur, Arous, Gerard Ben, and

LeCun, Yann. Explorations on high dimensional land-
scapes. arXiv:1412.6615, 2014.
A. Related work network structure from a single input layer.

There have been several major lines of research on the the-
oretical understanding of neural networks. The first one B. Proofs
deals with understanding the properties of the objective We will use the following structural learning guarantee for
function used when training neural networks (Choroman- ensembles of hypotheses.
ska et al., 2014; Sagun et al., 2014; Zhang et al., 2015; Livni
et al., 2014; Kawaguchi, 2016). The second involves study- Theorem 2 (DeepBoost Generalization Bound, Theorem 1,
ing the black-box optimization algorithms that are often (Cortes et al., 2014)). Let H be a hypothesis set admit-
used for training these networks (Hardt et al., 2015; Lian ting a decomposition H = ∪li=1 Hi for some l > 1. Fix
et al., 2015). The third analyzes the statistical and gener- ρ > 0. Then, for any δ > 0, with probability at least 1 − δ
alization properties of the neural networks (Bartlett, 1998; over the draw of a sample P S from Dm , the following in-
T
Zhang et al., 2016; Neyshabur et al., 2015; Sun et al., 2016). equality holds for any f = t=1 αt ht with αt ∈ R+ and
PT
The fourth takes the generative point of view (Arora et al., t=1 αt = 1:
2014; 2015), assuming that the data actually comes from a T r
particular network and then show how to recover it. The 4X 2 log l
R(f ) ≤ RS,ρ +
b αt Rm (Hkt ) +
fifth investigates the expressive ability of neural networks ρ t=1 ρ m
and analyzing what types of mappings they can learn (Co- s
log l log( 2δ )
2
hen et al., 2015; Eldan & Shamir, 2015; Telgarsky, 2016; 4 ρ m
+ log + ,
Daniely et al., 2016). This paper is most closely related to ρ2 log l m 2m
the work on statistical and generalization properties of neu-
ral networks. However, instead of analyzing the problem of where, for each ht ∈ H, kt denotes the smallest k ∈ [l]
learning with a fixed architecture we study a more general such that ht ∈ Hkt .
task of learning both architecture and model parameters si- Theorem 1. Fix ρ > 0. Then, for any δ > 0, with
multaneously. On the other hand, the insights that we gain probability at least 1 − δ over the draw of a sample S
by studying this more general setting can also be directly Pm
of size from Dm , the following inequality holds for all
l
applied to the setting with a fixed architecture. f = k=1 wk · hk ∈ F:
There has also been extensive work involving structure l r
4 X 2 log l
wk 1 Rm (Hk ) +

learning for neural networks (Kwok & Yeung, 1997; Le- R(f ) ≤ RS,ρ (f ) +
b e
ung et al., 2003; Islam et al., 2003; Lehtokangas, 1999; Is- ρ ρ m
k=1
lam et al., 2009; Ma & Khorasani, 2003; Narasimha et al., + C(ρ, l, m, δ),
2008; Han & Qiao, 2013; Kotani et al., 1997; Alvarez & q
Salzmann, 2016). All these publications seek to grow and 4 ρ2 m
log l log( δ2 )
where C(ρ, l, m, δ) = ρ2 log( log l ) m + 2m =
prune the neural network architecture using some heuris- q
tic. More recently, search-based approaches have been an e 1 log l .
O ρ m
area of active research (Ha et al., 2016; Chen et al., 2015;
Zoph & Le, 2016; Baker et al., 2016). In this line of work, Proof. This result follows directly from Theorem 2.
a learning meta-algorithm is used to search for an efficient
architecture. Once better architecture is found previously Theorem 1 can be straightforwardly generalized to the
trained networks are discarded. This search requires a sig- multi-class classification setting by using the ensemble
nificant amount of computational resources. To the best of margin bounds of Kuznetsov et al. (2014).
our knowledge, none of these methods come with a theo-
Lemma 1. For any k > 1, the empirical Rademacher
retical guarantee on their performance. Furthermore, op-
complexity of Hk for a sample S of size m can be upper-
timization problems associated with these methods are in-
bounded as follows in terms of those of Hs s with s < k:
tractable. In contrast, the structure learning algorithms in-
troduced in this paper are directly based on data-dependent k−1
X 1
generalization bounds and aim to solve a convex optimiza- b S (Hk ) ≤ 2
R Λk,s nsq R
b S (Hs ).
tion problem by adaptively growing network and preserv- s=1
ing previously trained components.
b S (Hk ) can be expressed as follows:
Proof. By definition, R
Finally, (Janzamin et al., 2015) is another paper that an-
alyzes the generalization and training of two-layer neural
 
m k−1
networks through tensor methods. Our work uses different b S (Hk ) = 1 E
X X
R  sup σi us · (ϕs ◦ hs )(xi ).

methods, applies to arbitrary networks, and also learns a m σ hs ∈Hsns i=1 s=1
kus kp ≤Λk,s
Qk Qk
By the sub-additivity of the supremum, it can be upper- Lemma 2. Let Λk = s=1 2Λs,s−1 and Nk = s=1 ns−1 .
bounded as follows: Then, for any k ≥ 1, the empirical Rademacher complexity
  of Hk∗ for a sample S of size m can be upper bounded as
k−1
X 1  Xm follows:
b S (Hk ) ≤
R E sup σi us · (ϕs ◦ hs )(xi ).

m σ hs ∈Hsns
r
s=1 i=1
kus kp ≤Λk,s ∗
b S (H ) ≤ r∞ Λk N
1
q log(2n0 )
R k k .
2m
We now bound each term of this sum, starting with the fol-
lowing chain of equalities: Proof. The empirical Rademacher complexity of H1 can
  be bounded as follows:
m " m
#
1  X 1 X
E sup σi us · (ϕs ◦ hs )(xi )

RS (H1 ) =
b E sup σi u · Ψ(xi )
m σ hs ∈Hsns i=1 m σ kukp ≤Λ1,0 i=1
kus kp ≤Λk,s " #
m
" m # 1 X
Λk,s X = E sup u · σi Ψ(xi )
= E sup σi (ϕs ◦ hs )(xi ) m σ kukp ≤Λ1,0
m σ hs ∈Hs ns
i=1

q " m
i=1
#
1
q
" m # Λ1,0 X
Λk,s ns X = E σi [Ψ(xi )]
= E sup
σi (ϕs ◦ h)(xi ) m σ i=1

q
m σ h∈Hs i=1 1 " m #
q
1
  Λ1,0 n0 X
q
Λk,s ns 
m ≤ E σi [Ψ(xi )]
= E sup σ
X
σi (ϕs ◦ h)(xi ) ,
 m σ i=1

∞
m σ h∈Hs i=1 1 " #
q
X m
σ∈{−1,+1} Λ1,0 n0
= E max σi [Ψ(xi )]j
m σ j∈[1,n1 ] i=1
where the second equality holds by definition of the dual  
norm and the third equality by the following equality: 1
q m
Λ1,0 n0  X
= E  max σi s[Ψ(xi )]j 

n i q1 n i q1
hX hX m σ j∈[1,n1 ] i=1
sup kzkq = sup |zi |q = [ sup |zi |]q s∈{−1,+1}
zi ∈Z zi ∈Z i=1 i=1 zi ∈Z p
1
1
q
√ 2 log(2n0 )
= n sup |zi |.
q ≤ Λ1,0 n0 r∞ m
zi ∈Z r m
1
2 log(2n 0)
The following chain of inequalities concludes the proof: = r∞ Λ1,0 n0q .
m
 
1
q m
The result then follows by application of Lemma 1.
Λk,s ns  X
E  sup σ σi (ϕs ◦ h)(xi )
 Qk
m σ h∈Hs Corollary 1. Fix ρ > 0. Let Λk = s=1 4Λs,s−1 and
i=1 Qk
σ∈{−1,+1} Nk = s=1 ns−1 . Then, for any δ > 0, with probability at
size m from Dm ,
1
least 1 − δ over the draw of a sample S of P
" m
#
q
Λk,s ns X
l
≤ E sup σi (ϕs ◦ h)(xi ) the following inequality holds for all f = k=1 wk · hk ∈
m σ h∈Hs i=1
" # F :
∗
m
Λk,s X
l
r
+ E sup −σi (ϕj ◦ h)(xi )

m σ h∈Hs i=1 2 X 1
q 2 log(2n0 )
R(f ) ≤ R bS,ρ (f ) + wk r∞ Λk N
1 k
ρ m
1 " m
# k=1
2Λk,s nsq X r
= E sup σi (ϕs ◦ h)(xi ) 2 log l
m σ h∈H
s i=1
+ + C(ρ, l, m, δ),
ρ m
1 " #
m
2Λk,s nsq X q log l log( δ2 )
≤ E sup σi h(xi ) where C(ρ, l, m, δ) = 4 ρ2 m
m σ h∈H
s i=1
ρ2 log( log l ) m + 2m =
q
1 Oe 1 log l , and where r∞ = ES∼Dm [r∞ ].
ρ m
≤ 2Λk,s nsq R
b S (Hs ),
where the second inequality holds by Talagrand’s contrac- Proof. Since F∗ is the convex hull of H∗ , we can apply
tion lemma. Theorem 1 with Rm (H e ∗ ) instead of Rm (H
k
e k ). Observe
optimization problem:
argmin min Ft (w, h).
t−1 l +1 w∈R
h∈∪s=1 Hs0
Remarkably, the subnetwork that solves this infinite di-

mensional optimization problem can be obtained directly
in closed-form:
Theorem 3 (A DA N ET.CVX Optimization). Let (w∗ , h∗ )
Figure 4. Illustration of a neural network designed by
A DA N ET.CVX. Units at each layer (other than the output be the solution to the following optimization problem:
layer) are only connected to units in the layer below. argmin min Ft (w, h).
t−1 l w∈R
h∈∪s=1 Hs0
that, since for any k ∈ [l], H
e∗
k is the union ofHk∗ and
its reflection, to derive a bound on Rm (Hk ) from a bound
e ∗ Let Dt be a distribution over the sample (xi , yi )m
i=1 such
that Dt (i) ∝ Φ0 (1 − yi ft−1 (xi ), and denote t,h =
on Rm (H e k ) it suffices to double each Λs,s−1 . Combining
Ei∼Dt [yi h(xi )].
this observation with the bound of Lemma 2 completes the
proof. Then, ∗ ∗
w∗ h∗ = w(s ) h(s ) ,
∗ ∗
C. Alternative Algorithm where (w(s ) , h(s ) ) are defined by:
In this section, we present an alternative algorithm, s∗ = argmax Λs,s−1 kt,hs−1,t−1 kq .
A DA N ET.CVX, that generates candidate subnetworks in s∈[lt −1]
closed-form using Banach space duality. ∗ (s) Λs,s−1

u(s )
= ui = q |t,hs−1,t−1,i |q−1
p
As in Section 5, let ft−1 denote the A DA N ET model after kt,hs−1,t−1 kq
t − 1 rounds, and let lt−1 be the depth of the architecture. (s∗ ) (s∗ )
h · (ϕs ◦ hs−1,t−1 )
=u
A DA N ET.CVX will consider lt−1 + 1 candidate subnet- m
works, one for each layer in the model plus an additional ∗ 1 X
w(s ) = argmin Φ 1 − yi ft−1 (xi )
one for extending the model. w∈R m i=1
∗

Let h(s) denote the candidate subnetwork associated to − yi wh(s ) (xi ) + Γs∗ |w|.
layer s ∈ [lt−1 + 1]. We define h(s) to be a single unit
in layer s that is connected to units of ft−1 in layer s − 1: Proof. By definition,
h(s) ∈ {x 7→ u · (ϕs−1 ◦ hs−1,t−1 )(x) : Ft (w, h)
ns−1,t−1 m
u∈R , kukp ≤ Λs,s−1 }. 1 X
= Φ 1 − yi (ft−1 (xi ) − wh(xi )) + Γh |w|.
m i=1
See Figure 4 for an illustration of the type of neural network
designed using these candidate subnetworks.
l +1
t−1
Notice that the minimizer over ∪s=1 Hs0 can be deter-
For convenience, we denote this space of subnetworks by
mined by comparing the minimizers over each Hs0 .
Hs0 :
Moreover, since the penalty term Γh |w| has the same con-
Hs0 = {x 7→ u · (ϕs−1 ◦ hs−1,t−1 )(x) : tribution for every h ∈ Hs0 , it has no impact on the optimal
u ∈ Rns−1,t−1 , kukp ≤ Λs,s−1 }. choice of h over Hs0 . Thus, to find the minimizer over each
Hs0 , we can compute the derivative of Ft − Γh |w| with re-
spect to w:
Now recall the notation
d(Ft − Γh |η|)
(w, h)
Ft (w, h) dw
m
m −1 X
1 X = yi h(xi )Φ0 1 − yi ft−1 (xi ) .

= Φ 1 − yi (ft−1 (xi ) − wh(xi )) + Γh kwk1 m i=1
m i=1
Now if we let
used in Section 5. As in A DA N ET, the candidate subnet-
work chosen by A DA N ET.CVX is given by the following Dt (i)St = Φ0 1 − yi ft−1 (xi ) ,
then this expression is equal to

"m # A DA N ET.CVX(S = ((xi , yi )m
i=1 )
X St St 1 f0 ← 0
− yi h(xi )Dt (i) = (2t,h − 1) ,
m m 2 for t ← 1 to T do
i=1
3 s∗ ← argmaxs∈[lt−1 +1] Λs,s−1 kt,hs−1,t−1 kq .
(s∗ ) Λs∗ ,s∗ −1
where t,h = Ei∼Dt [yi h(xi )]. 4 ui ← q |t,h ∗
s −1,t−1,i
|q−1
kt,hs∗ −1,t−1 kqp
Thus, it follows that for any s ∈ [lt−1 + 1], 5 0
h ←u (s∗ )
· (φs∗ −1 ◦ hs∗ −1,t−1 )
d(Ft − Γh |w|) 6 η 0 ← M INIMIZE(F̃t (η, h0 ))
argmax (w, h) = argmax t,h . 7 ft ← ft−1 + η 0 · h0
h∈Hs0 dw h∈Hs0
8 return fT
Note that we still need to search for the optimal descent

coordinate over an infinite dimensional space. However, Figure 5. Pseudocode of the AdaNet.CVX algorithm.
we can write
Thus, u(s) and the associated network h(s) is the coor-
max0 t,h
h∈Hs dinate that maximizes the derivative of Ft with respect
= max0 E [yi h(xi )] to w among all subnetworks in Hs0 . Moreover, h(s) also
h∈Hs i∼Dt achieves the value: Λs,s−1 kt,hs−1,t−1 kq .
= max E [yi u · (ϕs−1 ◦ hs−1,t−1 )(xi )] This implies that by computing Λs,s−1 kt,hs−1,t−1 kq for
u∈Rns−1,t−1 i∼Dt
every s ∈ [lt−1 + 1], we can find the descent coordinate
= max u · E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )].
u∈Rns−1,t−1 i∼Dt across all s ∈ [lt−1 + 1] that improves the objective by
the largest amount. Moreover, we can then solve for the
Now, if we denote by u(s) the connection weights associ- optimal step size in this direction to compute the weight
ated to h(s) , then we claim that update.
(s) Λs,s−1
ui = q |t,hs−1,t−1,i |q−1 ,
kt,hs−1,t−1 kqp The theorem above defines the choice of descent coordi-
nate at each round and motivates the following algorithm,
which is a consequence of Banach space duality. A DA N ET.CVX. At each round, A DA N ET.CVX can de-
To see this, note first that by Hölder’s inequality, every u ∈ sign the optimal candidate subnetwork within its searched
Rns−1,t−1 with kukp ≤ Λs,s−1 satisfies: space in closed form, leading to an extremely efficient up-
date. However, this comes at the cost of a more restrictive
u · E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )] search space than the one used in A DA N ET. The pseu-
i∼Dt
docode of A DA N ET.CVX is provided in Figure 5.
≤ kukp k E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )]kq
i∼Dt
≤ Λs,s−1 k E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )]kq .

i∼Dt
At the same time, our choice of u(s) also attains this upper
bound:
u(s) · t,hs−1,t−1
ns−1,t−1
(s)
X
= ui t,hs−1,t−1,i
i=1
ns−1,t−1
X Λs,s−1
= q |t,hs−1,t−1,i |q
i=1 kt,hs−1,t−1 kqp
Λs,s−1
= q kt,hs−1,t−1 kqq
kt,hs−1,t−1 kqp
= Λs,s−1 kt,hs−1,t−1 kq .

Cortes (2016) - AdaNet Adaptive Structural Learning of Artificial Neural Networks

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Cortes (2016) - AdaNet Adaptive Structural Learning of Artificial Neural Networks

Încărcat de

Drepturi de autor:

Formate disponibile

AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Abstract et al., 2013; Ioffe & Szegedy, 2015)) to derive a coherent

are defined as follows. Let l denote the number of interme-

For a function f taking values in R, we denote by R(f ) =

 q  jective function F in (4) is defined over a very large space

of the current value of the objective function, which de-

problem at each step and that additionally has a closed-

Label pair A DA N ET LR NN NN-GP

deer-truck 0.9372 ± 0.0082 0.8997 ± 0.0066 0.9213 ± 0.0065 0.9220 ± 0.0069

following ranges: λ ∈ {0, 10−8 , 10−7 , 10−6 , 10−5 , 10−4 },

Algorithm Accuracy (± std. dev.) Algorithm Accuracy

A DA N ET.SD 0.9309 ± 0.0069 A DA N ET 0.7846

References Han, Hong-Gui and Qiao, Jun-Fei. A structure optimisation

Narasimha, Pramod L, Delashmit, Walter H, Manry,

Neyshabur, Behnam, Tomioka, Ryota, and Srebro, Nathan.

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua.

Rätsch, Gunnar, Mika, Sebastian, and Warmuth, Man-

Sagun, Levent, Guney, V Ugur, Arous, Gerard Ben, and

A. Related work network structure from a single input layer.

Remarkably, the subnetwork that solves this infinite di-

closed-form using Banach space duality. ∗ (s) Λs,s−1

then this expression is equal to

Note that we still need to search for the optimal descent

≤ Λs,s−1 k E [yi · (ϕs−1 ◦ hs−1,t−1 )(xi )]kq .

S-ar putea să vă placă și

q jective function F in (4) is defined over a very large space