Documente Academic
Documente Profesional
Documente Cultură
j=1
W
j
q
1
j
x(v)), (3)
where
i
(u) = (u
i
) (sigmoidal function), l(v) R
n
is a vertex label, W
R
mn
is the free-parameters (weight) matrix associated with the label space
and
W
j
R
mm
is the free-parameters (weight) matrix associated with the
Processing structured data by RNNs 7
state information of jth children of v. Note that the bias vector is included
in the weight matrix W. The stationarity assumption allows for processing of
variable-size structure by a model with xed number of parameters.
Concerning the output function g, it can be dened as a map g : R
m
R
z
implemented, for example, by a standard feed-forward network.
Learning algorithms can be based on the typical gradient descendent ap-
proach used for neural networks and adapted to recursive processing (Back-
Propagation Through Structure and Real Time Recurrent Learning algo-
rithms [34]).
The problem with this approach is the apriori denition of a proper net-
work topology. Constructive methods permit to start with a minimal network
conguration and to add units and connections progressively, allowing auto-
matic adaptation of the architecture to the computational task. Constructive
learning algorithms are particularly suited to deal with structured inputs,
where the complexity of learning is so high that it is better to use an incre-
mental approach.
Recursive Cascade Correlation (RCC) [34] is an extension of Cascade Cor-
relation algorithms [10, 9] to deal with structured data.
Specically, in RCC, recursive hidden units are added incrementally to the
network, so that their functional dependency can be described as follows:
x
1
(v) = f
s
1
l(v), q
1
x
1
(v)
x
2
(v) = f
s
2
l(v), q
1
[x
1
(v), x
2
(v)]
(4)
.
.
.
x
m
(v) = f
s
m
l(v), q
1
[x
1
(v), x
2
(v), ..., x
m
(v)]
The training of a new hidden unit is based on already frozen units. The number
of hidden units realizing the f
s
() function depends on the training process.
The computational power of RNN has been studied in [33] by using hard
threshold units and frontier-to-root tree automata. In [5], several strategies
for encoding nite-state tree automata in high-order and rst-order sigmoidal
RNNs have been proposed. Complexity results on the amount of resources
needed to implement frontier-to-root tree automata in RNNs are presented
in [14]. Finally, results on function approximation and theoretical analysis of
learnability and generalization of RNNs (referred as folding networks) can be
found in [15, 16, 17].
3.2 Contextual Models
Adaptive recursive processing systems can be extended to contextual trans-
ductions by proper construction of contextual representations. The main
representative of this class of models is the Contextual Recursive Cascade
8 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti
Correlation (CRCC), which appeared in [24] applied to the sequence domain,
and subsequently extended to structures in [25, 26]. A theoretical study of the
universal approximation capability of the model appears in [18].
The aim of CRCC is to try to overcame the limitations of causal models.
In fact, due to the causality assumption, a standard RNN cannot represent
in its hypothesis space contextual transductions. In particular for a RNN the
context for a given vertex in a DPAG is restricted to its descending vertexes.
In contrast, in a contextual transduction the context is extended to all the
vertexes represented in the structured data.
As for RCC, CRCC is a constructive algorithm. Thus, when training hid-
den unit i, the state variables x
1
, . . . , x
i1
for all the vertexes of all the DPAGs
in the training set are already available, and can be used in the denition of
x
i
. Consequently, equations of causal RNN, and RCC in particular (Equa-
tion 4), can be expanded in a contextual fashion by using, where possible, the
variables q
+1
x
i
(v). The equation for the state transition function for a CRCC
model are dened as:
x
1
(v) = f
s
1
l(v), q
1
x
1
(v)
x
2
(v) = f
s
2
l(v), q
1
[x
1
(v), x
2
(v)], q
+1
[x
1
(v)]
(5)
.
.
.
x
m
(v) = f
s
m
l(v), q
1
[x
1
(v), x
2
(v), ..., x
m
(v)], q
+1
[x
1
(v), x
2
(v), ..., x
m1
(v)]
Hence, in the CRCC model, the frozen state values of both the children and
parents of each vertex can be used as input for the new units without intro-
ducing cycles in the dynamics of the state computing system.
Following the line developed in Section 3.1 it is easy to dene neural net-
work models that realize the CRCC approach. In fact, it is possible to realize
each f
s
j
of Equation 5 by a neural unit associating each input argument of the
f
s
j
with free parameters J of the neural unit. In particular, given a vertex v,
in the CRCC model each neural unit owns:
connections associated to the input label of v, and connections associated
to the state variables (both the output of already frozen hidden units and
the current j-th hidden unit) of the children of the current vertex v (as
for the RNN/RCC model presented in Section 3.1);
connections associated to the state variables of the parents of v which are
already frozen (i.e., the output of already frozen hidden units computed
for the parents of v). The connections associated to parents add new pa-
rameters to the model that are fundamental to the contextual processing.
The major aspect of this computation is that increasing context is taken
into account for large m. In fact, the hidden unit for each layer takes directly a
local context (parents and children) of each vertex as input and progressively,
Processing structured data by RNNs 9
by composition of context developed in the previous steps, it extends the
context involving other vertexes, up to the vertexes of the whole structure.
The context window follows a nested development; as an example, let consider
a vertex v
) depends on x
1
(pa[v
]) and x
3
(v) depends on x
2
(pa[v]) =
x
2
(v
with s
i
= 0 v
i
is the left
child of v
i1
. Note that the absolute position of a vertex in the DBAG can be
recovered from any given linear path representation. The dynamics of a CRCC
as given by equation 5 allows to recover all paths leading to v starting from
the supersource for large enough number of the context unit. More precisely,
assume the maximum length of a path leading to v is h. Then all linear path
representations to v can be recovered from the functional representation given
by x
h+2
(v) as dened in equation 5.
This can easily be seen by induction: for the supersource, the only lin-
ear path representation is empty. This can be recovered from the fact that
q
+1
[x
1
(v)] corresponds to an initial context for the root. For some vertex v
with maximum length h of a path leading to v, we can recover the linear
path representations of the parents from q
+1
[x
1
(v), . . . , x
h+1
(v)] by induc-
tion. Further, it is clear from this term whether v is the left or right child
from the considered parent because of the compatibility of the ordering. Thus
we obtain the linear path representations of all paths from the supersource to
v by extending the linear path representation to the parents of v by 0 or 1,
respectively.
Step 2: Assume h is the maximum length of a path in a given DBAG.
Assume v is a vertex. Then one can recover the following set from x
2h+2
(v):
(l(v
), B(v
)) [ v
) is the lexicographically
ordered set of all linear path representations of the supersource to v
.
16 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti
This fact is obvious for an empty DBAG. Otherwise, the term x
h
(v
0
), v
0
being the supersource and h
) of all children v
of v
0
. Thus, we can continue
this way, adding the terms (v
, B(v
).
Step 3: For any given DBAG, the set (l(v
), B(v
)) [ v
is a vertex in the
DBAG, B(v
l(v), q
1
[x
m
(v)], q
+1
[x
m1
(v)]
since all relevant information used in the proof is contained in this represen-
tation.
It should be noticed that, the condition imposed by DBAGs is only a suf-
cient condition to assure the CRCC completeness for transductions dened
on the classes of DPAGs, which includes DBAGs. In fact, as introduced in
[18], the limitation occurs only for high symmetric cases.
Thus, the overall picture as shown in Fig. 5 results if the functional form
of the denitions is considered.
5 Approximation capability
An important property of function classes used for learning functions based
on examples is its universal approximation ability. This refers to the fact that,
given an unknown regularity, there exists a function in the class which can
Processing structured data by RNNs 17
I/Oisomorphic transduction trees
I/Oisomorphic transduction DBAGs
supersource transduction DBAGs
supersource transduction trees
supersource transduction sequences
RCC CRCC RNN
I/Oisomorphic transduction DPAGs
supersource transduction DPAGs
Fig. 5. Classes of functions supported by RNN, RCC, and CRCC because of the
functional form
represent this regularity up to a small distance . Obviously, the universal
approximation capability is a condition for a function class to be successful as
a learning tool in general situations. There exist various possibilities to make
this term precise. We will be interested in the following two settings:
Denition 6 (Approximation completeness). Assume X is an input set.
Assume Y is an output set. Assume l : Y Y R is a loss function. Assume
F and G are function classes of functions from X to Y . F is approximation
complete for G in the maximum norm if for every g G and > 0 there
exists a function f F such that l(f(x), g(x)) for all x X.
Assume X is measurable with probability measure P. Assume all composi-
tions of functions in F and G with l are (Borel-)measurable. F is approxima-
tion complete for G in probability if for every g G and > 0 there exists a
function f F such that P(x [ l(f(x), g(x)) > ) .
We are interested in the cases where X consists of sequences, trees,
DPAGs, or DBAGs, and Y is either a real-vector space for supersource
transductions or the same structural space for I/O-isomorphic transduc-
tions. We equip X with the -algebra induced by the sum of structures
of a xed form, whereby structures of a xed form are identied with
a real-vector space by concatenating all labels. We choose l as the Eu-
clidean distance if Y is a real-vector space, and l(s, s
) = max|l(v)
l(v
)|[v and v
(0)
for 0.
20 Barbara Hammer, Alessio Micheli, and Alessandro Sperduti
functions computable by RCC
functions computable by RNN
finite automata/tree automata
general functions
Fig. 6. Approximation capability by RCC and RNN networks when considering
approximation in the maximum norm
We only consider continuous supersource transductions starting from
structures to [0, 1], the codomain of , and I/O-isomorphic transductions
where labels are restricted to [0, 1], such that the restriction of the function
to one label is continuous.
The following result has been shown in [16] which extends well known ap-
proximation results of recurrent networks for sequences such as [20] to tree
structures:
Theorem 12 (Completeness of RNNs in probability). Consider RNNs
with the logistic activation function . Then every continuous supersource
transduction from tree structures can be approximated arbitrarily well in prob-
ability by a RNN network.
Obviously, at most causal I/O-isomorphic transductions can be approximated
by RNNs. Note that approximation of causal transduction reduces to the
approximation of all substructures of a given structure, thus, it follows imme-
diately that Theorem 12 also holds for causal I/O-isomorphic transductions.
We now consider RCC and CRCC models. A general technique to prove
approximation completeness in probability has been provided in [18]:
1. Design a unique linear code in a real-vector space for the considered input
structures assumed a nite label alphabet. This needs to be done for every
nite but xed maximum size.
2. Show that this code can be computed by the recursive transition function
provided by the model.
3. Add a universal neural network in the sense of [19] as readout; this can
be integrated in the specic structure of CC.
4. Transfer the result to arbitrary labels by integrating a suciently ne
discretization of the labels.
5. Approximate specic functions of the construction such as the identity or
multiplication by the considered activation function .
Processing structured data by RNNs 21
Steps 3-5 are generic. They require some general mathematical argumentation
which can be transferred immediately to our situation together with the prop-
erty that is squashing and continuously dierentiable (such that the identity
can be approximated). Therefore, we do not go into details concerning these
parts but refer to [18] instead.
Step 1 has already been considered in the previous section: we have investi-
gated whether the functional form of the transition function, i.e. the sequence
term(x(v)), v being the supersource resp. root vertex, allows to dierentiate
all inputs. If so, it is sucient to construct a representation of the functional
form within a real-vector space and to show that this representation can be
computed by the respective recursive mechanism. We shortly explain the main
ideas for RCC and CRCC. Technical details can be found in [18].
Theorem 13 (Completeness of RCC for sequences in probability).
Every continuous supersource transduction from sequences to a real-vector
space can be approximated arbitrarily well by a RCC network in probability.
Proof. We only consider steps 1 and 2: Assume a nite label alphabet L =
l
1
, . . . , l
n
is given. Assume the symbols are encoded as natural numbers
with at most d digits. Adding leading entries 0, if necessary, we can assume
that every number has exactly d digits. Given a sequence (l
i1
, . . . , l
it
), the
real number 0.l
it
. . . l
i1
constitutes a unique encoding. This encoding can be
computed by a single RCC unit because it holds
0.1
d
l
it
+ 0.1
d
0.l
it1
. . . l
i1
= 0.l
it
. . . l
i1
This is a simple linear expression depending on the input l
it
at time step t
and the context 0.l
it1
. . . l
i1
which can be computed by a single RCC unit
with linear activation function resp. approximated arbitrarily well by a single
neuron with activation function .
As beforehand, this result transfers immediately to continuous causal I/O-
isomorphic transductions since this relates to approximation of prexes.
For tree structures, we consider more general multiplicative neurons which
also include a multiplication of input values.
Denition 7 (Multiplicative neuron). A multiplicative neuron of degree
d with n inputs computes a function of the form
x
i=1
j1,...,ji{1,...,n}
w
j1...ji
x
j1
. . . x
ji
+
l(v), q
1
[x
m
(v)], q
+1
[x
m1
(v)]