Sunteți pe pagina 1din 8

A Hybrid Recurrent Neural Networks Architecture Inspired by Hidden Markov

Models: Training and Extraction of Deterministic Finite Automaton

Rohitash Chandra
School of Science and Technology
The University of Fiji
rohitashc@unifiji.ac.fj

Christian. W. Omlin
The University of the Western Cape

Abstract where dependencies often extend through several states.


Recurrent neural networks are dynamical systems and it
We present a hybrid recurrent neural networks has been shown that they can represent deterministic finite
architecture inspired by hidden Markov models. We train automaton in their internal weight representations [6]. In
the architecture to model dynamical systems such as this paper, we will show that deterministic finite
deterministic finite automaton using a genetic training automaton can be also be learned and represented by the
algorithm. We then use a machine learning approach for proposed architecture of hybrid recurrent neural networks.
the extraction of deterministic finite automaton; we apply
a generalisation of the Trakhtenbrot-Barzdin algorithm to The structural similarities between hidden Markov
extract DFAs in symbolic form using the string labelling models and recurrent neural networks is the basis for
assigned by the trained network. The results demonstrate constructing a hybrid recurrent neural network
that the approach successfully extracts the correct architecture. The recurrence equation in the recurrent
deterministic equivalent automaton for strings much neural network resembles the equation in the forward
longer than the longest string in the training set. Thus,
algorithm in the hidden Markov models. The combination
our hybrid recurrent neural network architecture inspired
of the two paradigms into a hybrid system may provide
by hidden Markov models can train and represent an
better generalization and training performance which
important class of discrete dynamical systems.
would be a useful contribution to the field of machine
learning and pattern recognition. We have previously
1. Introduction
introduced a slight variation of this architecture and
shown that they can represent dynamical systems [7]. In
Recurrent neural networks have been an important
this paper, we will show how the hybrid recurrent neural
focus of research as they can be applied to difficult
network architecture can learn and represent dynamical
problems involving time-varying patterns. Their
systems such as deterministic finite automaton. We will
applications range from speech recognition and financial
show their application to speech recognition in future
prediction to gesture recognition [1,2,3]. Hidden Markov
work.
models, on the other hand, have been very popular in the
field of speech recognition [4]. They have also been
Evolutionary optimization techniques such as genetic
applied to other problems such as gesture recognition [5].
algorithms have been popular for training neural networks
Recurrent neural networks are capable of modeling
other than gradient decent learning [8]. It has been
difficult classification tasks. They have shown more
observed that genetic algorithms can overcome the
accuracy in speech recognition in cases of low quality problem of local minima whereas in gradient descent
noisy data compared to hidden Markov models. However, search for the optimal solution, it may be difficult to drive
hidden Markov models have shown to perform better the network out of the local minima which in turn proves
when it comes to large vocabulary speech recognition. costly in terms of training time. This paper will show how
One limitation for hidden Markov models in the genetic algorithms can be applied to train the proposed
application for speech recognition is that they assume that hybrid recurrent neural network architecture inspired by
the probability of being in a state at time t only depends hidden Markov models.
on the previous state i.e. the state at time t-1. This first-
order assumption is inappropriate for speech signals
Neural networks were once considered black boxes,
i.e. it was believed to be difficult to understand the
knowledge represented in their weights as part of the
information processing in the network. Knowledge
extraction from recurrent neural network aims at finding a
model which represents the knowledge represented in the
trained network [9]. There are two main approaches for
knowledge extraction from trained recurrent neural
networks: inducing finite automata by clustering the
activation values of hidden state neurons [10] and the
application of machine learning methods to induce
automaton from the observation of input-output mappings
obtained from the network [11]. We will use the
Trakhtenbrot-Barzdin algorithm for the induction of
finite-state automata from the trained hybrid network.
Figure 1: First–order recurrent neural network
2. Definition and Methods architecture. The recurrence from the hidden to the
context layer is shown. Dashed lines indicate that more
neurons can be used in each layer depending on the
2.1 Recurrent Neural Networks application.

Recurrent neural networks maintain information about 2.2 Hidden Markov Models
their past states for the computation of future states using
feedback connections. They are composed of an input A hidden Markov model (HMM) describes a process
layer, a context layer which provides state information, a which goes through a finite number of non-observable
hidden layer and an output layer as shown in Figure 1. states whilst generating either a discrete or continuous
Each layer contains one or more processing units called output signal. In a first-order Markov model, the state at
neurons which propagate information from one layer to time t+1 depends only on state at time t, regardless of the
the next by computing a non-linear function of their states in the previous times [16]. Figure 2 shows an
weighted sum of inputs. Popular architectures of recurrent example of a Markov model containing three states in a
neural networks include first-order recurrent networks stochastic automaton.
[12], second-order recurrent networks [13], NARX
networks [14] and LSTM recurrent networks [15]. A
detailed study about the vast variety of recurrent neural a11
networks is beyond the scope of this paper; however, we a12
will discuss the dynamics of first–order recurrent neural
network as given in Equation 1: П 1 2
1
П
a21
2
 K J

S i ( t ) = g  ∑ V ik S k ( t − 1) + ∑ W ij I j ( t − 1)  (1)
a13
 k =1 j =1 
3
where S k (t ) and I j (t ) represent the output of the state П
3
neuron and input neurons respectively. Vik and W ij
represent their corresponding weights. g(.) is a sigmoidal
discriminant function. We will use this architecture to Figure 2: A Markov model. Пi is the probability that
construct the hybrid architecture of recurrent neural the system will start in state Si and aij is the probability
networks inspired by hidden Markov models and show that the system will move from state Si to state Sj.
that they can learn deterministic finite automaton.
The model probabilistically links the observed signal to
the state transitions in the system. The theory provides a
means by which:

1. The probability P(O|λ) can be calculated for a HMM


with parameter set λ, generating a particular 1
2
observation sequence O, through what is called the 0
Forward algorithm. 3
0
0
1
2. The most likely state sequence the system went 1
4 0
through in generating the observed signal through 0
7
the Viterbi algorithm. 0 1
1 0
1 1
3. A set of re-estimation formulas for iteratively 5
6
updating the HMM parameters given an observation
sequence as training data. These formulas strive to 1
maximize the probability of the sequence being
generated by the model. The algorithm is known as Figure 3: Deterministic finite automaton. State 1 is
the Baum-Welch or Forward-backward procedure. the automaton’s start state; accepting states are drawn
with double circles.
The term “hidden” hints at the process’ state transition
sequence which is hidden from the observer. The process 2.4 Evolutionary Training of Recurrent Neural
reveals itself to the observer only through the generated Networks
observable signal. A HMM is parameterized through a
matrix of transition probabilities between states and Recurrent neural networks can be trained by
output probability distributions for observed signal frames evolutionary computation methods such as genetic
given the internal process state. The probabilities are used algorithms. In evolutionary neural learning, the task of
in the mentioned algorithm for achieving the desired genetic algorithms is to find the optimal set of weights in a
results. network which minimizes the error function. The fitness
function must define the performance of the neural
2.3 Finite-state Automata as Test Beds for network. Thus, the fitness function is the reciprocal of
Training Recurrent Neural Networks sum of squared error of the neural network. To evaluate
the fitness function, each weight encoded in the
A finite-state automaton is a device that can be in one chromosome is assigned to the respective weight links of
of a finite number of states. In certain conditions, it can the network. The training set of examples is then
switch to another state; this is called a transition. When presented to the network which propagates the
the automaton starts processing input, it can be in one of information forward and the sum of squared errors is
its initial states. There is also another important subset of calculated. In this way, genetic algorithms attempt to find
states of the automaton: the final states. If the automaton a set of weights which minimizes the error function of the
is in a final state after processing an input sequence, it is network. Compared to gradient descent learning, genetic
said to accept or reject its input according to the output algorithms can help the network to escape from the local
membership of the last state. Finite-state automata have minima.
been used as test beds for training recurrent neural
networks. Presumably, strings used for training do not 2.5 Knowledge Extraction from Recurrent
need to undergo any feature extraction. They are used to Neural Networks using Machine Learning
show that recurrent neural networks can represent
dynamical systems as they have dynamical characteristics. The network is initially trained with the training data
Figure 3 shows the DFA which will be used for training set which represents some finite automaton. After
the hybrid recurrent network architecture inspired by successful training and testing, the network is presented
hidden Markov models. The training and testing set is with a set of strings of increasing lengths up to length L.
obtained upon presentation of strings to this automaton In this way, a data set with input-output mappings from
which gives an output i.e. a rejecting or accepting state the trained network is obtained. The trained network’s
depending on the state where the last sequence of the generalisation is a measure of the knowledge acquired by
string was presented. For an input string of length 7, i.e. the network in the training process. Finally, we apply the
0100101, the automaton reaches state 5 which is an Trakhtenbrot-Barzdin algorithm which takes as input the
accepting state; thus, the automaton’s output is 1. string labels assigned by the trained network and produces
a unique, minimal finite-state which represents the
knowledge stored in the trained network.
Algorithm 2 : Let M be the universal DFA that rejects
2.5.1 Trakhtenbrot-Barzdin Algorithm (or accepts) all strings, and let string L =0.

The Trakhtenbrot-Barzdin algorithm [17] extracts Repeat


minimal DFAs from a data set of input strings and their L← L+1
respective labels assigned by the trained network. The Generate all strings of up to L
algorithm is guaranteed to induce a DFA in polynomial Extract ML using Trakhtenbrot-Barzdin(L)
time. A restricting premise of the Trakhtenbrot-Barzdin Until ML is consistent with the test set
algorithm is that we must know the labels of all strings up
to certain length L if the desired DFA is to be extracted
correctly. In this paper, we apply the Trakhtenbrot- 3. The Hybrid Recurrent Neural Networks
Barzdin algorithm for the extraction of DFAs from trained
hybrid recurrent networks inspired by hidden Markov
Architecture Inspired by Hidden Markov
models. The algorithm requires that the labels of all Models
strings up to length L be known in order for a DFA to be
extracted unambiguously; these strings can be represented 3. 1 Motivation
in a so-called prefix tree. This prefix tree is collapsed into
a smaller graph by merging all pairs of nodes that We have stated earlier that the structural similarities of
represent compatible mappings from suffixes to labels. hidden Markov models and recurrent neural networks
The algorithm visits all nodes of the prefix tree in breadth- form the basis for combining the two paradigms into a
first order, all pairs (i,j) of nodes are evaluated for hybrid architecture. Why is it a good idea? Most often,
compatibility by the subtrees rooted at i and j, first order hidden Markov models are used in practice in
respectively. The subtrees are compatible if the nodes in which the state transition probabilities are dependent only
corresponding positions in the respective trees are on the previous state. This assumption is unrealistic for
identical. If all corresponding labels are the same, then the many real world applications of hidden Markov models. It
edge from i’s parent to i is changed to point at j instead. has been shown that recurrent neural networks can learn
Nodes which become inaccessible are discarded. The higher-order dependencies from training data [18].
result is smallest automaton consistent with the data set. Furthermore, the number of states in the hidden Markov
The algorithm is summarized as follows: model needs to be fixed beforehand for a particular
application. Therefore, the numbers of states for different
Algorithm 1: Let T be a complete prefix tree with n nodes applications vary. The theory on recurrent neural networks
1,…,n and hidden Markov models suggest that the combination
of the two paradigms may provide better generalization
for i = 1 to n by 1 do and training performance. Our proposed architecture of
for j = 1 to i - 1 by 1 do hybrid recurrent neural networks may also have the
if subtree(i) ≡ subtree(j) capability of learning higher-order dependencies and one
parent(j) ← parent(i) does not need to fix the number of states which used to be
done in the case of hidden Markov models.

Using Algorithm 1, we generate a complete prefix tree 3.2 Derivation


for strings of up to lengths L=1. For each successive
string length, we extract a DFA from the corresponding Consider the equation of the forward procedure for the
prefix trees of increasing depth using the Trakhtenbrot- calculation of the probability of the observation O given
Barzdin algorithm. The algorithm checks if the extracted the model λ , thus P (O | λ ) in hidden Markov models is
DFA is consistent with the entire training data set, i.e. the given by:
input–output labels of the entire training data set must be
 N 
explained by the extracted DFA. If the DFA is consistent, α j (t ) =  ∑ α i (t − 1) aij  b j ( Ot ) 1 ≤ j ≤ N (2)
then the algorithm terminates. The algorithm for running  i 
the extraction process is summarized as follows:
where N is the number of hidden states in the HMM, aij is
the probability of making a transition from state i to j and
bj ( Ot ) is the Gaussian distribution for the observation at
time t. The calculation in Equation 2 is inherently
recurrent and bares resemblance to the recursion of
recurrent neural networks as shown in Equation 3.

 N 
x j ( t ) = f  ∑ x i ( t − 1) wij  1≤ i ≤ N (3)
 i 

where f(.) s a non-linearity as sigmoid, N the number of


hidden neurons and wij the weights connecting the neurons
with each other and with the input nodes. The dynamics of
first–order recurrent neural network as given by Equation
1 is combined with Gaussian distribution feature in
Equation 2 to form the hybrid architecture. Hence, the
dynamics for the hybrid recurrent networks architecture
is given by: bt ( O )

 K  J   (4)
S i ( t ) = f  ∑ Vik S k ( t − 1) +  ∑ W ij I j ( t − 1)  .bt −1 ( O ) 
 k =1 
  j =1  
Figure 4: The architecture of Hybrid Recurrent
where bt −1 ( O ) is the Gaussian distribution. Note that the Neural Networks. The dashed lined indicate that the
architecture can represent more neurons in each layer if
subscript in bt −1 ( O ) i.e. time t in Equation 4 is different required.
when compared to the subscript for Gaussian distribution
in equation 2. The dynamics of hidden Markov models Figure 4 shows how the Gaussian distribution for
and recurrent networks varies in this context; however we hidden Markov model is used to build hybrid recurrent
can adjust the parameter for time t as shown in Equation 4 neural networks. The output of the univariate Gaussian
in order to map hidden Markov models into recurrent function solely depends on the two input parameters
neural networks. For a single input, the univariate which are the mean and the variance. These parameters
Gaussian distribution is given by Equation 5: will also be represented in the chromosomes together with
the weights and biases and will be trained by genetic
1  1 ( O − µ )2  algorithm.
bt ( O) = exp −  (5)
2
2πσ  2 σ 
4. Empirical Results and Discussion
t
where O is the observation at time t, µ is the mean and
4.1 Training Hybrid Recurrent Neural Networks
σ 2 i is the variance. For multiple inputs to the hybrid on DFAs
recurrent network, the multivariate Gaussian for d
dimensions is given by equation 6: In the hybrid recurrent neural networks architecture,
the neurons in the hidden layer compute the weighted sum
1  1  (6) of their inputs with further multiplying to the output of
bt (O) = exp  − (O − µ )t ∑ −1 (O − µ ) 
2π d / 2 | ∑ |1/ 2  2  the corresponding Gaussian function which gets inputs
from the input layer. The product of the neuron and the
where O is a d-component column vector, µ is a d- output of the Gaussian function are then propagated from
component mean vector, ∑ is a d-by-d covariance the hidden layer to the output layer as shown in Figure 4.

matrix, and | ∑ | and ∑ −1


are its determinant and inverse,
We modified the crossover and mutation operators in
respectively. genetic algorithms so that the genes could represent real
numbered weight values in the hybrid architecture. Prior
to the training process, a population size is defined and
then the algorithm randomly chooses two parent
chromosomes; they are combined into a child
chromosome using the crossover operator. The child
chromosome is further mutated according to a mutation
probability. The mutation operator adds a small real
random number to a random gene in the child
chromosome. The child chromosome then becomes part of Table 2: Experiment 2
the new generation. A chromosome represents the
weights, biases, mean and variance as parameters of the No. of
No. of Training
hybrid recurrent networks architecture. The fitness Hidden Generalization
generations Performance
function computes the reciprocal of the squared error for Neurons Performance
each chromosome. Therefore, genetic algorithm is used 5 8 100% 100%
for reducing the squared error. The chromosome with the 10 26 100% 100%
least squared error from the hybrid architecture then 15 40 100% 100%
slowly begins to affect the entire population until a 20 9 100% 100%
solution is reached. Evolutionary computation such as
genetic algorithm thus finds the best chromosomes Weight initialisation of -7 to 7.
representing the weights, biases and other parameters of
the hybrid system.
Table 3: Experiment 3
We obtained the training and testing data set from the
deterministic finite automaton as shown in Figure 3. The No. of
training and testing set included strings lengths of 1-10 No. of Training
Hidden Generalization
and string lengths of 1-15, respectively. We trained all the generations Performance
Neurons Performance
parameters of the hybrid architecture, i.e. the weights 5 max 0% 0%
connecting input to hidden layer, weights connecting 10 3 100% 100%
hidden to output layer and weights connecting the context
15 max 0% 0%
to hidden layer. We also trained the bias weights and the
mean and the variance as parameters of the univariate 20 max 0% 0%
Gaussian function obtaining inputs from the input layer.
Weight initialisation of -15 to 15. ‘max’ denotes the limit of
The network topology used for this experiment is as 100 training generations used for training hybrid recurrent
follows: we used one neuron in the input layer for string neural networks.
input and one output neuron in the output layer. We
experimented with different number of neurons in the
hidden layer. We ran few sample experiments and found Experiment 2 reveals a 100 % training and
that the population size of 40, crossover probability of 0.7 generalization performance when initialed with weight
and mutation probability of 0.1 have shown good genetic values of the range -7 to 7 prior to training while
training performance. Hence, we used these values for all experiment 1 and 3 show poor results. The generalization
our experiments. We ran three major experiments with performance is based upon the presentation of unknown
different bounds in the weight initialization prior to strings on the trained network. The results demonstrate
training. Illustrative results for the experiments are shown that our proposed hybrid recurrent network architecture
in Table 1, Table 2, and Table 3, respectively. can train and represent dynamical systems such as
deterministic finite automaton.

Table 1: Experiment 1 4.2 Knowledge Extraction from Hybrid


Recurrent Neural Networks
Hidden No. of Training
Generalization As discussed in the previous section, we applied the
Neurons generations Performance
Performance knowledge extraction method where the extraction
5 max 0% 0% depends on the input-output mapping of the DFA string
10 max 0% 0% obtained by the generalisation made by the trained
15 max 0% 0% network. Upon successfully training and testing of the
20 max 0% 0% network, we proceeded with knowledge extraction in
order to identify the knowledge represented in the weights
Weight initialisation of -3 to 3. ‘max’ denotes the limit of of the hybrid recurrent network. We recorded the
100 training generations used for training hybrid recurrent
neural networks.
prediction made by the network for increasing lengths of
string as discussed in Section 2.5. A prefix tree was built
from each sample of the input string and corresponding
output string’s membership value. We applied the DFA
extraction algorithm as shown in Algorithm 2; for each hidden Markov models and recurrent neural networks. We
string length L, we recorded the extracted DFAs string used the Trakhtenbrot-Barzdin algorithm for knowledge
classification performance on the training set. extraction. The knowledge extraction results show that the
ideal deterministic acceptor could be extracted from
Table 4: DFA extraction results prefix trees from depth of at least 4, i.e. the extracted
acceptor could explain the entire training data. Our results
String Percentage correctly show that deterministic finite automaton can be trained
Length consistent with testing set and represented in hybrid recurrent neural networks. It
2 0% also demonstrates that hybrid recurrent neural networks
3 0% can represent deterministic finite automaton in similar
ways to recurrent neural networks. Therefore, the hybrid
4 90.02%
architecture can model dynamical systems making it
5 100% suitable for modeling temporal sequences.
6 100%
7 100% 6. References
8 100%
9 100% [1] A.J Robinson, “An application of recurrent nets to phone
probability estimation”, IEEE transactions on Neural Networks,
10 100%
vol.5, no.2 , 1994, pp. 298-305.

The recorded classification performance of the DFAs [2] C.L. Giles, S. Lawrence and A.C. Tsoi, “Rule inference for
financial prediction using recurrent neural networks”, Proc. of
extracted with increasing string length L are shown in
the IEEE/IAFE Computational Intelligence for Financial
Table 4. We note that the DFAs extracted from lengths Engineering, New York City, USA, 1997, pp. 253-259
L=2 and 3 show 0% accuracy as these string lengths were
too small to represent the deterministic finite automaton. [3] K. Marakami and H Taguchi, “Gesture recognition using
The string classification accuracy jumps to 90.02% for recurrent neural networks”, Proc. of the SIGCHI conference on
L=4 and remains at 100% for all prefix trees with larger Human factors in computing systems: Reaching through
depth than L=5. The extracted deterministic finite technology, Louisiana, USA, 1991, pp. 237-242.
automaton was identical to the automaton used for
training the hybrid recurrent neural networks architecture [4] M. J. F. Gales, “Maximum likelihood linear transformations
for HMM-based speech recognition”, Computer Speech and
as shown in Figure 3.
Language, vol. 12, 1998, pp. 75-98.
We also ran experiments where the training set [5] T. Kobayashi, S. Haruyama, “Partly-Hidden Markov Model
consisted of 50%, 30% and 10% of all strings up to length and its Application to Gesture Recognition”, Proc. of IEEE
10, i.e. the training data itself certainly no longer International Conference on Acoustics, Speech, and Signal
embodied the knowledge about the output values Processing, vol. 4, 1997, pp. 3081.
necessary in order to induce DFAs. In this case, the
trained network had to rely on its generalization capability [6] C. Lee Giles, C.W Omlin and K. Thornber, “Equivalence in
in order to assign the output of the values missing in the Knowledge Representation: Automata, Recurrent Neural
Networks, and dynamical Systems”, Proc. of the IEEE, vol. 87,
prefix tree. Our experiments show that, even when “holes”
no. 9, 1999, pp.1623-1640.
are present in the training set, it is possible to extract the
ideal deterministic finite acceptor by making use of the
[7] R. Chandra, C.W. Omlin, “Evolutionary training of hybrid
hybrid recurrent network for missing output membership
systems of recurrent neural networks and hidden Markov
values. models”, Transactions on engineering, computing and
technology, vol. 15, October 2006, pp. 58-63.
5. Conclusions
[8] C. Kim Wing Ku, M. Wai Mak, and W. Chi Siu, “Adding
We have successfully combined strengths of hidden learning to cellular genetic algorithms for training recurrent
Markov models and recurrent neural networks to construct neural networks,” IEEE Transactions on Neural Networks, vol.
10, no.2, 1999, pp. 239-252.
the hybrid recurrent neural network architecture. The
structural similarities between hidden Markov models and [9] H. Jacobsson, “Rule extraction from recurrent neural
recurrent neural networks have been the basis for the networks: A taxonomy and review”, Neural Computation, vol.
successful mapping in the hybrid architecture. We have 17, no. 6, 2005, pp. 1223-1263.
used genetic algorithms to train the hybrid system of
[10] S. Das & R. Das, “Induction of discrete state-machine by
stabilizing a continuous recurrent neural network using
clustering,” Journal of Computer Science and Informatics, vol.
2, no.2, 1991, 35-40.

[11] A. Vahed & C. W Omlin, “Rule extraction from recurrent


neural networks using a symbolic machine learning algorithm,”
Proc. of the 6th International Conference on Neural
Information Processing, Dunedin, New Zealand, 1999, pp. 712-
717.

[12] P. Manolios and R. Fanelli, “First order recurrent neural


networks and deterministic finite state automata,” Neural
Computation, vol. 6, no. 6, 1994, pp.1154-1172.

[13] R. L. Watrous and G. M. Kuhn, “Induction of finite-state


languages using second-order recurrent networks,” Proc. of
Advances in Neural Information Systems, California, USA,
1992, pp. 309-316.

[14] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, “Learning long-
term dependencies in NARX recurrent neural networks,” IEEE
Transactions on Neural Networks, vol. 7, no. 6, 1996, pp. 1329-
1338.

[15] S. Hochreiter and J. Schmidhuber, “Long short-term


memory”, Neural Computation, vol. 9, no. 8, 1997, pp. 1735-
1780.

[16] E. Alpaydin, Introduction to Machine Learning, The MIT


Press, London, 2004, pp. 306-311.

[17] B. Trakhenbrot & Y. Barzdin, “Finite automata: Behaviour


and synthesis”, North- Holland, Amsterdam, 1973.

[18]Y. Bengio, Neural Networks for Speech and Sequence


Recognition. London UK, International Thompson Computer
Press, 1996.

S-ar putea să vă placă și