Kjell O. E. Elenius and Hans G. C. Tråvén - Multi-Layer Perceptrons and Probabilistic Neural Networks For Phoneme Recognition

MULTI-LAYER PERCEPTRONS AND PROBABILISTIC NEURAL NETWORKS FOR PHONEME RECOGNITION
Kjell O. E. Elenius* and Hans G. C. Trvn+
*Department of Speech Communication and Music Acoustics +Department of Numerical Analysis and Computing Science (NADA) KTH, S-100 44 Stockholm, Sweden
ABSTRACT
Two artificial neural networks have been trained to recognise phonemes in continuous speech: multi-layer perceptron (MLP) nets and probabilistic neural networks (PNN). The speech material was recorded by one male Swedish speaker and the sentences were phonetically labelled. Fifty sentences were used for training and another fifty were used for testing. Both networks had a single hidden layer and 38 output nodes corresponding to Swedish phonemes. The MLP was trained by the supervised backpropagation algorithm. The PNN was trained by a selforganising clustering algorithm, a stochastic approximation to the expectation maximisation algorithm. The classification results for a feed-forward MLP and the PNN were rather similar, but an MLP with simple recurrency using context nodes gave the best performance. Several other differences of practical value was noted. Keywords: phoneme recognition, carticulation, back-propagation, multi-layer perceptron, simple recurrency, probabilistic neural network, expectation maximisation, supervised/unsupervised training.
framework of Bayesian classification the MLP and the PNN are related, but use two essentially complementary statistical models. The MLP models the discriminant functions for different phoneme categories, essentially by a piece-wise planar approximation. It can be shown that several of the commonly used cost functions are minimised when network outputs correspond to a posteriori class probabilities, see Michael and Lippman [1]. The PNN approximates class conditional probability densities by a Gaussian mixture but has no explicit model of the discriminants. In the two networks, connection weights correspond to normal vectors and mean values respectively.
2. SPEECH MATERIAL
The speech material consisted of 100 Swedish sentences and was recorded by one male speaker. It was sampled at 16 kHz using a 6.3 kHz low-pass filter, see Hunnicutt [2]. The smoothed output of 16 Bark-scaled filters in the range 200 Hz to 6.0 kHz were input to the network. The interval between successive spectral frames was 10 ms and the integration time was 25.6 ms, compare Figure 1. Fifty sentences were used for training and fifty were used for testing. The number of phonemes for the training material was 2202 and the total number of 10 ms frames was 15258 (2064 and 13671 for the test material). The sentences were phonetically labelled (Nord [3]) and had a natural (large) variation in distribution.
1. BACKGROUND
Phoneme recognition can be viewed as a pattern classification problem, the task being that of classifying multivariate measurements derived from a speech signal. Multi-layer perceptrons (MLP) and probabilistic neural networks (PNN) are both feed-forward neural networks that can be used as general purpose classifiers. Viewed within the
Figure 1 Example spectrogram of the speech material using the network speech input representation of 10 ms frame rate and 16 filter amplitudes. Part of sentence: "omsorgsfullt bilen" (carefully the car). The manual labelling is also shown.
3. NETWORKS
Both networks had a single hidden layer, containing 64 nodes in the MLP and 128 in the PNN. There were 38 output nodes corresponding to Swedish phonemes. The MLP was trained with the back-propagation algorithm and the crossentropy criterion (Solla et al. [4]). The PNN was trained by a self-organising clustering algorithm, a stochastic approximation to the expectation maximisation (EM) algorithm, compare Trvn [5]. The MLP-network and some of its results have earlier been described in Elenius & Blomberg [6], whereas the PNN experiments have been described in Trvn [7]. A straightforward way to include coarticulation information, which naturally is important in phoneme recognition, is to use an input window to the network that extends over several spectral frames. We used this technique for both types of networks. Another way of including this information is to add recurrency to the network, compare Watrous [8], and Robinson & Fallside [9]. We have tried to add simple recurrency to the MLP network using the technique of context nodes, compare Elman [10] and ServanSchreiber et al. [11]. The context nodes contain delayed values of other nodes in the network. We used context nodes for the hidden layer that stored the one frame delayed (10 ms) values of them. The context nodes were connected to the hidden nodes with trainable weights. We also used 10 ms delayed context nodes for the output nodes connected to the outputs. Using both types of context nodes simultaneously gave a somewhat better performance than just using either of them separately. The PNN has a connectivity structure similar to the timedelay-neural-network described by Waibel et al. [12, 13]. The hidden nodes have an input window extending over three spectral frames (30 ms). Hidden nodes are connected to output nodes through time delayed connections (five connections 30 ms apart between each pair of hidden- and output nodes). The training of the hidden nodes is unsupervised. The hidden nodes learn to recognise time invariant features using an EM algorithm for invariant Gaussian mixtures, described in Trvn [14]. Briefly, hidden node outputs are smoothed over five translations in 10 ms steps (from -20 to +20 ms) of the input window. During learning the contributions from different translations are weighted with the probability of the translated pattern. The output nodes are linear and trained with the least-mean-square algorithm. The PNN did not have
4. RESULTS
The classification performance of the networks was evaluated against the manual segmentation. It was first evaluated on the frame level: a frame was counted correctly labelled if the phoneme node with the maximum output value corresponded to the label assigned by the human labeller. Note that although the output is associated with a specific frame, the input window usually contains more than one frame. The relation between performance and the size of the input spectral window for the feed-forward MLP is shown in Table 1. The spectral amplitudes were fed to a hidden layer of 64 nodes. window size % correct frames Table 1 10 ms 58 30 ms 64 50 ms 66 70 ms 70
Frame level phoneme recognition performance for different input window sizes of the feed-forward MLP network.
We also evaluated the segment-level classification performance for some conditions. In this case the mean output over all consecutive frames with the same (manually assigned) phonetic label was used. We noted if the correct phoneme had the maximum mean output. We likewise measured if it was among the two (or three) nodes with the highest mean output. Results for the PNN with 150 ms input window and the feed-forward MLP with 70 ms window may be seen in Figure 2 together with the result of a simple recurrent MLP with 50 ms window. The results of a the MLP without simple recurrence and the PNN were rather similar. Introducing coarticulation information by expanding the size of the input spectral window was shown to significantly improve performance for both methods, compare Table 1. However, an MLP with a 100 ms input window did not give better results than a 70 ms window, probably since it had too many weights, compare below. Table 2 contains a list of the most frequent confusions made by the PNN. Confusions are generally towards phonemes with a high a priori probability that are acoustically similar. The same type of errors were observed for the MLP network. The MLP with recurrence performed significantly better than the other networks, except when allowing the correct phoneme segment to be among the three best ones, when the PNN performed equally well. Adding simple recurrence to
correct reported frequency
e 24
m n 19
p t 18
e 18
r a 18
r e 17
k t 13
f t 12
a: a 12
j i: 12
l r 12
s t 11
n 11
d n 10
o o: 10
i e 10
Table 2. The 16 most frequent confusions made by the PNN. Errors are in absolute numbers. The rest of the errors had less than 10 occurrences
any recurrant connections.
an MLP net with a 10 ms input window gave a substantial
AAA AAA AAA
PNN 150 ms window
AAA AAA AAA MLP 70 ms window
AAAA AAAA AAAA MLP 50 ms window,
recurrent
100 90 80 70 % correct 60 50 40 30 20 10 0
AAAAAAAA AAAAAAAA 80 AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA 74 AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAAAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAA AAAAAAAA AAA AAAA AAAA AAAAAAAA AAAAAAAA AAA AAAA AAAA
84
AAAAAAA AAAAAAA AAAA 70 AAAAAAA 70 AAAAAAA AAA AAAAAAA AAAAAAA AAAAAAAA AAAAAAA AAA 67 AAAAAAAA AAA AAAAAAAA AAA AAA AAAAAAAA AAAA AAAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAAAAA AAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAAAAA AAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAAAAA AAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAA AAA AAAA AAAA AAAAAAA AAAAAAAA AAA AAAAAAAA AAAA AAAAAAA AAAAAAA
78
73
72
AAAA AAAAAAAA AAAA AAAAAAAA 80 AAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAA AAAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAA AAAA
81
85
AAAAAAA 82 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA AAAAAAA AAAA AAAAAAAA AAAAAAA AAAAAAA AAAAAAAA AAAAAAA AAAAAAA AAAA AAAAAAA
87
87
Frames training
Frames test
Segment 1st
Segment 2nd
Segment 3rd
Figure 2
The performance of the PNN and the MLP with and without simple recurrence. Same speech material of a single male Swedish speaker evaluated on frame and segment (phoneme) level. Segment level results indicate when the correct phoneme is among the top 1, 2 or 3 nodes with highest output values. Frame level results are also shown for the training material. one erroneous frame per phoneme border, which we consider a good result. The result for the test material was the best we have found, indicating good generalisation, although the performance difference for the training and test set is fairly large. There are, of course, other, more fully recurrent network architectures (Pearlmutter [17]), and it would be interesting to compare their performance on the same speech material. We have also tested two other comparatively simpler classifiers, both realisable as one layer neural networks (without hidden layer). The first was a linear discriminant classifier (one layer perceptron). The second was a parametric maximum likelihood classifier, with one Gaussian for each phoneme. Both classifiers achieve lower results than the multi layer networks: 37 % correct segments for the linear classifier and 41 % for the parametric Gaussian classifier. The input window was 70 ms wide in both cases. These results illustrate that neither a parametric Gaussian nor a linear model fit the available spectral data very well.
improvement from 58% to 69% frame recognition. This is only 1% less than the results for the 70 ms window feedforward MLP that has more than double the number of connections (9600 weights compared to 4196). This illustrates the power of using recurrency. The net now has the ability to keep a "memory" of relevant events that have passed. The net is also allowed to adjust the integration time according to what is optimal for each case. Adding context nodes to a net with a 50 ms input window resulted in 6% improvement (to 73%), which shows that recurrency may be combined with an input window, Cho et al. [15]. However, these type of nets get many weights (13092 in our case), which means that the training becomes problematic with the around 15,000 frames available. This is far too few considering the rule of thumb saying that the number of training examples should be around 10 times the number of weights. Accordingly, using context nodes with a 70 ms input window did not improve performance, which thus most probably is due to the limited training set. Also, adding another set of context nodes keeping the 10 ms delayed values of the first set of context nodes, which should result in an improved and more flexible recurrency, gave no improvements, probably due to the increased number of weights. (Lee et al. have shown a small positive effect by using a similar technique [16]). A large number of weights makes it possible for the net to adjust itself to the training material, which counteracts generalisation. The performance for the training material of the 50 ms net with context nodes was 84% correct frames, 11% better than for the test set (Fig. 1). This corresponds to
5. CONCLUSIONS
In our studies we have been working with simple network topologies, using the same hidden nodes for all phonemes. We think this is an advantage when doing comparison studies and it also facilitates the training of the network. To optimise recognition performance some researcher have used more complex, modular networks, with different sets of nodes for different phoneme classes, compare Watrous [8], Waibel et al. [18] and Altosaar & Karjalainen [19]. Our results show that it is essential to capture coarticulation information when doing phoneme recognition. One way to include this is by expanding the size of the input spectral window used by the network. This significantly improved the performance for both methods. The PNN generally requires more hidden nodes than the MLP to reach comparable performance. This is natural since the training of the hidden nodes in the PNN is unsupervised. This disadvantage is traded for the advantage that non-labelled speech can be used for training. The training time, as measured by the number of iterations of the learning algorithm, needed for the network to reach its best performance is also significantly shorter for the PNN. It is known that an MLP can give high confidence to novel, previously unknown speech segments, while the PNN always gives low confidence to everything that is far from the training set. Similar observations have been made by Lee [20], when comparing different networks in a handwritten character recognition application. We have also shown that another way to include context information is the use of recurrent neural networks. This is the most powerful approach we have tested. It was also conjectured that the large number of weights needed for the more complex nets tested could not be satisfactorily trained due to the limited speech material available. This calls for the build-up of larger data banks of recorded speech, something that we are currently undertaking at our department.
ACKNOWLEDGEMENT
This work has been supported by grants from The Swedish National Language Technology Program.
REFERENCES
1] Michael, D., Lippmann, R.P. (1991) "Neural network classifiers estimate Bayesian a posteriori probabilities", Neural Computation Vol. 3, No. 4, pp. 461483. [2] Hunnicutt, S. (1987) "Acoustic correlates of redundancy and intelligibility," Tech. Report, STL-QPSR, Dept. of Speech Communication, KTH, No. 2-3, pp. 7-14. [3] Nord, L. (1988) "Acoustic-phonetic studies in a Swedish speech data bank," pp. 1147-1152 in Proc. SPEECH'88, Book 3 (7th FASE Symposium), Institute of Acoustics, Edinburgh. [4] Solla, S., Levin, E. and Fisher, M. (1988): "Accelerated Learning in Layered Neural Networks," Complex Systems 2, pp. 625-640.
[
[5] Trvn, H.G.C. (1991) "A neural network approach to statistical pattern classification by 'semi-parametric' estimation of probability density functions," IEEE Trans. on Neural Networks, Vol. 2, No. 3, pp. 366-377. [6] Elenius, K. and Blomberg M., (1992) "Experiments with artificial neural networks for phoneme and word recognition," Proceedings of ICSLP 92, Banff, Vol. 2, pp. 1279-1282. [7] Trvn, H.G.C. (1993) On Pattern Recognition Applications of Artificial Neural Networks, Dissertation, Thesis TRITA-NA-P9318, Dept. of Numerical Analysis and Computing Science (NADA), KTH. [8] Watrous, R. (1990) "Phoneme discrimination using connectionist networks", JASA, Vol. 87, No. 4, 1753-1772. [9] Robinson, T. and Fallside, F. (1991) A recurrent error propagation network speech recognition system. Computer Speech and Language, 5(3), 259-274. [10] Elman (1988) "Finding structure in time", Technical Report 8801. Centre for Research in Language, University of California, San Diego. [11] Servan-Schreiber, D., Cleeremans, A, and McClelland J. L. (1988) "Encoding sequential structure in simple recurrent networks", Carnegie Mellon University, CMU-CS-88-183. [12] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1989) "Phoneme recognition using time-delay neural networks", IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 37, No. 3, pp. 328-339. [13] Lang, K.J., Waibel, A.H., Hinton, G.E. (1990) A time-delay neural network architecture for isolated word recognition, Neural Networks, Vol. 3, pp. 23-43. [14] Trvn, H.G.C. (1993) "Invariance constraints for improving generalisation in probabilistic neural networks," Proc. IEEE ICNN'93, San Francisco, pp. 1348-1353. [15] Cho, Y.D., Kim, K.C., Yoon, H.S., Maeng, S.R. and Cho, J.W. (1990) "Extended Elman's recurrent neural network for syllable recognition", Proceedings of ICSLP 90, Kobe, Vol. 2, pp. 1057-1060. [16] Lee, S.J., Kim, K.C., Yoon, H.S. and Cho, J.W. (1991) "Application of fully recurrent neural networks for speech recognition," Proceedings of ICASSP-91, Toronto, pp. 77-80. [17] Pearlmutter, B.A. (1990) "Dynamic Recurrent Neural Networks," Technical Report CMU-CS-88-191, Computer Science Department, Carnegie Mellon University, Pittsburgh. [18] Waibel, A., Sawai, H., and Shikano, K. (1989) "Modularity and scaling in large phonemic neural networks", IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 37, No. 12, pp. 1888-1898. [19] Altosaar, T. and Karjalainen, M. (1992) "Diphonebased speech recognition using time-event neural networks," Proceedings of ICSLP 92, Banff, Vol. 2, pp. 979-982. [20] Lee, Y. (1991) "Handwritten digit recognition using K-nearest-neighbour, radial-basis function, and backpropagation neural networks", Neural Computation, Vol. 3, No. 3, pp. 440-449.

Kjell O. E. Elenius and Hans G. C. Tråvén - Multi-Layer Perceptrons and Probabilistic Neural Networks For Phoneme Recognition

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Kjell O. E. Elenius and Hans G. C. Tråvén - Multi-Layer Perceptrons and Probabilistic Neural Networks For Phoneme Recognition

Încărcat de

Drepturi de autor:

Formate disponibile

MULTI-LAYER PERCEPTRONS AND PROBABILISTIC NEURAL NETWORKS FOR PHONEME RECOGNITION

Kjell O. E. Elenius* and Hans G. C. Trvn+

correct reported frequency

any recurrant connections.

an MLP net with a 10 ms input window gave a substantial

AAA AAA AAA

PNN 150 ms window

AAA AAA AAA MLP 70 ms window

AAAA AAAA AAAA MLP 50 ms window,

S-ar putea să vă placă și