Sunteți pe pagina 1din 8

Exploring the Land of RNNs

Stefan Postavaru
spostavaru@bitdefender.com
Bitdefender ML Team

University of Bucharest,
Faculty of Mathematics and Computer Science

April 21, 2017

1/8
Why RNNs?

I Model distributions on sequences of elements, P(x1 , x2 , ..., xn ),


using parametrizable and differentiable functions
I One way we could do it: feed every element to a single model
I image
I Problems: possible too many elements, no straightforward
way to generalize to various-length sequences

2/8
Basic idea behind RNNs

I We would like the network to process one input at a time and


pass information to its future self
I image
I Vanilla RNN (SRNN) as presented in Elman [1990])
I ht = (Wh ht1 + Wx xt + bh )
I ot = (Wo ht + bo )
I A single hidden-layer perceptron - with self-feed inputs
I image

3/8
Are they any good?

I SRNNs are universal approximators for dynamical systems of


the form: (?)
I ht = f (ht1 , xt )
I ot = g (ht )
I f measurable, g continuous
I For at least some tasks, SRNNs are actually optimal relative
to the memory capacity of each parameter (!?) (?)

4/8
Measuring memory

I Synthetic task: label random binary sequences in two classes,


minimizing cross-entropy
I image
I Measure the mutual information in bits:
I I (L, Y ) = H(L) H(L|Y )
I H(L) is the entropy of a discrete uniform distribution
correct
I H(L|Y ) is modeled as a bernoulli distribution, with p = k
I I (L, Y ) = k + k(p log2 p + (1 p) log2 (1 p))
I Divide by the number of parameter = bits/parameter
measure

5/8
Results

I image
I Fun-fact: in ? the authors estimated the memory capacity of
a synapse to be around 4.7 bits.

6/8
Why not stop at vanilla RNNs?

I For complex tasks, they are way harder to train


I Cannot robustly learn dependencies over larger periods of
time.

7/8
References I
J. L. Elman. Finding structure in time. Cognitive science, 14(2):
179211, 1990.

8/8

S-ar putea să vă placă și