Sunteți pe pagina 1din 18

ml design

archive

Mixture Density Networks with


TensorFlow
NOVE MBER 2 5 , 2 0 1 5


A short while ago Google open-sourced TensorFlow, a library designed to allow
easy computations on graphs. The main applications are targeted for deep
learning, as neural networks are represented as graphs. Ive spent a few days
reading their APIs and tutorials and I like what I see. Although other libraries oer
similar features such as GPU computations and symbolic dierentiation, the
cleanliness of the API, and familiarity of the IPython stack makes it appealing to
use.

In this post I try to use TensorFlow to implement the classic Mixture Density
Networks (Bishop 94) model. I have implemented MDNs before in an earlier blog
post. This is my first attempt at learning to use TensorFlow, and there are probably
much better ways to do many things, so let me know in the comment section!

Simple Data Fitting with TensorFlow


To get started, lets try to quickly build a neural network to fit some fake data. As
neural nets of even one hidden layer can be universal function approximators, we
can see if we can train a simple neural network to fit a noisy sinusoidal data, like
this ( is just standard gaussian random noise):

importmatplotlib.pyplotasplt
importnumpyasnp
importtensorflowastf
importmath

Aer importing the libraries, we generate the sinusoidal data we will train a neural
net to fit later:

NSAMPLE=1000
x_data=np.float32(np.random.uniform(10.5,10.5,(1,NSAMPLE))).T
r_data=np.float32(np.random.normal(size=(NSAMPLE,1)))
y_data=np.float32(np.sin(0.75*x_data)*7.0+x_data*0.5+r_data*1.0)

plt.figure(figsize=(8,8))
plot_out=plt.plot(x_data,y_data,'ro',alpha=0.3)
plt.show()

TensorFlow uses place holders as variables that will eventually hold data, to do
symbolic computations on the graph later on.

x=tf.placeholder(dtype=tf.float32,shape=[None,1])
y=tf.placeholder(dtype=tf.float32,shape=[None,1])

We will define this simple neural network one-hidden layer and 20 nodes:

NHIDDEN=20
W=tf.Variable(tf.random_normal([1,NHIDDEN],stddev=1.0,dtype=tf.float32))
b=tf.Variable(tf.random_normal([1,NHIDDEN],stddev=1.0,dtype=tf.float32))

W_out=tf.Variable(tf.random_normal([NHIDDEN,1],stddev=1.0,dtype=tf.float32))
b_out=tf.Variable(tf.random_normal([1,1],stddev=1.0,dtype=tf.float32))

hidden_layer=tf.nn.tanh(tf.matmul(x,W)+b)
y_out=tf.matmul(hidden_layer,W_out)+b_out

We can define a loss function as the sum of square error of the output vs the data
(we can add regularisation if we want).

lossfunc=tf.nn.l2_loss(y_outy);

We will also define a training operator that will tell TensorFlow to minimise the loss
function later. With just a line, we can use the fancy RMSProp gradient descent
optimisation method.

train_op=tf.train.RMSPropOptimizer(learning_rate=0.1,decay=0.8).minimize(lossfunc)

To start using TensorFlow to compute things, we have to define a session object. In


an IPython shell, we use InteractiveSession. Aerwards, we need to run a
command to initialise all variables, where the computation graph will also be built
inside TensorFlow.

sess=tf.InteractiveSession()
sess.run(tf.initialize_all_variables())

We will run gradient descent for 1000 times to minimise the loss function with the
data fed in via a dictionary. As the dataset is not large, we wont use mini batches.
Aer the below is run, the weight and bias parameters will be auto stored in their
respective tf.Variable objects.

NEPOCH=1000
foriinrange(NEPOCH):
sess.run(train_op,feed_dict={x:x_data,y:y_data})

What I like about the API is that we can simply use sess.run() again to generate
data from any operator or node within the network. So aer the training is finished,
we can just use the trained model, and another call to sess.run() to generate the
predictions, and plot the predicted data vs the training data set.

We should also close() the session aerwards to free resources when we are done
with this exercise.

x_test=np.float32(np.arange(10.5,10.5,0.1))
x_test=x_test.reshape(x_test.size,1)
y_test=sess.run(y_out,feed_dict={x:x_test})

plt.figure(figsize=(8,8))
plt.plot(x_data,y_data,'ro',x_test,y_test,'bo',alpha=0.3)
plt.show()

sess.close()

We see that the neural network can fit this sinusoidal data quite well, as expected.
However, this type of fitting method only works well when the function we want to
approximate with the neural net is a one-to-one, or many-to-one function. Take for
example, if we invert the training data:

temp_data=x_data
x_data=y_data
y_data=temp_data

plt.figure(figsize=(8,8))
plot_out=plt.plot(x_data,y_data,'ro',alpha=0.3)
plt.show()

If we were to use the same method to fit this inverted data, obviously it wouldnt
work well, and we would expect to see a neural network trained to fit only to the
square mean of the data.

sess=tf.InteractiveSession()
sess.run(tf.initialize_all_variables())

foriinrange(NEPOCH):
sess.run(train_op,feed_dict={x:x_data,y:y_data})

x_test=np.float32(np.arange(10.5,10.5,0.1))
x_test=x_test.reshape(x_test.size,1)
y_test=sess.run(y_out,feed_dict={x:x_test})

plt.figure(figsize=(8,8))
plt.plot(x_data,y_data,'ro',x_test,y_test,'bo',alpha=0.3)
plt.show()

sess.close()

Our current model only predicts one output value for each input, so this approach
will fail miserably. What we want is a model that has the capacity to predict a range
of dierent output values for each input. In the next section we implement a
Mixture Density Network (MDN) to do achieve this task.

Mixture density networks


Mixture Density Networks (MDNs), developed by Christopher Bishop in the 90s,
attempt to address this problem. The approach is rather to have the network
predict a single output value, the network is to predict an entire probability
distribution for the output. This concept is quite powerful, and can be employed
many current areas of machine learning research. It also allows us to calculate
some sort of confidence factor in the predictions that the network is making.

The inverse sinusoidal data we chose is not just for a toy problem, as there are
applications in the field of robotics, for example, where we want to determine
which angle we need to move the robot arm to achieve a target location. MDNs can
also used to model handwriting, where the next stroke is drawn from a probability
distribution of multiple possibilities, rather than sticking to one prediction.

Bishops implementation of MDNs will predict a class of probability distributions


called Mixture Gaussian distributions, where the output value is modelled as a sum
of many gaussian random values, each with dierent means and standard
deviations. So for each input , we will predict a probability distribution function
(pdf) of that is a probability weighted sum of smaller gaussian
probability distributions.

Each of the parameters of the distribution will be determined


by the neural network, as a function of the input . There is a restriction that the
sum of add up to one, to ensure that the pdf integrates to 100%. In addition,
must be strictly positive.

In our implementation, we will use a neural network of one hidden later with 24
nodes, and also generate 24 mixtures, hence there will be 72 actual outputs of our
neural network of a single input. Our definition will be split into 2 parts:

is a vector of 72 values that will be then splitup into three equal parts, ,
, and

The parameters of the pdf will be defined as below to satisfy the earlier conditions:

are essentially put into a so max operator to ensure that the sum adds to one,
and that each mixture probability is positive. Each will also be positive due to
the exponential operator. It gets deeper than this though! In Bishops paper, he
notes that the somax and exponential terms have some theoretical
interpretations from a Bayesian framework way of looking at probability. Note that
in the actual somax code, the largest value will divide both numerator and
denominator to avoid exp operator easily blowing up. In TensorFlow, the network
is described below:

NHIDDEN=24
STDEV=0.5
KMIX=24#numberofmixtures
NOUT=KMIX*3#pi,mu,stdev

x=tf.placeholder(dtype=tf.float32,shape=[None,1],name="x")
y=tf.placeholder(dtype=tf.float32,shape=[None,1],name="y")

Wh=tf.Variable(tf.random_normal([1,NHIDDEN],stddev=STDEV,dtype=tf.float32))
bh=tf.Variable(tf.random_normal([1,NHIDDEN],stddev=STDEV,dtype=tf.float32))

Wo=tf.Variable(tf.random_normal([NHIDDEN,NOUT],stddev=STDEV,dtype=tf.float32))
bo=tf.Variable(tf.random_normal([1,NOUT],stddev=STDEV,dtype=tf.float32))

hidden_layer=tf.nn.tanh(tf.matmul(x,Wh)+bh)
output=tf.matmul(hidden_layer,Wo)+bo

defget_mixture_coef(output):
out_pi=tf.placeholder(dtype=tf.float32,shape=[None,KMIX],name="mixparam")
out_sigma=tf.placeholder(dtype=tf.float32,shape=[None,KMIX],name="mixparam")
out_mu=tf.placeholder(dtype=tf.float32,shape=[None,KMIX],name="mixparam")

out_pi,out_sigma,out_mu=tf.split(1,3,output)

max_pi=tf.reduce_max(out_pi,1,keep_dims=True)
out_pi=tf.sub(out_pi,max_pi)

out_pi=tf.exp(out_pi)

normalize_pi=tf.inv(tf.reduce_sum(out_pi,1,keep_dims=True))
out_pi=tf.mul(normalize_pi,out_pi)

out_sigma=tf.exp(out_sigma)

returnout_pi,out_sigma,out_mu

out_pi,out_sigma,out_mu=get_mixture_coef(output)

Lets define the inverted data we want to train our MDN to predict later. As this is a
more involved prediction task, I used a higher number of samples compared to the
simple data fitting task earlier.

NSAMPLE=2500

y_data=np.float32(np.random.uniform(10.5,10.5,(1,NSAMPLE))).T
r_data=np.float32(np.random.normal(size=(NSAMPLE,1)))#randomnoise
x_data=np.float32(np.sin(0.75*y_data)*7.0+y_data*0.5+r_data*1.0)

plt.figure(figsize=(8,8))
plt.plot(x_data,y_data,'ro',alpha=0.3)
plt.show()

We cannot simply use the min square error L2 lost function in this task the output is
an entire description of the probability distribution. A more suitable loss function is
to minimise the logarithm of the likelihood of the distribution vs the training data:

So for every point in the training data set, we can compute a cost function
based on the predicted distribution versus the actual points, and then attempt the
minimise the sum of all the costs combined. To those who are familiar with logistic
regression and cross entropy minimisation of somax, this is a similar approach,
but with non-discretised states.

We implement this cost function in TensorFlow below as an operation:

oneDivSqrtTwoPI=1/math.sqrt(2*math.pi)#normalisationfactorforgaussian,notneeded.
deftf_normal(y,mu,sigma):
result=tf.sub(y,mu)
result=tf.mul(result,tf.inv(sigma))
result=tf.square(result)/2
returntf.mul(tf.exp(result),tf.inv(sigma))*oneDivSqrtTwoPI

defget_lossfunc(out_pi,out_sigma,out_mu,y):
result=tf_normal(y,out_mu,out_sigma)
result=tf.mul(result,out_pi)
result=tf.reduce_sum(result,1,keep_dims=True)
result=tf.log(result)
returntf.reduce_mean(result)

lossfunc=get_lossfunc(out_pi,out_sigma,out_mu,y)
train_op=tf.train.AdamOptimizer().minimize(lossfunc)

We will train the model below, while collecting the progress of the loss function
over training iterations. TensorFlow oers some useful tools to visualise training
data progress but we didnt use them here.

sess=tf.InteractiveSession()
sess.run(tf.initialize_all_variables())

NEPOCH=10000
loss=np.zeros(NEPOCH)#storethetrainingprogresshere.
foriinrange(NEPOCH):
sess.run(train_op,feed_dict={x:x_data,y:y_data})
loss[i]=sess.run(lossfunc,feed_dict={x:x_data,y:y_data})

As a side note, the impressive thing is that TensorFlow automatically calculated the
gradients of the log likelihood cost functions and applied those gradients in the
optimisation. For this problem, actually there are very optimised gradient formulas
available (see derivations in Bishops original paper, equations 33-39), and I highly
doubt the gradient formula TensorFlow calculated automatically will be as
optimised and elegant, so there would be room for performance improvements by
building a custom operator into TensorFlow with the pre-optimised gradient
formulas for this loss function, like they have done for the cross entropy loss
function. I had implemented before the optimised closed-form gradient formulas
with all the numerical gradient testing that came with it if you want to implement
it, please make sure you do the gradient testing! It is diicult to get right the first
time.

Lets plot the progress of the training, and see if the cost decreases over iterations:

plt.figure(figsize=(8,8))
plt.plot(np.arange(100,NEPOCH,1),loss[100:],'r')
plt.show()

We found that it more or less stopped improving aer 6000 iterations or so. The
next thing we want to do, is to have the model generate distributions for us, say,
across a bunch of points along the x-axis, and then for each distribution, draw 10
points randomly from that distribution, to produce ensembles of generated data
on the y-axis. This gives us a feel of whether the pdf generated matches the training
data.

To sample a mixed gaussian distribution, we randomly select which distribution


based on the set of probabilities, and then proceed to draw the point based o
the gaussian distribution.

x_test=np.float32(np.arange(15,15,0.1))
NTEST=x_test.size
x_test=x_test.reshape(NTEST,1)#needstobeamatrix,notavector

defget_pi_idx(x,pdf):
N=pdf.size
accumulate=0
foriinrange(0,N):
accumulate+=pdf[i]
if(accumulate>=x):
returni
print'errorwithsamplingensemble'
return1

defgenerate_ensemble(out_pi,out_mu,out_sigma,M=10):
NTEST=x_test.size
result=np.random.rand(NTEST,M)#initiallyrandom[0,1]
rn=np.random.randn(NTEST,M)#normalrandommatrix(0.0,1.0)
mu=0
std=0
idx=0

#transformsresultintorandomensembles
forjinrange(0,M):
foriinrange(0,NTEST):
idx=get_pi_idx(result[i,j],out_pi[i])
mu=out_mu[i,idx]
std=out_sigma[i,idx]
result[i,j]=mu+rn[i,j]*std
returnresult

Lets see how the generated data looks like:

out_pi_test,out_sigma_test,out_mu_test=sess.run(get_mixture_coef(output),feed_dict={x:x_test

y_test=generate_ensemble(out_pi_test,out_mu_test,out_sigma_test)

plt.figure(figsize=(8,8))
plt.plot(x_data,y_data,'ro',x_test,y_test,'bo',alpha=0.3)
plt.show()

In the above graph, we plot out the generated data we draw out from the MDN
distribution, in blue, and plot it over the original training data in red. Apart from a
few out liers, the distributions seem to match the data. We can also plot a graph of
as well, and we see what the neural network is sort of doing. For every point
on the x-axis, there are multiple lines or states where may be, and we select these
states with probabilities defined by .

plt.figure(figsize=(8,8))
plt.plot(x_test,out_mu_test,'go',x_test,y_test,'bo',alpha=0.3)
plt.show()

Lastly, if we want, we can try to draw the entire mixture pdf at each point on the x-
axis to get a heatmap.

x_heatmap_label=np.float32(np.arange(15,15,0.1))
y_heatmap_label=np.float32(np.arange(15,15,0.1))

defcustom_gaussian(x,mu,std):
x_norm=(xmu)/std
result=oneDivSqrtTwoPI*math.exp(x_norm*x_norm/2)/std
returnresult

defgenerate_heatmap(out_pi,out_mu,out_sigma,x_heatmap_label,y_heatmap_label):
N=x_heatmap_label.size
M=y_heatmap_label.size
K=KMIX

z=np.zeros((N,M))#initiallyrandom[0,1]

mu=0
std=0
pi=0

#transformsresultintorandomensembles
forkinrange(0,K):
foriinrange(0,M):
pi=out_pi[i,k]
mu=out_mu[i,k]
std=out_sigma[i,k]
forjinrange(0,N):
z[Nj1,i]+=pi*custom_gaussian(y_heatmap_label[j],mu,std)

returnz

defdraw_heatmap(xedges,yedges,heatmap):
extent=[xedges[0],xedges[1],yedges[0],yedges[1]]
plt.figure(figsize=(8,8))
plt.imshow(heatmap,extent=extent)
plt.show()

z=generate_heatmap(out_pi_test,out_mu_test,out_sigma_test,x_heatmap_label,y_heatmap_label)
draw_heatmap(x_heatmap_label,y_heatmap_label,z)

Some Closing Thoughts

sess.close()

I think MDNs are a great way to model data, especially if what we are
modelling have either multiple states, or is inherently a random variable that
cannot be predicted with absolute certainty. I believe even simple single-layer
MDNs like these can be put to good use, as they are so simple to build!
TensorFlow has a solid and clean API which will make implementing and
testing machine learning algorithms more eicient, especially combined with the
IPython stack.
There are many issues related to performance, which needs to be ironed out.
Also, although the library can be used in theory for distributed computing, so far
only the non-distributed version has been open sourced.
Symbolic dierentiation is great, but at some point would be nice to be able to
manually implement eicient solved derivative formulas in a simpler way than
adding an operation and having the need to recompile the entire library. That
being said, TensorFlows symbolic operation seems much faster than Theanos
compiling process.

The full IPython notebook for this post can be found here

This blog post was exported from an IPython notebook. Here is the link to the original
notebook, for better viewing on a non-mobile device.

2016 otoro.net
2Comments otoro.net
1 Login

Recommend 4 Share SortbyBest

Jointhediscussion

maxc016monthsago
Hello,yourblogisreallyusefultome.
ButIdidn'tunderstandthecodewhichgeneratesrandomensembles,can
youexplainitinmoredetail.Thanksinadvance!

Edit:
Ifinallygottounderstand,Thanks
Reply Share

hardmaru Mod >maxc01 6monthsago

Goodluck.It'satrickyexample,Ialwayshavetoreferbackto
Bishop'sMDNpaper.
Reply Share

ALSOONOTORO.NET

NeuralNetworkEvolution GeneratingLargeImagesfrom

S-ar putea să vă placă și