16 - 1 - OPTIONAL - Learning A Joint Model of Images and Captions (10 Min)

In this video, I'm going to talk about
some recent work on learning a joint

model of captions and feature vectors
that describe images.
In the previous lecture, I talked about
how we might extract semantically
meaningful features from images.
But we were doing that with no help from
the captions.
Obviously the words in a caption ought to
be helpful in extracting appropriate
semantic categories from images.
And similarly, the images ought to be
helpful in disambiguating what the words
in the caption mean.
So the idea is we're going to try in a
great big net that gets its input, stand
to computer vision feature vectors
extractive for images and pack up words
representations of captions and learns
how the two input representations are
related to each other.
At the end of the video I'll show you a
movie of the final network using words to
create feature vectors for images and
then showing you the closest image in its
data base.
And also using images to create bytes of
words.
I'm now going to describe some work by
Natish Rivastiva, who's one of the TAs
for this course, and Roslyn Salakutinov,
that will appear shortly.
The goal is to build a joint density
model of captions and of images except
that the images represented by the
features standardly used in computeration
rather than by the ropic cells.This needs
a lot more computation than building a
joint density model of labels and digit
images which we saw earlier in the
course.
So what they did was they first trained a
multi-layer model of images alone.
That is it's really a multi-layer model
of the features they extracted from
images using the standard computer vision
features.
Then separately, they train a multi-layer
model of the word count vectors from the
captions.
Once they trained both of those models,
they had a new top layer, that's
connected to the top layers of both of
the individual models.
After that, they use further joint
training of the whole system so that each
modality can improve the earlier layers
of the other modality.
Instead of using a deep belief net, which
is what you might expect, they used a

deep Bolton machine, where the symmetric
connections bring in all pairs of layers.
The further joint training of the whole
deep Boltzmann machine is then what
allows each modality to change the
feature detectors in the early layers of
the other modality.
That's the reason they used a deep
Boltzmann machine.
They could've also used a deep belief
net, and done generative fine tuning with
contrastive wake sleep.
But the fine tuning algorithm for deep
Boltzmann machines may well work better.
This leaves the question of how they
pretrained the hidden layers of a deep
Boltzmann machine.
because what we've seen so far in the
course is that if you train a stack of
restricted Boltzmann machines and you
combine them together into a single
composite model what you get is a deep
belief net not a deep Boltzmann machine.
So I'm now going to explain how, despite
what I said earlier in the course, you
can actually pre-trail a stack of
restrictive Boltzmann machines in such a
way that you can then combine them to
make a deep Boltzmann machine.
The trick is that the top and the bottom
restrictive bowser machines in the stack
have to trained with weights that it
twices begin one directions the other.
So, the bottom Boltzmann machine, that
looks at the visible units is trained
with the bottom up weights being twice as
big as the top down weights.
Apart from that, the weights are
symmetrical.
So, this is what I call scale
symmetrical.
But the bottom up weights are always
twice as big as their top down
counterparts.
This can be justified, and I'll show you
the justification in a little while.
The next restrictive Boltzmann machine in
the stack, is trained with symmetrical
weights.
I've called them two W, here rather then
W for reasons you'll see later.
We can keep training with restrictive
bowsler machines like that with genuinely
symmetrical weights.
But then the top one in the stack has
be-trained with the bottom up weights
being half of the top down weights.
So again, these are scale symmetric
weights, but now, the top down weights
are twice as big as the bottom up

weights.
That's the opposite of what we had when
we trained the first restricted Bolton
machine in the stack.
After having trained these three
restricted Bolton machines, we can then
combine them to make a composite model,
and the composite model looks like this.
For the restricted Bolton machine in the
middle, we simply halved its weights.
That's why they were 2W2 to begin with.
For the one at the bottom, we've halved
the up-going weights but kept the
down-going weights the same.
And for the one at the top we've halved
the down-going weights and kept the
up-going weights the same.
Now the question is: Why do we do this
funny business of halving the whites?
The explanation is quite complicated but
I'll give you a rough idea of what's
going on.
If you look at the layer H1.
We have two different ways of inferring
the states of the units in h1, in the
stack of restricted bolts and machines on
the left.
We can either infer the states of H1
bottom up from V or we can infer the
states of H1 top down from H2.
When we combine these Boltzmann machines
together, what we're going to do is we're
going to an average of those two ways of
inferring H1.
And to take a geometric average, what we
need to do, is halve the weights.
So we're going to use half of what the
bottom up model says.
So that's half of 2W1.
And we're going to use half of what the
top down model says.
That's half of 2W2.
And if you look at the deep Boltzmann
machine on the right, that's exactly
what's being used to infer the state of
H1.
In other words, if you're given the
states in H2, and you're given the states
in V, those are the weights you'll use
for inferring the states of H1.
The reason we need to halve the weights
is so that we don't double count.
You see, in the Boltzmann machine on the
right.
The state of H2 already depends on V.
At least it does after we've done some
settling down in the Boltzmann Machine.
So if we were to use the bottom up input
coming from the first restricted
Boltzmann Machine in the stack.

And we use the top down input coming from
the second Boltzmann Machine in the
stack, we'd be counting the evidence
twice.'Cause we'd be inferring H1 from V.
And we'd also be inferring it from H2,
which, itself, depends on V.
In order not to double count the
evidence, we have to halve the weights.
That's a very high level and perhaps not
totally clear description of why we have
to half the weights.
If you want to know the mathematical
details, you can go and read the paper.
But that's what's going on.
And that's why we need to halve the
weights.
So that the intermediate layers can be
doing geometric averaging of the two
different models of that layer, from the
two different restricted Boltzmann
machines in the original stack.

16 - 1 - OPTIONAL - Learning A Joint Model of Images and Captions (10 Min)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

16 - 1 - OPTIONAL - Learning A Joint Model of Images and Captions (10 Min)

Încărcat de

Drepturi de autor:

Formate disponibile

In this video, I'm going to talk about

some recent work on learning a joint

is what you might expect, they used a

are twice as big as the bottom up

Boltzmann Machine in the stack.

S-ar putea să vă placă și