The Next Generation of Neural Networks

Uploaded by GoogleTechTalks on 04.12.2007

>> It's always fun introducing people who need no introduction. But for those of you
who don't know Geoff and his work, he pretty much created--he helped create the field of
machine learning as it now exists and was on the cutting edge back when it was the bleeding
edge of statistical machine learning and neural nets when they first made their resurgence
for the first time in our lifetime, and has been a constant force pushing it--pushing
the analysis in the field away from just sort of the touchy-feely, let's tweak something
until it thinks and towards getting--building systems that we can understand and that actually
do useful things that make our lives better. So you--if you read the talk announcement,
you've seen all of his many accomplishments and members of various royal societies, etcetera,
so I won't list those. I think instead of taking up more of his time, I'm just going
to hand the microphone over to Geoff. >> HINTON: Thank you. I've got--I got it.
So the main aim of neural network research is to make computers recognize patterns better
by emulating the way the brain does it. We know the brain learns to extract many layers
of features from the sensory data. We don't know how it does it. So it's a sort of joint
enterprise of science and engineering. The first generation of neural networks--I can
give you a two minute history of neural networks. The first generation with things like Perceptrons,
where you had hand coded features, they didn't adapt so you might put an image--the pixels
of an image here, has some hand coded features, and you'd learn the weights to decision units
and if you wanted funding, you'd make decision units like that. These were fundamentally
limited in what they could do as these points out in 1969, and so people stopped doing them.
Then sometime later, people figured out how to change the weights of the feature detectors
as well as the weights of the decision units. So what you would do is take an image share,
you'd go forwards through a feed-forward neural network, you will compare the answer the network
gave with the correct answer, you take some measure of that discrepancy and you send it
backwards through the net and as you go backwards through the net, you compute the derivatives
for all of the connections strings here both those once and those once and those once of
the discrepancy between the correct answer and what you got, and you change all these
weights to get closer to the correct answer. That's backpropagation, and it's just the
chain rule. It works for non-linear units so potentially, these can learn very powerful
things and it was a huge disappointment. I can say that now because I got something better.
Basically, we thought when we got this that we cannot learn anything and we'll get lots
and lots of features, object recognition, speech recognition, it'll be easy. There's
some problems, it worked for some things, [INDISTINCT] can make it work for more or
less anything. But in the hands of other people, it has its limitations and something else
came along so there was a temporary digression called kernel methods where what you do is
you do Perceptrons in a cleverer way. You take each training example and you turn the
training example into feature. Basically the feature is how similar are you to this training
example. And then, you have a clever optimization algorithm that decides to throw away some
of those features and also decides how to weight the ones it keeps. But when you're
finished, you just got these fixed features produced according to a fixed recipe that
didn't learn and some weights on these features to make your decision. So it's just a Perceptron.
There's a lot of clever math to how you optimize it, but it's just a Perceptron. And what happened
was people forgot all of Minsky and Papert's criticisms about Perceptrons not being able
to do much. Also it worked better than backpropagation in quite a few things which was deeply embarrassing,
but it says a lot more about how bad backpropagation was and about how good support in fact the
machines are. So if you ask what's wrong with backpropagation, it requires labeled data
and some of you here may know it's easy to get data than labels. If you have a--there's
a model of the brain, you [INDISTINCT] about that many parameters and you [INDISTINCT]
for about that many seconds. Actually, twice as many which is important to some of us.
There's not enough information in labels to constrain that many parameters. You need ten
to the five bits or bytes per second. There's only one place you're going to get that and
that's the sensory input. So the brain must be building a model of the sensory input,
not of these labels. The labels don't have enough information. Also the learning time
didn't scale well. You couldn't learn lots of layers. The whole point of backpropagation
was to learn lots of layers and if you gave it like ten layers to learn, it would just
take forever. And then there's some neural things I won't talk about. So if you want
to overcome these limitations, we want to keep the efficiency of a gradient method for
updating the parameters but instead of trying to learn the probability of a label given
an image, where you need the labels, we're just going to try and learn the probability
of an image. That is, we're going to try and build a generative model that if you run it
will produce stuff that looks like the sensory data. Another is we're going to try and learn
to do computer graphics, and once we can do that, then computer vision is just going to
be inferring how the computer graphics produce this image. So what kind of a model could
the brain be using for that? The building blocks I'm going to use are a bit like neurons.
They're intended to be a bit like neurons. They're these binaries stochastic neurons.
They get some input, they're given--I put this either a one or a zero, so it's easy
to communicate and it's probabilistic. So this is the probability of giving a one as
a function of the total input you get which is your external input plus what you get for
other neurons times the weights on the connections. And we're going to hook those up into a little
module that I call a restricted Boltzmann Machine. This is the module here, it has a
layer of pixels and a layer of feature detectors. So it looks like he's never going to learn
lots and lots of layers of feature detectors. It looks like we thrown out the baby with
the bath water and we're now just restricted to learning one layer of features but we'll
fixed that later. We're going to have a very restricting connectivity, hence the name,
where this is going to be a bipartite graph. The visible units for now don't connect to
each other and the hidden units don't connect to each other. The advantage of that is if
I tell you the state of the pixels, these become independent and so you can update them
independently and in parallel. So given some pixels and given that you know the weights
on the connections, you can update all these units in parallel, and so you've got your
feature activations very simply, there's no lateral interactions there. These networks
are governed by an energy function and the energy function determines the probability
of the network adopting particular states just like in a physical system. These stochastic
units will kind of rattle around and they'll tend to enter low energy states and avoid
high energy states. The weights determine the energies linearly. The probabilities are
an exponential function of the images so the probabilities, the log probabilities are a
linear function of the weights, and that makes learning easy. There's a very simple algorithm
that Terry Sejnowski and me invented in--back in 1982. In a general network, you can run
it but it's very, very slow. In this restricted Boltzmann Machine, it's much more efficient.
And I'm just going to show you what the Maximum Likelihood Learning Algorithm looks like.
That is, suppose you said take one of your parameter on your connection, how do I change
that parameter so that when I run this machine in generative mode, in computer graphics mode,
it's more likely to generate stuff like the stuff I've observed? And so here's what you
should do, you should take a data vector, an image, and you should put it here on the
visible units and then you should let the visible units via their current weights activate
the feature detectors. So you provide input to each feature detector and you now make
a stochastic decision about what the feature detector should turn on. Lots of positive
input, it almost certainly turns on, lots of negative input it almost certainly turns
off. Then, given the binary state of the feature detectors, we now reconstruct the pixels from
the feature detectors and we just keep going like that. And if we run this chain for a
long time, this is called a Markov chain, and this process is called alternating Gibbs
sampling, If we go back [INDISTINCT] for a long time, we'll get fantasies from the model.
This is the kind of stuff the model would like to produce. These are the things that
the model shows you when it's in its low energy states given its current parameters. So that's
the sort of stuff it believes in, this is the data and obviously you want to say to
it, believe in the data, not your own fantasies. And so we'd like to change the parameters
the way it's on the connections, so as to make this more likely and that less likely.
And the way to do that is to say, measure how often a pixel i and a feature detector
j on together when I'm showing you the data vector v. And then measure how often they're
on together when the model is just fantasizing and raise the weights by how often they're
on together when it's seeing data and lower the weights by how often they're on together
when it's fantasizing. And what that will do is it'll make it happy with the data, low
energy, and less happy with its fantasies. And so it will--its fantasies will gradually
move towards the data. If its fantasies are just like the data, then these correlations,
the probability of pixel i and feature detector j being on together in the fantasies will
be just the same as in the data, and so it'll stop learning. So it's a very simple local
learning rule that a neuron could implement because it just involves learning the activity
of a neuron and the other neuron it connects to. And that will do Maximum Likelihood Learning,
but it's slow. You have to settle for like a hundred steps. So, I think about how to
make this algorithm go a hundred thousand times faster. The way you do it is instead
of running for a hundred steps, you just run for one step. So now you go up, you come down
and you go up again. And you take this difference in statistics and that's quite efficient to
do. It took me 17 years to figure this out and in that time computers got a thousand
times faster. So, the change in the weight now is the difference--is a learning rate
times the difference between statistics measured with data and statistics measured with reconstructions
of the data. That's not doing Maximum Likelihood Learning but it works well anyway. So I'm
going to show you a little example, we are going to take a little image where we're going
to have handwritten digits, this is just a toy sample. We're going to put random weights
on the connections then we're going to activate the binary feature detectors given the input
they're getting from the pixels, then we're going to reconstruct the image and initially
we can get a lousy reconstruction, this will be very different from the data because they're
random weights. And then we're going to activate the feature detectors again and we're going
to increment the connections on the data and we're going to decrement the connections on
the reconstructions and that is neither going to learn nice weight for us as I'll show you,
nice connection strengths that will make this be a very good model of [INDISTINCT]. It's
important to run the algorithm where you take the data and on the data you increment connection
strengths and on your--this is really a sort of screwed up version of the data that's being
infected by the prejudices of the model. So the model kind of interprets the data in terms
of its features then it reconstructs something, it would rather see than the data. Now you
could try running a learning algorithm where you take the data, you interpret it, you imagine
the data is what you would like to see and then you learn on that. That's the algorithm
George Bush runs and it doesn't work very well. So, after you've been doing some learning
on this for not very long, I'm now showing you 25,000 connection strengths. Each of these
is one of the features, take this slide. That's a feature and the intensity here shows you
the strength of the connection to the pixels. So this feature really wants to have these
pixels off and it really wants to have these pixels on and it doesn't care much about the
other ones, mid-gray means zero. And you can see the features are fairly local and these
features are now very good at reconstructing twos. It was trained on twos. So if I show
you--show it some twos it never saw before, and get it to reconstruct them, you can see
it reconstructs them pretty well. The funny pixels here which aren't quite right is because
I'm using Vista. So you can see the reconstruction is very like the data and the--it's not quite
identical but it's a very good reconstruction for a wide variety of twos and these are ones
it didn't see during training, okay. Now what I'm going to do--that's no that surprising,
if you just copied the pixels and copy them back, you'd get the same thing, right? So
that would work very well. But now I'm going to show it something it didn't train on. And
what you have to imagine is that Iraq is made of threes but George Bush thinks it's made
of twos, okay? So here's the real data and this is what George Bush sees. That's actually
inconsistent with my previous joke because [INDISTINCT] this learning algorithm. Sorry
about that. Okay, so you see that it perverts the data into what it would like to believe
which is like what it's trained on. Okay, that was just a toy example. Now what we're
going to do is train the letter features like that in the way I just showed you. Forget
these features that are good at reconstructing the data, at least for the kind of data it's
trained on. And then we're going to take the activations of those features and we're going
to make those data and train another layer, okay. And then we're going to keep doing that
and for reasons that are slightly complicated and I will partially explain, this works extremely
well. You get more and more abstract features as you go up and once you've gone up through
about three layers, you got very nice abstract features that are very good then for doing
things like classification. But all these features were learned without ever knowing
the labels. It can be proved that every time we add another layer, we get a better model
of the training data or to be more precise, we improve a lower band on how good a model
we got of the training data. So here's a quick explanation on what's going on. When we learn
the weights in this little restrictive Boltzmann Machine, those weights define the probability
of given a vector here, we're constructing a particular vector there. So that's the probability
of a visible vector given a hidden vector. They also define this whole Markov chain,
if you went backwards and forwards many times. And so if you went backwards and forwards
many times, and then looked to see what you got here, you'll get some probability distribution
of the hidden vectors and the weights defining that. And so you can think of the weights
as defining both a mapping from these vectors of activity over the hidden units to the pixels,
to images, that's this term and the same weights define a prior over these tons of hidden activities.
When you learn the next level of Boltzmann Machine up, you're going to say, "Let's keep
this, keep this mapping, and let's learn a better model of the posterior that we've got
here when we use this mapping," and you keep replacing the posterior--implicit posterior
defined by these weights by a better one which is the p of v given h defined by the next
Boltzmann Machine. And so what you're really doing is dividing this task into two tasks.
One is, find me a distribution that's a little bit simpler than the data distribution. Don't
go the whole way to try and find a full model, just find me something a bit simpler than
the data distribution. This is going to be easy [INDISTINCT] Boltzmann Machine to model,
that's very nonparametric. And then find me a parametric mapping from that slightly simpler
distribution to the data distribution. So I call this creeping parameterization. What
you're really doing is--it's like taking the shell off an onion, you got this distribution
you want to model. Let's take off one shell which is this and get a very similar distribution
that's a bit easier to model and some parameters that tell us how to turn this one to this
one and then that's going to solve the problem of modeling this distribution. So that's what's
going on when you learn these multiple layers. After you've learned say three layers, you
have a model that's a bit surprising. This is the last restrictive Boltzmann Machine
we learned. So here we have this sort of model that says, "To generate from the model, go
backwards and forwards." But because we just kept the p of v given h from the previous
models, this is a directed model where you sort of get chunk, chunk to generate. So the
right way to generate from this combined model when you've learned three layers of features,
is to take the top two layers and go backwards and forwards for a long time. It's fortunate
you don't actually need to generate from it, I'm just telling you how you would if you
did. We want this for perception so really, you just need to do perceptual imprint which
is chunk, chunk, chunk, it's very fast. But to generate, you'd have to get backwards and
forwards for a long time and then once you've decided on a pattern here, you go--just go
chunk, chunk, that's very directed and easy. So I'm now going to learn a particular model
of some handwritten digits but all the digit classes now. So we're going to put slightly
bigger images of handwritten digits from a very standard data set where we know how well
other methods do. In fact it's a data set on which support back the machine's beat backpropagation
which was bad news backpropagation but we're going to reverse that in a minute. We're going
to learn 500 features now instead of 50. Once we've learned those, we're going to take the
data, map it through these weights which are just these weights in the opposite direction,
and get some feature vectors. We're going to treat those as data and learn this guy,
then we're going to take these feature vectors, we're going to tack on ten labeled units.
So now we needed the labels but I'll get rid of that later. And so we've got a 510 dimensional
vector here and we're going to learn a joint density model of the labels and the features.
We're not trying to get from the features to the labels, we're trying to say why do
these two things go together? So we're learning a joint model of both, not a discriminative
model. When we've completed this learning, what we're going to end up with is, the top
level here is a Boltzmann Machine and so it has an energy function, and you can think
of that as a landscape. When the weights are all small here or close to zero, then the
energy landscape is very flat. All the different configurations here are more or less equally
good. As it learns, it's going to carve ravines in this energy landscape. If you think of
it as a 510 dimensional energy landscape, these ravines are going to have the property
that in the floor of the ravine, there's about ten degrees of freedom and those are the ways
in which a digit can [INDISTINCT] and still be a good instance of that digit, like a two
with a bigger loop or a longer tail. Up the sides of the ravine, there's like 490 directions
and those are the ways in which, if you vary the image, it wouldn't be such a good two
anymore. But the nice thing is, it's going to learn long narrow ravine so that one two
can be very different from another two and yet connected by this ravine, the rings captured
the manifold, so it could wander from one to another in a way that it won't wander from
a two to a three even though the three might be more similar in pixels to the two. Okay.
I want to show you this generative model actually generating. Before I do that, I want to own
up, we did a little bit of fine tuning which actually took longer than the original learning,
where you--after you've done that greedy layer by layer learning, you do a bit of fine tuning
where you put in images, you do a forward pass, bottom up with binary states and when
you do this forward pass, you adjust the connections slightly so that what you get in one layer
would be better at reconstructing what caused you in the layer below. Then you do a few
iterations at the top level Boltzmann Machine, you go backwards and forwards a few times
to get the learning signal there. And then you do a down pass. And during the down pass,
you adjust the connections going upwards so they're better at reconstructing what caused
the activity in that layer. So during the down pass, you know what caused activity because
you caused it and you're trying to recover those causes. That fine tuning helps but it
will work without it. So now I'm going to attempt to show you a movie. That's not very
nice. Okay, there's that network. Here where we're going to put images. Here's 500 features,
500 features, 2,000 features and the ten labels. First of all, we're going to do some perception.
So I'm going to give it an image and tell it to run forwards. Oops, sorry? I didn't
mean that. I meant that. And you'll see, these are stochastic, they keep changing, but it's
very sure that it's a four. See, those are the identities of these neurons. It knows
that's a four and it has no doubt about it, even though its feature detectors are fluctuating
a bit. If I give it a five, hopefully it'll think it's a five. Yeah, it doesn't have any
doubt. So now let's be mean to it because that's a lot more fun. I'm going to give it
that. So, it says, so, four, six, eight, four, eight, eight, eight, eight, eight, eight,
four. It can't make up its mind whether it's a four or an eight, and that's pretty reasonable
in those circumstances. It will actually, for that one, say eight a bit more often than
anything else. So, we've classed it as getting that right but it's very unsure whether it's
an eight or a four. And just occasionally, it thinks it can be other things like a two,
but it basically thinks four or eight. I can make it run faster so you can--okay. It's
basically four or eight, an occasional six. I could give it something like this and it
thinks basically one or seven and occasionally a four. Because I programmed this myself,
I want to point out that it's very reasonable for this--this is my baby, and it's very reasonable
for it to think that I might be four, because, look, you could see the four in there, okay.
Okay. Now, that was just doing perception but the very same model does generation. So,
what I can do is I can fix the top level unit and all I've done is I've fixed the state
of one neuron. There's a million connections there because that's 2,500. I just fixed this
one neuron but when I fix that state, then the weights, the 2,000 weights coming out
of there to these neurons here, what they'll do is they'll lower the energy of the ravine
for twos and they'll raise the energy of the ravine for all of the other guys. So, now
we've got this landscape in which you got all these ravines but the two ravine has been
lowered. And if you put it at a random point, it will eventually stumble into the two ravine
and then it will stay there and wander around. So, let's see if we can do that. So, what's
really going on here is I'm just going backwards and forwards up here. Ignore that for now.
I'm going backwards and forwards here and letting it gradually settle until it's into
a state that this network's happy with. So, that's his brain state and that doesn't mean
much to you. If you look at that, you don't really know what it means. So, what we're
going to do is, as it's settling, we're going to play out the generative model here. We're
going to do computer graphics to see what that would have generated. And so, what you
got here is that's what's going on in its brain and this is what's going on in its mind.
So, you can see what this is thinking and I'm serious about that. That is--I know it
sounds crazy, when I say to you I'm seeing a pink elephant, what I mean is, I've got
a brain state such that, if there were a pink elephant out there, this will be perception.
That's how mental states work. They're funny because they're hypothetical, not because
they're made of spooky stuff. So, I use this language where the terms refer to things in
the world because I was saying, "What would have to be in the world for this brain state
to be perception?" Now, if I got a generative model, I can take the--take the brain state
and say, "Well, what would have to be in the world for that to be perception?" Well, that.
So, that's what it's thinking, that's its mental state right there. So, you got brain
states and mental states and most psychologists won't show you both. Let's go a bit faster.
And it still hasn't settled into the two ravine. And now it's about in the two ravine. And
now it's just wandering around in that two ravine and this is what it's thinking. It
knows about all sorts of different twos and it's very good that it does because that means
it can recognize weird twos. Let's give it another one. It hasn't got into the eight
ravine properly yet. It will jump [INDISTINCT] the ravines, he's not really there. But by
now, it will be in the eight ravine and it will show you all the sorts of different eights
it believes in, if you run it long enough. If you run it for an hour now, it would probably
just stay in the eight ravine showing you all sorts of different eights, okay. Let's
do one more because I liked it so much. Again, it's not really in the five ravine properly
yet. No, that was a six. By now it's in the five ravine and it will show you all sorts
of weird fives, ones without tops, some occasional sixes. And it ends up with a pretty weird
one but that's definitely a five and it's very good that it knows that that's definitely
a five because it lies to recognize things like that. Okay. That's it for the demo. I
have to get rid of that. Okay. So, here's some examples of things it can recognize.
These are all the ones it got right and you can see it. It recognizes a wide variety of
twos. It recognizes that this is a one despite that and it recognizes that this is a seven
because of that. If you try writing a program by hand, it will do that. You'll find it's
kind of tricky if you'd never thought of these examples in advance. If you compare it with
support vector machines, now what we're doing here is we're taking a pure machine learning
task. We're not giving it any prior knowledge about pixels being next to other pixels. We're
not giving it extra transformations of the data. So, this is without--it's a pure machine
learning task without any extra help. If you get extra help, you could make all the methods
a lot better. But a support vector machine done by DeCoste and Scholkopf were very good,
it got 1.4%. The best you can do with standard backpropagations is about 1.6%. This gets
1.25% and significance here is about a difference of 0.1. So, this is significantly better than
that. [INDISTINCT] maybe gets 3.3%. Now, I fine-tune that to be good at generations so
I could show you it generating using this sort of up down algorithm but we can also
use backpropagation for fine-tuning. And now that I've got this way of finding features
from the sensory data, I can say things like nobody in their right mind would ever suggest
that you would use a local search technique like backpropagation to search some huge non-linear
space by starting with small random weights. It will get stuck in local [INDISTINCT]. And
that is indeed true. What we're going to do is we're going to search this huge non-linear
space of possible features by finding features in the sensory data and then finding features
in the combinations of features we find in the sensory data and keep doing that. And
we'll design our features like that. So, we didn't need labels, we just need sensory data.
Once we designed all our features we can then use backpropagation too slightly fine-tune
them to make the category boundaries be in the right place. So, a pure version of that
would be to say let's learn the same net but without any labels. Okay? So, we do all the
pre-training like this. After we pre-trained now, what we're going to do is we're going
to attach ten label units to the top and we're going to use backpropagation to fine-tune
these and the fine-tuning is hardly going to change the weights at all but is going
to make the discrimination performance a lot better. So, this is going to be discriminative
fine-tuning and [INDISTINCT] 1.15% errors and all the code for doing the pre-training
and the fine-tuning is on my webpage, if you want to try it. Now, given that we now know
how to get features from data, we can now train things we never used to be able to train
with backpropagation. If you take a net like this where we're going to put in the digit,
and we're going to try and get out the same digit but we're going to put like eight layers
of non-linearities in between, if you start with small random weights and you backpropagate,
you get small, small times small, and by the time you get back here, you get small to the
power eight and you don't get any gradient. If you wrote in big random weights, you'll
get a gradient but you'll have decided in advance where you're going to be in the search
space. What we're going to do is learn this Boltzmann Machine here. After we've learned
that, we're going to map the data to get activity patterns and then this Boltzmann Machine.
Then we're going to learn this Boltzmann Machine but with linear hidden units. And then what
we're going to do is put the transposed weights here because this is good at reconstructing
that. So, this should be good and so on. And we're going to use that as a starting point
and then we do backpropagation from there and it will slightly change all of these weights
and it will make this work really well. And so now what it's done is this communicated
this 28 by 28 image via this bottleneck of 30 units but using a highly non-linear transformation
to compress it. If you make everything linear here, you leave out all these layers and make
everything linear, this is PCA, Principal Components, which is a standard way to compress
things. If you put in all these non-linear layers, it's much better than PCA. So, this
is all done without labels, now. You just give it the digits, you don't tell it which
is which. These are examples of the real digits, just one example of each class. These are
the reconstructions from those 30 activities in the hidden layer and you can see they're
actually better than the data. This is a dangerous line of thought. PCA does this and you can
see it's kind of hopeless compared to this method. At least that's what you're meant
to see. Now, we can apply this to document vectors. I don't find documents as interesting
as digits but I know some people are interested in them. You could take a document vector
and you could take the counts of the 2000 most common words and there's a big database
like this of 800,000 documents. And so we took 400,000--sorry. Yeah, I know. I see people
smiling. [INDISTINCT] 100,000, I'm an academic, okay. We then train up a neural net like this,
where these are now [INDISTINCT] units. For those of you who know machine learning, we
can use any units in the exponential family, where the log probability is linear in the
parameters. So, we train up this to get some features, we train up this to get some features,
and then we train up this until you get just two linear features. That seems a little excessive
and obviously when we reconstruct, we're not going to get quite the right counts. But you'll
get a--you'll get counts that are much closer to the right counts in the base rates. So,
we've done here, you have a high count for Iraq and Cheney and torture, up here, you'll
get high counts for similar things. So, we can turn a document into a point in the two
dimensional space. And of course once we got a point in two dimensional space, we can plot
it in 2D. And for this database, someone had gone through by hand, more or less by hand,
and labeled all the documents. We didn't use the labels, okay. But now when we plot the
point in 2D, we can color the point by the class of the document. So, if you do the standard
technique which is Latent Semantic Analysis which is just a version of PCA, and you layout
these documents in 2D, that's what you get. And you can see the green ones are in a slightly
different place from these blue once but it's a bit of a mess. If you use our method, it
does a little bit better. You get that. And so now, if you look at these documents--these
are business documents, right? If you look at these documents here, you can see there's
lots of different kinds of documents about accounts and earnings. Presumably, there's
an Enron cluster in here somewhere and it would be very nice to know which are the companies
that are in this Enron cluster. Okay. But there's something more interesting you can
do. That's just for visualization. But now I'm going to show you how to solve the following
problem. Suppose I'd give you a document. So, this isn't like what I call Google Search
where you use a few key words and you find what you want. This is--I give you a document
and I ask you to find similar documents to the one I gave you. Okay? Documents with similar
semantic content. So, I'm using a document as a query. What we're going to do is we're
going to take our big database of documents, a whole million of them, and we're going to
train up this network and it's going to convert these documents into 30 numbers. I'm going
to use logistic units here, that is numbers that range between 1 and 0 and we're going
to train it as Boltzmann Machines. Then we're going to back propagate and we'll get intermediate
values here that convey lots of information. And then we're going to start adding noise
here and we're going to add lots and lots of noise. Now, if I add lots and lots of noise
to something that has an output between 0 and 1, there's only one way it can transmit
a lot of information. It's got to make the total input that comes from below be either
very big and positive, in which case it'll give one, or very big and negative, in which
case it'll give a zero. And in both those cases, it will resist the noise. If it uses
any intermediate value, the outcome will be determined by the noise. So, it won't transmit
information, so it won't be very good at getting the right answers.
>> So the noise is something like Gaussian, it's not binary flipping.
>> HINTON: It's Gaussian noise. And we gradually increase the standard deviation and it's noise
in the input to the unit. And we gradually increase this, and we use a funny kind of
noise that I don't want to get into, that makes it easier to use conjugate gradient
descent. And what will happen is, these will turn into binary units. So, we now have a
way of converting the word can't vector a document into a 30 bit binary vector. And
now we can do what I call supermarket search. So, suppose you want to find things that are
like a can of sardines. What you do is you go to your local supermarket and you say to
the cashier, "Where do you keep the sardines?" And you go to where the sardines are and then
you just look around and there's all the things similar to sardines because the supermarket
arrange things sensibly. Now, it doesn't quite work because you don't find the anchovies,
as I discovered when I came to North America, I couldn't find the anchovies. They weren't
anywhere near the sardines and the tuna. That's because they're near the pizza toppings. But
that's just because it's a three dimensional supermarket. If there was a 30 dimensional
supermarket, they could be close to the pizza toppings and close to the sardines. So, what
we're going to do is we're going to take a document and using our learned network, we're
going to hash it to this 30-bit code. But this is a hash code that was learned. It's
not some random little thing. It was learned with lots of machine learning. So, it has
the property that similar documents mapped to similar codes. So, now we can use hashing
for doing approximate matches. Everybody knows hashing is nice and fast and everybody usually
can't do approximate matches. But with machine learning, you can have both. So, we take our
document, we hash it to a code and in this memory space, at each point in the memory
space, we put a pointer to the document that has that code and your [INDISTINCT] so if
two documents have the same code, you can figure out what to do. So now, with the query
document, we just go there and now we just look around like in the supermarket. And the
nearby similar documents will have nearby codes. And so, all you need to do to find
a similar document is flip a bit and do a memory access. Okay. That's two machine instructions.
So, if you were to have a database, let's say 10 Billion documents, and I give you one
and say, "Give me a 100,000 documents similar to this one," from my other search technique
I'm going to use, it can only cope with a 100,000. You're going to have to do a 100,000
times, you're going to have to flip a bit and do a memory access. So, that's only 200,000
machine instructions. I need two machine instructions per document. It's completely independent
of the size of your database. Okay. Because you've laid things out like in a supermarket,
you've got a document supermarket now [INDISTINCT] so, if you compare it with--well, we've actually
only tried it because we're academic, on 20-bit codes and a million documents and it works
just fine, but nothing could possibly go wrong when you scale it up. It's actually quite
accurate. That is, if you compare it with a sort of gold standard method, it's about
the same accuracy and when you now take your shortlist that you find in this very fast
way and you give those guys in the shortlist to the gold standard method, it works better
than the gold standard method alone. It's much better than locality sensitive hashing
but if in terms of speed, we use the code that's on the web for that and it's about
50 times faster. And in terms of accuracy, locality sensitive hashing will always be
less good than this because it's just a hack for doing this. And locality sensitive hashing
works on the count vector. If you work on the count vector, you will never understand
the similarity between the document that says, "Gonzales quits," than the documents that
says "Volfovich resigns." They're very similar but not in the word count vector. But if you've
compress it down to some semantic features, they're very similar documents. So, the summary
is that I showed you how to use this simple little Boltzmann Machine with the bipartite
connections to learn a layer of features. Then I showed you that if you take those features,
you can learn more features. And as you go up this hierarchy, you get more and more complicated
features that are going to be better and better for doing classification. This produces good
generative models. So they're good at reconstructing data, or producing data like the data you
saw. If you fine-tune with this [INDISTINCT] algorithm which has this funny name, if you
want good discriminative models, what you do is then you fine-tune with backpropagation.
But the good news is you don't need labels for all of your training data. You can learn
all these features on a very big data sets then with just a few million labels or even
a few hundred labels, you can backpropagate to fine-tune it for discrimination. And that
will work much better than for example using any machine learning method that just uses
the label data. It's a huge way. You can use the unlabeled data very effectively. And I've
shown you that it can also be used for explicit dimensional [INDISTINCT] where you get [INDISTINCT]
bottleneck and that you can do search for similar things very fast. And of course we'd
like to apply it to images, but for images you have a problem which is in documents,
a word is very indicative of what the document is about. In an image, what's indicative of
what the image is about is a recognized object and so what we are trying to do now is make
it recognize objects so that [INDISTINCT] then we can get the objects in the image and
then apply the semantic hashing technique. But we haven't done that yet. I see I've manage
to talk very fast so I can show you a little bit about how we're going to do the image
recognition. Suppose you want to do generative model which would allow you, a graphics model,
to take a type of an object and produce an image of that object. So, I say square and
I say what it's pose is, its position orientation. Then we might have a top-down model of--from
this and this, predicts where the parts might be. And if it's a kind of sloppy model, it'll
say this [INDISTINCT] to be round about there, and this [INDISTINCT] to be round about there.
And if we pick randomly from these distributions, we'll get a square where the edges don't meet
up. Now, one way we can solve that is to generate very accurately here. We could say, I'm going
to generate each piece just right. But that requires high bandwidth and lots of work.
We're going to generate sloppily. We're going to generate a redundant set of pieces and
then we're going to know how the pieces fit together. We're going to know a corner must
be co-linear with an edge and the edges here must be co-linear with corners. And now, by
lateral interactions here, using something called a Marker Finder Field, we can get it
to settle into that. And so now, [INDISTINCT] process is at each level, the level above
says where the major pieces should be, roughly, and a level that knows about how these pieces
go together, like how eyes and noses and mouths go together, says, "Okay, the nose should
be exactly above the middle of the mouth and the eyes should be at exactly the same height."
The level above doesn't need to specify that, that's known locally. So, how are we going
to learn that? Well, we're going to introduce lateral interactions during the visible units.
That's fine. The real crucial thing in these nets is you don't have lateral interactions
during the hidden units. So, we can learn that and the way we learn that is we put an
image in here, we activate the features then with the features fixed providing constant
top-down input, we run this lateral interactions to let this network settle down and we replace
the binary variables by real value variables. So, we're doing something called mean-field.
We let this settle down with something it is happier with, a reconstruction. It doesn't
need to get all the way to equilibrium, it just needs to get a bit better than this.
And then, we apply a normal learning algorithm to these correlations and these correlations,
like this. But we can also learn the lateral interactions by saying, "Take the correlations
in the data minus the correlations in the reconstructions," and that'll learn all these
lateral interactions. So now what we're going to do is, we're going to learn a network with
400 input units for 20 by 20 patch of an image. This is just preliminary work. When we learn
the first network, these aren't connected. Then when we use these feature activities
to learn the second level Boltzmann Machine, we connect these together and we learn these
and these. Then when we learned the top Boltzmann Machine, we connect these together and we
learn these weights and these weights. When we're finished, we can generate from the model.
And so as a control, what we're going to do is, we're going to learn this model on patches
and natural images which have notoriously [INDISTINCT] things to model because anything
could happen in a patch and natural image. So, it's a very hard thing to build identity
model of. We're going to learn it without lateral connections and we get a model that's
very like many other models. When you generate from it, what you get is clouds. So, here's
natural image patches and they have the property that there's not much going on and then there's
a sudden [INDISTINCT] of structure like here. So, if you apply a linear filter to these
things, the linear filter will usually produce a zero and occasionally produce a huge output.
If you apply a linear filter to these things, it will produce some kind of Gaussian distribution.
These have exactly the same [INDISTINCT] of spectrum as these. What they don't have is
this sort of heavy tailed distribution where there's not much happening and then a lot
happening, and long range structure. So, now what happens if we put in the lateral interactions
and do the learning again? If you put the lateral interactions in, they can say things
like if you have a piece [INDISTINCT] and you'd like a piece of that somewhere around
here, put it here where it lines up. So, that will make much longer range interactions.
And so now when we generate from the model with lateral interactions, we get that and
you can see that these are much more like real image patches. They pass many of the
statistical tests for being real image patches. They've got this kind of much longer range
structure. They've got sort of co-linear things and things at right angles and all sorts of
nice structure in them, which we didn't have before. And so we're getting--this is probably
the best model there is of natural image patches. If you ask anybody else who models them, "Show
me samples from your generative model." They say, "Oh, well, we tried that and it looked
terrible. So we never published those." This is, I think the first model, it generates
nice samples from the model. [INDISTINCT] has the models maybe comparable. What we'd
like to do now is make more layers and we'd also like to have attention. So, as you go
up, you focus on parts of the image. And what I want to do is get something--you're given
an image, you go up, it's focusing on parts and it gives you a figure at the top. It gives
you what you see, which is you look at an image and you see a face. And then you look
again, you see the eye. Then you look again, you see a group of four people. And those
are the things that come out and those are going to be like the words that need to go
and turn image retrieval system. You going to have--this is going to run for long time
learning and then it's going to run for quite a long time on each image, but that's all
[INDISTINCT] okay. I'm done. >> So, it looks like we've got time for questions.
If you have questions can you--if you have questions, can you please hit the mic in the
middle so that the folks at their offices can hear.
>> Okay. Hi. So, you were saying that this method doesn't require labels. I was just
wondering if it would actually help if you have labels for at least some of your training
data? >> HINTON: Oh, yes. Labels help. The main
thing is to show that you can do lot without them and therefore you can have much more
leverage from a few labels. Yeah. >> Okay. Thanks.
>> HINTON: So, for example in the semantic hashing idea, you could, as you're learning
those 30 dimensional codes, you could say if two things are from the same class and
the codes are far apart, introduce a small force pulling them together. And we've got
a paper on that in [INDISTINCT] last year. And that will improve the sort of clustering
of things of the same class. But the point is you can do it without knowing the classes
as well. >> Hi. Now, so, people have built all the
encoders for a long time before and they use regular sigmoid units and use backprop to
train them. >> HINTON: But they never work very well.
>> Correct. Would--if we actually have multiple layers of these, over these sigmoid units
and use--and train them the same function as you're doing, one layer at a time, would
it work as well, as RBMs or not? >> HINTON: Okay. That's a very good question.
So, it's a bit confusing. This deep thing with multiple layers trained with RBMs are
called mutli-layer auto encoder. But you could also have a very small auto encoder with one
hidden layer that's non-linear and train that up. And the RBM is just like that. So, you
could train these little auto encoders and stack them together and then train the whole
thing with backprop. That's what the question was. And that will work much better than the
old way, training auto encoders, but not quite as well as this. So, Yoshua Bengio has a paper.
Where he compared doing auto encoders with doing restricted Boltzmann Machines, and the
restricted Boltzmann Machines worked better specially for things like [INDISTINCT] backgrounds.
>> I've got a--I've got a question which--if I could ask...
>> HINTON: Okay. >> ...because I'm holding a microphone. So,
this morning we were talking about the--about news with--where the problem with news is
that everything changes from day to day. Do you have any intuition--this is one of those
unfair, "What do you think would happen," do you have any intuition on how hard it would
be to adapt a deep network like this once your input distribution changes or as it continues
to change? >> HINTON: Okay. So one good thing about this
learning is everything scales linearly with [INDISTINCT] training data. There's no quadratic
optimization anywhere that's going to screw you for big databases. The other thing is,
because it's basically stochastic online learning, if your distribution changes slightly, you
can track that very easily. You don't have to start again. So, if it's the case that
the news tomorrow has quite lot in common with the news over the last few months and
few years, and you just need to change your model a bit rather than start again, Then
this very good for--going to be good for tracking and it's not going to be as much work as learning
it all in the first place. And in fact, once you got all of these layers of features, basically
changing the interactions in high level features will get you lots of mileage without much
work. >> Sir I have another question about the a--so,
about the supermarket search. You were saying you just flip a bit in your hash code. So,
what I'm wondering is, you know, one thing that I'm not sure about is like if you flip
one of these bit you might not necessarily get something there?
>> HINTON: That's fine. >> I mean, how do you know that you're going
to find something there? And then also, maybe, is there some way of finding better bits to
flip and like how do you decided which ones? >> HINTON: So, of course. If you make the
number of addresses be about the same as the number of documents, the average answer is
one. >> Right.
>> HINTON: Okay. And you've--if there's nothing there, you can flip for more bits.
>> Sure. >> HINTON: So, yes. You'll get some misses
but that's just sort of constant >> Right.
>> HINTON: We can look at, actually, how evenly spread over addresses it is and typically,
most of the addresses won't be used and a typical address would be used like three or
four times. So, it's not as uniform as we'd like but that could all be improved. And we've
only done this once. We've just trained this network once on one data set and that's all
the research we've done so far, really. If we could get a tiny bit of money from someone,
we could make this whole thing work much better. >> So, one thing that is special about digits
is that they evolve in a way that they make them discriminative.
>> HINTON: Yes. >> So, you would hope you--it's not that surprising
that it then unsupervised way can attract features that are discriminative. I was wondering
what happens with [INDISTINCT] the other applications where--so clearly, when you do unsupervised,
you might throw away some very indicative features right there.
>> HINTON: Yes. So, basically, there's two kinds of learning, there's discriminative
learning where you take your input and your whole aim in life is to predict the label.
And then there's generative learning where you take your input and your whole aim in
life is to understand what's going on in this input. You want to build a model that explains
why you got these inputs and not other inputs. Now, if you do that generative approach, you
need a big computer and you're going to explain all sort of stuff that's completely irrelevant
to the task you're interested in. So, you're going to waste lots of computation. On the
other hand, you're not going to need as much training data because each image is going
to contain lots of stuff and you can start building your features without yet using information
in the labels. So, if you've got a very small computer, what you should do is discriminative
learning so you don't waste any effort. If you got a big computer, do generative learning,
you'll waste lots of the cycles but you'll make better use of the limited [INDISTINCT]
label data. That's my claim. >> Hi Geoff. I have a question. What happened
to regularization? What kind of regularization is implicit in all of your stages?
>> HINTON: Okay. So, we're using a little bit of weight decay and the way we set the
weight decay was just--we fiddled about it for a bit to see what worked on the--on a
validation set, the usual method. And if you don't use any weight decay, it works. If you
use weight decay, it works a bit better. And it's not crucial how much you use. So, we
are using some weight decays here but it's not a big deal. And like I say, all of the
code is in [INDISTINCT] on my web page. There's a pointer on my web page. So, you can go and
look at all those things and all the little fudges we use.
>> Right. But the Boltzmann Machine is fundamentally sort of entropic regularization and then your
little pieces of tuning with weight decay are from the other family. So, you're blending
both [INDISTINCT] >> HINTON: No. The Boltzmann Machine, it's
true. There's a lot of regularization comes on from the fact that the hidden units are
binary stochastic. So, they can't transmit much information.
>> Yes. >> HINTON: That does lots of regularization
for you, compared with the normal auto encoder. But in addition, we say don't make the weights
too big. And one reason for that is not just regularization, it's--it makes the Markov
chain mix faster if you don't make the weights too big.
>> Thanks. >> Hi. So, in your example of digits, you
actually tell them--tell the algorithm that they are ten classes.
>> HINTON: Yes. >> So, I wonder, well, what is the impact
if we do not give this number correct? So, yeah.
>> HINTON: Okay. So, what you can do is you can take this auto encoder that goes down
to 30 real numbers and not tell it how many classes there are, just give it the images,
get these 30 real numbers. Then you can take those 30 real numbers and apply dimensionality
reduction technique that Sam Roweis and I have developed, and the latest version of
that, you can lay them out in 2D and you will get 11 classes. And it did that without ever
knowing any labels. You'll get just these 11 clusters which is close to 10. It often
thinks that the continental sevens are a separate clusters.
>> So you are saying this is [INDISTINCT] you have try and that's what happened or?
>> HINTON: I might even have it in this talk somewhere. I might not, though. It's on my--it's--oh,
there you go. That's pure unsupervised on the digits. Now in this case, these are twos
and these are twos. In 30D, it's got the clusters. When you force it down to 2D, it wants to
keep the twos next to each other but it will also wants these--these are the spiky twos
and these are the sevens, and it wants those close. And these are the loopy twos and these
are the threes, and it wants those close. But it also wants the threes close to the
eights. And so in 2D, there just isn't enough space to make ten clusters. But look, it made
11 there and if I don't cheat and do this in black and white, you can still see there's
sort of roughly 11 clusters. So, this was pure unsupervised and it found that structure
in the data. So, when psychologists tell you, you impose categories on this data, they aren't
really there in the world, it's rubbish. I mean, they're really there.
>> So the magic number is 30. Is it--if I choose other number, it will be fine with
it? >> HINTON: If you choose a smaller number,
you might not preserve enough information to be able to keep the classes. And if you
choose a bigger number, then PCA will do it better. So your comparison with PCA won't
be as good. >> Thank you.
>> How does the performance of the digit classification vary according to the number of layers you
are using? >> HINTON: Okay. Obviously, using the number
of layers I showed you is one of the best numbers to use. If you use less layers, it
works a bit worse. If you use more layers, it works about the same. I've now got a--I've
got a very good Dutch student who has the [INDISTINCT] he doesn't believe a word I say,
and we will know--he's using like 40 cluster machines and he's going to get the answer
to this. But so far, I'm right that using less layers isn't as good and he hasn't got
to more layers yet. He's actually made with the same number of layers, he can make it
work better and we'll see if he makes it work better with more layers.
>> Just [INDISTINCT] guess a related question. So, it's clear how to evaluate this models
say if you have some labeled data and [INDISTINCT] you can try to see if you predict it similarly.
But if you try generative, this Boltzmann Machines with like, especially [INDISTINCT]
interactions in the same levels and so on, if I gave you another set, can you say how
good generatively it is and is it easy? >> HINTON: Okay.
>> How do you evaluate... >> HINTON: Yeah.
>> ...that kind of part of it? >> HINTON: So, the problem with these Boltzmann
Machines, this is a partition function, and what you'd love to do is take your data set,
hold out some examples, train your generative model on the training set and then say what
is the log probability of these held out examples? >> Exactly.
>> HINTON: And that would be the sort of gold standard. And that's very hard to do. You
know the log probability up to a constant but you don't know the constant. So, people
in my group and I are working very hard on a method for interpolating between Boltzmann
Machines that allows you to use a Boltzmann Machine with zero weights which is a pretty
dumb model and then gradually change the weights towards the Boltzmann Machine that you eventually
learned and you can get the ratio of the partition functions of all these Boltzmann Machines
so in the end, you can get the partition function. You can get a pretty good estimate. This is
called--it's a version of a [INDISTINCT] important something called bridging. And we think we're
going to be able to get pretty accurate estimates of the partition function now by running for
like, you know, a 100 hours. >> Yes. Yes.
>> HINTON: You do this after you've learned just to show how good you are. But the other
thing you can do is you can generate from the model and you can see that the stuff it
generates looks good and you can then take the stuff you generated from the model and
you can apply statistical test to that and statistical test to the real data and statistical
test to the other guy's data, the other guy's generative data. And if you choose the right
statistical test, you can make the other guy's data look terrible.
>> Okay. Okay. I think we're out of time now. I'd like to thank Geoff again and...