Learning From Examples Using Quantum Annealing (Google Workshop on Quantum Biology)


Uploaded by GoogleTechTalks on 28.10.2010

Transcript:
bjbjLULU >> NEVEN: In this talk, I want to come, sort of, from the other side, essentially
the user side of quantum computing, in particular quantum annealing. And we have studied earlier
this morning; I sort of gave a basic motivation. How can you do certain or key tasks in artificial
intelligence better by using quantum resources and we have looked in particular at learning,
supervised learning. And how can you do supervised learning better using quantum annealing. I
should say that a lot of this work has been done by Vasil Denchev at Purdue; we collaborate
closely with the folks from D-Wave, Geordie who talked earlier and Bill Macready, who
was also in his previous life a learning person. We get some help from Edward Frye and his
student David Gossett on Quantum Monte Carlo assimilations we conducted and we get some
from Vischwanatan also in Purdue, some help with--he's a specialist on using optimization
methods for machine learning. Maybe I'll skip the outline and immediately jump to--this
is not too small, it should be good, I assume sort of half of you are--a part of you is
familiar to some degree with machine learning but many of you won't, so we'll discuss the
simplest situation in machine learning. So there's binary classification. So you have
two types of patterns. As a good example would be face detection. You know, I give you pictures
and some pictures contain a face and other pictures don't. And then I'll ask you face
or no face or I'd rather ask the algorithm if you're a face or not a face. But it can
be many other things. It can be a vector of data from some medical diagnosis and then
the machine could say, "Should you do a secondary diagnosis?" or "no problem here" or it could
be some credit report and then the algorithm could say "issue a credit or not." So binary
classification is sort of the staple of machine learning and all it really is, is a binary
classifier. Once it's done, it takes input patterns 'X' and maps them onto output 'Y'
and yeah, (-1, +1) binary neighbors. So here, depictions of typically of kOzi input vectors
like let's say an image, let's say a high dimension--live in high dimensional spaces.
That's of course a large part of the problem we are facing, so you can, of course, not
visualize this. So here's a simple visualization where you have--let's say I just have two
features or two elements in the vector. So the blue guys are, let's say, the faces and
the red guys are the non-faces. And then doing classification means, you want to find a function
that divides--partitions the space in two areas, the +1 and the -1 area, and hopefully
you do it such that you don't have any of the red examples being classified as '+' and
vice-versa. Now of course, this can be more difficult, more rugged boundaries and again,
one danger with all of these pictures are that it's sort of a basic lesson that unfortunately
you never internalize well enough, is we are playing in very high-dimensional spaces and
geometry's in a very high dimensional spaces are just very different. So a lot of intuitions
drawn from these little diagrams tend to mislead. So now, how do we find these separating functions
that get the positive from the negative, separates the positive from the negative examples? So,
modern machine learning literature likes to state training of a classifier as an optimization
problem. So typically, you start out with a classifier 'H(x)' but it's not just 'H(x)',
it has a set of parameters, like think of it as little knops, typically called the weights
and if you change the weights, then this separating function changes and what gets labeled as
plus or minus one on the 'Y' side changes. So you can, if you go back to here, let's
say, one set of weights would have maybe done this rugged separating line and another set
of weight would have done this more smooth line and then training means you want to find
a good set of weights, such that this gets nicely separated. But there's one more complication
in machine learning and in some ways that makes it maybe a difficult area to apply quantum
computing to. Why? Is--in supervised learning how this works typically you take your data
and you separate it into two sets. One is a training set; the other part is your holdout
set. And then on the training set, you will do what we will discuss in a second; you use
it to form your classifier. But of course, what you're really interested in is not making
very few mistakes on the training set. That is of course a good condition or a good--or
if you are already bad with your training data, you cannot expect to be good on your
test data. But unfortunately, the converse is not true meaning that if you are perfect
on your training data, you can still make a lot of mistakes on your test data. So you
need to setup your optimization problem such that you generalize well, because what you're
really after, is new unseen examples on those you want to have little error rates, small
error rates. And one way to achieve this is to setup training objective that typically
consists of two parts. So again, training means to find a good set of knops that characterize
the classifier and the optimization problem that you setup has two components. One is
the so-called loss function, and the second term is the regularization. So the loss thing
is fairly easily understood. So loss is just a measure on how good do I do on my training
data. Now to appreciate a little bit more on what we do later, I need to discuss with
you a little bit--that's a different types of loss functions. And to appreciate those
different loss functions, for those who are not machine learning people, I need to explain
you one item and that's the so-called 'margin.' The margin is essentially the smallest distance
to the separating function between the two classes. And you can imagine that the larger
the margin, the more robust you are, the better you will probably generalize. Imagine if I
would have put the separating line just very close here, then the little bit of noise on
the red data, so I would have put it on the other side of my line and I would have classif--misclassified
it. So, what I--we tried to do is maximize the margins. That's typically considered good
for classifiers because you--better generalization. And now if you look, the most naive way to
measure on a training set, "How good is my classifier doing," is you just count. So whenever
I do a mistake, that is I have a negative margin, I'm on the wrong side. You know, and
then I say, "Okay, +1." You know, you get a count one error, another one, and so on.
If you classified it correctly, then you say that's fine. So that's called the zero-one
loss; simply counting the number of errors you make on your training set, you know. But
then the other losses--first off, there are different ways to motivate them but maybe
for starters, if you set up your loss function as zero-one loss, then it can be shown that
the resulting optimization problem is formally NP-hard. So, machine learning theories--theorists
of course didn't like this and they use instead what the optimization people call relaxations,
convex relaxations. They don't work with the non-convex zero-one loss but instead, let's
say, they approximated by some upper bound like an exponential or quadratic function.
And that on one hand gets you off, it's not NP-hard anymore, and it's not necessarily
all that bad. You get other advantages also because the zero-one loss is not margin maximizing,
while those convex relaxations are. You see, the larger your margin is, the lower your
objective value and if you try to minimize the objective value, then it gets rewarded.
So essentially, the further away your samples get from the separating line, the more you'll
like this. So that's a reasonable measure to say how good is your classifier coming
along. So--but there's another term as I said. So the first term is, when you try to find
the right set of knops is try to minimize the loss. The second is regularization; that's
a little bit harder to understand but it makes essentially a precise notion of that the simplest
explanation tends to be the best one. And it's probably easiest to understand this in
the rhyme of the curve fitting. So here's this green curve, the process has generated
some data for me and it's a little bit noisy and I want to fit those dots that come the
half there. And I want to fit it with a polynomial, and I have different polynomials of all the
zero, of all the one and so on. And you see if my model or--with which I've tried to fit
this is too simple. Let's say that's all the zeros and I tried to fit this data with a
constant function. So, of course, cannot work too well and then here if it's all the one,
the linear function, I will still do poorly. Here, all the threes seem to be good but you
don't get better and better as you make your model more complex. You know, if you order
nine, then you do--as what I've mentioned earlier, do perfectly on your training data,
but this will generalize very poorly. You can imagine, I'll give you a new example and
it will be completely off. And so, there are a lot of adages and signs that--and so when
Einstein said, "You should make things as simple as possible but not simpler." You can
see this here. This is too simple, now and here it doesn't pay to make your model overly
complex. You know, Occam's Razor, the simplest explanation tends to be the best one and in
machine learning theories are actually theorems that show that if you pick a classifier with
a lower complexity, then it will--the bounds on generalization error will be lower. Okay,
so--and yeah, I didn't give you my example. One way for example to choose a regularization
term is you put some norm on W and say I want this norm small. For example, you can say,
"I want the zero norm to be small." That simply means you switch certain parts of your classifier
off, so you make it as compact as possible. But actually, this choice again would have
rend--that's a problem formally NP hard. So, what we see is that for important choices
of the loss function and the regularization, the training problems are and are always--you
know, I have discussed enough with complexity, people to know that you better always put
formally NP hard in front of it; because we don't know let's say a true face--a real world
recognition problem like face recognition or car recognition is a truly, generally NP
hard. I don't know. But that doesn't really help us all that much because the algorithms,
it looks like an NP hard problem. So now, this is a good motivation to try to look at
quantum algorithm set again, are not known to solve NP hard problems exactly; but it's
very reasonable to expect that they give us better solutions than classically available.
And so, a quick excursion to optimization. So, optimization, you have an objective function
and you want to find is the minimum of this function, that's an optimization problem.
And this was the picture from this morning, here I have this video and I apologize, several
people here have seen this already many times. And I keep showing it again to give people
who are not familiar with it, the basic idea of quantum resource that can help you better
with finding the global minimum here. Classically, you could only have--this ball could have
only rolled down and would have been stuck in a local minimum. If you have a second resource
tunneling, then of course, if this barrier would have been too fat, the tunneling amplitude
would have been very small and for its advantage would have been tiny. But, that actually is
the beauty of the adiabatic quantum algorithm that it forms its landscape slowly. And hence,
at any point in time, you have a good chance of having a reasonable tunneling amplitude.
And overall, of course, physics determines what optimization problems you can solve.
And bringing in this additional resource of quantum tunneling expands the reach quality
of your optimization algorithms. And so, this was a bit the scientific American view. Geordie,
this morning, didn't have some time to figure from him to explain how the D-Wave chips really
work. And they actually do exactly this process we just looked at. But here's a little bit
nicer quantum explanation. And quantum mechanics got its name from certain entities in nature,
such as energy being quantized. So, what you do here is--oh, I should have explained this
also here. The principle here and or the key idea in adiabatic quantum computing is that
you start out with an optimization problem at the beginning, Hamiltonians often refer
to, is which has only one easy-to-find global minimum. So any algorithm gets you there.
And then, you slowly deform it and, until the Hamiltonian looks--has the shape that
encodes a problem you're really interested in. And now, if you do this the right way,
the following happens. So, in here is the energy spectrum quantized for the beginning
Hamiltonian, and they start out on the ground state. And now, I start to deform it, and
by the time one, I am sort of, in the Hamiltonians that has the shape I'm really interested in.
If I do this carefully enough, such that--you see, there's a difference in energy between
the ground state and the first excited state. And actually, if I look--look in them, zoom
in a little closer, you can see this entity here which is called the minimum gap. If I
do this evolution, meaning the transformation from beginning Hamiltonian to end Hamiltonian,
careful enough that I don't inject enough energy into the system that it can jump into
the next excited state is enough cause at all points it would have--you would stay on
the ground state; and then at the end, you would still be at the minimum of the function
you're interested in. So that's the idea of adiabatic quantum computing. And of course,
as you can imagine, the critical entity here is the minimum gap. If the minimum gap is
too small, you have to--the smaller it is, the more careful you have to do it, means
the slower you have to do it. And then of course, there's a question that is currently
under hot debate, and many people looking at it. For certain problems, how does a gap
scale as, let's say, my learning problem gets harder, I look at more examples or larger
classifiers. As my problems get larger, how is the gap shrinking. If it's shrinking exponentially
fast, then I'm screwed; then adiabatic quantum computing would probably not give me good
solutions for this. If it only shrinks polynomially, that bodes well. I think I'll skip this slide,
you--D-Wave has built now a system that executes this. And of course, we haven't discussed
yet what Hamiltonian, what energy landscape are we realizing. I don't know whether Geordie
has saw this, as this is my rendition of what happens inside the D-Wave chip. Essentially,
what I realized is sort of the simplest many body physical system known as the Ising models.
As the qubits are organized on grit, and they are nearest neighbor connected. Actually,
that's not quite true; but they're nearest neighbor plus, so they're also--you saw the
connections earlier, so there are connections beyond just nearest neighbors. But so, what
you do know, you start with a strong, what's called the transversal magnetic field. So
that's your beginning Hamiltonian; all these guys are randomized. And now, you phase in
those connections between the qubits that release the blue and the red arrow. They really
encode your problem, and you fade out the transversal magnetic field. And then at the
end of this evolution, some spins will be up, some spins will be down; and this you
can read out, this is the solution to your problem. Again, programming this chip means
you give it the blue and the red arrows and say how strongly negative or positively they're
supposed to be. And actually, there are also linear terms called the bias terms. That bias
every qubit's saying, I rather want just them to be up or down. So, what this creates is
a new computational resource. And a physicist would call this, as they realized the Ising
model, and as a quantum process for finding the--a minimum, the ground state of the Ising
function or as a mathematician would prefer to call this a quadratic optimization problem.
And using this new computational resource in a problem like machine learning essentially
means you have to take your problem and map it, such that it fits onto the format of a
quadratic optimization problem. And the good news here is that it looks a little bit limiting.
For example, what is if I want more than quadratic and need to sort other terms, force other
terms. Then, yeah you can by introducing auxiliary variables. You can map it back onto quadratic.
Also you may say, "Oh, this nearest neighbor business I didn't like. I need to connect
the qubits in a wider way." Yeah, you can also, by introducing auxiliary variables,
can bring this back on the existing architecture. So the good news is, in principle, pretty
much any problem you can map onto the Ising format. Unfortunately, the fine print is that
you need--that you need auxiliary qubits, and often you need polynomial many. So, sometimes,
let's say, if you have a sort order interaction, and you have a hundred variables, then you
would need a hundred square, ten thousand auxiliary qubits to get rid of this sort order
interaction. And that of course, it's a given moment where we talk about hundreds of qubits
being available in the chip; that can of course be a deal breaker. So, that is also why, software
guys like us are needed--you have to give it some sort, how can you map your problem
and then carefully or--look for problems that map well onto this architecture without the
need of too many auxiliary qubits. So now, I go back to training a binary classifier.
And I know some of you have heard this talk before. Here comes something new now; and
I'm pretty happy about this. So--I mean, sort of taught this thing a new trick. So we want
to train a classifier of this nature. So, we take the sign of a sum over "wihi." The
"hi" are often referred to as weak classifiers or features. Again, the "y" is the plus one,
minus one output. The "w's" in our case, we choose them just to be binary zero of one
weights, as this is the set we want to optimize. The overall classifiers often refer to as
the strong classifier. And then, we set up the optimization. We have a set of training
examples; and using those, we will set up our training objective. And I discussed earlier
a little bit the two terms. You remember, there's the loss that measures how well am
I doing on training; and regularization that essentially says, try to keep your explanation
as simple as possible. So, may I ask, which loss or which regularization is best? Unfortunately,
there's no good answer to this. It depends on the structure of your noise and the data.
And that is something called the no-free lunch theorem in machine learning that shows--that
is not a single training objective that is better on all datasets; better than some other
machine algorithm. So, we looked at something very specific. So, this new work where we
looked at how we can deal with outliers. You see, I've--again those two clouds, but we
put a few outliers in there. And that's situations that happen all the time. Let's say, to train
something like a face detector, we're talking about the order of between a hundred thousand
to two million images. And of course, you can imagine the wrong ones creep in there,
or just very unlikely faces. And faces probably are not even the best example, we looked at
datasets for character, classification, optical character classification. And there was such
a plenty of wrong labels in there. It's just a fact of life. And that's actually an intelligent
system I keep teaching you about or--I'm sorry in a--that's in an elementary school, you
know, you teach the kids; but maybe once a while you say something wrong, and the smart
kid will eventually figure out, "Hey, that's not right." So, we want our classifier to
be able to do this too; to find what the outliers are and discard them. So how do we do this?
Here, I just want to motivate that with the typical losses you can do this. Let's say,
if we would have used just a square loss, which are here put in, what's the optimal
possible classifier here, what's the best margins, it's called the base classifier.
The base classifier is sort of the goal standards. That's as good as you can do. So if we would
have used the square loss, because if you remember, here the square loss really doesn't
like big negative margins. And the outliers, of course, create big negative margins. So
the optimization would put a lot of emphasis of making these big negative margins to explain
the bad things not quite as bad. It doesn't really fit the picture, how can I deal with
this? I mean, am not familiar with this. So, square loss wouldn't do well here, but how
about 0-1 loss, 0-1 loss essentially just pays attention, "Let's get everything right."
So it would probably create, I mean those metaphorical pictures would create maybe a
classifier like this. Here I get roughly as a separation ride, but over here, because
it really doesn't really like to make anything wrong, put a little box or a fence in the
outliers and put those as a different class. And of course, even when I then see new real
test data, then I might fall into this box and make mistakes there. So, you can see that
there's not really an existing or one of the traditional used loss functions that would
do any good here. So what we did, we created a family of loss functions that can be tuned
to any given dataset such that it operates reasonably well. How do we do this? So we
start out with square loss, and I should--it would be nice if I could cover this here.
This would be gone, okay? It's just a square error, [INDISTINCT] one over S that would
be mean square error but by introducing this variable aid, we allow certain labels to be
flipped. So you can essentially look at an example and say, "You know what? I don't think
it's a positive one, I think it's a negative one." And by using this binary variable which
can be zero or one, but the first one to put a cost to this, you cannot nilly-willy just
change everything, it comes at a cost. You have to think when do you want to say, "Was
this example,it s probably from the other class?" And, we also introduced the second
parameter, maybe which I should--given the time--not bother to explain. So essentially
what is happening is that we have now a two parameter family of loss functions. We call
it the new family of loss functions. And a generic example you see here and here you
can essentially see there's a loss that's really low if you have a positive margin and
you classified it correctly. And the cost you pay gets higher and higher as the margin
becomes more negative, meaning you misclassified it. And at some point, the pane you [INDISTINCT]
becomes so large, it just doesn't fit into my picture, that the system would say, "You
know what, screw it, I flip." And as you can see at this point, where the system will flip
it. It comes at a cost but--it's--wouldn't keep growing more like exponential loss or
a square loss would do, so you can limit your damage, so to speak. So if you look at the
complete family, again, there's a two parameter family. Here we looked at the generic loss,
but you can see--I can find a pair of 2 where it becomes a square loss. Was it truncated
square loss or we can nicely approximate to 0-1 loss. So those are all contained in the
family. And, now we did a neat experiment with this. We did the following thing. We
start out with--it's a toy experiment and it goes like this: you start out with the
classifier, and you sign off some of the WiHi; and we dial a random set of W's, so some are
one, some are zero; and then we give the system input values X and it generates output Y's
for us. And that generates a set of training examples that I can use, but--and what we
want to do is we want to recover the original--can we learn what the original classifier was
that produced the data for us? And to make it a little bit meaner, we flipped some of
the labels. Now you can easily see or maybe--just trust me on this--that if you would have used
0-1 loss or square loss or any of the other known losses, you would not find this classifier
back. Let's maybe just discuss it for 0-1 loss, 0-1 loss could have retrieved it in
the absence of label noise, then it wouldn't want to do anything wrong. So on the training
set, you would have to do exactly the same thing as the original classifier, but as soon
as you add some noise, then it's loss that tries to get this noise right. So we ask the
question, "Can we retrieve it?" And then we used our new family of loss functions and
we played sort of with different sets of and there's a technique called cross validation.
It's a technique how you can find those Meta parameters. And then for small problem, because
this is very NP hard and then formally, there's no good solver for it, so we had to do it
by an exact solver so we need to keep it at a small--16. So we had a classifier of 16
weak classifiers, a strong classifier of 16 weak ones and 200 examples. And we added 10%
of label noise, meaning, randomly flipping labels and then we did 60,000 runs and in
99.9 cases of this toy problem, admittedly, we were able to retrieve the original classifier.
Now, I'm not sure, you know, to what degree, sort of, you're amazed by this, it's actually
pretty neat. There's a--there's no sort of system that's typically used that would have
been able to do this. It's essentially you--you hand carry your system, a bunch of data, put
noise on top of it, and you ask it, "Please make sense of it. Try to find the best possible
explanation for this data." And the system was able to do this. So that's a new piece
we did. I think I'm out of time. I wanted to show you some older--maybe if you forgive
me, some of the older works that people hadn't seen. So just using the square loss, we had
done in winter an experiment on trying to classify cars, so we tried to load a car detector
and we used--actually it's the D-Wave hardware. At the time, it had 52 variables to train
our or use our training objective and solved the training objective on the D-Wave hardware
and used the 20,000 images from Street View scenes. And at the time, now there's a chip
was supposed to look like this, this 128 Qubits, but this was a very early one, so not all
chips were functional and also, not all the connections existed. So on this somewhat beaten
up graph, we trained our classifier and to our delight, we found, if we compare this
hardware server to eristic servers that would run trouble search which would run, let's
say on a conventional machine, we were able to eke out not a tremendous, but nevertheless
a small advantage. And we also did--I will skip the Monte Carlo piece--or maybe just
your concluding was with what you saw. So far, what we're able to do is that, it seems
to us that the quantum optimization is a use for new resource for intelligent systems and
between the older work and the newer one, I think we were able to show that training
maps well and to quantum optimization we found that we get lower generalization error, we
get more compact classifiers. That means it would execute faster. We could train it faster
by using less training cycles and here's a new result, new achievement was that we can
better cope with outliers. So this is just a motivation why intelligence systems really
could benefit from quantum optimization. Is it really used, and I had a slide on this
but I know it's late, so thank you everybody. >> Thank you. Are there any questions for
Hartmut? Could you use the microphone, please? >> So, where is the Ising model in all of
this? And is it a quantum or classical Ising? Is it 2D or 3D Ising model, if it's there
in your algorithm? >> NEVEN: It is--yeah, the Ising model comes; I mean there's a question
I guess to the hardware. So the--it's a quantum Ising model and it's not a planar graph. Actually,
I think there's a result that shows that the planar graph could be--efficiently be solved
by s classical algorithm. So it's not a planar graph, if that's what you mean by a 3D Ising
model. >> Yeah, yeah, yeah. And a thought on this, which--I don't know the answer to,
but somebody's--a friend of mine suggested there might be a similar concept at work,
which is the use of NMR--of proteins because you have very similar. You can describe it
with the Ising model. The spins of the protons and you have the transverse field and you
have the different frequency sweeping the range. So, so the question is, can you think
of a protein as your quantum computer? >> NEVEN: Yeah, this is what we love about the Ising
model is as I say, that says the simplest many body model known to physicists and many
physical systems we see could be in first--or their approximation be modeled by an Ising
model. And Eric pointed that once out to me, a quantum Ising model is powerful enough to
pretty much map any piece of universe as far as we know it right now, maybe barring quantum
gravity effect can be mapped onto a quantum Ising model and be properly described. What
is not true for let's say Google service center, which couldn't even describe a single protein,
as you know. >> Luca. >> This is not so much a question as a statement. I wanted, probably
on behalf of everybody, to thank you for organizing this. >> NEVEN: Thanks and same to Stuart,
and Annibonn and Geordie and... >> It's a daring thing to do and it worked, so thank
you very much. >> NEVEN: Thank you. Okay, shall we...I think after this it's not good
to have any more questions. >> Well, thank you. Let's thank Hartmut one last time, thank
you very much Hartmut. hFU4 hFU4 urn:schemas-microsoft-com:office:smarttags place Normal.dot CJ Anand Microsoft Office
Word Title Microsoft Office Word Document MSWordDoc Word.Document.8