Uploaded by GoogleTechTalks on 28.10.2010

Transcript:

bjbjLULU >> NEVEN: In this talk, I want to come, sort of, from the other side, essentially

the user side of quantum computing, in particular quantum annealing. And we have studied earlier

this morning; I sort of gave a basic motivation. How can you do certain or key tasks in artificial

intelligence better by using quantum resources and we have looked in particular at learning,

supervised learning. And how can you do supervised learning better using quantum annealing. I

should say that a lot of this work has been done by Vasil Denchev at Purdue; we collaborate

closely with the folks from D-Wave, Geordie who talked earlier and Bill Macready, who

was also in his previous life a learning person. We get some help from Edward Frye and his

student David Gossett on Quantum Monte Carlo assimilations we conducted and we get some

from Vischwanatan also in Purdue, some help with--he's a specialist on using optimization

methods for machine learning. Maybe I'll skip the outline and immediately jump to--this

is not too small, it should be good, I assume sort of half of you are--a part of you is

familiar to some degree with machine learning but many of you won't, so we'll discuss the

simplest situation in machine learning. So there's binary classification. So you have

two types of patterns. As a good example would be face detection. You know, I give you pictures

and some pictures contain a face and other pictures don't. And then I'll ask you face

or no face or I'd rather ask the algorithm if you're a face or not a face. But it can

be many other things. It can be a vector of data from some medical diagnosis and then

the machine could say, "Should you do a secondary diagnosis?" or "no problem here" or it could

be some credit report and then the algorithm could say "issue a credit or not." So binary

classification is sort of the staple of machine learning and all it really is, is a binary

classifier. Once it's done, it takes input patterns 'X' and maps them onto output 'Y'

and yeah, (-1, +1) binary neighbors. So here, depictions of typically of kOzi input vectors

like let's say an image, let's say a high dimension--live in high dimensional spaces.

That's of course a large part of the problem we are facing, so you can, of course, not

visualize this. So here's a simple visualization where you have--let's say I just have two

features or two elements in the vector. So the blue guys are, let's say, the faces and

the red guys are the non-faces. And then doing classification means, you want to find a function

that divides--partitions the space in two areas, the +1 and the -1 area, and hopefully

you do it such that you don't have any of the red examples being classified as '+' and

vice-versa. Now of course, this can be more difficult, more rugged boundaries and again,

one danger with all of these pictures are that it's sort of a basic lesson that unfortunately

you never internalize well enough, is we are playing in very high-dimensional spaces and

geometry's in a very high dimensional spaces are just very different. So a lot of intuitions

drawn from these little diagrams tend to mislead. So now, how do we find these separating functions

that get the positive from the negative, separates the positive from the negative examples? So,

modern machine learning literature likes to state training of a classifier as an optimization

problem. So typically, you start out with a classifier 'H(x)' but it's not just 'H(x)',

it has a set of parameters, like think of it as little knops, typically called the weights

and if you change the weights, then this separating function changes and what gets labeled as

plus or minus one on the 'Y' side changes. So you can, if you go back to here, let's

say, one set of weights would have maybe done this rugged separating line and another set

of weight would have done this more smooth line and then training means you want to find

a good set of weights, such that this gets nicely separated. But there's one more complication

in machine learning and in some ways that makes it maybe a difficult area to apply quantum

computing to. Why? Is--in supervised learning how this works typically you take your data

and you separate it into two sets. One is a training set; the other part is your holdout

set. And then on the training set, you will do what we will discuss in a second; you use

it to form your classifier. But of course, what you're really interested in is not making

very few mistakes on the training set. That is of course a good condition or a good--or

if you are already bad with your training data, you cannot expect to be good on your

test data. But unfortunately, the converse is not true meaning that if you are perfect

on your training data, you can still make a lot of mistakes on your test data. So you

need to setup your optimization problem such that you generalize well, because what you're

really after, is new unseen examples on those you want to have little error rates, small

error rates. And one way to achieve this is to setup training objective that typically

consists of two parts. So again, training means to find a good set of knops that characterize

the classifier and the optimization problem that you setup has two components. One is

the so-called loss function, and the second term is the regularization. So the loss thing

is fairly easily understood. So loss is just a measure on how good do I do on my training

data. Now to appreciate a little bit more on what we do later, I need to discuss with

you a little bit--that's a different types of loss functions. And to appreciate those

different loss functions, for those who are not machine learning people, I need to explain

you one item and that's the so-called 'margin.' The margin is essentially the smallest distance

to the separating function between the two classes. And you can imagine that the larger

the margin, the more robust you are, the better you will probably generalize. Imagine if I

would have put the separating line just very close here, then the little bit of noise on

the red data, so I would have put it on the other side of my line and I would have classif--misclassified

it. So, what I--we tried to do is maximize the margins. That's typically considered good

for classifiers because you--better generalization. And now if you look, the most naive way to

measure on a training set, "How good is my classifier doing," is you just count. So whenever

I do a mistake, that is I have a negative margin, I'm on the wrong side. You know, and

then I say, "Okay, +1." You know, you get a count one error, another one, and so on.

If you classified it correctly, then you say that's fine. So that's called the zero-one

loss; simply counting the number of errors you make on your training set, you know. But

then the other losses--first off, there are different ways to motivate them but maybe

for starters, if you set up your loss function as zero-one loss, then it can be shown that

the resulting optimization problem is formally NP-hard. So, machine learning theories--theorists

of course didn't like this and they use instead what the optimization people call relaxations,

convex relaxations. They don't work with the non-convex zero-one loss but instead, let's

say, they approximated by some upper bound like an exponential or quadratic function.

And that on one hand gets you off, it's not NP-hard anymore, and it's not necessarily

all that bad. You get other advantages also because the zero-one loss is not margin maximizing,

while those convex relaxations are. You see, the larger your margin is, the lower your

objective value and if you try to minimize the objective value, then it gets rewarded.

So essentially, the further away your samples get from the separating line, the more you'll

like this. So that's a reasonable measure to say how good is your classifier coming

along. So--but there's another term as I said. So the first term is, when you try to find

the right set of knops is try to minimize the loss. The second is regularization; that's

a little bit harder to understand but it makes essentially a precise notion of that the simplest

explanation tends to be the best one. And it's probably easiest to understand this in

the rhyme of the curve fitting. So here's this green curve, the process has generated

some data for me and it's a little bit noisy and I want to fit those dots that come the

half there. And I want to fit it with a polynomial, and I have different polynomials of all the

zero, of all the one and so on. And you see if my model or--with which I've tried to fit

this is too simple. Let's say that's all the zeros and I tried to fit this data with a

constant function. So, of course, cannot work too well and then here if it's all the one,

the linear function, I will still do poorly. Here, all the threes seem to be good but you

don't get better and better as you make your model more complex. You know, if you order

nine, then you do--as what I've mentioned earlier, do perfectly on your training data,

but this will generalize very poorly. You can imagine, I'll give you a new example and

it will be completely off. And so, there are a lot of adages and signs that--and so when

Einstein said, "You should make things as simple as possible but not simpler." You can

see this here. This is too simple, now and here it doesn't pay to make your model overly

complex. You know, Occam's Razor, the simplest explanation tends to be the best one and in

machine learning theories are actually theorems that show that if you pick a classifier with

a lower complexity, then it will--the bounds on generalization error will be lower. Okay,

so--and yeah, I didn't give you my example. One way for example to choose a regularization

term is you put some norm on W and say I want this norm small. For example, you can say,

"I want the zero norm to be small." That simply means you switch certain parts of your classifier

off, so you make it as compact as possible. But actually, this choice again would have

rend--that's a problem formally NP hard. So, what we see is that for important choices

of the loss function and the regularization, the training problems are and are always--you

know, I have discussed enough with complexity, people to know that you better always put

formally NP hard in front of it; because we don't know let's say a true face--a real world

recognition problem like face recognition or car recognition is a truly, generally NP

hard. I don't know. But that doesn't really help us all that much because the algorithms,

it looks like an NP hard problem. So now, this is a good motivation to try to look at

quantum algorithm set again, are not known to solve NP hard problems exactly; but it's

very reasonable to expect that they give us better solutions than classically available.

And so, a quick excursion to optimization. So, optimization, you have an objective function

and you want to find is the minimum of this function, that's an optimization problem.

And this was the picture from this morning, here I have this video and I apologize, several

people here have seen this already many times. And I keep showing it again to give people

who are not familiar with it, the basic idea of quantum resource that can help you better

with finding the global minimum here. Classically, you could only have--this ball could have

only rolled down and would have been stuck in a local minimum. If you have a second resource

tunneling, then of course, if this barrier would have been too fat, the tunneling amplitude

would have been very small and for its advantage would have been tiny. But, that actually is

the beauty of the adiabatic quantum algorithm that it forms its landscape slowly. And hence,

at any point in time, you have a good chance of having a reasonable tunneling amplitude.

And overall, of course, physics determines what optimization problems you can solve.

And bringing in this additional resource of quantum tunneling expands the reach quality

of your optimization algorithms. And so, this was a bit the scientific American view. Geordie,

this morning, didn't have some time to figure from him to explain how the D-Wave chips really

work. And they actually do exactly this process we just looked at. But here's a little bit

nicer quantum explanation. And quantum mechanics got its name from certain entities in nature,

such as energy being quantized. So, what you do here is--oh, I should have explained this

also here. The principle here and or the key idea in adiabatic quantum computing is that

you start out with an optimization problem at the beginning, Hamiltonians often refer

to, is which has only one easy-to-find global minimum. So any algorithm gets you there.

And then, you slowly deform it and, until the Hamiltonian looks--has the shape that

encodes a problem you're really interested in. And now, if you do this the right way,

the following happens. So, in here is the energy spectrum quantized for the beginning

Hamiltonian, and they start out on the ground state. And now, I start to deform it, and

by the time one, I am sort of, in the Hamiltonians that has the shape I'm really interested in.

If I do this carefully enough, such that--you see, there's a difference in energy between

the ground state and the first excited state. And actually, if I look--look in them, zoom

in a little closer, you can see this entity here which is called the minimum gap. If I

do this evolution, meaning the transformation from beginning Hamiltonian to end Hamiltonian,

careful enough that I don't inject enough energy into the system that it can jump into

the next excited state is enough cause at all points it would have--you would stay on

the ground state; and then at the end, you would still be at the minimum of the function

you're interested in. So that's the idea of adiabatic quantum computing. And of course,

as you can imagine, the critical entity here is the minimum gap. If the minimum gap is

too small, you have to--the smaller it is, the more careful you have to do it, means

the slower you have to do it. And then of course, there's a question that is currently

under hot debate, and many people looking at it. For certain problems, how does a gap

scale as, let's say, my learning problem gets harder, I look at more examples or larger

classifiers. As my problems get larger, how is the gap shrinking. If it's shrinking exponentially

fast, then I'm screwed; then adiabatic quantum computing would probably not give me good

solutions for this. If it only shrinks polynomially, that bodes well. I think I'll skip this slide,

you--D-Wave has built now a system that executes this. And of course, we haven't discussed

yet what Hamiltonian, what energy landscape are we realizing. I don't know whether Geordie

has saw this, as this is my rendition of what happens inside the D-Wave chip. Essentially,

what I realized is sort of the simplest many body physical system known as the Ising models.

As the qubits are organized on grit, and they are nearest neighbor connected. Actually,

that's not quite true; but they're nearest neighbor plus, so they're also--you saw the

connections earlier, so there are connections beyond just nearest neighbors. But so, what

you do know, you start with a strong, what's called the transversal magnetic field. So

that's your beginning Hamiltonian; all these guys are randomized. And now, you phase in

those connections between the qubits that release the blue and the red arrow. They really

encode your problem, and you fade out the transversal magnetic field. And then at the

end of this evolution, some spins will be up, some spins will be down; and this you

can read out, this is the solution to your problem. Again, programming this chip means

you give it the blue and the red arrows and say how strongly negative or positively they're

supposed to be. And actually, there are also linear terms called the bias terms. That bias

every qubit's saying, I rather want just them to be up or down. So, what this creates is

a new computational resource. And a physicist would call this, as they realized the Ising

model, and as a quantum process for finding the--a minimum, the ground state of the Ising

function or as a mathematician would prefer to call this a quadratic optimization problem.

And using this new computational resource in a problem like machine learning essentially

means you have to take your problem and map it, such that it fits onto the format of a

quadratic optimization problem. And the good news here is that it looks a little bit limiting.

For example, what is if I want more than quadratic and need to sort other terms, force other

terms. Then, yeah you can by introducing auxiliary variables. You can map it back onto quadratic.

Also you may say, "Oh, this nearest neighbor business I didn't like. I need to connect

the qubits in a wider way." Yeah, you can also, by introducing auxiliary variables,

can bring this back on the existing architecture. So the good news is, in principle, pretty

much any problem you can map onto the Ising format. Unfortunately, the fine print is that

you need--that you need auxiliary qubits, and often you need polynomial many. So, sometimes,

let's say, if you have a sort order interaction, and you have a hundred variables, then you

would need a hundred square, ten thousand auxiliary qubits to get rid of this sort order

interaction. And that of course, it's a given moment where we talk about hundreds of qubits

being available in the chip; that can of course be a deal breaker. So, that is also why, software

guys like us are needed--you have to give it some sort, how can you map your problem

and then carefully or--look for problems that map well onto this architecture without the

need of too many auxiliary qubits. So now, I go back to training a binary classifier.

And I know some of you have heard this talk before. Here comes something new now; and

I'm pretty happy about this. So--I mean, sort of taught this thing a new trick. So we want

to train a classifier of this nature. So, we take the sign of a sum over "wihi." The

"hi" are often referred to as weak classifiers or features. Again, the "y" is the plus one,

minus one output. The "w's" in our case, we choose them just to be binary zero of one

weights, as this is the set we want to optimize. The overall classifiers often refer to as

the strong classifier. And then, we set up the optimization. We have a set of training

examples; and using those, we will set up our training objective. And I discussed earlier

a little bit the two terms. You remember, there's the loss that measures how well am

I doing on training; and regularization that essentially says, try to keep your explanation

as simple as possible. So, may I ask, which loss or which regularization is best? Unfortunately,

there's no good answer to this. It depends on the structure of your noise and the data.

And that is something called the no-free lunch theorem in machine learning that shows--that

is not a single training objective that is better on all datasets; better than some other

machine algorithm. So, we looked at something very specific. So, this new work where we

looked at how we can deal with outliers. You see, I've--again those two clouds, but we

put a few outliers in there. And that's situations that happen all the time. Let's say, to train

something like a face detector, we're talking about the order of between a hundred thousand

to two million images. And of course, you can imagine the wrong ones creep in there,

or just very unlikely faces. And faces probably are not even the best example, we looked at

datasets for character, classification, optical character classification. And there was such

a plenty of wrong labels in there. It's just a fact of life. And that's actually an intelligent

system I keep teaching you about or--I'm sorry in a--that's in an elementary school, you

know, you teach the kids; but maybe once a while you say something wrong, and the smart

kid will eventually figure out, "Hey, that's not right." So, we want our classifier to

be able to do this too; to find what the outliers are and discard them. So how do we do this?

Here, I just want to motivate that with the typical losses you can do this. Let's say,

if we would have used just a square loss, which are here put in, what's the optimal

possible classifier here, what's the best margins, it's called the base classifier.

The base classifier is sort of the goal standards. That's as good as you can do. So if we would

have used the square loss, because if you remember, here the square loss really doesn't

like big negative margins. And the outliers, of course, create big negative margins. So

the optimization would put a lot of emphasis of making these big negative margins to explain

the bad things not quite as bad. It doesn't really fit the picture, how can I deal with

this? I mean, am not familiar with this. So, square loss wouldn't do well here, but how

about 0-1 loss, 0-1 loss essentially just pays attention, "Let's get everything right."

So it would probably create, I mean those metaphorical pictures would create maybe a

classifier like this. Here I get roughly as a separation ride, but over here, because

it really doesn't really like to make anything wrong, put a little box or a fence in the

outliers and put those as a different class. And of course, even when I then see new real

test data, then I might fall into this box and make mistakes there. So, you can see that

there's not really an existing or one of the traditional used loss functions that would

do any good here. So what we did, we created a family of loss functions that can be tuned

to any given dataset such that it operates reasonably well. How do we do this? So we

start out with square loss, and I should--it would be nice if I could cover this here.

This would be gone, okay? It's just a square error, [INDISTINCT] one over S that would

be mean square error but by introducing this variable aid, we allow certain labels to be

flipped. So you can essentially look at an example and say, "You know what? I don't think

it's a positive one, I think it's a negative one." And by using this binary variable which

can be zero or one, but the first one to put a cost to this, you cannot nilly-willy just

change everything, it comes at a cost. You have to think when do you want to say, "Was

this example,it s probably from the other class?" And, we also introduced the second

parameter, maybe which I should--given the time--not bother to explain. So essentially

what is happening is that we have now a two parameter family of loss functions. We call

it the new family of loss functions. And a generic example you see here and here you

can essentially see there's a loss that's really low if you have a positive margin and

you classified it correctly. And the cost you pay gets higher and higher as the margin

becomes more negative, meaning you misclassified it. And at some point, the pane you [INDISTINCT]

becomes so large, it just doesn't fit into my picture, that the system would say, "You

know what, screw it, I flip." And as you can see at this point, where the system will flip

it. It comes at a cost but--it's--wouldn't keep growing more like exponential loss or

a square loss would do, so you can limit your damage, so to speak. So if you look at the

complete family, again, there's a two parameter family. Here we looked at the generic loss,

but you can see--I can find a pair of 2 where it becomes a square loss. Was it truncated

square loss or we can nicely approximate to 0-1 loss. So those are all contained in the

family. And, now we did a neat experiment with this. We did the following thing. We

start out with--it's a toy experiment and it goes like this: you start out with the

classifier, and you sign off some of the WiHi; and we dial a random set of W's, so some are

one, some are zero; and then we give the system input values X and it generates output Y's

for us. And that generates a set of training examples that I can use, but--and what we

want to do is we want to recover the original--can we learn what the original classifier was

that produced the data for us? And to make it a little bit meaner, we flipped some of

the labels. Now you can easily see or maybe--just trust me on this--that if you would have used

0-1 loss or square loss or any of the other known losses, you would not find this classifier

back. Let's maybe just discuss it for 0-1 loss, 0-1 loss could have retrieved it in

the absence of label noise, then it wouldn't want to do anything wrong. So on the training

set, you would have to do exactly the same thing as the original classifier, but as soon

as you add some noise, then it's loss that tries to get this noise right. So we ask the

question, "Can we retrieve it?" And then we used our new family of loss functions and

we played sort of with different sets of and there's a technique called cross validation.

It's a technique how you can find those Meta parameters. And then for small problem, because

this is very NP hard and then formally, there's no good solver for it, so we had to do it

by an exact solver so we need to keep it at a small--16. So we had a classifier of 16

weak classifiers, a strong classifier of 16 weak ones and 200 examples. And we added 10%

of label noise, meaning, randomly flipping labels and then we did 60,000 runs and in

99.9 cases of this toy problem, admittedly, we were able to retrieve the original classifier.

Now, I'm not sure, you know, to what degree, sort of, you're amazed by this, it's actually

pretty neat. There's a--there's no sort of system that's typically used that would have

been able to do this. It's essentially you--you hand carry your system, a bunch of data, put

noise on top of it, and you ask it, "Please make sense of it. Try to find the best possible

explanation for this data." And the system was able to do this. So that's a new piece

we did. I think I'm out of time. I wanted to show you some older--maybe if you forgive

me, some of the older works that people hadn't seen. So just using the square loss, we had

done in winter an experiment on trying to classify cars, so we tried to load a car detector

and we used--actually it's the D-Wave hardware. At the time, it had 52 variables to train

our or use our training objective and solved the training objective on the D-Wave hardware

and used the 20,000 images from Street View scenes. And at the time, now there's a chip

was supposed to look like this, this 128 Qubits, but this was a very early one, so not all

chips were functional and also, not all the connections existed. So on this somewhat beaten

up graph, we trained our classifier and to our delight, we found, if we compare this

hardware server to eristic servers that would run trouble search which would run, let's

say on a conventional machine, we were able to eke out not a tremendous, but nevertheless

a small advantage. And we also did--I will skip the Monte Carlo piece--or maybe just

your concluding was with what you saw. So far, what we're able to do is that, it seems

to us that the quantum optimization is a use for new resource for intelligent systems and

between the older work and the newer one, I think we were able to show that training

maps well and to quantum optimization we found that we get lower generalization error, we

get more compact classifiers. That means it would execute faster. We could train it faster

by using less training cycles and here's a new result, new achievement was that we can

better cope with outliers. So this is just a motivation why intelligence systems really

could benefit from quantum optimization. Is it really used, and I had a slide on this

but I know it's late, so thank you everybody. >> Thank you. Are there any questions for

Hartmut? Could you use the microphone, please? >> So, where is the Ising model in all of

this? And is it a quantum or classical Ising? Is it 2D or 3D Ising model, if it's there

in your algorithm? >> NEVEN: It is--yeah, the Ising model comes; I mean there's a question

I guess to the hardware. So the--it's a quantum Ising model and it's not a planar graph. Actually,

I think there's a result that shows that the planar graph could be--efficiently be solved

by s classical algorithm. So it's not a planar graph, if that's what you mean by a 3D Ising

model. >> Yeah, yeah, yeah. And a thought on this, which--I don't know the answer to,

but somebody's--a friend of mine suggested there might be a similar concept at work,

which is the use of NMR--of proteins because you have very similar. You can describe it

with the Ising model. The spins of the protons and you have the transverse field and you

have the different frequency sweeping the range. So, so the question is, can you think

of a protein as your quantum computer? >> NEVEN: Yeah, this is what we love about the Ising

model is as I say, that says the simplest many body model known to physicists and many

physical systems we see could be in first--or their approximation be modeled by an Ising

model. And Eric pointed that once out to me, a quantum Ising model is powerful enough to

pretty much map any piece of universe as far as we know it right now, maybe barring quantum

gravity effect can be mapped onto a quantum Ising model and be properly described. What

is not true for let's say Google service center, which couldn't even describe a single protein,

as you know. >> Luca. >> This is not so much a question as a statement. I wanted, probably

on behalf of everybody, to thank you for organizing this. >> NEVEN: Thanks and same to Stuart,

and Annibonn and Geordie and... >> It's a daring thing to do and it worked, so thank

you very much. >> NEVEN: Thank you. Okay, shall we...I think after this it's not good

to have any more questions. >> Well, thank you. Let's thank Hartmut one last time, thank

you very much Hartmut. hFU4 hFU4 urn:schemas-microsoft-com:office:smarttags place Normal.dot CJ Anand Microsoft Office

Word Title Microsoft Office Word Document MSWordDoc Word.Document.8

the user side of quantum computing, in particular quantum annealing. And we have studied earlier

this morning; I sort of gave a basic motivation. How can you do certain or key tasks in artificial

intelligence better by using quantum resources and we have looked in particular at learning,

supervised learning. And how can you do supervised learning better using quantum annealing. I

should say that a lot of this work has been done by Vasil Denchev at Purdue; we collaborate

closely with the folks from D-Wave, Geordie who talked earlier and Bill Macready, who

was also in his previous life a learning person. We get some help from Edward Frye and his

student David Gossett on Quantum Monte Carlo assimilations we conducted and we get some

from Vischwanatan also in Purdue, some help with--he's a specialist on using optimization

methods for machine learning. Maybe I'll skip the outline and immediately jump to--this

is not too small, it should be good, I assume sort of half of you are--a part of you is

familiar to some degree with machine learning but many of you won't, so we'll discuss the

simplest situation in machine learning. So there's binary classification. So you have

two types of patterns. As a good example would be face detection. You know, I give you pictures

and some pictures contain a face and other pictures don't. And then I'll ask you face

or no face or I'd rather ask the algorithm if you're a face or not a face. But it can

be many other things. It can be a vector of data from some medical diagnosis and then

the machine could say, "Should you do a secondary diagnosis?" or "no problem here" or it could

be some credit report and then the algorithm could say "issue a credit or not." So binary

classification is sort of the staple of machine learning and all it really is, is a binary

classifier. Once it's done, it takes input patterns 'X' and maps them onto output 'Y'

and yeah, (-1, +1) binary neighbors. So here, depictions of typically of kOzi input vectors

like let's say an image, let's say a high dimension--live in high dimensional spaces.

That's of course a large part of the problem we are facing, so you can, of course, not

visualize this. So here's a simple visualization where you have--let's say I just have two

features or two elements in the vector. So the blue guys are, let's say, the faces and

the red guys are the non-faces. And then doing classification means, you want to find a function

that divides--partitions the space in two areas, the +1 and the -1 area, and hopefully

you do it such that you don't have any of the red examples being classified as '+' and

vice-versa. Now of course, this can be more difficult, more rugged boundaries and again,

one danger with all of these pictures are that it's sort of a basic lesson that unfortunately

you never internalize well enough, is we are playing in very high-dimensional spaces and

geometry's in a very high dimensional spaces are just very different. So a lot of intuitions

drawn from these little diagrams tend to mislead. So now, how do we find these separating functions

that get the positive from the negative, separates the positive from the negative examples? So,

modern machine learning literature likes to state training of a classifier as an optimization

problem. So typically, you start out with a classifier 'H(x)' but it's not just 'H(x)',

it has a set of parameters, like think of it as little knops, typically called the weights

and if you change the weights, then this separating function changes and what gets labeled as

plus or minus one on the 'Y' side changes. So you can, if you go back to here, let's

say, one set of weights would have maybe done this rugged separating line and another set

of weight would have done this more smooth line and then training means you want to find

a good set of weights, such that this gets nicely separated. But there's one more complication

in machine learning and in some ways that makes it maybe a difficult area to apply quantum

computing to. Why? Is--in supervised learning how this works typically you take your data

and you separate it into two sets. One is a training set; the other part is your holdout

set. And then on the training set, you will do what we will discuss in a second; you use

it to form your classifier. But of course, what you're really interested in is not making

very few mistakes on the training set. That is of course a good condition or a good--or

if you are already bad with your training data, you cannot expect to be good on your

test data. But unfortunately, the converse is not true meaning that if you are perfect

on your training data, you can still make a lot of mistakes on your test data. So you

need to setup your optimization problem such that you generalize well, because what you're

really after, is new unseen examples on those you want to have little error rates, small

error rates. And one way to achieve this is to setup training objective that typically

consists of two parts. So again, training means to find a good set of knops that characterize

the classifier and the optimization problem that you setup has two components. One is

the so-called loss function, and the second term is the regularization. So the loss thing

is fairly easily understood. So loss is just a measure on how good do I do on my training

data. Now to appreciate a little bit more on what we do later, I need to discuss with

you a little bit--that's a different types of loss functions. And to appreciate those

different loss functions, for those who are not machine learning people, I need to explain

you one item and that's the so-called 'margin.' The margin is essentially the smallest distance

to the separating function between the two classes. And you can imagine that the larger

the margin, the more robust you are, the better you will probably generalize. Imagine if I

would have put the separating line just very close here, then the little bit of noise on

the red data, so I would have put it on the other side of my line and I would have classif--misclassified

it. So, what I--we tried to do is maximize the margins. That's typically considered good

for classifiers because you--better generalization. And now if you look, the most naive way to

measure on a training set, "How good is my classifier doing," is you just count. So whenever

I do a mistake, that is I have a negative margin, I'm on the wrong side. You know, and

then I say, "Okay, +1." You know, you get a count one error, another one, and so on.

If you classified it correctly, then you say that's fine. So that's called the zero-one

loss; simply counting the number of errors you make on your training set, you know. But

then the other losses--first off, there are different ways to motivate them but maybe

for starters, if you set up your loss function as zero-one loss, then it can be shown that

the resulting optimization problem is formally NP-hard. So, machine learning theories--theorists

of course didn't like this and they use instead what the optimization people call relaxations,

convex relaxations. They don't work with the non-convex zero-one loss but instead, let's

say, they approximated by some upper bound like an exponential or quadratic function.

And that on one hand gets you off, it's not NP-hard anymore, and it's not necessarily

all that bad. You get other advantages also because the zero-one loss is not margin maximizing,

while those convex relaxations are. You see, the larger your margin is, the lower your

objective value and if you try to minimize the objective value, then it gets rewarded.

So essentially, the further away your samples get from the separating line, the more you'll

like this. So that's a reasonable measure to say how good is your classifier coming

along. So--but there's another term as I said. So the first term is, when you try to find

the right set of knops is try to minimize the loss. The second is regularization; that's

a little bit harder to understand but it makes essentially a precise notion of that the simplest

explanation tends to be the best one. And it's probably easiest to understand this in

the rhyme of the curve fitting. So here's this green curve, the process has generated

some data for me and it's a little bit noisy and I want to fit those dots that come the

half there. And I want to fit it with a polynomial, and I have different polynomials of all the

zero, of all the one and so on. And you see if my model or--with which I've tried to fit

this is too simple. Let's say that's all the zeros and I tried to fit this data with a

constant function. So, of course, cannot work too well and then here if it's all the one,

the linear function, I will still do poorly. Here, all the threes seem to be good but you

don't get better and better as you make your model more complex. You know, if you order

nine, then you do--as what I've mentioned earlier, do perfectly on your training data,

but this will generalize very poorly. You can imagine, I'll give you a new example and

it will be completely off. And so, there are a lot of adages and signs that--and so when

Einstein said, "You should make things as simple as possible but not simpler." You can

see this here. This is too simple, now and here it doesn't pay to make your model overly

complex. You know, Occam's Razor, the simplest explanation tends to be the best one and in

machine learning theories are actually theorems that show that if you pick a classifier with

a lower complexity, then it will--the bounds on generalization error will be lower. Okay,

so--and yeah, I didn't give you my example. One way for example to choose a regularization

term is you put some norm on W and say I want this norm small. For example, you can say,

"I want the zero norm to be small." That simply means you switch certain parts of your classifier

off, so you make it as compact as possible. But actually, this choice again would have

rend--that's a problem formally NP hard. So, what we see is that for important choices

of the loss function and the regularization, the training problems are and are always--you

know, I have discussed enough with complexity, people to know that you better always put

formally NP hard in front of it; because we don't know let's say a true face--a real world

recognition problem like face recognition or car recognition is a truly, generally NP

hard. I don't know. But that doesn't really help us all that much because the algorithms,

it looks like an NP hard problem. So now, this is a good motivation to try to look at

quantum algorithm set again, are not known to solve NP hard problems exactly; but it's

very reasonable to expect that they give us better solutions than classically available.

And so, a quick excursion to optimization. So, optimization, you have an objective function

and you want to find is the minimum of this function, that's an optimization problem.

And this was the picture from this morning, here I have this video and I apologize, several

people here have seen this already many times. And I keep showing it again to give people

who are not familiar with it, the basic idea of quantum resource that can help you better

with finding the global minimum here. Classically, you could only have--this ball could have

only rolled down and would have been stuck in a local minimum. If you have a second resource

tunneling, then of course, if this barrier would have been too fat, the tunneling amplitude

would have been very small and for its advantage would have been tiny. But, that actually is

the beauty of the adiabatic quantum algorithm that it forms its landscape slowly. And hence,

at any point in time, you have a good chance of having a reasonable tunneling amplitude.

And overall, of course, physics determines what optimization problems you can solve.

And bringing in this additional resource of quantum tunneling expands the reach quality

of your optimization algorithms. And so, this was a bit the scientific American view. Geordie,

this morning, didn't have some time to figure from him to explain how the D-Wave chips really

work. And they actually do exactly this process we just looked at. But here's a little bit

nicer quantum explanation. And quantum mechanics got its name from certain entities in nature,

such as energy being quantized. So, what you do here is--oh, I should have explained this

also here. The principle here and or the key idea in adiabatic quantum computing is that

you start out with an optimization problem at the beginning, Hamiltonians often refer

to, is which has only one easy-to-find global minimum. So any algorithm gets you there.

And then, you slowly deform it and, until the Hamiltonian looks--has the shape that

encodes a problem you're really interested in. And now, if you do this the right way,

the following happens. So, in here is the energy spectrum quantized for the beginning

Hamiltonian, and they start out on the ground state. And now, I start to deform it, and

by the time one, I am sort of, in the Hamiltonians that has the shape I'm really interested in.

If I do this carefully enough, such that--you see, there's a difference in energy between

the ground state and the first excited state. And actually, if I look--look in them, zoom

in a little closer, you can see this entity here which is called the minimum gap. If I

do this evolution, meaning the transformation from beginning Hamiltonian to end Hamiltonian,

careful enough that I don't inject enough energy into the system that it can jump into

the next excited state is enough cause at all points it would have--you would stay on

the ground state; and then at the end, you would still be at the minimum of the function

you're interested in. So that's the idea of adiabatic quantum computing. And of course,

as you can imagine, the critical entity here is the minimum gap. If the minimum gap is

too small, you have to--the smaller it is, the more careful you have to do it, means

the slower you have to do it. And then of course, there's a question that is currently

under hot debate, and many people looking at it. For certain problems, how does a gap

scale as, let's say, my learning problem gets harder, I look at more examples or larger

classifiers. As my problems get larger, how is the gap shrinking. If it's shrinking exponentially

fast, then I'm screwed; then adiabatic quantum computing would probably not give me good

solutions for this. If it only shrinks polynomially, that bodes well. I think I'll skip this slide,

you--D-Wave has built now a system that executes this. And of course, we haven't discussed

yet what Hamiltonian, what energy landscape are we realizing. I don't know whether Geordie

has saw this, as this is my rendition of what happens inside the D-Wave chip. Essentially,

what I realized is sort of the simplest many body physical system known as the Ising models.

As the qubits are organized on grit, and they are nearest neighbor connected. Actually,

that's not quite true; but they're nearest neighbor plus, so they're also--you saw the

connections earlier, so there are connections beyond just nearest neighbors. But so, what

you do know, you start with a strong, what's called the transversal magnetic field. So

that's your beginning Hamiltonian; all these guys are randomized. And now, you phase in

those connections between the qubits that release the blue and the red arrow. They really

encode your problem, and you fade out the transversal magnetic field. And then at the

end of this evolution, some spins will be up, some spins will be down; and this you

can read out, this is the solution to your problem. Again, programming this chip means

you give it the blue and the red arrows and say how strongly negative or positively they're

supposed to be. And actually, there are also linear terms called the bias terms. That bias

every qubit's saying, I rather want just them to be up or down. So, what this creates is

a new computational resource. And a physicist would call this, as they realized the Ising

model, and as a quantum process for finding the--a minimum, the ground state of the Ising

function or as a mathematician would prefer to call this a quadratic optimization problem.

And using this new computational resource in a problem like machine learning essentially

means you have to take your problem and map it, such that it fits onto the format of a

quadratic optimization problem. And the good news here is that it looks a little bit limiting.

For example, what is if I want more than quadratic and need to sort other terms, force other

terms. Then, yeah you can by introducing auxiliary variables. You can map it back onto quadratic.

Also you may say, "Oh, this nearest neighbor business I didn't like. I need to connect

the qubits in a wider way." Yeah, you can also, by introducing auxiliary variables,

can bring this back on the existing architecture. So the good news is, in principle, pretty

much any problem you can map onto the Ising format. Unfortunately, the fine print is that

you need--that you need auxiliary qubits, and often you need polynomial many. So, sometimes,

let's say, if you have a sort order interaction, and you have a hundred variables, then you

would need a hundred square, ten thousand auxiliary qubits to get rid of this sort order

interaction. And that of course, it's a given moment where we talk about hundreds of qubits

being available in the chip; that can of course be a deal breaker. So, that is also why, software

guys like us are needed--you have to give it some sort, how can you map your problem

and then carefully or--look for problems that map well onto this architecture without the

need of too many auxiliary qubits. So now, I go back to training a binary classifier.

And I know some of you have heard this talk before. Here comes something new now; and

I'm pretty happy about this. So--I mean, sort of taught this thing a new trick. So we want

to train a classifier of this nature. So, we take the sign of a sum over "wihi." The

"hi" are often referred to as weak classifiers or features. Again, the "y" is the plus one,

minus one output. The "w's" in our case, we choose them just to be binary zero of one

weights, as this is the set we want to optimize. The overall classifiers often refer to as

the strong classifier. And then, we set up the optimization. We have a set of training

examples; and using those, we will set up our training objective. And I discussed earlier

a little bit the two terms. You remember, there's the loss that measures how well am

I doing on training; and regularization that essentially says, try to keep your explanation

as simple as possible. So, may I ask, which loss or which regularization is best? Unfortunately,

there's no good answer to this. It depends on the structure of your noise and the data.

And that is something called the no-free lunch theorem in machine learning that shows--that

is not a single training objective that is better on all datasets; better than some other

machine algorithm. So, we looked at something very specific. So, this new work where we

looked at how we can deal with outliers. You see, I've--again those two clouds, but we

put a few outliers in there. And that's situations that happen all the time. Let's say, to train

something like a face detector, we're talking about the order of between a hundred thousand

to two million images. And of course, you can imagine the wrong ones creep in there,

or just very unlikely faces. And faces probably are not even the best example, we looked at

datasets for character, classification, optical character classification. And there was such

a plenty of wrong labels in there. It's just a fact of life. And that's actually an intelligent

system I keep teaching you about or--I'm sorry in a--that's in an elementary school, you

know, you teach the kids; but maybe once a while you say something wrong, and the smart

kid will eventually figure out, "Hey, that's not right." So, we want our classifier to

be able to do this too; to find what the outliers are and discard them. So how do we do this?

Here, I just want to motivate that with the typical losses you can do this. Let's say,

if we would have used just a square loss, which are here put in, what's the optimal

possible classifier here, what's the best margins, it's called the base classifier.

The base classifier is sort of the goal standards. That's as good as you can do. So if we would

have used the square loss, because if you remember, here the square loss really doesn't

like big negative margins. And the outliers, of course, create big negative margins. So

the optimization would put a lot of emphasis of making these big negative margins to explain

the bad things not quite as bad. It doesn't really fit the picture, how can I deal with

this? I mean, am not familiar with this. So, square loss wouldn't do well here, but how

about 0-1 loss, 0-1 loss essentially just pays attention, "Let's get everything right."

So it would probably create, I mean those metaphorical pictures would create maybe a

classifier like this. Here I get roughly as a separation ride, but over here, because

it really doesn't really like to make anything wrong, put a little box or a fence in the

outliers and put those as a different class. And of course, even when I then see new real

test data, then I might fall into this box and make mistakes there. So, you can see that

there's not really an existing or one of the traditional used loss functions that would

do any good here. So what we did, we created a family of loss functions that can be tuned

to any given dataset such that it operates reasonably well. How do we do this? So we

start out with square loss, and I should--it would be nice if I could cover this here.

This would be gone, okay? It's just a square error, [INDISTINCT] one over S that would

be mean square error but by introducing this variable aid, we allow certain labels to be

flipped. So you can essentially look at an example and say, "You know what? I don't think

it's a positive one, I think it's a negative one." And by using this binary variable which

can be zero or one, but the first one to put a cost to this, you cannot nilly-willy just

change everything, it comes at a cost. You have to think when do you want to say, "Was

this example,it s probably from the other class?" And, we also introduced the second

parameter, maybe which I should--given the time--not bother to explain. So essentially

what is happening is that we have now a two parameter family of loss functions. We call

it the new family of loss functions. And a generic example you see here and here you

can essentially see there's a loss that's really low if you have a positive margin and

you classified it correctly. And the cost you pay gets higher and higher as the margin

becomes more negative, meaning you misclassified it. And at some point, the pane you [INDISTINCT]

becomes so large, it just doesn't fit into my picture, that the system would say, "You

know what, screw it, I flip." And as you can see at this point, where the system will flip

it. It comes at a cost but--it's--wouldn't keep growing more like exponential loss or

a square loss would do, so you can limit your damage, so to speak. So if you look at the

complete family, again, there's a two parameter family. Here we looked at the generic loss,

but you can see--I can find a pair of 2 where it becomes a square loss. Was it truncated

square loss or we can nicely approximate to 0-1 loss. So those are all contained in the

family. And, now we did a neat experiment with this. We did the following thing. We

start out with--it's a toy experiment and it goes like this: you start out with the

classifier, and you sign off some of the WiHi; and we dial a random set of W's, so some are

one, some are zero; and then we give the system input values X and it generates output Y's

for us. And that generates a set of training examples that I can use, but--and what we

want to do is we want to recover the original--can we learn what the original classifier was

that produced the data for us? And to make it a little bit meaner, we flipped some of

the labels. Now you can easily see or maybe--just trust me on this--that if you would have used

0-1 loss or square loss or any of the other known losses, you would not find this classifier

back. Let's maybe just discuss it for 0-1 loss, 0-1 loss could have retrieved it in

the absence of label noise, then it wouldn't want to do anything wrong. So on the training

set, you would have to do exactly the same thing as the original classifier, but as soon

as you add some noise, then it's loss that tries to get this noise right. So we ask the

question, "Can we retrieve it?" And then we used our new family of loss functions and

we played sort of with different sets of and there's a technique called cross validation.

It's a technique how you can find those Meta parameters. And then for small problem, because

this is very NP hard and then formally, there's no good solver for it, so we had to do it

by an exact solver so we need to keep it at a small--16. So we had a classifier of 16

weak classifiers, a strong classifier of 16 weak ones and 200 examples. And we added 10%

of label noise, meaning, randomly flipping labels and then we did 60,000 runs and in

99.9 cases of this toy problem, admittedly, we were able to retrieve the original classifier.

Now, I'm not sure, you know, to what degree, sort of, you're amazed by this, it's actually

pretty neat. There's a--there's no sort of system that's typically used that would have

been able to do this. It's essentially you--you hand carry your system, a bunch of data, put

noise on top of it, and you ask it, "Please make sense of it. Try to find the best possible

explanation for this data." And the system was able to do this. So that's a new piece

we did. I think I'm out of time. I wanted to show you some older--maybe if you forgive

me, some of the older works that people hadn't seen. So just using the square loss, we had

done in winter an experiment on trying to classify cars, so we tried to load a car detector

and we used--actually it's the D-Wave hardware. At the time, it had 52 variables to train

our or use our training objective and solved the training objective on the D-Wave hardware

and used the 20,000 images from Street View scenes. And at the time, now there's a chip

was supposed to look like this, this 128 Qubits, but this was a very early one, so not all

chips were functional and also, not all the connections existed. So on this somewhat beaten

up graph, we trained our classifier and to our delight, we found, if we compare this

hardware server to eristic servers that would run trouble search which would run, let's

say on a conventional machine, we were able to eke out not a tremendous, but nevertheless

a small advantage. And we also did--I will skip the Monte Carlo piece--or maybe just

your concluding was with what you saw. So far, what we're able to do is that, it seems

to us that the quantum optimization is a use for new resource for intelligent systems and

between the older work and the newer one, I think we were able to show that training

maps well and to quantum optimization we found that we get lower generalization error, we

get more compact classifiers. That means it would execute faster. We could train it faster

by using less training cycles and here's a new result, new achievement was that we can

better cope with outliers. So this is just a motivation why intelligence systems really

could benefit from quantum optimization. Is it really used, and I had a slide on this

but I know it's late, so thank you everybody. >> Thank you. Are there any questions for

Hartmut? Could you use the microphone, please? >> So, where is the Ising model in all of

this? And is it a quantum or classical Ising? Is it 2D or 3D Ising model, if it's there

in your algorithm? >> NEVEN: It is--yeah, the Ising model comes; I mean there's a question

I guess to the hardware. So the--it's a quantum Ising model and it's not a planar graph. Actually,

I think there's a result that shows that the planar graph could be--efficiently be solved

by s classical algorithm. So it's not a planar graph, if that's what you mean by a 3D Ising

model. >> Yeah, yeah, yeah. And a thought on this, which--I don't know the answer to,

but somebody's--a friend of mine suggested there might be a similar concept at work,

which is the use of NMR--of proteins because you have very similar. You can describe it

with the Ising model. The spins of the protons and you have the transverse field and you

have the different frequency sweeping the range. So, so the question is, can you think

of a protein as your quantum computer? >> NEVEN: Yeah, this is what we love about the Ising

model is as I say, that says the simplest many body model known to physicists and many

physical systems we see could be in first--or their approximation be modeled by an Ising

model. And Eric pointed that once out to me, a quantum Ising model is powerful enough to

pretty much map any piece of universe as far as we know it right now, maybe barring quantum

gravity effect can be mapped onto a quantum Ising model and be properly described. What

is not true for let's say Google service center, which couldn't even describe a single protein,

as you know. >> Luca. >> This is not so much a question as a statement. I wanted, probably

on behalf of everybody, to thank you for organizing this. >> NEVEN: Thanks and same to Stuart,

and Annibonn and Geordie and... >> It's a daring thing to do and it worked, so thank

you very much. >> NEVEN: Thank you. Okay, shall we...I think after this it's not good

to have any more questions. >> Well, thank you. Let's thank Hartmut one last time, thank

you very much Hartmut. hFU4 hFU4 urn:schemas-microsoft-com:office:smarttags place Normal.dot CJ Anand Microsoft Office

Word Title Microsoft Office Word Document MSWordDoc Word.Document.8