Uploaded by GoogleTechTalks on 02.09.2010

Transcript:

>>

I met him actually last month at CVPR where he gave a very interesting talk and I thought

it would be great if he shares some of his work with us here at Google. And just to spell

his full name, so it's Hamed Masnadi-Shirazi, and he works at--is in the middle of his PhD

at UCS, San Diego Western, you know, in the Department for Statistical Visual Computing.

And, yeah, I'm very pleased to hear about On the Design of Bayes Consistent Loss Functions

for Classification. Mr. Hamed, please. >> MASNADI-SHIRAZI: So thank you for the introduction.

Yeah, so the title is On the Design of Bayes Consistent Loss Functions for Classification,

and this is work that I've done at the Statistical Visual Computing Lab at the University of

California - San Diego. So during this talk, we're going to basically be talking about

classification. Many classification algorithms, some of the most successful ones such as SVMs

and Boosting are basically based on minimizing an expected value of a loss function phi,

which is close margin enforcing and Bayes consistent. So such loss functions assign

a large penalty to points with negative margin, and they generally assign a small penalty

to points of small positive margin and zero or close to zero penalty to points of large

positive margin. So we also need to talk about Bayes consistent loss functions and like I

said, Bayes consistent loss functions are very special and important for classification.

They--and the reason for that is because they produce the Bayes optimal decision rule without

having to know anything about the distribution of the data. So, an example for general Bayes

consistent loss function phi, we would take the--we would take the expected loss to find

the risk and we would minimize the risk to find our predictor, F star. And given that

our loss function was Bayes consistent, we're guaranteed that F star, our predictor, is

Bayes optimal. It implements the Bayes decision rule. And you can see that through this progression,

we didn't make any assumptions about the model of the data. We didn't know the distribution

of the data, nothing like that, but yet we were still able to arrive at the Bayes optimal

decision rule. So, on the other hand, unfortunately, only a handful of Bayes consistent loss functions

were previously known in the literature. Some examples include the well known Least Squares

loss; that's used all over the place. Another example is the Hinge loss that's used in SVM

classifiers; or the exponential loss that's used in Boosting. Another example is the Log

loss that's used in Logistic Regression and also in Boosting. And you can see that each

one of these previously known Bayes consistent loss functions has led to a highly successful

classification algorithm. We've plotted, actually, these Bayes consistent loss functions, and

just by looking at them, you can see that they all generally have a few properties in

common. One is that, they're all convex loss functions. And the ones that are considered

here are all unbounded for large negative margins. And number three is that, they all

have a zero or close to zero positive margins; of course, this excludes the Least Squares

that increases on both sides. And the last property is that they are all cost insensitive,

meaning that they assign the same amount of loss to both the positive class from--and

the negative class. So they don't prefer one class over the other. So the natural question

that we could ask at this point is, "Are we stuck with just these 4 or 5 loss functions

with those common shapes?" And can we actually derive other Bayes consistent loss functions

that are custom tailored for our application? Can we derive Bayes consistent loss functions

that have other shapes and other properties? For example, can we have a Bayes consistent

loss function that's non convex like the plotted loss function? Can we have a loss function

that's Bayes consistent and yet has bounded negative margins? As you can see, this loss

function flattens out. Can we have a loss function that has a non zero bounded positive

margins unlike previous loss functions? Can we have a loss function that's Bayes consistent

and is cost sensitive? So here we have two loss functions; one is assigned to the positive

class and then--and the other is assigned to the negative class. And in general, can

we have a Bayes consistent loss function that's just any shape we want it to be given our

application? So these problems and questions were previously known and researchers generally

would try to start off with the known Bayes consistent loss function. And in order to

alter it to their application, they would either make alterations to the loss directly

or alter the classification algorithm derived from the loss. And the problem with that is

that once you actually alter the Bayes consistent loss function in any way, it ceases to be

Bayes consistent. So during this talk, what we're going to show is how to actually derive

novel Bayes consistent loss functions, and this is work that has been covered in our

NIPS '08 paper. We're also going to talk about designing novel loss functions--Bayes consistent

loss functions for specific applications. And two of the applications that we'll cover

during this talk--yeah, the first one is going to be the designing loss functions that are

robust or outlier resistant for robust classification, and this is work that's been covered in our

CVPR 2010 paper. And we're also going to talk about designing loss functions that are Bayes

consistent and cost sensitive. So this is work that's been covered in our ICML 2010

and IEEE PAMI 2010 paper. So now that we know the importance of having Bayes consistent

loss functions for classification, we can talk about some specific problems and applications.

An interesting problem is the computer vision problem. Here, we have large margin loss functions,

generally, don't overcome the unique challenges posed by computer vision problems. And one

major difficulty is the prevalence of noise and outliers or class ambiguity that exists

in loss--in computer vision data sets. So here, for example, we have an image of a famous

street in San Francisco and this is clearly an image of a street. But if we were to do

patch-based image classification on this, we would get many patches that include trees,

and cars, and building even though this is supposed to be classified as an image of a

street. So in order to be able to do robust classification, we're going to try to design

a robust loss function. One limitation of current loss functions is their unbounded

negative margin. So for this simple classification problem we have at the top, we clearly have

three outliers. Those are the three Xs--the green Xs to the left are obviously outliers.

And if we were to use the well-known exponential loss to classify this problem--the exponential

loss has been plotted in light blue, the light blue dotted line. And because of its exponential

shape, it basically will emphasize and concentrate on these three outliers and assign to them

an exponentially high loss, whereas, what we'd really like it to do is to disregard

these three outliers. So, recent improvements have been made in terms of dealing with this

type of outliers. One is the linearly growing Logit loss used in LogitBoost introduced by

Friedman in 2000. And that's the green line. You can see that it grows linearly instead

of exponentially, so that's an improvement. The other is the bounded Savage loss used

in SavageBoost that we introduced in '08, and that's the dotted blue line in the bottom.

You can see that it's actually bounded; it flattens out after a while. And so these help

with this type of outliers. But we argued that the negative margins are not the entire

problem and that it's also important to penalize large positive margins. So in order to explain

this further, we can consider this simple classification algorithm--I mean, this simple

classification problem we have where the data is distributed uniformly in the vertical direction

and is Gaussian with equal variants and different means of +3 and -3 in the horizontal direction.

So for these two classes it's easy to see that the Bayes decision rule would be aligned

at X equals zero. And in fact, if any of the previous Bayes consistent loss functions are

used on this exact data set, they would all place their decision boundaries where the

red line is which is very close to the black Bayes decision rule. So the--all the previous

Bayes consistent loss functions will approximate the Bayes decision rule almost exactly.

>> [INDISTINCT] >> MASNADI-SHIRAZI: I didn't get the question.

What was that? >> [INDISTINCT]

>> MASNADI-SHIRAZI: [INDISTINCT], yeah. The fact that I know that the optimal Bayes decision

rule is the X line is because I know that they're Gaussian. But if all I had were these

few points and I used the Bayes consistent loss function, anyone of these loss functions

that I have on the right, they would in--actually for this data set, give you the red line,

which is almost on top of the black line, which is the optimal. So even if you didn't

know anything about this distribution and all I gave you were these few points and you

were to follow the procedure of--start off with the Bayes consistent loss function, compute

the risk, minimize the risk, you would get the red line without any information about

the distribution. So that's the power of Bayes consistent loss functions.

>> [INDISTINCT] >> MASNADI-SHIRAZI: I'm not assuming anything

about the data, whatsoever. All I have is just the data. I know that the blue line is

one class, and the blue circles is one class, and the red Xs are another class. Given their

coordinates, I can--but these points were actually sampled from a Gaussian which I do

not know whatsoever. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Sure, sure, sure. I'm going to actually consider what happens when

you have, you know, a rogue point or something like that. It could actually change immediately.

So what I'd--what I'm going to do next is actually introduce an outlier here. So let's

say I had one sample point at that point. So the impact of adding a single X at that

point--which could still happen as part of the data, right, because it's Gaussian in

the horizontal, so that could be one of the Xs that come up--the impact--but in essence,

it is kind of an outlier for this data because it's highly unlikely. But, in any case, if

you do have a point at -2 and zero, what would happen is that all of these previous loss

functions would drastically shift to their decision boundary to where the purple line

is. And they would place their decision boundary at about X equals -2.3. And the reason for

that is because all of these loss functions assigned zero or close to zero loss, for being

correct, right? This includes both the Savage loss which is the green one or the--sorry--the

Log is the green one and the Savage is the blue one. So you can see that they assigned

zero penalty for being correct, and since by placing the line at -2.3, you can classify

everything correctly. That's just where they're going to put their boundary which is very

far away from the Bayes optimal, had we known the true distribution. On the other hand,

the Tangent loss which is a loss function that we're going to derive during this talk

is also Bayes consistent. But since it penalizes being correct, it penalizes both the positive

side and the negative side. It discourages solutions that are too correct. So it will

actually place the decision boundary where the red line is given this exact data, which

is still very close to the Bayes decision rule. Yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: No, I'm not assuming anything

other than just the data. I mean, if-- >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah, yeah, that's true. If there was another, let say, another, you

know--let's say it was like two Gaussians, all of them consist--made up the red class.

That's true. That red point could be one sample from that. But the problem is that right now,

given the fact that it's just one point, it's still considered kind of an outlier. I will

actually deal with this problem later on because once you derive the Tangent loss, you can--you

can actually change the amount that it penalizes being too correct. So you could even actually,

you know, flatten this to make it look like--something like the Savage loss which wouldn't penalize

it and actually place the decision boundary very close to where the purple one is. So

you can have a control over this. It's just a matter of--given the data right now, I mean,

just looking at it, it would seem better to place the lines somewhere close to the black

one rather than drastically move it all the way to the left. Even if that were a distribution

and what--actually when I was building this, you know, toy example to get the point across

of why it's important to sometimes penalize being correct, if you actually add another

X to where that far left one is, the Tangent loss would also place its boundary close to

-2.3 because now its--you know, there's too much penalty for classifying two points wrong.

And you would expect that if that was a true--another mode. You would have more examples from it.

And once you actually add another one, it'll shift; it will be just too costly. It doesn't

want to make a mistake on two points, so. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah, yeah. But, I mean, this toy example is just to get across the

importance of actually penalizing, you know, outliers that are of this type. So you have

to--in order to deal with these, you have to kind of penalize being too correct sometimes

as well. So I don't know if I--yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: That's true. I mean, that's why you...

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: That's true. But that's

usually what we have to--I mean, that's what we deal with in the real world. We'd always

have only a few samples. We never have the true distribution of the data. And so, I mean,

the purpose of this talk is to say that it's important to have a Bayes consistent loss

function that gives you that flexibility and that control. So when you are dealing with

situations where you have small amounts of data and the possibility of having outliers

is increased in that case, you want to be able to, you know, have loss functions that

are not only Bayes consistent but also have, you know, these different types of shapes

that can deal with these potential problems >> [INDISTINCT]

>> MASNADI-SHIRAZI: So I actually--if you were to use the Hinge loss, which is what

the SVM does, I didn't--I didn't run an SVM because this is so simple, you can solve it

directly. You can just like do a exhaustive search over the space of lines here. But if

you were to use the Hinge loss, which is what the SVM uses, it does still place, you know--because

there's no penalty for being right, so why not place it where the purple ones--

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah, I didn't do a pass

like that on this, so. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: The Square loss? The Square loss--the Least Squares loss is actually almost

never used in classification because--let me go back to the shape. That purple line--I'll

give you a better example. So the red line is the Square loss, right? And you can see

that it penalizes being correct almost as much as it panelizes being wrong because of

its symmetry. It's just X squared, so it's never used for classification. If I were to

use--I actually--that's an interesting question. I--maybe I should have used it here and see

what would have happened. If I had actually used the Square loss--because then it would

penalize this, as well. And, let's see. If it penalizes that point so--actually the Square

loss wouldn't work. It would--it would place the line at--where the purple is, I think,

I'm kind of guessing, because there's going to be such a huge penalty for getting this

wrong, that it would prefer to get it right rather than skip over it. So the Tangent loss

here assigns a bounded penalty for being right that's why it can disregard to this outlier.

You see that it's kind of--you see that it tapers off here, right? It's bounded in that.

Whereas the square loss would go up, it would assign a huge loss for being--for any incorrectly

classified points of that sense. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. So--exactly. That's the whole point of a loss like this because

it wants to, you know, keep--it wants--it actually--in the paper, I actually have an

example where I've plotted like the histogram of the distance. This is kind of the margin.

So it actually encourages small margins that are placed at that plate. You know, I think

it's--that dip, that minimum is like 0.5. But the interesting thing is that when you

see that example, even though most of the points have a small margin of about 0.5, meaning

that--that minimum is actually enforced, it does better in classifying a data set that

has lots of outliers, so we were--I think, in that--in that example, we had some digit

classification problem and the margin is smaller but the classification output is better than

all of these other loss functions. So that's a--that's a problem that we actually considered

in the paper. So did I--the--it's--just because the margin is smaller, it doesn't necessarily

mean you're doing worse, especially when you have outliers, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: That's true. It's just

how much--for example, when you're using Exponential loss, it tries so hard to increase the margin

that it, you know, screws up the classification which is the goal, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: For the--I wanted to give

these--I know what you're saying, but that's regression. It's not classification because

in regression--in classification, you actually have the two classes and you're trying to

minimize the risk. And in regression, yeah, if you allowed it to, you know, plot a line

through here, that would happen. It would place a horizontal line, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: What I think I did for

this data set was I--let's see. I think I limited the--I didn't--yeah, I didn't--maybe

I didn't want that to happen so I limited to vertical lines. That might have been what

I did. So if I remember correctly, that's probably what I did to avoid that. In any

case, I haven't even considered the Least Squares, so. Any--yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: I--I think what I considered

was like from -10 to +10. Something like that. >> [INDISTINCT]

>> MASNADI-SHIRAZI: I wanted to put a Y dimension here just so it would look nice when I plot

it. Because if I only had one--that's why they're uniform in the Y direction. So I already

know that the Y direction is, you know, useless. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. Because if I had only plotted these points in the X direction,

you would get these circles on top of each other; you couldn't really see anything. But

that's why I said it's uniform in the vertical so you can disregard it completely. Yeah.

But the point of this toy problem is something completely different. You guys should let

me go further. What I wanted to just get across was the idea of the fact that some--first

of all, we have problems where we have outliers. And second of all, we want to deal with outliers

and there's two types of outliers in a sense. One is when you're wrong and--so one is this

type of outliers that I considered here, right? There's no--and the other is that when you

can be actually right, completely right, and still have, you know, an outlier. So the Savage

and the Log kind of deal with the first type, but you also want a loss that looks like this

that can, you know, penalize being too correct, right? When you have outliers, that's kind

of--that's just to get across the point of having a loss function that has a shape like

this. Now, of course, we'll go into more, you know, advanced and practical examples

so don't worry. But in order to derive something like the Tangent loss that I just talked about,

we need to talk a bit about risk minimization. So we define a feature, X, and the class label,

Y, and the--and we call our classifier, H, and out predictor, F. So our classifier is

just going to be the sign of our predictor, F. And we also have the loss function, L.

And so what we're going to be doing is minimizing the risk or the expected loss which is equivalent

to minimizing the conditional risk. And here we're going to define eta to be the probability

of one given X. So with these in mind, we'll consider the traditional approach of dealing

with a Bayes consistent loss function. We're going to actually talk about what does Bayes

consistency mean in more detail. So a good example is the exponential loss that we have

here. And the traditional approach is to, of course, take the--take the expectation,

find the risk, C. And then you want to minimize the risk, and find your predictor, F star.

And for the exponential, you can actually show that F star will be one half log of the

probability ratio. So the question now is to see whether our predictor is Bayes consistent

and we've plotted this F star and you can see that it is indeed Bayes consistent because

it assigns positive values to probabilities that are higher than a half and negative values

to probabilities, eta, that are less than a half. So it's implementing the Bayes decision

rule which is good. So we're good there. We take our F star and we plug it back into our

risk to find our minimum risk which we call C star. And then what we have to do here is

check to see whether C star is convex or not. And the reason for that is because it can

be shown that in order to have a convex optimization problem, you need to have a minimum conditional

risk, C star, that's convex. So, for the exponential loss where--everything works out fine, but

you can immediately see from this example that this traditional approach doesn't actually

let you design Bayes consistent loss functions. It only lets you check to see whether a loss

function is Bayes consistent or not. And we can kind of say that we got lucky with the

exponential because all these things, you know, fall into place. So in our next section,

we'd like to talk about how can we actually--we're going to propose a new path for designing

Bayes consistent loss functions. And the general approach is going to be that we can choose

any strictly concave functions, C star, and any invertible function, F star, such that

they have the simple symmetry constraints and we plug these two functions into this

formula and we get a phi, a loss, that is guaranteed to be Bayes consistent. So this

gives us a principled way of deriving, designing novel Bayes consistent loss functions. Now--I'm

not going to actually go over the proof of that formula. So if--the proof can be found

in our NIPS '08 paper and it basically deals with probability elicitation and Bregman divergences

and so I'm going to kind of skip over these. But we're going to go over an example of actually

deriving--using the formula and deriving a novel Bayes consistent loss function. Here

we're going to derive the Savage loss. So, let's say we start it off with the Logistic

Link function that we actually saw in the Exponential loss as well, and we pair it with

the Least Squares minimum risk. And these two satisfy the symmetric constraints. All

we have to do is take these and put them into the formula and we'd derive a novel loss that

we called the Savage loss and we've plotted it. And you can see that this loss function,

unlike all previous Bayes consistent loss functions, is first of all not a convex and

also it is bounded, so it flattens out. It has a bounded penalty. And we use this loss

function in--to do some robust classification. But for a truly robust loss function, we saw

in our previous discussion that for a hypothetical loss, we--it should have a certain number

of properties. We would, first of all, like it to be bounded for large negative margins.

And we'd also like it to be--have a smaller bounded penalty--assign a smaller bounded

penalty for large positive margins. And we'd also like it to be margin enforcing of course.

And it can be shown that under Bayes consistency, those three properties are satisfied if and

only if these three conditions or requirements hold for our choice of F star and C star.

And, unfortunately, existing link functions, F star, don't comply with those requirements.

So we introduced the Tangent Link which does. So this is the plot of the Tangent Link and

we pair it up with the Least Squares at minimum conditional risk again, plug it into our--the

formula, and we'd get what we called the Tangent loss which actually has the desired shape

of course. So now that we have our loss function, we can easily use it to do classification

and derive like, for example, a Boosting algorithm to minimize the risk associated with this

loss. I'm not going to go over deriving the Boosting, and algorithm, and stuff; you--the

details can be found in the paper. I'll just go over some examples and experiments that

we did. We've done experiments on image classification and also Multiple Instance Learning data sets.

But here, I'm going to talk about the scene classification, the 15 class scene classification

problem. We used the semantic space representation from Rasiwasia and Vasconcelos CVPR '09. And

the only thing that we changed was replace the SVM classifier with the Tangent loss classifier

and we got for--now, their scene classification method was already state of the art, but we

got 4% improvement with just simply, you know, an outlier resistant robust classifier instead

of the SVM. So--and actually, the interesting thing about this data set is that most of

the improvements were seen between classes that were--I'm starting to lose my voice.

>> [INDISTINCT] >> MASNADI-SHIRAZI: What was that?

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah. We're doing one

versus all. The most improvement was seen in--between classes that are easily confusable.

For example, when you're trying to distinguish between the highway class, or the street class,

or the tall building class, or when you're trying to distinguish between the mountain,

and forest, and open country class, these are actually one of the--like these are some

of the 15 classes out there. There's also living room, and bedroom, and kitchen, and

these are similar but they actually produce the most improvements which is expected when

you're doing a robust classification. We've also done experiments on tracking and we used

the Discriminant Saliency Tracker of Mahadevan and Vasconcelos CVPR '09. And we use TangentBoost

to combine the saliency maps in a discriminant manner. Here, in these two examples, tracking

examples, the red box is for when the Tangent loss is used to do tracking and the dotted

blue is when the Exponential loss, which isn't robust, is used. This one--let's see if it

works. It still doesn't work on my computer; that's weird. But--okay, this one should work.

So you can see that the Tangent loss keeps track of the object whereas the Exponential

loss drifts off and loses track. And this is because the person keeps changing shape

and there are plenty of outliers so it's much more robust. Now a natural question that was

already kind of asked was, "Can we control the shape of the Tangent loss?" So, there

is actually a parameter, A, that controls the ratio between the negative margin and

the positive margin. So the general formula is--includes the A. And we've plotted the

Tangent loss for different values of A. And you can see that--you can actually either--actually,

the best results are found through cross-validation for A or you can kind of like guess A depending

on how much outliers you expect to have in your data. But in general, you can kind of--you

can control the amount of penalty that you assign to the different margins. So, you have

that flexibility. Another question is the question of convexity. So, here we have examples

of a convex function, a quasi-convex function, and a function that's neither convex nor quasi-convex.

And so, for--a quasi-convex function isn't convex but it still has a unique minimum.

On the other hand, all convex functions are also quasi-convex. And we can actually prove

that Bayes consistent loss functions from this formula are all quasi-convex. Now, the

question--the main question is, "Is our optimization problem still convex even when the loss is

not convex?" So, if--so, the thing to keep in mind here is that we're not minimizing

the loss, we're minimizing the expected loss or the risk. So with that in mind, we can

prove that the risk is actually still convex in the probability space even though the loss

is not. We can also prove that the risk is quasi-convex in functional space. And this

is an important proof because when you're doing classification, you're usually not working

in probability space. You're working directly in the functional space. You're trying to

directly find the function that classifies--that minimizes your risk. So it's still quasi-convex

even though the loss is not in. Also, more importantly for practice, the empirical risk

is also quasi-convex in functional space. So in conclusion, we can say that the optimization

problem is quasi-convex and has a unique minimum that can be found with functional gradient

descent such as Boosting even though the loss function is not necessarily convex. So--

>> [INDISTINCT] >> MASNADI-SHIRAZI: The optimization is not

by definition convex, but it's quasi-convex, meaning that it still has a unique minimum.

So you can use quasi-convex methods to find the minimum. The easiest thing that works

is just do functional gradient descent which is what Boosting does. That's why the Boosting

algorithms work even though these loss functions have all these, you know, quasi-convex shapes.

So the important thing is that it still has a unique minimum.

>> It's [INDISTINCT] minimize [INDISTINCT] so you can have [INDISTINCT] function [INDISTINCT].

>> MASNADI-SHIRAZI: No, no, no, no, no. That's what I mean by unique minimum; it doesn't

have local minimum. Even the empirical risk won't have local minimum. That you'd--that's

what's... >> [INDISTINCT] I believe this is to be true,

but [INDISTINCT] is true [INDISTINCT] some [INDISTINCT].

>> MASNADI-SHIRAZI: Well, I can actually prove this, I mean, if you want later. Initially,

that--I mean, that's the question that people have. They think that, "Oh, since you have

a bunch of these little dips for these different points." But, think of it this way. Let's

say you only had one point and you're trying to do--like the worse case would be to just

have like one or two points. You can still show that the sum of two quasi-convex functions

that have this formula are themselves quasi-convex, meaning that the sum of even two--like the

simplest classification problem in the world is just one point here and one point there,

right? The sum of these two would still be another quasi-convex function. That's the

empirical risk, right, the sum of two of these losses. Let's say you sum two of those.

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yup. You can sum any two

and they'll still be quasi-convex as long as they come from this formula.

>> [INDISTINCT] >> MASNADI-SHIRAZI: Maybe I should have. But

the thing is that before giving--I got this question during my CVPR talk and I had gotten

the first question when I gave the NIPS paper. So I knew I was good in terms of probability

space, but that's not the space we really were--are dealing with. Because when you do

it--like you do a transition from the probability space, a transform doesn't necessarily preserve

convexity. So when you transform from a probability space into functional space, there's a chance

that you'll lose convexity. And so I had to actually go and prove two and three as well

which--that's why I wanted two weeks before I came to give this talk because I spent the

first week actually proving two and three. >> [INDISTINCT]

>> MASNADI-SHIRAZI: It's actually quite simple to do. If you'd--if you'd take the derivative

of this function down here, you can prove that it's quasi-convex by the classic way

of proving quasi-convexity which is to show that the derivative is positive up to a certain

point and then negative from then on, meaning that there's only one change. That's the--that's

the general--one of the ways to prove quasi-convexity. So--

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah. So the sum, you

know, adds a little bit more but it's still nothing more than adding one function phi

of V plus, let's say, alpha times phi of negative V which is the positive class plus the negative

class. So--and in fact, once you--let's see--once you prove two, three, which is the case that

you're talking about, is not that bad afterwards. So we can go over it--yeah. Maybe that's what

I should have concentrated on more. Later. Yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: This, I--this, I haven't

actually included in any of the papers. Part one is in the NIPS paper, theorem one, but

two and three were added to the paper that--like the journal version of the paper that I'm

writing and it's almost done. This is kind of like the missing link, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: No, just any. I just gave

the simple two examples. But the point is that you can add alpha number of the positive

numbers and beta number of negatives and still have a sum. So the thing about convexity is

that when they're talking about empirical risk, we all know that the sum of a bunch

of convex functions is still going to be convex. And that's why when you're doing empirical

risk minimization using convex loss functions, the thing works well because you know that

when you sum up a bunch of convex functions, you're still going to have convexity. The

same theorem, unfortunately, doesn't exist for quasi-convex functions in general. You

can't say just summing up a bunch of quasi-convex functions will ensure convexity. But, fortunately,

if they come from this formula, they do. So-- >> [INDISTINCT]

>> MASNADI-SHIRAZI: The Sigmoid is almost like Savage but it's not exactly. It's not

one of these--I'm trying to find the formula for Savage. Here it is. It looks almost like

the Savage but it's not exactly. The Savage does, too. I mean, it's quasi-convex. It's

derived from that formula. It will still work. Yeah. It seems more remarkable that you would

have a loss function looking like that. That would still work well. So are we--so I can

move on, hopefully. Okay. So in conclusion for this section, we showed how to derive

novel Bayes consistent loss functions for classification and we derived the set of requirements

that a robust Bayes consistent loss function should have. And we also showed that Bayes

consistent loss functions can be non-convex. And we gave some examples of using the Tangent

loss functions for doing classification. I don't know if we have enough time but we can--I

also wanted to talk about deriving Bayes consistent loss functions that are cost sensitive. So

I've been talking for about 50 minutes already. Do you want me to continue or...?

>> [INDISTINCT] so many questions so I think it's okay [INDISTINCT]

>> MASNADI-SHIRAZI: Okay. So many classification problems are actually naturally cost sensitive.

For example, when we're doing detection, or fraud detection, or medical diagnosis, or

business decision making, the cost of one--misclassifying one class is usually much higher than the

other. So for medical diagnosis, for example, missing that tumor might be considered as

having an infinite amount of loss because you, you know, the--it might result in a death

of a patient; whereas, on the other hand, a false alarm or a false positive would just

cause a bit more money because you have to do further experiments. So there's obviously

a demand for cost-sensitive extensions of state of the art techniques. And we--here

we propose an alternative strategy for the design of cost-sensitive algorithms. And rather

than directly manipulating the algorithm and risk compromising Bayes consistency or Bayes

optimality, we extended the underlying loss function and derived a cost-sensitive learning

algorithm from that loss that implements the Bayes decision rule and also approximates

the Bayes risk. So here--this is just the same slide that we had before to remind us

of the general path that we took before, for deriving Bayes consistent loss functions.

We would choose any strictly concave function, F star and C star--sorry--strictly concave

function, C star, and invertible function, F star, such that they've satisfied these

constraints. We'd plug it into the formula and we were guaranteed to have a Bayes consistent

loss function. And the question now is, "Can we have a general procedure for deriving cost-sensitive

loss function as well?" So here, we're going to define C1 as the cost of a false negative

and C -1 as the cost of a false positive, and we'd like to ask what conditions need

our choice of the F star and the C star now satisfy such that we have a cost-sensitive

loss that's also Bayes consistent. So generally, we would have a loss phi1 assigned to the

positive class and the loss phi -1 assigned to the negative class, but we'd still want

to conserve Bayes consistency. So the generic procedure is to start from an invertible predictor,

F star that has this symmetry and the concave function, C star that has this symmetry. And

this symmetry is basically the same symmetry that the Bayes risk has. So we're ideally

trying to find a risk that has the same behavior as the Bayes risk, and then we define J to

be negative of our C star, and we derive these auxiliary functions I1 and I -1 from these

two formulas, and our loss functions can be shown to basically be the negative of I1.

So phi1 is going to be negative I1, and phi -1 is going to be negative--I -1. And these

come from probability elicitation. The proof comes from there. But in general, the theorem

is that the loss derived through this procedure is cost-sensitive and Bayes consistent. So

here, we're going to go over an example which would be the exponential--we're going to try

to derive the exponential cost-sensitive loss. We choose F star to be this formula which

is the cost-sensitive version of the--that Log link and you can see that it is cost-sensitive.

It implements the cost-sensitive Bayes decision rule because now, the boundary has shifted

from a half to, in this case, 0.3 because C1 was chosen to be two. So all points that

have probabilities higher than 0.3 are now assigned positive values and all points that

have probability less than 0.3 are assigned negative values. So we've shifted the boundary

to make it--to prefer one class. And we also have the cost-sensitive version of the risk.

And you can see that in this, the--it's still concave but it has an isometric shape. And

the point here is that the maximum of this risk is, again, at 0.3. So the maximum risk

is being assigned to the boundary which is what we want. Going through all the steps,

we can find phi1 to be like this which is not a cost-sensitive version and there's--there

are now two loss functions. One is the red one that's used for the negative class and

the blue one is the loss that's used for the positive class, and it's still Bayes consistent.

So we've actually used the... >> [INDISTINCT] properties that we had earlier

[INDISTINCT] >> MASNADI-SHIRAZI: You could--I haven't--in

order to get a loss function that looks different, you would have to choose different F stars

and C stars. So if you could go back and derive the cost-sensitive asymmetric version of the--well,

you already have it here--combine it with the cost-sensitive Tangent Link, you could

probably get an analogous version for the cost-sensitive one. I haven't done it but,

you know, maybe that's up for grabs. Yeah. So the Least Squares is the same C star that's

used in Tangent. The difference here is that the Tangent uses a different F star. So if

you derive the cost-sensitive version of that one, you could get something similar. But

here we used just the Exponential and the reason for that was because--the history of

this is because I wanted to derive the cost-sensitive version of AdaBoost, and that uses the Exponential

loss. But here, using the cost-sensitive version and still Bayes consistent Boosting, we did

some--we've done a lot of experiments for pedestrian, car, and face detection. We've

even used it for Brain EEG Surprise Signal Detection. But I'm going to move on to deriving

the cost-sensitive SVM. >> I remember, there was an asymmetrical version

of AdaBoost, but I only [INDISTINCT] >> MASNADI-SHIRAZI: So there have been lots

of papers on making AdaBoost cost-sensitive. In my--actually in the PAMI paper, that's

basically what it's all about. It's going through the history of this and finally, you

know, deriving this which is the cost--an interesting thing about this is that it's

still Bayes consistent. All of the previous methods usually dealt with altering the AdaBoost's

algorithm directly. What they would generally do is change the weights. So the AdaBoost

has its weight update formula. They would go at--introduce costs directly to the weights

and depending on how you'd kind of change the weights, you would get different algorithms.

But no one--when you go and alter the algorithm like that, you have to still kind of show

that it's Bayes consistent because otherwise, you don't know what you're doing anymore;

are you really minimizing, you know, what you should be? And given infinite amount of

data, will you still converge to the Bayes decision rule which is what you should be

doing? So here, when you show that it's Bayes consistent, you answer those problems and

you don't need to go directly in it, like manipulate the algorithm. Once you have the

correct loss, you can just derive the classifier from it and that's what we did. So for the--we're

going to try to do the same thing for the SVM because there have been many papers on

cost-sensitive SVMs but, again, what they would do is, you know, go into the algorithm

and, you know, tweak this, or change that portion, or add the cost here or there. But

the principled way to do it is to actually derive the cost-sensitive loss function, show

that it's Bayes consistent, and then once you have that, you know, the rest is just--you

know, go through the steps of finding the algorithm. So here, we're going to--we're

going to start with F--we're going to choose our F star to be this which can be shown to

be Bayes consistent and it implements the Bayes decision rule. And in fact, it's just

the cost-sensitive version of the link used in SVM. So if you set C1 and C2--if you set--sorry--C1

and C -1 = 1, it will just, you know, the--simplify the sign of 2 eta -1 which is the exact same

thing used in SVMs. So for cost-sensitive--so this suggest the cost-sensitive minimum conditional

risk. And this has four parameters, A, B, D, and E, that can be shown to satisfy the

symmetric conditions for Bayes consistency if they satisfy these conditions. But in any

case, if you go through the steps, you can find the cost-sensitive SVM loss function

which is what I have there. So this loss has four degrees of freedom; A, B, E, and D. And

for the positive samples, we've--basically, they're assigned the margin of E/D and the

hinge slope of D. So if the red line is the positive hinge loss, they'd get a slope of

D and a margin of E/D. For the negative, you have a margin of B/A and a hinge slope of

A. So the blue line is the hinge loss that's used for the negative samples. And after some

parameter normalization, you can find C star to be like this, but you go through the steps.

And now that you have the loss, the risk basically leads to the cost-sensitive SVM primal formulation

and this is assuming that you--you're preferring the positive class. You can derive a similar

primal formulation [INDISTINCT] equal to 1, which is a good check, right? Because if it

doesn't return to the standard SVM when I set the cost equal to 1, I've probably done

something wrong, but it does actually simplify to the standard SVM when you set the costs

equal. But--here we have some experiments that we've done using the cost-sensitive SVM

and we've compared it to the previous algorithms; the boundary movement SVM and the bias penalty

SVM. So the bias penalty SVM would introduce costs on the slack variables directly, and

the boundary movement SVM would basically just change the bias of your final SVM. So

compared to these two competitors, the--we did experiments on 10 UCI data sets and cost-sensitive

SVM has smaller error on seven out of the 10 data sets and does equal to bias penalty

on two other ones. We also did experiments on the German Credit data set. And here the

goal is to identify bad credit customers to be denied credit by the bank. And each person,

each--gets 24 attributes such as--24 features, such as the person's age and income, and we're

trying to decide whether the bank should, you know, give this person credit or not.

An interesting thing is that, here, the cost matrix is actually defined in the problem

because if a good credit customer is denied credit, the bank has a loss of one; meaning

that the bank looses a good customer. Whereas if you give credit to a bad customer, the

bank suffers a loss of five; meaning that the--so the bank looses money. So this suggests

the following constraint on our choice of C1 and C -1. Their ratio should be 5. And

this is defined in the problem, so it's part of the data set. But the final result is that

if you use cost-sensitive SVM, you get a total reduction in cost of 37%. So in conclusion,

we extended the--uh-huh? >> [INDISTINCT]

>> MASNADI-SHIRAZI: So here is--no. When you're--there are different ways to compute the error. So

here, what we did was just--yeah, just count the number of errors that you have over the

data. There isn't any cost-sensitivity incorporated within the data set itself because I don't

know how much a miss would cost when I'm doing, let's say, heart disease or I'm trying to

do PEMA. I wouldn't know--I mean, you can define these to be whatever you want. It would--it

wouldn't change the result. It would just be multiplying these by a number.

>> I was wondering, let's say, you changed the [INDISTINCT].

>> MASNADI-SHIRAZI: Yeah. So what I--okay. So to go into--yeah--to go into the details

of how I actually did this, the way we did it was we--all of these algorithms have their

different parameters, right, like the C1, C -1. So you--we tried all the different parameters

such that all of these algorithms attained a certain detection rate on a validation set.

So, for example, we said we want to have all of these such that they have 90% detection

rate. Once they're equal on that sense, I would go to the test set, and test it, and

see how many false positives they gave me. And of course, the one with the smallest false

positive would be the best because I've ensured that their detection rates are fixed given

the amount--you know, like, the amount of flexibility these have. The cost-sensitive

and bias penalty SVMs, they have the same number of parameters that you can tweak. But

the boundary movement SVM, the only thing that you can change there is just the bias,

so. But with that in mind, that's the fair way of comparing them. So for this section,

we extended the probability elicitation view of loss function designed for cost-sensitive

problems, and we specified the generic procedure for deriving cost-sensitive algorithms, and

also derived the cost-sensitive Bayes consistent Boosting and SVM loss functions, and also

went over some experimental results. And thank you. So I'm open to any more questions.

I met him actually last month at CVPR where he gave a very interesting talk and I thought

it would be great if he shares some of his work with us here at Google. And just to spell

his full name, so it's Hamed Masnadi-Shirazi, and he works at--is in the middle of his PhD

at UCS, San Diego Western, you know, in the Department for Statistical Visual Computing.

And, yeah, I'm very pleased to hear about On the Design of Bayes Consistent Loss Functions

for Classification. Mr. Hamed, please. >> MASNADI-SHIRAZI: So thank you for the introduction.

Yeah, so the title is On the Design of Bayes Consistent Loss Functions for Classification,

and this is work that I've done at the Statistical Visual Computing Lab at the University of

California - San Diego. So during this talk, we're going to basically be talking about

classification. Many classification algorithms, some of the most successful ones such as SVMs

and Boosting are basically based on minimizing an expected value of a loss function phi,

which is close margin enforcing and Bayes consistent. So such loss functions assign

a large penalty to points with negative margin, and they generally assign a small penalty

to points of small positive margin and zero or close to zero penalty to points of large

positive margin. So we also need to talk about Bayes consistent loss functions and like I

said, Bayes consistent loss functions are very special and important for classification.

They--and the reason for that is because they produce the Bayes optimal decision rule without

having to know anything about the distribution of the data. So, an example for general Bayes

consistent loss function phi, we would take the--we would take the expected loss to find

the risk and we would minimize the risk to find our predictor, F star. And given that

our loss function was Bayes consistent, we're guaranteed that F star, our predictor, is

Bayes optimal. It implements the Bayes decision rule. And you can see that through this progression,

we didn't make any assumptions about the model of the data. We didn't know the distribution

of the data, nothing like that, but yet we were still able to arrive at the Bayes optimal

decision rule. So, on the other hand, unfortunately, only a handful of Bayes consistent loss functions

were previously known in the literature. Some examples include the well known Least Squares

loss; that's used all over the place. Another example is the Hinge loss that's used in SVM

classifiers; or the exponential loss that's used in Boosting. Another example is the Log

loss that's used in Logistic Regression and also in Boosting. And you can see that each

one of these previously known Bayes consistent loss functions has led to a highly successful

classification algorithm. We've plotted, actually, these Bayes consistent loss functions, and

just by looking at them, you can see that they all generally have a few properties in

common. One is that, they're all convex loss functions. And the ones that are considered

here are all unbounded for large negative margins. And number three is that, they all

have a zero or close to zero positive margins; of course, this excludes the Least Squares

that increases on both sides. And the last property is that they are all cost insensitive,

meaning that they assign the same amount of loss to both the positive class from--and

the negative class. So they don't prefer one class over the other. So the natural question

that we could ask at this point is, "Are we stuck with just these 4 or 5 loss functions

with those common shapes?" And can we actually derive other Bayes consistent loss functions

that are custom tailored for our application? Can we derive Bayes consistent loss functions

that have other shapes and other properties? For example, can we have a Bayes consistent

loss function that's non convex like the plotted loss function? Can we have a loss function

that's Bayes consistent and yet has bounded negative margins? As you can see, this loss

function flattens out. Can we have a loss function that has a non zero bounded positive

margins unlike previous loss functions? Can we have a loss function that's Bayes consistent

and is cost sensitive? So here we have two loss functions; one is assigned to the positive

class and then--and the other is assigned to the negative class. And in general, can

we have a Bayes consistent loss function that's just any shape we want it to be given our

application? So these problems and questions were previously known and researchers generally

would try to start off with the known Bayes consistent loss function. And in order to

alter it to their application, they would either make alterations to the loss directly

or alter the classification algorithm derived from the loss. And the problem with that is

that once you actually alter the Bayes consistent loss function in any way, it ceases to be

Bayes consistent. So during this talk, what we're going to show is how to actually derive

novel Bayes consistent loss functions, and this is work that has been covered in our

NIPS '08 paper. We're also going to talk about designing novel loss functions--Bayes consistent

loss functions for specific applications. And two of the applications that we'll cover

during this talk--yeah, the first one is going to be the designing loss functions that are

robust or outlier resistant for robust classification, and this is work that's been covered in our

CVPR 2010 paper. And we're also going to talk about designing loss functions that are Bayes

consistent and cost sensitive. So this is work that's been covered in our ICML 2010

and IEEE PAMI 2010 paper. So now that we know the importance of having Bayes consistent

loss functions for classification, we can talk about some specific problems and applications.

An interesting problem is the computer vision problem. Here, we have large margin loss functions,

generally, don't overcome the unique challenges posed by computer vision problems. And one

major difficulty is the prevalence of noise and outliers or class ambiguity that exists

in loss--in computer vision data sets. So here, for example, we have an image of a famous

street in San Francisco and this is clearly an image of a street. But if we were to do

patch-based image classification on this, we would get many patches that include trees,

and cars, and building even though this is supposed to be classified as an image of a

street. So in order to be able to do robust classification, we're going to try to design

a robust loss function. One limitation of current loss functions is their unbounded

negative margin. So for this simple classification problem we have at the top, we clearly have

three outliers. Those are the three Xs--the green Xs to the left are obviously outliers.

And if we were to use the well-known exponential loss to classify this problem--the exponential

loss has been plotted in light blue, the light blue dotted line. And because of its exponential

shape, it basically will emphasize and concentrate on these three outliers and assign to them

an exponentially high loss, whereas, what we'd really like it to do is to disregard

these three outliers. So, recent improvements have been made in terms of dealing with this

type of outliers. One is the linearly growing Logit loss used in LogitBoost introduced by

Friedman in 2000. And that's the green line. You can see that it grows linearly instead

of exponentially, so that's an improvement. The other is the bounded Savage loss used

in SavageBoost that we introduced in '08, and that's the dotted blue line in the bottom.

You can see that it's actually bounded; it flattens out after a while. And so these help

with this type of outliers. But we argued that the negative margins are not the entire

problem and that it's also important to penalize large positive margins. So in order to explain

this further, we can consider this simple classification algorithm--I mean, this simple

classification problem we have where the data is distributed uniformly in the vertical direction

and is Gaussian with equal variants and different means of +3 and -3 in the horizontal direction.

So for these two classes it's easy to see that the Bayes decision rule would be aligned

at X equals zero. And in fact, if any of the previous Bayes consistent loss functions are

used on this exact data set, they would all place their decision boundaries where the

red line is which is very close to the black Bayes decision rule. So the--all the previous

Bayes consistent loss functions will approximate the Bayes decision rule almost exactly.

>> [INDISTINCT] >> MASNADI-SHIRAZI: I didn't get the question.

What was that? >> [INDISTINCT]

>> MASNADI-SHIRAZI: [INDISTINCT], yeah. The fact that I know that the optimal Bayes decision

rule is the X line is because I know that they're Gaussian. But if all I had were these

few points and I used the Bayes consistent loss function, anyone of these loss functions

that I have on the right, they would in--actually for this data set, give you the red line,

which is almost on top of the black line, which is the optimal. So even if you didn't

know anything about this distribution and all I gave you were these few points and you

were to follow the procedure of--start off with the Bayes consistent loss function, compute

the risk, minimize the risk, you would get the red line without any information about

the distribution. So that's the power of Bayes consistent loss functions.

>> [INDISTINCT] >> MASNADI-SHIRAZI: I'm not assuming anything

about the data, whatsoever. All I have is just the data. I know that the blue line is

one class, and the blue circles is one class, and the red Xs are another class. Given their

coordinates, I can--but these points were actually sampled from a Gaussian which I do

not know whatsoever. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Sure, sure, sure. I'm going to actually consider what happens when

you have, you know, a rogue point or something like that. It could actually change immediately.

So what I'd--what I'm going to do next is actually introduce an outlier here. So let's

say I had one sample point at that point. So the impact of adding a single X at that

point--which could still happen as part of the data, right, because it's Gaussian in

the horizontal, so that could be one of the Xs that come up--the impact--but in essence,

it is kind of an outlier for this data because it's highly unlikely. But, in any case, if

you do have a point at -2 and zero, what would happen is that all of these previous loss

functions would drastically shift to their decision boundary to where the purple line

is. And they would place their decision boundary at about X equals -2.3. And the reason for

that is because all of these loss functions assigned zero or close to zero loss, for being

correct, right? This includes both the Savage loss which is the green one or the--sorry--the

Log is the green one and the Savage is the blue one. So you can see that they assigned

zero penalty for being correct, and since by placing the line at -2.3, you can classify

everything correctly. That's just where they're going to put their boundary which is very

far away from the Bayes optimal, had we known the true distribution. On the other hand,

the Tangent loss which is a loss function that we're going to derive during this talk

is also Bayes consistent. But since it penalizes being correct, it penalizes both the positive

side and the negative side. It discourages solutions that are too correct. So it will

actually place the decision boundary where the red line is given this exact data, which

is still very close to the Bayes decision rule. Yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: No, I'm not assuming anything

other than just the data. I mean, if-- >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah, yeah, that's true. If there was another, let say, another, you

know--let's say it was like two Gaussians, all of them consist--made up the red class.

That's true. That red point could be one sample from that. But the problem is that right now,

given the fact that it's just one point, it's still considered kind of an outlier. I will

actually deal with this problem later on because once you derive the Tangent loss, you can--you

can actually change the amount that it penalizes being too correct. So you could even actually,

you know, flatten this to make it look like--something like the Savage loss which wouldn't penalize

it and actually place the decision boundary very close to where the purple one is. So

you can have a control over this. It's just a matter of--given the data right now, I mean,

just looking at it, it would seem better to place the lines somewhere close to the black

one rather than drastically move it all the way to the left. Even if that were a distribution

and what--actually when I was building this, you know, toy example to get the point across

of why it's important to sometimes penalize being correct, if you actually add another

X to where that far left one is, the Tangent loss would also place its boundary close to

-2.3 because now its--you know, there's too much penalty for classifying two points wrong.

And you would expect that if that was a true--another mode. You would have more examples from it.

And once you actually add another one, it'll shift; it will be just too costly. It doesn't

want to make a mistake on two points, so. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah, yeah. But, I mean, this toy example is just to get across the

importance of actually penalizing, you know, outliers that are of this type. So you have

to--in order to deal with these, you have to kind of penalize being too correct sometimes

as well. So I don't know if I--yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: That's true. I mean, that's why you...

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: That's true. But that's

usually what we have to--I mean, that's what we deal with in the real world. We'd always

have only a few samples. We never have the true distribution of the data. And so, I mean,

the purpose of this talk is to say that it's important to have a Bayes consistent loss

function that gives you that flexibility and that control. So when you are dealing with

situations where you have small amounts of data and the possibility of having outliers

is increased in that case, you want to be able to, you know, have loss functions that

are not only Bayes consistent but also have, you know, these different types of shapes

that can deal with these potential problems >> [INDISTINCT]

>> MASNADI-SHIRAZI: So I actually--if you were to use the Hinge loss, which is what

the SVM does, I didn't--I didn't run an SVM because this is so simple, you can solve it

directly. You can just like do a exhaustive search over the space of lines here. But if

you were to use the Hinge loss, which is what the SVM uses, it does still place, you know--because

there's no penalty for being right, so why not place it where the purple ones--

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah, I didn't do a pass

like that on this, so. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. >> [INDISTINCT]

>> MASNADI-SHIRAZI: The Square loss? The Square loss--the Least Squares loss is actually almost

never used in classification because--let me go back to the shape. That purple line--I'll

give you a better example. So the red line is the Square loss, right? And you can see

that it penalizes being correct almost as much as it panelizes being wrong because of

its symmetry. It's just X squared, so it's never used for classification. If I were to

use--I actually--that's an interesting question. I--maybe I should have used it here and see

what would have happened. If I had actually used the Square loss--because then it would

penalize this, as well. And, let's see. If it penalizes that point so--actually the Square

loss wouldn't work. It would--it would place the line at--where the purple is, I think,

I'm kind of guessing, because there's going to be such a huge penalty for getting this

wrong, that it would prefer to get it right rather than skip over it. So the Tangent loss

here assigns a bounded penalty for being right that's why it can disregard to this outlier.

You see that it's kind of--you see that it tapers off here, right? It's bounded in that.

Whereas the square loss would go up, it would assign a huge loss for being--for any incorrectly

classified points of that sense. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. So--exactly. That's the whole point of a loss like this because

it wants to, you know, keep--it wants--it actually--in the paper, I actually have an

example where I've plotted like the histogram of the distance. This is kind of the margin.

So it actually encourages small margins that are placed at that plate. You know, I think

it's--that dip, that minimum is like 0.5. But the interesting thing is that when you

see that example, even though most of the points have a small margin of about 0.5, meaning

that--that minimum is actually enforced, it does better in classifying a data set that

has lots of outliers, so we were--I think, in that--in that example, we had some digit

classification problem and the margin is smaller but the classification output is better than

all of these other loss functions. So that's a--that's a problem that we actually considered

in the paper. So did I--the--it's--just because the margin is smaller, it doesn't necessarily

mean you're doing worse, especially when you have outliers, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: That's true. It's just

how much--for example, when you're using Exponential loss, it tries so hard to increase the margin

that it, you know, screws up the classification which is the goal, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: For the--I wanted to give

these--I know what you're saying, but that's regression. It's not classification because

in regression--in classification, you actually have the two classes and you're trying to

minimize the risk. And in regression, yeah, if you allowed it to, you know, plot a line

through here, that would happen. It would place a horizontal line, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: What I think I did for

this data set was I--let's see. I think I limited the--I didn't--yeah, I didn't--maybe

I didn't want that to happen so I limited to vertical lines. That might have been what

I did. So if I remember correctly, that's probably what I did to avoid that. In any

case, I haven't even considered the Least Squares, so. Any--yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: I--I think what I considered

was like from -10 to +10. Something like that. >> [INDISTINCT]

>> MASNADI-SHIRAZI: I wanted to put a Y dimension here just so it would look nice when I plot

it. Because if I only had one--that's why they're uniform in the Y direction. So I already

know that the Y direction is, you know, useless. >> [INDISTINCT]

>> MASNADI-SHIRAZI: Yeah. Because if I had only plotted these points in the X direction,

you would get these circles on top of each other; you couldn't really see anything. But

that's why I said it's uniform in the vertical so you can disregard it completely. Yeah.

But the point of this toy problem is something completely different. You guys should let

me go further. What I wanted to just get across was the idea of the fact that some--first

of all, we have problems where we have outliers. And second of all, we want to deal with outliers

and there's two types of outliers in a sense. One is when you're wrong and--so one is this

type of outliers that I considered here, right? There's no--and the other is that when you

can be actually right, completely right, and still have, you know, an outlier. So the Savage

and the Log kind of deal with the first type, but you also want a loss that looks like this

that can, you know, penalize being too correct, right? When you have outliers, that's kind

of--that's just to get across the point of having a loss function that has a shape like

this. Now, of course, we'll go into more, you know, advanced and practical examples

so don't worry. But in order to derive something like the Tangent loss that I just talked about,

we need to talk a bit about risk minimization. So we define a feature, X, and the class label,

Y, and the--and we call our classifier, H, and out predictor, F. So our classifier is

just going to be the sign of our predictor, F. And we also have the loss function, L.

And so what we're going to be doing is minimizing the risk or the expected loss which is equivalent

to minimizing the conditional risk. And here we're going to define eta to be the probability

of one given X. So with these in mind, we'll consider the traditional approach of dealing

with a Bayes consistent loss function. We're going to actually talk about what does Bayes

consistency mean in more detail. So a good example is the exponential loss that we have

here. And the traditional approach is to, of course, take the--take the expectation,

find the risk, C. And then you want to minimize the risk, and find your predictor, F star.

And for the exponential, you can actually show that F star will be one half log of the

probability ratio. So the question now is to see whether our predictor is Bayes consistent

and we've plotted this F star and you can see that it is indeed Bayes consistent because

it assigns positive values to probabilities that are higher than a half and negative values

to probabilities, eta, that are less than a half. So it's implementing the Bayes decision

rule which is good. So we're good there. We take our F star and we plug it back into our

risk to find our minimum risk which we call C star. And then what we have to do here is

check to see whether C star is convex or not. And the reason for that is because it can

be shown that in order to have a convex optimization problem, you need to have a minimum conditional

risk, C star, that's convex. So, for the exponential loss where--everything works out fine, but

you can immediately see from this example that this traditional approach doesn't actually

let you design Bayes consistent loss functions. It only lets you check to see whether a loss

function is Bayes consistent or not. And we can kind of say that we got lucky with the

exponential because all these things, you know, fall into place. So in our next section,

we'd like to talk about how can we actually--we're going to propose a new path for designing

Bayes consistent loss functions. And the general approach is going to be that we can choose

any strictly concave functions, C star, and any invertible function, F star, such that

they have the simple symmetry constraints and we plug these two functions into this

formula and we get a phi, a loss, that is guaranteed to be Bayes consistent. So this

gives us a principled way of deriving, designing novel Bayes consistent loss functions. Now--I'm

not going to actually go over the proof of that formula. So if--the proof can be found

in our NIPS '08 paper and it basically deals with probability elicitation and Bregman divergences

and so I'm going to kind of skip over these. But we're going to go over an example of actually

deriving--using the formula and deriving a novel Bayes consistent loss function. Here

we're going to derive the Savage loss. So, let's say we start it off with the Logistic

Link function that we actually saw in the Exponential loss as well, and we pair it with

the Least Squares minimum risk. And these two satisfy the symmetric constraints. All

we have to do is take these and put them into the formula and we'd derive a novel loss that

we called the Savage loss and we've plotted it. And you can see that this loss function,

unlike all previous Bayes consistent loss functions, is first of all not a convex and

also it is bounded, so it flattens out. It has a bounded penalty. And we use this loss

function in--to do some robust classification. But for a truly robust loss function, we saw

in our previous discussion that for a hypothetical loss, we--it should have a certain number

of properties. We would, first of all, like it to be bounded for large negative margins.

And we'd also like it to be--have a smaller bounded penalty--assign a smaller bounded

penalty for large positive margins. And we'd also like it to be margin enforcing of course.

And it can be shown that under Bayes consistency, those three properties are satisfied if and

only if these three conditions or requirements hold for our choice of F star and C star.

And, unfortunately, existing link functions, F star, don't comply with those requirements.

So we introduced the Tangent Link which does. So this is the plot of the Tangent Link and

we pair it up with the Least Squares at minimum conditional risk again, plug it into our--the

formula, and we'd get what we called the Tangent loss which actually has the desired shape

of course. So now that we have our loss function, we can easily use it to do classification

and derive like, for example, a Boosting algorithm to minimize the risk associated with this

loss. I'm not going to go over deriving the Boosting, and algorithm, and stuff; you--the

details can be found in the paper. I'll just go over some examples and experiments that

we did. We've done experiments on image classification and also Multiple Instance Learning data sets.

But here, I'm going to talk about the scene classification, the 15 class scene classification

problem. We used the semantic space representation from Rasiwasia and Vasconcelos CVPR '09. And

the only thing that we changed was replace the SVM classifier with the Tangent loss classifier

and we got for--now, their scene classification method was already state of the art, but we

got 4% improvement with just simply, you know, an outlier resistant robust classifier instead

of the SVM. So--and actually, the interesting thing about this data set is that most of

the improvements were seen between classes that were--I'm starting to lose my voice.

>> [INDISTINCT] >> MASNADI-SHIRAZI: What was that?

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah. We're doing one

versus all. The most improvement was seen in--between classes that are easily confusable.

For example, when you're trying to distinguish between the highway class, or the street class,

or the tall building class, or when you're trying to distinguish between the mountain,

and forest, and open country class, these are actually one of the--like these are some

of the 15 classes out there. There's also living room, and bedroom, and kitchen, and

these are similar but they actually produce the most improvements which is expected when

you're doing a robust classification. We've also done experiments on tracking and we used

the Discriminant Saliency Tracker of Mahadevan and Vasconcelos CVPR '09. And we use TangentBoost

to combine the saliency maps in a discriminant manner. Here, in these two examples, tracking

examples, the red box is for when the Tangent loss is used to do tracking and the dotted

blue is when the Exponential loss, which isn't robust, is used. This one--let's see if it

works. It still doesn't work on my computer; that's weird. But--okay, this one should work.

So you can see that the Tangent loss keeps track of the object whereas the Exponential

loss drifts off and loses track. And this is because the person keeps changing shape

and there are plenty of outliers so it's much more robust. Now a natural question that was

already kind of asked was, "Can we control the shape of the Tangent loss?" So, there

is actually a parameter, A, that controls the ratio between the negative margin and

the positive margin. So the general formula is--includes the A. And we've plotted the

Tangent loss for different values of A. And you can see that--you can actually either--actually,

the best results are found through cross-validation for A or you can kind of like guess A depending

on how much outliers you expect to have in your data. But in general, you can kind of--you

can control the amount of penalty that you assign to the different margins. So, you have

that flexibility. Another question is the question of convexity. So, here we have examples

of a convex function, a quasi-convex function, and a function that's neither convex nor quasi-convex.

And so, for--a quasi-convex function isn't convex but it still has a unique minimum.

On the other hand, all convex functions are also quasi-convex. And we can actually prove

that Bayes consistent loss functions from this formula are all quasi-convex. Now, the

question--the main question is, "Is our optimization problem still convex even when the loss is

not convex?" So, if--so, the thing to keep in mind here is that we're not minimizing

the loss, we're minimizing the expected loss or the risk. So with that in mind, we can

prove that the risk is actually still convex in the probability space even though the loss

is not. We can also prove that the risk is quasi-convex in functional space. And this

is an important proof because when you're doing classification, you're usually not working

in probability space. You're working directly in the functional space. You're trying to

directly find the function that classifies--that minimizes your risk. So it's still quasi-convex

even though the loss is not in. Also, more importantly for practice, the empirical risk

is also quasi-convex in functional space. So in conclusion, we can say that the optimization

problem is quasi-convex and has a unique minimum that can be found with functional gradient

descent such as Boosting even though the loss function is not necessarily convex. So--

>> [INDISTINCT] >> MASNADI-SHIRAZI: The optimization is not

by definition convex, but it's quasi-convex, meaning that it still has a unique minimum.

So you can use quasi-convex methods to find the minimum. The easiest thing that works

is just do functional gradient descent which is what Boosting does. That's why the Boosting

algorithms work even though these loss functions have all these, you know, quasi-convex shapes.

So the important thing is that it still has a unique minimum.

>> It's [INDISTINCT] minimize [INDISTINCT] so you can have [INDISTINCT] function [INDISTINCT].

>> MASNADI-SHIRAZI: No, no, no, no, no. That's what I mean by unique minimum; it doesn't

have local minimum. Even the empirical risk won't have local minimum. That you'd--that's

what's... >> [INDISTINCT] I believe this is to be true,

but [INDISTINCT] is true [INDISTINCT] some [INDISTINCT].

>> MASNADI-SHIRAZI: Well, I can actually prove this, I mean, if you want later. Initially,

that--I mean, that's the question that people have. They think that, "Oh, since you have

a bunch of these little dips for these different points." But, think of it this way. Let's

say you only had one point and you're trying to do--like the worse case would be to just

have like one or two points. You can still show that the sum of two quasi-convex functions

that have this formula are themselves quasi-convex, meaning that the sum of even two--like the

simplest classification problem in the world is just one point here and one point there,

right? The sum of these two would still be another quasi-convex function. That's the

empirical risk, right, the sum of two of these losses. Let's say you sum two of those.

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yup. You can sum any two

and they'll still be quasi-convex as long as they come from this formula.

>> [INDISTINCT] >> MASNADI-SHIRAZI: Maybe I should have. But

the thing is that before giving--I got this question during my CVPR talk and I had gotten

the first question when I gave the NIPS paper. So I knew I was good in terms of probability

space, but that's not the space we really were--are dealing with. Because when you do

it--like you do a transition from the probability space, a transform doesn't necessarily preserve

convexity. So when you transform from a probability space into functional space, there's a chance

that you'll lose convexity. And so I had to actually go and prove two and three as well

which--that's why I wanted two weeks before I came to give this talk because I spent the

first week actually proving two and three. >> [INDISTINCT]

>> MASNADI-SHIRAZI: It's actually quite simple to do. If you'd--if you'd take the derivative

of this function down here, you can prove that it's quasi-convex by the classic way

of proving quasi-convexity which is to show that the derivative is positive up to a certain

point and then negative from then on, meaning that there's only one change. That's the--that's

the general--one of the ways to prove quasi-convexity. So--

>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah. So the sum, you

know, adds a little bit more but it's still nothing more than adding one function phi

of V plus, let's say, alpha times phi of negative V which is the positive class plus the negative

class. So--and in fact, once you--let's see--once you prove two, three, which is the case that

you're talking about, is not that bad afterwards. So we can go over it--yeah. Maybe that's what

I should have concentrated on more. Later. Yeah.

>> [INDISTINCT] >> MASNADI-SHIRAZI: This, I--this, I haven't

actually included in any of the papers. Part one is in the NIPS paper, theorem one, but

two and three were added to the paper that--like the journal version of the paper that I'm

writing and it's almost done. This is kind of like the missing link, so.

>> [INDISTINCT] >> MASNADI-SHIRAZI: No, just any. I just gave

the simple two examples. But the point is that you can add alpha number of the positive

numbers and beta number of negatives and still have a sum. So the thing about convexity is

that when they're talking about empirical risk, we all know that the sum of a bunch

of convex functions is still going to be convex. And that's why when you're doing empirical

risk minimization using convex loss functions, the thing works well because you know that

when you sum up a bunch of convex functions, you're still going to have convexity. The

same theorem, unfortunately, doesn't exist for quasi-convex functions in general. You

can't say just summing up a bunch of quasi-convex functions will ensure convexity. But, fortunately,

if they come from this formula, they do. So-- >> [INDISTINCT]

>> MASNADI-SHIRAZI: The Sigmoid is almost like Savage but it's not exactly. It's not

one of these--I'm trying to find the formula for Savage. Here it is. It looks almost like

the Savage but it's not exactly. The Savage does, too. I mean, it's quasi-convex. It's

derived from that formula. It will still work. Yeah. It seems more remarkable that you would

have a loss function looking like that. That would still work well. So are we--so I can

move on, hopefully. Okay. So in conclusion for this section, we showed how to derive

novel Bayes consistent loss functions for classification and we derived the set of requirements

that a robust Bayes consistent loss function should have. And we also showed that Bayes

consistent loss functions can be non-convex. And we gave some examples of using the Tangent

loss functions for doing classification. I don't know if we have enough time but we can--I

also wanted to talk about deriving Bayes consistent loss functions that are cost sensitive. So

I've been talking for about 50 minutes already. Do you want me to continue or...?

>> [INDISTINCT] so many questions so I think it's okay [INDISTINCT]

>> MASNADI-SHIRAZI: Okay. So many classification problems are actually naturally cost sensitive.

For example, when we're doing detection, or fraud detection, or medical diagnosis, or

business decision making, the cost of one--misclassifying one class is usually much higher than the

other. So for medical diagnosis, for example, missing that tumor might be considered as

having an infinite amount of loss because you, you know, the--it might result in a death

of a patient; whereas, on the other hand, a false alarm or a false positive would just

cause a bit more money because you have to do further experiments. So there's obviously

a demand for cost-sensitive extensions of state of the art techniques. And we--here

we propose an alternative strategy for the design of cost-sensitive algorithms. And rather

than directly manipulating the algorithm and risk compromising Bayes consistency or Bayes

optimality, we extended the underlying loss function and derived a cost-sensitive learning

algorithm from that loss that implements the Bayes decision rule and also approximates

the Bayes risk. So here--this is just the same slide that we had before to remind us

of the general path that we took before, for deriving Bayes consistent loss functions.

We would choose any strictly concave function, F star and C star--sorry--strictly concave

function, C star, and invertible function, F star, such that they've satisfied these

constraints. We'd plug it into the formula and we were guaranteed to have a Bayes consistent

loss function. And the question now is, "Can we have a general procedure for deriving cost-sensitive

loss function as well?" So here, we're going to define C1 as the cost of a false negative

and C -1 as the cost of a false positive, and we'd like to ask what conditions need

our choice of the F star and the C star now satisfy such that we have a cost-sensitive

loss that's also Bayes consistent. So generally, we would have a loss phi1 assigned to the

positive class and the loss phi -1 assigned to the negative class, but we'd still want

to conserve Bayes consistency. So the generic procedure is to start from an invertible predictor,

F star that has this symmetry and the concave function, C star that has this symmetry. And

this symmetry is basically the same symmetry that the Bayes risk has. So we're ideally

trying to find a risk that has the same behavior as the Bayes risk, and then we define J to

be negative of our C star, and we derive these auxiliary functions I1 and I -1 from these

two formulas, and our loss functions can be shown to basically be the negative of I1.

So phi1 is going to be negative I1, and phi -1 is going to be negative--I -1. And these

come from probability elicitation. The proof comes from there. But in general, the theorem

is that the loss derived through this procedure is cost-sensitive and Bayes consistent. So

here, we're going to go over an example which would be the exponential--we're going to try

to derive the exponential cost-sensitive loss. We choose F star to be this formula which

is the cost-sensitive version of the--that Log link and you can see that it is cost-sensitive.

It implements the cost-sensitive Bayes decision rule because now, the boundary has shifted

from a half to, in this case, 0.3 because C1 was chosen to be two. So all points that

have probabilities higher than 0.3 are now assigned positive values and all points that

have probability less than 0.3 are assigned negative values. So we've shifted the boundary

to make it--to prefer one class. And we also have the cost-sensitive version of the risk.

And you can see that in this, the--it's still concave but it has an isometric shape. And

the point here is that the maximum of this risk is, again, at 0.3. So the maximum risk

is being assigned to the boundary which is what we want. Going through all the steps,

we can find phi1 to be like this which is not a cost-sensitive version and there's--there

are now two loss functions. One is the red one that's used for the negative class and

the blue one is the loss that's used for the positive class, and it's still Bayes consistent.

So we've actually used the... >> [INDISTINCT] properties that we had earlier

[INDISTINCT] >> MASNADI-SHIRAZI: You could--I haven't--in

order to get a loss function that looks different, you would have to choose different F stars

and C stars. So if you could go back and derive the cost-sensitive asymmetric version of the--well,

you already have it here--combine it with the cost-sensitive Tangent Link, you could

probably get an analogous version for the cost-sensitive one. I haven't done it but,

you know, maybe that's up for grabs. Yeah. So the Least Squares is the same C star that's

used in Tangent. The difference here is that the Tangent uses a different F star. So if

you derive the cost-sensitive version of that one, you could get something similar. But

here we used just the Exponential and the reason for that was because--the history of

this is because I wanted to derive the cost-sensitive version of AdaBoost, and that uses the Exponential

loss. But here, using the cost-sensitive version and still Bayes consistent Boosting, we did

some--we've done a lot of experiments for pedestrian, car, and face detection. We've

even used it for Brain EEG Surprise Signal Detection. But I'm going to move on to deriving

the cost-sensitive SVM. >> I remember, there was an asymmetrical version

of AdaBoost, but I only [INDISTINCT] >> MASNADI-SHIRAZI: So there have been lots

of papers on making AdaBoost cost-sensitive. In my--actually in the PAMI paper, that's

basically what it's all about. It's going through the history of this and finally, you

know, deriving this which is the cost--an interesting thing about this is that it's

still Bayes consistent. All of the previous methods usually dealt with altering the AdaBoost's

algorithm directly. What they would generally do is change the weights. So the AdaBoost

has its weight update formula. They would go at--introduce costs directly to the weights

and depending on how you'd kind of change the weights, you would get different algorithms.

But no one--when you go and alter the algorithm like that, you have to still kind of show

that it's Bayes consistent because otherwise, you don't know what you're doing anymore;

are you really minimizing, you know, what you should be? And given infinite amount of

data, will you still converge to the Bayes decision rule which is what you should be

doing? So here, when you show that it's Bayes consistent, you answer those problems and

you don't need to go directly in it, like manipulate the algorithm. Once you have the

correct loss, you can just derive the classifier from it and that's what we did. So for the--we're

going to try to do the same thing for the SVM because there have been many papers on

cost-sensitive SVMs but, again, what they would do is, you know, go into the algorithm

and, you know, tweak this, or change that portion, or add the cost here or there. But

the principled way to do it is to actually derive the cost-sensitive loss function, show

that it's Bayes consistent, and then once you have that, you know, the rest is just--you

know, go through the steps of finding the algorithm. So here, we're going to--we're

going to start with F--we're going to choose our F star to be this which can be shown to

be Bayes consistent and it implements the Bayes decision rule. And in fact, it's just

the cost-sensitive version of the link used in SVM. So if you set C1 and C2--if you set--sorry--C1

and C -1 = 1, it will just, you know, the--simplify the sign of 2 eta -1 which is the exact same

thing used in SVMs. So for cost-sensitive--so this suggest the cost-sensitive minimum conditional

risk. And this has four parameters, A, B, D, and E, that can be shown to satisfy the

symmetric conditions for Bayes consistency if they satisfy these conditions. But in any

case, if you go through the steps, you can find the cost-sensitive SVM loss function

which is what I have there. So this loss has four degrees of freedom; A, B, E, and D. And

for the positive samples, we've--basically, they're assigned the margin of E/D and the

hinge slope of D. So if the red line is the positive hinge loss, they'd get a slope of

D and a margin of E/D. For the negative, you have a margin of B/A and a hinge slope of

A. So the blue line is the hinge loss that's used for the negative samples. And after some

parameter normalization, you can find C star to be like this, but you go through the steps.

And now that you have the loss, the risk basically leads to the cost-sensitive SVM primal formulation

and this is assuming that you--you're preferring the positive class. You can derive a similar

primal formulation [INDISTINCT] equal to 1, which is a good check, right? Because if it

doesn't return to the standard SVM when I set the cost equal to 1, I've probably done

something wrong, but it does actually simplify to the standard SVM when you set the costs

equal. But--here we have some experiments that we've done using the cost-sensitive SVM

and we've compared it to the previous algorithms; the boundary movement SVM and the bias penalty

SVM. So the bias penalty SVM would introduce costs on the slack variables directly, and

the boundary movement SVM would basically just change the bias of your final SVM. So

compared to these two competitors, the--we did experiments on 10 UCI data sets and cost-sensitive

SVM has smaller error on seven out of the 10 data sets and does equal to bias penalty

on two other ones. We also did experiments on the German Credit data set. And here the

goal is to identify bad credit customers to be denied credit by the bank. And each person,

each--gets 24 attributes such as--24 features, such as the person's age and income, and we're

trying to decide whether the bank should, you know, give this person credit or not.

An interesting thing is that, here, the cost matrix is actually defined in the problem

because if a good credit customer is denied credit, the bank has a loss of one; meaning

that the bank looses a good customer. Whereas if you give credit to a bad customer, the

bank suffers a loss of five; meaning that the--so the bank looses money. So this suggests

the following constraint on our choice of C1 and C -1. Their ratio should be 5. And

this is defined in the problem, so it's part of the data set. But the final result is that

if you use cost-sensitive SVM, you get a total reduction in cost of 37%. So in conclusion,

we extended the--uh-huh? >> [INDISTINCT]

>> MASNADI-SHIRAZI: So here is--no. When you're--there are different ways to compute the error. So

here, what we did was just--yeah, just count the number of errors that you have over the

data. There isn't any cost-sensitivity incorporated within the data set itself because I don't

know how much a miss would cost when I'm doing, let's say, heart disease or I'm trying to

do PEMA. I wouldn't know--I mean, you can define these to be whatever you want. It would--it

wouldn't change the result. It would just be multiplying these by a number.

>> I was wondering, let's say, you changed the [INDISTINCT].

>> MASNADI-SHIRAZI: Yeah. So what I--okay. So to go into--yeah--to go into the details

of how I actually did this, the way we did it was we--all of these algorithms have their

different parameters, right, like the C1, C -1. So you--we tried all the different parameters

such that all of these algorithms attained a certain detection rate on a validation set.

So, for example, we said we want to have all of these such that they have 90% detection

rate. Once they're equal on that sense, I would go to the test set, and test it, and

see how many false positives they gave me. And of course, the one with the smallest false

positive would be the best because I've ensured that their detection rates are fixed given

the amount--you know, like, the amount of flexibility these have. The cost-sensitive

and bias penalty SVMs, they have the same number of parameters that you can tweak. But

the boundary movement SVM, the only thing that you can change there is just the bias,

so. But with that in mind, that's the fair way of comparing them. So for this section,

we extended the probability elicitation view of loss function designed for cost-sensitive

problems, and we specified the generic procedure for deriving cost-sensitive algorithms, and

also derived the cost-sensitive Bayes consistent Boosting and SVM loss functions, and also

went over some experimental results. And thank you. So I'm open to any more questions.