On the Design of Bayes Consistent Loss Functions for Classification

Uploaded by GoogleTechTalks on 02.09.2010

I met him actually last month at CVPR where he gave a very interesting talk and I thought
it would be great if he shares some of his work with us here at Google. And just to spell
his full name, so it's Hamed Masnadi-Shirazi, and he works at--is in the middle of his PhD
at UCS, San Diego Western, you know, in the Department for Statistical Visual Computing.
And, yeah, I'm very pleased to hear about On the Design of Bayes Consistent Loss Functions
for Classification. Mr. Hamed, please. >> MASNADI-SHIRAZI: So thank you for the introduction.
Yeah, so the title is On the Design of Bayes Consistent Loss Functions for Classification,
and this is work that I've done at the Statistical Visual Computing Lab at the University of
California - San Diego. So during this talk, we're going to basically be talking about
classification. Many classification algorithms, some of the most successful ones such as SVMs
and Boosting are basically based on minimizing an expected value of a loss function phi,
which is close margin enforcing and Bayes consistent. So such loss functions assign
a large penalty to points with negative margin, and they generally assign a small penalty
to points of small positive margin and zero or close to zero penalty to points of large
positive margin. So we also need to talk about Bayes consistent loss functions and like I
said, Bayes consistent loss functions are very special and important for classification.
They--and the reason for that is because they produce the Bayes optimal decision rule without
having to know anything about the distribution of the data. So, an example for general Bayes
consistent loss function phi, we would take the--we would take the expected loss to find
the risk and we would minimize the risk to find our predictor, F star. And given that
our loss function was Bayes consistent, we're guaranteed that F star, our predictor, is
Bayes optimal. It implements the Bayes decision rule. And you can see that through this progression,
we didn't make any assumptions about the model of the data. We didn't know the distribution
of the data, nothing like that, but yet we were still able to arrive at the Bayes optimal
decision rule. So, on the other hand, unfortunately, only a handful of Bayes consistent loss functions
were previously known in the literature. Some examples include the well known Least Squares
loss; that's used all over the place. Another example is the Hinge loss that's used in SVM
classifiers; or the exponential loss that's used in Boosting. Another example is the Log
loss that's used in Logistic Regression and also in Boosting. And you can see that each
one of these previously known Bayes consistent loss functions has led to a highly successful
classification algorithm. We've plotted, actually, these Bayes consistent loss functions, and
just by looking at them, you can see that they all generally have a few properties in
common. One is that, they're all convex loss functions. And the ones that are considered
here are all unbounded for large negative margins. And number three is that, they all
have a zero or close to zero positive margins; of course, this excludes the Least Squares
that increases on both sides. And the last property is that they are all cost insensitive,
meaning that they assign the same amount of loss to both the positive class from--and
the negative class. So they don't prefer one class over the other. So the natural question
that we could ask at this point is, "Are we stuck with just these 4 or 5 loss functions
with those common shapes?" And can we actually derive other Bayes consistent loss functions
that are custom tailored for our application? Can we derive Bayes consistent loss functions
that have other shapes and other properties? For example, can we have a Bayes consistent
loss function that's non convex like the plotted loss function? Can we have a loss function
that's Bayes consistent and yet has bounded negative margins? As you can see, this loss
function flattens out. Can we have a loss function that has a non zero bounded positive
margins unlike previous loss functions? Can we have a loss function that's Bayes consistent
and is cost sensitive? So here we have two loss functions; one is assigned to the positive
class and then--and the other is assigned to the negative class. And in general, can
we have a Bayes consistent loss function that's just any shape we want it to be given our
application? So these problems and questions were previously known and researchers generally
would try to start off with the known Bayes consistent loss function. And in order to
alter it to their application, they would either make alterations to the loss directly
or alter the classification algorithm derived from the loss. And the problem with that is
that once you actually alter the Bayes consistent loss function in any way, it ceases to be
Bayes consistent. So during this talk, what we're going to show is how to actually derive
novel Bayes consistent loss functions, and this is work that has been covered in our
NIPS '08 paper. We're also going to talk about designing novel loss functions--Bayes consistent
loss functions for specific applications. And two of the applications that we'll cover
during this talk--yeah, the first one is going to be the designing loss functions that are
robust or outlier resistant for robust classification, and this is work that's been covered in our
CVPR 2010 paper. And we're also going to talk about designing loss functions that are Bayes
consistent and cost sensitive. So this is work that's been covered in our ICML 2010
and IEEE PAMI 2010 paper. So now that we know the importance of having Bayes consistent
loss functions for classification, we can talk about some specific problems and applications.
An interesting problem is the computer vision problem. Here, we have large margin loss functions,
generally, don't overcome the unique challenges posed by computer vision problems. And one
major difficulty is the prevalence of noise and outliers or class ambiguity that exists
in loss--in computer vision data sets. So here, for example, we have an image of a famous
street in San Francisco and this is clearly an image of a street. But if we were to do
patch-based image classification on this, we would get many patches that include trees,
and cars, and building even though this is supposed to be classified as an image of a
street. So in order to be able to do robust classification, we're going to try to design
a robust loss function. One limitation of current loss functions is their unbounded
negative margin. So for this simple classification problem we have at the top, we clearly have
three outliers. Those are the three Xs--the green Xs to the left are obviously outliers.
And if we were to use the well-known exponential loss to classify this problem--the exponential
loss has been plotted in light blue, the light blue dotted line. And because of its exponential
shape, it basically will emphasize and concentrate on these three outliers and assign to them
an exponentially high loss, whereas, what we'd really like it to do is to disregard
these three outliers. So, recent improvements have been made in terms of dealing with this
type of outliers. One is the linearly growing Logit loss used in LogitBoost introduced by
Friedman in 2000. And that's the green line. You can see that it grows linearly instead
of exponentially, so that's an improvement. The other is the bounded Savage loss used
in SavageBoost that we introduced in '08, and that's the dotted blue line in the bottom.
You can see that it's actually bounded; it flattens out after a while. And so these help
with this type of outliers. But we argued that the negative margins are not the entire
problem and that it's also important to penalize large positive margins. So in order to explain
this further, we can consider this simple classification algorithm--I mean, this simple
classification problem we have where the data is distributed uniformly in the vertical direction
and is Gaussian with equal variants and different means of +3 and -3 in the horizontal direction.
So for these two classes it's easy to see that the Bayes decision rule would be aligned
at X equals zero. And in fact, if any of the previous Bayes consistent loss functions are
used on this exact data set, they would all place their decision boundaries where the
red line is which is very close to the black Bayes decision rule. So the--all the previous
Bayes consistent loss functions will approximate the Bayes decision rule almost exactly.
>> [INDISTINCT] >> MASNADI-SHIRAZI: I didn't get the question.
What was that? >> [INDISTINCT]
>> MASNADI-SHIRAZI: [INDISTINCT], yeah. The fact that I know that the optimal Bayes decision
rule is the X line is because I know that they're Gaussian. But if all I had were these
few points and I used the Bayes consistent loss function, anyone of these loss functions
that I have on the right, they would in--actually for this data set, give you the red line,
which is almost on top of the black line, which is the optimal. So even if you didn't
know anything about this distribution and all I gave you were these few points and you
were to follow the procedure of--start off with the Bayes consistent loss function, compute
the risk, minimize the risk, you would get the red line without any information about
the distribution. So that's the power of Bayes consistent loss functions.
>> [INDISTINCT] >> MASNADI-SHIRAZI: I'm not assuming anything
about the data, whatsoever. All I have is just the data. I know that the blue line is
one class, and the blue circles is one class, and the red Xs are another class. Given their
coordinates, I can--but these points were actually sampled from a Gaussian which I do
not know whatsoever. >> [INDISTINCT]
>> MASNADI-SHIRAZI: Sure, sure, sure. I'm going to actually consider what happens when
you have, you know, a rogue point or something like that. It could actually change immediately.
So what I'd--what I'm going to do next is actually introduce an outlier here. So let's
say I had one sample point at that point. So the impact of adding a single X at that
point--which could still happen as part of the data, right, because it's Gaussian in
the horizontal, so that could be one of the Xs that come up--the impact--but in essence,
it is kind of an outlier for this data because it's highly unlikely. But, in any case, if
you do have a point at -2 and zero, what would happen is that all of these previous loss
functions would drastically shift to their decision boundary to where the purple line
is. And they would place their decision boundary at about X equals -2.3. And the reason for
that is because all of these loss functions assigned zero or close to zero loss, for being
correct, right? This includes both the Savage loss which is the green one or the--sorry--the
Log is the green one and the Savage is the blue one. So you can see that they assigned
zero penalty for being correct, and since by placing the line at -2.3, you can classify
everything correctly. That's just where they're going to put their boundary which is very
far away from the Bayes optimal, had we known the true distribution. On the other hand,
the Tangent loss which is a loss function that we're going to derive during this talk
is also Bayes consistent. But since it penalizes being correct, it penalizes both the positive
side and the negative side. It discourages solutions that are too correct. So it will
actually place the decision boundary where the red line is given this exact data, which
is still very close to the Bayes decision rule. Yeah.
>> [INDISTINCT] >> MASNADI-SHIRAZI: No, I'm not assuming anything
other than just the data. I mean, if-- >> [INDISTINCT]
>> MASNADI-SHIRAZI: Yeah, yeah, that's true. If there was another, let say, another, you
know--let's say it was like two Gaussians, all of them consist--made up the red class.
That's true. That red point could be one sample from that. But the problem is that right now,
given the fact that it's just one point, it's still considered kind of an outlier. I will
actually deal with this problem later on because once you derive the Tangent loss, you can--you
can actually change the amount that it penalizes being too correct. So you could even actually,
you know, flatten this to make it look like--something like the Savage loss which wouldn't penalize
it and actually place the decision boundary very close to where the purple one is. So
you can have a control over this. It's just a matter of--given the data right now, I mean,
just looking at it, it would seem better to place the lines somewhere close to the black
one rather than drastically move it all the way to the left. Even if that were a distribution
and what--actually when I was building this, you know, toy example to get the point across
of why it's important to sometimes penalize being correct, if you actually add another
X to where that far left one is, the Tangent loss would also place its boundary close to
-2.3 because now its--you know, there's too much penalty for classifying two points wrong.
And you would expect that if that was a true--another mode. You would have more examples from it.
And once you actually add another one, it'll shift; it will be just too costly. It doesn't
want to make a mistake on two points, so. >> [INDISTINCT]
>> MASNADI-SHIRAZI: Yeah, yeah. But, I mean, this toy example is just to get across the
importance of actually penalizing, you know, outliers that are of this type. So you have
to--in order to deal with these, you have to kind of penalize being too correct sometimes
as well. So I don't know if I--yeah. >> [INDISTINCT]
>> MASNADI-SHIRAZI: That's true. I mean, that's why you...
>> [INDISTINCT] >> MASNADI-SHIRAZI: That's true. But that's
usually what we have to--I mean, that's what we deal with in the real world. We'd always
have only a few samples. We never have the true distribution of the data. And so, I mean,
the purpose of this talk is to say that it's important to have a Bayes consistent loss
function that gives you that flexibility and that control. So when you are dealing with
situations where you have small amounts of data and the possibility of having outliers
is increased in that case, you want to be able to, you know, have loss functions that
are not only Bayes consistent but also have, you know, these different types of shapes
that can deal with these potential problems >> [INDISTINCT]
>> MASNADI-SHIRAZI: So I actually--if you were to use the Hinge loss, which is what
the SVM does, I didn't--I didn't run an SVM because this is so simple, you can solve it
directly. You can just like do a exhaustive search over the space of lines here. But if
you were to use the Hinge loss, which is what the SVM uses, it does still place, you know--because
there's no penalty for being right, so why not place it where the purple ones--
>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah, I didn't do a pass
like that on this, so. >> [INDISTINCT]
>> MASNADI-SHIRAZI: The Square loss? The Square loss--the Least Squares loss is actually almost
never used in classification because--let me go back to the shape. That purple line--I'll
give you a better example. So the red line is the Square loss, right? And you can see
that it penalizes being correct almost as much as it panelizes being wrong because of
its symmetry. It's just X squared, so it's never used for classification. If I were to
use--I actually--that's an interesting question. I--maybe I should have used it here and see
what would have happened. If I had actually used the Square loss--because then it would
penalize this, as well. And, let's see. If it penalizes that point so--actually the Square
loss wouldn't work. It would--it would place the line at--where the purple is, I think,
I'm kind of guessing, because there's going to be such a huge penalty for getting this
wrong, that it would prefer to get it right rather than skip over it. So the Tangent loss
here assigns a bounded penalty for being right that's why it can disregard to this outlier.
You see that it's kind of--you see that it tapers off here, right? It's bounded in that.
Whereas the square loss would go up, it would assign a huge loss for being--for any incorrectly
classified points of that sense. >> [INDISTINCT]
>> MASNADI-SHIRAZI: Yeah. So--exactly. That's the whole point of a loss like this because
it wants to, you know, keep--it wants--it actually--in the paper, I actually have an
example where I've plotted like the histogram of the distance. This is kind of the margin.
So it actually encourages small margins that are placed at that plate. You know, I think
it's--that dip, that minimum is like 0.5. But the interesting thing is that when you
see that example, even though most of the points have a small margin of about 0.5, meaning
that--that minimum is actually enforced, it does better in classifying a data set that
has lots of outliers, so we were--I think, in that--in that example, we had some digit
classification problem and the margin is smaller but the classification output is better than
all of these other loss functions. So that's a--that's a problem that we actually considered
in the paper. So did I--the--it's--just because the margin is smaller, it doesn't necessarily
mean you're doing worse, especially when you have outliers, so.
>> [INDISTINCT] >> MASNADI-SHIRAZI: That's true. It's just
how much--for example, when you're using Exponential loss, it tries so hard to increase the margin
that it, you know, screws up the classification which is the goal, so.
>> [INDISTINCT] >> MASNADI-SHIRAZI: For the--I wanted to give
these--I know what you're saying, but that's regression. It's not classification because
in regression--in classification, you actually have the two classes and you're trying to
minimize the risk. And in regression, yeah, if you allowed it to, you know, plot a line
through here, that would happen. It would place a horizontal line, so.
>> [INDISTINCT] >> MASNADI-SHIRAZI: What I think I did for
this data set was I--let's see. I think I limited the--I didn't--yeah, I didn't--maybe
I didn't want that to happen so I limited to vertical lines. That might have been what
I did. So if I remember correctly, that's probably what I did to avoid that. In any
case, I haven't even considered the Least Squares, so. Any--yeah.
>> [INDISTINCT] >> MASNADI-SHIRAZI: I--I think what I considered
was like from -10 to +10. Something like that. >> [INDISTINCT]
>> MASNADI-SHIRAZI: I wanted to put a Y dimension here just so it would look nice when I plot
it. Because if I only had one--that's why they're uniform in the Y direction. So I already
know that the Y direction is, you know, useless. >> [INDISTINCT]
>> MASNADI-SHIRAZI: Yeah. Because if I had only plotted these points in the X direction,
you would get these circles on top of each other; you couldn't really see anything. But
that's why I said it's uniform in the vertical so you can disregard it completely. Yeah.
But the point of this toy problem is something completely different. You guys should let
me go further. What I wanted to just get across was the idea of the fact that some--first
of all, we have problems where we have outliers. And second of all, we want to deal with outliers
and there's two types of outliers in a sense. One is when you're wrong and--so one is this
type of outliers that I considered here, right? There's no--and the other is that when you
can be actually right, completely right, and still have, you know, an outlier. So the Savage
and the Log kind of deal with the first type, but you also want a loss that looks like this
that can, you know, penalize being too correct, right? When you have outliers, that's kind
of--that's just to get across the point of having a loss function that has a shape like
this. Now, of course, we'll go into more, you know, advanced and practical examples
so don't worry. But in order to derive something like the Tangent loss that I just talked about,
we need to talk a bit about risk minimization. So we define a feature, X, and the class label,
Y, and the--and we call our classifier, H, and out predictor, F. So our classifier is
just going to be the sign of our predictor, F. And we also have the loss function, L.
And so what we're going to be doing is minimizing the risk or the expected loss which is equivalent
to minimizing the conditional risk. And here we're going to define eta to be the probability
of one given X. So with these in mind, we'll consider the traditional approach of dealing
with a Bayes consistent loss function. We're going to actually talk about what does Bayes
consistency mean in more detail. So a good example is the exponential loss that we have
here. And the traditional approach is to, of course, take the--take the expectation,
find the risk, C. And then you want to minimize the risk, and find your predictor, F star.
And for the exponential, you can actually show that F star will be one half log of the
probability ratio. So the question now is to see whether our predictor is Bayes consistent
and we've plotted this F star and you can see that it is indeed Bayes consistent because
it assigns positive values to probabilities that are higher than a half and negative values
to probabilities, eta, that are less than a half. So it's implementing the Bayes decision
rule which is good. So we're good there. We take our F star and we plug it back into our
risk to find our minimum risk which we call C star. And then what we have to do here is
check to see whether C star is convex or not. And the reason for that is because it can
be shown that in order to have a convex optimization problem, you need to have a minimum conditional
risk, C star, that's convex. So, for the exponential loss where--everything works out fine, but
you can immediately see from this example that this traditional approach doesn't actually
let you design Bayes consistent loss functions. It only lets you check to see whether a loss
function is Bayes consistent or not. And we can kind of say that we got lucky with the
exponential because all these things, you know, fall into place. So in our next section,
we'd like to talk about how can we actually--we're going to propose a new path for designing
Bayes consistent loss functions. And the general approach is going to be that we can choose
any strictly concave functions, C star, and any invertible function, F star, such that
they have the simple symmetry constraints and we plug these two functions into this
formula and we get a phi, a loss, that is guaranteed to be Bayes consistent. So this
gives us a principled way of deriving, designing novel Bayes consistent loss functions. Now--I'm
not going to actually go over the proof of that formula. So if--the proof can be found
in our NIPS '08 paper and it basically deals with probability elicitation and Bregman divergences
and so I'm going to kind of skip over these. But we're going to go over an example of actually
deriving--using the formula and deriving a novel Bayes consistent loss function. Here
we're going to derive the Savage loss. So, let's say we start it off with the Logistic
Link function that we actually saw in the Exponential loss as well, and we pair it with
the Least Squares minimum risk. And these two satisfy the symmetric constraints. All
we have to do is take these and put them into the formula and we'd derive a novel loss that
we called the Savage loss and we've plotted it. And you can see that this loss function,
unlike all previous Bayes consistent loss functions, is first of all not a convex and
also it is bounded, so it flattens out. It has a bounded penalty. And we use this loss
function in--to do some robust classification. But for a truly robust loss function, we saw
in our previous discussion that for a hypothetical loss, we--it should have a certain number
of properties. We would, first of all, like it to be bounded for large negative margins.
And we'd also like it to be--have a smaller bounded penalty--assign a smaller bounded
penalty for large positive margins. And we'd also like it to be margin enforcing of course.
And it can be shown that under Bayes consistency, those three properties are satisfied if and
only if these three conditions or requirements hold for our choice of F star and C star.
And, unfortunately, existing link functions, F star, don't comply with those requirements.
So we introduced the Tangent Link which does. So this is the plot of the Tangent Link and
we pair it up with the Least Squares at minimum conditional risk again, plug it into our--the
formula, and we'd get what we called the Tangent loss which actually has the desired shape
of course. So now that we have our loss function, we can easily use it to do classification
and derive like, for example, a Boosting algorithm to minimize the risk associated with this
loss. I'm not going to go over deriving the Boosting, and algorithm, and stuff; you--the
details can be found in the paper. I'll just go over some examples and experiments that
we did. We've done experiments on image classification and also Multiple Instance Learning data sets.
But here, I'm going to talk about the scene classification, the 15 class scene classification
problem. We used the semantic space representation from Rasiwasia and Vasconcelos CVPR '09. And
the only thing that we changed was replace the SVM classifier with the Tangent loss classifier
and we got for--now, their scene classification method was already state of the art, but we
got 4% improvement with just simply, you know, an outlier resistant robust classifier instead
of the SVM. So--and actually, the interesting thing about this data set is that most of
the improvements were seen between classes that were--I'm starting to lose my voice.
>> [INDISTINCT] >> MASNADI-SHIRAZI: What was that?
>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah. We're doing one
versus all. The most improvement was seen in--between classes that are easily confusable.
For example, when you're trying to distinguish between the highway class, or the street class,
or the tall building class, or when you're trying to distinguish between the mountain,
and forest, and open country class, these are actually one of the--like these are some
of the 15 classes out there. There's also living room, and bedroom, and kitchen, and
these are similar but they actually produce the most improvements which is expected when
you're doing a robust classification. We've also done experiments on tracking and we used
the Discriminant Saliency Tracker of Mahadevan and Vasconcelos CVPR '09. And we use TangentBoost
to combine the saliency maps in a discriminant manner. Here, in these two examples, tracking
examples, the red box is for when the Tangent loss is used to do tracking and the dotted
blue is when the Exponential loss, which isn't robust, is used. This one--let's see if it
works. It still doesn't work on my computer; that's weird. But--okay, this one should work.
So you can see that the Tangent loss keeps track of the object whereas the Exponential
loss drifts off and loses track. And this is because the person keeps changing shape
and there are plenty of outliers so it's much more robust. Now a natural question that was
already kind of asked was, "Can we control the shape of the Tangent loss?" So, there
is actually a parameter, A, that controls the ratio between the negative margin and
the positive margin. So the general formula is--includes the A. And we've plotted the
Tangent loss for different values of A. And you can see that--you can actually either--actually,
the best results are found through cross-validation for A or you can kind of like guess A depending
on how much outliers you expect to have in your data. But in general, you can kind of--you
can control the amount of penalty that you assign to the different margins. So, you have
that flexibility. Another question is the question of convexity. So, here we have examples
of a convex function, a quasi-convex function, and a function that's neither convex nor quasi-convex.
And so, for--a quasi-convex function isn't convex but it still has a unique minimum.
On the other hand, all convex functions are also quasi-convex. And we can actually prove
that Bayes consistent loss functions from this formula are all quasi-convex. Now, the
question--the main question is, "Is our optimization problem still convex even when the loss is
not convex?" So, if--so, the thing to keep in mind here is that we're not minimizing
the loss, we're minimizing the expected loss or the risk. So with that in mind, we can
prove that the risk is actually still convex in the probability space even though the loss
is not. We can also prove that the risk is quasi-convex in functional space. And this
is an important proof because when you're doing classification, you're usually not working
in probability space. You're working directly in the functional space. You're trying to
directly find the function that classifies--that minimizes your risk. So it's still quasi-convex
even though the loss is not in. Also, more importantly for practice, the empirical risk
is also quasi-convex in functional space. So in conclusion, we can say that the optimization
problem is quasi-convex and has a unique minimum that can be found with functional gradient
descent such as Boosting even though the loss function is not necessarily convex. So--
>> [INDISTINCT] >> MASNADI-SHIRAZI: The optimization is not
by definition convex, but it's quasi-convex, meaning that it still has a unique minimum.
So you can use quasi-convex methods to find the minimum. The easiest thing that works
is just do functional gradient descent which is what Boosting does. That's why the Boosting
algorithms work even though these loss functions have all these, you know, quasi-convex shapes.
So the important thing is that it still has a unique minimum.
>> It's [INDISTINCT] minimize [INDISTINCT] so you can have [INDISTINCT] function [INDISTINCT].
>> MASNADI-SHIRAZI: No, no, no, no, no. That's what I mean by unique minimum; it doesn't
have local minimum. Even the empirical risk won't have local minimum. That you'd--that's
what's... >> [INDISTINCT] I believe this is to be true,
>> MASNADI-SHIRAZI: Well, I can actually prove this, I mean, if you want later. Initially,
that--I mean, that's the question that people have. They think that, "Oh, since you have
a bunch of these little dips for these different points." But, think of it this way. Let's
say you only had one point and you're trying to do--like the worse case would be to just
have like one or two points. You can still show that the sum of two quasi-convex functions
that have this formula are themselves quasi-convex, meaning that the sum of even two--like the
simplest classification problem in the world is just one point here and one point there,
right? The sum of these two would still be another quasi-convex function. That's the
empirical risk, right, the sum of two of these losses. Let's say you sum two of those.
>> [INDISTINCT] >> MASNADI-SHIRAZI: Yup. You can sum any two
and they'll still be quasi-convex as long as they come from this formula.
>> [INDISTINCT] >> MASNADI-SHIRAZI: Maybe I should have. But
the thing is that before giving--I got this question during my CVPR talk and I had gotten
the first question when I gave the NIPS paper. So I knew I was good in terms of probability
space, but that's not the space we really were--are dealing with. Because when you do
it--like you do a transition from the probability space, a transform doesn't necessarily preserve
convexity. So when you transform from a probability space into functional space, there's a chance
that you'll lose convexity. And so I had to actually go and prove two and three as well
which--that's why I wanted two weeks before I came to give this talk because I spent the
first week actually proving two and three. >> [INDISTINCT]
>> MASNADI-SHIRAZI: It's actually quite simple to do. If you'd--if you'd take the derivative
of this function down here, you can prove that it's quasi-convex by the classic way
of proving quasi-convexity which is to show that the derivative is positive up to a certain
point and then negative from then on, meaning that there's only one change. That's the--that's
the general--one of the ways to prove quasi-convexity. So--
>> [INDISTINCT] >> MASNADI-SHIRAZI: Yeah. So the sum, you
know, adds a little bit more but it's still nothing more than adding one function phi
of V plus, let's say, alpha times phi of negative V which is the positive class plus the negative
class. So--and in fact, once you--let's see--once you prove two, three, which is the case that
you're talking about, is not that bad afterwards. So we can go over it--yeah. Maybe that's what
I should have concentrated on more. Later. Yeah.
>> [INDISTINCT] >> MASNADI-SHIRAZI: This, I--this, I haven't
actually included in any of the papers. Part one is in the NIPS paper, theorem one, but
two and three were added to the paper that--like the journal version of the paper that I'm
writing and it's almost done. This is kind of like the missing link, so.
>> [INDISTINCT] >> MASNADI-SHIRAZI: No, just any. I just gave
the simple two examples. But the point is that you can add alpha number of the positive
numbers and beta number of negatives and still have a sum. So the thing about convexity is
that when they're talking about empirical risk, we all know that the sum of a bunch
of convex functions is still going to be convex. And that's why when you're doing empirical
risk minimization using convex loss functions, the thing works well because you know that
when you sum up a bunch of convex functions, you're still going to have convexity. The
same theorem, unfortunately, doesn't exist for quasi-convex functions in general. You
can't say just summing up a bunch of quasi-convex functions will ensure convexity. But, fortunately,
if they come from this formula, they do. So-- >> [INDISTINCT]
>> MASNADI-SHIRAZI: The Sigmoid is almost like Savage but it's not exactly. It's not
one of these--I'm trying to find the formula for Savage. Here it is. It looks almost like
the Savage but it's not exactly. The Savage does, too. I mean, it's quasi-convex. It's
derived from that formula. It will still work. Yeah. It seems more remarkable that you would
have a loss function looking like that. That would still work well. So are we--so I can
move on, hopefully. Okay. So in conclusion for this section, we showed how to derive
novel Bayes consistent loss functions for classification and we derived the set of requirements
that a robust Bayes consistent loss function should have. And we also showed that Bayes
consistent loss functions can be non-convex. And we gave some examples of using the Tangent
loss functions for doing classification. I don't know if we have enough time but we can--I
also wanted to talk about deriving Bayes consistent loss functions that are cost sensitive. So
I've been talking for about 50 minutes already. Do you want me to continue or...?
>> [INDISTINCT] so many questions so I think it's okay [INDISTINCT]
>> MASNADI-SHIRAZI: Okay. So many classification problems are actually naturally cost sensitive.
For example, when we're doing detection, or fraud detection, or medical diagnosis, or
business decision making, the cost of one--misclassifying one class is usually much higher than the
other. So for medical diagnosis, for example, missing that tumor might be considered as
having an infinite amount of loss because you, you know, the--it might result in a death
of a patient; whereas, on the other hand, a false alarm or a false positive would just
cause a bit more money because you have to do further experiments. So there's obviously
a demand for cost-sensitive extensions of state of the art techniques. And we--here
we propose an alternative strategy for the design of cost-sensitive algorithms. And rather
than directly manipulating the algorithm and risk compromising Bayes consistency or Bayes
optimality, we extended the underlying loss function and derived a cost-sensitive learning
algorithm from that loss that implements the Bayes decision rule and also approximates
the Bayes risk. So here--this is just the same slide that we had before to remind us
of the general path that we took before, for deriving Bayes consistent loss functions.
We would choose any strictly concave function, F star and C star--sorry--strictly concave
function, C star, and invertible function, F star, such that they've satisfied these
constraints. We'd plug it into the formula and we were guaranteed to have a Bayes consistent
loss function. And the question now is, "Can we have a general procedure for deriving cost-sensitive
loss function as well?" So here, we're going to define C1 as the cost of a false negative
and C -1 as the cost of a false positive, and we'd like to ask what conditions need
our choice of the F star and the C star now satisfy such that we have a cost-sensitive
loss that's also Bayes consistent. So generally, we would have a loss phi1 assigned to the
positive class and the loss phi -1 assigned to the negative class, but we'd still want
to conserve Bayes consistency. So the generic procedure is to start from an invertible predictor,
F star that has this symmetry and the concave function, C star that has this symmetry. And
this symmetry is basically the same symmetry that the Bayes risk has. So we're ideally
trying to find a risk that has the same behavior as the Bayes risk, and then we define J to
be negative of our C star, and we derive these auxiliary functions I1 and I -1 from these
two formulas, and our loss functions can be shown to basically be the negative of I1.
So phi1 is going to be negative I1, and phi -1 is going to be negative--I -1. And these
come from probability elicitation. The proof comes from there. But in general, the theorem
is that the loss derived through this procedure is cost-sensitive and Bayes consistent. So
here, we're going to go over an example which would be the exponential--we're going to try
to derive the exponential cost-sensitive loss. We choose F star to be this formula which
is the cost-sensitive version of the--that Log link and you can see that it is cost-sensitive.
It implements the cost-sensitive Bayes decision rule because now, the boundary has shifted
from a half to, in this case, 0.3 because C1 was chosen to be two. So all points that
have probabilities higher than 0.3 are now assigned positive values and all points that
have probability less than 0.3 are assigned negative values. So we've shifted the boundary
to make it--to prefer one class. And we also have the cost-sensitive version of the risk.
And you can see that in this, the--it's still concave but it has an isometric shape. And
the point here is that the maximum of this risk is, again, at 0.3. So the maximum risk
is being assigned to the boundary which is what we want. Going through all the steps,
we can find phi1 to be like this which is not a cost-sensitive version and there's--there
are now two loss functions. One is the red one that's used for the negative class and
the blue one is the loss that's used for the positive class, and it's still Bayes consistent.
So we've actually used the... >> [INDISTINCT] properties that we had earlier
[INDISTINCT] >> MASNADI-SHIRAZI: You could--I haven't--in
order to get a loss function that looks different, you would have to choose different F stars
and C stars. So if you could go back and derive the cost-sensitive asymmetric version of the--well,
you already have it here--combine it with the cost-sensitive Tangent Link, you could
probably get an analogous version for the cost-sensitive one. I haven't done it but,
you know, maybe that's up for grabs. Yeah. So the Least Squares is the same C star that's
used in Tangent. The difference here is that the Tangent uses a different F star. So if
you derive the cost-sensitive version of that one, you could get something similar. But
here we used just the Exponential and the reason for that was because--the history of
this is because I wanted to derive the cost-sensitive version of AdaBoost, and that uses the Exponential
loss. But here, using the cost-sensitive version and still Bayes consistent Boosting, we did
some--we've done a lot of experiments for pedestrian, car, and face detection. We've
even used it for Brain EEG Surprise Signal Detection. But I'm going to move on to deriving
the cost-sensitive SVM. >> I remember, there was an asymmetrical version
of AdaBoost, but I only [INDISTINCT] >> MASNADI-SHIRAZI: So there have been lots
of papers on making AdaBoost cost-sensitive. In my--actually in the PAMI paper, that's
basically what it's all about. It's going through the history of this and finally, you
know, deriving this which is the cost--an interesting thing about this is that it's
still Bayes consistent. All of the previous methods usually dealt with altering the AdaBoost's
algorithm directly. What they would generally do is change the weights. So the AdaBoost
has its weight update formula. They would go at--introduce costs directly to the weights
and depending on how you'd kind of change the weights, you would get different algorithms.
But no one--when you go and alter the algorithm like that, you have to still kind of show
that it's Bayes consistent because otherwise, you don't know what you're doing anymore;
are you really minimizing, you know, what you should be? And given infinite amount of
data, will you still converge to the Bayes decision rule which is what you should be
doing? So here, when you show that it's Bayes consistent, you answer those problems and
you don't need to go directly in it, like manipulate the algorithm. Once you have the
correct loss, you can just derive the classifier from it and that's what we did. So for the--we're
going to try to do the same thing for the SVM because there have been many papers on
cost-sensitive SVMs but, again, what they would do is, you know, go into the algorithm
and, you know, tweak this, or change that portion, or add the cost here or there. But
the principled way to do it is to actually derive the cost-sensitive loss function, show
that it's Bayes consistent, and then once you have that, you know, the rest is just--you
know, go through the steps of finding the algorithm. So here, we're going to--we're
going to start with F--we're going to choose our F star to be this which can be shown to
be Bayes consistent and it implements the Bayes decision rule. And in fact, it's just
the cost-sensitive version of the link used in SVM. So if you set C1 and C2--if you set--sorry--C1
and C -1 = 1, it will just, you know, the--simplify the sign of 2 eta -1 which is the exact same
thing used in SVMs. So for cost-sensitive--so this suggest the cost-sensitive minimum conditional
risk. And this has four parameters, A, B, D, and E, that can be shown to satisfy the
symmetric conditions for Bayes consistency if they satisfy these conditions. But in any
case, if you go through the steps, you can find the cost-sensitive SVM loss function
which is what I have there. So this loss has four degrees of freedom; A, B, E, and D. And
for the positive samples, we've--basically, they're assigned the margin of E/D and the
hinge slope of D. So if the red line is the positive hinge loss, they'd get a slope of
D and a margin of E/D. For the negative, you have a margin of B/A and a hinge slope of
A. So the blue line is the hinge loss that's used for the negative samples. And after some
parameter normalization, you can find C star to be like this, but you go through the steps.
And now that you have the loss, the risk basically leads to the cost-sensitive SVM primal formulation
and this is assuming that you--you're preferring the positive class. You can derive a similar
primal formulation [INDISTINCT] equal to 1, which is a good check, right? Because if it
doesn't return to the standard SVM when I set the cost equal to 1, I've probably done
something wrong, but it does actually simplify to the standard SVM when you set the costs
equal. But--here we have some experiments that we've done using the cost-sensitive SVM
and we've compared it to the previous algorithms; the boundary movement SVM and the bias penalty
SVM. So the bias penalty SVM would introduce costs on the slack variables directly, and
the boundary movement SVM would basically just change the bias of your final SVM. So
compared to these two competitors, the--we did experiments on 10 UCI data sets and cost-sensitive
SVM has smaller error on seven out of the 10 data sets and does equal to bias penalty
on two other ones. We also did experiments on the German Credit data set. And here the
goal is to identify bad credit customers to be denied credit by the bank. And each person,
each--gets 24 attributes such as--24 features, such as the person's age and income, and we're
trying to decide whether the bank should, you know, give this person credit or not.
An interesting thing is that, here, the cost matrix is actually defined in the problem
because if a good credit customer is denied credit, the bank has a loss of one; meaning
that the bank looses a good customer. Whereas if you give credit to a bad customer, the
bank suffers a loss of five; meaning that the--so the bank looses money. So this suggests
the following constraint on our choice of C1 and C -1. Their ratio should be 5. And
this is defined in the problem, so it's part of the data set. But the final result is that
if you use cost-sensitive SVM, you get a total reduction in cost of 37%. So in conclusion,
we extended the--uh-huh? >> [INDISTINCT]
>> MASNADI-SHIRAZI: So here is--no. When you're--there are different ways to compute the error. So
here, what we did was just--yeah, just count the number of errors that you have over the
data. There isn't any cost-sensitivity incorporated within the data set itself because I don't
know how much a miss would cost when I'm doing, let's say, heart disease or I'm trying to
do PEMA. I wouldn't know--I mean, you can define these to be whatever you want. It would--it
wouldn't change the result. It would just be multiplying these by a number.
>> I was wondering, let's say, you changed the [INDISTINCT].
>> MASNADI-SHIRAZI: Yeah. So what I--okay. So to go into--yeah--to go into the details
of how I actually did this, the way we did it was we--all of these algorithms have their
different parameters, right, like the C1, C -1. So you--we tried all the different parameters
such that all of these algorithms attained a certain detection rate on a validation set.
So, for example, we said we want to have all of these such that they have 90% detection
rate. Once they're equal on that sense, I would go to the test set, and test it, and
see how many false positives they gave me. And of course, the one with the smallest false
positive would be the best because I've ensured that their detection rates are fixed given
the amount--you know, like, the amount of flexibility these have. The cost-sensitive
and bias penalty SVMs, they have the same number of parameters that you can tweak. But
the boundary movement SVM, the only thing that you can change there is just the bias,
so. But with that in mind, that's the fair way of comparing them. So for this section,
we extended the probability elicitation view of loss function designed for cost-sensitive
problems, and we specified the generic procedure for deriving cost-sensitive algorithms, and
also derived the cost-sensitive Bayes consistent Boosting and SVM loss functions, and also
went over some experimental results. And thank you. So I'm open to any more questions.