Bay Area Vision Meeting: Visual Recognition via Feature Learning

Uploaded by GoogleTechTalks on 15.04.2011

Kai Yu sets up. Let me go ahead and introduce him. He is the head of mediate Analytics Department
at NEC Laboratories America. Before joining NEC, he was a senior research scientist at
Siemens Corporate Technology. His research interests are learning formulations, collaborative
filtering, multitask learning, computer vision, image classification, deep learning and sparse
coding. And I should note that he lead the first place team in the ImageNet Large Scale
Visual Recognition challenge. And with that, look forward to hearing his talk.
>> YU: So, how is everyone? Today, I'm going to talk about Image Recognition using Unsupervised
Feature Learning. So, if you look at the title, you can notice that there's a big overlap
with Andrew's talk because we both believe--actually, a bunch of, you know, machine learning guys
really believe by learning from data, we can actually obtain very powerful features for
image recognition. There's one part actually I don't want to talk is deep learning because
I believe it's hard to be as deep as Andrew did, so. So let me focus on unsupervised feature
learning. So, in the past two years, I worked on this with a large group of people. Those
are my colleagues from NEC Labs and also we collaborated with professors like Tong Jiang,
Thomas Huang and John Lafferty. And by the way, we also have some secret weapons which
are--who are our interns. They are very smart and productive. So, what do we did, I believe,
is pretty amazing. So, this is one example. So, in 2009, this is a PASCAL Visual Object
Class Challenge. I think this is probably one of the most prestigious benchmark in computer
vision for object recognition. So, the task is to classify images to one or multiple of
the 20 categories. Right. So, typically, very often one image here can be multi label because
it can turn in several different category of objects, right? So, in 2009, the task is,
okay, given about 8,000 images we're training to classify a test to set of 6,650 images.
So, this is the result. And we achieved the number one positions for 18 of the categories
out of total 20 categories, right. Those here for each category will list our performance
and the best of performance from other teams. Here is the difference. Right. So, it's actually
pretty encouraging for us because we use the learning features, right? We don't, you know,
combine all kind of different features like appearance, shape, textual, those features.
We just use a HOG feature on gray images, right. So, it's pretty standard and simple
feature. But what we really did is learning features from these simple features. And this
is a very encouraging result. This shows the potential of learning features front data.
Well, you might argue that, okay, we all know humans can recognize tens of thousands of
objects, right? So, the Pascal challenge is pretty small scale. So, we should really,
you know, challenge our self, you know, to push to the boundary and to recognize more
and more type of objects. So what's happened last year is pretty amazing. Fei Fei and her
colleagues organized this ImageNet Large Scale Visual Recognition challenge. So, in this
task we have 1,000 categories of object to recognize and 1.2 million images for training
compared to 20 categories and the 10,0000 images in Pascal challenge. So, it's much
bigger scale of the problem. And, in a way, they did really well. This shows the result
of the--of the different teams. And, actually, in terms of classification accuracy, we achieved
the 52% for 1,000 class classification. So from, you know, from machine learning point
of view this is a pretty high classification accuracy, right? And that we all know image
classification actually is really, really challenging, right, much more difficult than
a text classification. So I think Andrew have given a quite deep insight about why feature
learning is useful. So, I think I want to skip these slides. And in particular I think
for image recognition, you know, we also have a strong evidence that biological visual systems
actually learned features or, you know, a very long time of evolution, right. So, I
guess I'm not an expert to talk too many details about this biological system but I just want
to point out that simply the discovery of the visual processing architecture in the
primary V1 layer, namely this complex cells and the simple cells, those kind of behaviors
actually inspire the committed vision researchers to develop a very powerful visual recognition
systems, right. So, for example, this is Convolution Neural Networks, right. So here's the input
image, right, then you apply some learning [INDISTINCT], right, to convolute the input
data and then--as different locations. Then you get the response map, right? Then to further
handle this local invariance, right, so we do, like, pooling. So, for example, average
pooling, right, pool a local response together to get a single response, right? Again, you
get a smaller--sorry. You get a smaller feature map. And then again, on top of that, you apply
convolution filter again and then pooling again. So, you can repeat this process several
times and in the end put into a classifier, right? So this is basically the architecture
of a Convolution Neural Network. So, you can see here the convolution of filtering, the
filter start to simulate the behavior of the so called the simple cells, right? And the
pooling step simulates the processing of the complex cells, right? I think what's amazing
here is the Convolution of Neural Network actually learn a whole--the whole structure
simultaneously. That means unlike many other vision models, which kind of first handcraft
or whatever to get some image features and then put into SVM or [INDISTINCT] to get a
final outcome. And this model actually learns the whole thing from the pixel level, right?
So, it's achieved a state of the arts performance on a number different tasks, for example,
one thing is digit recognition and also a [INDISTINCT] full face recognition is also
one of the best models. So, it turns our actually this model probably was developed like--more
than 20 years ago, right. And recently, people--this is a very popular thing in the computer vision
community especially for scene recognition and object recognition which is so called
the bag of words--bag of visual words model. And in particular, it's a combined of this
spatial pyramid matching to handle the spatial structure. It's achieve a really state of
the art performance on a bunch of benchmarks, for example, Caltech 101 and Pascal Object
Recognition and also like [INDISTINCT] evaluation. But if you look at actually the details of
this model, it still follows pretty much the same architecture. For example, here the first
layer, you use local gradient, you know, applied everywhere across the image. And then I do
this kind of pooling. So by this way you obtain sort of a dense safety feature or dense HOG
feature. And on top of that, you'll apply vector quantization coding and then apply
average of code--average pooling. That's the way you get the--[INDISTINCT] one, right?
In the end, you apply linear SVM. So you can see, the structure is pretty similar but in
the first two layers, are handicraft, right? And these VQ coding layer is some of like
a learning because [INDISTINCT] learning--the learning [INDISTINCT] you have to apply K
means to a large tool of unlabeled image of patches. So this is--actually what's learning,
right? So, I think our observation so far on this model is--probably there are several
ways to start from here--to improve this model. So one thing is; first, this nonlinear SVM
is not scalable, right? Even though it's a [INDISTINCT] very impressive performance but
we all know in the computer vision--really, if you want to slower the problem, we have
to use a lot of training data, right? So, for nonlinear SVM, you cannot--probably it's--it's
[INDISTINCT] to handle for example or 10,000 of the training examples. I'm thinking of
ways the VQ, Vector Quantization coding might be too discreet but might be too cost, not
fine enough, right? Because we end up--the sensory data structure is a continuous, right?
So, probably we need a better coding methods. Then, the next one is average pooling, right?
So probably average pooling is a suboptimum. Actually, I think many researchers observe
this and using something like a, for example,[INDISTINCT]. You can achieve a much better performance,
right? So we also need to [INDISTINCT]. What's the--what--how to develop? What's the principle
to develop a better pooling method? Right. The last [INDISTINCT] is, why don't we learn
hosting together, right? Instead of feeding, you know, fixing--the initial two layers are
fixed, right? So these pretty much motivate our work. So our work is pretty much lets
say, don't try to do anything too fancy that's except, that's--so this kind of architecture
is the thing we want to stay with. And then, let's--based on this observations, can we
do something better, right? So the first thing--what we did is replace the nonlinear SVM by simple
linear classifier. Then the question is, "How to computation in between a linear classifier
with millions or hundred of millions or even billions for training examples, right? So
we developed sarcastic training. I mean, for--to handle this problem. And the second one is
the coding part. So, how to do a lot better coding methods from--on label data, right?
And the key question we want to ask here is, how do we know? I mean, learning from--unlabeled
data actually can help you to achieve a [INDISTINCT] wise learning tasks, right? There's--then
a [INDISTINCT] is a pooling to develop better pooling methods. The last one is deep learning,
that means, learning the whole scene together--multiple layers. All the--all that filters together.
I think in this talk I will focus on only two part. The first is the nonlinear coding
we are [INDISTINCT] learning. So I think if I have some time then I would talk a bit more
on the pooling side. So here's a notation. This is an input image then if we extract
the features in the local patches everywhere, then what we get is sort of dense local features.
So in this case, each image is represented as a bag of feature vectors, right? In a case
of--if the feature is safety feature, it's a 128 dimensional feature vector. Then, we
perform nonlinear mapping to map the features into a new space, right? So this is a nonlinear
coding. And typically, the new space--the dimension [INDISTINCT] is much higher and
the end--the end [INDISTINCT] is some representation, not just the high dimensional but also high
sparse, right? Next, the way I perform linear pooling so which is a [INDISTINCT] average
of the--of the nonlinear codes together in the end that we perform linear SVM or any
linear classified. So if we look at this whole structure and look at the list of classifier,
so this is a linear model applied on the pooling codes. So--because this is a summation then
you can move the linear operator into the--inside of the summation. So--then this is a function
on the images. These are classified on image level and this is the function on local patches,
right. So in this framework, an interesting insight is a--so learning a classifier on
the level of images actually is so--closely connected to learn some nonlinear function
on local patches. So here's a very high level idea about learning nonlinear features on
the local patches, right. So supposed this is a--the distribution of the local patches,
right. And here's unknown nonlinear function on local patches, right. So we don't know
this nonlinear function. This is something what we want to learn. So, the assumption
here is the image of patches x follow a nonlinear manifold structure and the target of function
is smooth on its manifold. So that's our assumption, right? I think this is the key assumption
to connect unsupervised learning to supervised learning, right. So that means by learning
this interesting structure of unlabeled data so we can get some information about the target
function, right. And that's the information that actually can help us to learn the classified,
okay. So the coding is there is to learn nonlinear mapping from the input data to a high dimensional
space, right. Typically, this is the high dimensional and a sparse, right. And then
the way to learn such a nonlinear function is to actually put a linear function on this
nonlinear mapping, right, because the mapping is nonlinear then the linear function here
actually is a li--nonlinear function on the original input, okay. It's clear, yes. So
here's one idea where [INDISTINCT] which called local coordinated coding. So the intuition
is the following. Supposed this is the input data, right. We don't know this nonlinear
function, right. And then we want to figure out what are the function values, you know,
at this particular location, right. So what do we do is--okay, let's say we have a set
of bases, right, by whatever method we have this bases and then we find this on the local
bases, right, to encode--to linearly encode, to state the point but those are linear coefficients.
So we know this target of function is nonlinear but is it reasonable to assume it's locally
linear? Okay. And if it's locally linear then what we get is here in a local region, right.
So list the function value as a list location can be represented as a linear combination
of the function values on those activated bases because those bases is also in the local
neighborhood, right. So based on the local linear assumption, you have this approximation,
right. So this is the way how we interpret this target function value. This is the kind
of high level intuition. How do I use the coding method to learn the nonlinear function?
So the procedure is like the following; first, we learn the dictionary from unlabeled data,
right. Those are the basis sort of element in a dictionary, right. And the second step
is to learn--use the dictionary to encode the data point. So the way we so is, okay,
for each data point to be encoded and we find it at the neighborhood to activate the neighboring
bases, right. And that we get a set of coefficients, right. Typically, this coefficient is highly
sparse, right. So--because each of data point only activate a small number of neighboring
bases, right. That means other coefficients; most of them are zero, right? Only a small
number of bases none zero. So this is somehow connected to the sparse coding idea. Then
the next step is to estimate the parameters because based on the local linear assumption,
we have this approximation. Normally, each data point, the function of audio on each
tested point can be resented as a linear form, right? These are linear coefficients of the
coding; and those are kind of fixed standard of parameters which corresponding to the function
radius on the location over the bases, right? So here, it turns out that these--those parameters
are global linear ways to be learned. So this is pretty interesting because somehow it turned
a learning problem into kind of a simply estimating a linear function, right, estimating on some
linear parameters. So we can say something about at least a function approximation performance.
So this is the approximation to the function [INDISTINCT] at this location, right? It turns
out that these are approximation error is bounded--is up upper bounded by a kind of
uncivilized learning objective term, right, which is the coding error, right, so how close
you encoded the data point, right, plus additional of term which is--this is sort of like--regularization
on the coefficients, right? So, it encourages sparsity right? And at least, it's a way to
[INDISTINCT] distance turn which is the distance between the encoding to the point and the
bases, right? So, if you look at this, if we minimize this upper bound, right, that's
actually--it will push down the function approximation error. But something good is about this term,
encourage local kind of a local basis are turned in to be picked up, right? So, the
key message here is a good encoding scheme should be--yes, should be--have a small coding
error and also be sufficiently local. So, based on this principle, we can do a lot pretty
simple, very simple algorithm. So let's say, we use dictionary learning like a k means
or hierarchal k means to learn our dictionary, right? So then, we encode the data points,
that's the first thing that you show locally. That means--that's a pickup in neighboring
bases, right? So that's similar or somehow similar to vector quantization. But, you know,
in VQ, you pick up only one bases. Here, we pick up multiple bases, right? And the second
step, we show a low-coding error. That's means that we solve this minimization problem based
on constraint, right? Then, we get to the code. So, we apply this to my source. For
example, here is Caltech-101 and Caltech-256. So, on Caltech-101, it's pretty much comparable
with state-of-the-art message using fast coding on top of a safety feature, okay. And on Caltech-256,
it's much better, right? You can see here, performance. So, the next is another message
pretty much followed a similar idea of exploring the data geometry, right? So, this is also
called a Super-vector Coding. And so, it's--let's say, we want to--want to figure out what's
the function [INDISTINCT] at this location, right, the same as before. So what do we do
is to find that the nearest--the one bases, right? And so, one idea is--so, let's say,
we--let's use the function [INDISTINCT] under this particular location, right, to--as the
approximation of the target function [INDISTINCT], right, and the way--even the way to do this
for all the data point, so what do we get is something like a piecewise local constant
approximation of [INDISTINCT] smooth function, right. Actually this is a pretty much--explains
how the vector quantization for better words presentation, right? It does. Let's say if
we can't do something better. So again--so we want to figure out the function [INDISTINCT]
of this location, and then let's find the nearest bases. Then, let's pick up the nearest
basis here. So, instead of doing zero under approximation, maybe we can explore the high
order structure, let's say, the first order which is the local tangent, right? So, if
we do these for all the data point, what do we get is a very simple piecewise linear--local
linear approximation, right? So, we can also look at to how bad the function approximation
error, right? So, this error is [INDISTINCT] by simply the vector quantization error. That
means that--that's very nice, right? That means we can simply apply VQ to learn the
bases and then trust it to a Lipshitz on top of VQ then we can get a better performance,
right? So, here is the local tangent turn at the location of the bases. So, the function
approximation can be represented as a linear form. So, this a super-vector code of data
and those are unknown parameters to be estimated, right? So--I think here is the summary of
a--the three coding methods I mentioned. The first one is the popular Vector Quantization
Coding, second one is the Local Coordinate Coding, the third one is the Super-Vector
Coding. So, they all lead to kind of a high dimensional sparse, and they localized their
coding schemes. And they--by learning that from an unlabeled data, they all explore the
geometry structure of data, right? But the new coding methods are more suitable, kind
of they explore finer structure of data and they are all suitable for linear classifiers,
right? So, their implementations are pretty straightforward. So, I have a--how many minutes
do I have? >> Ten.
>> YU: Okay, good. Okay. So, in the next, I'm going to talk about--since we have this
nonlinear pooling, right? How do we do pooling right? I think pooling is a pretty much a--kind
of an unexplored area. I think there--people have been--especially in feature learning
community, people have intensively worked on how to learn nonlinear features from data.
But coding and stuff is pretty much like a--in a--kind of at HOG stage, right? I think there's
only one work done by [INDISTINCT] and the current last year [INDISTINCT] to analyze--[INDISTINCT]
the--so-called the max coding which is pretty much they've want message for pooling stay--stage,
right? But I think this deserve more higher level of interest for further investigation.
So, let's look at how do we represent the images by local features, okay? So, for each
image, we extract a bunch of local features at different locations, right? In that sense,
each image represented as a set of features, which is a feature set right? A feature set
is an empirical distribution, right? So, that means, each image is an empirical distribution
like this, right? So, one idea to perform image or classification, right? Is kind of
to embed the empirical distribution into a feature space, well, let's say a Hilbert space,
right? And in that space, we do linear classifier, right? This is a pretty much like the kernel
trick, okay? So, let's say this is the embedding of the distribution. I think why embedding--people
have started this problem. Let's say this work have done by Smola and his colleagues,
they consider embedding in such a very simple form, right? Simply average of the linear
mapping of the local features, right? So, this is pretty much, it is actually the average
pooling, right? There's some problem about this average pooling. I think fundamentally,
one issue is, in this embedding, the kind of norm of the embedding is not properly normalized.
This could have been unbonded. So, there seem to be a mathematical problem but it's actually
also an issue in practice. That means, in the image, if some features, some patterns--kind
of like noise or whatever, right? Repeated and occurring too much--too often, then this
component will dominate, you know, the representation of the image data, right? So, this is kind
of, you know, intuition level. Why? Because this is a simple average pooling is not good.
And empirically, this has been observed by many division researchers, right? Max pooling
is much better, way better than average pooling, okay? And then let's look at another way of
defining a kernel between distributions. This is so-called a Bhattacharyya Kernel, which
is a pretty simple form, right? This is an integration of this, like a--this a square
route of the two densities, right? You take the integration, right? This is a very well-performed--behaved
in the sense that the norm of the--of this embedding is pretty normalized, right? So,
it's too long, it's simply one, right? But one problem of this Bhattacharyya Kernel between
the two distributions is the following; here is the simple example, right? Suppose you
have two distributions, like two p(x), right? Somehow if you apply this kernel on--in this
case or another case, right? Pretty much you get a similar result because this kernel ignore
the structure of data, right? If the--these two p(x) are close or faraway, this kernel
is insensitive, right? Let's say if you have a [INDISTINCT], right? And have always so
far each image. In the extremely simple case, each image you end up always on one pin, right?
So if two image and the two beams of two images, actually they are very close in the feature
space, right? And the--this Bhattacharyya Kernel that distributes two images is totally
different, right? Somehow it ignores the distance, the structure of the data. So, there we introduce
a smoothing turn. So, here is the new kernel function we define between two distributions,
right? So, what do we have here is a linear kernel between on the features space, right?
It doesn't have the problem of the previous Bhattacharyya Kernel because of this smoothing
turn. Actually, it is incorporated as just small structure of data, right? The--however,
this definition requires to know the density which is impossible, right? So, there we introduced
a nano formulation which requiring only a rough estimation of the two densities which
is the p-theta and q-theta here. So, this formulation endures a new feature representation
which is at this formulation that will be--yes. So, this is a--if we apply this method into
an empirical distribution. So, what you get is the following. So, this the empirical distribution,
right? The embedding of this empirical distribution is average but it's weighted average of the
nonlinear [INDISTINCT] of the features, right? Here is the weights, right? The weight is
based on the estimation of the density, right? So, interesting in seeing here is--a nice
[INDISTINCT] actually this density estimator doesn't have to be very accurate, right? So,
as long as the ratio between the estimator of the density and the true density, right,
is bonded from above and below and if the estimate--estimator itself converged to a
limit, right? So mathematically, it's real--be real behaved, it will achieve desire to properties.
And we have done actually extensive experiments show that as this new measures perform much
better than average pooling and in most of the cases, it's like equally well or even
better than max pooling, right. So, let me tell you in which cases this is better than
max pooling. Actually max pooling implicitly assumes the--the coding is non-hectic, right,
so you can't afford picking up the most prominent response, right. But what about--I mean you
know, if we want to just use an average rate coding method, just want to explore the geometry,
structure of data, right. So the coding is--doesn't have to be non-hectic, right, in that case,
fast coding will fair, right. And in this case--this new pooling methods would be much
better, yes. So, here I will conclude my talk. So, we have developed a series of feature
learning methods, we are learning from unlabeled data and they achieve the state-of-the-art
performance on several benchmarks, right. Actually, I have a few demos if I have time
or maybe we can show them demo offline. I'm done basically. Okay, let me finish the conclusion--yes.
And actually this is the last second slide, I have one more slide. So, and we proposed
a new pooling methods--it's just very interesting, these radical properties. And also empirically
show the better results than average of pooling and to in--for sparse coding or local coding
type of measures, it's just the same level as max pooling but for fewer record type of
codings is way better than max pooling. There's a bunch of interesting directions I think
we would like to explore further. So one thing is definitely to the deep learning, you know,
so why don't we learn the whole thing, you know. Multiple layers of allowing your feature
transformation, you know, together, right. I think Angel has given a very excellent coverage
on that topic. It is something we are also working on. And the second is definitely interesting
is the fast faster measures, you know, for our future extraction, right. Because we have
to think about how do we handle, you know, extract it--once the learning is done--the
feature learning is done, how do we use to learn a model to extract the image features,
for example, in a level of like a 50 millisecond, right. So, it's still very trying to do. The
sort of thing I think is probably the most interesting to myself is learning powerful
features from the richer structure of data. For example, from the video, I think Angel
also mentioned from the temporal consistency assumption to learn, to explore the kind of,
you know, temporal consistency but also to learn kind of the environment features, for
example, out plain rotation of the objects, right. So, those are the papers that we published
over the last two years and thank you very much.