Uploaded by GoogleTechTalks on 15.04.2011

Transcript:

>>

Kai Yu sets up. Let me go ahead and introduce him. He is the head of mediate Analytics Department

at NEC Laboratories America. Before joining NEC, he was a senior research scientist at

Siemens Corporate Technology. His research interests are learning formulations, collaborative

filtering, multitask learning, computer vision, image classification, deep learning and sparse

coding. And I should note that he lead the first place team in the ImageNet Large Scale

Visual Recognition challenge. And with that, look forward to hearing his talk.

>> YU: So, how is everyone? Today, I'm going to talk about Image Recognition using Unsupervised

Feature Learning. So, if you look at the title, you can notice that there's a big overlap

with Andrew's talk because we both believe--actually, a bunch of, you know, machine learning guys

really believe by learning from data, we can actually obtain very powerful features for

image recognition. There's one part actually I don't want to talk is deep learning because

I believe it's hard to be as deep as Andrew did, so. So let me focus on unsupervised feature

learning. So, in the past two years, I worked on this with a large group of people. Those

are my colleagues from NEC Labs and also we collaborated with professors like Tong Jiang,

Thomas Huang and John Lafferty. And by the way, we also have some secret weapons which

are--who are our interns. They are very smart and productive. So, what do we did, I believe,

is pretty amazing. So, this is one example. So, in 2009, this is a PASCAL Visual Object

Class Challenge. I think this is probably one of the most prestigious benchmark in computer

vision for object recognition. So, the task is to classify images to one or multiple of

the 20 categories. Right. So, typically, very often one image here can be multi label because

it can turn in several different category of objects, right? So, in 2009, the task is,

okay, given about 8,000 images we're training to classify a test to set of 6,650 images.

So, this is the result. And we achieved the number one positions for 18 of the categories

out of total 20 categories, right. Those here for each category will list our performance

and the best of performance from other teams. Here is the difference. Right. So, it's actually

pretty encouraging for us because we use the learning features, right? We don't, you know,

combine all kind of different features like appearance, shape, textual, those features.

We just use a HOG feature on gray images, right. So, it's pretty standard and simple

feature. But what we really did is learning features from these simple features. And this

is a very encouraging result. This shows the potential of learning features front data.

Well, you might argue that, okay, we all know humans can recognize tens of thousands of

objects, right? So, the Pascal challenge is pretty small scale. So, we should really,

you know, challenge our self, you know, to push to the boundary and to recognize more

and more type of objects. So what's happened last year is pretty amazing. Fei Fei and her

colleagues organized this ImageNet Large Scale Visual Recognition challenge. So, in this

task we have 1,000 categories of object to recognize and 1.2 million images for training

compared to 20 categories and the 10,0000 images in Pascal challenge. So, it's much

bigger scale of the problem. And, in a way, they did really well. This shows the result

of the--of the different teams. And, actually, in terms of classification accuracy, we achieved

the 52% for 1,000 class classification. So from, you know, from machine learning point

of view this is a pretty high classification accuracy, right? And that we all know image

classification actually is really, really challenging, right, much more difficult than

a text classification. So I think Andrew have given a quite deep insight about why feature

learning is useful. So, I think I want to skip these slides. And in particular I think

for image recognition, you know, we also have a strong evidence that biological visual systems

actually learned features or, you know, a very long time of evolution, right. So, I

guess I'm not an expert to talk too many details about this biological system but I just want

to point out that simply the discovery of the visual processing architecture in the

primary V1 layer, namely this complex cells and the simple cells, those kind of behaviors

actually inspire the committed vision researchers to develop a very powerful visual recognition

systems, right. So, for example, this is Convolution Neural Networks, right. So here's the input

image, right, then you apply some learning [INDISTINCT], right, to convolute the input

data and then--as different locations. Then you get the response map, right? Then to further

handle this local invariance, right, so we do, like, pooling. So, for example, average

pooling, right, pool a local response together to get a single response, right? Again, you

get a smaller--sorry. You get a smaller feature map. And then again, on top of that, you apply

convolution filter again and then pooling again. So, you can repeat this process several

times and in the end put into a classifier, right? So this is basically the architecture

of a Convolution Neural Network. So, you can see here the convolution of filtering, the

filter start to simulate the behavior of the so called the simple cells, right? And the

pooling step simulates the processing of the complex cells, right? I think what's amazing

here is the Convolution of Neural Network actually learn a whole--the whole structure

simultaneously. That means unlike many other vision models, which kind of first handcraft

or whatever to get some image features and then put into SVM or [INDISTINCT] to get a

final outcome. And this model actually learns the whole thing from the pixel level, right?

So, it's achieved a state of the arts performance on a number different tasks, for example,

one thing is digit recognition and also a [INDISTINCT] full face recognition is also

one of the best models. So, it turns our actually this model probably was developed like--more

than 20 years ago, right. And recently, people--this is a very popular thing in the computer vision

community especially for scene recognition and object recognition which is so called

the bag of words--bag of visual words model. And in particular, it's a combined of this

spatial pyramid matching to handle the spatial structure. It's achieve a really state of

the art performance on a bunch of benchmarks, for example, Caltech 101 and Pascal Object

Recognition and also like [INDISTINCT] evaluation. But if you look at actually the details of

this model, it still follows pretty much the same architecture. For example, here the first

layer, you use local gradient, you know, applied everywhere across the image. And then I do

this kind of pooling. So by this way you obtain sort of a dense safety feature or dense HOG

feature. And on top of that, you'll apply vector quantization coding and then apply

average of code--average pooling. That's the way you get the--[INDISTINCT] one, right?

In the end, you apply linear SVM. So you can see, the structure is pretty similar but in

the first two layers, are handicraft, right? And these VQ coding layer is some of like

a learning because [INDISTINCT] learning--the learning [INDISTINCT] you have to apply K

means to a large tool of unlabeled image of patches. So this is--actually what's learning,

right? So, I think our observation so far on this model is--probably there are several

ways to start from here--to improve this model. So one thing is; first, this nonlinear SVM

is not scalable, right? Even though it's a [INDISTINCT] very impressive performance but

we all know in the computer vision--really, if you want to slower the problem, we have

to use a lot of training data, right? So, for nonlinear SVM, you cannot--probably it's--it's

[INDISTINCT] to handle for example or 10,000 of the training examples. I'm thinking of

ways the VQ, Vector Quantization coding might be too discreet but might be too cost, not

fine enough, right? Because we end up--the sensory data structure is a continuous, right?

So, probably we need a better coding methods. Then, the next one is average pooling, right?

So probably average pooling is a suboptimum. Actually, I think many researchers observe

this and using something like a, for example,[INDISTINCT]. You can achieve a much better performance,

right? So we also need to [INDISTINCT]. What's the--what--how to develop? What's the principle

to develop a better pooling method? Right. The last [INDISTINCT] is, why don't we learn

hosting together, right? Instead of feeding, you know, fixing--the initial two layers are

fixed, right? So these pretty much motivate our work. So our work is pretty much lets

say, don't try to do anything too fancy that's except, that's--so this kind of architecture

is the thing we want to stay with. And then, let's--based on this observations, can we

do something better, right? So the first thing--what we did is replace the nonlinear SVM by simple

linear classifier. Then the question is, "How to computation in between a linear classifier

with millions or hundred of millions or even billions for training examples, right? So

we developed sarcastic training. I mean, for--to handle this problem. And the second one is

the coding part. So, how to do a lot better coding methods from--on label data, right?

And the key question we want to ask here is, how do we know? I mean, learning from--unlabeled

data actually can help you to achieve a [INDISTINCT] wise learning tasks, right? There's--then

a [INDISTINCT] is a pooling to develop better pooling methods. The last one is deep learning,

that means, learning the whole scene together--multiple layers. All the--all that filters together.

I think in this talk I will focus on only two part. The first is the nonlinear coding

we are [INDISTINCT] learning. So I think if I have some time then I would talk a bit more

on the pooling side. So here's a notation. This is an input image then if we extract

the features in the local patches everywhere, then what we get is sort of dense local features.

So in this case, each image is represented as a bag of feature vectors, right? In a case

of--if the feature is safety feature, it's a 128 dimensional feature vector. Then, we

perform nonlinear mapping to map the features into a new space, right? So this is a nonlinear

coding. And typically, the new space--the dimension [INDISTINCT] is much higher and

the end--the end [INDISTINCT] is some representation, not just the high dimensional but also high

sparse, right? Next, the way I perform linear pooling so which is a [INDISTINCT] average

of the--of the nonlinear codes together in the end that we perform linear SVM or any

linear classified. So if we look at this whole structure and look at the list of classifier,

so this is a linear model applied on the pooling codes. So--because this is a summation then

you can move the linear operator into the--inside of the summation. So--then this is a function

on the images. These are classified on image level and this is the function on local patches,

right. So in this framework, an interesting insight is a--so learning a classifier on

the level of images actually is so--closely connected to learn some nonlinear function

on local patches. So here's a very high level idea about learning nonlinear features on

the local patches, right. So supposed this is a--the distribution of the local patches,

right. And here's unknown nonlinear function on local patches, right. So we don't know

this nonlinear function. This is something what we want to learn. So, the assumption

here is the image of patches x follow a nonlinear manifold structure and the target of function

is smooth on its manifold. So that's our assumption, right? I think this is the key assumption

to connect unsupervised learning to supervised learning, right. So that means by learning

this interesting structure of unlabeled data so we can get some information about the target

function, right. And that's the information that actually can help us to learn the classified,

okay. So the coding is there is to learn nonlinear mapping from the input data to a high dimensional

space, right. Typically, this is the high dimensional and a sparse, right. And then

the way to learn such a nonlinear function is to actually put a linear function on this

nonlinear mapping, right, because the mapping is nonlinear then the linear function here

actually is a li--nonlinear function on the original input, okay. It's clear, yes. So

here's one idea where [INDISTINCT] which called local coordinated coding. So the intuition

is the following. Supposed this is the input data, right. We don't know this nonlinear

function, right. And then we want to figure out what are the function values, you know,

at this particular location, right. So what do we do is--okay, let's say we have a set

of bases, right, by whatever method we have this bases and then we find this on the local

bases, right, to encode--to linearly encode, to state the point but those are linear coefficients.

So we know this target of function is nonlinear but is it reasonable to assume it's locally

linear? Okay. And if it's locally linear then what we get is here in a local region, right.

So list the function value as a list location can be represented as a linear combination

of the function values on those activated bases because those bases is also in the local

neighborhood, right. So based on the local linear assumption, you have this approximation,

right. So this is the way how we interpret this target function value. This is the kind

of high level intuition. How do I use the coding method to learn the nonlinear function?

So the procedure is like the following; first, we learn the dictionary from unlabeled data,

right. Those are the basis sort of element in a dictionary, right. And the second step

is to learn--use the dictionary to encode the data point. So the way we so is, okay,

for each data point to be encoded and we find it at the neighborhood to activate the neighboring

bases, right. And that we get a set of coefficients, right. Typically, this coefficient is highly

sparse, right. So--because each of data point only activate a small number of neighboring

bases, right. That means other coefficients; most of them are zero, right? Only a small

number of bases none zero. So this is somehow connected to the sparse coding idea. Then

the next step is to estimate the parameters because based on the local linear assumption,

we have this approximation. Normally, each data point, the function of audio on each

tested point can be resented as a linear form, right? These are linear coefficients of the

coding; and those are kind of fixed standard of parameters which corresponding to the function

radius on the location over the bases, right? So here, it turns out that these--those parameters

are global linear ways to be learned. So this is pretty interesting because somehow it turned

a learning problem into kind of a simply estimating a linear function, right, estimating on some

linear parameters. So we can say something about at least a function approximation performance.

So this is the approximation to the function [INDISTINCT] at this location, right? It turns

out that these are approximation error is bounded--is up upper bounded by a kind of

uncivilized learning objective term, right, which is the coding error, right, so how close

you encoded the data point, right, plus additional of term which is--this is sort of like--regularization

on the coefficients, right? So, it encourages sparsity right? And at least, it's a way to

[INDISTINCT] distance turn which is the distance between the encoding to the point and the

bases, right? So, if you look at this, if we minimize this upper bound, right, that's

actually--it will push down the function approximation error. But something good is about this term,

encourage local kind of a local basis are turned in to be picked up, right? So, the

key message here is a good encoding scheme should be--yes, should be--have a small coding

error and also be sufficiently local. So, based on this principle, we can do a lot pretty

simple, very simple algorithm. So let's say, we use dictionary learning like a k means

or hierarchal k means to learn our dictionary, right? So then, we encode the data points,

that's the first thing that you show locally. That means--that's a pickup in neighboring

bases, right? So that's similar or somehow similar to vector quantization. But, you know,

in VQ, you pick up only one bases. Here, we pick up multiple bases, right? And the second

step, we show a low-coding error. That's means that we solve this minimization problem based

on constraint, right? Then, we get to the code. So, we apply this to my source. For

example, here is Caltech-101 and Caltech-256. So, on Caltech-101, it's pretty much comparable

with state-of-the-art message using fast coding on top of a safety feature, okay. And on Caltech-256,

it's much better, right? You can see here, performance. So, the next is another message

pretty much followed a similar idea of exploring the data geometry, right? So, this is also

called a Super-vector Coding. And so, it's--let's say, we want to--want to figure out what's

the function [INDISTINCT] at this location, right, the same as before. So what do we do

is to find that the nearest--the one bases, right? And so, one idea is--so, let's say,

we--let's use the function [INDISTINCT] under this particular location, right, to--as the

approximation of the target function [INDISTINCT], right, and the way--even the way to do this

for all the data point, so what do we get is something like a piecewise local constant

approximation of [INDISTINCT] smooth function, right. Actually this is a pretty much--explains

how the vector quantization for better words presentation, right? It does. Let's say if

we can't do something better. So again--so we want to figure out the function [INDISTINCT]

of this location, and then let's find the nearest bases. Then, let's pick up the nearest

basis here. So, instead of doing zero under approximation, maybe we can explore the high

order structure, let's say, the first order which is the local tangent, right? So, if

we do these for all the data point, what do we get is a very simple piecewise linear--local

linear approximation, right? So, we can also look at to how bad the function approximation

error, right? So, this error is [INDISTINCT] by simply the vector quantization error. That

means that--that's very nice, right? That means we can simply apply VQ to learn the

bases and then trust it to a Lipshitz on top of VQ then we can get a better performance,

right? So, here is the local tangent turn at the location of the bases. So, the function

approximation can be represented as a linear form. So, this a super-vector code of data

and those are unknown parameters to be estimated, right? So--I think here is the summary of

a--the three coding methods I mentioned. The first one is the popular Vector Quantization

Coding, second one is the Local Coordinate Coding, the third one is the Super-Vector

Coding. So, they all lead to kind of a high dimensional sparse, and they localized their

coding schemes. And they--by learning that from an unlabeled data, they all explore the

geometry structure of data, right? But the new coding methods are more suitable, kind

of they explore finer structure of data and they are all suitable for linear classifiers,

right? So, their implementations are pretty straightforward. So, I have a--how many minutes

do I have? >> Ten.

>> YU: Okay, good. Okay. So, in the next, I'm going to talk about--since we have this

nonlinear pooling, right? How do we do pooling right? I think pooling is a pretty much a--kind

of an unexplored area. I think there--people have been--especially in feature learning

community, people have intensively worked on how to learn nonlinear features from data.

But coding and stuff is pretty much like a--in a--kind of at HOG stage, right? I think there's

only one work done by [INDISTINCT] and the current last year [INDISTINCT] to analyze--[INDISTINCT]

the--so-called the max coding which is pretty much they've want message for pooling stay--stage,

right? But I think this deserve more higher level of interest for further investigation.

So, let's look at how do we represent the images by local features, okay? So, for each

image, we extract a bunch of local features at different locations, right? In that sense,

each image represented as a set of features, which is a feature set right? A feature set

is an empirical distribution, right? So, that means, each image is an empirical distribution

like this, right? So, one idea to perform image or classification, right? Is kind of

to embed the empirical distribution into a feature space, well, let's say a Hilbert space,

right? And in that space, we do linear classifier, right? This is a pretty much like the kernel

trick, okay? So, let's say this is the embedding of the distribution. I think why embedding--people

have started this problem. Let's say this work have done by Smola and his colleagues,

they consider embedding in such a very simple form, right? Simply average of the linear

mapping of the local features, right? So, this is pretty much, it is actually the average

pooling, right? There's some problem about this average pooling. I think fundamentally,

one issue is, in this embedding, the kind of norm of the embedding is not properly normalized.

This could have been unbonded. So, there seem to be a mathematical problem but it's actually

also an issue in practice. That means, in the image, if some features, some patterns--kind

of like noise or whatever, right? Repeated and occurring too much--too often, then this

component will dominate, you know, the representation of the image data, right? So, this is kind

of, you know, intuition level. Why? Because this is a simple average pooling is not good.

And empirically, this has been observed by many division researchers, right? Max pooling

is much better, way better than average pooling, okay? And then let's look at another way of

defining a kernel between distributions. This is so-called a Bhattacharyya Kernel, which

is a pretty simple form, right? This is an integration of this, like a--this a square

route of the two densities, right? You take the integration, right? This is a very well-performed--behaved

in the sense that the norm of the--of this embedding is pretty normalized, right? So,

it's too long, it's simply one, right? But one problem of this Bhattacharyya Kernel between

the two distributions is the following; here is the simple example, right? Suppose you

have two distributions, like two p(x), right? Somehow if you apply this kernel on--in this

case or another case, right? Pretty much you get a similar result because this kernel ignore

the structure of data, right? If the--these two p(x) are close or faraway, this kernel

is insensitive, right? Let's say if you have a [INDISTINCT], right? And have always so

far each image. In the extremely simple case, each image you end up always on one pin, right?

So if two image and the two beams of two images, actually they are very close in the feature

space, right? And the--this Bhattacharyya Kernel that distributes two images is totally

different, right? Somehow it ignores the distance, the structure of the data. So, there we introduce

a smoothing turn. So, here is the new kernel function we define between two distributions,

right? So, what do we have here is a linear kernel between on the features space, right?

It doesn't have the problem of the previous Bhattacharyya Kernel because of this smoothing

turn. Actually, it is incorporated as just small structure of data, right? The--however,

this definition requires to know the density which is impossible, right? So, there we introduced

a nano formulation which requiring only a rough estimation of the two densities which

is the p-theta and q-theta here. So, this formulation endures a new feature representation

which is at this formulation that will be--yes. So, this is a--if we apply this method into

an empirical distribution. So, what you get is the following. So, this the empirical distribution,

right? The embedding of this empirical distribution is average but it's weighted average of the

nonlinear [INDISTINCT] of the features, right? Here is the weights, right? The weight is

based on the estimation of the density, right? So, interesting in seeing here is--a nice

[INDISTINCT] actually this density estimator doesn't have to be very accurate, right? So,

as long as the ratio between the estimator of the density and the true density, right,

is bonded from above and below and if the estimate--estimator itself converged to a

limit, right? So mathematically, it's real--be real behaved, it will achieve desire to properties.

And we have done actually extensive experiments show that as this new measures perform much

better than average pooling and in most of the cases, it's like equally well or even

better than max pooling, right. So, let me tell you in which cases this is better than

max pooling. Actually max pooling implicitly assumes the--the coding is non-hectic, right,

so you can't afford picking up the most prominent response, right. But what about--I mean you

know, if we want to just use an average rate coding method, just want to explore the geometry,

structure of data, right. So the coding is--doesn't have to be non-hectic, right, in that case,

fast coding will fair, right. And in this case--this new pooling methods would be much

better, yes. So, here I will conclude my talk. So, we have developed a series of feature

learning methods, we are learning from unlabeled data and they achieve the state-of-the-art

performance on several benchmarks, right. Actually, I have a few demos if I have time

or maybe we can show them demo offline. I'm done basically. Okay, let me finish the conclusion--yes.

And actually this is the last second slide, I have one more slide. So, and we proposed

a new pooling methods--it's just very interesting, these radical properties. And also empirically

show the better results than average of pooling and to in--for sparse coding or local coding

type of measures, it's just the same level as max pooling but for fewer record type of

codings is way better than max pooling. There's a bunch of interesting directions I think

we would like to explore further. So one thing is definitely to the deep learning, you know,

so why don't we learn the whole thing, you know. Multiple layers of allowing your feature

transformation, you know, together, right. I think Angel has given a very excellent coverage

on that topic. It is something we are also working on. And the second is definitely interesting

is the fast faster measures, you know, for our future extraction, right. Because we have

to think about how do we handle, you know, extract it--once the learning is done--the

feature learning is done, how do we use to learn a model to extract the image features,

for example, in a level of like a 50 millisecond, right. So, it's still very trying to do. The

sort of thing I think is probably the most interesting to myself is learning powerful

features from the richer structure of data. For example, from the video, I think Angel

also mentioned from the temporal consistency assumption to learn, to explore the kind of,

you know, temporal consistency but also to learn kind of the environment features, for

example, out plain rotation of the objects, right. So, those are the papers that we published

over the last two years and thank you very much.

Kai Yu sets up. Let me go ahead and introduce him. He is the head of mediate Analytics Department

at NEC Laboratories America. Before joining NEC, he was a senior research scientist at

Siemens Corporate Technology. His research interests are learning formulations, collaborative

filtering, multitask learning, computer vision, image classification, deep learning and sparse

coding. And I should note that he lead the first place team in the ImageNet Large Scale

Visual Recognition challenge. And with that, look forward to hearing his talk.

>> YU: So, how is everyone? Today, I'm going to talk about Image Recognition using Unsupervised

Feature Learning. So, if you look at the title, you can notice that there's a big overlap

with Andrew's talk because we both believe--actually, a bunch of, you know, machine learning guys

really believe by learning from data, we can actually obtain very powerful features for

image recognition. There's one part actually I don't want to talk is deep learning because

I believe it's hard to be as deep as Andrew did, so. So let me focus on unsupervised feature

learning. So, in the past two years, I worked on this with a large group of people. Those

are my colleagues from NEC Labs and also we collaborated with professors like Tong Jiang,

Thomas Huang and John Lafferty. And by the way, we also have some secret weapons which

are--who are our interns. They are very smart and productive. So, what do we did, I believe,

is pretty amazing. So, this is one example. So, in 2009, this is a PASCAL Visual Object

Class Challenge. I think this is probably one of the most prestigious benchmark in computer

vision for object recognition. So, the task is to classify images to one or multiple of

the 20 categories. Right. So, typically, very often one image here can be multi label because

it can turn in several different category of objects, right? So, in 2009, the task is,

okay, given about 8,000 images we're training to classify a test to set of 6,650 images.

So, this is the result. And we achieved the number one positions for 18 of the categories

out of total 20 categories, right. Those here for each category will list our performance

and the best of performance from other teams. Here is the difference. Right. So, it's actually

pretty encouraging for us because we use the learning features, right? We don't, you know,

combine all kind of different features like appearance, shape, textual, those features.

We just use a HOG feature on gray images, right. So, it's pretty standard and simple

feature. But what we really did is learning features from these simple features. And this

is a very encouraging result. This shows the potential of learning features front data.

Well, you might argue that, okay, we all know humans can recognize tens of thousands of

objects, right? So, the Pascal challenge is pretty small scale. So, we should really,

you know, challenge our self, you know, to push to the boundary and to recognize more

and more type of objects. So what's happened last year is pretty amazing. Fei Fei and her

colleagues organized this ImageNet Large Scale Visual Recognition challenge. So, in this

task we have 1,000 categories of object to recognize and 1.2 million images for training

compared to 20 categories and the 10,0000 images in Pascal challenge. So, it's much

bigger scale of the problem. And, in a way, they did really well. This shows the result

of the--of the different teams. And, actually, in terms of classification accuracy, we achieved

the 52% for 1,000 class classification. So from, you know, from machine learning point

of view this is a pretty high classification accuracy, right? And that we all know image

classification actually is really, really challenging, right, much more difficult than

a text classification. So I think Andrew have given a quite deep insight about why feature

learning is useful. So, I think I want to skip these slides. And in particular I think

for image recognition, you know, we also have a strong evidence that biological visual systems

actually learned features or, you know, a very long time of evolution, right. So, I

guess I'm not an expert to talk too many details about this biological system but I just want

to point out that simply the discovery of the visual processing architecture in the

primary V1 layer, namely this complex cells and the simple cells, those kind of behaviors

actually inspire the committed vision researchers to develop a very powerful visual recognition

systems, right. So, for example, this is Convolution Neural Networks, right. So here's the input

image, right, then you apply some learning [INDISTINCT], right, to convolute the input

data and then--as different locations. Then you get the response map, right? Then to further

handle this local invariance, right, so we do, like, pooling. So, for example, average

pooling, right, pool a local response together to get a single response, right? Again, you

get a smaller--sorry. You get a smaller feature map. And then again, on top of that, you apply

convolution filter again and then pooling again. So, you can repeat this process several

times and in the end put into a classifier, right? So this is basically the architecture

of a Convolution Neural Network. So, you can see here the convolution of filtering, the

filter start to simulate the behavior of the so called the simple cells, right? And the

pooling step simulates the processing of the complex cells, right? I think what's amazing

here is the Convolution of Neural Network actually learn a whole--the whole structure

simultaneously. That means unlike many other vision models, which kind of first handcraft

or whatever to get some image features and then put into SVM or [INDISTINCT] to get a

final outcome. And this model actually learns the whole thing from the pixel level, right?

So, it's achieved a state of the arts performance on a number different tasks, for example,

one thing is digit recognition and also a [INDISTINCT] full face recognition is also

one of the best models. So, it turns our actually this model probably was developed like--more

than 20 years ago, right. And recently, people--this is a very popular thing in the computer vision

community especially for scene recognition and object recognition which is so called

the bag of words--bag of visual words model. And in particular, it's a combined of this

spatial pyramid matching to handle the spatial structure. It's achieve a really state of

the art performance on a bunch of benchmarks, for example, Caltech 101 and Pascal Object

Recognition and also like [INDISTINCT] evaluation. But if you look at actually the details of

this model, it still follows pretty much the same architecture. For example, here the first

layer, you use local gradient, you know, applied everywhere across the image. And then I do

this kind of pooling. So by this way you obtain sort of a dense safety feature or dense HOG

feature. And on top of that, you'll apply vector quantization coding and then apply

average of code--average pooling. That's the way you get the--[INDISTINCT] one, right?

In the end, you apply linear SVM. So you can see, the structure is pretty similar but in

the first two layers, are handicraft, right? And these VQ coding layer is some of like

a learning because [INDISTINCT] learning--the learning [INDISTINCT] you have to apply K

means to a large tool of unlabeled image of patches. So this is--actually what's learning,

right? So, I think our observation so far on this model is--probably there are several

ways to start from here--to improve this model. So one thing is; first, this nonlinear SVM

is not scalable, right? Even though it's a [INDISTINCT] very impressive performance but

we all know in the computer vision--really, if you want to slower the problem, we have

to use a lot of training data, right? So, for nonlinear SVM, you cannot--probably it's--it's

[INDISTINCT] to handle for example or 10,000 of the training examples. I'm thinking of

ways the VQ, Vector Quantization coding might be too discreet but might be too cost, not

fine enough, right? Because we end up--the sensory data structure is a continuous, right?

So, probably we need a better coding methods. Then, the next one is average pooling, right?

So probably average pooling is a suboptimum. Actually, I think many researchers observe

this and using something like a, for example,[INDISTINCT]. You can achieve a much better performance,

right? So we also need to [INDISTINCT]. What's the--what--how to develop? What's the principle

to develop a better pooling method? Right. The last [INDISTINCT] is, why don't we learn

hosting together, right? Instead of feeding, you know, fixing--the initial two layers are

fixed, right? So these pretty much motivate our work. So our work is pretty much lets

say, don't try to do anything too fancy that's except, that's--so this kind of architecture

is the thing we want to stay with. And then, let's--based on this observations, can we

do something better, right? So the first thing--what we did is replace the nonlinear SVM by simple

linear classifier. Then the question is, "How to computation in between a linear classifier

with millions or hundred of millions or even billions for training examples, right? So

we developed sarcastic training. I mean, for--to handle this problem. And the second one is

the coding part. So, how to do a lot better coding methods from--on label data, right?

And the key question we want to ask here is, how do we know? I mean, learning from--unlabeled

data actually can help you to achieve a [INDISTINCT] wise learning tasks, right? There's--then

a [INDISTINCT] is a pooling to develop better pooling methods. The last one is deep learning,

that means, learning the whole scene together--multiple layers. All the--all that filters together.

I think in this talk I will focus on only two part. The first is the nonlinear coding

we are [INDISTINCT] learning. So I think if I have some time then I would talk a bit more

on the pooling side. So here's a notation. This is an input image then if we extract

the features in the local patches everywhere, then what we get is sort of dense local features.

So in this case, each image is represented as a bag of feature vectors, right? In a case

of--if the feature is safety feature, it's a 128 dimensional feature vector. Then, we

perform nonlinear mapping to map the features into a new space, right? So this is a nonlinear

coding. And typically, the new space--the dimension [INDISTINCT] is much higher and

the end--the end [INDISTINCT] is some representation, not just the high dimensional but also high

sparse, right? Next, the way I perform linear pooling so which is a [INDISTINCT] average

of the--of the nonlinear codes together in the end that we perform linear SVM or any

linear classified. So if we look at this whole structure and look at the list of classifier,

so this is a linear model applied on the pooling codes. So--because this is a summation then

you can move the linear operator into the--inside of the summation. So--then this is a function

on the images. These are classified on image level and this is the function on local patches,

right. So in this framework, an interesting insight is a--so learning a classifier on

the level of images actually is so--closely connected to learn some nonlinear function

on local patches. So here's a very high level idea about learning nonlinear features on

the local patches, right. So supposed this is a--the distribution of the local patches,

right. And here's unknown nonlinear function on local patches, right. So we don't know

this nonlinear function. This is something what we want to learn. So, the assumption

here is the image of patches x follow a nonlinear manifold structure and the target of function

is smooth on its manifold. So that's our assumption, right? I think this is the key assumption

to connect unsupervised learning to supervised learning, right. So that means by learning

this interesting structure of unlabeled data so we can get some information about the target

function, right. And that's the information that actually can help us to learn the classified,

okay. So the coding is there is to learn nonlinear mapping from the input data to a high dimensional

space, right. Typically, this is the high dimensional and a sparse, right. And then

the way to learn such a nonlinear function is to actually put a linear function on this

nonlinear mapping, right, because the mapping is nonlinear then the linear function here

actually is a li--nonlinear function on the original input, okay. It's clear, yes. So

here's one idea where [INDISTINCT] which called local coordinated coding. So the intuition

is the following. Supposed this is the input data, right. We don't know this nonlinear

function, right. And then we want to figure out what are the function values, you know,

at this particular location, right. So what do we do is--okay, let's say we have a set

of bases, right, by whatever method we have this bases and then we find this on the local

bases, right, to encode--to linearly encode, to state the point but those are linear coefficients.

So we know this target of function is nonlinear but is it reasonable to assume it's locally

linear? Okay. And if it's locally linear then what we get is here in a local region, right.

So list the function value as a list location can be represented as a linear combination

of the function values on those activated bases because those bases is also in the local

neighborhood, right. So based on the local linear assumption, you have this approximation,

right. So this is the way how we interpret this target function value. This is the kind

of high level intuition. How do I use the coding method to learn the nonlinear function?

So the procedure is like the following; first, we learn the dictionary from unlabeled data,

right. Those are the basis sort of element in a dictionary, right. And the second step

is to learn--use the dictionary to encode the data point. So the way we so is, okay,

for each data point to be encoded and we find it at the neighborhood to activate the neighboring

bases, right. And that we get a set of coefficients, right. Typically, this coefficient is highly

sparse, right. So--because each of data point only activate a small number of neighboring

bases, right. That means other coefficients; most of them are zero, right? Only a small

number of bases none zero. So this is somehow connected to the sparse coding idea. Then

the next step is to estimate the parameters because based on the local linear assumption,

we have this approximation. Normally, each data point, the function of audio on each

tested point can be resented as a linear form, right? These are linear coefficients of the

coding; and those are kind of fixed standard of parameters which corresponding to the function

radius on the location over the bases, right? So here, it turns out that these--those parameters

are global linear ways to be learned. So this is pretty interesting because somehow it turned

a learning problem into kind of a simply estimating a linear function, right, estimating on some

linear parameters. So we can say something about at least a function approximation performance.

So this is the approximation to the function [INDISTINCT] at this location, right? It turns

out that these are approximation error is bounded--is up upper bounded by a kind of

uncivilized learning objective term, right, which is the coding error, right, so how close

you encoded the data point, right, plus additional of term which is--this is sort of like--regularization

on the coefficients, right? So, it encourages sparsity right? And at least, it's a way to

[INDISTINCT] distance turn which is the distance between the encoding to the point and the

bases, right? So, if you look at this, if we minimize this upper bound, right, that's

actually--it will push down the function approximation error. But something good is about this term,

encourage local kind of a local basis are turned in to be picked up, right? So, the

key message here is a good encoding scheme should be--yes, should be--have a small coding

error and also be sufficiently local. So, based on this principle, we can do a lot pretty

simple, very simple algorithm. So let's say, we use dictionary learning like a k means

or hierarchal k means to learn our dictionary, right? So then, we encode the data points,

that's the first thing that you show locally. That means--that's a pickup in neighboring

bases, right? So that's similar or somehow similar to vector quantization. But, you know,

in VQ, you pick up only one bases. Here, we pick up multiple bases, right? And the second

step, we show a low-coding error. That's means that we solve this minimization problem based

on constraint, right? Then, we get to the code. So, we apply this to my source. For

example, here is Caltech-101 and Caltech-256. So, on Caltech-101, it's pretty much comparable

with state-of-the-art message using fast coding on top of a safety feature, okay. And on Caltech-256,

it's much better, right? You can see here, performance. So, the next is another message

pretty much followed a similar idea of exploring the data geometry, right? So, this is also

called a Super-vector Coding. And so, it's--let's say, we want to--want to figure out what's

the function [INDISTINCT] at this location, right, the same as before. So what do we do

is to find that the nearest--the one bases, right? And so, one idea is--so, let's say,

we--let's use the function [INDISTINCT] under this particular location, right, to--as the

approximation of the target function [INDISTINCT], right, and the way--even the way to do this

for all the data point, so what do we get is something like a piecewise local constant

approximation of [INDISTINCT] smooth function, right. Actually this is a pretty much--explains

how the vector quantization for better words presentation, right? It does. Let's say if

we can't do something better. So again--so we want to figure out the function [INDISTINCT]

of this location, and then let's find the nearest bases. Then, let's pick up the nearest

basis here. So, instead of doing zero under approximation, maybe we can explore the high

order structure, let's say, the first order which is the local tangent, right? So, if

we do these for all the data point, what do we get is a very simple piecewise linear--local

linear approximation, right? So, we can also look at to how bad the function approximation

error, right? So, this error is [INDISTINCT] by simply the vector quantization error. That

means that--that's very nice, right? That means we can simply apply VQ to learn the

bases and then trust it to a Lipshitz on top of VQ then we can get a better performance,

right? So, here is the local tangent turn at the location of the bases. So, the function

approximation can be represented as a linear form. So, this a super-vector code of data

and those are unknown parameters to be estimated, right? So--I think here is the summary of

a--the three coding methods I mentioned. The first one is the popular Vector Quantization

Coding, second one is the Local Coordinate Coding, the third one is the Super-Vector

Coding. So, they all lead to kind of a high dimensional sparse, and they localized their

coding schemes. And they--by learning that from an unlabeled data, they all explore the

geometry structure of data, right? But the new coding methods are more suitable, kind

of they explore finer structure of data and they are all suitable for linear classifiers,

right? So, their implementations are pretty straightforward. So, I have a--how many minutes

do I have? >> Ten.

>> YU: Okay, good. Okay. So, in the next, I'm going to talk about--since we have this

nonlinear pooling, right? How do we do pooling right? I think pooling is a pretty much a--kind

of an unexplored area. I think there--people have been--especially in feature learning

community, people have intensively worked on how to learn nonlinear features from data.

But coding and stuff is pretty much like a--in a--kind of at HOG stage, right? I think there's

only one work done by [INDISTINCT] and the current last year [INDISTINCT] to analyze--[INDISTINCT]

the--so-called the max coding which is pretty much they've want message for pooling stay--stage,

right? But I think this deserve more higher level of interest for further investigation.

So, let's look at how do we represent the images by local features, okay? So, for each

image, we extract a bunch of local features at different locations, right? In that sense,

each image represented as a set of features, which is a feature set right? A feature set

is an empirical distribution, right? So, that means, each image is an empirical distribution

like this, right? So, one idea to perform image or classification, right? Is kind of

to embed the empirical distribution into a feature space, well, let's say a Hilbert space,

right? And in that space, we do linear classifier, right? This is a pretty much like the kernel

trick, okay? So, let's say this is the embedding of the distribution. I think why embedding--people

have started this problem. Let's say this work have done by Smola and his colleagues,

they consider embedding in such a very simple form, right? Simply average of the linear

mapping of the local features, right? So, this is pretty much, it is actually the average

pooling, right? There's some problem about this average pooling. I think fundamentally,

one issue is, in this embedding, the kind of norm of the embedding is not properly normalized.

This could have been unbonded. So, there seem to be a mathematical problem but it's actually

also an issue in practice. That means, in the image, if some features, some patterns--kind

of like noise or whatever, right? Repeated and occurring too much--too often, then this

component will dominate, you know, the representation of the image data, right? So, this is kind

of, you know, intuition level. Why? Because this is a simple average pooling is not good.

And empirically, this has been observed by many division researchers, right? Max pooling

is much better, way better than average pooling, okay? And then let's look at another way of

defining a kernel between distributions. This is so-called a Bhattacharyya Kernel, which

is a pretty simple form, right? This is an integration of this, like a--this a square

route of the two densities, right? You take the integration, right? This is a very well-performed--behaved

in the sense that the norm of the--of this embedding is pretty normalized, right? So,

it's too long, it's simply one, right? But one problem of this Bhattacharyya Kernel between

the two distributions is the following; here is the simple example, right? Suppose you

have two distributions, like two p(x), right? Somehow if you apply this kernel on--in this

case or another case, right? Pretty much you get a similar result because this kernel ignore

the structure of data, right? If the--these two p(x) are close or faraway, this kernel

is insensitive, right? Let's say if you have a [INDISTINCT], right? And have always so

far each image. In the extremely simple case, each image you end up always on one pin, right?

So if two image and the two beams of two images, actually they are very close in the feature

space, right? And the--this Bhattacharyya Kernel that distributes two images is totally

different, right? Somehow it ignores the distance, the structure of the data. So, there we introduce

a smoothing turn. So, here is the new kernel function we define between two distributions,

right? So, what do we have here is a linear kernel between on the features space, right?

It doesn't have the problem of the previous Bhattacharyya Kernel because of this smoothing

turn. Actually, it is incorporated as just small structure of data, right? The--however,

this definition requires to know the density which is impossible, right? So, there we introduced

a nano formulation which requiring only a rough estimation of the two densities which

is the p-theta and q-theta here. So, this formulation endures a new feature representation

which is at this formulation that will be--yes. So, this is a--if we apply this method into

an empirical distribution. So, what you get is the following. So, this the empirical distribution,

right? The embedding of this empirical distribution is average but it's weighted average of the

nonlinear [INDISTINCT] of the features, right? Here is the weights, right? The weight is

based on the estimation of the density, right? So, interesting in seeing here is--a nice

[INDISTINCT] actually this density estimator doesn't have to be very accurate, right? So,

as long as the ratio between the estimator of the density and the true density, right,

is bonded from above and below and if the estimate--estimator itself converged to a

limit, right? So mathematically, it's real--be real behaved, it will achieve desire to properties.

And we have done actually extensive experiments show that as this new measures perform much

better than average pooling and in most of the cases, it's like equally well or even

better than max pooling, right. So, let me tell you in which cases this is better than

max pooling. Actually max pooling implicitly assumes the--the coding is non-hectic, right,

so you can't afford picking up the most prominent response, right. But what about--I mean you

know, if we want to just use an average rate coding method, just want to explore the geometry,

structure of data, right. So the coding is--doesn't have to be non-hectic, right, in that case,

fast coding will fair, right. And in this case--this new pooling methods would be much

better, yes. So, here I will conclude my talk. So, we have developed a series of feature

learning methods, we are learning from unlabeled data and they achieve the state-of-the-art

performance on several benchmarks, right. Actually, I have a few demos if I have time

or maybe we can show them demo offline. I'm done basically. Okay, let me finish the conclusion--yes.

And actually this is the last second slide, I have one more slide. So, and we proposed

a new pooling methods--it's just very interesting, these radical properties. And also empirically

show the better results than average of pooling and to in--for sparse coding or local coding

type of measures, it's just the same level as max pooling but for fewer record type of

codings is way better than max pooling. There's a bunch of interesting directions I think

we would like to explore further. So one thing is definitely to the deep learning, you know,

so why don't we learn the whole thing, you know. Multiple layers of allowing your feature

transformation, you know, together, right. I think Angel has given a very excellent coverage

on that topic. It is something we are also working on. And the second is definitely interesting

is the fast faster measures, you know, for our future extraction, right. Because we have

to think about how do we handle, you know, extract it--once the learning is done--the

feature learning is done, how do we use to learn a model to extract the image features,

for example, in a level of like a 50 millisecond, right. So, it's still very trying to do. The

sort of thing I think is probably the most interesting to myself is learning powerful

features from the richer structure of data. For example, from the video, I think Angel

also mentioned from the temporal consistency assumption to learn, to explore the kind of,

you know, temporal consistency but also to learn kind of the environment features, for

example, out plain rotation of the objects, right. So, those are the papers that we published

over the last two years and thank you very much.