Assistive Vision Technology for the Blind: Recognizing...

Uploaded by GoogleTechTalks on 08.10.2007

Okay, this project is nick named
that's spelt G-R-O-Z-I, and I'll explain where that nickname
comes from later. This
is a collaboration with many different people
both at Calit2,
UC San Diego, and also University of Kentucky and
I'll describe the different parts of that collaboration shortly. But
the focus of the project is
to develop a handheld device, which we call the Mozi box, and
that device
will help blind and visually impaired people locate
objects in a grocery store. For
example, to prepare
shopping list and take that to a grocery store and use it to find
products on the shelf.
And just as a side comment, this is an
interesting and different project for my group for a couple reasons. One
is, because it actually involves a hardware device, which is something I rarely do.
Second, the
of the research here, really the lead researcher on this is an undergraduate.
So this represents a big push forward in my group
to get undergraduates more involved in research.
Another person working on this project is Carolina, who
is here
interning at Google.
So, a little bit about the motivations of the project.
The idea
here is to increase the independence of people with low vision or that are blind to perform grocery shopping in a supermarket or store.
Certainly there are other solutions to this. A blind person could just
order groceries by phone or online and have the groceries delivered,
but in some cases the blind person might actually want to go to the grocery store and have the independence to
locate the groceries themself.
And --
so, the
idea here is we would like ideally in the scope of the project, we would like to cover
everything from
the home
of the user along
the walking path to the store and then navigating inside the store to find the products and back, and
that would include also the process of payment at the checkout counter.
So, the market so to speak is the
1.3 million or so legally blind people in the
US. But we also think this could be useful to the much larger group of visually impaired people that could use some help locating
products, and what can be a visually overwhelming stimulus when you look at a grocery aisle shelf. There's really a
lot of things there, and if you are visually impaired and that can still be a challenging problem. Now
there are two ways of looking at the current state of the art. One is from a cold marketing perspective.
Grocery store managers think of blind people as high-cost customers, okay.
In other words they're a pain.
The people that have
to help the blind customers should
be stocking shelves and
packing boxes and so forth. So, if a blind customer comes in and needs help
that takes that worker away
from his other task.
But from another point of view,
these grocery stores are underselling to the blind
customers. So there is demand from blind customers to
purchase these products,
and if there were technology or just some kind of solution more generally to help them in
this setting then grocery stores can actually sell more
that market.
more broadly speaking, the type of device that we're making
will have applications outside the grocery store. For example,
one of the applications we want to move into is an airport terminal. But
-- so,
from a pure computer vision standpoint, you can think of this
presentation as being about a mobile device that
runs computer vision algorithms and that can do object recognition. But
we've chosen this specific
problem setting of a grocery store
as a way of getting started in
this problem of navigating for the visually
Okay, let me show an
overview of
this system. This is a slide that Carolina actually put together for
us. So there are three stages of
this project. There's the
part that takes place at home
where we see the user there in front of
the computer with
visually impaired accessible website, and
the device itself, the so called Mozi box, is depicted there on top of this CPU sitting there. It
is connected to the computer buy a USB port, and
at this stage at home
the user finds the products
online, for example, using one
of these
online grocery stores and
maybe chooses twenty or so products to
put on to the shopping list and
then a couple of things get
uploaded to that device.
The walking path to the
store and then an assortment of training images of the different products because what we would like to do is
spot those products on the
shelf. So,
training images of all the objects in the shopping list go on to the Mozi
box. Then comes the outdoor project which is something our
collaborators at University of Kentucky are working on, which consists of
navigating crosswalks,
visual landmarks, visual way points, and so
forth. And this --
so, the
here is the Starbucks coffee logo,
you could in principal have a GPS system that just indicates
where you are located. The
problem is that this has
bugs of area. It can be affected by the architecture in the vicinity and it
may not be particularly accurate or you may not have GPS available at all. But
to have the feedback
from visual waypoints can help you know that you are moving in the right
direction. So it is an extra affirmation that you are along the correct walking path. So
that part is being done by University of Kentucky. That is David Nester,
Hendricks De Venius and Melody Carswell.
The part that we're focusing on is inside the store.
And broadly speaking, the problems inside the store include
locating the
aisle signs. So, detecting the aisle signs and reading them and
then once you are inside an aisle that you think contains the products you want,
spotting the
products on the shelf
and providing haptic or tactile feedback through the box to direct your hand to
that product on the shelf.
The Mozi box itself
is shown here. The size, it is about the
size of
a typical,
maybe if you took two
flip cell phones and stacked them up,
it is about
that size. So you can hold it in
your hand. It has two orthogonal servos on it with little plastic tabs
and those plastic tabs provide directional feedback and
they are inexpensive tabs that all the hardware in the Mozi box in
quantities of one is about $300 or so. The
servos themselves are the type of servos you'd find in a remote controlled car. The
Mozi box itself
is actually the marriage of two different projects. There was one project called the
Zig Zag and that was a
box developed by John Miller at Calit2, and
it had no
camera on it, just a servo and
the idea there was that a
so called remote-sited guide, a sited person would be sitting somewhere.
For example, imagine the person is sitting in the
bleachers at
an athletic field and
then you have a blind or blind folded person on the track
holding this box with the servo,
and then the remote-sited guide can set the servo angle to direct the person around the track. One
of the motivations for that besides helping
visually impaired people was actually helping
first responders at a disaster site
where the room might be filled with smoke and they are able to get directional feedback from an external source.
So the Zig Zag was this servo part that for the haptic feedback, and
then there was another project called MOVS the Mobile Vision System and
the idea there was to take the Intel Open CV library
and get it working on a low power,
low cost mobile platform. So,
we put the two together and we
got Mozi and
just became the nickname for the application of Mozi to the grocery store. So, it
does not actually stand for anything anymore but that
is the big
thing -- [Laughter]. So
this is under
development in parallel. So, I won't say too much about it
except that it
will very shortly have the functionality to run all the algorithms that I am showing because everything we are writing runs in open CV
but the simulation. What I am showing are simulations that run on regular a PC
and there will be some challenges to optimize the code because we can only get up to about 400 MHz on this thing.
So, that
is just a quick look at the Mozi box.
Now let me jump into some
the different problems
on the computer vision side of things, because
this --
the broad picture of the Grozi project, as I showed in those different panels and encompasses the website design, the blind accessible technology for finding the groceries,
the haptic feedback. So
there are many different issues besides the actual computer vision part. Because
at its heart, the computer vision part is
really using object recognition, pattern recognition things, but
that alone would not make a very useful product at
all. So,
Melody Carswell at University of Kentucky is really critical in this, and
so is John Miller at Calit2 to make sure that
we have volunteers from NFB involved and lots of
people in the community that are testing this with us, and
actually telling us what's stupid and what works well and so forth. So there is really a lot of back and forth here
between the
computer vision people and then actual members of the blind community that are giving us feedback. What
I am focusing on in this talk is really the
computer vision part and then maybe at
the end we can talk more about
these other issues that affect
the user interface.
the first problem here, well
what I am showing here is kind of a mix of
examples of processed data from real supermarkets,
which we just recently got permission to go into
legally and
then also
experiments from a convenience store on campus. As
you might know, grocery stores are not thrilled about having
people come in and take pictures and video because there is a lot
of intellectual property in a way
present inside that grocery store just in terms of where the products are
located on the shelf, and which products they are choosing to sell.
We finally did find a manager of a supermarket nearby campus that is happy to help us. So now
we are able to go in and take as many pictures as we want as
long as we call in advance. These pictures were taken before we have that permission, so we just snuck in
and took pictures and then ran out -- [Laughter]. So,
the idea here, what's shown in the picture here is
the sample photograph taken looking down an aisle, and there is a aisle sign. It happens to be aisle
21 in a
Safeway Supermarket, and
it has
pasta sauce, import pasta, canned meat, rice cakes.
And --
so, one of the reasons, you know, I mentioned that this Mozi box has limited
processing power. The reason that the processing power is important at all
is because one
of the biggest challenges here in making this useful is just
getting the device pointed in the right direction to
begin with.
So, we need this thing
when it set to read aisle signs to run all the time
because if a blind person walks into the aisle and they think they are pointing in the
right direction, but they are actually pointing
straight down the aisle instead of up enough to contain aisle sign.
This thing, it needs to be interactive enough to
keep them updated about what is in
the frame. So what we are shooting for is about six frames per second and
we have a text-to-speech
module on this box that
just speaks whenever it detects these different words.
So, I will
just say a little bit about how the text detection and recognition works. We
are using a technique, which was originally developed
at the Smith-Kettlewell Institute in UCLA by Allen Yule
and Chen.
This Chen and Yule CVPR '04 paper and
that in turn was based on a face detection
approach by Viola and Jones. And
so they adapted that to work with text, and we adopted that to our
problem and that is based on what is called an AdaBoost cascade in Haar Wavelets,
and the way this works is you start with this huge training phase where you take a whole bunch of photographs
of scenes that contain aisle signs,
and scenes that don't.
you take all that data and you get a human to label, to draw rectangles around the boxes that contain
aisle sign text. So
far this doesn't have to do with what the text is saying. It
is just the process of detecting things that could be words on aisle sign. So
you get all that training data.
Then you come up with
features that you use that
really in practice are convolution kernels and for efficiency these convolution kernels are just Haar wavelets that is just differences of boxes that can be computed extremely fast at different
scales. And these little differences of boxes are able to describe the textural features of text. So,
what you do
is you use this learning algorithm called AdaBoost in an efficient
implementation called a cascade. You
train this thing for a couple of hours
and then the result is that you just put in a photograph of a scene in a grocery store that may or may not contain
aisle signs and
the cascade processes
in a very efficient fashion and pops out rectangles that could contain
aisle sign text. As
you can see, it is detecting the words on the aisle sign. It
is not detecting the large
number above the aisle signs that says 21, because we didn't train it for that type of
text. You will also notice on the background there are signs that they yogurt and butter and
there is a wavy line
passing through them,
and this is a kind of bizarre because some of you might know about visual captures. Have
you heard
of this? So a
visual capture is something --
Yahoo uses these; I think Google might also. Is that
right? So
these are designed to be difficult for a computer to read and easy for a human
to read,
and they
-- it is a really
casework something is purposely made difficult to read
to throw off
algorithms to detect
it. And it is
seemed bizarre that a grocery store would put something on the wall that is effectively a visual capture. So we,
perhaps it was
there to through off robots that
are trying to spy on the store, we do not know -- what the
motivation was. But we do know that two weeks later, when we came back, those signs, the wavy lines were gone. Really. We did not say anything, but for some reason they took them
away. So,
these are, this is an example of the aisle sign detection. And so, once
we detect the text, we do some adaptive thresholding and then feed it through
optical character recognition
which any of you who have used OCR know that these
algorithms are difficult enough to use on regular text much less text in an unconstrained environment. So
in general, it is very difficult to take
these threshold letters and
simply produce
the ASCII string that
represents these words. But we have a big benefit here in
the sense that our universe,
our lexicon of words is highly restricted to the case
of grocery store
in a way, you can think of it like we
have low level OCR with a
spell-check that is highly biased towards grocery
food aisle signs.
So, in
its raw implementation it is not
terrifically usable right now, but this
thing runs. On the PC
it is running at about 10 frames per second and
every time it detects one of these rectangles, it binarizes it, runs it through OCR, does spell-check and
then this little prerecorded wave file just says the word.
So it is actually kind of obnoxious right now, but we
are working on getting a better interface so
you can set the Mozi box to read aisle signs
and just start pointing it at things and the speaker just
talks and says what it's
seen. This concept
goes back to the Smith-Kettlewell so called parking sign concept
which I think is may be 10 years old. That was based on a
line of site infrared system,
and for street signs in San Francisco, so you could spot the sign through an infrared beacon, and
then you'd carry a box and it says the name of the street. So,
we are trying to do that type of thing
for food aisle signs just using computer vision.
So this is a part of the project that is actually being done in grocery stores now with permission,
but the
recognition of
the products
themselves is actually taking place in a small convenience store on campus,
which is called, The Sunshine Store.
the basic reason we are doing that is because grocery stores at this stage of the project, they are just way too big.
A typical grocery store has about 25,000 square feet,
around 30,000 different products,
and we do intend to do that at some point and we have collaborators
Evolution Robotics, which is in Pasadena, and
they are offering to lend us robot to use
to help collect training data and do the testing, and
so forth. At this stage, we have this great opportunity to use this
small convenience store
beta testing. It
has less than 2000 square feet. You can
think of it is a miniature grocery store. It
has got about 4000 items in stock, it
has the drawback of not having
a bakery or produce, but
in all other respects it represents a miniature version
of this supermarket problem.
And the nice thing is that it has regularly scheduled maintenance hours. So, we can just go there a couple of times per month for
a couple of hours at a time with students and
cameras, digital video cameras, and so forth, and collect all the data we
want. So this is where our pilot testing is taking place.
And specifically that
is where we are getting the testing data. So
all the testing data is being captured
with video camera and
the target resolution for this Mozi box is a 1.3 Megapixel camera.
So that
is better than VGA. So
all the testing data is being captured like that.
The training data on the other
hand comes
from the web.
So the idea here is that, --
so take the case
of this convenience store. They
sell 4000 items.
The manager of the convenience store
gave us the
inventory, a
printout of the store inventory.
So there is 4000 items.
He gave us a UPC
or barcode
for all the products plus a textual description of what the product is. So
all the things that could be in the store are included in
that list and
nice thing about grocery is compared to say trying to solve the airport terminal navigation problem or general outdoor navigation problems is
that there are all these different
online grocery stores. And
in our case Google's Froogle website was extremely useful for this purpose. We can go find the training images online, and
not just one, but we can find a lots of difference views of these
different products. And of those 4000, we picked a 120 to
build up our
database and these are just sample pictures of
images that we grabbed from Froogle for
different types of products.
And this
is what we use. So one of the artifacts of this project is a
large database of training and testing images for
all these groceries, and I
can answer questions about that later if you would like. But
just to describe the
of data collection task, we use these nick names of In Vitro and In Situ to describe the different modalities of data. In
Vitro, we used to describe the clean
images with that we capture in the lab or
rather from the web
that typically have the products
professionally photographed at a stock photo agency with a white background labeled perfectly.
And then the In Situ is on site. It
is where the
actual products are
located and those,
this student McCauley
painstakingly went through
30 minutes of video,
cropping all of these 120 different products in every fifth frame by hand.
for that entire list of products, he just clicked through every single frame,
drew a perfect rectangle around each product
and even did the outline of the mask so that we knew how to separate the foreground from the
background. The testing data, over here, this is the In Situ side of things.
The In Vitro side, it really got two parts.
Ideally what we would do is just punch the UPC code
into some query
-- just do
a query into a database punching the UPC code
and pop out the appearance
of the
product. Unfortunately that database
is not available or at least is not available for free yet.
So what we have is a sort of two-step process. You
can go to Google or the UPC database look up and
punch, you can just grab any
product, look at the UPC code and punch that into the Google search field, and
it will automatically query the UPC database and pop up a
little textual product description. Then
you can write a script that grabs that textual description does
a Google image query
and then gets you things that may or may not be that product.
And that what we are doing there is exploiting some redundancy that if
you are searching for example for
Sun Chips,
this represents seven different
pictures of Sun Chips or a Neosporin box that you might get out on those queries. But you will also get lots of other stuff
in there that just happens to get returned when you type in those words. So
there is a sort of
content based image retrieval problem there, just in building up the database.
So you can think of it
as it's a semi-automated
problem of taking that inventory list from the store
and doing the query in to
get those
images. I've tried, starting around six months ago, I started trying to get
direct access to these databases through one
company that provides these images to lots of different online grocers and
they are just ignoring me and it is difficult to get that. So, my group just came up with our own solution to build up this database. So,
this is
what some of the
data looks like. So, the In Vitro
or the online images,
these are captured from Froogle and
that is what the masks look like that McCauley prepared by hand. And
the In Situ images, these are captured with, in this
case, it was a VGA
resolution camera and
they're clearly not as high quality. So this is the type of challenge we have to try to find things where there is
training images that look like that, the testing images look like this. To
give some
idea of the number of different examples
we have, over
on the training data, the X-axis here
is showing
the 120 different
products and the Y-axis is showing how many of each one we have
and the mean is somewhere, it is about 5.8. So
we have
close to an average of six examples per
training image or
per training product.
And on the testing side where
each of these examples represents
an appearance of that particular in a
frame that
had a box drawn around it by McCauley.
So that
has an average of
per item.
So that allows us for the purposes of a kind of
focused computer vision study.
This then decouples into these two problems where you can study object recognition performance and
you can also study object localization. So
just by taking these
training images and
the extracted testing images, you
can create a
database, and you can
forget about the particular application of groceries and just treat it as testing and training data. You can produce ROC curves
and look at
the difficulty of that problem with respect to different types of recognition algorithms.
Similarly you can study how well
any given detection algorithm does at locating the object within the frame.
You could start with the most basic question just asking,
is the object in the frame at
all, all the way down to localizing it within a couple of pixel and
that stuff were are still in the process
of studying. So,
this is a snap shot of some of the different object detection and recognition algorithms we are looking at. We do
not expect any particular algorithm to solve the whole problem.
We think that we will need a combination of algorithms.
Some of them include color histogram matching
which in this case where just using the chrominance channels to get
more invariance to illumination. We
can also use invariant interest points so this example is SIFT features from David
Low. We may want
to use something that like
the Haar like features in the Ada Boost framework, which traditionally was more used for object
detection, but perhaps we can think of this as an object detection problem just multiplied by the
length of the shopping list. So,
this is
just a snapshot of different plausible algorithms. We could also create new ones and we are thinking of different ones. But
one important thing to keep in
mind is that when the user goes into the store, we
are not, because we want this device to work independently, we are not expecting it to stream video over EVDO and do classification on some backend server. We
wanted all that happen on that box with a quick turnaround time. So it is highly interactive and
maximally usable by that person. And
because of that, it
is very important that --.
Okay, the
point there is that we
are not expecting the person to be able to recognize any of the 30,000 items in the
store. The priority is only that they be able to recognize what is on their shopping list. So
in fact, this device they are carrying does
not know what it is looking at most of the time.
All we are expecting it to do is
to vibrate or activate in some way when it sees an object that is on that shopping list.
Now another implication of that is that we could do a lot of processing
offline on
the PC
when you have all the power of the PC and the Internet available.
So when the user prepares their particular shopping list, they
might put 10 items on there
when you are at your
PC, you could optimize
your object detection and recognition algorithms to work specifically for that set
of object that you picked.
Okay, this is something that we haven't yet explored. What
we are doing at the moment is just studying the way that these different
algorithms work
in isolation so
that we know what we are dealing with.
So, the way that we,
what I will do here is
just show a snapshot of the performance of some of these different
algorithms that we were using the standard approach here for
quantifying recognition performance called the Receiver Operating Characteristic or ROC curve. So we
are plotting true positives against false positives as
a function of the threshold use to declare a
match or not.
And in the case of the color histogram matching we have a lot of different ways of quantifying
the similarity between color histograms. So,
as I
said that
the -- example of feature matching
based approach were
are using is SIFT,
which finds interest points on the image
and attempts to match it and look for
geometrical consistency.
The color histogram is actually looking at
histogram of two different
chrominance channels, the
CB and CR channels in the chrominance plane.
And I have a little Demo Video here that shows what
it looks like in the case
of trying to find the Tide box.
Here video camera starts out pointing at breath mints and cough drops and
deodorants, and the person scanning around and the boxes looking for a Tide
laundry detergent. So,
it's a kind of slowed down here but eventually the Tide appears in the window. This
is running at 25 frames per second and is capable of detecting matching histograms at 5 different scales.
And so this is just a simulation here. The algorithm itself is a not a simulation. It is actually running. The part that is directing the user's hand to the object is still a simulation because we do not have
the haptic feedback working yet. But
the way that would work is once the color histogram match is
above threshold, the haptic
would get activated and then direct the
user's hand to that box. That
mostly solves
the problem, but at the
end what is happening here is the user
is searching
for the barcode
and then once the barcode is visible, you
can verify that
it is in fact the correct
product. That last step, yes
--. It's a stupid
but if you're
holding this
in your
hand and
it's directing it and how you can grab the box? You've got this
other thing in your
hand. Oh
direct your hand --. Oh. Use the
other hand.
Oh, yeah, so I think in
practice the way it would work is that you actually keep, -- you zoom in until
you are touching
the actual box. But that is a problem
that remains open. My assumption was that you
essentially collide the box with the object and
then you
pick it up. You might pick up the wrong thing in which the case the --. I'm just saying if you are hand is full, -- the Mozi box you have to hold in your hand. That is right, yeah.
So that is something we still need to figure out in
terms of ergonomics.
this barcode thing is a bit, it
is nice in terms of validating that it is the correct product, but one
of the other collaborations we have going on is
a radio frequency ID tag technology from Intel. Specifically
it is a
style RF ID tag, and
it works in
close proximity about 5 to 8 cm.
And the
idea is in
the 5 to
10-year time
most grocery products are going to have
passive RF ID
tags on them.
And they are just close range in order to be safe for humans. It is just a close range kind
of thing you can only read it
from this range of a few centimeters. But
eventually that will eliminate this barcode part. You will
still need the computer vision to get you into
the neighbor, to get your hand on the
product. So we
are going
to use computer vision to get you into the
correct aisle by reading the aisle sign and get
you into the neighborhood of
the correct product. But then the hope is that
you can wear one of these; right now it is, the form factor is it's
a wristband.
So you can either wear a wristband or just embed that into the Mozi
box and have
take care of that final validation step. So
we were
doing that collaboration with
Research in Seattle right now. These
some examples of the ROC curves we get
for some different
actual products. On the
left, where I
am showing
training image that we got from Froogle, a
renew contact lens solution, and then
that is a snapshot, the
next to it is a picture from the actual video. So that
is just cropped out from the video. So out of,
that is one of the 120 objects. So there were 119 possible
distracter objects and the ROC curve was produced for
that based on true positives and
false positives.
In this case, you
do not really have
to pay attention to these different
color codings. They are just different ways of comparing color histograms. But
the ROC curve in this case for the Pi square distance between histograms is very good.
This is something that has a very distinct color histogram. It has got
a bunch of
white and then some blue and green and that
makes it very distinct with respect to the other
products. Yeah.
You know you're comparing
color histograms
the objects, --
but in real implementation you have
to look at every box that we scale on the image? That
is absolutely right.
So this sub-problem, this database that we created that just has the cropped out products
is artificial in the sense that it is just object versus
object. In practice, you will not only have the multi-scale but
you will also have clutter. You could have the
fluorescent lights. That
is something whether we like it or not, that is going to show
up very soon.
This ROC curve is just studying Intel product confusion
at this point. But
you are right, it
is artificially separated so we can just
take off
the shelf object recognition algorithms and just study
their performance. But you know, the demo that
you showed earlier is actually looking
at auto
box, every
rectangular --. That
is right and
you could see sometimes it misfired at
a completely wrong scale. But
one thing to keep in mind here and one of the reasons we are not
terribly worried about having
the distracters bee things besides
neatly cropped other
product images is that there
is a temporal element to this. There
is temporal continuity in the recognition, which is something that is not captured in the static database. So
whether it be aisle signs or the actual products, if
you are looking at the correct object,
then for several successive frames that should continue to recognize it as the right
thing and as you approach it the recognition should get better.
Now we do expect the occasional
incorrectly recognized thing. But
there will be a whole dimension of this project, which is to incorporate temporal continuity, and
that is something we just have not
gotten into yet. Here
is another good example. The Tide box has become our poster child for this project, because it is such a nice
product with attractive designs, good
nice solid color histogram. It
has got good SIFT features too. I will show this again for SIFT matching.
So that is the training image up there. This is an example of a test image down there. The
ROC curve is so good, you almost cannot see it. It is way up there in the top left corner. Here is
an Arm & Hammer box and
again it has a relatively good color
histogram; so it is doing pretty well.
Some bad examples, this
is real data, so
things happen that we do not want to happen. There is
a training image of a Skittles bags there. Well
in the convenient stores, Skittle
bags are at the very bottom of the shelf
and they are inside a cardboard
crate and
they are poorly lit. And
I was
the one capturing the data here
and I just pointed it down where the Skittle bags were and it was pretty dark.
And you can hardly see anything there but if you look very closer, if do you gamma correction you actually see some Skittles
writing there.
Okay, and this was a legitimately cropped region that McCauley outlined
and it just does not work. Okay
but that is real; you look in the data, that type of thing
happen. You could argue it should be solved using better elimination, but that is
a fact of life. We set this
project up. That
is the kind of thing that is going to happen and
needs to be solved
somehow. Here
is a Dove. I think this is antiperspirant. So
again we have an illumination problem and if you look closely
that product because this is a small convenience
store, a lot of products are actually turned to
the side a little bit just to fit more stuff on the shelf. So,
in terms
of color histograms, besides
the couple of problems, so much of
it is white that it tends to get confused with other things. But the one distinct part, which would be cap was so dark that it did get a good
enough match. Yes. Have you considered putting illumination in the
Mozi box itself so that you have more control
of it? We have and as
a result of this first data pass, we did two things. One, we decided we wanted to put
an array of
LED's around the
front of the
device, and the other was to put a polarizer on
the lens, because with the potato chip bags, the glare was
usually the
most salient features that we extracted from the potato chips, were
erroneous. They
were just caused by reflection and when we put the polarizer on it, it fixed that.
So these in retrospect that is just completely obvious, but you
tend to forget everything when you
start a
new project. Here is another one,
this is
Milano cookies and
in this
case, this is the kind of
thing that would work well with
invariant feature matching approaches, but the color histogram of a Pepperidge Farm's cookie
bag is just not
very distinctive at all.
So this thing did not perform well
with color histogram.
How does it do on
cans? Soda cans. Soup cans.
Soda cans are even
more reflective. Soup cans I guess have paper around
them, but with the
curvature? That is a good question. So, the curvature is really only an issue if the
cans are
actually rotated. If the poles of the can changes with respect to the training images that were captured.
I don't
have those ROC curves to show, but we do have internally the ROC curves for every product, all
of those 120. So I could check afterwards to see what they were.
But I think that they worked well
for the soup cans. The
soda cans were more difficult because they are behind glass in the refrigerated area, so they
had more problems with reflection. So how do you plan to deal with the problem of the poses of the can with respect of
the training data, 'cause that is going to happen, right?
Yeah, --
so there are two ways. One is through,
when you, okay --. We want to have highly heterogeneous training data.
One way to get that heterogeneity is that there are many different online grocery stores
who have different
pictures of the products.
So when we looked for Neosporin for example,
we actually found eight unique pictures of a Neosporin box and
they all have slightly different poses. It
is actually enough to build a three-image construction of
Sometimes you don't get that though
and I think that one
excellent way of
building this heterogeneous training set is
to have the user community share images. So,
for example,
you may have,
so Trader Joes might sell things that
simply don't appear online. Well,
you might
have it your
pantry though. So, if you have
it in your pantry, you can capture an image of it. You
could share it with the user community
and that's good as grabbing it from the Web. In fact it is better probably because it is an actual photograph with the device itself. So
right now it is a bit
more difficult I
think than
it could be because we are forcing the training data to come from this clean
stock photography domain and
all the testing data comes from
the noisy real world. But
the idea you suggest with cans is interesting because if you have
domain knowledge about a type of food that
it is cylindrical or square or crinkly plastic,
you can use that and then generate synthetic. But
the way the grocery store stocks these, all the soup cans are stacked together. So
in fact that is actually a
cue that you are in fact in the soup aisle versus not in the cracker aisle. The cracker aisle is all square boxes. Yeah, -- that is true and that brings up an issue that there is a lot more than just the aisle sign text that can tell where you
are. One
of the ways that came up was that some
of the products that
we want to find are very small like
was a toothbrush.
Now it is tall and skinny,
but toothbrushes tend to be close to certain other products statistically speaking. Yeah that's true. So the challenge might be what is the closest large
object that is correlated with the thing you are looking for,
download images of that on to the
device, find that and then look for the toothbrush. So,
these are
things that are a little bit outside the
scope of
the core computer vision problem. But
these are machine learning problems that I think that the system will need in order to be successful. So
this is a snapshot of -- SIFT
features on the tied box; again the ROC curve is so good
you can't see
it. The Tide
box, -- for those of you that ever worked in content-based image retrieval, the
Tied box is the sunset.
So, sunsets where the super easy
example back in the content-based image retrieval base. So Tied
boxes, -- those are easy.
You should tell Tied
that. [Laughter] It is really, -- it's basically the informal logo of our
project. Here is a Raisin Bran box
SIFT matching. There's quite
a bit
of pose variation here and
that is really the type of thing SIFT was built for. A Raisin Bran box is basically a plainer object with an interesting pattern painted on it;
ROC curve is very good for Raisin Bran.
So it does
end user effectiveness, if you could make
the training, really, really trivial
for an end user to
do it.
Yeah. Then it would actually probably work a lot better, because
the end user could actually train it a couple
of times in the store that the
user shops at. Yeah. Then you have training data and your usage data really match so closely. I think you are absolutely right and the nice thing about the grocery store domain is
that we would
encourage all the users of this system to do exactly what you said and
then there is this
UPC code that makes it all so easy to index.
So, if some body in Lexington, Kentucky,
these vegetables fins,
it has a unique UPC code. They add that to the database. Of
course it is probably still a good idea to do some quality control to make sure that it is
the right thing, but that we can have this clearing house database
of In Situ
and In Vitro training images that
eventually will make the
system much more effective. So
I think that
we do not have that benefit right now because we do not have any users, but I do foresee that it will get much --
quality will go way up when more people start using it.
So that's
-- here is some bad example
with SIFT
matching. So this Cheese-it bag, --
again it is very dark. But if you stare at
it a little bit, you will see that
the Cheese-it bag is not only
reclined back. But it is also distorted a little bit. So
the word 'cheese' is
bent and in
fact the number
of matching SIFT features was zero.
It just did not work at
all. Here
is that Skittles bags again.
it looks really dark, but there actually are all these letters that say Skittles.
No SIFT features fired on it.
Here is another problem. This is Pepto-Bismol. It
just so happened,
in the testing
data that the
Pepto-Bismol appeared with
motion blur
in a lot of these
frames. Some of the
frames containing Pepto-Bismol did not have motion blur and
got recognized correctly.
when motion blur like that happens, it throws off the interest point detector. This happens
to be one that color histograms did work well on by
the way, but in the case of the SIFT
features, it only matched maybe one or two different features on
it. Yet to the human eye these clearly looked quite similar.
So motion blur is a fact of life. It's the kind
of thing that gets worse as you increase the frame rate of the device and
it's one of the trade-offs that we need to
Do we want a
lower frame rate with
risk of motion blur
or a higher frame rate and error on the side of having the device make mistakes now and then, but
constantly give feedback to the user about what is being spotted? So
let me
say a little about future directions. One
thing that is not mentioned here but actually represents the very
first stage of the
project may be about three
months of it is something called the remote-sited guide study. I mentioned the remote-sited guide earlier in the context of
Zig Zag. But what we are going to do in that same convenience store is simulate this whole experiment in
a kind of Wizard of Oz style, where instead of having the computer do it, we
are going to have
a person sitting
under the
check-out counter with a
laptop. So they are sitting there with the laptop, the
user has the Mozi box but all it is doing is
transmitting the video to the
user, the remote-sited guide.
That user has a control panel,
which can do two things; it can set the servo angles and it
can also control the text-to-speech unit.
So we plan to use this remote-sited guide study
to learn about
the usability and
protocols for directing people in an environment like a grocery store. We
already have a lot
feedback that
we can import from the Zig Zag studies
about what kind of protocols people worked out for
how much to turn. Do
you want to turn 90 degrees or 270 degrees or
what is the right way to get
the person to respond to the haptic feedback? So now
we have those issues plus the visual information provided by the camera. We also have text to speech, which the Zig Zag did not
have before. So a big part of this study will be will not use computer vision at all. It
is just the process
of being a fly on the wall and watching the way that a human operator would interact
with a blind or blindfolded person in the store and
then try to
use that to
guide the way we develop automated system.
So, some of the things we
are doing were expanding the core object recognition ability bringing in
new features with
that kind of combined
SIFT with color histogram type descriptors, though
I should say this
is not really where -- we
are not looking to do rocket size in that regard. We believe that most of the object recognition technology that is out there is actually quite well suited to
the Grozi problem domain. It
is just a question of putting it together in the right way and making
it usable. We need to get these algorithms ported to the Mozi box,
and we're actually aiming to have the
aisle sign reading working by mid-October and then in
the coming months start
getting more and more product recognition working on the
device. And like many were asking we need to get these
do actually user studies to figure out
what this person is supposed to do with their
hands. Suppose they've got a shopping card and
a cane, and
the Mozi box or shopping card and a guide dog and a
Mozi box and
may be something else hand in their hand, how much stuff, what
is the right way to get their hand to that box? So
that is an interesting problem
and we recently submitted a proposal to NIDRR, the
National Institute for Disability
and Rehabilitation Research, that would be
a three-year proposal and
so we are hoping to get that funded. We will find out on October or November.
This is just a snapshot of the collaborators here
and that picture shows John Miller who
is a blind researcher at Calit2, and he is teaching my
student Vincent how to use a cane to cross
a street and
that is Franklin Yang from Polcom, that
is watching along. And
as I mentioned that
we are collaborators
at Kentucky
that are working on the outdoor problem and
then some machine learning collaborators that
are helping us with the semi-supervised learning problem at
Max-Planck Institute in Tubingen, and
then James Coughlin and
John Braden
Smith-Kettlewell are also advisors on this
project. And,
that's it. And, I thank you for
your attention. [Applause]