Real-Time Crowd Support for People with Disabilities - Jeffrey Bigham


Uploaded by Dartmouth on 22.11.2011

Transcript:
>> He is an assistant professor there, has been there two years.
He got his PhD in 2009 from the University of Washington
and he's had a fairly successful career.
Many awards and so forth including the Andrew Mellon
Foundation Award for technology collaboration.
And the MIT Technology Review top 35 innovators under 35,
which you see in there in their magazine once a year
and he's had several Best Paper awards.
And I was interested in inviting him because I'm interested
in HCI, we don't see enough of that in our department.
And also I'm interested in health care and this connection
to disabilities I thought was interesting
and so I'm sure it will be fun and please ask questions.
Right?
>> Yeah. Please interrupt me.
All right, well thank you, I thank all of you for coming.
I've had a great visit so far.
So I am Jeff Bigham from the University of Rochester
where I lead this group called Roch HCI there
at the University.
So today I'm going to be talking about my research which is
in human computer interaction.
My background is kind of in artificial intelligence,
or it was, and I found HCI as a way to kind of allow me
to get done what I wanted to get done.
I wanted to help people interact with information better,
I wanted to help them, I guess, in the real world
in their every day lives.
And in particular a lot of my work is motivated
by helping people
with disabilities lead more independent lives,
better interact with what's around them.
And so today we'll be talking
about in particular some work my group and I have been doing
around the idea of real time Crowd support.
And so this talk is going to be a mix of motivation and history
and also a bunch of projects
that we have been working around this theme.
So it turns out that people with disabilities have relied
on assistance of others in their community for centuries.
So a blind person may ask a fellow traveler the number
of an approaching bus or a person
with a motor impairment may ask somebody for assistance
so it's getting something off a shelf.
Individually a lot of the challenges of people
with disabilities face are actually pretty small
but if you collect them together it can lead to a lack
of independence and bigger problems.
It turns out also that despite many of us thinking,
that many of us in this room that this idea of Crowd sourcing
or human computation is kind of a more recent phenomenon,
people with disabilities have been soliciting the assistance
of others for a very long time with structure
around that assistance.
So for instance, volunteers may go to a person's home
to help a blind person read their mail,
sign language interpreters here can help convert spoken speech
into sign language to help a deaf or hard
of hearing student follow along in class.
And recently, you know, technology has even been applied
in this space so things like video relay services
which allow a deaf person to actually make a phone call
and the idea then that the deaf person is actually communicating
over Skype or something like Skype
and a person they call is then actually making the voice call.
So this is pretty fascinating.
And what I think is really interesting is
that connectivity has really allowed us
to engage the Crowd in new ways.
Allowed us to form new Crowds that may provide a new route
to assistance for people with disabilities.
And so just a kind of talk briefly about the kind
of the broad nature of what I think of as the Crowd.
This varies everything from services we might think
of in computer science especially of the Crowd
as things like mechanical Turk so these micro-task marketplaces
where you can pay people small amounts of money to do jobs
for you and it turns out that Mechanical Turk is not alone
in this space anymore.
There is a whole slew of start ups that are doing those things
like CrowdFlower and others.
There's a bunch of people who are actually,
if given the opportunity would be willing to volunteer
for to help somebody out, especially for a good cause.
And because of the connectivity that means
that I am always connected
to everyone else via my phone maybe I can actually harness the
few seconds of time I have, the few minutes
of time I have waiting in line or you know,
sitting on the bus to help somebody out.
And then our friends and family are also really easily connected
to us by social networks, Facebook,
Twitter, things like that.
And so a lot of people have been thinking about how
to engage the Crowd, how to solve problems
that are too difficult for computers
by engaging the intelligence of people.
What I think differs about the research that I've been doing
at the University of Rochester is
that we've really been targeting this idea
of real time human computation
or Crowdsourcing that's interactive and so this means
that we can build tools that we'd
like to be maybe eventually done with automatic technology
so intelligent user interfaces but today in a deployable way
by engaging people in real time.
And though this is hard because of Crowds.
Crowds are interesting and the way we think
about Crowds are they are this dynamic group of people
so that means that people come and go and recruiting workers
to do your job can actually be kind of slow.
If you think about Mechanical Turk it actually takes a while
if you post a job to get somebody to come to do it.
And Crowds are unreliable.
You never really know who is part of that Crowd
for a whole variety of reasons they may not provide a great
level of assistance that you would like them to
and so we need to build models to allow us to take kind
of these messy Crowds and extract out the good parts
of them so the intelligence of these Crowds contain
and while still making guarantees about the quality.
So I'd like to take just a second to talk
about who would be part of one of these Crowds.
So how many people here have actually gone
on Mechanical Turk and done jobs?
Nobody. All right.
How many people have answered a question on the social network?
Yeah, come on, you've all probably done this.
All right, well so maybe we're outliners,
maybe we're afraid to raise our hands.
But I would just like to kind of, this chart is going to,
as I add things in here it's going to kind of illustrate
to me why the promise of Crowdsourcing
because as it turns out we actually have a lot of time.
Okay, so this data is actually a little bit old
but the relative scale is going to make sense.
So at one point in time Twitter had
about 80 million people using it.
I'm actually pretty active on Twitter and so there's all kinds
of time I'm wasting there.
Facebook of course, is much greater than this.
I couldn't even, if I drew the correct scale it would be
up to the ceiling.
But what surprises a lot
of people is actually this next piece
of this chart which is Farmville.
Just coming in right below Twitter.
So I don't think you've played Farmville but this is a game
by Zynga and essentially you, for free,
are able to have your own virtual farm and this game is
so monotonous and tedious, Zynga actually makes huge amounts
of money selling you ways to get
around the tedium of playing Farmville.
And so but it's actually a very addicting game.
A lot of people play this game
and they've done a really great job of kind
of making this a social experience
so you can actually have friends on Farmville
and they can have neighboring farms.
You get all these little rewards so like if you have enough money
that you earn by cashing in your virtual crops you can get these
rewards like houses and new tractors
and all of this great stuff.
So obviously people, if you make it fun for them are willing
to spend time doing work.
Wouldn't it be great if we could harvest that work?
Regardless of who you are,
regardless of your expertise you spend a lot
of time really just waiting.
So here is a line at a DMV.
What I like about this picture is you can imagine these people
maybe they're not skilled workers
but maybe they're doctors right?
Maybe they're rocket scientists.
Maybe they're computer scientists, even better.
And here they are just wasting their time in line.
So a lot of people have time that's not even being used.
And there is one final example.
Time adds up.
So this example is a few months old but the iPad 2 came
out a few months ago and there were reports
about this fancy new cover
and what this cover did was you didn't have to, once you opened
up the cover, slide open or slide
to unlock your iPad it would just unlock automatically and so
that saved you five seconds.
Right? Well, man.
If you could just imagine, if you took five seconds
and you assumed that, somebody made an estimate that this was
about six minutes a week if you use your iPad a few times a day
and they are projecting to sell 40 million iPads in 2011.
Well, that's 40 million hours approximately
of time that's just being wasted and you might say,
what in the world could you possibly do with five seconds?
Well, people have already figured out how
to harvest five seconds.
This is reCAPTCHA out of CMU and the idea is that instead
of just solving a CAPTCHA and having that work wasted,
you actually help to OCR documents that are too difficult
for automatic OCR alone.
So automatic OCR and human OCR based on reCAPTCHA.
So there's a lot of potential in the Crowd and a lot of time
that all of us have, we might be able to harness in good ways.
All right.
So that was kind of my introduction to you know,
people with disabilities having used human assistance
for a long time, the potential of human computation
and why many of us, even
if we think we don't have time might actually participate
in it.
So the first project that I actually want to talk to you
about that we've done at the University of Rochester is
on usable captchas and so this is actually
in particular not just targeted
at regular captchas but audio captchas.
So probably most people in the room know what captchas are,
they're these problems that are designed to be difficult
for computers but relatively easy for humans.
The most common substantiation of these looks something
like this where you have some text that somehow objectated
and the task that you have to do to show that you're a person is
to figure out what text is in these captchas.
The initial ones that look like this,
these visual captchas required division
and so obviously this excludes people who are blind
and so not long after the visual captchas were introduced,
people introduced audio captchas and the idea
with audio captchas is pretty much the exact same thing
where you have some text or characters or numbers
and you distort them in some way to make them difficult
for computers to pick out but hopefully still easy
for machines to pick out.
And so what I have here is essentially a test
for all of you.
I'm going to play an audio captcha you just listen
to the audio captcha and I'm going
to later ask you what it says.
So here we go, three, two, one.
>> [ Inaudible ]
>> All right, so did anyone get that?
>> [Inaudible]
>> How many digits do you think was in that audio captcha?
>> Six, seven.
>> Six or seven?
20? There is actually 10 digits so it turns out the probably
with audio captcha is not just
to identify how many digits there were, it's actually
to identify the digits that make it up.
So I'm going to play it one more time
and this time I'll animate those digits to convince you.
>> [ Inaudible ]
>> So audio captchas and this, again,
this is not an unusual audio captcha,
this is actually the Microsoft audio captcha they used
for a number of years.
So a couple of years ago we did a study of these audio captchas
and also visual captchas, we took 10 examples
from 10 different audio and visual captchas, we had a bunch
of blind people and a bunch of sighted people attempt
to solve these captchas and here are the results.
So essentially the first try results are here
where it took blind people were about a little
over 40 percent successful at solving the audio captchas
and sighted people were about the same
on the audio captchas, a little lower.
But yet sighted people were about twice as successful
at the visual captchas.
So it probably doesn't surprise you given
that you just heard one of the audio captchas
and probably found it pretty tough.
But when we started talking to people and so we did this study
to kind of get a quantitative comparison
but then we started talking to the people, the blind people
who participated in this study to ask them
about their experiences solving audio captchas
and not surprisingly many
of them said I hate audio captcha because they're hard.
But many of them also sited the user interface that was provided
to them to solve the audio captcha as a big challenge.
And so to get a sense of why that might be,
I'm going to play a video here of audio captchas
and an audio captcha being solved with a screen reader.
And so essentially what's going to happen is this is going
to start playing, we're going to read these instructions
and then the person in the video is going
to solve the audio captcha so here we go.
>> The audio file is intentionally distorted
to prevent automated programs from reading it.
If the file plays in another program, remember the numbers,
return to this page and then type them in the field below.
Play audio button, browse off.
Play audio button.
Audio number [inaudible] browse off, numbers in edit box.
Question. Question.
>> Okay. So you might have gotten a sense
from that why the interface was actually getting in the way.
When the audio captcha started playing
of course it overlapped the audio of the screen reader
which was the interface the user had to the window here
to solve the audio captcha.
Fortunately, you know, someone there, again,
not to pick on Microsoft but to pick
on Microsoft someone thought of this problem
and gave some helpful hand about how to solve it
and that was just remember the arbitrary sequence of 10 digits
when the audio plays and then type it in the box later.
Imagine it's a little difficult to do.
All right.
So we simplified the interface, basically had one box,
you press a period in that box,
it starts the audio captcha playing, the advantage
of this is that you're already in the box so you don't have
to do any of this navigation, there's no boxes that pop up.
And you're just right there and ready to type
when the audio captcha started playing.
We redid our study and just
from this simple interface change users had a 59 percent
higher success rate on audio captchas.
Without changing the underlying audio captcha,
without changing anything therefore about the security
of those audio captchas but just by changing the interface.
We took some lessons from this.
Usability is more than accessibility.
It's possible that a blind person would solve the other one
but that doesn't mean that they were going to be able to
and interfaces just can't be naively adapted.
So maybe some general lessons that people can take away
when they're designing interfaces for blind people.
But just to continue with this idea a little bit farther,
what do people do if there is no audio captcha?
So I was kind of presenting this whole work
like audio captchas are very popular.
They are getting more popular,
ReCAPTCHA is an audio version for instance.
But a lot of Web sites just don't have audio captcha
so what do you think blind people do
when there is no audio captcha?
Well, they ask friends around them.
Just all the visual captcha and so I think this is neat
and this leads very well to the next project that I'm going
to talk about which is VizWiz.
So just as I was thinking about, you know, just as thinking
about problems that computers have a difficult time solving
but humans are good at, led to captchas, right, so to be able
to differentiate the two, the same line
of thinking actually lead Luis Von Ahn and others who were some
of the pioneers of this idea of human computation to think
about that maybe we could actually engage people as part
of computational processes and so VizWiz is one example
of where we're doing that.
And this is an iPhone application that we've developed
that allows blind people to take a picture, speak a question
and then get answers back in nearly real time from the Crowd.
So essentially we're solving visual questions with the Crowd
and so one question you might ask is, well,
why do you need to do that?
There is all of this great technology that's been developed
over the years to solve visual problems.
So there is Optical Character Recognition,
there's color recognizers, I guess I have speech recognition
on here, it's just one example of another technology
that interprets senses.
Sensory information.
But there's big problems with this technology.
It's very limited in scope so your OCR program might be able
to work pretty well on a really high resolution version
or picture of a scanned document but it's probably not going
to work once you give the camera to a blind person
who takes a picture of a street sign.
And that's unacceptable error rates so even something
as simple or seemingly simple
as a color recognizer actually trips up pretty substantially
when used in a real situation.
I took a picture of my shirt,
what color do you think color recognizer would say?
If I took a picture, here is my jacket, it's actually one color,
I took a picture of this, what do you think it would say?
Well, unless I really carefully framed it,
and even if I did there's actually a lot of colors
that kind of make up this shirt, so what do you choose?
But the real picture would be something like this,
right where I have a little bit of the shirt
and a lot of background.
So it's actually pretty tricky to get the right answer.
Technology that's available tends to be really expensive.
This is a mobile OCR phone from Kurzweil.
This cost, I think it was $1000.00 for the OCR software
to run on this phone plus $500.00 for the phone.
This is not a typical of system technology
and in general I also argue that this is,
usually this technology isn't exactly what people want anyway.
And to get a sense of that I have this big giant menu
from a brewery.
So imagine we had an OCR program that could reliably pick
out all the text in this menu.
It would still be pretty hard to ask questions
that we might actually want to know about this menu.
So it wasn't long ago that I was a graduate student.
And one of the things I would normally look
for when I was a graduate student is what's the cheapest X
in this case, what's the cheapest burger is what I'm
going to ask.
If you just had to bring long stream
of text it would actually be pretty difficult
to answer that question.
Now it turns out that the answer is right there,
and it's still kind of blurry I guess
but it's called the no frills burger, I think it's $7.95
and if you're sighted you can find this information
pretty quickly.
You're using all kinds of clues, you know,
you're able to scan this outline that's provide,
you're able to look and see that says burgers,
well probably the cheapest burgers under burgers.
This doesn't take a genius to do this but
yet automatic programs are not able to do this,
at least not in a general way.
And so that's why we thought it might make sense
to engage the Crowd.
So again VizWiz is very simple.
Basically blind people can just take a picture with it,
speak a question into their phones
and then they get multiple answers back from the Crowd.
Really quickly.
You might wonder, how does a blind person use an iPhone,
right?
So this might seem interesting.
We actually did some work on this
when I was a PhD student back in the University of Washington
on accessible touch screens and some of the ideas
from that work audio actually made it into every iPhone
so now all I have to do is triple click on my phone.
>> [Inaudible]
>> Now I can use my phone even
if I can't see the screen although I'm not very good
at it so I keep repeating.
>> [Inaudible] Maps, music, mail, calendar.
>> All right.
And there's actually a very interesting,
the types of things you can do with a touch sensitive surface,
we don't really care about three screens if we're talking
about non-visual interfaces that are really difficult to do
if you are restricted to physical buttons.
That's a different story for a different day.
And so just to give you an idea of some of how this really works
in the real world I thought I would show a video
and what's pretty fascinating, kind of fast forwarding,
is that we actually released VizWiz out on the App store,
a bunch of blind people started using it
and this particular person actually thought it was cool
enough that he independently from us,
actually recorded a video of him using it
in some common scenarios and so I think it's fascinating
because here is a real user really using it and you get
to see a lot of things about how he interacts with his iPhone
and then also how he takes pictures,
which is another thing, how does a blind person take a picture.
So assuming this works, wait.
Oops. Well, somehow it should work.
>> [Inaudible] demonstration of VizWiz for the iPhone,
well it's really for IOS and I have an iPhone 4 and I'm going
to show you a couple of common household things
that I tested VizWiz today with and I tested VizWiz with today.
And I just wanted to kind of show you some of the results.
I'm pretty impressed.
So I got my thermostat here and I set it way down and I'm going
to take a picture here, so I place this,
move it back a little bit.
Take a picture.
I'm going to record my question, what does this thermostat say?
I have some options as to what sources I'm going
to pull the results from.
I'm going to select Web workers and IQ engines
and so I'm just going to go ahead and send.
>> Please wait while your question is sent.
>> So it's going to send my question.
So Web workers will actually pull information
from human sources, IQ engines I imagine pulls it
from image data bases somewhat like a Google Images.
Notice I have the option to pull results from social networks.
I think Twitter is the only option here now
but on the Web site for the application they talk about it
on Facebook being an option.
So let's see.
It already has some answers available.
The IQ engines has it labeled as a thermostat so we knew that.
That doesn't do me too much good if I wanted
to get a temperature reading, so we just hold on to here
for a couple more seconds.
Earlier I got a result probably within about 45 seconds
so just hang tight here.
It did identify the temperature.
>> [Inaudible]
>> Okay, a Web worker said that 75 is on the left
and 71 is on the right.
They are not sure what the current temperature is
but I can tell you that 71 is what I have set at
and 75 is the current temperature.
Is that correct?
>> Yes.
>> All right.
So we have two pairs of eyes that says that that's correct.
>> All right.
So that's just a quick example.
He goes on to some other scenarios,
telling what color his shoes are, he wants to match it
with his outfit and a few other things.
I think a lot of this video is pretty fascinating.
I mean you got that sense
of how a blind person will use a touch screen,
how they take a picture.
Some of the interesting qualities
of the answers already you see in this video so 71
on the left, 75 on the right.
I'm not sure which is the right one.
Imagine, you know, building an automatic question answering
system that could give you that kind of an answer.
So to be able to express, not only to be able to do OCR
in general but be able to express a confidence like that.
So just a little bit about how this works.
Essentially what happens is these questions are forwarded
to our server, they go out on our Web page,
we recruit workers, make sure we recruit multiple workers.
So the user takes a picture, asks a question
and then the answers go back and again, we get multiple workers
and there's a number of reasons for that but they all relate
to these qualities of the Crowd that we expect.
So Crowd workers will take varying amounts of time
to answer the question, they may
or may not answer the question correctly.
In this case we've actually made the choice of not trying
to verify that the answers are correct
but letting the user do that instead.
Returning multiple answers.
We do this for latency reasons.
So if we try to verify or somehow validate
that each answer was correct, that actually adds time
and so we feel much better to let the user pick it out.
>> Is there guidance that you give to the Crowd?
Like if you're not sure, say that or give more details?
>> Yeah, yeah.
So I'm not sure.
There's this little paragraph you can't really read here,
it says, unable to answer the question and it kind
of gives them feedback about what to do.
Exactly. So we'll see through some more examples here,
a lot of times you can't answer the question
because the photo isn't of the right thing, or it's too blurry
or a number of other things.
>> Do you give the Crowd instructions on that?
>> [Inaudible] to text first?
>> So we try to do that, this says which can is the corn?
So speech recognition doesn't always work
and so we also just play the audio here.
I don't have the chart in this slide deck
but it's kind of interesting.
If you look at a breakdown
of questions the answers come back the fastest
when the speech recognition works so they don't have
to listen to the audio and actually
when the question was answerable.
So the workers get a little bit nervous
about answering something if they can't.
Which kind of makes sense.
So what I think is fascinating about this is
that this is essentially an example
of deployable Wizard of Oz.
Right? So in HCI we often kind of talk about this idea
of Wizard of Oz study where instead of going right off
and trying to build all of the complex algorithms and AI
and everything else that might go into a system
so that it might do something useful, instead you kind
of put people behind the curtain to control the system,
let people try it out and see if it's actually useful
so before you go to all that effort.
And so with things like Mechanical Turk
with the Crowd we can actually build systems
that are essentially Wizard of Oz but are also deployable.
So I have this cute little animation
of the Wizard going into the phone.
So here's the kind of example of this idea
of deployable Wizard of Oz.
Before we released it on the App store we did this initial field
deployment with 11 blind iPhone users
and I have these examples here of questions
that they really asked and so I like these because they kind
of illustrate the diversity of questions that are asked
and kind of hint at why it would be really difficult
to make something automatic that could solve all of these.
So the first question is what color is this pillow?
The first person says, I can't tell,
probably because the picture is kind of bad.
The next person says multiple shades
of soft green, blue and gold.
Pretty interesting answer.
What denomination is this bill?
20, 20 people get that.
You know, do you see picnic tables across the parking lot?
So people are trying to get a sense of space
so now technology really helps people answer that now.
Those questions are no or but the answers are no.
What temperature is my oven set to?
This is all kinds of appliances
which have inaccessible interfaces.
Can you tell me what this can is?
You can tell this picture was taken by a blind person,
it's actually the back of the can.
And now, nevertheless the people are actually able
to answer this, it took them a little longer
than the other ones.
They say chick peas, beans and Goya beans and it turns
out that the primary ingredient
in chick peas is actually chick peas.
So you can tell from the ingredient list what this is.
What kind of drink does this can hold?
It might not be obvious that there is anything here
but it's actually a little Red Bull can here,
so that energy or energy drink.
So I think this is interesting.
It kind of gives you a sense of the types
of questions people wanted answered
and the kind of answers they get.
We did a lot of work to try to get these answers back quickly
and there is a number of ways, of strategies we used for these.
I'm going to go over just two of them.
One is we think about how tasks
or Mechanical Turk are completed.
There's an amount of time it takes workers to be recruited
and then it takes them some amount of time
to actually do the task.
This should be pretty obvious.
So an easy thing you can do is,
once you've recruited a worker you can actually have them do
multiple tasks and that way they're ready at multiple points
and so if a question comes in they are more likely
to be there to answer it.
And then you can also recruit multiple workers
who will each have their own recruiting time
and their own time to complete each task and so if you do this,
do enough of these, recruit enough workers
and have them do enough tasks you actually have a worker ready
pretty much any time that you want.
As an added benefit you might be wondering
if we don't have a question for the man
to answer right now what are they doing?
We actually have them answer old questions and so this is kind
of an implicit learning phase.
So they learn what it's like, what to expect.
Maybe they read those instructions about what to do
if they can't answer the question.
And so this actually works pretty well.
The other thing that we can do is we can get a jump start
on recruiting these workers and we do
that by sending a signal off
when the user starts taking a picture.
Before they've actually sent it in.
So is someone starts using the App and they start, you know,
trying to take a picture there's a really good chance they're
going to ask a question.
And so we can actually optimistically recruit workers
before hand.
So ideally we can actually get rid of a lot
of that recruiting time just by covering it
up with the time it takes the person
to formulate their question.
If we pull out all the stops and we're pretty aggressive
in pre-recruiting workers
and all this stuff we actually showed we could get answers back
to people in an average of 27 seconds.
Which is getting pretty fast so that's about the, you know,
about the point where, you know, it's still, believe it
or not it's 27 seconds sounds short, it does still feel
like you're waiting pretty long, 27 seconds but it means
that you can actually use it in a real situation.
So you can take a picture of a menu in a restaurant
and get an answer back before you leave the restaurant
for instance.
>> Jack, do you have to [inaudible] people?
>> Yeah, so I kind of made this whole thing nebulous.
We use Turk of the study, now we're doing mostly volunteers
and we're trying to kind of build our own Crowd because,
especially for this kind of an Application we think
that people would want to participate and so we think
that we can actually build our own Crowd and we have a number
of people, something around 10 or 20 people
who have volunteered as kind of a pilot group and a whole bunch
of other people who are kind of on our waiting list
who have actually sent me emails
out of the blue saying hey, I want to volunteer.
And so we hope that this type of approach
of our own Crowd will allow us not to have to charge for it.
>> [Inaudible] how much does VizWiz cost you [inaudible]?
>> On average it was seven and a half cents per question
which resulted in three answers, about three answers on average.
So we did release VizWiz out on the App store so here's our page
at VizWiz dot org and it's been pretty great to see kind
of the uptake by blind people so we released this
on the very end of May.
We had about 2000 downloads, we keep getting
about 200 questions per day and overall we've had
about 20,000 questions answered.
So this is kind of a test again.
A bit of a trick.
So some of these questions,
some of these pictures come from VizWiz users.
Some of these pictures come from one
of the more popular computer vision data sets.
Your job as the audience is to tell me
which ones do you think came
from the popular computer vision sets
and which one do you think came from VizWiz users?
I don't want all.
>> [Inaudible]
>> Well, so just as one of the implications
of this what we're building with VizWiz is this big data set
of real questions that blind people want to know
about their environment and we hope to be able to package
up in a way the computer vision researchers can use
as kind of a challenge.
And one of the ways we imagine this working is kind
of as a two stage process.
The first process would be basically taking a new question,
categorizing it into a sub-set so some kind of category so say,
a color recognition or getting a sense of space
or object recognition etc. And then maybe eventually being able
to solve some of these categories with computer vision.
And so it's been fun for us to kind of look
at the difference between, you know, the pictures that we get
from our users and the types of pictures that are used
in these computer vision data sets.
Well you might look at these and there's some common problems
that blind people have taking pictures, framing
and entire object is difficult, you know,
making sure you have the right side of it, getting the object
in the frame at all is tough.
And so we've also been thinking at the same time
about how we might help blind people take better pictures.
Here's a quick example
of helping a blind person take a better picture of documents.
So you see a person here with an iPhone getting assistance taking
a picture of this document.
[ video ]
That's an example of one thing that we've been playing with.
We built a different application called Easy Snap
that helps blind people take pictures
of arbitrary objects in their environment.
This is actually based off of discussions
with existing blind photographers
about how they take pictures of objects.
The idea is basically that they will take a --
they will take their phone or their camera, hold it up close
to the object and then move back.
Right, so you can get a very good sense, and you kind
of saw this actually in the video
with the person taking the picture of the thermostat.
So we made -- we added a little bit of technological support.
What we basically have the user take a close-up picture
and then we use computer vision descriptors to allow them
to move farther back than they would be able
to with just their arms' reach.
And so that's what Easy Snap is and here are some examples
of pictures taken without and with the random study.
We had people take pictures of random objects in our labs,
so blind people brought into our labs taking random pictures
of our, of objects.
We actually Crowd sourced whether
or not the pictures were good or not, because it's kind
of a subjective measure in and of itself.
You know, taking a picture of the tea kettle,
which one has a tea kettle better centered.
And we got -- we got okay results, we did better
with feedback, right, so with easy snap 57 percent
of the pictures taken with Easy Snap were preferred
over the ones without, probably a lot more room
to improve there.
There's one other example of how we're kind
of combining this idea of human computation
with computer vision.
We've noticed that a lot of the questions
that people ask are actually motivated
by finding the right thing, so finding, say the right can
in the cabinet, so you have a bunch of cans
that all feel the same, are about the same weight,
you're looking for the corner or whatever else,
or you're in the cereal aisle and you're looking
for the right, the box of Weaties, say.
We built a prototype
where essentially the user speaks what they --
they take an overview picture of all of the objects.
They speak what they want to find,
in this case we're looking at the Wheaties.
We Crowd sourced just the identification of the Wheaties
so the Crowd worker just draws a box around the Wheaties,
and then we build a computer vision descriptor that's
specific to that box of Wheaties and that lighting
and those lighting conditions
from approximately the same perspective.
That then uses kind of a Geiger counter type of audio interface
to help the user locate the object in space.
And so there's kind of this neat idea where the hard part
of conputevision is done by the Crowd,
but the really interactive piece is still done
with automatic computer vision.
And then you could always, you know, combine them and use it
as ways to find the nutrition information
of the object we found.
All right.
So that's VizWiz but you know,
this talk was real time Crowd support
and so that's what this next project is about,
so we call Legion.
And the idea with Legion is to outsource control
of existing computer interfaces, so the computer interface
on my laptop here or on my desktop, to the Crowd,
and this is very different than what people have had done
with Crowd sourcing before.
Where normally it's pretty synchronous.
Even, you know, our VizWiz we had that delay of 27 seconds,
now we're really targeting these real-time control tasks
from Crowds, so Crowds formed of dynamic groups
of varying reliability and somehow we need to make it
so that Crowds can both be in real time but also reliable.
So these you seem like a natural tension but legion helps
to make this possible.
So the idea is essentially well it would be great
if I'm having trouble with my computer,
say I'm a person who's blind and some part
of my software is not accessible or one of the things
that actually motivated us was we were trying to figure
out if there was a way to apply Crowd sourcing
in assistive robotics.
So imagine controlling a robot from the Crowd.
And so all of these things we kind of captured
as controlling an existing interface
on your existing computer, and so the idea would be
to have a kind of remote desktop kind of setup but where instead
of having a single worker, single person,
controlling the interface through the Crowd,
we actually replace the single worker with a Crowd and have
that Crowd collectively control the existing interface.
And so really the challenge here is we have all of these people
who are trying to control the interface, you know,
pressing keys, making mouse clicks, and then we have
to have something that takes in all of these simultaneous inputs
and figures out a way to intelligently merge them
and send one signal stream out to your existing interface.
I'll show you an example of what this looks
like for first end users, so how would you
as a user set the system up, quick little video,
media not found, oops.
Well, essentially what you do is you describe what you want
somebody to do with your task and then you draw a box
around your existing desktop to say what part
of the screen essentially the Crowd can have control of.
And now if that one didn't work this one's probably not going
to work.
But this is what the worker interface would look like.
In this case we've outsourced their Crowd source control
of a web cam robot, so a little remote control web cam,
so the Crowd sees this view of the web cam on the robot,
they're given instructions on what keys they can press,
in this case just up down left and right and space
to indicate they don't want to do anything.
We give them minimal feedback.
It's 5:00.
It's great technology, right, so I feel like I'm in Star Trek.
So anyway we give them minimal feedback on kind
of what's happening, so one tension here if you think
about the usability of something like this is
if there's all these people that were all simultaneously trying
to control this one interface,
it means that the system is necessarily not listening
to everyone.
And if you're not being listened to then you don't get feedback.
So if I say robot turn left and the robot does nothing
or turns right, then I might think well,
the system isn't even listening to me.
And so we give them some minutely feedback
about whether they're being listened to
or the Crowd is being listened to.
>> [Inaudible]
>> In this case, in this particular case,
we've set a goal and the goal is to drive the robot into a kettle
that we've set up in a little pen in our lab,
and this is basically just really boiled
down navigation task.
So there's a lot of great stories
about what happened before we had our pen,
about the robot being controlled by the Crowd,
that it was really fun just to kind of wander around the lab,
looking at people and exploring so there's other problems
that happen when you, when you include real people.
Well, I'll see if this video plays.
It doesn't.
I could open it up, but I think for time I'll just wait.
And just describe how we do the merging, because I think
that from the kind of computer science perspective
from these perspectives that's kind
of the most interesting thing.
So I mentioned the system Legion disguises the Crowd
as a single user, and so in essence what we need
to do is take these users as they provide input
and somehow merge that together.
And so what I'm going to do is I'm going to through five ways
of merging this input, five input mediators,
is what we call them, and these are going to be in order
from kind of dumb to increasing smart, so bear with me.
The first is we could just kind of support it all, right,
so as people type something we can just combine it,
we just merge it into a single stream.
We call this the mob input mediator
and as you expect it doesn't work all that well
because it's kind of just chaos.
So this was kind of chaos
because each individual had their own plan potentially
but we were just kind of mashing that all together.
So we were just kind of thinking the other dumb thing we could do
is just pick an individual Crowd worker
and this is what we call solo.
The problem with solo is you might get a malicious Crowd
worker but that Crowd worker might just decide to disappear.
I mentioned that this Crowd is dynamic so these people come
and go so your person that you picked might just leave
or they might be back.
So that's not very smart.
So one level on top of solo,
which is a little bit smarter is active,
and active essentially we pick a random Crowd worker
to control the system until they stop controlling the system
and then we pick a new one.
So here we've picked a random Crowd worker and partway
through we'll pick somebody else.
That gets us a little closer, but it still doesn't address
that problem of quality.
So you get a bad applied in there, upside down red guy I
like to call him, and it doesn't matter
if other people are providing good input you might have just
randomly selected the bad guy.
And so you might think well, the next thing well this is
like a lot of people who dwell on their own ideas
and they're all providing input
so it sounds a lot like democracy.
So why don't we just take a vote so we could vote and what
that means is that we actually delay our input a lot,
and the reason for that is to vote you have to divide time
up into time segments over which the vote is taken,
because there's no other way to easily align these inputs.
And so regardless of how long this is, no input is output
until the end of the time segment.
So the vote has to take place over say a second in our case
and output is only forwarded after that second.
And so the final variation of this
which is what we called leader, use the vote not
to select the input to forward on but actually
to elect a temporary leader who would assume full control
and so the idea is that using Crowd agreement we can actually
select an individual Crowd worker who for the moment seems
to best represent the rest of the Crowd and forward along
that person's input directly.
And so that means that that person has full control
but if they go against the Crowd wisdom they'd be dethroned
and someone else would be elected.
So we have this idea of kind of balancing the attention
between reliability and real time.
Now certainly if when somebody was elected they could do
something malicious, but we assume this won't happen
because they -- we give them no clue
that they're actually leader.
Now that always holds this is a question to consider.
All right, so just a simple experiment
that we did we set up this task.
This looks deceptively simple, so we have this robot here
on one side of our pen, and we told them
to drive over to this teapot.
The problem is this robot was abused by undergraduates
for several years and veers severely to the left.
And so it's actually very hard to control this robot.
But nevertheless, we were able
to take this cheap webcam robot rundown cheap webcam robot
and turn it into a robot that basically responded
to natural language commands and could be done in real time.
And we were able to compare our various input mediators
on this robot.
So this chart represents the completion,
the average completion time over 10 trials for each one
of the input mediators, and just several things to point out.
First solo was actually the fastest, so I was not
so positive about solo, but it actually worked out pretty well
because if you happened to get a good worker,
the good worker could actually complete the task pretty quick.
These numbers over top of each bar, however,
represented the number of successful trials so it turns
out that only four of the 10 trials were successful
with solo.
Mob was more successful but it took longer
and vote took even longer, active worked pretty well
for this particular task because eventually you'd find a worker
that would be able to drive you to the goal,
and then finally the leader in input mediator was faster
than the rest and actually completed all the trials.
Here's a visual of what this kind of looks like.
These are just chosen to kind of be what we thought
of as representative, solo, something that gets control
of this, and in this case they kind
of think they're going straight, they kind of just veer off
into the wall and then they're like what happened and quit.
Vote, there's a lot of just back and forth
about what should be done, so there's a lot
of thrashing in the vote, right.
And then finally they kind of figure out like a strategy
and go but it takes a while.
In active, it works out pretty well in the beginning,
and then however this is disconnects
and somebody else just thinks they're going to mess around
and starts exploring around the area of the teapot
and never actually drives to it,
so we've got a bad worker in that case.
And then leader actually works pretty well, so it more
or less goes toward the tea kettle.
And this is kind of what we observed just qualitatively
across the different trials.
So this idea is pretty interesting.
It allows us to do pretty cool stuff.
Just one example of what we think thought was interesting
was it allows us to create what we call desktop mashups.
So in this case we're trying to do something very similar
to that setup with the existing webcam robot,
except what we've done is we've taken this little scribbler
robot, which can be controlled by this terminal program here,
and we put an iPod Touch on it,
and that iPod Touch is running Skype.
And so we have the Skype program located next
to the terminal program.
we outsource both of these to the Crowd,
and now we've taken a robot,
which previously could not have been controlled remotely,
let alone intelligently, and we've turned it into this,
you know, again the natural language following robot
controlled by the Crowd.
So this is kind of neat, you can kind
of group these applications together
and create these new kinds of mashups.
In general what I'm excited about and what we're working
on now is applying this idea of Crowd support,
real-time Crowd support to other problems.
We've been looking at how the Crowd can
in real-time label activities, so we have an ongoing project
in the University of Rochester, where we're trying
to identify what people are doing in their home.
It's actually targeted at providing support for people
with cognitive impairments.
We have an extension of VizWiz,
in which we're sending a video stream out to the Crowd
and then we're trying to get them
to basically have a conversation with the user.
And what does that mean, right?
So basically can we adopt this idea of converting this Crowd
into a single person, because you have a single consistent
coherent conversation with a Crowd.
And in our case we're applying
that to helping a blind person navigate or understand more
about what's in around them, and then also in audio description,
so we've talked about kind description
of visual information but what about audio information
for people who are deaf or hard of hearing?
The activity recognition problem we actually just last night
submitted a paper to Pervasive about and it's kind of neat.
The interface changes a little bit, but it's very similar.
We show workers a live video feed of what's going on
and we ask the Crowd just
to in natural language type the activities that they see.
In using kind of slight variations
on the same input media areas that we talked about before,
we're able to kind extract a consistent set
of activity labels from the Crowd, and we use those
to interchange and train an HMM.
So in real time in a really deployable way we're able
to put activity recognition for tasks we've never seen
in situations we haven't explicitly trained the
system for.
All right so I am out of time anyway, and so in conclusion,
you know, human computation is really interesting
and I think it's even more interesting when we apply it
in real time or nearly real time.
This whole approach builds
on how people disabilities get answers now.
It just puts them in more control,
it allows more ubiquitous access to that human support.
It enables us to build quick deployable prototypes, you know,
deployable Wizard of Oz, that sort of thing
and I think this idea that I talked about with Legion
where we're churning this dynamic group
of unreliable people
into perhaps a more reliable single user
that can use our existing interfaces is the new model
for HCI that I'm excited to explore
and the problems I talked about and maybe more.
So thank you very much and thanks again for having me.
I'll take any questions that you have.
>> My question is about Legion.
[ Inaudible ]
>> Yeah, so I think on average the leader changed three times
in our task.
I didn't say how long this took.
As a base line it took some undergraduates who we told
to do this as quickly as possible.
46 seconds to drive the robot
from the beginning to the kettle.
And so over that time we actually had the leader change
and it took them a little longer,
you saw from the chart there.
A little bit longer so 100 seconds for the leader.
It changed I think about three times was I remember.
>> [Inaudible] what do you think that the sort of down side?
You know, can you imagine any [inaudible]
>> Instability in the sense of?
>> Say you can sort of an example here.
[Inaudible] I'm just saying what is the down side to this?
[ Inaudible ]
>> There's all kinds of limitations
in this particular set up.
I mean, one is privacy, so you have to be pretty careful
about what you actually send out to the Crowd.
The way we deal with this now is you get to choose the part
of your interface that you send
and you actually white list the commands
that three able to send.
The other thing, I guess,
that's potential is you could imagine malicious workers
somehow figuring how to gain the system, gain leadership
and do something undesirable.
Sometimes there are these sorts of things.
I didn't talk about it here, we actually ran a study,
very similar study and got similar results
where the leader also came out on top where we took a picture
of a hand drawn table, like on a white board and we kind
of we drew a box around that and also around part
of a spread sheet problem, it's actually Google spread sheets
and we asked the Crowd to transcribe the table
into the spread sheet so it is actually robust
across different types of applications
but I'm sure there's limitations and we're only beginning to kind
of explore, you know, what those limitations are
and to what extent we can get around them.
>> [ Inaudible ]
Is it possible to come along the lines with
[ inaudible ]
>> Yeah, well so in general that's been somewhat
of a problem on Mechanical Turk.
I mean, there's a study out of UCSD that showed that I think
that captcha market is actually makes it
so that Mechanical Turk actually seems expensive
so there's actually captcha farms out there
who will solve I think it's a thousand captchas
for like 80 cents or something.
So it's not clear that this will be the most efficient way
to do this.
We have had people actually send in pictures of captchas
like from the phone and to the best we can tell those are
actual users who have hit upon a captcha they can't solve.
And so I usually solve them.
But, you know, there's perhaps that potential
and we get a question a lot
about whether we've seen malicious users.
>> And what we would do if there is malicious users.
The good side about people kind of answering these questions is
that they can apply this high level reasoning potentially to,
say, well this maybe isn't the best question to be asking,
maybe you want to look at this user
and so we can block particular users for instance
and that's probably what we would do if it seemed
as though people were trying to do that.
>> [Inaudible] VizWiz you had that
[ inaudible ]
recruit at real time being interactive or [inaudible]
>> Yeah, and so this is exactly what I was getting
at with our future work.
The reason we started with the first version of VizWiz
with just images was actually a very much a trade off
between what we thought we could actually deploy
because of current bandwidth, the current bandwidth
that people have access
to versus what we would really like to do.
Certainly people who use VizWiz want a more interactive
experience; they want to ask follow up questions,
they don't want to have to take like five pictures
of a box before they get the right side to show the label,
they want to kind of just do that in real time.
So there's one hand it was the technical constraints.
That we didn't want to have too high of a bandwidth requirement
because we didn't think
that people had ubiquitous access to that.
In fact we see it even with the pictures now, you would think,
you know, we have these,
we are told that our cell connections are one thing,
the thing that we observed
in reality is actually much flakier and lower than that.
Well, on the other hand when we were developing VizWiz we hadn't
thought a lot about what it would be like to say, you know,
with my little quote here, conversation with the Crowd,
and so we were thinking, well,
we didn't think it would be a great idea to pair a blind user
with a random Mechanical Turk worker.
For all of the reasons that we think the solo input mediator
wasn't a great idea and so certainly we want to move
in that direction and that's what this VizWiz stream
or whatever we end up calling it is designed to do
because exactly what you described is what blind
people want.
But I think it takes a little work to actually get there.
Technologically and then also just figuring
out the right interaction.
Yeah.
>> I mean the whole original sort
of online Crowd is [inaudible] where you had some
of the same kind of concerns about [inaudible] early date
of experts putting the content there and the quality in it.
As soon as you don't then you might blow quality
but the advantage there is that it's so asynchronous
that over time you expect problems
to be corrected or whatever.
And if not, at least in the short term, you know,
it seems like you could still use that.
So if the person, I mean it doesn't make it unusable but,
you know, I might be concerned if the robot
or whether it's a blind person getting a response
and [inaudible] responses at all the same thing or, you know,
ways to verify the content coming back
or if it's high quality or if it's more reliable or whatever.
>> Yeah, I think that if there's a lot of things
that we could do there.
Certainly we made one choice based on, you know,
one optimization of the constraints
of this type of a problem, right?
So our perception with VizWiz was that most
of the questions we talked to blind people about who wanted
to answer were of the type that we thought would be fine
for them to kind of make the judgment
of whether the answer was correct or not and so
that we would prioritize sending an answer back quickly versus,
you know, making sure the answer is correct before we
did anything.
Certainly you could make different choices and maybe
for certain types of questions you want
to be really sure for instance.
The user can, in fact, kind of do that themselves
so once they get their three answers they can kind
of take a vote right?
So the user can actually do the vote.
We've talked about other mechanisms
so for instance what do you do when you have a bunch of workers
and you want to ensure are doing a good job.
Well you add middle management arguably.
You could actually, you could actually have
and this is what a lot of other systems do, right,
so they basically put workers in who evaluate the work of others.
And so this is completely reasonable thing to do.
We can do that in VizWiz, we can also do that in Region and all
of the follow on projects we have imagined for that
where in real time you could actually have people observing
other people's real time answers and just kind of giving a yes,
yes, that's good or no, no that's bad right?
And there's a little question there
about what the interface would look like,
how do you actually keep people paying attention and that stuff.
But it's certainly something that we could set up.
>> Do you see [inaudible] the Crowd first, that would be,
I mean, you could expand the number
of people giving the correct answers more quickly [inaudible]
before you have to [inaudible]
>> Yeah, right.
Well, yeah.
It's hard to tell what the best approach is.
The good thing is, people often think that,
and I think this is getting less but when I first started talking
about VizWiz people were very concerned
that there wouldn't actually be enough people on Mechanical Turk
to answer our questions.
It turns out that it's not a problem.
We've very consistently been able
to attract pretty big Crowds quickly.
There was a paper at Wist [assumed spelling]
but Michael Bernstein at MIT, he was able to get Crowds
of five people in two seconds.
So you can get these Crowds quickly.
And there are all kinds of models about how you do that.
Essentially he paid people to wait.
But and we had a different model where we kind of queued
up the workers and had them kind of iterate over old tasks.
But there are all kinds of things you can do
to get these workers quickly.
You can get big Crowds
and really the question then is what do you do with them
and that's really, you know, an algorithm question, right?
So people talk about Crowd algorithms
so how do you get the best result out of this kind
of noisy black boxes and it's really how you structure that.
It could be middle management, it could just be you know,
a redundancy thing, there's a lot of options.
>> Great, well, I think we'll take any other questions
on line.
>> All right.
Thank you again.