Uploaded by Google on 25.07.2007

Transcript:

ADEL: So I thank you for coming today.

Today's talk is about GPS tracking, and what do you do

with information collected from GPS.

Our speaker,today, is John Krumm,

from Microsoft Research.

John graduated from CMU in 1993.

He was a PhD in robotics, and he worked at the Robotics

Center in SNL for four years, in Object Detection and, I

think, individuals inside car detections.

And then he joined Microsoft Research in 1997.

And since then, he was working in location determination,

location-based services.

He works in many systems, actually.

SmartMoveX, which is RF-based location determination.

Locadio, which is based on radio signal strengths.

And the SPOT, which is based on FM signal strengths.

So without any further direction, I introduce John.

Thank you for coming, John.

JOHN KRUMM: Thanks.

Thanks Adel.

So the reason I know Adel, is he was summer intern of mine

at Microsoft, a few years ago-- one of my favorite ones.

And I was going to be down here, and I got in contact

with, and I said, I just want to see if you really have free

lunches here.

And so he said, he's going to take me to a free

lunch after my talk.

So we'll see if it's really true.

This is not really part of the main outline of my talk, but I

said that Adel was a summer intern with me.

And I just want to tell you what he worked

on, and what he did.

Microsoft had this product-- we still do-- called the SPOT

watch, which gets actual updates on weather and dining

and traffic and movies over sidebands of commercial FM

radio stations.

And the SPOT team came to us, and said, could you make that

watch location aware?

It's got this FM tuner in it.

We can tune to any FM station, and measure how strong the

signal strengths are.

So what Adel did-- he did a really nice job, and wrote

this paper on making that watch,

actually, location aware.

So it kind of knows where it is, just based on measuring

signal strengths from commercial FM radio stations.

So that was a cool thing.

AUDIENCE: That actually works in production?

JOHN KRUMM: No, it's not in the product.

AUDIENCE: So I mean, does it have a database of where all

the conversations are?

JOHN KRUMM: Basically.

What it does is it looks at their fingerprints.

As a function of latitude/longitude, what are

the set of signal strengths that you are going to see?

So that's the database it's--

AUDIENCE: And it knows where you're from?

JOHN KRUMM: --based on.

No, it actually doesn't.

You do wardriving before hand to get their ground truth.

So what I'm going to talk about, the rest of my talk, is

projects that we've been doing on top of some GPS data that

we've been collecting.

So we went out, and we bought 55 of

these Garmin GPS receivers.

And we're loaning them out to volunteer drivers, to leave on

the dashboard of their car for two weeks or

four weeks at a time.

They record up to 10,000 timestamp latitude/longitudes.

And so far, we've had 227 people go through the program,

1.77 million points, and 12,500 separate trips.

And down below, there, you can see what some of the data

looks like.

So these are mostly all Microsoft employees, so

they're all, usually, around the Seattle area.

In fact, that's Microsoft, right there, so it's

particularly dense there.

And when we started with these GPS receivers, we didn't

really know, exactly, what we're going to do with them.

But since then, we've thought of some research projects on

top of them.

So I'm going to talk about those.

And the first one is destination modeling.

What we have here-- these are the destinations of

people in our study.

So what we do, is we segment the GPS into discreet trips,

and the last lat-long of each trip is the

destination of that trip.

And we looked at well what can we say about people's

destinations?

One thing, is we can look at the types of places they seem

to like to go.

So the map on the right there is a US geological survey map

of the Seattle area, where each 30 meter by 30 meter

square on the ground is categorized into 1 of 21

different ground cover types.

And what we did, is we looked at the type of ground cover

that people tend to like to go to.

And it turned out that a commercial and low intensity

residential are the most popular ones.

And as you'd expect, emergent herbaceous wetlands and woody

wetlands, which, I think, is another word for swamp,

basically, are not really all that popular as destinations.

And then, a little later on, I'll show how we actually use

this to do something useful.

AUDIENCE: Is the fact that high intensity residential

shows up so low- would that indicate that it didn't get a

signal there?

JOHN KRUMM: Let's see.

Oh yeah, this one here?

That's a good point.

Yeah, maybe that's what happened.

Because that's where the urban canyons would be,

and things like that.

AUDIENCE: Yeah, surprised it has that much in the forest

areas, considering how poorly GPS usually does when there's

a lot of trees around.

JOHN KRUMM: Right.

Yeah, yeah.

We also looked at how people behave, as a function of time.

So this top plot, here--

it's as a function of a 24-hour clock.

And here are the number of trips, per week, that people

make, versus the time of the day.

So there are not too many trips in the

middle of the night.

You see quite a few, here, in the morning--

maybe people commuting to work.

And down here, in the evening-- maybe people are

going home or going out to restaurants, like that.

So this kind of makes sense.

Then you can also look at how many of these trips are trips

to new destinations, that we haven't observed the person to

go to before?

And that peaks right around the middle of the day.

You can also look at how those things very, as

the day of the week.

So that's what we have over here, the days of the week.

And here are the number of trips taken per week.

And so it turns out, the most trips are taken on Thursday.

And I'm not really sure why that would be.

And then you can look at number of new trips--

places to new destinations where the

person hasn't been before.

And that peaks kind of as you'd expect, on Saturday,

when people might be going off to try new things.

But it's also pretty high, here, on Thursday.

And then, finally, on this destination modeling thing--

we looked at how fast do people, generally, fall into a

rut of not visiting new destinations at all.

In general, when we first gave the people a GPS, they visited

a little over three and a half destinations.

And then, we looked at how many destinations they visited

on the second day, that they hadn't visited

on the first day.

And it's down here.

A little over one and a half, and it goes down

exponentially, kind of like that.

So people fall into a rut, sort of.

After about two weeks, they've come fairly close to hitting a

steady state.

We also have demographics on all these people.

So we were interested in, does this rate of decay vary with

the demographic makeup of the people who are in the study?

And we found no difference between single people and

people with a partner, children versus no children,

extended family nearby or not.

There was no significant difference.

What we did find--

a small, but statistically significant, difference

between men and women.

And women's rate of decay is slightly higher than men's.

They go down, and hit the steady state

a little bit faster.

AUDIENCE: I have a question.

In the sample, were there people driving as well as

public transportation and--

JOHN KRUMM: This is all, pretty much, driving, because

they just put it on the dashboard of their car.

Yeah.

AUDIENCE: On what day of the week were the GPS

units issued to people?

I'm curious if it was Thursday.

JOHN KRUMM: No.

It's totally random.

We just sent them out via the mail.

So whatever day they get there.

AUDIENCE: With the bias of new trips, especially, every

trip's a new trip when you first get it.

Was that accounted for?

JOHN KRUMM: No we didn't account for that.

We knew that the starting dates were all random, so we

were hoping the randomness would average

out to general behavior.

But you're right.

If like, Saturday was the second day, here, then you

probably wouldn't see quite the same behavior.

So I don't know why that women decline a little bit faster

than men, and I don't want to speculate on it either.

So the next project was one called Predestination.

And this is where we're trying to predict where you're

driving to, as you drive.

So you get your car, start on a trip, and can we predict

when you're going to stop, and get out of your car.

And the reason why you'd ever want to do that, is

anticipatory information.

If you could warn someone of an upcoming traffic jam,

before they get there, that would be useful.

And of course, you can get traffic incident data.

But you can only get in a radius of where you are, right

at a certain point.

And if we could actually tell you that your coming up on

something, but not tell you about the things that you've

left behind, then that's less distraction for the driver.

Location-based advertising is another one.

No one wants to hear about a sale on bananas, as drive 40

miles an hour by a grocery store.

But at least in the Seattle area, we've got these

third-party parking lots around the airport.

So if 10 minutes before I got to the airport, I could get an

ad that said there's a discount on parking at some

place, then I actually have time to act on that ad, and I

might actually welcome it.

And then finally, this is kind of an interesting application

for hybrid vehicles.

The way most hybrid vehicles work, is they have a gas

engine, that can charge up a battery.

And the vehicle has to decide when to turn on the gas

engine, to do some charging.

So if a car's coming up a hill, say, it might find that

the battery's a little bit low, and say, well, I guess I

need to turn on the gas engine, to charge the battery

at this point.

But if the car knew that is was going to have a good

charging opportunity, by going down a long hill, then it

wouldn't, necessarily, have to turn on the gas engine before

it gets to the top.

And so, it could be made more efficient that way.

So our basic approach to this, to predicting where your're

driving, is we split up the region into

one-kilometer cells.

And we compute a probability, for each one of those cell's,

as to whether or not that's your destination.

And then that's just regular, old, discreet probability

distribution, where they all have to sum to one.

And so now, we get back to that ground cover model, that

I was talking about.

So that's the same bar graph over here.

And if you apply those normalized probabilities, to

the ground covers that are actually inside those

one-kilometer cells, then this is the prior probability

distribution that you get.

And you can see that the probability of ending in a

lake is pretty low, which is good.

And also rural areas are pretty low.

This is a rural area, out here, but you do see some

little pops, here, by these old towns

and these rural areas.

So this kind of makes sense.

We also looked at prior on personal destination.

So the easiest way to predict where someone is going, is to

look at the places that they've been before.

And so here are the destinations of one person in

our study, so those cells would have a higher

probability.

We also looked at something to try to make this work a little

better out of the box.

So each image, here, is a 40 pixel by 40 pixel image, where

each pixel corresponds to one of those cells.

And on day one, this person visited these cells, day two,

these, and on down to day 14.

So you can see how the destinations for this person

evolved over time.

Now if we're doing predictions for day two, based on day one,

we're never going to get that destination up there, because

a person hasn't visited it before.

So we wanted to try to account for the fact that out of the

box isn't, necessarily, going to work all that well.

So what we did, is we tried to anticipate what the

probability distribution would be on this day, based on just

the days that we've observed.

And we noticed two things-- that new

destination sometimes sparkle.

So they'll just pop up, where there wasn't any one before,

and not nearby any other ones.

They also tend to cluster so you can see some clusters of

destinations in through here.

So we do, is we add a little bit of background probability,

to account for that sparkling.

And we blur the destinations a little bit, with a wedding

cake model like this, which tends to model that

clustering.

And as time goes on, this is the effect of the background

and that wedding cake model.

They drop down to zero, because the steady state is

being approached.

So after about two weeks, here, again, we're almost at

steady state.

So this red line means that we're really paying attention

to just the places the person has actually gone, and not

trying to modify the probability

distributions anymore.

The third thing we looked at for predicting where we're

going-- and then these all get glommed together at the end--

is efficient driving.

And that we're trying to model, here, is the fact that.

If you're driving along here, say, and you head north,

chances are that you're not going

anyplace south, like this.

And so what we did, is we looked at the trips people are

actually taking, and kind of learned how efficient most

drivers are.

And they're not 100% efficient getting to their destination,

but they're fairly efficient.

So for a candidate destination, of one of those

squares, we could say, if the person is passed up a lot of

opportunities for an efficient route to that to that cell,

then the probability of that cell has to be pretty low.

And here's how that looks.

Here's a person who started a trip, and they've gone four

squares south.

We've recomputed the probabilities, is just based

on efficient driving.

And you see this, basically, upper triangle has been

eliminated, where a darker outline means a higher

probability.

And they continue their trip down here, and now the only

probabilities that are really non-zero are down here.

So that's doing a reasonable job, it seems like.

Then finally, a very simple one.

We look at a trip times.

You get your car here, and you're probably not going to

drive to Chicago.

You're probably driving between 0 or 40 minutes, or

something like that.

So that just tends to bound how far away the

destinations are.

And I'll skip the math here, but this is just to say that,

we put this all together in Bayesian probability

framework, so it's all mathematically principled.

And we didn't have to come up with how much we're going to

weight the different components,

and things like that.

It just all fell out of math nicely.

And the result looks like this, and this is the line to

look at, here.

This is the fraction of the trip it's been completed.

And this is the median error in predicting the cell that

you're going to stop in.

where we just take the highest probability cell as the one we

think you're going to stop in.

So the error goes down as a function of time, as you'd

expect, And at the halfway point, our median error is

about two kilometers in predicting where you're going.

So another project that we did was--

Oh, yes.

AUDIENCE: How did you determine when a driver was

stopping-- like that was the destination?

JOHN KRUMM: Yeah, that involved

segmenting the GPS data.

And that's something that we don't know how

to do all that well.

But what we did, is we looked for gaps of greater than five

minutes in the data.

And then we had to eliminate some little, short garbage

trips when we did that.

But yeah, we just looked for five-minute gaps.

AUDIENCE: OK.

So they weren't instructed to push a button for start,

versus stop.

JOHN KRUMM: That's right.

Yeah.

The way the GPSs are, we had to modify them a little bit,

but they just set them on the dashboard of their car, plug

them into the cigarette lighter, and then just forget

about them for the whole two weeks.

They don't have to touch them again.

AUDIENCE: Well, if it's plugged into the cigarette

lighter, then you could tell, from the voltage, whether the

engine's running or not.

JOHN KRUMM: I guess you could, if the GPS could measure that,

but it doesn't.

Some car cigarette lighters, it turns on, stay on when the

key's off, and some come on--

AUDIENCE: I know, but it'll drop to just the battery

voltage, whereas when the engine's running, it goes up,

because it's the alternator voltage.

JOHN KRUMM: I see.

OK.

Yeah, right.

For the next project I want to talk about, it requires this,

as a prerequisite.

And this is snapping to a road, which was, deceptively,

a hard problem, it turns out.

We have all this GPS data, and we need to know which road the

person was actually driving on.

And the most common way to solve this problem, is just to

snap the GPS to the nearest road.

But that causes a disaster sometimes.

So for example, here I've got these three GPS points--

one, two, and three.

And what we want the output of the program to be, is the fact

that you were driving on this road, here, even though this

GPS point didn't fall on a road, or even

that close to it.

And it fell fairly close to some other roads as well.

If you just do a nearest road snapping, you get a problem.

Here are three GPS points connected by this black line,

one, two, and three.

If you just do snap to nearest road, that one snaps

correctly, that on snapped correctly.

This one, here, instead of snapping to the east-west

road, snapped to the north-south road.

And what happened, is the algorithm, then, fills in the

roads that didn't have a GPS point.

So it reasons that, well, if you were on this north-south

road, and next you were over here, you would have had to

drive north, to the first U-turn opportunity, turn

around, come back down, and get over here.

OK?

So the snap to nearest road, when you do the fill

in, fails like that.

And so what we did, is instead of looking at just the

locations of the GPS points, we looked at the

timestamps as well.

So point number one, here--

this is just a pretend scenario-- snapped here, point

number two snaps here.

And the question is, where does point number three snap?

Should it snap over here?

Or over to there, instead?

And if you just do nearest road, then it

ought to snap here.

But what we can also do, is we can just run a little routing

algorithm, and predict how long it would take to drive

from here to here, or how long it would take to drive from

here over to there.

We can see which one of those times best matches the

timestamps that we actually got from the GPS data.

AUDIENCE: Can't you guys just use headings that's published

with the GPS feed, to correct for this thing?

JOHN KRUMM: That would be another good clue.

Our GPSs, actually, don't record that, though.

So we just have the timestamp lat-long altitude.

AUDIENCE: Doesn't most commercial software already do

this, including Microsoft?

I think it's called streets and trips.

I mean, how recent is this research?

JOHN KRUMM: Well, this is pretty recent.

And what makes it slightly different from the nav system

in your car.

The nav system in your car uses odometry and a compass,

as well as the GPS data.

Yeah.

AUDIENCE: Do you have the speed, as well as the position

in the data you're using?

JOHN KRUMM: No.

No.

Of course, you could get that.

AUDIENCE: Because a lot of the GPS receivers to record that.

I don't know if these particular ones do.

JOHN KRUMM: Yeah, these are really simple, so they don't

record that.

Yeah.

AUDIENCE: So what is frequency with the GPS recorder?

Did it depend on just the time, or even the data from

the last recorded point?

JOHN KRUMM: We put them in adaptive recording mode, so we

wouldn't fill up the memory.

So if you're on a long straightaway, going at a

constant speed, it won't record that many points.

But if you're weaving around a lot, then it records more.

The average time between points with six

seconds or 60 meters.

Yeah.

So we had to weight those two errors.

How far can we snap in distance, and how far can we

snap in time?

Basically, how much can we ignore what the time to the

planned route was?

So we actually computed some statistics, assumed that the

spatial error of GPS was Gaussian--

came out to about 7.6 meters.

And we looked at the error in our route planner.

So when you're planning these little short routes, what's

the error between the estimated time and the actual

time, to drive between these points.

And that was a Gaussian that looks, kind of, like this.

So now we could weight those two errors, when we're trying

to find the optimal route.

This ends up just being a hidden Markov model, where

this is the same thing that's used in speech processing.

Because in speech, when someone stalking along it's

pretty easy to antici--

[PAUSE]

--pate what they're going to say next.

So it's kind of like, in this case, we've got these noisy

GPS measurements.

And we have to anticipate what road the person

is driving on next.

So the hidden Markov model just looks at a whole bunch of

different possible road snaps, for every point

in time, like this.

And then picks the optimal path through these road snaps,

using the Viterbi algorithm, that minimizes both the air

and space in time.

And so an example looks like this.

Here, point one can match to here, here, or here.

And GPS point number two can match to here, here, or here.

So there are three possibilities over here, four

possibilities over there.

So there are 12 possible little, short routes

between these two.

So we, actually, give those to our route planner, and ask it,

well, how long would it take to drive those

12 different routes?

And how does that the timing of those routes, the duration,

compared to the actual time it took.

And then we do the optimal snap.

So point three, over here, might have

three different snaps.

And then you've got, 3 times 4 times 3.

And the Viterbi algorithm, in the hidden Markov model is

just really good at efficiently finding the best

way through, that minimizes those errors.

AUDIENCE: If there's a traffic jam, or something, do you

start assuming they went really far away and came back?

JOHN KRUMM: You could, yeah.

That's a great point, because our router doesn't know about

traffic jams. So it's just assuming clear traffic.

So, yup.

AUDIENCE: You can, actually, take those kind of right?

In the history and 10 points in the past, and 10 points in

the future and that would be a much better estimate--

JOHN KRUMM: Well, yeah.

So this algorithm is--

AUDIENCE: Just not isolated within one or two points.

JOHN KRUMM: Right.

Well this algorithm is, actually, looking at all the

points on the trip, from beginning to end.

So it's kind of a non-causal model, because it's looking to

the future as well.

AUDIENCE: Right.

JOHN KRUMM: Yeah.

So here's one of the results.

The black, again, are the measured GPS points.

Just connect the dots.

This point, here, on the nearest road snapping, for

some reason, snapped to the road down below.

This point snapped to the correct road.

So nearest road snapping said, well, if you're on this a

highway, under here, and then ended up over here, what you

would have to do, is turn around, come back, take the

nearest entrance to the highway, get on, pick up that

snapped point, and now the next snapped

point is over here.

So I have to keep going, get off the highway, come back

around here, over here, over here, and now, pick up that

snapped point.

But our algorithm cuts through, so it ignores that

big excursion.

And then we just looked at a lot of different examples of

where our algorithm and works, in the nearest doesn't work.

So here's a GPS problem.

This point gets snapped here, so nearest road gets pulled

over there, but our algorithm doesn't.

Distraction with cross overs, like the bridge I was just

talking about.

Parallel roads-- sometimes the point would get snapped over

the center median, into the oncoming lane of traffic.

Our algorithm doesn't fall for that trick.

Distractions are just little spurs off the road, like this.

Cutting through spaghetti intersections, like this.

The gray is the nearest road snapping algorithm, and the

white as our algorithm.

Passing around spaghetti intersections.

Looping through spaghetti-- so in here, there's a lot of

different choices for which road you can be on.

OK.

So of the reason we did that, is we wanted to do this next

project on personalized routes.

And this is something I did with another summer intern,

Julie Letchner, from the University of Washington.

And what she was looking at, is, you know, when you go to a

website, now, and ask for directions between two points,

you can you can ask for the shortest route

or the fastest route.

But you can't ask for a route that, say, sticks to the roads

that you tend to like to drive on.

And what Julie found, is we looked at the routes that

people actually took in our data.

And 60% of the people were taking neither the fastest,

nor the shortest, routes between two points.

So because these people live in the area, we assume that

they were familiar with where they were driving.

So they must have had some other reason to pick the

routes that they were picking.

In fact, here are four different routes between this

point and this point.

There's the empirically fastest, that paid attention

to traffic.

There's the plan we got from map point, the shortest

distance route, and the drivers route.

And all four of those are different from each other.

So what we did, is we looked at two ways of making the

routes better.

Hopefully, to better match what people actually did.

One was to look at traffic, so we could use all our GPS data,

just as traffic probes, for each little section of road,

to see how fast traffic was moving on there.

And we split those up, as a function of 15-minute

segments, over the time of the day.

So given the time of day that you wanted to make the trip,

we could anticipate what traffic would be at that time.

And so here's an example of someone who--

this is just a regular commuter, who starts his trip

here, and ends here.

At midnight, he sticks to the highway, like this.

But during rush hour, he has a trick, and he goes off the

highway for as long as he can.

And then gets on, down here.

The other thing we did, which is the more interesting part,

I think, was we looked at the roads that each person drove

on, and we would deflate the cost of those

roads a little bit.

So when you submit the route-planning problem to your

A*search, or your Dykstra search, it looks at the cost

of the roads.

And we could say, multiply the cost of the roads that you'd

already driven on, by 0.9, or something.

So they're a little bit cheaper, and the router would

tend to prefer those instead.

And so we tested this on 2,500 trips, from our GPS data.

And we found that close to 47% of the per slice trips match

the actual routes that people took.

And that was quite a bit better than what we were

getting from just using a standard router.

So I think, that was pretty successful in making the

router more personalized.

The last thing I want to talk about, is location privacy.

And the motivation here, is there are a lot of reasons to

send your latitude/longitude to a third party, One of those

is congestion pricing.

So toll roads-- but all roads can be a toll road.

And you just get charged, at the end of the month, for

which roads you happen to be driving on, and when you were

driving on them.

Location-based services, we all know about.

Pay-as-you-drive insurance.

So your insurance rates vary, depending on where

and when you drive.

This is an interesting device.

This dash, which was just announced recently, it's a

navigation system, that you can put on your dashboard.

And it can give you real-time traffic updates.

But where it gets the updates from, is other drivers who are

driving around.

If they encounter some slow traffic, their GPS data is

continually getting sent into a central server, which is

looking for these traffic slowdowns,

based on this GPS data.

And then it can instantly broadcast to the other people

who have these systems. And then,

finally, research purposes.

So a bunch of people ran around London with GPSs and

made this map, just by walking around.

But those are all reasons that you might want to send your

data off to a third party.

And people say that, well, even if this data is an

anonymized, which is what, for instance, the dash system

uses, everyone has this theory, well, you know, if

someone knew where you go, they could easily figure out

who you are, just by figuring out where your home is.

And then, from that, they could

figure out your identity.

And everyone's got this theory, but no one has tested

it, up to now.

So we were able to actually test it.

So we start off with these pseudonomized GPS tracks.

That's just the proper term for an anonymizing something,

but replacing the name with a unique ID.

Then we infer the home location of the people, and

then we do a reverse white-pages look up, to see if

we can get their identity back.

And so we had four simple algorithms, for trying to

infer where this person lives, based on the GPS data.

And the first of these four, was called the Last

Destination algorithm.

And we segmented all our data into discreet trips.

And for every day, we look at the last destination the

person went to, before 3:00 in the morning.

And we took the median of all those, for every day that we

observed the person.

And we assumed that that is the location of their home.

And that was accurate, down to about 61 meters median error.

AUDIENCE: Well, AOL demonstrated a somewhat

similar concept a few minutes ago.

JOHN KRUMM: Who did?

AUDIENCE: AOL.

[OVERLAPPING VOICES]

JOHN KRUMM: Oh, right.

Exactly.

AUDIENCE: --nicer data.

JOHN KRUMM: Yeah, yeah.

Another algorithm, we called the weighted median algorithm.

We didn't have to segment the data for this.

We just looked at all the GPS points.

And we computed the median of all the GPS points, for a

particular person.

But the median was weighted by the amount of time you spent

at that particular location.

That had an error of about 67 meters.

We looked at a clustering algorithm, where we just

clustered all the destinations that a person had.

We clustered up to, I think, until the cluster size got to

about 100 meters in diameter.

And we, then, just took the cluster that had the most

points in it, figuring that your most frequent

destination is home.

And the median error, here, was about 67 meters.

So the same as the previous one.

Then, finally, we looked at this one, involving

probabilities.

And this is the most principled of the four, but

it's the one that works, by far, the worst. Since we know

the home addresses of all the people in our study--

by the way, that's how we can figure out what the error in

this home-finding algorithm is.

So we've got all their home addresses.

So what we did, is we computed the relative probability that

you're at home, as a function of time of day.

So this is a 24-hour clock, along the bottom, here.

And this is relative probability that you'll be

found at home.

So you can see, it starts to drop, here, at 8:00 AM, kind

of like you'd expect.

So this is, sort of, like the complement to the number of

trips you take, as a function of the time of day.

It goes down, here.

It's lower during the middle of the day, and then starts to

rise, here, around 6:00 PM, again.

And so what the algorithm does--

for every person, it tries to find if there's a GPS point,

in the highest probability bin.

And if it is, then it says, OK, that must

be where they live.

If there's not one in that bin, then it looks at the next

highest probability, and on down the line, like that.

AUDIENCE: How do calculate these probabilities?

JOHN KRUMM: How do we calculate the probabilities?

We would, for any given time of day, look at all the GPS

points in that one-hour bin.

And see if any of those points were at the home location of

the person.

And so just relative frequency of being at home or not.

Yeah.

But this worked terribly.

Over two meters median error.

And I think the reason it didn't work very well, is that

we, actually, don't have much data from the middle of the

night, like this.

Because our GPSs are set up in this adaptive recording mode.

So when the car is parked, they don't record any data.

And that's just to save memory.

So if you had a different GPS receiver, that was recording

all the time, then this would probably work better.

AUDIENCE: But if you know that, then couldn't you say,

look, I've got a point at this time.

I've got a point at his time.

And so presumably, the receiver was just sitting

there for that entire time?

JOHN KRUMM: Yeah.

You probably could.

That would be a good thing to do.

So why wasn't this more accurate?

Well, so our GPS recording interval--

6 seconds and 63 meters.

So you could be fairly far away from your home, before

you get a GPS reading.

GPS satellite acquisition takes a little bit of time.

When you just start up the car, and the GPS turns on, it

can take about 45 seconds before you get the first

reading from a satellite.

Most GPSs have a cold-start time, kind of like that.

And so you can drive 300 meters, if you're going 50

miles an hour, in that 45 five seconds.

So you can be pretty far away from your home.

A lot of people have covered parking, so you don't get a

GPS signal there.

And a lot of people park far from their home.

And they have to park far from their home, So the GPS in the

car is not, necessarily, a good proxy for where the

person actually lives.

So the next step--

now that we had the home location down to about, I

think, 61 meters was the best we could do--

was to see, could we do a reverse white-pages look up,

to get the identity of the person.

And so just like you guys, we have is API into our search,

and one component of it is a white pages look up.

So this is just a little program I wrote, that lets you

click somewhere in the map.

And it gives you the name, address, and phone number, of

the person who lives there, just based on their latitude

and longitude.

And it's not just because of where I work, I have access to

this, but anyone can download the API, and

could do this too.

So you wouldn't have to be an evil genius,

necessarily, to do this.

Anyone could do it.

And so here's how well it worked.

Here's our attack.

So we stared off with GPS tracks of 172 people.

We got their home location down to about 61 meters.

We actually get the identity--

the correct name of the person-- with 5% of those.

So this is kind of low.

I thought it would be higher.

And part of the reason, is we're not getting the home

locations all that accurately.

But part of the reason is, that reverse geocoding doesn't

really work all that well.

So given a lat-long, to get the house number, is not,

necessarily, all that accurate.

And also, the white pages are somewhat out of date.

Some of this data is kind of old.

So we didn't try to attack as soon as we got the fresh data

from a person.

We let it accumulate for, like, a year and a half.

So some people moved away.

Also some people live in multi-unit buildings, like a

condo or apartment.

So you really not going to get what apartment number, or what

condo number they live in.

And so we also looked at, well, what if we are just

satisfied in getting their home address.

Which gets around the problem of, you know, is the white

pages up to date?

And we could actually get the home address of a 12% of the

people in the study.

So the first lesson from this, is that everyone's theorized

about these attacks, but the attacks actually work.

So there is a problem.

Even if the data is anonymized, someone can

actually get your identity.

And we all thought that before, but now

we know it's true.

We were in a position to look at some countermeasures, now,

for protecting your identity from someone, doing

an attack like this.

And so a lot of people have written papers on this-- how

to protect your identity.

And they propose different ways of doing it.

Of course, they haven't been able to test them, so now

we're able to test them.

And one of the ways that they talk about, is

corrupting the data.

So for instance, here, this is some of the

original data in our study.

That's where the person lives.

And here is Gaussian noise added to the GPS data, with 50

metes standard deviation.

So it looks like that.

So the question is, now, if we run this noised data through

the same algorithms we ran before, how does

it affect the algorithm?

And what we did, is we looked at, as a function of the

amount of noise you add to the data-- that's the standard

deviation in meters--

how many correct home address inferences do you get?

And these are just using those four different algorithms that

I talked about.

So that's why there are four different

bars for each number.

And as you'd expect, the number goes down, the more

noise you add.

But out here, you're adding five kilometers of noise--

standard deviation--

to the signal, before you've really reduced the threat to

almost zero.

AUDIENCE: [INAUDIBLE]

You can maybe do some sort of [INAUDIBLE] or something

[INAUDIBLE]

If you are using Gaussian noise to do this, maybe

someone can [INAUDIBLE] maybe locally [INAUDIBLE]

JOHN KRUMM: Sure.

Yeah.

I think, in general, if the attacker knows how you were

trying to protect the data, then the attacker can figure

out how to undo that.

And so if they know it's a simple thing, like adding

Gaussian noise, then there's a lot of statistics out there to

help you with that.

So you're right, yup.

AUDIENCE: Maybe you have to add cryptographic noise.

JOHN KRUMM: Maybe, yeah.

So another technique you can try that's similar, is just

discretizing the data.

So it's just courser.

And so there's original data again.

And here, each point is snapped to a point on a grid,

where the grid points are 50 meters apart.

So the data ends up looking like that, at the end.

So here's the same kind of plot, as a function of the

coarseness of this of this grid-- how far apart these

points are.

And as you'd expect, the number does go down, the more

coarse you make the grid.

But you've got to get out, here, to a grid where the

points are two kilometers apart, before you've really,

reduced to almost zero, the effect of the attack.

AUDIENCE: Does the functionality drop

proportional to this also?

In the sense that the actual [INAUDIBLE]

are figuring out where you're going?

Do they drop [INAUDIBLE]

if you discretize or add noise?

JOHN KRUMM: Yes.

So that's a good question.

The general question is, if you're trying to use these

techniques for location-based services, how much does affect

the usefulness of the service?

For instance, predicting where you're going,

or things like that.

And that's something that we didn't look at at all.

And I think that would be a very interesting

thing to look at next.

Let's see.

For some things, like if you just want a

weather report, right?

You don't really care if you're a couple kilometers

off, because the weather's not going to change.

But if you want to know the next time you get to a cheap

gas station, or something, then it's important to be

fairly accurate.

And then the final algorithm that we looked at, for

protecting you, is just cloaking the data.

So we delete the data in a circle, that's near the

person's home.

And, specifically, what we do, is we draw a small circle

around the person's home.

That's this the circle, here.

And pick a random point inside that circle.

And then, around that random point, we draw a bigger

circle, and delete all the data inside that big circle.

And the real reason that we've got the randomness in there,

is if you center this cloaking disk on the actual home, then

it would be pretty easy for an attacker to figure that out,

and say, OK, well, it had to be at the middle of this

place, where all the data's missing.

AUDIENCE: That's what they call a void at CSI.

JOHN KRUMM: OK.

A void.

And so here's how that worked.

Here's how many correct home addresses we got, as a

function of the radius of that little circle, the radius of

the big circle.

The little circle radius didn't have that much of an

effect on the algorithm.

But the big circle had to be about a kilometer in radius,

before the effect of the attack was almost zero.

So the lesson here, on this privacy part, is that yes, an

attack really does work, even if the data isn't anonymized.

We all kind of knew that, but we verified it.

But the other thing is, that some of these proposed

techniques for protecting you from an attack like this, are

not very good.

Because you have to corrupt the data so much, that it's

really, probably, not going to be good, for a lot of

location-based services that might be wanting to try it.

AUDIENCE: Unless you did that processing locally, where

you're not worried about it.

JOHN KRUMM: Yeah, if you're not sending the data off your

device, to a central server, and then right.

Yup.

OK.

So that's the end of my talk.

I talked about our gathering of GPS data.

The driver models, including a part that was based on places

that you tend to like to go, as a function

of the ground cover.

Our predestination algorithm, which gets your destination

down to about two kilometers at the halfway

point of the trip.

A new algorithm for doing map matching, Personalized

routing, that pays attention to the roads you tend to like

to drive on.

And then, location privacy.

So that's the end of my talk, and thanks for listening.