What to do with Thousands of GPS Tracks?

Uploaded by Google on 25.07.2007


ADEL: So I thank you for coming today.
Today's talk is about GPS tracking, and what do you do
with information collected from GPS.
Our speaker,today, is John Krumm,
from Microsoft Research.
John graduated from CMU in 1993.
He was a PhD in robotics, and he worked at the Robotics
Center in SNL for four years, in Object Detection and, I
think, individuals inside car detections.
And then he joined Microsoft Research in 1997.
And since then, he was working in location determination,
location-based services.
He works in many systems, actually.
SmartMoveX, which is RF-based location determination.
Locadio, which is based on radio signal strengths.
And the SPOT, which is based on FM signal strengths.
So without any further direction, I introduce John.
Thank you for coming, John.
Thanks Adel.
So the reason I know Adel, is he was summer intern of mine
at Microsoft, a few years ago-- one of my favorite ones.
And I was going to be down here, and I got in contact
with, and I said, I just want to see if you really have free
lunches here.
And so he said, he's going to take me to a free
lunch after my talk.
So we'll see if it's really true.
This is not really part of the main outline of my talk, but I
said that Adel was a summer intern with me.
And I just want to tell you what he worked
on, and what he did.
Microsoft had this product-- we still do-- called the SPOT
watch, which gets actual updates on weather and dining
and traffic and movies over sidebands of commercial FM
radio stations.
And the SPOT team came to us, and said, could you make that
watch location aware?
It's got this FM tuner in it.
We can tune to any FM station, and measure how strong the
signal strengths are.
So what Adel did-- he did a really nice job, and wrote
this paper on making that watch,
actually, location aware.
So it kind of knows where it is, just based on measuring
signal strengths from commercial FM radio stations.
So that was a cool thing.
AUDIENCE: That actually works in production?
JOHN KRUMM: No, it's not in the product.
AUDIENCE: So I mean, does it have a database of where all
the conversations are?
JOHN KRUMM: Basically.
What it does is it looks at their fingerprints.
As a function of latitude/longitude, what are
the set of signal strengths that you are going to see?
So that's the database it's--
AUDIENCE: And it knows where you're from?
JOHN KRUMM: --based on.
No, it actually doesn't.
You do wardriving before hand to get their ground truth.

So what I'm going to talk about, the rest of my talk, is
projects that we've been doing on top of some GPS data that
we've been collecting.
So we went out, and we bought 55 of
these Garmin GPS receivers.
And we're loaning them out to volunteer drivers, to leave on
the dashboard of their car for two weeks or
four weeks at a time.
They record up to 10,000 timestamp latitude/longitudes.
And so far, we've had 227 people go through the program,
1.77 million points, and 12,500 separate trips.
And down below, there, you can see what some of the data
looks like.
So these are mostly all Microsoft employees, so
they're all, usually, around the Seattle area.
In fact, that's Microsoft, right there, so it's
particularly dense there.
And when we started with these GPS receivers, we didn't
really know, exactly, what we're going to do with them.
But since then, we've thought of some research projects on
top of them.
So I'm going to talk about those.
And the first one is destination modeling.

What we have here-- these are the destinations of
people in our study.
So what we do, is we segment the GPS into discreet trips,
and the last lat-long of each trip is the
destination of that trip.
And we looked at well what can we say about people's

One thing, is we can look at the types of places they seem
to like to go.
So the map on the right there is a US geological survey map
of the Seattle area, where each 30 meter by 30 meter
square on the ground is categorized into 1 of 21
different ground cover types.
And what we did, is we looked at the type of ground cover
that people tend to like to go to.
And it turned out that a commercial and low intensity
residential are the most popular ones.
And as you'd expect, emergent herbaceous wetlands and woody
wetlands, which, I think, is another word for swamp,
basically, are not really all that popular as destinations.
And then, a little later on, I'll show how we actually use
this to do something useful.
AUDIENCE: Is the fact that high intensity residential
shows up so low- would that indicate that it didn't get a
signal there?
JOHN KRUMM: Let's see.
Oh yeah, this one here?
That's a good point.
Yeah, maybe that's what happened.
Because that's where the urban canyons would be,
and things like that.
AUDIENCE: Yeah, surprised it has that much in the forest
areas, considering how poorly GPS usually does when there's
a lot of trees around.
Yeah, yeah.

We also looked at how people behave, as a function of time.
So this top plot, here--
it's as a function of a 24-hour clock.
And here are the number of trips, per week, that people
make, versus the time of the day.
So there are not too many trips in the
middle of the night.
You see quite a few, here, in the morning--
maybe people commuting to work.
And down here, in the evening-- maybe people are
going home or going out to restaurants, like that.
So this kind of makes sense.
Then you can also look at how many of these trips are trips
to new destinations, that we haven't observed the person to
go to before?
And that peaks right around the middle of the day.
You can also look at how those things very, as
the day of the week.
So that's what we have over here, the days of the week.
And here are the number of trips taken per week.
And so it turns out, the most trips are taken on Thursday.
And I'm not really sure why that would be.
And then you can look at number of new trips--
places to new destinations where the
person hasn't been before.
And that peaks kind of as you'd expect, on Saturday,
when people might be going off to try new things.
But it's also pretty high, here, on Thursday.

And then, finally, on this destination modeling thing--
we looked at how fast do people, generally, fall into a
rut of not visiting new destinations at all.

In general, when we first gave the people a GPS, they visited
a little over three and a half destinations.
And then, we looked at how many destinations they visited
on the second day, that they hadn't visited
on the first day.
And it's down here.
A little over one and a half, and it goes down
exponentially, kind of like that.
So people fall into a rut, sort of.
After about two weeks, they've come fairly close to hitting a
steady state.
We also have demographics on all these people.
So we were interested in, does this rate of decay vary with
the demographic makeup of the people who are in the study?
And we found no difference between single people and
people with a partner, children versus no children,
extended family nearby or not.
There was no significant difference.
What we did find--
a small, but statistically significant, difference
between men and women.
And women's rate of decay is slightly higher than men's.
They go down, and hit the steady state
a little bit faster.
AUDIENCE: I have a question.
In the sample, were there people driving as well as
public transportation and--
JOHN KRUMM: This is all, pretty much, driving, because
they just put it on the dashboard of their car.
AUDIENCE: On what day of the week were the GPS
units issued to people?
I'm curious if it was Thursday.
It's totally random.
We just sent them out via the mail.
So whatever day they get there.
AUDIENCE: With the bias of new trips, especially, every
trip's a new trip when you first get it.
Was that accounted for?
JOHN KRUMM: No we didn't account for that.
We knew that the starting dates were all random, so we
were hoping the randomness would average
out to general behavior.
But you're right.
If like, Saturday was the second day, here, then you
probably wouldn't see quite the same behavior.

So I don't know why that women decline a little bit faster
than men, and I don't want to speculate on it either.

So the next project was one called Predestination.
And this is where we're trying to predict where you're
driving to, as you drive.
So you get your car, start on a trip, and can we predict
when you're going to stop, and get out of your car.

And the reason why you'd ever want to do that, is
anticipatory information.
If you could warn someone of an upcoming traffic jam,
before they get there, that would be useful.
And of course, you can get traffic incident data.
But you can only get in a radius of where you are, right
at a certain point.
And if we could actually tell you that your coming up on
something, but not tell you about the things that you've
left behind, then that's less distraction for the driver.
Location-based advertising is another one.
No one wants to hear about a sale on bananas, as drive 40
miles an hour by a grocery store.
But at least in the Seattle area, we've got these
third-party parking lots around the airport.
So if 10 minutes before I got to the airport, I could get an
ad that said there's a discount on parking at some
place, then I actually have time to act on that ad, and I
might actually welcome it.
And then finally, this is kind of an interesting application
for hybrid vehicles.
The way most hybrid vehicles work, is they have a gas
engine, that can charge up a battery.
And the vehicle has to decide when to turn on the gas
engine, to do some charging.
So if a car's coming up a hill, say, it might find that
the battery's a little bit low, and say, well, I guess I
need to turn on the gas engine, to charge the battery
at this point.
But if the car knew that is was going to have a good
charging opportunity, by going down a long hill, then it
wouldn't, necessarily, have to turn on the gas engine before
it gets to the top.
And so, it could be made more efficient that way.

So our basic approach to this, to predicting where your're
driving, is we split up the region into
one-kilometer cells.
And we compute a probability, for each one of those cell's,
as to whether or not that's your destination.
And then that's just regular, old, discreet probability
distribution, where they all have to sum to one.
And so now, we get back to that ground cover model, that
I was talking about.
So that's the same bar graph over here.
And if you apply those normalized probabilities, to
the ground covers that are actually inside those
one-kilometer cells, then this is the prior probability
distribution that you get.
And you can see that the probability of ending in a
lake is pretty low, which is good.
And also rural areas are pretty low.
This is a rural area, out here, but you do see some
little pops, here, by these old towns
and these rural areas.
So this kind of makes sense.
We also looked at prior on personal destination.
So the easiest way to predict where someone is going, is to
look at the places that they've been before.
And so here are the destinations of one person in
our study, so those cells would have a higher
We also looked at something to try to make this work a little
better out of the box.
So each image, here, is a 40 pixel by 40 pixel image, where
each pixel corresponds to one of those cells.
And on day one, this person visited these cells, day two,
these, and on down to day 14.
So you can see how the destinations for this person
evolved over time.
Now if we're doing predictions for day two, based on day one,
we're never going to get that destination up there, because
a person hasn't visited it before.
So we wanted to try to account for the fact that out of the
box isn't, necessarily, going to work all that well.
So what we did, is we tried to anticipate what the
probability distribution would be on this day, based on just
the days that we've observed.
And we noticed two things-- that new
destination sometimes sparkle.
So they'll just pop up, where there wasn't any one before,
and not nearby any other ones.
They also tend to cluster so you can see some clusters of
destinations in through here.
So we do, is we add a little bit of background probability,
to account for that sparkling.
And we blur the destinations a little bit, with a wedding
cake model like this, which tends to model that
And as time goes on, this is the effect of the background
and that wedding cake model.
They drop down to zero, because the steady state is
being approached.
So after about two weeks, here, again, we're almost at
steady state.
So this red line means that we're really paying attention
to just the places the person has actually gone, and not
trying to modify the probability
distributions anymore.

The third thing we looked at for predicting where we're
going-- and then these all get glommed together at the end--
is efficient driving.
And that we're trying to model, here, is the fact that.
If you're driving along here, say, and you head north,
chances are that you're not going
anyplace south, like this.
And so what we did, is we looked at the trips people are
actually taking, and kind of learned how efficient most
drivers are.
And they're not 100% efficient getting to their destination,
but they're fairly efficient.
So for a candidate destination, of one of those
squares, we could say, if the person is passed up a lot of
opportunities for an efficient route to that to that cell,
then the probability of that cell has to be pretty low.
And here's how that looks.
Here's a person who started a trip, and they've gone four
squares south.
We've recomputed the probabilities, is just based
on efficient driving.
And you see this, basically, upper triangle has been
eliminated, where a darker outline means a higher
And they continue their trip down here, and now the only
probabilities that are really non-zero are down here.
So that's doing a reasonable job, it seems like.
Then finally, a very simple one.
We look at a trip times.
You get your car here, and you're probably not going to
drive to Chicago.
You're probably driving between 0 or 40 minutes, or
something like that.
So that just tends to bound how far away the
destinations are.
And I'll skip the math here, but this is just to say that,
we put this all together in Bayesian probability
framework, so it's all mathematically principled.
And we didn't have to come up with how much we're going to
weight the different components,
and things like that.
It just all fell out of math nicely.
And the result looks like this, and this is the line to
look at, here.
This is the fraction of the trip it's been completed.
And this is the median error in predicting the cell that
you're going to stop in.
where we just take the highest probability cell as the one we
think you're going to stop in.
So the error goes down as a function of time, as you'd
expect, And at the halfway point, our median error is
about two kilometers in predicting where you're going.

So another project that we did was--
Oh, yes.
AUDIENCE: How did you determine when a driver was
stopping-- like that was the destination?
JOHN KRUMM: Yeah, that involved
segmenting the GPS data.
And that's something that we don't know how
to do all that well.
But what we did, is we looked for gaps of greater than five
minutes in the data.
And then we had to eliminate some little, short garbage
trips when we did that.
But yeah, we just looked for five-minute gaps.
So they weren't instructed to push a button for start,
versus stop.
JOHN KRUMM: That's right.
The way the GPSs are, we had to modify them a little bit,
but they just set them on the dashboard of their car, plug
them into the cigarette lighter, and then just forget
about them for the whole two weeks.
They don't have to touch them again.
AUDIENCE: Well, if it's plugged into the cigarette
lighter, then you could tell, from the voltage, whether the
engine's running or not.
JOHN KRUMM: I guess you could, if the GPS could measure that,
but it doesn't.
Some car cigarette lighters, it turns on, stay on when the
key's off, and some come on--
AUDIENCE: I know, but it'll drop to just the battery
voltage, whereas when the engine's running, it goes up,
because it's the alternator voltage.
Yeah, right.

For the next project I want to talk about, it requires this,
as a prerequisite.
And this is snapping to a road, which was, deceptively,
a hard problem, it turns out.
We have all this GPS data, and we need to know which road the
person was actually driving on.
And the most common way to solve this problem, is just to
snap the GPS to the nearest road.
But that causes a disaster sometimes.
So for example, here I've got these three GPS points--
one, two, and three.
And what we want the output of the program to be, is the fact
that you were driving on this road, here, even though this
GPS point didn't fall on a road, or even
that close to it.
And it fell fairly close to some other roads as well.
If you just do a nearest road snapping, you get a problem.
Here are three GPS points connected by this black line,
one, two, and three.
If you just do snap to nearest road, that one snaps
correctly, that on snapped correctly.
This one, here, instead of snapping to the east-west
road, snapped to the north-south road.
And what happened, is the algorithm, then, fills in the
roads that didn't have a GPS point.
So it reasons that, well, if you were on this north-south
road, and next you were over here, you would have had to
drive north, to the first U-turn opportunity, turn
around, come back down, and get over here.
So the snap to nearest road, when you do the fill
in, fails like that.
And so what we did, is instead of looking at just the
locations of the GPS points, we looked at the
timestamps as well.
So point number one, here--
this is just a pretend scenario-- snapped here, point
number two snaps here.
And the question is, where does point number three snap?
Should it snap over here?
Or over to there, instead?
And if you just do nearest road, then it
ought to snap here.
But what we can also do, is we can just run a little routing
algorithm, and predict how long it would take to drive
from here to here, or how long it would take to drive from
here over to there.
We can see which one of those times best matches the
timestamps that we actually got from the GPS data.
AUDIENCE: Can't you guys just use headings that's published
with the GPS feed, to correct for this thing?
JOHN KRUMM: That would be another good clue.
Our GPSs, actually, don't record that, though.
So we just have the timestamp lat-long altitude.
AUDIENCE: Doesn't most commercial software already do
this, including Microsoft?
I think it's called streets and trips.
I mean, how recent is this research?
JOHN KRUMM: Well, this is pretty recent.
And what makes it slightly different from the nav system
in your car.
The nav system in your car uses odometry and a compass,
as well as the GPS data.
AUDIENCE: Do you have the speed, as well as the position
in the data you're using?
Of course, you could get that.
AUDIENCE: Because a lot of the GPS receivers to record that.
I don't know if these particular ones do.
JOHN KRUMM: Yeah, these are really simple, so they don't
record that.
AUDIENCE: So what is frequency with the GPS recorder?
Did it depend on just the time, or even the data from
the last recorded point?
JOHN KRUMM: We put them in adaptive recording mode, so we
wouldn't fill up the memory.
So if you're on a long straightaway, going at a
constant speed, it won't record that many points.
But if you're weaving around a lot, then it records more.
The average time between points with six
seconds or 60 meters.

So we had to weight those two errors.
How far can we snap in distance, and how far can we
snap in time?
Basically, how much can we ignore what the time to the
planned route was?
So we actually computed some statistics, assumed that the
spatial error of GPS was Gaussian--
came out to about 7.6 meters.
And we looked at the error in our route planner.
So when you're planning these little short routes, what's
the error between the estimated time and the actual
time, to drive between these points.
And that was a Gaussian that looks, kind of, like this.
So now we could weight those two errors, when we're trying
to find the optimal route.
This ends up just being a hidden Markov model, where
this is the same thing that's used in speech processing.
Because in speech, when someone stalking along it's
pretty easy to antici--
--pate what they're going to say next.
So it's kind of like, in this case, we've got these noisy
GPS measurements.
And we have to anticipate what road the person
is driving on next.
So the hidden Markov model just looks at a whole bunch of
different possible road snaps, for every point
in time, like this.
And then picks the optimal path through these road snaps,
using the Viterbi algorithm, that minimizes both the air
and space in time.
And so an example looks like this.
Here, point one can match to here, here, or here.
And GPS point number two can match to here, here, or here.
So there are three possibilities over here, four
possibilities over there.
So there are 12 possible little, short routes
between these two.
So we, actually, give those to our route planner, and ask it,
well, how long would it take to drive those
12 different routes?
And how does that the timing of those routes, the duration,
compared to the actual time it took.
And then we do the optimal snap.
So point three, over here, might have
three different snaps.
And then you've got, 3 times 4 times 3.
And the Viterbi algorithm, in the hidden Markov model is
just really good at efficiently finding the best
way through, that minimizes those errors.
AUDIENCE: If there's a traffic jam, or something, do you
start assuming they went really far away and came back?
JOHN KRUMM: You could, yeah.
That's a great point, because our router doesn't know about
traffic jams. So it's just assuming clear traffic.
So, yup.
AUDIENCE: You can, actually, take those kind of right?
In the history and 10 points in the past, and 10 points in
the future and that would be a much better estimate--
JOHN KRUMM: Well, yeah.
So this algorithm is--
AUDIENCE: Just not isolated within one or two points.
Well this algorithm is, actually, looking at all the
points on the trip, from beginning to end.
So it's kind of a non-causal model, because it's looking to
the future as well.

So here's one of the results.
The black, again, are the measured GPS points.
Just connect the dots.
This point, here, on the nearest road snapping, for
some reason, snapped to the road down below.
This point snapped to the correct road.
So nearest road snapping said, well, if you're on this a
highway, under here, and then ended up over here, what you
would have to do, is turn around, come back, take the
nearest entrance to the highway, get on, pick up that
snapped point, and now the next snapped
point is over here.
So I have to keep going, get off the highway, come back
around here, over here, over here, and now, pick up that
snapped point.
But our algorithm cuts through, so it ignores that
big excursion.
And then we just looked at a lot of different examples of
where our algorithm and works, in the nearest doesn't work.
So here's a GPS problem.
This point gets snapped here, so nearest road gets pulled
over there, but our algorithm doesn't.
Distraction with cross overs, like the bridge I was just
talking about.
Parallel roads-- sometimes the point would get snapped over
the center median, into the oncoming lane of traffic.
Our algorithm doesn't fall for that trick.
Distractions are just little spurs off the road, like this.
Cutting through spaghetti intersections, like this.
The gray is the nearest road snapping algorithm, and the
white as our algorithm.
Passing around spaghetti intersections.
Looping through spaghetti-- so in here, there's a lot of
different choices for which road you can be on.

So of the reason we did that, is we wanted to do this next
project on personalized routes.
And this is something I did with another summer intern,
Julie Letchner, from the University of Washington.
And what she was looking at, is, you know, when you go to a
website, now, and ask for directions between two points,
you can you can ask for the shortest route
or the fastest route.
But you can't ask for a route that, say, sticks to the roads
that you tend to like to drive on.
And what Julie found, is we looked at the routes that
people actually took in our data.
And 60% of the people were taking neither the fastest,
nor the shortest, routes between two points.
So because these people live in the area, we assume that
they were familiar with where they were driving.
So they must have had some other reason to pick the
routes that they were picking.
In fact, here are four different routes between this
point and this point.
There's the empirically fastest, that paid attention
to traffic.
There's the plan we got from map point, the shortest
distance route, and the drivers route.
And all four of those are different from each other.

So what we did, is we looked at two ways of making the
routes better.
Hopefully, to better match what people actually did.
One was to look at traffic, so we could use all our GPS data,
just as traffic probes, for each little section of road,
to see how fast traffic was moving on there.
And we split those up, as a function of 15-minute
segments, over the time of the day.
So given the time of day that you wanted to make the trip,
we could anticipate what traffic would be at that time.
And so here's an example of someone who--
this is just a regular commuter, who starts his trip
here, and ends here.
At midnight, he sticks to the highway, like this.
But during rush hour, he has a trick, and he goes off the
highway for as long as he can.
And then gets on, down here.
The other thing we did, which is the more interesting part,
I think, was we looked at the roads that each person drove
on, and we would deflate the cost of those
roads a little bit.
So when you submit the route-planning problem to your
A*search, or your Dykstra search, it looks at the cost
of the roads.
And we could say, multiply the cost of the roads that you'd
already driven on, by 0.9, or something.
So they're a little bit cheaper, and the router would
tend to prefer those instead.

And so we tested this on 2,500 trips, from our GPS data.
And we found that close to 47% of the per slice trips match
the actual routes that people took.
And that was quite a bit better than what we were
getting from just using a standard router.
So I think, that was pretty successful in making the
router more personalized.

The last thing I want to talk about, is location privacy.

And the motivation here, is there are a lot of reasons to
send your latitude/longitude to a third party, One of those
is congestion pricing.
So toll roads-- but all roads can be a toll road.
And you just get charged, at the end of the month, for
which roads you happen to be driving on, and when you were
driving on them.
Location-based services, we all know about.
Pay-as-you-drive insurance.
So your insurance rates vary, depending on where
and when you drive.
This is an interesting device.
This dash, which was just announced recently, it's a
navigation system, that you can put on your dashboard.
And it can give you real-time traffic updates.
But where it gets the updates from, is other drivers who are
driving around.
If they encounter some slow traffic, their GPS data is
continually getting sent into a central server, which is
looking for these traffic slowdowns,
based on this GPS data.
And then it can instantly broadcast to the other people
who have these systems. And then,
finally, research purposes.
So a bunch of people ran around London with GPSs and
made this map, just by walking around.
But those are all reasons that you might want to send your
data off to a third party.
And people say that, well, even if this data is an
anonymized, which is what, for instance, the dash system
uses, everyone has this theory, well, you know, if
someone knew where you go, they could easily figure out
who you are, just by figuring out where your home is.
And then, from that, they could
figure out your identity.
And everyone's got this theory, but no one has tested
it, up to now.
So we were able to actually test it.
So we start off with these pseudonomized GPS tracks.
That's just the proper term for an anonymizing something,
but replacing the name with a unique ID.
Then we infer the home location of the people, and
then we do a reverse white-pages look up, to see if
we can get their identity back.
And so we had four simple algorithms, for trying to
infer where this person lives, based on the GPS data.
And the first of these four, was called the Last
Destination algorithm.
And we segmented all our data into discreet trips.
And for every day, we look at the last destination the
person went to, before 3:00 in the morning.
And we took the median of all those, for every day that we
observed the person.
And we assumed that that is the location of their home.
And that was accurate, down to about 61 meters median error.
AUDIENCE: Well, AOL demonstrated a somewhat
similar concept a few minutes ago.
JOHN KRUMM: Who did?
JOHN KRUMM: Oh, right.
AUDIENCE: --nicer data.
JOHN KRUMM: Yeah, yeah.

Another algorithm, we called the weighted median algorithm.
We didn't have to segment the data for this.
We just looked at all the GPS points.
And we computed the median of all the GPS points, for a
particular person.
But the median was weighted by the amount of time you spent
at that particular location.
That had an error of about 67 meters.
We looked at a clustering algorithm, where we just
clustered all the destinations that a person had.
We clustered up to, I think, until the cluster size got to
about 100 meters in diameter.
And we, then, just took the cluster that had the most
points in it, figuring that your most frequent
destination is home.
And the median error, here, was about 67 meters.
So the same as the previous one.
Then, finally, we looked at this one, involving
And this is the most principled of the four, but
it's the one that works, by far, the worst. Since we know
the home addresses of all the people in our study--
by the way, that's how we can figure out what the error in
this home-finding algorithm is.
So we've got all their home addresses.
So what we did, is we computed the relative probability that
you're at home, as a function of time of day.
So this is a 24-hour clock, along the bottom, here.
And this is relative probability that you'll be
found at home.
So you can see, it starts to drop, here, at 8:00 AM, kind
of like you'd expect.
So this is, sort of, like the complement to the number of
trips you take, as a function of the time of day.
It goes down, here.
It's lower during the middle of the day, and then starts to
rise, here, around 6:00 PM, again.
And so what the algorithm does--
for every person, it tries to find if there's a GPS point,
in the highest probability bin.
And if it is, then it says, OK, that must
be where they live.
If there's not one in that bin, then it looks at the next
highest probability, and on down the line, like that.
AUDIENCE: How do calculate these probabilities?

JOHN KRUMM: How do we calculate the probabilities?

We would, for any given time of day, look at all the GPS
points in that one-hour bin.
And see if any of those points were at the home location of
the person.
And so just relative frequency of being at home or not.

But this worked terribly.
Over two meters median error.
And I think the reason it didn't work very well, is that
we, actually, don't have much data from the middle of the
night, like this.
Because our GPSs are set up in this adaptive recording mode.
So when the car is parked, they don't record any data.
And that's just to save memory.
So if you had a different GPS receiver, that was recording
all the time, then this would probably work better.

AUDIENCE: But if you know that, then couldn't you say,
look, I've got a point at this time.
I've got a point at his time.
And so presumably, the receiver was just sitting
there for that entire time?
You probably could.
That would be a good thing to do.

So why wasn't this more accurate?
Well, so our GPS recording interval--
6 seconds and 63 meters.
So you could be fairly far away from your home, before
you get a GPS reading.
GPS satellite acquisition takes a little bit of time.
When you just start up the car, and the GPS turns on, it
can take about 45 seconds before you get the first
reading from a satellite.
Most GPSs have a cold-start time, kind of like that.
And so you can drive 300 meters, if you're going 50
miles an hour, in that 45 five seconds.
So you can be pretty far away from your home.
A lot of people have covered parking, so you don't get a
GPS signal there.
And a lot of people park far from their home.
And they have to park far from their home, So the GPS in the
car is not, necessarily, a good proxy for where the
person actually lives.

So the next step--
now that we had the home location down to about, I
think, 61 meters was the best we could do--
was to see, could we do a reverse white-pages look up,
to get the identity of the person.
And so just like you guys, we have is API into our search,
and one component of it is a white pages look up.
So this is just a little program I wrote, that lets you
click somewhere in the map.
And it gives you the name, address, and phone number, of
the person who lives there, just based on their latitude
and longitude.
And it's not just because of where I work, I have access to
this, but anyone can download the API, and
could do this too.
So you wouldn't have to be an evil genius,
necessarily, to do this.
Anyone could do it.
And so here's how well it worked.
Here's our attack.
So we stared off with GPS tracks of 172 people.
We got their home location down to about 61 meters.
We actually get the identity--
the correct name of the person-- with 5% of those.
So this is kind of low.
I thought it would be higher.
And part of the reason, is we're not getting the home
locations all that accurately.
But part of the reason is, that reverse geocoding doesn't
really work all that well.
So given a lat-long, to get the house number, is not,
necessarily, all that accurate.
And also, the white pages are somewhat out of date.
Some of this data is kind of old.
So we didn't try to attack as soon as we got the fresh data
from a person.
We let it accumulate for, like, a year and a half.
So some people moved away.
Also some people live in multi-unit buildings, like a
condo or apartment.
So you really not going to get what apartment number, or what
condo number they live in.
And so we also looked at, well, what if we are just
satisfied in getting their home address.
Which gets around the problem of, you know, is the white
pages up to date?
And we could actually get the home address of a 12% of the
people in the study.
So the first lesson from this, is that everyone's theorized
about these attacks, but the attacks actually work.
So there is a problem.
Even if the data is anonymized, someone can
actually get your identity.
And we all thought that before, but now
we know it's true.
We were in a position to look at some countermeasures, now,
for protecting your identity from someone, doing
an attack like this.
And so a lot of people have written papers on this-- how
to protect your identity.
And they propose different ways of doing it.
Of course, they haven't been able to test them, so now
we're able to test them.
And one of the ways that they talk about, is
corrupting the data.
So for instance, here, this is some of the
original data in our study.
That's where the person lives.
And here is Gaussian noise added to the GPS data, with 50
metes standard deviation.
So it looks like that.
So the question is, now, if we run this noised data through
the same algorithms we ran before, how does
it affect the algorithm?
And what we did, is we looked at, as a function of the
amount of noise you add to the data-- that's the standard
deviation in meters--
how many correct home address inferences do you get?
And these are just using those four different algorithms that
I talked about.
So that's why there are four different
bars for each number.
And as you'd expect, the number goes down, the more
noise you add.
But out here, you're adding five kilometers of noise--
standard deviation--
to the signal, before you've really reduced the threat to
almost zero.

You can maybe do some sort of [INAUDIBLE] or something
If you are using Gaussian noise to do this, maybe
someone can [INAUDIBLE] maybe locally [INAUDIBLE]

I think, in general, if the attacker knows how you were
trying to protect the data, then the attacker can figure
out how to undo that.
And so if they know it's a simple thing, like adding
Gaussian noise, then there's a lot of statistics out there to
help you with that.
So you're right, yup.
AUDIENCE: Maybe you have to add cryptographic noise.
JOHN KRUMM: Maybe, yeah.

So another technique you can try that's similar, is just
discretizing the data.
So it's just courser.
And so there's original data again.
And here, each point is snapped to a point on a grid,
where the grid points are 50 meters apart.
So the data ends up looking like that, at the end.
So here's the same kind of plot, as a function of the
coarseness of this of this grid-- how far apart these
points are.
And as you'd expect, the number does go down, the more
coarse you make the grid.
But you've got to get out, here, to a grid where the
points are two kilometers apart, before you've really,
reduced to almost zero, the effect of the attack.
AUDIENCE: Does the functionality drop
proportional to this also?
In the sense that the actual [INAUDIBLE]
are figuring out where you're going?
Do they drop [INAUDIBLE]
if you discretize or add noise?
So that's a good question.
The general question is, if you're trying to use these
techniques for location-based services, how much does affect
the usefulness of the service?
For instance, predicting where you're going,
or things like that.
And that's something that we didn't look at at all.
And I think that would be a very interesting
thing to look at next.
Let's see.
For some things, like if you just want a
weather report, right?
You don't really care if you're a couple kilometers
off, because the weather's not going to change.
But if you want to know the next time you get to a cheap
gas station, or something, then it's important to be
fairly accurate.

And then the final algorithm that we looked at, for
protecting you, is just cloaking the data.
So we delete the data in a circle, that's near the
person's home.
And, specifically, what we do, is we draw a small circle
around the person's home.
That's this the circle, here.
And pick a random point inside that circle.
And then, around that random point, we draw a bigger
circle, and delete all the data inside that big circle.
And the real reason that we've got the randomness in there,
is if you center this cloaking disk on the actual home, then
it would be pretty easy for an attacker to figure that out,
and say, OK, well, it had to be at the middle of this
place, where all the data's missing.
AUDIENCE: That's what they call a void at CSI.
A void.
And so here's how that worked.
Here's how many correct home addresses we got, as a
function of the radius of that little circle, the radius of
the big circle.
The little circle radius didn't have that much of an
effect on the algorithm.
But the big circle had to be about a kilometer in radius,
before the effect of the attack was almost zero.
So the lesson here, on this privacy part, is that yes, an
attack really does work, even if the data isn't anonymized.
We all kind of knew that, but we verified it.
But the other thing is, that some of these proposed
techniques for protecting you from an attack like this, are
not very good.
Because you have to corrupt the data so much, that it's
really, probably, not going to be good, for a lot of
location-based services that might be wanting to try it.
AUDIENCE: Unless you did that processing locally, where
you're not worried about it.
JOHN KRUMM: Yeah, if you're not sending the data off your
device, to a central server, and then right.
So that's the end of my talk.
I talked about our gathering of GPS data.
The driver models, including a part that was based on places
that you tend to like to go, as a function
of the ground cover.
Our predestination algorithm, which gets your destination
down to about two kilometers at the halfway
point of the trip.
A new algorithm for doing map matching, Personalized
routing, that pays attention to the roads you tend to like
to drive on.
And then, location privacy.
So that's the end of my talk, and thanks for listening.