Google Refine 2.0 - Data Augmentation (3 of 3) (video version 2)

Uploaded by GoogleRefine on 19.07.2011


Welcome back to another video on Google Refine.
Previously, we have shown you how to use Google Refine to
clean up messy data and to transform it from one format
to another.
In this video, we focus on scenarios in which the data
you need is not all in the data you have.
For instance, consider this data set of biotech companies'
names and street addresses.
We have their addresses, but say we want their latitudes
and longitudes.
There are web services like OpenStreetMap's Nominatim that
can translate from a street address to a
latitude/longitude pair.
We can get Google Refine to call such a web service on the
Address cell of each row.
Simply invoke the command Add Column By Fetching URLs.

In this dialogue box, we need to formulate the URL to fetch.
The documentation of the service will tell
us how to do so.
In this case, here's what the URL looks like.
Note that we need to Escape the cell's value in order to
use it as a URL parameter.
We also need to include the email address and the app's
name, as requested by the service's usage policy.
The service also limits the rate of access to at most one
request per second, so we set the delay between consecutive
URL fetches to any value larger than 1,000
We name the new column JSON and click OK.
This process takes a few minutes for 117 addresses.
When that's done, we have a new column of JSON results.
From this column, we invoke the command Add Column Based
On This Column in order to extract out the
latitude/longitude pairs.
First, we parse the value of the cell into JSON and access
its first result.
We use the With construct to create a shorthand for that
expression and call it Pair.
The final result is the concatenation of the latitude
field and the longitude field of Pair.
Looking up latitude/longitude for street addresses is only
one use of this URL-fetching feature in Google Refine.
Consider a more esoteric use.
Say we have a data set with many text blurbs in many
languages, and we would like to know which blurb is in
which language so that we can divide the data by language
and send the subsets to different translators.
We can invoke the same command, Add Column By
Fetching URLs.
But this time, we use Google's language detection service.
This service does require an API key, and we also want to
substring the cells' text just to keep the URL short.
When the fetching is done, we invoke the command Add Column
Based On This Column, parse the result JSON, and extract
out the language ID.
Augmenting your data with data from web
services can be that easy.
But with a bit more work, you can do even more
sophisticated things.
Sometimes your data contains names of people or places
stored in some database.
And by associating each name in your data with the
corresponding key in that database, you can then
retrieve more data from that database.
We call this process of associating names to database
keys reconciliation.
And we use such a fancy word because this process requires
a bit more thinking than it might seem at first.
Say we have a list of favorite movies and our
own rating for each.
We want to compose a web page listing for these movies, not
just to show their names and ratings, but also to show
their directors and release dates and so forth.
One source for such information is Freebase.
And as you can see here, for a movie, Freebase contains
plenty of data.
So we first need to reconcile the movie names in our
database to the IDs on Freebase.
We invoke the Reconcile command and pick the Freebase
reconciliation service.
Using that service, Google Refine takes a few seconds to
guess what type of names these are.
And the top guess is that these are names of films.
We simply click Start Reconciling.
This process takes a bit of time, and when it's done, we
can see that some of the cells have already
turned into web links.
For example, clicking on The Dark Knight, the link takes us
to the Freebase page about The Dark Knight, the film.
Thus that cell and our data has been reconciled with or
linked to the appropriate record within Freebase.
For some other cells, it's not entirely obvious which record
in Freebase is the right match.
Consider the cell containing the string "terminator." There
are at least three movies whose titles match that text
closely enough, so it's difficult for the computer to
decide with confidence which one we mean.
Say we mean The Terminator.
Then we can click on the check mark in front of the first one
to match it, and the cell turns into a web link.
Next, consider Ocean's Eleven.
This is ambiguous, because there are two movies with that
exact same name.
The first one is a remake in 2001 with George Clooney, and
the second one is the original in 1960.
Only by seeing more details for each movie can we decide
that we mean the first candidate.
And so while some movie names can be matched automatically
to the appropriate Freebase IDs, others
require our human judgment.
When we're done and every cell has now been reconciled, we
can fetch more data from Freebase.
We invoke the command Add Columns From Freebase, and
we're presented with a long list of properties of these
films that we can retrieve.
Let's pick the Directed By property, the Initial Release
Date property, as well as the Netflix ID.
On the right, we get a preview of these new columns.
Click OK, and Google Refine will fetch the extra
date for all rows.
Thus, just by reconciling the names within our data to IDs
databases such as Freebase, we can enrich our data.
Not only that, when we give this enriched data to someone
else, they know exactly which Ocean's Eleven movie we mean.
Previously ambiguous names in our data now carry precise
So Google Refine is not just for cleaning up messy data or
transforming data from one format to another.
It's also for enriching data, making it less ambiguous and
more precise.