Google Refine 2.0 - Introduction (1 of 3) (video version 2)

Uploaded by GoogleRefine on 19.07.2011


So you're looking at some data journalism project like
ProPublica's Dollars for Docs and say, wow, that's cool.
They're putting together data from seven different drug
companies in order to discover which drug company paid which
doctor to recommend their drugs.
So you decide to dig up some public data yourself to do
some data journalism on important social issues.
Like, let's say you go to that government's IT dashboard and
get data about projects that the government has contracted
out to private companies, download and open it in a
spreadsheet program, and check out the
Type of Contract column.
Firm Fixed Price.
What could that possibly mean?
But whatever it means, FFP probably means the same thing.
Time and Materials, T&M, wouldn't those be
the same, as well?
And should there be an S here, or not?
You would quickly discover that public, free, open data
can be inconsistent and messy.
Think of it is as raw materials that you have to
refine before it's useful.
And that's where Google Refine comes in.
It's a free power tool for working with messy data.
So let's load that same data into Google Refine and look at
the Type of Contract column again.
One core feature of Refine is the text facet.
When created from a column, the text facet groups together
identical cells into that column across rows and shows
you the number of rows in each group.
For example, 512 rows contain FFP in their
Type of Contract cells.
Clicking on FFP inside the text facet filters the data
table on the right to show only those 512 rows out of
5,200 rows in total.
Now there are two other groups that look like FFP, as well.
Clicking Edit on the first one shows us that it has a
trailing space.
Removing that space would merge it into the first group
and increase the count to 513.
By condensing all 5,200 rows into 800-something groups,
Refine makes it easier to locate these inconsistencies.
In fact, if we suspect that there are trailing white space
in other groups, as well, we could apply a trimming
transformation on the whole column to fix that whole
family of problem in one shot.
And now we're down to 785 groups.
We could even rename FFP to Firm Fixed Price.
That performed a Find and Replace
operation on 513 cells.
And we can also change T&M to Time and
Materials, and so forth.

You can sort the groups in the facet by count to find the
biggest groups.
Firm Fixed Price is the biggest group, consisting of
800-something rows.
But it would have been even bigger if its many alternative
forms were written in the same way.
Time and Materials has the very same
inconsistency problem.
Google Refine has a clustering feature that helps you fix
this family of problem.
Essentially, the clustering feature tries to group the
groups based on some heuristics.
And you can pick different heuristics to adjust how
aggressive the feature works.
Select any group of groups that you want to merge, and
set the desired new cell value.
When these groups get merged, every cell belonging to each
of these groups will be replaced with
this new cell value.
We could select all groups of groups to be merged and click
Merge Selected and Close.
That changed several hundreds of cells in a few clicks.
By grouping the cells and then grouping the groups, Google
Refine shows you a big picture of the data and surfaces its
inconsistencies, and lets you fix those inconsistencies.
Now, those are very powerful editing operations that affect
hundreds or thousands of cells.
What if you make a mistake?
Not to worry.
Google Refine tracks all editing operations that you
have taken, so that you can roll back to the previous
state or undo all the way back to the beginning when you just
loaded the data.
So go ahead, make mistakes.
This freedom to make mistakes encourages us to play more
with the data.
For example, let's look at the Total
Value of Contract column.
Since this is a numeric column, we create a numeric
facet, rather than a text facet.
The contract value ranges from zero to 20 million.
The distribution is extremely skewed, so a common trick is
to use the log scale.
We can change the expression that defines
this numeric facet.
Google Refine provides a very powerful expression language,
but in this case, we only need to take the log
of the cell's value.
Here's a preview of the expression we have entered, as
applied on the first few visible rows.
When the preview looks right, click OK.
Now the distribution is more informative.
But first, let's look at the choices below.
There are 5,000-something rows with numerical values and 27
with errors.
Selecting only the error rows shows us the problem.
Namely, they're all zeroes.
And the log of zero is negative
infinity, hence the errors.
But how can any contract have zero cost?
The distribution histogram is also intriguing.
Let's filter for rows in the lower end of the range.
Since the log values are negative, we expect the
original values to be less than one.
Obviously, no contract can cost less than $1.
These numbers must be in the thousands,
millions, or billions.
Since this one is a three-year contract, the only unit that
makes sense is billion, which makes the contract's value to
be roughly $700,000.
On the other end of the range, we find a contract with a
value of 19 million.
This is already too large to be in a unit of a billion, so
apparently, the numbers in this one column can be in
different units.
So that's what Google Refine is for: seeing the big picture
of your data, discovering
inconsistencies, and fixing them.
Google Refine helps you refine messy, raw data into something
useful, and ask questions in the process.
And if you're working on some super-sensitive data for a
juicy news story, know that Google Refine is not a web
service, so you don't have to upload your data to use it.
It's a desktop application that you download, install,
and run on your own computer, even though you interact with
it through your web browser.
So your super-sensitive data never has to leave your
machine to get cleaned up.
We can only show you a small fraction of what Google Refine
can do in seven minutes, so check out the rest of the
videos and give Google Refine a try.