Google Refine 2.0 - Data Transformation (2 of 3) (video version 2)


Uploaded by GoogleRefine on 19.07.2011

Transcript:

Welcome back to another video on Google Refine.
Previously, we've said that Google Refine is a free power
tool for working with messy data.
But the data doesn't have to be messy.
It might just not be in the form that you want.
And using Google Refine, you can quickly transform it into
the form that you do want.
For example, pretend that you are the author of a list on a
wiki page, such as this list of Filmfare Best Actress Award
winners and nominees for the last 50 years or so.
You started recording this data in a list, but now you
want to convert it into a table.
First, switch to the Edit mode and copy out the wiki text of
the list. Paste it into a text editor and save. Next, import
that file into Google Refine.
We don't want Google Refine to split the lines into columns.
We will do this ourselves.
Also, there is no header in the line, so we set the number
of header lines to zero.
Here is the wiki text, one line per
row, 234 rows in total.
First, we want to remove all of the year section headings.
It's easy to describe them.
Any line starting with three equal signs is a
year section heading.
So we create a text filter and type in three equal signs.
That filters the table on the right to the expected rows
containing the year section headings.
And now we invoke the Remove All Matching Rows command to
move those seven rows.
We then clear the text filter so that the table
is no longer filtered.
And we can see the remaining 227 rows, down from 234 rows
originally.
What just happened is that we used a filter to isolate the
rows that we wanted to act upon, and then we invoke a
command to affect all of those rows together.
That is essentially how to use Google Refine.
Isolate all the rows you want to change using filters and
facets, and then change them all in one shot.
Moving on.
Let's note that in the original list, there's a lot
of bolded text.
We want to bold the text in the final table differently,
so we need to remove all existing bolding, which is
encoded as triple quotes.
So we invoke the Transform command on this one column.
In this dialog box, we will specify an expression that
takes a cell's value and performs a search for three
consecutive single quote marks and replaces them with an
empty string.
We get a preview of the expression as applied on the
first few visible rows, so we can check the results before
clicking OK.
That changed 57 cells.
Next, we note that for each year, the winner is listed
first, and then the nominees are shown as an indented
sublist. In the wiki text, the indented list items
start with two stars.
Using that hint, we can create a new column that indicates
whether a row corresponds to a winner or a nominee.
We invoke the command Add Column Based On This Column
and enter the expression, value starts with two star.

And name that new column Is Winner.
We create a text facet from the Is Winner column to help
us work with the winners and the nominees separately.
First, we select True to isolate the winner rows.
Then we want to extract out the years.
We invoke the command Add Column Based On This Column
and specify the character range from one to five, and
name that column Year.
Having extracted out the years, we can remove them from
the original column by invoking the Transform command
and taking the substring from character index six onward.
We now turn our attention to the nominees by selecting
False in the Is Winner facet.
They have no years, so we only need to
remove the leading stars.
Again, we invoke the Transform command and take the substring
from character index two onward.
We remove the Is Winner facet, as we no longer need it.
The winners and nominees row should now look the same,
except the nominees are missing the years.
But since each nominee is listed immediately after the
winner of the same year, we can invoke the Fill Down
command to copy each year onto the blank cells below it.
Next, we need to split the first column into three
separate columns: Actress, Film, and Character.
We invoke the command Split Into Several Columns and
specify the separator to be space, dash, space.
We want to get up to two columns only.
Then we invoke the command Split Into Several Columns
again and specify the separator to
be space, as, space.

Finally, we rename the columns to
Actress, Film, and Character.
We also want to append Winner to the Actress column for the
Winner rows.
Instead of using a text facet to isolate the Winner rows,
we're going to do something different.
We invoke the Transform command on the Actress column.
Now we need to refer to the Is Winner column to know if a row
is a Winner row or not.
We do so by referring to the cell's variable, accessing the
Is Winner field in it, and accessing the Value field.
This gets us the value inside the Is Winner column of the
current row.
Below, we get to see such values, which are
either True or False.
If it is True, we append Winner to the
current cell's value.
Otherwise, we keep the same cell's value unchanged.
To implement that logic, we use the If
construct as follows.
And thus, there are different ways to accomplish the same
thing in Google Refine.
Now that the data has been transformed from nested lists
into a proper table, we still need to export the data out in
wiki text syntax, so we invoke the Templating Exporter.
By default, the Templating Exporter provides a template
that generates JSON.
We can tweak this to generate a wiki text table.
The first text field contains the initial chunk of generated
text that does not repeat.
In our case, this is the table column headings.
The next chunk is the text that gets generated
repeatedly, one for each row.
The embedded expressions within this text field are
evaluated to retrieve the actual data inside the cells.

The next text field is the text that gets generated to
separate the rows, and the final text field is the final
chunk of generated text that does not repeat.
In our case, this marks the end of the table.
We get a preview of our table on the right.
When it all looks right, click Export.
Open the exported file, copy the wiki text, paste it back
into the original wiki page, and preview it.
There is our table of Filmfare Best Actress
Award winners and nominees.
Now let's say that we want to transform not just the Best
Actress list, but also the Best Actor list, which looks
pretty much the same.
This is going to be easy.
Again, copy off the wiki text to a file, and load it into
Google Refine.

Now, let's switch back to the Best Actresses project to see
what we've done.
The Undo/Redo History shows us a list of editing operations
that we have performed.
Not only that, it allows us to extract those operations out.
The operations we select on the left are encoded as JSON
on the right.
We can simply copy that JSON, switch back to the Best Actors
project, invoke the Apply Operations command, paste in
that JSON, and click Apply.
And voila, our Best Actors list has been
transformed to a table.
I hope this video has given you some ideas of data
transformations that you can do using Google Refine.
Although our examples are small data sets with a few
hundred rows, Google Refine works just fine with up to a
few hundred thousand rows if your
computer has enough memory.
And with more data, the difference between Google
Refine and spreadsheet software is even more
pronounced.
That ability to interactively change in bulk a large chunk
of data matching some particular criteria really
sets Google Refine apart.
And that's not the whole story.
Check out the rest of the videos for even more
data-wrangling magic that Google Refine can
do for you for free.