Google BigQuery: Batch, Connector for Excel and Office Hours!


Uploaded by GoogleDevelopers on 31.08.2012

Transcript:

BILL LUAN: Shanghai GDG is a very
interesting developer community.
FEMALE SPEAKER 1: I'm glad somebody
has asked this question.
MALE SPEAKER 1: This is where the magic happens.
FEMALE SPEAKER 2: This is primarily a question and
answer show.
So if any of you out there would like to ask questions--

MICHAEL MANOOCHEHRI: All right.
Hey, Google Developers.
Welcome to The Michael and Ryan Show.
I'm Michael Manoochehri.
RYAN BOYD: And I'm Ryan Boyd.
MICHAEL MANOOCHEHRI: And we're here to talk about some new
features of Google BigQuery as well as answer
some of your questions.
So if you have some questions for us, visit us at our Google
Developers Live page--
that's developers.google.com/live--
and leave your question on the Moderator link below.
RYAN BOYD: And if you're seeing a recording of this
later, you can feel free to ask your
questions on Stack Overflow.
Michael was just answering some questions here--
MICHAEL MANOOCHEHRI: As I always am.
RYAN BOYD: --as we were preparing, so--
MICHAEL MANOOCHEHRI: Use the Google-BigQuery tag for that.
Great.
So if you've been following our recent Google BigQuery
blog posts, you know the theme for this week is making
BigQuery more flexible.
So today we're going to start by talking about some new
features that we've just recently launched to help make
BigQuery more flexible.
The first is the batch processing for BigQuery, which
allows a more flexible pricing model for your queries.
And the second is the Google BigQuery connector for Excel,
which will allow you to use it in a more flexible environment
such as spreadsheets.
So we'll talk about that later.
So let's talk about the batch
processing for Google BigQuery.
So, you know, Google BigQuery is great for interactive
queries, so queries over large data sets
that return very fast.
We found some customers using Google BigQuery for batch
processing as well, so queries run overnight, queries that
cache data for later use, where interactivity is
not the main goal.
So we've introduced a new feature for this use case at a
much reduced price, which is Google BigQuery Batch
Processing.
This allows you to set the priority of a query to batch
mode, which allows the query to run later.
So the query will be run somewhere between 30 minutes
and three hours after you actually insert the query at a
much reduced cost.
This is great for applications that have a data set and
allows you to run queries either interactively or in a
batch mode, letting you build much better applications for
this kind of a use case.
RYAN BOYD: Yeah, and it's really easy to do.
It's a simple new property in your JSON configuration.
You just specify the priority mode as Batch, and our client
libraries already support this as well.
So try it out.
And I wanted to actually stop over to our computer here and
show you an example of a dashboard.
This is the BIME dashboard, which shows some retail
statistics.
And this is the type of thing that you might want to have
some of the queries running in batch mode while other of the
queries run in sort of an interactive manner.
So this is a dashboard showing the retail statistics, sales
from a retail outlet.
And a lot of different graphs here going over--
what do we have?
Like 500,000 rows--
whoops.
Sorry for that.
But going over a large number of rows here, but overall--
oh sorry, 500 million rows.
And we have profit and discount per category, top
products per category, profit in various ways.
So these types of things you might be able to calculate on
a nightly basis in this batch mode.
Whereas maybe if you want to allow users to drill down on
certain bits of data, for instance, the sales just over
the last week or something like that, you might want to
have that running more live against the live data set and
allow people to drill down into particular categories or
drill down into particular types of products
and things like that.
So this is one of those use cases where you might want
some things in batch and really caching those query
results and then other things you want to perform live.
MICHAEL MANOOCHEHRI: And on the flip side, it's easy to
make an application that's actually
doing more batch queries.
And when you need that interactive query immediately
for a presentation or for some kind of analysis, you can run
them when you need them.
So it really expands the types of applications you can build
with Google BigQuery.
RYAN BOYD: Exactly.
Now, I'm going to start talking a little bit about the
other feature that we launched, which is the Google
BigQuery connector for Excel.
The BigQuery connector for Excel is basically a tool
which allows you to--
let me hop over to it here.
It allows you to basically pull in your data into Excel
from BigQuery.
So you could run queries off of hundreds of millions or
billions of rows of data and pull your aggregate results
down into Excel for further analysis.
And this is very useful if you're spending a lot of your
time running data in Excel.
We actually have had a feature for a while now which allows
you to do this type of thing with Google Spreadsheets, with
Google Apps Script and Google Spreadsheets.
And I'll show you that here quickly before we dive too
much into the Excel version.
This is a dashboard I built using Google Spreadsheets and
Google Apps Script that basically I can hit here.
I can run the query.
It dumps the results in from BigQuery and then does a graph
right here within Google Spreadsheets.
Well, we wanted to bring the same capability to Excel
users, and we've done that with this BigQuery connector.
You can find it at bigquery-connector.appspot.com.

And essentially what it is, is a proxy server for your query.
So when you run your queries in Excel, it makes HTTP
requests to the BigQuery connector for Excel.
And those HTTP requests then gets translated into API calls
to the BigQuery REST API.
But it all happens seamlessly for you as a user.
There's just a few setup processes,
and it's really simple.
So the first thing that you need to do here is actually
authorize the connector for Excel.
I've already granted OAuth permission here, but if you
haven't, you'll just be prompted to
approve OAuth request.
And then you create this key.
And this key is a value that you'll just need to copy and
paste into Excel.
So we going to do that here in a second.
But the other thing you want to do is
download this IQY file.
And the IQY file is really just a configuration file that
tells Excel the information that it needs in order to
query against the BigQuery connector.
So we're going to hop over into Excel here a second and
show you how some of this stuff is used.
So I'm going to actually run a query here.
The query is going to be against the US birth
statistics data, so all the births since 1969 in the
United States.
It's, I think, 137 million rows.
MICHAEL MANOOCHEHRI: It's a pretty big data set.
RYAN BOYD: Yeah, it's pretty big.
It's still small by BigQuery standards, but it's a great
demonstration data set.
So I'm going to run that query.
And I've actually templatized that query a little bit in a
similar way to what you could do in Excel.
And I've put in here additional fields, average
mother age, and then I have this full query here.
So let's show you how we would run the query.
We're going to need a few bits of data for setup.
But I'm going to go to the Data menu here and say Get
External Data and say Run Saved Query.
And then I'm going to select that IQY file that is in my
Downloads folder.
So the IQY file is right here, connector.iqy.
And I'm going to say Get Data.
And it's going to ask where I want to put the data, and I
want to put the data into a new sheet.
And I want to set up the parameters that are required
to connect to the connector.
The parameters are the query itself, which I'm going to get
the value from this cell.
Then the Project ID, and the Project ID is also
pretty easy to get.
We can get that from the BigQuery UI here.
So I have the Project ID.
It comes up in a hover, and I can copy and paste that, and
enter in the value in this box here.
And then the key is that generated key that we use to
pass our authorization over to the BigQuery connector, sort
of copy that in.

And I'm going to hit OK.
And again, we are saying populate the data into the new
sheet, run that query.
So this query is going to get us the state and year of every
birth, but aggregated, based off of the average age of the
mother when the child was born.
MICHAEL MANOOCHEHRI:And this is doing a table scan of how
many records, did you say?
RYAN BOYD: I think it was 137 million.
MICHAEL MANOOCHEHRI: 137 million records.
RYAN BOYD: We'll actually show you running the query in the
BigQuery UI here a second.
You'll see some more statistics data.
But let's make this a little bit larger for you here.
So here's the data.
And it's basically every state and the year, and then what
the average age of the mother's birth in that state
in that year.
So that's really simple.
Now let's just say we wanted to change this because I
actually want the average weight of the babies rather
than the average age of the mother.
So I'm going to hop over to this first sheet here.
And I'm going to change this to, say,
average weight pounds.
And you could, of course, get these values, build these
values up from a--
you can build these values up from a--
sorry.
It looks like we had a little bit of a glitch
here with my sharing.
So you could build these values up from other fields
within your Excel spreadsheet.
So--
MICHAEL MANOOCHEHRI: So using some of the features of your
Excel spreadsheet, you can actually incorporate that into
the query that you're running now.
RYAN BOYD: Exactly.
So I'm going to hop over here, and I need to reshare my
presentation window.
It looks like we had a little bit of a glitch here, but let
me do that.
MICHAEL MANOOCHEHRI: So while we're doing that-- yeah, I
think it's great that for both of these applications with
Google Spreadsheets or Excel, it lets you use what you're
used to for analysis with Google BigQuery.

RYAN BOYD: So, for example, like in there, I built up the
formula, and I just did a simple Replace on the fields
that I was trying to substitute in the formula.
And that was pretty easy.
But you might want to actually do something more complicated,
where you're using the built-in Excel functions and
accepting the analyst input because not all analysts will
necessarily know how to use SQL.
But as long as you can set up this initial spreadsheet for
them with SQL, it's fairly easy for you to do then.
And I'm hoping I can--
yeah, here we go.
So I'm going to reshare my screen here.
MICHAEL MANOOCHEHRI: All right, looks good.

RYAN BOYD: There we go.
I've reshared my screen, and let's hop
back over into Excel.
Sorry about that little glitch here.
So here's my query.
So I wanted to modify this to say the average weight in
pounds instead of the average mother's age.
So I've modified just this field here quickly.
And you could specify any the other fields that
are part of the table.
But I modified that.
And then I'm just going to go over to the Data tab here and
just say Refresh.
And if we hop over to our sheet, we can see we had the
mother's age when they gave birth broken
down by year and state.
And when this refreshes here, instead of the mother's age,
we're going to get the birth weight.
So we'll see here-- and it's a little small again for you, so
I'll enlarge it.

And let's see here.
So we can see that refreshed it in really easy.
All I had to do was press one button to refresh that
information.
That was very simple.
If you're actually using the Windows version of Excel, you
can actually set up dependent fields.
So you can basically say, if the fields that were used to
build up the query change at all, you can ask it to
automatically refresh.
So there's none of that manual refreshing.
MICHAEL MANOOCHEHRI: So that was the same data set, the
same original 100 and something
million record data set.
You just changed one parameter and re-run the query.
RYAN BOYD: Exactly.
And it was really easy for us to do.
But, you know, we can run this type of query directly in the
BigQuery UI, right?
You know, just getting this data, and you can see
I ran it over here.
It took about seven seconds, processed 2 and 1/2
gigabytes of data.
So that was pretty simple to run within the BigQuery UI.
But the advantage of Excel is the analysis that you could do
after you run the query.
So let's say that I want to take this data and I want to
go over and do a pivot table.
So I just say Pivot Table.
And it creates the pivot table here.
And then what I can do is, this is just the count to
start with.
Let's just change this to be the average
instead of the count.
And so now we can see here for every single state, this
massive table that shows you what the weight of the babies
were in that state broken down by year.
But if we want to take this from a pivot table and take it
into a chart, a chart might be more easy for some people to
read although I don't want to chart out every state.
So let's limit here to just pick a few states.
So we'll pick California, Colorado, Alaska.
Let's pick Alaska.
MICHAEL MANOOCHEHRI: That sounds pretty good.
RYAN BOYD: And so just a few states here.
And I don't know--
Louisiana.
And we'll pick a few states here.
And now we're going to take this and make it into a graph.
So I have that data, and now let's take here, and we'll
just say Charts, and we'll say Align Charts.

And it looks like I actually selected a value here, so that
didn't quite work out.
So let me recreate this here without me
having selected a value.
So we just say Line Chart.
This is going to work.
MICHAEL MANOOCHEHRI: There it is.
RYAN BOYD: All right.
There we go.
So here we have our graph.
And these are aggregate results, but the source data
is over 100 million rows of data.
So it's pretty fantastic.
MICHAEL MANOOCHEHRI: So with very little
SQL, actually no SQL.
No SQL is coming out.
You've actually been able to pull all the data from
BigQuery into a pivot table and chart it, just using what
you normally are used to in these
spreadsheet tools like Excel.
RYAN BOYD: Yeah, I mean, I did create that initial SQL, but
for the analyst, you know, the IT person can create the SQL
and the analyst can just plug-in the fields that
they're interested in.
And it's really easy.
And we can then look down and see hey, we have 7.6 or so
pounds per baby in Alaska.
But then if we go down and see Colorado, it
looks like about 7.1.
So--
MICHAEL MANOOCHEHRI: You know what I'm about to say.
Is that statistically significant?
RYAN BOYD: Yeah, it's not statistically significant.
MICHAEL MANOOCHEHRI: We don't know.
RYAN BOYD: I don't know.
So you could talk to the CDC or talk to Michael if you're
trying to figure out whether this is--
MICHAEL MANOOCHEHRI: Export your data and run it in R. You
can do all kinds of stuff with this data.
RYAN BOYD: Yeah, exactly.
So we can see here, we are able to really easily pull in
data, run data, run huge queries on large
data sets in BigQuery--
pull the results into Excel, throw them into pivot tables,
throw them into graphs.
It's a really powerful feature.
MICHAEL MANOOCHEHRI: Right.
I love this idea of using the workflows you're used to and
using the power of BigQuery for what
it's good at, so great.
We'd love to get your feedback on some of these new features.
And a great way to get the feedback to us is to join our
circle, our new Google Cloud Platform
Developers Google+ page.
So if you search now for Google Platform Developers in
Google+, you'll find our page.
Please add us to your circles.
You can actually comment on our threads.
We post a lot of things that are coming up, for example,
some new events that we're going to be at and some
upcoming conferences.
So let's talk about some of those.
So Ryan, can you talk about some of the conferences you've
been to or you're going to be at in the near future.
RYAN BOYD: Yeah.
I recently just did a conference or actually a local
meetup last weekend on Saturday.
It was a great event.
It went a lot of the day on Saturday.
But it was a big data science meetup, and it was in Fremont,
But at the SGI building in Fremont, and it was fantastic.
We had three separate speakers, and those three
speakers were myself, as well as some folks from Zementis
and Pervasive, all talking about some big data
technologies everywhere, from kind of predictive modeling to
ETL to running queries using a tool like BigQuery.
And it was a fantastic event.
The meetup has only been going--
I think this was the sixth session of the meetup, but
very well organized, and they had over 60
people at the event.
So it's great to go to these local meetups.
Let us know if you're an organizer.
Let us know what types of meetups you're having.
Reach out to us.
It's possible we may be able to attend.
So that's just one of the meetups.
But then--
MICHAEL MANOOCHEHRI: We'll definitey
see that group again.
RYAN BOYD: Yeah, exactly.
And coming up, we have a couple other events.
So I think next week, we have the--
I believe it's on the fifth-- the Google Developer Groups in
Silicon Valley, and the Google Developer Group in Silicon
Valley is hosting myself to talk about BigQuery and answer
questions that you have as developers.
And fortunately, it's turned into a rather popular event,
fortunately or unfortunately, however you
want to look at it.
So it's fully booked, but get on the wait list and maybe
you'll be able to get in.
But that should be great.
And then after that, the following week we have the
XLDB Conference in Stanford.
And I'll be speaking there about BigQuery to a group of
hardcore scientists.
So hopefully I don't disappoint them, but--
anyway, Michael, I guess we have some other events coming
up in the fall too, right?
MICHAEL MANOOCHEHRI: I was just about to say we're both
going to be at the upcoming Strata conferences.
One is in London at the beginning of October, October
first and second, and there's a Strata Conference in New
York at the end of October.
So at Strata London, Ryan's going to be
talking about BigQuery.
And we're going to have Kathryn Brisbin, another one
of our teammates, talk about data journalism and data
gathering there.
So that'll be really exciting.
We're also planning on some community events around that
time with the London GDG group and some other groups.
So follow our Cloud Platform Developers page for more
information about that.
And I will be speaking at Strata New
York along with Kathryn.
We'll also have a talk about using BigQuery with GitHub
data, which is very exciting.
And Ryan, you have some community events around the
time on the East coast as well.
RYAN BOYD: Yeah, exactly.
So we've worked with some of the community organizers to
plan some events over in both Boston and New York, so local
Google developer groups in Boston and New York, and that
should be fantastic as well.
So let us know if you have other opportunities for us.
We're happy to look at our travel schedule and see if we
can fit those in.
But there should be some great events.
MICHAEL MANOOCHEHRI: We'd love to see you guys there.
RYAN BOYD: So this has been fantastic.
I hope you have enjoyed some of our new features that we've
been launching in BigQuery.
Let's actually jump over.
We actually do have some questions
over here on our Moderator.
So if you're trying to get into the Moderator, you can
look at Google Developers Live page, and you can click on our
event on the Google Developers Live page if you're not
already on that page.
And you can see the Moderator below.
And we have a couple questions here.
The top question coming in here is, "Could you also point
out scale-wise where you will use BigQuery versus Google
Cloud SQL?" So we're happy to talk about that.
Google Cloud SQL is our MySQL-based solution that runs
in the cloud.
It's a really simple kind of database offered as a service
that you can use from App Engine and maybe even from
some other environments.
But basically, it's a MySQL database.
It's a relational database.
So from a scale perspective, it's not going to have the
same scaling properties as something like BigQuery.
So let's just give you some basic properties of these
different types of databases.
Your standard relational databases use indexes to find
results quickly.
And indexes are fantastic if what you're looking for is
available in that index.
You can get fairly quick results.
But indexes have two properties which aren't so
good, and one is that it takes a while to
build up the indexes.
So as you ingest the data, it takes a while
to build those up.
But then secondly, if you're trying to do large-scale
aggregate data in an ad hoc fashion, you don't necessarily
know what columns you're trying to query against or
what sort of aggregates you're looking for.
And the aggregates oftentimes are kind of impossible to
really include in the indexes.
MICHAEL MANOOCHEHRI: You end up running table scans for
some of these things, over enormous amounts of tables
with lots of links.
RYAN BOYD: Yeah, exactly.
So full table scans used to be considered kind of the devil.
MICHAEL MANOOCHEHRI: Right.
RYAN BOYD: We have this quote from our Google I/O
presentations, one of our colleagues, Jordan, who
basically said a quote from a 16th century French-Italian
philosopher or something like that, where he's like, if
you're running over a full table scan over hundreds of
millions of rows, you're going to have a bad time.
And typically that's how it is.
But BigQuery actually does full table scans, and it's
designed to do full table scans.
Pretty much every query you run against BigQuery does a
full table scan, but since we can distribute that over tens,
hundreds of machines around the world, those queries can
still run really quickly even though they're doing a full
table scan.
Each machine is only processing a very small chunk
of data, so it allows us to do the full table scans quickly.
So I would say use Google Cloud SQL if you're trying to
power your web application for just sort of--
let's say you have an online forum or something like that
and you want to power that.
You need a lot of write capabilities.
You need to update data.
You need to insert new data, add and delete columns.
Cloud SQL would be really good for that.
But if you really need to do the performance of large-scale
analytics, that's where BigQuery comes in.
BigQuery doesn't allow you to modify the data.
It allows you to insert new data.
But then you can run queries over it much, much faster.
MICHAEL MANOOCHEHRI: And I think of it that way, too.
Cloud SQL or SQL in general is great for transactional data,
maintaining state about a database.
BigQuery is really great for asking questions about large
data sets, right?
So it's really great as an analysis tool.
It's not designed to be a relational database.
It's designed for aggregate queries, asking questions, and
dealing with data sizes that are massive, that a SQL
database just can't handle very well.
RYAN BOYD: Yeah, and I mean, typically, you actually
denormalize your data for BigQuery just to get that
additional performance gained by not having to scan over
multiple tables.
MICHAEL MANOOCHEHRI: Right.
And it seems like it leads well into the next question,
which is a very similar question.
So the question there--
RYAN BOYD: Yeah.
Sorry, one second.
But I just wanted to address with MySQL that usually you
can use these technologies together.
So don't think of it as you have to use one or the other.
You really think about using them together.
And we have a lot of our customers that dump data out
of relational databases into BigQuery before analysis.
MICHAEL MANOOCHEHRI: Right.
So let's take a look at the next question.
So the next question is just what Ryan was talking about.
"Can this sit on top of a MySQL database, and can it
replace Lucene for querying?" So yes, I've used all these
technologies together, too.
So just as Ryan said, typically in a workflow like
this where you're using a MySQL database for things like
transactional data and the kind of things you would
normally use a SQL-style database for, you would
actually export the data into cloud storage and then ingest
that into BigQuery.
And sometimes this does take a denormalization step since
BigQuery is often, at times, used for large tables rather
than relational table design.
But basically, you can use them together.
So I would take the data out of MySQL and place it into
BigQuery for some of these kind of analysis tools.
Now, you mentioned Lucene in this question.
Lucene is great for things like searching.
It's kind of a search tool.
And App Engine has a full-text search API
for stuff like that.
However, you can use BigQuery for some types of search.
One thing that may be difficult for a full-text type
search is that we have 64-kilobyte record size.
So if you have unlimited text, it's hard to dump megabytes
into a single BigQuery record.
You can break that text up and have an index sort of type
situation in BigQuery.
But in general, for a full-text search kind of
application, I might use App Engine's Full Text Search API.
BigQuery is great for things like word counts, right?
So things like we have a Wikipedia data set, which we
have and we're going to have revision history.
You could do word counts, word statistics, things like that.
That be a great use case for BigQuery because the data in
those situations can be very big if you have a lot of text.
RYAN BOYD: So one thing that were actually powerful,
though, in terms of text searching with BigQuery is our
regular expressions capabilities.
So although the row size is a little limited, for the text
that you do have stored in those rows, you could run
regular expressions on it.
And you might think running regular expressions on
hundreds of millions or billions of rows would take a
long time, but BigQuery can actually do that in seconds
simply because it's distributed to such a wide
number of machines.
So try out the regular expressions search if you're
trying to just search, for instance, titles or abstracts
or something like that as opposed to a full document.
BigQuery might be good for some of those use cases with
big data sets.
MICHAEL MANOOCHEHRI: And we have that public data set in
BigQuery now, the Ngrams data set.
So we have a selection of Ngrams from the
Google Books project.
We can see how we've used Ngram data as sort of a way
you can do searches.
You can search for phrases and search for particular Ngrams.
And so check out how we're doing that.
I guess the short answer is you can use all of these.
These are all complementary technologies.
And real big data workflow, you can use them all together.
RYAN BOYD: And there's actually one other thing we
want to talk with you about is just kind of share a little
bit with you guys about what we've been up to
over the last week.
So we've been having a lot of fun over the last week,
pulling late nights.
I think I was up to 2:00 or 3:00 AM a couple times playing
with some of the Wikipedia data that we found.
We actually found it at dumps.wikimedia.org.
They actually have the number of page views by hour for
every page in Wikipedia going back for about
five years, I believe.
And so we've been crunching along this data and trying to
use some of the other Google Cloud technologies just to get
familiar with some of the other Google cloud
technologies like the Pipelines API on App Engine.
Michael, do you want to talk a little more about that?
MICHAEL MANOOCHEHRI: Yeah, sure.
So we're building a complete pipeline.
We're extracting the data from the Wikipedia dumps page.
We're transforming it into CSV data, into Google cloud
storage, and we're ingesting that into Google BigQuery
using 100% Google Cloud Platform technologies,
basically App Engine and the Pipelines API
to build the workflow.
So for each step, we have a pipelines API class that we're
using to build this workflow.
RYAN BOYD: Yeah, and it's--
I think it's 2 terabytes of raw source data.
That's actually gzipped data.
We haven't uncompressed all of it, but I'm guessing from what
we've seen, somewhere between 6 and 10 terabytes of
uncompressed data.
So it's a pretty fantastic size of data, and it's really
great to play around with that data in BigQuery.
Like I said, we've been having a lot of fun, pulling a lot of
late nights.
MICHAEL MANOOCHEHRI: It's been fun, and eventually we'll
share the code from some of these projects so you can
actually check out what we're doing yourself.
RYAN BOYD: And I don't think we actually have any more
questions from the audience on our Moderator, so we'll wrap
it up for this week.
But we wanted to thank you all for joining us and invite you
to come out to some of the upcoming community events that
we're doing, whether you're in the Bay Area, whether you're
in Boston, or New York, or London.
We may actually even be out to Sydney--
MICHAEL MANOOCHEHRI: Yep.
RYAN BOYD: --in the fall.
So let us know where you're going to be and join us where
we're going to be.
We'd love to talk with you about the big data challenges
that you're having and see which of the Google Cloud
Platform technologies are a good fit.
I love the questions.
MICHAEL MANOOCHEHRI: Don't forget to add to the Google
Cloud Platform Developers page to your circle so you can find
out about all these upcoming events.
RYAN BOYD: Great.
That's it for the Michael and Ryan show.
MICHAEL MANOOCHEHRI: We'll see ya.
RYAN BOYD: Take care.

[MUSIC PLAYING]