GDD-BR 2010 [2F] Storage, Bigquery and Prediction APIs

Uploaded by GoogleDevelopers on 14.12.2010

>> CHANEZON: Okay. Hello, everybody. Again, I'm Patrick Chanezon, developer advocate at
Google. I'm French but I live in San Francisco. And this is the second of the talk about our
cloud computing offering. So in the first talk, I talked about our platform offering
which was Google App Engine and Google App Engine for business. And now in this talk,
I'm going to tell you about some new offerings that are more at the infrastructure layer.
So what is called infrastructure as a service that Google has announced at Google I/O in
May. So these three offerings are not available for everybody yet. It's in trusted testers
right now. So if you're interested in participating into the trusted tester, I'll show you or
I'll tweet the URL later today. Just follow my Twitter account where you can sign up to
be--to be part of the Trusted Tester Program. So I'm going to tell you about Google Storage,
BigQuery, and the Prediction API. For me, these are the most exciting of our cloud computing
offering especially BigQuery and the Prediction APIs because they solve problems that are
very hard to solve manually by yourself and they expose some of the Google infrastructure
and Google services that I don't think there are equivalents for them in other cloud providers.
Again, the mobile agenda for the day is there, and you can rate the sessions over there.
Okay, so Google storage for developer Prediction API and BigQuery. So cloud computing, as I
told you in the previous session, it's a very loaded word. Everybody's trying to mark their
offering with the term cloud. At Google, the way we understand cloud is like having some
providers reach some economies of scale when they have lots of servers and they can automate
lots of processes, and then offering that as a service for third-party developers. So
that doesn't encompass what some vendors are calling private clouds; this is really the
public clouds. Okay. So, yeah, the analyst classifies stuff as the infrastructure platform,
and software as a service. In this session, we're going to talk really about the infrastructure
part. So Google cloud Offerings, I told you about that before. So we have software as
a service with Google Apps, the Apps Marketplace and then the Apps, you can build on top of
Google App Engine platform. For us, it's Google App Engine with Java and Python. And then
we start to offer infrastructure with Storage, Prediction, and BigQuery. So let's talk about
Storage first. Basically, Google Storage is a very simple storage service with a REST
API, and we're using the same API as Amazon S3, so you can use the tools that you're using
with S3 with Google Storage. But we added a few features to them like, for example,
for access control. Okay. So you can use that for hosting static content, for doing backups.
There are lots of companies who do synchronization software or you install some software on Windows
or Mac, and it completely integrates with the finder or with Windows Explorer. And then,
you can mark some folders to be synchronized in the cloud and you can get them back on
other machines. So, some of these providers are using Google Storage behind the scene.
You can use that for sharing data. It can be used as data storage for applications.
And last but not least--and I think it's one of the most interesting use cases--it can
be used as storage for computation which means Google Storage for is an entry point. When
you put your data in there, it's an entry point to access other Google services like
Prediction and BigQuery. So basically, once your data is in Google's cloud, we're going
to offer you more services so that you can do useful stuff with them. So, the benefits
of Google Storage: its high-performance. Basically, we are--we are leveraging all the magical
infrastructure that we have at Google with data centers all around the world; some replication
layer for the content and the caching servers. And there are some people who made some evaluation
of their latency of Google Storage all around the world and it's very, very good. Security
and privacy, a very important aspect and differentiator for Google Storage. In Storage, we have some
pretty deep access control rules that you can set where you can say--and it's integrated
with Google Accounts, so that means at the API level, you can say, "This file will be
accessible only by these Google accounts or this Google group which is a group accounts
in view or edit mode. And then these rules can be enforced at the browser level". So
that means you don't need to have to have a server to enforce access control between
the end-client and your file. And this is very useful when you're building application;
we take care of access control for you. And then it's super easy to use because we leverage
the existing--all the existing tools that were used with the S3 API because we support
the same API. So that means that you can use tools like Boto library in Python, or S3Fox,
the Firefox extension to manage your files, or Java libraries or things like that. So
technical details; it's a very simple REST API. Who among you have been using Amazon
S3 before? Okay. So basically, you'll feel at home; it's the same API. So, resources
are organized by buckets. Buckets are just flat containers; there's no hierarchy in there.
And then under that, you put resources that are represented by--identified by URI. And
some people have been using, like, resource naming with Flash in order to replicate file
system hierarchy; but it's not exactly the same semantics. You can put any type of objects.
The limit in size is 100 gigabytes per object. So, access control, I told you about that.
There are two ways to authenticate the request; either you sign them using your access key
that you get when you sign-up for the API or you use the web browser login with the
Google Account. So in terms of performance and scalability, yes, as I told you, it's
100 gigabytes per object maximum. You can have thousands of buckets and unlimited number
of objects. The data itself is replicated within multiple U.S. data centers. And so--but
then, we are leveraging all the Google network of edge servers that have all around the world
that provide caching for these files. So what that means that when you're measuring latency
to get your files from anywhere around the world, usually, it's pretty fast. Okay. Yes.
And then another--a few other differentiators is this very strong data consistency which
is, when you have written a file, if it's--if it gets read from anywhere around the world,
once you have received the acknowledgement that it was written, you read the same file.
And you can do range get queries as well. So privacy and security, it's a key-based
authentication system. We also support authenticated downloads from the web browser using Google
Accounts. You can share with the individuals, all with Google groups, and you can--at the
API level, and that's, I think, a differentiator with S3. We added a few API calls that let
you specify an ACL for buckets and objects. So in terms of tools, we have one tool called
the Google Storage Manager. Maybe I'll show it to you. Let me see where my browser is.
Okay. Okay, so the Google Storage Manager is a very simple user interface to manage
your buckets--so these are my buckets--and then, manage your files in there. So these
are the files that are uploaded. So once you have uploaded the files using the API, then
you can decide to share them publicly. So this is an image that I decide to share publicly.
And when I share something publicly, I get a link that I can then share with people.
And if I had access control on that, I would be obliged to log in to access it. So it's
a pretty simple, basic tool where you can upload files manually and things like that.
There's another tool called GSUtil which is a command line tool that allows you to do
all the operations that you can do with the API but from the command line. And GSUtil
is itself built in Python on top of the Boto Python library that was used with S3 and they
added some Google Storage specific commands to it for managing ACLs, for example. So,
with GSUtil, you can say--actually, I can show it to you maybe. So here, basically,
I'm copying all these images under the Chanezon bucket that I have created for my account.
And the first time you use GSUtil, you need to identify with a token. And then it's copying
all the files. Yes, connection is pretty slow there. So maybe I'm just going to close it.
And just to show you--actually I had these files already in there. So, if I sort by last
updated... Okay. I don't know. You get the ID. Okay, so that's GSUtil. And GSUtil allows
you to manage your access control rules as well. So Google Storage is used within Google
by lots of Google services already. So, used it for the Haiti relief imagery. When
there was an earthquake in Haiti, we just put some Maps imagery in there. Google Buttons
is using it; Picnik is using it for all their images; Panoramio for images as well; DoubleClick
and YouTube are using it as well. And it's used by BigQuery and the Prediction API that
I'm going to tell you about. There are lots of companies who have started using Google
Storage for developers, so VMWare. Syncplicity is a solution for syncing your files, as I
told you before. Memeo is using it; it's the same kind of solution. The Guardian has been
using it for some of their apps. Socialwok, they're using every products we have. Widgetbox
has been using it for serving images and resizing them. The pricing is--so, it's 17 cents per
gigabyte per month and then 10 cents per gigabyte for upload, and 30 cents per gigabyte in APAC
for downloads, and $0.15 in America and EMEA; and then, the requests have different pricing.
So right now, this is a preview and it's in the US only; but we've been doing lots of
developer events around the world and everybody's interested in using that plus using BigQuery
or the Prediction API. So, we asked the product team and they gave us the okay to put a form
out for non-US developers to sign up for it. So there's some waivers in there to tell you,
"This is not supported yet," and all that, and it's only for the US, but if you want
to play with it, just go ahead. So, I'll--I'll Tweet the link later. So, Google Storage,
pretty simple; you can store or any kind of data in there. There's many tools and libraries
available, and it's a pretty simple API to play with. Now, where it gets interesting
is that Google Storage is only the first layer of our infrastructure offering. Once you put
your data in there, you can start applying, like, Google services that don't exist anywhere
else. Two of them are the Prediction API and BigQuery. So, I'm going to start with the
Prediction API. Prediction API is basically a packaging of our machine learning algorithms
that we're using at Google for many different products, especially in Ads. And how many
of you are familiar with machine learning here? Pretty good. Pretty good turnout. So
as you may know, applying machine learning to your data can be time-consuming; you need
to know the algorithms pretty well. Usually, you need to tune them. There's lots of--lots
of technically involved decisions and iteration to do in order to leverage to that in a web
application. So with the Prediction API, we try to make that easier for developers. So,
the idea is very simple, it's--you're creating a CSV file with lots of columns. The first
column is the value you want to predict and all the other columns are considered features.
You put that in Google Storage, then you ask the Prediction API to train a model on this
data, and then you have another API call to check the status on the model. Usually, after
ten minutes or maybe an hour--I don't remember well what's the latency--but less than an
hour, your model is trained and is ready, and it gives you an estimate accuracy for
the model. And once you've done that, you can start asking for prediction. This is a
very simple REST API where you send JSON payloads to the API. And what you send is all the features
except the first one that you want to predict and it will give you a prediction for what
their first feature should be. So, this is a very simple example where we have put a
CSV file where the first column is the language, and then the second column is the sentence.
And so we put thousands of sentences in various languages; and then after that, you just send
it a sentence and it will detect the language for you. So this is what the model looks like.
So, you create this file, CSV file with all these sentences, then you ask--and their language.
Then you ask the Prediction API to train on that and once it's trained, you send it new
sentences and it should detect what language it is. So that can be used--that kind of API
can be used in many different contexts where the barrier to entry to using machine learning
was too high; the Prediction API makes it super easy to get started with. Basically,
in a few hours, you can build your file, put it to Google Storage, train a model, and you're
up and running.
So this is an example that my colleagues built where--it's kind of a business-oriented example
where you want to categorize and respond to emails based on the language in which they
are written. And so, you want to predict the language of emails; so it's pretty much the
same thing. So you upload your data, your training data, to Google Storage; so the training
data would look something like that. To upload it to Google Storage, you use GSUtil, the
command line I showed you where you copy your data to your buckets. Once you have it, you
can train a model. So for that, you use a post to the Prediction API and to the training
endpoint where you specify as a parameter the bucket where you stored your data. So
then, the training runs asynchronously and it can take from ten minutes to an hour and
during that time, you can do a get to just get the status of the training. And so, when
it's not ready, it will tell you, "In process." When it's ready, it will tell you, okay, it's
ready and the accuracy is a figure between zero and one. And then after that, you can
apply the train model to new data. So, when you have new data which is a sentence that
comes in an email, you just send that to the Prediction API wrapped into a JSON payload
like this. And then, the result will be a JSON result where it will tell you the output
label--here it's French. And you can also have multiple outputs with different scores;
they added that pretty recently. So that's how we would use that in Python. So it's like
a few lines of Python just to get a prediction result. So in terms of capabilities, input
features can be numeric or unstructured text. Be careful about the unstructured text. I
will show you an example I built where I scrape HTML pages and you need to clean them up pretty
well if you want good results. The output can be up to hundreds of discrete categories.
They use--so the kind of technique that's used there is called Supervised Learning.
So they're using many different techniques but the tuning of the algorithm is completely
automated and it's done asynchronously. So, you won't have access, you don't know which
algorithm are going to be applied. And then, you can access that from anywhere where you
could access HTTP, it's just a REST API. But it's very easy to use from App Engine with
URL Fetch from Apps Script. An Apps Script is a pretty cool case where it allows you
to add scripts to your--to Google Apps, for example, in a spreadsheet. So you could have
a spreadsheet pulling prediction from the content of cell values, and you can use that
in your desktop apps, as well. So, the--what they added in the version 1.1 which is from
a month ago, I think, is multi-category prediction, continuous output and mixed inputs where you
can put inputs in numeric or text. So, that's prediction. Let me show you my example. Actually,
I'm going to go down there. I think I put a slide down there that explains--yeah. So
when I--when I saw that API coming out, I was pretty excited. And what I wanted to do
is--I've been using Delicious for tagging links in the past five years. And I think
I have 6,000 URLs in there that I categorized in 14,000 tags. And these tags are very personal
to me, that's the way I see information, the way I categorize it, really very personally
for the topics I'm interested in which is HTML5 and cloud computing and social software
and stuff like that. So I read all these articles and I tag them. Now, what I like to happen
is that instead of having to tag them myself, I said, "This system would be good to learn
from my habits and take all this knowledge," and now, when I find a new article, it would
provide me with the tags I would be more likely to use for the article. So, that's what I
was trying to do. So, I used the Delicious API to get all my tags and then all my tags
and URLs. I wrote a small Python script to scrape all the pages and clean them up. And
actually, that's where I need to do a better job. So, I end up with a CSV file that has
a tag, a URL, and then the full text and I remove all the HTML tags of all the pages.
So that file was too big for--I think Prediction right now is limited to 100 megabytes, so
I had to clean up and make a smaller data set. But after that, I put that data set into
Google Storage and then I asked for training--I asked the Prediction API to train it for me.
So, there are some command lines example... Actually, let me show you [INDISTINCT]. Yes,
so that's why the next slide [INDISTINCT] is pretty small. So the bucket in which I
have uploaded--so I uploaded my training set which is called smalltags.CSV because I had
to reduce the size into Chanezon, so that's where my model has trained. And then there's
the token thing for authentication and all that. For--yes. So for the model, what I'm
doing is I'm going to pass as a parameter to that script a URL of an article that I've
read. And here, what it's going to do is it's going to get the URL, remove all the--remove
all the HTML tags, and then build a JSON data structure with the text as an input. And then,
I'm just going to ask a prediction for it. So I pass the authentication data, and then
I say, "For that model," which is the model there, "Please predict the tags I would have
used for it." And--okay, let's not cheat and let's use a random article. So I'm going to
take Techmeme from today. I'm completely addicted to Techmeme. Let's say, "Hands on the Nexus
Two by Samsung." Okay. So I take this random new article that I haven't read yet. And the
goal of this is--what I'm planning to do is to build a Chrome extension and an app engine
backend for it where, when I browse, it will tell me the categories automatically. Let's
see if it works. So first, it's pulling the content of it then cleaning it up. So I tried
different cleaner in there, like Beautiful Soup and HTML to or something like
that. And they didn't give me very good results. Actually, the accuracy of my training, once
I've done it, I think was like 005% or something which is not very good. So this is a typical
example where it fails. It tells me it's about RSS. Not really. I had another article that
I just tried before coming which was about Apple, and this one worked a little bit better.
So I received this JSON payload that tells me the output should be that. Okay, so this
one is well categorized. So that's an example I built in a few hours and I need to work
on it again, obviously. My colleague, Nick Johnson, wrote an excellent blog post. His
example is much better. I think he has a 63% accuracy for it. So he took a different data
set; he took all the--do you guys know reddit? Yes. So, reddit is a site where you can categorize
articles and it's done by lots of people. So in my case, I had used Delicious which
I use as a single user. What he did is that he took all the content of reddit for one
week, like a few thousand of posts and subreddits which are the categories that users are categorizing
them in. He put that as the training set and then he reserved a part of the training set
for verification, and he had a 63% success on that one. And he built an App Engine app
out of it. I think I can show you. Okay. And he has a live demo, so let's try something.
Okay. So you have a--you have to put a title and a domain. Actually, let's try the one
that I tried before, this guy. So--and he just uses the domain, not the full URL. I
guess that may help as well.
Okay, let's see if it works. So then, he's calling App Engine who's taking this data,
submitting it to the Prediction API, getting a prediction back and telling it to us if
it works. Tough luck. At least he puts the code in there so you can do it yourself, and
see for yourself whether it works. So I highly recommend his blog post; it's pretty good.
Okay. So now, let's talk about BigQuery. So, Prediction API allows you to use machine learning
in your own application. Another issue that people have with big data, you have lots of
data like maybe billions of rows. If you put them into a relational database, it's very
slow. You can put it into a proprietary analytics engine, but it's expensive. What we've done
at Google, because we have the same kind of issue, we need to analyze billions of rows
of data, for example, for all the clicks when people click on ads. So Googlers developed
internal infrastructure for doing queries on all these data. And this infrastructure
has been exposed to developers in a service called BigQuery. So that's what we're using.
Basically, you can do SQL-like queries on huge sets of data and the response is very
fast. The API is a REST API using JSON the same way, and it's very simple to use. So
you can use that for building interactive tools for embedding into Google spreadsheets
because you can use it from Apps Script using the REST API. It's scalable to billions of
rows, very fast response and the queries are in SQL, so it's very simple. The way it works,
the same way as the Prediction API which is you upload your rowed data as a CSV into Google
Storage, then you have a REST API which I think is not ready yet but should be very
soon. You have a REST API that allows you to send a JSON payload that defines your schema.
Basically defines the name of the fields in your CSV and the type of these fields. In
my example, I had to send my CSV to Google Storage and tell the team to just apply my
schema. So you define your schema and then after that, you just query it using a REST
API and sending your SQL orders that look something like that. So you can do "GROUP
BY", "ORDER BY." You have some functions like Math, String and Time, and you have some statistical
functions like "TOP" or "COUNT DISTINCT." The API is very simple; you do a get on the
table name. And the table name is essentially the bucket and file where you have put it
in Google Storage, then you can query it. And the query--so that's the query of the
results. No, that's the results. So you send a query that's a SQL query; and the results
you receive look something like that, a bunch of results. Security, it's like client log-in
or OAuth or AuthSub. It's typical Google security stuff for our APIs. It supports HTTPS and
they have--they have some example where they took the whole revision history of all Wikipedia,
put that into storage, defined a table for it. And then you can start to do things like
select the top title--the top five titles and give me the number of revisions for these
from the Wikipedia table and WPM space--I don't know what it is. But basically, you
can say, for a specific topic, give me the top pages that talk about that and that have
the most revisions. And so that's the result. And there's a command line as well. I wanted
to show it to you but I have an old version of it. And so, there's a command line, you
just can type your query and you get the result right away. And the most impressive aspect
is when you integrate that using Apps Script, because it's a REST API giving you back JSON.
So in Apps Script, you can just make a call using input data. You put the URL to your
query because it's just a get and then the results, you can put it in your table. So
here in this example, we're going to type a search term and then it's going to use the
search term as a parameter in the query, and it's going to give us the title and number
of revision for the documents who have the most revision for that topic in Wikipedia.
So, you type Google and you get these results and then you can assemble them in diagrams
like this. So basically, the kind of heavy duty data processing that used to require
a lot of server side infrastructure and experience; now, you just need to put your data in Google
Storage and then, you build your front-end using very lightweight JavaScript infrastructure,
or even directly in Google Apps. Yes, so I told you about my tagger example, the subreddit.
So that's Nick's blog post, and that's it. So basically, with what we announced in May
is that Google is starting to offer infrastructure-type services in the cloud. First one is Google
Storage, and it's point of entry for more exciting services like Prediction or BigQuery
that are hard to do when you're doing that by yourself. You can learn more over there,
and when this session will be finished--and I have five minutes--I'll just upload on my
slides to SlideShare, and will Tweet about it. And I will Tweet the URL to the sign-up
sheet as well where you can sign up for Google Storage, Prediction and BigQuery. Okay. So
we have to finish. I'll be in the--I think there's a room about cloud stuff there. I'll
go on. Thanks.