Google I/O 2009 - App Engine: Scalability, Fault Tolerance..


Uploaded by GoogleDevelopers on 02.06.2009

Transcript:
>> KOCHER: All right. Hello, everyone. Welcome to App Engine Nitty Gritty. We put up one
of those Google moderator things, so if you want to jump over there, a tiny world called
App Engine Nitty Gritty. Welcome to submit questions, we'll take up from the mics as
well. As people are trickling in here, probably, people are grabbing food, just going to tell
you a little bit about that what we're covering today. One is that for folks who went to Brett's
session yesterday afternoon, he had a session on scaling Apps and App Engine. There's going
to be no overlap at all with that one. We'll be covering you know, also things about scalability
but, none of the same things. We definitely recommend his talk though if you didn't catch
it. So, we're going to talk about some things that are unique about building an App Engine.
How you can, you know, when you're on a system where you expect to have some errors, how
you can build something that's stable and reliable on top of that? Knowing that, you
know, some of those parts can and will at times fail. We'll also talk about some of
the lessons we learned and recommendations about how to make your upscale well, and we'll
also talk about integrating external services. In our case, that was Amazon EC2. We'll talk
about why we did that, why you might want to do that, it could be a different service,
but just looking at how you might have things that don't fit on to App Engine that, you
know, are important to your App. We are from FrontSeat.org which is a civic software company
up in Seattle. And, we were founded around the idea that software is getting cheaper
and cheaper to build, and because of that we can start applying software to, you know,
civic issues, things where in the past, it might have been prohibitably expensive but
with the falling costs, there really are some new opportunities out there. I'm Jesse. This
is Dave and Josh, and I'm the lead developer at Front Seat. These guys each run their own
consulting companies and have been working with us for a long time, and they've been
very involved in all of our work with App Engine. I'm going to give you a couple of
examples of the kind of things I'm talking about civic software before we dive in. This
is the site we did just after the election last fall where Obama had said, "I want to
appoint a chief technical officer for the country," but had no way of finding what that
role was and so we put up a site. It was really quickly, it took us about a day to launch
this and let people submit, discuss and vote on ideas for what that position should be
about, what the priorities for that position should be, so you can see the top result here
was about a combination of neutrality and accessibility. Earlier, last summer, we did
a site where students who went to school in a state other than their home state can go
and say, "You know, I'm from this state. I'm at school on this other state," and the website
we tell them where your presidential vote will count more and then it would help them
register to vote and get an absentee ballot. So we have a whole bunch of projects. I'm
going to skip over with these other ones if you want to know more about the kind of stuff
we're doing. You can check out frontseat.org, it links to all these things; everything from
utility bill designs that promote conservation to satire about the payday loan industry.
Projects we're going to talk about today is called Walk Score, and before we dive into
that, you know, the context of all of these is that, we're a very small team, we have
a lot of projects and we need to minimize our overhead and our maintenance, and that's
the, the context for our use of App Engine. So we wanted to bring a Civic Software service
to scale on a budget with a very small staff, you know, no system staff to sit around and
monitor it all the time. So, on to Walk Score--Walk Score is a website that lets you measure the
walkability of any neighborhood in terms of access to amenities. So, it's, you know, how
much can you do without getting in a car, and it's--I'm going to give you a demo really
quickly of the original Walk Score website. It started out with Google Maps mesh ups.
I'm going to search here, in search for any address. I'm going to search for the Mascony
Center address. So you can see, if our network cooperates, my mother also seems to be a little
bit confused about what size it is so, sorry that is off-center here. So, as these searches
complete, it pulls up a bunch of amenities in the neighborhood and it calculates the
score, up here at the top, from zero to a hundred of how walkable that neighborhood
is. So, there's a ton of stuff around here. You can do a lot of things about getting in
a car, so you get a very high score, 98 out of 100. And on this side here you can see
things like the restaurants, the coffee shops, got parks and schools, all these different
amenities. And, we've got integration with Yelp. So if you look at the Verba Buena, it'll
search for reviews and pops up a little picture there, you can click through the reviews.
And we also have street views so, maybe you want to go coffee and see, you know, what's
the closest thing that's not Starbucks. Jump over here, and I think that's it. Yeah, there's
that coffee shop. So, this Walk Score indicates, you know, how much you can do without getting
in a car, and it's also an indicator of the vibrancy of the neighborhood. And, this part
that I just showed you, this is great for checking out a specific address, but what
if you want to get to know a city at a larger scale, we did some work where we've generated
these heat maps that can show you in green the most walkable areas and fading out to
red, the least walkable. So this is the map for Seattle, you can see, those who know the
city a little bit, downtown, Capitol Hill, university district, and you have these little
pockets, these little neighborhoods that are down Columbia City and West Seattle where
you have some kind of little neighborhood. And this is a great way to get a quick insight
into the shape of the city in terms of, you know, where people actually hang out on the
streets and where the street life is. So why are we doing all these work about walkability?
Well, it turns out that walkability encapsulates a whole bunch of great things into one concept;
it's very easy for people to understand.
So, here are those benefits very quickly: climate when people walk more, they drive
less, they emit less greenhouse gases. Health, people weigh about seven pounds on average--seven
pounds less on average in walkable neighborhoods. These neighborhoods tend to have strong social
capital, good transit options, very few auto-related deaths. Also, a lot of economic benefits,
home values tend to be higher or appreciate faster or in this climate, fall more slowly.
And, transportation costs are a big part, about 18% of household income, much less in
walkable neighborhoods. And they tend to have very strong local economies where local businesses
can thrive. It's also something that people rank very highly in terms of choosing where
to live, they say walkability or access to amenities is one of the top two things higher
than property taxes or schools. So, what we tried to do is create demand for walkable
neighborhoods by educating people about the benefits and then, fill that demand by providing
transparency about the walkability of every property. And, we want to, you know, help
people find these places and also create more of them. So, walkability matters most in this
kind of transparency, it matter most when people are choosing a place to live and that
means, we need to be on real estate sites. So, the first thing we did for that is build
this widget that's very easy to, for people to embed, and this is basically a miniature
version of what I showed you before. It's that rectangle part, it's what can be embedded
into a real estate site, and we see a lot of adoption to that. We've got about twice
as much traffic to that tile as we do to our website, so it's been really successful. But,
they often put us down in the kind of "about the neighborhood section," it's not really
the ideal placement. So we really want is this, we want to be right in the primary information
about a property: three bedrooms, two bathrooms, walk score: 86, and we want "Search By Walk
Score." And so to do those things, we need to build an API that gives scores back to
a real estate site very quickly, so they can just pass us latitude, longitude and we give
them a score, they don't want to wait for all those local searches to happen and all
the restaurants to load and all that. So we're going to jump into the text stuff now, and
we'll talk a little bit about why we brought EC2 into the picture, why we didn't just do
everything that we built for the API on App Engine? And we'll talk a little bit about
some guidance on making App Scale and some issues that we ran into and thoughts about
working within the App Engine environment. I'm going to hand it over to Josh to take
us in. >> LIVNI: Thanks, Jesse. All right. So, when
we first--talking about the API, this is pretty much our ideal workflow for the entire API.
All we really want to do is have a request come in, latitude, longitude, location; goes
to this magic box, spits out a score response, that's pretty much it. And, of course, things
get a little bit more complicated than that but before we get to the complications, we
had to figure where we're going to host this. It is pretty straightforward; we could've
run our own servers, and so when we're considering where to put it, these were some of the, these
are unfortunately just the negative considerations at first glance of App Engine. But some of
the things we thought about were vendor lock in, if we write code for App Engine, are we
going to be able to, is it portable a little bit because it's similar to Jango but, we're
going to have to rewrite some things in case we had to move. The cost when we started this
and launched it, there was no pricing announced, so we didn't know what we were in for. We
assumed here it would be, you know, a competitive cost to host on App Engine. And it's a data
product, so we didn't know, is it going to potentially go down for six, eight hours,
make us look bad, are we going to have various other problems. And we decided after all of
these, you know, that Google engineers probably are going to keep things up better than we
would with the series of EC2 or other virtual machines. Cost would be good and we weren't
super worried about the portability of the codes, so we went with App Engine. And the
next step then is to figure out what I considered kind of the core functionality, and the core
functionality is just this, this is really what we need, the ability to return responses
really fast. So, the criteria for us is always on, let's go ahead and separate out this core
functionality from these other pieces and I'm going to show you some of these other
pieces that we considered secondary functionality. We put the secondary functionality on App
Engine where it might conflict potentially or cause a problem or our core functionality
of returning responses would fail. Should we put it somewhere else? How do we integrate
these other pieces? I'm going to spend a little time just talking about some of these secondary
pieces that we decided not to put on App Engine for something that we did. So, App Engine,
of course, is really, really, really good for simple operations. A request comes in;
does something basic such as look at the score, return it. Not so good for certain other things
that it's just maybe not designed for. Some of these things Jesse mentioned such as the
rankings. We have a fairly complex procedure to figure out the rankings. It's not just
counting up points in a polygon from different neighborhoods and cities. We bring in all
kinds of demographic data and weight by population with Walk Score and App Engine is not really
set up for that kind of geo-capabilities. Some folks I know have done some back and
worked with this but we do all of these uploaded on a post [INDISTINCT] database, and so that's
offline. The other is, what if we just want to look at the API usage? Who's using stock
in a day or a week or a month, and where are the queries coming in? So this is Seattle
API usage over a given time, and we want to know a little bit. Are people coming in and
maybe doing a kind of a survey or an academic study where they're going to request a couple
of hundred thousand points in a specific area? Is it people just looking at houses all over
the place and when, and this helps us decide where to pre-calculate points. We want to
make sure we can respond really fast to places so we see the API with some obvious things
such as all the sort of kind places or cities, maybe the top thousand cities will score all
of the possible walk scores there. But where else do we go next? By looking at the usage,
we could see things like, oh, people query for houses mostly within two miles of a city
center. And also, when we're doing the rankings, are going to rank over urban areas? Are we
going to rank over this statistical metropolitan stuff from the census? So, we did this offline
and then precede the cache offline with places of interest. And all of this pre-calculation
stuff brings in the complication to that really, really simple workflow which is we have to
actually calculate the score to give it out. And, so the API gets a little bit more complicated
if we don't have a score, we return saying "We don't have a score" then we have to go
and do something about that, and to explains a little bit why we decided to off load the
scoring portion outside of App Engine, I'll hand over to Dave.
>> PECK: Okay, so, Josh showed us a number of the reasons that we couldn't strictly build
the Walk Score API on top of App Engine. And I'm going to talk through sort of the central
reason that we had to make use of Amazon EC2 and that's that, as Josh hinted that, that
Walk Score Calculation takes time. Now, calculating the Walk Score is not CPU intensive but it
is rather I/O intensive. At a minimum, Walk Score requires us to make 17 URL fetches to
Google local search, and potentially talk to other services, census services, geo-coding,
et cetera. Now, if you're an individual user and you go to our website and type in an address,
your browser and our JavaScript will do most of the work for you. But if you're a real
estate company what you really want is programmatic access and you want us to do the work of calculating
Walk Score on your behalf and so that's why we built this API. From a customer's perspective,
the request response cycle as Josh mentioned is, give us the latitude and longitude, get
back a score if we've already calculated it, otherwise, queue it up. And we've actually
built a reliable message queue abstraction on top of the App Engine data store, to hold
on to the latitudes and longitudes that we haven't yet calculated. Of course, at some
point, we have to turn around and actually service that queue. And so, we looked at a
few options when we we're building our API. The first option we looked at was App Engine
Cron Jobs, and as you know, App Engine Cron jobs allow you to regularly ping an App Engine
URL handler. That URL handler is subject to many of the same restrictions that standard
URL handlers are subject to, on App Engine. So, in our case making 17 URL fetches is probably
not going to happen in the timeframe that App Engine request response are allowed. So,
unfortunately, Cron jobs didn't look like a particularly good answer. And, of course,
the second thing we looked at is whether App Engine supported background tasks, and obviously,
if you were at Brett's talk earlier, you will know that the new task queue API is going
online in a few weeks and we're very excited about that. We believe that we can move a
lot of our functionality of our Walk Score calculation on to App Engine. But, six months
ago, when we started building this API, background tasks weren't available on App Engine, so
it wasn't really an option for us. So, where do we turn to? Well, we turn to Amazon EC2.
For those who don't know EC2, it's Amazon's API to spin up virtual servers, you have full
control of the machine, you can choose whatever operating system image you want, and for us
the things that it gave us was a place to do background processing, they build and they
did an arbitrary number of URL fetches, of course, just arbitrary I/O's since you are
on the machine. And, the design of our calculator is very parallel. We have lots of processes
running, working on different latitudes and longitudes making different network requests,
we needed a place to run lots of processes in parallel. So, in other words the Walk Score
API is built with both App Engine and EC2, and when you start to build the service with
multiple cloud computing environments, there are a few important considerations to keep
in mind. So, I just wanted to show this perspective. Here, we've got the customers code on the
left, our App Engine code in the middle and our Amazon EC2 code on the right. And here's
a customer request where we've already calculated the score, they give us the latitude and longitude,
we checked to see that it is already calculated. We package up a response and send it back
to them, and here's the other type of customer request, we haven't actually seen it. The
details of what we do on the App Engine side aren't really important and as Josh mentioned
it's not very complex. But the important thing to see here is that, during a customer request
response cycle, Amazon EC2 has never been touched. So, Amazon EC2 is simply our queue
servicing code, and customer request never get to Amazon. And that's a really important
design point that I'd urge you to think about if you are architecting an API for multiple
cloud services. For us, what it means is that our App time and our scalability is not a
function of both App Engine and EC2s App side, up time, or rather our downtime is not a function
of both of them, it's only strictly tied to App Engine for us. If Amazon EC2 goes offline,
what that means is that it'll take a little longer for us to process requests in our queue.
So, we've architected this API and it's been running for the last six months and I want
to talk a little bit about the behavior that we've seen during the last six months. So,
this is sort of the, the bottom line thing that I'd like to stress and, of course, App
Engine is a beta service so some amount of unpredictable performance is predictable.
But, we saw a number of concerning things along the way which have really smoothed out
over this beta period. So, the number one thing that we struggled with while running
our API in production is high data store contention, and by contention what I mean is when we make
a fetch or a put to the data store, we see a time out, so contention was the percentage
of timeouts we saw in a given set of requests. And in particular for us where we saw a contention
was accessing our queue data structure. So, obviously, customer request are coming in
a rather rapid rate. They're adding new latitudes and longitudes that we haven't calculated
yet to the queue, at the same time our calculators are trying to pull them off, computing them,
and then pushing them back to App Engine and removing those entries from the queue once
they're calculated. So, a typical day for us today is about a half percent failure on
reading from the queue. Just two days ago at around 1:00 a.m., I think, we saw App Engine
contention rise up to about 50% or 60% for about six hours, so a really big surprise;
something that we had to architect for on the Amazon EC2 side where we're servicing
our queue. Obviously, we'd buffer up a lot of latitudes and longitudes there so that
if we do, or if we're unable to get data from App Engine for a while, we can still continue
to calculate Walk Scores. So, another thing that could happen during this, especially
during beta period is that the data store goes offline or simply goes in to read-only
mode. For those of you who have App running application over the last six months, you'll
know that this has happened from time to time. What you see normally in this case is when
you make a database, data store request, you see a capability disabled exception. Another
thing that we've seen a lot is that the App Engine response time increases and we measured
this on the EC2 side, so as our calculator pulls new points from our App Engine queue,
normally those requests take about half a second to a second. Every once in a while
the response time goes up to several seconds so, in other words latency increases. App
Engine is a bit, a bit of a black box, so it's hard to always understand why this is
happening. You know, with no changes to your code and with no changes to your underlying
data models you might see this. And this is a very rare occurrence but has happened once
or twice in the last half year. So, with those performances issues in mind, what can you
do to make sure that your application scales as far as you need it to? And so what I'm
going to talk about here are sort of the steps that you can follow to make your applications
scale as far as you'd like. These steps are distilled from our experience building the
Walk Score API but also building other APIs on top of App Engine that we're working on.
So, they may not apply in your condition, they're just good rules of thumb to keep in
mind as you're building your application. So, scalability on any system is really about
stair steps, you do a certain amount of work you get to the next step. To get to the step
higher, it might require substantially more work on your part. And so, at the bottom rung
for App Engine when you're just getting your feet wet, here are some of the things that
your code might have. So, first of all, inconsistent model design, either your models are very
large and you only access one or two properties in any amount of time, or your models are
very small and distributed, and you end up accessing a cluster of different models in
a single request. In that case, you probably want to re-think the design of your models,
shape them a little bit differently. So this is a characteristic that I've seen in, just
starting out App Engine applications especially for people who've worked in the sequel world
before and are just starting to move over. Uneven or no memcache use, it's a lot of fun
and it's a lot easier to rate your application without thinking about memcache at first.
Unfortunately, pretty much all good App Engine applications are going to make some or very
heavy use of memcache early on in their life cycle. Well, the final thing is, for those
who don't know every time you make a data store request of fetch or put, you're effectively
making an RPC request somewhere in Google's data center. So, you're able to batch these
requests but it can be difficult to design your code that way so, a lot of early stage
code that I've seen doesn't do that. So, with this sort of naive style of coding with App
Engine, we've been able to see something like five queries a second handled which is actually
really rather amazing. I should point out that this number is in our experience depending
on the type of application that you're building, you may see something different. But this
is approximately what we saw when we we're starting out with the Walk Score API. Five
queries a second is over 10 million requests a month. It started with really rather large
application and it speaks to the power of App Engine as a platform to get you scaling
very fast right out the door. So, where do you go from there once you want to go pass
that five or so queries a second? Well, the very first thing I'd urge you to think about
is starting to use memcache and the easiest way to use memcache is just slather it on
everywhere you read data from your data store. So, the basic behavior, of course, is if you're
going to read an entity from the data store, check to see if you've cached it first. If
you have, great, you're done, if you haven't, read it, add it to the cache. Oh and be sure
when you're writing back or updating that entity in the data store to either clear a
memcache or update it. And if you actually go and look at the Google App Engine cookbooks,
somebody uploaded, I think, about a month or two ago, a really great shim for the App
Engine data store that just causes memcache on read to happen everywhere, so, as a first
step, that might be something to look into. Another thing to think about is to batch your
data store request where easy, where easy for us was anywhere we had non-nested loops.
So anywhere we had a loop that we hadn't, we we're putting single entities; we inverted
that, now we call one put with lots of entities. And I should mention that batching is a little
bit tricky because you can only fetch up to a thousand entities at a time from the data
store, and puts are limited based on the size of the entities you're putting in data store,
a good rule of thumb is 50 at a time. So, you may need to have multiple batches depending
on the size of operations you're performing. So, without work, we saw a pretty much a double
in our variability to handle load which is actually, again, really rather impressive
and that's not so much engineering work on our part. But we wanted to go a lot further,
and of course, the running Walk Score API see substantially more than 10 queries a second
on any given day. So, what do we have to do next, and the next of the tips are really
a grab bag of things that you might want to think about as you really scale out your application.
The first is that I'd urge you to think long and hard about how your data store is accessed
and what type of usage pattern do you see. So memcache is really great in particular
for two types of usage pattern. One is repeated request of the same data and the other is
predictable sequential access to data. So, for a repeated request of the same data, if
you know your customer is going to hit that same entity again and again, "cache on read,"
like we discussed in the previous step is actually really a great strategy. But if you
have a predictable sequential access, for example, in the case of the Walk Score queue,
we know that we're going to pull out those queue items in order or "Calculate our code"
is going to request them in a certain order. So our calculator talks to App Engine and
it requests 50 items at a time from the queue. But on the App Engine side, what we do is
we actually pull out a full thousand items from our queue, put that entire list in the
memcache, and that obviously cuts down substantially on a number of data store accesses we need
to make. So, think about memcache usage patterns. Batch all of your data store calls, this went
a long way for us to scaling outward, and for us what that meant is unraveling nested
loops. Sometimes they were loops that were basically a cross method boundary so we had
to flatten a lot of our code up but it really helped us a lot. As important as using memcache
carefully is, sometimes it's just not to right thing to do to use memcache depending on your
access pattern. It may not, memcache may not provide a very meaningful barrier between
your users and the data store, and that's, memcache is a limited resource. You obviously
don't want to populate memcache with entities that aren't useful to you and, and lose the
entities that are. And, of course, memcache, when you add items to memcache, you can, of
course, think about how long they stay there. Sometimes that's a very great way to keep
memcache pressure low. And then, I would urge you to load test your application. We actually
built a load testing hardness on Amazon EC2. We're able to hit, our running Walk Score
instances with over a hundred queries a second and that's real load, talking to real URL
handlers that actually do real work with the data store. So, load testing, there are third
party products to help you with that too. App Engine load tends to be surprising once
you get pretty high, past 10 queries a second. And the last thing is, of course, monitor
your performance. Now the frontline of defense for monitoring performance is the, is the
App Engine Dashboard which is a great place to go and of course, your logs to look at
individual requests that, for example, it gave you a data store time out. Also the system
status if you're seeing a lot of latency is a good place to go, just look at the overall
behavior of App Engine. We actually built our own performance dashboard for the Walk
Score API which we're happy to show you if you'd like after this talk. And it monitors
specific things about the behavior of our EC2's calculator code communication with App
Engine. And just sort of one last technique, if you're getting started with App Engine
coding, that I think everybody should know which is, if at first you don't succeed, try,
try again. What we have here is basically an attempt to write to the data store that
time, if it times out, we're going to turn around right away and write again. Just a
few milliseconds later, you may not get a time-out exception and this is something we
do everywhere in our code base now, both on the read and write side and it's actually
extremely helpful. So, I've talked a little bit about general principles for scaling out
App Engine applications, and what I want to do now is turn it back over the Josh, who's
going to talk about one specific instance where we had some difficult scalability issues
that we needed to tackle and the sort of unorthodox technique we used to solve. So, thanks.
>> LIVNI: Thanks Dave. So, yeah, I started out talking about kind of the core functionality
and secondary functionality. Obviously, your guy's applications are going to have different
pieces of secondary functionality that may or may not fit different places. A really
common use case on App Engine is counting stuff. I'm going to talk about specific issue
we have counting our user's request. We have a basic quota system, we have an API key.
We want to know who's basically doing what so, certain folks might get, you know, a couple
of hundred thousand requests a day. Others might get, you know, maybe a couple million
a day. But we want to make sure that people can't just run away and abuse the system and
again, we understand who's doing what and where. So counting every request that comes
in, matching it to users should be pretty straightforward and the answer that's usually
given for this is I'll just use sharded counters. Every time a request comes in, write it to
one of this shards, you're good to go. But every time a request comes in, if that request
is, you know, a hundred requests a second, that's a lot of shards and you're going to
really hit a lot of data store contention. So, we found that these sharded counters didn't
scale as well as we had hoped, and what happened in our case, this was from about six months
ago and things are really bit different now. Around 30-40 queries a second, our response
time started creeping up, you know, two, four, six plus seconds and then, along with the
error rates and then all of a sudden, everything was 503s and this is a really big problem
because again, core functionality should not be hurt by this sort of icing on the cake
of counting who's doing what. So, we were finding that just because we want to have
quota system now nobody can even get a score because all the requests are just bombing
out entirely. So, the solution that we came up with, how many of you guys were in the
talk just before this, at Brett's talk next door? So a few guys, so, one of the things
I thought really interesting example of the new task queue was the backend sort of writing
to a cache. We implemented something--similar concepts. Dave talked a little about using
memcache cleverly on read and this is using memcache on write which in the Google group
discussions in other places, people are going to say "Oh, don't do that, you know, memcache
unreliable. Things might go away." But in my opinion, there are certain cases and this
is one of them for us where you don't absolutely need to have exact accuracy. If you're writing
a banking application and counting people's pennies, you know, don't do this. But for
a lot of cases, if you're just getting a general idea of a quota, how many things are, you
know, generally around at a given time. And again, knowing things might fail, you have
other processes to come in place and check things happened accurately, do things twice
and so forth. Using memcache on write, it can be a really interesting idea. So this
diagram, basically, it shows that request comes in with the API rather than writing
your sharded counter, data store counter, we write to the sharded memcache counters.
When the memcache counters fill up to 100, to 100 whatever, only then do we write that
bit of information to the data store. And then, the task queue is a little bit nicer
than this, and the talk just before this, you know, just between the memcache and the
data store, there's this task queue and it's a lot faster to write to the upcoming task
queue that we certainly weren't expecting when we started writing this. But, you still
have the same issue, writing to a task queue or writing to the data store which is just
this small time period where request can come in, get your memcache shard, increment the
count before you've written and then cleared the memcache shard. So, you just have to sort
of keep track of the data--of what you're writing to the data store and then the amount
of things that came in before you got confirmation, same with task queue. So, as I mentioned,
the solution is really nice, the advantage is really nice, you get, instead of, you know,
a hundred data store, you know, hits a second, you get a couple of orders of magnitude less
and that just scales up really, really, really nicely. But the disadvantage is there is,
I don't know where my disadvantage went? But there's a possibility you could lose a little
bit. And so for our use case, we were okay if it was off a little bit, as long as we
didn't over count. We didn't want to accidentally set some of our quota that they didn't use
and so we just really, really careful that in case of a loss of data, it's always a subtle
under count. And again, memcache has proven really, really reliable, there's a couple
of, you know, two or three hour knockouts that we had that we just lost counters for.
But, aside from that, as a general rule, you can pretty much rely on it. So, after we implement
the memcache, things went like this, and again, this graph's from a while ago, these days,
it's smoother and things stay well under a second for the response time, but it's really
a huge difference. What it means is we can actually count stuff now at a very, very high
rate per second, and not have to worry about hitting these issues where our core functionalities
compromise. So the summary here is, it's not necessarily a bad idea to think about using
memcache on write. Again, possibility of glossiness is always there but, in many cases that you
might come across, it can really save not only the possibility of this data store contention
but also some money because you charged for the CPU time on the data store, and if you're
not in the data store, you know that can add up a few bucks a day overtime. And then, the
other piece is early on decide; is this functionality a good fit for App Engine? Certain things,
for example, we could've offloaded the quota, parsed logs every couple of hours offline,
come back on. And we decided "Oh, this is a good piece to fit on App Engine, we ought
to be able to count stuff." Other things we decided, "No, this piece, although integrated
with App Engine and the API, maybe when we box, so just deciding which pieces fit in
App Engine is important. And then the final one is, of course, when you're putting those
pieces into App Engine, make sure that you write them in such a way that should strange
and weird things happen, like all of a sudden you get popular, 80 queries a second, data
store contention hits. Those features that you wanted to have an option, don't compromise
the really core functionality that your application may have. And so with that, we'll turn it
back to Jesse and talk about some of the overall results.
>> KOCHER: So, you've seen some of the, you know, stair steps to scalability hotness that
Dave went through, where has all that gotten us? Well, the three of us, a very small team
have built an App Engine API that handles over 1.5 million requests a day, and peak
rates up to about 80 requests per second, it's quite a bit higher than where we started.
And we've come to think of App Engine as really being a black box. It's a place where things
may shift in, you know, unexpected ways internally and you may not always know what's going on
in there. But it's also very well documented and not just the main documentation online
but also this dynamic support with IRC channel, the Google groups, the issue tracker, and
the App Engine team is really free, they frequent all of these places. So the degree of communication
is pretty amazing and it makes working inside of a black box considerably less daunting.
So, our amended statement would be, it's a very good black box. I think looking back,
we would make the same choice that we made, you know, not knowing a lot of the things
that we've learned along the way, and I can say that having watched App Engine progress
over these last few months, the work that we have done would be considerably easier
starting now than it was starting back then. So we hope that this talk has helped you understand
some of the risks and rewards of using App Engine and some things about how you get App
to scale well. We have, you know, put up our contact info for a second here, so if you
want to get in touch with us, you can jot that down. We will also, got some time for
questions here and we also will be out of the demo spot for most of the rest of the
afternoon until about 4 o'clock, so if you want to catch us there, please do. And we'll
move into Q and A now, so, we're happy to take questions on anything. We'll take them
at the mic and also, I'll flip over to the moderator side in a few minutes. But before
I do that I want to put up, these are a few things that we didn't put in to our talk but
we definitely have more to say about, so if any of those interest you, feel free to ask
about those. And then, Google sent us a note earlier today asking us to direct you all
to haveasec.com/io. If you go there, you put in the time slot and the session and you can
give feedback so we'd appreciate any feedback you have for us. And with that, we will to
move to questions so, go ahead. >> Two part question about your counters.
Why did you still shard them when you put them on memcache?
>> KOCHER: There actually can be a contention with memcache in that, when you're hitting
it very intensively. >> How many request per second when you put
that or…? >> LIVNI: Well, we found that once we hit
our, there's a couple of reasons we sharded the memcache counter originally. One is there
is that possibility of contention, and the other is it's cheap to make lots of memcache
shards, and one of the issues I mentioned in that slide is that as you're writing the
shard to the data store, you might not get a response back for a half second or four
seconds. So, all of that time, you're having a lot more requests coming in. We wanted to
minimize to some degree in order to keep the count a little--as accurate as possible, that
very, very small timeframe from when we clear the memcache shard as, "Okay, we got a confirmation
back" and then we've written it out. We don't want 30 new requests coming in in that timeframe
because there is still a little bit of latency even writing to memcache. It's, you know,
much, much faster writing to data store and so by making a lot of shards, it's not really
any overhead for the system that we can sort of spread out the number of requests in that
just a couple of milliseconds, it's less likely we'll lose something. Does that make sense?
>> Okay. The second part is, before this morning, I, you know, I didn't really know how many
memcache instances Google would have for my App. But now, it sounds like I have to assume
if I write to a key, it will always be the same one memcache because, otherwise, these
counters wouldn't work. You know, if I have one instance of my Java App running in Australia…
>> PECK: No, that's correct, memcache is a distributed API and so, if you write on, you
don't have any knowledge of what CPU your code is executing on inside the App Engine
data center. >> Right. But now, it sounds like I have to
guarantee that when I write to memcache, all my 500 App instances write to the same memcache
instance. >> PECK: Yeah, you're effectively writing
to the same store. That's correct. >> Do you think that will be the future proof?
>> PECK: I don't know they propagate that data across, I don't know how long it takes
so, if you're running a code, not that you can specify when you're writing an App Engine
application right now, but if some of your codes are running in Australia and some of
it is running here, I don't know how long that would take to propagate. But, yes, it
looks like you see a consistent view of the memcache world regardless of where your code
is. >> Just surprised that that's future proof
because, you know, once you get to the Facebook scale, maybe you guys, you need multiple memcache
instances. >> PECK: Yeah. Yeah.
>> For the same write. >> LIVNI: Yeah, I mean, we've had good luck
with it but I'd say that's a really good question for the folks who actually built it would
have the better answers than we would. >> KOCHER: They also have, you know, the memcache
has a certain size and, you know, the more you put in there, the more you increase pressure
on your memcache and the sooner things will expire from it. So, there's really a balancing
act about figuring out which things to put in there and, you know, the size of your entities
matters when you're putting things in. So, choosing that, you're making a lot of decisions
that will affect how available that data is when you go back to look for it later.
>> LIVNI: I think there's one actually a small thing I should mention on that which is that
when you write to a memcache, I believe it gets deleted, first created first out, not
last updated, out. And so, when we refresh our memcache key, I believe we delete the
key and then recreate it, rather than just updating it. So that way, it's got a fresher
date and is less likely to get migrated should more contention happen in memcache.
>> I have a couple of questions. First, it's about your API. You sort of had a very simple
diagram but the queuing that you do when you don't have the latitude and longitude, does
your API just are done queued up? Like, I mean, in case, the queries for a latitude
and longitude that you haven't calculated, do you just redone queued up?
>> PECK: Yeah, that's right. Our API, we tell our customer try back later basically. If
you think about the typical use case of embedding in a real estate site, what that means is
a customer doesn't display the Walk Score at that time. But the next time one of their
users comes and looks at that same property, we'll probably have that Walk Score queued
up and ready for them. >> Okay. And the other question is about the,
so, you use this metric of queries per second. So, when you were describing the techniques
to speed up like using memcache and everything. >> PECK: Yeah.
>> So, these queries per second I'm assuming is like data store queries that you're saying,
right? >> PECK: Oh no, when I was showing those queries
per second numbers, what I was showing is the number of your successful URL requests
that someone outside of our, of the App Engine data center can make through our application
and get running correctly. And those numbers were rough, they won't apply to your application
exactly but that's approximately what we saw in the development of the Walk Score.
>> LIVNI: They're also a little bit better now.
>> PECK: Yeah, they really are. >> LIVNI: I think if you would write the same
App that we did early on, you would get, I think, a much higher queries per second today
than eight months ago. >> PECK: Yeah, I mean, the bottom line message
is that with even with naive code, you can actually scale how pretty far with that, that's
just pretty impressive. >> I have two questions. The first one is:
are you now running your entire production front end off of App Engine?
>> PECK: Yes, we do. >> So, if we go to www.walkscore.com is actually…
>> KOCHER: No, great. No, we're only doing the API there. We are looking at, we're considering
moving other parts that, you know, either the tile or potentially even the whole website
over and you could mention quickly the other stuff.
>> PECK: Yeah, yesterday. So, that the Walk Score website is also actually written in
PHP and as you know JBM now on App Engine and there's a company here called Calcio which
makes a PHP implementation for the JBM. And actually yesterday, we just, we're able to
port our entire PHP website over to App Engine, and we just started testing it but it looks
like it's extremely performant which is impressive. It's a lot faster than running Apache on a
private server at some random ISP. Yeah, the API itself, however, is all of course, Python
App Engine. >> And the second question is, have you run
into any of the big limits of App Engine in terms of the storage and Nexus? You've mentioned
in your last slide, have there been any big things where you end up having to be billed
for a lot more than you expected? >> KOCHER: We haven't. They gave us, you know,
as early users of early high traffic users. They gave us access to go beyond the, those
initial free quarters, you know, before the billing was enabled. Now that billing is enabled,
there's a lot more flexibility to go, you can use a lot, so we haven't really run into
any of those limits. I think the things that are interesting are the per request limits
of, you know, how much CPU time you can use and how much time you can use, and we have
some of the EC2 to App Engine communication. Some of those things do, they use way more
CPU time than App Engine is really happy with, but because they're are much less frequent
than the public calls that come in at a much higher rate. We can manage those and we use
some backing off techniques where if EC2 has trouble communicating, it can say, "Oh, we're
going to stop talking to App Engine for a little while and stop hitting it with this
high CPU intensive request." >> PECK: Do you want to take this?
>> KOCHER: Sure. >> PECK: So, considering that calculating
Walk Scores requires considerable time, would consider using web hooks? And, actually, I
was in Brett's talk just previous and the answer is yes, I think we'd love to use web
hooks. We haven't used them yet, but it might be something for us to look at. And certainly,
for those who were there, I think you probably guessed that a lot of our calculator code
fits nicely with the task queue API that he described.
>> KOCHER: Yes, go ahead. >> Could you please elaborate a little bit
into how you implemented the, how you implemented doing the queries of your App Engine and the
computational intensive part of Amazon EC2? >> PECK: Sure, I mean, I can dig a little
more into the design of the calculator. Basically, the calculator is currently, well, okay, we
have a bunch of other stuff. So, the calculator is, of course, completely running on at EC2.
And it's actually running on a single machine but it's lots and lots of processes.
>> KOCHER: Do you want to jump this thing here?
>> PECK: Oh, actually I wanted to jump to that one. And so, the key thing that we saw
because of the unpredictable performance reading from the queue that we've implemented on the
App Engine side is that we wanted to decouple the computation of the Walk Score from the
I/O request we've made. So, what we have is a master process that spawns up a ton of slaves.
Some of those slaves are responsible for talking to App Engine and requesting new latitudes
and longitudes that we need to calculate. And what we do is we have a rather large buffer
on the Amazon EC2 site. A lot of latitudes and longitudes coming in, so that if we can't
talk to App Engine for a while, we can continue to calculate scores, so that's one major thing.
We basically decoupled our I/O from our computation on the EC2 side and we buffered all our I/O.
And sort of one last point which Jesse alluded to you is that we dynamically respond to the
changing I/O conditions, so, if App Engine is, if contentions are very high for more
than, say, 10 or 15 requests in a row, what we do is we actually back off and stop talking
to App Engine for a while. And for various reasons based on the internal design of the
data store, that's something that can be alleviate contention if we come back 20 minutes later,
and just work through the buffer, the data that's buffered on the EC2 side. So, that's
sort of the big picture of the design of the calculator.
>> All right. So, it was roughly one database and you used RSync or something like that
between… >> PECK: Oh, so actually App Engine is our
data master. All of our data is there. All of the points that we're working on, on the
EC2 side are simply held in RAM. If that process crashes, it's not the end of the world. We
basically might read you a few points. We actually, the EC2 code is also Python. So,
if we accept out of a process, we actually just pickle out all the current data, so.
>> Okay. Thanks. >> KOCHER: Any other questions?
>> PECK: Any other questions?
>> KOCHER: That was the only… >> LIVNI: I think that was it.
>> PECK: Oh, thank you. >> LIVNI: Thanks, guys.
>> KOCHER: Yup. Thanks.