Tools for Continuous Integration at Google Scale

Uploaded by GoogleTechTalks on 26.01.2011

Let me introduce Nathan York, he's an engineering manager at Google. He's been with Google about
six years and been an integral part of--to insure the engineering tools and systems have
been growing in the scale of Google's growth, both in terms of the users we have and also
the hardware [INDISTINCT] first the applications and the offerings we have. So, Nathan is--calls
himself a tools guy and I'll let him share why he thinks that, and also talk about the
Tools for Continues Integration. >> Okay. Thank you. So, yes, I'm here to talk
about Tools for Continues Integration at Global Scale. This is really--this talk will really
be about the build system and how we're using the build system to enable this type of continues
integration system. So a quick question, how many people here interact with build system
of some sort? Quite a few. Lot of managers in the audience. Okay. That's okay. So how
many people actually really truly love their built system? Totally happy with it? Yes?
Okay. It's kind of the response I was expecting. So, a bit about, kind of, what we're doing
at Google and just tackle about be just about what we're doing, but how we're doing it.
We have a dedicated tools team within the productivity area and we've been working on
the scaling issue at Google in particular, the challenges that are specific to the way
we work at Google. And we've been working on this for about six years with a fairly
decent sized engineering team. And so this talk is really distilling, kind of the key
insights from six years of development into this one to talk to share with you guys. And
the reason I really want to share with you guys, I'm exited about this. This is the first
time we've ever really talked in depth about what we're doing and we want to share this
because we fell like, build systems are really not talked about much. They're often overlooked
and in many ways under appreciated. And, this leads to what I refer to as a software engineering
gap, were we spend a lot of time and resources on, kind of the very low level platforms and
compilers. We have very specialized people in this area. And then of course we hire specialized
teams of people to build applications and features, but there's this gap in between
where the tools that everyone are interacting with are, in many ways just overlooked and
under provisioned. Often developed in-house, many times it's just some poor guy who gets
stuck with it, usually the new guy on the team, or the intern, or something like this.
In the most cases the bar is pretty low. You get to the point were the tools are good enough
to use and then you move on. And hopefully you'll see in this talk that really believe
that there's so much more potential there. So, where this often leads to, and this is
probably why a lot of people are frustrated with their build systems, are things like:
Incorrect and flakey build, slow builds--that's a very common one; spending a lot of time
waiting for the build system; having a cumbersome and/or fragmented built system. So, many companies
I believe, wants the [INDISTINCT] to a certain size. We see the piece of it--pieces of it
start fragmenting of into different kind of code silos, and then of course their needs
to be a process for, you know, integrating this stuff all back together and it can become
very heavyweight process. And of course, usually this is under maintained overall. And so,
one of the insights we had here very early on was that "a make clean" is actually the
sign of a deep flaw in the system. If you think about it--if you don't trust your ability
to do incremental rebuild, what you're really saying is you don't believe the system has
correct information. That is not--it's not going to be able to understand what you're
doing in a correct sense, and this is one of the things we set out to address. And so,
there's also a question of why--why build systems matter, right? And of course, in an
agile test-driven development type of model, you know, however long the test and the build
takes? That's how long people are waiting in each generation--iteration that's [INDISTINCT].
It also--this also affects the ability--transitively for things like, automated build systems,
and in also building and releasing products. Basically the build system forms the core
of productivity and in the end this is all about getting feedback to the user, right?
And the build system is kind a sitting at the core of this feedback loop. I'll point
here is also that, one of the things we discovered very early was that, in the race to get to
speed, often times correctness is overlooked. And we found that this usually is counter-productive,
so. Its easy to spend much more time diagnosing incorrect built results, flakey build results,
then the actual gains produced by the build system. So the challenge we're facing at Google:
Six thousand engineers spread out across worldwide offices; one code base. Everyone's working
on the same code base. People don't generally work from branches. Everything is build from
source. We don't build really from libraries except things like system libraries. This
has advantages. We don't have things like big integrates, so called merge hell. It also
makes the code very open and transparent, people are free to experiment. It is very
easy to go in and change some core piece of the system. This doesn't mean that they can
just check in what they're experimenting with, but nevertheless it's very easy to try ideas
out, and there's a set process for how to get this changes back into the main code line.
And of course one of the keys to making all this work so that we're not all just sitting
around fighting broken builds all the time, is very, very extensive automated testing.
And there's a link here on the slides to a talk that Patrick Coppola and the director
of engineering productivity gave early this year in Paris; he kind of goes over some of
the stuff at a higher level details. So some of this is a repeat of what he talked about
earlier, but I'm going to go into more detail. This is an eye chart. It describes really,
a rough outline of our developer work flow. And this is kind of a very rough cut, but
this is our understanding at how engineers at Google work. And for those that are probably,
even in the front row who can't read this, the process is really: start with a clean
client; check out some code--obviously you have to do a clean build the first time to
get the initial build results; make some modification; run some test; iterate--in the code review
process includes this iteration. And at the point where we actually submit, which is down
in the bottom left box there, once that submit goes in everyone in the company sees it. And
if you broke something everyone's got to see that breakage. So, just, you know, this is
one of the reasons that having good tools for this type of working environment is critically
important. So we realized that our older make based system was really insufficient for what
we needed. We needed a better build system. And one of the things that we--one the insights
had was that build metadata, things like describing the dependencies and the inputs and the outputs
in the source code is really a type of source code in of itself. And as such, it should
be treated as any other, kind of, compiled language. It should be deterministic, well
specified. It shouldn't force things like dependencies and inputs and outputs being
declared. So we built a better build system. It's--it's really an optimized in tune implementation
of a gold language, it does dependency analysis and scheduling. It's not just a build language
but a complete build system. And it leverages Google infrastructure to provide scalability.
And, kind of the bottom line in all this is, we took this as our opportunity to really
re-envision what a build system could be. And again, getting back to the speed versus
correctness kind of tradeoff, we really believe that this is a false choice. People talk about
choosing one or the other and we decided lets choose both correctness and speed. So, moving
out to another insight. There was a core piece consumption how the build system works was
that, this notion of content addressable storage that we can refer to files, not by filenames
and time stamps, but rather by the digest of the content itself. And that we could use--where
in places where we do need to use some type of path to set up an environment for executing
some type of compiler tool, we should really be using relative paths, instead of absolute
paths. This is all really about trying to eliminate global state in the system, and
to have universal handles to the resources in the build so that a functional build model
to the ideas that, you know, in this functional build model we have files or values and expression,
or actions as we call them, or transformations of files. So we have this very simple, you
know, functional equation, and of course outputs from an action could be inputs to another
action. This is how we setup the dependency graft within the build system. It's all well
and good but of course, one of the issues with content--digest of a content is that
you need to, actually build--read the content to get the digest and for talking about, again
building everything from the source code. Eventually, we got to the point where we are
reading a very, very large amount of source code. People were checking out very large
amount of source code, it was taking a very long time to get. The build system was taking
a lot of time just doing IO to read the contents to get the digest. So we actually built a
special corpus files system for providing source to the build system. For those of you
familiar with the fuse file system in user space, this is a fuse based system, it provides
only read-only access. It actually gives an entire view of the entire code repository
over all points in time. So we can see snapshots of the repository at any arbitrary point in
time. It does this efficiently by only pulling content on demand, so as users or tools like,
you know, GCC or JAVA C or whatever, are requesting and opening files at that point we go and
pull the content down. But we will pay for the content transfer [INDISTINCT]. So we do
very aggressive cashing and reuse and the immutable part of the file system is really
key to this, right? We know that the files in the file system don't change until we change
the snapshot were looking at. This also means of course that we can keep all of the source
code in the cloud until such point it's needed on the work station. And also, this means
that we can provide the digest or the content as part of the metadata in the file system.
So, in fuse were using the extended attributes [INDISTINCT]. This means that the build system
can then get this digest without actually reading the content. And one interesting observation
from this, when we rolled this system out was that only about 10% of the files that
were being synced by developers were actually being read during a build, and this was after,
you know, doing a very good job of trying to limit the scope of what we were pulling
down. We were still trying to pull just the dependencies that were being declared. We
found that in many cases that was even much too broad. So the next challenge, once we
get the source code was to make the build fast. And what we find is that this notion
of a functional build system where we don't have global stake and the source code is already
in the cloud, allows us to execute build actions in arbitrary locations. So build actions can
be, again, like a INDISTINCT], a compile, a link. A JAVA C, a JAR whatever. And--so
this allows us to do large--large scale distributed execution of arbitrary actions. This is language
agnostic. We could even be running shell scripts or whatever the case maybe, as long as those
conform to the functional motion of the build model. The scheduling of the parallelism is
really only limited by the shape of the dependency graft. And, so there you have it. We can do--we
can basically distribute the load across, you know, Google production infrastructure.
So, we can do things very quickly but of course, the faster we make engineers the more they
do builds, and the more they do builds the more load it creates. And so our next issue
is really how do we scale this apps so that we're not spending a huge amount of resources
on doing a bunch of builds. In many cases to builds are sharing overlapping--having
overlapping information. And so we found--is we can actually, once the insides was that,
we could do with this functional build model--we can do caching at the build action layer.
So if we see the same inputs, and again the inputs are defined by there digest, if we
see the same digest as inputs and the same action that we've seen previously, we can
just return the cache result for that, and we avoid actually re-executing it. This means
that, in most cases, when engineers are doing builds are actually getting the build results
from people who had done builds previously. It's really the equivalent to, like an automatic
binary release every time someone checks on code and does a build. But it all happens
seamlessly and no one actually sees it as a release. And the overall amortized cost
is very low. We're going to enter some numbers on the results of this in a bit. And then
finally, the last scaling problem we hit was one with how to deal with the build office.
We were so effective in distributing these across a large number of machines that we
were basically saturating the network links and overloading the work stations. So we had
to build a--another file system for storing the outputs from the distributed building
on the back-end. The idea here is that, again, we can present to the engineer the illusion
that in the build outputs are available locally, but in fact they're all being stored on this
back-end system and they're only downloaded at the point where one of these files is being
read. For example, to be executed or, you know, to run a debugger on it, for example.
And this is also a fuse based system. So this is a system view of how this works. We have
the work station on the top, we have a build, again, the better build system that kind of
sets us the user interface between several or two different files systems and another
client that talks to distributed builds on the back-end. The back-end is storing and
retrieving objects directly to the object back-end. For engineers, builds look like
they're happening locally. They see all of the warnings and things going by on the screen,
it looks just like locally. It's going extremely fast and in fact all of the activities happening
in a data center. And the interesting thing is that, you know, the data center can even
be on another continent and its fine, like, we can locate the back-ends in multiple locations.
It doesn't necessarily need to be all that close to the engineer. So what does it have
to do with testing? I've been going on and on about build systems. This is really a backbone
for automated testing systems. This is a platform to build on, and one of the insights we had
was that executing a test is just another form of a build action. The test you have--the
action being executed is the test itself that has binary--the inputs are the data files
if there any of those being cast as input and the output are the test results, the,
you know, the test log essentially. So--and we can break up very large tests, we can shard
them into smaller test. But this allows us to run a very large number of automated test
in parallel as--in its fully intergraded with the build system. So, engineers can say, you
know, "I want to make my project and then run all the tests." And this all happens in
the cloud, so to speak. This means that--at this point we can do automated testing of
all changes going into the repository. Currently about 50% of the activity we see is from automated
systems, and the automated systems share the same action cache with the rest of the engineers
as well. So, yes, the interesting thing here is this means that if so much [INDISTINCT]
breakage for example, the automated systems pick it up very quickly and, you know, emails
go out and alarms go off, and people get nag-mailed to, "Hey, roll back your change," like, you
know, "Your breaking everyone." So, we get really quick feedback to engineers on the
state of their change. So, this is all well and good, you know, theoretically. But what
kind of results are we getting. So our code base actually changes about 20 times per minute.
About 50% of the code changes for a month. But we're able to keep up with this change
rate, running the builds we do about 65 thousand builds per day, 20 million for year. So we
run seven and a half million test suites per day. This requires on, you know, on the back-end
system, about 10,000 CPU's with about 50 terabytes of memory, and it produces about 1 [INDISTINCT]
bytes of output per 7 day window. This is unique output. Going back to the scaling and
the importance of the caching in the distributed build system. We have about 94% cache it rate.
So, in most cases, you know, 94% of the time where people are saying is build results or
things that are coming back from cache. Only about 10% of the builds are clean. These tend
to be things where people are creating new clients or automated systems that are, you
know, are doing this out of paranoia. But we believe that most--most of engineers now
are accustomed to the idea that it's safe to do incremental builds. And--even though
we're building all from source and were generating, you know, a very large amount of output, build
times are typically 5 minutes or less. With some variation, you know, depending on the
size of the build. But also, going to the title of this talk, why I put global on there.
We actually get the very similar build and test times in offices around the world. So,
it's very interesting if you think about that. We have the same tool, same command line in
all the offices around the world and they all have very similar response, and it's all
very fast. It's all, you know, within the 5 minute area. So, this is an example of what
we're seeing globally. This is average time builds, you know, over a period of days in
September from different offices in milliseconds. So, we will see that there, you know, there's
some variation there, but we have Munich, Sydney, New York, and Belo Horizante in Brazil.
And they're all fairly close together in terms of their performance. It just goes to show
that, you know, we are in fact seeing the performance we expect. Going back to the clean
versus incremental builds. The far left bar here, over the zero, those are the clean builds.
And the bar over the right, those are the incremental builds; those are the small incremental
builds were, like a single file has changed. And so we see that, you know, indeed, most
of the build activity is now just the thing, the small incremental builds, which is what
we want. We want people on this kind of tight, build-debug-edit cycle. And then also, we
did some analysis based on how we believe users are interacting with the system. Basically,
their frequency of different actions in the work flow, and estimated that--this is a very
conservative estimate--they estimated the tools we built are saving about 600 person
years. So again, this is a fairly large engineering organization. But--is probably even more beneficial
than this because we know people change there behavior when build times improve, right?
People get up less to go get a coffee and, you know, talk and chat and stuff. So, we
know that there are--probably the benefits are grater than what we're saying. So, my
conclusion in all of these. The built system is really--should be viewed as a core component
of software engineering. It's unfortunate that in many case it's not. And many of the,
kind esoteric principles of build systems things like, hermeticism, incorrectness, reproduce
ability, are often kind of, pushed to the side in the name of expediency. And we're
actually finding is that, correctness is just as important to speed and scalability as,
you know, distributing load across, you know, many machines. And so, kind of, the questions
I want to leave with before I ask questions of you guys is how much does good enough cost?
What's you're experience with your build systems and hoping you guys will come up and talk
to me during the mingling time because I'm interested at, you know, getting more data
points on what peoples' experiences are with this. Again, get people kind of thinking about
this. So with that any questions? >> Sorry. I just [INDISTINCT] you write something
about cache [INDISTINCT] with interactive user, what do you mean with that?
>> Sorry. Yes because this is not very clear. We have very extensive automated test systems.
So as, again, as changes go into the repository, automated systems see that and start running
the test straight away. And so those automated system are using the same caching mechanisms
that the interactive users or the developers are using. And so, what happens is, as those
tests are executing its actually pre-warming the cache, so that when the engineers come
along and are doing the next build, they're hitting the warm cache. Makes sense?
>> Hello. Usually any build cycles we do has mainly been--based on a time cycle, the time
stamp. Here, you're talking about metadata which is actually having all the information
for the build to happen. So, will it save each compiler? What information does metadata
contains? I mean--and then I check-in the code, should I be mentioning or what does
the complier do? It should run on, on all those information. Because every time I check
in, I have to upgrade the metadata. Is it for this expedition?
>> No. I mean, obviously at Google our current build system is a little bit, you know, were
not dealing with this many platforms maybe as some places. So, the goal is to not have
people changing the build metadata frequently, although they can and it's written in a fairly
high level language. But in general we don't expect that people are changing the build
metadata every time they're checking in. Does that make sense?
>> So what triggers the build cycle? Is it a check-in? And check-in based on time stamp?
>> Yes. So several things can trigger a build cycle. So, one of course is the engineer that's
requesting a build. But the--in the case of automated build systems, we have things listening
to changes as they go into the repository, and as soon as, you know, a change goes into
the repository, various automated system are triggered.
>> Okay. Thanks a lot. >> So the question is how much of the 500
build time is from the cache versus distributing the load? And it really depends, right? I
mean, we have peak load times where there's not a lot of spare capacity and, you know,
totally claimed built may take longer than it would otherwise if the cache is not warm.
But the primary feature of the cache I would say is, just the ability to scale. And if
we look at the cache hit rate that we see in the amount of resources that's currently
taking the build, you can kind of interpolate how many CPU's we would need to have without
that caching to do the same level of activity would stigmatically more expensive. It is
more of a resource usage issue. And in the back here.
>> If I understand it standard correctly, the complete builders control using the build
config files and other config files. So, how much [INDISTINCT] spent on concerting or building
that config files? And how easy or difficult to...?
>> Yes. So the build language itself--we've gone through a fair amount of effort to keep
it clean and simple. It's fairly declarative. But of course there are some ways to extend
it and add more features to it, as teams need to add those features. The goal though is
to make it as simple as possible for people to deal with that. It's no more--its hard
to make a comparison on how difficult it is, but its not overly complicated I think. The
main feature is, again, we try to focus on declaring the dependencies and the inputs
and the outputs along with any other metadata required. Things like, you know, options or
whatever. We try to focus on what the user actually needs rather that forcing them to
think about how do we--how am I going to distribute this, or how am I going to scale this. There's
no notion for example, really of--well there's some notion of threading, but not really,
so. More questions? >> [INDISTINCT]
>> Yes. So the question is how to take care of multiple platforms in a build? And so,
within Google, we do have several different platforms we use, and there is a notion of,
like a host versus target platform. And there are ways to do that. If I could really distill
it down to the content of the talk, it's really--the definition of the action includes the binary
that's being run. So for example, if it is a cross compiler--and it includes the options.
So if there's some types of options to target the different host platforms, that all part
of the action description. >> Some part of the code can be common redeemed
platforms. Do you like, I mean, do you have a separate compilation for the common part
of the code or...? >> So we do--we have ways in handling that.
I can't really go into it here, but... >> Okay.
>> [INDISTINCT] >> Yes. We do have ways of doing that.