Site Reliability Engineers — Keeping Google up and running 24/7




Uploaded by GoogleStudents on Mar 25, 2013

Transcript:

ANDREW WIDDOWSON: Greetings, everyone, and welcome to the
Google Students Hangout On Air about site reliability
engineering.
My name is Andrew Widdowson.
I've worked here at Google as an SRE for the last six years
specializing in web search.
And I came to here from Carnegie Mellon University.
I'm joined by a panel of SREs who are very excited to share
with you a little bit about what SRE is all about.
We'll start off with introductions from Leslie.
LESLIE CHEUNG: Hi, my name is Leslie Cheung I've been an SRE
at Google for over three years now.
I work on the main web server, the front-end for all of your
search results and your home page.
And my alma mater was from UCLA.
ANDREW WIDDOWSON: Thank you, Leslie.
Dina?
DINA BETSER: Hi, everyone.
I'm Dina Betser.
I'm an ads quality SRE in the Pittsburgh office.
I actually work on the system that automatically decides
which ads should be shown to users.
And this is one of the largest machine-learning climates in
production.
I've been a full-time SRE since I graduated MIT in July.
And I'd been a software engineering intern at Google
before that.
ANDREW WIDDOWSON: Very cool.
And last but not least, Aaron.
AARON JOINER: Hi, my name is Aaron Joiner.
I come to Google from a less traditional computer science
background, more from a music performance background at the
IT industry writ large.
I've been here for about seven and a half years now.
And I work down in the bowels of the Google infrastructure,
working on the building blocks that keep all of the wonderful
services we run able to scale to the size they do.
ANDREW WIDDOWSON: Very cool, indeed.
All right, so what we're going to do today is we're going to
talk about some of the most common questions and answer
those questions that we get about SRE.
Let's go ahead and start off.
Site reliability engineering--
what the heck does that mean?
I think it means different things to different folks.
So I'd like to get a take from--
let's start with Leslie.
LESLIE CHEUNG: Yeah, so site reliability engineering is
about balance.
So a big production service has two things
that are going on.
There's the reliability aspect.
You don't want this service to go down.
You want it to be available to users at all times.
And at the same time, you also want to be able to push out
new features--
that newfangled feature that is going to
serve our users better.
And the thing is these are sometimes seen as two forces
that are colliding.
And the SRE role is about trying to make those things
work together, so that way we can push out new features in a
reliable way.
We want to be able make changes, but make sure that
we're still serving our users.
ANDREW WIDDOWSON: Absolutely right, Leslie.
I think that site reliability engineering is definitely--
it enables both agility and stability in our approach.
So these are not two sides of the same coin.
They are, in fact, constructive positives, right?
For another take then on what site
reliability engineering is--
Dina?
DINA BETSER: Thanks.
I also think of being an SRE is being a paramedic for
production issues.
As a person who carries a pager, we're the first
responders when something goes wrong in the system.
But part of that is being responsible for triaging the
issue, seeing how severe it is.
Is it actually something that's harming live users?
Or is it something that we can deal with on our own that
users will never see.
So being able to see how serious an issue is, and
getting in touch with the right people who can solve an
issue is really important to Google.
ANDREW WIDDOWSON: Absolutely.
I like to think of site reliability engineering kind
of as a fun mix.
I think of it as equal parts cloud
mechanic and systems scientist.
But that's just my own hyperbole.
We'll find out more about this as we go.
So next up, let me just go ahead and take care of the
elephant in the room.
A lot of folks assume that site reliability engineering
is just some fancy-pants term for the
heavy-duty operations role.
But the reality is it's quite different than that.
It is both operations, and it's also software engineering
development and planet-scale science and engineering.
So let's get a little bit more of a sense for that.
I'm going to send this over to Aaron, who has
some points to make.
AARON JOINER: Yeah, so I guess essentially we're trying to
figure out what do we do in terms of being able to scale
the systems we run.
So clarify for me again, I'm sorry--
ANDREW WIDDOWSON: Sure.
So, is SRE just some fancy ops role?
Or how do we differ?
AARON JOINER: Right, sorry.
So essentially my background is in the ops world.
So I've spent many years in the operational
trenches, if you will.
And the nice thing about being here at Google is that we
really don't just run around and push buttons in
production.
It's our job to figure out how to find those issues that we
run across from time to time.
And ultimately, as we were saying earlier, automate
ourselves out of a job.
So the ability to take whatever the new exciting
thing that's happened today is, and turn that into a way
that we can prevent that from ever happening again.
ANDREW WIDDOWSON: Absolutely.
Thank you, Aaron.
And, Leslie, what's your take on this?
LESLIE CHEUNG: Yeah, as Aaron was saying, we're not
interested in handling the burning building that
sometimes--
if the services were burdened down by a lot of things that
are going on.
Like let's say, we have to move jobs from a machine or
restart a machine.
And that's not what we're interested in at all.
That may come about from time to time.
But the real interest that we have is identifying ways that
failure scenarios, things that can go wrong, and figuring out
how we can design solutions that either fix themselves, so
we don't even get woken up in the middle of the night with
these problems, or just handle themselves transparently.
They're by design.
If something fails, they'll still continue working.
And we don't have to worry so much about it.
And that's where the real engineering part of this.
There's a reason why it's called site reliability
engineering.
Because we're actually engineering solutions in the
long term to make our own lives easier.
And whether that frees up our own time--
we say automate ourselves out of a job-- but that doesn't
mean that we're not going to have a job.
Because there's always a longer term big picture things
that we have to think about, and new features that we need
to work with in order to make sure that
they roll out reliably.
ANDREW WIDDOWSON: Absolutely.
I like to think of the fact of this automate ourselves of a
job thing as being able to do more with even more, right?
To use a silly analogy, in a lot of ops work, it's a lot of
repetitive banal barnacle scrubbing.
And who wants to just scrub barnacles on the ship?
You want to be able to steer that ship, right?
So at Google, we are lucky enough to have some fantastic
infrastructure that takes care a lot of the mundane and the
repetitive.
And we build from that.
So some pretty good stuff.
And we're using our own science and engineering
principles to go levels beyond that.
So cool.
All right, now, of course we have the traditional software
engineering role here at Google.
Many SREs and software engineers are working
together every day.
So what I want to know is, is there a crossover
between the two roles?
What sort of collaborations exist?
And we'll send this right back to you Leslie.
LESLIE CHEUNG: Yeah.
I mean, just to kind of preface this, I was hired as a
software engineer at Google.
And then, just before I started, SREs, they called me
and said, hey, would you like to join our organization?
And this kind of shows that SRE is pulling in software
engineering talent in order to make sure that they've got the
knowledge there in order to design the solutions that make
our systems reliable.
We definitely will do consultations with software
dev groups on new features, making sure that--
we've seen the failures that have occurred in production,
what we've learned from a lot of mistakes.
And there's always still new mistakes to be made, new
learning things.
And SREs are right there responding on these incidents.
So we have a lot of knowledge.
And we also understand the software systems that are
being deployed.
And we can work with developers to make sure that
these things get rolled out in a good fashion.
ANDREW WIDDOWSON: Absolutely.
And Dina, what's your take?
DINA BETSER: Yeah, absolutely everything that Leslie
says is 100% true.
I think that SREs work with software engineers along every
step of the way.
And that includes when they're working on a
new push, a new binary.
And an example of this is just last week when I was on call,
one of the developers on the ads quality infrastructure
team, was making an improvement to the system with
a port and replacing a component.
And as the on-call, I was responsible for making sure
that as he was doing the push, that everything was proceeding
smoothly and there was no unexpected alerts.
This whole time we are working very closely together, and
making sure that the push was a success.
So I like to think of this as the SREs and the SWEs are
working together, just like Mario and Luigi work together
to save the princess.
ANDREW WIDDOWSON: That's absolutely
what we're doing here.
Now, thank you for that, Leslie and Dina.
I'd like to just mention for all of those of you who are
following along to our Google+ Hangout on Air, if you have
questions about the site reliability engineering role,
which we're discussing today here at Google Students, you
can tag a question to us.
Use the hash tag, srehangout.
That's no dashes, no spaces, tag srehangout.
OK, we'll take some questions a little bit later on.
All right, I'd like to mention some of the ways that I feel
SREs and software engineers engage with each other.
I spent a lot of my time for Google web search doing
proactive sorts of things.
SRE roles are often an equal mix of proactive
and reactive work.
What I'm trying to do is get out ahead of new exciting
features in web search.
And one of the perks in the job is I get to see where web
search is going next.
I know you're going to enjoy the next several months of
Google's search evolution.
But so when I'm meeting with software engineers who are
excited to deploy these features, I'm having
conversations with them to make sure that their designs
are the best they can be so that they'll scale to the sort
of challenges that we have, right?
A common conversation between myself and another developer
might be something like, well, how many users do you think
your feature can support on our infrastructure?
100,000 users?
OK, well, what if you have 10 million users?
Where are the bottlenecks?
Where are the slow downs?
Where are the things that you'll need to scale?
And how can we make sure that we do this right the first
time, so that on opening day, you have the least-eventful
infrastructurally launch that you can have?
So having seen the different failures and the different
sorts of ways that software can explode here at Google, in
our very fault-tolerant systems, I'm able to share
their experience with folks who are maybe writing a
feature for the first time, or writing a feature that's 10
times as large as they've ever written.
That's the sort of value that SRE can bring to the software
engineering organization, working in tandem with them.
OK, so that was kind of my parting
comment for that question.
So let's see.
If we've established then how SREs and software engineers
work together, how do the scope of SRE work and software
engineering work differ?
Let's go back to you, Dina.
DINA BETSER: Right, SREs are software engineers.
They're working on our live production software.
So one thing that SREs focus on is that everything they
work on has to do with production code.
So they might be working on monitoring code or
instrumentation, figuring out how things will run when it
actually goes live on our network.
So one of the things that we've been saying repeatedly
is we write things to automate ourselves out of a job.
When we do this, we try to pick the pain points that we
currently see, and make sure that they never happen again.
So while developers may focus most on just making their
software work, we make sure that the software works on
tons of machines spread across many, many data
centers around the world.
ANDREW WIDDOWSON: Absolutely.
And Aaron, what do you think?
AARON JOINER: So, like Dina, I certainly think we spend a lot
of time working on the monitoring infrastructure
aspects, more focusing on things like the ability to--
if a developer comes to us and says, we're going to launch
this new feature, we're going to launch this new small
product, or even a large planetary-scale things like
web search, that we've had a chance in advance to put in
place the tools that are going to be able to allow them to do
that productively.
And large portions of our lower-level infrastructure are
written directly by the SREs from the monitoring
infrastructures and the frameworks that make that easy
to do to roll-out infrastructures, to be able to
build systems so you can deploy that software easily to
our quintillions of machines in production.
It's really that scale that I think is--
the familiarity with the production scale is the key
differentiator.
ANDREW WIDDOWSON: Absolutely.
And we're really lucky here at Google to have code that is
highly instrumented, very scalable in its
base library basis.
Being able to navigate through the large grid scheduling
system that we have at Google is--
I count myself lucky to be able to do that.
And part of everyday job at site reliability engineering.
OK so we have these site reliability engineers.
We've talked a bit about what they do, but for which teams
do they do what they do., right?
Do Google SRE work on certain products?
Do they work on all products?
Do all Google teams have an SRE working on things?
How does this exactly work?
And we'll go back to you, Aaron, for that.
AARON JOINER: So for my part, I've spent the last probably
three or four years mostly taking stuff where they did
not currently have SRE support, and helping bootstrap
it and get it up to the standards where we're willing
to have an SRE support it.
This might make you wonder, well, who ran it before that?
Typically when new software is introduced by a developer,
it's run by the developer for at least the first six months
of its production life cycle.
And that can include rolling it out to live users.
And then we usually do encourage those developers to
have SREs involved in helping them get ready to roll it out.
It was often at a more mature point in the product life
cycle where they've had a chance to stabilize it and get
the kinks worked out of it, if you will, and have run a few
hurdles through it themselves, where the SREs then come in
and help them really prepare to be long-term supportable.
And of course, that's not throw it over the fence and it
will be running from then, which is one of the nice
differentiators there.
ANDREW WIDDOWSON: Yeah, so what Aaron's talking about as
far as bootstrapping a service up to the quality standard for
SRE teams, I think that is a very valid point.
That's a crucial time when there's a hand-off of part of
the responsibilities for a product to the SRE teams.
Leslie, where do you want to take this from here?
I know you have some thoughts.
LESLIE CHEUNG: Yeah, so Aaron had said that we don't start
out-- when a new service comes up, it's not staffed
immediately with SRE.
We make sure that the devs are responsible for making sure
that it's stable.
And that they understand some of the common pitfalls that
occur when you're running in production systems.
SRE teams are not only deployed for mission-critical
services, or services that Google sees the need for
making sure that we're serving users very
reliably on that service.
And they'll either work with the developers to make sure
that this comes up to speed, and then we can deploy a team
of people who can work with this.
Or once the project reaches maturity we can actually have
a real team work with it and take it over completely, and
then work with the developers on the future, long-term
prospects of that service.
ANDREW WIDDOWSON: Absolutely.
It's interesting to me that we have the standards that we do
for the SRE organization.
I think this actually separates out substantially
from a traditional operations organization, where it's like,
here's what you get.
We're give this to you, right?
And in fact, most interestingly enough, though
SRE provides a lot of value for production services, if we
find that it's not up to our standards, we'll hand it right
back to the developers, right?
So there's a time and a place for us to get involved.
And we do that as a whole team.
Speaking of which, Dina, what's your take on which
projects do Google SREs work on?
DINA BETSER: Right.
So SRE as a whole really tries to support organizations with
a group of SREs.
So one thing that is important to know is that a small
service is unlikely to get a singleton
SRE working for them.
Because we believe that a group of SREs working together
are much more likely to have a bigger impact.
So for instance, sometimes smaller services are grouped
together such that a group of SREs can support all of those
services and collaboratively work on larger and more
complex problems.
ANDREW WIDDOWSON: Absolutely.
Cool.
So it turns out we have a question from our Google+
Hangout on Air audience.
This question comes from [? Yukon ?]
[? Wong ?].
And his question is what kinds of projects have you guys
worked on in the past?
Aaron, why don't we start with you.
AARON JOINER: So when I came to Google, I guess the first
thing I got handed was our serial console infrastructure.
I had worked on serial console things before I came here.
And this sort of feeds back into the
engineering aspect of SRE.
I, pretty much with me and a small team of a couple other
people, wrote all of the code, many thousands of lines of
Python from scratch to manage integrating a planetary scale
serial console infrastructure out of the gate.
It was quite a challenge from what I was
used to in the past.
It was a lot of fun.
I've worked on what we like to call--
well, I guess the best way to say it, because I work in the
bowels of infrastructure on things that aren't public
facing, I have to choose my words a little carefully.
I've worked on turning up our new data centers and machines
when they come online.
So if you have--
you can imagine what large chunks of machines that we
bring on, and turn web search on on top of them or Gmail,
and various and other sundry services.
And automating that process so that it's not done by a human
is something that allows us to greatly increase our velocity.
Every time any of those machines is sitting idle that
is dead time that is literally money
ticking away on the clock.
So we work very hard to make that
process fast and expedient.
And these days, I've started to work pretty much in the
bowels of the infrastructure still.
So doing things like working on the system that safely and
carefully rolls out our production Linux image to all
the machines in the fleet.
You can imagine that if we make errors there, that it can
be catastrophically bad.
So we have to be very slow and very careful so that we don't
disrupt the systems running on top of that, such as web
search and Gmail, et cetera.
ANDREW WIDDOWSON: Good stuff.
We all have a story about things that we've worked on
here for Google SRE.
I have a couple I'd like to share.
So I happen to, I think like Aaron, have a background and
appreciation for music.
And so I jumped at the chance when we launched a
music-related feature on Google web search.
In fairness, it was only in the Americas at the time.
It was a feature where you could search for an artist,
album or track, and it would pop up playing a
streaming bit of music.
The thing was, we weren't hosting that music ourselves.
We were hosting it with some external partners.
And like any good site reliability engineer, I had to
make sure that the entire end-to-end user experience was
going to be fast, friendly, perfect the first time.
So I actually worked with those external companies to
load test them through the commodity internet.
We used the same infrastructure that the Google
bot uses to send a representative
sample of user traffic.
And we streamed at tremendous bandwidth rates over the
internet to make sure things were going to work perfectly.
I've also had the pleasure and privilege of working on
scaling up Google's encrypted search, which we launched a
few years back.
So if you use HTTPS Google web search, myself, another SRE,
and a software engineer worked together in tandem intensely
for a couple of weeks to get that out the
door in a timely manner.
So privacy and freedom of information being what it is,
I was very proud as a Googler and as an individual to be
able to launch that feature--
couldn't have happened without SRE.
So those are some ideas of some things that we worked on.
And thank you, [? Yukon, ?] for asking that question.
If other folks have questions, you can tag us at #SREhangout
on Google+.
OK.
So coming up next, another question that many people ask
is what does a typical day look like for SRE?
What are some of the typical problems you face?
And the answer is, there really is no such thing as a
typical day for SRE.
In fact, that quite frankly hits the nail on the head.
It turns out that, as an SRE, you are a Jack
or Jill of all trades.
You work across multiple teams.
Your challenges adapt on a daily basis.
Where perhaps a traditional software engineer may be
working on the one feature that they're going to launch
this quarter, or this month, or the one or two features,
SRE are moving within several different domains, doing the
reactive consulting--
excuse me, the proactive consulting we mentioned.
Is something going to scale.
Or spending their own amount of time being in an on-call
rotation, some of the more reactive stuff, meetings with
developers, meetings with their teammates
to get stuff done.
So in my particular case, just as the example I'll take for
this question, I spend about 20% of my time in meetings,
keeping track of what's coming up, communicating changes to
folks, making sure everyone's on board.
I spend about another 20% of my time as a senior SRE here
at Google, teaching newer SRE how to be a good on-call
engineer, how to do the diagnostics and sorts of
things that we do.
I spend maybe another 40% of my time writing code.
As a software engineer, I'm the tech lead of a project
that prevents abuse against web search, denial of service
attacks, people who want to unfairly copy our search
results, that sort of thing.
And the remainder of my time is spent occasionally serving
shifts in our on-call rotation.
So that's a typical day or an atypical day in SRE.
You never know quite what you're going to get.
So we just wanted to make sure we covered that topic.
Now another topic that people particularly have questions
about is, what's it's like to be on call at Google?
What happens if you get paged, and you're dealing with an
emergency alert?
And I will send this off to Aaron to lead that answer.
AARON JOINER: So I'll start off by admitting upfront that
I'm somewhat of an on-call junkie.
I am delighted when the pager goes off, because of the
intellectual puzzle that it often presents.

Being woken up at 2:00 in the morning and having to go fix
something is really something I thrive on.
That said, I have to say that at least for our particular
group-- and I think this is broadly
representative of SRE--
we work very hard to make sure that our pager rotations are
not the meat grinder you might imagine that on-call rotations
can be from some of the industry.
We tend to have very well established procedures for the
typical things that can go wrong in the form of playbooks
and very good and concise monitoring alerts.
And making sure that you don't have a pager that goes off
when there's no need for a human to respond to it.
Getting back into that whole automate
yourself out of a job.
Let the machines fix it where it's practical
for them to do so.
So that's my take on it.
Yeah, it's a lot of fun to hold a pager.
But it's really not nearly as bad here as it might be
elsewhere, so to speak.
ANDREW WIDDOWSON: Cool.
That's definitely one take on things.
I like the adrenaline junkie that many of SRE are.
If that's something that intrigues you, take a look.
Leslie, I'd like to hear your take on things.
LESLIE CHEUNG: Yeah.
When you get an emergency alert, maybe you're woken up,
or maybe you're just out and about.
It's like getting a problem dropped in front of you.
It's like a puzzle and you have to solve it.
And there might be some time constraints with that as well.
It makes it a little exciting.
I think that when you're on call and get an
alert, you have to--
you're now the lead person.
You're coordinating the effort to fix it.
But that doesn't mean that you're the one person who's on
call for this event.
There are many other engineers who are also on call that you
can tap into.
There are other developers you can ask what's going on.
They may be experts in a certain part of the system
that you need to look into.
And that's part of your reverse-engineering effort as
you're trying to figure out what went wrong.
You can tap into this vast network of people.
So although you might be the lead on this, you're never
really alone when you're on call.
You have a very good support network.
ANDREW WIDDOWSON: Absolutely.
And I'd like to underscore some your points, Leslie.
You're fully supported by Google and Google teams when
you're on call.
It's thrilling to think that as the incident commander for
an outage what you say goes.
And it is your responsibility to divine the ultimate root
cause of a problem, and make sure to see it through all the
way to its completion.
At the same time, you have to prioritize several things.
This is where the excitement and the puzzle stuff comes in.
How do we prevent the breakage that our
users would see, right?
It's actually amazing to consider that of all the
different things going on the Google, there's brokenness
going on every day.
But because we design our systems to expect failure and
to route around things, much of it is not seen.
That being said, the SREs can swoop in and make sure that it
doesn't become a problem of epic proportions.
So that to me is really fascinating.
And when I'm on call, to reiterate Aaron's point about
the adrenaline of things, it's absolutely a rush.
Because you're out there defending our users from
brokenness and making sure they get the right results.
Thinking of it less from an operations standpoint and more
from a scientist and engineer standpoint, you can imagine
that any Google engineer's time is very valuable.
But an SRE who's on call, even doubly or triply so, because
of all of the stuff that's on the line.
Think for a moment about the hundreds of different factors
that one could imagine that might contribute to any sort
of problem that we have at Google.
And like I said, there are an unlimited
number of these things.
I like to think of it as kind of a decision tree, or even a
search space, of things that we have to consider before we
can find the root cause of a problem.
It's my job to take my experiences and my skills that
I brought from school, from work, and figure out how to
prune and navigate through that search space so that I
spend my time most effectively, nipping the
problem in the bud.
And I think that's something that SRE definitely
brings to the table.
So perhaps a bit more of an insight into how we deal with
emergencies here at Google.
But it's definitely part of our company culture-- pitch
in, get done, fix the problem, and fight for the user.
That's the SRE way and the Google way.
OK, so we have another question from our audience.
As a reminder, you can #SREhangout if you'd like to
contribute.
The question from [INAUDIBLE]
is, do you make use of machine learning and artificial
intelligence algorithms for your projects?
If yes, could you provide an example where such algorithms
have proven useful during your projects?
All right, I know several of us can answer to that.
I'll take a first pass at this.
As a matter of fact, we use machine learning and
artificial intelligence in our abuse detection software that
we use here at Google.
I'm, in fact, leading a team of folks who are doing all
sorts of clustering and analysis to try to figure out
what behaviors for web search are totally above board,
totally normal--
our users are on a variety of different platforms--
versus all sorts of nefarious behavior that's trying to
attack and shut us down.
It's an interesting arms race.
And so to the extent that we can do some of our pre-work by
automatic classification or through learning and
supervised systems, we do that.
And then ultimately we'll step back and
measure and cut again.
So it's definitely an exciting thing on the
face of abuse detection.
I know that perhaps Aaron would also like to contribute
to this question?
AARON JOINER: Sure, yeah.
Like I said, down in the bowels of the infrastructure,
we deal with literally all of the machines at Google.
And so as you can imagine, that's a lot of machines.
In order to keep track of what's gone wrong with any
particular machine at a point in time, we need to be able to
diagnose what that machine's problem is, and be able to fix
it quickly and reliably.
There's a couple neat things we do here.
One of them, there's actually some insight into it in a
paper we published about hard drive failure
rates a few years ago.
Most of the information gleaned from that was done in
applying machine learning to figuring out what are the
signals that this machine will go bad in time.
And that's not something I personally worked on, but it
was something I was closely associated with.
And things I've personally done for monitoring why the
software on all those machines-- the sort of lowest
layer of software, the actual production image itself--
as it rolls out, as we roll out changes to the fleet,
things will break.
And they'll be coincidental breakages.
And being able to determine the signal from the noise in
that condition can be very tricky.
And we apply some neat machine learning tricks to make sure
that when we're rolling out a bad change, we can tell the
difference from merely some of those three or four or five,
nine events that happen to fail in
coincident with that roll-out.
ANDREW WIDDOWSON: So as you imagine at Google, we do
things that what we've been referring to as a planet
scale, right?
And you can imagine the different sorts of data
sources we have on everything from machine failures to
queries per second coming in through our border routers or
what have you.
It's a tremendous and interesting data problem for
anyone who's interested in machine learning and
artificial intelligence.
And having the keys to the castle to be able to make
changes and in an on-call capacity, also allows you to
spend the time writing software in that same
environment, which is pretty darn cool,
if I do say so myself.
Maybe I'm biased.
All right, so we're going to go on to our
next question then.
So a lot of students ask me at job and career fairs that we
host for Google, what are some of the skills needed to be an
SRE, right?
So in fact, one of our students online here has asked
a question.
Matthew has asked what is the typical education level for an
SRE, which I think is a good follow-on question.
So let's start off, Leslie, with what are some of the
skills needed to be an SRE?
LESLIE CHEUNG: Well, a lot of people think that in all these
engineering roles you need to have some type of a CS
background.
And while that does help, that's not necessarily the
only thing that we're looking for.
We're also looking for people who can be assertive when a
situation needs someone to take a lead on.
You need to be able to say--
you need to be able to tell people like this
is not a good design.
You need to be able to have confidence in your knowledge
and be able to assert yourself, either in an
emergency situation, or even in design discussions.
You need to be able to say, no, this may not
be the right way.
Maybe you should look at this design because it will be more
fault tolerant.
You also need to have a good head for triaging.
You want to evaluate when a few different problems come up
all at once.
You want to be able to weigh which one is most severe,
which one's going to lead to the most user impact, and then
attack that problem.
And those are skills that aren't necessary taught
explicitly in formal education.
But those are skills that are really good to have as an SRE.
ANDREW WIDDOWSON: Absolutely the case.
So Dina, what's your take on this?
DINA BETSER: Yeah, I definitely agree with what
Leslie said.
I think it's really important in talking about the kind of
work that SRE do to remember the reactive and proactive
kinds of work.
So since one of the things we do is react to pages and
interrupts' work, it's important to be able to
balance between our projects, as well as
the interrupts' work.
So one of the things that that can require is being able to
focus on multiple projects at the same time, which is a good
skill for SREs.
Another good skill is being able to communicate
succinctly.
When you're in the middle of an outage, you want to make
sure that all of the stakeholders and users of your
system know what's going on every step of the way, and
that everything is communicated clearly.
Also to respond to the question, what is the typical
education level for a Google SRE from our audience, just on
my team alone, there's a wide array of education levels.
We have people who majored in CS-- typical.
We have people who have master's.
We also have people who learned mostly by doing, and
may not even have a degree.
But really it's about what you can do, and the skills that
you have that you can bring to Google.
ANDREW WIDDOWSON: Absolutely.
And so, to the idea of skills that we may not have learned
in the classroom as well, let's talk about some of the
skills and extra things you can practice to help put
yourself on the career path for site reliability
engineering.
Aaron, what do you think?
AARON JOINER: So for my two cents, I'm probably the least
traditional in that respect in that I did not come from the
CS background.
So I certainly had to learn it the hard way,
if you will, myself.
And so we see in SRE, as far as the educational
backgrounds, people who did come from a CS background, we
end up having to help them learn some of those more
operational skill sets.
And if you're currently a student, it's probably
something you could think about as well to answer your
question directly.
But any chance you can get to work with production Linux
systems, either at home or with your local Linux users
group, or at a computer lab at the University--
any hands-on exposure you get to dealing with file systems
and dealing with actual LAMP stacks serving websites, those
kind of things are very practical skills, and ranging
the gamut across into networking as well.
So those things are the sort of core operational skill sets
that software engineers end up having to pick up as they go,
not because it's something you necessarily need on a
day-to-day basis.
Our systems tend to abstract away a lot of the sort of
mundanity of running a job on a Linux machine at Google.
But that's true until they don't.
So when things start to fail, those skills
become extremely valuable.
And the intuitions that Leslie was talking about earlier, to
be able to triage those problems as you see them.
So again, as far as education backgrounds, myself, I came as
a conservatory musician.
I certainly did not come from a traditional CS background.
So you certainly can succeed in this space on your merits,
as Dina was saying earlier.
You can prove that you can do it.
And that's all we're really after is success over
paperwork, if you will.
ANDREW WIDDOWSON: Absolutely the case.
Myself as a more traditional computer scientist, I took it
upon myself to diversify my out-of-classroom
experience as well.
For example, I was the IT manager of sorts at the campus
radio station to Carnegie Mellon.
And I also interned with the campus's network engineering
and network development group, to kind of get a whole
different perspective.
For those of you who are computer science and
electrical and computer engineering students perhaps
out there, and variations thereof, obviously be solid on
your data structures and algorithms.
Understand computational complexity and run times.
Understand the fundamentals really of what makes computer
science computer science.
But venture out a little bit too.
Getting that extra experience with Linux or servers or that
sort of stuff, will definitely make you stand out as far as
site reliability engineering is concerned.
And the balance of those two things, regardless of which
way you came into SRE, is a great thing.
And we will build you up on one, if you are less familiar
with it than the other--
software engineering and the more traditional systems
engineering roles.
So with that then, I think we have spent a lot of good time
here talking about the site reliability engineering role.
And I'm so glad we could take this time to
discuss all of this.
I'm sure there will be more questions.
If you have others, please feel free to continue tagging
SREhangout on G+, and we'll help answer
some of those questions.
For now, for myself, for Dina, for Aaron, and for Leslie, my
name is Andrew.
And thanks so much for attending our Google Students
Hangout On Air about site reliability engineering.
For more information about Google and its job roles, you
can visit google.com/students, your portal
to hiring at Google.
And of course, you can also plus us on Google+.
That's plus.google.com/ plusGoogleHangout.

Thanks so much.
Everyone say goodbye.
LESLIE CHEUNG: Bye.
DINA BETSER: Bye.
AARON JOINER: Bye.