Jamie and I often get asked the question of
"how should you raise your kids to become computational biologists, just like us?"
And we say, "Don't go to a museum."
"Don't go to a planetarium."
"Especially do not go to any of our lectures."
"Go to a baseball game."
We became friends based on our mutual interest in and love of baseball statistics.
So over the years that we have been hanging out together,
going to baseball games, watching baseball games, talking about baseball
we have noticed that baseball statistics isn't just something we do, are interested in as a hobby,
it actually plays a big role in how we think about what we do in the lab and in our work.
And we have also noticed that we are not alone in this.
There are many computational biologists who have really got their training
about how to think about computational biology from obsessively analyzing baseball statistics.
Right, so today we are going to walk you through a few examples of
why we think these fields actually have a lot in common.
And how it is not an accident that all of these people who are now successful computational biologists
grew up being obsessive baseball geeks.
And it really starts with the history of how the data is collected. So baseball statistics used to collected
at an individual game where people would just keep score on a score sheet.
And DNA sequences were obtained by a cumbersome method
where you took a piece of DNA and you did a bunch of biochemistry
and you ran it out on a sequencing gel and recorded the DNA sequence
of what you had obtained by writing it down on a single piece of paper.
And then eventually those results would be disseminated
in the case of baseball statistics through a box score in a newspaper.
In the case of DNA sequences, they were printed in a journal article
in a... you know, printed in a paper in a journal article.
But it wasn't really until those statistics and that sequencing data became systematically organized
that people began to appreciate just how powerful they could be.
Right, it was some time in the mid 1970s after almost a century after collecting baseball statistics
and nearly a decade of collecting DNA sequence data,
visionaries in both fields realized that it would be valuable to put this information altogether
in one place where you could look up the past performance of any baseball player,
or you could look up the DNA sequence of any gene whose sequence had been determined.
And so what we are going to talk about today is the way that we use those databases
to try and make some predictions about baseball
players, in this case, how many homeruns Dustin Pedroia might hit,
and the functions of different proteins.
So let's start with an example from baseball. So, if I have a player,
in this case we are going to look at Boston Red Sox second baseman Dustin "The Laser Show" Pedroia.
We have a collection of his statistics over his career.
He came up to the majors in 2006. We know how many at bats he had, how many homeruns he hit,
and how many stolen bases he had and so forth.
We can take that statistical profile and search it against a database
of known players, of previous players, who are older than Pedroia,
and look for players who had similar statistical profiles up to the age of 27,
which is how old Dustin Pedroia was last year.
And when we do that we find a collection of second basemen
who can roughly be described as slugging second basemen.
This includes hall of famer Tony Lazzeri, Yankees great, Robinson Cano,
a contemporary New York Yankee who is just a few years older than Dustin Pedroia, and some...
Yankee scrub, Chuck Knoblauch,
and as well, former Montreal Expo, Jose Vidro.
So since we know that those players had similar performances up to the age of 27,
but because they played... they are older than Dustin Pedroia,
so we know what they did when they turned 28,
we can flip over there baseball cards
and look at their performance in their 28th year, and from that...
we can see that for example that Jose Vidro hit 15 homeruns when he was 28.
And Chuck Knoblauch also hit, sorry, no no, Tony Lazzeri also hit 15 homeruns.
Chuck Knoblauch only hit 9, and Robinson Cano hit 28.
So it is a range, the range is from 9 to 28,
but we can see that essentially the average second baseman with a performance
like Dustin Pedroia's up to the age of 27 hit around 18 homeruns.
So we can make the prediction that Dustin Pedroia is going to hit 18 homeruns in 2012.
And you know, since we have a lot of data like this,
we know that these predictions are actually fairly accurate,
that using historical data can make real meaningful and accurate predictions
about baseball player performance in the future.
Although, of course, they are not perfect.
And we use a similar logic to ask questions about proteins,
in this case we have the sequence of a protein,
all of the amino acids that encode it,
and we want to ask the question, "What does this protein do?
What does it look like? And is their anything that is weird about its function?"
So we take that protein and we look against a sequence database,
using a computer program called BLAST
and out of the millions of sequences that we have for proteins
we find the ones that match up best.
And we can take those proteins that match up best to our protein of interest
and we can align them. We can align them against each other and look for
regions of the proteins that have
very similar, well, we can recognize that the proteins
have very similar amino acid sequences.
And just like we were making predictions of baseball players
based on the performance of very similar baseball players,
we can do a similar thing with proteins,
So we can say for example, that this protein sequence is very similar
to this related family member, for which we know the three dimensional structure.
And then we can infer that the sequence we were originally interested in
probably has a very similar structure.
Moreover, we can look at what is different about the sequence that we are interested in
versus all of its other family members, and one of the things that is different
is that it contains a mutation at a very conserved
position, so all of the other family members have a histidine at a certain position,
whereas this protein sequence will have an arginine.
And that mutation actually has a pretty dramatic effect.
It's a mutation that when individuals carry that mutation they are very highly prone to,
they always develop, a serious disease, amyotropic lateral sclerosis.
but many of you will know this as Lou Gehrig's disease,
which affected the great slugging first baseman of the New York Yankees in the '20s and '30s.
So both of these prediction problems,
both trying to figure out the future performance of Dustin Pedroia,
or trying to learn something about a protein follow a very similiar logic.
We have in one case the back of Dustin Pedroia's baseball card,
and in the other case, a sequence of amino acids.
And we use that information about our baseball player or our protein to search
against a database of all known baseball players, or all known amino acid sequence
to find that small subset of them that look very similar
to the baseball player or protein that we are interested in.
And then we take that collection of matches and try to build up a model that takes into account
something about the history of how those players performed in their subsequent years
or something about the evolution that led to the protein having specific sequences in specific places.
And we also then can try to infer unique features about the protein,
for example, the mutation that causes ALS, which might be similar to
a player who has a gap in his career due to maybe a broken leg or something for a year.
Right, and in both cases we are constantly comparing the predictions that we make
to what we subsequently learn in the laboratory
either because we do additional experiments in biology or because another baseball season passes.
And so, we can use the data collected every year to improve our ability to make predictions.
And not just improve our ability to make predictions, but learn things in the process.
We can learn what it is that is likely to make a baseball player's career last longer,
or hit more homeruns,
and we can learn what it is that makes a protein function in a particular way.
And it is often from the players or the proteins that conform the least to our expectations
that we end up learning the most interesting stuff.
So we are constantly trying to refine these models
both in baseball statistics and in computational biology.
I think one of the reasons that people, that there is a similarity in the people
who are interested in both of these things, in both cases these predictions are imperfect.
They are messy. They are working in a world that defies
any effort to describe it from a simple set of models and first principles.
And there is a further comparison in that both of these fields
are experiencing a huge explosion of really interesting data.
We basically had the same type of data for centuries
in the case of biology, I mean in baseball,
and a decade or few in the case of genetics. We've had the same underlying type of data.
We've had the box score and the gene sequence.
And you know, we had larger and larger sets of them, but we basically
have been using the same fundamental type of information.
But this is all changing.
So now, every single major league pitch is actually tracked from the time it leaves the pitcher's hand
until it is either hit or crosses the plate,
and we can try to learn something from the velocity of the pitch,
or the pitcher's arm angle or the relationship between these things.
to try to make better predictions in the future about player performance
and about the game of baseball itself.
And in the case of DNA sequence data
a new generation of DNA sequencing machines has increased our ability to sequence
DNA by five or six orders of magnitude.
Whereas the... so we can now essentially drop any sample we can collect
whether it is well studied laboratory organism,
or samples scraped from the bottom of a cow living in the fields of Montana,
or the spit of a baseball player,
we can sequence essentially every piece of DNA that lives inside of those samples.
But just like we are still trying to figure out exactly what we can learn
by collecting all of the PITCHf/x data in one place,
we are still trying to figure out exactly what we can learn by sequencing
every piece of DNA that exists anywhere on the planet.
Now we are both extremely confident that we are going to learn something from that, but
but learning something from that data is going to require a whole development of new tools
and we expect that the same types of tools that are developed to study the PITCHf/x data
in baseball are going to use the same types of underlying algorithms
and logic as the tools that we are using to develop, to analyze, new sequencing data.
And that's why both Jamie and I, whose careers and our futures depend upon
thinking of clever ways of using high throughput DNA sequence data
still keep our eyes on what is going on with the analysis
of PITCHf/x data and other baseball data.
And it is why we think that people who are interested in becoming computational biologists,
I think everywhere you look this is pitched as the emerging field, right?
Biology is becoming a data rich field. There are lots of experiments to do.
And in fact, you know, everytime I walk down the street with Mike, we get asked this question,
"How do you become such a famous computational biologist?"
And after we sign autographs for these people, we want to give them some useful kind of advice,
and tell them, you know,
what kinds of books they should be reading, but you know
it is important to say you shouldn't necessarily be opening a molecular biology textbook.
Or a programming textbook, or even a math textbook,
you should be opening up this. This is the book.
This is the book that lays out the logic of the kind of baseball statistical analysis
that we have been talking about.
In fact it is something that Jamie and I use as a textbook
in a class we taught at UC Berkeley in which we tried
to teach freshmen how to think like a computational biologist
by teaching them how to analyze baseball statistics.
And the other book that is up there is Moneyball by Michael Lewis,
which is now a motion picture starring Brad Pitt, that lays out how this type of thinking was used
by the front office of the Oakland Athletics in the early 2000s.
And for those Hollywood producers out there, Jamie and I are available to star in GenomeBall. Yeah...