David Botstein speaking at the Genbank 25th Anniversary

Uploaded by NCBINLM on 03.05.2010

I have to say that I want to congratulate NCBI, especially Jim and David for having done that most strange
and bizarre of things, produced a government business that actually works and does more than was asked of it.
Okay, really has been a pleasure to get into this sort of genome business with this kind of backup.
And in fact my talk is not about GenBank. My talk could have been called 'Beyond GenBank.'
Because as you have ably heard, most recently from Jim, business about what the genes do
was not taken as the remit of GenBank or any of the originating people. For some reason and there is
something I could never understand, the early advocates of sequencing seemed truly to believe that if
you knew the sequence of everything, you would understand how it all worked. Okay? And I never believe that.
Geneticists were sort of very restricted in their enthusiasm for this point of view to begin with.
This was a view they attributed to biochemists, I should say, but as Rich told you this morning,
the real biochemists didn't believe it either and certainly they don't believe it now, and so the whole question is
how do we put biological understanding into the picture is really the theme of my talk. All of this
presupposes that GenBank exists and that it is as excellent as it is. So I want to do a little bit of
history of genomics from the point of view of a simple geneticist. Okay? First of all genomics is not a new word.
Genomics was introduced - this if from the OED by Winkler in 1920 or something like that.
Okay, Winkler was an embryologist. It still means what it meant then, pretty much what it means now.
And there is some difference in this business of prediction I found this, in the OED found this
1970 Scientific American article about how many genes they expected to find, and I always imagined
that this must have been a monaural biologist, because they've always had trouble how you can
think with a finite number of genes. Now the history is basically the history of genetics. I put it into
a couple of slides. At first we were worried about the mechanics of inheritance, and this is now
all in the textbooks. Then we were worried about how the DNA information is expressed,
and we ended up with the ineffable central dogma. This is basically orthogonally arranged,
the same order that Jim had in his schema of GenBank going from the bottom to the top. Okay?
And that may be why he as a biologist sounded natural, because the central dogma was there.
In the 1960s, of course, this began to chance and a few pioneers, the Phage Group, became interested
in the question of what do those genes do, and they did this by using conditional lethal mutants,
and the theory was that if you got a conditional lethal, it must be essential, and if it's essential and
if the method by which you got conditional lethals was general, you should be able to get such
mutants in any gene, and if you saturated the genome with them, you could find out all the
functional parts of the genome. Notice that we're going from the top to the bottom here, not
from the bottom to the top. Okay? And the phages were attractive, of course, because it was
clear already that they had a very small number of genes to worry about, and this is the first
genome paper that I could find. You can see it's a genome paper, very many authors, different
labs, different continents, and the central author is not the first or last author. It's Edgar. Okay?
And on the left you see a portrait of the organism, and on the right you see the first genome database.
On the inside of the ring is the map order done by mapping, and on the outside is a little ideogram
and it is that ideogram that tells you what goes wrong when the gene on the inside is mutant.
Okay? That's the essential pieces, and in fact this was such a powerful idea that a number of other
pioneers began to do the same kind of thing with simple eukaryotes that also had smaller than the
numbers of genes arguably and they were chosen also because they were easy to manipulate
and you could do experiments in them, internally to them without reference to outside experiments.
And the founders of these communities had a philosophy of science best summarized by Delbruck.
The idea was that we all worked together, we compete with each other, but we share the same information.
This was not universally the case in experimental biology. This was something quite remarkable
even for the time, and these of course became the model organisms, and when the time came
they were the ones who were sequenced first. So I want then to sort of summarize what the
geneticist's view of the essential part of understanding is. Geneticists distinguish between structure,
the genotype. That means basically genetic maps. And the ultimate in the genetic map is the
sequence of the DNA especially if the mutants are on there. And the second thing they're interested in
and actually arguably more interested in is function. They're pleased to have named this phenotype,
and that is the biological roles of whatever those elements on the genes that are inherited are.
And they can be genes, classically defined sequence motifs, individual bases, and historically
these were inferred from the biological consequences of mutations - wide eyes or something.
Today the inference is still at the end of the day from knockout mice. Okay, and things like that.
And these data unlike the data on the top part of this chart, these data are much more difficult
to summarize. Okay? Lists of genes or annotations and references to the literature, and these things
are stored in databases by and large not so much at NCBI, although there are some, but in other
databases that are designed to understand the information that is above the sequence, if you like,
the phenotype level as opposed to the genotype level information. I saw this at the Museum of Natural History.
They had a Darwin exhibit. He made this sketch in his notebook just after he got off the Beagle,
and I think it's one of the great premature insights, you know, that is pretty close to the way it
actually looks, and this idea, of course, has animated biology for a long time, but in molecular biology
it became relevant relatively late. Here are two viruses. This is phage B22. This is phage lamda.
They grow on what was regarded as different species of bacteria. They look different, but it turns out
they have similar genetic maps. When you look at the maps and you include what the genes do,
you discover a striking regularity, namely all the regulatory stuff to a huge level of detail is the same,
and then there's DNA and control and lysis and heads and tails and so on. Okay? And it turned out
that you can make hybrids between these two viruses by genetic manipulations that only the ancients knew.
And when you do that, you get viable things out of it. So these things have interchangeable parts,
different species allegedly. This is an early genetic map of E. coli, and again the E. coli people
named the genes in such a way that you had some idea of what they did, and I show you this
to show you that in fact there is a logical progression from map to a higher structure map to the sequence.
This is Blattner's annotation of one or two minutes, and what you see is, yeah, there's a lot of
stuff in there, but basically this is just the ultimate in genetic maps from the functional point of view.
So the sequences were sequenced and the sequences went into GenBank, and there was a lot of
back and forth about whether the whole sequence could go into GenBank but only 350 kb,
and you heard all about that. I'm going to skip that. And the clear thing which had been prefigured
in earlier literature is that for these eukaryotes the basic cellular functions are carried out by
proteins whose structure and function are conserved. You can follow them all the way back to the beginning
in many cases. That doesn't mean that their regulation is conserved. It doesn't mean a lot of things,
but it does mean that a lot of the function is conserved, and I'll show you some example of that.
Here's a map of Saccharomyces cerevisiae, well before sequencing and like E. coli one - I won't show
you a zillion slides - it comes out the same, and of course humans it was more complicated,
and one had to use the restriction polymorphisms and other DNA polymorphisms to figure out mapping
and eventually one got maps that looked like this, but at the end of the day, we got sequences again.
Now what do those sequences mean in the human? Well, in the human the idea was that there are some
diseases like these for which it was obvious from the structure of the inheritance patterns that there
are likely to be single Mendelian genes, and there's a list of these here, and so the idea is very simple.
If you find the right gene, you will find a correlation between the defect in that gene and the phenotype,
and it will be the only gene for which that is true. And that turned out to be the case using these methods.
And at the end of the day, you ended up just like in E. coli and just like in phage, and just like in yeast
you ended up with a sequence. This happens to be the BRCA2 gene, comes off the NCBI website.
And you see there's splicing and there's other various details, but at the end of the day it's a gene and a protein.
And interestingly, most of the - we did this review awhile ago - most of the genes that cause
Mendelian diseases do change proteins, RNAs and regulation notwithstanding, only at that time
only 213 were not visibly altering a protein. Now complex inheritance on the other hand is
completely different story. Many of us, me included, very vociferously I should say, thought it
was just going to be a couple of genes, the three genes, the combinatorial complexity and so
forth and so on was hiding the fact that this was really simple proposition. Nothing turns out to
be farther from the truth, and I think a fair way to put this is to say that although we keep finding
statistically significant undoubted reproducible findings, what they amount to in terms of
explaining the disease is very small, and we don't understand why, and in fact one of the great
open questions in my mind is having a robust theory of how genes interact to produce phenotypes
and complex inheritance. Here's an example of failure. Tom Marr who was involved in this a long
cast of thousands, and we got lots and lots of exactly the right families to look at bipolar disease,
high heritability, simple apparent inheritance, and the conclusion was the results exclude monogenic
models and make it unlikely that two genes account for the disease in this sample, two independently
segregating genes. The answer really is that we don't have a clue, and it was really hard for the
psychiatrists to accept this. It took a very long time to get this published because I wanted to
get all the authors in the tent. There's a lot of statistical education involved. Okay. If we're going to
talk seriously about sequence and function, we have to confront an actual sequence. Okay?
This is a sequence. All right? What can you tell about it? Virtually nothing. Okay? Some of you
in the audience either have heard me before or know how I've spent my career in some detail,
and will guess what this sequence is. But there's no internal information unless you have a terabyte
memory that can tell you what this sequence is. Okay? However, if you do a sequence comparison
you compare it to - you get a human hit, and that human hit - this is the last hit for this sequence,
and it's actin, and the important thing, the burden of my talk here is how do I know it's actin.
By looking at the sequence, I could tell nothing. How do I know it's actin, because at the end
of a chain of homology searches, there is a sequence, a protein sequence on which somebody
did an experiment. In this case, it's a rabbit sequence, and somebody figured out that this gene
is the gene that hydrolyzes the ATP when your muscles move. It's actin. Right? And what you
need to remember about all this sequenceology, at least in my humble opinion is at the end of
every trail is an experiment, just like Rich Roberts said this morning. Having a sequence gives
you a hint maybe, tells you where to look but at the end of the day there has to be an experiment.
It doesn't have to be a biochemical experiment but it has to be an experiment. Okay? In 1988
in response to a particularly ignorant article in the obscure journal, Science, Gary Fink and I
were moved to write this article, and this has a sort of a spell-like diagram in it, and basically
it shows you at some epistemological level what the relationship is between genes and proteins
and function, and how these things are inferred with any comfort. And the article here was to say
first of all that the previous article was mistaken when it thought that yeast was a prokaryote.
And it was even more mistaken when it thought that nothing would ever be learned from yeast.
This is in the late 1980s I should say. And the idea of this was that in yeast, you can do the experiments
internally in a single system to try to figure out what a particular protein does with maybe the
greatest ease for any eukaryote. That was the burden of the argument. Okay, so the kind of
thing - this is a table from that paper - the kind of thing that we had in mind was something like this.
Now I've updated the table. You can go back to the original science, only about half as many function things
were filled in. But basically this is the story as it is as of about ten years ago. There's a bunch of
conserved proteins. These proteins are sorted by their degree of amino acid identity, and what this
function here is is that - what function means - that when you blow away the yeast gene and you get a phenotype,
usually death, and you replace the coding sequence with the mammalian or the human coding sequence,
you get the phenotype reversed. That's like saying in the car example where if you have a Mercedes
over here and a Model T over there, Model T doesn't run, you reach into the Mercedes, grab a part,
slap it into the Model T and drive away. Of course, in the 120 or so years of evolution of motor cars,
it's hopeless. Okay? But this is two billion or so years. I don't know what the taxonomists say about
the distance between human and yeast, but this works much of the time especially for these high ranking hits.
Some of these genes are being used in ways that are completely dissimilar. Even actin after all is
essential in functions the yeast doesn't have, like thinking. I take an inhibitor of this one, okay, and
in order to reduce my cholesterol, but the truth of the matter is that yeast don't even make cholesterol,
they make some other steroid, but nature doesn't reinvent. And that's really what this is about. Okay?
So this was a huge favor to the biology community because we could do an experiment over here in yeast,
get an answer that was an even much broader hint about what the protein is doing over here in my cancer cell.
Of course, I do this just to emphasize not to put down the GenBank-like databases, but to show you that
they don't mention this here. There are all these tracks, and there's all this homology but it never tells you
what the thing does. For that you've got to go to the literature, and that was the most brilliant thing
that NCBI did. They connected in a really robust way the sequences with what is known about the sequences
through PubMed and all the rest of that sort of stuff, and when we get to PubMed Central, and you
can search on virtually anything, that's going to become even more robust, and that's in fact
when I say the nature of biological understanding, that's how we work. We get a sequence.
The sequence leads us around to papers and lots of different taxa that tell us what RASS does in
all of these different circumstances because somebody did an experiment. Here is another.
Now the next issue that I want to address is that, of course, proteins don't work alone.
Proteins work in great gangs of assemblies. Actin is - which I happen to work on - a lot of these examples
are taken from my own work, because I know it very well, and not because I couldn't make examples
from 1,000 other people's work that's better than mine. But, you know, I'm me. Okay. So actin
which I happen to work on has very, very many ligands, other proteins that bind to it. And many of them
are essential to its function in one or another aspect. The number these days is somewhere in the 60s
of well documented ones. At this time it was more like 50. So that means that we have a big problem.
We now have biology as an information science. Information why? Because of the hugeness of the genome.
Information, because we can't function without a computer, and with the combinatorial complexity
and the relationships that we have to remember. It's hopeless even if you could sequence and assemble
without a computer. It's hopeless to think about the biology without a computer today. The combinatorial
complexity is huge, and finally there is this problem that there's way more data there than we understand.
We have as biologists what I like to call a NASA problem. NASA took all these pictures all over
and they're sitting as electrons in some computer somewhere, maybe Los Alamos for all I know,
and no one ever looks at them, okay, because we have no way to make any sense of them, people sample them.
And we have that kind of problem. We have much more information than we understand, and part of
that problem is that we're pretty good at computing, but we're not very good at communicating
between the computer and ourselves. So to address this problem and more or less anticipating this problem,
we started to think for the yeast genome long before the genome was actually sequenced, we began
to think about the question of could we organize the yeast data so that it could be computed upon,
that we could search and find with reliability relevant data with a sensible query. And we started that
when we started sequencing yeast, and we did lots of work using those data that were used to make the maps.
I showed you the Mortimer map, and before that I showed you the coli maps, and those were in fact
functioned as genome databases before there were genome databases. If you found a gene was mutant
in your experiment, you went to look there to see if anyone had mapped that gene or anything near it
and what was known about it and the guide to the literature, it had all those elements, and so we
decided to computerize that. And we decided first of all as did NCBI that the place would be
run by biologists, not people who knew how to run the computers, although it was important to run the computers,
and we were very much taken with the fact that the point of this was to talk to the working biologists.
This was not a compilation of data that could be taken down by command line by a bunch of experts
and crunched upon, although that was an important function. This was for a simple query, and so
I used to go around at the time that GenBank was moving, and there were other databases that
shall remain nameless, using the tests, and the test was I asked each database what is hemoglobin
or what do you know about hemoglobin or what do you know about HB2B, okay, and if I didn't get
an answer, I declared it useless. Okay? We, of course, didn't do that, and NCBI didn't do that.
That whole thing - Jim's talk was masterful in showing you how much they were listening to what people
needed and wanted. So what we ended up was making what is really a mini version of NCBI,
but focused on something entirely different which is what do the genes do. This is usage up to
about - I don't know - '05 or something. Okay, so as soon as a lot of sequences became available
it became clear that we had a different kind of problem and that different kind of problem was
beautifully put by my friend Michael Ashburn. Biologists would rather share a toothbrush than a gene name.
And this went double across species. Okay? So you ended up with situations like this, CDC25 is son
of sevenless in flies. Talk about useless naming compared to let's say E. coli who look like heroes in this regard.
Right? This thing turns out to be RASS. Right? Most of you unless you're experts had no idea
that either of these things is orthodox RASS. Right? So the idea was to produce a controlled vocabulary
to describe the gene products and their functions in such a way that if a yeast biologist queried -
blasted the world, discovered that Drosophila had a gene that the thing they got back made
some biological sense to them. Okay? That was the idea. Okay, so as soon as we started thinking
about this seriously - this is the great thing about computerization, it sometimes makes you think
clearly about the most basic things when you never had to before, and trying to get a computer
to remember what a function is, it became immediately clear to us that we had to distinguish at least
three aspects of function. One is the process as defined here. The other is the actual molecular function,
ATP ace or something, and the last, of course, is the cellular component or ware. And again to
go back to the car analogy, if your Model T isn't running and you push it out of the way, and you find
a bolt on the ground that came from your car, and you look at the bolt and you ask Swiss-Prot what is it
and Swiss-Prot says fastener, four inch - actually, four centimeters if it's Swiss-Prot. Okay?
And so the question is how much are you better off knowing fastener four inch. You can see that. Right?
It will give you the PI. It will give you the molecular weight and stuff like that. That's not so useful.
Okay? What you really want to know is what did it do for the car. Well, the molecular function may be
fastener, but the cellular component could be the steering gear. Okay? Or it could hold on the vanity mirror
on the dashboard. Okay? The biological process could be locomotion or primping. Right? Okay?
All right, and so you want this to be a structured vocabulary, and of course GO, incomplete and imperfect
as it is has helped a lot in the bioinformatics, and it has helped hugely in the endeavor of individual
biologists who want to know what's known. Okay. So I think I'm going to move on here if my machine -
Yeah, and then of course we started to build tools that allow you to find the GO terms when you do some
high throughput experiment or other, and that has helped a lot. Now so the intellectual impact of the genome
has been to a very large extent one of unification. People used to laugh at this, including my friend
who wrote this silly science article that Fink got excited about, and it has also opened up a new frontier,
and that new frontier is at this interaction level because the genome and the computer make it possible
to do experiments at a different level than was possible before. And I'll give you an anecdote.
When I was an undergraduate, I did undergraduate research. I worked in the lab of a great microbiologist,
Boris Rankasanic, and we were taking some organism whose name has been changed many times,
And I don't know what it's called now. With all this proliferation of species, Jim, why don't they
keep the old names the same? Anyway, this thing was called Aerobacter aerogenes. It's called something
else now. Anyway, and the idea was that we soaked this organism as it was growing in tryptophan,
and asked what kind of new enzymes we could find, and the new enzyme we were looking for
of course was something that did away with tryptophan. We found something like that, and then Boris
had me in his office, he said it's time to figure out whether this is specific, this induction by tryptophan,
of this tryptophan ace or whatever it was. And I said, yeah, how do we do that, and he said, well,
just measure a few other enzymes. So I measured ornithine transcarbamylase, and I measured some other
NADPH-using enzyme, and those weren't induced, and so we concluded it was specific. Of course,
this experiment in the modern day is completely incompetent. It didn't make a lot of sense at the time,
I have to say, but it was the best you could do. Now you can do much better. Before I show you any of this,
I want to deliver another message, and that message is about ourselves. In 1940 or so, a professor
at Columbia, Pavis, had a bunch of people in a dark room, undergraduates at Columbia, and they were
taking his course. And they were shown in the dark a number of spots of light for a second or less,
and then they were asked to write down, A, how many did they see, and B, how sure were they.
And what you see here is the confidence with which they saw how many spots of light, and as you see
this hypergeometric fall in confidence after the number six. They were great at six, and then it went away.
Now in fact the rest of this paper is saying that even this is too confident. We're worse than it appears.
The number of genes in the human genome is, I don't know, but it's in the tens of thousands
much bigger than six. And that illustrates my problem. You get a result that involves thousands of genes
and how do you understand it when your brain, which is good at six and not bigger numbers - so
you have to compute somehow first. You have to do something to show you the result, and this is now,
I think, a major challenge, not just remembering the data but presenting it in a way that means something.
So microarray is the example. Microarray data starts out as numbers, gets reduced into a certain
kind of picture which isn't very useful and then Mike Eisen figured out this picture, which is useful.
And what you're looking at here is not a number but a color, and the color is determined by how
far away you are for the medium, for the ratio of expression between the red and the green. And this thing
has been clustered, and so you can see that this set all have in common that they go from low to high.
And those of you who are colorblind, you can see from blue to yellow, low to high. Okay, and you can
actually remember what all the genes are even if there are 10,000 of them, because you see this
on a computer screen, and you can use hyperlinks, and if you have GO and you have something
which is called GO-Slim, which just gives you a couple of words for each function, you can actually
get some idea of the function, and if all of these genes all say ribosomal - all say translation,
you say, gee, that's a biological result. Okay. We can do cell cycle experiment in which we synchronize
cells and what you see is that if you compare the synchronized to the unsynchronized, you can
calculate some kind of Fourier score, and you can see these stripes. Okay? But suppose I didn't have the stripes.
Suppose I didn't have the idea about prejudice, about that. Could I just look at internal data?
And in this case you can look - you can cluster the data on the basis of some measure of similarity,
and when you do that, you get a pattern here. This is a Venn diagram showing the Pearson correlation coefficients
and what you see here is that the stripes are visible. So you would have discovered the stripe
without the prejudice. But even more interesting is if you look at the annotations for each of these
subsets here, these deep clusters here, you see that they are functionally related according to GO.
And if you read all the papers, you find that they are functionally related according to all the papers,
because GO is bad, but not that bad. Okay? So now you can do a different kind of experiment.
Different kind of experiment goes something like this. I have a hypothesis - genes whose patterns
of expression are strongly correlated are likely to function together. Okay? So the test is, I'm going to
look at all these genes, and I'm going to ask if they are over represented in a particular GO category.
Hypothesis 2, same thing for position. Okay? Both tests consist of just doing a simple statistical test,
and when I do that test - I did this for a subset of this particular group for clarity. This is the GO graphical
output. The different colors mean different degrees of enrichment, and here is three of these genes,
a prior probability - I mean the bootstrap probability that they would have gotten there by chance
is between 1 in 1,000,000 and 1 in 10,000, and for the other case it's between 1 in 100,000,000
and 1 in 1,000,000 and so we have a kind of inference that wasn't possible before, and it's very general
I should say. Here you can look at the patterns of gene expression with my collaborator, Pat Brown.
We had a lab that did this kind of stuff for about a decade, and you can see all kinds of things,
And I want you to notice two things. You can see that they're all different just by looking.
The calculation has been done for you, but presented in a way, in a Tavis-proof way. You're not
looking at any numbers. You're not doing any calculations in your head. You can just see it, because
we are evolved to see things coming and going. You can do the same thing for tumors, and when you do
for tumors, you get a very good result right away. All the breast tumors, the purple ones, are all together,
which means that they're more similar to each other than they are to other tumors, and so it's not very
likely that you're going to have an ovarian tumor that looks like a breast tumor. You can do more than that.
You can cluster now also the patterns of gene expression among the samples, and then you get arguably
four subclasses of breast cancer. And you can do this again on somebody else's dataset, so it's beginning
to look very much like sequence information, this information. It's worth keeping. It's worth cataloguing.
It's worth annotating. It's worth having in GenBank or some NCBI thing, which there is, and because
I can now take somebody else's data if they're available - and I'll come back to that in a minute - and then see
that the red ones are still together, the blue ones are still together, the light blue ones are still together,
and the magenta ones would be together if there were more than six of them. And this has clinical
consequences as well as a biological consequence, because in both datasets, the different patterns
of breast cancer are characteristic of tumors with different lethalities. So now I can make a different kind of
genomic hypothesis test. Now this is really a genetics experiment. The four cancer breast types,
I argue, are different diseases. If so, then women who inherit one of the genes that predisposes to cancer
should at least by Occam's razor and by logic might be expected to have the same tumor subtype,
and so we can just do the same kind of thing that we did with indiscernible and what you see is that
the BRCA1 carriers are all the dark red type, and the BRCA2 carriers - this is still an anecdote in this dataset,
because we don't have the genotypes of all of them, because BRCA2 wasn't - it's a long story.
But you get the picture. Okay? So my last post-genome anecdote has to do with understanding -
I've got five minutes, okay - really what the classical - now it's called system biology. This is not
a new idea like all the others. Ira Herskowitz made this diagram summarizing decades of yeast work,
of phage lambda work, and this is all still pretty much true. What's interesting about this diagram
besides its complexity is the fact that all of these relationships are unquantitative, they're digital,
goes up, goes down, and you can do new kinds of data like mass spectrometry. It's now really possible
to measure compounds, and if we can't understand metabolism at the system level, what will we
be able to understand? You can do experiments in which you can follow not only all the genes,
but all the metabolites, and when you do that, you get sensible results. If you use the same kind of
bioinformatic approach on the amounts of metabolite under - this is starving E. coli for carbon or nitrogen.
This is starving yeast for carbon or nitrogen. And you get some really amazing results. Notice also
that the dynamic range of this assay is 256-fold in each direction. So we're not talking about
small differences here. This is not little noise, and you can't read these. It doesn't matter because
the result here - this is now a singular value decomposition or a singular component analysis,
and what you see is that there are three major vectors, and the three major vectors are organism-dependent,
metabolize-dependent. Most of you probably thought that was going to be the major class.
Organism-independent but metabolite-dependent, that's a reasonable - but a whopping 42 percent
of the variation in this experiment is due - is independent of both metabolite and organism.
What does that mean? It means that we all live on the same metabolic plan, that the connections
no matter what you want to say about what happened, the connections in these two free-living organisms
that are millions of years apart, the connections between the metabolites are such on the building plan
that when you starve for carbon, you get a stereotyped response that both remember. We can do
experiments in which we find out things in yeast that remind us of things in tumors. It's not important
what we found, but you can see that these curves are different, and so we interpret that as being
something like the Varberg effect, which is that tumor cells waste glucose and normal cells
never do in culture or in tissue slices. And so what we did was we looked for mutants, okay, that
suppressed this effect. Because we're still geneticists, we're doing basically what Rich wants us to do,
is to investigate what's actually going on, and we do a lot of that, and we get mutants. Now because
it's the post-genome era, and we have a chip on which we can find a single mutation in the whole
genome, we don't have to go through a lot of rigmarole to find out where these mutations lie,
and the first two mutations we found were in a gene which we had never heard of before called PPM1,
and this PPM1 is part of a complex that has to do with protein phosphatase 2A, which is very highly
conserved everywhere, and as you can see it really does suppress the effect. This is the effect,
this is the mutant. You can see we're back to normal. And so why is this here? This looks like ordinary
microbiology, right? Not so, because we can now look at all these things that we got when we do
a much bigger screen, and then when we ask for all the things that we believe in, we ask where
are they in the diagram of related genes that's derive bioinformatically from all the data that's around
out of SGD by Olga Troyanskaya's group. What you see is that the gray ones are the ones we found.
Okay, we found PPM1 is in here somewhere, and we found all these guys, and if you ask who is the
center of this CDC55, and if you ask who are the closest relatives by all kinds of interaction methods
to CDC55, they include something that is the protein phosphatase 2A itself, something that's involved
in mitosis, another thing involved in mitosis, something involved in entry into the cell cycle
at the very beginning, something that is the other subunit of that, spindle assembly,
phosphofructokinase, remember this is all about wasting glucose. That is the central regulator of the
glycolysis. Bud emergence, which is in the middle there somewhere, and glucose repression.
So by experiment I blundered into a little knot of 50 out of the 6,000 yeast genes, 6,700 yeast genes
that are all stuck together. This is something worth spending a lot of effort to figure out. So that's
the end of my story. I want to summarize only in the following way. I think the time has come to become
much more serious about the things that are above Jim Ostell's line, above the proteins. What do the
proteins do? What do the other elements in the genome do? And only in that way are we going to
get any real purchase on what is actually happening and real purchase on how complex diseases
and gene interactions work, and to that end, we need to get much better at remembering this information
and serving it up in a way so that someone who doesn't have a degree in biophysics can understand it
or at least look at it, and we've done extremely well. NCBI has done, I think, an exemplary job
in getting us to there on the sequence level, but now it's time to spend comparable effort at least
on higher-up level. So this is basically to credit individual people. I think Mike Eisen needs to be
remembered because this visualization of gene expression - he was really the central actor.
Kara Dolinski is a major player in the GO business and SGD business, and all of these people
that I had the pleasure of collaborating - especially Mike Cherry. Mike Cherry is sort of - I don't know -
soul brother of Jim Ostell, same kind of thing, he got into computers when he was a graduate student
who should have been doing something else, and I was so fortunate to recruit him. I knew he
needed to do SGD, would never have happened without Mike, and he now runs it, and it's a great thing.
And Pat Brown, of course, invented the microarrays, and we had this fantastic collaboration.
Ashburner and Judy Blake were co-conspirators in GO. I'm really a minor conspirator. Olga is sort of
one of the new bioinformatics gurus. Josh Rabinowitz does the mass spectrometry at our place,
and Mike Ashburner and Judy Blake, of course - and Anne-Lise is who I left out, was where the
breast cancer stuff came from, and so I thank you again for the pleasure and the honor of haranguing
you about what you should do next, and I'm happy to take any questions. [Applause]
Question: You said that part of the big challenge going forward is the ability to analyze
high dimensional datasets. I recently co-chaired a Keystone Biomarker Symposium where
by the end of the meeting there was this sinking feeling that the community is creating a lot of
red herrings by statistically over-fitting high dimensional datasets, and one of the examples
that was used was one of the predicted breast cancer datasets from the Dutch group. Botstein: Right.
Question: That data was over-fit. Botstein: Right. Question: And when you reexamine it
the affect is still there, but it's been exaggerated by 100 percent by the over-fitting, so how do we
avoid, you know, going down that path. Botstein: Okay. First of all, you're right. In that Greek
paper where he talks about the Dutch dataset, the Dutch dataset was the second dataset we analyzed.
Okay? Our effect, which is not done by over-fitting but is done unsupervised, is completely there.
Their affect, which is completely over-fit is not present in our dataset or any other dataset. That's the problem.
Okay. I'm afraid that the problem is exactly the same problem that happens when you have a teenager
and you give them a Ferrari. Okay? You have to know how to drive the damn thing at 120 miles per hour.
The fact that it will go 120 miles an hour is interesting but dangerous in the wrong hands. The problem
really is that statistics - we teach undergraduates and medical students in particular the wrong math.
They have no idea. This over-fitting is not surprising. There's endless statistical literature that says
you can't do what they did. You can't mix the same dataset and analyze it twice and so forth and so on.
Right? Which they did. Okay, so I'm afraid that what I have to say in this case is that they were
less than fully competent. Now the question you should be asking is why did the New England Journal
print the thing? That is much harder to explain, because presumably they sent it to someone who
knows some statistics. Okay? I told you that it took me ten years to publish that negative result
on manic depression, which now everybody believes. The reason was that my collaborators were
optimists, and they kept hoping that by reanalyzing the data, by dredging through the data, they would find
something that - positive to publish, and they didn't have to publish bad news. Okay? Even though,
I was happy to have it published in the Proceedings of the National Academy. That's why they
made me a member, I assume, is that I can publish things like that. Okay? And I did. Okay?
The point, however, is that there is a natural human desire to have a good result, and the whole
idea behind statistics is to be able to do that without bias, and it's somehow hard to convince
ordinary humans that multiple hypothesis testing requires a discount. That's where these guys all go wrong.
So what can I do? I mean my thing they wanted and sort of published. It was inherited but it had a
parent of origin affect. You can imagine where that came from, right? It was just a false positive.
Right? So I don't know what to do. At Princeton, we're trying to teach everybody statistics
as freshmen, in the hope that, you know, we'll do our small bit to have one fewer optimist out there.
That was a good question. Got more? Landsman: Thank you. Botstein: My pleasure.