Science Reporters' Seminar on Genome-Wide Association Studies

Uploaded by GenomeTV on 06.01.2010

Dr. Francis Collins: Glad to see everybody turning out at 9:30
in the morning here in downtown D.C. to have what I think is going to be a pretty interesting
day, because, from the perspective of many of us, we are witnessing a real avalanche
of exciting data coming out of a new approach to try and understand the genetics of common
disease, an area that has, frankly, been pretty frustrating until recently, with the power
of the methods we were using to identify hereditary factors in diseases like diabetes or coronary
artery disease as being quite limited and, therefore, relatively little progress on an
unfortunate number of claims, based on candidate genes that didn't hold up when they were attempted
in a validated follow-up. All of that is changing, and I think you’re going to see, and have
already seen over the last month or so, a real deluge of reports of people who have
applied this new approach to common diseases and, I think, as reporters, you’re going
to be challenged to try to figure out exactly how to interpret these and what the significance
is, what the impact is likely to be on the future of medicine.
We planned this workshop oh, you know, all of two and a half weeks ago, sort of realizing
that this was coming: getting a number of queries from some of you about interest in
getting more of a scientific background assembled into one place and we thought this would be
timely to try to do this and I really want to think Larry Thompson and his staff for
putting together, in an incredibly short period of time, a program of this sort: finding the
space, sending out the invitations, beating on all of us to come up with our presentations,
which I think were complete by about midnight last night so they are very fresh [laughter].
3:00 AM was -- okay -- [inaudible] [laughter] [inaudible] [laughter]
I do also want to thank the speakers, many of whom who have many of whom have very busy
lives in their jobs at the Genome Institute, for basically setting aside quite a chunk
of time to come to meet with you. But we are excited about this and we hope this will be
an opportunity for a very interactive day. Each of the speakers has been asked to present
for no more than 20 minutes so that there's plenty of time for discussion and questions,
and I think we also want to make ourselves available to you at the breaks and after today
if you are looking for information about how to interpret these kinds of studies. I think
all of us would welcome those kinds of inquiries in the future.
So I just have a few words of introduction and then we’re going to go into the first
presentation from Emily Harris but just to set the stage a little bit: what's the fuss
about anyway?
The sequencing of the human genome which, as you know, was completed in April of 2003
was basically focused on a reference DNA sequence: the 99.9% of the genome that we all have in
shared form but, obviously, while that gave us an enormous wealth of information about
how biology works, it didn't shed a whole lot of light on how variations in that sequence
result in individual risks of disease. We know that virtually all diseases have some
hereditary contribution. We know that, simply by observing that family history, is often
the highest risk factor for a disease like cancer or heart disease or diabetes. The reference
sequence was a wonderful foundation but not for that part so we really had to develop
a rich catalog of genetic variation and develop the tools to be able to survey that in people
with diseases in order to see where those variations lie that play a role in that kind
of risk.
Essentially, we've been trying to develop the tools to find the ticking time bombs in
all of our genomes and we all have them and various estimates have been made about how
many and I don't think we really know but certainly dozens that each of us walk around
with: generally common variations which may have no significance unless you get a certain
set of them to push you over a threshold, especially if the environment comes along
as an additional risk factor, and nudges you into a disease situation.
These ticking time bombs, though, for common diseases like asthma or hypertension, carry
a relatively low risk. These are not like Huntington's Disease or Cystic Fibrosis where
the gene mutation is almost deterministic for common diseases. These are going to increase
your risk by maybe 20% and otherwise you won't know they are there because you may very well
not see the consequences if you don't have a sufficient collection of them.
So how to find these ticking time bombs has been the struggle of the field for quite some
long time. Most of the variation in the human genome, and you'll hear more about this from
Larry Brody, is of the simple sort where it's a single letter that's different and, of course,
we call these SNPs or single nucleotide polymorphisms, here diagrammed as a simple C instead of a
G, in two different versions of the same chromosome. Probably something like 90% of human variation
is of this type. The other 10% is more complicated, involving either small or even large insertions
and deletions which we have somewhat more difficulty measuring right now so most of
the studies that you’re going to be reading about and we’ll be talking about today will
be focused on SNPs as the major of where the variation is in the genome and how it might
correlate with disease.
And the idea of doing a genome-wide association study is a very simple one and isn't that
nice? This isn't like linkage where you have to have log scores and all sorts of other
complex concepts about how to analyze a family. This is really a very simple idea. The idea
is, you find individuals with the disease and you find individuals who don't have the
disease. Here, colon cancer is the example and, of course, you want to be sure the people
with disease really have that disease and not something else that looks like it and
you want to be sure that unaffected people, to the extent that you can, are really unaffected,
and not just that you've missed the diagnosis and, of course, that's always a problem for
an adult onset disease, because your so-called unaffected people might actually get affected
later on, and you don’t have any way of knowing that and that's going to dilute out
your power a little bit but we live with that.
And then you're going to want to test all of the SNPs in the genome if you're going
to be thorough about this to see are there ones like SNP B, where there is a skewing
of the appearance of the two different spellings: here color coded orange and blue so that the
people with colon cancer have more of the orange spelling than the unaffecteds do. Most
of the SNPs you test are going to be unrelated to the disease so they're going to look like
SNP A where the proportionalities of the two different alleles as we call them of the SNP,
two different spellings, are the same in the affecteds and the unaffecteds.
Now, already, this cartoon has committed a variety of serious distortions, particularly
by showing you only 10 cases and 10 controls because I think you could obviously determine,
without even being particularly mathematically inclined, that you would occasionally see
something like SNP B, not because SNP B was involved, but just by chance. If you're going
to test hundreds of thousands of SNPs, as you will see we are going to, then once in
a while, just as when you flip a coin 10 times, sometimes it will come up nine heads and one
tails; you'll see something like SNP B. and you shouldn't be misled by that, so, despite
Mark Twain's comments about statistics, we need statistics in this particular kind of
analysis in order not to leap to conclusions about positives that are really false positives.
The consequence of that is you need very large numbers of cases and controls.
The other violence that this cartoon does to the concept of what a genome-wide association
is, is that it shows you a rather drastic difference for SNP B between the colon cancer
and the unaffected individuals: sort of nine oranges in the colon cancer and only one in
the unaffecteds. In general that difference is going to be much more subtle because, again,
you're looking at factors that play only a modest role in disease risk so instead you
might have seen five in the colon cancer and four in the unaffected. That might be the
best difference you'd expect to see which, again, tells you; you have to have very large
numbers in order to be able to assess whether that means anything, or whether it's just
Now, in the past, people wanting to conduct this kind of experiment pretty much had to
satisfy themselves with picking some candidate genes, hoping that they picked wisely, and
finding SNPs in those candidate genes and then trying to see whether any of them looked
like SNP B, and that candidate gene strategy is pretty much the entire literature of association
studies for common disease for the past many decades, and some of those were very successful:
the idea that HLA, for instance, is associated with Type I diabetes or with a variety of
other autoimmune diseases. That's an association study that has held up very well. But a lot
of association studies focused on candidate genes have not fared so well, and the problem
has been that when you're going after a candidate gene, essentially committing the same kind
of blunder as the guy who lost his keys after a night at the bar and came out realizing
they were somewhere on the street, and limited his searching strategy to one place, namely
under the lamppost because that's where he could see and, of course, the keys, sadly,
are often not where you want them to be and so the candidate gene approach for diseases
like diabetes or cancer or heart disease where, frankly, we don't know enough to know a good
candidate gene when we see one, has often come up empty or it's thought that it came
up with something that were keys and turned out actually not to be keys so candidate genes
have a checkered career of false positives.
Obviously, that's not what you'd like to do. You'd like to sample the entire genome. So,
just five ago ,when people talked about doing this, if you actually went through what you
would have to do, it was pretty daunting. In 2002, if you were going to propose getting
beyond a candidate gene and looking at the whole genome, you would need to have a catalog
of essentially all of the common SNPs in the genome and there are about 10 million of them.
That's sort of a useful number to keep in mind, so 10 million places in this three billion
base pair gnome where there are common variations or the less common spelling, less common allele,
is present at least 1% of the time. You'd want to test all of those. First you’ve
got to find them of course and then you want to test all of them. Again, you want to test
a lot of people, so 1000 cases and 1000 controls, as we'll see in some of the examples today,
is sort of a minimum if you are really serious about finding variations in common disease,
because you're not expecting to see big effects.
And then you would have to take each of those DNA samples and, using technology you'll hear
about, genotype each DNA sample for each of those 10 million steps, so do the math. That
would be 20 billion lab tests, genotypes, and in 2002 a genotype cost you about 50 cents,
even in a very good lab, so that would have cost you $10 billion to do the kind of genome
wide association that you’re now seeing all around us so it was totally out of the
question in 2002. I mean, just so far out of the question that the idea we are now doing
it is pretty astounding, just five years later.
So what happened? Well, one thing that happened was this project called the HapMap, which
many of you wrote stories about at the time when it was published in October 2005, so
this was an international collaboration involving six countries and more than 1000 scientists
all working together and, basically, the plan was to lay out the landscape of genetic variation
across the whole genome: first of all, to build up that catalog of SNPs because before
HapMap came along, we only knew of about 2 million of them. Now we’re pretty close
to the 10 million that we wanted to see, but not only to catalog them, but to take advantage
of something about variation across the genome which turns out to be incredibly useful and
time-saving and that is that SNPs don't travel independently of their neighbors. These SNPs
tend to be clumped together in terms of which particular spelling you're going to find in
neighborhoods so that if you have tested this SNP and there's a SNP next door, you can probably
predict what it will also have on that same chromosome and this -- well, you'll hear more
about, is a phenomenon that geneticists have labeled in their inimitable way with a term
that almost nobody loves called linkage disequilibrium or LD: linkage disequilibrium. Well, what
does that mean?
Linkage equilibrium would mean that two SNPs that were next to each other were truly random
as far as their association with each other: that knowing the spelling of this one would
tell you nothing about the spelling of that one. Linkage disequilibrium means that they
are not random. They're actually coordinated, correlated and that turns out to save you
a ton of work because the neighborhoods over which this linkage disequilibrium operates
are actually pretty good sized: in the neighborhood of 20 or 30,000 base pairs so something like
30 or 40 SNPs will all be traveling in lockstep together on a chromosome and that means if
you know that and you know the boundaries of those neighborhoods, which is what HapMap
tells us, you can pick two or three SNPs in that neighborhood and if you test those you
can infer what all the others would have been without actually having to measure them so
the two or three you test are essentially proxies for all the rest.
Well, that saves you a lot of work and you'll see how much in a second here because in 2007,
instead of 10 million SNPs, which is what we would have to do if it were not for this
phenomenon of linkage disequilibrium, a carefully chosen set of 300,000 SNPs is enough to stand
in for all the rest at a pretty high degree of coverage. You're essentially covering 85
to 90% of the genome as if you had actually done all of that work by picking a carefully
chosen set of 300,000 SNPs. That happens to be true for European samples and Asian samples.
You need more for African samples. The reasons will become clear why that is. So that’s
pretty good. We just saved a factor of 30 but if you remember our 10 billion cost, that's
still not going to make this affordable so you still have to collect your cases and controls.
You still have to do the genotypes. You still have a lot of genotypes. The other thing that
happened, and Larry Brody will mention a bit about the technology, is that in part because
of HapMap but mostly because of just really remarkable ingenuity on the part of private
sector genotyping platform developers, the cost of a genotype in five years, in five
years, has dropped from $.50 to about an eighth of a penny so when you put that all together
what used to be a $10 million project is now less than a million. That makes it actually
quite within reach, especially because if you’d really collected 1000 cases and 1000
controls and done all that clinical work, you would have spent more than $1 million,
so that now means that the genotyping part of this is not the cost driver anymore. The
clinical work is the cost driver.
Now, many groups had already done the clinical work over the course of the last many years,
anticipating that a day might eventually come where this kind of approach could be taken
and they are plunging in with wild abandon from studies on diabetes to the Framingham
Study, which is doing this kind of thing, to studies on schizophrenia, on autism, and
a long list of others that are in the works and, essentially, hopefully without being
too grandiose, what this is doing, is allowing you no longer to be relegated to looking under
that single lamppost, but lighting up the whole street by having this kind of technology
now in hand, so you don't have to know the answer. You don't have to guess the right
candidate gene.
You can now systematically and comprehensively survey the genome and find what's there. And
so, if you have enough cases and controls and you've done the matching properly --
that's important. You don't want to be finding things that have nothing to do with the disease
and everything to do with the fact that your cases and controls were actually different
in terms of their ancestral origins, and you apply the technology appropriately, you get
quality data and you do your math right, you should be able to find variations that are
associated with common disease and that is, in fact, what is happening. Again, what is
being found are variations that have relatively modest effects on disease risk: maybe increasing
the risk in somebody who carries the risk spelling by 20%, not multiplying it by 10,
but each one of those, then, points you toward a pathway you didn't know about, and I think
one of the things that's coming out of this early phase of this is that most of the genes
people are discovering for common disease are genes that they never would have guessed
had anything to do with that particular illness. They would not have been on anybody's short
list of candidate genes.
So that's a brief introduction to the topic with much more detail to be filled in during
the course of the day.