The Challenges to Aggregating and Analyzing Data Sets from Sequencing Studies - Francis Collins

Uploaded by GenomeTV on 13.06.2012

Francis Collins: Boy does this have déjà vu written all over
it. It's great to be here to welcome all of you to what I think is a very important meeting,
to talk about how we're going to make the most of some technological advances that have
enormous potential for teaching us about human biology in genetics and a wide variety of
exciting applications. I wanted the chance to come and say just a couple words of welcome
because, from the perspective of all of NIH, I think the meeting you're initiating right
now and going through tomorrow afternoon has great significance for our ability to make
the most of this remarkable moment where we have, by various count, and Adam Felsenfeld
will give you more details about this, something in excess of 70,000 individuals for whom whole-exome
or whole-genome sequence will be in hand by the end of this calendar year. It's pretty
amazing to just say that and not have anybody fall off their chair, but -- because a few
years ago that would have been almost unthinkable and now it's like, "Well, yeah, okay." Well,
the "okay" part is how do we make sure that that data, which is at the moment scattered
in quite a few places with lots of different ways that people can get access or not, could
be a symbol together, so that the whole would be greater than the sum of the parts, and
we would have a chance collectively across NIH and across other scientific sectors to
make the most of what we could learn from this exceptionally powerful data.
There are of course many kinds of questions that could be posed if we had that data set
in front of us. Some of us, I'm thinking David Cox and Teri Manolio and Eric Green and Lisa
Brooks and David Altshuler and I, and maybe a few others, I'm not sure I got the whole
roster, spent Thursday and Friday of last week at a meeting in Boston, which was co-sponsored
by NIH and industry with multiple pharmaceutical companies represented there at a relatively
high level, to talk about how one could utilize this kind of genome sequence data to do a
better job of identifying the right targets for the next generation of therapeutic development,
effectively imagining being able to take advantage of the human knockout project that nature
has already carried out to identify individuals with heterozygous or homozygous loss of function
of virtually all of the protein-coding loci, maybe some of the non-protein-coding loci
as well, and figure out what their phenotypes are in anticipation that that would be a great
way of predicting what the consequence would be if you developed an antagonist against
that particular product and it could not only tell you what the likely efficacy would be
of such a drug development effort, but perhaps also whether there would be toxicity or not
since the natural experiment is already potentially in hand.
That was an exciting conversation, driven of course by some examples like PCSK-9 which
everybody likes to refer to where this has been reduced to practice already and clearly
is presenting opportunities for therapeutic development that are quite powerful. And the
question is how generalizable could that be, if one had the data in front of you, the opportunity
to go back and carry out phenotyping on those individuals whose genotypes particularly strike
you as interesting. That's one of the things one might talk about in terms of having access
to this kind of data on a large-scale fashion in an accessible database. There are many
others, certainly the ability to understand biology from the perspective of gene-gene
and gene-environment interactions would be greatly assisted by having data of this sort
assembled in one place with standardization, not only of the DNA sequence, which of course
needs to be of the highest quality and isn't always, let's be honest, as well as having
environmental information and phenotypic information collected in a fashion that will be most useful.
And frankly I think that's even harder but something that we should begin to tackle.
Clearly this kind of approach is happening across NIH, almost all of the 27 institutes
and centers have something going in the direction of utilizing genome sequence information to
try to answer questions, but it has not been fully pulled together in the way that everybody
now agrees it needs to be. And that's why we are grateful to all of you, and I guess
especially to Mike Boehnke and Wylie Burke as your co-chairs who agreed to shepherd this
enterprise, for being here because we do expect this is not a meeting to give your standard
talk, not at all. It's a genome meeting. I know that never happens at genome meetings.
It is a meeting to really roll up your sleeves and try to figure out what are the barriers
to getting this particular outcome to happen and how can we knock those down, and do so
in a way that is both scientifically rigorous and highly respectful of the concerns about
confidentiality and privacy, which we must pay attention to if we are going to maintain
the confidence of those who have given biological samples for us to learn from.
So the problems will be numerous, certainly some of the hard issues you'll be talking
about will be data access policies: how can they be maximized to benefit science while
preserving privacy confidentiality? Comparability of variant data: how do you put together some
of these datasets that are actually collected in different ways and have different ways
of recording the information that's been collected on the participants? Particularly, what do
we do about phenotype and environmental data? How is that best displayed in a fashion that
you could compute across multiple datasets? What about the simple problem of the computing
power? The kinds of questions being asked here are going to be very challenging, and
some of them very expensive in terms of cycles. And what analysis tools do we not have that
we are going to wish we did and how could we start down the pathway of getting them
sooner rather than later? Those are just a few of the issues that I know you will be
wrestling with.
So again it's really wonderful to be able to be here to issue a word of welcome on behalf
of all of NIH, but I do want to thank NHGRI and Eric and Lisa and others at the -- the
NHGRI staff who have worked very hard to pull together the details. And I'm counting on
this to be a meeting that has a lot of substantive outcomes that will lead us forward to get
to that goal that I think we all can see ahead of us, but it's still a bit blurry and there
are still some potholes in the road, and the challenge is to figure out how to get to that
goal in a fashion that does the most to advance the cause of biomedical research.
So that's all I wanted to say. I'm going to invite Eric Green to come up and say a few
words, and if you have questions then for Eric and me, I'll stay for a little bit. Unfortunately
there are other fires burning back on the NIH campus which I will need to go to attend
to in a bit. But Eric, why don't you come forth?