Whole-Exome Sequencing: Technical Details - James Mullikin


Uploaded by GenomeTV on 14.10.2011

Transcript:
>> Thank you, Eric for the wonderful overview of where we are today in the world of genomics
and sequencing. As he said I'll be talking about the how toes
of exome sequencing. It was quite amazing when we put this course
together and announced on the web we had such an amazing request to be here for this series
of lectures today. The storm through a little wrench in the work
I think but luckily this will all be recorded and people can view it live now or later.
So I am Jim Mullikin, head -- director of the NIH intramural sequencing senor, I was
just appointed that a month ago. Prior to that I was acting director for two
years. I have been on this itself has been in existence
since 1997, Eric put together and headed that until I became acting director and also head
of the comparative genomics unit. So I want to focus on technical details but
-- the hows but let's ask why. With the amazing opportunity-out we have for
this court today, -- turn-out today there's great interest so I thought I would highlight
a few reasons why we have brought up in this opinion piece you'll see in genome biology
written by less Biesecker and myself as the argument for exome and the argument for genome
was from Kevin (indiscernible) from duke senor for human genome variation.
So what are the reasons we say people are interested in exome data?
It focuses on the part of the genome that we understand best.
The exons of genes if there's certain changes in those regions of a gene that can change
the way the protein works. That is out by exomes to help us understand
the high pen trans, an its relationship to phenotype.
You'll see that later today in application of this technology.
Getting down to one of the fundamental reasons why exome is a good first step in the process
and the whole exome is currently around 1/6 cost of genome sequencing if you compare apples
to apples what does it cost within a sen tore do a genome or exome, stayed constant about
1/6 cost. So that's quite substance,.
-- substantial. Another huge impact, only 1/15 of the data.
If you have storage devices that you -- that cost you money and you want to store the data
a long time, if you have 15 times more data that's more storage.
Another aspect is if you look -- if you have -- it's also machine time.
It takes 15 times longer to generate genome on the same machine as it does to generate
a whole exome. So you can tie your machines longer if you're
doing genomes than doing whole exomes quite a few reasons to stick with the exome for
near term and longer term. Now I'm going to jump into more the technical
details and I want to say that the approaches and platforms that we're -- I'll describe
in this presentation do not constitute any product or commercial entity an are instead
used solely for the purposes of illustration of general principles.
So this group occupies the top floor of this building that you see out at Rockville, research
building. And we moved there in 2004 prior to that,
it was another locationch we have been in this facility since 2004 and this is what
the sequencing floor looked like in February 2010.
A little -- a year and a half ago, you can see in this area here are six of our GA-2s
we had at the time. Two editors at this position and here is 3730s
and a year later we went to installing the first two production machines of the high
sequencing technology an have been using those in conjunction with the other machines as
well. As Eric was saying, the sequencing technologies
are racing along and you have to keep up and keep upgrading your technology as they feel
changes so rapidly. To look at comparison of different types of
platforms the 3730, this is not a time line, this is what machines produce how much data
today, you can generate per run this amount of data at this point and this is a log scale
along this axis going from kilo bases to Tera bases per instrument run.
You can see that the different technologies have hugely difference puts and high seek
is generating 600-kilo bases of sequence per machine run.
If you convert that into cost, the cost per base is driving quite low these newest technologies.
There's advantages and disadvantages to all the different ones, there's turn around time
is very nice for these and longer for these cost per base.
We are taking on a lot of sequencing projects, and at this graph we're well over a thousand
exomes we processed so far but doing other types of emperiments with our machines and
for various investigators and other institutes. Looking at the technology put by month since
we started running this in production, adding more machines and technology improved greater
output, billions of reads per month. Switched over, we added the high seeks an
decreased GAs and started to focus on high seek running those machines an now you see
in the month of August we were able to hit 35 billion bases, running in the mode of 100
base read, 3 1/2 Tera bases per month we can generate with sequencing too M&A sheens today.
Thousands of samples coming through, it's very important to track the data, investigators
come us with projects and we take their samples and need to make sure we track them properly
laboratory information management system based on Cimarron software we have been using many
years and so we have written a new system for managing these types of projects.
Here is a flow cell lay-out so high seek or GA is eight lane, one operation you're assigning
a sample for whichever lane you want it to go on in that stage.
There is computational needs for the data flowing off of these machines and we have
a -- for the running -- for the production processing of the data that comes from the
six GAs and three high seeks we have a Lennox cluster on that floor with a thousand cores
and 250 of those are available for production operations.
A petabyte of disc and it's eaten up quickly so we're looking how to expand those systems
or come press the data in new ways to stretch the length of time these storage systems will
last. And networking is also very important.
You need to have high bandwidth networking to have the data flows from those machines.
Today we want to hear ab whole exome sequencing. In this particular aspect of our pipeline.
The exome sequencing pipeline, I have highlighted the major topics.
I'll through them quickly here. It needs to be fragmented to shorter fragments.
So we can make a library according to whatever protocol we're using.
The exome enrichment stage happens. I'll detail that.
Samples are lowed on the sequencing machine and sequencing an base columning is performed.
Sequence read align, a critical -- alignment, a critical stage prior to detection is performed.
So now I'll go through each steps. As I said first a sample is fragmented and
we fragment the DNA into lengths of 300 to 400 bases in length and then the fragmentedded
ends are repaired and over hang is added. This is the Lumina protocol from this website
if you want to see it in more detail and explanation. You have adapters with a T overhang that lie
gate to the end repaired fragments. We can select out the ones that are properly
conform by amplifying the PCA five cycles at this stage.
We can enter into the process of enriching the exome part of the genome with -- and this
is an example with the adjuvant technology we used a bit.
38 mega base capture and 50 mega base capture kits, it's the same type of approach, this
I have already described this is the fragmenting the DNA, making the initial library for the
sequencing platform. Then you mix together with reagents and biotenlated
RNA library and these are designed in the regions that are exons of the genome when
these are hybridized together you can then mix those with encoded magnetic beads and
by holding those captured fragments with a magnet you can wash away the unbound fraction,
this is the enrichment process that's happening here and you can remove this from the beads
by digesting the RNA. Then you have a library that's -- the same
library you starred with but enriched for those regions they targeted with whatever
capture kit we were using. And then one more round of amplification occurs.
Typically ten PCR cycles to amplify prior to having a library ready to load on a sequencing
machine. We use the true seek exome kit.
ius DNA based instead of RNA based. The latest kit targets a little bit more of
the genome. So once you have enriched a library for the
exome and another process that can be implemented along the way is you can use a portion of
the sequence to read an index. There's 12 indices you can use right now with
aLumina system. You can pool six tags before you do the exome
enrichment so you're enriching six libraries at once.
That have been tagged uniquely. You need to be able to balance these libraries
correctly so that each of these will be equally represented once you get the sequence back
out so we use QPCR to figure out to quantitate the concentration of these libraries independently
and we can balance those an pool them together and once enriched load them on two lanes of
the latest flow cell from aLumina where the next step of the aLumina process occurs, the
cluster generation diagram here, the DNA has special ends that will attach to the flow
cell that I just showed you. Then this wonderful step of bridge amplification
occur which is amplifies and makes a large enough cluster to image.
You can get a good signal after these clusters as sequencing is performed which I'll show
next. Here is the sequencing primerch as each base
is added, as each base is added we can see fluorescence that comes from that and read
each base as the sequencing occurs in a cyclical fashion as each base is added, we just read
off the different color flurophores tell us which base is which and it generates base
calls. So the high seek flow cell has ago lanes an
can generate one 1/2 billion clusters of data that con Veritas 300 gig bases of sequence
per flow cell. Next is sequencing an data processing, this
is running through their initially up here, their whole pipeline, their CASABA pipeline
they have for processing the data. We run the machine in the two by 100 paired
in read mode and we also get the index. The system will see the index tag and demulti-plex
that information and since it was pulled in one lane it needs to be pull aid part bioinformatically,
so each of these samples are data sets independent data sets referring back to the original sample
that they were associated with. The CASABA pipeline includes a stage called
E land, and these are lined then to the human genome or another case we've done the mouse
genome as well. So this also works if you just say you captured
the exome of the mouse you can have it processed against the mouse genome.
The bulk of what we have done is against the human genome so the alignment is performed
to give you an idea of the amount of sequence not only per sample if properly balanced will
generate 10 giga bases per sequence per two lanes of samples, there's six samples across
the two lanes so 60 giga bases of sequence from the two lanes but 10 giga bases per sample.
This converts to about 100X coverage of the targeted regions of the exome.
This givings you an idea of what it look like and raw coverage, many exons, you'll see different
patterns, here is isolated targeted regions you get fairly consistent coverage.
This is one sample, if you look vertically that's four, consistency is fairly nice cross
samples. There's more variability as you look, they
need a lot of probes so you get overlap of information in that region.
If you zoom on that particular cluster you'll see in the zoomed in view that the coverage
is fairly consistent across samples but variable along one sample across a difference targeted
regions. This leads to a reason why exome sequencing
you can just say it's -- if it's only 2% of the genome why isn't it just 2% of the sequence?
The reason you need more than the direct ratio, if you had an ideal system doing whole genome
sequencing you get more poisson distribution of the reads across the genome.
If you sequence to 60X coverage you get this kind of distribution, nicely covering every
base. But because the affinities are different for
the different probes and sometimes when probes are close together like I showed earlier,
you'll get some regions that get high depth of coverage and other areas lower because
things didn't quite amplify -- not amplify but didn't enrich so push more sequence through
the system to make sure you cover this more difficult fraction.
so I want to make sure you have that in mind. This variability here translates to a broadening
of this distribution. Unless it takes a 15th instead of a 50th of
the sequence. So after we have the eland alignment I'm going
the talk about specific refinement of the alignment that we do that my group developed
called DIAG CM. eland we found as part of the standard pipeline
we can leverage off that because eland places the reads in the correct genomic location.
But I'll show you a couple of images shortly that show the fine-scale alignment isn't as
perfect as we would like. Another aligner called match, a local aligner
and it does a good job of also spanning across NLs.
Another thing we do here is if a read was never placed but its mate was, we throw the
unaligned read that's associated with with an aligned read will throw into the same bin
an this cross match algorithm works in regions of the genome.
It believes what is a correct location but a realignment in localized region of 100 KB
regions across the genome of whatever Reeds fall in those binsch this is what the alignment
looked lake from eland to start with. There's a six base deletion here and it couldn't
quite capture that information. So it went quite wrong on either side.
And if you then look at after its cross match improved, everything looks nice.
You can see clearly there's a six base deletion at this location of the genome.
We develop this about two years ago. This whole approach.
And other groups are pushing along rapidly with other alignment methods and we always
want to make sure that we're at least as good if not the best in the field, and turns out
with this -- to be able to compare liners you need to know truth, and the only way you
can do truth is to simulate. So these are simulated data and this access
shows percentage correctly placed verse simulated variant size.
A zero mean as single knew collie yes tide substitution and various nucleotide substitution.
So we simulate with appropriate depth of coverage to do proper depth like saw in the previous
slides, so we model the depth, we model the error models and the outcome of this, the
green line is eland. Because DIAG CM throws unaligned mate we get
higher percentage alignment there. It didn't quite hold up as we would think
but six base deletions insertions are more challenging for that aligner.
Next after you have a good alignment of the reads to the genome, you need to be able to
convert to variation calls. More accurate to say what you want from the
data are what genotype calls from every position. All the autosomes are diploid and you want
to know what most are, so this program is a Bayesian genotype package that models the
ten different possible genotypes at every base and we look at every base in the genome
to figure out what the call would be if it's homozygous reference.
We have a genotype call for every position that we have enough coverage for.
11As an three Ts in this position. If you picked out of 14 possible coin flips
it is 50/50. There are cases you have a skew like this
as well. This system will give a probability of genotypes
given the data and in this case the most probable the AT genotype, minus the score for next
most probably genotype, minus 14 on the log scale, gives an overall score for this allele
of AT NBG score of 14. We found empirically scores of ten or higher
are generally good calls but you can raise that score if you need to.
We done look at depth of coverage for saying a base is covered, we look at good quality
NPG score at each base of genome, then we know it's covered properly or not.
There's another part of the genome that we need to take care of, I have already mentioned
that autosomes are for normally diploid in human samples.
But cancer samples can be quite different. NBG is designed for two alleles on the autosomes
an on the X chromosome if a female sample but for male samples we change the mode it
works in so we'll attempt to call one of the four bases, nucleotides in those four regionses
because you only expect one of four bases instead of all ten possible genotypes for
diploid region of the genome. So this is the non--- X and Y cellsomes.
Another test we've looked at is to figure out for a given kit, true seek kit, you want
to know what is the optimal amount of sequence to generate for a given experiment and for
the human genome we had one data set that had a lot of coverage and we titrated that
back in total giga bases from 19 to. Is you can see the coverage of all the coding
exon bases to increase the amount of sequence. Typically we feel right around 5 or 6 giga
bases would be enough. I told you earlier we're doing ten of sequence.
But there's going to be variability in the balancing of this sequences from those two
lanes. So the lower ones will probably hit with enough
coverage and higher will be definitely enough coverage.
All the coding exons, the seek kit didn't design everyone so etch that you are design
isn't at 100% but it's very high as you can see here.
We have run three capture kits, we started with a 38 mega base capture from adjuvant,
sure collect megabase capture around 600 samples from the pipeline.
A month down here up to September of this year and number of samples that we process.
We switched 50 megabase kit because it's an improvement.
We need to keep moving as technology moves like we do in sequencing technologies but
also the capture technologies. Then aLumina true seek came and allowed to
capture pooled index samples so there was some efficiency savings there and a little
bit more genome was captured so with we just reseasonly switched the that kit as well.
To show an example of a gene that was -- has different coverage designs so the top three
graphs here are Bates that are designed by these different kits.
This gene was not included in all for the 38 megabase capture for the 50 megabase cap
Chu they added I, true seek has it, this is all because these were designed at different
time points. And the true seek that wasn't captured by
the 50 megabase. You can ask what are the Bates but what gets
cover redirect examination the 38 megabase cap Chu though it wasn't targeted there's
off target capturing of these regions, these wouldn't be as trusted because they weren't
targeted but in this case this gene was targeted for the 50 megabase capture and the true seek
capture. Hard to see at this resolution what's going
on so I can zoom this end of the gene and see nothing targeted in the 3850 targeting
the coding exons and the UTR but what comes through is good coverage of this gene, even
the UTR from true seek an coverage of the exons for the 50 megabase adjuvant sure select
capture. Then you want to figure out what do you get
from all this? I told you a lot of the hows, but what you
get out of the system? With the latest capture kit, this is kind
of a typical example, sometimeious get more, we don't get a lot less than this, this is
kind of the lower end typical of total bases that have high-quality genotype calls.
They're only targeting 62 megabases but we're getting more, double the amount of sequence
that has good quality calls after this analysis. The single nucleotide variants detected is
this number here, 142,000 variants homozygous an heterozygous F you look at a heterozygous
fraction of this and say look at the autosome sample,, you can see that since an individual
on the autosomes has two genomes, within them, one from their mother, one from their father,
if you compare those to each other, there should be what was typically stated was one
variant in a thousand bases, you'll see here it's a little less than that.
Why is that? Because we're focusing on the coding regions
of the genome, which are more highly conserved and don't vary as greatly.
So this number is less than that typical, it would be .001 instead of .0076.
So it's less because we are targeting much more conserved region of the genome.
We done get these values for the X or the Y because this is a male sample.
If we go and lack at a female sample, again we can calculate within sample hetero zygosity
value, the same as before, different individual, but the X chromosome has a lower overall hetero
zygosity value, there's population history reasons for that, and also the fact that there's
only three X chromosomes for every four in the population for the autosomes.
This gives an idea of the amount of data generated by this process.
If you're interested in seeing a few variants this is one example of the heterozygous position,
single nucleotide change at this position in the genome.
This is heterozygous deletion in this individual. These can be reviewed.
Typically we rely on the MPG calls for the data.
But you can go back and look at raw versions of data to make sure things make sense.
So what is the coverage look like from what was for the different kits and even whole
genome if you want to compare to that. Here is the total input sequence for different
capture methods and the coverage is wherever we have these MPG scores of ten of higher.
And if you lack at aligned sequence for the sure select, we have 5 giga bases of aligned
sequence, 131X coverage for the sure select 50 megabase, a little more sequence but less
because they're targeting more. For here more sequence, more targeted, 114X
coverage so fairly similar total align sequence for the different kits.
If you're just looking at the consensus coding sequence, the CCDS portion of the genome,
all these get designed for we end up with 90% coverage of the exome.
When you look at whole genome people think whole genome, that solve the day because you
get everything, well, in this case slightly older sequencing technologies, this will probably
be better today but still there's reasons why you won't get complete coverage of the
sequence coding regions because there were GC biases and not uniform coverage as you
like to see on the coating sequences of the genome.
More widely to the UCSC coding genes, this is more inclusive set but still just coding
sequence the first kit was not designed for everything so less than 62 megabase capture
up there in high 80s or near 90% but the whole genome is significantly less.
How accurate are these genotype calls from exome sequencing.
Here is again the different capture kits below, whole genome shotgun and when you compare
the genotype overlaps between the two platforms of aLumina genotype chip, the concordance
between the two are in the 99.9% level across the board.
So we're doing very well in that regard. What are we applying this to?
You'll hear from the later talks, we're applying this to many different projects, undiagnosed
disease program is one with hundreds of samples, the seek project of over a thousand samples,
a variety of other PI driven projects. You'll hear from those investigators later
today, example at cancer or other rare diseases. An idea of the through put per year type machine,
you get 200 exomes per GA and six times that for high seek 2000.
We are 90% or higher coding bases and the genotype calls is also quite high.
These are areas that I want to cover today in the exome sequencing pipeline.
But now that we generated these over hundred thousand variants per sample, next to 100
samples will give rise to 600,000 o more variants, what do you D&O with those, how do you work
with a large data set? The next speaker Dr. Jay anymy Teer will address
these steps and how to annotate and then work with these large data sets.
Finally in closing I would lick to thank all the people that have worked on this project
sequencing operations headed by Bob Blakely, Alice young and the lab staff that worked
on this, lots of buy informatics needs for this pipeline headed by Jerry B UFFARD.
Lennox support, Jessie Becker and my research group, Nancy Hanson who has pioneered a lot
of work that you have seen here today and also Pedro cruise and who are heavily involved
in the pipeline, they recently left to go to other places an work on new things.
And then from the Biesecker lab, Jamie Teer. Jamie will give the next talk.
Thank you for your attention.