Whole-Exome Sequencing to Identify Somatic Variants in Cancer - Yardena Samuels

Uploaded by GenomeTV on 14.10.2011

>> All right. Everyone can hear me?
I'll be approaching this ly looking at a cancer he canomes specific as sporadic cancers.
I'm a PI who works with melanoma genetics and this can be apply to cancer types.
Cancer is a genetic disease, one of the best example es is colorectal cancer where you
have a development of epithelial cell to metastasis, and these morphological stages are linked
to various molecular changes right here. This has become a paradigm for endsomal cancers.
Even though cancer has looked at genetically we still at the tip of the iceberg and more
alterations are to be found out. So I'll start talking about the platform,
in matic mutations in cancer genomes. There are three main hurdles for this.
Base pairs for these mutations. These are just two examples of discovery using
unbiased sequencing approaches, two kinases found to be highly mutated and approved drug,
recently FDA approved drug so though there are hurdles, obviously it's worth moving forward
looking for these alterations. So you'll talk about tumor bank establishment.
We have here an example from breast cancer to off getting the primary breast tumor, and
you can get a metastatic brain tumor, as fresh tumors or bedded -- embedded.
Some tumors can be turned to cell lines used for tumor bank and some of these samples could
be turned into a xenograph in an immunodeficient mouse.
Once you have this genomic DNA you can start looking at genome wide DNA sequencing.
So what are advantages an challenges for these various sources?
The advantage for the fresh frozen tumor or OCT block is obviously that you have highly
reliable data, this is closest to the actual tumor that was resected.
Usually you have limited amounts of DNA from the source, it is heterogeneous, you have
difference clones of cells populating that and a laser resection with a pathologist to
find where the tumor lies and then ma or laser capture dissection to get out the tumor cells.
Paraffin imbedded tissue, highly reliable data.
The challenge again you'll have limited amounts of DNA heterogeneous, you need a pathologist
to determine where the tumor cells are. With the (inaudible) we'll have DNA quality
issues as well because of the fixation procedure and the imbedding procedure.
Cell line, a lot of DNA mostly homogenous because you have a clone of cells being expanded
in tissue culture. The extraction is usually simple but it's
more a challenge in Mel Mona because you have melanin being produced the you have to make
sure you done have melanin in that DNA. A cell line is useful for downstream functional
studies. We have to make sure alterations are identified
in the cell line represent the original tumor so it is worth going back to the original
tumor to make sure the alterations are found there as well.
We have a lot of DNA, it's homogenous, the extraction is simple.
However, again, we need the make sure the alterations are similar to original tumor,
it is expensive to make sow know graphs more than other sources.
You will get mouse DNA contamination that affects analysis down the line so keep that
in mind. The other part of your tumor bank is going
to be that you need your normal tissue. We're looking for matic mutation.
Usually blood would be a excellent source for this DNA, however it's not always available.
And if you look at leukemia, it's not going to be relevant.
So you can go to neighboring tissue, but there you have to be careful because you might have
some contaminating tumor cells. And now that we're looking at in depth sequencing
we want as much clam information as possible about their patients, date of birth, date
of death, diagnosis, malignancy, the therapies that the patients underwent.
This is ab example of the tumor bank we established but ultimately applicable to cancer types
so in this case we have metastatic tumor DNA who matched normal, in our case it's blood.
The OCT blocks the original fresh tumor, a -- as well as the matched cell line.
Alterations can be gone an tested again in the fresh tumor.
The cell line allows RNA being made as well as protein lysates.
And finally clinical information. I would like to point out we have three cohorts.
It's really worthwhile getting several cohorts, it will be able to validate genetically to
each one of your cores have particular bottlenecks. They want to make sure that filters are being
applied to various cohorts and not affecting any results.
So it really is worth spending time getting additional cohorts.
So one thing at the tumor bank several call controls are worth considering.
The first is SNP detection to make sure that the tumor on your normal tissues are matched.
The second is to implement an assay to determine that the fraction of the tumors, 75% or above,
that is because if it's below that, there's a chance you will not see loss of hetero zygosity
or homozygous alterations. The assay we use in our case is looking for
melanoma antigens and using immunohistochemistry. But obviously for difference cancers, different
assays would be applied. Then a third quality control of the mutation
analysis of known mutated genes to see whether the percentage to identify in your tumor bank
is similar to what's known in the literature. So this is just a schematic of matic mutation
analysis having the patient resected, you have the normal tissue, in some cases you
can get cell line, you get DNA, you sequence the genome of interest, you do the same for
normal tissue and you compare the sequences. Now we're looking at matic mutation, any difference
between normal and tumor. In this case we look at coding regions as
well as shranking the exons to lack for splice at variations -- look for splice at variation
as well. We use candidate approaches now with whole
exome and genome, we can start being much less biased.
As mentioned previously, TCGA, the cancer genome atlas is applying a whole exome an
genome sequencing to cancers so this is the largest industry provider of this data and
so they said they will be at least 3,000 new cancer cases by the end of September 2011
and they are on target. So I'll move on to looking at exomes and touch
on how do we choose the genomic DNA source we're looking at whole exomes.
We have to consider DNA quality. From fresh tumor we have limited DNA the cell
Lynn we are unlimited. How do you choose it, how do you assess it
in our case we were fortunate to be able to do whole genome analysis and this led us to
decide which DNA to use for whole exomes. The whole genome study we compared fresh tumor
to cell line derived from that fresh tumor as well as blood.
And we run it through the human sequencer and look for melanoma somatic variation, this
is are the statistics and just like Jim and LES mentioned, note the passing filter that's
between 34 an 67X, it's much lower than required for exomes obviously it's more homogenous
in its coverage. And you see coverage as we get whole exomes.
This allows for 92% of the genotypes across the entire genome.
So the bottom line of the study was that 97% of the alterations identified in the coding
region overlapped between fresh tumor and cell line.
However, the copy number variations were less concordant compared to the SNVs.
So based on this we decided to use the low passage cell line DNA, not the fresh tumor.
We were interested in finding the single nucleotide variation, so we can derive that from the
cell line but getting from the fresh tumor was more a challengech also we knew that we
wouldn't have stroma contamination when looking at the exomes.
So we did that to the cell lines, if it's study design we used exome capture.
In this case 14 were done in parallel, normals were done as well.
28 exomes. in this case 37 mega base was used so we captured
20,000 genes. And this was done atnist we had (inaudible)
cross match aid plied. This was the discovery, every one of these
studies is followed by a validation and the validation was done in this case using Saenger
sequencing. So we have seen a couple this morning already,
this is an example from this study where you can see the BRAF gene exons, you can see the
sure select oligos covering these various exons and you can see the coverage itself.
Like I said, the coverage is variable across the sections and that's why you need to go
very much in depth in your coverage. What quality tests do you need to do to get
your exome data? We'll start off by looking at the coverage.
So this is an example of some of the samples that we did.
The coverage was extremely high, it was a minimum of 180X.
Allowed us to send target coverage of 90% in average.
So this is the performance summary 12 giga bases, 180X and 90% coverage with high quality
genotypes. The next item applied when they're de, it's
worth look the specificity in your exomes. So this is a some data we got in the excel
file. What we can see here is the gene name T. reference
allele is the rareiation allele same for amino acids.
In addition to this we have the MPG score and Jamie were talking about earlier.
As we as coverage, in this case we're using MPG to coverage ratio.
What we see is if you have a ratio .5 and above, and this occurs, in tumor and normal
sample, when you look at evaluation most of the time this is a real alteration.
If that ratio is below .5, in either sample tumor or normal, the Saenger will not show
you that there a real mutation there. Looking this way, for example you take 90
regions an assess by sequencing that divided by this so the ones that are a valid .5 ratio
will validate. The ones below won't.
This allows you the calculate specificity. You get 97.9 percent coverage and 2.4% false
negative rate. Once you apply this you remove 18% of the
alterations. The next quality test worth doing once you
get your whole exome data, to look at the sensitivity.
So in this case we knew we were a candidate approaches before doing the exome so we already
had 47 matic substitutions we expected to see in the exome.
Out of those 47 only 38 were present in our exome study which means we had 81% sensitivity.
The myth alterations are not because they were captured, they were captured and sequenced
but simply missed. And wear not sure why at this point.
So the next item that's worth considering looking at the exome data is a number of somatic
mutations identified per tumor. So these before various samples sequenced
you can see the total number of mutations that can be seen.
There's a variation. There's a certain range.
But clearly there's one tumor here that has an extremely high number of mutations compared
to the others. One way to identify this is for the biological
reason behind it but noting it and deciding whether this is a relevant sample for this
particular study. There could be all kinds of artifacts, these
are hard to predict. It's worth looking out for them.
A few examples. So in one example we're looking at the data,
we sorted not only by sample but also by chromosome. So in this case it was done by patient number
9 and seems to be ordinary large number of matic mutations in chromosome X, so we look
more closely an found that the genotype on chromosome X was 1 allele.
We knew the patient was a male. But the tumor had two alleles in the same
precise location and seen in this table. So what's happening here is there's a copy
number variationch this is an important item especially when looking at cancer genomes
that you do have copy number variation. So there's a Y chromosome deletion, X chromosome
duplication, you can check by fish, we did not.
But we know that something (inaudible) so it's worth investigating the underlying regions
we find alterations in patient 9 on chromosome X.
So we went through various, we have a discovery screen and then we have validation screen.
So we selected tumors and we attach both tumor approximate normal sample.
It is worthwhile doing the normal in parallel rather than rely DB SNP for various reasons
that at this point Jim, Jamie andS were talking ab.
So here is an example where we used, the 14 samples and we found over 300,000 variants,
with the reference sequence. Once we applied the normal information, it
went down to 5,000. So at this point here is worth doing normal
as well. Here is a validation screen.
Then you go up to validate interesting genes in a larger number of samples.
And put an X axis at the and the number of samples that you have, the more the bear.
Here you need to compare gene mutation frequency to the expected background.
We'll get back to this later in my talk. It's important to do this because thisly you
to find out whether the genes that have the mutation or candidate cancer genesch again,
I'll emphasize the discovery and validation scene, for budgetary con s worth doing exactly
lake this. Going from discovery and then applying validation,
scaling up just a subset of the interesting genes.
So let's go into the filtering processing. The number of potential matic mutations are
seen here. We filtered DB SNP and went down to 141,000.
We applied the somatic filter, 58,000. Now, because of the mutations were missing
some of the alterations that are actually poll morphisms could come up in another normal
sample, not necessarily the matched normal. It's worth taking that into account because
this is going to be a polymorphism. When you find that alteration another normal
and er that out, then you remove the non-coding variant alterations, down to 5,000 alterations.
This is recapitulated in this part of the slide.
We have about 5,000 alterations now when you apply the coverage ratio of .5 we remove 18%
so you have 44,000 alterations, and this is the way the alterations can be divided by
missense nonsense, insertion, submission and synonymous, it's important to keep track of
these synonymous mutations as well because then you can find out the ratio of non-synonymous
to synonymous mutation. This is important because ratio is expected
to be 221 if there's no selectionch however there's significance difference from the 221
ratio, it suggests there's been selection for these mutations so this particular gene
has a role the play in the cancer. So obviously it tracks down the synonymous
mutations. What kind of data do we need to evaluate the
drivers and passengers? Which have a role to play and which are just
there and they have a neutral affect. This is a challenging question being dealt
with a lot by the field. And this is an attempt to answer the question.
It can be answered by statistics, bioinformatics and functional studies.
So if we look at statistics first the non-synonymous to synonymous ratio.
If you look at the full exome study it was 221 suggesting most alterations identified
are passengers. Mutations about back grown mutation rate.
When you do exomes track this down. Because the mutation rate is number of mutations
per mega base of DNA. derived from the exomes.
Melanoma the number is 11.4 mutations. The next is to see whether you have hot spot
mutations. Meaning the same exact alteration in the exact
same amino acid in different samples. The likelihood of it happens is low so again
signatures this would be a driver. Then look for the highly hue mutated engines.
So 13 for the hot spots in our case we found nine available genes doing mutations.
So obviously this is 14 samples so we scaled it up additional sets and went out to large
number of samples. This is a summary of the hot spots how they
scaled up. So this is a non-synonymous alteration.
It occurs in our sample sets six times one of the cases it's found in commercial cell
linech it is worth including commercial cell lines an exomes an validation step because
that allows you a -- to do functional studies and also allows the rest of the community
to actually have a sample that has alterations that you identify, validated and also functional
studies. So another interesting point is we found synonymous
mutations that scaled up. It occurred in exactly the same position in
three difference samples. This is interesting because it won't affect
the proteins per se, it is being selected. So it is worth capturing these as well and
we maybe following up on them. This is a schematic of a hot spot.
To see what is happening by chance. In order to calculate this you need a background
mutation rate that we talked about before and you obviously take into account a number
of samples that were sequenced so in this case this likelihood is extremely low, so
this is being selected for work doing functional best probing so we know this protein is a
histone acetyl transfer ration and when disrupted in mice it causes lethality and defect in
cell cycle progression. Bioinformatics is worth applying in this case,
you can see that the particular alteration is highly conserved in various orthologs.
So suggesting it is very important. If possible, it's hard to predict which gene
will be found but it is worth doing functional analysis to see and date whether mutation
has an affect on the actual function. This is an example of a functional study that
we did, where we knocked down this particular protein in a cell line that was wild type
or mutant for the protein using SH RNA and then check the affects of this knock down
on cell growth and apoptosis. When we knocked it down in cell lines that
had the mutation, we could see an increase in apoptosis when there was -- the cells were
grown in serum. But this does not happen when we knock down
the cells, the protein in cells are wild type for this protein.
This is onco gene addiction so the cells are dependent on this particular mutation, when
you knock it down they're more sensitive and they will die.
So it is worth trying to do functional studies on alterations identified.
Highly mutated genes, in this case there were six highly mutated gene, what do I mean by
highly mutated? We look the see which were mutated in two at least two samples out of
14 exome capturedch it wasn't enough to look at the percentage of mutations but we also
needed to do biono, ma'amial test taking into account the background mutation rate and size
of the transcript of a particular gene. We found the 16 genes we validated them, so
we scaled them up an additional sample seen here.
So here you can see number 4. When you look at these, when you do this P
value calculation, always take the longest size of this particular gene to push this
formula to this extreme. And then you take into account the back grown
mutation rate as well. So here is where you only consider the percentage
of the cumulative being affected. When you don't consider that P value, if you
by percentage they're highly hue at a timed. You see titer -- Titan is the largest gene
in the human genome, it's a false positive. Just because it's such a large gene and such
a large back ground ground mutation rate it's going to be highly mutated of course.
However, when you start applying the P value and you sort your data you get completely
different table. In this case you get BRAF on the top of the
list, known to be highly mutated and have the lower P value and you get the additional
genes. You do have control here because enyou scale
up an additional samples you can see there is concordance between the percentage that
identified in the larger number of samples compared to the original 14 samples.
So we knew that we were probing the right direction here.
Then when you look at the actual genes it's well characterized, it was never shown to
be mutated, it is now highly mutated. If you focus into this, we did what I talked
about earlier, look at additional cohorts to see if we can see a similar percentage
in additional samples. When you look at different cohorts you see
these there. It is important applying additional patient
cohorts. S that schematic of the alterationings in
this particular protein. Putting it up here because it's worth looking
at that particular alterations once you have a gene of interest in cosmicch cosmic is a
database developed by the Saenger institute, imsum ryes alterations in different cancer
types so if find a march alteration in your gene of interest in cosmic it is likely to
be important. So it's worth going back to cosmic.
In this case two alterations we found the third one in cosmic.
It's likely this particular alteration is important.
Okay. What do we do with complex exome?
We have been talking about using cell lines but what happens once you start using the
fresh tumor? We don't have much experience with this, we
only started looking at it. We do know that when you apply the MPG coverage
ratio particular mutations in the normal tissue we have the BRAF V 600E in the normal, clearly
mutation of normal sample with tumor cells which is what I said could be a problem.
Also probably heterogeneity, we haven't determined this, that's something to keep in mind.
Probably when looking at fresh tissues, different algorithms, different filters will have to
be applied. So we have been obviously looking at a large
number of genes in the genome, a lot of exomesk looking at genes, also looking at pathways.
for pancreatic cancer where a 12 co-pathways were identified to be altered in this particular
cancer type. Which means if you look at two different patients
even though they have different mutations and different gene, these still effect the
effect the same co-pathways. So it is worthwhile applying your exome data
to pathway analysis and see how what kinds of pathways come up to be significant.
Now is it worth delving deeper? We talked 14 exomes, once you complete your
first exomes how do you know it's worth doing more or are we done?
So this is just a graph showing mutations per mega base for different cancer typesch
since we focus on melanoma, I have these numbers right here derived from difference studies.
What seems to be the case is that at least melanoma, there is a large number of mutations
per mega base, in some cases an order of magnitude higher.
Does seem worth doing more exomes because there will be more passegers.
When doing exomes it's worth finding out what the mutation per mega base is comparing the
other sample -- other cancer types and seeing is it worth delving deeper.
What are future challenge? Finding out the drivers versus the passengers
from all these alterationings, how do we analyze interpreted the data.
The functional studies are pretty important but how do we do this in a high through put
fashion? And anytimely how do we apply all this data
back into the clinic. Once we have the genes we get the pathways
but we could also integrate all this further and see how various pathways cross talk.
So this is the kind of database we hope to de from many, many exomes.
At this point I would like to acknowledge all these people in the slide, thank you.