Variant Annotation and Viewing Exome Sequencing Data - Jamie Teer

Uploaded by GenomeTV on 14.10.2011

Questions will take place at the end of the session.
>> Can everybody hear me okay? Is that better?
Still no. How about now?
Any better? The same all the time.
All right. So I'll try and project and speak loudly so
that you can hear what I'm saying. Please, if you can't hear, wave, say something
in case I -- you can hear me. So I'm Jamie Teer.
I'm a post-doctoral fellow working with LES Biesecker and Jim Mullikin, I work on technology
aspects including informatics and analyses and end user analyses trying to determine
which variants could be biologically interesting. So today I'll talk about what we're doing
with varying annotation an introduce the topic of viewing the exome sequencing data and like
Jim, I'll talk about many different software tools available and tell you about some of
the ones we are using but that doesn't imply any type of endorsement or anything like that.
So you heard from Jim about the sequence generation as well as the alignment and calling of genotypes
and this will often be done by sequence provider for you.
Certainly alignment in calling of genotypes can be complex and that can be further refine
and often done by informatics teams. Today what I'll talk about are next steps
first Annation of the variants and then I'll introduce analysis of these variants and tools
you can use to look at your data. These steps require for informatics experience
but hopefully can be done by end users the look at your own data.
So I'll frame this general considerations as sort of the who, what, where questions.
We'll start with where to read the lines and I'll briefly talk about the programs to align
to view the alignments and there is value to looking at the more raw data.
And I'll tell you a little more about that. We'll talk about effects of variants, consequence
to determine the context and what perhaps predicted detriment, your variant might have.
Who else has the variant? We'll talk about variant databases, looking
at. s an sequencing data across many difference
populations. Finally how can this be done?
I'm talk about pipeline software to pool these tools together and generate as a pipeline
a final set of data. A brief shift we'll talk about how to identify
important variants and I'll introduce the tool we have to allow enusers access to the
data. Down here these tools that I'll talk about
require varying degrees of computational and informatics expertise.
So we'll talk about -- there will be a range of tools and some of these are easier to use
an require less informatics experience. This generally mean as graphical interface
with buttons and lists and things. So those tools will be represented by the
picture of the laptop here. On the other end of the spectrum tools more
challenging much more experience experience in LENOX, these tools come in lion driven
so typing things out in the command line. These tools will be represented by a picture
of a server. So let's start with where are the reads.
The reads aligned correctly? What is the environment look like?
Jim showed a picture where alignment look messY and that give as that something might
be going wrong in that region so value looking at your raw data.
Low level approach is to look at the file format.
One file format we use is becoming quite common is the Sam bam format, this file will contain
20 to 100 million alignments reads per sample so there's a lot of data and there are a variety
of programs to view and manipulate this data. Mostly command line driven and many are actually
libraries for programming languages so you need to write your own programs to use these.
You can't see the data, it's not important exactly what's going on but we do have reads,
we have sequences encoded, the quality scores encoded in a special way and alignments.
Clearly looking at this is perhaps not terribly informative so you're really just getting
lists of reads and not so useful. Some tools more useful, I apologize if it
doesn't show up so well, this is an example of a program as part of Sam tools called P
view. This is something we use quite a bit because
it's very fast. You have dots and commas illustrating forward
and reverse reads that are the same as reference. When you see things difference from the reference
you see the base. So similar to the slide Jim hoed when you
have reference and non-reference bases that is a heterozygous variant.
We have one over here as well. There are colors here that give you a sense
of the quality, just as a quick look. Other things you can get from looking at this
in this particular case, you'll notice when there is a variant here there is not a variant
on the same read over here. These variants are on different chromosomes.
This program is very fast, it is text based and that adds to speed, however the text based
nature also really affects functionality, very basic functionality in our case it's
a go too tool to spot check variation. And regions.
So a different tool, more user friendly. The UCSC genome browser, many of you are familiar
with this browser. It does now accept bam files an shown here
are the alignments, you can see lots of different reads, again, the two positions I showed you
previously the view is a little bit dense though.
It's really the UCSC browser allows you the view your data together with the UCSC track
so that's powerful. You need a purr lick facing server to hold
data. So what happens is the UCSC browser looks
at your data to see what it needs is that takes expertise to set up.
The viewing options are a little more limited, you can't change too much of the views but
it is still a powerful approach. A final tool for visualization I'll talk about
is called IGV. This is the integrative genomics viewer from
the broad. I should point out if you downloaded the presentation,
the last three slides have links to all these tools.
If you can't see those slides you can search the name of the tool and add genomics to it
and find the name of the tool. IGV, similar to UTSC, if you have your eye
toe ban, your chromosome location. Down here you have gene models to see where
variants and reads align compared to genes. substitutions we were looking at before, the
next thing about IGV, you can zoom in closely to your data, zoom out, you can highlight
the reads and see quality. Here they have a nice histogram of the depth
you're looking at so this is a very powerful tool.
It allows zooming. You can highlight to get more info.
It has many features and it was designed to be integrative so it takes many kinds of genomics
data. So I encourage you the look at this one.
It does have a web watcher in addition to being able to run locally.
So it seems to be easy to use. I'll get more into the meat of annotation.
You have been given a list of variants. What information is this?
Are they in a gene R they coding? Could it be a detrimental Chang?
These are questions we seek to answer. So when you get your list of variants you're
getting just that, a chromosome, a position, and the change, by itself there's not a lot
of information there. Especially biological, potential function.
So the annotation really provides context and what you're getting then, we have this
position, now we can see annotation it niece a gene, it's in this particular gene, it's
coding, it's in an exon here, a change would cause amino acid substitution here it's highly
conserved, that's quite interesting. If you dig the literature you identify this
position associated with various diseases and that way you gain a lot of information
about your variant. The first step of basic annotation is the
goal to determine variant context. All these tools are the basics of identifying
if your variant falls within a gene, it will tell you if it's coding, what amino acid position
it's in and what the amino acid change could be if that is anonymous variant.
Each tool offers a little bit extra. N of R offers splicing common variant formats
intergenic descriptions and things like that. CD offers a conserved domain prediction which
I'll talk about in a second. Seattle seek offer as broad variety of Annation
tools so many different data sets you can compare your variants to see what could be
going on, really I think this one has the most features.
And then this one here SNP F integrates with GATK, analysis pipeline written by the broad
as well as galaxy which I'll talk about more later.
This one can read and write more common format, variant call format to describe your variants
and their annotations. System of these programs run locally.
N of R and SNP F run on willcal data and therefore, very, very fast.
CD uses willcal scripts but it accesses a server, external server for the Annation information.
This one here is external so you have to up load all your data to their server.
That's something to be aware of, this particular tool in terms of service says it is possible
system day the data maybe released so if you have private data you need to be aware of
where you're actually putting it and what that might entail.
We have some idea what the variant is, now consequence, how detrimental is the variant.
Most tools are amino acid centric but certainly is more and more information about the genome
is becoming available, I definitely hope to be able to see tools that go beyond amino
acids and talk about -- be able to predict consequence in other regions.
So I'll start with sift, sift uses a degree of conservation among the proteins to predict
a detrimental effect. Polyfen use as variety of features including
sequence conservation an structure and if there's a known structure model this tool
takes that into account to determine if a rare could be causing a biological effect.
CD uses conserved domain database so if a variant changes a way from ancestral position
that's predicted to be detrimental. This is a different type of tool, not really
prediction tool, database this is a duration of literature and locust specific databases
that is constantly updated. I believe you'll hear more about this later
and so I want to mention though it is subscription based NIH has a license for this.
So if you go to the website on the slide it gives you instructions to set up an account
an access that data, those data. These tools are powerful, it can guide an
analysis but not predictive, so it should be carefully considered, the information you
receive from them and not just blindly use it to determine what your most important variants
could be this is just an example of the types of antations we to and what the data look
like. Here we are using N of R, CD and HGMD, this
is a screen shot from our program of oursift that I'll talk about later.
Every row is a different variant. Here are rare types, we have synonymous and
non-synonymous single nucleotide variants and entergenic regionsch gene names are here
what names does the gene fall in, here is consequence, so in this case we get the gene
name, a transcript ID, the exon number, the CDNA what protein is changed so here we have
a synonymous D to T and me thioneine to lysine. Here the predictions, CD score.
This particular predictive tool when the score is lower predicted to be detrimental, the
synonymous ones are more predicted to cause any difference but some of these numbers are
quite low. If you look at HDMD now we have diseases associated
with these variants. This is the example, the CFTR gene and this
is associated with cystic fibrosis, and system of these detrimental variants are identified
being associated with the disease. So looking at all this information together
you can begin to see how you might start to sort and filter the data and identify positions
that might be biologically relevant to a disease or phenotype of interest.
So now we have talked where the variant might be and what consequence it might have biologically.
So now I would like to talk about who else has the variant.
The variant comment in a population, for example, if you were studying a rare dominant disorder
and your variant is common in another population, that variant is not causing the disorder or
else all those people would have it as well. So seen in other populations, in certain populations,
as observed in a disease cohort. There are many studies now looking at certain
disease cohorts in this information can be helpful when deciding if your variant is important.
So here are human variation databases available, definitely this list will grow I believe in
the next few years. We'll start with DB SNP.
It includes everything, a common variation but variants associated with disease, really
everything. The SNPs do have information about origin
so you can side the project in which they might have been identified and when the projects
are larger get frequency information so how common was it in a particular population.
You may have heard a thousand genomes project. This is really a project designed to sequence
a lot of sequence for a large number of populations across the globe.
Currently the current data set has more than a thousand low coverage genomes and low coverage
whole genomes limit sensitivity. So the most rare variant might not be detected
and they say their sensitivity is 1% or higher so pretty good but the rarest of the rare
variants won't be in that data set. They have sequence ad large number of exomes,
it will be high coverage and therefore will have the power to detect the rarest variants.
The seek from apologetic you have heard about. Currently we have data for 650 exome with
a plan to increase to 1500. These individuals have extensive phenotypes
and they will be deposited in DB gap an soon will be releasing data from these individuals
to DB SNP as well. The NHLBI sequencing project, currently they
have 2500 exomes with phenotype to be deposited into DB gap.
And this is in DB SNP and VCF will be available. So I have given this a big server and a little
laptop icon, most data sets come as flat text files or some other format and require informatics
tools to access the data. However, some do have web entertases and this
will be true for some of the other tools I talk about, the web bear faces are easy to
use but generally only allow you to check one variant at a time in the facilities to
check hundreds or thousands of variants aren't as easy to use.
So these can certainly be used on a web-based way, but that might not be as useful as getting
to the command line to examine thousands or hundreds of thousands of variants.
Just to highlight a personal story in our group for DB SNP, we had identified in the
search for a cause of syndrome a variant in the arcKT-1.
This was a non-synonymous variant, one I showed before so highly conserved but there was a
known SNP. So we looked at that time SNP carefully and
found that SNP is a deletion where our change was a substitution.
We're fine. We don't have to worry about that.
In the latest DB SNP there's a new one that is the exact substitution we observed.
Has we filtered the DB SNP positions we would have missed the syndrome.
It's a resource but you have to be careful when you use it because it is not exclusively
common polymore -- common polymorphisms. How are these things run?
Is there a graphical interface, how do we tie it together to go from the calls down
to annotated data. So I'll briefly discuss what we're doing atnist,
this is a bioinformatics heavy approach, we write scripts in pearl and use a variety of
tools for this to run smoothly an even then it does require a large degree of experience
in Hahn holding to watch the data, make sure things run correctly.
So we do start with sample genotypes file. This was basically the genotypes as Jim described
for one sample. The first thing we do is identify the variants
in each samples and make a list of variants. We take the position in this list of variants
and go back and determine genotypes from all different files at that position.
'ally the point is to say at this position in this sample we have a non-reference position
or we don't have non-reference because we have a reference position or we don't have
non-reference because we done have enough data.
Being able to say it's not referenced because we know it is -- being able to say it's not
a variant because we know it is referenced is very powerful.
We can then determine frequencies and actually say with certainty this variant is rare in
our sample set or in our population. So we call this process back genotyping an
it takes a little computational time to do but we fine it to be quite valuable.
So now we have a file of back genotypes, all variant positions ab genotypes for each sample
reference or non-reference. Now we undertake the annotation process.
So again, we take the data and call queue late Annation based on the three programs
I talked about. Then we merge it back with the file.
We calculate frequencies in our population and get them from other databases an merge
back with the file. We end with an output genotype file that's
currently in a structured text, to tab the limited text file but in the future we hope
to offer more common -- file format being accepted more and more, the variant call format
file. So that's great if you have a lot of experience,
a lot of big sequencing senors roll their own pipeline, they create their own, but is
there another way? A team developed a cool piece of software
called galaxy. This is a web-based software, it can be run
locally on your own servers or through a publicly accessible web page, I believe at University
of penN. Basic idea of galaxy is to allow investigators
to use a wide variety of tools for DNA sequence analysis in a graphical user way.
So the idea of this is that you load your data over here, you then can choose many different
annotation or really analysis programs including in this case amino acid changes, here is sift
that I talked about, along with a wide variety of others.
So you then pick one of the tools you want to use, you'll select the data set that you
loaded. Select some other options and then click execute.
This will submit the job to run in the background on the galaxy servers and return data here.
It's nice, this will give you a history of the tools you have applied to your data.
You can even tie together different tools to make what they call a work flow but it
is a pipeline really that can be run over and over again for different files.
Galaxy is also quite nice. They have spent time integrating with the
Amazon cloud. So as Jim told you, this really takes a lot
of computational hardware to do these types of analysis, sequencing costs beating Moore's
law is frightening to folks like me. So however the integration with the Amazon
cloud means you done necessarily have to buy all this hardware and maintain it, you can
set up an account with Amazon an basically rent their servers.
This then basically is installed for you, and you can access at your own cluster.
That's a very powerful approach. Finallyish say the team behind galaxy has
taken -- made an effort to make this easy to use, there's a variety of tutorials and
how toes and videos describing how to use the tool to make it as user friendly as possible.
Just to show you're features, it has next generation sequencing tools including a variety
of tools to help you deal with Sam and bam alignment files.
A lot of tools for text manipulation. So sort of hinted that a lot of data that
comes out of bioinformatics analysis is textural in nature.
In many cases you want to change columns around or reformat the data output from one program
in order to be able to input in another program. And galaxy offers you some ways to do that
as well. Okay.
So I described how variants are annotated so at this point you have a list of variants
the genotype as well as a large number of annotations with which to determine biological
importance. So now how do you identify
the other being opposite, you as the user generate ad career's worth of information
so use that information determining what variants could be biologically relevant.
Of course are the tools easy to use. So we'll tart low level file format.
The variant call for mat used by the thousand genomes format is a text format.
There are ways to zip this to make it a much smaller format and to make it very fast to
use but these tools are generally libraries for programming languages.
The details are not important but it's quite dense and using this by eye isn't really terribly
useful structured file, something we often work with contains a header line telling you
what's in the file and column separated in some standard way including annotations and
samples, and different types of information. And the advantage of this type of file is
it can be loaded to spreadsheet programs like XL, people have certainly done that successfully
but we have found there are some feature that would be wanting in using XL so that's -- there
are other approaches. At this point as Jim said, for each sample
you're lacking at 70 to 150,000 variants each sample you add has more variants so we're
talking a large number of data, certainly biological experiments on each variant, how
can you reduce this number to something much more manageable.
The first kinds of tools are variant prioritization tools run so this does not require any user
input. One tool is called vast, it prioritizes variant
using a probablistic approach using case control GWAS styles but are also including more information
so they use amino acid substitution to decide how detrimental a know acid might be.
They using agration, this is a neat idea where you look at a region of the genome.
And they combine variants in that gening to and consider one entity.
So for rare variant this is helpious account for any rare variant in a gene to offer more
power. This tool can also be used in inheritance
information. The tool is free for academic research use
but check the license to make sure it applies. Far MD is a tool you'll be hearing about later
using inheritance patterns studying a certain type of inheritance to identify variants that
fit in your models. This is a helix.
In the galaxy of development sections available for NIH folks.
Finally I'll talk about a tool we developed to help you, there's with limited informatics
experience to get their hands dirty with this vast amount of data and really try and identify
what could be biologically relevant. The tool is VAR sifter and allows the viewingsorring
and filtering of variants, it's available inside NIH at this website here.
So basically what we're looking at is a view of annotation, each row in this view is a
difference variant, you have your chromosome and position location information and all
the annotations I describe, gene name, mutation type database ID and things like that.
Each of these columns, you can click to sort your data so you can prioritize sort by gene
name, anything that's in this table. Licking any row will show you genotypes for
your samples in this window. Even if you have hundreds of samples, they're
all displayed and sortable so you can sort by genotype and see who has the actual variant
and coverage an genotype score quality information are visible as well.
Over on this side you have the filters with which you can reduce the number of variants
under consideration. So I'm walk you through several examples using
this tool. First I would like to impress in this file
we have 76,000 variantsch this is a test file, two samples but really only one sample that
have copied and made a single artificial change. So presumably the am of data you have will
be much higher than this. 76,000 isn't a mall number so how can you
begin to get down to those variants most interesting. So let's start by fill filtering mutation
type. This is a list in the upper corner generated
from all the different types in your data file.
For fun we'll click on the most detrimental so we click on splice and click on stop boxes
here this is an important button that applies the filters you selected.
Multiple filters can be chosen in any combination and applying them will filter the data.
We have only stops an splice sites. You can sort any of these columns to look
at and look into what is going on here. And we have reduced the number down to 133.
That wasn't so bad. 133 is still a lot.
How can you reduce it further? If we look we notice a lot of variants previously
described in a population database. Given the caveats mentioned before let's see
what happens if we eliminate those things reported in this database already.
We have an exclude box and we'll check exclude database ID.
Now the list is down to 25 variants. Now these are getting to numbers that are
handleable and do literature searching and think about biological experience -- experiments.
This filter failed so what can we do next? We clear and back out into our original list
and start again. This is the idea of the tool to to allow you
to design filters to dig in deeper an deep tore smaller sets of data, examine the data
and if nothing is apparently interesting to back out a little bit, back out a lot, and
really just to work diving in backing out with the data to athrough you to see what's
going on. Say you have information about samples, you
have the pairs this is are the things to look at if you have a tumor normal pair where samples
are the same except important differences that you're interested in.
If you have multiple pairs you can determine how many you're interested in and dial sensitivity
er or dial in your sensitivity. Titer
so now click the apply filter button. Here is the single artificial change I introduced
in this practice data set. That was great but can you define this yourself?
You can define it yourself, click on file, and sample settings you see a list like this
that lists your samples then the status for the sample and these are boxes you can check
to determine. Sample versus to come in pairs but cases and
controls are more flexible. You can identify any number of cases or any
number of control bus having at least one of each is required.
All right. Now we will unclick affected normal, click
case and control and hit okay. Now down here a new filter is available.
The case control filter. This basically says I want to see variants
-- I want to see positions variant in X or more cases but Y or fewer controls.
These numbers can be dialed in to dial in sensitivity as you would like so it will identify
the same artificial variant I have introduced. What's next, filtering by gene name.
Most of you have favorite genes that you would be interested in seeing the variants in.
So down here the search gene names flow box, type in the gene name.
And here we have in our example TFTR T variants that lie in TFTR.
Of course if there was a gene called CFTRQ, that shows up as well because I does match
CFTR, so using this box you can use a syntax called regular expression which is a very
powerful text searching method. If you done know what that is, I urge you
to check it out this evening because it's cool and allows you power in searching, and
that's supported here as well. So what other types of things can you do?
So say you don't have one interesting gene, you have a whole list to know about genes
in a pathway or in a gene family. You can create a text file, one gene name
line loaded into the program an identify variants that fall within those genes similarly if
you have a list of genes you're not interested in, they have false positives, you joust don't
want to look at them, you can load the same gene file an exclude genes from consideration.
You have linkage regions or a chromosomal region that could be interesting, these can
be loaded as a bed file and filter that and the program that fall within that region of
interest. Those are options that come standard with
the program, but say there's something else you want to do.
Perhaps you have a BCF file. Unfortunately the RCF file standard doesn't
specify a column for the gene mutation type. So the program won't know about it.
But using the custom filter you can still filter on columns like that, the program is
not aware of. So really what we're looking at here, the
same part of our sifter, this is a central window that shows all the queries you have
designed and logic how they link togetherch here we have sample options and over here
annotation options. So just the walk you through how this would
work, we would say view custom query and now we have a blank slate.
We start with annotation query, along with this exam of a type of annotation, the program
does not know about. We'll click on type up here.
Notice this becomes populated with information and these buttons are active.
So we have closen a -- chosen a type and now we have to choose action, we'll say exactly
matches. Now we have to choose something that it exactly
matches. So in this list, this will contain all the
values that observed in your file in that column.
So you can select any one of these values or type in some search text here again using
the regular expressions syntax. So in this case we will click stop.
And now we have a query block here, type equals stop so that's one query.
Let's add a sample query. So we had clicked affected and now we have
action buttons available. This time we'll choose does not match.
And now we can click a genotype homozygous reference, variants or a different sample.
So in this case we had clicked affected, does not match, now we'll click normal.
So now we have a new block affected does not equal normal.
You can use it to define your own Mendelian inheritance filtering.
Parent one is equal to heterozygous, parent 2 is equal to heterozygous affected offspring
is equal to non-reference to athrough you the identify homozygous recessive variants.
So now we have to link it together if we draw a box around them or shift click them.
They are highlighted. Down here we have logical statements, and
and or being very common so A and B or A or B so which we'll click and, affected does
not equal normal, and type equals stop. You have to finalize the query, this is just
like any other query so you check the box and you can combine it with any of the other
filters I showed you on the main page. In this case where we the apply this query,
we would get an empty list back because the single change I introduce was not a stop variant.
So you would back out again and start a different search.
Confidence prediction guides the analysis but it's a guide, not an absolute truth.
The tools I have showed do require varying experience, there is more of an effort to
make these easier to use but certainly some personal experience or collaboration with
those that do have more experience can be valuable thing, to run these very powerful
tools. So once you have annotation information you
can use prioritization tools that return information that is not pre-conceived so user independent
type information or you can do user guided analysis.
To use your own knowledge to determine what variants could be the most interesting.
Finally I showed you a tool we have written to make this hands on analysis easy to use
and hopefully that will be helpful for you. With that, I would like to close.
And say thank you very much.