Insights from integrative analysis of the D. melanogaster genome - Manolis Kellis

Uploaded by GenomeTV on 28.06.2012

Manolis Kellis: Thank you very much for the opportunity. This
is a real pleasure to be here. I’d like to tell you about the insights that we’ve
gained from the drosophila modENCODE integration, and also follow up a little bit on a theme
that Mark set up of sort of reflecting how that has given us insights into human biology
and specifically human disease. So, what I’m going to present today is the results of the
Analysis Working Group and the Data Analysis Center, which you can see here is very friendly
and working closely together. And this has been really a tremendous adventure, I would
say, through integrated genomics. So, the folks doing the work are sitting here in the
audience: Ben Brown for example; Peter Park; Peter Kharchenko; Matt Eaton; Dave MacAlpine;
Eric Lai; Steve Henikoff; Casey Brown and Nicolas Negre from Kevin White’s group;
as well as, of course, Lincoln Stein, Gos Micklem, and the DCC, to name a few. And they’re
contributing from very different classes of valuable elements, but, at the same time,
each of these folks have sort of really come together to contribute to the integrated analysis.
So, what I’m showing here is the work of many, many people.
So I would like to, again, use the same slide that Elise showed earlier to basically tell
you how we’re actually thinking about all these different classes of elements with a
lot of different assays coming together and giving us different glimpses of what is the
underlying sort of biological story of the genome. And you’ve also heard from the Data
Coordination Center -- this pointer doesn’t really work -- from the Data Coordination
Center as to how even putting all these datasets together has been tremendous challenge, and
this is the thousand or so datasets that have been linked through the science papers for
immediate download as of December 2010, in a lot of work from the DCC. And this is the
total tally as of today. You can see that the number of datasets has actually doubled
across the different types of regulatory items.
So Mark already did a great job in sort of introducing all these classes of elements
in terms of mRNAs, non-coding RNAs, microRNAs, siRNAs, piRNAs. Also, comparative transcriptome
mix, looking at variants of chromatin as well as nucleosome turnover. Again, histone modifications,
chromosomal proteins, and understanding the complexes of replication, origins of replication,
timing and differential replication, and then transcription factors. So, when you start
simply overlapping all these elements together, what you end up is this astonishing picture
where if you started out with, say, the protein coding annotation of the drosophila melanogaster
genome, you would cover about 20 percent of the genome. And as you start adding these
small RNAs, 3' UTRs, non-coding RNAs, Poll II, TF binding, transcription factor binding,
insulators, additional bound proteins, Polycomb domains, origins of replication, enhancer
and promoter states, transcribed states, heterochromatic states, introns, you end up seeing what fraction
of the genome each of these elements covers by themselves.
And as you start piling them onto each other, you end up with more than 75 percent of the
genome or nearly 85 percent of the genome covered for both the overall genome, as well
as the conserved genome shown in red here; you see an even higher fraction of the genome
being covered. So what we’ve gone from is about 20 percent of the genome being quote
unquote “interpretable,” based on overlap with protein coding regions, to nearly 80
percent or 90 percent using these diverse assays. And what you also see is instead of
just covering the genome each base once, what we see is the number of assays that multiply
overlap these regions in multiple ways. So, for example, you see that about 5 percent
of the genome is covered by more than 14 different regulators and, you know, 65 percent of the
genome is covered by at least one; and, similarly here, you can see the number of overlapping
transcripts, the number of overlapping different classes of chromatin elements. And, then,
if you pile all these elements together, you have about 50 percent of the genome covered
by at least four assays, and 30 percent of the genome covered by at least eight assays.
So this is not just painting of the genome; this is a multiple painting by many colors.
So, for example, if you see this region here where we have several protein coding and non-coding
transcripts sitting here on the right, and then these large unannotated region on the
left, the moment you overlap modENCODE datasets on it, you see this increase in the coverage
of these coding regions, where you can see a lot of small RNAs and non-coding RNAs sort
of lighting up and coming from these transcripts. But you can also see these vast regions of
bound proteins, for example, that are happening in the middle of these genes here in regulatory
elements, as well as in the middle of this large intergenic region, which now has a new
gene model sitting right in it with many different regulatory elements that are annotated by
So what we’re looking here is really much of what was promised by this encyclopedia
of non-coding elements, this encyclopedia of both coding and non-coding DNA elements,
and sort of the emphasis on these large non-coding regions. And Elise, just to clarify, I see
the timer went for 25 minutes, is that correct? Okay, just checking.
So, where do we go from here? So we can certainly pile all these elements together and sort
of count the amount of the genome that we’re covering, but what I’d like to tell you
about today is what do we gain by actually putting all these elements together across
these different data types? So, for example, we can start annotating coding and non-coding
genes and actually distinguishing different classes of transcripts by actually overlapping
the transcriptional information across different types of mRNAs, or different types of transcripts
in the cell, and also overlaying that with evolutionary signatures of the patterns of
change of these regions. So, for example, if we look here in this very small new transcript
that sits, you know, between these previously annotated genes, you see that, in fact, the
single transcript here contains, within a single exon, two independent, small peptides
on the order of 20 amino acids each, which we would never be able to recognize unless
we had both the extremely precise transcriptional evidence, as well as the comparative evidence
showing us that the patterns of change here precisely match what you would expect for
protein coding regions.
You can also go within extremely well-studied transcripts, such as these heat shock response
elements or these exist orthologs that actually determines X chromosome dosage, and actually
discover new transcribed regions within them that actually fold into well-defined structures
that are evolutionary conserved. And you can also go within protein coding exons and discover
microRNA genes that are actually overlapping the protein coding regions and encoding both
amino acids as well as short regulatory RNAs that can actually target downstream genes.
You can also discover downstream of conserved stop codon regions of overlapping functions
here, where this serves as both 3' UTR of this particular translation termination region,
but you also have an alternative translation termination that simply reads through the
stop codon and translates these additional regions; and we found, actually, 300 of those
examples in the fly genome, and actually an additional four examples in the human genome
that, again, started out from the model organisms before we knew they also existed in humans.
But beyond these protein-coding regions we’d like to also annotate non-coding elements,
and in particular, we’ve been working with Gary Karpen and Peter Kharchenko from Peter
Park’s lab to actually annotate chromatin regulatory elements, such as enhancers for
motors, and a diversity of different classes of regulatory regions. So as you heard in
the previous talk, DNA is wrapped around these nucleosomes, each of which is made up of eight
histone proteins and each of which has a long tail of amino acids that can undergo full
translational modifications. And there’s a large number of these modifications creating
many distinct combinations of histone marks, which are very difficult to interpret because
you end up with a large number of genome-wide tracks mapping the locations of these modification
maps across the genome. But we and others have developed algorithms for actually learning
the hidden chromatin states that are actually responsible for the observed combinations
of chromatin marks that we can learn completely denova [spelled phonetically] across the genome
and only then overlap with existing functional elements.
So, in collaboration with Peter Park, as well as Jason Ernst from my group, we basically
set out to annotate the chromatin states of the drosophila genome. So, first of all, this
is a picture by Peter Park that actually shows how each of these chromatin marks is in fact
very strongly positionally biased to be in different places with respect to different
classes of genes, or origins, or replication, or specific insulator proteins, or specific
transcription factors. So you can see here that each combination of chromatin marks,
in fact, perhaps, uniquely defines each one of these regions in a very consistent way
across the whole genome. So we can use that, for example, to discover new, and surprising
sometimes, classes of elements. For example, we can use that to define promoter signatures
based on the presence of, for example, H3K36, H2B, H3K4 translation, that is associated
with both promoter regions as well as transcribed regions. And you can also use that to actually
define new classes of elements. For example, we actually found a new class of H3K36 mono-methylation
marks that are associated with replication origins, and then collaborated with Dave MacAlpine
to map those across the genome.
You can also systematically learn combinations of these modifications, as I mentioned earlier.
So this is learning not just the specific combinations, but also the intensity of each
of these combinations, and you end up with nine different chromatin states, which we
can simply refer to as active transcription start sites where you can see a high intensity
of H3K4 trimethylation, and to a lower degree, H3K4 dimethylation and H3Kac methylation.
And then this is extremely enriched for TSS-proximal regions. You can define active exons in the
elongation elements, active introns, both enhancers as well as intergenic, based on,
again, specific combinations of these chromatin marks and at specific intensities with respect
to each other. There’s a specific chromatin state that, in fact, defines male X genes
that contains H4K16 acetylation, and is, again, specifically enriched for, you know, this
class of genes. You can define Polycomb-repressed elements that are marked by H3K27 trimethylation,
heterochromatin elements that are marked instead by H3K9 trimethylation, and other basal and
repressed elements in the genome.
So these now gives us a handle for going off and annotating the genome using these large
classes, but we can also go further and actually define more discrete states where specific
combinations are marked, are defined regardless of the intensity, and we’ve used that to
actually define 30 different chromatin states which correspond to these nine states as you
see here. And when we intersected those with a very large array of functional elements
such as nucleosome solubility, HOT spots, nucleosome turnover, different classes of
insulators, different classes of histone deacetylases, HOT regions, early origin, origin replications
that [unintelligible] early, regions of origin replication complex binding, as well as different
classes of transcription factors. What’s really astonishing is that these histone modifications
alone can actually pick out each of these classes of elements in different states, suggesting
that, in fact, chromatin is encoding a much more diverse array of functions than previously
thought. It is not just encoding active and inactive regions. It is, instead, encoding
a vast array of different classes of annotations.
So we can now go beyond just simply annotating regions of the genome. So before we looked
at both coding and non-coding transcripts, and then we looked at regulatory regions.
We can now start connecting these regions together to actually piece together regulatory
networks that Mark Gerstein alluded to earlier. So the first thing that we can do is, in fact,
learn the hierarchy of this network by actually combining transcription factors and microRNAs,
and then I’m going to switch to these regions of high occupancy. So just looking at the
physical regulatory network, namely which transcription factor is actually physically
contacting which target gene, we again find a hierarchical structure where most of the
links are, in fact, pointing down and only a small number of links are pointing up, if
you arrange the regulators in this particular way. And a very interesting picture emerges
if you include microRNAs in this picture that are shown in red surrounding this transcription
factors, which is that the feedback from the bottom layers of the hierarchy to the top
layers of the hierarchy is, in fact, predominantly happening through microRNA regulators that
are increasingly targeted by the bottom layers of the hierarchy, and increasingly targeting
the top layers of the hierarchy, which is rather surprising. And this is also found
if you study the specific structural motifs of this regulatory network; these recurrent
patterns of connectivity where you see these cascades of transcription factors targeting
each other and then feedback coming back through a microRNA layer, just like we see here. So
both at the low level of the network, where you started specific patterns of connectivity,
and at the high level of the network, where you started the overall hierarchical layout,
you can, in fact, see this feedback of regulatory information from transcription factors through
microRNA to other transcription factors, namely the master regulators that Mark was talking
about earlier.
So, again, just like in human and in worm, in fly, you find these regions of very high
occupancy by multiple transcription factors. So, if you look at the average number of transcription
factors found with each of the regulators that was profiled in the drosophila genome,
you see that only a small number of factors are, in fact, binding alone. But most of the
factors are binding with another six or sometimes 10 partners. So -- and this is the median
of partners that every location that these transcription factors bound has, suggesting
that, in fact, there’s some regions in the genome that are just very, very widely bound,
and that’s where the name of high occupancy target regions comes from; and, in fact, this
term was coined by Kevin White in the drosophila genome.
What’s interesting here is that we can now bring in these different classes of functional
elements to help annotate and understand these HOT regions. The first thing that we can do
is, in fact, look at regulatory motifs, and what we can ask is, given a particular complexity
of a transcription factor which tells us the number of other factors that are binding with
it, are the regulatory motifs more enriched to the right or to the left of this class?
And what we’re finding is that most of the time it’s actually a depletion for these
regulatory motifs, suggesting that regions of increased complexity are less likely to
contain regulatory motifs, suggesting that as more and more transcription factors are
binding, you are less and less likely to bind to your specific motif, and therefore you
increase the non-specificity of binding, or the non-specific binding. You can also overlay
that with the chromatin state annotations that we had earlier, and what you’re finding
is that specific chromatin states are enriched for either HOT or cold regions of transcription
factor binding. And a striking finding comes if you actually overlap that with regions
of origin replication complex binding. So, as you increase the complexity of a transcriptional
factor binding site, you also increase in an almost linear way the likelihood of binding
of the origin replication complex, suggesting an intricate interplay between replication
and transcriptional factor binding that was previously unappreciated.
Now you can look within these regions of high occupancy and search for regulatory motifs
that are specifically enriched within these regions, and you end up with a large number
of specific regulatory motifs that are predominantly found within these high occupancy target regions
and that do not match other existing regulators, suggesting that perhaps a different class
of binding may be actually guided to these regions in a sequence specific way, and then
enabling this nonspecific binding by additional regulators.
And we’re finding a very interesting story that sort of mirrors that in the human genome.
When we actually study the interplay of transcription factors, regulatory motifs, and chromatin,
we find, first of all, that transcription factors show distinct chromatin preferences.
Different transcription factors shown here are, in fact, matching different classes of
chromatin states, and even though they’re all matching regions of open chromatin, they’re
matching differentially promoters, voice promoters, enhancers, weak enhancers, a class of enhancers
that’s, in fact, lacking histone modification marks, and so on and so forth.
Now, when we look at the regulatory motif preferences for each of these factors, we
do find, indeed, that the motifs for these transcriptional factors are enriched in the
regions of binding which is reassuring; but you see additional binding beyond these motifs,
namely you find that the motifs are, in fact, just like we saw in fly, depleted amongst
all regions of TF bindings, they’re depleted in the regions of high occupancy, suggesting,
indeed, that they’re, in addition to these specific binding, that there’s some nonspecific
binding happening within these regions. And particularly surprising were these three chromatin
states that lacked histone modifications, that showed abundant binding, but also showed
no nonspecific binding. Therefore, these regions, in order to become permissive and bind without
a motif, they actually require the histone modifications. Open chromatin is not enough,
suggesting that open chromatin alone, again, is not sufficient information; that you need,
perhaps, chromatin regulators to recognize these marks to enable the nonspecific binding.
The state preferences also predict the pairwise transcription factor co-occurrence patterns
that we observed in the fly. So, now, if you correct for the chromatin state preferences,
you actually remove a lot of these TF co-occurrence that we observed across human, fly, and worm.
So we can use this information now to build predictive models of gene regulations. So,
before, we looked at physical regulatory networks of which transcription factor is actually
physically contacting what target gene, or, at least, what upstream region of a target
gene. We can now start building functional regulatory networks by actually integrating
all that information together. So we’re looking at motif instances that are conserved
across different species, and therefore more likely to be functional. We’re integrating
with that the physical evidence of binding, using chromatin immunoprecipitation. We’re
also using correlation information between transcription factors and their target genes,
both in terms of their chromatin marks, as well as in terms of their gene expression
patterns. And then we’re putting all of that into a learning framework that predicts,
given the vector of information across all of these different patterns, whether a particular
transcription factor is, in fact, targeting a particular target gene.
So we can use that to actually define functional enrichments across the targets of genes of
the same transcription factor, namely, what we’re finding is that depending on what
gene cluster, gene expression cluster you are in and what transcriptional factor is
targeting you, you’re much more likely to have the same function, and we can see these
functional enrichments across different lines of evidence here. And we can use that information
to now predict the likely functions of genes that were previously unannotated. For example,
by observing that genes that are targeted by specific regulators are, in fact, involved
in cellular respiration, we can predict that additional genes that were previously unannotated
are also involved in cellular respiration. And if we do that in a cross-validation framework,
we actually have very strong predictive value for several of these annotation terms, enabling
us to predict more than a thousand new functions for previously unannotated genes in the drosophila
We can also predict regulators based on the stage of development at which they’re acting.
Looking here at embryo, larva, pupa, and adult in different days or hours of development
here, or you can see here that at specific branch points, the expression of these regulators,
in fact, changes, just as the expression of their target gene changes, predicting, you
know, the action of specific regulators at specific branch points. And we can also develop
predictive regulatory models that use this targeting information from the functional
network to actually predict the expression level of target genes based on the expression
level of the corresponding regulators. So this is, for example, the true expression
pattern for the groucho gene, and these are five of its predicted regulators. So we can,
in fact, learn a function for each of these edges that predicts as a sum -- as a linear
function the expression level at each different stage of development for that gene, and we
can compare that to a random prediction based on randomized network, and we see that, indeed,
the randomized network doesn’t do very well at all. And we can do that for a very large
number of genes; for example, here I’m showing the top 1,000 genes, or at least sampling
from the top 1,000 genes that are the best predicted, and you can see here both negative
correlations for genes whose expression is predicted based on repressors and positive
correlations for genes whose expression is predicted based on activators.
So, how does that all translate now to actually interpreting human disease? So, in the ENCODE
project we’ve actually similarly mapped these chromatin states using combinations
of chromatin marks across numerous human cell lines to actually define different classes
of enhancer, promoter, and transcribed regions, enabling us to now look at any region of the
human genome and, at a glance, observe its activity patterns across different cell types.
Now what’s really exciting here is that we can now use correlations between these
activity patterns, just like we did in the fly, to actually link together not just enhancers
to their target genes, but also enhancers to their likely trans-regulators that are
sitting upstream of them by actually studying these vectors of activity for gene expression,
for chromatin, for regulatory motif enrichment, and for transcription factor expression across
the different cell types. And we can use that to actually predict activators, as well as
repressors, for each of the cell types based on the joint action of the regulatory motifs,
the expression of the transcription factor, as well as the activity of the chromatin state.
So we can use that to actually define a number of activators and repressors for each of the
cell types, and we can now use that information of these predicted regulatory regions in each
of the cell types, and then the predicted linking between the transcription factors
and these regulatory regions, both downstream in terms of what enhancer is, in fact, targeting
what gene, and upstream in terms of what regulator is targeting what enhancer to actually start
interpreting disease association studies. For example, we find that the top scoring
regions for a genome-wide association study for systemic lupus erythematosus is, in fact,
having 18 SNPs genome wide significant, six of which are, in fact, falling specifically
within the GM enhancers that we have defined using these chromatin states. And if you look
within one of them in this particular example, you find that the SNP that is associated with
the disease phenotype is, in fact, disrupting a predicted causal motif for the Ets1 regulator,
and therefore resulting in the inactivation of this particular enhancer and likely changing
the expression of the downstream gene, which is predicted to be a target of this region
based on our activity profiles.
So we have automated this so we can now do that for any one region. We can, basically,
read off our regulatory map that Ets1 is a predicted activator of GM cell lines; that
Gfi1 is a predicted repressor of K562. In this particular case, the disease-associated
variant is, in fact, creating a motif for the repressor Gfi1, which is then predicted
to repress the activity of this enhancer region, and therefore lead to inactivation of this
particular gene, and therefore lead to the disease phenotype. And we can also, of course,
leverage information from comparative studies across 29 mammals in this particular case,
where the specific SNPs that are disrupting conserved instances of regulatory motifs are
much more likely to be associated with disease. With automating this process in a tool that
anyone can use called HaploReg, where you can actually go and mine the entire ENCODE
database of regulatory annotations across specific regulatory motifs, binding of specific
regulators, DNA’s hypersensitivity across 80 different cell types, and the chromatin
state annotation maps across the nine cell types, as well as conserved elements across
the 29 mammals, and, of course, the coding and non-coding gene annotations in order to
interpret for every SNP that’s associated with a disease which of the neighboring SNPs
might actually be responsible for the disease phenotype.
So, overall, what I want to leave you with is that you can, in fact, use this type of
information to annotate coding and non-coding regions; to annotate chromatin regulatory
elements; to define networks of regulator targets and their downstream genes; and to
build predictive models of gene regulation. And putting all that together, in the example
of human disease, you can use that to actually annotate non-coding SNPs and also link them,
both to the upstream transcription factors that bind them, as well as the target genes
that they regulate.
So, ultimately, our goal is to be able to use that information to systematically annotate
human disease, and what that requires is a systematic understanding of gene regulation,
where we can predict for every coding or non-coding mutation in the genome exactly what the functional
implications are likely to be, and that’s what the goal of the ENCODE project is. So
what we’re doing now is comparing fly and worm, and, of course, comparing that to human,
and we’ve done a lot of work in defining these orthologs, and tomorrow you’ll hear
a lot more about sort of how each of these stories plays out when you compare flies,
worms, and human.
So, ultimately, I believe that model organisms can be extremely powerful for actually understanding
the relationship between genotype and phenotype because we can study at a systems level the
effect of these functional elements and selective pressures on trait--associated regions. And
also, given the powerful genetics and the short time spans, we can use them for systematic
mutations, as well as drug screening.
So, again, this has been a wonderful collaboration with the entire Analysis Working Group, across
Drosophila and worm. So I acknowledged some of the key contributors at the beginning of
the talk, and this is the full set of authors here for our integrated paper. You can see
the stars here of a very large number of equal contributors because this has really been
an incredibly sort of collaborative team effort. And again a set of PIs here that are all sort
of, again, equal contributors. So I’ll stop there and take questions.
Male Speaker: So this sort of gets to the importance of
doing the functional analyses, and so Barbara Wall [spelled phonetically] and others, specifically
looking at transcription factors --
Manolis Kellis: Can you speak into the mic?
Male Speaker: I’ll have to lower myself. [laughs]
Manolis Kellis: To our level.
Male Speaker: So, anyway, the point is is that the transcription
factor binding seems to be excessive, right? There’s a lot of sites in the genome; I’m
not talking about HOT regions, I’m talking about, you know, tens of thousands of sites
but only a few of them actually seem to regulate gene expression. And so what I’m specifically
wondering about is in terms of using SNPs for doing the analysis of human disease, how
much that confounds the analysis, and do we need to have depletions of the transcription
factors and an analysis of how that affects transcription before we can actually do the
linkage to human disease?
Manolis Kellis: You have a fantastic point, and I -- so, as
you’ll notice from -- I mean, in the next phase of ENCODE there have been several technologies
that are funded for actually systematically validating the functional consequences of
our regulatory elements. And we have been involved in the development of one of those
in collaboration with Dr. Nicholson [spelled phonetically], where we can now test thousands
of enhancers that are designed from scratch, using plasmids, certainly, and therefore,
it’s, you know, the native chromatin context might not be the same but we can now test
the effect of individual mutations on the expression of downstream genes in a reporter
assays. What we’re finding is that the causal motifs that we’re predicting here will,
in fact, disrupt the expression of downstream genes in a reporter assay, suggesting that,
in fact, those specific motifs are responsible for controlling enhancer activity, in a way.
And, in fact, if you do neutral mutations within the binding site itself, then you maintain
activity of the downstream gene. If you shuffle the binding site, or if you make a mutation
that changes the high information content basis, then you, in fact, disrupt downstream
So I think this is one type of technology where you can test individual regulatory elements
in isolation. I think what we’re going to see in the future is -- and there’s a lot
of technologies underway for doing that -- is ways to actually massively test elements in
their native chromatin context, to actually integrate them into the genome. And I think
that’s, in a way, one of the powers of the model organisms, that, in fact, you can do
that systematically. So I think we’re in for many surprises on the technological end.
Every, you know, few years we think that, wow, that was a great few years, but I think
looking forward, we have much, much more to expect.
[end of transcript]