Regulatory Conservation and Divergence Among Organisms - Michael Snyder

Uploaded by GenomeTV on 28.06.2012

Male Speaker: So I'd like to go ahead and get started with
our first speaker who's Mike Snyder. Mike Snyder's modENCODE researcher, he is the Stanford
Asherman professor and chair of genetics at Stanford and one of his specialties is personalized
medicine. Thanks, Mike.
Michael Snyder: Great, well, thanks very much for having me
here and giving me a chance to tell what we're up to on trying to look at regulatory information
across different organisms, and I think the modENCODE consortia, along with ENCODE, has
really provided some nice opportunities here.
Let's see. So, why would you want to compare regulatory information across organisms? Well,
first, if you do seek conserved elements you know they're functional, and if they're functional
that can be quite useful to you; it implies something you might want to study, or understand,
if it gets mutated.
I think the other reason you want to look at comparison of regulatory information across
organisms is that you'd like to try and understand basic principles of design. What kinds of
circuits lead to what kinds of outcomes and, of course, ultimately how you modulate those
circuits and change outcomes is of high interest.
And lastly I would argue that, probably many more than this, but the third possibility
is that you'd like to get a general understanding of how we're similar and different from one
another, both within a species and between species. I think many of you may be aware
of this general question is just how much are we identical to one another, how much
of that is due to differences in, say, amino acids or the genes we have, or how much is
it due to the fact that we have the same genes but how they're regulated makes us different
from one another.
Along the lines of this last point, there is some controversy in the field about how
much is binding site, transcription factor binding site, conserved or differing amongst
different individuals. So we had a paper out along with Duncan Odom in about 2007 suggesting,
in fact, is extensive divergence in binding sites amongst different organisms, and the
pilot phase of the ENCODE consortium had similar conclusions.
But there have been other papers out there suggesting that, in fact, binding site information
is incredibly conserved between different organisms or within the same species. And
so there is this unresolved issue that's there as well.
So what I'm going to talk today it's about general characteristics and so -- I should
back up and say so modENCODE, along with ENCODE, does provide a really nice opportunity because
there's a huge amount of data to really get at this question. Although I have to admit
there's still many more questions to get answered as you'll see at the end.
Okay, anyway, here's what I'm going to talk about in today's talk. So first I'm going
to tell you a little bit about general transcription factor binding data, and this talk will really
just focus on transcription factor binding information. Jason will talk more about the
chromatin information which is obviously another aspect of regulatory information. And then
I'm going to give you a few vignettes, if you will, of some of the comparisons we've
done. This is a work in progress and some of those questions are just how much are the
features of binding site of different transcription factors and their binding sites similar or
different to one another. How much are the partners of transcription factors differing
between different organisms? And even a fundamental question of do transcription factors like
to bind the same types of genes between different organisms. And most of what I have to talk
about is worm, fly, and human comparisons but for referencing it'll sprinkle in some
mouse ENCODE data as well.
Okay, let's start with the data sets. As they say there's really a lot of data sets out
there now with transcription factor binding, and since humans have been studied the most
with the most groups and the most amount of time, they have the most. So 700 different
data sets, for worm there's 236, flies there's 102, but there's a lot chromatin data sets
here, again, we'll talk about that, and for mouse there's quite a few as well.
These represent a number of different transcription factors, generally 50 to 168. This just -- oops
-- this just shows that, in general, there'll be a lot of transcription factor binding on
one or two cells lines, although for a number of factors we can get them across different
developmental stages or different conditions. So for example, here's human you can see that
there's a lot of effort put on a few limited lines, but you will find a decent number of
data sets, tens of data sets across multiple lines. And the same is true for worms. A little
bit less for flies and mouse.
Okay, so all the data was processed using a very common pipeline for mapping, for calling
peaks and then for running through various quality control measures to make sure the
data sets are of high quality. And at the end we do come up with these data sets. There
are some additional data sets that are put aside that don't make this analysis, I'll
tell you about.
If we look within these data sets, how many so called orthologous data sets are there?
That is to say we have orthologous transcription factors, how many are in common? And there's
zero between all four organisms, so there's not huge amounts. There's the most between
human and mouse, not surprisingly because they're the most closely related. There's
a decent number between, say, human and worm, and human and fly, and so on and so forth.
And again, that's what you would expect because these things -- it's reflecting how close
the organisms are relative to one another. And if you look at the transcription factors
involved, in fact they are quite similar based on expectations; so red means you're 80 percent
or more identical in amino acid sequence, and say you could see mostly human/mouse orthologous
are quite similar. And if you start taking the worm, giving it to the fly/human, these
are really quite divergent with only one of them really being super close in terms of
its amino acid sequence.
So, within this first thing we did was analyze all of these, and sure enough there are these
hot regions which Mark told you about, I won't say a whole lot about them other that there
are thousands of these things in the various organisms, and you can identify them by statistical
-- over representation of regions. And they do change across development, so that is one
of the conclusions that does come up. And so for example, there's only 212 of the hot
regions in worms that are similar between the various stages that were looked at, embryo,
first larval, second, third, and fourth. So, there's only a small number that are conserved,
most are stage specific and they do change quickly. So for example, the embryonic ones
are very specific to embryos, they will shift down. Many of them are lost as soon as you
have the L1, and likewise when you move from L3 to L4 you get a lot of stage specific hot
regions. Okay.
First question is how conserved are the binding sites. Now worms, flies, and humans are quite
far apart so you can't really get syntenic relationships there, but you can ask some
general principles about this. Are they binding the same motifs, are they binding in the same
locations? So step one is to look are they binding at the same motifs, and this is the
work of Poya Kay [spelled phonetically] from Manolis' lab, has done this very systematically
across all the different motifs. And just looking at his data set it's over 80 percent
of them seem to be identical or very similar between the different organisms, that is between
flies, worms, and humans. This is just one example here of Blimp 1, you can see it looks
pretty similar between flies and humans.
But there are some interesting differences there and here's one case here that Alan Boyle
[spelled phonetically], from Alan Boyles' slide and AirLab [spelled phonetically]. So
this is C. elegans Zag-1 and its homologs have one, have actually quite different binding
sites. Okay? And it's not simply a matter of well we only happen to look at one stage,
if you look at different stages you'll see that you get the same binding site for worm
across different stages, and so it's not really changing, but the bottom line then is that
this factor does have different targets in the two different organisms.
All right. So binding sites are usually conserved but not always. What about binding locations?
Well, we thought we'd start with the simplest case and if you compare orthologous factors
between human and mouse you can see are they binding your promoters or enhancers, or what
have you. So if you look at transcription factors they fall into three classes I would
say. There're those that love to be around promoters, there're those that love to be
around enhancers, which are -- tend to be fairly distant in humans and mouse, and then
there are those that like to be at both.
Now the way we have the relationship set up here, this is the TSS and that's a 20 Kb a
way, you can plot in an accumulate fashion whether something likes to be close or further
away. If it's red in this region, it means it likes to be around the promoter, and if
it's green that means it likes to be distal, typically at an enhancer.
So the bottom line is that if you start comparing human and mouse, you'll discover that, in
fact, in general they're clustered, obviously, by these are more proximal. And in general,
there's some trend towards this, but there were surprisingly a lot of variation in this.
When one digs into this you realize it's actually because we're comparing different cell types;
and this is interesting in its own right. So if you look within orthologous cell types,
so for example K562 and MEL cells, which are semi-orthologous if you will, they're cell
culture cells. You can see there's actually quite a bit of conservation in terms of location
of these orthologous factors, so these factors, again, are near the promoter and they're near
the promoter mouse -- okay, both mouse and human.
Okay, so what that implies then is that these factors do vary a bit amongst different cell
types and, in fact, we do know that -- this is something from Carlos Araya in AirLab,
and he's actually looked at the binding relationship of the different factors across different
development and different -- well just different developmental stages in this case. And the
bottom line is sometimes they're together and sometimes, very often in fact, they're
not together. We'd seen that from other studies from Wayzong [spelled phonetically] as well.
I should point all this is done in collaboration with the team that I'll mention at the end.
So the bottom line is for certain kinds of factors you can see that they'll cluster based
on like these blue factors; these are different stages that like to be clustered together,
but there'll be times when they're further apart. [unintelligible] be binding with -- at
different locations and with different partners. Something I'll get to, in fact, in a minute.
So, just for completeness, we did in fact compare humans and worms and, again, the promoter-loving
factors for humans sometimes are promoter-loving for worms, sometimes not so. And we varied
this whether its 20 Kb or 2 Kb; you get pretty much the same result.
Okay, so the conclusion then is that these orthologous factors, they're often binding
the same motifs but they're not necessarily in the same general locations across different
organisms. What about partners? Well initially, so we want to see whether the different factors
work together in different organisms. Excuse me. And so first we'd set up a general clustering
scheme and, in fact, we didn't see any relationship that is the co-associations of one organism
did not match up with another. So then we said, well let's make this -- let's define
the context of this so we decided to just look around promoter regions. So you look
around a 2 Kb, a promoter region, and you ask R factors working together in these regions
and are they conserved across different organisms. So again, we first set up a co-association
relationship in a 2Kb region around promoters, and it's almost somewhat of a standard clustering
if you will, not quite standard, and we look for within species comparison to find relationships
and then from that we then look for those that are conserved across species.
What I'm going to show you is this slide here which is the most complicated slide, I think,
of the presentation. So just focus on the right for the moment. Because we're going
to compare co-association of these different factors, if you will, with these different
factors within an organism and the only comparisons I'm making are the cases where we have orthologs
between, in this case, worms and humans. Okay, so we're going to compare relationships within
an organism and then we're going to ask which ones are the same between. Is that clear?
So, let's look here. If you take this factor EGL 38 and you say who does it like to be
associated with around promoters you'll see it has -- it turns out three partners, PHA
4, there's a PHA 4 data site here and here and it likes to be at both of those, also,
this one EFL1 and MDL1. So those are its partners if you were, it called co-association partners.
We can then say which of these are conserved in humans? And here's the ortholog, EGL 38,
unfortunate, doesn't have the same name; so we should rename all these things, mind you.
And PAX5 is the same as EGL38, and what you notice is that when you look at its orthologs
here, you don’t see co-association, all right?
So we do it again, we get the same result. ZAG 1, which I mentioned before co-associates
with BLIMP 1, its homologs have 1, does not associate with that fellow there, and then
we do it again and get the same result. So we still didn't get any co-associations. So
that didn't sound so good.
So then Dan Xie and Alan Boyle said well, wait a minute, maybe we're asking to hard
a question. We're asking these things to co-associate across all the genome, initially, then across
all of the promoter regions. What if we start to just asking in specific segments of the
genome, and this makes a lot of sense because it turns out different transcription factors
work together with one another at different gene targets, or different specific locations.
And so when you do this across a genome, or across a broad range of segments, you'll have
a hard time finding these relationships. So they came up with an idea to use these self-organizing
maps and this kind of approach was first started by Allyon [spelled phonetically] when he was
in Barbara Wald's lab analysis in UC Irvine.
And so the ideas the following: You'll take 1 Kb stretches, and within these 1 Kb stretches,
you look for enrichments, partners of factors and they [unintelligible] to be multiple partners
across, essentially, the whole genome; you're looking for enrichments. You then, basically,
generate these neurons which are -- show that big one there -- which are statistically significant
enrichments of these transcription factor co-associations. That's the general principle.
Clear? Okay.
So first you do this with humans and you say, let's look at all the 1 Kb segments of humans
and see which ones like to be co-associated. And here's the answer, crystal clear, I know.
All right.
You look into these at the neurons and you see things that make sense. So for example,
in this neuron all the PolIII -- nearly all PolIII, there's a few other things that belong
there -- they're all together binding near one another as you might expect for PolIII
components. Same is true for some of these other things. Here's an enhancer type area,
which has all of these factors that like to be together. So you can start finding relationships
that come up. This is the regulatory code within an organism, again, makes a lot of
Now can we find similar relationships between organisms? And first, we were thinking well
this is going to be complicated. You've got to find all of these and compare to all of
-- other organism's. But then actually Alan and Dan [spelled phonetically] came up with
a clever idea, let's just mix them all together. We'll take all of the worm 1 Kb segments,
all of the human 1 Kb segments, all of the factor binding information, put it all in
the same pot and let it self-organize. Okay, and if they're the same relationships they
should show up on the same neurons. That's the general scheme here. You take all the
binding information here, all the binding information here, you've got mouse, human
-- funny looking human, throw them together, let them self organize and see who belongs
together, and then see now if they're human specific ones you'll only get human specific
relationships on one of the neurons. If they're mouse ones there'll be mouse neurons, and
if they're mixed neuron you'll get a mixture of both and the ratio of mouse to human tells
you how much is mouse and how much is human.
We mix them together and this is what we get. We get a lot of species specific information
which -- there can be technical reasons for some of this, but the idea this is a neuron
with a lot of co-associations that are primarily mouse. The red ones are primarily human, but
you do get plenty of these yellow and orange ones which mean they're conserved across -- these
relationships are conserved across organisms. Okay?
These are the ones that are conserved. You can see things here, and I don't know what
a lot of these factors are, but I know this one makes a lot of sense. CTCF is one of its
good partners is RAD21, and SMC3, and this relationship, you can see, it's shared in
both humans and mouse because it's orange, yellowish-orange. Okay? So in fact that makes
sense. And you can find others as well.
And then there are these species specific ones; so the green ones and the red ones presumably
are species specific although there can be technical reasons for this and so we have
to look at that harder.
So then we say all right we can find relationships between mouse and human, that's good. We should
be able to do that. What about humans and worms? And low and behold it actually does
work. So in humans and worms you can find -- again you'll find these species specific
-- green is worm in this case -- so species specific relationships for humans, the red
ones, but you find the yellow and orange ones, again, and these will be shared relationships
that you can see between humans and worms. Okay? So these kinds of relationships are
shared across organisms. So this is one way you can tease this out. And what makes it
special is it's not across the genome, it's a very specific location and now we need to
do the GO analysis to see exactly where those locations are.
We have done general GO analysis to see where the different orthologous factors bind at
various gene locations, and what we've discovered is the same general message I'm giving you
which is sometimes information's conserved and sometimes it's not. So this is a case
where these are the different worm transcription factors. These are the different GO categories,
and the bottom line is you can see that some factors and clusters of factors, if you will,
have certain functional relationships. You can do the same for fly and get the same result,
only the color has changed. Okay? And then you can compare the two between worms and
flies. Okay?
So here's a factor in worm and compare it what factors it has in common based on its
GO categories of its targets. So are they binding the same kinds of genes? And the answer
is what I told you, sometimes yes and sometimes no. So this factor here on 52 has very similar
GO relationships with these four factors here, two of which are in fact orthologs, and two
which are not. Okay. So sometimes it shares with its ortholog, and sometimes with other
things, and the same is true for this relationship. And this is a complicated slide but the general
point is the one I just said.
So in conclusion, we've compiled a massive set of ChIP-seq data sets, which are out there
for everyone to use and we're glad lots of people are using them. We can use this to
look at regulatory variations within a species and Mark talked about some of that yesterday,
it was actually quite a bit of analysis of this and I think it's quite interesting because
regulatory information does change across developmental stages which is something you
can tease out from having worm and fly data.
And then lastly I can tell you across species that binding locations, co-associations in
particular, and even GO analysis can be different. The types of genes are binding can be very
different across different organisms between the various homologous orthologs. So regulatory
information can be quite divergent.
So for future directions we need to understand what these rules are. We're seeing some conservation
and we're seeing some that's not, and we want to understand what those basic principles
are. And there's a lot of way to attach this particular problem even with the data sets
we have. You can look at different kinds of relationships. One area that we're particularly
interested is this idea of looking at things in terms of networks. How much can these principles
be deciphered based on regulatory networks. And Mark this mentioned this yesterday how
you can organize regulatory information into hierarchical networks, and then you can start
looking at various principles of what regions are most interacting, what regions have certain
properties, and one very interesting property related to conservation, either within an
organism or between organisms, is that when you have these various layers -- so this is
the top, the middle manager, if you will, and the lower levels, peons if you will, in
a regulatory network. Basically it turns out that there's much more conservations at the
top. That is these are -- I should say these are under negative selection at the top of
the network which is something that came out of the human ENCODE project and, of course,
this is what we now like to look at in terms of worms and flies, and see if that same general
principle is true. And then you can map the conservation information I just mentioned
previously about partners and other things based on the context of where you are in these
networks. So there's really lots of interesting ways to attack this problem. Okay.
So I gave you the conclusions already. Here, in fact, are the people who did all the work,
none of it was done by me. These are the people in my lab who did the work. Let's see, well
I'll zip through a bunch of these: Yong Cheng, Dan Xie, Alan Boyle, Manoj Hariharan, and
Carlos Araya, and Philip Cayting. These are the analysis folks, these are the people who
did a lot of the ChIP-seq and Valerie, where are you? There you are -- she's over here.
We've had merge people, double-dipping here. Okay. We work a lot with Mark Gerstein's lab,
those folks who did the analysis part are over here, at least some of them. Anshul Kundaje,
of course, is the hero in all of this, he processes lots of data sets and is up here.
I mentioned Pouya on the motif stuff, and of course they work with Manolis Kellis right
now, and then for fly work, we have various people, including Lijia Ma. We work a lot
with Bob's lab, Amish Osaroff [spelled phonetically] and Tony Heiman, [spelled phonetically]. Lincoln
Stein's group, of course, is essential and we worked a lot with Stuart Kim's lab. So
I think I have time for some questions. Thanks.
Male Speaker: So Mike, when you look at the hot spots that
are constitutive, that not developmentally different do you find DNA configuration that's
inamicable to nucleus information. Is there a binding characteristic?
Mike Snyder: So do I find what? I
Male Speaker: A DNA sequence that doesn't bend very well
it would be disinclined to form nucleus zones [spelled phonetically]?
Mike Snyder: Well, we haven't looked at the structural
aspects. That's a great question. We should look at that. I don't know if Mark or anyone
else has looked at that. We certainly have not. Yeah, to see whether it's got bent DNA
or poly-a runs and that sort of thing. Yeah, it would be a great thing to do. Great suggestion.
Male Speaker: Thanks. Very interesting analyses. This isn't
my real area of expertise, so this may be a naïve question, but in the self organizing
map that you showed initially for humans, that huge cluster of neurons with -- that
was very dense with multiple factors and things. I mean, what was all that? I mean, the two
you called out were -- looked to be -- looked to have quite fewer in members. I was just
wondering what that huge dense --
Mike Snyder: Yeah, so you can find -- you mean that one
region up above that was -- well we're on to Jason's talk --
Male Speaker: The very first picture you showed, it was
black and white and you called out the two --
Mike Snyder: Yeah, and then there were these regions with
lots of edges basically.
Male Speaker: Yeah, that whole cluster of many, many neurons
with --
Mike Snyder: Yeah, this can look at all possible relationships
so it can look at pair-wise relationships. I can't remember if we put a filter of three
or more for this one. We've done this several ways. So it can look at all three mer [spelled
phonetically] possible relationships, if you will, as well as four mer, as well as five
mer, all the way up to 168 for the case of humans. Does that make sense? So some of them
will have lots of partners that could be working, and the goal is to look for them being enriched
in a 1 Kb region over the genome as a whole, and I'm sure we didn't do any of the fancy
genome structure correction stuff that Ben Brown does, but hopefully we're still very
healthy here. I don't know, did that answer your question though? So you're looking at
all possible combinations -- combinatory relationships.
Male Speaker: That was great, Mike. I haven't been on the
joint DAY-WD [spelled phonetically] calls in a little while -- it was great to see
Mike Snyder: That's right, all this' done in the last week
Male Speaker: Oh, okay, good. [laughs] But I'm a little
concerned about one part which is -- when you were showing the clusters of co-binding
in one of the groups that you called the enhancer group, there was RAD21, which is cohesion
subunit, and that's something that is associated with insulator elements but it should be very
broadly distributed across all sites that have --
Mike Snyder: That's right --
Male Speaker: So I'm wondering about whether, you know,
the list should be culled for things like that or more --
Mike Snyder: Well this is what the data --
Male Speaker: -- importantly, why is that showing up, why
is something that should be more in Jason's area [laughs]. But just more broadly distributed
across all regions that are probably going to be active in the genome, why is that [inaudible]
that one?
Mike Snyder: Yeah. What's nice about this is that this
is unbiased. This pulls these things out and it is what it is. And it may not be what you
want to see. Having said that, we've looked a lot at actually RAD21 and CTCF and it's
clearly the guys that are near insulators, and others can comment on this as well. We've
looked at this a lot in humans and they're clearly about one-third of them are sitting
on enhancers and another third around promoters and such. So they're not just insulators,
they're really at different locations. The CTCF and RAD21 and CTCF they're pretty together
nearly all the time. So in fact they are -- they're different classes of CTCF/RAD21, and I think
that's what that data set's telling you.
Male Speaker: Can I comment on that as well? So when we're
looking at the chromatin state annotations and how transcription factors associate with
that we actually find RAD21 clustering with CTCF in a specific insulator state, and very
rarely do we see sort of enriched in-binding in promoters and enhancers. It's found there
but it's actually not enriched there. So the most of the enrichment is actually found in
insulator regions as you would expect.
Mike Snyder: Yeah, that might differ a little with some
of our results. We might want to talk about that later.
Male Speaker: But I think the point is that you can't use
that data to make the conclusion that RAD21 is specifically associated with those --
Mike Snyder: Oh, I hope I didn't say I was specific.
Male Speaker: And not [unintelligible]
Mike Snyder: All I'm saying is that this is a cluster of
things that are enriched together over the genome as a whole. So, yeah, there's plenty
of other RAD21 buddies out there, too.
Female Speaker: So, my question. I don't know much about self-organizing
maps but my question is: So a 2 Kb region is very different on a worm than human, right?
Because of the size of the innergenic [spelled phonetically] sequence. So have you varied
that across the different types of organisms and have them re-socialize?
Mike Snyder: Yeah, so we've done a lot of that for the
promoter proximal part I showed you and we've done a limited amount of that for the self-organizing
maps. So we can do it more extensively and we will do it more extensively for the worm,
human self-organizing map. We've done 500 bases of 1 Kb for a number of the relationships.
We haven't gone from 100, which we will do, to much higher numbers and see what relationships
still hold and such. In our experience with some of this when we varied the windows it
didn't make a whole lot of difference but it is something good to do. Yep, Ross.
Ross Cagan: My question is so you made certain predictions
for example from human to fly, which I care about, and there have been many examples where
we've put -- or the community has put human transcription factors into flies and shown
that they drive at some level transcription in reasonable tissues and so on. Have you
compared that data to your predictions? So in some cases you have predictions that it
shouldn't bind, say in transcription factors, or that they use different co-factors and
so on. Have you actually compared them to the data that sits there and, I guess, in
the future you could actually make direct tests of those by throwing those back in and
seeing if they can drive transcription in a reasonable place.
Mike Snyder: Right, no, great comment and question. Or
comment I guess. Good question. So we can -- I think we would do it -- first, it suggests
some obvious follow-ups. We should see those that have different motifs, the prediction
is they probably wouldn’t rescue, or maybe we're not looking at the right cell types
or conditions. That's a possibility.
Ross Cagan: Unfortunately, those may not be published.
The negatives may not have been published, unfortunately.
Mike Snyder: Yeah, well, we should do -- you put it a case
where you get lots of positive results, then you can publish it. And you can do -- what
was the other comment? Oh, so you could do obvious follow-ups like that. Yeah, you would
say from the co-association none of this stuff should work. I think it comes back to the
point that was raised yesterday. We don't know which of these sites are functional or
not, we just know these are binding relationships. So, it's conceivable that some of the more
functional ones are primarily driven by one or a limited number of conserved things -- Jason,
I'm flipping through your talk here -- anyway, so it's conceivable -- two stories -- all
Anyway, the --
I'm not touching anything. I'm moving away from this. So it's conceivable we -- it'd
be nice to figure out which ones are really truly functional as far as driving genes because
that might help explain that relationship. And the other thing I think I could say is
that, you know, small amounts of activity can rescue phenotypes, and you pointed out,
they're almost always partial rescue. I'm very familiar with the yeast experiment. And
yes, they rescue but they're always wimpy -- you know, there're various grades of rescue.
And I think that might be what you're seeing because of the partner relationships aren't
perfectly established.
Ross Cagan: Right, but you -- you do --
Mike Snyder: So that would be my [unintelligible] --
Ross Cagan: Yes, I agree and you -- but you make hard
predictions, or fairly hard predictions, about ones that work better or worse.
Mike Snyder: Yeah, no, I think this -- like any good experiment
it suggests more things to do. Okay.
[end of transcription]