Panel discussion: Use of modENCODE data for understanding basic biology


Uploaded by GenomeTV on 28.06.2012

Transcript:
Elise Feingold: Great. So we're now going to move into a panel
discussion, and what we have here, this panel is focused on basic biology and the use of
modENCODE for understanding basic biology. It's going to be chaired by Gary Karpen, and
there'll be additional speakers who Gary will introduce.
Gary Karpen: So welcome. There we go. Okay, so, first,
as Elise said, we are having a discussion, which means your involvement. We've sort of,
as you'll see, we've come up with some questions but that's pretty much it for us. We have
no more ideas. So, maybe I should just speak for myself. Please wave as I say your name.
So, yes, I'm Gary Karpen, I'm the moderator; the fact that Elise thought that I could do
anything moderately was --
[laughter]
-- cute. Dave MacAlpine, Brenton Graveley, Valerie Reinke and Susan Mango. Okay, so,
just to start out, the way we're going to organize this is we're going to have these
four questions, we're going to -- yes. Sorry, that's a maybe somewhat of an in joke. You
have to either be from New York, or Ashkenazi Jewish, or something like that. Anyway, we
have these four questions. We're going to have roughly 10 minutes total directed toward
each question so we'll -- each of us will get up and present so the context for the
question, and then we'll go through -- have a discussion. We'll limit the total time to
about 10 minutes. So we're giving you this in advance so that if you really like question
four and have some ideas, wait until then and jump in. So, we're -- the four questions
are going to be: How can we quantitatively analyze chromatin protein localization dynamics,
genome-wide, at high resolution, in single cells? How do we develop tools and approaches
to quantitatively describe protein-DNA interactions and occupancy? How do we comprehensively characterize
the transcriptome? And what are the conceptual and practical barriers to effective use of
modENCODE data by the research community, for those of you who can actually see the
bottom. And the point is we're actually, as you can tell from questions, not really here
to talk about the beauty of modENCODE and what we've produced, but looking forward,
what's needed to advance the field, to advance our understanding of chromosome biology and
the other areas that modENCODE touches.
Okay, so I'll start out, and I'll just give a very brief introduction to the question
that I want to ask about dynamics, and that is that actually this is, since we have other
institutes here, I thought I'd point out this comes from a project funded by NIGMS, the
General Medicine Institute. And I'll just point out, you know, within cells you have,
as we've heard already, a couple of different types of chromatin: heterochromatin here in
green, euchromatin in red, and heterochromatin is marked by proteins like heterochromatin
protein one, discovered by Sally Elgin a while ago. And this just shows you that if you do
look dynamically, if you do live analysis with, for example, HP1a-GFP, you see the HP1
domain, the nucleus is bigger, and it's basically the cell was moving around -- whoops [laughs]
-- cell was moving around. After you irradiate, this domain just enlarges dramatically; 1.5-fold
volume increase, it starts very quickly, as soon as you can start to image, within three
minutes after radiation, it lasts about five hours. And you have all these dynamic protrusions,
which have now disappeared, even though it was supposed to keep looping.
So, the thing is, and this is the last little bit of data, is just basically you see these
dynamic protrusions and you see this dynamic expansion 1.5-fold over the normal volume,
but, in fact, when you do a ChIP experiment and you look at HP1a, and you look at untreated
30 minutes after irradiation, 60 minutes are irradiation, you still have HP1 in the heterochromatin.
Now this is good. That means this is a -- this is an expansion of this domain, not spreading
of the protein into euchromatin. But it also tells you that if I'm just doing a ChIP experiment,
I'm going to say nothing's happened, and in fact, a lot has happened. So this gives rise
to this question or basically the point that ChIP, the different Cs, high C, five C, four
C, I don't know what's going to be next, really only provides static maps of cell populations,
and this is a real problem looking forward and thinking about how we're going to understand
function using the kind of data modENCODE produces. And so this raises the question,
well, I guess you could say, should we or can we quantitatively analyze chromatin protein
localization dynamics in a genome-wide basis, at high resolution, in single cells. And Susan's
discussion, I think, bears directly on this. But we're here talking about if we look at
the kind of data that we've generated from modENCODE, for example ,and histone mark distributions,
how can you really analyze that dynamically in cells at the single cell level.
And so I guess we'll just open it up for a few minutes of discussion. Again, I would
encourage not just our panel members but the people in the audience to get up to the microphone
-- Manolis [laughs] -- and also if you have other questions, just these are launching
pads for whatever it is you'd like to discuss.
[laughter]
Manolis Kellis: Gary. So, I have to question the question
[laughs], since you ask me to. In other words, modENCODE has done so much with technologies
that were developed prior to modENCODE, and I think what you're asking is a huge break
in technology in terms of sort of transitioning from, you know, being on earth to being in
space, basically. And it's laudable, and it's visionary, and it's extremely useful, but
what I'd like to ask is, could you get there with existing technologies? In other words,
is there anything you can do to use chromatin in your precipitation, and use the C's, and
so on and so forth, in order to get to aspects of this question that you can then reconstruct,
perhaps, computationally, or by integrating different data sets, and so on and so forth?
So I think I'd like to perhaps turn the question around and say, or preface it with, in absence
of, you know, sort of breaking technologies or of sort of non-continuous type of technological
development, can you envision a path with existing technologies of components of this
question that perhaps you could obtain?
Gary Karpen: Turn on the button, okay. No, I think the
point is that we can't do it with existing technologies. We can get there partly. Steve
Henikoff has done the CATCH-IT type of analysis for H3.3; you can map that and other kind
of variants across the genome. I think what I'm mostly trying to point out is that if
we want to understand genome function, which is the goal of the ENCODE project and many
of our own individual research, we need to go beyond what we currently have. And the
problems that I see now with current technologies are, we can do a lot with respect to populations
of cells, and ChIP for histone modifications is the best example, I think, where we don't
really have the resources to be tracking what's happening in real time, not just with cytology,
but with any of these other methods, and really gain an understanding of what is the basic
biology that's underlying what we're seeing. And so we are probably getting, and we'll
discuss this a little more next in a couple of minutes, we're probably getting mislead
[laughs] by a lot of the -- not a lot but by the type of technology that we have available
today.
Sorry?
Gary Karpen: Sorry, the question was what about the PacBio
technology, where a lot of the nano-sequencing methods that are coming down which are single
molecule methods. This can tell you, yes, what's happening for DNA sequence at any particular
time, but it's not clear how one can do, for example, ChIP for histone modification. Where
the limitations are currently technical, which is how do you get antibody concentration high
enough to actually be able to do a ChIP, for example. So just having the sequencing capacity
doesn't answer that.
Yes, sorry. Sure.
Male Speaker: Well, there are methods for a single cell
methylation, for instance, but that -- the thing that bothers me is all the animal cell
work of the recycling. At one site, you know, or minutes you would -- they'd add a chromatin
modifier, take it off ahead of a sequence-specific site, take it off again. And that, on one
hand and the other hand, most of the literature suggests that there're bursts of synthesis,
such that you see the RNA levels of housekeeping genes are not constant from time to time or
cell to cell. So I think what you'd have to do is not only get it in single cells but
get it in a array of single cells in order to get the picture. But I think the exception
-- single cell methods are coming along for analysis.
Male Speaker: Yeah, I'm not sure I have a whole to add other
than that, I mean, there are groups working on this, and I think it has more potential
for histone modifications and such where you probably have, you know, multiple events over
a localized area versus, say, transcription factors which might be, you know, single mers
[spelled phonetically] or a few mers, and to clear the efficiency, a ChIP is quite low.
I do think it is plausible for, you know, again I know Sherm's [spelled phonetically]
group is working on it, our group's working on it, for getting this working for the histone
modifications but...
Male Speaker: Could I just add one thing? Again, there -- cross-thinking
is the other issue, and I think Gordon Hager has used, with isolated nuclei, used laser
pulse cross-linking to get very instantaneous pictures.
Male Speaker: So, Gary, how high-throughput are some of
these microscopy approaches? Can you also look at -- you showed HP1, but H2Av and other
histone variants that might be involved here?
Gary Karpen: Yeah, I mean, you can certainly do the microscopy.
The problem is that you're still looking at every -- you're looking at the blob. [laughs]
You're not looking at a particular sequence. It is sort of similar to the question that
I asked Susan about, about the array. You, at least from my perspective, you want to
understand what's happening at the level of resolution that we can get for ChIP sequence,
say, rather than -- rather than -- but I'm not saying the psychology and the imaging
isn't useful; I think it's very useful and tells you a lot. But we seem to have this
gap, right, between what you can see in the cell and what you can see biochemically.
So maybe -- I think we should probably just move on. Save it, if we have time at the end
as people get tired. Dave, why don't you go ahead.
Dave MacAlpine: All right. So my name is Dave MacAlpine, and
I just want to talk to you guys briefly about defining occupancy. Occasionally, you know,
I'll have a student come to me, and they're like, "All right, Dave, you know, this is
my favorite gene, and obviously, it's regulated by factor A." And I'm like, "Well, how do
you know it's regulated by factor A?" And they'll say "Well, I looked at the modENCODE
browser and there's a black box over a, you know, factor A." And you're like, Okay so
you've got a black box but what does that really mean? And then you can look at that
in a little bit more depth, and you can see the two peaks here that I draw, and we've
got a big peak and a little peak, but what does the size of the peak really mean? So
this brings up another question. All right, we've got a factor bound there, but we're
talking about diploid genomes, right? So you've got two genes. Is it bound at both alleles
or just one allele? And oftentimes I work from the Snyder group; you can use SNPs to
detect differences in these, if you have a SNP right where your transcription factor
is binding. But in the case that you don’t, then you really still don't know whether you're
occupied at one allele or the other.
We also bat the occupancy. These are all population-based studies where you do an enrichment, and again,
what does it mean? Do we have one factor bound at a handful of genes, or is it more common
layer distributed and bound to many of the same -- many copies of the same gene?
And finally, a lot of the talks today alluded on hot spots -- this was a big topic that
came in modENCODE and ENCODE as well. A very large fraction of the binding sites in the
genome are occupied by multiple transcription factors which often -- whoops, wrong button
-- which gives you that view there, but when you expand this out, whereas at the alternative,
that you have individual factors bound at many different loci. And these are the things
that the chromatin immunoprecipitation that ChIP-seq approaches have not quite enabled
us to resolve quite yet, and there are ideas out there on improving this, and I just wanted
to bring that up to the panel and to the audience.
Gary Karpen: Any thoughts? [laughter]
Valerie Reinke: I was just going to say I'd also add that
it looks like for all of these different events that happen, only a tiny fraction of them
can even be associated with a simple change in gene expression, if you do something like
knock down the transcription factor. So even if you set aside hot spots and other things,
there are still a lot of mystery binding sites that seem to be true, seem to be reproducible
but yet don't directly lead to gene expression. We also have no idea what is going on with
those as well.
Manolis Kellis: So, I'm just thinking out loud here, what
about -- and you -- this might be totally naïve of this and stupid, but what about
some kind of tagging of these different cells by, say, mitogenesis. So suppose that we were
bombarding a set population of cells, leading to different mutations in each of them, and
then you could, you know, just like you can tell allele-specific activity by looking at,
you know, SNPs between individuals, suppose that then you could distinguish whether whatever
ChIP signal you are getting, is that coming from only one, many copies thereof, or different
variants, if they have a high enough mutation rate. Would do you guys think, is that plausible?
David MacAlpine: I think you'd probably have to, you know,
to get a high enough hit rate to accumulate enough SNPs in a population that are cell-specific,
that, you know, affected enough binding sites to be meaningful, that would be a pretty messed
up cell, I imagine. But --
Manolis Kellis: It doesn't need to survive for that long.
David MacAlpine: That's true. [laughter] You know, right.
Female Speaker: How do you interpret it at that point?
Manolis Kellis: I mean, you have the sequence, once you do
the ChIP-seq experiment, and then you can tell, well, where's the -- you know, was the
binding site mutated, and so on and so forth, I mean. Anyway, just one idea, starting up
the conversation.
Female Speaker: It's kind of a brainbow for ChIP. Yeah, it
would a cool, cool idea if one could pull it off.
Male Speaker: I mean, I think a standard -- you raise a
good question. I think a standard issue for multiple proteins binding once versus individual
proteins bound as a sequential ChIP, right, is a way of getting at this. On the issue
of quantifying the signals and what they really mean, I think that's a really important problem,
and I haven't seen that perfectly addressed in any system. And certainly things we've
considered but not -- have not executed would be to compare occupancy with DNA's footprints
and such, where you try, in a quantitative fashion, where you can see how much, hopefully
project how much is really truly occupied in vivo versus accessible and such. That's
one way of doing it. I also think it's worth doing an experiment that we proposed in our
grant but never got to, it seems, [laughs] is we might actually -- I should watch what
I say, I guess.
[laughter]
I'm fairly honest. I think if we could somehow set up -- it's not a simple experiment but
to set up a truly open region where you put in something like lac repressor with nome
sites. And then actually put this in and get it bound, you know, presumably overexpressed
at a sufficient amount, and then actually see what that ChIP signal looks like with
one site, two sites, et cetera, where you're really reconstruct the site in vivo. What
makes it complicated is that you actually need to make sure it's open and it's not all
occupied with histones and things like that, and so presumably put bent DNA and things
like that around it. But to truly recreate a site that you can do quantitative experiments
on would be probably a useful thing to do.
David MacAlpine: Yeah, I mean, I think we are getting close,
I mean, with the MNAs and DNAs footprinting, the ChIP-exo approaches that you can start
to see the specific footprints and binding sites, and can you start to resolve, you know,
HOT spots there, what's the, you know, seed [spelled phonetically] factor that's binding.
You know, we are getting there but maybe some of the stuff that Jason Lieb is doing, looking
at the dynamics and turnover-specific, you know, factors with these competitive experiments
as well.
Male Speaker: So, maybe one imperfect way to approach what
Manolis is suggesting is to make use of natural diversity, so, for instance, in the population
like the e cells with the immunoglobulin genes, you've got a lot of variation there, you know
that it differs by cell, and so maybe that's one way you could...
Gary Karpen: Sorry, I'm not sure the microphone -- did
you hear --
[laughter]
Male Speaker: Oh, sorry, let me try it again. So Manolis
was suggesting that it would be nice to have cells that you knew were the same, had the
same promotive but were -- the alleles were different, and so you could mark what was
going on, and therefore try and disentangle what's happening at the single cell level
versus the population. And I'm suggesting you could use immunoglobulin gene promoters,
and -- because the variable region acts as a market. So it is your built-in variability
in other senses that promoters ought to be the same in inverted compass.
Gary Karpen: But I think the issue there is that you can
use that variability for that locus, not for the whole genome, so, yeah. [laughs] But thanks.
[laughter]
Manolis Kellis: So regarding the levels, I mean, what about
just systematically doing enhancer experiments for regions that are bound at different levels,
and then trying to get a functional read out of what the implications of that are. In other
words, I mean, we're observing these differences in intensity in the binding, but perhaps with
some of these large-scale validation techniques we can now start applying them to test, you
know, all of these different signatures. All of which we're calling peaks, and basically
see if we should be thinking of big peaks, and small peaks, and wide peaks, and narrow
peaks, and...
Gary Karpen: Are talking about in vitro or like an in vivo
titration experiment?
Manolis Kellis: Oh, I actually wasn't thinking about sort
of by a chemical binding experiment. I was actually thinking of a reporter assay, where
you're asking, is that functioning as an enhancer, and then that reporter assay can be, you know,
any of the above. But I -- so basically besides just validating the bio-chemical activity
of what's happening there, asking if whatever signal we're observing in fact has functional
ramifications, because I don't think we even understand that. I mean, if you see it, would
you bet, I don't know your cat, that a big peak is going to respond more than a small
peak?
David MacAlpine: Right, but maybe you have a factor that binds
-- that's very good at turning on a specific enhancer or promoter but it only binds a very
small fraction of the time. So I don't know how you separate the biology from the -- I
mean, the downstream output from the input.
Manolis Kellis: So what you're saying is that in different
loci, basically that the -- I could misinterpret your question as to sort of, you know, in
a different context, a little bit will matter more, and in another context you will need
a lot of the same factor, and therefore, you need to do the sort of functional experimentation
with many different reporters to actually test the importance of the context. Is that
what you were saying or something else?
David MacAlpine: I think so.
Gary Karpen: How about Mike, and then we'll move on --
Male Speaker: So to meet this goal this is really lofty
right now, and I don't see that this is something that achievable today, but it is important
to know that parts of this are addressable. Like Mike Snyder said, if you want to know
about any two factors at a time on the same piece of DNA, you can do sequential ChIP;
it doesn't work well for all antibodies, it's not easy, but you can get at one part of this
problem. Similarly, you bring up the issue of the tall versus short peak, and ENCODE
project has done a significant amount of work asking are those changes significant? And
it looks like they are quantitative. It doesn't tell you is the occupancy 10 percent or 50
percent, but it does tell you that a big peak, more occupancy. Is it distributed among difference
cells, is it uniformly distributed in population; it doesn't speak to that. But it gets at that
a little bit.
And finally, I'm aware of one publication, Gary Felsenfeld, back in '95, did, for histone
modifications, attempt to come up with quantitative numbers by doing a titration method. Now,
again, this was on Ensembl measurement, populations of cells, but they could estimate in the population
the average histone modification at a site was 80 percent, 20 percent. Again, it doesn't
speak to, is it 20 percent in 100 percent of the cells or is it 100 percent in 20 percent
of the cells. But there are ways to kind of approach this today.
Gary Karpen: I think the problem is that you won't understand
how it impacts the biology in the end, so anyway [laughs]. Let's move on to Brent.
Brenton Graveley: Okay. So I'm going to talk about some of the
transcriptome features. Okay. So, the transcriptome projects in all the organisms that have been
done, so the fly, the worm, and the human, have all been really successful and each has
lead to the discovery of thousands of new genes, and Bob did a really good job highlighting
all of that. But what I'd like to tell you is -- go over some of the things that we've
learned which actually tell you about the things that we don't know. So, for instance,
this graph here, this is all data from fly but the same principles are true in all the
species, and so this is like a cumulative count for how many genes we see expressed
over this developmental time course of 30 samples, and what you can see is -- like you
look at this line here it's going up and it seems to plateau a bit here but it's actually
still going up as we add samples on. And this graph down here is actually indicating how
many genes are expressed in each of the different experimental types we have, so this is the
developmental timecourse. And if we look at all the tissue culture cell lines, the tissues
that we've done and treatment samples, every single sample that we look at we discover
new genes or we can see the expression of genes that we haven't seen in other samples.
So, so far we haven't saturated things, so there's a lot of new genes out there yet to
be discovered even though I think the fly project, and the worm, and human have all
done a good job.
Another thing in here is, in the fly project and the other projects, the vast majority
of the sequencing data for RNA has been done on ploy (A) plus RNA, so there's all the poly
(A) minus RNA left to discover. And so in the fly project, we've done a little bit,
and the difference between these two lines here is actually how much additional discovery
we can make in the poly (A) minus, and we've only done it on 12 samples so there's a lot
left, I think, in that aspect of the transcriptome to discover.
So that's just looking at discover, then over here, this is looking at actually splicing,
but the idea is to look at the dynamics of gene expression. And if we look at tissues,
we can see that basically there's a lot of splicing changes that change dramatically
between different tissues, but these changes are actually diminished when you look at whole
animals, and this is because you have two tissues where the splicing is very different,
but when you grind up a whole animal, it sort of looks like nothing's really happening,
okay? So this is the same at the gene expression levels. So as we get finer and finer in detail,
going down to single cells, we'll get more and more information about the dynamics. So
I think this is pinpointing that we need to really get into doing single cell expression
analyses to figure this out.
And the final one down here, which you probably can't see over the heads, is really getting
at this issue of connectivity between transcripts, so this is the descanned gene in Drosophila,
which makes 38,000 isoforms, but the point is, using the data that we have, it's impossible
to tell whether exons on this end of the transcript are on the same exact molecule as exons on
this end. And so what we really need is this super long single molecule sequencing technology.
So if, for instance, Oxford Nanopore actually produces something, and it does anywhere near
what's advertised, things like that might really go a long way towards addressing these
issues.
Gary Karpen: Manolis is just faster. [laughs]
Male Speaker: I was going to ask, is there really any bottom
to the number of transcripts, you know, with enhancer transcripts of the sense and anti-sense
[spelled phonetically] for transcripts around promoters? But more specifically, is there
anything you can say about these non-poly (A) RNAs that might be insightful? Are they
long RNAs?
Brenton Graveley: So, a lot of the poly (A) minus RNA's that
we did discover were previously unannotated snoRNAs; a lot of them were microRNA precursors
that were, in some cases, you know, 10, 20 KB long, and then just a lot of like non-coding
RNA type things that we don't have any idea what they are.
Male Speaker: Right, I wonder whether you could unmask some
groups have been by things like backing out the stem loop binding proteins for histones,
so would that go on to a poly (A) site?
Brenton Graveley: Right, right, yeah, so then there are the
poly (A) tailless, like histone transcripts and things like that, yeah.
Manolis Kellis: Thank you. So I think what we need here is
the same thing as, you know, I was describing on the microphone, namely. Disrupting each
of these sites and then asking what do they actually do? In other words, I mean, biology's
messy, and what makes it so wonderful is that it can cope with how messy it is, and that's
something that, you know, it is part of the design principle, in a way, that it can cope
with stuff happening. And as you start sequencing deeper and deeper and deeper, I mean, you'll
find stuff that happens that might not actually happen for any good reason. It might not actually
have any good function, and so on and so forth. I think what we're still lacking is the ability
to sort of knock out that weird random -- or not random but rare, you know, junction that
happens only in one cell out of 10,000 and see what happens. And I think that, sort of,
from the discovery to the validation, perhaps the next challenge ahead is to not discover,
you know, things any further down the rabbit hole, but instead sort of take the ones we
already have and make them or see how often they're actually made, or sort of localize
them extremely precisely, or, you know, disrupt them and see if they have any kind of consequence,
and so on and so forth. So I'm wondering if you want to comment on technologies that can
do that, whether that's even a feasible endeavor to sort of start, you know, chucking them
off on the functionality side of things.
Benton Graveley: Yeah, well, I think that's an area where the
model organisms, in particular, are great to add, you know, so you can actually go in
and disrupt all these different elements. So, and then fly, you know, recently, there's
been this use of backward combineering that Hugo Bellen's lab has pioneered, that you
can now basically change anything to anything else and put it in the proof of sites place
in the genome. So it's actually feasible to go and do these things on a fairly large scale
now in the fly. I think it's more challenging with worms, but it's doable as well.
Male Speaker: I wanted to take it in a little different
direction. So, your Venn diagram there is interesting. When you take cell lines or something,
you get some transcripts. What happens if you sequence the same thing, make -- grow
-- you know, take the second larval stage from fly and assay that 10 times, 10 different
samples?
Benton Graveley: Yeah, well, that's good but we have not done
that experiment.
Male Speaker: We haven't either.
Benton Graveley: Yeah --
[laughter]
-- I don't know, but, yeah. Yeah, I don't know. I mean, I don't know what these things
that are in tissue cultures cell lines only are. I mean, we know what the lists are but
--
Male Speaker: No, it's --
Benton Graveley: We don't know why they would be. I mean, some
of these are annotated genes so they're like real genes but --
Male Speaker: Yeah.
Benton Graveley: Maybe they're --
Male Speaker: But how much of this is --
Benton Graveley: I don't know.
Male Speaker: -- I don't know, biological fluctuation? Or
methodologic? Yeah, I don't know. I just wonder where this -- where these extra things are
coming from, whether it's -- it's not really that they belong in cell lines, it's just
that they happen to reach your threshold in this -- in that sample and not --
Benton Graveley: Yeah, yeah, that's absolutely true, but they're,
you know, they're below any threshold where we would call them as a gene in the other
samples.
Gary Karpen: But I think isn't the point, even with your
analysis of -- I mean, the point is to ask what's the total capacity of the genome, and
if you think in an evolutionary context you think there must be genes [laughs] sitting
there that we don't discover as transcripts because we haven't figured out what conditions
would induce them.
Male Speaker: [affirmative]
Gary Karpen: And so, you know, the more treatments you
do, the more challenges you provide, then higher the probability that you'll find things
like that.
Male Speaker: But there, I mean, evolution certainly helps.
If the thing doesn't function well enough or have enough of a role so that evolution
exerts selection to maintain this sequence, then presumably we're not going to see that
-- we're not going to see a function for that in the lab.
Benton Graveley: Well, actually, a lot -- I mean, one of the
features of the new genes that we've discovered in the project is that they're more poorly
conserved than the annotated ones.
Male Speaker: Those are actually positively selected so
that you're selecting -- I mean, they're still under selection, they're just -- you're being
-- they're being selected for change.
Benton Graveley: Right. Yeah, but they're --
Male Speaker: And so that's also an evolutionary signature.
Because it's be -- well, you compare it to neutral sites and you show that it's higher.
It's like a Ka/Ks ratio. I mean, that sex -- sex is very good for this. [laughs]
Male Speaker: My question was we know a lot about this transcriptome,
but what is modENCODE's vision on the proteome [spelled phonetically], with additional layers
of microRNA regulation, et cetera, the proteome becomes very complex. So, it is like mass
spectrometry, or is there other things that you are thinking about for the proteome? That's
where the function lies.
Benton Graveley: Well, being an RNA person, I would say the
RNA does a lot of the stuff in the cell. [laughs] But, I mean, certainly the discoveries that
have been made in the transcriptome with significantly expanded the proteome for sure. But, you know,
within the modENCODE projects, you know, I'm not aware of any efforts to do any mass spec
type stuff.
Gary Karpen: The worm, Bob, Bob did it for the worm, it
just wasn't done for the fly. And their funding for modENCODE is finished so there won't be
any from modENCODE for proteome but there --
Male Speaker: Thanks.
Male Speaker: Pardon me, the words were taken out of my
mouth, but even given the known transcriptome, there's a lot to be discovered about the proteome,
you know, there's discoveries of initiation with near canaticals [spelled phonetically]
start sites. All the translation of 5' prime UTRS and higher cells, how these change with
physiology, even with the constant transcriptome.
Benton Graveley: Yeah, I agree. Yeah, so for instance, I think
the ribosome profiling technology would be really great to use on the model organisms
to really look at this.
Gary Karpen: There's always the post-translatome [spelled
phonetically].
[laughter]
Okay, and last but probably most important, Valerie is going to discuss what I think ultimately
will be the most important question for the impact upon ENCODE, which is, can people actually
get to the data and use it.
Valerie Reinke: Yeah, so I was inspired to bring up this topic
because I get emails like this one, which I do have permission to put up here, which
I hope you can read. It says: "I'm working in the lab of Barbara Conrad on the control
of apoptosis. We are interested in finding out what factors bind to the egl-1 locus,
and in particular, whether ceh-30 is one of these factors. And so I have a list of questions:
Is egl-1 a target of ceh-30? What factors bind to egl-1? Which factors bind to the larger
egl-1 locus? How can I get a list of all binding sites for ceh-30? How can I judge their relevance?
[laughs] I'd like to know the answer to that one too.
[laughter]
Do you also have genome-wide information on another factor, how about ces-1? Any information
on that? And I get these on a, you know, fairly regular basis, and I try to help out and point
them to the cool sites that Gos brought up, but this is the kind of questions that a lot
of the people out there in the community have, and I can't actually do all their analyses
for them. And so I think that there is still some really key issues out there, despite
our best efforts, you know, that there's still some problems with finding and accessing the
right data for a lot of people, understanding what kind of analyses have been performed.
In particular, I think, you know, scientists like to know what it means when they mouse
over a peak and they see a statistical value of some sort, and they don't really understand
what that statistical test means or what sort of test was even performed, is tricky, and
then they get suspicious, and then they don't know what to do with the data. And then interpreting
the importance and reliability of individual data points, and, for instance, within the
consortium we talk a lot about HOT spots and understand that, you know, every single binding
event might not have an immediate impact on the expression of that gene, but people out
in the broader community aren't necessarily thinking about these questions in this way.
And then also a lot of people would like to do sort of intermediate types of analyses,
not be capable of downloading huge data sets and doing really complex things, but sort
of do sort of mix and match kind of mid-level types of analyses. And I think it's -- we
either enable them to look at individual loci or pull down large data sets, but how to do
sort of medium level analyses is still missing. So anyway, that was my point.
Manolis Kellis: Can I --
Male Speaker: Go ahead.
Manolis Kellis: Can I rephrase your question as we need Siri
for modENCODE? [laughs] In other words --
Gary Karpen: No, actually, I think that's part of the problem.
It's now like Siri for modENCODE.
[laughter]
Manolis Kellis: So, in other words, what I'm trying to say
is that you have some level of interaction that can be automated and some other interaction
which would require sort of somebody hiding under the table and pretending to, you know,
be a machine, so artificial intelligence. So, my question is two-fold, so A), can we
add some AI so that, you know, people can sort of interact with it in a fuzzy way, and
ask, you know, human questions and get reasonable answers, and perhaps even translate these
human questions into lines of code that we can then give back to them so that they can
modify these lines of code to sort of ask more precise questions. So that would be one
way of sort of using AI to automate human questions into sort of code that people can
then run and sort of, you know, show that, and maybe show a set of examples where we
can -- you can take all of the emails that you've received and have God translate them
into lines of code, and you know, sort of have, sort of find your -- the closest example
and modify kind of thing. The second one is, and I don't know if you want to go first and
I can continue after?
Gary Karpen: You're on a roll, you're doing good. You're
doing well.
Manolis Kellis: So the second one is -- go ahead. I can't
keep going. I'll wait until the end.
Gary Karpen: Actually, maybe, I don't know if there are
comments on that, I mean, yeah, essentially it's -- we've reached the point where essentially
we need the price of bioinformaticists to go down. [laughs] Or the money to pay for
them to go up. And -- because I think that without this -- I think that's a very good
idea but I don't know anything about the field of artificial intelligence and whether or
not that's actually doable. It is doable; I think Gos, in fact, and collaborators have
made really good advances, really great advances in terms of the new formats that bring us
closer to that, but, you know, it's a huge problem out there in the --
Manolis Kellis: So, I --
Gary Karpen: -- NIH universe where unless you have bioinformaticists,
you're basically in deep trouble because you can't interpret this stuff.
Manolis Kellis: I mean, instead of having Valerie answer these
questions, I mean, sort of having an army of, you know, sort of, kind of like a call
center --
Gary Karpen: Sure, if --
Manolis Kellis: Over in, you know, somewhere else.
Gary Karpen: So NHGRI should have a --
Manolis Kellis: Yeah, just a call center.
Male Speaker: In India.
[laughter]
Gary Karpen: Yes. Well, only in certain times of day.
Male Speaker: I would certainly support the AI, and the
call center, and the more money for bioinformaticists, and some more money for fly biologists as
well, would be great. So I just wanted to, I guess, amplify the point that you were making,
you were trying to solve, which is we've gone into modENCODE multiple times, and because
I know people in the field who are in modENCODE, I've then call them up and said, could you
please explain to me this simple question, which is, I've got a promoter, what binds
to it, this sort of thing. And they've walked us through it or they've just gone ahead and
done it for us.
So, in addition to amplifying that point, it would certainly be helpful if you just
-- there's two basic things that non-transcription people generally query, which is, I have a
promoter what binds to it, or I have a transcription factor, what does it bind. And if there was
just a simple shell way of walking into that question, you'd probably solve 70 to 80 percent
of the queries into modENCODE, just being able to do that. And I understand there's
lots of caveats and so on, and you can find ways to put that in, but if you just pluck
a random worm or fly biologist off the street and say, okay, take your favorite gene, here's
modENCODE, I'm not going to tell you anything, go answer a question that you've made up.
I think that would be very helpful, because it really is quite difficult and it's intimidating
to try to get at it, really simple questions like that.
Male Speaker: [inaudible]
Male Speaker: So I get a lot of these kinds of emails also,
and -- but I seem be getting less of them and I was wondering if that's -- other people
are having the same kind of reaction because I think people are getting better at digging
data out of modENCODE, so that's kind of an open question, I guess, for the panel.
Gary Karpen: Maybe -- maybe what kind of answers have you
been giving them? [laughs]
[laughter]
Male Speaker: I usually say, "You should email Gary that
question."
Gary Karpen: That's not the impression that I get. I just
think it's the opposite. I think it's more people are now accessing it and --
Male Speaker: That's certainly true.
Gary Karpen: -- there's more problems. And I think also
some people just give up.
Male Speaker: Well, I think that a lot of the ways of looking
at this and a lot of these kind of questions, you know, really they're -- if they're just
interested in their particular gene, they really need to go to the browser, that's probably
the best way to look at that, and the kind of things that are in InterMine for dealing
with lists is just fantastic. You know, but the problem that we're facing now with modENCODE
winding down is that all this is going to go static, and it's going to go into clouds,
and at least in some places that I know of, you know, getting the IT people to let you
actually access some of the sites is difficult. So, the -- I would really hope that as ENCODE
goes forward, and this is directed, I guess, to NHGRI until 2016, at least, that they really
make a strong effort to absorb the modENCODE data and to keep it alive, and to keep, you
know, improving the tools for people to access the data and to analyze it.
So, I think that, you know, modENCODE kind of started as producing data and building
a very complicated data infrastructure at the same time and, you know, it's like trying,
you know, you don't normally build an airplane, you know, at 30,000 feet, which is kind of
the way things happen, and now it's actually, I think, in pretty good shape and people,
in my opinion, are having an easier time getting the data out. And so if we could just, you
know, make sure that that kind of continues to improve a little bit, I think a lot of
these kinds of difficulties will go away.
Valerie Reinke: Yeah, I -- just to say I've also had plenty
of people come up to me and talk about how they have been able to use the data. It's
not like a, you know, it doesn't happen a lot so there are plenty of people who've figured
it out. And I think with the increasing familiarity with the existing databases and better incorporation
into WormBase and FlyBase, that people will figure out more and more how to use it, and
at least get some utility out of it, I think.
Susan Mango: For us, it was really opaque, and even looking
at a browser was difficult. So if you want to hear what it's like to be a clueless wet
bench scientist trying to tackle this stuff, I'm happy to fill you in. And one aspect is
having that sort of -- you know, email is one way, but really being able to talk to
someone. So I wonder, you know, if it were possible to have say a workshop at, let's
say, the worm meeting, or the fly meeting, or various things like that, I think it would
be packed because I bet there're a lot of people who would love to be able to do this,
and if you walk them through it and you could talk to whoever was explaining things, you
could begin to disseminate that a little bit. And, I mean, a course would be ideal, but
that's probably not possible, but if you could even have sort of some of these simple questions
addressed at workshops in meetings or something like that, I think that might be really useful.
Gary Karpen: So FlyBase does that, and I think the DCC
was at the last fly meeting at the modENCODE workshop. It is a very effective way, and
I think as Bryan [spelled phonetically] pointed out, I think that a lot of this is migrating,
all of this is migrating to FlyBase and that people are going to end up using the browser
to go through there. So I think that at the fly side that'll happen. I assume WormBase,
because it's really excellent, will be following suit.
Okay, we have time for two more, and Manolis will have to do it over dinner. So --
Male Speaker: It's a nice one, you'll like it.
Gary Karpen: First one.
Male Speaker: Oh, sure. Just really quickly, I think it
was very valuable that the way Valerie put this piece of data, this real question up
in front of people. And I was wondering if there's any way people can think of kind of
collecting a lot of these questions and, you know, sort of tabulating them together to
get some sense of what are the really common questions that people have in terms of the
data resource, and ways of thinking about prioritizing, you know, building of a tool,
or doing some mid-level analyses. I mean, people talk about this a lot, but it'd be
nice to get some data, you know, in terms of what, what're the mid-level analyses that
really, you know, the majority of people want. And I'm just putting it out there to think
about.
Gary Karpen: I think that's a great idea. I think part
of the problem has been that some of it goes to the DCC, some of it goes to individual
investigators, and there has yet to be a real way to conglomerate it.
Yeah.
Male Speaker: I'd like to address the issue to understanding
what [unintelligible] have been performed for collecting data, and probably like question.
Would be as a solution, if you have centralized data storage for methods, for tools, when
customer would be able to log in for web interface, check all steps over [unintelligible], if
it requires, change some parameters, and your own tools.
Male Speaker: Sorry [inaudible].
Valerie Reinke: So are you suggesting that in the web databases
that there be a way to sort of reiteratively run an analysis, a query, over and over again?
Male Speaker: Yes. Well, all tools will be presented in
a form, some boxes where you can check all parameters, all the agents that you use, was
used, and easily run on your data.
Valerie Reinke: We're going to let Gos handle this one.
Male Speaker: Yeah, whatever microphone you want, but if
you use that one, get really close.
Gos Micklem: I wasn't going to handle that question.
[laughter]
Gary Karpen: But you did something towards -- you mentioned
that you can save at least certain configurations of tracks.
God Micklem: You can save configurations of tracks, and
then in BodMine [spelled phonetically] you can save lists and queries and things like
that. I just felt I had to say something since this is about data access, and Bryan has encouraged
me to point out that actually we have these things called template searches, and the idea
of a template search is we try to guess what people want to do, and then write a little
page, which is really easy to fill out, where you can paste you gene in or your gene list
in, and actually, there's a template that answers several of those questions. And we've
kind of somehow, in spite of standing up and giving lots of talks at fly meeting, worm
meeting tutorials, and at this, kind of people don't listen. And we haven't had -- we haven't
had just that many questions come through help@modENCODE as we expected, and it's interesting
that it's been picking up quite a lot in the last few months, and we do get some questions
like that. We also get ones like, "I'm studying this genome and I'd like to design parameters
for PCRing it; can you do it for me?" It's just fantastic. And so, you know, one practical
suggestion that we could have -- if we'd had the conversation a year earlier, would be
make sure you forward all your questions to the help, because that then helps us to know
what the common questions are and then try to maybe make the really simple questions
easier to answer.
Female Speaker: Can I also answer the other question that
was asked? So, right now there is an effort to put up an instance of Galaxy onto what
will be the permanent housing on the Amazon cloud so that you can then write your own
pipelines and be able to run them over and over again. Now in terms of the automatic
population of whatever the parameters were for all of the tools that were done on the
analysis of the existing data, now that won't be pre-populated into the Galaxy instance,
but certainly we try to collect as many of those parameters as possible, and they're
annotated in the protocols that are in the wiki. So it's accessible but there's a lot
of it, so...
Male Speaker: I just want to make a quick comment, and it
contains the words legacy of modENCODE in it, so...
Gary Karpen: What's that?
Manolis Kellis: [laughs] All right, quick comment. And you
don't have to answer it but -- I just, like, we're thinking maybe the data will become
obsolete very soon because everybody has or will have very soon the capability of generating
modENCODE-scale data which has been amazing for 2010 and 2011, 2012 but perhaps not for
'13, '14 and '15. Perhaps one of the legacies of modENCODE should be to actually educate
people to use this type of genomic data, and I think the effort should be placed now, not
just in generating additional data of the same type, but also in educating everyone
who generates this type of data to integrate it with the existing resources and to use
this type of resource in any kind of project. And I think sort of funding the kind of computational
and sort of also educational efforts for using this type of genomic data sets could be part
of what's unique to modENCODE rather than just the generation aspect, which I think
has been democratized.
Elise Feingold: Thank you.
Gary Karpen: Thanks to the entire panel, and especially
what's left of the audience for [laughs] --
[applause]
--being here. And we'll solve all these problems tomorrow.
Elise Feingold: Tomorrow, exactly. So I want to thank the
panel members, as well as all the speakers in the session. We are going to start tomorrow
morning promptly at 8:30, and for anyone who's going to dinner, the PIs, please just meet
at the front of the room; we want to talk about logistics very quickly. Thank you.
[end of transcript]