Andy Baxevanis: All right, good morning everyone and welcome
to this final week of the 2012 Current Topics in Genomic Series. Just a little bit of housekeeping
before we get to today's lecture, for any of you who missed any of the lectures along
the way, just by way of reminder, you can view any of the lectures by just going to
the course website, which is, again, at genome.gov/course2012, and all of the lectures to date have been
posted for viewing. And we're very heartened by the fact that there's already been over
25,000 views of the lectures in this year's series, and I would encourage any of you who
are engaged in any sort of teaching activities to use these lectures as an educational resource
to supplement your own curricula. All of the handouts will continue to be available on
the website as well until the next offering of this course in the spring of 2014.
For those of you who have been signing in each week for your CME credits, there's one
final thing we have to ask you to do, and that's complete an online questionnaire, which
will be hitting your inboxes later this week. And once we have that in place we'll be submitting
all of the registrations into the CME office so to make sure that you get your credits
and a certificate for participating in the course.
For everyone, we also ask that you complete a brief online survey, and let us know what
you liked about the course, what not so much, any comments that you have about specific
lectures, any suggestions that you have for improvement, basically any constructive criticisms
that you have regarding any aspect of the course would be welcome. So, the survey invitations
will go out to everyone on the course mailing list at the end of today's lecture. Please
rest assured all of the survey responses are collected anonymously. We read each and every
one of the survey responses we get back, and the changes that we make based on your feedback
from all of you is really a key part of why this course is successful from year to year.
So, this will take you less than five minutes to complete. So, I would very much appreciate
your time in completing the survey once you receive the invitation email today.
So, just in closing, I really hope you found the course both interesting and informative,
and Tyra and I really encourage all of you to apply the concepts and the methods that
we presented to you over the last 13 weeks to your own research interests. So, thank
you for your participation and support, and we look forward to seeing all of you again
in the spring of 2014 for Current Topics 2014.
So, to today's lecture, it's my pleasure to introduce to you Dr. Julie Segre, who is a
senior investigator in the NHGRI Intramural Research Program. And Dr. Segre's main focus
is on the body's largest organ, namely the skin. And over the years her research program
has provided us with a great deal of insight about the genetic pathways that are involved
in building and in repairing the skin. Now, obviously, the skin provides a critical barrier
to invasion by microbes, but it also at the same time provides a major home to them as
well. And through her lab's work and her work as part of the Human Microbiome Project, Julie's
efforts continue to provide us with new insights into how the bacteria that constitute the
skin microbiome contribute to both chronic skin disorders such as psoriasis and eczema,
but also to overall human health. So, given her role as a thought leader in the field,
I'm quite happy that Julie could join us this morning to share with her her perspectives
-- share with us her perspectives on the genomics of microbiomes and microbes. So, please join
me in welcoming today's speaker, Dr. Julie Segre.
[applause]
Julie Segre: Okay, thank you. So, I'm going to just launch
right in. And, so, no financial disclosures, one of the great benefits of being an NIH
employee makes all that a lot easier. So, actually, I've been involved in the Human
Genome Project for 20 years now. And this part of the Human Microbiome Project really
started about five years ago when we started thinking about, you know, the fact that humans
are really super-organisms. And that the -- what's contributing to health and disease is the
chromosomes that encode the human, but also the multiple microbes that live in and on
our body, including fungi, bacteria, viruses, and archaea. Now, of those 23 human chromosomes
that you've heard so much about -- there's 25,000 genes, you know, I mean, lots of alternative
splicing, but in fact, each of those cells has more or less the same gene encoding potential,
whereas the microbes that live in and on our body actually have quite varied protein coding
potential, and as you could imagine the bacteria that live in your gut are quite different
than those that live on your skin. As well, when we start to think about some disorders
like allergic disorders, asthma, hay fever, that have really increased in the last 20
to 30 years, it can't be our genetic material that's changed in that short of a time span,
so it's something about the gene-environment interaction. And we use that word a lot, but
what really would be integrating this gene environment interaction? And one of the ideas
of this project is that perhaps the gene-environment interaction is really being integrated by
the microbes that live as, you know, together with the human cells.
And as you could imagine, although our human DNA is evolving in a very slow way, the microbes
that live on us could be evolving more rapidly as we integrate antibiotics into, you know,
common human health, which is to say that, you know, my mother probably didn't take antibiotics
very much, because growing up as a kid during the war they weren't very available. But I
took antibiotics as a kid. My children take antibiotics as a kid. So, are we going through
a bottleneck where we are actually changing the microbe -- microbial diversity that lives
on our human body? And that's really part of it. It's just a set of baseline and understand
what are the microbes that live on us, how do they change during disease states, and
how does that integrate with human health?
So, this part of the Human Microbiome Project that is a large project that is part of the
NIH Common Fund, born as the NIH Roadmap, that the goal of this project is to assess
the microbial diversity of 250 healthy individuals at five sites and to make all this data publicly
available. And the data has now been collected. The papers are under review, but the data
is already publicly available, although some of the data about the clinical features of
the patients is in controlled access, but the DNA sequencing of the microbes is in open
access and would be, you know, freely available for anyone wanting to use it as controls for
their own experiments. I'm sorry, and the five sites that are being studied are the
gut, the nasal passage, the oral cavity, the vagina, and the skin. In several cases there
are multiple sites like in the oral cavity being sampled, so that you could compare the
left, you know, the left, you know, the left cheek versus the right cheek of the mouth
and -- or the left arm versus the right arm and understand what the variation is between
individuals and between sites in an individual.
So, here the goal is to sequence bacterial reference genomes. In the first paper the
first 180 bacterial reference genomes have been published. And here it's really to expand
the repertoire of bacteria that have been sequenced. Predominantly bacteria that have
been sequenced are ones that are involved in disease. So, there's, like the MRSAs that
have been sequenced are the ones that are circulating in hospital, but what are the
MRSEs, the Staph epidermidis that have been circulating in hospitals. But what are the
bacteria that are part of the normal healthy human microbiome? And part of the reason for
doing these bacterial reference genomes is really to enable metagenomics, which I'll
get to as the final topic of today. Metagenomics is the analysis of the combined coding potential
of a mixed population. So, imagine that, you know, a spaceship comes down in New York City
in a, you know, in a crowded street where people are crossing the street. Basically,
sucks them up and takes the DNA from all of them simultaneously. You know, that's what
metagenomics is. Like, you would just be scraping your skin and sequencing everything, and then
you try to sort out which one came from which genome. So, instead of sequencing microbe
by microbe we'd like to eventually sequence all together the entire gene encoding potential
of the bacterial community, because there is such an interaction between the bacteria
of how they control and interact with each other.
The goal is also to look at the correlation of changes in the microbial communities with
disease states. So there are some classic projects being looked at here, inflammatory
bowel disease, Crohn's disease, psoriasis, eczema, and, you know, really to understand
what is the relationship of these disease onsets with bacterial communities. Now, it's
very interesting because, in fact, what we're seeing now is that it's not just that the
bacteria of the gut controls gut disease, but in fact, they're seeing that enzymes produced
by the gut are having an effect on coronary heart disease, because drugs that are used
to treat coronary heart disease or even products that contribute to coronary heart disease
are metabolized by bacteria that inhabit the gut. Similarly, oral cavity seems to have
an effect on multiple systems of the body. So, we're beginning to understand that the
immune system is educated by the microbes, but also far away sites are being affected
by the microbes.
As well, we're really -- as with the Human Genome Project, we're taking this project
and also exploring the ethical, legal, and social implications of this field of research.
So, for example, it's really unclear how probiotics will be regulated in this country. That is
to say that right now probiotics can be used as drugs if they go through the full IND process
of a drug, but it's very hard to do that, but on the other hand, if you -- most of the
things that you buy in Whole Foods are actually being regulated just as natural products or
food additives, and therefore there's not the same level of clinical scrutiny on the
manufacturing and the efficacy of those. So, you know, that's one of the things that we're
looking at as part of the project, is how to really regulate probiotics. Also, what
will people think about, you know, how can we change the impressions of people in this
country, in that, you know, there are these nasty bugs out there that you certainly would
want to avoid, but it's not necessarily that the language of warfare is always applicable
here. There are a lot of healthy bacteria, and our goal should not be to just kill them
all. And it's interesting, because, again, people are very interested in the probiotics
that go into their gut, and then they want to sterilize their exterior using all these,
you know, hand sanitizer products, which are -- have their role, but, you know, we can't
lose sight of the fact that hand washing has been, you know, extremely effective and been
really tested over the years.
So, how did this project start? Well, really, the microbial diversity has been studied in
the environment for decades before the Human Microbiome Project started. And these are
just some early articles about microbial diversity that was studied in the environment where
in this example they go around and they're taking all these different spots in the Sargasso
Sea and they're looking at what the microbial diversity is. And what they found here was
that by DNA sequencing you could recover a much greater diversity of microbes than you
could by culturing. And we call this -- this has been well-known also, you know, suspected
in human studies, which we call "the great plate count anomaly," but you can see when
you take a sample that there's a great diversity of bacteria. But when you try to grow them
up, they're really not as diverse as what you could see in the original sample, because
you really have certain bacteria that are really lab weeds, you know. And for the skin
it's really the Staphylococcus that just grow tremendously well when you put them on, you
know, agar plates and let them grow as individuals. And these environmental samples have been
sampled -- you know, that was studying different places in the Sargasso Sea, but this is another
one which is a saline map where they sample at all these different sites as you go down
in depth of the map. And they're looking, then, at what are the biological functions
that are being performed. These sorts of things are also being done in the ocean at different
depths. And what you can see is the effect of, you know, sunlight, because you go from
much more photosynthesis to less and those kinds of processes that you can really see
the extent to which the bacteria are responding to the environment.
Okay, so there's the, you know, the environment of the oceans and the sea, but there's also
the environment in which the humans live. And we rarely, you know, think about this,
but this was an experiment that Norm Pace's lab did where they went and tested showerheads
all over the U.S. and looked to see what was in showerheads. Now, this is an example of
something where you think you're standing in the shower and you are, you know, getting
clean, and, in fact, you are, but no one in their home very commonly changes the showerhead.
And it turns out that that is a moist, warm environment that we are creating in our homes
that really has a great potential to grow bacteria. So, part of this is also to think
about where are the, you know, environmental point sources. And it turns that there's lots
of bacteria that live in your showerhead, and that some of the bacteria are dependent
on what type of showerhead you have. And it doesn't necessarily mean that you should run
off and change your showerhead, although I've often meant to do that after reading this
article.
So, how do we look at bacterial diversity? I mean, you know, there's the culturing, but
now what we've entered is this realm where we think that the DNA sequencer is a very
powerful microscope that can tell us what are the bacteria that are in a certain place.
So, the way that we look at the bacterial diversity is by sequencing the 16S rRNA gene.
So, ribosomes, as I'm sure you all know, are made up of proteins and these ribosomal RNAs
that guide the tRNAs through. And now a ribosome is actually 70 percent ribosomal RNA, and
the crystal structure's been solved and that's been, you know, really one of the beautiful
works of biology. But these 16S ribosomal RNAs, which means that they're not translated
into proteins, but they have a lot of secondary structure, because they're part of this ribosome.
And the 16S rRNA gene has been used as a signature for bacterial genes for a very long time,
because there are regions that are more highly conserved, because they have to form these
stems. And there are regions that are less conserved, because they form these loops.
But, in fact, there is a phylogenetic distinction where all the Firmicutes are more like each
other. And then all the Staphs are like each other, and all the streps, streptococcus are
more like each other. And you can use the 16S sequence, the similarity to go from the
phylum to the order to the family to the genus to the species level. And, so it's the 16S
gene that Carl Woese and then Norm Pace really developed as a molecular signature for bacterial
diversity. And I'm just going to sort of spend the first part of this talk talking about
how we use the 16S gene for characterizing bacterial diversity.
So, here's another display of that same 16S gene, and what you can see here is that, again,
I've laid out the 16S gene. It's 1,500 base pairs, and I've laid it out where you can
see now what is the, you know, what's the sequence similarity of these different regions.
So, like these regions up here are those stem regions, you know, that have to be highly
conserved. They probably interact with the tRNAs. And you really just can't, you know,
you don't have much wiggle room on those. But then there are other regions, and you
can see that they vary, that are very highly diverse, or less diverse. Now, in fact, we
use different regions to get different levels of specificity in that these most diverse
regions are sometimes hard to use if you want to get to the level of phylum or something,
because they are so variable. And I'll kind of go through that, but the first thing is,
and sometimes you just want to know how much bacterial -- bacteria is there. And, so, quantitative
PCR primers have been designed that are in fairly conserved regions that you can put
ends in, and you can get most of them to then do a qPCR and figure out, like for example,
is, you know, do these mice have greater bacterial load than these mice, or does this site in
the human, you know, like the oily sites of the human skin have a greater bacterial load
than the dry sites?
So, actually, this needs to be sort of standardized and so, for example, this is how we calculate
the bacterial load where we actually took bacterial DNA and, you know, using Avogadro's
number we figured out, you know, how many picograms of DNA we were putting in of the
bacterial DNA, and then spiked that with an increasing amount of human DNA, because when
you get samples, some, you know, some of those samples have a significant amount of human
DNA in them. And, then, we did the qPCR curve where as you increase, sorry, as you decrease
the amount of bacterial DNA by tenfold, you're now increasing the number of cycles by three
and by three again. And from this we can calculate what the bacterial load is.
So for the skin, we were wondering, which, as Dr. Baxevanis mentioned, is my lab's area
of focus, we were wondering, you know, how much bacteria were we getting by the different
methods. So, this way we were able to calculate that when you swab someone's skin with, like,
a Q-tip, you can release 10,000 bacteria per square centimeter, whereas if you scrape the
skin so removing that, like, white, you know, the white things that form the dust bunnies
in some peoples' bedrooms that would yield 50,000 bacteria, but if you use a biopsy,
you'll actually generate a million bacteria per square centimeter. And that's because
the bacteria don't just live superficially, they, in fact, live very deep into the skin,
in the hair follicles, and the sweat glands. So, you can generate more, although you don't
need to, because we can get most of our answers with a lower -- with a subsample of the bacteria.
So, when just thinking about how to study microbial diversity, there's really an emerging,
you know -- this is sort of -- one of the questions that's really been emerging is how
do you study microbial diversity. So, the earliest studies would take the 16S DNA, amplify
it, and then they'd do sort of a fingerprint. And based on the number of fragments, they'd
calculate how many different types of 16S gene they had. But that's really based on
the limitation of a gel. So, that's the cheapest method, but it's very limited in resolution.
The PhyloChip or the GeoChip are kind of like microarrays, where they have the different
16S sequences laid out on a slide, and you can use that to say I have this much of, you
know, the Staph epidermidis, this much of the strep agalactiae, this much of the strep
pyogenes. And I think that, you know, this will -- this not only has a role right now,
but it has a continuing role in that the analysis of these types of microarrays is often much
easier than using sequence analysis. But the problem is that with any of these microarrays,
you're only going to find what you know is already there. And you'll never find the unique
species. So, what we need in order to build these chips is a very good reference library
of what are all the possible bacteria that could be found on this site so that you can
interrogate that rather than thinking, you know, well, you know, you'd hate that the
population you are looking at had some unique bacteria and you're just not assaying it.
So that brings us to sequencing, because sequencing is, you know, gene discovery. This is how,
you know, you can find, you know, a full dynamic range and compare multiple complex samples.
So, for a small study, you know, the sequencing may be limiting, but for a large study, and
I would actually even say for a small study also, the bioinformatics becomes limiting
as you go through this, because most of the programs that I'll talk about for sequence
analysis do require you to kind of dive right into this, and do, you know, some of this
from in line programming, or, you know, at least run it on the command line, and have
some understanding of what may be the issues associated with your sequencing. So, this
is an example of PhyloChip, just to show you how this type of data can be used, if this
is, you know, the type of experiment you want to use. This was looking at the intestinal
microbiota in the first year of life of children. So, on the X-axis is days and on the Y-axis
is, you know, the percent of sequences in relative abundance that belong to these classes
of bacteria. And, you know, the punch line of the story was that there's great diversity
between infants and between time points with the sort of spikes that you can see, these
blooms here, and that that is part of the normal process; that for infants, as you can,
you know, imagine, the child makes major shifts as they go from breast milk to cereal to,
you know, eating a diverse diet, much as they do on their skin microbe -- the skin microbes
as they go from always being held to being seated to, then, crawling around and exploring
their environment. Also, to say that as they are infants they have these, like, rolls of
fat and then their skin kind of changes as they start running around and become leaner.
So, something like PhyloChip can give you this overall perspective of what is, you know,
the microbial diversity over time and how, you know, how stable is it.
So -- but if we want to get to sequencing, it, again, becomes this issue of where are
we going to put our primers? In terms of you want the primers to have the specificity that
you could amplify as many of the types of bacteria as possible, accepting the fact that
there will always be some amplification bias of any primers. But you put the primers into
conserved regions and then the phylogeny is determined by the variable regions. In -- the
size of these amplicons is really, again, technology limited. So, if you were using
Sanger sequencing doing, you know, amplifying 16S gene doing ligation and having it sequenced
with Sanger sequencing, then you might, you know, many -- what many of the early studies
did was amply the full length 16S. But most projects, if not all projects, have now switched
over to the Pyroseqencing, and the Human Microbiome Project has been using the V6-V9 region, and
there's actually a V3-V5 primer set, and also from the five prime end of the gene that's
not shown on here, there's actually a conserved region here into the V3 region.
And one of the things I would say is that there's a fair amount of variability. So,
if you take the same sample and amplify it with the V1-V3 primers at the five prime end,
and that same sample and go for the middle region and go for the end, they aren't going
to be exactly the same. Some of those primers are better at amplifying Firmacutes. Some
of them are better at amplifying the streptococcus. Some are better for lactobacillus. And, so,
really the region that you pick is often driven by what type of bacterial diversity you are
expecting to see in your sample type. For example, we for the skin always use the V1-V3
primers, because it's very important to us to get a very good handle on the Staphylococcus.
That's just one of the important bacterial genera in skin diseases, including atopic
dermatitis. But for people who are studying vaginal microbiota, they may use the V6-V9
region, because that gives them better resolution. Also, for the oral cavity they typically use
the V1-V3, because they need to differentiate the different streptococcus. So, what region
you use is often dependent on what body site you're looking at. However, that does bring
up an issue that if you are sequencing the V3-V5 region, you can't use as a reference
someone who studied healthy controls but sequenced the V6-V9 region. You may find that there
are great differences your disease state and the healthy controls, but that really could
be driven by primer choice, rather than by the disease state.
So, these are some of the complications of these studies, is that we are still utilizing
these kinds of sequencing techniques that clearly have biases. So, it sort of -- I sort
of said this, but just to reiterate, there's the Sanger sequencing, there's 454, and I
would say keep your eye on Illumina. The sequencing read length of, you know, 70 to 100 base pairs
has often been too short to really get that much specificity from Illumina. But, especially
now as the MiSeq comes online, and people get paired ends of two times 150, the, you
know, sequencing 150 base pairs from the V3 and 150 base pairs from the V1. You're getting
in exactly the same range as the 400 base pairs of a 454/Roche. And so those read lengths
go up to 200 base pairs, 250, you're going to be getting a much more powerful dataset
from Illumina at a much lower cost than you are from the 454/Roche. It's not clear that
as you go, if, you know, if Roche really does go up to 600 or 800 base pairs of DNA in an
amplicon, whether that additional sequence would really be useful and important for sequencing
the 16S gene, because you typically get enough resolution to get all the way to the species
level with 400 base pairs. And that's sort of laid out here, although in this reference
where they talk about, you know, the sequence length, and if you just had 50 base pairs,
you may not really be able to get even to the genus level, but you could get to the
phylum level. So, it really depends on the sequence length and the primers that you could
then use and how that really, what type of specificity you want for your bacterial diversity
sequencing. Maybe it's enough for you to just even know has there been a shift in the phylum
or the class, and then you can think about using different sequencing platforms.
Okay, so you've got a sequence and, you know -- you've got a sequence, say you've got a
400 base pair sequence, you know, or you have 2,400 base pairs sequences. Well, you can't
just BLAST them anymore, and this is kind of frustrating to a number of people who work
even in clinical microbiology labs. You used to be able to BLAST something and get it to
match something, but by now so much bacterial sequencing has been dumped into BLAST that
if you put in a sequence, I mean, when we put in a sequence and we're trying to say
what is the sequence, what we us usually get is that this matches thousands of other sequences
that our lab has deposited into GenBank that are uncultured, and, you know, it just comes
back uncultured skin bacteria. Well, that really doesn't actually help us that much.
And so, this is one example where more data has actually, you know, I mean, it's great,
but it has gone beyond what you can do with BLAST.
So, you know, fortunately there is a solution, which is that bacterial sequences do have
their own classification systems, and this is not, you know, this is -- I refer you here
that there are these tutorials which actually will take you through this. We have the Ribosomal
Database Project that was curated by Jim Cole's lab and contains approximately a million 16S
sequences that have all been classified based on, you know, common microbiology technique
taxonomy. And I do get into this because for clinical microbiologists there are different
taxonomic systems. There's Burgey's, Euzeby, NCBI, but they're basically, you know -- the
Phil Hugenholtz -- they're basically the same. You know, you will find bacteria that they've
been reclassified or renamed, and you can look into all of that. But, so, within RDP
you can find what is the bacteria classification; you can do things like find probes if you
want to find something to, you know, try to make either a little microarray or to do in
situ hybridization; you can do Seqmatch. But basically by this point you need to move into
these kinds of bacterial programs, because BLAST is now quite limited. So, this is just
an example of sort of RDP Pyrosequencing pipeline where the data is processed and formatted
and then RDP will already give you some of the analysis tools that tell you from a sample
how diverse is this sample, have you sequenced enough to achieve saturation. I'm going to
kind of go through some of those. But it, you know, it's a sort of -- it's a -- it's
one of those projects that -- one of those programs that will take you sort of soup to
nuts.
Okay, if you are dealing with human sequences, then for 16S I would say that we very rarely,
almost never, end up amplifying human DNA with those 16S primers, but it is something
that you need to have a filter in, because ethically you really -- because we release
all our data into the public. There's a distinction and when we consent patients that we would
put their microbial sequence in open access, but we wouldn't put their human DNA into open
access. So, just sort of a shout out that, you know, one of the issues is to really think
about that. So, at the level where I was saying that you could gain insights even from the
highest level, this is an early paper from Jeff Gordon's lab looking at what is the difference
between lean mice and obese mice, and these are genetically obese mice. They have a mutation
in the leptin pathway. And what you can see, even at this level the shift is quite great.
So, what the Gordon Lab can see here is that the -- oops, oh that's not good -- the obese
mice have an increase in the amount of Firmicutes and a decrease in the amount of Bacteroidetes.
And it's really this kind of a wide sweep.
Now, actually some of the most interesting things that are now coming out in mice studies
are how the effect of having the mice in the same cage affects them, in that mice will
-- are coprophagic. They will eat each other's poop, and that actually kind of makes their
microbiomes conform. So, you can actually even see differences; if mice are housed in
the same cage, they are much more consistent than if they're in different individual cages.
And they will go towards a norm if mice are in the same cage with each other. And they
can actually even transfer the microbiome from an obese mouse to a lean mouse. Those
sorts of things, actually Richard Flavell was here two weeks ago talking about experiments
that he's done with Jeff Gordon's lab on that topic.
So, again, it is very interesting: when you start to think about setting up mouse studies,
you have to think about how you're housing them, because these microbiomes are not unique
to the mouse. There is a community aspect to these microbes. Now, in humans I would
say we are just at the beginning of studying this, you know. There are these sort of small
studies that report that, you know, a couple living together will start to conform to the
same -- not to the same, but to a similar microbiome. And, certainly, that twins have
greater similarity either monozygotic or dizygotic twins than siblings. It, you know, so we're
just at the beginning of understanding how microbes are shared between people. With mice
there's a lot more sharing that goes on.
So, the studies on obesity also have been shown. Now, in humans this was sort of one
of the studies that kind of got a lot of press when it came out five years ago looking at
what are the bacteria that live in the human gut. And this is sort of these people where
they are put on a diet, and as they become leaner the amount of the Bacteroidetes increase
and the amount of Firmicutes -- wait, so the amount Bacteroidetes increases and the amount
of Firmicutes decreases. And that is correlated with changes in the body weight.
But, you know, I mean, Jeff is very clear on this, and I -- so I want to be, you know,
sure to say this, that, you know, in terms of weight the microbes are, you know, playing
a role and perhaps being educated and selected by the diet that you are picking, but there
is still a tremendous role here for what your diet is and how many calories you are consuming
in terms of body weight. So, it's not to say that microbes are the whole story if, you
know, if you're consuming many more calories than you need.
Okay, so, coming back to sequence analysis. This is another one of the dirty little secrets
about bacterial sequence analysis, which is chimeras. When you're doing PCR to look for
the 16S gene, and this would be true even if you were sequencing or if you were going
to a PhyloChip or any other means. You are starting, you know, by amplifying a Staphylococcus,
the PCR cycle is over, you're not yet done amplifying this 400 base pair product. And
then when the next round of PCR starts, you have, you know, almost an exact sequence identity
at the three prime end of that gene or that product to match any other sequence in there.
So these are what chimeras will look like, where you start, you know, by amplifying parent,
you know, start by amplifyingpParent B and then you switch over and now you're amplifying
parent A. So, the reason that chimeras are this dirty little secret and are so pernicious
is that when you're thinking about, "How diverse is my sample," we did an experiment where
we took 20 known bacteria and we mixed them together, and we did the 16S PCR, and we generated
thousands of unique bacterial sequences by generating chimeras between the sequences.
And so when you think about, "How diverse is my sample," we knew in that case we had
only put 20 bacteria in, but because of the multiple ways, you know, that a chimera could
be formed either, you know, here with the -- here, or here or here. Those each would
be viewed as unique sequences, so you can't consider, how diverse is your sample, without
correcting for chimeras. And we use these 20 bacterial DNAs that had been all integrated
together to then use that as a training set to develop a chimera detection program. And
the one that basically everyone is using now is called ChimeraSlayer. There were other
programs earlier, Pintail and something else and something else. But really this is the
most well tested program by now and it really reduces -- oh, sorry Bellerophon was the other
one -- and it really reduces the false sequences that you could otherwise generate with that
kind of PCR amplification.
So, then you want to figure out, like, how many different bacterial species do you have
in this sample. And so you have to start binning the sequences. And you want to start by doing
an alignment of your sequences. But these 16S sequences, we know a lot about the structure
of a ribosomal RNA and we want to use that information to generate an alignment. So,
many of the alignment programs that have been generated for looking at DNA sequence are
based on the fact that they should form a protein, and that therefore you would penalize
something that had a one base pair insertion or deletion because that would throw off the
frame shift. But what we know about the 16S gene is that there are gaps and that those
gaps in the different regions might mean something different. So if you have a gap and a stem
that should actually be penalized more than if you have a gap and a loop, and also indels
may not be as, you know, they may not be that different than here where you have, you know,
a base pair that doesn't, you know, match up.
So, again, this is something that the Human Microbiome Project has worked hard on to sort
of come up with a fixed-width character alignment format: NAST. And again, NAST was the original
program that really, you know, just specifically is designed for aligning 16S sequences. NAST
has now been changed, or, you know, made better. And it's now called NASTier, which again is
from the Broad site. And NAST is the original alignment based on this Ribosomal Database
Project, you know, the curated dataset. NASTier, the difference is that NASTier now allows
you to have -- if you have paired end sequencing and you don't have the middle region, it doesn't
count -- it doesn't penalize you for putting ends in the middle of your sequence. You can
still do that type of alignment and it is aware of -- that you could have a gap in your
sequence.
So, the thing is that what you have now is that you have these sequences aligned but
now you want to build a phylogenetic tree and you want to calculate the branch length
between each of these sequences and start to bin them. And for this, typically people
are using ARB, which is based on the Silva Database, to build the phylogenetic tree.
And so you'll end up with a parsimony-generated dendrogram. And then this tree is then input
into the next step which is typically to define these taxonomic groups by sequence similarity.
And now you kind of, I mean, and actually now, MOTHUR -- there's two programs -- there's
MOTHUR and there's Chim [spelled phonetically] and I'll sort of get to Chim. Either one of
these is again now this whole sort of soup to nuts program of how to do everything from
taking your bacterial sequences and bringing them through to an analysis and a visualization
and a display tool. But MOTHUR will take your sequences and it will group them, once you
have that phylogenetic tree, it will group them based on what sequences are what, you
know, are similar to each other. And you can set that similarity and you can say, "I want
to," you know, "I want the groups to be 99 percent identical"; "I want them to be 97
percent identical." And, again, it depends on what level of specificity you want to have.
A lot of our projects will look at 99 percent, other projects will look at 98, 97 percent
and you do you sort of want -- it is a craft, it's like fact sorting, you know. If you say
you want 99 percent similarity you will have many more groups. And it's really about what
level of specificity are you trying to do your analysis?
The other thing that you have to be aware of when you're forming these taxonomic groups
is the Nearest Neighbor Joining Method versus the Furthest Neighbor Joining Method. And
all of this is really documented and explained, but Furthest Neighbor means that any two sequences
have to be at least 99 percent identical to each other, whereas the other method means
that you kind of pick a root sequence and these two can be 99 percent identical and
these two can be 99 percent identical but then these two other sequences might be 98
percent identical to each other. How that becomes important is if you think about what
the error rate of the sequencing instruments is. If, you know, in other applications, when
you think about the sequence error, it is not as big of an issue because you are often
doing alignment and you're looking at multiple reads of the human genome, you know, and you
have like 50 reads of this region of the human genome, and you're then saying 25 of them
are As and 25 of them are Cs and you call the genotype as AC. But in our case, again,
like I stressed about this issue about chimeras, each one of these sequences is being taken
as its own sequence that is uniquely representing a single bacteria, so we don't have the same
way where we are correcting sequencing errors. And probably the more realistic view of the
human genome data would be instead of 25 As and 25 Cs, you'd be getting 24 As, 24 Cs,
one G and one T and you ignore the G and the T.
We don't have a reference by which to know whether this is improbable or not. There could
be one bacteria out of 25 that actually does have a G at this position. So, again, this
is where you have to kind of, you know, understand the data and start to think about the number
of sequences. This is old Sanger data, I can see, because now we would no longer take it
up to 140 sequences; it would going to, you know, 3,000. But the idea remains the same,
that you look at the percent, you know, if you're looking for 100 percent identity then,
you know, is what you're measuring here sequencing errors or is what you're measuring here actually
bacterial diversity? So, most experiments classify at the 97 or the 99 percent identity.
Okay, now when we do the types of analyses, there's really two different types of methodologies
that we can use and these are the most common. We look at community membership, which the
term jakard [spelled phonetically], it's known as the jakard, and the community structure
is the theta. But these are two different ways of looking at the datasets. I've sort
of diagramed it out here, pretending that we're talking about a fruit bowl. And the
question that you would ask here is, you've got these two fruit salads. And if I want
to say how -- what categories of fruit do they have in common? And this would be like,
you know, do they all have streptococcus, do they all have Staphylococcus? I would say
actually they're not that similar, because only two out of five of the communities are
shared between the two groups. But if I said, you know, then the other way of saying it
is the community structure. If I took, you know, if I took 100 pieces of fruit out of,
you know, the first fruit bowl, would I find -- how many of those 100 pieces of fruit would
I find in the second bowl? And there the answer is about 90 percent.
So, the question is, "What's important? Is it community membership or is it community
structure?" If you think about two bacterial communities, whether or not they are similar
probably has to do with what kind of protein encoding potential they have. And so, you'd
think about community structure. But if what you want you to say is, "Does this bacterial
community have the potential to bloom?" And maybe there's some infectious, you know, bacteria
that certain people are susceptible to having these kinds of infections and other people
aren't. Then you're worried about community membership. Because it could be a bacteria
that's there at very low levels, but under other circumstances could bloom and suddenly
you have a Staph aureus, you know, infection, so that's what you want to look at. So, we
typically calculate both of these and then we look at, you know, they can be used for
very different types of questions.
This is an example of using the community membership, where what we are -- what -- again,
this data is showing and it's looking again at those obese mice. It's laying out that
this is the genotype of the animal, but then also it's which mother is it from? So, M1-1,
M1-2, and M1-3, those are the pups of this mother one and these are the pups of -- oh,
sorry, that's too small even for me to read -- that's the -- those are the pups of mother
three and mother three clustered together the pups of mother one and mother one clustered
together and mother one and mother three are sisters. So, what you see here is that at
the level of community membership, pups are most like their mothers and the next step
is that they are most like their cousins if their mothers are sisters. And so, this is
saying that microbes are inherited, at least in mice, microbes are inherited from their
mother.
But when we looked at a mouse mutant, what we saw was that with community structure -- and
you saw this also with the OB mice verses the wild type mice -- that they are most like
other mice of the same genotype because what bacteria they have may be defined by their
mother but the proportions of those bacteria are then defined by their genotypes. So that's
why community membership and structure can give you different types of information.
And UniFrac, like MOTHUR, this is part of Chim now, these are, I would say, these are
sort of two of the most highly used methods, UniFrac and MOTHUR, for generating these same
kinds of data. This will get you the same, you know, calculating the branch length, giving
you statistical analysis, and it generates the same kind of data. Again, I always think
it's useful to use two independent methods to look at your datasets, because if you see,
you know, something being statistically significant with one method and with another method, you
have additional confidence that you're really looking at something real.
How much diversity is there in the population? Here we calculate these rarefaction curves
where we just say, you know, how many OTUs am I seeing as I add additional sequences
and how many would I predict? And again, here what we're seeing is that that's very dependent
on the body site.
As you delve into it, we have measures for all of these things. These are, pretty much
have been developed from environmental sequencing, so the richness is the number of OTUs or,
you know, species. The diversity accounts for -- sorry -- the evenness is the distribution.
Do you have, you know, 90,1,1,1 or do you have 10, 10, 10, 10, 10? And then Shannon
Diversity, which is pretty much what people put out there -- the Shannon Diversity Index
takes into account the richness and the evenness. So all of these are the kinds of ways that
you would characterize community structure beyond even saying "What are the bacteria,"
and that people use to talk -- to compare.
So this is an example of work from my own lab, where we looked at a survey of what are
the bacteria on the different parts of the human skin. And what you can see here, this
is those plots from the RDP where each of the bacteria are just classified at the phylum
and then genus level. What you can see is the blue sites here are the oily sites and
they have a high preponderance of this propionibacterium, which are lipophilic bacteria. The moist sites,
these creases, have a lot of the corynebacterium and also in cases have a lot of the Staphylococcus.
And the dry sites, which are, you know, typically, well, the buttocks and the arms, those actually
have the greatest diversity. So, you know, there's a lot of different ways that you can
look at all these communities, this is probably the easier way to see that. Now if I am showing
you four different healthy volunteers, you can see, again, that all four of them have
a lot of propionibacterium bacteria on their back.
And, you know, what you could really see from here, the anticubital crease is the bend of
the elbow. So, you could see that, you know, these people, their backs are more similar
to each other than their back is to their arm. And then, really what we see from this
is that the ecology of the site is dominating what bacteria live there rather than the individual.
So the bacteria are responding to, is this an oily site, is this a dry site more than
who is this that I am living on? Because, in general, humans provide many different
microbiological niches for bacteria.
This is an early analysis of the different habitats. And the colors here, the greens
up here, these are all sites from the oral cavity, these are all from the gut. And then,
these are different sites from the skin including hairy sites and inside the ear. And, you know,
there is wider diversity than you see for the oral cavity and for the gut but what you
see here is basically again at a higher level. It is the body site that is defining what
bacteria live there. So, stay tuned for further insights from the Human Microbiome Project
and further tool development.
That was talking a lot about bacterial diversity; we're also doing work to look at fungal diversity.
Fungi are eukaryotes and have an 18S and an ITS intervening transcript sequence. And we're
using time to adopt the same methods that we did for bacterial sequences to look at
fungi because there's probably a tremendous amount of a relationship between the bacteria
and the fungi.
So, now I'm going to talk about sequencing bacterial genomes and this sort of goes hand
in hand. But, again, I would say that these sequencing instruments that are coming online
and that are, you know, present in many of the sequencing centers and may come into people's
labs, also, soon, these are really ideal for sequencing microbial genomes. Microbial genomes
are about three to six million base pairs. And the type of data that you get from a HiSeq
or from a Roche is really perfect. You get a lot of depth of coverage and they are fairly
affordable to do now. So what you get is you get these short reads, and I know other people
have talked about that, and you then align these reads into contigs. There are also way
that you can get paired end reads, and depending on the size of the paired end read you can
bridge these contigs.
There are several different assemblers that are used for assembling bacterial genomes.
This is also sort of fairly out of the box. We just use standard assemblers. And it really
is, for these that you can make a library, get your, you know, give your DNA, make a
library, and for many of these instruments you will now get back large contigs of DNA
sequence. Now you don't get back a finished, total reference genome -- Velvet is another
one that we've used -- so in, you know, so you do have to think about what level of coverage
that you need. So how big are the contigs going to be and how many of them are there?
And you have to sort of start thinking about that because you kind of need to pilot that.
For a six-megabase genome, you can make these calculations of how much sequence you would
need. But there are things that break contigs. Like any time you have a ribosomal RNA operon,
that will break the contig because there are many of those copies of that in a gene -- genome.
And you aren't sure if you start on this contig and you hit 16S operon or the RNA operon,
you know, if there's five copies of that you don't know then which is the next contig that
you're going into. So, and transposons, phage inserts -- there are things like that that
will break contigs.
But we can also use that information to try to generate information about what is the
genome that we're looking at. So here is the Staphylococcus genome that we sequenced in
the lab. And you can see here -- this is the contig lens, and you can see that most of
the contigs are here. I'm sorry, I can't even read this. But this is sequenced at about
30-fold depth. And you can see that most of the contigs are this size. We don't really
trust anything that's, you know, a tiny contig, but what you can see is that, over here, this
sequence, these three all assembled, is present two times the amount as the other parts of
the DNA. This is found at five times, and these are plasmids that are high copy. And
that gives you an idea that these encode non-, either repetitive sequences or plasmids that
are in the genome.
One of the things that we have found as a field, you know, people talk about Staph epidermidis
or they talk about streptococcus agalactiae, and morphologically, they look indistinguishable,
different, you know, isolates, will look indistinguishable by traditional biochemical means. But then
what you can see is that you know some of them have drug resistance to this or that
but they also sometimes have different invasive properties and so what we find is that, in
fact, there is a pan-genome where something like a strep agalactiae, when you sequence,
and this project was based on sequencing here when you sequence, like, 11 or 12 of these
genomes, what you see is that about 80 percent of the genes are found in every genome. But
this is looking at, so the number of core genes, as you start sequencing, you'll find
that there are approximately 1,800 core genes. But in fact each of these genomes has an additional
200 -- I'm sorry, 400, genes that aren't part of this core genome. Those are sort of a random
mixture that are found in some of them but not others of them. That's called the flexible
genome, the open genome, the variable genome, but what we basically are seeing is that,
of course, we talk about bacteria as species but they're not engaging in, you know, sex.
I mean, this is -- they can switch around their genetic information with recombination,
horizontal gene transfer; they don't have the same constraints. So there are bacteria
that are, you know, extremely similar, but there also are a lot of bacteria that have
a core genome and then have a flexible genome. And that actually starts to get at why, perhaps,
are some strains more invasive, some strains, you know, able to, you know, metabolize this
or live in the blood stream versus the urinary tract.
So that tells us about the genome structure, but also sequencing the bacterial genomes,
we can also pick up mutations that have occurred either as insertions, deletions, or mutations.
This is a project that we did in my lab for the NIH Clinical Center where we had three
different bacterial strains that were of acinetobacter, that were circulating in the hospital. We
wanted to understand what is the phylogenetic relationship. So we sequenced these three
genomes and then these, exactly as I told you, they formed contigs. And now we're using
Marco Marra's program Sercose [spelled phonetically], which was actually designed for cancer genome
sequencing, but I can tell you it's perfect for microbial genomes, which actually are
circles. We now are looking at it. And what we're looking at is any time that -- there's
these different strains that were in the NIH clinical center in 2007. We're looking at
strain A and we're saying any time that strain A is different from the reference to which
we aligned this, you code that SNP blue. Any time B is different than A you code it as
red. Any time C is different than B or different than A you code it as green.
And what you can see is that these three genomes are different, but they're different because
there have been these regions of homologous recombination. There are also SNPs that distinguish
them, but really what was confusing to us was that there are these blocks that you can
now see are blocks of homologous recombination when you line them up like that. And that
allowed us -- sorry -- that allowed us to then understand the phylogenetic relationship
that B and C were more closely related to each other than they were to A. And if you
look at these blocks of recombination, this is exactly what you are seeing, is that, this
again Marco Marra's program, where you're seeing it now is a pinwheel. So these genes
here, from A, are identical and found in B and found in C. But in fact these genes in
the middle region, this is defining exactly the block of recombination. These genes are
all unique to A, B, and C. And these encode, actually, the O-antigen biosynthetic locus
that is on the outside of the cell and is used for detection by the human immune system.
So this is an example where you really get down into the nitty-gritty of understanding
the different bacteria.
So I've sort of talked about bacteria, I've talked about fungi, I sort of want to give
a two-minute shout out to viruses because I think that this is going to be another really
important area of microbial genomics, but it's the hardest one, which is to find novel
viruses or even to understand what is the viral diversity that lives, you know, again,
on us. And it's the hardest because you really have to do sort of de novo sequencing and
there's RNA viruses, DNA viruses. You have to think about what are you going to use as
your control. But I want to just sort of show this as one example where they used genomic
sequencing to identify a novel virus. This was a case of someone who was an organ donor
and then the three people who received organs from this person who, of course these three
people are now immune suppressed because they've just received an organ transplant, they then
all died within a month from this fever and this sepsis. So the question is, was there
an underlying viral origin to these tissues?
So what you can do is then these can all be sequenced independently, but you're really
looking here for the needle in the haystack. You're looking for something that you see
in these sequences that you never see in a human genome. So, I'm sorry, this ended up
finding a novel arenavirus. This type of method has also been used for finding Merkel cell
carcinoma. But this really, you know, for diarrhea, this is really an area that needs
a lot of, you know, sequence being thrown at it.
So coming back even to the regulatory issue of how are we going to keep a healthy microbe.
You know, sequencing is just the start. If you want to talk about a microbe being associated
with a disease, then historically you should satisfy Koch's postulates: that the microorganism
is found in abundance from organisms suffering from the disease but not in healthy animals;
you should be able to isolate it in culture and then transfer it into a healthy organism
and recreate the disease. It's not clear to me that when we now start to think about microbiome,
that it is going to be individual organisms, it may really be that it is, that even introduction
of something like, vancomycin-resistant enterococci, you know, the VRE, but that is only pathogenic
in the context of limited microbial diversity of the gut, and that perhaps if there is a
VRE, but there's also sufficient amounts of the cumouncil [spelled phonetically] Bacteroidetes
that they would keep that VRE in check. So that makes it difficult for us to move, and
the sequencing is all about generating hypothesis, but then thinking about how we're going to
test them becomes complicated because we may not be able to satisfy what are the original
tenets.
In the last few minutes I'm just going to talk about what is the most complex part of
what we do, which is metagenomics. So again, I was saying this, and, you know, about the
spaceship coming down and sequencing the DNA from all those people in the middle of the
crosswalk in New York. But that's really what we would like to ultimately achieve, which
is to understand who are all the players all together, which would get us all the bacteria,
fungal, viral, archaeal DNA all together. In some cases, probably you'd also generate
human DNA because the bacteria live in such intricate association with the human. You
would end up -- you will end up with a very complex mixture and the computational analysis
is very complex. So what do I mean by that? With metagenomics, you know, we sort of talked
about this in the context of the pan-genome, that you could imagine that you'd be looking
at two different populations and that, again, you know, that you'd see the pink, the green,
and the blue, but it's really about getting at the level of the green gene is enriched
and the pink gene is reduced in this population. And you wouldn't get this by looking at 16S
because maybe these are all the same type of bacteria but within that type of bacteria
they have diversity. So for example, when I look at 16S sequences, I can't tell you
if this is a methicillin-resistant Staph epidermidis or that this is a methicillin-sensitive Staph
epidermidis. So I would need to do this kind of metagenomic sequencing to understand what
is really in those genomes. But oftentimes the sequences will then be discontinuous.
So in humans there are the first studies of metagenomics; there has been a lot of metagenomic
sequencing also from this group MetaHIT. And it's generated a lot of controversy about
how many different types of microbiomes are there. Are there two? Are there seven? Are
there eight? How many different vaginal types are there? Are there five? Are there three?
And, you know, what we're getting at here is what is the diversity that constitutes
normal and what is going to constitute dysbiosis, or, you know, deviations from the norm. And
there's a lot of room for argument here because we have not yet solidified how we will analyze
these rich, complex microbiome sets. So the tools don't yet exist to catalogue and comprehend
microbiome data. These are from the human gut microbiome and, you know, what's really
kind of sad about this is that in this rich dataset -- and then you can, from this, sort
of look at what bacterial phylum are present or what KEGG-COG terms are present. And they
look fairly similar. But in fact you are taking a very detailed information set and you're
reducing it to sort of 20 categories. And that may not be the level of resolution that
you need to really understand what are the differences between these two bacterial communities.
But it's really hard to know, you know, at what altitude you need to be looking at this
kind of metagenomic data.
Metagenomic data has been very useful in these types of experiments, where you're looking
for new metabolic enzymes. You know, and this is sort of in terms of what are the new energy
sources that the world could be harnessing. So the two energy sources that have been examined
from a metagenomic perspective is the termite hindgut, you know, so how does it take wood
and create that into energy? The cow rumen, so what they do is they put this into the
cow stomach and, you know, they incubate it and then they look at what bacteria they will
find in that cow rumen after that food has been digested. And they're using this, they're
actually, in these cases they're getting to a level of specificity where they know that
they're looking for certain classes of enzymes, and they can find these with metagenomics.
So these are some of the examples where metagenomics is now being used to find new enzymes that
could be used in energy production. But in terms of the human genome, we really still
need some computational tools to think about -- not if you're looking at one, you know,
looking for one gene, but if you're looking at the whole classification, how would you
really deal with metagenomic complexity? And that is just very much an open question. So,
that's my presentation for today, thank you all for coming and for participating in the
course series. It's really a pleasure for NHGRI to host this. Thank you.
[applause]
Andy Baxevanis: You can ask a couple of questions if you could
come to the microphone. Thank you.
Male Speaker: Wow, thank you for a comprehensive presentation.
And it looks like we have seen the process of discovering most of those variations and
all their implications. And, so since we are putting all the money, is there any way we
could get something back in terms of investment? I attended a talk on autoimmune disease, so
at the end of the presentation -- it was a beautiful talk showing us all of the issues.
And then the question came out that maybe we need -- our immune system is so aggressive,
that we need more challenges for them every day. So --
Julie Segre: Right, right.
Male Speaker: So the question is, I asked, so what is a
good system? They say, we need a good parasites. I say okay, what is a good parasite and where
we should we put them? In what body cavities? Skin? Gut? Or somewhere else? So, since you're
covering some of the areas do you think we could some idea of these sequences?
Julie Segre: Right, so I think that, in terms of human
disease, I think the first progress will be in the context of using the microbiome as
biomarkers. In the sense, in the same way that, you know, a diabetic checks their blood
sugar level, we would hope that a kid who has eczema would be able to check their skin
microbial diversity and see when they were about to have a flare. So that's one possibility.
I think in the intensive care units it would be used on a rectal swab to say, is this antibiotic,
basically, you know, doing something bad to the person's, you know, GI population. That
this person is now at increased risk for developing a VRE infection.
More generally, in terms of health, you're getting to the ideas of Stan Falkow and Marty
Blaser, that we've gone through these bottlenecks and that we're not properly educating our
immune system because, first of all, the use of antibiotics in early life and lack of understanding
of how that may affect kids six months and a year later and 20 years later and 40 years
later. There is something called the hygiene hypothesis that believes that kids who are
in, you know, that shows kids in daycare or kids on farms have less allergic disease.
But I think that's why we just sort of need a baseline to understand what is the microbial
diversity now and are we messing with it, you know? And the same way that my husband
would love to see what Chicago looked like 500 years ago, I'd love to know what the microbiome
looked like 100 years ago, before we started using antibiotics and also the urbanization
of our society.
Male Speaker: So we also use a lot of antibiotics after
a lot of infections and other things, so what is a good way to repopulate the good bacteria
in your gut?
Julie Segre: You know, I certainly have to encourage people
to take antibiotics if you have strep throat. And I don't really, you know, I think Activa
is a marketing genius, but it's not clear to me that it changes the microbial diversity
of your gut, if you're a normally healthy individual, and so I don't really have any
comments other than, you know, eat a healthy diet, get exercise, and don't smoke.
[laughter]
Andy Baxevanis: Generalized [inaudible].
Julie Segre: Yeah.
Andy Baxevanis: Any other questions? All right, let's take
a moment to thank Julie once again.
[applause]