Morphologic Analysis for Characterization of Disease Subtype: Dr. Lee Cooper

Uploaded by NCIevents on 18.04.2012

>> Thank you for the introduction and thanks for having me today.
To present our work at the CBIIT Speaker Series.
I'll be talking about some of the work that we've been doing at our NCI caBIG
In Silico Research Center of Excellence.
We tend to have a mix of science and informatics people and we're looking at aspects
of glioma brain tumors.
We have a strong image analysis in a lot of our projects working
with both pathology and radiology images to try
to understand how these can add data to, to add value to study the disease.
So, this is the agenda for the talk today.
I'm just going to talk a little bit about the background to provide context
about why we're doing this and what is glioblastoma and a little bit
about the imaging technology involved.
I'll talk about the actual topics in the paper,
the pipeline that we've developed to perform integrated morphologic analysis,
the results and the validation.
Then I'll talk about some of our efforts to scale up to larger datasets.
So we're developing several tools that you can describe as software infrastructure to deal
with larger and larger datasets.
Finally I'll wrap things up with discussing our future work and conclusions.
Let's move on to the background material.
So, the NCI caBIG In Silico Brain Tumor Research Center is a collaboration
between several groups.
It's led by Joel Saltz and Dan Brat, who are both at Emory University.
Joel is our informatics guy; Dan is our neuropathologist, and we're both at Emory.
So we also have collaborations with several hospitals.
There's Adam Flanders at Jefferson who is working some of the radiology.
Lisa Scarpace and Tom Mikkelsen at Henry Ford Hospital
and also Dave Rubin at Stanford.
The main theme of our center is to look at glioblastoma.
This is the most common primary tumor in adults.
It is a uniformly fatal disease.
Median survival is typically 50 weeks.
So the goals of our center are to leverage some of the rich datasets
that are now available and to perform in silico analysis to try
to understand the mechanisms of progression of this disease and we also want
to share any kind of data that we have in datasets with our researchers.
These are semantically complex and this is not an insignificant effort.
So, one of the things that makes glioblastoma an interesting topic is sort
of variations that you see on the microscopic level.
So, if you look at the tumor cells,
sometimes they take on oligodendroglial characteristic like you see on the left here.
The sort of round, compact stark cells with this pri nuclear halo.
Sometimes they resemble the more astrocydic phenotypes.
So, these are kind of rougher anterior, they're more potato like.
There really is a continuum between these two classes.
So, if you talk to another pathologist, they say this is not really understood.
You see a lot of cells it's not clearly oligo or astro but it is a tumor cell.
We also see a lot of interesting macro structures.
So this right here is an example of pseudo palisades, this necrosis here.
So, what happens you see critical maps of cells and there's an insult
to the vascularure in the area and so the cells become starved of oxygen
and they sort of become this wave that propagates accross the tissue.
They search for more oxygen-rich areas.
This is another thing that sort of drives the disease and is a very hot topic.
We also see angiogenesis around abnormal vessels.
So, in terms of the pathology,
there's really a lot to talk about, a lot to analyze.
Fortunately, we have the Cancer Genome Atlas dataset.
So the idea with TCGA is to characterize 500 tumors for each of a variety
of cancers and so for each cancer type, glioblastoma was the first by the way,
we get the clinical records of the patient.
Things like outcome and what their treatment was, demographics.
We also get genomics.
I would say genomics is really the focus of TCGA so there's a variety
of platforms that are offered there to measure different phenomenon.
We also have imaging that's available.
So, all of the tissue that's sent in we have pathology and slides, glass slides,
of that tissue and we have radiology on a subset the patients.
So, we're working on collecting that from the institutions
and developing that as a resource.
What we want to do is relate all these modes of data
to see what can metathology tell us about the genomics, the clinical pathology
or the radiology.
So, TCGA really represents a unique opportunity
where you can do this type of analysis.
I'll just talk a little bit about some of this technology for scanning slides.
So, just your standard glass slides
that pathologists produce to give a diagnosis.
We have machines that can scan these are very high resolution
and high throughput.
So, it's really an informatics challenge to deal with this data
as it comes off the machine.
It's not quite as bad as next generation sequencing,
but it's still very not trivial.
So, you can, you know, fill these machines up with
about 250 slides and walk away overnight.
When you come back, you'll have 250 images each with 1,000 pixels square.
So you're really talking about massive amounts of data.
How do you organize that data?
How do you store it?
How do you transfer it?
How do you describe it?
So, it's really in an area where, we're kind of lagging behind radiology.
So, we don't have a dicom standard for pathology images, for example.
So, it's a developing area of informatics.
I would say it's not quite as advanced as the radiology -
there are readings for that.
One of the things about glioblastoma that's interesting is that, you know,
coming out of TCGA there was a site that was done.
It looked mostly at genomics.
It found that even a set of tumors if you look
at their gene expression they tend to posture the four different groups
and these groups have different gene expression characteristics.
They also have mutations that are associated with them.
So you have, they're shown in this heat map here, you have roughly the proneural group,
the neural group, the classical group and the mesenchymal group.
The differences in outcome among these groups and a lot that's going to come
out what this really means.
They come from different cells of origin, et cetera.
So this is really, this idea
of subtyping tumors using genomics has been a very hot top especially
since TCGA came out.
So the idea is if you can separate these patients into groups,
maybe if you bundled them all together and you have clinical trial
of the drug it doesn't seem very effective.
Maybe if you know you're dealing
with four different groups here maybe it was effective in one group
but just not effective in others.
So that's some of the thinking behind doing this is
to better personalize therapy by targeting specific mechanisms.
So we have these themes of, you know, morphology in glioblastoma,
these molecular subtypes and we have rich datasets of TCGA.
So one of the things that we thought we'd go about doing is ask,
can we use image analysis and find there's natural clusters of GBM
and define in terms of morphology instead of genomics?
Also if there are these clusters,
what are the links between these morphology clusters in patient outcome
and so they molecular characteristics, the gene expression,
the mutation and the copy number.
So, what we did is come up with a methodology.
This was published in JAMIA in March.
You can go look it up.
I provide a link at the end of the presentation.
This is Figure 1 from that paper and it really looked at what analysis is.
So, we have sort of four layers.
The top layer is responsible for analyzing the image so segmenting the cells,
calculating descriptions of the cells.
All of these descriptions for each cell go into the database
and each patient we look across the cells
and find what does the average cell look like in terms
of size and texture and shape.
Does it have an irregular boundary, does it have a regular boundary,
is it dark, is it light, is it textured, is it smooth.
This top method right here really gives us our model.
So, you can think of these morphology profiles per patient
as like a gene expression experiment.
So just by gene expression you can cluster them.
So we have some things to do at sort the numerical parts of clustering
and how to get rid of redundancy and how we do the clustering of the patients.
Once we have a set of patients and we're grouping them by the morphology,
we can play games looking at what are the differences between these groups
in terms of survival or response to treatment.
Did we have any associations with the molecular classes?
They've been described in literature and how do they stack up with some
of the well known genetic alterations
that are mentioned frequently in the literature.
Then the final layer what we'll do is to take these patient groups
that are defined by morphology and do a very brief genome wide analysis to look
at differential expression for genes, differential methylation of DNA
and so it's a very tooth analysis to find the individual genes
that are associated with each cluster.
This goes into a little more detail about the morphology in general.
We have our slides and for each slide we go through and have an algorithm
that will delineate the boundaries of the nuclei shown in red there.
For each nucleus they find this cytoplasmic space, which is the green ring.
Then what we do is calculate features that describe, you know,
the shape of the nucleus, its color, its texture,
and likewise you see this green ring that's around the cell.
So, you can imagine a cell that's closely crowded to other cells it's going
to have different sort of characteristic ... the cell is off
by itself.
Once we calculate these features for each nucleus we dump them into a database.
So, you're talking about hundreds of millions of objects it's really a challenge
to manage this information.
Once we put these features and the objects
into the database then we can start to ask questions.
So, you can formulate queries using ..., you know,
for this patient look at all the cells and give me the average of each feature.
You get an average profile.
If you look at the model that we used in this paper it's a representation
of morphology is the average profile.
You essentially ask what's the appearance of the average cell measured
in terms of shape and texture and so forth.
The next thing we do is we have the patient morphology descriptions of profiles is
to feed them into the clustering engine.
So, you can imagine there are features that have different scales.
So size may range from 10 up to 200 where the statistical measures
of texture range from 1 to 0.
So, this is the normalization that's involved --
to make the data well behaved before we cluster.
Another set that we take before clustering to eliminate features
that are redundant so, you know, we use a lot of statistical measures
and they're related in many ways.
We don't want to essentially pad the dataset with sort
of redundant information before we cluster
because that can really hide relationships that we're trying to find.
So we have this entropy selection where we just look at the feature
and calculate the entropy of the dataset and we just take the top 75%
of features that are contributing to the entropy.
The next step what we do is perform a technique called consensus clustering.
So, in many biology data sets, you can cluster things in a way
that confirm what you want to find.
So, you know, a lot of times people will cluster until they find results
that they like and they'll use that particular clustering.
It's probably an isocratic process.
It depends on how you initialize it.
So, the consensus clustering overcomes this in the sense
that it takes many random initializations
of the clustering algorithm it measures how frequently they're
clustered together.
You can think of it as kind of ensemble clustering
over many thousands of initializations.
So it's very robust and it tends to find the groups very well
and it's not very easy to over fit using this method.
We like to use the consensus clustering method.
It also gives you this graph
where you can see parallels for each sample how frequently they're clustered together,
overall initializations.
Once we [inaudible] cluster label to the patients then we can decide
to play games just like the people do with genomics,
which is to look at things like differences in survival across the groups.
Are there any associations between these morphology clusters
and the molecular classes?
So, you know, this classical mesenchymal and neural and they were defined in previous publications.
One thing is go back and look at human annotations of pathology
where pathologists have gone in and created a categorical rating
of different pathological features like how strong is the angiogenesis
or are there infiltrating immune cells in the tumor.
We can correlate that with these groups to see
if there are any strong associations.
Another thing we do is to look at a very limited set of genetic alterations
that are published frequently in the literature.
So, maybe it's like TP53 mutations, amplification of EGFR.
Deletion of CDKN2A.
That's kind of a very shallow look into the molecular world.
So, to go beyond that we actually go deep into the raw array data
and try to examine the differential expression and dating DNA methylation copy number
to really find the small events that may not get published
but that are very significant either in terms of, you know,
effect on the outcome or association with morphology clustering.
To do this we need a suite of tools that we borrowed from people on the web.
So, SAM is a tool that looks at significance in analysis testing in microarrays.
This was developed by Robert's group at Stanford.
They used logistic method to look at significant differences in copy number.
I think this was developed by the BROAD Institute.
We end up with a lot of lists of genes and how do we make sense out of those?
Well, we use the gene ontology capabilities of the DAVID database.
We also use the ingenuity pathway analysis to try to distill this information
into something that, you know,
our collaborators can digest as opposed to a list of 2,000 genes.
We'll look now and talk about the experiments and the results.
So, we applied this methodology to a 162 tumors from TCGA.
They have 462 corresponding slides.
In these slides, we found 200 million nuclei.
This is quite a large dataset.
We found three clusters using our methodology and we named these clusters
for the genes that are associated with them.
So, one cluster is called the cell cycle cluster;
the other is the chromatin modification cluster;
and the protein biosynthesis cluster.
So, if you look here at this figure where you see the heat map in the middle,
each column is a morphology profile.
I've grouped them by clustering.
So you can see the patterns there.
Each row is some feature, size, area, texture, things like that.
So I think you can see clearly that they do form a cohesive set of groups.
There is some variation within each group especially the cell cycle cluster,
but you can see how the clustering would come about.
I think the really interesting thing is that when we looked
at these groups we found that there were significant differences in survival.
So, this chromatin modification cluster has poor outcome when compared
to the protein biosynthesis cluster.
Another thing we did is to check
and make sure we're not recapitulating these molecular clusters
that were defined by the TCGA, you know, the classical mesenchymal and pro-neural.
So, this essentially shows the chromatin modification, the protein biosynthesis.
One of the proportions of each of the molecular subtypes
within these morphology classes.
So it's really more uniform
They really are in the strong associations.
We took it a step further and did a hazard analysis to see, you know,
which set of clustering is more predictive of outcome?
Is it morphology clustering or is it these gene expression
based clustering.
What we found in our analysis is
that the morphology clustering is highly significant
where the Verhaak expression class is not significant,
which is an interesting finding I think.
We also validated this in a separate dataset.
Our collaborators at Henry Ford sent us a selection of 84 glioblastomas
and we performed the same analysis.
I think we found the same clusters.
Visually we confirmed this.
We also used a tool called ClusterRepro which says if you have a set of clusters
in one dataset can you find the same clusters in another dataset?
We found that for the cell cycle cluster, you know, it's highly significant.
We found that cluster in the Henry Ford data.
The same thing with the chromatin modification.
We did not observe the protein biosynthesis.
We have this group that's kind of a mixed group
that somewhat resembles the protein biosynthesis, but not exactly.
I would also mention that, you know, the relative survival of the two clusters
that were reproduced remains the same as it did in the TCGA dataset.
So, to visualize results.
What we came up with was this idea to each average cell
because that's potentially what we're doing to do the clustering.
So, for each patient we took their morphology profile and compared it
to all their cells and we picked the cell that best represents that patient.
Then we organized in the groups here
so you can see the different morphology clusters.
So, in the cell cycle cluster, the nuclei tend to be somewhat larger.
They're more hyperchromatic.
So, they have darker staining when you compare with the other groups.
The chromatin modification cluster these tend
to be smaller nuclei, they're lighter.
They have a more eosinophilic cytoplasm.
It's a little bit rougher; it's a little bit pinker.
Then the PB cluster the characteristics are sort of intermediate.
So that may explain, you know, why we're not finding that cluster
when we look at the Henry Ford dataset.
Maybe it's some type of artifact.
So, this is Table 1 from the paper.
What we did we have the clusters and we've done the survival
and visualized them is to look at all the possible associations
with molecular date and genetics and pathology.
So we found quite a few things.
I'm not going to go into details.
This is discussed in detail in the paper so it's not of much interest to an informatcis audience probably
I will point out in results here we find a lot
of differential expression across the groups.
So, when you compare the cell cycle cluster to the CM cluster
and the PB cluster, you find that there are, you know,
more 3,000 genes that are different across these clusters.
So, what we measure in the image it's really does translate
into some molecular difference or you can look at it the other way around.
So when we distill these lists of genes that were found to be significant,
in actual, you know, concepts using data, gene ontology and IPA
we found some interesting things.
We found that the nuclear lumen localization is most highly enriched
in all genes for all clusters.
So, if you look at the genes that are differentially expressed,
all of them have nuclear lumen localization,
which is a kind of interesting point because, you know,
we were analyzing nuclei so it seems likely we would find, you know,
in groups of patients that have different shaped nuclei,
we would find genes associated with the nuclei.
Other terms that we found were things like DNA repair, m-phase cycle,
and of course the names of the clusters; protein biosynthesis
and chromatin modification.
So when we looked at IPA, we found a lot of notable cancer pathways.
So, ATM and TP53 damage checkpoints
and it's NFKB pathway, Wnt signaling PTEN/AKT.
So these are things that are related to cancer and it's just nice to see that,
you know, that the cancer pathway is somehow related to morphology.
So I think I'll talk now about some software infrastructure that we're trying
to use as we scale this up.
So the original paper was written using a pipeline and a prototype I developed
in MATLAB, which works okay for 500 images,
but we want to analyze the rest of TCGA.
So, how do we scale this thing up to 14,000 images?
It's quite a leap to address the other 19 cancer types with TCGA.
Some of the problems we had how do we physically calculate anotations
and extract features and do the clustering?
How can we organize the results?
Instead of 200 million nuclei now we're talking
about billions and trillions of nuclei.
Then the question of what's the best way to share our imaging
and our results with the outside world?
So we have a series of solutions for this that we've been developing.
The first is a high performance computing pipeline
for doing the actual image analysis taks.
So a couple of guys here, Tony Pan and George Teodoro,
poured it all of our MABLAB code into C and set it up to run on large computers
with multiple nodes and also accelerated by specialized graphics cards.
It was run on a system called Keeneland;
it's a sort of hybrid system
between these graphics processes and regular processes.
We are achieving good performance so we typically break these images up into large pieces
at 4000 pixels square and now we can analyze those in terms
of thousands per second given enough nodes.
So, we're realizing our dream of being able to analyze, you know,
not only TCGA data but also someday, you know,
we can scan larger archives even and analyze those too.
Another tool that's been developed by Professor Fusheng Wang here
at CCI is this PAIS database.
So, he took the logical model to represent analysis results
from pathology images and so he's got a chart here on this slide.
One of the things it captures is objects that have been segmented.
Any kind of features that have been calculated
with those objects like texture or shape.
He also captures a lot of provenance related to the version of software
that were used to generate this analysis.
So, you know, it's kind of a virtual experiment.
If we need to go back and repeat it, we can just use that provenance
and recalculate everything.
So this model is very useful for what you can do with it.
All of our image analysis results get pushed into this database.
Then we have all of these tools where we can ask questions and do queries.
So, for our study here and the JAMIA paper,
we can push all the nuclei in there and ask, you know,
for patient X what does the average cell look like and, you know,
show me where the cell is on the image.
It really makes things a lot easier than having descriptive kind
of stuff with MATLAB and C++.
So, if we are working out some sort
of parallel programming techniques using MapReduce to speed up some
of these queries, you know, for when we start putting
in 14,000 slides worth of data.
So that's on the last page.
You can visit and check out the PAIS database and we have some pages on that.
The last tool I'll talk about is the Cancer Digital Slide Archive.
It was developed by Professor David Goodman at CCI.
What Professor David Goodman has done is to make an archive that's
in a web format of all of the pathology images associated with TCGA.
So you can see on this menu here you can go in and select your tumor type,
you can scroll down through the thumbnails of the slide
and then you can select a slide and view it in full detail and actually look
at the image analysis results on that slide too.
So this is a very useful tool for reviewing our results
and also showing people what we do.
It also has some sort of plugins where you can basically go in
and look at critical information.
So, you know, let's say you have some kind of curiosity about you're looking
at a slide and you see some funny characteristic,
you can go in and ask questions about what kind
of drugs did this person receive, how long did they survive,
what hospital are they from, et cetera.
It also has some capabilities to look at radiology imaging if it's available.
So you can bring up this web page here on top of the slide just by clicking
in here and you can look at the brain scan to see, you know,
what part of the brain does this tumor come from?
Which is a very useful tool for our radiologist.
So the Cancer Digital Slide Archive is very interesting.
It's probably better to visit it and play with it yourself than for me
to describe it here, but we've got the address here at the top.
You can go in there and play around with the data and give us some feedback.
This talks you know, a couple of slides about some of our future work,
what we're planning to do next along the same lines.
Professor Goodman is also performing a similar study to our JAMIA study
but using radiology data instead.
So in a collaboration with people at Henry Ford and Thomas Jefferson, Stanford,
other places, they're developing datasets and annotations of radiology
that describes things like where is the tumor located?
How much necrosis is there in the tumor?
Then you'll be playing the same games with, you know,
correlating that with outcome and gene expression and so forth.
So, they have a manuscript in that's under review on this.
There's several abstracts on our website at CCI where you can come
and take a look and read about this radiology analysis that's the companion
to the pathology study.
Another thing we're doing is using a technique called Quantum Dot
Immunohistochemistry to examine kinds of protein expression in tumors.
So, the images we were looking at before you can mostly just see structure.
If you use a different protocol for staining the tissue,
you can actually measure the activation of key pathways and you can nail it
down is it in the cytoplasm, is it in the nucleus,
is it close to certain structure, you can look at really the issues
of molecular heterogeneity that is so important in tumors.
So, we're developing our image analysis capabilities
to go after this problem next.
So you can see we've developed the 5 step staining
where we can measure five simultaneous proteins
with our collaborator.
You can see there are a lot of differences in tumor cells
that we're looking at here.
So, the colors represent protein expression and there's really a lot
of variation across the slide.
It's definitely worth examining.
So that's sort of our next area we're forging into is image analysis.
Some conclusions on the work that we've shown today.
Pathology imagery contains important cues and it's definitely what you see
in the image is related to, you know, the patient's outcome
and the molecular characteristics.
Now, of course, the pathologist says, you know,
we've known that for a long time, it's true.
They've been using slides for a long time.
I think this really bears repeating
and it is receiving a lot of attention.
TCGA has some things that can really complement genomics
but we really need to develop new informatics methods to integrate imaging into our studies.
So, you know, what was presented here was a pipeline
for analyzing whole slide imagery and sort of correlating
with other critical molecular factors.
This is generalizable and we're trying to extend this
to other tumor types and the TCGA.
We're also trying to improve it
through developing richer descriptions of image content.
Things like, you know, classifying cells instead
of just calculating the average cell for a patient.
Being able to correlate specific cellular populations
with molecular traits and outcome.
Really teasing things apart at a finer level.
Check back with us periodically to see what we're doing.
Our websites are posted here.
Department of Biomedical Infomations at Emory and also the Center
for Comprehension Informatics.
We have a lot of descriptions of our work there.
The Cancer Digital Slide Archive is online.
Go take a look at that and let us know what you think.
We also have links to the JAMIA paper and also the TCGA's symposium talk
that was delivered on the same topic.
So here are some acknowledgements.
I'm not going to go through names but it's quite a lot of people to work
with at different institutions and we couldn't do it without our friends
at Henry Ford or Thomas Jefferson and Stanford.