Science Commons


Uploaded by Google on 25.07.2007

Transcript:

DAPHNE KELLER: Hi Everyone.
I'm Daphne Keller.
I'm one of your friendly neighborhood product counsel.
And it's my pleasure to, once again, introduce Jaime Boyle
for his second tech talk of recent memory.
His first one, on seven ways to ruin a technological
revolution, was well received here and in lots and lots of
repeat video views.
Jaime gave a great overview of some current problems in IP
and especially copyright law.
Jaime is--
I don't know your title.
You're some kind of professor with a fancy endowed chair at
Duke Law School.
JAIME BOYLE: That's the very name.
DAPHNE KELLER: And has all kinds of other great
credentials, but the important thing about him
is that he is awesome.
He was a mentor to a lot of the lawyers here, and he's
training a generation of lawyers who, in all modesty,
are the lawyers you want to work with.
They're the people working on ways to use IP and other areas
of law to facilitate tech innovation, to facilitate
sharing information and not just constraining it.
So Jaime is here today to talk about some new projects
spinning out of creative commons that focus on science.
And I'm not going to step on his lines by telling you what
they are, but they're super interesting.
With that, let me introduce Jaime Boyle.

JAIME BOYLE: Thanks Daphne.
It's great to be here.
There are--

anthropologists search for universal proscriptions,
universal norms that apply across all cultures, you know,
against murder, against incest. There's a very strong
one which says, don't play cards with a guy called Doc.
Never eat at a place called Mom's.
But the strongest of all, which is found in nearly all
cultures, is don't go to Google and give a talk about
search technology.
And I'm going to violate that norm here, but I am a law
professor, so that should be OK.
We believe ourselves capable of doing anything.
I'm going to talk about the Creative Commons mission in
the world of science.
The talk is about an organization
called Science Commons.
I'm one of the people who helped to found it.
It's part of Creative Commons, an
offshoot of Creative Commons.
And it's the attempt to take Creative Commons' mission and
move it into the world of science.
The Creative Commons mission could be
described a couple of ways.
One way is that, where the legal rules were bad or poorly
suited to individuals dealing with the net, then we tried to
create hacks, legal hacks, licensing hacks that solve
that and let individuals share their content with the world
under the terms that they wanted.
Whereas copyright automatically applied a kind
of default, all rights reserved, you can't do
anything with this except perhaps read it mode.
And so one way of looking at Creative Commons is that it's
an attempt to produce sane rules by private agreement
when the default rules don't fit the rules the legal system
gives you--
just when you write something or create
something or take a picture--
don't fit the world of the net very well.
That's one way of looking at what creative commons does.
A second way is that Creative Commons is an attempt to
create an enormous commons of licensed content.
So that if you're a documentarian and you don't
have much money and you want to be able to get a great
soundtrack and to get, maybe, still pictures of the New York
skyline, if you want to get access to poetry or to text.
and there are people out there who want to share their
material, maybe without attribution, that you can go
out and you can assemble all of that and you can make
something even though you've never met them.
Glenn Brown, who now works at Google, put together the first
and still best explanation of Creative Commons and in which
the key phrase was, permission has already been given.
And this is a phrase that is actually--
which I think Glenn came up with--
which has really become central to the way I
understand Creative Commons.
It's a world in which you don't have to have lawyers,
and you don't have to have contracts that are other than
the Creative Commons license, because permission has already
been given.
The final way we can look at Creative Commons is that it's
an attempt to reduce transaction costs.
Transaction costs are economic words for all the hassles that
actually are everyday life.
Doing stuff, talking to lawyers, filling in forms,
these are all transaction costs.
So how to take all of those ideas, getting rid of or
customizing bad default rules that the law sets up, creating
a commons of license content, cutting down transaction
problems, and move them to the world of science.
So how did we do it.
The first thing we did was to think about
the research cycle.
This is the research cycle.
Here are materials in the top, left-hand box, that are
actually being described, that are annotated, you've got
PubMed links.
But this is basically stuff, here's the actual physical
stuff to which they refer.
It might be plasmids.
It might be cell lines.
If it's more ambitious, it's zebra fish or genetically
engineered mice.
It's basically the raw material that biomedical-- and
I'm going to focus on biomedical science here,
although that's not our only charge in Science Commons--
it's the raw material that science works with.
And as you work with it, and you come up, you formulate
your hypothesis, you end up publishing.
And you publish over there in an article.
And there's your article.
The article actually says, this is what I'm doing, this
is the experiment that I ran, here are my results that
produces some form of innovation perhaps in the form
of a technology which might go into repository, maybe adds to
an actual plasmid collection which then goes back to the
circle starts again.
The point here is that what you've got is raw material,
experimentation, documentation through literature and then
scientists saying, I want to replicate that, I want to
extend it, I need to find it, and I want to move on forward.
The idea is to say, where all of these arrows are, the
moment when you're moving from finding the existence of a
plasmid to getting it in your hot, sweaty, little hands to
do experiments on it.
Or when you've done your experiment, to
putting it out there.
Or somebody's trying to find the experiment or the
literature you did, the description in the literature.
Each of these stages is a moment where transaction costs
can be added, and the Science Commons role is to try and
minimize them.
I'm going to talk about two projects that we've got going
that I think a particularly exciting, where I think that
some of the minds at Google might be able to help.

How many of you know something or anything about biology?
OK, great, that puts you a little bit ahead of me.
This is a double area of ignorance for me.
But let me describe for you my rough understanding of the way
that we do scientific research, particularly trying
to solve disease problems.
This is, as you can see, a slide representing some of the
processes that go on to make Huntington's Disease.
Most of these are cell signaling
pathways, proteins, genes.
The Huntington signifier, right here, is the thing that
causes the disease.
Here you have a series of processes
leading to cell death.
The question is, how do you stop that
process from happening.
Each line here represents some process, either cell
signaling, replication, the creation of a protein.
And each of these is a potential drug target, a place
where you might try and break the link with the drug.
And the idea is, break the link, stop the process.
But, of course, the difficulty is when you break these links,
what if you screw up something else.
Using a metaphor that I find explains it very well, think
of this as an airline map.
Some of these nodes are like the Jacksonville Airport.
You take it out, air travel in the US is
not noticeably affected.
Some of them are like Ordway or SFO.
You take them out, the whole system shuts down.
But the trouble is, until you do it, you're not terribly
sure which is which.
And the only large scale model which correctly replicates
what will happen is called the human body.
And there are obviously problems in working only
inside a human body.
Here are the number of papers on each of a few of the
selected likely targets, 27,000, 4000,
128,000, 41,000, 10,000.
One of the first things that we found when we started
investigating the world of science, investigating the
empirical literature, talking to scientists is that the
scientists said, for god's sake don't give us
access to more data.
We have data coming out our years.
That's scientists in the developed world at rich
institutions.
Scientists outside of the developed world and not at
rich institutions have a very different attitude.
But let's just start with the ones in the developed world,
at somewhere like Berkeley or Stanford or Harvard, with full
access to all of the expensive databases and
journals and libraries.
They say, we have a mass of information, and we have very
primitive tools for figuring out which pieces of that
information are valuable.
For example, just one of these-- one of my colleagues
calculated--
it would take you 111 years to read all of the available
papers on the subject.
So the first thing you're going to do is to throw away
99% of all of the potential data out there, because you
simply can't afford to process it.
You're going to do that according to some rough
heuristics that your mentors and professors have, kind of,
programmed into you.
Like, we work only on Huntington, so let's get rid
of all the rheumatologist and all the cancer biology, all
the stuff that actually might be deeply, deeply relevant,
but I don't have time to look at it, so we're just
going to move on.
I'll do some tech searches.
I'll use Google, ironically, which is hardly optimized--
wonderful thing though it is--
for searches like this.
I'll use the internal search engines, and I'll try and come
up with a better model.
What would you do it on.
This is an abstract of an article in
something called PubMed.
PubMed articles are freely available to everyone.
The articles they refer to often aren't, but the
abstracts contain a lot of information, as you can see.
Here's the key set of insights.
Study explores the hypothesis that transglutaminase
catalyzes cross linking of Huntington into internuclear
inclusions, and then goes on to say there that here's what
the data supports, here's the hypothesis.
The question is, how are you going to be able to find that
kind of assertion.
How are you going to be able to find that kind of assertion
and understand how it was linked to the
techniques, the methods--
I guess I should use the pointer here--
the methods by which the insights were generated.
And what are the materials on which this was tested, so that
if you wanted to replicate it, you would know what materials
you needed to replicate or expand the science.
And what's the bottom line, the conclusion.
Based on these and other studies, modulation of
transglutaminase activity could be explored as a
treatment for Huntington's Disease.

Second problem.
You understand, far better than I, the world of the
internet and the way in which websites and linkages among
websites provide a knowledge model which search engines can
rank usefully and well, albeit with secret algorithms. This
is a discussion of a neuron.
These are all of the places where you might actually find
out about such a thing.
These are all publicly or largely publicly available
repositories in which information, which might be
relevant to the person who's doing
Huntington's Disease, is contained.
What we did was we spent a long time talking to people
saying, we think that if someone was to go through and
take all this publicly available data, all this
publicly available text, all the publicly available
databases and run some text-recognition software on
it which actually understands, in quotes here, which actually
recognizes scientific terms, which actually understands
assertions of causation--
so x causes y, inflammation up-regulates this protein for
example, like to understand that that's a subject, verb,
object assertion--
that you could actually generate a search engine that
scientists could use to mine this vast agglomeration of
data in a much better way than they currently could.
Now, let me tell you, when this stuff actually gets run--
we used a proprietary technology called TMIS and ran
it on this stuff--
it turns out, of course, it produces
lots and lots of mistakes.
So you have lots of assertions like, hunger causes the
absence of food, where it gets the basic terms there, but it
reverses the causal arrow.
Nevertheless, it produces something which is actually
pretty good.
It can, for example, understand some key points.
It can understand that the same gene has four or five
different names in the scientific literature, and see
that gene in each place and say, this is the same as this
is the same as this is the same as this.
What it does by doing that is, it enables something that's
probably the single most important contribution of
this, it enables a kind of Rosetta Stone between other
wise incompatible databases.
So if you use naming structure one for your database and he
uses naming structure two, the two databases can talk to each
other because this software understands that those two
genes are actually the same gene and that you're both
making assertions about the same gene.
This is a kind of query that a scientist might want to run on
this material, which so far as I know, if they are outside of
a large pharmaceutical company, there is no really
great, available software that actually looks for this in
other than a simple text-search way.
That is to say, it goes out and finds the word gene,
single transaction and pyramidal neurons.
What does that mean.
Here's a human readable version of what that query is.
Find me potential drug targets for Alzheimer's Disease based
on what's publicly known.
Pyramidal neurons are apparently a key area for the
progression of Alzheimer's Disease, an area people think
is a possible area, just like the Huntington's chart that I
showed you, for interruption with some kind of drug target.

Our first project I'm going to talk to you about, the
NeuroCommons did, was took four publicly available
databases or text repositories.
Mash, which is a, kind of, universal description of
diseases, classifications, neurons, and so forth and
basically gives something a single ID so you could know
you're talking about the same thing.
PubMed, which provides the abstract
to the journal articles.
Entrez Gene, which is a listing of all of these genes.
This is the single transduction repository.
And what it does is say, take all of those and render that
same query, in what is obviously extremely ugly, kind
of, early stage code, to say produce an answer.
And here is actually the answer that it got, which
we're told by some scientists passes the BS test. These are
actually all genes which are reasonably involved in this.
There's the actual way that the search engine works.
There's the NeuroCommons virtuoso RDF store.
This is how we write the queries right now.
As you can see it's, kind of, like the early days of HTML
where people coded web pages by going into View Source of
other people's web pages and taking it out and, like,
pasting it in.
It's all done in extremely only-coder-friendly form, but
obviously a human interface is a key thing that we'd be
moving on next.
The thing is that there you can actually store your
request, and it produces a stable URL.
This is that search, and all of the components here can be
substituted.
So pyramidal neurons, you could say, take exactly the
same query but maybe look for Malaria drug targets, so that
it's the same, the structure is the same.
This is an early stage, maybe not even quite alpha-stage,
software project research engine.
So that's the kind of thing that you can produce out of
it, safe queries.
Then the next thing, which I'm going to talk about in a
moment, is once you've done that, how do you move on to
developing--
how do you move on to actually finding the materials that
would enable you to do the science.
But I want to pause for a moment and explain exactly why
I think this is important.
There's a war currently going on about access to scientific
literature.
Any of you who've been involved in science know that
there are open-access journals, journals which are
often under Creative Commons licenses, which are put up on
the web for free.
Other journals that most explicitly aren't.
Science and nature, for example, much harder
to get access to.
And there's a big debate about whether or not when
governments fund scientific inquiry, they should require
the scientific literature, the research that comes out of
that, to be put on open access, so that when your tax
dollars go to pay for research on Rheumatoid Arthritis, you
don't also have to pay $20 to read the article which comes
out of that research.
It's a fight, and it's going on.
There will be solution sometime in the next couple of
years, probably some kind of mandate for delayed
open-access publication.
It has to come out within at least six months or a year if
it's government funded.
That's going to be hugely important.
But what I think this project, in all of its crudeness, in
all of its buggy, sort of, basic text file with lots of
incorrect assertions shows is that when we think about open
access to scientific literature, we cannot think
about it the way most people are thinking about it now,
which is that the mark one eyeball gets to see a picture
of text on a page and that counts as open access.
That if I can read Glenn's article, in nature, six months
afterwards, so long as there's a JPEG of each page up there,
that should be enough, because I get to read it.
What this points to is the fact that it's not just human
beings getting access to the literature that we ought to
care about, it's computers.
Because if that literature is not in machine-readable form,
if that database is not accessible to this kind of
activity, then we will never have the ability to build
something of which this is the very crude, first prototype.
What is that thing.
Well, you could think of it as a more precise search engine.
I've been told that when you come to Google, you're simply
not allowed to say semantic web.
So it's a blank, blank for science.
A rough draft of a blank, blank for science.
Applying blank, blank technologies to the idea here
by forming simple RDF blank, blank statements, which
coupled together to form a blank,
blank network for science.
You could think of it, as I said, a Rosetta Stone.
But I think of it as the demonstration project of why
our ideas about open access need to be much broader.
Even if this is total nonsense--
that is to say, it doesn't work, it's too buggy, it ends
up having too many wrong statements in it like hunger
causes the absence of food--
another version of it might be better.
Stage one, better search.
Why is better search important.
We can afford a grotesquely inefficient science research
process in the United States because
it's a wealthy society.
But even in a wealthy society, some candidates for scientific
research are marginal economically.
Orphan diseases, for example, Huntington's Disease doesn't
afflict that many people, which means that there isn't a
large commercial market.
Erectile dysfunction, don't you worry.
No matter how inefficient the science ecology, there will be
research done on Erectile Dysfunction, not to worry.
Obesity, I think we can all sleep safe in our beds and not
worry about obesity research being pursued.
Huntington's Disease?
And then when we get to the diseases of the global poor,
even a small level of friction can be
enough to kill the project.
And in some cases, even with almost zero friction, the
project might not be enough.
So efficiency in search is a human, morally compelling
imperative.
We have to have better data retrieval in the sciences or
people will die.
And they will die disproportionately when they
are poor or when their diseases are suffered by
relatively few other people.
This strikes me as the kind of message that a moral search
engine company might take to heart.
Just any moral search engine company, not, like, focusing
on any particular one.
It's something where fixing the problem would lead to a
better FDB index.
The FDB index, I get from my colleague Hal Abelson, Fewer
Dead Babies.
Fewer dead babies, this is actually a
morally compelling thing.
I guess, subtitle of this talk is, what if we put the same
ingenuity that we put into allowing teenagers to flirt on
Myspace, or for you to find just exactly that tennis
racket you're looking for, into finding the cell
signaling pathways that are crucial to Huntington's
Disease, to allowing people to do that at almost zero cost.
So, compelling idea.
But even if you find the data, the articles-- you find that
one key article in a discipline you would never
have thought of looking because you're a cancer
biologist so you didn't think of looking in Rheumatoid
Arthritis journals although they both deal with
inflammation, importantly, and that's actually
an important linkage.
Even if you find that thing, so what?
The next thing you have to be able to do is actually to be
able to work with the stuff, the raw materials that
scientists have used in order to do their experiments.
Now, surprisingly, the single biggest impediment to
biomedical research is the
unavailability of physical things.
You're all thinking, OK we're an information society right.
It ought to be--
the information flows.
No, it's the physical stuff.
There's the results of some of our research into the
empirical literature here. "47% of academic geneticists
had been rejected in their efforts to secure access to
data or materials related to research." And a lot of this
was concentrated in the materials.
Again, these are plasmids, these are cell lines, these
are mice, these are zebra fish.
Why does this rejection happen.
Well, a pie, think of a pie chart.
Reason number one, it takes time to make the materials.
You're sitting in your lab, you've got a
bunch of cell lines.
I say, hey I want to do an experiment on it.
You're going to have to take a grad student and make the grad
student go and make a version of the cell line.
It just takes time.
It's a hassle.
It's a hassle factor.
So that's friction.
The second kind of friction, legal friction.
Every transfer should be accompanied by a contract, a
material transfers agreement.
Here you get a great quote from two tech transfer
officers who work on these things.
"One of your colleagues at BigAg says she'd be happy to
send you her transpos on insertion lines that saturate
the right arm of chromosome nine.
You'll just need to have a material transfer agreement
signed by your institution.
Six months later, the terms of the agreement are still under
negotiation, you've missed your field season, your
grant's expired, and there's now a better resource if you
just start negotiating material transfer agreement
now. "
Science is really important.
The process of getting material across labs is not
even like Web 0.5.
It's not even like buying a book pre-Amazon.com.
It's like ordering from the Sears Roebuck catalogue when
you live on the Western frontier and things had to
come by stagecoach.
I mean, science hasn't even made it into the '70s when it
comes to the process of facilitated legal transaction
between others, because permission should already have
been given, so that none of that was necessary.
Now what's the third reason why this is hard.
It's just hard to make the cell line.
The process of getting the material
transfer is often difficult.
The tech transfer officers often customize them.
They're supposed to be standard agreements, but there
are customizations.
Oh, well, what if this is going to lead to a new
infertility drug or a drug to cure infertility or stop hair
loss or whatever, we want to make sure we get enough money
out of that.
There's back and forth fights in all of this.
And all of that just slows things down.
Plus, you just have to sign things.
It just takes time.
The third reason, maybe one of the most important, secrecy.
Scientists are refusing to share at an
ever increasing rate.
The rate of denials is going up, which is a really bad
trend line for us to have. Why.
Well, you've got this cell line.
You think you can get three papers out of Science and
Nature, maybe a fancy new grant, maybe tenure, maybe a
fancy chaired professorship of the kind that Daphne was
talking about.
Why should I give you my cell line.
I'm just enabling you to write those articles yourself.
Right now the credit economy favors secrecy.
The informal credit economy favors secrecy.
This is something that Web 2.0 thought could be applied to.
So there are the problems, massive set
of problems at MTAs.
What's the solution.
Well, there's a beginning.
There is this universal biological
material transfer agreement.
This is the first page of it.
There are nine pages.
It's long.
Even the short letter agreement is quite long.
But it's a pretty good agreement.
It's a good working agreement.
It's hard to understand.

Even this part, which is just the listing here, is
relatively hard to understand.
It gets more hard, even though it's a pretty simple
agreement, what it does is pretty simple.
So wouldn't it be nice if there were Creative Commons
for materials in which you could just go in and as easy
as it is for you to license your photo on
flicker or your video--
Creative Commons license it and stick it up on YouTube, on
Google video--
wouldn't it be great if you could automatically put things
under the UBMTA or the SLA and have them up there and
generate very simple Commons Deed, which just tells you
what you're supposed to do.
You can't use the materials for clinical
and commercial purposes.
That's what it says.
This nine-page thing, that's what it says.
And you can't distribute without permission.
So that the scientists could understand it.
This is what we call the human readable layer.
Forget the lawyers, this is the human readable layer.
The scientists could understand it.
And imagine a process where social engineering--
that is to say pressure--
said to entities like the NIH or Duke University, couldn't
you get your labs to pre-license all these plasmids
under this agreement, just automatically give permission,
so it's done already.
Sure, you can reserve the patented stuff, the totally
proprietary stuff, but the rest, couldn't you just make
it a Commons.
And here's the really cool part.
Once you've done that, every time that I go to Glenn's lab
and say I need one of your plasmids, then that's going to
create an electronic record.
And the electronic record is going to say, Glenn Brown's
lab produced 3000 plasmids last year.
Three thousand plasmids were demanded by other people.
500 zebra fish, 20 genetically engineered mice, know out
mice, skid mice.
That's more than anyone else at Berkeley, Stanford,
wherever by a factor of five.
That Glenn Brown must be a real hotshot scientist.
Suddenly we have a number.
We all love numbers, rankings.
We all look at the awful US News and World Report survey
of colleges, look at citation counts.
Academics nowadays basically check their citation counts,
like, every morning, like the weather.
It's raining.
My citation count's up 2%.
Citation counts are hugely important.
They're the way that the informal economy runs.
This is an idea to flip the informal credit economy so
that the benefit is to sharing not to secrecy.
And this is a second idea that I think Creative Commons
really got.
If you can make sharing easy enough and get benefits out of
it and have those benefits flow back to people, then more
people will share.
The economists who look to this stuff tend to
look at it as static.
They say OK, people aren't sharing, and they aren't
sharing for a good reason.
And the good reason is they want to get benefit out of the
stuff they're holding onto.
They never think, could we flip that system technically
and legally so that they can get just as much or maybe even
more benefit by doing the more socially desirable thing.
So again, that's an idea which is generalizable beyond this.
That would be the meta-data that would
travel with the material.
So standardized licenses, permission giving in advance,
allow repositories to fill requests.
So now imagine the final link.
Remember we started off with the journal article.
What if in the journal article, when Glenn Brown
writes up his research, in the footnote it says, and I used
this plasmid, hyperlink.
And the hyperlink goes directly to a description of
the plasmid with its license attached in a repository that
can provide that plasmid.
Now that could be a for-profit repository, the eBay, or
FedEx, or Amazon.com of cell lines.
It's certainly business I'd love to get into, just you
know, all that goop in FedEx containers
flying across the country.
There is a lot of it, by the way.
That's the way scientists-- everything is done by FedEx.
But imagine that process, where the article is itself
the link to the material needed for the next iteration.
The thing is, as I'm describing this, those of you
who don't know the world of biology will go, well, what
he's describing is like not even Web 2.0 stuff.
This is, like, pretty trivial stuff.
But we don't have it in science.
I really want to stress this.
Science isn't even Web 0.5.
And these are just two ideas, better search and closing the
cycle of journal repository credit, so that you could say,
wow, I like the sound of Brown's work.
Click here.
Hey, it's available for $24.95 from Abgene,
their processing costs.
Maybe it's for a profit, maybe it's not for profit.
I don't care.
I would love for people to get into lowering the price,
commodotizing cell lines.
It's got to be an accredited repository.
Of course, you've got to make sure that it's actually
correctly produced, because the scientist is going to have
a vested interest in making sure that cell line's not
contaminated.
So, you know, some safety concerns there, but nothing
that we couldn't possibly deal with.
And then the cycle starts again.
So basically, thinking about lowering transaction costs at
each stage of this cycle, using some of the insights
that are jejune, uninteresting, mundane in the
existing world of CC license content, search engines, Web
2.0, whatever.
Just taking some of that ingenuity and applying it.
Whose job is that in science right now.
Kind of NIH, kind of the National Academies.
But no one is sitting here going, how could
we transform this.
And the thing is, in many cases they're not doing it
because they don't see the opportunity
for making a profit.
The private pharma is doing stuff like this,
but it's all internal.
And good for them, I mean, I'm not putting that down.
This is great.
They're doing this versions, but it's inside Merck.
Merck has its own versions of this kind of
stuff inside Merck.
But those are all walled gardens.
And the point about walled gardens is they don't give you
the benefit of the large commons.
Its like, again, early day search engines, intra-company.

These are people who've helped us with funding, the HiQ
Foundation, Teranode MacArthur.
This whole thing that I've described to you is being done
on $600,000, which is trivial money by science standards.
I mean, it's actually really embarrassing amounts of money.
We have to not tell scientists because they basically don't
start paying attention until you get up above the $50
million level.
It's too small.

That's the idea behind Science Commons.
It's an idea that I actually think--

the biggest problem here is a cultural gap.
We actually have a lot of the tools that we need, the
conceptual tools, licensing tools, metadata tools, search
engine tools, sitting over in the world of searches for
Britney Spears.
And all we need to do is drag and drop that stuff into the
science room.
All we need to do, no.
Because it turns out that there's huge differences.
We have to learn to stop dealing with copyright
licenses and start thinking about licenses over stuff.
When I talked to Daphne about it, she goes, well, what kind
of intellectual property right are you talking about over
these cell lines.
I said, no Daphne, this is old-fashioned
property, like, things.
She goes, oh yeah I've heard that, She said.
You know, remember the old days, when
we had, like, stuff.
Yeah, physical property rights.
Wow, that's so cute.
There's lots of customization.
There are lots of problems that I haven't mentioned,
ranging from patient privacy to bioterrorism fears.
There's all kinds of stuff that needs to be dealt with.
There are enormous problems. There's also commercial
entities, some of whose interests run along these
lines and some of whose interests don't entirely run
along these lines, but definitely a lot who do.
So, all I'm saying is, that pursuing this kind of thing
would be a good idea.
Here's a second version of this that I just think is a
nice little end, lagniappe at the end.
This is the Allen Brain Atlas, an amazing atlas of mouse
brains developed by funding from Paul Allen, one of the
early Microsoft developers.
And it basically cost $100 million to develop.
It has all these really cool ways of imaging and slicing
mouse brains, so you can basically do all kinds of
virtual stuff on mouse brains even if you're sitting at a
terminal in Kinshasa or Kyoto or wherever.

We got some of the publicly available material from the
Allen site and simply made it a mash up in the layer of
Google maps.
Now you can layer it.
And again this was something--
I didn't think anybody had thought about doing it, and
this is already being picked up by one of the big
bioinformatics networks, which has $100
million worth in funding.
This was done with, basically, chewing gum and bailing wire
and 25 hours worth of volunteers' time.
The point is that stuff that, on the level all of traffic
maps or crime data, all of you have thought about a million
times and said, that's trivial, I'll never get
anywhere by doing that, in the science world has
not yet been tried.
I really want to say, big frontier
here, really big frontier.
To be done, what's needed.
Money, smart people.
It is embarrassing to come to Google and talk about search
technology, particularly one that talks about the [COUGH]
semantic [COUGH]
web.
I'm sure that there are all kinds of things wrong with the
NeuroCommons model we've got here.
I'm sure that it's buggy.
It may be that the process that we imagined, which is
that going forward we have an almost, like, Wikipedia-like
process of correction.
Where somebody who's working on Alzheimer's says to his
graduate students, OK your job is to go in and make sure that
this incredibly buggy NeuroCommons thing is right
about our area of science, so we can use it in the lab.
And that's your responsibility.
Go in and check it.
And since each assertion leads back to an article, it's a
simple footnote that just refers to an article, it's
easy for you to read the article and go, wait, hunger
doesn't cause the absence of food, it's the absence of food
that causes hunger, you idiot.
And flip the causal sign and so make the
NeuroCommons better.
That's the way we imagine it getting better, a bigger and
bigger community with more and more people thinking about
what they can do.
That's the great thing about the Commons is that there's
always someone who has something smarter to do with
your content than you thought of.
That's the secret of the Commons.
Smart people, money, more experimentation, and, I think,
a conceptual shift.
A shift that starts to look at the world of science and say,
what if we took everything that we have learned about
networks, about the way that networks function--
I think, the world wide web of networks--
about the way that search works on networks, about the
way that distributed creativity could work across
networks, and let's figure out which of those can be
translated into the world of science.
Not all of them can, of course, but some of them
could, and it would be really exciting.
I hope that you find that of interest. Sciencecommons.org
is where you can find a lot of this information, and that's
my contact.
Thank you very much.

Questions.

Sorry.
Is this software open source?
So, I've been asked to repeat the questions just so the
remote audience can hear them.
So the question is, is the software open source.
Right now, all of the data and the tools that we are
developing is open source, BSD licensed.
The tools we use to do the initial text recognition, that
is proprietary, although its output is, of course,
available freely.
Next stage would be developing open-source, excellent
scientific text recognition software for lots and lots of
areas of science, because the
proprietary software is expensive.
Interestingly we've had very generous support from two
proprietary companies, Teranode and Millennium
Pharmaceuticals, who have allowed us to use their
systems and, in some case, even to release versions of
their systems under open licenses.
They believe, I think--
I don't want to put words in their mouth,
let me rephrase this.
It may be that they believe that there is a, kind of,
open-source model here where there's room for a lot of
companies to exist on top of a giant Commons of Science
information the same way that say IBM or Red Hat exist in
the free software world.
This is something we don't currently have in the
sciences, though obviously you know it very well.
But in answer to your question, I think one next
stage would be, let's not just deliver the stack, the Nero
Commons kernel as open source or even our search interface
as open source, let's have the stuff that generated it in the
first place and make that open source too.
There are open source versions of that search technology, but
right now it's not as good, the stuff that does the text
recognition in the first place.

AUDIENCE: Are there any efforts underway to get
scientific researchers to deliver their work product in
the form of structured data in the first place?
JAIME BOYLE: Great question.
So are there any efforts to get scientists to deliver
their data pre-structured, pre-metadata encoded.
There have been some attempts, and
they've all failed massively.
One reason why we went this route, which is trying to
figure it out at post hoc with software, is because it was
impossible to make it happen beforehand.
I can imagine one way in which that would happen, which is if
the NIH said, 0.5% of every grant is for metadata encoding
of the results.
And it's a deliverable.
You don't get your final payment until we see your
whatever it would be, RDF triples, whatever it is, that
says, here are my hypothesis materials, and whatever.
That would definitely work.
Like Creative Commons, this is the second best private fix,
post hoc, rather than the public structuring ex ante.
AUDIENCE: Actually, my question is about the idea of
using a graduate student powered Wiki to improve the
semantics [INAUDIBLE]
sort of like a secondary grade, and so my actual
question is if that happens, is there an issue with
validating the results and making sure that--
do they need peer review?
Or does there need to be [UNINTELLIGIBLE]?
JAIME BOYLE: Excellent.
So the question is, how are we going to improve the results.
One idea I had mentioned was the idea of having curation by
graduate students after the fact to make sure that the
assertions were correct.
And one obvious concern, which is often raised about things
like Wikipedia, is can we trust this material.
Obviously, in the sciences, you would say this is, maybe,
even more important than knowing whether someone has
sung a particular song or not.
Does it need peer review.
I think the answer of how we might end up having various
forms of peer review is still TBD, To Be Decided.
That will be developed.
But here's the reason why I don't think it's
too much of a problem.
Each one of these assertions is in the form of x causes y
or so says this article, at this page, or this abstract.
That's what it is.
When that is being used by someone, all they're getting
is what is, effectively, a hyperlink to an article.
The actual information they need is in the article.
And it's not being given to, you know, Joe lay person who's
going to go on and do, like, brain surgery on himself.
Its users are scientists who are expert in
their particular area.
I think, therefore, that while we would, obviously, never
want to accept any kind of errors, if there are errors,
they are easily discarded by the person looking at them and
easily, perhaps even you would hope, corrected at the moment
when they're discovered.
I got this in the search page and it's wrong.
Just as when you scan a search engine page in an area you're
familiar with, you rule out, those that's nonsense, that's
just garbage, that's an accident of the search, and
focus on the others.
I think that makes it acceptable to have a less
expensive level of peer review.
Having said that, it might be that there are sub-specialties
or particular areas where people might want to say, we
want our core of material in our scientific field to be,
like, beyond reproach with a zero error rate.
And that would simply require them then to come in and
insert their excellent validation
structure on top of it.
And, of course, the benefit of an open system is, they could.
And it would be distributed and that would be
entirely up to them.
But that's me just guessing of how it might evolve.
We've certainly got some Autism, Alzheimer's, and
Huntington's researchers who seem interested in doing
something like that.

AUDIENCE: You mentioned that you were only talking about
the biomedical aspects of Science Commons, what other
parts are there of Science Commons?
JAIME BOYLE: Excellent question.
So what beyond biomed.
Biomed we picked because the problems are huge.
The human impact is obviously enormous.
And there had been some existing efforts, particularly
in the form of open journals, where we thought we could, you
know, maybe leverage some stuff.
We have preliminary contacts with people in the
geospatial data area.
I think it would be very important for things like
environmental science.
We've worked, in the past, with people like the people
who are doing the national map, where there's, sort of,
all these different layers of data about
the country in there.
There's been a lot of interest from anthropologists and
archaeologists who would love the idea that everything that
you have about a site would be able to play together with all
of the other information about that site, so that researchers
from other disciplines could come in and actually find the
thing that they needed, because the two types of data,
which are currently incompatible, would actually
be meshed together.
The model of the NeuroCommons strikes me as scalable across
many disciplines.
Obviously different text search mechanisms, you'd need
an archaeology specific one, an anthropology specific one.
But we have a lot of interest for people in those areas.
In physics, in high-energy physics, there's already much
more of a going concern that's working on these.
The high-energy physics and astronomical people are way
ahead on this.
Archive.org has really done excellent stuff on this.
The Human Genome Project has obviously released a vast
amount of material.
Sanger and Wellcome Trust and the whole push behind the
Human Genome Project and the Bermuda Accord was to get this
kind of data out there.
So research on evolutionary biology and so forth is,
perhaps, more advanced in certain areas.
But basically, I think, it's just a matter of enthusiasm
and money to scale this out to other areas.

AUDIENCE: Scientists, in academics in particular, seem
to be more reticent to do anything new, especially
something dramatic.
How do you get momentum for this and convince people that
it's a really idea, they should spend their grad
students' time on this instead of their paper?
JAIME BOYLE: Right.
So, the question is, how do you get scientists to buy into
this, because they can be more reticent in getting it going,
which by the way, I entirely agree.
Justifiably, scientists don't want to waste their time doing
something if it turns out that it's not going to yield.
The first thing is you have to give them something that's
already useful.
Don't come to them with an idea.
Come to them with a search engine that already yields
some results, and ideally, some of which are wrong.
Because then they go, well this is
right, but that's stupid.
The harnessing of the that's stupid impulse is a very
powerful idea.
You know, that needs to be fixed.
This is the way open-source software works.
It's like, the person who reports the bug is much more
likely to fix it.
I think that's one thing.
I think a second thing is, for a lot of these people, the
abstractions of the open-source Web 2.0 search
community are completely outside of
their frame of reference.
And they also, understandably, are in a world which has a
prestige economy that's in a particular way.
So they might be very keen to share, but they'll go, you
know I spent all this time developing this plasmid and I
send it over to that woman in the other lab, and now she
scooped me.
I think it's hard for you to convince someone to say, no
you'll be better off if you share so long as lots of
other people do.
There's a scale problem.
So the answer there is to move at every level.
For example, the Howard Hughes Medical Institute, which is,
after the Nobel, probably the single most prestigious prize
in biomedical research science and so forth, fully funds, and
I mean fully, every single thing in the labs of the
people that it supports, who are brilliant, genius
scientists.
Those people already operate in a, kind of, semi-Commons, a
close Commons, where they are--
the Howard Hughes Medical Institute says we will
encourage you to share among all of the other people in
this network.
So you find networks like that, that already exist, that
have already begun the process.
And you find people, like funders, who have some ability
to say, this will lead to bigger grants for you or
renewal of your grants, and you encourage them to do it.
And finally, you work with the journals.
For example, we'd been talking to the Public Library of
Science about +1, which is for putting up journal articles in
early stage or without going through a complete--
it is peer reviewed, but it's not completely peer reviewed
in the way that their flagship journals are.
And they're thinking about implementing this, so if you
talk about a material in here, we want
you to use this system.
I think if you approach it at every level and give people
incentives to do it, then that will build
the level of content.
And it's very similar to what we did with Creative Commons
licenses in the beginning.
How do you persuade people to do it when
there is no other material.
You try and get it embedded in Flickr.
You try and get it with companies like Magnitude.
You basically make it pervasive, so that you see it
everywhere.
And then you get people to uptake.
And then, at that point, the cycle starts.
It's a great question.

Just a little louder.
AUDIENCE: How much interest have you had internationally?
JAIME BOYLE: A lot of interest internationally.
We're only limited by the fact that we--
I mean we already have people who we want to be working with
internationally, basically a limitation of funds.
Two kinds of partnerships here, I think.
One is that a lot of foundations who have been
resistant to funding science infrastructure in the
developed world because what they want to do is build
research capacity in the developing world, I think
we're beginning to convince them that if you fix the
problem globally then you fix the problem globally, which,
like, includes everywhere.
So that's a bridge which has enabled us to get people in
other countries to work with it.
The other thing is that there are existing efforts, through
things like the European Union and the OECD, where we've had
a lot of interest in these kinds of initiatives.
And again, it's just a matter of--
once you've got a working prototype, you're a lot
further along.
And, basically, these two things that I described to
you, the NeuroCommons and the MTA are ready for launch in
the next two months.
We think, we hope we launch at a very alpha, crude, buggy
level, but launch nevertheless.
We'll be launching with 5000 plasmids already pre-licensed,
for example, and lots of scientists and
labs committing too.
Once we get that, then I think the international interest
will accelerate.
One more question and then we're done.

Thank you very much.