Why Most Published Research Findings Are False


Uploaded by C0nc0rdance on 24.06.2011

Transcript:
The artificial sweetener video I've been working on is dragging out. I always have more ideas
than I have time to make videos on them, unfortunately, and I appreciate everyone who sent in suggestions.
Just be patient, and I'll do my best to get to your topic. Fluoride and alcohol are priorities
at the moment, then smoking and some positive applications of cannabis to provide a little
balance to my previous vid.
Most published research findings are false, according to a recent paper in the journal
PLoS Medicine, written by John Ioannidis (Yo-Nee-Dees) a professor of epidemiology at Tufts University
School of Medicine.
Well, I suppose we had a good run. I'll inform the other scientists, and we can start packing
up our desks. Science as a human endeavor is dead. We'll just have to find something
else to do. World domination has always held an attraction for me, I think I might check
the want ads for something in that field. Start out as a minion and work my way up to
evil genius.
This kind of headline is A. completely accurate
B. completely misleading to non-scientists
In fact, if someone quotes this at you, it's a good chance they've never read it, and also
a reasonable chance that they are on the road to science denialism. I want to discuss what
it actually has to say.
First, I want us to pick a medically relevant hypothesis. Let's say we decide that we hypothesize
that black licorice can be shown to significantly increase cancer in rats. We actually need
to set up two competing hypotheses. The alternative hypothesis, which we just stated, and the
null hypothesis, which is that black licorice CANNOT be shown to significantly increase
risk of cancer in rats.
The word significant here has a specific meaning. It doesn't mean important or meaningful, so
much as that it is unlikely to have occurred by chance. If we feed licorice to 1000 rats,
and no licorice to another 1000 rats, the number who get cancer is going to be affected
by chance and also, possibly, the treatment.
If we, for example, find that 50 of the control rats, and 55 of the licorice fed rats get
cancer, is that a significant finding, or was it simply what we should have expected,
given a normal random distribution of cancer risk?
What about 50 and 60? Is that enough difference to say that black licorice is a scourge to
mankind?
Ronald Fisher, the evolutionary biologist and statistician, provides a way of answering
this question. He set up a way to express whether something was significantly above
random chance.
First, we define our hypotheses, as we've already done.
Second, we clearly define the statistical relationships and assumptions.
Third, we make the observations, and apply the appropriate test statistic
Fourth, we make a decision on the outcome of the test statistic, and we reject or fail
to reject the null hypothesis.
We always start from the position that the null cannot be rejected, in this case that
licorice is safe. If it were a new drug, the null hypothesis would be that it is NOT safe,
or that it is NOT effective. Only successfully rejecting the null will be a vindication of
that new drug. Our prior assumption for something we've never tested before is that it doesn't
have the desired effect.
Our test statistic today, and this is a common one, is the Student's T-test. It was developed
by William Gosset,an Oxford graduate a chemist working for Guinness brewery, as a way of
monitoring the quality of the stout being produced. It's not called Gosset's test because
Claude Guinness didn't want his competitors to know he was using statistics to monitor
quality. Instead, Gosset used a pen name, "Student". Fisher actually created the form
we use today, but it still bears the pen name of a secret brewery mathemetician.
There are certain assumptions to a student's t-distribution, small sample size, unknown
standard deviation for the population, and it looks a bit like a normal distribution,
but with heavier tails on either side.
We take our two populations of rats, and apply the test statistic. Now we need to decide
at what level we will consider the results significant. Fisher proposed a standard 5%,
or 0.05 significance level, a value denoted as alpha, or 1-beta.
What does that mean? It means that there is a 5 percent chance that we will falsely conclude
that there is a difference between the groups by chance alone. Take our example, with 50
rats in the control group and 68 rats in the licorice group with cancer, we conclude that
the difference is signficant, but at this level there's always a 5% chance we're falsely
concluding there's a difference when no real difference exists. Out of every 100 such studies,
we'll get a false conclusion in 5. We'll call this the false positive, or Type 1 error.
There's the opposite error. This is where we reject the alternative hypothesis in favor
of the null even though there really is a difference. The drug is not safe, not effective,
or the licorice is not a carcinogen, when it really is. This is a Type 2, or false negative.
The false negative is strongly affected by the power of a test. In our example, suppose
we only had 10 rats in each group, and exactly 3 in each group got cancer. We might conclude
that there is no significant difference between these two groups, but that's because the noise
level contributed by randomness is so high. If we repeated this 100 times, we might never
be able to detect the very real 20% difference in cancer rates between these two populations.
So there are two ways that we can be wrong in every test of a hypothesis, and two ways
we can be right. What can we say, generally, about the numbers of false and true alternative
hypotheses? How many false hypotheses exist, and how many true? There's a very large number
of hypotheses that can be false, but only a relatively small number that can be true.
We can hypothesize a lot more things than can actually be demonstrated to be true.
So in our four quadrant display, the majority of null hypotheses are likely to be true.
This makes that small 5% false positive error a pretty major contributor to coming to the
wrong conclusions. I'm going to borrow an example from an excellent article by economist
Alex Tabarrok on his blog, Marginal Revolution. Suppose we take 1000 hypotheses. Assume that
most of them will be false, for reasons we just discussed. 200 are true and 800 are false,
for our purposes. Our level of significance allows 5% of the false hypotheses to be true,
so at least 40 will be false positives. From the true set, let's assume the power of our
tests allows us to pick up 60% of the 200 true hypotheses. That gives us 160 total cases
where we fail to reject the null, or if you prefer, where we support the hypothesis, but
only 120 of that 160 are true positives, or 75%. The other 25% of the results are giving
us the wrong answers, without any need for publication bias, or researchers with conflicts
of interest.
There's more to it, of course, this was just where the objective numbers get us. Bad research
can also bias results and so can bad researchers: our expectations, our conflicts of interest,
our simple bad statistical assumptions. One of the most important elements is the economics
of sample size. It's cost-prohibitive to run trials with thousands of animal models or
enrolled patients, yet that's what's needed when the effect size is relatively small.
We often do "good enough" research, and add caveats to the conclusions, saying that further
study is warranted, or calling the study a pilot study.
So what should we do? Should I go get fitted out for my minion uniform of a spandex jumpsuit
and hard hat with a stripe down the middle? No, and none of this is surprising to me.
I think most young grad students learn this intuitively by conducting their own research
and following the literature. Results are variable, and what starts as a very interesting
outcome is less interesting when it is repeated, or when a larger sample size is used. Engaging
in research has a way of grinding out the bright-eyed optimism in surprising results,
even if your understanding of statistics isn't very deep.
So when someone sends me a single study in support of a surprising contradiction, I am
always very skeptical, even a little cynical, that this single outcome has destroyed our
prior knowledge on the topic. Likewise, numerous small studies, each of them inappropriately
small for the effect size, do not add up to a strong conclusion.
What is needed is some way of applying a little extra caution for the really outrageous conclusions,
and some have proposed just such a mechanism: Bayesian prior probability. This is a slightly
subjective way of measuring what the likelihood of an alternative hypothesis is. Those that
represent major departures from existing models need a stronger effect, a larger population,
or less contribution from random noise. It's a way of codifying the basic skeptical rational
principle expounded by Carl Sagan: "Extraordinary claims require extraordinary evidence". When
we apply this to concepts like psi or precognition, which frequently find support for their hypotheses
but require a new theoretical model in multiple fields of science, it allows us to cut through
90% of research published with small sample sizes, or very small, but significant differences.
Scientists learn very early on in their careers that individual papers are interesting, and
can lead to marvelous discoveries, but cherry picking single studies is never as reliable
as looking at an entire body of literature on a topic. A single result, unless it is
a truly definitive study, is not very persuasive when stacked against dozens of other papers
with contradictory findings.
In spite of what you may have been taught, a single documented exception to theory is
not sufficient evidence to overturn it. What is needed, most of all, is repetition of good
research, skepticism about weak or small studies, skepticism about very small effect sizes,
and a good theoretical underpinning to the finding.
We also need to learn the difference between evidence based medicine, where a difference
can be demonstrated between treatment and controls, and science based medicine, where
not only is there evidential support, but it also agrees with a theoretical or mechanistic
understanding of the disease or condition or pathway. Surprising results we don't understand
the mechanism of should be regarded with more skepticism.
There are many documented cases where a researcher made bad choices in experimental design either
intentionally or through simple bad judgement. I think the effect of researcher bias is a
lot less prevalent than most people think, and researcher incompetence or simply sloppy
design is a lot more common.
We scientists need to be a little bit more strict about misuse of the p value statistic,
or measurements of significance as the sole determinant of whether a result is real or
not. There is more to research than simply observing statistical differences. It should
certainly inform further research, but not be the end of it. We also need to get a lot
better at communicating these concepts to journalists and policy makers.
So how about our poor rats fed that nasty licorice? Do you feel a little differently
about the importance of the outcome of our study? Even if we find a significant result,
that result on its own is probably not enough to ban licorice. Politicians and journalists,
like most of the general public, don't get this, and so policy choices are often made
on bad understanding of research outcomes.
I hope when you are presented with some similar finding on a new cure for cancer, or some
new association is found between people who watch YouTube videos and IQ, you'll apply
a little bit of extra skepticism. Yes, many or most research findings are false. But the
more good studies there are on a topic, the better the chance that we'll converge on something
like the correct result. Also important is that prior probability, the theoretical basis
or mechanism by which something might have occurred. Extraordinary claims require extraordinary
evidence.
And if you find someone who's quoting Ioannidis, check to see if they even understand why findings
are false. If they yammer on about corrupted researchers, industry meddling, political
bias and global conspiracies, you know they never read the paper. The real reason why
studies are false is much simpler.
Scientific testing is not perfect. It is a human endeavor done in the darkness of ignorance,
but building a useful tool to light our way by numerous missteps, by fruitless pursuits
and false confidence. Ultimately, though, it works better than any other method we have
so far discovered. We just have to learn to manage it better, to compensate for our failings
and to accept that it is not a simple way to comprehend the world around us.
Thanks for watching.