Uploaded by C0nc0rdance on 24.06.2011

Transcript:

The artificial sweetener video I've been working on is dragging out. I always have more ideas

than I have time to make videos on them, unfortunately, and I appreciate everyone who sent in suggestions.

Just be patient, and I'll do my best to get to your topic. Fluoride and alcohol are priorities

at the moment, then smoking and some positive applications of cannabis to provide a little

balance to my previous vid.

Most published research findings are false, according to a recent paper in the journal

PLoS Medicine, written by John Ioannidis (Yo-Nee-Dees) a professor of epidemiology at Tufts University

School of Medicine.

Well, I suppose we had a good run. I'll inform the other scientists, and we can start packing

up our desks. Science as a human endeavor is dead. We'll just have to find something

else to do. World domination has always held an attraction for me, I think I might check

the want ads for something in that field. Start out as a minion and work my way up to

evil genius.

This kind of headline is A. completely accurate

B. completely misleading to non-scientists

In fact, if someone quotes this at you, it's a good chance they've never read it, and also

a reasonable chance that they are on the road to science denialism. I want to discuss what

it actually has to say.

First, I want us to pick a medically relevant hypothesis. Let's say we decide that we hypothesize

that black licorice can be shown to significantly increase cancer in rats. We actually need

to set up two competing hypotheses. The alternative hypothesis, which we just stated, and the

null hypothesis, which is that black licorice CANNOT be shown to significantly increase

risk of cancer in rats.

The word significant here has a specific meaning. It doesn't mean important or meaningful, so

much as that it is unlikely to have occurred by chance. If we feed licorice to 1000 rats,

and no licorice to another 1000 rats, the number who get cancer is going to be affected

by chance and also, possibly, the treatment.

If we, for example, find that 50 of the control rats, and 55 of the licorice fed rats get

cancer, is that a significant finding, or was it simply what we should have expected,

given a normal random distribution of cancer risk?

What about 50 and 60? Is that enough difference to say that black licorice is a scourge to

mankind?

Ronald Fisher, the evolutionary biologist and statistician, provides a way of answering

this question. He set up a way to express whether something was significantly above

random chance.

First, we define our hypotheses, as we've already done.

Second, we clearly define the statistical relationships and assumptions.

Third, we make the observations, and apply the appropriate test statistic

Fourth, we make a decision on the outcome of the test statistic, and we reject or fail

to reject the null hypothesis.

We always start from the position that the null cannot be rejected, in this case that

licorice is safe. If it were a new drug, the null hypothesis would be that it is NOT safe,

or that it is NOT effective. Only successfully rejecting the null will be a vindication of

that new drug. Our prior assumption for something we've never tested before is that it doesn't

have the desired effect.

Our test statistic today, and this is a common one, is the Student's T-test. It was developed

by William Gosset,an Oxford graduate a chemist working for Guinness brewery, as a way of

monitoring the quality of the stout being produced. It's not called Gosset's test because

Claude Guinness didn't want his competitors to know he was using statistics to monitor

quality. Instead, Gosset used a pen name, "Student". Fisher actually created the form

we use today, but it still bears the pen name of a secret brewery mathemetician.

There are certain assumptions to a student's t-distribution, small sample size, unknown

standard deviation for the population, and it looks a bit like a normal distribution,

but with heavier tails on either side.

We take our two populations of rats, and apply the test statistic. Now we need to decide

at what level we will consider the results significant. Fisher proposed a standard 5%,

or 0.05 significance level, a value denoted as alpha, or 1-beta.

What does that mean? It means that there is a 5 percent chance that we will falsely conclude

that there is a difference between the groups by chance alone. Take our example, with 50

rats in the control group and 68 rats in the licorice group with cancer, we conclude that

the difference is signficant, but at this level there's always a 5% chance we're falsely

concluding there's a difference when no real difference exists. Out of every 100 such studies,

we'll get a false conclusion in 5. We'll call this the false positive, or Type 1 error.

There's the opposite error. This is where we reject the alternative hypothesis in favor

of the null even though there really is a difference. The drug is not safe, not effective,

or the licorice is not a carcinogen, when it really is. This is a Type 2, or false negative.

The false negative is strongly affected by the power of a test. In our example, suppose

we only had 10 rats in each group, and exactly 3 in each group got cancer. We might conclude

that there is no significant difference between these two groups, but that's because the noise

level contributed by randomness is so high. If we repeated this 100 times, we might never

be able to detect the very real 20% difference in cancer rates between these two populations.

So there are two ways that we can be wrong in every test of a hypothesis, and two ways

we can be right. What can we say, generally, about the numbers of false and true alternative

hypotheses? How many false hypotheses exist, and how many true? There's a very large number

of hypotheses that can be false, but only a relatively small number that can be true.

We can hypothesize a lot more things than can actually be demonstrated to be true.

So in our four quadrant display, the majority of null hypotheses are likely to be true.

This makes that small 5% false positive error a pretty major contributor to coming to the

wrong conclusions. I'm going to borrow an example from an excellent article by economist

Alex Tabarrok on his blog, Marginal Revolution. Suppose we take 1000 hypotheses. Assume that

most of them will be false, for reasons we just discussed. 200 are true and 800 are false,

for our purposes. Our level of significance allows 5% of the false hypotheses to be true,

so at least 40 will be false positives. From the true set, let's assume the power of our

tests allows us to pick up 60% of the 200 true hypotheses. That gives us 160 total cases

where we fail to reject the null, or if you prefer, where we support the hypothesis, but

only 120 of that 160 are true positives, or 75%. The other 25% of the results are giving

us the wrong answers, without any need for publication bias, or researchers with conflicts

of interest.

There's more to it, of course, this was just where the objective numbers get us. Bad research

can also bias results and so can bad researchers: our expectations, our conflicts of interest,

our simple bad statistical assumptions. One of the most important elements is the economics

of sample size. It's cost-prohibitive to run trials with thousands of animal models or

enrolled patients, yet that's what's needed when the effect size is relatively small.

We often do "good enough" research, and add caveats to the conclusions, saying that further

study is warranted, or calling the study a pilot study.

So what should we do? Should I go get fitted out for my minion uniform of a spandex jumpsuit

and hard hat with a stripe down the middle? No, and none of this is surprising to me.

I think most young grad students learn this intuitively by conducting their own research

and following the literature. Results are variable, and what starts as a very interesting

outcome is less interesting when it is repeated, or when a larger sample size is used. Engaging

in research has a way of grinding out the bright-eyed optimism in surprising results,

even if your understanding of statistics isn't very deep.

So when someone sends me a single study in support of a surprising contradiction, I am

always very skeptical, even a little cynical, that this single outcome has destroyed our

prior knowledge on the topic. Likewise, numerous small studies, each of them inappropriately

small for the effect size, do not add up to a strong conclusion.

What is needed is some way of applying a little extra caution for the really outrageous conclusions,

and some have proposed just such a mechanism: Bayesian prior probability. This is a slightly

subjective way of measuring what the likelihood of an alternative hypothesis is. Those that

represent major departures from existing models need a stronger effect, a larger population,

or less contribution from random noise. It's a way of codifying the basic skeptical rational

principle expounded by Carl Sagan: "Extraordinary claims require extraordinary evidence". When

we apply this to concepts like psi or precognition, which frequently find support for their hypotheses

but require a new theoretical model in multiple fields of science, it allows us to cut through

90% of research published with small sample sizes, or very small, but significant differences.

Scientists learn very early on in their careers that individual papers are interesting, and

can lead to marvelous discoveries, but cherry picking single studies is never as reliable

as looking at an entire body of literature on a topic. A single result, unless it is

a truly definitive study, is not very persuasive when stacked against dozens of other papers

with contradictory findings.

In spite of what you may have been taught, a single documented exception to theory is

not sufficient evidence to overturn it. What is needed, most of all, is repetition of good

research, skepticism about weak or small studies, skepticism about very small effect sizes,

and a good theoretical underpinning to the finding.

We also need to learn the difference between evidence based medicine, where a difference

can be demonstrated between treatment and controls, and science based medicine, where

not only is there evidential support, but it also agrees with a theoretical or mechanistic

understanding of the disease or condition or pathway. Surprising results we don't understand

the mechanism of should be regarded with more skepticism.

There are many documented cases where a researcher made bad choices in experimental design either

intentionally or through simple bad judgement. I think the effect of researcher bias is a

lot less prevalent than most people think, and researcher incompetence or simply sloppy

design is a lot more common.

We scientists need to be a little bit more strict about misuse of the p value statistic,

or measurements of significance as the sole determinant of whether a result is real or

not. There is more to research than simply observing statistical differences. It should

certainly inform further research, but not be the end of it. We also need to get a lot

better at communicating these concepts to journalists and policy makers.

So how about our poor rats fed that nasty licorice? Do you feel a little differently

about the importance of the outcome of our study? Even if we find a significant result,

that result on its own is probably not enough to ban licorice. Politicians and journalists,

like most of the general public, don't get this, and so policy choices are often made

on bad understanding of research outcomes.

I hope when you are presented with some similar finding on a new cure for cancer, or some

new association is found between people who watch YouTube videos and IQ, you'll apply

a little bit of extra skepticism. Yes, many or most research findings are false. But the

more good studies there are on a topic, the better the chance that we'll converge on something

like the correct result. Also important is that prior probability, the theoretical basis

or mechanism by which something might have occurred. Extraordinary claims require extraordinary

evidence.

And if you find someone who's quoting Ioannidis, check to see if they even understand why findings

are false. If they yammer on about corrupted researchers, industry meddling, political

bias and global conspiracies, you know they never read the paper. The real reason why

studies are false is much simpler.

Scientific testing is not perfect. It is a human endeavor done in the darkness of ignorance,

but building a useful tool to light our way by numerous missteps, by fruitless pursuits

and false confidence. Ultimately, though, it works better than any other method we have

so far discovered. We just have to learn to manage it better, to compensate for our failings

and to accept that it is not a simple way to comprehend the world around us.

Thanks for watching.

than I have time to make videos on them, unfortunately, and I appreciate everyone who sent in suggestions.

Just be patient, and I'll do my best to get to your topic. Fluoride and alcohol are priorities

at the moment, then smoking and some positive applications of cannabis to provide a little

balance to my previous vid.

Most published research findings are false, according to a recent paper in the journal

PLoS Medicine, written by John Ioannidis (Yo-Nee-Dees) a professor of epidemiology at Tufts University

School of Medicine.

Well, I suppose we had a good run. I'll inform the other scientists, and we can start packing

up our desks. Science as a human endeavor is dead. We'll just have to find something

else to do. World domination has always held an attraction for me, I think I might check

the want ads for something in that field. Start out as a minion and work my way up to

evil genius.

This kind of headline is A. completely accurate

B. completely misleading to non-scientists

In fact, if someone quotes this at you, it's a good chance they've never read it, and also

a reasonable chance that they are on the road to science denialism. I want to discuss what

it actually has to say.

First, I want us to pick a medically relevant hypothesis. Let's say we decide that we hypothesize

that black licorice can be shown to significantly increase cancer in rats. We actually need

to set up two competing hypotheses. The alternative hypothesis, which we just stated, and the

null hypothesis, which is that black licorice CANNOT be shown to significantly increase

risk of cancer in rats.

The word significant here has a specific meaning. It doesn't mean important or meaningful, so

much as that it is unlikely to have occurred by chance. If we feed licorice to 1000 rats,

and no licorice to another 1000 rats, the number who get cancer is going to be affected

by chance and also, possibly, the treatment.

If we, for example, find that 50 of the control rats, and 55 of the licorice fed rats get

cancer, is that a significant finding, or was it simply what we should have expected,

given a normal random distribution of cancer risk?

What about 50 and 60? Is that enough difference to say that black licorice is a scourge to

mankind?

Ronald Fisher, the evolutionary biologist and statistician, provides a way of answering

this question. He set up a way to express whether something was significantly above

random chance.

First, we define our hypotheses, as we've already done.

Second, we clearly define the statistical relationships and assumptions.

Third, we make the observations, and apply the appropriate test statistic

Fourth, we make a decision on the outcome of the test statistic, and we reject or fail

to reject the null hypothesis.

We always start from the position that the null cannot be rejected, in this case that

licorice is safe. If it were a new drug, the null hypothesis would be that it is NOT safe,

or that it is NOT effective. Only successfully rejecting the null will be a vindication of

that new drug. Our prior assumption for something we've never tested before is that it doesn't

have the desired effect.

Our test statistic today, and this is a common one, is the Student's T-test. It was developed

by William Gosset,an Oxford graduate a chemist working for Guinness brewery, as a way of

monitoring the quality of the stout being produced. It's not called Gosset's test because

Claude Guinness didn't want his competitors to know he was using statistics to monitor

quality. Instead, Gosset used a pen name, "Student". Fisher actually created the form

we use today, but it still bears the pen name of a secret brewery mathemetician.

There are certain assumptions to a student's t-distribution, small sample size, unknown

standard deviation for the population, and it looks a bit like a normal distribution,

but with heavier tails on either side.

We take our two populations of rats, and apply the test statistic. Now we need to decide

at what level we will consider the results significant. Fisher proposed a standard 5%,

or 0.05 significance level, a value denoted as alpha, or 1-beta.

What does that mean? It means that there is a 5 percent chance that we will falsely conclude

that there is a difference between the groups by chance alone. Take our example, with 50

rats in the control group and 68 rats in the licorice group with cancer, we conclude that

the difference is signficant, but at this level there's always a 5% chance we're falsely

concluding there's a difference when no real difference exists. Out of every 100 such studies,

we'll get a false conclusion in 5. We'll call this the false positive, or Type 1 error.

There's the opposite error. This is where we reject the alternative hypothesis in favor

of the null even though there really is a difference. The drug is not safe, not effective,

or the licorice is not a carcinogen, when it really is. This is a Type 2, or false negative.

The false negative is strongly affected by the power of a test. In our example, suppose

we only had 10 rats in each group, and exactly 3 in each group got cancer. We might conclude

that there is no significant difference between these two groups, but that's because the noise

level contributed by randomness is so high. If we repeated this 100 times, we might never

be able to detect the very real 20% difference in cancer rates between these two populations.

So there are two ways that we can be wrong in every test of a hypothesis, and two ways

we can be right. What can we say, generally, about the numbers of false and true alternative

hypotheses? How many false hypotheses exist, and how many true? There's a very large number

of hypotheses that can be false, but only a relatively small number that can be true.

We can hypothesize a lot more things than can actually be demonstrated to be true.

So in our four quadrant display, the majority of null hypotheses are likely to be true.

This makes that small 5% false positive error a pretty major contributor to coming to the

wrong conclusions. I'm going to borrow an example from an excellent article by economist

Alex Tabarrok on his blog, Marginal Revolution. Suppose we take 1000 hypotheses. Assume that

most of them will be false, for reasons we just discussed. 200 are true and 800 are false,

for our purposes. Our level of significance allows 5% of the false hypotheses to be true,

so at least 40 will be false positives. From the true set, let's assume the power of our

tests allows us to pick up 60% of the 200 true hypotheses. That gives us 160 total cases

where we fail to reject the null, or if you prefer, where we support the hypothesis, but

only 120 of that 160 are true positives, or 75%. The other 25% of the results are giving

us the wrong answers, without any need for publication bias, or researchers with conflicts

of interest.

There's more to it, of course, this was just where the objective numbers get us. Bad research

can also bias results and so can bad researchers: our expectations, our conflicts of interest,

our simple bad statistical assumptions. One of the most important elements is the economics

of sample size. It's cost-prohibitive to run trials with thousands of animal models or

enrolled patients, yet that's what's needed when the effect size is relatively small.

We often do "good enough" research, and add caveats to the conclusions, saying that further

study is warranted, or calling the study a pilot study.

So what should we do? Should I go get fitted out for my minion uniform of a spandex jumpsuit

and hard hat with a stripe down the middle? No, and none of this is surprising to me.

I think most young grad students learn this intuitively by conducting their own research

and following the literature. Results are variable, and what starts as a very interesting

outcome is less interesting when it is repeated, or when a larger sample size is used. Engaging

in research has a way of grinding out the bright-eyed optimism in surprising results,

even if your understanding of statistics isn't very deep.

So when someone sends me a single study in support of a surprising contradiction, I am

always very skeptical, even a little cynical, that this single outcome has destroyed our

prior knowledge on the topic. Likewise, numerous small studies, each of them inappropriately

small for the effect size, do not add up to a strong conclusion.

What is needed is some way of applying a little extra caution for the really outrageous conclusions,

and some have proposed just such a mechanism: Bayesian prior probability. This is a slightly

subjective way of measuring what the likelihood of an alternative hypothesis is. Those that

represent major departures from existing models need a stronger effect, a larger population,

or less contribution from random noise. It's a way of codifying the basic skeptical rational

principle expounded by Carl Sagan: "Extraordinary claims require extraordinary evidence". When

we apply this to concepts like psi or precognition, which frequently find support for their hypotheses

but require a new theoretical model in multiple fields of science, it allows us to cut through

90% of research published with small sample sizes, or very small, but significant differences.

Scientists learn very early on in their careers that individual papers are interesting, and

can lead to marvelous discoveries, but cherry picking single studies is never as reliable

as looking at an entire body of literature on a topic. A single result, unless it is

a truly definitive study, is not very persuasive when stacked against dozens of other papers

with contradictory findings.

In spite of what you may have been taught, a single documented exception to theory is

not sufficient evidence to overturn it. What is needed, most of all, is repetition of good

research, skepticism about weak or small studies, skepticism about very small effect sizes,

and a good theoretical underpinning to the finding.

We also need to learn the difference between evidence based medicine, where a difference

can be demonstrated between treatment and controls, and science based medicine, where

not only is there evidential support, but it also agrees with a theoretical or mechanistic

understanding of the disease or condition or pathway. Surprising results we don't understand

the mechanism of should be regarded with more skepticism.

There are many documented cases where a researcher made bad choices in experimental design either

intentionally or through simple bad judgement. I think the effect of researcher bias is a

lot less prevalent than most people think, and researcher incompetence or simply sloppy

design is a lot more common.

We scientists need to be a little bit more strict about misuse of the p value statistic,

or measurements of significance as the sole determinant of whether a result is real or

not. There is more to research than simply observing statistical differences. It should

certainly inform further research, but not be the end of it. We also need to get a lot

better at communicating these concepts to journalists and policy makers.

So how about our poor rats fed that nasty licorice? Do you feel a little differently

about the importance of the outcome of our study? Even if we find a significant result,

that result on its own is probably not enough to ban licorice. Politicians and journalists,

like most of the general public, don't get this, and so policy choices are often made

on bad understanding of research outcomes.

I hope when you are presented with some similar finding on a new cure for cancer, or some

new association is found between people who watch YouTube videos and IQ, you'll apply

a little bit of extra skepticism. Yes, many or most research findings are false. But the

more good studies there are on a topic, the better the chance that we'll converge on something

like the correct result. Also important is that prior probability, the theoretical basis

or mechanism by which something might have occurred. Extraordinary claims require extraordinary

evidence.

And if you find someone who's quoting Ioannidis, check to see if they even understand why findings

are false. If they yammer on about corrupted researchers, industry meddling, political

bias and global conspiracies, you know they never read the paper. The real reason why

studies are false is much simpler.

Scientific testing is not perfect. It is a human endeavor done in the darkness of ignorance,

but building a useful tool to light our way by numerous missteps, by fruitless pursuits

and false confidence. Ultimately, though, it works better than any other method we have

so far discovered. We just have to learn to manage it better, to compensate for our failings

and to accept that it is not a simple way to comprehend the world around us.

Thanks for watching.