Sunteți pe pagina 1din 5

An unhealthy obsession with p-values is ruining science

By Julia Belluz @juliaoftoronto julia.belluz@voxmedia.com Mar 15, 2016, 11:00am EDT

Lightspring/Shutterstock

Over the past couple of years, Stanford meta-researcher John Ioannidis and several
colleagues have been working on a paper that should make any nerd think twice
about p-values, those tests of statistical significance that are now commonly
perceived as a signal of a study's worth.

Their paper, published today in JAMA, examines p-values across 25 years of


biomedical research. That involved doing some seriously impressive data crunching:
The researchers analyzed more than 1.6 million study abstracts and more than
385,000 full-text papers, all of which included p-values.

What they found was "an epidemic" of statistical significance: 96 percent of the
papers that included a p-value in their abstract boasted statistically significant results
(on a scale from 0 to 1, a p-value that's statistically significant measures 0.05 or
lower).
What’s more, Ioannidis told Vox, "the proportion of papers that use p-values is going
up over time, and the most significant results have become even more significant over
time." Only about 10 percent of the papers he looked at mentioned effect sizes in
their abstracts, for example, and even fewer mentioned measures of uncertainty,
such as confidence intervals. So very rarely were researchers giving any context
about the real importance of their p-value findings.

All this means that as p-values have become more popular, they've also become more
meaningless.

"If you’re a pessimist," Ioannidis added, "this may be called p-value trash."

But even if you’re an optimist, the new study suggests the entire biomedical world has
been furiously chasing statistical significance, potentially giving dubious results the
appearance of validity by churning them through this increasingly popular statistical
method, or simply suppressing important results that don't look significant enough.

In the biomedical context, this finding is worrying. It means drugs and medical devices
that don't work so well may be sold using p-values that suggest they do.

"So the big picture," Ioannidis concluded, "is that there are these millions and millions
of papers with millions and millions of p-values floating around, and many are
misleading."

Good luck trying to find a really clear definition of a p-value


If you're struggling to wrap your head around the definition of a p-value, you're not
alone.

In the broadest sense, it's simply one of many ways researchers can test a hypothesis
using statistics.

A more detailed and still comprehensible definition is actually shockingly hard to


come by.

Here's a recent stab from the American Statistical Association:

"Informally, a p-value is the probability under a specified statistical model that a statistical
summary of the data (for example, the sample mean difference between two compared groups)
would be equal to or more extreme than its observed value."
I called Rebecca Goldin, the director for Stats.org and a professor at George Mason
University, for help parsing that still perplexing definition. She walked me through an
example using drug studies, the kind Ioannidis and his colleagues examined.

Say a researcher has run a study testing the effect of a drug on an outcome like
cholesterol, and she's trying to see whether the people on the drug (group A)
improved their cholesterol levels more than the people who did not take the drug
(group B). Let's say she finds that patients in group A (who got the medicine) also
happened to lower their cholesterol more than those in group B (who didn't get the
medicine).

The researcher has no way of knowing whether that difference in cholesterol levels is
because of the medicine or some other difference between the two groups. "She
cannot 'see' with her data alone whether, behind the scenes, God was rolling dice or
whether the medicine was influencing cholesterol levels," said Goldin. In other words,
the difference in cholesterol levels between the two groups may have occurred
because of chance or because of the medicine — but that's a question the researcher
can't answer using the data she has.

But there is something she can answer: If it were randomness alone ("God rolling the
dice"), then how likely would it be that people's cholesterol levels came out as they
did in this study? This is where the p-value comes in.

She can use a statistical method (in this case, resulting in a p-value) to check the
probability that she would see the difference in cholesterol between the groups (or
more extreme differences) under the assumption that the medicine had nothing to do
with the difference. This assumption is called the "null hypothesis," and generating a
p-value always starts with a null hypothesis.

To actually calculate the p-value, the researcher would plug a bunch of numbers about
her data — the number of people in the study, the average change in cholesterol for
both groups, the standard deviation for each group, etc. — into a calculator. Again, the
p-value that the calculator spits out will be the probability of seeing this data (the
difference in cholesterol levels between the two groups) or more extreme data, given
the null hypothesis (the medicine didn't work). A p-value of less than 0.05 is
considered "statistically significant" by many in the medical community — an
indicator that the data are unlikely, though still possible, if the medicine weren't
working.
To be clear: The p-value will not tell the researcher how likely it is that the medicine is
working (or not working). So it won't tell her whether her original hypothesis (about
whether the medicine works) is true or false. Instead, the p-value tells her the
probability of seeing her data (the difference between group A and B) given a null
hypothesis. And, again, if the p-value is low (less than 0.05), the probability that this
data would arise is low, providing some evidence that the medicine is having an
impact.

Why the p-value crisis matters


Ioannidis's paper, which raises questions about the trustworthiness of p-values,
doesn't come in isolation.

Though statisticians have long been pointing out problems with "significance doping"
and "P-dolatory" (the "worship of false significance") journals have increasingly
relied on p-values to determine whether a study should be published.

"We fear the p-value is used as a gatekeeper for determining what’s publishable
research," said Ron Wasserstein, the executive director of the American Statistical
Association. This means that good research with higher p-values is being turned away,
that authors may turn themselves away from submitting to journals when they get a
high p-value, or, even worse, that authors game their p-values or selectively report
only low p-values (dubbed "p-value hacking") in order to make them appear
statistically significant and therefore publishable.

"I'm concerned that work that’s important doesn't see the light of day because p-
values didn't come out to be below 0.05," said Wasserstein. "I'm concerned that work
that is published is published and considered successfully evidentiary based on low
p-values."

When I asked Wasserstein how we arrived at this moment, he had a couple of


guesses. First, software makes churning p-values out easier than ever. And second, a
p-value is a temptingly easy figure to rely on when deciding whether research is
valuable. "It's this number that looks like you could use it to make a decision that
might otherwise be difficult to make or require a whole lot more effort to make," he
said. Unfortunately, that's not true.

It doesn't have to be this way


Most ironic about this state of affairs is that the p-value had much more modest
origins, as statistician Regina Nuzzo reported in Nature: When p-values were
introduced by UK statistician Ronald Fisher in the 1920s, he intended them to be "one
part of a fluid, non-numerical process that blended data and background knowledge
to lead to scientific conclusions." They weren't the be-all and end-all of significance;
again, they were intended as just one tool in the statistical toolbox.

But even with all of this controversy, few are suggesting abandoning the p-value all
together. Instead, the American Statistical Association just released guidance on p-
value principles in an effort to use the method more conservatively and more
accurately:

1. P-values can indicate how incompatible the data are with a specified statistical model.

2. P-values do not measure the probability that the studied hypothesis is true, or the probability
that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be based only on whether a
p-value passes a specific threshold.

4. Proper inference requires full reporting and transparency.

5. A p-value, or statistical significance, does not measure the size of an effect or the importance
of a result.

6. By itself, a p-value does not provide a good measure of evidence regarding a model or
hypothesis.

Even Ioannidis doesn't think the p-value should be thrown out. Instead, he said,
journals should crack down on their use of p-values. "They should insist on more
[information] about what is the effect size, the uncertainty around effect size, and
how likely [the results are] to be true."

S-ar putea să vă placă și