The problem with statistical significance

Statistical significance isn’t the end-all and be-all of truth. Is there a better way forward?

8 min readDec 2, 2023

Welcome to 1920s England. A group of academics have gathered for tea. Algae researcher Muriel Bristol has just made a rather outrageous claim — she swears that she can tell whether the tea or the milk was added first to a cup.

The mathematician Ronald Fisher scoffs at this claim. William Roach — a biochemist and, incidentally, Bristol’s future husband — overhears the argument and proposes a solution. Fisher will give Bristol eight cups of tea, four of each variety, in random order.

Bristol correctly identifies all eight cups of tea. After calculating the probability of Bristol identifying all the cups by chance — a meagre 1 in 70 — Fisher concedes that “Bristol divined correctly more than enough of those cups into which tea had been poured first to prove her case”.

And thus, a scientific ritual is born.

What is statistical significance?

Imagine we want to find out whether chihuahuas live longer than pugs. We can do this by collecting data about the lifespans of each breed and investigating whether there is a statistically significant difference between the two groups. The null hypothesis is the idea that there is no difference between how long pugs and chihuahuas live.

Statistical significance is typically represented by the p-value, which is the probability of getting the observed result (or a more extreme result) if the null hypothesis is true. The p-value depends both on how far the two groups differ and how many individuals we included in our study. Learning that one chihuahua lived six years longer than one pug tells us very little. Perhaps the chihuahua just got lucky — or, more likely, the pug got unlucky. If the average lifespan of 1,000 chihuahuas was six years longer than the average lifespan of 1,000 pugs, that’s a different story.

The typical threshold for significance is p < 0.05, otherwise known at the 95% confidence interval. If there was less than a 5% chance of you obtaining the observed result under the null hypothesis, congratulations! You’ve got a significant result. Frontiers in Veterinary Science awaits your contribution to canine research.

The loss of nuance

Ronald Fisher never intended for the p-value to be a definitive test. He envisioned the p-value as part of a larger, non-numerical process that took both data and background knowledge into account.

The 95% confidence threshold came later, as we fought to bring more rigour and objectivity to science. This is a noble goal. We can’t rely on human intuition to work out if and how data relate to one another — the human brain is wired to see patterns everywhere. Just ask a child to tell you what the clouds look like.

The p-value’s appeal is easy to understand. It’s a single number that we can use to interpret data objectively. It’s flexible, too. We can compare p-values across different types of studies and different statistical tests.

But p-values have received plenty of criticism. They’ve been compared to mosquitoes (annoying and difficult to get rid of) and the emperor’s new clothes (fraught with problems that we all see, but won’t point out). One researcher suggested renaming the p-value approach to “statistical hypothesis inference testing”, which seems benign until you consider the acronym.

Few people would argue that the p-value is an inherently bad statistic — rather, the problem is with the dichotomous way in which we interpret it. You don’t have to be an expert statistician to question why we consider a p-value of 0.04 significant while completely dismissing a p-value of 0.06.

As observed by Andrew Gelman and Han Stern, the difference between “significant” and “non-significant” is in itself non-significant. Only small changes are required to move a result from the 5.1% significance level to 4.9%. Conversely, large changes in significance levels can reflect negligible changes in the underlying data.

Not all hypotheses are created equal

The fact that the p-value is just a probability is only part of the issue. It’s also easy for us to forget what that probability actually represents.

Imagine a new drug has been shown to give mice healthier hearts. We decide to test it on humans. 25 people receive the drug while another 25 get a placebo. After a year, we work out whether the people who received the drug have significantly better cardiovascular health than the people who received the placebo. We get a p-value of 0.01.

Few people would challenge us for saying there is a 99% chance that the drug improves cardiovascular health.

Now let’s say I decide our pharmacological research isn’t getting me enough attention and claim that I can predict the future. I toss a coin seven times, and successfully manage to predict which side it will land on every time. Again, our statistical analyses return a p-value of 0.01.

Is there a 99% chance that I have supernatural powers?

No. p = 0.01 means that, prior to my coin tosses, there was less than a 1% probability of me making seven accurate predictions in a row. It does not mean that there is only a 1% probability that my results were down to chance.

A p-value of 0.01 actually corresponds to a false-alarm probability of at least 11%, depending on the plausibility of the hypothesis. There’s a higher chance that our drug worked than that I‘m some kind of coin wizard.

The problem goes the other way, too. It’s incorrect to say there is “no difference” between two groups because the p-value exceeded 0.05 — yet scientific papers and conference talks are full of such conclusions.

The size of an effect

“Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude — not just, does a treatment affect people, but how much does it affect them.”
— Gene V. Glass, statistician and researcher

Another issue with our usual interpretation of the p-value is how we treat it as the only statistic that matters. There’s plenty of information that p-values can’t tell us.

The p-value doesn’t measure the magnitude of an effect. With a sufficiently large sample size, even a tiny effect can be statistically significant.

A study with 22,000 subjects found that aspirin reduced the risk of heart attacks with a p-value of less than 0.00001. However, in actuality, aspirin only explained one-tenth of 1% of the risk of suffering a heart attack.

Because of this study, many people were advised to take aspirin despite the risk of adverse effects. Further research suggested that aspirin may have an even smaller effect on the risk of heart attacks, and the recommendation to use aspirin has since been changed.

Fixation on statistical significance damages academic integrity

As a general rule, if a researcher fails to find a significant result, they’ve failed. Scientists know results that don’t support a hypothesis are unlikely to get published, so rarely even bother writing them up.

Annie Franco and colleagues researched publication bias in the social sciences by tracking 221 studies. They discovered that strong results were 40 percentage points more likely to be published than non-significant results and 60 percentage points more likely to be written up.

Because publication is mandatory to succeed in an academic career, some researchers may resort to ethically questionable practices to nudge their results past the p < 0.05 threshold.

While only 1.97% of researchers admit to having fabricated, falsified or modified data, a third have confessed to engaging in other questionable research practices. Examples of such practices include attempting a study several times but only reporting the iteration that returned a significant result, collecting many variables but only reporting the ones that showed significant effects and terminating data collection once a significant result is achieved.

Of course research fraud is immoral — but given the system, are we really surprised that it happens?

The replication crisis

The replication crisis is a major concern in the scientific community. Researchers are finding it difficult — or even impossible — to reproduce the results of many scientific studies. A survey by Nature revealed that more than 70% of researchers have failed to reproduce another scientist’s results, and more than half have failed to reproduce their own results.

It’s possible for one researcher to find a statistically significant relationship between two variables and for another to find no such thing, while actually having perfectly compatible results. A p-value of 0.05 has a false positive error rate of 23–50%.

Add those shady research practices we mentioned into the mix, and the replication crisis loses all mystery. According to simulations by behavioural analyst Uri Simonsohn and colleagues, small changes in data analysis approaches can increase the chances of a false positive result in a single study to 60%.

Photo by ThisisEngineering RAEng on Unsplash

Should we retire statistical significance?

Perhaps the most obvious solution would be to abandon the concept of statistical significance altogether. In 2016, the American Statistical Association (ASA) released a policy statement on p-values and statistical significance. “No single index should substitute for scientific reasoning,” they wrote.

The ASA argued that neither scientific conclusions nor policy decisions should be based on whether a p-value passes a certain threshold. Instead, researchers must account for several contextual factors, including he quality of the measurements, study design, background knowledge and the validity of the assumptions underlying the analysis.

But not everyone is keen to dispense with statistical significance. Physician-scientist John Ioannidis asserts that statistical significance is an important obstacle to unfounded claims. He worries that without the p < 0.05 threshold, people would be free to draw whatever conclusions they liked from a study. What would stop a company from claiming that any results supported the licensing of their new product?

Despite the clear problems with our dichotomous interpretation of p-values, sometimes our inferences must be dichotomous. Consider medical research. Either a drug will be licenced or it won’t.

Dispensing with statistical significance allows for more nuance — but would more nuance really be appropriate?

Ironically, it depends.