Statistical Significance

What is Statistical Significance? 

Statistical significance is a measure used in hypothesis testing to determine if the observed results in a study are unlikely to have occurred due to chance alone, typically assessed using a p-value threshold (e.g., p < 0.05). It provides confidence that the effect or relationship observed in the data is real and not random. Understanding statistical significance is crucial for making informed, evidence-based decisions in research and data analysis.

The Basic Idea

How many times in life have you noticed a strange phenomenon? Maybe the same song you were just humming came on the radio, you bumped into someone from your hometown in a completely different part of the world, or you’ve noticed that every time you have oatmeal for breakfast, you hit all the green lights on the way to work. While each of these could be a coincidence, wouldn’t it be nice to have a way to figure out whether or not they truly were? Or at least determine the odds that these events were due to chance alone? While it may not be entirely possible to calculate whether each of these personal situations was completely coincidental, the concept of assessing the likelihood that there is a force besides chance at play, known as statistical significance, can be applied to many other phenomena in the world. 

Statistical significance is used in many types of research to determine whether the results of a study are likely to be due to a specific factor rather than occurring by chance. Researchers use statistical tests to assess this, including calculating a p-value. The p-value is a measure that helps determine the likelihood of observing the obtained results if the null hypothesis (which suggests no effect or no difference between groups) is true. Basically, the p-value answers the question: “Is this result unlikely to be due to random chance?” If the p-value falls below a predetermined threshold, the result is deemed statistically significant. Usually, the threshold of 0.05 is used (meaning you’re looking for a p-value of 0.05 or lower) because it indicates less than a 5% probability that the observed results are due to chance.1 

However, statistical significance only speaks to the presence of an effect—not its magnitude or practical importance. Findings from a study could be statistically significant but have a negligible effect in real-world applications, either because the methodology was flawed, the effect size was too small, or some other factor. Although researchers should consider statistical significance in any study, many other considerations are needed to understand the practical implications of statistical results.2

“A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions.”


— M. J. Moroney, statistician and author of Facts from Figures.

Key Terms

P-value: The quantified probability of obtaining results at least as extreme as those observed, assuming the null hypothesis is true. A lower p-value suggests stronger evidence against the null hypothesis.1

Null Hypothesis (H₀): The default assumption that there is no effect or relationship between variables. It's the hypothesis that researchers aim to test against to identify whether their findings are statistically significant.1

Alternative Hypothesis (H₁): The supposition that there is an effect or relationship between variables. Rejecting the null hypothesis lends support to the alternative hypothesis.1

Type I Error: When the null hypothesis is incorrectly rejected, suggesting an effect or relationship where none actually exists. It's also known as a "false positive."1

Type II Error: When the null hypothesis is incorrectly retained, resulting from a failure to detect a true effect or relationship. This is also referred to as a "false negative."1

Effect Size: The magnitude of a relationship or difference, offering context to the statistical significance. Larger effect sizes indicate more substantial findings. While a p-value answers, “Is there an effect?” the effect size answers, “How big is the effect?”2

Sample Size: The number of observations or participants in a study. Larger sample sizes reduce variability and increase the power of statistical tests, making it easier to accurately detect significant effects.1

Sampling Bias: When a sample is not representative of the population being studied, leading to systematic errors in the results. This can happen if certain groups are overrepresented or underrepresented due to flaws in the sampling process. Sampling bias can compromise the validity of statistical tests by distorting p-values, potentially leading to incorrect conclusions about whether or not findings are significant. 

History

In the 1700s, researcher John Arbuthnot became the first person to calculate statistical significance while trying to understand a mysterious phenomenon regarding births in London. Using 82 years of data, from 1629 to 1710, he noticed that more male babies were being born than females, year after year. Since an understanding of biology at the time had shown us that the odds of having a girl or boy should be roughly 50/50, the size of the difference was hard to account for. In order to understand the likelihood of such a disparity in sex at birth occurring due to chance, Arbuthnot conducted what is now recognized as the first use of statistical significance tests. By using the odds of being born female (½) and exponentiating by the number of years in which the data occurred (82, giving the equation y = ½82), he found that the odds of such a sex difference occurring by chance alone were 1 in 4,836,000,000,000,000,000,000,000, resulting in an infinitesimal p-value.4 Several decades later, statistician Pierre-Simon Laplace tackled the same question of differences in female versus male births, using slightly different techniques. Although he also used a p-value calculation to reveal that the additional boys being born was a statistically significant difference, the difference remained unexplained.5,6

It was the mathematicians Karl Pearson and Ronald Fisher who, at the turn of the 20th century, first formally introduced the p-value, which quantifies how extreme a set of observations would be in the event that the null hypothesis was true. The p-value was first debuted with Pearson's chi-squared test, which is used to test how well-recorded data fits a hypothesized distribution. Fisher is credited with defining the concept of statistical significance tests, establishing a p-value of 0.05 as a threshold to determine whether a result is statistically significant. Both tests helped bring more objective measures to the sciences, giving researchers the tools to quantify the strength of their results. 

If you’re in the field of psychology, you may be familiar with the prejudice that the social sciences sometimes face. Compared to the “hard sciences” like math and chemistry, the social sciences are often less tangible and can be harder to quantify. Particularly in the early 20th century and during the space race, other fields of study were much more likely to be funded and published, while psychological experiments seemed to lack the objective measures valued by publications at the time. Under pressure to adopt more consistent statistical methods to establish psychology’s scientific rigor, psychologists began employing p-value calculations in an increasing number of experiments.6 

As statistical significance tests became more common in the social sciences, psychology journals began publishing only papers with reported p-values below 0.05, indicating statistically significant findings. Unfortunately, this backfired in an expected way: an increasing number of researchers began “adjusting” their data, either by outright lying or skewing their data collection and analysis processes in response to the pressures placed on them by journals, a dubious practice known as p-hacking.2,6 

In response, some critics protested the standards of statistical significance being imposed upon researchers at the cost of quality research. In the 1990s, psychology journal editor Geoffrey Loftus wrote a piece outlining the dangers of mindlessly calculating whether experimental results are statistically significant or not. In the paper, he succinctly explains, “Significance testing is all about how the world isn’t and says nothing about how the world is." Although his criticism was minimally effective in shifting the scientific community’s long-term stance, Loftus’ critique highlighted the limitations of the p-value and the dangers of over-relying on significance tests to determine the validity of results.7 

People

John Arbuthnot 

A Scottish physician and mathematician who is credited with pioneering the concept of statistical significance in 1710 by analyzing birth records. He demonstrated that an excess of male births over female births was statistically unlikely to be due to chance, which was an early example of hypothesis testing using probability theory. Although he ultimately theorized that the difference in sex at birth was due to divine intervention, his work was among the first to use statistical significance in research.

Pierre-Simon Laplace

A French scholar and the most prominent mathematician studying probability theory in the 19th century. He continued to study the same birthrate question as John Arbuthnot, using his own theories related to mathematical probability and statistics.5 

Karl Pearson

An English mathematician and statistician who introduced the idea of correlation and applied statistics to biological problems of evolution and heredity. The Pearson correlation coefficient is named after him, as he developed the method to measure the strength and direction of a linear relationship between two variables.8 

Ronald Fisher

A British mathematician, statistician, biologist, geneticist, and academic who is widely credited with establishing the concept of statistical significance and, in particular, popularizing the idea of using a p-value threshold of 0.05 to determine whether a result is statistically significant. This means there is only a 5% chance of observing such a result by random chance alone, and the level is sometimes referred to as the "Fisherian significance level.”8

Geoffrey Loftus 

An American psychologist and science journalist who has been a prominent critic of the field’s overreliance on statistical significance. His research focuses on memory and attention.7

behavior change 101

Start your behavior change journey at the right place

Impacts

Statistical significance profoundly impacts how we interpret data, advance scientific knowledge, and apply findings to the real world. We can use significance tests to assess how likely it is that the effects we observe are due to chance, shaping how research is used and communicated. By understanding its strengths and limitations, we can move beyond treating statistical significance as an endpoint and instead use it as a starting point for deeper inquiry, making sure to acknowledge its potential for misinterpretation or misrepresentation. 

Confidence Intervals Increase Confidence 

Statistical significance helps create a framework for assessing the reliability of data by distinguishing between random noise and genuine effects. If statistical analysis suggests the effect is genuine, the researchers and those interpreting the data can feel more confident in the results. Whether in economics, psychology, or biology, the ability to rely on consistent metrics fosters trust across disciplines, and the shared language makes it easy to collaborate.1 If, for example, a group of economists and a team of epidemiologists were both working on poverty interventions, both teams could use statistically significant data to identify programs that genuinely make an impact despite the likely differences in their methodological approaches.

As in any research or experimental setting, uncertainty is inevitable, but statistical significance can offer a way to quantify that uncertainty. This type of benchmark is important for many decision-makers; imagine if you, as a researcher, proposed a new life-saving medication and said, “It seems like this works, but I’m not sure how likely it is that it actually does anything.” It would be hard for any institution to sign off on something with so much uncertainty. Statistically significant results obtained from a clinical trial could provide reassurance that a new drug’s efficacy is unlikely to be due to random variation. Statistical significance is a helpful—if not necessary—metric to get stakeholders to sign off on major decisions.

Increased Awareness Means a Risk of Misinterpretation 

As discussion of the p-value becomes more common, the risk of misinterpretation by the public (or even experts in the field) increases. In many domains, statistical significance is highlighted to validate results when conveying them to broader audiences. Unfortunately, its technical nuances are rarely communicated, which can distort the public’s understanding of what these values really mean.2 Maybe you’ve seen a study that’s claimed something like, “Eating blueberries reduces cancer risk,” claiming a p-value of p < 0.01. If there isn’t an accompanying explanation outlining the absolute risk reduction or what the p-value represents, audiences might assume the effect is more transformative than it is.

Although significance offers a structured lens to interpret findings, overreliance on any statistic risks oversimplification. When a statistically significant outcome is treated as definitive proof of an effect, other important considerations, like effect size or the study’s methodology, are unfairly sidelined. This can lead to overconfident decisions.2 For example, an education program could be deemed “effective” because it caused students to improve significantly in one area, but the actual effects of the education program may do little to address long-term learning or equity issues. Meanwhile, researchers, policymakers, and the general public may be distracted by a search for programs with the strongest statistical significance, ignoring the possibility of a more broadly effective program.2 

Decision-Making Frameworks

Statistical significance is now used in almost every field. In marketing or operations, for example, statistical significance often informs business strategies.1 For instance, if a company runs an A/B test and finds that its website’s new design improves customer conversions with a significant p-value, teams might adopt it. It’s important to note, though, that decisions based solely on statistical significance will likely overlook important operational considerations, like whether the improvement actually justifies implementation costs or if the change aligns with other company or user experience goals.

Controversies 

Critics have long pointed out that overreliance on statistical significance can lend itself to misleading conclusions. These issues have fueled ongoing debates about the utility and limitations of statistical significance, leading some to advocate for abandoning p-values altogether in favor of more nuanced approaches like effect size and confidence intervals.

P-Value vs. Effect Size

Although the quantity of a p-value can be helpful in assessing statistical significance, the value does nothing to explain the size or importance of an effect. To reiterate, a p-value tells us the likelihood of observing our data (or something more extreme) if the null hypothesis is true, but it only speaks to the possible presence of an effect, not its magnitude or practical importance.2,6

Here’s where the nuance comes in: a small p-value doesn’t automatically mean the effect is large, meaningful, or actionable. Similarly, a larger p-value doesn’t necessarily mean the effect is negligible. In fact, it could simply mean the sample size was too small to detect the effect with certainty. That’s why focusing solely on the p-value without considering effect size or real-world relevance can lead to misleading conclusions. 

The importance of the effect size, which quantifies the magnitude of a difference or relationship in a study, is clear when we consider the following two possibilities6

  1. Statistically significant but small effect: A study might find that a new drug reduces blood pressure by 1 mmHg compared to a placebo, with a p-value of 0.01. While statistically significant, a 1 mmHg reduction is likely too small to have any meaningful clinical impact on an individual. Focusing solely on the p-value might mislead researchers, clinicians, or policymakers into overestimating the drug’s usefulness.
  2. Large effect but not statistically significant: Imagine another study testing a rare but powerful intervention on a small sample of participants. The intervention reduces blood pressure by 20 mmHg, but the p-value is 0.08. While not statistically significant, the effect size suggests the intervention could potentially be highly impactful if tested on a larger sample.

Thus, ignoring the effect size can have a real impact. Imagine if policymakers prioritized interventions based on statistically significant results, even if the actual impact is minimal. They might invest millions in an education program, for example, that increases test scores by only 0.5% (but it is statistically significant) instead of investing in a program with the potential for an effect 20 times the size but lacks proven statistical significance and may simply require further testing. 

Size Isn’t Everything

Furthermore, the p-value only describes the statistical significance of the results found in the specific situation being studied, but it doesn’t account for study design or real-world relevance. Statistical significance is inherently tied to the internal conditions of the study: the sample size, the statistical test used, and the specific dataset being analyzed.6 While defining the statistical significance of a study can help us interpret its results, it doesn’t help us answer some of the more important questions about the study’s applicability to the real world. For example, was the study designed to answer the right question? How well do the study conditions mirror the complexities of real-world scenarios? Is the observed significance practically meaningful outside the study’s controlled environment?

The design of a study plays a critical role in shaping the p-value in almost every step of the methodology. When selecting participants, if the sample isn’t representative of the larger population, the findings may not be able to be generalized—an unfortunately common issue known as sampling bias. Perhaps a study on a new medication conducted only on young volunteers might yield statistically significant results, but the same effect may not be present for older or more diverse populations. More commonly, studies are conducted on those raised in Western, Educated, Industrialized, Rich, and Democratic (WEIRD) places.3 These populations are far from representative of the planet as a whole, limiting the applicability of findings. 

The p-value also assumes that the data being analyzed are accurate representations of the phenomena being studied, but if a study uses poorly designed surveys or other inappropriately designed measurements, then the significance test simply measures how strongly the survey (or interview, experimental condition, etc.) relates to the outcome. 

Even if a p-value suggests significance within a specific study, we really only know that the results are significant within the controlled setting of the experiment. While a controlled study works to minimize extraneous factors to isolate the relationship being tested and establish internal validity, the lab setting isn’t always a realistic representation of real life. The real world is much more nuanced, and it’s impossible to perfectly replicate the social and cultural (or even physical) influences of the outside world in the safety of a lab. 

P-Hacking 

Moreover, the fixation on thresholds like 0.05 has encouraged questionable practices such as p-hacking, where researchers manipulate data or analyses to achieve "significant" results. Overemphasizing p-values has contributed to the replication crisis in science, as there has been an increasing amount of researchers skewing (or downright fabricating) their results in order to achieve statistically significant findings and have their work published.2,8 

This is true even for researchers who have found large but non-significant effects. Their work may be potentially groundbreaking, but if they lacked an adequate sample size or failed to meet other standards to prove statistical significance, the results may be lost in the abyss of unpublished findings. Unfortunately, this can skew the body of evidence in a field, hindering scientific progress. The pressure to publish pushes some researchers to frame data in a misleading way in the hopes of achieving statistical significance, which can undermine the public’s trust in the scientific community, as well as delegitimize the integrity of the scientific process.2,6,7,8 

Case Studies

The Lady Tasting Tea

While many case studies documenting the importance of statistical significance are related to clearly consequential topics like public health, income inequality, or scientific research, it may be more fun to highlight how statistical significance can also help us understand relatively insignificant phenomena. In one famous randomized experiment, statistician Sir Ronald Fisher tested a woman named Muriel Bristol’s claim that she could tell whether the tea or the milk was added to a cup first.9 Although this seemed unlikely, Fisher agreed to design a simple experiment, wherein he prepared eight cups of tea for her, with four having milk put in the cup first and the other four having tea poured into the cup first. He had her try each of these cups in random order and identify how each was prepared. 

Thus, the null hypothesis in this case was that the lady had no ability to distinguish between the teas and that lucky guesses could explain her accuracy in other situations. In this experiment, the test statistic was a simple count of the number of successful attempts to select the four cups prepared by a given method. The distribution of possible successes, assuming the null hypothesis is true, can be computed using the number of combinations of guesses. Because there were eight total cups, of which four were chosen, the number of possible combinations of answers was 70. The usual probability criterion would rely on the lady having a success rate of more than 95%. So, in order to reject the null hypothesis and prove statistical significance, Bristol needed to correctly identify all four cups because the chances of getting all four correct by sheer luck are one out of 70 (which is ≈ 1.4%, less than 5%). 

Ultimately, after having all eight cups of tea, Bristol guessed each one correctly, leading Fisher to reject the null hypothesis (and most likely leading Bristol to run to the restroom). Although the experiment itself was rather inconsequential, the simplicity of the experiment is a wonderful demonstration of how the process of determining statistical significance works and the importance of randomization in experiments.9 

The Salk Polio Vaccine Trials 

Ideally, the benefits of the polio vaccine would be widely accepted at this point; in reality, the history of the vaccine’s development seems to have been forgotten by some. In the early 20th century, polio was a devastating disease causing widespread paralysis and death, especially among children.10 In 1954, field trials for the vaccine took place, which were some of the largest and most rigorously controlled studies ever conducted at the time. Over 1.8 million children in the United States participated in the trials, divided into vaccinated and placebo groups, to determine whether the vaccine significantly reduced the risk of contracting polio.

At the end of the trials, the results showed that in the placebo group, 115 cases of paralytic polio were reported, as opposed to 33 cases in the vaccinated group.11 That’s obviously a big difference, but when these results were analyzed, the p-value was far below the typical threshold of 0.05, indicating that the observed difference was highly unlikely to be due to random chance. This statistical significance provided strong evidence that the vaccine was effective in preventing polio.10

The demonstration of statistical significance had immediate and profound consequences for the nation and the world. The vaccine was approved for widespread use, which led to mass immunization campaigns that drastically reduced polio cases worldwide.10 Importantly, Salk’s contributions are still celebrated today, not only because his scientific efforts saved so many lives but also because his perspective on vaccine patents sharply contrasts what we typically see in modern pharmaceutical development. Instead of capitalizing on a patent for the vaccine, Salk chose to keep the vaccine as equitable and widely accessible as possible, as he knew that was needed in order to make the intervention successful. The vaccine was indeed a hit, and Salk never profited from its success. Today, polio has been nearly eradicated, with only a few cases reported annually in isolated regions—though it has recently made a comeback in some areas, a result linked to unfounded vaccine skepticism.10 The case of polio and many other vaccines’ efficacy illustrate how statistical significance when combined with robust study design, can lead to lifesaving breakthroughs.

Related TDL Content

Instrumental Variables Estimation 

Statistical significance plays a central role in almost all methods used for statistical analysis. This article explains one such method, instrumental variables (IV) estimation, which is used in statistics and econometrics to address the problem of endogeneity. Learn more about what this means and how the approach uses variable correlation to obtain consistent estimates.

Inferential Statistics 

Inferential statistics is a branch of statistics that allows researchers to make generalizations about a larger population based on a sample of data. This article explores techniques such as hypothesis testing and confidence intervals and explains how inferential statistics helps estimate population parameters, test relationships between variables, and make predictions beyond the immediate dataset. 

Sources

  1. Tenny, S., & Abdelgawad, I. (2023, November 23). Statistical significance. Stat Pearls. 
  2. Dyer, I. (1998). The significance of statistical significance. Accident and Emergency Nursing, 6(2), 92–98. https://doi.org/10.1016/S0965-2302(98)90006-6
  3.  Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1601785
  4. Arbuthnot, J. 1710. An argument for divine providence. Philosophical Transactions 27:186-190. 
  5. Bijak, J., & Bryant, J. (2016). Bayesian demography 250 years after Bayes. Population Studies, 70(1), 1–19. https://doi.org/10.1080/00324728.2015.1122826
  6. Curran-Everett, D. (2020). Evolution in statistics: P values, statistical significance, kayaks, and walking trees. Advances in Physiology Education, 44(3), 324–332. https://doi.org/10.1152/advan.00054.2020 
  7. Bower, B. (2021, August 12). How the strange idea of ‘statistical significance’ was born. Science News. https://www.sciencenews.org/article/statistical-significance-p-value-null-hypothesis-origins 
  8. Kennedy-Shaffer L. (2019). Before p < 0.05 to Beyond p < 0.05: Using History to Contextualize p-Values and Significance Testing. The American statistician, 73(Suppl 1), 82–90. https://doi.org/10.1080/00031305.2018.1537891 
  9. Fisher, Ronald A. (1971) [1935]. The Design of Experiments (9th ed.). Macmillan. ISBN 0-02-844690-9. 
  10. World Health Organization. (2024.). History of polio vaccination. World Health Organization. Retrieved January 19, 2025, from https://www.who.int/news-room/spotlight/history-of-vaccination/history-of-polio-vaccination 
  11. Francis, T., Jr. (1955, April 12). The first press release on polio vaccine evaluation results. The University of Michigan Information and News Service. Retrieved from https://sph.umich.edu/polio/

About the Author

A smiling woman with long blonde hair is standing, wearing a dark button-up shirt, set against a backdrop of green foliage and a brick wall.

Annika Steele

Annika completed her Masters at the London School of Economics in an interdisciplinary program combining behavioral science, behavioral economics, social psychology, and sustainability. Professionally, she’s applied data-driven insights in project management, consulting, data analytics, and policy proposal. Passionate about the power of psychology to influence an array of social systems, her research has looked at reproductive health, animal welfare, and perfectionism in female distance runners.

About us

We are the leading applied research & innovation consultancy

Our insights are leveraged by the most ambitious organizations

Image

I was blown away with their application and translation of behavioral science into practice. They took a very complex ecosystem and created a series of interventions using an innovative mix of the latest research and creative client co-creation. I was so impressed at the final product they created, which was hugely comprehensive despite the large scope of the client being of the world's most far-reaching and best known consumer brands. I'm excited to see what we can create together in the future.

Heather McKee

BEHAVIORAL SCIENTIST

GLOBAL COFFEEHOUSE CHAIN PROJECT

OUR CLIENT SUCCESS

$0M

Annual Revenue Increase

By launching a behavioral science practice at the core of the organization, we helped one of the largest insurers in North America realize $30M increase in annual revenue.

0%

Increase in Monthly Users

By redesigning North America's first national digital platform for mental health, we achieved a 52% lift in monthly users and an 83% improvement on clinical assessment.

0%

Reduction In Design Time

By designing a new process and getting buy-in from the C-Suite team, we helped one of the largest smartphone manufacturers in the world reduce software design time by 75%.

0%

Reduction in Client Drop-Off

By implementing targeted nudges based on proactive interventions, we reduced drop-off rates for 450,000 clients belonging to USA's oldest debt consolidation organizations by 46%

Read Next

Notes illustration

Eager to learn about how behavioral science can help your organization?