Around 1889, the English polymath Sir Francis Galton began to suspect a divergence from statistics and causation. Looking at hereditary data sets, Galton began to notice that tall men had longer-than-average forearms, though not as far above average as their height. It was clear to Galton that height was not a cause of forearm length, nor was forearm length the cause of height, but rather both were likely caused by genetic inheritance. He began to use a new term for these relationships such as height and forearm length: they were “co-related.”1
In 1892, another English statistician, Karl Pearson, referenced Galton’s work when he claimed that causation can never be proven–that mere data is all there is to science. In the early 20th century, Pearson and his assistant would provide examples of “spurious correlations” such as the correlation between a country’s per capita chocolate consumption and its number of Nobel Prize winners. However, as Judea Pearl points out in The Book of Why, despite Pearson’s hostility towards causation, by suggesting a correlation was spurious, he was also making a logical reference to causation. In other words, by saying that chocolate consumption does not cause Nobel Prize winners, one is presuming that causation does in fact exist somewhere. So while the statistics community by this time had agreed that correlation does not imply causation, there was little agreement as to how to actually determine causation.
Around 1918, a guinea pig caretaker with the US Department of Agriculture named Sewall Wright began exceeding his job duties by using mathematical models to evaluate direct dependencies in the guinea pig’s genetic data through a causal model. His ingenious work using “path diagrams” would later become the foundation of causal inference. As Pearl writes, “This idea must have seemed simple to Wright but turned out to be revolutionary because it was the first proof that the mantra ‘correlation does not imply causation’ should give way to ‘some correlations do imply causation.’”1
Outside of predetermined data sets, randomized control trials (RCTs) eventually gained popularity within science and statistics as a way to determine causality experimentally rather than relying solely on mathematics. Now often referred to as the “gold standard” in clinical trials, RCTs are essential to sound medical research as distinguishing between correlation and causation is paramount to understanding the efficacy of a new treatment or medical procedure.