Over the past 20 years, a wave of improbable-sounding scientific research has come under the microscope. Are Asian Americans really prone to heart attacks on the fourth day of every month? Do power poses really increase testosterone? Do men really eat more pizza when women are around? Are people named Brady really more susceptible to bradycardia (a slower-than-normal heart rate)? As early as 2005, alarm bells were going off over unrigorous social-science research — that was the year John P.A. Ioannidis, a Stanford professor of medicine, published “Why Most Published Research Findings Are False” in PLOS Medicine. Since then, self-appointed “data thugs” have championed more transparent research practices, watchdog projects including the Center for Open Science and the Meta-Research Innovation Center at Stanford have attempted to tackle the problem, and reproducibility efforts have gained steam in disciplines ranging from medicine to psychology to economics.
And yet, after decades of awareness efforts, dubious research still finds a home in scholarly journals. Surgeries are more likely to be fatal if they are done on the surgeon’s birthday, argues a medical paper. Fatal motorcycle accidents are more common when there is a full moon, claims a paper by a medical researcher and a psychologist. Bitcoin prices correlate with stock prices in the health-care industry, posits an economics paper.
To understand the persistence of dodgy research, it helps to consider the motivation and methods.
The inherent randomness in scientific experiments is handled by calculating the p-value, the probability that random assignment might be responsible for the observed disparity in outcomes. How low does the p-value have to be to be considered “statistically significant” evidence? The great British statistician Ronald Fisher chose a p-value cutoff of 0.05, which quickly became gospel.
Fisher’s argument that we need to assess whether empirical results might be explained by simple chance is compelling. However, any hurdle for statistical significance is bound to become a target that researchers strive mightily to hit. Fisher declared that we should “ignore entirely all results which fail to reach this level.” No researchers want their findings to be ignored entirely, so many work to get their p-values below 0.05. If journals require statistical significance, researchers will give them statistical significance.
The result is p-hacking — trying different combinations of variables, looking at subsets of the data, discarding contradictory data, and generally doing whatever it takes until something with a low p-value is found and then pretending that this is what you were looking for in the first place. As Ronald Coase, an economics Nobel laureate, cynically observed: “If you torture data long enough, they will confess.”
With terabytes of data and lightning-fast computers, it is too easy to calculate first, think later. This is a flaw, not a feature.
Consider a 2020 BMJ article (picked up by dozens of news outlets) claiming that surgeries are more likely to be fatal if they are done on the surgeon’s birthday. It is a truly damning indictment if true, that patients are dying because surgeons are distracted by birthday plans and good wishes from colleagues. The conclusion is implausible, but it is provocative and media friendly — something that is often true of p-hacked studies.
It is difficult to prove p-hacking, but one sign is when the research involves many selection choices, what Andrew Gelman, professor of statistics and political science at Columbia University, has likened to a “garden of forking paths.” The birthday study involved Medicare patients who underwent one of 17 common types of surgery between 2011 and 2014: four cardiovascular surgeries and the 13 most common noncardiovascular, noncancer surgeries in the Medicare population. The use of 2011-14 data in a paper published in 2020 is perplexing. The choice of 17 surgeries is baffling. P-hacking would explain all of this.
The authors justified their surgery selections by referencing several studies that had used Medicare data to investigate the relationship between surgical mortality and other variables. One of those four cited papers considered 14 cardiovascular or cancer operations but reported results for only four cardiovascular procedures and four cancer resections; two papers examined four cardiovascular and four cancer operations; and the fourth paper considered four cardiovascular surgeries and the 16 most common noncardiovascular surgeries in the Medicare population.
The four cardiovascular procedures considered in the birthday paper are identical or nearly identical to those reported in the four cited papers. However, the inclusion of 13 other procedures is suspicious. Why didn’t they use a more natural number, like 10, or perhaps 16, so that the total would be 20? Did 13 procedures give the lowest p-value? It is also striking that none of the four referenced studies excluded patients with cancer, but the birthday study did. The authors’ unconvincingly claim that this was “to avoid patients’ care preferences (including end-of-life care) affecting postoperative mortality.”
Even with all these possible p-hacks, the reported p-value is 0.03, only marginally under Fisher’s 5-percent rule. One sign of widespread p-hacking by researchers is the suspicious clustering of reported p-values slightly below 0.05. A 0.03 p-value does not necessarily mean that there was p-hacking — but when there are many forking paths and peculiar forks are chosen, a marginal p-value is not compelling evidence.
Brian Wansink retired from his position as a professor of marketing at Cornell University and director of the university’s Food and Brand Lab after a variety of problems were discovered with his studies, including extensive p-hacking. One smoking gun was an email to a co-author lamenting that a p-value was 0.06: “If you can get the data, and it needs some tweaking, it would be good to get that one value below 0.05.”
In Gelman’s garden-of-forking-paths analogy, p-hacking occurs when a researcher seeks empirical support for a theory by trying several paths and reporting the path with the lowest p-value. Other times, a researcher might wander aimlessly through the garden and make up a theory after reaching a destination with a low p-value. This is hypothesizing after the results are known — HARKing.
A good example is a 2018 National Bureau of Economic Research study of bitcoin prices. Bitcoin is particularly interesting because there is no logical reason why bitcoin prices should be related to anything other than investor expectations about future prices, or perhaps market manipulation. Unlike bonds that pay interest and stocks that pay dividends, bitcoin doesn’t yield any income at all, so there is no logical way to value bitcoin the way investors might value bonds and stocks.
Nonetheless, the NBER working paper reported hundreds of estimated statistical relationships between bitcoin prices and various variables, including such seemingly random items as the Canadian dollar–U.S. dollar exchange rate; the price of crude oil; and stock returns in the automobile, book, and beer industries. I am not making this up.
Of the 810 statistical relations they do report, 63 are statistically significant at the 10-percent level — which is somewhat fewer than the 81 statistically significant relationships that would be expected if they had just correlated bitcoin prices with random numbers.
The occasional justifications the authors offer are seldom persuasive. For example, they acknowledge that, unlike stocks, bitcoins don’t generate income or pay dividends, so they “proxy” this value using the number of bitcoin-wallet users:
Obviously, there is no direct measure of dividend for the cryptocurrencies. However, in its essence, the price-to-dividend ratio is a measure of the gap between the market value and the fundamental value of an asset. The market value of cryptocurrency is just the observed price. We proxy the fundamental value by using the number of Bitcoin wallet users.
The number of bitcoin-wallet users is not analogous to the income corporations earn or the dividends paid to stockholders and is not a valid proxy for the fundamental value of bitcoin — which is a big fat zero.
Among the 63 statistical relationships that were significant at the 10-percent level, the researchers reported finding that bitcoin returns were positively correlated with stock returns in the consumer-goods and health-care industries, and negatively correlated with stock returns in the fabricated-products and metal-mining industries. These correlations don’t make any sense, and the authors did not try to explain them: “We don’t give explanations, we just document this behavior.” Academics surely have better things to do than document coincidental correlations.
Some are tempted by an even easier strategy — simply make up whatever data are needed to support the desired conclusion. When Diederik Stapel, a prominent social psychologist, was exposed in 2011 for having made up data, it led to his firing and the eventual retraction of 58 papers. His explanation: “I was not able to withstand the pressure to score points, to publish, to always have to be better.” He continued: “I wanted too much, too fast.”
It is just a short hop, skip, and jump from making up data to making up entire papers. In 2005, three MIT graduate students created a prank program they called SCIgen that used randomly selected words to generate bogus computer-science papers. Their goal was to “maximize amusement, rather than coherence” and, also, to demonstrate that some academic conferences will accept almost anything.
They submitted a hoax paper with this gibberish abstract to the World Multiconference on Systemics, Cybernetics and Informatics:
Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public-private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.
The conference organizers accepted the prank paper and then withdrew their acceptance after the students revealed their hoax. The pranksters have now gone on to bigger and better things, but SCIGen lives on. Believe it or don’t, but some researchers have used SCIgen to bolster their CVs.
Cyril Labbé, a computer scientist at Grenoble Alps University, wrote a program to detect hoax papers published in real journals. Working with Guillaume Cabanac, a computer scientist at the University of Toulouse, they found 243 bogus published papers written entirely or in part by SCIgen. A total of 19 publishers were involved, all reputable and all claiming that they publish only papers that pass rigorous peer review. One of the embarrassed publishers, Springer, subsequently announced that it was teaming with Labbé to develop a tool that would identify nonsense papers. The obvious question is why such a tool is needed. Is the peer-review system so broken that reviewers cannot recognize nonsense when they read it?
P-hacking and HARKing were less of a problem when it was not practical to estimate zillions of models. Now, computers can do in seconds what it would take humans years to do by hand. James Tobin, a Nobel laureate in economics, once told me that the bad old days when researchers had to do calculations by hand were actually a blessing. The calculations were so hard that people thought hard before calculating. Today, with terabytes of data and lightning-fast computers, it is too easy to calculate first, think later. This is a flaw, not a feature.
P-hacking, HARKing, and dry labbing inevitably lead to the publication of fragile studies that do not hold up when tested with fresh data, which has created our current replication crisis. In 2019 it was reported that 396 of the 3,017 randomized clinical trials published in three premier medical journals were medical reversals that concluded that previously recommended medical treatments were worthless, or worse.
In 2015, Brian Nosek’s Reproducibility Project reported the results of attempts to replicate 100 studies that had been published in what are arguably the top three psychology journals. Only 36 continued to have p-values below 0.05 and to have effects in the same direction as in the original studies.
In December 2021, the Center for Open Science (co-founded by Nosek, a psychology professor at the University of Virginia) and Science Exchange reported the results of an eight-year project attempting to replicate 23 highly cited in-vitro or animal-based preclinical-cancer biology studies. The 23 papers involved 158 estimated effects. Only 46 percent replicated, and the median effect size was 85 percent smaller than originally estimated.
In 2016 a team led by Colin Camerer, a behavioral economist at Caltech, attempted to replicate 18 experimental economics papers published in two top economics journals. Only 11 were successfully replicated. In 2018 another Camerer-led team reported that it had attempted to replicate 21 experimental social-science studies published in Nature and Science and found only 13 continued to be statistically significant and in the same direction with fresh data.
The skepticism that psychology researchers have for work in their field is sobering — and justified.
An interesting side study was done while Nosek’s Reproducibility Project was underway. Approximately two months before 44 of the replication studies were scheduled to be completed, auction markets were set up for researchers in the field of psychology to bet on whether each replication would be successful. People doing the studies were not allowed to participate. The final market prices indicated that researchers believed that these papers had, on average, slightly more than a 50-percent chance of a successful replication. Even that dismal expectation turned out to be overly optimistic: Only 16 of the 41 studies that were completed on time replicated. The skepticism that psychology researchers have for work in their field is sobering — and justified.
There are several ways to alleviate the replication crisis and restore the luster of science. Here are four of the most promising directions.
1. The first step for slowing the p-hacking/HARKing express is for researchers to recognize the seriousness of the problem. In 2017, Joseph Simmons, Leif Nelson, and Uri Simonsohn wrote:
We knew many researchers — including ourselves — who readily admitted to dropping dependent variables, conditions, or participants so as to achieve significance. Everyone knew it was wrong, but they thought it was wrong the way it’s wrong to jaywalk. … Simulations revealed it was wrong the way it’s wrong to rob a bank.
Michael Inzlicht, a professor of psychology at the University of Toronto, spoke for many but not all when he wrote that,
I want social psychology to change. But, the only way we can really change is if we reckon with our past, coming clean that we erred; and erred badly. … Our problems are not small and they will not be remedied by small fixes. Our problems are systemic and they are at the core of how we conduct our science.
Statistics courses in all disciplines should include substantial discussion of p-hacking and HARKing.
2. A direct way to fight p-hacking and HARKing is to eliminate the incentive by removing statistical significance as a hurdle for publication. P-values can help us assess the extent to which chance might explain empirical results, but they should not be the primary measure of a model’s success. Artificial thresholds like p < 0.05 encourage unsound practices.
3. Peer review is often cursory. Compensating reviewers for thorough reviews might help screen out flawed research.
4. Replication tests need replicators, and would-be replicators need incentives. Highly skilled researchers are generally enmeshed in their own work and have little reason to spend their time trying to replicate other peoples’ research. One alternative is to make a replication study of an important paper a prerequisite for a Ph.D. or other degree in an empirical field. Such a requirement would allow students to see first hand how research is done and would also generate thousands of replication tests.
None of these steps are easy, but they are all worth trying.