The science crisis

It has been called “galling” and “worse than we thought.” It risks demolishing the technocratic case for “expertise.” Nearly two decades into science’s replication crisis, have scholars, researchers, and funding agencies learned anything?

Like a plumbing nuisance-turned-emergency, the replication crisis emerged in dribs and drabs before gushing violently into the public’s consciousness. As early as the 1950s, Democratic Sen. Estes Kefauver was holding congressional hearings on “the sorry state of science supporting drug effectiveness,” a critique that led to stricter FDA requirements. The year 1977 saw the publication of Michael J. Mahoney’s landmark study exposing confirmation bias in the peer-review process. (“Reviewers were strongly biased against manuscripts which reported results contrary to their theoretical perspective.”)

Yet, despite these early warnings, the true clarion call was sounded in 2005 with the appearance of John P. A. Ioannidis’s “Why Most Published Research Findings Are False.” A shock treatise that has since approached 3 million views on PLOS Medicine’s open-access website, the physician-scientist’s paper argued that “in modern [epidemiological] research, false findings may be the majority or even the vast majority of published research claims.” Ioannidis and others had begun to notice that a number of famous and influential experiments could not be rerun with similar results. The consequence, as the PLOS Medicine paper tersely declared, was a scientific establishment rife with “confusion and disappointment.”

Though Ioannidis was ostensibly writing for an audience of specialists, evidence of an irreproducibility virus could not be kept from the public forever, especially given the bug’s simultaneous presence in disparate fields. By the time Perspectives on Psychological Science was running a 2012 special issue on replicability in that discipline, news of the emergency was only a year or two away from appearing in major American and European newspapers.

And appear it did. Between 2014 and 2021, the New York Times, Washington Post, and Times (London) alone ran no fewer than 23 pieces considering the crisis in whole or in part. Vox, the Atlantic, NPR, and Fox News had their say, as well. Perhaps sensing that the public’s disdain for elite institutions had reached an inflection point, the professional organizations themselves leaped into action, producing a series of articles intended to address the confusion head-on. For the American Psychological Association, communicating through its in-house magazine, one seeming priority was the deflection of attention away from psychologists specifically. (“Reproducibility is a concern throughout science,” it insisted in 2015.) The Association of American Medical Colleges, meanwhile, took to the digital digest AAMCNews to declare that “there is no evidence to suggest that irreproducibility is caused by scientific misconduct.”

Whatever their actual explanation, the failures that had dragged the hard and social sciences under the public’s microscope were stark indeed. According to the Reproducibility Project, a crowdsourced enterprise led by University of Virginia psychologist Brian Nosek in 2011, an attempt to replicate 100 key studies from three years prior resulted in a success rate of only 39%. Similarly distressing was the work of three Bayer scientists, that same year, examining reproducibility in oncology, women’s health, and cardiovascular disease. As stated in analyses eventually published in Nature, the Bayer team was unable to replicate nearly two-thirds of the external studies under review.

Perhaps because its findings are easily regurgitated as popular news bites, the field of behavioral economics took a particularly hard fall in the years after the reproducibility crisis struck. “Priming,” the foundational theory behind subliminal advertising, was called into question in 2012 when a team of researchers could not replicate the concept’s most famous study, in which participants exposed to old-age stereotypes walked more slowly upon exiting a lab. “Loss aversion,” the well-known idea that individuals weigh losses more heavily than equivalent gains, suffered a similar fate in 2018 when an article in Psychological Research alleged outright misconduct in previous experiments. Among the discipline’s gravest failures has been the collapse of implicit bias theory, which holds that closet racists will struggle to pair black and brown faces with words such as “good” in laboratory experiments. An obvious example of pseudo-scientific quackery, IBT was shown, in 2017, to suffer from “low test-retest reliability,” another way of saying that replicating results has proven to be impossible.

The consequence of these and related discoveries has been a classic internecine feud, in which scholars have argued among themselves about how, and whether, the replication crisis ought to be addressed. For standpatters, including the authors of an oft-cited 2015 article in American Psychologist, the problem is very likely reducible to the “low statistical power [of] single replication studies” — we ought not to dismiss original findings until multiple replication attempts have failed. Others in this camp point to the notion that investigative mistakes tend eventually to be exposed through existing processes. No systemic reforms are thus necessary.

For still others, including Daniele Fanelli of the London School of Economics and Political Science, the true predicament is not the collapse of reproducibility itself but the “narrative of crisis” that has arisen in recent years. According to Fanelli’s 2018 article in the peer-reviewed Proceedings of the National Academy of Sciences, reproducibility issues are “not distorting the majority of the literature, in science as a whole as well as within any given discipline.” Additionally, “scientific misconduct and questionable research practices occur at frequencies that, while nonnegligible, are relatively small and therefore unlikely to have a major impact.”

Arrayed across the field from these crisis naysayers is a considerably larger army of scholars for whom the replication dilemma is rather more serious business. A 2016 survey by Nature, for example, found that 90% of scientist respondents believed that a “slight” or “significant” crisis was at hand. A full 70% of those surveyed had “tried and failed to reproduce another scientist’s experiments,” the journal reported.

Accompanying this general sentiment has been an outpouring of peer-reviewed scholarship attempting to describe and address the problem’s root causes. One popular theory, aired by UC San Diego’s Harold Pashler and others, holds that publication bias, the habit of circulating only positive findings, is a major culprit. The Reproducibility Project’s Nosek has suggested that academic norms and incentives might themselves be to blame, writing in 2012 that “to the extent that publishing itself is rewarded, then it is in scientists’ personal interests to publish, regardless of whether the published findings are true.”

Beneath these easily digestible suppositions lies a series of more technical theories that require some explanation. “P-hacking” (the “p” stands for “probability”) occurs when a researcher conducts many similar tests, then selectively reports only those results that rise to the level of “significance.” (An amusing online example involves a theoretical link between M&M consumption and baldness.) “Null hypothesis significance testing,” the default practice in nearly all biomedical and psychological research, allows a scientist to search for statistical deviations without developing a precise hypothesis first. Despite the fact that the latter has been controversial since at least the 1960s, and the former is flatly unethical, both practices are a part of academic science as it is actually conducted. One needn’t be a specialist to see how “false positives” might arise from such behavior.

As for what has been done in the 17 years since John Ioannidis threw down his methodological gauntlet, opinion here, as elsewhere, is decidedly mixed. Speaking to the National Institutes of Health in recent days, I was assured that the agency “requires grant recipients to address rigor in [their] applications and as part of [their] annual progress reports.” While the National Science Foundation declined to provide a quote for the record, representatives did direct me to a stylishly produced report funded by the NSF. Opening that document (and studying the NIH’s online materials), one finds ample guidance on how to conduct effective research but far less clarity concerning how specific funding practices have changed. This seeming dichotomy aligns squarely with what Harold Pashler told me in an email in early April: “People like NIH director Francis Collins regularly reassure Congress that they are aware of and focused on this issue, but they haven’t done nearly as much as they could have to promote replicable research.”

Where practices have begun to evolve, slowly but surely, is in the rules and norms that govern the article submission process. One such development is the increasing use of “pre-registration,” which requires scholars to share their research plans online before conducting studies. Employed correctly, such a prerequisite can do much to eliminate p-hacking and may even, in the long run, help correct the publishing bias in favor of positive results. Yet even pre-registration, as currently performed, is unlikely to be a silver bullet. When I asked Brian Nosek whether researchers are actually altering their designs based on internet feedback, his answer was cautious. For Registered Reports, “a special case of pre-registration” in which proposed methodologies are submitted for peer review before the experiment or study actually commences, changes are adopted in almost every instance. For regular pre-registration, “this rarely occurs.”

Examining both peer-reviewed and popular sources, one finds still more possibilities for reform. In a 2015 article in Frontiers in Psychology, the University of Oxford’s Jim Everett and Brian Earp proposed requiring Ph.D. students to conduct replication attempts as part of their training. A recent piece in the Guardian, meanwhile, suggested that research might be published on the basis of methodology alone, irrespective of results. (To be fair, the same article also called for the total abolition of scientific papers.) Whatever improvements are forthcoming, they are likely to be accompanied by a steady stream of further bad news, at least in the near future. To name just one of the horrifying discoveries made in recent months, a meta-study published in Science Advances found that unreplicable studies in top psychology and economics journals are cited more frequently than experiments that replicate. Furthermore, “only 12% of post-replication citations of nonreplicable findings acknowledge the replication failure.”

As has been widely remarked, the reproducibility crisis is not mere inside baseball but a matter of some urgency for a liberal order under fire from both the Left and Right. Until actual science gets its house in order, hysterical worship of “The Science” will remain exactly what it is today: an implausible posture that only emboldens those who would tear down America’s institutions. In this sense, the replication disaster has more than a little in common with the COVID adventurism practiced by previously respected organizations such as the Centers for Disease Control and Prevention. Even as experts themselves have begun to flounder openly, leftist paeans to expertise in the abstract have grown ever more shrill. Something, eventually, will have to give.

As with the public health establishment’s COVID response, what is needed to address the replication crisis is not only a new set of protocols but a marked uptick in professional humility. It may well be the case that scholars and researchers are commencing that long journey. But they have not yet arrived.

Graham Hillard is managing editor of the James G. Martin Center for Academic Renewal.

Related Content