top of page

News Feature: Data Gaps

Issue #69

Data, Numbers

On 21 February 2024 Gary Smith wrote a Retraction Watch blog post about “How (not) to deal with missing data: An economist’s take on a controversial study”.¹ Smith offered a brief history of the origin of 5% as the cutoff p-value for valid statistical results based on a story about the British statistician Ronald Fisher doing a tea-tasting experiment with Muriel Bristol a century ago: “When she got all eight [answers] correct, Fisher calculated the probability that a random guesser would do so as well – which works out to 1.4%. …He soon recognized that the results of agricultural experiments could be gauged in the same way – by the probability that random variation would generate the observed outcomes. If this probability (the P-value) is sufficiently low, the results might be deemed statistically significant. How low? Fisher recommended we use a 5% cutoff and “ignore entirely all results which fail to reach this level.”¹  

What makes this story interesting is that it demystifies the origin of the 5% p-value in showing that it was in some sense an arbitrary number, as professional statisticians know. Nonetheless many publishers have made it a necessary goal, which tempts authors to do anything to reach it. There are many ways to improve a p-value, one of which is to include more values that fit the desired range. This happened in a recent paper by Almas Heshmati and Mike Tsionas,”...on green innovations in 27 countries during the years 1990 through 2018.”¹ When questioned about what appeared to be gaps and inconsistencies in the data, the primary author “... revealed in an email that these gaps had been filled by using Excel’s autofill function…” The author justified his use of “imputed” data: “In an email exchange with Retraction Watch, Heshmati said, ‘If we do not use imputation, such data is [sic] almost useless.’”¹  

As Smith notes, there are cases where “Imputation sometimes seems reasonable.” He describes a case with population statistics in a stable setting with little change, where imputing missing values could make sense, but mechanically filling in numbers to achieve the magic 5% cutoff for publication is hard to justify, even though the 5% p-value was originally an arbitrary number. 


1: Smith, Gary N. ‘How (Not) to Deal with Missing Data: An Economist’s Take on a Controversial Study’. Retraction Watch (blog), 21 February 2024.


Recent Posts

See All


bottom of page