Inspired by the statistics classes I’m currently teaching, and a conversation I recently had in the pub with some colleagues (because I’m just that exciting), I’ve been wondering about why p < 0.05 is the most common threshold for statistical significance, at least in the psychological sciences. I realised that the choice of threshold was probably arbitrary to a certain extent, but I thought that maybe it was at least a useful arbitrary value for whatever purpose p values were first used for. I had been teaching about t-tests, so they were on my mind. I knew that the Student’s t-test was created by William Gosset to help quality control at the Guinness brewery (the brewery forced him to publish under a pseudonym – Student – to conceal from competitors that they were using statistics). Perhaps a false positive rate 1 in 20 was considered to be a reasonable error rate in brewery quality control? Apparently not…
The threshold, or indeed any threshold, doesn’t seem to have arisen with Gosset. P values certainly pre-date Gosset and the t-test anyway, but the publication of his tables of the t-statistic (or rather, what he referred to as the z-statistic), and the tables of his colleague Pearson’s χ2 distribution, provided precise p values to 4 decimal places for a given value of t or χ2. Instead, our fixation on p < 0.05 seems to be at least in part due to the issues between Pearson, and another statistician, R.A. Fisher. Fisher had created more statistical tests, and wanted to reproduce Gosset’s tables. However, permission was refused because of financial issues over granting copyright and disagreements over theory between Pearson and Fisher, so Fisher had to re-create the tables. Fisher rearranged the data, and instead of providing exact p values for a value of t he provided t values for values of p.
Although it is apparently “a matter of historical fact that Fisher was the first to have published tables in this form”, there is evidence pre-dating Fisher and Pearson that p values were considered as an indication of findings of further interest, and the threshold of interest was usually around 0.05. Warnings about the overuse of thresholds of significance were also surfacing as early as 1919 — 6 years before Fisher’s tables. So it seems unfair to lay the blame for p values obsession at Fisher’s door, but the publication and widespread use of his tables in a form that focused on round p values seems to have helped to reinforce the habit. Fisher doesn’t appear to have recommended the use of absolute thresholds of significance; he considered p values above 0.2 to be indicative of no effect, but values between 0.05 and 0.2 to be a suggestion that an effect might be detectable with sufficient modification of the experiment. Most of his tables reflected this; they provided values of several test statistics for a range of p values. However, when he produced tables for his newly introduced F statistic, values were only produced for p = 0.05 for simplicity. Although later versions expanded to include other p values, people seemed to have latched on to 0.05 as an important value.
Perhaps because the tables opened up the arcane world of statistics to a wider audience, or maybe because of some historical tendency towards 1 in 20 as an intuitive compromise of sensitivity and false-positives, Fisher’s tables seem to have left us with the one thing that everyone who knows anything about statistics ‘knows’. Maybe if Fisher and Pearson had been on better terms, undergraduate statistics might have been very different…
Clauser, B. (2009). War, enmity, and statistical tables CHANCE, 21 (4), 6-11 DOI: 10.1007/s00144-008-0004-8
Stigler, S. (2009). Fisher and the 5% level CHANCE, 21 (4), 12-12 DOI: 10.1007/s00144-008-0033-3
Also, see Gerald Dalall’s article Why P = 0.05? for more detail, or if you can’t access the papers.