Typically, when a researcher conducts a psychology experiment, they compute something called a ‘p value’. This statistic has become controversial, and yet critics face a paradox.
A p value is (typically) the probability of observing the experimental data, or more extreme data, assuming something called the ‘null hypothesis’ is true. To illustrate what the null hypothesis represents, it is best to picture a simple experiment. Imagine investigating the claim that drinking water during a mathematics test improves test performance. A sample of students is randomly allocated to one of two groups - a group that drinks water during a test and a group that completes the same test under the same conditions but without drinking water. The null hypothesis would be that drinking water makes no difference.
The experiment is completed and the water drinking group scores an average of 13.8/20 while those without water score an average of 12.2/20. On the face of it, this appears to disprove the null hypothesis and suggests drinking water improves performance. However, such a result could arise by chance. The experimenters could have randomly allocated a few more higher performing students to the water drinking group than the other group, for example.
What the p value would tell us in this experiment is the probability of obtaining a difference in average scores as large as this or larger, given the assumption that the null hypothesis - drinking water makes no difference - is true. Typically, we look for p values of less than 5%. If the p value is less than 5%, we say the result is ‘statistically significant’. This would not definitively prove that drinking water makes a difference in maths tests, but it would make researchers take the idea seriously. If the experiment is repeated, ideally by a different team of researchers, and another p value of less than 5% is obtained, our data is increasingly at odds with the null hypothesis and consistent with the idea that drinking water improves maths test performance. We would say the result has been ‘replicated’.
So, what’s not to like about p values? It appears that there is plenty. Oddly, we seem to have reached something approaching a consensus about the flawed nature of p values while the vast majority of researchers continue to use them, a contradiction that should at least give us pause for thought and is at the root of the paradox.
Some of the main criticisms are:
Researchers are confused about what the p value represents and assume it is the probability of the null hypothesis being true. It doesn’t actually say anything about the probability of the null hypothesis being true and you need to evaluate that separately. For instance, if the null hypothesis was that teaching children ‘philosophy’ would have no effect on reading and maths scores, we may consider that highly likely and so, even if we found a p value of less than 5%, this may not be sufficient to cause us to reject the null hypothesis.
Researchers ‘hack’ p values. For instance, imagine we conducted the experiment about drinking water during a mathematics test but instead of just comparing water drinkers with those without access to water, we further divided the group into males versus females and low SES versus high SES. We now have three attempts at calculating a p value - overall, sex, SES. The probability that we will obtain at least one ‘significant result’ even if the null hypothesis is true will now be greater than 5%. If all the data is reported, this may not be an issue, but if nonsignificant results are not reported - e.g if the data is sliced six ways but only two are reported - we are given a distorted view of the evidence.
p value calculations assume that test subjects are randomly allocated to conditions but most experiments fail to live up to this standard. So, the p value gives us the probability of observing the experimental data, or more extreme data, assuming both the null hypothesis is true and assuming the subjects have been randomly allocated. If we have reason to doubt this last assumption then we cannot interpret the calculated p value.
p values do not give any indication of the size of an effect and so an effect may be ‘statistically significant’ but practically meaningless.
There are other criticisms. So why are we still clinging on to such a flawed measure?
The first point to make is that p values are not a universal measure to be used to the exclusion of all others and were never designed to answer certain questions. If you want to know the size of an effect, calculate an effect size. Effect sizes have their own problems which I won’t go into now, except to suggest that many of these also arise from assuming that effect size is a universal measure to the exclusion of all others.
Yet this is perhaps a more trivial point than the one that leads to the paradox.
Keep reading with a 7-day free trial
Subscribe to Filling The Pail to keep reading this post and get 7 days of free access to the full post archives.