To P-Value or Not to P-Value - That is the Question

For centuries, the p-value has been the gold standard of statistical testing. Whether it’s determining whether a specific result is significant, or deciding whether a study is publishable, the science and business communities have used p-values as a main criterion. If the p-value is less than 0.05, we reject the null and conclude that something is going on. If the p-value is greater than 0.05, we fail to reject the null and conclude that there’s nothing to see here; move along.

However, over the past few years, more and more disciplines have been questioning the validity of the p-value. For example, most of the major psychology journals have either stopped using the p-value as a criterion for publication, or have banned its use entirely.

The p-value doesn’t give any indication of how important the results are (that is, it doesn’t measure the magnitude of the effect); it doesn’t even give an indication of how likely it is that the results are due to more than random chance. All a p-value communicates is how likely a result would be IF the phenomenon under review WASN’T there. If you find this confusing, you’re not alone; it’s a very peculiar way of looking at a question.

Let’s take a concrete example. Imagine scientists wanted to find a connection between jelly beans and cancer. They could collect a lot of data about people’s jelly bean consumption habits and the incidence of cancer, and then perform statistical analysis to see if there’s a relationship. Well, spurious correlations are abundant in the real world, so we’d expect at least a slight connection, just by random chance.

The question is: are the patterns we’re seeing in the data GREATER than what we’d expect by random chance. If there were no correlation between jelly beans and cancer, each sample would give a slightly different result, but they’d all be pretty close to showing no relationship. At a certain point, they’d be far enough from showing no relationship that we’d say “Hey, it’s REALLY unlikely that we’d see this if there were no relationship; so there probably is one.”

Usually, our threshold for REALLY unlikely is 0.05. If the null hypothesis were true (if there were no relationship between jelly beans and cancer), we’d only see results this extreme 5% of the time. We consider that unusual enough that we could say, hey! There’s something going on here.

Source: xkcd

This means we would imagine that if we do 20 studies where nothing is going on, we’d expect, on average, that one of the studies would end up statistically significant at p<.05, just by chance. Let’s say we do 20 studies, and three of them end up significant. On average, one of the three is just due to chance, and the other two are the result of an actual phenomenon. However, we have no way to identify the one that is just random chance. Furthermore, a result of p=.04999 and result of p=.05001 are virtually identical; but one is “significant” and the other is not.

This doesn’t mean p-values are worthless. But it does mean that researchers (and consumers of research) need to be thoughtful when interpreting them. A p-value by itself doesn’t tell you much, and simply knowing that a result is “significant” tells you even less. More useful is an estimate of the effect size, or a review of multiple studies looking at the same phenomenon.

To learn more, check out this great post from PLOS.

About the Author

Jenny is a member of the HBX Course Delivery Team and currently works on the Business Analytics course for the Credential of Readiness (CORe) program, and supports the development of a new course in Management for the HBX platform. Jenny holds a BFA in theater from New York University and a PhD in Social Psychology from University of Massachusetts at Amherst. She is active in the greater Boston arts and theater community, and she enjoys solving and creating diabolically difficult word puzzles.

To P-Value or Not to P-Value - That is the Question

About the Author

Subscribe to the Blog

Recent Posts

Posts by Topic