A Correlation Paradox
Paradoxes pop up pretty often in quantitative research—more than most of us realize. They can be wonderful things—surprising, puzzling, even funny. And they are often instructive. I discovered the correlation paradox when studying trends in state support for public higher education. My colleagues and I were discussing the best way to demonstrate trends in our data, and someone said, how about a correlation between state income and higher education spending? I said “sure, let me call up the data and see what we get.” A minute later, when the answer appeared on the screen, one colleague (who didn’t like me) beamed maliciously; he said “that’s impossible; as we all know blah, blah, blah.” The smartest among us said nothing, but she did utter a barely audible “harrumph.” The project director said, playfully I think, “So Paul, remind me again, how much are we paying you?” The statistical result I projected onto the screen, the correlation I calculated, seemed impossible. Everybody thought I must have made an error. And I did too. But it wasn’t an error. It was a paradox. To describe it, I’ll sketch in some background information.
A correlation is a co-relation. It is a measure of the association between two things (called variables). The correlation is a number (called a coefficient) between zero and 1. When there is no relation between the two things, the number is zero. When the relation is perfect the coefficient is 1. An example of a strong correlation might be the ages and heights of the children in a pediatrician’s practice. The older children would usually be taller; the younger would usually be shorter. There might be some exceptions, but a strong general pattern would probably hold. The correlation might be around 0.75, not perfect (which would be 1.0), but pretty strong. That’s obviously because when age goes up so too does height. When two variables, such as age and height, move in the same direction, this is called a direct relation or a positive correlation.
Sometimes correlations are “inverse.” When one thing goes up the other goes down. For example, when the price of gasoline goes up the sales of gas guzzlers goes down. The numbers used to describe the relation between cost of gas and sales of gas hogs are the same—they range from 0 to 1. The only difference is that with an inverse relation the number is negative. It might be around −0.75. Since the number used to describe them is negative, such inverse relations are often called “negative correlations.” That is not because there is anything undesirable about them. The negative number is just a way to describe an inverse relation. For example, the more people exercise the less likely they are to have heart attacks. Because the amount of exercise and the number of heart attacks go in opposite directions, they are negatively correlated, but more exercise and fewer deaths are “positive” things.
These are very well known statistical facts. My students almost always get them right on exams. So where does the paradox described above come in? It occurs with trend data. When studying trend data, variables that are moving in opposite directions over time do not always, as they “should,” have a negative correlation. This ought to be impossible—by definition: the correlation coefficient describing the association between two variables moving in opposite directions must be negative. However, when looking at the trend data for two variables that are clearly moving in opposite directions, sometimes the correlation is positive, not negative. That is the paradox. A paradox is something that is very surprising, because it is impossible by definition or by logic, but actually seems to be true.
I was shocked (and embarrassed) the first time I encountered a positive correlation between variables that I knew were moving in opposite directions. My surprised colleagues and I were working on a project studying correlations between trends in states’ incomes—Gross State Product or GSP—with state appropriations for public higher education. In many states, the trend over the years in state income was clearly upward, but the trend in state spending on higher education was just as clearly downward over the decades we studied. But the correlation between the two variables really was positive, not negative—despite the fact that they were moving in opposite directions.
My colleagues were very surprised. They thought I had made some stupid mistake. And I even thought I must have messed up. I recalculated the correlations. I made sure I had entered the data correctly. I double checked to make sure I had pointed at and clicked the right options in the software. I even calculated the correlations by hand—for the first time in many years. But the correlation would not change. The correlation between two variables moving in opposite directions remained positive.
So what happened? What caused the paradoxical results? Paradoxes like this are not unknown. The two most familiar among statisticians are Simpson’s Paradox and the related Lord’s Paradox. Naturally I named mine “Vogt’s Paradox” (we do so love to name things after ourselves) and I described it briefly in my Dictionary of Statistics and Methodology.[1] The paradox is not much of a mystery once you know the source of the anomaly. Like the Simpson and Lord paradoxes, the correlation paradox has a fairly simple explanation. It can occur when change scores rather than absolute values are used to calculate the correlations. In this case, what was happening is that when state income went up, so too did their expenditure on higher education, but not by as much. When states’ income went down, spending on higher education went down too, but by more. In most individual years the two variables (income and expenditure) moved in the same direction; that accounts for the positive correlation. But the magnitudes of movements were such that over time the two got further and further apart. Each year that the two increased, income increased more. And each year that the two decreased, expenditures decreased more. That accounted for the overall inverse trend.
Like other paradoxes, explaining it in words often doesn’t help. It can be hard to see what is happening until you look at some worked out examples. For Simpson’s paradox, the Wikipedia entry has some good ones. If you want to see a small set of example data illustrating Vogt’s paradox, send me an e-mail (wpaulvogt@gmail.com) and I’ll reply with a short document containing an illustration.