Three ways that data may be deceiving you

That pretty graph you just saw? It may not be as trustworthy as you think.

We live in a modern world where data lives supreme, and “evidence” in the form of numbers is widely accepted as truth. Data visualizations, even on contentious topics such as COVID-19, are now widely available to the public. Chartr sends a weekly newsletter communicating news updates purely through charts and graphs, and the Reddit sub r/dataisbeautiful has 15.9 million members and is growing.

However, the problem is that data sets are not infallible. Specifically, the presentation of data in the form of visualizations shouldn’t be considered as undeniable truth. Numbers can lie as well, and poorly done visualizations can easily mislead readers.

Take, for example, the popular InformationisBeautiful website of COVID-related data visualizations. The very first visualization states, “The majority of infections are mild,” showing a bar chart in which 80.9% of symptoms are “Mild,” 13.8% “Severe,” and 4.7% “Critical.” One misleading element of the graph is the base-rate fallacy. A casual viewer of this chart may immediately think, “I have an 80.9% probability of experiencing mild symptoms if I get COVID.” But this interpretation ignores several prior probabilities that must be factored in.

For example, if you have underlying conditions, then you have a higher base-rate probability of experiencing more critical symptoms. As a result, your actual probability of experiencing critical symptoms due to COVID-19 is much higher. Second, there’s the problem of omitted variables, also known as the endogeneity problem. Data visualizations are particularly tricky because there are only so many variables you can capture in a single image before it gets too complicated. As a result, some variables are left out — and if these variables are important to understanding the data, this creates a ripe field for misinterpretation. Consider that we know COVID-19 affects age groups differently. By leaving age out of the picture (literally), it masks the fact that older individuals will have a much higher probability of experiencing critical symptoms from COVID-19. Sure, this may be a fairly obvious “omitted variable,” but one can easily imagine just how many other less obvious omitted variables there are that might change or even reverse the direction of an effect.

Finally, here’s another example on a completely unrelated topic. During the 2020 election, the Financial Times published a visualization depicting mail-in vs. in-person voting with the headline “Trump supporters are much more likely to say they will vote in

person rather than by mail.” The newspaper’s bar chart had two horizontal bars, one for Trump/Lean Trump voters and one for Biden/Lean Biden voters, broken down into how many voters in each category said they would vote in-person vs. voting by mail.

This presents an example of a third problem, one of scale. The two horizontal bars aren’t on the same “scale,” at least not in terms of the number of voters. Fifty percent of Trump/Lean Trump voters means something completely different than 51% of Biden/Lean Biden voters, and this chart doesn’t give us that information. Given what we now know about the actual vote numbers, this chart would more accurately state that 37.1 million Trump voters plan to vote in-person, while 41.5 million Biden voters plan to vote by mail. As a result of the different scales between the two bars, we’re led to believe that more people will vote for Biden by mail than Trump by mail when, in actuality, the numbers could be much closer than 51% to 25%.

To be clear, I don’t believe that the people who make these charts and graphs are purposely trying to mislead others. Data visualization is hard, and there are a lot of complicated principles that go into it. We need to remember that no graph is perfect, and even casual readers should think critically and ask difficult questions before they accept a piece of data-based “evidence” as truth. The next time you see an article provide a pretty graph in support of an argument, start by asking at least these three questions: What is the base rate? Are there any omitted variables? And is everything on the same scale?

Steven Zhou (@szzhou4) is a Ph.D. student in industrial-organizational psychology at George Mason University, where he researches leadership, personality, and psychometrics. He previously worked in HR data analytics at a large international consumer services start-up and in college student affairs.

Related Content