Tuesday, November 29, 2016

Bar charts must start at zero (or something)

The other day I was mentioning to a colleague that a bar chart should start at zero, and I may have given the impression that it was just my personal taste. It is not. It is a universal standard in statistical visualisation. However, since it is very easy to get distracted or to misinterpret it, I'll summarise it here very quickly, and include some links below if you want to read more.
  1. It is OK to set the Y axis to whatever range you want (min an max are good choices). However, in this case, do not use bar charts. Use mean-and-error plots, boxplots, lines, scatterplots, violin plots, etc.
  2. Bar charts start at the bottom of the plot for a reason: you want to easily compare the heights between bars. Our eyes try to find the proportion of increase/decrease between them. Then, for example, if your Y axis starts at  a higher value, a 0.1% increase may look twice the height.
  3. The Y axis doesn't always need to start at zero, but it must be a value that makes sense as a baseline. Therefore, if you are comparing ratios maybe one is a better baseline, or for a time series the baseline can be the value at day one (as in climate change graphs). The idea is that there are "natural" baseline values, such that the proportions between bars are meaningful.
  4. In all cases, ask yourself if a bar chart is really necessary, and if you can justify it against other options. This may help you pay attention to the interpretation of the baseline. 
I believe that the same rules apply to an impulse plot or a stem plot -- like the ones used in ACF plots -- but I am not so sure. Anyway, here are some links:

It’s OK not to start your y-axis at zero — Quartz
Of course column and bar charts should always have zeroed axes, since that is the only way for the visualization to accurately represent the data. Bar and column charts rely on bars that stretch to zero to accurately mirror the ratios between data points. Truncating the axis breaks the relationship between the size of the rectangle and the value of the data. There is no debating this one (except for a few exceptions).
(The exception linked above is about a ratio, where the natural baseline is one instead of zero.)

Bar Chart Baselines Start at Zero | FlowingData
The main argument for bar charts without a zero baseline is this: There’s no point in extending the range of the value axis if the range of the data never includes zero. Ok.
Now instead of weight, let’s look at height. I think we can agree that it’s difficult to find people who are zero inches tall. I’m 70 inches tall, and my son is half my height at 35 inches. The bar chart on the left shows the comparison with a zero baseline, and as expected, the bar for me is twice the length of the bar for my son. On the right, I take it to the extreme and set the baseline to 35 inches, and the bar for my son disappears.


Maybe the latter communicates that I’m much taller, but the magnitude is infinitely exaggerated.
(The link above also provides an elegant solution if you insist on using barplots: explicitly model your baseline such that you can compare all values to it.)

Kick the bar chart habit : Nature Methods : Nature Research
Instead of comparing to an abstract zero level, scientists often compare multiple experimental samples to one another. Because the samples are usually generated from populations with a potentially large and irregular underlying variation, graphing their means using bar charts misleadingly assigns importance to the distance of the means from zero and poorly represents the distribution of the data used to calculate the means. Instead of bar charts, mean-and-error plots and box plots should be used for statistical sample data.
Zero is zero - Statistical Modeling, Causal Inference, and Social Science
The idea is that the area of the bar represents “how many” or “how much.” The bar has to go down to 0 for that to work. You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. Usually line graphs are better anyway.  I’m sure this is all in a book somewhere.
10 Ways Charts Can Lie, Cheat & Lead Astray | QVDesign
Here we’re looking at exactly the same basic dataset covering Visitor numbers per year, the chart on the left seems very placid and visitor numbers appear to be relatively steady year on year; not bad but not great, whilst the chart on the right looks much more positive; visitor numbers took a huge leap in 2010; cue extra investment and pay rises all round! The charts show identical data so what’s going on?

The ONLY difference between these 2 charts is that the one on the left has the Properties > Axis > ‘Forced 0’ checked for the Y-Axis whilst the other doesn’t; one check box and we’re completely changing the message of this chart. Now of course if you spend a little time looking at this chart and read off the numbers it becomes clear that the increase isn’t all that marked, however; (...)  To accentuate the effect even further you can use the ‘Static Min/Max’ settings to zero in even more on the difference making the smaller value seem even smaller and the increase all the larger.

No comments:

Post a Comment

Use the space below to ask, inform and criticize -- if you are not very happy please read the rules for commenting.

Please, do not include unrelated, commercial sites not even in your signature.