Box Plot

A boxplot, or a box-and-whisker-plot, is a graph that indicates the variability or the dispersion of the data, graphically depicting groups of numerical data through their quartiles. It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

The visualization of data through a boxplot is based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

the makeup of a boxplot

median (50th Percentile): the middle value of the dataset.

first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

interquartile range (IQR): 25th to the 75th percentile.

whiskers: the two lines outside the box that extend to the highest and lowest observations

outlier: a data point that lies outside the overall pattern in a distribution.(below minimum and /or above maximum)

“maximum”: Q3 + 1.5*IQR

“minimum”: Q1 -1.5*IQR

In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

In addition to the traditional square box plots, box plots in other shapes (eg. violin plots, beans plots, notched box plots) exist to answer different needs. Sometimes an organically shaped box plot is more effective in showing the range and tendency of the data. Adding colors or points (called jitters) representing each data are also common.

Good examples:

The use of opposite colors enables direct visual comparison between the two data groups. Outliers are laid out clearly. The organization of the boxplot makes it easily understandable.
The highlighted jitters translate the raw data very directly and effectively. 
The use of colors allow for direct comparisons between the variables. Its horizontal composition adds to the explicity. The footnote clearly explains the chart. 

Bad examples:

The use of jitters here interferes with reading the boxplot as they are highly similar in range across the samples. Plus, the minimums and maximums are not marked. 
The width differences between the boxes are unnecessary and instead creates confusing hierarchy. 
The information included in different boxplots is inconsistent. Some have only the box while others have whisker or outliers. Plus, there are too much overlapping marks in overly small increments. 
Show Comments