This one’s a request! My friend Jesse suggested I do a bit of a dive into box plots – and the timing couldn’t be better. You’ll recall from yesterday, it looked like a predictable way to make assumptions about price per carat was to consider a stone’s clarity – when we take the same data and compose it into the same graph – price per carat vs. carat split by clarities – check it out:
– Let’s be honest: box plots are not sexy.
– They are also not super intuitive. A box plot (or, more formally, a box and whisker plot) represents a few pieces of data: the box itself represents the first, second and third quartiles of a data set, with the center line representing the median (aka the second quartile – TIL!). The top and bottom whiskers extend to a value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. If you’re still reading, here’s the Wikipedia page. Any data points that fall outside of the whiskers are officially outliers, and are marked with points.
– What this does illustrate is that while the rightmost clarity does indeed reach for the stars when it comes to price per carat, the stones are not reliably massively priced – in fact the median price is not seriously different from the other clarities.
– The real workhorse in terms of price per carat turns out to be the second from the left, with the highest median value.
– Here’s the same graph with some tiny blue points from yesterday’s graph.
> library(ggplot2) > boxplot <- ggplot(diamonds, aes(carat, price/carat)) + geom_boxplot() > boxplot + facet_grid(. ~ clarity)