Tag: data

DVC Day 20: Hotter Heat Map

(This Post is part of my 30 day Data Visualization Challenge – you can follow along using the ‘challenge’ tag!)

After spending some more time thinking about heat maps, and reading the way that a few other R practitioners approach them (huge thanks to R-bloggers), I noodled around a bit with our earlier heat map to produce this, which I think is a bit more readable, and also provides more visually appetizing data bites:

Screen Shot 2015-05-03 at 8.27.09 AM

Thoughts:
– Using only one triangle of the correlation matrix means we don’t repeat data, and it makes it a bit easier to pick out what you’re seeing.
– It’s interesting to see that depth and table have nearly no impact on price, and are in fact negatively correlated with one another – making negative correlations more visually apparent is a good move, I think, as it was tough to tell “uncorrelated” from “negatively correlated” in the last iteration of the heat map.

Code:

> library(ggplot2)
> library(reshape2)
> dnum <- diamonds[c(1,5:10)]
> dnum <- sapply(dnum, as.numeric)
> dcor <- round(cor(dnum), 2)
> get_lower_tri <- function(cormat){
+ cormat[lower.tri(cormat)] <- NA
+ return(cormat)
+ }
> dcor <- get_upper_tri(dcor)
> melted_dcor <- melt(dcor)
> ggplot(data = melted_dcor, aes(Var2, Var1, fill=value)) + geom_tile(color="white") + scale_fill_gradient2(low="blue", high="orange", midpoint=0, limit=c(-1,1)) + theme_minimal()

DVC Day 19: The World’s Smallest Violin

(This Post is part of my 30 day Data Visualization Challenge – you can follow along using the ‘challenge’ tag!)

Another way we can approach the overplotting problem (especially now that we recognize how useful but not-sexy box plots are) are a ggplot2 plot type called violin plots – here’s the same data as with the box-and-whiskers, but in the shape of everyone’s second-favorite string instrument:

Screen Shot 2015-05-01 at 4.20.37 PM

Thoughts:
– Much sexier than box-and-whisker!
– Despite being a bit nicer to look at, things like the median, outliers, etc, are not quite as easy to distinguish.
– The space to the right of each of the graphs is distracting – in the future I would probably do a harder line around each individual clarity box.

Code:

> library(ggplot2)
> fiddle <- ggplot(diamonds, aes(carat, price/carat)) + geom_violin(alpha=.65, fill="blue") + facet_grid(. ~ clarity)
> fiddle

DVC Day 18: Heat Maps for Everybody!

(This Post is part of my 30 day Data Visualization Challenge – you can follow along using the ‘challenge’ tag!)

It turns out heat map style representations of correlations are pretty easy in ggplot2 – here’s one that includes all of our roughly numerical values in the diamonds data set:

Screen Shot 2015-04-30 at 8.40.12 PM

Thoughts:
– This is so far the most pre-processing we’ve had to do during the challenge. First, grab a sample, then grab only the numbers-based columns, then convert them all into R-recognized numeric values, then create the correlation, then melt the table into a more heatmap friendly format, and then plot that data. Phew.
– The visualization itself is sort of neat, but it doesn’t really bring us any new insights. It’s kind of interesting to see that table and depth are not all that correlated. It makes some sense after reading this, but I’m not totally sure I understand, to be honest.
– I can see how a heatmap style correlation matrix like this would be very handy for more numerically-oriented data sets. I wonder if there’s any way to include non-numerical values in this type of visualization.

Code:

>library(ggplot2)
> library(reshape2)
> set.seed(1117)
> dsmall <- diamonds[sample(nrow(diamonds), 1000), ]
> dnum <- dsmall[c("carat", "clarity", "depth", "table", "price")]
> dnum <- sapply( dnum, as.numeric )
> dcor <- round(cor(dnum), 2)
> melted_dcor <- melt(dcor)
> ggplot(data=melted_dcor, aes(x=Var1, y=Var2, fill=value)) + geom_tile(color="white")

DVC Day 17: Who you calling a box plot?

(This Post is part of my 30 day Data Visualization Challenge – you can follow along using the ‘challenge’ tag!)

This one’s a request! My friend Jesse suggested I do a bit of a dive into box plots – and the timing couldn’t be better. You’ll recall from yesterday, it looked like a predictable way to make assumptions about price per carat was to consider a stone’s clarity – when we take the same data and compose it into the same graph – price per carat vs. carat split by clarities – check it out:

Screen Shot 2015-04-28 at 5.17.36 PM

Thoughts:
– Let’s be honest: box plots are not sexy.
– They are also not super intuitive. A box plot (or, more formally, a box and whisker plot) represents a few pieces of data: the box itself represents the first, second and third quartiles of a data set, with the center line representing the median (aka the second quartile – TIL!). The top and bottom whiskers extend to a value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. If you’re still reading, here’s the Wikipedia page. Any data points that fall outside of the whiskers are officially outliers, and are marked with points.
– What this does illustrate is that while the rightmost clarity does indeed reach for the stars when it comes to price per carat, the stones are not reliably massively priced – in fact the median price is not seriously different from the other clarities.
– The real workhorse in terms of price per carat turns out to be the second from the left, with the highest median value.
Here’s the same graph with some tiny blue points from yesterday’s graph.

Code:

> library(ggplot2)
> boxplot <- ggplot(diamonds, aes(carat, price/carat)) + geom_boxplot()
> boxplot + facet_grid(. ~ clarity)

DVC Day 16: Over Halfway

(This Post is part of my 30 day Data Visualization Challenge – you can follow along using the ‘challenge’ tag!)

I am seriously spending a lot of time thinking about diamonds, you guys. One thing that I’ve asked myself is about value – or rather, price. Are there reliable ways to tell if a particular diamond will be more expensive? Are there certain clarities or colors with more outliers or deviations in general? Looking at you, J. Here’s one way to answer that question, by looking into price per carat mapped against carat weight, split up between clarities:

Screen Shot 2015-04-28 at 5.02.04 PM

Thoughts:
– We can see that even though the leftmost diamonds tend to be the largest, the price per carat paid per carat for them stays roughly equal as they increase in size.
– We can also see that as we progress to the right, there are fewer diamonds per set (probably, the dot density is a problem here), but the cost per carat climbs skyward, peaking at almost four times per carat compared to the leftmost clarity.
– This doesn’t do a great job of representing deviation from the norm, though. There is probably a better way to represent that visually.

Code:

> library(ggplot2)
> qplot(carat, price/carat, data=diamonds, alpha=I(.25)) + facet_grid(. ~ clarity)