Tag: R

STILL Visualizing the Support Driven Survey

I have been away from the blog for a bit – during the time I normally spend blogging and thinking about blogging, I’ve been spending trying to get to know a new tool for my R toolbox, a web app platform for R called Shiny.

Those of you who have been around for a while are familiar with my bizarre love of the intersection of information and design that data visualization represents, especially given that I am neither a statistician nor an artist.

(The heart wants what the heart wants!)

As demonstrated previously on this blog, I learn best by doing (hence my 30 day visualization sprint wherein I took a dive into the R library ggplot2) – so after going through the Shiny tutorial, I gave it a try, and pushed live my first ever web app, a super rudimentary user-adjustable visualization of the recent Support Driven compensation survey.

So, here’s a link. Check it out. I would genuinely and with a full heart appreciate your feedback. 

Munging NASA’s Open Meteor Data

Munging NASA’s Open Meteor Data

In snooping around the US Government’s open data sets a few months back, I found out that NASA has an entire web site dedicated to their publicly available data: https://data.nasa.gov/

Surely, you understand why that would excite me!

I dug around a bit and pulled out some information on meteor landings in the United States, with tons of information, mass, date, lots of stuff.

To simplify the data set and make things tidy for R, I wrote a quick Python script to strip out some columns and clean up the dates. Here’s the gist if you want to have a go at the data as well.

I ended up looking to see if there was a trend between date and meteor mass, to see if maybe there were obvious cycles or other interesting stuff, but some super-massive meteors ended up shoving the data into pretty uninteresting visualizations, which is too bad.

We can do some simpler stuff, even with some super-massive meteors. For instance, here’s a log(mass) histogram of all of the meteors:

Screen Shot 2016-01-05 at 7.49.24 PM.png

Check it out! It results in a somewhat normal, slightly right-skewed distribution. That means we can use inferential statistics on it, although I am not sure why you would want to! The R code is a super quick ggplot2 script.

It’s pretty amazing how easily we can access so, so much information. The trouble is figuring out how to use it in an actionable and simply explained way. The above histogram is accurate, and looks pretty (steelblue, the preferred default color of data folks everywhere), but it isn’t actually helpful in any way.

Just because we can transform a dense .csv into a readable chart doesn’t mean it’s going to be useful.

30 Day Challenge Post Mortem

tumblr_nnsdgysxUx1sfie3io1_1280

I’ll admit up front that working with R every day for thirty days, producing a new visualization every day, was both harder and easier than I thought it was going to be.

There were days when I felt like I was on fire, found an interesting thread and produced four or five days of visualizations all at once. There were also days where it felt like a real drag, just trying to find something that even looked a little interesting.

There is some debate on the internet about whether a thirty day time period is sufficient to make something a habit – I can’t really speak to that, as creating a habit wasn’t the goal. The goal was to become familiar with a particular R library (ggplot2), and I think that goal has certainly been accomplished.

I really liked this format – thirty days is long enough to feel possible, for the finish line to always be in sight, but still requires discipline and buy-in. As far as a way to jump start a new skill, we’ll have to see a bit farther down the line, but I certainly feel about a hundred times more comfortable with ggplot2 than I did when I started the whole thing.

I’d recommend this format to folks who are looking to mix up their personal development. The hardest part is choosing an activity that will be interesting and challenging to do, thirty times, every day, but without picking something so large that it becomes onerous or negatively stressful.

I had considered, for instance, to use a new statistical analysis every day for thirty days. That would probably have been a bit too large a bite for me, and I would have really struggled to accomplish it.

Now, the only question remaining is: what should my next challenge be?

DVC Day 30: EL FIN

(This Post is part of my 30 day Data Visualization Challenge – you can follow along using the ‘challenge’ tag!)

Screen Shot 2015-05-11 at 4.34.04 PM

 

Thoughts:

– Boom.
– Longer post-mortem in the works about both this challenge in particular and doing a thing every day for 30 days in general.
– Thanks for tuning in 🙂

Code:

> library(ggplot2)
> qplot(day, n, data=dh) + scale_y_continuous(limits=c(0,3)) + geom_smooth() + ylab("Number of Posts Published") + xlab("Day of the Challenge") + theme_bw()