R – Nest Cafe ~ Cafe Nido

This post title plays on Nathan Vass’s book, but doesn’t have to do with Seattle bus transportation!

I’ve been studying how to use R programming for data analysis. We learned about the “geometry” layer of creating graphs using a neat function called ggplot(), from the ggplot2 package. (In case you wondered like me, ‘gg’ stands for Grammar of Graphics, which is a book written about data visualization. Nothing to do with Gigi’s nor gee’s). Compared to using graphing functions like matplot() (“matrix plot”) that come standard in “base R”, graphing tools from packages created by others tend to take care of a lot of details that would be cumbersome to set manually.

We took a set of 562 American movies spanning many genres from 2007-2011, their Rotten Tomatoes ratings (from critics and public audience) and budget in dollars. The following shows how Critic versus Audience ratings compared, by genre (color) and budget (line thickness & dot size, millions of dollars).

Movies lines dots — ggplot: Critic vs Audience Movie Ratings, by genre and budget, 2007-2011. Created using R programming.

Isn’t this graph crazy? Like rainbow paint was splashed everywhere. This left me wondering what is the point of having a graph with lines overlaid above dots like this? It’s hard to distinguish between line strokes and dots, not to mention its not at all necessary to ‘connect the dots’ with geom_line’s.

That’s when I started creating a different graph. I have this banana plant that is unfurling long, luscious green leaves, each one bigger than the next. I counted each leaf (number 1 being the first/oldest, and smallest leaf), and measured the length in centimeters. Ultimately I’d like to create a predictive model (comparing regression versus machine learning) to forecast how long the next leaf will be. Anyways, seeing this graph, this is a case where having lines and dots isn’t so chaotic, and actually aesthetically makes sense.

Banano Line and Dots — Banana plant leaf length. Created using R programming.

The lesson here is that combining both geometric points and lines – geom_line() + geom_point() is better suited when:

Dots and their connecting lines are distinct colors
Lines are not so thick that the dots are hard to distinguish
Generally speaking, that it makes sense to have a reason to draw lines (to emphasize trends, for example).

None of these were the case with the movies graph – so while it may be worthy of being displayed in an art museum, the simpler Banana Growth graph makes a much better use of ggplot geometric dots and lines.

Remember ‘box and whisker’ plots? Along the same vein as ‘stem and leaf’, these are a type of plot learned in school to show the distribution or spread of data, but I’d be stunned if you, dear reader, have actually made use of these outside of math class. Please comment if you have had the honor!

I’ve done a number of statistical analyses for school and tutoring, but I’ve yet to encounter a situation where a good old box and whisker rises as the best contender to display data.

Nonetheless, I learned to make one by coding in R today.
We’re working with a set of data showing the popularity of google-searching the term “entrepreneur” by state. The data has been standardized so that a state with an ‘entrepreneur’ value of 0 is at the mean; a positive value means ‘entrepreneur’ was googled relatively more than average. A negative value means that ‘entrepreneur’ was googled relatively less than average.

Out of the box, the boxplot() function is quite bare.

05_02 Out of the Box Boxplot — Out of the Box boxplot()

No titles, no labels! Those are premium services!
Curiously, the outlier ‘circle’ on the right, around 2.55, is actually TWO data points but they overlap and appear as one. I discovered this only upon summoning some descriptive statistics in the console viewer using:

boxplot.stats(df$entrepreneur)

So, let’s:
– Label the descriptive statistics (min, max, quartiles) and outlier
– Make the labels a fun color while at it

# Boxplot with Quartile Labels

boxplot(df$entrepreneur,
notch = F, horizontal = T,
main = “Distribution of Googling ‘Entrepreneur’ by State”,
xlab = “Standard Deviations”)

text(x = fivenum(df$entrepreneur),
labels = fivenum(df$entrepreneur),
y = 1.25,
col = “#990066”)

05_02 Boxplot Googling Entrepreneur by State and Color Quartile Labels

Concluding here, and next steps:
– It looks like the fivenum() five number summary considers the outliers as the max, so the max whisker is not labelled.
– I’d like to label the outlier with the State Abbreviations (any guesses?)

Thanks for reading!

* The outliers are DE and UT! Surprised?

Tag: R

The Lines and Dots that Make Us

Box and Whiskers