TidyTuesday Tidbits from Animal Crossing

There is a popular video game captivating folks at home these days called “Animal Crossing”. When I first heard this, I imagined animals crossing a road to get to the other side, like the old game Frog Xing or Frogger. Alas, the game has nothing to do with crossing busy traffic roads!

“Animal Crossing” is a Japanese game whose original title (Doubutsu No Mori どうぶつの森) translates to something more like “Animal Forest”. From what I gather, it’s a SIMs-like game where you build up an island, interacting with both virtual (in-game) and real characters (actual people). My friend was totally unfazed when I showed her a picture of fish whose face only a mother could love (aka a Sculpin) because she’d grown accustomed to catching valuable tarantulas by night and selling them by day for the game. A rather strange-sounding way to make money…

Every Tuesday, some wonderful people share an interesting data set that becomes a source of data analysis. It’s a great way to practice “data wrangling” – that is, filtering, cleaning, and taking care of problematic data in a quest to tame it, understand it, and find some interesting insights. It’s called #TidyTuesday and comes with a fresh dataset, background information, and article that makes use of the data. This past Tuesday the data just so happened to pertain to Animal Crossing!

The “Tidy” aspect of “TidyTuesday” refers to ‘cleaning’ or ‘tidying’ the data. In R Programming, there’s a package called the “Tidyverse” that comes with many tools and enhanced functions great for tidying data. Why on Tuesday? I’m not sure, but it does add a spark of excitement to this day of the week for me. #TidyFriday or #TidyThursday wouldn’t sound bad, either.

This week’s Animal Crossing data included stats on all of the villagers (computer characters) on the island and items that can be bought/sold. There’s a plethora of information on each, like a villager’s personality type, birthday, and theme song! I decided to practice my R data wrangling skills while answering curiosities that popped into my mind as I examined this data.

(Check out my github site for the full analysis & R code, which I’ll update every Tuesday as I’m able!)

Items Analysis

# Analysis 1: Which items can be bought with ‘miles’ and then sold for bells?
# Answer: 19 “Nook Inc” items from the Nook Miles System. Not the e-reader Nook, from the Tanuki character Tom Nook.

items_miles % filter(buy_currency == "miles" & sell_currency == "bells")
view(items_miles)

# Analysis 2: Which ‘items’ give the highest profit (sell minus buy value)?
# Answer: None. All items have a sell_value < buy_value.
I guess you can’t sell things for more than you bought them for…

# Analysis 3: Which items have a greatest difference between buy and sell value?
# Answer: Royal Crown, Crown, Gold Armor, Golden Casket (?!?), Grand Piano

items_bells % filter(buy_currency == "bells" & sell_currency == "bells")
items_bells %>% filter(sell_value > buy_value)
# Create value difference column and add to items_bells table
value_dif <- items_bells$buy_value - items_bells$sell_value items_bells$value_dif % top_n(wt = value_dif, n = 5) # Done!! Expand Console width to view last col. The trick was to create the value_dif in a new column and add it to the table. top_n(items_bells %>% filter(buy_value > sell_value), n = 5)

My question is…is a Golden Casket really what I think it is? Who would use this?? (I can only think of one person, who is rumored to have a golden toilet).

…oh! It’s real! (The casket is in the lower left corner, I believe.)

Golden Casket 1

# Analysis 4: Which category of items is the most expensive?
# Furniture, Hats, Bugs
# Cheapest: Flowers, Fruit, Photos, Socks, Tools

items_bells %>%
group_by(category) %>%
dplyr::summarize_at(vars(buy_value, sell_value), funs(mean(., na.rm = TRUE)))

Villagers Analysis

# Analysis 5: Are there more male or female characters?
# Answer: 187 Females, 204 males! Slightly more males!
# count() does group_by() + tally() https://dplyr.tidyverse.org/reference/tally.html
villagers %>%
count(gender)

# Analysis 6: Which Personality Types are the most common?
# Answer: Lazy, Normal, Cranky/Jock/Snooty, Peppy

# Analysis 6b:Are there any personality types with only 1 character?
# Nope! But the least common type is ‘uchi’ (a translation of “sisterly/big sister” in Japanese).

villagers %>%
count(personality) %>%
arrange(desc(n))

Uchi

To be honest…Agnes the black pig scares me…

# Analysis 7: Who has their own song?
# Four special villagers have their own song: Angus, Diva, Gwen, and Zell.

villagers %>%
add_count(song) %>%
filter(n == 1)

# Analysis 8: Who has a birthday today (5/6)?
# Answer: Tank the Rhino! He looks cuter than I imagined from his name.

villagers %>%
filter(birthday == "5-6")

Tank Rhino

The Lines and Dots that Make Us

This post title plays on Nathan Vass’s book, but doesn’t have to do with Seattle bus transportation!

I’ve been studying how to use R programming for data analysis. We learned about the “geometry” layer of creating graphs using a neat function called ggplot(), from the ggplot2 package. (In case you wondered like me, ‘gg’ stands for Grammar of Graphics, which is a book written about data visualization. Nothing to do with Gigi’s nor gee’s). Compared to using graphing functions like matplot() (“matrix plot”) that come standard in “base R”, graphing tools from packages created by others tend to take care of a lot of details that would be cumbersome to set manually.

We took a set of 562 American movies spanning many genres from 2007-2011, their Rotten Tomatoes ratings (from critics and public audience) and budget in dollars. The following shows how Critic versus Audience ratings compared, by genre (color) and budget (line thickness & dot size, millions of dollars).

Movies lines dots
ggplot: Critic vs Audience Movie Ratings, by genre and budget, 2007-2011. Created using R programming.

Isn’t this graph crazy? Like rainbow paint was splashed everywhere. This left me wondering what is the point of having a graph with lines overlaid above dots like this? It’s hard to distinguish between line strokes and dots, not to mention its not at all necessary to ‘connect the dots’ with geom_line’s.

That’s when I started creating a different graph. I have this banana plant that is unfurling long, luscious green leaves, each one bigger than the next. I counted each leaf (number 1 being the first/oldest, and smallest leaf), and measured the length in centimeters. Ultimately I’d like to create a predictive model (comparing regression versus machine learning) to forecast how long the next leaf will be. Anyways, seeing this graph, this is a case where having lines and dots isn’t so chaotic, and actually aesthetically makes sense.

Banano Line and Dots
Banana plant leaf length. Created using R programming.

The lesson here is that combining both geometric points and lines – geom_line() + geom_point() is better suited when:

  • Dots and their connecting lines are distinct colors
  • Lines are not so thick that the dots are hard to distinguish
  • Generally speaking, that it makes sense to have a reason to draw lines (to emphasize trends, for example).

None of these were the case with the movies graph – so while it may be worthy of being displayed in an art museum, the simpler Banana Growth graph makes a much better use of ggplot geometric dots and lines.

Box and Whiskers

Remember ‘box and whisker’ plots? Along the same vein as ‘stem and leaf’, these are a type of plot learned in school to show the distribution or spread of data, but I’d be stunned if you, dear reader, have actually made use of these outside of math class. Please comment if you have had the honor!

I’ve done a number of statistical analyses for school and tutoring, but I’ve yet to encounter a situation where a good old box and whisker rises as the best contender to display data.

Nonetheless, I learned to make one by coding in R today.
We’re working with a set of data showing the popularity of google-searching the term “entrepreneur” by state. The data has been standardized so that a state with an ‘entrepreneur’ value of 0 is at the mean; a positive value means ‘entrepreneur’ was googled relatively more than average. A negative value means that ‘entrepreneur’ was googled relatively less than average.

Out of the box, the boxplot() function is quite bare.

05_02 Out of the Box Boxplot
Out of the Box boxplot()

No titles, no labels! Those are premium services!
Curiously, the outlier ‘circle’ on the right, around 2.55, is actually TWO data points but they overlap and appear as one. I discovered this only upon summoning some descriptive statistics in the console viewer using:

boxplot.stats(df$entrepreneur)

So, let’s:
– Label the descriptive statistics (min, max, quartiles) and outlier
– Make the labels a fun color while at it


# Boxplot with Quartile Labels

boxplot(df$entrepreneur,
notch = F, horizontal = T,
main = “Distribution of Googling ‘Entrepreneur’ by State”,
xlab = “Standard Deviations”)

text(x = fivenum(df$entrepreneur),
labels = fivenum(df$entrepreneur),
y = 1.25,
col = “#990066”)

05_02 Boxplot Googling Entrepreneur by State and Color Quartile Labels

Concluding here, and next steps:
– It looks like the fivenum() five number summary considers the outliers as the max, so the max whisker is not labelled.
– I’d like to label the outlier with the State Abbreviations (any guesses?)

Thanks for reading!

* The outliers are DE and UT! Surprised?