Saturday, April 7, 2012

R

We're getting more focused on analyzing data for SimCity. Telemetry (gameplay analytics) and server metrics are both getting some attention from the team. There are lots of tools to help with this, from real-time graphing systems such as Graphite to the venerable Excel.

But working on SimCity has given me a taste for interesting forms of data visualization. The standard charts serve a purpose, of course, but working on the game has exposed me to newer developments in the visualization field. We're always passing around this or that interesting visualization from the Internet, because showing data to the player is one of the core things we have to do.

We're trying, as often as we can, to put the data a player cares about in the game world. The most extreme form of this comes from what the game's simulation engine, GlassBox, can give us. Everything going on in the world is tied to what's really happening in the engine, not some statistical abstraction as in previous SimCity games. A puff of smoke from a factory isn't just an effect; it's actually a cue that we have written to the pollution map. I like to say that our game is the ultimate data visualization.

Inspired by all this, I started learning R, a language for statistical data analysis and presentation. Along with being tailored for this purpose, it can also be run in batch scripts, which will be a key feature as we automate reports about various aspects of system activity.

I first learned of R when reading Nathan Yau's  Visualize This, a book I recommend for getting your hands dirty with practical data visualization. Unfortunately, the nature of his book allows for little more than a cursory explanation of R. 

This time around, I picked up R In Action, a much deeper look at the language. While a good chunk of the book is aimed at people who remember their statistics better than I do, the introductory chapters will give you the basics of slicing and dicing data and presenting it in a useful form.

Here's one visualization I did about some data from a focus test. I've stripped off the titles, and I'm not going to say exactly what's going on here (NDA and all that), but the gist is that it's a particular facet of player activity we were measuring during the focus group. The darker the color in each graph, the more players who did that activity in the time frame specified. The taller the the boxes, the more players overall who did it. Each graph represents a particular subset of that activity. (See small multiples)
I had a few goals with this visualization. One, obviously, was to apply R to a real-world problem so I could learn it better. Another was to push it outside of the realm of bar charts and line graphs. Obviously it can do those, but so can everything else on the planet. If I'm going to be inspired to do interesting visualizations, I want a language that will support me.

I came away impressed. R has specialized data structures that make it easy to throw data around in any old way. The standard install was able to do everything I wanted, though a number of packages make various pieces even easier.

The code to prepare that visualization is on the order of 50 lines. (And that's without any real experience with the language.) Calculating the quantiles that make up the gradient? One line. The actual graphing work? Ten lines. Most of the work, as is always the case, was just cleaning and prepping the data. How many lines would it be in Ruby? Or, god forbid, Java? 

As powerful as R is, it's also maddening. It's a language designed by and for academics, so it lacks a lot of the niceties that you find in more widespread languages. Functions are haphazardly named, based on the whim of whatever grad student added it back in the day. The documentation is quite good if you know what you're looking for, but frustrating if you want to query it more abstractly. Some large percentage of my current R knowledge comes from Stack Overflow, the end point of seemingly every query about R in Google. 

Still, if analyzing data is part of your job, R's a powerful tool. It can work with large data sets (and there are packages that let it work with very large data sets), and you can quickly aggregate and manipulate data to understand it better.