Saturday, September 24, 2011

Protovis And Wine Visualization: California Crush Statistics

Radio station visualizations are fun and all, but I realized that I should research data visualization by looking at data I actually care about. That way, I can provide context and ask deeper questions about the subject matter at hand.

As an occasional wine writer, data about the wine industry seemed like a good start.

Harvest — "crush" in wine industry jargon — is afoot here in California, and that spurred me to search for data on previous harvests. The National Agricultural Statistics Service publishes a range of interesting data for wine geeks, some of which I've been using for experiments and explorations with Protovis.

The first public one shows harvest statistics over 20 years for the 15 grapes with highest crush numbers in California in 2010. The interactive version gives you a deeper view, with detailed per-year statistics as you mouse over, but here's a static version to give you an idea.

Groovy, eh?

Wine geeks will know many of this visualization's stories well. The California wine industry has grown tremendously over the last 20 years, thanks to increased consumption in the United States. Grape gluts are periodic, but 2005 was a particularly grape-heavy year. Industrial grapes such as French Colombard, Rubired, and Ruby Cabernet are mainstays of the bulk-wine industry led by Gallo. Pinot Noir tonnage surpassed Syrah tonnage in 2008, about 4 years — when vines start producing worthwhile fruit — after Sideways, the movie that told everyone about Pinot Noir. (Though I should note that I prefer the Pinots of Oregon and the Sonoma Coast to those of Santa Barbara, the setting for the movie. But, really, I prefer the Pinots of Burgundy to those from anywhere else.)

But some items in the data surprised me. Merlot, a common Bordeaux variety, went from almost nothing in 1991 to a dominant grape in 2010. Grenache, the popular, fruity darling of the Rhone Rangers, has actually seen lower crush values in the last 20 years. Pinot Gris has gone from a nonexistent grape in California to one of the top 15 in the state in just over a decade. Tonnage of French Colombard has gone down, which makes me wonder how the industrial market is doing overall.

But if you're reading this blog, you're probably more interested in the technical aspects of this data. I used Protovis, and I have repeatedly found that getting a basic visualization up and running with the library is very fast. Getting the fine details right, however, is much slower. It takes a lot of trial and error to get the language to do what you want. I might switch to D3, its successor, for my next projects. It supposedly gives finer control over your visualization.

What I also keep realizing is that visualizing some set of data isn't really an issue. Organizing the data is. I know this isn't news to anyone who works with data, but these projects are good reminders of how much work that can be.

I started with 20 separate spreadsheets from the NASS and wrote a Ruby script to extract out the bits of data I wanted and compile them into a JSON object I could serve to this chart's HTML page. But even then, the page's JavaScript has to do some processing as well to get the data in a format that Protovis can easily work with. The Underscore JavaScript library is a handy tool for doing data transformations.

But I also used that preprocessing to cache certain items such as the pretty-printed numbers, the colors to use for the different areas (which I calculated with the excellent and other useful items.

Saturday, September 3, 2011

Radio Station Playlist Data Visualization, Part 2

As soon as I did my visualization of 99.7's music selection for a week, I asked the obvious next question: How does 99.7 compare to other "adult contemporary" radio stations?

There's an interactive version that lets you drill down into the graph, but here's a screenshot.

People listen to radio stations for all sorts of reasons, of course, so I don't know that anyone actually cares about this. But it did give me a chance to look at Protovis and compare it to Processing as I learn about data visualization toolkits.

Gathering Data
When I gathered data for my first visualization, I wrote a simple script that grabbed songs from the 99.7 website. I set that up as a cron job on an EC2 instance and let it go.

I did the same thing for the other four radio stations I decided to look at. 97.3 uses the same website tech as 99.7, and KFOG and KBAY share a different website tech, so those got me two stations for the price of one. 101.3 uses yet another system. Once I had my scripts running, I just had to wait until I had the same week's worth of data from all stations. A bit of cleanup on the data, a quick change to JSON from comma-separated values, and I was ready to go.

I decided to use the concept of small multiples to provide a quick comparison between stations, but then showing an enlarged version for deeper exploration. Each small graph in the chart represents one station across the same span of time.

Protovis Vs. Processing
It took me some time to learn Protovis. I feel that only now, after finishing one visualization, do I really have a grasp on how it works. It seeks to be a declarative language, which means that you define the result and let the under-the-hood bits figure out how to get you there, but I found myself struggling against the lack of control.

Processing gives you that control. You have vast amounts of control, but that's because it starts you with a blank slate. You can probably do anything you want, but the flip side is that you have to do everything you want.

But Processing comes with a strong disadvantage: It creates Java applets. Remember those? I barely do, and I was actually writing Java when that's all people did with it. An applet takes a long time to load in a world where website visitors are accustomed to instant gratification from your page. An applet also won't work on your iOS device. So my first visualization was completely unusable by iPad owners.

(Yes, there is Processing.js, but my attempts to use it only frustrated me. It didn't support Java generics, and even when I removed them from my code, it failed with cryptic errors that were impossible to debug.)

As with so many things, deciding on a visualization toolkit means figuring out what's best for your job. If you're doing something complex and custom, you'll probably want Processing. But for a lot of web-based visualizations, I think Protovis will give you what you need once you figure out how to use it. It can certainly do a lot in that space.

I have still more visualizations in mind for this same set of data, and I'm planning on starting with Protovis (or its successor, d3). The Java applet problems are too big.