An Obsession With Programming

Saturday, April 7, 2012

R

We're getting more focused on analyzing data for SimCity. Telemetry (gameplay analytics) and server metrics are both getting some attention from the team. There are lots of tools to help with this, from real-time graphing systems such as Graphite to the venerable Excel.

But working on SimCity has given me a taste for interesting forms of data visualization. The standard charts serve a purpose, of course, but working on the game has exposed me to newer developments in the visualization field. We're always passing around this or that interesting visualization from the Internet, because showing data to the player is one of the core things we have to do.

We're trying, as often as we can, to put the data a player cares about in the game world. The most extreme form of this comes from what the game's simulation engine, GlassBox, can give us. Everything going on in the world is tied to what's really happening in the engine, not some statistical abstraction as in previous SimCity games. A puff of smoke from a factory isn't just an effect; it's actually a cue that we have written to the pollution map. I like to say that our game is the ultimate data visualization.

Inspired by all this, I started learning R, a language for statistical data analysis and presentation. Along with being tailored for this purpose, it can also be run in batch scripts, which will be a key feature as we automate reports about various aspects of system activity.

I first learned of R when reading Nathan Yau's Visualize This, a book I recommend for getting your hands dirty with practical data visualization. Unfortunately, the nature of his book allows for little more than a cursory explanation of R.

This time around, I picked up R In Action, a much deeper look at the language. While a good chunk of the book is aimed at people who remember their statistics better than I do, the introductory chapters will give you the basics of slicing and dicing data and presenting it in a useful form.

Here's one visualization I did about some data from a focus test. I've stripped off the titles, and I'm not going to say exactly what's going on here (NDA and all that), but the gist is that it's a particular facet of player activity we were measuring during the focus group. The darker the color in each graph, the more players who did that activity in the time frame specified. The taller the the boxes, the more players overall who did it. Each graph represents a particular subset of that activity. (See small multiples)

I had a few goals with this visualization. One, obviously, was to apply R to a real-world problem so I could learn it better. Another was to push it outside of the realm of bar charts and line graphs. Obviously it can do those, but so can everything else on the planet. If I'm going to be inspired to do interesting visualizations, I want a language that will support me.

I came away impressed. R has specialized data structures that make it easy to throw data around in any old way. The standard install was able to do everything I wanted, though a number of packages make various pieces even easier.

The code to prepare that visualization is on the order of 50 lines. (And that's without any real experience with the language.) Calculating the quantiles that make up the gradient? One line. The actual graphing work? Ten lines. Most of the work, as is always the case, was just cleaning and prepping the data. How many lines would it be in Ruby? Or, god forbid, Java?

As powerful as R is, it's also maddening. It's a language designed by and for academics, so it lacks a lot of the niceties that you find in more widespread languages. Functions are haphazardly named, based on the whim of whatever grad student added it back in the day. The documentation is quite good if you know what you're looking for, but frustrating if you want to query it more abstractly. Some large percentage of my current R knowledge comes from Stack Overflow, the end point of seemingly every query about R in Google.

Still, if analyzing data is part of your job, R's a powerful tool. It can work with large data sets (and there are packages that let it work with very large data sets), and you can quickly aggregate and manipulate data to understand it better.

Monday, December 19, 2011

Books For Systems Geeks

I've been involved in lots of system releases over the years. Not only major versions and upgrades for the various companies I've worked for but the monthly or weekly minor (or, "minor") releases that are the norm in startups everywhere.

And like everyone else who's gone through those circuits, I've picked up a bunch of ad hoc, empirically proven knowledge about how to do these things. I know what's worked, and I know what hasn't.

But as the person helming the online systems for my studio's next major game, I thought it was a good idea to see what other people have to say about the complex art of launching a big system. Not just making it scale but the processes and practices that make it a smooth, worry-free (or as close as one can hope) launch. No sense in avoiding the wisdom of others, after all.

Here are my thoughts on a few of the books I've read recently.

The Art of Scalability - This is one of my favorites, and it's a book I'll come back to again and again. While it's light on actual technical details, it does a great job of explaining how scalability decisions are actually business decisions and how you need a culture of scalability, not just a decision to shove it all in at the end.

That sounds obvious, right? That's the other thing the book does really well: Encapsulating ideas you, like me, have probably learned on your own into nice, articulate concepts. I know about horizontal scaling, sharding, and splitting up servers based on functionality. But when they talk about their X-, Y-, and Z-axis scaling cube, it summarizes it quickly. Little phrases like "technology agnostic design" and "swim lanes" become keywords that you can quickly call up when you're thinking, "Something about this doesn't sound right."

Release It! - I was scribbling down notes constantly while reading this book. This is the distilled advice of someone who's seen lots of systems work well and poorly. It's one of the few tech books that is super relevant even five years or so after it was published. (Though it merely hints at the cloud technology that has sprung up since then.)

It documents a slew of patterns and antipatterns that will have you nodding your head constantly. A lot of the topics are things that you probably kind of know if you've done a bunch of launches. But this book takes them out of the realm of intuition and into concrete knowledge, real experience, and practical advice.

Its notion that the time before launch is a comparably short time in your product's life versus the time it spends live won't necessarily apply if, like me, you work in the games industry, where years of work go into a product whose user base drops off, under normal circumstances, rather than grows. But that does not diminish the value of the text. (And I won't always be in games.)

Scalability Rules - This book is the sequel to The Art of Scalability, though perhaps companion is a better word. Whereas the authors were light on tech and heavy on business in the first book, this book is all about the technical concepts you need to launch scalable systems. Again, lots of great advice that I scribbled into my notes. (This book has more to say on cloud systems than the original, earlier book did.)

Best of all, they provide a handy list of the 50 rules in order of "amount of risk management you get" and also provide a rough guess at the amount of work. Sort by most bang for the least buck, and you'll be knocking out a few in no time.

Scalable Internet Architectures - On the other hand, if you really want technical advice, this book gets way down to the nitty-gritty. There's a lot of good info here, but it's going to be most useful if you're on the IT/operations side of the fence versus the development side. That said, his argument that you should focus on scaling down (i.e., cutting costs) as well as scaling up has become a fixture in my architecture.

Continuous Delivery - The notion of continuous delivery, which means that your software is always ready for launch, is a compelling one. The authors are basically trying to convert the seemingly inevitable tension around a production release into a ho-hum experience that most of the team probably doesn't even need to be aware of.

It sounds great, but I'm not sure how much I'll be able to apply to my day-to-day work. The authors set a high bar when they cavalierly suggest that you should have 90 percent coverage just from unit tests. I used to feel fairly good about the fact that I had 40 percent coverage from unit tests and integration tests.

Still, I definitely came out with some ideas that will make our own launches more relaxed once I implement them.

Saturday, December 3, 2011

My Radio Broadcast Podcast

While I have an enthusiasm for boppy, bubble-gum music, particularly as a backdrop to coding, I have also had a passion for opera in the past. I've even had questions used in the Opera Quiz on the Metropolitan Opera radio broadcasts.

Unfortunately, that passion is hard to fit into my life these days. Opera tickets are expensive unless you want to stand, and performances take a long time. This is the nature of opera.

This used to be easier. On Saturday mornings, I'd turn on the radio broadcast and listen to the opera for several hours as I went about my morning. Then I got into food. And farmers markets, my favorite of which are on Saturday mornings. That then became my normal food shopping day, and the radio broadcast went by the wayside. I'm usually coming home from shopping right as the opera ends.

Sure, you can listen to albums, but the Met broadcasts are fun because they provide context. The host describes the costumes, experts provide backstory, and they do the quiz.

When podcasts became popular, I realized that a podcast of the Met's Saturday matinees would be perfect. I sent letters asking the Met to do it. When they called and asked for money, I'd mention it. I'd even tell them that I would pay for such a podcast. Imagine: paying for Internet content!

They've never done it. And so I've fallen behind on opera, enjoying it as much as possible with a ticket or two a year to the San Francisco Opera.

Earlier this year, I was ranting about this yet again when I realized that I could probably craft my own podcast based on the radio broadcasts. What I wanted was the ability to call up a podcast on my iPod and see the latest opera broadcast, already synced. So began a day or two, off and on, of work on a podcast-creation script that would use radio stations as its source material.

Rube Goldberg would like this one.

First, I needed to figure out how to capture the music. A friend suggested FStream, Mac OS X software that had two valuable features: I could open a wide range of streaming URLs with it, and it was scriptable. I like scriptable apps. And these days, one can even use Ruby to do that scripting.

What I ultimately wanted was to not even think about this. That meant that my script would need to know a schedule. It reads in a config file with YAML entries that contain the name of the item, the start time, end time, and streaming URL. When the script runs, it parses the file and checks once a minute to see if it should be recording. If it should (and it isn't), it starts up FStream, points it to the appropriate URL, and tells it to start recording. When it reaches the end time, it tells FStream to stop recording.

Once the file is closed, the script uploads it to S3 and creates an XML file that points to all the appropriate links of the files. Voila: a podcast of streaming radio.

Though I did this originally with the Metropolitan Opera broadcasts in mind, it obviously works for any streaming radio. I've set up new entries for CapRadio's Friday Night at the Opera, and KDFC's San Francisco Opera broadcasts.

There are a couple of problems with this script. One is a bug that causes it to throw an exception when uploading the file. I'll fix that at some point. It just means I have to manually upload the files to S3. The other problem is logistical: Neither of our computers is on all the time. To add one more step to my baroque script, I set myself a reminder so that I know to set up my computer during the appropriate time period. I wonder if iCal is scriptable …

In addition to all the stuff this script is supposed to do, the script passes its config file through Ruby's ERB system. That means that I can actually set up my config file so that the start times are programatically driven (e.g., 9:00am on the coming Saturday).

I'd still like the Met to do a podcast of their own. I'd even still pay for it. But until they do, I not only have their broadcasts in a podcast, I have a wealth of others.

Here's the script, with various sensitive bits taken out. One thing I've found useful is to put my S3 connectivity information in a separate, included script so that I can distribute the main script and not accidentally include my S3 credentials. On the off chance you want to use this script, you'll need a file named aws_info.rb in the same directory as this script which defines three variables: S3_BUCKET_NAME, S3_ACCESS_KEY, and S3_SECRET_ACCESS_KEY.

require './awsinfo'
require 'rubygems'

require 'appscript'
include Appscript

require 'fileutils'
require 'time'
require 'yaml'
require 'aws/s3'
require 'erb'
require 'rss/1.0'
require 'rss/2.0'
require 'rss/maker'

class File
 def name
     pieces = self.path.split('/')
     pieces[pieces.length - 1]
 end
end

SLEEP_INTERVAL = 60
PODCAST_FILE = "podcast.xml"
PODCAST_URL = "https://s3.amazonaws.com/#{S3_BUCKET_NAME}"

$schedule_file = 'opera_schedule.yaml'
$is_recording = false
$current_schedule = nil

#constants are defined in awsinfo
def start_s3
 AWS::S3::Base.establish_connection!(
    :access_key_id     => S3_ACCESS_KEY,
    :secret_access_key => S3_SECRET_ACCESS_KEY
  ) 
end

def stop_s3
 AWS::S3::Base.disconnect!
end

def parse_time(time_string)
  regex = /^(\d*?)-(\d*?)-(\d*?)\s*?(\d*?):(\d*)/
  year = time_string[regex,1].to_i
  month = time_string[regex,2].to_i
  day = time_string[regex,3].to_i
  hour = time_string[regex,4].to_i
  minute = time_string[regex,5].to_i
  Time.local(year,month,day,hour,minute)
end

def parse_schedule_file(filename = $schedule_file,time=Time.new)
   schedule = File.open(filename,'r') {|file|YAML::load(ERB.new(file.read).result binding)}
end

#side effect: sets $current_schedule if appropriate
def should_be_recording(time=Time.new)
  schedule = parse_schedule_file(filename=$schedule_file,time=time)
  schedule_flag = false
  schedule.each do |schedule_entry|
     start_time = parse_time(schedule_entry['start_time'])
     end_time = parse_time(schedule_entry['end_time'])
     if (start_time..end_time) === time then
         schedule_flag = true
         $current_schedule = schedule_entry
         break
     end
  end
  schedule_flag 
end

def next_scheduled_task(time=Time.new)
    schedule = parse_schedule_file(filename=$schedule_file,time=time)
    sorted_schedules = schedule.sort {|a,b| parse_time(a['start_time']) <=> parse_time(b['start_time'])}
    return_schedule = nil
   
    sorted_schedules.each do |entry|
      if parse_time(entry['start_time']) > time then
        return_schedule = entry
        break
      end
   end
   return_schedule
end

def add_file_to_podcast(file,schedule_info=$current_schedule)
   xml_file = file_from_pieces(fstreams_dir,PODCAST_FILE)
   rss = nil
   if !File.exists?(xml_file) then
      rss = RSS::Maker.make("2.0") do |maker|
         maker.channel.title = "Derrick's Radio Podcast"
         maker.channel.link = PODCAST_URL + PODCAST_FILE
         maker.channel.description = "Radio programs captured by script"
         maker.items.do_sort = true # sort items by date
      end
      File.open(xml_file,"w") {|file| file.write(rss)}
   end
    
   content = ""
   File.open(xml_file,"r") do |existing_file|
       content = existing_file.read
   end
   rss = RSS::Parser.parse(content,false)
  
   item = item = RSS::Rss::Channel::Item.new
   item.title = schedule_info['name']
   item.date = File.mtime(file)
   item.link = "#{PODCAST_URL}/#{File.new(file).name}"
   item.pubDate = File.mtime(file)
   item.enclosure = RSS::Rss::Channel::Item::Enclosure.new(item.link, File.size(file), 'audio/mpeg')    
   rss.items << item
  
   File.open(xml_file,"w") {|file| file.write(rss)}
end

#todo: use fstream.recording flag
def is_recording
 fstream = app('FStream')
 puts fstream.status
 $is_recording && fstream.status == 3
end

def fstreams_dir
  fstream_path = './fstreams'
  FileUtils.mkdir_p(fstream_path)
  Dir.new(fstream_path)
end

def file_from_pieces(dir,file)
  "#{dir.path}/#{file}"
end

def sync_dir
 dir = fstreams_dir
 start_s3
 dir.entries.each do |filename|
    next if filename =~ /^\..*/
    file_path = file_from_pieces(dir,filename)
   
    # if the file doesn't exist (or it's the podcast file), upload it
    if !AWS::S3::S3Object.exists?(S3_BUCKET_NAME,filename) || filename == PODCAST_FILE then
        AWS::S3::S3Object.store(filename,open(file_path),S3_BUCKET_NAME)
    end
 end
 stop_s3
end

def s3_safe_name(english_name)
  english_name.gsub(/\s/,'_').downcase
end

def start_recording
 $is_recording = true
 puts 'Starting to record'
 fstream = app('FStream')
 fstream.openStreamWithURL($current_schedule['from_url'])
 fstream.startRecording
end

def stop_recording
 puts 'Stopping recording'
 fstream = app('FStream')
 fstream.stopRecording
 fstream.stopPlaying

 # find the file most recently created, and rename it
 dir = fstreams_dir
 filepaths = []
 dir.entries.each do |filename|
    next if filename =~ /^\..*/
    file paths << file_from_pieces(dir,filename)
  end
  filepaths.sort {|a,b| File.mtime(b) <=> File.mtime(a)}
  filepath = file_from_pieces(dir,s3_safe_name($current_schedule['name'])+".mp3")

  FileUtils.mv(filepaths[0],filepath)
  
  # file can take a while to close
  while true
     begin
       sleep(10)
       File.mtime(filepath)
       break
     rescue exception
       puts "Waiting for #{filepath} to close"
     end   
  end
  
  # update podcast file and S3
  add_file_to_podcast(filepath)
  sync_dir
  
  $is_recording = false
  $current_schedule = nil
end

def do_start_app
  while(true)
    if !is_recording && should_be_recording then
      start_recording
    elsif is_recording && !should_be_recording then
      stop_recording
    else
      next_sched = next_scheduled_task
      puts "#{Time.new} Next scheduled task is #{next_sched['name']} starting at #{next_sched['start_time']}" if next_sched
    end
    sleep(SLEEP_INTERVAL)
  end
end

$schedule_file = ARGV[0] if ARGV.length >= 1
do_start_app if !$testing_mode

Thursday, November 3, 2011

Partitioned Concurrency

Here is a common block of code in Java. Assume cache is a Map of some form.

if (!cache.containsKey(someKey)) {
  cache.put(someKey,someValue);
}

Or: if a given object is not in the map with a certain key, put it there. Otherwise, proceed along your merry way.

I was writing code like this in a controller the other day, because I wanted to cache Broadcaster objects (from the Atmosphere framework) that would manage separate chat channels. Requests would come in and get attached to one or another Broadcaster based on a channel ID.

But that code snippet, in its current form, isn't thread-safe. You could have two threads get past the if and thus both think that the map needs the object inserted. That isn't always a problem — my reflection-based API marshaling layer caches Field and Method objects that are constant for a given class, making me unconcerned about threads that replace values — but in this case, you could have the second thread replace the Broadcaster that the first one inserted, meaning that some chunk of the chat clients wouldn't see messages because they'd be attached to the wrong object.

The standard way to solve this is to put a mutex around the cache-modification code:

synchronized(this) {
  if (!cache.containsKey(someKey)) {
      cache.put(someKey,someValue);
  }
}

But this is code that is likely to have a lot of requests fired at it, and synchronizing this way introduces a major performance bottleneck. Every single request would have to wait for that lock. Yes, I know: Premature optimization is the root of all evil. Still, it seemed problematic. Synchronization at that high of a level can cause major issues when you're dealing with hundreds of thousands of concurrent users, which we may very well be when we launch.

As I was pondering this, I remembered a technique for increasing concurrency. I had discussed it with friends before, but never implemented it. Still, it's straightforward enough. (In fact, it's how java.util.concurrent.ConcurrentHashMap is implemented, or near enough.) Note that this technique should also work in Objective-C and any other language with similar semantics, though I haven't tried it anywhere other than Java.

Here's the idea: If you've got some sort of cache, you only really need to make sure that two threads aren't working on the same cache entry at the same time. If I have a Broadcaster, and I attach it to some ID for the channel, I only care about isolating activity around that ID. I don't care if someone's working on some other ID. (except when I do: see below) In other words, if I'm working with an ID of 3, I don't care if some other thread is checking about whether or not ID 1 already exists; I only care that someone else asking about ID 3 doesn't cause problems.

So if you could create a sequence of locks, and just ensure that anyone working on the same ID ends up synchronizing on the same lock, you're good to go.

Consider this code:

public class ConcurrentCache {
  private static Object[] locks = new Object[10];
  static {
       for (int lockIndex = 0; lockIndex < locks.length; lockIndex++) {
               locks[lockIndex] = new Object();
        }
    }

    private Map cache = new HashMap();
    public Object getOrInsertIntoCache(Object key) { 
           synchronized(locks[key.hashCode() % locks.length]) { 
                 if (!cache.containsKey(key)) {
                      cache.put(someKey,new Object());
                 } 
               return cache.get(key);
            }
     }
 }

The static initialization code gives you ten locks to work with, and you can get to one by just modding the hash code by the length of the list. Any given ID will always end up with the same lock (assuming a consistent hashCode result, which is true of Long, Integer, String, and other object representations of Java primitives typically used as keys.

To test this, I wrote a simple program that would put 5,000 tasks on an ExecutorService with ten threads. Each thread would generate a random number between one and 100. That would become the key to use on the cache. Depending on some command-line arguments, those threads would either lock on the cache or on a partitioned lock as above. Any given thread captured current system time when it was constructed (put on the queue) and then printed the difference between the new system time and the start time when it eventually ran. I ran the program and eagerly checked the average wait times each thread experienced.

Only to discover that they had almost identical performance. The partitioned code showed threads waiting a mere millisecond less, on average, than the "block on everything" code. And that was averaged over 5,000 jobs, remember. That's a lot of complexity for not a lot of gain.

I figured out what was going on with a bit of poking. The basic code says: See if the key's already there, and, if it's not, insert it. What that really meant was that 4,900 of my jobs called containsKey, saw that the key was there, and exited. I had an extraordinarily high cache hit rate. The lock was acquired and released so quickly that there simply wasn't noticeable lock contention.

Once I realized that, I made a simple change. After doing the cache logic, I simply had each thread sleep for one millisecond. And I ran my program again.

That produced the results I expected! Threads in the "block on everything" version waited almost precisely ten times as long, on average, as their partitioned cousins (29 seconds versus 3).

Real-world caches are messy things. In fact, my simple test case wouldn't really be done this way at all: You'd prefill the cache with the fixed items you wanted and avoid all this nonsense. But real caches need to expire items, cached results can be more or less complex, and so forth. The caching situation I have involves lots of different IDs with corresponding amounts of object churn. So the in situ results will likely be very different than an almost exact division. Still, it obviously made a big difference for time-consuming activity and at least didn't hurt performance in the simple case. You have more code complexity, which means more potential bugs and less maintainability, but in my system this logic is tucked into a class by itself, so no clients need to worry about these details: They just request a Broadcaster and don't worry about how it gets to them.

Still, if your cache handling is no more sophisticated than this, synchronizing on the whole thing was basically equal in speed, and you should therefore avoid the readability/maintainability cost altogether. If you're worrying about a one-millisecond difference across 5,000 tasks in your server code, you're farther along in your optimizations than I am.

There's also the question of what to do when you do care about the overall state of the cache. For instance, what if you want to get the size of this cache? The above technique won't work, because no one lock guarantees a fixed state. (Ignoring the reality that you probably don't actually care about the exact size of the cache: You simply want the approximation, in which case you're fine.)

Actually, there is one lock that will guarantee the state of the overall cache, and that's the cache itself. If you really care about exact counts, you can synchronize on the cache itself for anything that changes its size. For instance:

public class ConcurrentCache {
  private static Object[] locks = new Object[10];
  static {
       for (int lockIndex = 0; lockIndex < locks.length; lockIndex++) {
               locks[lockIndex] = new Object();
        }
    }

    private Map cache = new HashMap();
    public Object getOrInsertIntoCache(Object key) { 
           synchronized(locks[key.hashCode() % locks.length]) { 
                 if (!cache.containsKey(key)) {
                      synchronized(cache) {
                          cache.put(someKey,new Object());
                      }
                 } 
               return cache.get(key);
            }
     }
 }

You re-introduce the global lock, but you minimize when it's acquired. If you have a high cache hit rate, this should still give you better concurrency but allowing for across-the-board thread safety (though, again, do you really care that your size is 99 and not 98?)

This isn't a new technique, but it was worth jotting down so I don't forget it.

Saturday, October 15, 2011

Faking Block Programming In Java

While I was writing some server-side code to talk to a web service, I did the total n00b thing of forgetting to close my connection when I was done with the call. This works fine in test cases, but in any sort of real-world situation, you'll quickly exhaust your connection pool as connections linger, unavailable, until they time out. And as I fixed the problem, I realized I had done the same thing in one other place — I haven't written client code in a while. Then I realized that I had a recurring pattern:


HttpMethod get;
try {
    get = new GetMethod(url);
    get.execute();
    
    // pull the response body out and do stuff with it
} finally {
   if (get != null) {
       get.releaseConnection();
   }
}

This ensures that, even in the case of an Exception, the connection gets released. Fairly straightforward. But who wanted little copies of that code all over the system? And, worse, what if someone forgot to do this, just as I did? Anytime you set up code people have to remember to type, you ensure that someday someone will forget.

This would be an obvious use case for a closure. In fact, it reminded me of the File.open method in Ruby that takes a block of code as an argument. The method creates a File object, calls the block of code you pass in with said File object, and then closes the file even if there was an exception.

The only problem: pure Java doesn't support closures. (Some JVM-compatible languages like Scala do, however.)

But you can mimic the behavior to some degree with anonymous inner classes, and you can use Java Generics to provide type checking. I created a MethodOperator interface that looks like this


public interface MethodOperator<T> {
    public T actOnMethodPostResponse(HttpMethod method) throws Exception;
}

The <T> and public T … bits basically mean that when I have to instantiate one of these, I can also declare it as being of some type, which then gets returned from the one method.

Once I had that code, I added a simple utility method:


    protected <T> T actOnHttpMethod(HttpMethod method, MethodOperator operator) throws Exception {
        try {
            executeMethod(method); // utility method that checks for errors and so forth on the method
            return (T)operator.actOnMethodPostResponse(method);
        } catch (Exception e) {
            log.error("Error talking to http server",e);
            throw e;
        } finally {
            if (method != null) {
                method.releaseConnection();
            }
        }
    }

And can invoke it with something like this:


 IdResponse response = actOnHttpMethod(post,new MethodOperator<IdResponse>() {
    public IdResponse actOnMethodPostResponse(HttpMethod method) {
        // unmarshal the response and create an IdResponse object with it
     }
 });

The actOnHttpMethod will do the request, hand my object the response, and then close the connection for me.

Inner classes definitely suffer from readability problems, but this setup ensures that it's very easy for developers to not even think about connection management. Furthermore, I can add features and have them automatically used by every client. For instance, if I wanted to profile the request/response time or add logging. If I ever want to add support for asynchronous calls, I can write a new utility method that does all the work of enqueuing the method and so forth, invoking the MethodOperator code as needed, and then change specific code to say actOnHttpMethodAsync or something instead of actOnHttpMethod. A minimal change in client code plus a utility method, and I've added a more scalable alternative in situations where I don't care about waiting for the response.

Once I refactored all that away, I then realized that I could refactor even more. At the moment, I handle a response in one of two ways: I either ignore it (for things like DELETEs) or I unmarshal the contents from XML into Java. Once I had my whole framework in place, I realized I could make implementations of the MethodOperator interface that would cover these two cases.

I created the following:


public class IgnoreResponseMethodOperator implements MethodOperator<Object> {
    public Object actOnMethodPostResponse(HttpMethod method) {
        return null;
    }
}

public class XmlUnmarshallingMethodOperator<T> implements MethodOperator {
    public T actOnMethodPostResponse(HttpMethod method) throws Exception {
        JAXBContext context = JAXBContext.newInstance(Constants.JAXB_PACKAGES);
        Unmarshaller unmarshaller = context.createUnmarshaller();
        return (T)unmarshaller.unmarshal(method.getResponseBodyAsStream());
    }
}

Now my client code actually looks like this:


       IdResponse response = actOnHttpMethod(post,new XmlUnmarshallingMethodOperator<IdResponse>());

I still get all the value of the code that manages connections around my code, but now I don't even have to worry about the unreadability of anonymous inner classes.

(You could make the case that this will create a lot of object churn. If it does, I can look at making a thread-safe implementation that will let me re-use the MethodOperator objects. That's very easy to do with the IgnoreResponseMethodOperator, but tougher with the type-safe xml unmarshaller. I imagine I'd have to create instances for each type of object I might get back. Given that there aren't too many, this wouldn't be too bad. But first I'll see if that's actually a problem.)

Saturday, September 24, 2011

Protovis And Wine Visualization: California Crush Statistics

Radio station visualizations are fun and all, but I realized that I should research data visualization by looking at data I actually care about. That way, I can provide context and ask deeper questions about the subject matter at hand.

As an occasional wine writer, data about the wine industry seemed like a good start.

Harvest — "crush" in wine industry jargon — is afoot here in California, and that spurred me to search for data on previous harvests. The National Agricultural Statistics Service publishes a range of interesting data for wine geeks, some of which I've been using for experiments and explorations with Protovis.

The first public one shows harvest statistics over 20 years for the 15 grapes with highest crush numbers in California in 2010. The interactive version gives you a deeper view, with detailed per-year statistics as you mouse over, but here's a static version to give you an idea.

Groovy, eh?

Wine geeks will know many of this visualization's stories well. The California wine industry has grown tremendously over the last 20 years, thanks to increased consumption in the United States. Grape gluts are periodic, but 2005 was a particularly grape-heavy year. Industrial grapes such as French Colombard, Rubired, and Ruby Cabernet are mainstays of the bulk-wine industry led by Gallo. Pinot Noir tonnage surpassed Syrah tonnage in 2008, about 4 years — when vines start producing worthwhile fruit — after Sideways, the movie that told everyone about Pinot Noir. (Though I should note that I prefer the Pinots of Oregon and the Sonoma Coast to those of Santa Barbara, the setting for the movie. But, really, I prefer the Pinots of Burgundy to those from anywhere else.)

But some items in the data surprised me. Merlot, a common Bordeaux variety, went from almost nothing in 1991 to a dominant grape in 2010. Grenache, the popular, fruity darling of the Rhone Rangers, has actually seen lower crush values in the last 20 years. Pinot Gris has gone from a nonexistent grape in California to one of the top 15 in the state in just over a decade. Tonnage of French Colombard has gone down, which makes me wonder how the industrial market is doing overall.

But if you're reading this blog, you're probably more interested in the technical aspects of this data. I used Protovis, and I have repeatedly found that getting a basic visualization up and running with the library is very fast. Getting the fine details right, however, is much slower. It takes a lot of trial and error to get the language to do what you want. I might switch to D3, its successor, for my next projects. It supposedly gives finer control over your visualization.

What I also keep realizing is that visualizing some set of data isn't really an issue. Organizing the data is. I know this isn't news to anyone who works with data, but these projects are good reminders of how much work that can be.

I started with 20 separate spreadsheets from the NASS and wrote a Ruby script to extract out the bits of data I wanted and compile them into a JSON object I could serve to this chart's HTML page. But even then, the page's JavaScript has to do some processing as well to get the data in a format that Protovis can easily work with. The Underscore JavaScript library is a handy tool for doing data transformations.

But I also used that preprocessing to cache certain items such as the pretty-printed numbers, the colors to use for the different areas (which I calculated with the excellent 0to255.com) and other useful items.

Saturday, September 3, 2011

Radio Station Playlist Data Visualization, Part 2

As soon as I did my visualization of 99.7's music selection for a week, I asked the obvious next question: How does 99.7 compare to other "adult contemporary" radio stations?

There's an interactive version that lets you drill down into the graph, but here's a screenshot.

People listen to radio stations for all sorts of reasons, of course, so I don't know that anyone actually cares about this. But it did give me a chance to look at Protovis and compare it to Processing as I learn about data visualization toolkits.

Gathering Data

When I gathered data for my first visualization, I wrote a simple script that grabbed songs from the 99.7 website. I set that up as a cron job on an EC2 instance and let it go.

I did the same thing for the other four radio stations I decided to look at. 97.3 uses the same website tech as 99.7, and KFOG and KBAY share a different website tech, so those got me two stations for the price of one. 101.3 uses yet another system. Once I had my scripts running, I just had to wait until I had the same week's worth of data from all stations. A bit of cleanup on the data, a quick change to JSON from comma-separated values, and I was ready to go.

I decided to use the concept of small multiples to provide a quick comparison between stations, but then showing an enlarged version for deeper exploration. Each small graph in the chart represents one station across the same span of time.

Protovis Vs. Processing

It took me some time to learn Protovis. I feel that only now, after finishing one visualization, do I really have a grasp on how it works. It seeks to be a declarative language, which means that you define the result and let the under-the-hood bits figure out how to get you there, but I found myself struggling against the lack of control.

Processing gives you that control. You have vast amounts of control, but that's because it starts you with a blank slate. You can probably do anything you want, but the flip side is that you have to do everything you want.

But Processing comes with a strong disadvantage: It creates Java applets. Remember those? I barely do, and I was actually writing Java when that's all people did with it. An applet takes a long time to load in a world where website visitors are accustomed to instant gratification from your page. An applet also won't work on your iOS device. So my first visualization was completely unusable by iPad owners.

(Yes, there is Processing.js, but my attempts to use it only frustrated me. It didn't support Java generics, and even when I removed them from my code, it failed with cryptic errors that were impossible to debug.)

As with so many things, deciding on a visualization toolkit means figuring out what's best for your job. If you're doing something complex and custom, you'll probably want Processing. But for a lot of web-based visualizations, I think Protovis will give you what you need once you figure out how to use it. It can certainly do a lot in that space.

I have still more visualizations in mind for this same set of data, and I'm planning on starting with Protovis (or its successor, d3). The Java applet problems are too big.