An Obsession With Programming: 2011

Monday, December 19, 2011

Books For Systems Geeks

I've been involved in lots of system releases over the years. Not only major versions and upgrades for the various companies I've worked for but the monthly or weekly minor (or, "minor") releases that are the norm in startups everywhere.

And like everyone else who's gone through those circuits, I've picked up a bunch of ad hoc, empirically proven knowledge about how to do these things. I know what's worked, and I know what hasn't.

But as the person helming the online systems for my studio's next major game, I thought it was a good idea to see what other people have to say about the complex art of launching a big system. Not just making it scale but the processes and practices that make it a smooth, worry-free (or as close as one can hope) launch. No sense in avoiding the wisdom of others, after all.

Here are my thoughts on a few of the books I've read recently.

The Art of Scalability - This is one of my favorites, and it's a book I'll come back to again and again. While it's light on actual technical details, it does a great job of explaining how scalability decisions are actually business decisions and how you need a culture of scalability, not just a decision to shove it all in at the end.

That sounds obvious, right? That's the other thing the book does really well: Encapsulating ideas you, like me, have probably learned on your own into nice, articulate concepts. I know about horizontal scaling, sharding, and splitting up servers based on functionality. But when they talk about their X-, Y-, and Z-axis scaling cube, it summarizes it quickly. Little phrases like "technology agnostic design" and "swim lanes" become keywords that you can quickly call up when you're thinking, "Something about this doesn't sound right."

Release It! - I was scribbling down notes constantly while reading this book. This is the distilled advice of someone who's seen lots of systems work well and poorly. It's one of the few tech books that is super relevant even five years or so after it was published. (Though it merely hints at the cloud technology that has sprung up since then.)

It documents a slew of patterns and antipatterns that will have you nodding your head constantly. A lot of the topics are things that you probably kind of know if you've done a bunch of launches. But this book takes them out of the realm of intuition and into concrete knowledge, real experience, and practical advice.

Its notion that the time before launch is a comparably short time in your product's life versus the time it spends live won't necessarily apply if, like me, you work in the games industry, where years of work go into a product whose user base drops off, under normal circumstances, rather than grows. But that does not diminish the value of the text. (And I won't always be in games.)

Scalability Rules - This book is the sequel to The Art of Scalability, though perhaps companion is a better word. Whereas the authors were light on tech and heavy on business in the first book, this book is all about the technical concepts you need to launch scalable systems. Again, lots of great advice that I scribbled into my notes. (This book has more to say on cloud systems than the original, earlier book did.)

Best of all, they provide a handy list of the 50 rules in order of "amount of risk management you get" and also provide a rough guess at the amount of work. Sort by most bang for the least buck, and you'll be knocking out a few in no time.

Scalable Internet Architectures - On the other hand, if you really want technical advice, this book gets way down to the nitty-gritty. There's a lot of good info here, but it's going to be most useful if you're on the IT/operations side of the fence versus the development side. That said, his argument that you should focus on scaling down (i.e., cutting costs) as well as scaling up has become a fixture in my architecture.

Continuous Delivery - The notion of continuous delivery, which means that your software is always ready for launch, is a compelling one. The authors are basically trying to convert the seemingly inevitable tension around a production release into a ho-hum experience that most of the team probably doesn't even need to be aware of.

It sounds great, but I'm not sure how much I'll be able to apply to my day-to-day work. The authors set a high bar when they cavalierly suggest that you should have 90 percent coverage just from unit tests. I used to feel fairly good about the fact that I had 40 percent coverage from unit tests and integration tests.

Still, I definitely came out with some ideas that will make our own launches more relaxed once I implement them.

Saturday, December 3, 2011

My Radio Broadcast Podcast

While I have an enthusiasm for boppy, bubble-gum music, particularly as a backdrop to coding, I have also had a passion for opera in the past. I've even had questions used in the Opera Quiz on the Metropolitan Opera radio broadcasts.

Unfortunately, that passion is hard to fit into my life these days. Opera tickets are expensive unless you want to stand, and performances take a long time. This is the nature of opera.

This used to be easier. On Saturday mornings, I'd turn on the radio broadcast and listen to the opera for several hours as I went about my morning. Then I got into food. And farmers markets, my favorite of which are on Saturday mornings. That then became my normal food shopping day, and the radio broadcast went by the wayside. I'm usually coming home from shopping right as the opera ends.

Sure, you can listen to albums, but the Met broadcasts are fun because they provide context. The host describes the costumes, experts provide backstory, and they do the quiz.

When podcasts became popular, I realized that a podcast of the Met's Saturday matinees would be perfect. I sent letters asking the Met to do it. When they called and asked for money, I'd mention it. I'd even tell them that I would pay for such a podcast. Imagine: paying for Internet content!

They've never done it. And so I've fallen behind on opera, enjoying it as much as possible with a ticket or two a year to the San Francisco Opera.

Earlier this year, I was ranting about this yet again when I realized that I could probably craft my own podcast based on the radio broadcasts. What I wanted was the ability to call up a podcast on my iPod and see the latest opera broadcast, already synced. So began a day or two, off and on, of work on a podcast-creation script that would use radio stations as its source material.

Rube Goldberg would like this one.

First, I needed to figure out how to capture the music. A friend suggested FStream, Mac OS X software that had two valuable features: I could open a wide range of streaming URLs with it, and it was scriptable. I like scriptable apps. And these days, one can even use Ruby to do that scripting.

What I ultimately wanted was to not even think about this. That meant that my script would need to know a schedule. It reads in a config file with YAML entries that contain the name of the item, the start time, end time, and streaming URL. When the script runs, it parses the file and checks once a minute to see if it should be recording. If it should (and it isn't), it starts up FStream, points it to the appropriate URL, and tells it to start recording. When it reaches the end time, it tells FStream to stop recording.

Once the file is closed, the script uploads it to S3 and creates an XML file that points to all the appropriate links of the files. Voila: a podcast of streaming radio.

Though I did this originally with the Metropolitan Opera broadcasts in mind, it obviously works for any streaming radio. I've set up new entries for CapRadio's Friday Night at the Opera, and KDFC's San Francisco Opera broadcasts.

There are a couple of problems with this script. One is a bug that causes it to throw an exception when uploading the file. I'll fix that at some point. It just means I have to manually upload the files to S3. The other problem is logistical: Neither of our computers is on all the time. To add one more step to my baroque script, I set myself a reminder so that I know to set up my computer during the appropriate time period. I wonder if iCal is scriptable …

In addition to all the stuff this script is supposed to do, the script passes its config file through Ruby's ERB system. That means that I can actually set up my config file so that the start times are programatically driven (e.g., 9:00am on the coming Saturday).

I'd still like the Met to do a podcast of their own. I'd even still pay for it. But until they do, I not only have their broadcasts in a podcast, I have a wealth of others.

Here's the script, with various sensitive bits taken out. One thing I've found useful is to put my S3 connectivity information in a separate, included script so that I can distribute the main script and not accidentally include my S3 credentials. On the off chance you want to use this script, you'll need a file named aws_info.rb in the same directory as this script which defines three variables: S3_BUCKET_NAME, S3_ACCESS_KEY, and S3_SECRET_ACCESS_KEY.

require './awsinfo'
require 'rubygems'

require 'appscript'
include Appscript

require 'fileutils'
require 'time'
require 'yaml'
require 'aws/s3'
require 'erb'
require 'rss/1.0'
require 'rss/2.0'
require 'rss/maker'

class File
 def name
     pieces = self.path.split('/')
     pieces[pieces.length - 1]
 end
end

SLEEP_INTERVAL = 60
PODCAST_FILE = "podcast.xml"
PODCAST_URL = "https://s3.amazonaws.com/#{S3_BUCKET_NAME}"

$schedule_file = 'opera_schedule.yaml'
$is_recording = false
$current_schedule = nil

#constants are defined in awsinfo
def start_s3
 AWS::S3::Base.establish_connection!(
    :access_key_id     => S3_ACCESS_KEY,
    :secret_access_key => S3_SECRET_ACCESS_KEY
  ) 
end

def stop_s3
 AWS::S3::Base.disconnect!
end

def parse_time(time_string)
  regex = /^(\d*?)-(\d*?)-(\d*?)\s*?(\d*?):(\d*)/
  year = time_string[regex,1].to_i
  month = time_string[regex,2].to_i
  day = time_string[regex,3].to_i
  hour = time_string[regex,4].to_i
  minute = time_string[regex,5].to_i
  Time.local(year,month,day,hour,minute)
end

def parse_schedule_file(filename = $schedule_file,time=Time.new)
   schedule = File.open(filename,'r') {|file|YAML::load(ERB.new(file.read).result binding)}
end

#side effect: sets $current_schedule if appropriate
def should_be_recording(time=Time.new)
  schedule = parse_schedule_file(filename=$schedule_file,time=time)
  schedule_flag = false
  schedule.each do |schedule_entry|
     start_time = parse_time(schedule_entry['start_time'])
     end_time = parse_time(schedule_entry['end_time'])
     if (start_time..end_time) === time then
         schedule_flag = true
         $current_schedule = schedule_entry
         break
     end
  end
  schedule_flag 
end

def next_scheduled_task(time=Time.new)
    schedule = parse_schedule_file(filename=$schedule_file,time=time)
    sorted_schedules = schedule.sort {|a,b| parse_time(a['start_time']) <=> parse_time(b['start_time'])}
    return_schedule = nil
   
    sorted_schedules.each do |entry|
      if parse_time(entry['start_time']) > time then
        return_schedule = entry
        break
      end
   end
   return_schedule
end

def add_file_to_podcast(file,schedule_info=$current_schedule)
   xml_file = file_from_pieces(fstreams_dir,PODCAST_FILE)
   rss = nil
   if !File.exists?(xml_file) then
      rss = RSS::Maker.make("2.0") do |maker|
         maker.channel.title = "Derrick's Radio Podcast"
         maker.channel.link = PODCAST_URL + PODCAST_FILE
         maker.channel.description = "Radio programs captured by script"
         maker.items.do_sort = true # sort items by date
      end
      File.open(xml_file,"w") {|file| file.write(rss)}
   end
    
   content = ""
   File.open(xml_file,"r") do |existing_file|
       content = existing_file.read
   end
   rss = RSS::Parser.parse(content,false)
  
   item = item = RSS::Rss::Channel::Item.new
   item.title = schedule_info['name']
   item.date = File.mtime(file)
   item.link = "#{PODCAST_URL}/#{File.new(file).name}"
   item.pubDate = File.mtime(file)
   item.enclosure = RSS::Rss::Channel::Item::Enclosure.new(item.link, File.size(file), 'audio/mpeg')    
   rss.items << item
  
   File.open(xml_file,"w") {|file| file.write(rss)}
end

#todo: use fstream.recording flag
def is_recording
 fstream = app('FStream')
 puts fstream.status
 $is_recording && fstream.status == 3
end

def fstreams_dir
  fstream_path = './fstreams'
  FileUtils.mkdir_p(fstream_path)
  Dir.new(fstream_path)
end

def file_from_pieces(dir,file)
  "#{dir.path}/#{file}"
end

def sync_dir
 dir = fstreams_dir
 start_s3
 dir.entries.each do |filename|
    next if filename =~ /^\..*/
    file_path = file_from_pieces(dir,filename)
   
    # if the file doesn't exist (or it's the podcast file), upload it
    if !AWS::S3::S3Object.exists?(S3_BUCKET_NAME,filename) || filename == PODCAST_FILE then
        AWS::S3::S3Object.store(filename,open(file_path),S3_BUCKET_NAME)
    end
 end
 stop_s3
end

def s3_safe_name(english_name)
  english_name.gsub(/\s/,'_').downcase
end

def start_recording
 $is_recording = true
 puts 'Starting to record'
 fstream = app('FStream')
 fstream.openStreamWithURL($current_schedule['from_url'])
 fstream.startRecording
end

def stop_recording
 puts 'Stopping recording'
 fstream = app('FStream')
 fstream.stopRecording
 fstream.stopPlaying

 # find the file most recently created, and rename it
 dir = fstreams_dir
 filepaths = []
 dir.entries.each do |filename|
    next if filename =~ /^\..*/
    file paths << file_from_pieces(dir,filename)
  end
  filepaths.sort {|a,b| File.mtime(b) <=> File.mtime(a)}
  filepath = file_from_pieces(dir,s3_safe_name($current_schedule['name'])+".mp3")

  FileUtils.mv(filepaths[0],filepath)
  
  # file can take a while to close
  while true
     begin
       sleep(10)
       File.mtime(filepath)
       break
     rescue exception
       puts "Waiting for #{filepath} to close"
     end   
  end
  
  # update podcast file and S3
  add_file_to_podcast(filepath)
  sync_dir
  
  $is_recording = false
  $current_schedule = nil
end

def do_start_app
  while(true)
    if !is_recording && should_be_recording then
      start_recording
    elsif is_recording && !should_be_recording then
      stop_recording
    else
      next_sched = next_scheduled_task
      puts "#{Time.new} Next scheduled task is #{next_sched['name']} starting at #{next_sched['start_time']}" if next_sched
    end
    sleep(SLEEP_INTERVAL)
  end
end

$schedule_file = ARGV[0] if ARGV.length >= 1
do_start_app if !$testing_mode

Thursday, November 3, 2011

Partitioned Concurrency

Here is a common block of code in Java. Assume cache is a Map of some form.

if (!cache.containsKey(someKey)) {
  cache.put(someKey,someValue);
}

Or: if a given object is not in the map with a certain key, put it there. Otherwise, proceed along your merry way.

I was writing code like this in a controller the other day, because I wanted to cache Broadcaster objects (from the Atmosphere framework) that would manage separate chat channels. Requests would come in and get attached to one or another Broadcaster based on a channel ID.

But that code snippet, in its current form, isn't thread-safe. You could have two threads get past the if and thus both think that the map needs the object inserted. That isn't always a problem — my reflection-based API marshaling layer caches Field and Method objects that are constant for a given class, making me unconcerned about threads that replace values — but in this case, you could have the second thread replace the Broadcaster that the first one inserted, meaning that some chunk of the chat clients wouldn't see messages because they'd be attached to the wrong object.

The standard way to solve this is to put a mutex around the cache-modification code:

synchronized(this) {
  if (!cache.containsKey(someKey)) {
      cache.put(someKey,someValue);
  }
}

But this is code that is likely to have a lot of requests fired at it, and synchronizing this way introduces a major performance bottleneck. Every single request would have to wait for that lock. Yes, I know: Premature optimization is the root of all evil. Still, it seemed problematic. Synchronization at that high of a level can cause major issues when you're dealing with hundreds of thousands of concurrent users, which we may very well be when we launch.

As I was pondering this, I remembered a technique for increasing concurrency. I had discussed it with friends before, but never implemented it. Still, it's straightforward enough. (In fact, it's how java.util.concurrent.ConcurrentHashMap is implemented, or near enough.) Note that this technique should also work in Objective-C and any other language with similar semantics, though I haven't tried it anywhere other than Java.

Here's the idea: If you've got some sort of cache, you only really need to make sure that two threads aren't working on the same cache entry at the same time. If I have a Broadcaster, and I attach it to some ID for the channel, I only care about isolating activity around that ID. I don't care if someone's working on some other ID. (except when I do: see below) In other words, if I'm working with an ID of 3, I don't care if some other thread is checking about whether or not ID 1 already exists; I only care that someone else asking about ID 3 doesn't cause problems.

So if you could create a sequence of locks, and just ensure that anyone working on the same ID ends up synchronizing on the same lock, you're good to go.

Consider this code:

public class ConcurrentCache {
  private static Object[] locks = new Object[10];
  static {
       for (int lockIndex = 0; lockIndex < locks.length; lockIndex++) {
               locks[lockIndex] = new Object();
        }
    }

    private Map cache = new HashMap();
    public Object getOrInsertIntoCache(Object key) { 
           synchronized(locks[key.hashCode() % locks.length]) { 
                 if (!cache.containsKey(key)) {
                      cache.put(someKey,new Object());
                 } 
               return cache.get(key);
            }
     }
 }

The static initialization code gives you ten locks to work with, and you can get to one by just modding the hash code by the length of the list. Any given ID will always end up with the same lock (assuming a consistent hashCode result, which is true of Long, Integer, String, and other object representations of Java primitives typically used as keys.

To test this, I wrote a simple program that would put 5,000 tasks on an ExecutorService with ten threads. Each thread would generate a random number between one and 100. That would become the key to use on the cache. Depending on some command-line arguments, those threads would either lock on the cache or on a partitioned lock as above. Any given thread captured current system time when it was constructed (put on the queue) and then printed the difference between the new system time and the start time when it eventually ran. I ran the program and eagerly checked the average wait times each thread experienced.

Only to discover that they had almost identical performance. The partitioned code showed threads waiting a mere millisecond less, on average, than the "block on everything" code. And that was averaged over 5,000 jobs, remember. That's a lot of complexity for not a lot of gain.

I figured out what was going on with a bit of poking. The basic code says: See if the key's already there, and, if it's not, insert it. What that really meant was that 4,900 of my jobs called containsKey, saw that the key was there, and exited. I had an extraordinarily high cache hit rate. The lock was acquired and released so quickly that there simply wasn't noticeable lock contention.

Once I realized that, I made a simple change. After doing the cache logic, I simply had each thread sleep for one millisecond. And I ran my program again.

That produced the results I expected! Threads in the "block on everything" version waited almost precisely ten times as long, on average, as their partitioned cousins (29 seconds versus 3).

Real-world caches are messy things. In fact, my simple test case wouldn't really be done this way at all: You'd prefill the cache with the fixed items you wanted and avoid all this nonsense. But real caches need to expire items, cached results can be more or less complex, and so forth. The caching situation I have involves lots of different IDs with corresponding amounts of object churn. So the in situ results will likely be very different than an almost exact division. Still, it obviously made a big difference for time-consuming activity and at least didn't hurt performance in the simple case. You have more code complexity, which means more potential bugs and less maintainability, but in my system this logic is tucked into a class by itself, so no clients need to worry about these details: They just request a Broadcaster and don't worry about how it gets to them.

Still, if your cache handling is no more sophisticated than this, synchronizing on the whole thing was basically equal in speed, and you should therefore avoid the readability/maintainability cost altogether. If you're worrying about a one-millisecond difference across 5,000 tasks in your server code, you're farther along in your optimizations than I am.

There's also the question of what to do when you do care about the overall state of the cache. For instance, what if you want to get the size of this cache? The above technique won't work, because no one lock guarantees a fixed state. (Ignoring the reality that you probably don't actually care about the exact size of the cache: You simply want the approximation, in which case you're fine.)

Actually, there is one lock that will guarantee the state of the overall cache, and that's the cache itself. If you really care about exact counts, you can synchronize on the cache itself for anything that changes its size. For instance:

public class ConcurrentCache {
  private static Object[] locks = new Object[10];
  static {
       for (int lockIndex = 0; lockIndex < locks.length; lockIndex++) {
               locks[lockIndex] = new Object();
        }
    }

    private Map cache = new HashMap();
    public Object getOrInsertIntoCache(Object key) { 
           synchronized(locks[key.hashCode() % locks.length]) { 
                 if (!cache.containsKey(key)) {
                      synchronized(cache) {
                          cache.put(someKey,new Object());
                      }
                 } 
               return cache.get(key);
            }
     }
 }

You re-introduce the global lock, but you minimize when it's acquired. If you have a high cache hit rate, this should still give you better concurrency but allowing for across-the-board thread safety (though, again, do you really care that your size is 99 and not 98?)

This isn't a new technique, but it was worth jotting down so I don't forget it.

Saturday, October 15, 2011

Faking Block Programming In Java

While I was writing some server-side code to talk to a web service, I did the total n00b thing of forgetting to close my connection when I was done with the call. This works fine in test cases, but in any sort of real-world situation, you'll quickly exhaust your connection pool as connections linger, unavailable, until they time out. And as I fixed the problem, I realized I had done the same thing in one other place — I haven't written client code in a while. Then I realized that I had a recurring pattern:


HttpMethod get;
try {
    get = new GetMethod(url);
    get.execute();
    
    // pull the response body out and do stuff with it
} finally {
   if (get != null) {
       get.releaseConnection();
   }
}

This ensures that, even in the case of an Exception, the connection gets released. Fairly straightforward. But who wanted little copies of that code all over the system? And, worse, what if someone forgot to do this, just as I did? Anytime you set up code people have to remember to type, you ensure that someday someone will forget.

This would be an obvious use case for a closure. In fact, it reminded me of the File.open method in Ruby that takes a block of code as an argument. The method creates a File object, calls the block of code you pass in with said File object, and then closes the file even if there was an exception.

The only problem: pure Java doesn't support closures. (Some JVM-compatible languages like Scala do, however.)

But you can mimic the behavior to some degree with anonymous inner classes, and you can use Java Generics to provide type checking. I created a MethodOperator interface that looks like this


public interface MethodOperator<T> {
    public T actOnMethodPostResponse(HttpMethod method) throws Exception;
}

The <T> and public T … bits basically mean that when I have to instantiate one of these, I can also declare it as being of some type, which then gets returned from the one method.

Once I had that code, I added a simple utility method:


    protected <T> T actOnHttpMethod(HttpMethod method, MethodOperator operator) throws Exception {
        try {
            executeMethod(method); // utility method that checks for errors and so forth on the method
            return (T)operator.actOnMethodPostResponse(method);
        } catch (Exception e) {
            log.error("Error talking to http server",e);
            throw e;
        } finally {
            if (method != null) {
                method.releaseConnection();
            }
        }
    }

And can invoke it with something like this:


 IdResponse response = actOnHttpMethod(post,new MethodOperator<IdResponse>() {
    public IdResponse actOnMethodPostResponse(HttpMethod method) {
        // unmarshal the response and create an IdResponse object with it
     }
 });

The actOnHttpMethod will do the request, hand my object the response, and then close the connection for me.

Inner classes definitely suffer from readability problems, but this setup ensures that it's very easy for developers to not even think about connection management. Furthermore, I can add features and have them automatically used by every client. For instance, if I wanted to profile the request/response time or add logging. If I ever want to add support for asynchronous calls, I can write a new utility method that does all the work of enqueuing the method and so forth, invoking the MethodOperator code as needed, and then change specific code to say actOnHttpMethodAsync or something instead of actOnHttpMethod. A minimal change in client code plus a utility method, and I've added a more scalable alternative in situations where I don't care about waiting for the response.

Once I refactored all that away, I then realized that I could refactor even more. At the moment, I handle a response in one of two ways: I either ignore it (for things like DELETEs) or I unmarshal the contents from XML into Java. Once I had my whole framework in place, I realized I could make implementations of the MethodOperator interface that would cover these two cases.

I created the following:


public class IgnoreResponseMethodOperator implements MethodOperator<Object> {
    public Object actOnMethodPostResponse(HttpMethod method) {
        return null;
    }
}

public class XmlUnmarshallingMethodOperator<T> implements MethodOperator {
    public T actOnMethodPostResponse(HttpMethod method) throws Exception {
        JAXBContext context = JAXBContext.newInstance(Constants.JAXB_PACKAGES);
        Unmarshaller unmarshaller = context.createUnmarshaller();
        return (T)unmarshaller.unmarshal(method.getResponseBodyAsStream());
    }
}

Now my client code actually looks like this:


       IdResponse response = actOnHttpMethod(post,new XmlUnmarshallingMethodOperator<IdResponse>());

I still get all the value of the code that manages connections around my code, but now I don't even have to worry about the unreadability of anonymous inner classes.

(You could make the case that this will create a lot of object churn. If it does, I can look at making a thread-safe implementation that will let me re-use the MethodOperator objects. That's very easy to do with the IgnoreResponseMethodOperator, but tougher with the type-safe xml unmarshaller. I imagine I'd have to create instances for each type of object I might get back. Given that there aren't too many, this wouldn't be too bad. But first I'll see if that's actually a problem.)

Saturday, September 24, 2011

Protovis And Wine Visualization: California Crush Statistics

Radio station visualizations are fun and all, but I realized that I should research data visualization by looking at data I actually care about. That way, I can provide context and ask deeper questions about the subject matter at hand.

As an occasional wine writer, data about the wine industry seemed like a good start.

Harvest — "crush" in wine industry jargon — is afoot here in California, and that spurred me to search for data on previous harvests. The National Agricultural Statistics Service publishes a range of interesting data for wine geeks, some of which I've been using for experiments and explorations with Protovis.

The first public one shows harvest statistics over 20 years for the 15 grapes with highest crush numbers in California in 2010. The interactive version gives you a deeper view, with detailed per-year statistics as you mouse over, but here's a static version to give you an idea.

Groovy, eh?

Wine geeks will know many of this visualization's stories well. The California wine industry has grown tremendously over the last 20 years, thanks to increased consumption in the United States. Grape gluts are periodic, but 2005 was a particularly grape-heavy year. Industrial grapes such as French Colombard, Rubired, and Ruby Cabernet are mainstays of the bulk-wine industry led by Gallo. Pinot Noir tonnage surpassed Syrah tonnage in 2008, about 4 years — when vines start producing worthwhile fruit — after Sideways, the movie that told everyone about Pinot Noir. (Though I should note that I prefer the Pinots of Oregon and the Sonoma Coast to those of Santa Barbara, the setting for the movie. But, really, I prefer the Pinots of Burgundy to those from anywhere else.)

But some items in the data surprised me. Merlot, a common Bordeaux variety, went from almost nothing in 1991 to a dominant grape in 2010. Grenache, the popular, fruity darling of the Rhone Rangers, has actually seen lower crush values in the last 20 years. Pinot Gris has gone from a nonexistent grape in California to one of the top 15 in the state in just over a decade. Tonnage of French Colombard has gone down, which makes me wonder how the industrial market is doing overall.

But if you're reading this blog, you're probably more interested in the technical aspects of this data. I used Protovis, and I have repeatedly found that getting a basic visualization up and running with the library is very fast. Getting the fine details right, however, is much slower. It takes a lot of trial and error to get the language to do what you want. I might switch to D3, its successor, for my next projects. It supposedly gives finer control over your visualization.

What I also keep realizing is that visualizing some set of data isn't really an issue. Organizing the data is. I know this isn't news to anyone who works with data, but these projects are good reminders of how much work that can be.

I started with 20 separate spreadsheets from the NASS and wrote a Ruby script to extract out the bits of data I wanted and compile them into a JSON object I could serve to this chart's HTML page. But even then, the page's JavaScript has to do some processing as well to get the data in a format that Protovis can easily work with. The Underscore JavaScript library is a handy tool for doing data transformations.

But I also used that preprocessing to cache certain items such as the pretty-printed numbers, the colors to use for the different areas (which I calculated with the excellent 0to255.com) and other useful items.

Saturday, September 3, 2011

Radio Station Playlist Data Visualization, Part 2

As soon as I did my visualization of 99.7's music selection for a week, I asked the obvious next question: How does 99.7 compare to other "adult contemporary" radio stations?

There's an interactive version that lets you drill down into the graph, but here's a screenshot.

People listen to radio stations for all sorts of reasons, of course, so I don't know that anyone actually cares about this. But it did give me a chance to look at Protovis and compare it to Processing as I learn about data visualization toolkits.

Gathering Data

When I gathered data for my first visualization, I wrote a simple script that grabbed songs from the 99.7 website. I set that up as a cron job on an EC2 instance and let it go.

I did the same thing for the other four radio stations I decided to look at. 97.3 uses the same website tech as 99.7, and KFOG and KBAY share a different website tech, so those got me two stations for the price of one. 101.3 uses yet another system. Once I had my scripts running, I just had to wait until I had the same week's worth of data from all stations. A bit of cleanup on the data, a quick change to JSON from comma-separated values, and I was ready to go.

I decided to use the concept of small multiples to provide a quick comparison between stations, but then showing an enlarged version for deeper exploration. Each small graph in the chart represents one station across the same span of time.

Protovis Vs. Processing

It took me some time to learn Protovis. I feel that only now, after finishing one visualization, do I really have a grasp on how it works. It seeks to be a declarative language, which means that you define the result and let the under-the-hood bits figure out how to get you there, but I found myself struggling against the lack of control.

Processing gives you that control. You have vast amounts of control, but that's because it starts you with a blank slate. You can probably do anything you want, but the flip side is that you have to do everything you want.

But Processing comes with a strong disadvantage: It creates Java applets. Remember those? I barely do, and I was actually writing Java when that's all people did with it. An applet takes a long time to load in a world where website visitors are accustomed to instant gratification from your page. An applet also won't work on your iOS device. So my first visualization was completely unusable by iPad owners.

(Yes, there is Processing.js, but my attempts to use it only frustrated me. It didn't support Java generics, and even when I removed them from my code, it failed with cryptic errors that were impossible to debug.)

As with so many things, deciding on a visualization toolkit means figuring out what's best for your job. If you're doing something complex and custom, you'll probably want Processing. But for a lot of web-based visualizations, I think Protovis will give you what you need once you figure out how to use it. It can certainly do a lot in that space.

I have still more visualizations in mind for this same set of data, and I'm planning on starting with Protovis (or its successor, d3). The Java applet problems are too big.

Wednesday, August 24, 2011

Hiring Online Engineers

(By the way, Maxis is hiring).

I've interviewed a lot of candidates over the years. At my last job, we seemed to have an unending stream of candidates because we had learned to be picky. And I didn't even do the phone screens that kept the numbers down. But I did do what we affectionately called the "make them cry" part of the interview, drilling down on Java knowledge as much as I could.

At Maxis, we have less candidate churn, but I now have to sift through the resumés and do the phone screens. And I'd like to pass along some advice about getting your resumé to the top of my pile. None of these are absolutes, but the more flaws in your resumé, the greater your strengths need to be.

First, have someone proofread your resumé. I don't care if English is your first language or not; get a good editor to give you feedback. I'll wince at one typo but let it slide. Much more than that, and I begin to wonder if your code will be as unprofessional as your text. A recent candidate said in the header of his resumé that he's a "self-mortification" person. I'm guessing he meant self-motivated, but clearly not self-motivated enough to have someone sanity check his text.

In a similar vein, make sure you use the correct terms when discussing things you know. You do not "program in" AJAX, HTML, XML, or CSS. You may understand them, but it doesn't look like you do if you call them programming languages.

Next, keep your resumé brief. I've always liked the saying, "the only thing on the second page of a resume should be the Nobel Prize you won," but I am in the minority. The style du jour is to make your resume as long as possible.

I'll grudgingly accept a two-page resumé even without the Nobel Prize, but a recent applicant I saw had a many-page resumé with this line item: "Coded Perl functions, invoking subroutines and functions calling functions." In other words, you did some programming? And that was simply the most ridiculous in a long list. "Set up Object-Relational Mapping with Hibernate" is another common line item. For those who don't know, this involves writing a config file. If you have items such as these on your resumé, you're padding, and I'll think you need to pad because you don't have any real skills.

If you're a mid-level Java engineer with some enterprise software experience, I'm afraid your resumé looks like approximately 100,000 others. Spring, Hibernate, JUnit, Struts, MySQL, Oracle. I'm yawning already. So make your projects sound interesting. We don't all get to work on famous video games, but before I did, I worked on projects you've never heard of. And I made them sound neat. I focused on the compelling problems and described those. If you can't find interesting problems in your work, you're not a programmer I want to work with: Every problem is interesting in its own way.

I also look for personal projects on a resumé. Yes, people have families and whatnot. But think of great writers. They don't just write because they are required to: They write because they need to.

If you're working on personal projects, no matter how esoteric, you're telling me that you're so in love with programming that you can't simply kick it aside when you clock out. You're telling me you love the craft, the problem-solving, the tinkering. You're telling me that you're a programmer I want to work with.

Sunday, August 7, 2011

Visualizing My Bike Rides

My three main interests are programming, writing, and food and wine.

You'll notice exercise isn't on that list. In fact, quite the opposite: My primary interests are all anti-exercise. But I grudgingly acknowledge its value, and I've decided to "trick" myself into exercising more by turning it into a programming project.

I live close enough to work that I can commute on my bike, and so I've started a project to gather data on my rides and do interesting things with that data, inspired in part by Cooper Smith's visualization of Nike+ data from New York City. Since I need a lot of data to make useful visualizations, I'm riding more consistently to get it. And hopefully by the time I have enough data, the bike riding will seem routine.

I picked up a couple of bike tracking apps for the iPhone, but have settled on Abvio's Cyclemeter, based on the recommendation of a co-worker who is both a data geek like me and an avid cyclist. You press Start on your phone, ride your bike, and press Stop when you're done. It gives graphs, maps, and all sorts of other goodies.

Getting to the data is then just a minor step: All of these apps seems to support exports in KML and GPX. Since these are actually just my rides, that data isn't all that interesting by itself. I know how I get to work.

But with that data, I can create meta-analyses. For instance, how does my speed look across a given ride? Here's a ride I took from the Saturday Berkeley farmers market to Berkeley Bowl, our preferred grocery store.

Green lines indicate places where I was faster than my average speed for that ride. Red lines indicate places where I was slower. I add in the Start and Stop pins, and also provide meta information about the ride off of the extra data in Cyclemeter's GPX file: total distance, average speed, and so forth.

Ruby made this work pretty straightforward. I use Nokogiri to parse a GPX file and calculate the velocity between subsequent points. Each velocity item has the coordinates and timestamps of the two points as well as the calculated velocity. I then use an ERB template for the KML I want to create. That ERB template sets up the styles and other items, and then uses the state variables to construct the line segments, the start/stop pins, and other items.

Eventually, I want to add arrows indicating the direction (which is more useful when you're looking at lots of overlapping routes), pins for the slowest point and the fastest point, and other items.

That will do for individual routes, but I also plan to start aggregating my rides to show even more data.

Sunday, July 24, 2011

Visualizing A Week Of 99.7

When we're driving, Melissa and I often listen to Bay Area radio station 99.7 FM. It specializes in dance-focused pop, club, and hip-hop songs. In other words: boppy, brainless music.

But at any given point in time, it feels like they just replay the hits du jour. How much diversity do they really have? I wanted to know.

The answer? Not much. A mere nine songs made up half the station's rotation in the week I measured. During that week, about half the songs you would have heard would have been one of those nine songs.

I also made an interactive version that requires a Java-enabled browser. It shows the song titles as you hover over them.

Here are the top 9 songs:

Song Title	Artist	Times Played
The Edge of Glory	Lady Gaga	100
I Wanna Go	Britney Spears	75
Till The World Ends	Britney Spears	69
How To Love	Lil Wayne	59
Stereo Love	Gym Class Heroes	54
Rolling In The Deep	Adele	53
Written In The Stars	Tinie Tempah	52
The Lazy Song	Bruno Mars	46
Cheers (Drink to That)	Rihanna	39

To gather the data for this chart, I used the station's published playlist, which lists 25 songs at a time. I set up a micro-instance on Amazon.com's EC2 to run a Ruby script every 45 minutes that fetched that playlist page, extracted the information I wanted (via regexes and the excellent Nokogiri library) and appended it to a file.

Then I set up a node.js server* that returned a cleaned-up version of the raw playlist data I amassed. It removed duplicates caused not only by my data fetching script, whose 45-minute interval meant that there was always some degree of overlap, but also by a peculiarity in the data. Remixes on the site get two or more entries with the same timestamp, one for the original song and one for the remix title, and I collapsed those down into the original song. A remix of Britney Spears' "Till the World Ends" might be different in some ways than the original, but to me it counts as playing the same song. I've published the final dataset I used for this chart.

Along the way, I discovered two lacunae in the website's playlist — there are probably more — which affect the numbers a bit. As Nathan Yau says in his new book Visualize This, "Just because it’s data doesn’t make it fact."

Katy Perry's "Last Friday Night" and Pitbull's "Give Me Everything" got plenty of rotation on the station during this week but never showed up in the published playlist. I checked my raw data, the cleaned-up form, and did spot checks on the site whenever I heard one of those songs. They're just skipped. I assume this is some discrepancy in the database, since other songs from the same Katy Perry album show up in the list.

I don't know that it would change the numbers very much. There might be a more gradual drop-off from "Edge of Glory" to "I Wanna Go," but, if anything, the halfway mark would be closer in. The gap doesn't change the premise: 99.7 replays a lot of the same music.

For visualizing the data, I put it into a couple of tools -- a custom tool I'm writing as well as R -- but ultimately decided on Processing, the big gun in any data visualizer's arsenal. Processing is a full programming language aimed specifically at making digital images, with an emphasis on visualization. I could both fully churn through and munge the semi-raw data and quickly visualize it, all with the same tool. And since Processing is basically Java with some handy utility methods, I'm already very comfortable with the language.

Inspired by Yau's book, which encourages a storytelling mindset, I decided to add visual cues and callouts for "points of interest" to my graphic: the most popular song, the cutoff line for the songs that made up fifty percent of the total, and the cutoff line for the the songs that made up eighty percent of the total.

Because Processing is a programming language, I drove everything off the data itself. While I obviously had to program the callouts I wanted on the chart, I don't have a line that says, "Draw 'Edge of Glory, Lady Gaga' at these coordinates." Instead I have a line that says, "draw the name of the song that got played the most next to the leftmost bar." I used the same mindset for all the callouts. Change the dataset, and the callouts change with it.

Once you have all this shiny, pretty data, you start looking at other ways to explore it. For instance, what's the average number of songs played in each hour? A little bit of modification to my Processing program, and I had a new chart ready to go from the same data.

Then Melissa and I wondered if Bruno Mars' "The Lazy Song" gets played a lot more on weekends, since it's, well, about being lazy and deciding to do nothing with your day. Not really. As a rule, expect to hear it six to eight times a day at the moment, and not more on weekends.

I have more ideas for this, but they're going to take a bit more data collection, so stay tuned for more in the coming weeks.

*There was no need to use node.js here. It just gave me a chance to play with it for something deeper than the "Hello, World!" example.

Monday, May 9, 2011

More With Google's WebGL Globe: Legen - Wait For It - Dary

In my last post, I talked about Google's WebGL data visualization globe and briefly mentioned the "legend" format for the data array.

I finally got a chance to do something with it, and it is both simple and powerful. If you're in legend mode, your big data array is four pieces of data at a time, not three. The first two pieces are latitude and longitude. The third is the magnitude of the line that will be drawn (divided by 200 to compensate for their multiplying by 200). So what is the fourth value?

Anything you want. When you specify legend mode, the color function you pass in to the globe's constructor gets that fourth value for each point on the globe. That in turn allows you to define the color of the line based on something other than magnitude (the default).

What that really means is that you get an extra dimension in your data. Height is always height, but legend mode allows you to add non-height information based on the color of the line.

Google's initial example — search language by volume — is a great example. The color of the line comes from the dominant language of the area. Looking at their globe, you can immediately figure out where the high search volume comes from (big cities, mainly) just by looking for the tall lines. You can also see which languages dominate search in a given area. English (blue) covers the United States and the United Kingdom. French (light green) covers France and also Quebec. Portugese (dark green) dominates Portugal and Brazil, and also Madeira. Likewise, the Canary Islands are yellow, because they're a part of Spain.

For a new visualization at work I did with the globe (as usual, NDAs make me cautious about giving more exact details), people at my studio wanted to see the height of the line represent the number of events at a given location. But they wanted the color to represent whether the average data from that spot represented a "good" or "bad" user experience. Little red lines would be unfortunate, but might not get flagged as high priority. Big red lines would be a problem, because it would mean that a lot of players were having a bad experience. Fortunately, there weren't any of those, but there were some long yellow lines, which suggests an area where we could improve the player's interaction with our game.

To get that view, I set the globe to legend mode and set the magnitude field of each point to the sample size. Then I set the fourth data point to a very heterogeneous number that represented the user experience. My color function looks at that number and puts it into a bucket. Good returns green, bad returns red, and a middle ground returns yellow.

That means each line on my globe gives three pieces of information: location on the planet, sample size from that location, and quality of user experience from that location. It's easy to spin the globe and look for red hot spots. It's also just a pleasure to play with the visual, even though you can find the worst spots pretty quickly.

One person saw my work and suggested adding a calculation that would make the line more or less red, for instance, depending on how far over the bad threshold it went. Bright red would mean a really bad experience; dim red would be right at the line. I guess that would give us three and a half data points. Before, we knew it was bad. Now, we'll know how bad it is.

Legend mode makes this all possible.

Thursday, May 5, 2011

Working With Google's WebGL Globe

Today, Google released a data visualization that shows search volume by language across the globe. It uses WebGL, so you'll need a recent version of Chrome and decent video drivers to see it.

It's different than the normal "things on a globe" visualizations you see in — for instance — Google Earth, because it incorporates height as an additional dimension. Google Maps and Google Earth give flat perspectives: You can only guess magnitude based on clusters of the familiar red, upside-down teardrops.

By itself, the visualization would be a five-minute folderol to enjoy on a lunch break. But Google's Data Arts team released all the source. Take your own data, muck with it a bit, and you too can have an interactive globe dripping in an oh-so-modern, shades-of-black aesthetic.

I took the bait.

Over the course of today, I took some of our game analytics data and built a local WebGL globe visualization. We have an established workflow for creating latitude/longitude tables from our data — this isn't the first map visualization I've done — and I built on top of that. Once I extracted the information I wanted, I wrote a quick Ruby script that converts the exported data into the format their code looks for.

I can't share a link with you — it's on our internal network — and I can't tell you what I mapped. But I can tell you that it got lots of oohs and aahs in the office.

I can also tell you that it wasn't a simple "replace their data with our data" exercise. If you're planning on doing something with the tool, here are a few things I figured out by trial and error.

Note: This has since been fixed. Don't rely on their README: It's wrong. At least, the JSON format is. It's much better to read the globe.js file to see what it wants, though that requires you to know JavaScript. Rather than a complex set of nested arrays, the code prefers one long array with latitude, longitude, and magnitude strung together like beads on a string. (There's an optional "legend" mode that requires a fourth point, but I haven't played with it. I assume it lets you define different data series. Their addData method takes the data block and a set of options, and one of those options is format: The default is magnitude — three data points — and you can specify legend — four data points.)
A portion of their code multiplies the magnitude by 200 to ensure the small numbers in their data — percentages, I suppose — become big enough for bars that reach high into the sky. For our particular data set, I had to divide the number by 200 in order to get the bars to be correct once they multiply it by 200.
The default coloring of the lines relates to the height, but it looks like you can pass a custom "get the color" function for different logic.
Their source code relies on a top-level directory on your website named /globe for the location of the map image that wraps onto the sphere. You can change that easily enough. Search for "imgDir" in the source.

Tuesday, March 15, 2011

Sunday, January 16, 2011

One-Button Loadtesting With EC2

I recently realized a minor ambition.

As I've said before, EA's loadtesting group doesn't meet all my needs: They come in near the end of a project and test the nearly-finished system, which means that it's a big headache all around. Fixing performance issues means rummaging through months — or even years — of code to try and suss out issues they're seeing. Also, the mere act of getting them up and running is usually a big disruption for one or more members of your team.

I had loadtesting scripts running already, but they suffered in a few ways. JMeter is good at what it does, but I want easier programmatic control of the scripts (especially for measuring things it's not so good at, such as the lag time on asynchronous processes under load). Also, I didn't have a loadtesting environment. I was running the scripts on a dev server against that same server. I could have put together an environment of multiple servers, but that would incur expense to the studio for machines that were largely idle.

Then I read an article somewhere about a company that uses EC2 as a dynamic loadtest environment. Yes, I thought, that's what I want.

Off and on over the last few weeks, I put together a Ruby script that can, with a single command, set up a full environment in Amazon's cloud-based servers. That one command instantiates an RDS database, an EC2 server that will act as our application server, and another EC2 server that acts as an instance for running The Grinder, a loadtesting tool that another team in the studio is using and that I liked for its extensibility. That one command also builds a custom version of our war file (to handle all the dynamic addresses you get from EC2/RDS), sets up access permissions between all the servers, and runs checks to make sure the whole environment is ready to go.

A second command fires off the loadtest itself, running the test script through The Grinder and collecting the results back to the local machine.

Finally, a third command shuts the whole system down. I do have one command that runs all three steps, but at the moment I tend to run each one manually.

Each run takes less than an hour for now, so the cost to the studio is on the order of $.28 per loadtesting run. Not something you'd want to run continually, but certainly something you could run once a day. As we move along in development, we'll probably need to spend more to set up bigger, more realistic environments, but that will also be when the studio has more budget for my project. The environments we need will scale up alongside the money we have.

One of the key pieces that I set up was the concept of a profile. A config file lets you specify which script to run and how many of each item to set up. So, for my first profile, I just set up one each of the different servers. But you could imagine some profile later down the road that sets up 2 of each machine with some sort of autoscaling system. And you could imagine one much further down the road that sets up something approximating our production environment and runs tests against it. All that is mostly supported.

The advantages to this system are huge. First, it makes real loadtesting feasible even early on in our project. But it also empowers developers to do rapid iteration on performance fixes without having to push them live and see if they make a difference. See a problem in the results, figure out a fix, run the script again to test. Repeat and then release when you can prove that your fix makes a difference. Because each script fires up its own environment, I'll be able to distribute performance fix work to my team.

This ad hoc loadtesting tool is already proving its worth, and I've only just started employing it. I can't wait to see how effective it is going forward.