An Obsession With Programming: 2010

Tuesday, November 23, 2010

Dynamic Feature Disabling

Often, when I'm rolling out a new, substantial feature, I add a config file property that marks it as enabled or disabled. This lets the team shut it down quickly if it's causing problems. It's a good practice, overall.

Keeping the status in a config file is easy to maintain, but it's problematic. Being able to "switch off" a feature in a config file really means, for most Java deployments, rebuilding and redeploying the war file, which can take a while if you're including a bunch of big libraries.

Recently, inspired by two pieces I read about how disabling features helps scale a site and mitigate risk, I came up with an enhancement to my older idea, and it's one of those brain-dead-obvious changes that I wish I had done before: A feature's enabled or disabled status now lives in the database.

I have a simple table (with Hibernate object and service) that stores a feature name, its current enabled status, and the date on which that status changed. Any other service or controller in the system can query that service to see if a given feature is enabled. (Since I run in the Spring framework, access to the service is a simple matter.)

What that means practically is that I can disable an entire subsystem across all instances of my application simply by updating a database row. (In the live system, these will probably be cached and refreshed every few minutes so that incoming requests aren't slowed down.) It also means I can get the status of the systems across an entire bank of servers with a simple query.

I put it in as groundwork for one feature — capturing real incoming traffic to a site to then play back as loadtesting scripts — but quickly back-ported a few other systems that have single points of entry. Our profiling system, for instance, can now be turned off application-wide simply by setting the enabled status of "Profiling" to false.

Of course, once you have subsystems that have a dynamic mechanism for checking their enabled/disabled status, it's a small step to enabling/disabling on a per-user basis. Which means we can gradually roll out new features, checking the load on the system at each pass and fixing bugs.

It's not an earth-shattering idea, but I was immediately entranced by the power it gives to my system.

Saturday, October 2, 2010

Integration Tests

I've been a fan of unit tests for a couple of years now. Once I buckled down and wrote some, I realized how powerful they could be. They give you more stability out of the gate and they provide good regression tests. They allow me, the developer, to keep moving forward in tasks and minimize the time I spend going backward to debug or test older functions.

But I've always held the stance that unit tests shouldn't test things such as interactions with Hibernate or Spring. Those are well-tested frameworks with strong community support. Writing unit tests that touch those layers always struck me as a waste of time.

That said, most server-side Java code ends up moving into those layers. Virtually everything I write ends up talking to a database. And the server code itself is a loosely knit tapestry of services that chat with each other. That code can certainly have bugs — incorrect queries, edge conditions, bad assumptions, whatever. So how do I get tests against it?

Integration tests. Like unit tests, integration tests demonstrate that function A returns result X given inputs Y and Z. The difference is that in an integration test, you are testing the interaction between systems versus the simple in and out of a self-contained function.

I finally decided that I wasn't getting the test coverage I wanted with just unit tests. I was finding subtle bugs tucked away in database calls and service-to-service communication. Mock objects — code that presents the expected interface to a layer without providing the full functionality — only get you part of the way. If your mock object is keeping information in a local map instead of hitting the database, you're not testing that the query to the database returns the right thing.

Once I decided to incorporate integration tests, however, I went down a bit of rabbit hole. Your services need to talk to each other, so you need to wire them up to each other. But in the real, running code, I use Spring to manage all that. Fortunately, I can use Spring in the test environment, too. And Spring allows me to have a supplemental bean configuration file that overrides the production-code config file. So, for instance, I can have a bean named "telemetryService" that overrides the bean of the same name in the main config file. The test version doesn't actually send telemetry information. It effectively becomes a mock object. (Though in that particular case, it's done with a simple boolean.) I have an S3 service layer in the test config file that points to a test S3 bucket instead of our development one. Any beans that aren't overridden are pulled from the main file.

My integration tests do have to call a method to set up that configuration, however. Since the Spring config doesn't change at runtime, I have a static utility method that checks to see if the configuration is set up and, if it isn't, sets it up. This violates the principle that a test should not rely on existing state from another test, but that configuration load is time consuming.

But services don't just talk to each other. They also talk to the database. You don't want them to point to a database you actually use, so you need a test database. And that test database needs to a) be kept up to date, b) have some test data, and c) not let data linger (to ensure that no test ends up passing because of the coincidence of some other test running before it).

For the last two requirements, I incorporated DBUnit. DBUnit can do unit testing against the database, but I don't really need its form of that. Because I'm using Spring to set up my app in test mode, all my services can continue to use Hibernate for database work. (They can also use my JRuby-based query files.) But DBUnit offers two key services: It can take a data file of test data and load the database with it, and it can wipe any desired tables clean between test runs. When a test suite starts, it calls a method that ensures the Spring configuration is read (if it hasn't already been) and wipes/reloads the test database. When a test suite starts, I know that there is predictable data in the database and that that's the only data in the database.

What about between tests, though? I could do a wipe and reload around every test, but that's time-consuming. Instead, I have a setup method that configures a transaction and a teardown method that rolls back that transaction. All of this means that developers have to remember to do all this when writing an integration test, which I dislike, but I hope to integrate some tools that will automate that.

My test database also needs to be current with the latest schema. For that, I incorporated Autopatch, which will automatically get your database up to date when a code base runs against it.

So far, so good. But integration tests take a while to run, and I worry that developers might, faced with progressively longer test runs, disable tests on local builds. To keep our tests spry, I separate integration tests so that they're only run if you tell Buildr you're in the "longtests" environment. (That environment also specifies the test database properties.) The build system, which runs within minutes of any check-in, always runs them, however, so even a developer who gets hasty has an automatic monitoring.

All of this took three solid days to get working properly, but now that's it running, I got exactly what I wanted: the ability to have much deeper test coverage. The other day, I wrote a test that called service methods that altered the database and moved files around on S3. At each step in the chain, I could check the state of each component and verify that it had worked succesfully. And I knew that it was executing real queries and running real code. No mock objects.

The ability to have deeper test coverage means I've set myself a goal of writing those tests. Every feature I work on now has tests against it, even if they require database work or multiple services. And when I find a bug in my existing code, I force myself to write a test to reproduce it and then work on the code until the test passes. That way I get free regression testing whenever a check-in happens.

I have noticed one negative aspect, however. Unit tests, by their very nature, force you to write smaller and smaller methods that are easier to test. With no limit on the amount of activity you can do in an integration test, however, I find that my methods aren't channeled into smaller pieces. I try and do that because I know that ultimately produces more maintainable code, but my integration test system doesn't give an almost automatic push the way unit tests do.

Big methods or not, I still get to run a complete suite of extensive, realistic tests with a single command. That's pretty powerful.

Tuesday, September 28, 2010

Config Files As Programs

While working on a few recent tasks, I've hit upon a small but powerful technique: My configuration files are actually programs in their own right.

Expression languages of some form are de rigueur in modern frameworks. Especially in server code, where you need to deploy the same code base to different servers and have them point to different databases, mail servers, or whatever. In Spring, for instance, you can write something like ${project.database.url} and it will assume you mean to use not the literal string but the value of the project.database.url property, which is probably different in your dev environment and your production environment.

But my technique is vastly more powerful. I'm sure others have thought of it, too, but it sort of snuck up on me.

The first time I used this trick was when I put my work's SQL into Ruby scripts. I did that for readability's sake, but as I added more queries, I began to do the things one does does with programs. I refactored two similar queries into a method that took a differentiating argument and returned the query. I declared constants to hold common values. I started to ponder a Ruby module that I could import for various utility functions. The end result from the Java code's view was a set of query definitions. But within the query definition itself, I had normal programming tools at my disposal.

On a personal project (which I hope to write about soon), I decided to use YAML for the config files. I've gotten used to the format for config files, thanks to my work with AppEngine, and Ruby can easily work with it. My original idea for the config file was just fixed dates and times when I wanted my script to wake up and record an Internet radio stream. But then I realized that I wanted to record some streams on every weekday or on the first Sunday of the month.

So I thought, what if, before I feed the file to the YAML parser, I first run it through ERB, Ruby's powerful templating system? As long as the YAML parser sees something that looks like YAML, it doesn't matter how it gets there, right?

Right. I defined a method in my config file — in my config file! — that would return the next weekday after the given date. Then, when defining the value for a field, I called that method to calculate the value. My "YAML" file looks something like: start_date: <%next_weekday.strftime('%Y-%m-%d')%> 18:00. Which won't give you anything useful from the YAML parser. But run it through the ERB parser, and the YAML parser sees start_date: 2010-09-28 18:00, which is perfectly valid YAML syntax and is what you intend for the value to be.

Friday, September 3, 2010

Words With Friends Bingos, Redux

In my last post, I described writing a program to find the five-letter combinations most likely to yield a bingo in Words With Friends.

Then I realized there was a subtle bug.

Consider the word abreast. When you look at how Ruby's combination method splits it up, you'll see two entries equivalent to STARE, because there are two a's. This is correct from a programming view, but from a player's perspective, it doesn't matter. Having STARE is sufficient to get to abreast. You don't view the a's as different.

Here is the new word list with the corrected code. TEARS and TIERS are still your best bet, but the list has shifted a bit and the numbers are different.


aerst 1295
eirst 1237
aeirs 988
aeist 984
einst 983
einrs 972
eorst 949
aelrs 940
eerst 920
aeins 854
aeirt 816
aenrs 805
aenst 799
eeirs 792
aelst 788
eilst 782
aeint 765
aeils 765
ainst 751
eilrs 748

And here's the corrected script.


# find the 5-letter combos most represented in 7- or 8-letter words

wordlist = ARGV[0]

combos_to_count = Hash.new(1)

File.open(wordlist,"r") do |file|
   line = file.gets
   while(line)
      line = file.gets      
      if line then
          line.chomp!
          next if line.length != 7 && line.length != 8
          chars = line.split(//)
          unique_combos = []
          chars.combination(5) do |combo| 
              combo.sort! {|a,b| a <=> b}
              unique_combos << combo.join
          end
          unique_combos.uniq!
          unique_combos.each {|combo| combos_to_count[combo] = combos_to_count[combo] + 1}
      end
   end
end

sorted_combos = combos_to_count.sort {|a,b| b[1] <=> a[1]}
(0...20).each {|num| puts sorted_combos[num][0] + " " + sorted_combos[num][1].to_s}

Optimize For Bingos In Words With Friends

See the updated version here.

Somewhere in my library of books is one called Everything Scrabble, a guide to improving your Scrabble game. I read through it a few years back and have probably forgotten half its contents, but I do remember the author's comment that highly ranked Scrabble players often swap tiles in an effort to increase their chances of hitting a bingo (a word that uses all seven letters on your tray).

I've been playing Words With Friends (screen name: linechef) and have been thinking about that Scrabble strategy. Bingoing is less of an advantage in WWF. They're only worth 35 points instead of 50, and they often open up large swaths of bonus tiles for your opponent to gobble up. But they are usually worth it if you can hit a bonus tile yourself.

Everything Scrabble provides the scoop on which five tiles you should keep in your tray to maximize bingoing, but my mind let that knowledge go into the abyss. So I decided to write a program to rediscover it.

You can imagine how the program works: Go through every seven- and eight-letter word in the Words With Friends dictionary, the ENABLE wordlist, and find every unique combination of five letters in it. Keep track of how many copies of that combination you've seen throughout the dictionary, and then sort.

Let's start with the answers first. Here are the top 20 combinations, along with the number of times they occurred (the ENABLE list has 172,823 words):


aerst 2413
eirst 2199
einst 1780
eorst 1698
eerst 1646
einrs 1596
aeist 1587
aeirs 1484
aelrs 1468
aelst 1350
aenst 1346
eilst 1338
eiprs 1286
aeirt 1273
aeprs 1266
aenrs 1258
aeins 1245
einrt 1238
deirs 1233
ainst 1232

And here's the Ruby code. Ruby 1.8.7 added a combination method to the Array class, which makes this program tiny. You tell it how big you want each subset to be, and you pass it a block which takes each subarray as an argument.


wordlist = ARGV[0]

combos_to_count = Hash.new(1)

File.open(wordlist,"r") do |file|
   line = file.gets
   while(line)
      line = file.gets      
      if line then
          line.chomp!
          next if line.length != 7 && line.length != 8
          chars = line.split(//)
          chars.combination(5) do |combo|
             combo.sort! {|a,b| a <=> b}
             combo_str = combo.join
             combos_to_count[combo_str] = combos_to_count[combo_str] + 1
          end
      end
   end
end

sorted_combos = combos_to_count.sort {|a,b| b[1] <=> a[1]}
(0...20).each {|num| puts sorted_combos[num][0] + " " + sorted_combos[num][1].to_s}

Once you have the code, you can start asking other questions. Is the list different for words with exactly seven letters? A little bit:


aerst 609
eirst 493
eorst 450
aelrs 413
eerst 404
einst 385
aeprs 379
einrs 348
aelst 346
eilst 335
acers 331
aeirs 330
aenst 327
aeist 322
deirs 320
aders 318
eiprs 317
eilrs 297
aerss 296
deers 295

Note that there are only 676 permutations of the other two letters you could fit in your tray, so with STARE you can bingo with almost any combination. I leave learning those 609 words as an exercise for the reader.

What about a different dictionary? Here's the same run I first did but with the Scrabble Official Dictionary, Third Edition.


aerst 2395
eirst 2182
einst 1759
eorst 1670
eerst 1628
einrs 1581
aeist 1562
aeirs 1473
aelrs 1458
aelst 1335
aenst 1327
eilst 1317
eiprs 1278
aeprs 1258
aeirt 1257
aenrs 1239
aeins 1232
ainst 1224
einrt 1219
deirs 1218

Of course, some of this is common sense: EST gives you the superlative of many words. Likewise, ER gives you the comparative along with the prefix RE and the "person who does" suffix. Those four letters plus vowels other than U and Y form top contenders in many of the runs.

If you're playing WWF (or Scrabble), work to keep STARE in your tray, playing or swapping other tiles, until you get a bingo. Placing it on the board, of course, is a different matter.

Wednesday, August 25, 2010

Rails Scaffolding In Java

I recently needed to add a new business object to a system I'm developing at work, and it was a bit of a drag.

If you follow any of the normal Java patterns, you probably know what I'm talking about. You need the file for your business object. You need a service class responsible for returning instances or sets of instances from the DAO layer. You need a SQL file that defines the database table the business objects will be stored in. In our case, we also have queries in resource files, so we need a file for that as well. You need to edit your Spring config files to point to the new service. And so on. Your environment probably has some overlap with mine, and probably requires pieces mine doesn't.

You can simplify a lot of this, of course. You can make a service base class with Java Generics that will give you a lot of the type-safe methods you'd want. IDEs will let you set up templates, but you'll still have to click through a few menu options to get you the files you need. And the more files, the more clicks. And I like running an IDE-neutral team, so I wouldn't want to do something specific to one workflow.

I wanted a better way. Specifically, I wanted what Rails provides. You type a command at the command line, and you get all the files you need for working with the object. (XCode offers similar functionality.)

And I thought, "Well, why not?" My build system is written in Ruby, and Ruby's ERB templating system is built in to the language.

It took about an hour to get the main system up and running. I now type buildr :biz_object object=com.ea.foo.TestTest and I get a src/main/java/com/ea/foo directory, a TestTest.java and TestTestService.java file in that same directory, a sql file with the table definition (named test_test), and an empty query file. I also get a perforce changelist (via p4Ruby) with the description filled in and all the new files added.

Here's the heart of the code. template_to_final is a hash of template file name locations to the end file destination. The local variables exposed by the call to binding include the package name, the Java object name, and the SQL-friendly name:


       template_to_final.keys.each do |key|
           File.open(key) do |file|
               b = binding
               erb = ERB.new(file.read)
               outfile = File.new(template_to_final[key],"w")
               puts "Creating #{outfile.path}"
               outfile.write(erb.result(b))
               outfile.close
           end
       end

To give you an idea of what the templates look like, here's some code from the java file templates:


package <%=java_pkg%>;

@Entity
@Table(name="<%=java_sql_name%>")
@SequenceGenerator(name="objIdGenerator",sequenceName="object_id_seq",allocationSize=1)
public class <%=java_obj%> {

I don't yet go the full Rails route and specify all the properties on the command line, but this takes care of getting the boring parts of business object development out of the way, enforcing consistent naming schemes, and ensuring that the developer doesn't forget to check in some file. I also don't yet modify my config files, but that won't be too tough to add.

Thursday, August 5, 2010

Scripting Campfire

My team uses Campfire, a web-based chat tool from 37signals, to communicate throughout the day. We've divided it into various rooms, one of which is a Status room where we do a virtual version of the daily stand-up that many teams do: a quick meeting where everyone says what they're working on. (One of our team members is in England, and our hours vary a bit, so a classic stand-up isn't practical.)

About a month ago, one team member started posting the date in the status room before anyone gave their update. For some reason, Campfire itself doesn't do this reliably, and when you have a room with nothing but status updates, it's not always clear where one day's messages end and another's begin.

Usually one or two people on the team did this, but I followed suit on a couple of occasions, and of course thought about automating it. We already had a bot account for some very early automation, and Campfire has a good web service. Insert Tab A into Slot B.

Here's the relevant Ruby source code, which I have hooked up to a cron job (Note that if you want to script Campfire, I suggest creating a "Bot Test Room" that you can use for experiments without spamming your real rooms):


require 'date'
require 'json'
require 'net/http'
require 'net/https'
require 'uri'

def send_text_to_campfire(text)
    message = {
       :message => {
           :type => "TextMessage",
           :body => text
       }
    }
    send_to_campfire(message)
end

def send_to_campfire(message)
    url = URI.parse("https://<your base URL>/room/<your room number>/speak.json")

    request = Net::HTTP::Post.new(url.path)
    request.basic_auth(<your auth token>,<any password string>)
    request.content_type = 'application/json'
    request.body = message.to_json
    request.content_length = request.body.length

    http = Net::HTTP.new(url.host,url.port)
    http.use_ssl = true
    response = http.start do |http|
        http.request(request)
    end

    puts response.body.to_s
end

# construct message
today = Date::today
formattedDateString = sprintf("%02d/%02d/%4d",today.mon,today.mday,today.year)
dateStatusString = "=== Today is #{Date::DAYNAMES[today.wday]}, #{formattedDateString} ==="

#send status message to campfire
send_text_to_campfire(dateStatusString)

Filling in the date each day isn't a huge time savings: It will take us a lot of days to recoup the time I spent automating a five-second typing task. But once I figured out the gist of posting to Campfire, I started adding new functionality. Our "Today is ..." status message now includes a few choice statistics — gleaned from our telemetry system — about gameplay from the previous day. I also wrote a "canary" script that does a health check on our dev server and posts to our general chat room if it seems to be slow.

Naturally, some of my co-workers have suggested writing an adventure game on top of the API. That may be a bit silly, but it does emphasize a point I often make: You can't really imagine all the possibilities for a technology until you get your hands dirty a bit and play with it.

This whole experience underlines again why web applications should have APIs. The lack of a date is probably a Campfire bug, but we don't have to wait for them to fix it. And we've added functionality that is only relevant for our team, resulting in something that more closely ties in with our real needs.

Sunday, August 1, 2010

Watch Your Users Use

Time and time again, I re-learn this lesson: Nothing makes your system more usable than watching your users use it.

As I mentioned in my last post, I'm building a telemetry system on top of Google AppEngine so that my studio can understand how our games will be played. As I've demoed pieces of it, I've put together web pages and interactive charts and other nice little visualizations. I've made JavaScript libraries to make it easy to build those sorts of things, and I've demoed some of the prettier ones to the stable.

But I also put in a way to export data as tab-delimited fields. I figured this would be a good last resort if the JavaScript version wasn't quite there yet, or if we hadn't written a specialized view on the data, but I figured we'd favor those.

Naturally, then, my principal users have all just used the tab-delimited export and not pushed beyond that. In particular, they grab a bunch of tab-delimited data, import it into Excel, and then do manipulations on it there. Though there are macros and various other things, it still seems to take a senior engineer a couple of hours to compile all the data for his weekly reports.

I know this because I sit with him a fair amount as he explains his process.

The other day, I was thinking of his process and realized that Python can create Excel spreadsheets. So I could remove one minor barrier by just letting him get the data in Excel format instead of tab-delimited text. Easy.

But that doesn't buy him much. For his weekly reports, he trims some of the unneccesary columns (my exports grab all the fields and dump them out, including some that are system-level fields), runs some macros to insert formulas to give him percentages, sorts the data, and then copies and pastes that into an email. He does that for about 10-15 reports, in one form or another.

What if, I thought, I could let him define the format of the spreadsheet the system generates? Then he could say, "give me this data, but put it in this layout." That would shave a big chunk of time from his flow.

One thing about working with him is that he's not an online engineer. This turns out to be very good, because it makes me think in terms of letting him modify config files instead of modifying server code. (A medium-term goal is building a real interface on top of my system so that anyone in the studio can interact with it, but for now, config files are how we do things.) One of our other online engineers used the system by writing a custom request handler that interacted with the AppEngine datastore directly to generate a customized HTML page. Very neat, but if he were my only user, my system wouldn't be very evolved. I'd just say to someone new, "write a request handler."

I mentally sketched out what the config file would need to contain, what abilities it would need to give him, and within 45 minutes had a system in place that will let him generate something much closer to the final spreadsheets he needs for his reports.

A key component of the new feature is the Django templating language that comes with AppEngine. I actually dislike the language for HTML generation, but for letting users construct simple templates, it's pretty good. One important feature: It can render a template contained in a string, not just one contained in a file. And you can give the renderer a context, filling in some variables that are accessible within the template. In my case, I create a context that contains the current piece of aggregation data (which is a summary object of a large number of events) and the current row number.

From his perspective, to generate a report about how much damage a creature is doing in our testing, he creates a YAML config file that looks like this:


header:
    - value: Name
    - value: Avg Damage
    - value: Total Damage
row:
    - value: "{{aggregation.groupByValue}}"
    - value: "{{aggregation.average}}"
    - value: "{{aggregation.sum}}"

Then, when he requests data in Excel format, he can tell it to go through the layout he specifies. Rather than getting a dump of every piece of data in each object, he gets a spreadsheet with just the information he needs. The {{}} is part of the templating system and translates as "spit out the result of this expression."

Let's say he also wanted to capture the number of times that creature did damage. That's actually available to him in another field in the aggregation, but let's say it wasn't. My system also supports formulas. So he could do this:


header:
    - value: Name
    - value: Count of Attacks
    - value: Avg Damage
    - value: Total Damage
row:
    - value: "{{aggregation.groupByValue}}"
    - formula:"ROUND(D{{row_num}} / C{{row_num}},0)"
    - value: "{{aggregation.average}}"
    - value: "{{aggregation.sum}}"

That formula will become, in the output for the first row, ROUND(D1/C1,0).

This is still a long way from user-friendly, but I think it's going to give him back an hour and a half of time each week. And building the system this way means that a user-friendly version just needs to write the format into this config file, and away it will go.

Tuesday, July 20, 2010

Google AppEngine

Three months ago or so, I had an interesting idea for Google's AppEngine, an environment for building web applications. I wanted to use it for telemetry.

Telemetry, literally, means measuring something from afar. (Merriam-Webster's definition is somewhat circular if not outright tautological.) In the gaming industry and perhaps others, it means capturing information about how users interact with your system. Think of stats packages for web pages; they're the same thing. You also hear the terms business intelligence and plain old data analysis.

I want to be clear: This isn't about watching you, in particular, play a game. If you die in a particular spot on a particular level when our newly announced Darkspore ships, I don't really care. But if lots of people die in that same spot, it tells you something about the difficulty of that spot, doesn't it?

I needed something that could take in a large stream of data without hiccuping and allow me to process it and report on it. I went with AppEngine.*

A big selling point of AppEngine is scalability. Google obviously knows something about running big systems, and they take control of scaling your app. At any given point, I don't know how many servers are running my code: AppEngine starts up new instances as traffic grows and shuts them down when traffic slows. It's not like Amazon.com's EC2, where you manually say, "I want this many servers. No! Now I want this many!"

Closely tied to its scalability is the datastore behind AppEngine, which is a proprietary NoSQL technology called BigTable. When you run queries against BigTable, you're using a MapReduce algorithm — the same algorithm that allows Google to index and search the web so quickly. I had a potentially huge set of relatively unconnected data: AppEngine seemed like a good fit.**

And, for the most part, it has been.

But the datastore that makes AppEngine so powerful has some quirks that I've had to work around. Keep in mind that it's not a relational database, so it doesn't have the same behaviors. Most notably, it's not very good at aggregating data: sums, averages, whatever. And remember where I said that telemetry is all about aggregation? In the long run, I'm hoping BigQuery will make this easy, but in the mean time, I had to build my own aggregation system on top of AppEngine's cron jobs and task queues. The system is pretty neat, including support for a "group by" concept so that I can get views on subsets of the data in a given aggregation, but I'll be happy to let someone else handle that heavy lifting.

Another issue I've run into is that AppEngine sets hard caps on certain aspects of the system. For instance, any given request can't run too long or it will be terminated. On my own servers, if a webpage for an admin tool takes thirty seconds to load, I'm not too upset. It's an admin tool; I can wait. Google isn't so forgiving, presumably to prevent some process of yours from running amok on their machines. This has led to some early performance tweaking that's probably, in the long run, better to get out of the way now.

But these have been interesting problems to solve, so I haven't minded them so much as been annoyed by them. We're currently pulling in 100,000 events or so on any given day (as testers play the game and developers work on it), and I've only had to do optimizations on the reporting side and aggregation system. There's a whole other round of work I can do on optimizing writes, but I haven't bothered yet. Granted, 100,000 events isn't that significant, but it's nice to defer some of the deeper tuning until we're a bit closer to shipping. Meanwhile, we've got graphs and charts and spreadsheets with our telemetry data served up in a useful way to different people in the studio.

It's always hard for me to gauge the ease-of-use of a web system for someone who's not a long-time server engineer, but AppEngine feels like it is very good for either the people who know nothing about web stuff or people who know a lot about it. For middle ground types — maybe a bit of PHP or other dynamic site system — who won't be facing scalability issues, I wonder if Rails might be a better choice.

But I do know that one of our non-online engineers was able to install a new index on AppEngine for a query he wanted without even asking me. I didn't know about it until I happened to see his check-in note. That's pretty powerful. AppEngine supports both Python and Java, but I deliberately went the Python route first, despite not knowing the language at all, because there's more Python knowledge in general in my studio. And that language I didn't know? The code I wrote to take in telemetry events is virtually unchanged from the first day I wrote it. So it's easy to get started on AppEngine, and the cost is free well into the needs of most websites.

If you're interested in developing on AppEngine, I recommend the yet-to-be-released Code in the Cloud.The writing in the early drafts is a bit too exuberant, but there's solid information from the get-go, and Pragmatic Programmers delivers polished final products.

* Yes, yes. I don't trust Google, either, though, as always in blanket statements, the AppEngine team seems to have their hearts in the right place. What's to prevent them from pilfering through our data and gleaning their own information about our users? Well, there is a clause about that in the Terms of Service — they won't do it — but I'm not naive. We won't be storing information that anyone else could use to trace back to our users. An ID that makes sense inside EA? Yes, sometimes. Anything that makes any sense anywhere else? Nope. Our legal department was pretty firm on that.

** What if AppEngine goes down on our big launch day? Am I nervous about tying my company's game to a service that may not be there when I need it? In this case, not really. Let's say AppEngine goes down for a full 24 hours. The whole thing, not just pieces. Such an event would be unprecedented in my experience, but still. So we lose a day of telemetry data. Big deal. People will play the game one day the same way they played it the day before and the day after. Since I'm dealing with aggregates, I can cheerfully ignore missing data.

Wednesday, May 26, 2010

What's The Probability Of Two Boys?

There is a link going around about the most recent Gathering For Gardner. The writer hooks the reader with an intro featuring Gary Foshee (a top-notch designer of "secret box" puzzles, though they're rarely boxes), who poses this question to the crowd of mathematicians, puzzlers, and magicians: "I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?"

The article then describes the Gathering. (I'm on the invite list, but I've not yet been.) It eventually explains the answer, but these probability questions never make sense. So I wrote a program to illustrate the first oddity, which is that announcing you have two children and one is a boy makes the probability of the other one being a boy only 1/3. Basically, create 10,000 sets, and remove any that are two girls. Now count up the total number of pairs left and the total number of those that are two boys, and you end up with something around one-third.



num_bbs = 0
total_pairs = 0

(0...10000).each do |count|
  children = [Kernel.rand(2),Kernel.rand(2)]
  next if children[0] == 1 && children[1] == 1

  total_pairs = total_pairs + 1
  num_bbs = num_bbs + 1 if children[0] == 0 && children[1] == 0
end

puts "#{num_bbs} pairs of boys out of #{total_pairs} valid pairs = #{(num_bbs.to_f/total_pairs.to_f) * 100}"

It's weird, but it's true.

Tuesday, May 18, 2010

We Rule: Maximize Your Time

I've been playing We Rule on my iPhone. I have to say, I don't really get the point of it. It seems like the only way to progress is to plant and harvest crops and exchange services with other players. But there are tons of things like trees and banners and what not that don't seem to do anything. Should I place a water tower? Who knows?

But this is not a game review. When you plant items and harvest them, you get paid some amount of gold depending on the crop. Planting each crop takes some defined amount of money (usually), and each crop requires some defined amount of real-world time to mature.

So, naturally, I wondered: Which plants are the best investment?

A simple spreadsheet offers the answer*. I wrote down the profit of the crop and divided that by the number of minutes it would take to grow. I did the same thing for experience points, which allow you to level up. (This list reflects the crops I currently have access to. I'll update as I go.)

Crop	Profit	Profit/Minute	XP/Minute
Magic Asparagus	1250	.87	.24
Potatoes	200	.56	.22
Pumpkins	160	.89	.36
Strawberries	120	1.33	.56
Carrots	260	.36	.15
Squash	180	.6	.25
Beans	340	.24	.10
Onions	100	1.67	.67
Wheat	20	4	1.60
Corn	5	6.67	1.33

Your definition of a good investment may be different than mine. Clearly the best investment is to plant and harvest corn like a madman every 45 seconds. You have fun with that. Wheat is somewhat more tolerable, with a 5-minute harvest time, but I like to set up a bunch of crops and then go do something else until they're done. Or set up a bunch overnight to have ready the next morning.

Strawberries and onions are good investments, with onions beating out strawberries for both cash and experience points. Plus, they only take an hour: That's about right for mid-day play. In the crops that take all day, Magic Asparagus is your best best, with beans not even close. Beans also give you one of the worst XP/Minute ratios. Don't plant beans.

Derive your own fun theories from the list.

*Yes, this is a programming blog. Spreadsheets are programs, too, though they're rarely treated as such.

Friday, May 14, 2010

Rails Rules

I get now why people love Rails.

I had a personal project in mind: a web-based application for organizing all the notes you collect while working on an article. Like the index cards you used to use for school reports. I decided to try Ruby on Rails, a web application framework built on top of the Ruby language, mostly to see what all the fuss was about.

Rails is the framework of choice for a wide range of startups because it lets you get up and running quickly. It works by driving home a brutal truth: Your website is not that unique.

Some sizable percentage of what you need to do for a real-world website is what everyone needs to do for a real-world website: working with databases, adding CRUD functionality, rendering HTML, mapping handlers to URLs, and so forth. "All of this has happened before and all of it will happen again."

Rails acknowledges this and takes care of most of that functionality out of the box. A lot of what it does seems like magic, but really it just imposes naming conventions that will let it do the right thing most of the time. Put your User objects into a users table, and you rarely have to write SQL. Organize your URLs as /controller/action/id, and you don't have to map anything manually. Call a controller method create, and it will automatically be called to handle POST operations. And you don't even have to remember any of this: The scripts that come with Rails generate tons of the code for you, so adding a new business object into the database and HTML-based CRUD pages to manage said objects take two scripts. Two lines at a command prompt, and poof.

I gave it a whirl and came away impressed. In five hours of flying to the East Coast this weekend, I, who know very little about Rails and something about Ruby, had a functional site for creating, viewing, and editing each of the key object types my application needs. I even added a whole bunch of "it would be nice if it did this" types of features. All on the way to New Hampshire. I wrote minimal amounts of code to do it, too. I have an app that I can use — indeed, I've started using it for real-world stuff.

On the way back, I started worrying about the user interface, and that's where I began to flounder a bit. Rails is smooth sailing if you're using all of its tools, but if you want to use something such as jQuery for your front-end JavaScript/AJAX, things get tougher. Or perhaps it's that the book I used doesn't explain the depths well enough to let me figure it out. A friend of mine says it's easy, and I'm sure it is, but I haven't quite grokked everything I need to do yet.

But even with this hiccup, Rails is clearly a valuable tool that any web developer should consider.

Tuesday, April 13, 2010

Ruby + Outlook + Perforce

At my work, we have a practice that I've always found curious. Usually, when people check something in, they send out an email to the whole studio with the change list.

This is, by itself, a fine practice. The curious part to me has always been that Perforce (our version control software) has the ability to send out emails for every check-in. My old boss would subscribe to every check-in in our branch, for instance, so she could keep tabs on what the group was doing. But that's not what we use.

Over time, I've come to understand the rationale of our redundant workflow. People can attach screenshots (I do work for a game company, after all) and the subject line can provide project categorization and a short description that perforce doesn't know about. Also, culturally speaking, if everyone in your company does this and you don't, people have a harder time figuring out what you do.

But I still find the workflow annoying. You submit your changes in Perforce. Then you go to the "submitted changelists" pane and double-click on the changelist you just submitted. You select all the text and then copy it (this view has more info in it — such as the list of files — than the text you wrote when submitting, so you can't just use that). You alt-tab over to Outlook, open a new message, address it to the studio email address, attach your pithy subject, and then copy and paste in the changelist info. (You also use this moment to attach screenshots, if relevant.) You send the email and then alt-tab back over to Perforce (or your IDE) to do more work.

I finally decided to automate the process a bit using Ruby to tie together Outlook and Perforce. I set it up to work with my needs, so it may be of limited use to anyone else. For instance, we have a custom of using "mini" to indicate minor, one-liner types of fixes. Otherwise, I tend to use "submit." Also, sometimes I want to aggregate a few recent changelists in one email. That's what the -count argument does. I also put all the config info (username, password, etc.) into a separate yaml file so that I can distribute the script without sending around my network password. Finally, I specified a sendMode of either Display or Send. Display opens the email for you and lets you customize it. Send just kicks it out the door. The former is useful for screenshots and the like. The sendMode and project config file options provide defaults, but they can be overridden. Sometimes I do work in another functional project and need to change the email accordingly.

You'll need p4Ruby to make this work. (I think the OLE stuff is built in to the Ruby for Windows installation.) Perforce returns information in sort of an odd way: a changelist will have the list of revisions as one field and the list of files as another. The indexes line up, but it takes a bit more work to get the info you want.

There are no doubt better ways to do this: I'm still fumbling around with Ruby.


require 'win32ole'
require 'P4'
require 'yaml'

is_mini = false
subject = ""
num_cls = 1

def assert_value(obj,message_on_fail)
   if !obj
      puts message_on_fail
      Kernel.exit
   end
end

# load the yaml settings first
assert_value((File.exists? 'p4_email.yaml'),'You must have a p4_email.yaml file in the same directory')

config = YAML.load_file 'p4_email.yaml'
sendMode = config['sendMode']
project = config['project']

assert_value(config['p4User'],'No p4User specified!')
assert_value(config['p4Password'], 'No p4Password specified!')
assert_value(config['p4Client'], 'No p4Client specified!')
assert_value(config['p4Host'],'No host specified!')

# now parse ARGs. In particular, see if the user has overridden config settings
ARGV.each_index do |index|
   if ARGV[index] == '-sendMode'
      sendMode = ARGV[index+1]
      # check value
      assert_value(sendMode == 'Display' || sendMode == 'Send',"Invalid sendMode value: #{sendMode}")
   end

   if ARGV[index] == '-project'
      project = ARGV[index+1]
   end

   if ARGV[index] == '-mini'
      is_mini = true
   end

   if ARGV[index] == '-subject'
       subject = ARGV[index+1]
   end

   if ARGV[index] == '-count'
      num_cls = ARGV[index+1].to_i
   end
end

puts "count: #{num_cls}" 

assert_value(project,'No project specified! Add to p4_email.yaml or use the -project command-line argument')

# set up p4 connections
p4 = P4.new
p4.client = config['p4Client']
p4.password = config['p4Password']
p4.user = config['p4User']
p4.host= config['p4Host']

p4.connect
p4.run_login

# retrieve recent changelists
lists =  p4.run_changes('-u',p4.user,'-m',num_cls, '-s','submitted')

#get the id
msg_body = ""
(0...(num_cls)).each_with_index do |obj,index| 
    cl_num = lists[index]['change']

    #get the full details for that cl
    cl_full = p4.run_describe(cl_num)[0]
    cl_action_list = cl_full['action']
    cl_rev_list = cl_full['rev']
    msg_body = msg_body + "Change #{cl_num} by #{p4.user}@#{p4.client}\n\n"
    msg_body = msg_body + cl_full['desc'] + "\nAffected files ...\n\n"
    cl_full['depotFile'].each_index do |index|
        msg_body = msg_body + cl_full['depotFile'][index] +\
                   "##{cl_rev_list[index]} " + cl_action_list[index] + "\n"
    end
end
p4.disconnect

#compose email
outlook = WIN32OLE.new('Outlook.Application')

message = outlook.CreateItem(0)
submit_type = "submit"
if is_mini
   submit_type = "mini"
end
message.Subject = "p4 [#{project}] #{submit_type}: #{subject}"
message.Body = msg_body
message.To = '[studioemail]'
# todo: should invoke the method by using reflection
if sendMode == 'Display'
   message.Display
elsif sendMode == 'Send'
   message.Send
end

Sunday, April 4, 2010

Thoughts On Core Data

I started a new iPhone app, and I decided to use the Core Data framework.

For my first app, I built an object wrapper around calls to sqlite, the embedded database built in to the iPhone frameworks. Core Data didn't exist, so everyone had to roll their own solution to this problem. I thought about just using my original solution again — it's well tested, it's a few tweaks from total reusability, and I know SQL well — but my iPhone programming is mostly about learning new technologies, so I gave Core Data a try.

Core Data is basically an ORM system. I've used a number of these over the years; I've even written some, including, in a minor way, the sqlite wrapper I mentioned above. All the ones I've seen abstract away the notion of a "database" so that the bulk of the system just sees objects without knowing their origin.

Here are some of my initial thoughts on Core Data.

Core Data abstracts the database away so much that you can't actually get to it. I recognize that Core Data can run on top of any number of storage solutions, but I feel like if I know it's running over a database, I should be able to manipulate the database myself. Bulk updates of database info — versus loading each object and modifying it — are just one scenario where that would be useful.

Objects managed by Core Data have to extend a single base class. This isn't a huge problem for my model, but it does mean you use up the one inheritance you have in Objective-C. Java has the same limitation, and most of its ORM solutions don't require you to extend a class, which gives you more flexibility in the long run.

Migrating a model should not be an "advanced" topic. One minor change to a model, and you have to nuke the data for your app, which is a bother when you're actually using it. Yes, there are a range of ways to accomplish your goal. But in my first iPhone app, I just wrote a few lines of SQL and had them run against the database at startup: Migration to new models was a snap.

The NSFetchedResultsController is a delight to use. With a few short lines of code, you have a model object you can use to drive table views of data.

Maybe I haven't read up on it enough, but when Core Data is running against a database, I'd like to see explain plans for its queries and be able to check its index usage.

Running arbitrary queries is extremely verbose, again because of the inability to run SQL directly. I wanted the ability to display a unique list of existing non-null values for an object's property in my app so that a user could either enter a new one or select an existing one. In SQL, that would be something like SELECT DISTINCT property_column FROM object_table WHERE property_column IS NOT NULL ORDER BY property_column. The Core Data version of this is:


    NSFetchRequest *request = [[NSFetchRequest alloc] init];
    [request retain];
    NSEntityDescription *entity = [NSEntityDescription entityForName:@"CallSlip" inManagedObjectContext:[self managedObjectContext]];
    [request setEntity:entity];
    [request setResultType:NSDictionaryResultType];
    NSExpression *keyPathExpression = [NSExpression expressionForKeyPath:researchAreaField];
    
    NSExpressionDescription *expressionDescription = [[[NSExpressionDescription alloc] init] autorelease];
    [expressionDescription setName:researchAreasKey];
    [expressionDescription setExpression:keyPathExpression];
    
    [request setPropertiesToFetch:[NSArray arrayWithObject:expressionDescription]];
    
    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"%@ != nil", researchAreaField];
    [request setPredicate:predicate];
    
    [request setReturnsDistinctResults:YES];
    
    NSSortDescriptor *descriptor = [[[NSSortDescriptor alloc] initWithKey:researchAreaField ascending:YES selector: @selector(caseInsensitiveCompare:)] autorelease];
    NSArray *descriptors = [NSArray arrayWithObject:descriptor];
    [request setSortDescriptors:descriptors];
    
    NSError *error = nil;
    NSArray *results = [[self managedObjectContext] executeFetchRequest:request error:&error];

That version isn't exactly shorter.

Compared to other, similar frameworks, I'd rank Core Data as decent. I imagine it's scalable enough for a client application, where you probably don't have to worry about anything larger than 50,000 records. And, if you don't know SQL, it's probably better than just dumping an object tree into a file. But if you know databases, you're likely to find it frustrating as often as you find it useful.

Wednesday, March 17, 2010

Dynamic XML Schema Elements With JAXB

I solved an interesting problem the other day at work, and I wanted to write about it here, mostly so I don't forget how I did it.

I can't talk about what I'm doing at work at the moment, but I came up with a scenario from a different industry that presents some of the same problems.

Let's say you're building a web service for a stock trading desk. You generate a variety of reports in XML and JSON so that you can build a robust, AJAX-y front end solution.

Imagine that the head of the desk wants a report that gives a summary of all the stocks traded that day, complete with volume bought and volume sold. The desk trades different stocks each day, of course, and there are a vast number of valid ticker symbols that could appear, with more coming online as companies go public and with some disappearing as companies get delisted.

You might end up with an XML structure that looks something like this:


<report>
    <tradedStocks>
        <stock>
            <symbol>AAPL</symbol>
            <bought>100</bought>
            <sold>100</sold>
        </stock>
        …
    </tradedStocks>
</report>

With a corresponding JSON structure like this:


{"report":{"tradedStocks":[
    {"symbol":"AAPL","bought":"100","sold":"100"},
    …]}}

So far, so good. Now imagine that the head of the desk wants to treat AAPL differently. Instead of being mixed in with the other stocks traded that day, it should be at the head of the list and printed in green.

When you code this special-case logic on the front end, it will probably look something like this (in JavaScript):


    function findSymbol(symbolName) {
        for (stock in report.tradedStocks) {
            if(stock.symbol == symbolName) {
               return stock;
            }
        }
        return undefined;
    }

    aapl = findSymbol("AAPL");

Even wrapped in a function, that's a bit cumbersome. Plus, if the desk has traded hundreds of stocks, that iteration can be time-consuming, especially if the head of the desk wants to call out stocks that the desk shouldn't be trading: The code goes through the entire list of stocks only to return undefined.

It would be easier and more readable to be able to do something like this:


aapl = report.tradedStocks.AAPL

But that would necessitate an XML structure in which the element name stock was replaced by the element name AAPL. Which would in turn mean that the elements allowed under tradedStocks were drawn from a very large list of ever-changing element names. In essence, subelements under tradedStocks could literally be anything.

XML doesn't really allow for that. It assumes you have a well-defined structure. And tools like JAXB build on that.

You could put the special-case logic on the server-side, of course. Grab the AAPL data from the list of traded stocks, and make a sub-element of report be AAPL with further subelements showing the data. That would repeat data, which may not be the end of the world, but what about the next time the head of the desk wanted special-case logic? More server-side custom logic.

Here's how I solved the problem.

First, I changed the logic in our XML formatter. My view layer is isolated from the rest of the application, and within it it shuffles the model to different formatters depending on the format that was requested: xml goes to the XML formatter, json goes to the JSON formatter, and so forth.

The first incarnation of our XML formatter did the obvious thing: It just used the JAXB engine to spit out the XML based on the JAXB annotations. But I made it act more like our JSON formatter, which receives events from a parser that analyzes the JAXB annotations and constructs the output based on those events.

Why switch? Because I wanted custom annotations. By making our XML formatter act as an event listener too, I could interject events based on annotations that JAXB doesn't know about. The new XML formatter (and the JSON formatter) wouldn't notice anything odd about the data, because it would just be another event.

Next I created a FlattenableMap annotation, which our JAXB parser spies and interprets as "take this map, and for each key-value pair, fire an appropriate event." To use the above example, there would be a Map in the report object that would key stock ticker symbols to stock objects. In that case, our parser would say "I'm starting a complex object whose name is 'AAPL'." All the infrastructure would then fire appropriately, and you'd end up with


<report>
    <tradedStocks>
        <AAPL>
           <bought>100</bought>
           <sold>100</sold>
        </AAPL>
    </tradedStocks>
</report>

On the front-end side, a JavaScript programmer merely writes:


   report.tradedStocks.AAPL

This implementation also means that no matter what stocks show up in the report, they'll be pushed out to the client (assuming the stocks get loaded into the Map). There's no need to maintain the code to add "allowed" stocks. So it's automatically maintainable based on the real data in front of it, even if that data changes (which, in my case, it certainly will).

You could fairly point out that this means we're not using valid XML. You could certainly not construct a DTD or schema for this logic. But the reality is that eventually, we probably will have needs defined enough to allow us to construct the equivalent of report.AAPL, at which point we can have a more rigid schema. In effect, our schema is still so much in flux that every week or so creates a different schema. My code keeps up with those changes, even without me doing any new work.

Thursday, March 4, 2010

Highest-Scoring Word In WordCrasher

I've become a fan of WordCrasher on the iPhone. Think of it as <insert your favorite word game> meets Tetris: You tap on bubbles, which are falling from the top, to make words and clear them from the screen.

I've unlocked most of the achievements, but there are two secret achievements that have escaped me. I'm convinced that one of them is finding the highest-scoring word in the dictionary. The in-game view of the leaderboards, as of this writing, shows just four people who have managed to find a 2,300-point word, the highest recorded score.

What is the highest-scoring word? Well, I don't know. But I wrote a Ruby script to make an educated guess. I assumed Kevin Ng, the developer, used the Official Scrabble Dictionary as his dictionary (though his game omits at least ort and gams).

So I wrote this script, which takes a path to a dictionary file as a command-line argument:



$letterScores = {
  'a' => 10,
  'b' => 20,
  'c' => 20,
  'd' => 20,
  'e' => 10,
  'f' => 30,
  'g' => 30,
  'h' => 20,
  'i' => 10,
  'j' => 50,
  'k' => 20,
  'l' => 10,
  'm' => 30,
  'n' => 10,
  'o' => 10,
  'p' => 20,
  'q' => 80,
  'r' => 10,
  's' => 10,
  't' => 10,
  'u' => 10,
  'v' => 30,
  'w' => 30,
  'x' => 50,
  'y' => 30,
  'z' => 50 }
  
def calc_word_score(word) 
   sum = 0
   word.split(//).each do |char|
      sum = sum + $letterScores[char] if $letterScores[char]
   end
   sum * word.length
end


File.open(ARGV[0]) do |file|
    file.each_line do |word|
        score = calc_word_score(word.downcase)
        isBest = (score == 2300)
        puts "#{score} #{word}" if isBest
    end
end

I put in 2,300 as a score because that's what people have achieved. However, in the basic Scrabble dictionary, there's no word that scores exactly 2,300; there's one word that scores 2,350: zyzzyvas. So instead of the official Scrabble dictionary, I used the Enable wordlist (both wordlists are available from the National Puzzlers' League

With the Enable list, I found two words that scored exactly 2,300: showbizzy and whizzbang. I have yet to get a board where I can spell any of these, but I'm going to be trying all of them as soon as I can.

Saturday, February 6, 2010

Tail Recursion

The annoying thing about being a writer who has focused a lot on learning his craft is that I now have a constant editorial chatter now when I'm reading. Typos, awkward sentences, factual problems. They all crop up and prevent me from just taking in what I'm reading.

I was reading through a Scala book the other day, and I noticed this blurb in a section about tail recursion.

(If you don’t fully understand tail recursion yet, see Section 8.9).

8.10 Conclusion

The editor in my brain pounced on this end sentence, which cross-references to the same section the reader has just finished.

It took a beat before the programmer side of my brain woke up and noticed the joke.

Tail recursion is when the last instruction in a method is a call to the same method. To take an example from the book,


def boom(x: Int): Int =
         if (x == 0) throw new Exception("boom!")
        else boom(x - 1) + 1

In this specific case, boom calls itself on the last line of the method. That's tail recursion.

And the last sentence in that section is a perfect example.

Monday, February 1, 2010

Print To URL Via Smartphone

According to this Mashable post, Microsoft has unveiled a new "tagging" system that would let print publication have a smartphone-readable link so that readers could visit a webpage referenced in an article by pointing their smartphones at it.

Saturday, January 30, 2010

JAXB -> JSON

I'm working on a new project at work, and it has a by-now-standard RESTful API web service layer. And, of course, like all such layers, it needs to support output in XML or JSON.

Supporting XML is easy in Java, thanks to a technology called JAXB. Among its many capabilities is one which lets you annotate Java objects and generate an XML document from those annotations.

For instance, you could write an object with the following:


@XmlElement(name="name")
private String name = "Joe";

Pass that to the JAXB marshaller, and you'd get this XML:


…
<name>Joe</name>
…

And the reverse is true: Make JAXB parse the XML, and you'd end up with an object whose name instance variable was set to Joe.

So our XML version was done in no time. But JSON support is a less-entrenched technology, and thus there's no elegant built-in solution. My initial idea was to just use one of the JSON libraries out there and have it construct a JSON object from the XML document generated by JAXB: Dump the XML into a buffer, convert it, and print that text to the HTTP response output stream. That works (though it's not very efficient), except in this particular case:


<things>
    <thing>
         <name>Joe</name>
    </thing>
<things>

Your schema might define things as a collection with zero or more items, but the XML-to-JSON converter doesn't see the schema, and so it creates a things object instead of a things collection. Once there are two items in the collection, everything works fine. (It's possible that if we had an actual schema doc somewhere, the converter could know about it. But since the annotations define the output, we don't have an xsd file around at the moment. )

I looked at various options for mapping objects, but most felt like duplicate work: I'd have to maintain a JSON mapping in sync with the XML mapping, a setup that's guaranteed to get messed up in some subtle way someday. What I really wanted to do was use the JAXB annotations as the definition for the JSON output.

Jersey looks like it's trying to accomplish the same thing, but it also seemed (when I first looked at it) to be not yet ready for prime time.

So I came up with my own solution. While I can't post the code, I can give you a good sense of how it worked.

I created a concrete object,JAXBBridgeParser. I also created an interface, JAXBBridgeParserListener. You throw a JAXB annotated object and an implementation of JAXBBridgeParserListener at the parser, and it uses reflection to find the annotated fields and methods in that object. For each annotated field/method, it calls some appropriate method on the listener.

In addition to the ubiquitous startParsing and finishParsing messages, the parser fires special-purpose messages at the listener. The easiest scenario is what I call the "simple type" field or method. A String, a Java primitive, a Date, etc. In that case, the parser says to the listener, "Here's the name of this field and here's the value." In the JSON scenario, this translates to a key-value pair.

Next up is what I call the "complex type" field or method. This is a value that is itself an annotated object. In that scenario, the parser first tells the listener that it's beginning a complex object with a given name; then it recurses into a processObject method with that new object. That will in turn trigger its own "simple type" or "complex type" messages. When it comes out of the recursive call, the parser tells the listener that it's done with the complex object of the given name. This corresponds to a JSON key-value pair where the value is an object.

Finally, I have to worry about collections. These could contain simple types or complex types. When the parser sees an annotated field or method that is a collection, it tells the listener that it's starting a collection via the beginCollection method in the interface. For each item in the collection, it sends a message to the listener telling it that it's processing an object inside a collection. There are separate methods for simple types and complex types in collections. When it's done with the collection, it tells the listener that it's finished. In JSON, that corresponds to a list that might look like this: ["a","b",{c: "c",d:"d"}].

The end result works like a charm: My JSON objects line up perfectly with our XML documents, and I don't have to do anything to get them there. The JSON listener is about 20 lines of code. Any object that can be converted to XML can also be converted to JSON simply by passing it through the system. (A custom Spring ViewResolver figures out the right converter to use based on the request, so anything that can be served up as XML also automatically gets a JSON version.)

And I can support new formats pretty easily as they rise in prominence in the web world. I've thought about wiring up an HTML formatter that would give default browser versions of the data. HTML is a little trickier because I'd want some of the items — object IDs, for instance — to be attributes and some of the items — names, for instance — to be page content. But it should be doable. I've also thought of wiring up YAML support, which should be brain-dead simple, just because I can.

My particular scenario is pretty simple: We don't use some of the deeper features of JAXB, so I don't have to worry about handling them. I do, however, support @XmlJavaTypeAdapter, running the referenced adapter on a field before the parser sends the value to the listener. And my version has the downside that I, and not some larger open-source community, have to support and extend it. Still, it was a pleasant little exercise, and it's working well.

If you go this route, I encourage you not only to have lots of unit tests to catch subtle edge cases, but also to set up a more behavior-focused test. In my case, I made a test listener that simply counts the number of messages it gets from the parser. I set up an object with a fixed set of annotations, and then passed it and the listener to the parser. My "unit test" then checks the counts for each message in the listener.

Saturday, January 23, 2010

Polyglot Programming

I recently read through The ThoughtWorks Anthology, a collection of essays by Big Thinkers in the realm of systems design. The essays were largely interesting, but one in particular resonated with me: Polyglot Programming. The author made a compelling case for using the Java Virtual Machine — a robust, mature, well-tested infrastructure — as a platform in which any number of languages can co-exist.

Java's a good language, of course, but it's not good at everything. Why not mix in other languages that can run in the virtual machine but offer strengths in the face of Java's weaknesses, asked the author. I've toyed with this idea before, especially with adding Scala's potential for highly concurrent code, but the essay lit a new fire in me.

I came up with a way to try out the concept. We have a bunch of queries in our service layers, and Java blows when it comes to formatting long strings. Without the ability to have one string span multiple lines, you end up with something like this:


String query = "SELECT table_a.*,table_b.* FROM table_a,table_b,table_c " +
                          "WHERE table_a.some_column = table_b.some_column " +
                          "AND table_b.some_column = table_c.some_column " +
                          "AND table_c.id = :idValue";

Easy reading, right? Not only is it annoying to read, it's error prone. I almost always forget a space in one of these long strings, causing SQL exceptions that don't show up until runtime.

Ruby, like many other scripting languages, allows for a "here document" which basically says, "Treat this text following double less-than signs as a double quote and just pull in everything after it until you see the same text again." In Ruby, you might write the query above as follows:


query = <<QUERY
SELECT
     table_a.*,
     table_b.*
FROM
     table_a,
     table_b,
     table_c
WHERE
     table_a.some_column = table_b.some_column
     AND table_b.some_column = table_c.some_column
     AND table_c.id = :idValue
QUERY

Which is more readable. I admit this is not a monumental problem, but it did offer an opportunity.

Enter JRuby. JRuby is a Ruby interpreter written in Java. Curiously, it now outperforms the C-based interpreter in lots of benchmarks, which is not only a testament to the maturity of the JVM but to the dedicated open-source team that have devoted themselves to improving JRuby. JRuby's main benefit is that you can access the sweeping Java API from within your Ruby scripts, but you can also invoke Ruby scripts from your Java code.

I made a new class called QueryContainer that would serve as a facade for managing the Ruby invocations and giving Hibernate Query objects back to the service layer. No other layer in the code would need to know about invoking Ruby: QueryContainer would translate the scripts into objects useful elsewhere in the system. Inside each Ruby script, I made a class to act as a namespace (because I opted for a singleton of the Ruby interpreter instead of multiple copies), and then inside each class defined hash literals that looked something like this:


QUERY_1 = {
:type=>AppConstants::SQL,
:query => <<QUERY
SELECT
     table_a.*,
     table_b.*
FROM
     table_a,
     table_b,
     table_c
WHERE
     table_a.some_column = table_b.some_column
     AND table_b.some_column = table_c.some_column
     AND table_c.id = :idValue
QUERY
}

What's that AppConstants::SQL thing? AppConstants is a Java class in our system that has some globally useful constants. Because it's JRuby, I can use constants from my Java classes. We have two query languages in our system: normal SQL and Hibernate's abridged SQL. QueryContainer needs to know which query language it is because Hibernate defines a createQuery method for HSQL and a createSQLQuery method for SQL.

But it gets more complicated. If you have a SQL query that returns everything you need to construct a Hibernate object, you need to tell Hibernate what kind of object it is. (You don't need to do this for HSQL.) I added an entityClass key to the SQL hash literals, and had it reference a Java class object (.java_class when you're in JRuby, since .class has meaning in the Ruby world. In other words:


QUERY_1 = {
:type => AppConstants::SQL,
:entityClass => BusinessObject.java_class,
…
}

Here's the final flow. Some method in the service layer wants to run a query. It calls a method in the base class called getQueryForKey, passing in the query key it wants. That base class method calls a similar method on a QueryContainer instance variable held by the base class. QueryContainer was initialized with the Ruby script that will act as a resource, and it reaches into it to find the keys in the hash literal with the same name as the key that's been moving through the chain. e.g.,: QuerySet1::QUERY_1[:query]. If it's an HSQL query (QuerySet1::QUERY_1[:type]), QueryContainer just constructs a regular Query object. If it's a SQL query, QueryContainer constructs a SQLQuery object and calls addEntity on it, passing in the Java class from the :entityClass key of the hash literal.

So how does it work? Well, on the one hand, it accomplishes what I wanted. My queries have been factored out into new files, and they've been re-produced in a format that's easier to read and less error-prone. The entire rest of the system is ignorant of their source. It makes the case for adding languages that have strengths (in this case, the relatively minor advantage of string literals that can span multiple lines) to a deploy.

But on the other hand, JRuby seems to have added a sizable chunk of memory to our app. Shortly after I put in this system, our dev server started running into OutOfMemory errors on a regular basis, a process that I've contained somewhat by disabling some other systems. And this is with a singleton of the interpreter. I've found little information about this, and so I'm wondering if JRuby is the way to go. I haven't hooked up a profiler yet to determine the real source, but it's the only thing that's changed.

I've started looking at Groovy as an alternate. At least if I go that route, only QueryContainer needs to change.

Wednesday, January 20, 2010

Menu For Hope Source

For the past few years, I've run the raffle program that generates the winners for Menu For Hope, an annual fundraising event started by Pim that incorporates food bloggers from around the world and raises tons of money for hunger relief.

It's been a couple of years since I first posted the source code, so I thought I'd post the current version. It hasn't changed much: I made the parsing a bit smarter and added support for giving people tickets even if they asked for fewer tickets than they bought (I used to skip those as invalid). Looking at it now, I see a number of ways it could be cleaner and better architected (can you say regex?), but it continues to work. If you see any bugs, feel free to mention them!

Keep in mind that I don't run the program blindly: I run it and then look over every line of output to make sure the program is behaving the way I expect. If it's not, or I see an error, I usually end up fixing some minor data issue (People often think O is the same as 0, for some reason, or they write UW01x2 UW02x3 instead of 2xUW01, 3xUW02. My program handles the latter but not the former because it attaches the 2 to the second string in the field instead of the first.)

Here's the main class.


  public class MFHRaffle {
  
  
   public static void main(String[] args) {
     List prizeCodes = new ArrayList();
     try {
  RandomAccessFile prizeFile = 
      new RandomAccessFile("prizeFile.txt","r");
  String curPrize;
  while ((curPrize = prizeFile.readLine()) != null) {
      prizeCodes.add(curPrize.trim());
      System.out.println("Added prize code: " + curPrize);
  }
     } catch (IOException ioe) {
  System.err.println("Error reading prize code file");
  ioe.printStackTrace();
     }
    // divide args into buckets
    // there's only one input arg (file) so we can size list in advance
    List commandArgs = new ArrayList(args.length - 1);
    Map params = new HashMap();
    String filename = null;
    for (int i = 0; i < args.length; i++) {
     if (args[i].startsWith("-")) {
      commandArgs.add(args[i].substring(1,args[i].length()));
      if (args[i].startsWith("-oneprize") ) {
       params.put("oneprize",args[i+1]);
      }
     } else {
      filename = args[i];
     }
    }
  System.out.println("Using file: " + filename);
  
  if (commandArgs.contains("testrandomdraw")) {
   testRandomDraw();
  }
  
  boolean debug = true;
  if (commandArgs.contains("debug")) {
   debug = true;
  }
  
  MFHDataParser parser = null;
  if (commandArgs.contains("csv")) {
   parser = new CSVDataParser();
     } else {
      parser = new ExcelDataParser();
     } 
     parser.setDebug(debug);
            parser.setValidPrizes(prizeCodes);
     
     Map<String,List<String>> entries = parser.extractData(filename);
     Map<String,Integer> prizeCounts = new HashMap<String,Integer>();
     Map<String,String> prizeToWinner = new HashMap<String,String>();
     
     //produce a sorted list
     List<String> sortedPrizes = new ArrayList<String>(entries.keySet());
     Collections.sort(sortedPrizes);
     
     System.out.println("*******************************");
     if (!commandArgs.contains("oneprize")) {
   // for every entry in map, throw list to randomDraw
   for (Iterator<String> prizeIt = sortedPrizes.iterator();
    prizeIt.hasNext();) {
     // drumroll please...
     String prize = prizeIt.next();       
     List<String> bidders= entries.get(prize);
     prizeCounts.put(prize,new Integer(bidders.size() * 10));
     
     String winnerEmail = randomDraw(bidders);
     prizeToWinner.put(prize,winnerEmail);
   }
  } else {
   String prizeCode = (String)params.get("oneprize");
      String winnerEmail = randomDraw(entries.get(prizeCode));
      prizeToWinner.put(prizeCode,winnerEmail);
  }

   
  // tab-delimited output for fatemeh
  System.out.println("********** TEXT ****************");
  for (Iterator<String> prizeIt = sortedPrizes.iterator();
   prizeIt.hasNext();) {
    String prize = prizeIt.next();
    System.out.println(prize+"\t$"+prizeCounts.get(prize)+"\t"+
     parser.getNameForEmail(prizeToWinner.get(prize)) +"\t"+
     prizeToWinner.get(prize));
  }
  
  // html markup for brett
  System.out.println("*********************************");
  System.out.println("********** HTML *****************");
  System.out.println("<table rules=\"rows\" >");
  for (Iterator<String> prizeIt = sortedPrizes.iterator();
   prizeIt.hasNext();) {
    String prize = prizeIt.next();
    System.out.println("<tr><td>"+prize+"</td><td>$"+
        prizeCounts.get(prize) + "</td><td>" +
        parser.getNameForEmail(prizeToWinner.get(prize)) +
         "</td><td>" + prizeToWinner.get(prize) +
          "</td></tr>");
  }
  System.out.println("</table>");

           
 }

Here's the base class for parsers:


public abstract class MFHDataParser {
     
    private boolean debug=false;
    
    private char[] delims = {',',' ','.',';'};
    
    private Map<String,String> emailToName = new HashMap<String,String>();
    
    private List<String> validPrizes = new ArrayList<String>();
    
    public void setValidPrizes(List<String> prizeList) {
     this.validPrizes = prizeList;
    }
    
    protected void mapEmailToName(String email, String name) {
     this.emailToName.put(email, name);
    }
    
    public String getNameForEmail(String email) {
     return this.emailToName.get(email);
    }
    
    public abstract Map<String,List<String>> extractData(String filename);
    
 /** Do our darndest to figure out what prizes a donator has mentioned on
  *  a given line. Note: Be sure to complain if we don't recognize a prize.
  *  
  */
    protected List<String> extractPrizes(String prizes, int amount) 
     throws NoPrizeFoundException {
    
     String ucPrizes = prizes.toUpperCase().trim(); //for consistency's sake
     if (isDebug()) {
      System.out.println("Incoming prize string " + ucPrizes);
     }
    
     // basic strategy
     // find two numbers followed by a delim (, space, ., eol, ;)
     // back up and find two letters before it
     // then back up (not into an earlier code) and find a #
     // can't use java's regex abilities because i need to divide into larger chunks
     
     // put that many copies into the list
     // verify that size of list = donation /10. bark if not
     // List = one of each real raffle ticket (5xUW03 -> 5 entries in List)
     
     List<String> prizeList = new ArrayList<String>();
     List<Integer> prizeCounts = new ArrayList<Integer>();
     
     if (ucPrizes.length() < 4 ) {
      throw new NoPrizeFoundException(ucPrizes);
     } else if (ucPrizes.length() == 4) {
      // exact count. easy case, but verify that it's legit
      
      String prizeCode = findPrizeCodeInTextBlock(ucPrizes); 
      if (!validPrizes.contains(prizeCode)) {
       System.out.println("INVALID PRIZE CODE: " + prizeCode);
         }
      for (int i =1;i<= amount/10; i++) {
       prizeList.add(prizeCode);
      }
     } else {
      // in this case we need to walk through the list, divided it into
      // chunks, and find the prize code in each
      int chunkStart = 0;
      for (int i = 0; i < ucPrizes.length();i++) {
       if (i == ucPrizes.length() - 1 || 
        isDelim(ucPrizes.charAt(i))) {
        String curChunk = null;
        if (isDelim(ucPrizes.charAt(i))) {
          curChunk = ucPrizes.substring(chunkStart,i);
        }
        
        if (i == ucPrizes.length() - 1) {
         curChunk = ucPrizes.substring(chunkStart,i+1);
        }
        if (curChunk.length() < 4) {
         continue;
        }
        String prizeCode = findPrizeCodeInTextBlock(curChunk);
        if (prizeCode.equals("")) {
         continue; // not in this text block
        } else if (!this.validPrizes.contains(prizeCode)) {
         System.out.println("INVALID PRIZE CODE: " + prizeCode);
           }
        
        
           prizeList.add(prizeCode);
        int prizeCount = new Integer(
         findPrizeQuantityInTextBlock(curChunk,prizeCode));
        if (prizeCount == -1) {
         prizeCounts.add(new Integer(1));
        } else {
               prizeCounts.add(new Integer(prizeCount));
           }
        
        chunkStart = i + 1;
       }
      }
     }
      
  // expand prize list as needed
  if (prizeList.size() == amount /10) {
   return prizeList; // if there are as many prizes as the amount
         // would suggest, do one ticket each
  } else if (prizeList.size() == 1) {
   // create an expanded list that has one entry for each ticket
   List<String> newPrizeList = new ArrayList<String>(amount/10);
   for (int i = 0;i < (amount /10); i++) {
     newPrizeList.add(prizeList.get(0));
   }
   return newPrizeList;
  } else {
   // we have a mix of amounts and quantities
   List<String> newPrizeList = new ArrayList<String>(amount/10);
   for (int i = 0; i < prizeList.size();i++) {
    Integer count = prizeCounts.get(i);
    for (int j = 0; j < count.intValue(); j++) {
     newPrizeList.add(prizeList.get(i));
    }
   }
   return newPrizeList;
  }
 }
    
    protected int parseAmount(String amtString) {
        if (amtString.trim().equals("")) {
            return 0;
        }
     return (int)(Double.parseDouble(amtString));
    }
    
    /** Takes a guess at the prize quantity in a given text block */
    private int findPrizeQuantityInTextBlock(String chunk, String prizeCode) {
     // make a spaced version so we can look for UC 01 as well as UC01
     // walk over the string looking for numbers, skipping the prize code
     String spacedPrizeCode = prizeCode.substring(0,2) + " " +
                           prizeCode.substring(2,4);
                
        String newChunk = chunk.replace(prizeCode,"");
        newChunk = newChunk.replace(spacedPrizeCode,"");
                           
     for (int i = 0; i < newChunk.length(); i++) {
      if (Character.isDigit(newChunk.charAt(i))) {
       if (i < newChunk.length() - 1 &&
        Character.isDigit(newChunk.charAt(i+1))) {
        // two-digit quantity
        char[] digits = {newChunk.charAt(i),newChunk.charAt(i+1)};
        return Integer.parseInt(new String(digits));
       } else {
           return Integer.parseInt(newChunk.substring(i,i+1));
       }
      }
  }
  
  // one last check. some bidders wrote "TWO" instead of 2
  if (newChunk.indexOf("TWO") != -1) {
   return 2;
  }
  
  return -1;
    }
        
 /** Returns the offset of something that looks like a prize code. */
    private String findPrizeCodeInTextBlock(String chunk)  {
     
     // look for 2 letters followed by 2 numbers => prize code
     for (int i = 3; i < chunk.length(); i++) {
      if (Character.isDigit(chunk.charAt(i)) && 
       Character.isDigit(chunk.charAt(i - 1))) {
       if (chunk.charAt(i-2) == ' ') {
        if (Character.isLetter(chunk.charAt(i-3)) &&
         Character.isLetter(chunk.charAt(i-4))) {
            return chunk.substring(i-4,i-2) +
                 chunk.substring(i-1,i+1);
        }
       } else {
        if (Character.isLetter(chunk.charAt(i-2)) &&
         Character.isLetter(chunk.charAt(i-3))) {
            return chunk.substring(i-3,i+1);
        }
         }
      }
        }
        return "";
    }
    
    protected boolean isDebug() {
     return this.debug;
    }
    
    public void setDebug(boolean debug) {
     this.debug = debug;
    }
    
    private boolean isDelim(char c) {
     for (int i = 0; i < this.delims.length; i++) {
      if (c == delims[i]) {
       return true;
      }
     }
     return false;
    }
    
}

And here's the Excel-parsing subclass of Data Parser (you can see that it's got some logic that should be in DataParser, but in truth we've always used the Excel format, so it hasn't been an issue.


public class ExcelDataParser extends MFHDataParser {

    public Map<String,List<String>> extractData(String filename) {
     Map<String,List<String>> retVal = new HashMap<String,List<String>>();
     try {
      Workbook wkbk = Workbook.getWorkbook(new File(filename));
      Sheet sheet = wkbk.getSheet(0);
      for (int i = 0; i < sheet.getRows(); i++) {
    if (i == 0) { continue; }// skip headers
    String name = sheet.getCell(0,i).getContents();
    String email = sheet.getCell(1,i).getContents();
    String date = sheet.getCell(2,i).getContents();
    if (isDebug()) {
     System.out.println("amount: " + sheet.getCell(3,i).getContents());
    }
    int amt = parseAmount(sheet.getCell(3,i).getContents());
    String comment = sheet.getCell(4,i).getContents();
    if (email == null || email.trim().equals("")) {
     throw new IllegalArgumentException("No email found at line " + i);
    }
    mapEmailToName(email,name);

    
    if (comment == null || comment.trim().length() == 0) {
     System.out.println("No comment on line " + (i+1));
     System.out.println("");
     continue;
    }
    List<String> prizes = extractPrizes(comment,amt);
    if (isDebug()) {
     System.out.print( "prizes for line " + (i+1) + " ");
     for (Iterator<String> prizeIt = prizes.iterator();
          prizeIt.hasNext();) {
          System.out.print(prizeIt.next() + " " );
     }
     System.out.print("\n");
    }
    
    if (prizes.size() != amt / 10) {
     System.out.println("Line " + (i+1) + 
     " does not have the right number of prizes for the amt");
     System.out.println("");
    }
    
    // compress the lists down to MFHPair, which includes an email
    // and a count. Insert into map, keyed by prize code
    Collections.sort(prizes); // make sure they're in order
    String curPrize = "";
    int curCount = 0;
    for (Iterator<String> prizeIt = prizes.iterator();
         prizeIt.hasNext();) {
         String prizeFromList = prizeIt.next();
         List<String> bidders = null;
         if (retVal.containsKey(prizeFromList)) {
          bidders = retVal.get(prizeFromList);
         } else {
          bidders = new ArrayList<String>();
          retVal.put(prizeFromList,bidders);
         }
         bidders.add(email);
     }
     System.out.println("");
     }
     return retVal;
  }
  catch (Exception e) {
   System.err.println("There was a problem: " + e);
   e.printStackTrace();
   System.exit(1);
  }
     return null;
    }
}