Tuesday, January 1, 2013

Hirsute: Fake Data for the Real World

At a previous job, which sold web services to help school districts track their tests and their progress against state tests, the sales team thought they could be more effective if our demo district was more realistic. The demo district, which had survived years of service, only had a few hundred students. Superintendents wanted to get a sense of what the system could really do for their needs.

You couldn't just show a district some other district's data, though, for confidentiality reasons. So one of our engineers came up with a solution. He took data from other districts and munged it together. He didn't just swap names around, though. Because we tracked demographic information, he squished students together within demographics. So you'd end up with Hispanic names together. And those students would have test scores that mirrored the Hispanic population in your district (since, at that point, we probably had data from nearby districts in your state). Kids of different socioeconomic status would also have similar test scores, and so forth. I seem to remember that one district half-jokingly suggested they just run the district off the demo they saw, since it was so close to their own.

I thought of that recently when working with our loadtesting group. I'm sure EA's centralized loadtesting group is no different than that of any other corporation in that they lack intimate details about our business objects. On Spore, I remember that our loadtesting database was set up with something like 100,000 users (we have 3 million or so now), each of whom had something like 10,000 creations (most users probably have 20). Or all the sporecasts had 5,000 items (most held somewhere in the few tens of creations). Early work on SimCity's system produced similarly unrealistic data. That kind of data makes it hard to tune queries and indexes, figure out caching strategies, and all the other normal stuff one needs to do with data for a system.

What I wanted, I thought, was a DSL that would let me do what we did for school districts: Specify how the data should kind of look, and let the system generate that data for me.

Thus, Hirsute was born. Hirsute is an internal DSL built on top of Ruby and its extensive metaprogramming facilities (an aside: I recommend Metaprogramming Ruby for its solid information, though most emphatically not for its trumped-up dialog and narrative structure). In Hirsute, you build templates that define how objects should look, and then you create collections of objects derived from those templates. You can specify histograms of probabilities so that you don't just get a random distribution among options but a distribution that reflects your real-world requirements. You can read in data from files and randomly combine them to get plausible composite strings. You can then flush the collections out to files ready-made for loading into a database (mysql at the moment).

For instance, here's some Hirsute code from a fictional wine-cellar-management service that I created as a sample (since the data requirements most on my mind are for SimCity, which I can't talk about). There's also a manual that goes into greater detail.

# This defines a tasting note that a single user might write about a single bottle. It pulls descriptors from various files.
a('tastingNote') {
  has :tasting_note_id => counter(1),
      :description => combination(
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","}),
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","},
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","},
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","})
      :rating => one_of([1,2,3,4,5],[0.1,0.1,0.4,0.3,0.1]),
      :bottle_id => 1, # filled in later
      :user_id => 1    # filled in later
    is_stored_in 'tasting_note'

tastingNotes = collection_of tastingNote

That sample defines a template for tasting notes. The description field comprises 1 to 6 lines pulled randomly from a file of aromas combined with 1-3 lines randomly read from a file containing wine flavors, all joined to one another with commas. The rating is from 1 to 5, but weighted such that most wines will have either a 3 or a 4. The tasting_note_id is a counter that's incremented with each new object. The bottle_id and user_id fields are filled in later when an actual tasting note is created.

Then you define a collection to hold them. You could also do this by using

   tastingNotes = tastingNote * 6000

Which would create 6,000 tasting note objects based on the formula you provide.

So far, the system is pretty simple, but it gets the job done. And because it's a Ruby DSL, you can always just write raw Ruby to fill in what you need. I definitely plan to keep adding to it, though, with a newborn in the house, maybe not quite yet.