An Obsession With Programming

Tuesday, January 1, 2013

Hirsute: Fake Data for the Real World

At a previous job, which sold web services to help school districts track their tests and their progress against state tests, the sales team thought they could be more effective if our demo district was more realistic. The demo district, which had survived years of service, only had a few hundred students. Superintendents wanted to get a sense of what the system could really do for their needs.

You couldn't just show a district some other district's data, though, for confidentiality reasons. So one of our engineers came up with a solution. He took data from other districts and munged it together. He didn't just swap names around, though. Because we tracked demographic information, he squished students together within demographics. So you'd end up with Hispanic names together. And those students would have test scores that mirrored the Hispanic population in your district (since, at that point, we probably had data from nearby districts in your state). Kids of different socioeconomic status would also have similar test scores, and so forth. I seem to remember that one district half-jokingly suggested they just run the district off the demo they saw, since it was so close to their own.

I thought of that recently when working with our loadtesting group. I'm sure EA's centralized loadtesting group is no different than that of any other corporation in that they lack intimate details about our business objects. On Spore, I remember that our loadtesting database was set up with something like 100,000 users (we have 3 million or so now), each of whom had something like 10,000 creations (most users probably have 20). Or all the sporecasts had 5,000 items (most held somewhere in the few tens of creations). Early work on SimCity's system produced similarly unrealistic data. That kind of data makes it hard to tune queries and indexes, figure out caching strategies, and all the other normal stuff one needs to do with data for a system.

What I wanted, I thought, was a DSL that would let me do what we did for school districts: Specify how the data should kind of look, and let the system generate that data for me.

Thus, Hirsute was born. Hirsute is an internal DSL built on top of Ruby and its extensive metaprogramming facilities (an aside: I recommend Metaprogramming Ruby for its solid information, though most emphatically not for its trumped-up dialog and narrative structure). In Hirsute, you build templates that define how objects should look, and then you create collections of objects derived from those templates. You can specify histograms of probabilities so that you don't just get a random distribution among options but a distribution that reflects your real-world requirements. You can read in data from files and randomly combine them to get plausible composite strings. You can then flush the collections out to files ready-made for loading into a database (mysql at the moment).

For instance, here's some Hirsute code from a fictional wine-cellar-management service that I created as a sample (since the data requirements most on my mind are for SimCity, which I can't talk about). There's also a manual that goes into greater detail.

# This defines a tasting note that a single user might write about a single bottle. It pulls descriptors from various files.
a('tastingNote') {
  has :tasting_note_id => counter(1),
      :description => combination(
         subset(
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","}),
         subset(
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","},
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","},
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","})
         ),
      :rating => one_of([1,2,3,4,5],[0.1,0.1,0.4,0.3,0.1]),
      :bottle_id => 1, # filled in later
      :user_id => 1    # filled in later
    is_stored_in 'tasting_note'
      
}

tastingNotes = collection_of tastingNote

That sample defines a template for tasting notes. The description field comprises 1 to 6 lines pulled randomly from a file of aromas combined with 1-3 lines randomly read from a file containing wine flavors, all joined to one another with commas. The rating is from 1 to 5, but weighted such that most wines will have either a 3 or a 4. The tasting_note_id is a counter that's incremented with each new object. The bottle_id and user_id fields are filled in later when an actual tasting note is created.

Then you define a collection to hold them. You could also do this by using

   tastingNotes = tastingNote * 6000

Which would create 6,000 tasting note objects based on the formula you provide.

So far, the system is pretty simple, but it gets the job done. And because it's a Ruby DSL, you can always just write raw Ruby to fill in what you need. I definitely plan to keep adding to it, though, with a newborn in the house, maybe not quite yet.

Sunday, November 18, 2012

Scripting Campfire, Again

A little over two years ago, I wrote a post about scripting Campfire, the group chat tool from 37signals. At the time, my script posted a routine "today's date is" message with a variety of statistics. Over time, the statistics have disappeared — though I now post charts from Graphite — but the bot has been tirelessly plopping the date into each room each weekday (and now, in crunch time, each day).

Then I watched a video that mentioned the Campfire interface to HUBOT, github's little slave server that handles all sorts of tasks. What if we could type a command into Campfire and have it actually do something?

But what? A first use case quickly suggested itself.

One of our Campfire rooms is devoted to server issues, and my boss and I often, while chatting in there, make a comment such as "todo: update deployment instructions."

How often do you think we actually remember those todos? Did you guess "at least rarely"? You may have overshot.

Some after-hours refactoring of the Ruby scripts I originally wrote plus a bit of tinkering with a new script, and I had a very simple command parser. Now, when you type "@SimCityBot todo blah blah blah," you'll get an email saying something like "You wanted to be reminded to blah blah blah." That doesn't guarantee the task will get done, of course, but it does make it less ephemeral.

The script is pretty straightforward: it polls each room looking for messages that start with "@SimCityBot," and then invokes a method with the same name as the first word in the text. That means adding a new command is now a simple matter of adding a single method to a file. Yay for Ruby's metaprogramming support! The script also maintains a YAML file that keeps track of the most recent messages in each room. This ensures that when the script is restarted, it doesn't respond a second time to every command it sees.

I had to add support to our Campfire library for uploading images in order to post our "slowest calls" graphs each day. Once that work was done, adding an "image" command was a single call. Give it a URL, and the the script downloads that image and re-uploads it to whatever room the command appeared in.

Next up is a command to kick off Hudson builds. For that one, of course, I'll want to spin off a process that can monitor the build and report back when it's done. A co-worker suggested listing Emeryville food trucks. (Which I maintain as a Twitter list.)

There's lots of things that the script isn't good at. Handling a command blocks until it's done. Error reporting is minimal. It doesn't support multi-word commands. But it's a little trinket I can poke at and have fun with.

Is this the most important thing I could be doing for SimCity? Let's hope not. But at the end of a long day, sometimes I need a break, and my breaks from programming are … other programming projects! SimCityBot provides a refreshing distraction that often buoys my mood and gives me a nice close to the day. That's also part of why I've not just installed HUBOT. There's less fun in that.

But the catalog of all HUBOT scripts is an inspiring read. AWS status checks? Graphite graphs? The latest XKCD? When is break time again?

Sunday, November 11, 2012

Scala and Java: A Simple Application

My boss recently asked me how I'd build out SimCity's online systems today if I knew everything three years ago that I know now. I didn't hesitate. "I'd take a good, long look at Scala and, by extension, Lift."

Scala has a lot that appeals to the me of today. I like its hand-in-hand support for functional programming and emphasis on immutable objects. I've gotten used to both concepts with Erlang, and I've come to appreciate that programming paradigm for building robust, scalable systems. But Scala also offers imperative syntax and mutable objects if you need them.

Scala has native support for the actor model abstraction of concurrency, which I first encountered with Erlang (Scala's syntax is openly lifted from that language's). The actor model makes it much easier to manage and reason about concurrent code, and Scala supports two major implementations: actors tied to particular threads or event-based actors thrown onto whatever thread is available, maximizing resource utilization.

And, unlike something like node.js or even Erlang, Scala has a huge universe of libraries at its disposal thanks to its bytecode-compliance with Java.

All good stuff. I thought it was time to do something real with it.

Before I dove into a large system, I thought I'd write a simple application. We have a group of binary files on SimCity that are very important to the game but, being binary, aren't easy to debug when something goes wrong. So I thought I'd do a quick project on my own to write a Java-based parser for the files and compare it to a Scala-based version. Little admin tools like this or other low-risk sections of the code base are often good ways to try out new tech and see if it will fit into the larger project. This code didn't leverage any of the concurrency systems in Scala; I just wanted a simple program.

One reason people like Scala — and, indeed, the many other JVM-compatible languages — is conciseness. Spend any time at all with Scala, Ruby, or — it sometimes seems — any other language, and Java's verboseness begins to feel like cement around your hands, sucking time and productivity away from your programming. Plus, more code, even Java's boilerplate, means more potential bugs.

As a simple measure, I compared the the non-whitespace characters in my two versions. I structured the programs the same way, mirroring class structures and refactored methods. But I used the support that each language gave me for keeping things concise.

The Java version was 5258 characters. The Scala version? 3099. The Java version was almost 70 percent larger.

Java Code	Scala Code
5258	3099

Scala's biggest single win was with a file full of small classes that defined types of data within the file I was parsing. The Java version was 160 percent bigger than the Scala one.

This makes sense. Let's say you wanted an immutable class in Java to represent a point in 3D space. This is about as concise as you can get it.


public class Point {
   public final float x, y, z;
   public Point(float _x, float _y, float _z) {
      x = _x;
      y = _y;
      z = _z;
   }
}

Here's the equivalent Scala code.


class Point(val x: Float, val y: Float, val z: Float)

But Scala offers lots of little aids as well. You rarely need Java's omnipresent semicolons; you don't need to declare types as often, since Scala can usually infer them; you don't need to explicitly type "return" at the end of a function, because the last result in the function is the return value; you don't have to declare that you throw exceptions. The list of little things goes on and obviously adds up.

Functional programming, too, offers some conciseness. I needed a routine to read an unsigned int out of a variable-length byte array. In the Java version, I wrote this:


private static int byteArrayToInt(byte[] bytes) throws IOException {
long retVal = 0;
    for (int i = 0; i < bytes.length; i++) {
        retVal = (retVal << 8) | ((long)bytes[i] & 0xff);
    }
    return (int)retVal;
}

In the Scala version, I wrote this:


private def byteArrayToInt(bytes: Array[Byte]) = {
    ((0L /: bytes) {(current,newByte) => (current << 8).toLong | (newByte & 0xff).toLong}).toInt
}

(The references to longs in this int-parsing code are to cope with the fact that I needed to read very large unsigned ints from the files, which Java defaults to interpreting as signed integers. The way to get around that is to write into a larger memory space, namely a long.)

You could argue that the Scala version is concise to the point of obtuseness, even if you're familiar with the functional-programming mainstay foldLeft operation it represents. I agree that there's a balance to be struck. In particular, I'm not sold on the /: operator for foldLeft; I might opt for spelling it out to be more clear.

For functional programming geeks, note that, to the extent it can, Scala offers tail-call optimization on recursive calls.

But things weren't all sunshine and roses on the Scala side. Here is the average time to run my program for each version, timed over 1000 iterations.

Java Time	Scala Time
88ms	182ms

To some extent, I expected this. Scala has to compile down to Java bytecode, which means that all that syntactic sugar and functional programming and closure support must turn into Java concepts somewhere. Even my little program generates a slew of extra classes and, presumably, lots of extra code that has to be navigated. Also, I think it's reasonable to imagine that immutable objects necessarily mean that new objects have to be created more often than they would in mutable space, where you can change an object directly. Finally, I've been working with Java in one form or another for 16 years or so; I've been working in Scala for about three days. So I'm likely missing out on performance tips.

Though I admit this seems like a huge difference for some extra classes and objects and missing an optimization step or two. Even if it's correct, I'm still of the mindset that greater productivity and easier, safer concurrency are big wins. (Note that you could always switch to imperative mode in key sections if performance demanded it, in much the same way that some sites offload work to C programs.)

If I were really honest about how I'd rebuild SimCity, I'd probably use Erlang, where you have to do things functionally, have a virtual machine that supports what you're doing, and have native systems for handling failures with aplomb. But Scala at least offers the potential of hiring from the pool of Java programmers, whereas Erlang really doesn't. (On the other hand, the vast majority of Java programmers I've seen seem to be couched safely and comfortably in Java, so wouldn't necessarily adapt. But Erlang would be a way bigger change, I think.)

I'm going to keep plunking away at Scala and try to build something a bit more real with it. Event-based actors might be a bit slower, but if they can scale vastly better, that may matter more to a site.

Thursday, November 8, 2012

Copying On S3

The question recently arose: Is it faster to copy within buckets on S3 than it is to copy between buckets?

A quick script provided an answer. I copied a 100K file 100 times for each test and averaged the results (which are in seconds).

Avg. time to make copy between buckets: 0.10705331
Avg. time to make copy within bucket: 0.10522299

A second test produced similar results (very slightly slower in both cases).

And here's the Ruby script I threw together. It uses the aws-sdk gem.

# get buckets
s3 = AWS::S3.new
bucket1 = s3.buckets['dfsbucket1']
bucket2 = s3.buckets['dfsbucket2']

# get an object from bucket 1
random_file = bucket1.objects['191111308/state_file']

start = Time.now
copies = 100
(1 .. copies).each do |i|
random_file.copy_to("test_file#{i}", {
:bucket => bucket2
})
end
puts "Avg. time to make copy between buckets: #{(Time.now - start)/copies}"

start = Time.now
(1..copies).each {|i| random_file.copy_to("test_file#{i}")}
puts "Avg. time to make copy within bucket: #{(Time.now - start)/copies}"

Sunday, November 4, 2012

Grokking Graphite

We started using Graphite at work six or so months ago, largely because there was already support for it in the metrics library we're using. If you don't know Graphite, it's a system for accumulating and, obviously, graphing time series. Most people use it for systems monitoring.

When we first set it up, I played with a few graphs of key metrics over time. Pick the metric from a list; Graphite shows you the graph. Easy. I also set up some basic dashboards that showed a few graphs. Again, easy.

But that's not always all you need. I wanted larger pictures of the whole system: hot spots, accumulated data across servers (in our setup, each server is its own metrics hierarchy in Graphite), and more. I pondered various ways to get the data out of Graphite (which it supports) and into R.

Then I discovered its functions library. And I went crazy.

First was a graph that showed every call in our system over a certain time threshold.

Then came one that combined a number of metrics to estimate mean time to first byte, a common metric for website performance.

And then another. And another. These days, I set up my laptop to run Chrome in full-screen mode so that it can fit all the graphs on one of my dashboards. But that's just one tab: I have dashboards for different environments and dashboards that focus on subsystems within those environments. A graph showing our 10 slowest calls gets uploaded to Campfire each day.

Our ten slowest calls as of today, with proprietary information removed. The lines are flat because of lack of activity on the server.

So far Graphite — especially version 0.9.10 — has been able to keep up with almost all my needs, and I haven't even hit all the functions. It even has a command-line interface that I just started playing with. (It allows faster iteration and finer control over each graph in a dashboard, but also allows you to keep a dashboard-building script under source control.) There are also a wide range of tools that work with it (including, of course, my own metrics relay system).

When I first read Graphite's documentation, I was struck by the author's right-up-front advice to consider your metrics naming scheme carefully. It seemed very nitty-gritty so early in the manual.

But now I understand. A consistent naming scheme and hierarchy depth allows for much simpler construction of useful graphs. To some extent, our profiling code, our package hierarchy, and our metrics library give us this for free. But the other day I realized I had made a mistake in naming a metric that captures all invocations of methods with a particular annotation, and it made it much more difficult to assemble a meaningful graph. I got it to work, but it required some wrestling. If you're using Graphite, I recommend auditing your metrics periodically to make sure you can get the most out of them.

Sunday, September 16, 2012

Tsung vs. JMeter

I first learned of tsung from a colleague in EA's Pogo division, and I was intrigued by his description of the load it could generate. I've been pushing it at Maxis as a result.

There's sound architectural reasoning for its abilities. It's written in Erlang, which delivers highly concurrent systems by eschewing the use of threads in favor of events. The various Java-based loadtesting tools use threads to simulate users, which necessarily limits their abilities.

There's also good anecdotal evidence favoring tsung, not only from the Pogo folks but also from the fact that it's used to loadtest ejabberd, the chat system Facebook uses that is generally considered to be the most scalable around. (It's also written in Erlang)

But how much better is it than JMeter, the go-to Java-based loadtesting tool? JMeter is probably easier to configure -- it has a GUI versus moderately documented XML -- and it probably has a richer feature set. So is it worth going to tsung?

There are some numbers out there, such as this slideshow, which found that tsung could generate up to 50,000 simultaneous users while JMeter couldn't go above 1,000 requests per second, but I wanted to find out for myself.

I set up a simple web server (mochiweb, an Erlang-based web server that also has a good scalability track record -- sensing a pattern here?) that served a simple, static page. Remember the goal wasn't to test the server under load; it was to test how much load the clients could generate. I also ran top to get a sense of how much CPU my process was using. There are probably lots of things wrong with my methodology, but, again, I just wanted a sense of the difference.

Here's some graphs I made comparing CPU usage by tsung for given rates of users. This was on my MacBook Pro with a minimal set of other applications open.

Note that the numbers on the x-axis show the number of users up to that point. The web server responds very quickly in this scenario, so the number of simultaneous users is probably very small.

Tsung Graphs

And here are the graphs I could collect from JMeter, which I configured via the GUI and then ran in command-line mode. Once I tried to get to 100 users per second, the JVM never had enough memory to create all the threads, no matter how I adjusted the heap size. This is, of course, the other major resource that threads consume.

So tsung, even on my machine, could handle about 50 times more users than JMeter, which is roughly consistent with the numbers in the report I link to above. That said, for tiny amounts of load, JMeter compares favorably with tsung and is a lot quicker to get up and running.

JMeter Graphs

I've noticed a tendency in Java shops to rely solely on tools written in Java. It drives me crazy; why put blinders on to a big chunk of software universe? I (more or less) like Java; I use it all the time. But it's not good for everything, and it's worth knowing what else is out there that might be better at solving the problem at hand.

Yes, tsung is "harder" than JMeter. It's written in another language. Boo hoo. Use the tool that makes sense; in this case it's hard to argue with tsung's capabilities.

Friday, August 31, 2012

metricsmaw

Web applications have always been heterogenous environments -- Apache or Nginx talking to PHP or Java which is talking to MySQL or Oracle. But these days, it seems like the number of potential components in a system has exploded. And those components, often drawn from the front lines of new tech, come with different levels of maturity.

That realization happened again as I evaluated node.js at work and wanted to capture metrics about the system. How many of X could node.js do versus a Java version? And it's occurred to me at home when working with Ruby, Erlang, or Scala. I can't always get the metrics I want from the environment I want in a consistent way.

So I wrote a program to fix that. For lack of a better name, I called it metricsmaw, and I checked the very early version into github.

metricsmaw is an Erlang server that does one thing: It receives data into metrics and relays those metrics to other places. Currently, the other places are csv files and a Graphite server. The metrics it supports right now are counters, gauges, and a by-the-minute rate meter that provides rates for the last minute, the last five minutes, and the last 15 minutes. Metrics and reporters are set up as Erlang behaviours so I can add new ones easily.

While this solves a problem I often run into, it's also a good chance to flex my Erlang muscles and dig more into a language and environment I'm really enjoying. Erlang seems like a good fit here, as it excels as long-lived software that needs to be fault-tolerant and highly concurrent.

And since I first thought of this in the context of node.js, I added a node.js library for talking to it. I haven't really dug in to the node.js idioms, so it doesn't handle, say, disconnected sockets very well, but it solves the basic problem.