An Obsession With Programming: February 2013

Thursday, February 28, 2013

Advances In Hirsute

Since I first launched Hirsute, I've been plunking away at it, making little changes here and there. I thought I'd do a quick post about the changes, some of which I'm quite happy with.

Specifying Histograms

One of the main things I wanted out of Hirsute was the ability to generate data based on non-uniform histograms. For instance, if most of your users have 0-10 friends, some other percentage has 10-50, and a small amount has 50-100.

But specifying that distribution was non-intuitive. You had to create an array of probabilities, they had to add up to 1, and they had to be the same length as your buckets.

Pondering how I might make it easier, I realized that what I wanted to do was draw out the histogram and let the system figure it out. So that's what I did.

This is now valid:

star_rankings = <<-HIST

****

********

**

HIST

and then you can add a generator as follows:

one_of([1,2,3],star_rankings)

Histograms no longer have to add up to 1 — the system will scale values appropriately — and they can be different lengths, though a histogram with more probabilities than values will throw an exception, while a histogram that has fewer probabilities will generate a warning.

Ranges As Results

If your generator returns a Ruby Range object, Hirsute will return a random value (based on a uniform distribution) from within that range. That lets you easily construct a script for the friends example above:

one_of([1..10,11..50,51..100],[0.75,0.2,0.05])

MySQL Batching And CSVs

The MySQL outputter now bundles up inserts for faster loading. CSV is now a supported output format.

Post-Generator Blocks Run Within Object

When you attach a block to a generator, the code in that block will run within the context of the generated object. This lets you access existing fields within the newly-minted object.

Sunday, February 10, 2013

All Together Now: A Look at Concurrent Languages

Over the past several months, I've been taking a look at various languages that advertise easy concurrency and scalability. It's too late to use any of them on SimCity, but I'm always thinking about what I'll build next and how I'll build it, and these languages are on my radar. Java is increasingly cumbersome to me as a language, and its concurrency constructs are too error-prone even for senior programmers, not to mention that the thread model Java exposes has serious performance issues if not carefully managed.

Here's my quick take on the three languages I focused on: Erlang, Scala, and Go. The TL;DR version: I'd use Erlang for infrastructure in a heartbeat, Scala will make your developers more productive at the possible expense of application performance, and Go is fast but less fun to work in.

For each language, I worked through at least one book on the subject and then built something for myself. The personal projects ranged from small to sizable.

Erlang

Erlang was my favorite of the three languages, and it would be hard to ever argue against using Erlang for back-end infrastructure pieces that require high scalability. Say, message queues, or chat systems, or NoSQL databases, or the backbones of prominent first-person shooters.

Like all these languages, it has a high-level abstraction for concurrency, but, unlike the others, it easily supports passing messages between machines, which bodes well for a cluster of servers. It has extensive fault tolerance mechanisms, even across machines, allowing for robust systems. It has support for hot code swapping, opening the possibility of upgrading a system while it's still live and reducing maintenance windows to nil. It has great support for extracting values out of binary data, which is invaluable when dealing with network traffic and proprietary data formats. It's a mature, proven technology. And it has the benefits of functional programming: more concise code that reduces the number of potential bugs and immutable objects that prevent weird thread-safety issues.

On the other hand, I can probably count on one hand the number of other Erlang programmers I've met. And it's not like taking a C programmer and teaching them Java; functional programming is a distinct mental shift from imperative programming, and it can be hard to get your head around it. That means that you can write your Erlang code all you want, but what about the people who will have to maintain your system beyond you? Its small community also means that while there are certainly lots of third-party libraries for it, it's not the vast universe that Java enjoys. And while immutable objects are easy to work with, they're also expensive because new ones are constantly being made.

Scala

While Twitter's Scala School argues for treating Scala as a separate language, it's hard not to compare it to Java, since it compiles down to Java bytecode and runs on the same virtual machine. And in terms of developer productivity, Scala rockets past Java in my book.

As an application layer language, it has tremendous advantages. You can accomplish complicated tasks with much fewer keystrokes. You can use functional paradigms and immutable objects, but also use imperative style and mutable objects if you need performance, or, crucially, if you need to bring another developer on board with your system. You can enjoy the same concurrency abstraction Erlang provides. You can leverage Java's seemingly infinite supply of open-source libraries. You can incorporate it into an existing Java application, giving you the ability to bring it in without rewriting everything. You can even easily build internal DSLs with it to make your system more expressive and easier to maintain.

But, in my own experiments and in anecdotal evidence, it suffers from sluggish performance. All that great functionality makes your developers more productive, but potentially at the expense of speed. This makes sense; all that pretty code needs to be contorted and converted into Java with who-knows-how-many object creations along the way. Obviously Twitter and Foursquare manage to be fairly fast, but how much engineering time is spent to get them there? On the other hand, a system that enjoys greater and easier concurrency than Java might be more scalable and have more consistent performance under load, even if any given call could be faster in another language.

Go

Go is Google's attempt to build a better C, with a focus on developing distributed systems at a scale that Google needs. Its concurrency model is distinct from Erlang and Scala's, preferring the notion of Communicating Sequential Processes to the Actor model, but neither is particularly superior; each has strengths and weaknesses that fit different situations.

The big win with Go is its speed: Go programs are compiled down to machine code. And while the built-in garbage collection probably means that C would win a horse race between the two, Go is a much less error-prone language to work in. Its community is still young, but it seems eager to improve the language, and a wide variety of useful go libraries already exist. It's hard to compete with the many years of robust Java libraries out there, but Go nuts seem to have filled in the most obvious needs.

I have to admit that I dislike working in the language itself; it lacks the cleanliness of Erlang and the depth of Scala. But there's no denying that its concurrency model is easy to work with, and the programs that you create are nice and zippy relative to their Java counterparts.

Others

I've yet to dig deep into Clojure, though it's the obvious next one. I figure if I'm going to be a fan of functional programming, I might as well go into crazy Lisp land. But I'd worry that it would suffer from the same performance problems — for the same reasons — as Scala.

It seems funny to mention node.js in this post, since in some ways it's all about zero concurrency: a single thread of execution is all you get. Of course, under the hood there's lots of asynchronous work, but it's tied directly to the operating system's I/O. We use it for a subsystem in SimCity, and it, like everything, has strengths and weaknesses. It can do lots of I/O tasks concurrently. Lots. But it's very sensitive to slow code, since that code will block the entire thread when it runs. It appeals to the game developers on my team, since single threads, event loops, and performance-sensitive code are the norm for them.

However, it's not very mature, and the libraries for it can be buggy and incomplete at this stage. I think we made the right decision switching our SimCity subsystem to node.js, since it outperforms its Java predecessor by a long shot, but it hasn't been simple or without issues.

Tuesday, February 5, 2013

@AsyncToExecutorService

When I first discovered Spring's @Async annotation, I thought it was a great idea. Slap an annotation on a method, and that method would be turned into a task on a work queue serviced by a different thread. A large number of tasks in a web server can be done asynchronously with respect to the incoming request, which means you can respond more quickly to your user. (These days, I'd write a system around events and ignore threads, but that's not the environment I'm in.)

But after a couple of months of using @Async, I found it annoying. It's applied with Spring AOP, which means it can only be applied to public methods on top-level beans and won't take effect on intra-object methods. It also seemed to inevitably cause circular dependency issues. And, perhaps most annoying, all @Async methods go into a single work queue. Together. With no priority, no distinct properties for different kinds of jobs, no control.

While this last issue has been dealt with in Spring 3.2, that's not what I'm using (though support for Servlet 3.0's asynchronous requests is a tempting carrot). And it still has the Spring AOP limitations.

I moved us off of Spring AOP a few months back in favor of straight AspectJ, which has been invaluable. We can declare control-flow pointcuts, pointcuts on private methods, and more. And during that migration I read AspectJ in Action, which features an example that basically does the same thing as the @Async annotation.

So what if I just rolled my own asynchronous execution aspect that allowed me to specify a thread pool to receive the work?

On a quiet morning, I did just that. I started with the example in the book, and then added a couple of my own twists. My @AsyncToExecutorService takes the name of a Spring bean that is an ExecutorService, and routes the join point to it within a Runnable. If you specify an invalid bean name, it throws a runtime exception. I also added some flags so you could declare that you need to run under the aegis of Spring's @Transactional annotation and whether you could be pointed to a read slave.

While Spring's XML files can get a bit wordy, it's also handy to be able to construct a large range of objects directly from that XML. When I want to define a new thread pool, I do it completely in the XML. I also set up our metrics system to automatically grab any ExecutorService beans on startup and record metrics about their current queue size and active threads so that the thread pools could be easily monitored.

I did the work because I needed it for a performance improvement, but I've yet to check in that optimization and have already heavily leveraged my new annotation into several other areas. My current favorite is a thread pool designed to discard tasks when its backing queue fills up. Non-critical tasks go into this queue and, if we're under load, they just start getting tossed. It's a built-in safety valve.