An Obsession With Programming: 2013

Monday, December 23, 2013

CAD For Programmers

I'm a back-end programmer. I deal with databases and servers, and I think a terminal window is a perfectly reasonable interface. And my normal creativity outlets are writing and, yes, programming.

As you might imagine then, CAD programs fill me with dread. They seem so cool! And yet, I spend vastly more time learning how to do basic stuff in them then I ever spend doing said stuff. I keep thinking, "Can't I just write a program to do this?"

And the answer, it turns out, is yes. If you use OpenSCAD. I learned about it from Make magazine's special issue on 3D printers (a topic I'm this close to really digging into). While a real CAD user may find it limiting, someone like me, who works in programming languages and usually has no design more complex than a mechanical puzzle piece, it's perfect. You -- get this! -- just write a program to do all the work you need! You refactor with variables, functions, and macros, and you provide instructions through a declarative programming language.

Finally, a CAD tool that works the way I do.

Saturday, September 7, 2013

pomaid - a tool for creating pom.xml

One of the many things I dislike about Maven is its use of XML as a language format. This was au courant back when Maven was released, but by now, many people realize that this is a poor use for XML.

It's not hard to figure out why. A pom.xml file (the script that drives Maven) consists of XML syntax and the data that is specific to your project. The data specific to your project is not only information but is the most important information to someone on your project looking at the pom.

Most programmers would probably agree with this, but then they all forget information visualization 101: Keep important information front and center. Most people think of data visualizations when they hear information visualization, but I would argue that any information you produce should be easy for people to understand, and the theory behind visualizations is still relevant.

I grabbed a random pom off the Internet, and broke it apart into the Maven syntax and the data you actually care about. Syntax is the start and end tags of XML. Data is the values relevant to your project. Here's the breakdown.

More than 75 percent of this pom.xml file was the syntax for Maven. Less than one-quarter of the text in the file was the data pertinent to the project.

I recently read an interesting article about "Invisible XML," which argued for the creation of DSLs that present information clearly and concisely but transform into whatever XML your system might need.

I thus decided to make a tool that would provide a minimalist syntax that could be used to generate pom.xml files needed by Maven. (And, inspired by the "Little Languages" chapter in The AWK Programming Language, I decided to do it in that language.) The result is pomaid.

Since dependencies are the meat of most pom files, I focused on that. Here's an excerpt from the pom I linked to earlier:

And here's the equivalent in pomaid:

And here's the syntax vs. data breakdown of the entire pom file I mention above, redone as a pomaid script.

In other words, the pomaid file pushes syntax into the background to produce something vastly more information-rich than its pom.xml equivalent.

While this may or may not be useful, I think it illustrates the point well enough: If you have to deal with XML, you may be better off writing a language that can provide clarity to the reader and then translate into the syntax-heavy, data-light XML.

Sunday, July 14, 2013

Internet Radio Podcast, the Raspberry Pi Edition

I recently bought a Raspberry Pi*, the small-and-cheap-but-powerful computer taking the geek world by storm. The hope is that someday I can use this little board or its successor to teach my baby daughter more about my own interests. But given my current time constraints, I figured I should get one now to ensure I've done at least one significant project with it by the time she's old enough to grok what this thing is.

That's right; I think I can do one useful project with my Pi in the next ten years.

People go crazy with their Pis: rigging up their microwaves to cook food according to the bar code on it, taking pictures from very very high in the sky, and building souped-up irrigation systems that would be the envy of pot farmers everywhere.

I do have ambitious goals in mind, but I wanted to start with something a little simpler. Some of you may remember my Rube-Goldberg-esque "Radio Broadcast Podcast." An obvious failing of my original setup was that it required me to leave my MacBook running. Predictably, this has turned out to add enough friction that the recordings don't get made.

At the time, commenters offered suggestions to fix that obstacle. One was to set up an EC2 server that could run all the time. A good idea, and I'm certainly comfortable with EC2, but I never got around to it. Between launching a big game and having a baby, the priority on the whole endeavor kept dropping. Still another suggestion was to just buy a server for the home. A very good idea, but we didn't have space for it even before we had a baby.

Enter the Raspberry Pi. Since it's just a normal computer that runs Linux, it can also be a server. Not a super-powerful one, but I don't need a super-powerful server for my house. And it's hard to argue with the physical footprint: the Pi and its case are roughly the same size as two or three stacked decks of cards. I just tucked it into a small space behind our TV.

I decided to start from scratch on my radio-to-podcast scripts, mostly because I wanted to fix some problematic and error-prone behaviors from version 1. It ended up being a few independent parts:

A Python script runs every five minutes as a cron job and uses a config file to determine if it should be capturing streaming audio. If yes, and streamripper isn't running, it starts the program, figuring out a file name based on parameters in the config file (including adding numbers if a file of that name already exists.)
Another cron job looks in the download directory for any mp3 files that haven't been touched in 5 minutes, uploads any it finds to S3 via s3cmd, and then deletes them.
Yet another cron job creates a podcast-compliant XML file by getting a listing of items on S3, running them through an awk script that generates the necessary XML, and then taking that output and uploading it to S3.

This may still seem Rube-Goldbergian, but it's designed to be more resilient than its previous incarnation. If streamripper crashes or the network goes out, my script will ensure that the music starts recording at the next opportunity and it won't overwrite the first part. If the Pi can't get to S3 for some reason, the files will stay in place until needed. There's no cleverness around updating the podcast xml file; it's created from scratch every time. (Note that doing a bucket listing on S3 very much exposes you to eventual consistency issues, but even a day's worth of latency is probably fine for my purposes.)

The onus is still on me to clear old files from S3, but that's something I can do every few months. And once I do, the XML file will reflect that within an hour.

The next step is to run an HTTP server on the Pi so that I can log in and make configuration changes from any of the devices on the network instead of sshing in and modifying the config file directly. And after that? I'm tempted to add AirPlay capability so that if I happen to be home and awake, I can send the current download to the living room speakers.

Overall, I'm pretty impressed with the Pi's capabilities. I doubt I can pile too much onto it, but I'm curious to see what its actual limits are as a household server. Future projects involve using its ability to talk to other electronic components, which is one area where the Pi shines.

* I bought the Maker Shed Starter Kit. It comes with a good book and most of the stuff you'll need to get going and do the book's exercises.

Tuesday, July 9, 2013

Screen Time

This is a bit of a tangent for a programming blog, but I figure this is a relevant topic for many people reading this blog.

The TL;DR review: If you're interested in kids and how they interact with media, buy this book.

I love technology. Probably more than most non-technology people really understand. I love its potential, the changes it makes to the world, and the ideas that drive it forward. And I look forward to sharing that love with my baby daughter. Some dads can't wait until their kid can go to a baseball game; I can't wait until she is able to grasp that technology is a thing within her ability to control and mold. And the day she expresses an interest in controlling that technology herself? They'll be able to see me smiling from the International Space Station.

That's not always a popular view here in Hippyville, USA, otherwise known as Berkeley. Paradoxically, a casual drive from some of the biggest tech companies on the planet, there is a pervasive notion that mixing technology and kids is bad. One extreme -- Waldorf schools -- came up in conversation with my wife. Here's an excerpt from http://www.whywaldorfworks.org:

"Waldorf teachers are concerned that electronic media hampers the development of the child's imagination. They are concerned about the physical effects of the medium on the developing child as well as the content of much of the programming."

While I think Waldorf has some good ideas, that first sentence sent me into a fury. You can't lump all electronic media together unless you're driving your philosophy through some misguided notion that "things were better in the old days." I'm not going to argue that Angry Birds or Robot Unicorn Attack is good for a kid's development, but I know of plenty of apps and "games" that are all about imagination. And I feel that the Waldorf view actually cuts off areas of exploration. (Let alone its effect on kids who, let's face it, are coming of age in a technology-driven society.)

But what do I know? Just what I feel in my gut. One article I read about kids and technology suggested a book called Screen Time, by Lisa Guernsey (the title seems to be messed up in the Amazon.com database), so I bought it.

In the end, she says, there's no easy answer. But she has a recurring theme she's derived from her research that she applies to her own children: What content is your child seeing, what context are they seeing it in, and where is your individual child in their development? Are you destroying your kids by letting them watch TV while you shower? No. Are there long-term downsides to media access if parents aren't careful? Yes.

The book focuses heavily on television time, which is less relevant to me since we rarely watch TV or even have it on. Her epilogue notes the emergence of the iPad and its potential, but there's precious little research about it so far (however, it does solve one problem with previous incarnations of interactive toys: the abstraction needed for "I press this thing on this thing way over here and the screen does something.") She does list links to sites that do thorough reviews of children's interactive media, however. My favorite is Children's Technology Review.

Still, the book is full of eye-opening explorations of the research that's out there, which she has extensively plumbed, as evidenced by her endnotes and massive bibliography. She divides the book into chapters that each address a prevailing question about kids and screens: What Exactly Is This Video Doing To My Baby's Brain, Could My Child Learn From Baby Videos, My Toddler Doesn't Seem to Notice When the TV is On - Or Does He? And so on.

Here were some of my key takeaways, in no particular order:

TV is not turning your kid into a zombie through passive engagement. She says that the research that argued that (heavily cited by another book on my queue, Waldorf-darling Endangered Minds) has been widely debunked. Kids engage heavily with what's on screen, trying to understand it.
The standard advice by the American Association of Pediatricians about no screens before two years old? Not really based on any research.
TV does reduce creative play time and time with parents.
TV in the background is a bad idea. It exposes kids to adult-themed content and makes it harder for them to develop language skills. Make it a foreground activity when it's on. (And only do short, controlled watching times. Most parents paying attention have a "you can watch this half-hour video and that's it." mentality)
Telling your toddler that something scary on TV or in a movie is "not real" is useless. They're toddlers; everything is real.
If your child has an interactive electronic thing, you will naturally steer towards saying, "Push that button now, or don't do that yet." Don't. Let the kids explore on their own. Also, most interactive media sucks: It imposes artificial constraints on what can and can't be done at any moment. There are plenty of apps that don't fall into that trap.
Dora the Explorer is awesome for kids. So is Sesame Street, Mister Rogers, some show called Dragon Tales, and, yes, even Barney.
Purveyors of products that purportedly improve your kid are rarely based on real research. But it is possible for kids to learn from DVDs and the like. Choose wisely.
Kids' brains are weird.
Products and shows often have very specific age ranges they're aimed at. Pay attention to those (but note that publishers will often advertise a wider audience to get more sales.)
Your involvement is important. Not just in a "Are you having fun" way but in a "what did you think of this or that?" Engage your kids about what they're seeing.
Moderation is important: Video entertainment is just one facet of experience (duh.)

I can't praise this book highly enough. The author has had the ability (helped along by various grants) to really dig in deep. She looks at academic research, talks to scores of parents and social workers, looks at her own kids, and tries to shape what little material is out there.

It's possible that I highlighted more of this book than not.

Sunday, July 7, 2013

DSByte: Scala-based DSL for binary data structures

About a year ago, I wrote about a tool for creating binary structures more easily. It was inspired by Erlang's binary packing syntax, and was for a scenario where a server was creating small binary for a client that didn't speak Java (i.e., so simple serialization wasn't a possibility)

I recently decided to recreate it, but this new version is a whole new beast. It's still inspired by Erlang's syntax, but I'm enhancing that syntax, I rearchitected it, and I wrote it in Scala.

The very early version of DSByte, as I call it, can be found under my github account. It currently only supports packing from objects into binary, but I'll add unpacking from binary into objects soon.

One of the things I like about Scala is built-in support for simple DSLs, and this was a good opportunity to exercise that feature. While the parser combinator feature takes a bit of getting used to, it's quicker and friendlier than ANTLR, though it's also less powerful. But for my need, it fit the bill nicely.

Friday, April 19, 2013

Awkward And Upward

The subject of awk, a text-processing tool on UNIX systems, came up recently in a number of contexts — a book I was reading, a test harness at work, and a blog post I had bookmarked — and I decided it was time to learn a bit more about it. Like anyone who's been around servers for a while, I've used awk here and there, but never in earnest.

So I picked up sed & awk and worked through the code. (An aside: The book is 16 years old at this point; it's fascinating to remind oneself what the "download the source code" options were at the time: FTP, UUCP, and more, but not HTTP. But downloading code samples is a less effective learning strategy for me compared to typing it in myself.)

I'm sold. I've written any number of little Ruby scripts for processing text files over the years, but awk is a DSL designed just for this purpose. awk handles the mechanical aspects of opening a file and reading in each line. You write the business logic you care about.

Here's an example: As an exercise, I took a Ruby script I had written to extract stack traces from a Java thread dump and rewrote it in awk. I wanted every stack trace where any line in the trace matched the regex I passed in. This is useful for, say, finding every thread that uses a particular library or stems from a particular source or goes through a particular code path. A Java server can have hundreds of threads, which can make it tough to focus on one subsection of code. This lets me zero in on the stacks I care about.

# extracts bits of a thread dump based on a passed in regex
# usage: awk -f extract_stacks_by_regex -v regex=[regex] [file]

# lines with stuff when we've established we're in a stack dump that matches the regex
$0 ~ /^..*$/ && matches == 1 {
   print
}

# lines with stuff when we haven't yet found a match: build up current stack and check to see if this is a match
$0 ~ /^..*$/ && matches == 0 {
   current_stack = current_stack "\n" $0
   if ($0 ~ regex) {
      matches = 1
      print current_stack
   }
}

# lines without stuff: reset 
/^$/ {matches = 0;current_stack = ""}

This script is about half the size of the Ruby version.

As someone who's often digging into logs or massaging data for visualizations, a few hours of time getting really solid with awk will no doubt pay for itself over and over again.

Tuesday, April 16, 2013

Better Visualization Of Baby Sleep

My wife uses an app from Similac to track various things about our baby: diaper changes, sleep schedules, and feeding schedules. Friends of ours recommended it to us, and we've recommended it to other people.

But I find its visualization for sleep schedules lacking. Anything beyond the daily view just shows you how many hours your baby slept on a given day; it doesn't show you when on that day your baby slept. And if a sleep session starts on one day and ends on another, the hours are only counted for the first day. In the summary the app provides, this probably doesn't matter; the hours will even out. But it's confusing.

Fortunately, the app exports its data. So I figured I could get the export and then produce the chart I had in my head.

The first step was cleaning the data. Here's a sample of what the app sends.

Start Time	4/11/13, 2:09 PM
Duration	2 hrs 11 min
Time of Day	Daytime
Laid Down Awake	No

Not very program friendly, but it's easy enough to fix with awk.

function quote(string) {

   return "\"" string "\""
}

BEGIN {
   FS = "[, \t]"
   print quote("line") "," quote("start_date") "," quote("start_time") "," quote("duration")
}
/Start Time/ {
   date = $3
   date_and_timestamp = quote(date) "," quote(date " " $5 " " $6)
}
/Duration.*hr.*min/ {
   print quote(NR) "," date_and_timestamp "," quote((($2 * 60) + $4))
}
/Duration.*hrs?$/ {
   print quote(NR) "," date_and_timestamp "," quote(($2 * 60))
}
$3 ~ /min$/ {
   print quote(NR) "," date_and_timestamp "," quote($2)
}

That turns a chunk of text like the one above (and its variations) into a line like this:

"8","4/11/13","4/11/13 2:09 PM","131"

(I add the line number at the beginning so that R has a primary key to work with on the import.)

Once I made the csv, I pulled it into R. As usual in R, drawing the chart was straightforward once the data was correct. Here's the meat of it:

rect(xleft=data$sleep_offset,

     ybottom=data$y_value-.25,

     xright=(data$sleep_offset + data$duration),

     ytop=data$y_value+.25,

     col=data$rect_color,

     border=NA)

But even my cleaned data needed some cleaning within R. The data the app exports suffers from the same problem when it comes to sleep sessions that cross midnight. You might see an entry for 04/11/13, 9:00 PM with a duration of five hours. But you won't see any data for 4/12/13, midnight to 2:00 AM.

So I first added new entries that duplicated those records, but provided a "start date" (which translates into the y axis) of the "other side of midnight" time frame. So the theoretical 9:00 PM sleep session above would produce another row of data where the start date was 04/12/13 and the start time was 04/11/13 9:00 PM. That meant that the "sleep offset" (position on the x-axis) was negative, which is what I wanted.

Finally, I wanted to draw any chunks of sleep that fell outside of a given date (because of the overlap) in a different color, so I broke the overlap rows into rows that would give, to use the same example, a row for 04/11/13 with a start time of 9:00 PM and a duration of 3 hours and a row for 04/12/13 from midnight to 2:00 AM. I stored the color in the data as well, so I could tell R to just draw the rectangles based on each row's coordinates and with the color specified in one of the fields.

Here's what I came up with. Note that this particular chart is based off of fake data (since I have a tool that makes that easy), because I didn't want to expose the baby's personal data for all the world to see. But the chart gives the gist.

Each horizontal line is a day, with the night before and morning after visible on the chart but unobtrusive. The dark green represents sleep sessions within that calendar day, spanning the time listed on the x-axis. I often feel that every visualization I make ends up being a small multiple, but it is true that I often want to quickly look at a large mass of data and make comparisons within it.

The visualization is a work in progress. I'd like to put the total on the right, and a friend suggested that I add a heat map to show when parents have the best chance of getting in a long nap.

But compare this grid to what, in the app, would simply be a line graph with a single number for each day. That doesn't tell you anything about, say, how long your baby's been sleeping at night this week versus last week. Or whether her individual sleep sessions have gotten longer. We recently put up blackout curtains, and so we can see what effect that's had on the baby's sleep. We can see particularly sleepless days (or particularly sleepful ones) and correlate to other variables such as her mood and sleep schedule.

Of course, even with the awk and R scripts written, the data still has to be emailed to me, and I still have to process it. But I think the extra detail is worth the small effort to get it.

Thursday, February 28, 2013

Advances In Hirsute

Since I first launched Hirsute, I've been plunking away at it, making little changes here and there. I thought I'd do a quick post about the changes, some of which I'm quite happy with.

Specifying Histograms

One of the main things I wanted out of Hirsute was the ability to generate data based on non-uniform histograms. For instance, if most of your users have 0-10 friends, some other percentage has 10-50, and a small amount has 50-100.

But specifying that distribution was non-intuitive. You had to create an array of probabilities, they had to add up to 1, and they had to be the same length as your buckets.

Pondering how I might make it easier, I realized that what I wanted to do was draw out the histogram and let the system figure it out. So that's what I did.

This is now valid:

star_rankings = <<-HIST

****

********

**

HIST

and then you can add a generator as follows:

one_of([1,2,3],star_rankings)

Histograms no longer have to add up to 1 — the system will scale values appropriately — and they can be different lengths, though a histogram with more probabilities than values will throw an exception, while a histogram that has fewer probabilities will generate a warning.

Ranges As Results

If your generator returns a Ruby Range object, Hirsute will return a random value (based on a uniform distribution) from within that range. That lets you easily construct a script for the friends example above:

one_of([1..10,11..50,51..100],[0.75,0.2,0.05])

MySQL Batching And CSVs

The MySQL outputter now bundles up inserts for faster loading. CSV is now a supported output format.

Post-Generator Blocks Run Within Object

When you attach a block to a generator, the code in that block will run within the context of the generated object. This lets you access existing fields within the newly-minted object.

Sunday, February 10, 2013

All Together Now: A Look at Concurrent Languages

Over the past several months, I've been taking a look at various languages that advertise easy concurrency and scalability. It's too late to use any of them on SimCity, but I'm always thinking about what I'll build next and how I'll build it, and these languages are on my radar. Java is increasingly cumbersome to me as a language, and its concurrency constructs are too error-prone even for senior programmers, not to mention that the thread model Java exposes has serious performance issues if not carefully managed.

Here's my quick take on the three languages I focused on: Erlang, Scala, and Go. The TL;DR version: I'd use Erlang for infrastructure in a heartbeat, Scala will make your developers more productive at the possible expense of application performance, and Go is fast but less fun to work in.

For each language, I worked through at least one book on the subject and then built something for myself. The personal projects ranged from small to sizable.

Erlang

Erlang was my favorite of the three languages, and it would be hard to ever argue against using Erlang for back-end infrastructure pieces that require high scalability. Say, message queues, or chat systems, or NoSQL databases, or the backbones of prominent first-person shooters.

Like all these languages, it has a high-level abstraction for concurrency, but, unlike the others, it easily supports passing messages between machines, which bodes well for a cluster of servers. It has extensive fault tolerance mechanisms, even across machines, allowing for robust systems. It has support for hot code swapping, opening the possibility of upgrading a system while it's still live and reducing maintenance windows to nil. It has great support for extracting values out of binary data, which is invaluable when dealing with network traffic and proprietary data formats. It's a mature, proven technology. And it has the benefits of functional programming: more concise code that reduces the number of potential bugs and immutable objects that prevent weird thread-safety issues.

On the other hand, I can probably count on one hand the number of other Erlang programmers I've met. And it's not like taking a C programmer and teaching them Java; functional programming is a distinct mental shift from imperative programming, and it can be hard to get your head around it. That means that you can write your Erlang code all you want, but what about the people who will have to maintain your system beyond you? Its small community also means that while there are certainly lots of third-party libraries for it, it's not the vast universe that Java enjoys. And while immutable objects are easy to work with, they're also expensive because new ones are constantly being made.

Scala

While Twitter's Scala School argues for treating Scala as a separate language, it's hard not to compare it to Java, since it compiles down to Java bytecode and runs on the same virtual machine. And in terms of developer productivity, Scala rockets past Java in my book.

As an application layer language, it has tremendous advantages. You can accomplish complicated tasks with much fewer keystrokes. You can use functional paradigms and immutable objects, but also use imperative style and mutable objects if you need performance, or, crucially, if you need to bring another developer on board with your system. You can enjoy the same concurrency abstraction Erlang provides. You can leverage Java's seemingly infinite supply of open-source libraries. You can incorporate it into an existing Java application, giving you the ability to bring it in without rewriting everything. You can even easily build internal DSLs with it to make your system more expressive and easier to maintain.

But, in my own experiments and in anecdotal evidence, it suffers from sluggish performance. All that great functionality makes your developers more productive, but potentially at the expense of speed. This makes sense; all that pretty code needs to be contorted and converted into Java with who-knows-how-many object creations along the way. Obviously Twitter and Foursquare manage to be fairly fast, but how much engineering time is spent to get them there? On the other hand, a system that enjoys greater and easier concurrency than Java might be more scalable and have more consistent performance under load, even if any given call could be faster in another language.

Go

Go is Google's attempt to build a better C, with a focus on developing distributed systems at a scale that Google needs. Its concurrency model is distinct from Erlang and Scala's, preferring the notion of Communicating Sequential Processes to the Actor model, but neither is particularly superior; each has strengths and weaknesses that fit different situations.

The big win with Go is its speed: Go programs are compiled down to machine code. And while the built-in garbage collection probably means that C would win a horse race between the two, Go is a much less error-prone language to work in. Its community is still young, but it seems eager to improve the language, and a wide variety of useful go libraries already exist. It's hard to compete with the many years of robust Java libraries out there, but Go nuts seem to have filled in the most obvious needs.

I have to admit that I dislike working in the language itself; it lacks the cleanliness of Erlang and the depth of Scala. But there's no denying that its concurrency model is easy to work with, and the programs that you create are nice and zippy relative to their Java counterparts.

Others

I've yet to dig deep into Clojure, though it's the obvious next one. I figure if I'm going to be a fan of functional programming, I might as well go into crazy Lisp land. But I'd worry that it would suffer from the same performance problems — for the same reasons — as Scala.

It seems funny to mention node.js in this post, since in some ways it's all about zero concurrency: a single thread of execution is all you get. Of course, under the hood there's lots of asynchronous work, but it's tied directly to the operating system's I/O. We use it for a subsystem in SimCity, and it, like everything, has strengths and weaknesses. It can do lots of I/O tasks concurrently. Lots. But it's very sensitive to slow code, since that code will block the entire thread when it runs. It appeals to the game developers on my team, since single threads, event loops, and performance-sensitive code are the norm for them.

However, it's not very mature, and the libraries for it can be buggy and incomplete at this stage. I think we made the right decision switching our SimCity subsystem to node.js, since it outperforms its Java predecessor by a long shot, but it hasn't been simple or without issues.

Tuesday, February 5, 2013

@AsyncToExecutorService

When I first discovered Spring's @Async annotation, I thought it was a great idea. Slap an annotation on a method, and that method would be turned into a task on a work queue serviced by a different thread. A large number of tasks in a web server can be done asynchronously with respect to the incoming request, which means you can respond more quickly to your user. (These days, I'd write a system around events and ignore threads, but that's not the environment I'm in.)

But after a couple of months of using @Async, I found it annoying. It's applied with Spring AOP, which means it can only be applied to public methods on top-level beans and won't take effect on intra-object methods. It also seemed to inevitably cause circular dependency issues. And, perhaps most annoying, all @Async methods go into a single work queue. Together. With no priority, no distinct properties for different kinds of jobs, no control.

While this last issue has been dealt with in Spring 3.2, that's not what I'm using (though support for Servlet 3.0's asynchronous requests is a tempting carrot). And it still has the Spring AOP limitations.

I moved us off of Spring AOP a few months back in favor of straight AspectJ, which has been invaluable. We can declare control-flow pointcuts, pointcuts on private methods, and more. And during that migration I read AspectJ in Action, which features an example that basically does the same thing as the @Async annotation.

So what if I just rolled my own asynchronous execution aspect that allowed me to specify a thread pool to receive the work?

On a quiet morning, I did just that. I started with the example in the book, and then added a couple of my own twists. My @AsyncToExecutorService takes the name of a Spring bean that is an ExecutorService, and routes the join point to it within a Runnable. If you specify an invalid bean name, it throws a runtime exception. I also added some flags so you could declare that you need to run under the aegis of Spring's @Transactional annotation and whether you could be pointed to a read slave.

While Spring's XML files can get a bit wordy, it's also handy to be able to construct a large range of objects directly from that XML. When I want to define a new thread pool, I do it completely in the XML. I also set up our metrics system to automatically grab any ExecutorService beans on startup and record metrics about their current queue size and active threads so that the thread pools could be easily monitored.

I did the work because I needed it for a performance improvement, but I've yet to check in that optimization and have already heavily leveraged my new annotation into several other areas. My current favorite is a thread pool designed to discard tasks when its backing queue fills up. Non-critical tasks go into this queue and, if we're under load, they just start getting tossed. It's a built-in safety valve.

Tuesday, January 1, 2013

Hirsute: Fake Data for the Real World

At a previous job, which sold web services to help school districts track their tests and their progress against state tests, the sales team thought they could be more effective if our demo district was more realistic. The demo district, which had survived years of service, only had a few hundred students. Superintendents wanted to get a sense of what the system could really do for their needs.

You couldn't just show a district some other district's data, though, for confidentiality reasons. So one of our engineers came up with a solution. He took data from other districts and munged it together. He didn't just swap names around, though. Because we tracked demographic information, he squished students together within demographics. So you'd end up with Hispanic names together. And those students would have test scores that mirrored the Hispanic population in your district (since, at that point, we probably had data from nearby districts in your state). Kids of different socioeconomic status would also have similar test scores, and so forth. I seem to remember that one district half-jokingly suggested they just run the district off the demo they saw, since it was so close to their own.

I thought of that recently when working with our loadtesting group. I'm sure EA's centralized loadtesting group is no different than that of any other corporation in that they lack intimate details about our business objects. On Spore, I remember that our loadtesting database was set up with something like 100,000 users (we have 3 million or so now), each of whom had something like 10,000 creations (most users probably have 20). Or all the sporecasts had 5,000 items (most held somewhere in the few tens of creations). Early work on SimCity's system produced similarly unrealistic data. That kind of data makes it hard to tune queries and indexes, figure out caching strategies, and all the other normal stuff one needs to do with data for a system.

What I wanted, I thought, was a DSL that would let me do what we did for school districts: Specify how the data should kind of look, and let the system generate that data for me.

Thus, Hirsute was born. Hirsute is an internal DSL built on top of Ruby and its extensive metaprogramming facilities (an aside: I recommend Metaprogramming Ruby for its solid information, though most emphatically not for its trumped-up dialog and narrative structure). In Hirsute, you build templates that define how objects should look, and then you create collections of objects derived from those templates. You can specify histograms of probabilities so that you don't just get a random distribution among options but a distribution that reflects your real-world requirements. You can read in data from files and randomly combine them to get plausible composite strings. You can then flush the collections out to files ready-made for loading into a database (mysql at the moment).

For instance, here's some Hirsute code from a fictional wine-cellar-management service that I created as a sample (since the data requirements most on my mind are for SimCity, which I can't talk about). There's also a manual that goes into greater detail.

# This defines a tasting note that a single user might write about a single bottle. It pulls descriptors from various files.
a('tastingNote') {
  has :tasting_note_id => counter(1),
      :description => combination(
         subset(
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","},
           read_from_file('wine_cellar_aromas.txt') {|text| text + ","}),
         subset(
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","},
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","},
           read_from_file('wine_cellar_flavors.txt') {|text| text + ","})
         ),
      :rating => one_of([1,2,3,4,5],[0.1,0.1,0.4,0.3,0.1]),
      :bottle_id => 1, # filled in later
      :user_id => 1    # filled in later
    is_stored_in 'tasting_note'
      
}

tastingNotes = collection_of tastingNote

That sample defines a template for tasting notes. The description field comprises 1 to 6 lines pulled randomly from a file of aromas combined with 1-3 lines randomly read from a file containing wine flavors, all joined to one another with commas. The rating is from 1 to 5, but weighted such that most wines will have either a 3 or a 4. The tasting_note_id is a counter that's incremented with each new object. The bottle_id and user_id fields are filled in later when an actual tasting note is created.

Then you define a collection to hold them. You could also do this by using

   tastingNotes = tastingNote * 6000

Which would create 6,000 tasting note objects based on the formula you provide.

So far, the system is pretty simple, but it gets the job done. And because it's a Ruby DSL, you can always just write raw Ruby to fill in what you need. I definitely plan to keep adding to it, though, with a newborn in the house, maybe not quite yet.