Wednesday, May 30, 2012

Packing bytes with templates

I recently came up with a little tool that solves a tangential problem we have.

We're building a binary protocol for talking to our servers. We have some initial work in place, and I have a battalion of little functions for creating a byte array from a number, creating a number from a byte array, etc. For a given message in the protocol, you need to compose a byte array to send over the wire, so you might write out a header, an ID or two, and maybe some other stuff.

Something like:
public byte[] createJavaInterestingMessage(Long id, String message) {
      ByteArrayOutputStream bos = new ByteArrayOutputStream();





We had our first case working when the game-side engineer I'm working with said he wanted additional data in the message. Easy enough to add, but I was reminded of erlang's bit syntax. You can create elaborate binary data with a concise syntax that says, effectively, "create a binary blob by writing this variable into the first 3 bits of the binary, this variable into the next 15 bits, etc."

Something like:
ErlangInterestingMessage = <<Var1:3,Var2:15>>.

Wouldn't that be nice? No adding another call to writeLong just for another value. I could just write out the structure I wanted. I didn't care about the bit-level specification that erlang offers, but specifying byte lengths is valuable. If you know a particular int value is never going to go above 65,000, you can encode it in just two bytes, not four. Compare that to Java's built-in DataOutput classes; they always write an int into four bytes.

There's another benefit. Tell me quickly what the format of JavaInterestingMessage is. Now tell me quickly what the format of ErlangInterestingMessage is. One documents itself; one requires a bit of digging.

Other tools, notably Google's protocol buffers, solve this problem in various ways, but many of those were overkill for what, in my case, is just a simple template.

If you're a Java programmer, you might be envisioning something with StringTokenizer and/or regular expressions to parse my little template. I thought about those, but I didn't want to fritter away time trying to get a regex just right or handling funny little edge cases. I decided to use antlr to generate a parser for me. Antlr is a tool that generates parser code based on a language spec. The language spec is written in a language designed for defining language specs (if that doesn't make your head hurt a little), but antlr lets you insert some Java (or other language) into the language spec so that your parser can collect information along the way for you.

I had to refresh my memory of using antlr, but within an evening -- this is incidental work, so I did it after hours -- I had a parser up and running. I shamelessly lifted from erlang's syntax, only replacing the comma with a pipe. The system has two pieces: the parser generated by antlr and some Java code for translating the parsed template into an actual byte array. I can generate a byte array from a List of objects, an array of objects, and a map of objects keyed by String. And because antlr does all the work of making the parser, I allowed for white space and various other freebies such as requiring that you only use positions or field names, not both.

Now you can do stuff like this:

// writes item 0 of list into the first two bytes of the array and writes item 1 into however many bytes it needs
pack(list, "<< $0:2 | $1 >>"); 

 // writes the "id" item of map into however many bytes it needs and then the "flag" element into the next byte.
pack(map, "<< $id | $flag:1 >>");

Right now, I support longs, ints, and byte arrays, but I have plans to support arbitrary data with type converters that can go from one type (Date, say) to another (long, say). (This is inspired by Scala.)

Once I had data packing working, I realized that my parsing of incoming binary data now seemed grotesque. Figure out offsets into byte arrays, do bit-shifting and bitwise comparisons as needed, yada yada yada.

Naturally, I wanted to use my same templates for unpacking data.

That's a trickier problem. Bytes are just bytes; if one says 65, are you supposed to interpret it as 65 or as "A"? But erlang has a syntax for that as well; you can specify a data type of the bytes you've unmarshalled with "/" followed by the data type.

As of tonight, my little template language supports the same thing. You can now do something like:
unpackIntoMap(byte[] data, "<< $id:8/long | $text >>"); // treats the first 8 bytes as a long that is put into the "id" field of a map and consumes all the rest of the data and stores it in the map as a byte array keyed by "text"

and get back a Map populated as you want. With, again, the expected format being very easy to see.

I expect this to save a lot of programming and debugging time as we move forward with our protocol, but it was also fun to play with creating a custom language (however small) to reduce boilerplate code.



  1. github link or it doesn't exist.

  2. Ha!

    I'm actually going to shoot a note to legal to see what can be done in that regard. I don't know how far I'll get, but it's not proprietary or patentable and doesn't give us a competitive advantage.