Tuesday, July 20, 2010

Google AppEngine

Three months ago or so, I had an interesting idea for Google's AppEngine, an environment for building web applications. I wanted to use it for telemetry.

Telemetry, literally, means measuring something from afar. (Merriam-Webster's definition is somewhat circular if not outright tautological.) In the gaming industry and perhaps others, it means capturing information about how users interact with your system. Think of stats packages for web pages; they're the same thing. You also hear the terms business intelligence and plain old data analysis.

I want to be clear: This isn't about watching you, in particular, play a game. If you die in a particular spot on a particular level when our newly announced Darkspore ships, I don't really care. But if lots of people die in that same spot, it tells you something about the difficulty of that spot, doesn't it?

I needed something that could take in a large stream of data without hiccuping and allow me to process it and report on it. I went with AppEngine.*

A big selling point of AppEngine is scalability. Google obviously knows something about running big systems, and they take control of scaling your app. At any given point, I don't know how many servers are running my code: AppEngine starts up new instances as traffic grows and shuts them down when traffic slows. It's not like Amazon.com's EC2, where you manually say, "I want this many servers. No! Now I want this many!"

Closely tied to its scalability is the datastore behind AppEngine, which is a proprietary NoSQL technology called BigTable. When you run queries against BigTable, you're using a MapReduce algorithm — the same algorithm that allows Google to index and search the web so quickly. I had a potentially huge set of relatively unconnected data: AppEngine seemed like a good fit.**

And, for the most part, it has been.

But the datastore that makes AppEngine so powerful has some quirks that I've had to work around. Keep in mind that it's not a relational database, so it doesn't have the same behaviors. Most notably, it's not very good at aggregating data: sums, averages, whatever. And remember where I said that telemetry is all about aggregation? In the long run, I'm hoping BigQuery will make this easy, but in the mean time, I had to build my own aggregation system on top of AppEngine's cron jobs and task queues. The system is pretty neat, including support for a "group by" concept so that I can get views on subsets of the data in a given aggregation, but I'll be happy to let someone else handle that heavy lifting.

Another issue I've run into is that AppEngine sets hard caps on certain aspects of the system. For instance, any given request can't run too long or it will be terminated. On my own servers, if a webpage for an admin tool takes thirty seconds to load, I'm not too upset. It's an admin tool; I can wait. Google isn't so forgiving, presumably to prevent some process of yours from running amok on their machines. This has led to some early performance tweaking that's probably, in the long run, better to get out of the way now.

But these have been interesting problems to solve, so I haven't minded them so much as been annoyed by them. We're currently pulling in 100,000 events or so on any given day (as testers play the game and developers work on it), and I've only had to do optimizations on the reporting side and aggregation system. There's a whole other round of work I can do on optimizing writes, but I haven't bothered yet. Granted, 100,000 events isn't that significant, but it's nice to defer some of the deeper tuning until we're a bit closer to shipping. Meanwhile, we've got graphs and charts and spreadsheets with our telemetry data served up in a useful way to different people in the studio.

It's always hard for me to gauge the ease-of-use of a web system for someone who's not a long-time server engineer, but AppEngine feels like it is very good for either the people who know nothing about web stuff or people who know a lot about it. For middle ground types — maybe a bit of PHP or other dynamic site system — who won't be facing scalability issues, I wonder if Rails might be a better choice.

But I do know that one of our non-online engineers was able to install a new index on AppEngine for a query he wanted without even asking me. I didn't know about it until I happened to see his check-in note. That's pretty powerful. AppEngine supports both Python and Java, but I deliberately went the Python route first, despite not knowing the language at all, because there's more Python knowledge in general in my studio. And that language I didn't know? The code I wrote to take in telemetry events is virtually unchanged from the first day I wrote it. So it's easy to get started on AppEngine, and the cost is free well into the needs of most websites.

If you're interested in developing on AppEngine, I recommend the yet-to-be-released Code in the Cloud.The writing in the early drafts is a bit too exuberant, but there's solid information from the get-go, and Pragmatic Programmers delivers polished final products.

* Yes, yes. I don't trust Google, either, though, as always in blanket statements, the AppEngine team seems to have their hearts in the right place. What's to prevent them from pilfering through our data and gleaning their own information about our users? Well, there is a clause about that in the Terms of Service — they won't do it — but I'm not naive. We won't be storing information that anyone else could use to trace back to our users. An ID that makes sense inside EA? Yes, sometimes. Anything that makes any sense anywhere else? Nope. Our legal department was pretty firm on that.

** What if AppEngine goes down on our big launch day? Am I nervous about tying my company's game to a service that may not be there when I need it? In this case, not really. Let's say AppEngine goes down for a full 24 hours. The whole thing, not just pieces. Such an event would be unprecedented in my experience, but still. So we lose a day of telemetry data. Big deal. People will play the game one day the same way they played it the day before and the day after. Since I'm dealing with aggregates, I can cheerfully ignore missing data.