Sunday, November 4, 2012

Grokking Graphite

We started using Graphite at work six or so months ago, largely because there was already support for it in the metrics library we're using. If you don't know Graphite, it's a system for accumulating and, obviously, graphing time series. Most people use it for systems monitoring.

When we first set it up, I played with a few graphs of key metrics over time. Pick the metric from a list; Graphite shows you the graph. Easy. I also set up some basic dashboards that showed a few graphs. Again, easy.

But that's not always all you need. I wanted larger pictures of the whole system: hot spots, accumulated data across servers (in our setup, each server is its own metrics hierarchy in Graphite), and more. I pondered various ways to get the data out of Graphite (which it supports) and into R.

Then I discovered its functions library. And I went crazy.

First was a graph that showed every call in our system over a certain time threshold.

Then came one that combined a number of metrics to estimate mean time to first byte, a common metric for website performance.

And then another. And another. These days, I set up my laptop to run Chrome in full-screen mode so that it can fit all the graphs on one of my dashboards. But that's just one tab: I have dashboards for different environments and dashboards that focus on subsystems within those environments. A graph showing our 10 slowest calls gets uploaded to Campfire each day.

Our ten slowest calls as of today, with proprietary information removed. The lines are flat because of lack of activity on the server.

So far Graphite — especially version 0.9.10 — has been able to keep up with almost all my needs, and I haven't even hit all the functions. It even has a command-line interface that I just started playing with. (It allows faster iteration and finer control over each graph in a dashboard, but also allows you to keep a dashboard-building script under source control.) There are also a wide range of tools that work with it (including, of course, my own metrics relay system).

When I first read Graphite's documentation, I was struck by the author's right-up-front advice to consider your metrics naming scheme carefully. It seemed very nitty-gritty so early in the manual.

But now I understand. A consistent naming scheme and hierarchy depth allows for much simpler construction of useful graphs. To some extent, our profiling code, our package hierarchy, and our metrics library give us this for free. But the other day I realized I had made a mistake in naming a metric that captures all invocations of methods with a particular annotation, and it made it much more difficult to assemble a meaningful graph. I got it to work, but it required some wrestling. If you're using Graphite, I recommend auditing your metrics periodically to make sure you can get the most out of them.

1 comment:

  1. I love reading actual blog posts... (Consider this your Pavlovian feedback to make sure you continue... :-) )