Friday, April 19, 2013

Awkward And Upward

The subject of awk, a text-processing tool on UNIX systems, came up recently in a number of contexts — a book I was reading, a test harness at work, and a blog post I had bookmarked — and I decided it was time to learn a bit more about it. Like anyone who's been around servers for a while, I've used awk here and there, but never in earnest.

So I picked up sed & awk and worked through the code. (An aside: The book is 16 years old at this point; it's fascinating to remind oneself what the "download the source code" options were at the time: FTP, UUCP, and more, but not HTTP. But downloading code samples is a less effective learning strategy for me compared to typing it in myself.)

I'm sold. I've written any number of little Ruby scripts for processing text files over the years, but awk is a DSL designed just for this purpose. awk handles the mechanical aspects of opening a file and reading in each line. You write the business logic you care about.

Here's an example: As an exercise, I took a Ruby script I had written to extract stack traces from a Java thread dump and rewrote it in awk. I wanted every stack trace where any line in the trace matched the regex I passed in. This is useful for, say, finding every thread that uses a particular library or stems from a particular source or goes through a particular code path. A Java server can have hundreds of threads, which can make it tough to focus on one subsection of code. This lets me zero in on the stacks I care about.


# extracts bits of a thread dump based on a passed in regex
# usage: awk -f extract_stacks_by_regex -v regex=[regex] [file]

# lines with stuff when we've established we're in a stack dump that matches the regex
$0 ~ /^..*$/ && matches == 1 {
   print
}

# lines with stuff when we haven't yet found a match: build up current stack and check to see if this is a match
$0 ~ /^..*$/ && matches == 0 {
   current_stack = current_stack "\n" $0
   if ($0 ~ regex) {
      matches = 1
      print current_stack
   }
}

# lines without stuff: reset 
/^$/ {matches = 0;current_stack = ""}




This script is about half the size of the Ruby version.

As someone who's often digging into logs or massaging data for visualizations, a few hours of time getting really solid with awk will no doubt pay for itself over and over again.

No comments:

Post a Comment