An Obsession With Programming: April 2013

The subject of awk, a text-processing tool on UNIX systems, came up recently in a number of contexts — a book I was reading, a test harness at work, and a blog post I had bookmarked — and I decided it was time to learn a bit more about it. Like anyone who's been around servers for a while, I've used awk here and there, but never in earnest.

So I picked up sed & awk and worked through the code. (An aside: The book is 16 years old at this point; it's fascinating to remind oneself what the "download the source code" options were at the time: FTP, UUCP, and more, but not HTTP. But downloading code samples is a less effective learning strategy for me compared to typing it in myself.)

I'm sold. I've written any number of little Ruby scripts for processing text files over the years, but awk is a DSL designed just for this purpose. awk handles the mechanical aspects of opening a file and reading in each line. You write the business logic you care about.

Here's an example: As an exercise, I took a Ruby script I had written to extract stack traces from a Java thread dump and rewrote it in awk. I wanted every stack trace where any line in the trace matched the regex I passed in. This is useful for, say, finding every thread that uses a particular library or stems from a particular source or goes through a particular code path. A Java server can have hundreds of threads, which can make it tough to focus on one subsection of code. This lets me zero in on the stacks I care about.

# extracts bits of a thread dump based on a passed in regex
# usage: awk -f extract_stacks_by_regex -v regex=[regex] [file]

# lines with stuff when we've established we're in a stack dump that matches the regex
$0 ~ /^..*$/ && matches == 1 {
   print
}

# lines with stuff when we haven't yet found a match: build up current stack and check to see if this is a match
$0 ~ /^..*$/ && matches == 0 {
   current_stack = current_stack "\n" $0
   if ($0 ~ regex) {
      matches = 1
      print current_stack
   }
}

# lines without stuff: reset 
/^$/ {matches = 0;current_stack = ""}

This script is about half the size of the Ruby version.

As someone who's often digging into logs or massaging data for visualizations, a few hours of time getting really solid with awk will no doubt pay for itself over and over again.

My wife uses an app from Similac to track various things about our baby: diaper changes, sleep schedules, and feeding schedules. Friends of ours recommended it to us, and we've recommended it to other people.

But I find its visualization for sleep schedules lacking. Anything beyond the daily view just shows you how many hours your baby slept on a given day; it doesn't show you when on that day your baby slept. And if a sleep session starts on one day and ends on another, the hours are only counted for the first day. In the summary the app provides, this probably doesn't matter; the hours will even out. But it's confusing.

Fortunately, the app exports its data. So I figured I could get the export and then produce the chart I had in my head.

The first step was cleaning the data. Here's a sample of what the app sends.

Start Time	4/11/13, 2:09 PM
Duration	2 hrs 11 min
Time of Day	Daytime
Laid Down Awake	No

Not very program friendly, but it's easy enough to fix with awk.

function quote(string) {

   return "\"" string "\""
}

BEGIN {
   FS = "[, \t]"
   print quote("line") "," quote("start_date") "," quote("start_time") "," quote("duration")
}
/Start Time/ {
   date = $3
   date_and_timestamp = quote(date) "," quote(date " " $5 " " $6)
}
/Duration.*hr.*min/ {
   print quote(NR) "," date_and_timestamp "," quote((($2 * 60) + $4))
}
/Duration.*hrs?$/ {
   print quote(NR) "," date_and_timestamp "," quote(($2 * 60))
}
$3 ~ /min$/ {
   print quote(NR) "," date_and_timestamp "," quote($2)
}

That turns a chunk of text like the one above (and its variations) into a line like this:

"8","4/11/13","4/11/13 2:09 PM","131"

(I add the line number at the beginning so that R has a primary key to work with on the import.)

Once I made the csv, I pulled it into R. As usual in R, drawing the chart was straightforward once the data was correct. Here's the meat of it:

rect(xleft=data$sleep_offset,

     ybottom=data$y_value-.25,

     xright=(data$sleep_offset + data$duration),

     ytop=data$y_value+.25,

     col=data$rect_color,

     border=NA)

But even my cleaned data needed some cleaning within R. The data the app exports suffers from the same problem when it comes to sleep sessions that cross midnight. You might see an entry for 04/11/13, 9:00 PM with a duration of five hours. But you won't see any data for 4/12/13, midnight to 2:00 AM.

So I first added new entries that duplicated those records, but provided a "start date" (which translates into the y axis) of the "other side of midnight" time frame. So the theoretical 9:00 PM sleep session above would produce another row of data where the start date was 04/12/13 and the start time was 04/11/13 9:00 PM. That meant that the "sleep offset" (position on the x-axis) was negative, which is what I wanted.

Finally, I wanted to draw any chunks of sleep that fell outside of a given date (because of the overlap) in a different color, so I broke the overlap rows into rows that would give, to use the same example, a row for 04/11/13 with a start time of 9:00 PM and a duration of 3 hours and a row for 04/12/13 from midnight to 2:00 AM. I stored the color in the data as well, so I could tell R to just draw the rectangles based on each row's coordinates and with the color specified in one of the fields.

Here's what I came up with. Note that this particular chart is based off of fake data (since I have a tool that makes that easy), because I didn't want to expose the baby's personal data for all the world to see. But the chart gives the gist.

Each horizontal line is a day, with the night before and morning after visible on the chart but unobtrusive. The dark green represents sleep sessions within that calendar day, spanning the time listed on the x-axis. I often feel that every visualization I make ends up being a small multiple, but it is true that I often want to quickly look at a large mass of data and make comparisons within it.

The visualization is a work in progress. I'd like to put the total on the right, and a friend suggested that I add a heat map to show when parents have the best chance of getting in a long nap.

But compare this grid to what, in the app, would simply be a line graph with a single number for each day. That doesn't tell you anything about, say, how long your baby's been sleeping at night this week versus last week. Or whether her individual sleep sessions have gotten longer. We recently put up blackout curtains, and so we can see what effect that's had on the baby's sleep. We can see particularly sleepless days (or particularly sleepful ones) and correlate to other variables such as her mood and sleep schedule.

Of course, even with the awk and R scripts written, the data still has to be emailed to me, and I still have to process it. But I think the extra detail is worth the small effort to get it.

An Obsession With Programming

Friday, April 19, 2013

Awkward And Upward

Tuesday, April 16, 2013

Better Visualization Of Baby Sleep