Monday, August 1, 2016

PRisk - A Chrome Extension For Pull Requests

Here's what I see when I look at a pull request on github.

I've read lots of academic papers and books about defect prediction in code bases. And most of them read something like this: We ran a bunch of analyses on code bases using a bug-reporting system to correlate where faults are. But often when I read these, I thought: Why can't we take established knowledge like this and frontload it into a pull request? Why not highlight risk factors before it even gets merged?

Well, it turns out I'm a programmer.

I've been working for a little while on PRisk, a Chrome extension that highlights items in a PR that might need extra attention if you're the reviewer. At the top of a pull request, you'll see overall risk factors. If the author of the PR doesn't know the code base, that's a risk. If there's a lot of complexity amongst the diffs, that's a risk. If there are a lot of files, that's a risk. If there are a lot of changes, that's a risk.

For each diff within a pull request, PRisk gets more specific. Is the code perhaps too complex? Is it a file that gets a lot of activity? Are there a lot of contributors to the file? Is the file relatively young? And finally, the extension figures out some likely owners of the file, so you could ping them as a reviewer. I have more plans for this section, but this is a good start.

The Clunky UI

There is a UI, but it needs some work. Depending on the settings for a given repo, you might need to generate an access token for PRisk. Once you've generated it, click on the P that appeared when you installed the extension and fill in your username and the access token. That token will get used for API calls against private repos.

Monday, May 9, 2016

GitHub API For Reports

We use Enterprise Github at my work and, like, it sports a fairly comprehensive REST API.

While there are lots of ways to use this API, my most common usage has been creating reports about what my team is doing on github.

What PRs Are Waiting For Me?

At my company, engineering leads are in a github group that gets notified about repositories throughout the company. But I'm unlikely to have meaningful input on a pull request generated, say, against the analytics team's code. So I wanted a quick way to find PRs that I actually care about.

This turns out to be easy to get via the Github search API:
curl -LGs 'https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr state:open repo:|repo|" --data-urlencode "per_page=100"

If you want all the repos for a given organization, use "user:|org_name|" You can string together any number of repos and organizations, and they'll be ORed together. In fact, the script I use constructs the query I need from its arguments, figuring out which syntax is needed. I run the results through jq and an awk-based formatting script, and I get a nice report of outstanding PRs in the repos I actually care about.

As my team grew, I wanted to also look at PRs by members of my team, even if those PRs are outside repos we own. This is particularly true of junior team members or new colleagues coming onto the team — I want a sense of their coding style and areas where they can grow. Again, this is pretty easy.

curl -LGs 'https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr state:open author:|username|" --data-urlencode "per_page=100"

As with repos, you can have any number of "author:" items in your search string. I run this through the same formatting steps above.

Finally, sometimes people want me to weigh in on a pull request outside of the repositories my team owns. So I added another stanza to my wrapper script (actually, the wrapper script factors out the common parts of the URL, so each section just passes in the new part of the query):

curl -LGs 'https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr state:open mentions:|my_username|" --data-urlencode "per_page=100"

Throughout the day, I run my script, and a few seconds later I have a complete view of all the PRs I should at least be aware of.

What Self-Merges Have Happened?

We have a strong policy against self-merges thanks to a culture in which pull requests and continuous code reviews are the norm. We wanted an easy way to find what self merges had happened in a given repo. This quick one-liner I assembled will give you those pull requests where the author was also the merger

curl -LGs https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr is:merged user:|org_name|" --data-urlencode "per_page=100" | jq -r '.items | .[] | .pull_request.url' | xargs -I {} curl -LGs {} | jq -r '[.user.login, .merged_by.login, .html_url] | @tsv' | awk '$1 == $2'

Who Knows About A Repo?

We recently wanted to do some cleanup on our account, which has a host of repos that may or may not still be active and may or may not need to have accounts cut-off based on people who have left, contractors, and the like. I came up with this quick script that, for all the repos in the given organization, will print a list of contributors and the number of commits they've authored across the last 100 commits, ending with the ones who have contributed the most and who are thus the most likely to be knowledgeable about the state of the repo. It requires the user to create an API token so that it can access private accounts.

curl -u |username|:|github_api_token| -LGs|your_org|/repos | jq '.[] | .name' | tr -d '"' | xargs -I {} sh -c "echo {} && curl -u |username|:|github_api_token -LGs|your_org|/{}/commits?per_page=100 | jq '.[] |' | sort | uniq -c | sort -n"

These are just a sampling of how I use the API: I wrote a bunch of scripts to generate data for my self-review; I have a longer script that will identify which repos my team is working in on a monthly basis, part of my effort to take my somewhat siloed team members and push them into other areas where they're less comfortable; I wrote a Chrome plugin that does lots of queries to flag risk factors in an incoming PR. Just today, I was concocting a way to move a large number users between different groups of contributors to change permissions on a repo. These are just examples: The github API has a wealth of possibilities once you're aware of it.

Tuesday, April 19, 2016


Many modern web services give you data in JSON so that you can easily incorporate it into web applications. But a lot of automation tasks are done in the shell, where the built-in tools predate JSON and thus can't handle that particular structure.

Sound familiar? You should check out jq, a tool that interprets JSON data and gives you a language for extracting, reformatting, and manipulating it. It uses a data-flow metaphor analogous to awk where you work with chunks of data at a time.

Large percentages of my automation scripts now incorporate jq, and it sometimes seems like my main contribution at my current job was introducing it to the development toolkit.

Monday, December 23, 2013

CAD For Programmers

I'm a back-end programmer. I deal with databases and servers, and I think a terminal window is a perfectly reasonable interface. And my normal creativity outlets are writing and, yes, programming.

As you might imagine then, CAD programs fill me with dread. They seem so cool! And yet, I spend vastly more time learning how to do basic stuff in them then I ever spend doing said stuff. I keep thinking, "Can't I just write a program to do this?"

And the answer, it turns out, is yes. If you use OpenSCAD. I learned about it from Make magazine's special issue on 3D printers (a topic I'm this close to really digging into). While a real CAD user may find it limiting, someone like me, who works in programming languages and usually has no design more complex than a mechanical puzzle piece, it's perfect. You -- get this! -- just write a program to do all the work you need! You refactor with variables, functions, and macros, and you provide instructions through a declarative programming language.

Finally, a CAD tool that works the way I do.

Saturday, September 7, 2013

pomaid - a tool for creating pom.xml

One of the many things I dislike about Maven is its use of XML as a language format. This was au courant back when Maven was released, but by now, many people realize that this is a poor use for XML.

It's not hard to figure out why. A pom.xml file (the script that drives Maven) consists of XML syntax and the data that is specific to your project. The data specific to your project is not only information but is the most important information to someone on your project looking at the pom.

Most programmers would probably agree with this, but then they all forget information visualization 101: Keep important information front and center. Most people think of data visualizations when they hear information visualization, but I would argue that any information you produce should be easy for people to understand, and the theory behind visualizations is still relevant.

I grabbed a random pom off the Internet, and broke it apart into the Maven syntax and the data you actually care about. Syntax is the start and end tags of XML. Data is the values relevant to your project. Here's the breakdown.

More than 75 percent of this pom.xml file was the syntax for Maven. Less than one-quarter of the text in the file was the data pertinent to the project.

I recently read an interesting article about "Invisible XML," which argued for the creation of DSLs that present information clearly and concisely but transform into whatever XML your system might need.

I thus decided to make a tool that would provide a minimalist syntax that could be used to generate pom.xml files needed by Maven. (And, inspired by the "Little Languages" chapter in The AWK Programming Language, I decided to do it in that language.) The result is pomaid.

Since dependencies are the meat of most pom files, I focused on that. Here's an excerpt from the pom I linked to earlier:

And here's the equivalent in pomaid:

And here's the syntax vs. data breakdown of the entire pom file I mention above, redone as a pomaid script.

In other words, the pomaid file pushes syntax into the background to produce something vastly more information-rich than its pom.xml equivalent.

While this may or may not be useful, I think it illustrates the point well enough: If you have to deal with XML, you may be better off writing a language that can provide clarity to the reader and then translate into the syntax-heavy, data-light XML.

Sunday, July 14, 2013

Internet Radio Podcast, the Raspberry Pi Edition

I recently bought a Raspberry Pi*, the small-and-cheap-but-powerful computer taking the geek world by storm. The hope is that someday I can use this little board or its successor to teach my baby daughter more about my own interests. But given my current time constraints, I figured I should get one now to ensure I've done at least one significant project with it by the time she's old enough to grok what this thing is.

That's right; I think I can do one useful project with my Pi in the next ten years.

People go crazy with their Pis: rigging up their microwaves to cook food according to the bar code on ittaking pictures from very very high in the sky, and building souped-up irrigation systems that would be the envy of pot farmers everywhere.

I do have ambitious goals in mind, but I wanted to start with something a little simpler. Some of you may remember my Rube-Goldberg-esque "Radio Broadcast Podcast." An obvious failing of my original setup was that it required me to leave my MacBook running. Predictably, this has turned out to add enough friction that the recordings don't get made.

At the time, commenters offered suggestions to fix that obstacle. One was to set up an EC2 server that could run all the time. A good idea, and I'm certainly comfortable with EC2, but I never got around to it. Between launching a big game and having a baby, the priority on the whole endeavor kept dropping. Still another suggestion was to just buy a server for the home. A very good idea, but we didn't have space for it even before we had a baby.

Enter the Raspberry Pi. Since it's just a normal computer that runs Linux, it can also be a server. Not a super-powerful one, but I don't need a super-powerful server for my house. And it's hard to argue with the physical footprint: the Pi and its case are roughly the same size as two or three stacked decks of cards. I just tucked it into a small space behind our TV.

I decided to start from scratch on my radio-to-podcast scripts, mostly because I wanted to fix some problematic and error-prone behaviors from version 1. It ended up being a few independent parts:

  1. A Python script runs every five minutes as a cron job and uses a config file to determine if it should be capturing streaming audio. If yes, and streamripper isn't running, it starts the program, figuring out a file name based on parameters in the config file (including adding numbers if a file of that name already exists.)
  2. Another cron job looks in the download directory for any mp3 files that haven't been touched in 5 minutes, uploads any it finds to S3 via s3cmd, and then deletes them.
  3. Yet another cron job creates a podcast-compliant XML file by getting a listing of items on S3, running them through an awk script that generates the necessary XML, and then taking that output and uploading it to S3.
This may still seem Rube-Goldbergian, but it's designed to be more resilient than its previous incarnation. If streamripper crashes or the network goes out, my script will ensure that the music starts recording at the next opportunity and it won't overwrite the first part. If the Pi can't get to S3 for some reason, the files will stay in place until needed. There's no cleverness around updating the podcast xml file; it's created from scratch every time. (Note that doing a bucket listing on S3 very much exposes you to eventual consistency issues, but even a day's worth of latency is probably fine for my purposes.) 

The onus is still on me to clear old files from S3, but that's something I can do every few months. And once I do, the XML file will reflect that within an hour.

The next step is to run an HTTP server on the Pi so that I can log in and make configuration changes from any of the devices on the network instead of sshing in and modifying the config file directly. And after that? I'm tempted to add AirPlay capability so that if I happen to be home and awake, I can send the current download to the living room speakers.

Overall, I'm pretty impressed with the Pi's capabilities. I doubt I can pile too much onto it, but I'm curious to see what its actual limits are as a household server. Future projects involve using its ability to talk to other electronic components, which is one area where the Pi shines.

* I bought the Maker Shed Starter Kit. It comes with a good book and most of the stuff you'll need to get going and do the book's exercises.

Tuesday, July 9, 2013

Screen Time

This is a bit of a tangent for a programming blog, but I figure this is a relevant topic for many people reading this blog.

The TL;DR review: If you're interested in kids and how they interact with media, buy this book.

I love technology. Probably more than most non-technology people really understand. I love its potential, the changes it makes to the world, and the ideas that drive it forward. And I look forward to sharing that love with my baby daughter. Some dads can't wait until their kid can go to a baseball game; I can't wait until she is able to grasp that technology is a thing within her ability to control and mold. And the day she expresses an interest in controlling that technology herself? They'll be able to see me smiling from the International Space Station.

That's not always a popular view here in Hippyville, USA, otherwise known as Berkeley. Paradoxically, a casual drive from some of the biggest tech companies on the planet, there is a pervasive notion that mixing technology and kids is bad. One extreme -- Waldorf schools -- came up in conversation with my wife. Here's an excerpt from

"Waldorf teachers are concerned that electronic media hampers the development of the child's imagination. They are concerned about the physical effects of the medium on the developing child as well as the content of much of the programming."

While I think Waldorf has some good ideas, that first sentence sent me into a fury. You can't lump all electronic media together unless you're driving your philosophy through some misguided notion that "things were better in the old days." I'm not going to argue that Angry Birds or Robot Unicorn Attack is good for a kid's development, but I know of plenty of apps and "games" that are all about imagination. And I feel that the Waldorf view actually cuts off areas of exploration. (Let alone its effect on kids who, let's face it, are coming of age in a technology-driven society.)

But what do I know? Just what I feel in my gut. One article I read about kids and technology suggested a book called Screen Time, by Lisa Guernsey (the title seems to be messed up in the database), so I bought it.

In the end, she says, there's no easy answer. But she has a recurring theme she's derived from her research that she applies to her own children: What content is your child seeing, what context are they seeing it in, and where is your individual child in their development? Are you destroying your kids by letting them watch TV while you shower? No. Are there long-term downsides to media access if parents aren't careful? Yes.

The book focuses heavily on television time, which is less relevant to me since we rarely watch TV or even have it on. Her epilogue notes the emergence of the iPad and its potential, but there's precious little research about it so far (however, it does solve one problem with previous incarnations of interactive toys: the abstraction needed for "I press this thing on this thing way over here and the screen does something.") She does list links to sites that do thorough reviews of children's interactive media, however. My favorite is Children's Technology Review.

Still, the book is full of eye-opening explorations of the research that's out there, which she has extensively plumbed, as evidenced by her endnotes and massive bibliography. She divides the book into chapters that each address a prevailing question about kids and screens: What Exactly Is This Video Doing To My Baby's Brain, Could My Child Learn From Baby Videos, My Toddler Doesn't Seem to Notice When the TV is On - Or Does He? And so on.

Here were some of my key takeaways, in no particular order:

  • TV is not turning your kid into a zombie through passive engagement. She says that the research that argued that (heavily cited by another book on my queue, Waldorf-darling Endangered Minds) has been widely debunked. Kids engage heavily with what's on screen, trying to understand it.
  • The standard advice by the American Association of Pediatricians about no screens before two years old? Not really based on any research.
  • TV does reduce creative play time and time with parents.
  • TV in the background is a bad idea. It exposes kids to adult-themed content and makes it harder for them to develop language skills. Make it a foreground activity when it's on. (And only do short, controlled watching times. Most parents paying attention have a "you can watch this half-hour video and that's it." mentality)
  • Telling your toddler that something scary on TV or in a movie is "not real" is useless. They're toddlers; everything is real.
  • If your child has an interactive electronic thing, you will naturally steer towards saying, "Push that button now, or don't do that yet." Don't. Let the kids explore on their own. Also, most interactive media sucks: It imposes artificial constraints on what can and can't be done at any moment. There are plenty of apps that don't fall into that trap.
  • Dora the Explorer is awesome for kids. So is Sesame Street, Mister Rogers, some show called Dragon Tales, and, yes, even Barney.
  • Purveyors of products that purportedly improve your kid are rarely based on real research. But it is possible for kids to learn from DVDs and the like. Choose wisely.
  • Kids' brains are weird.
  • Products and shows often have very specific age ranges they're aimed at. Pay attention to those (but note that publishers will often advertise a wider audience to get more sales.)
  • Your involvement is important. Not just in a "Are you having fun" way but in a "what did you think of this or that?" Engage your kids about what they're seeing.
  • Moderation is important: Video entertainment is just one facet of experience (duh.)
I can't praise this book highly enough. The author has had the ability (helped along by various grants) to really dig in deep. She looks at academic research, talks to scores of parents and social workers, looks at her own kids, and tries to shape what little material is out there. 

It's possible that I highlighted more of this book than not.