Monday, August 1, 2016

PRisk - A Chrome Extension For Pull Requests

Here's what I see when I look at a pull request on github.

I've read lots of academic papers and books about defect prediction in code bases. And most of them read something like this: We ran a bunch of analyses on code bases using a bug-reporting system to correlate where faults are. But often when I read these, I thought: Why can't we take established knowledge like this and frontload it into a pull request? Why not highlight risk factors before it even gets merged?

Well, it turns out I'm a programmer.

I've been working for a little while on PRisk, a Chrome extension that highlights items in a PR that might need extra attention if you're the reviewer. At the top of a pull request, you'll see overall risk factors. If the author of the PR doesn't know the code base, that's a risk. If there's a lot of complexity amongst the diffs, that's a risk. If there are a lot of files, that's a risk. If there are a lot of changes, that's a risk.

For each diff within a pull request, PRisk gets more specific. Is the code perhaps too complex? Is it a file that gets a lot of activity? Are there a lot of contributors to the file? Is the file relatively young? And finally, the extension figures out some likely owners of the file, so you could ping them as a reviewer. I have more plans for this section, but this is a good start.

The Clunky UI

There is a UI, but it needs some work. Depending on the settings for a given repo, you might need to generate an access token for PRisk. Once you've generated it, click on the P that appeared when you installed the extension and fill in your username and the access token. That token will get used for API calls against private repos.

Monday, May 9, 2016

GitHub API For Reports

We use Enterprise Github at my work and, like, it sports a fairly comprehensive REST API.

While there are lots of ways to use this API, my most common usage has been creating reports about what my team is doing on github.

What PRs Are Waiting For Me?

At my company, engineering leads are in a github group that gets notified about repositories throughout the company. But I'm unlikely to have meaningful input on a pull request generated, say, against the analytics team's code. So I wanted a quick way to find PRs that I actually care about.

This turns out to be easy to get via the Github search API:
curl -LGs 'https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr state:open repo:|repo|" --data-urlencode "per_page=100"

If you want all the repos for a given organization, use "user:|org_name|" You can string together any number of repos and organizations, and they'll be ORed together. In fact, the script I use constructs the query I need from its arguments, figuring out which syntax is needed. I run the results through jq and an awk-based formatting script, and I get a nice report of outstanding PRs in the repos I actually care about.

As my team grew, I wanted to also look at PRs by members of my team, even if those PRs are outside repos we own. This is particularly true of junior team members or new colleagues coming onto the team — I want a sense of their coding style and areas where they can grow. Again, this is pretty easy.

curl -LGs 'https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr state:open author:|username|" --data-urlencode "per_page=100"

As with repos, you can have any number of "author:" items in your search string. I run this through the same formatting steps above.

Finally, sometimes people want me to weigh in on a pull request outside of the repositories my team owns. So I added another stanza to my wrapper script (actually, the wrapper script factors out the common parts of the URL, so each section just passes in the new part of the query):

curl -LGs 'https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr state:open mentions:|my_username|" --data-urlencode "per_page=100"

Throughout the day, I run my script, and a few seconds later I have a complete view of all the PRs I should at least be aware of.

What Self-Merges Have Happened?

We have a strong policy against self-merges thanks to a culture in which pull requests and continuous code reviews are the norm. We wanted an easy way to find what self merges had happened in a given repo. This quick one-liner I assembled will give you those pull requests where the author was also the merger

curl -LGs https://|your_github_host|/api/v3/search/issues --data-urlencode "q=type:pr is:merged user:|org_name|" --data-urlencode "per_page=100" | jq -r '.items | .[] | .pull_request.url' | xargs -I {} curl -LGs {} | jq -r '[.user.login, .merged_by.login, .html_url] | @tsv' | awk '$1 == $2'

Who Knows About A Repo?

We recently wanted to do some cleanup on our account, which has a host of repos that may or may not still be active and may or may not need to have accounts cut-off based on people who have left, contractors, and the like. I came up with this quick script that, for all the repos in the given organization, will print a list of contributors and the number of commits they've authored across the last 100 commits, ending with the ones who have contributed the most and who are thus the most likely to be knowledgeable about the state of the repo. It requires the user to create an API token so that it can access private accounts.

curl -u |username|:|github_api_token| -LGs|your_org|/repos | jq '.[] | .name' | tr -d '"' | xargs -I {} sh -c "echo {} && curl -u |username|:|github_api_token -LGs|your_org|/{}/commits?per_page=100 | jq '.[] |' | sort | uniq -c | sort -n"

These are just a sampling of how I use the API: I wrote a bunch of scripts to generate data for my self-review; I have a longer script that will identify which repos my team is working in on a monthly basis, part of my effort to take my somewhat siloed team members and push them into other areas where they're less comfortable; I wrote a Chrome plugin that does lots of queries to flag risk factors in an incoming PR. Just today, I was concocting a way to move a large number users between different groups of contributors to change permissions on a repo. These are just examples: The github API has a wealth of possibilities once you're aware of it.

Tuesday, April 19, 2016


Many modern web services give you data in JSON so that you can easily incorporate it into web applications. But a lot of automation tasks are done in the shell, where the built-in tools predate JSON and thus can't handle that particular structure.

Sound familiar? You should check out jq, a tool that interprets JSON data and gives you a language for extracting, reformatting, and manipulating it. It uses a data-flow metaphor analogous to awk where you work with chunks of data at a time.

Large percentages of my automation scripts now incorporate jq, and it sometimes seems like my main contribution at my current job was introducing it to the development toolkit.