A guide to distributed work

Oct 31st, 2017 6:30 pm

See all of my remote/working-from-home articles here.

Whether it is working as the one remote employee at a traditional company or being one of many at a distributed company, remote work is becoming an option for many of us.

The number of employees that work remotely is growing¹. Technology improvements, pervasive Internet, and mindset changes are some of the drivers for this change.

The remainder of this article highlights some observations from my last few years of working for distributed companies. I’ve also interviewed and corresponded with many other remote and formerly remote workers. Their contributions, along with books and articles on remote work, have influenced my thinking about what it means to have a successful distributed company. I’ve been working remotely since October 2013 in roles that have ranged from being a team lead to software developer to CTO. Each company I’ve worked for was fully distributed. There were no offices.

The types of distributed companies I’ve worked with have not been asynchronous. They have had core working hours centered around one of the middle time zones in the continental United States. You could work from anywhere, but you needed to be willing to work while others were working. I consider this synchronous remote work.

Much of the following also applies to individuals working remotely for a not-entirely-distributed company or team. Being the only remote individual in a non-remote team comes with its own set of challenges that I’m not going to attempt to present. If you are part of a team where some members work remotely, my recommendation is that you should treat that team as a remote team. If you don’t, the remote worker will have a harder time keeping up with the rest of the team.

Benefits

If you ask someone that has never worked remotely before for the benefits of working remotely, they would probably be able to guess at some of the most obvious benefits. The top two responses I’ve received to this question are having no commute and more flexibility in your schedule. These two advantages are huge. There are other, less obvious advantages as well.

No commute

This is the benefit that most workers, remote and non-remote, identify when asked about benefits of remote work. This benefit stands out because it is huge and obvious. In the United States, the average one-way commute is 25 minutes long. The average worker spends nearly one hour going to and from their work.

If you are going to work five days a week, then you’re spending over four hours commuting. That is half of a full workday riding a bicycle, car, train, or bus. In the best case scenario, you are using that time to read, listen to a podcast, or trying to think deeply about a problem. In reality, you’re trying to do one of those activities, but you are continuously distracted by the world around you. You have to worry about avoiding accidents, driving safely, or another distracting concern.

Commuting has been shown to have negative effects on the commuters². Being able to work remotely lets you avoid those problems.

Flexibility in schedule

This is another benefit that most people can immediately identify. Remote works gives you more power over your schedule. Even if you’re part of a synchronous distributed team, you gain flexibility. All of a sudden your breaks become time you can use to enrich your non-work life.

Picking up or dropping off your children at daycare or school becomes easier. Taking your dog for a midday walk becomes possible. Since you aren’t commuting, you have more time to make breakfast for you and your family. You can run errands or go to your favorite neighborhood lunch spot during the day. These errands and restaurant trips are quicker since you’re effectively doing them at off hours as most of your neighbors are working at their office.

More time with family

If you’re working at home, then it becomes easy to see your family more. You can say hi to your kids when they get home from school. You have more time in the evening to spend with your baby before bedtime.

Customize your workspace

You get to choose where you work. For many, this will be in a home office. This is your space. You get to make it your own.

Do you like to work in a cave? Paint the walls a dark color, block out the windows and enjoy your cave-like experience. Do you prefer sunlight and plants? Work near a window and add houseplants to your space. Do you want an awesome sit-stand desk and chair? Buy them.

One of my former colleagues likes to walk on a treadmill while standing or, if sitting, he enjoyed having pedals under his desk. These are customizations he would have a hard time getting at most offices.

Eat the food you want to eat

Many of us have preferred diet (or a diet we’re forced to follow). When you work from home, it is easier to eat food you know you should eat.

When you work from an office, you have a few food options. You can bring your food, go out to eat, or (if your employer offers it) eat food provided by your company. If you follow a restrictive diet, all of these options are more hassle than making your lunch at home every day.

Feel like eating food someone else has prepared? You can still do that while working from home.

Fewer interruptions

As a remote worker, you can choose your working location. This lets you select a spot with fewer distractions. You can pick an ideal location that helps you achieve a state of flow.

Minimizing interruptions is one of the keys to accomplishing challenging tasks. After an interruption, it takes up to a half an hour to get back to your original task³.

Off-hours support

This is a benefit I have not seen mentioned many other places. Off-hours support becomes much easier if you are working remotely. The actions you would take for an urgent alert at 1 AM are the same actions you would take at 1 PM.

When you get that 1 AM page you don’t have to struggle to remember how you check production while at home; this is an activity you do on a regular basis. You know what you need to do. You don’t have to remember how to VPN into your company’s network; you do that every day.

No one likes getting woken up by a support call. At least this way you get to use your normal tools for solving problems.

Recruiting

Since you aren’t limited to your locale, you can recruit from a much broader region. This means you can find the top talent for your company. This is huge.

Back in late 2013, it was quite challenging to find experienced Clojure programmers. Because Outpace is a distributed company, we were able to hire experts from across the entire United States. We would not have been able to recruit nearly as well if we were limited to a single location.

Employee retention

If your company supports remote work, then you remove an entire reason for an employee leaving. Sometimes, a person needs to move. Working for a company that supports remote work allows them to move and not leave the company.

Reduced office costs

Having a distributed workforce can reduce office costs drastically. In a fully distributed company, it could reduce the cost to zero. Realistically, I’d expect the company I work for to provide computer hardware so there are still some costs⁴. Unless the company pays for Internet and phone, the recurring costs are minimal.

Downsides

While there are many benefits, there are also downsides to working remotely. When I talk to non-remote workers about working remotely, I typically hear “I don’t know how you do it, I’d be so distracted.” This statement touches on one of the downsides of remote work, but it isn’t largest one. Below are some downsides that I and others have observed.

Loneliness and isolation

Nearly everyone I’ve talked to, including myself, puts this as the top downside.

Most of us are social creatures. You do not get the same type of social interaction with your coworkers when you are working remotely. Instead of bumping into a wide range of people in the office, you interact with a smaller group through video chats, phone calls, chat rooms, and email. Depending on what you are doing, you might not even get that interaction.

This is very unfamiliar to most of us. We’re used to being in physical proximity to other humans. We’re used to having small talk and grabbing coffee or lunch with other people.

You can combat these feelings by setting up communication with other employees at your company. Have some dedicated chat rooms for non-work discussions. Have a daily, short video meeting that is a status check within the team so that everyone gets to see another person’s face at least once a day. If you work in a city that has other remote workers from your company then meet up occasionally for dinner, lunch, or happy hour.

If you are having troubles with loneliness and isolation, try to find an area where you can work surrounded by other people. Two options are co-working spaces and coffee shops. Alternatively, try to have social activities you regularly do with non-coworkers in your area. Having strong connections with non-coworkers can help combat loneliness.

If I stay inside for more than a couple days, I get grumpy. I didn’t realize this when I worked in an office. Noticing this has benefited my rock climbing, as I’ve made that my main non-work social activity. Even if I merely go bouldering by myself, being around other humans helps. If you’re working remotely and feeling grumpy, try to find an activity you can regularly do and see if doing that helps.

Distractions

This is the downside that non-remote workers most often identify. People assume that television and other distractions in your home are irresistible and will cause you not to get work done. When you are working 100% remotely, you don’t have the same distractions you have when you are only occasionally working remotely. You can’t do laundry every day. You only have so much TV you can watch.

Personally, I don’t have a problem with distractions when working at home. I know others that do. They mostly have when they first started working from home. When they first started working at home, they found themselves doing too much around the house. As a result, they worked late hours or felt like they weren’t getting enough work done. Once you recognize the problem, it is possible to train yourself not to get distracted.

Roommates, kids, and family are another (sometimes welcome) distraction. You can combat interruptions from others by setting boundaries. Many of my coworkers have a rule that when their office door is closed, they are unavailable. I’ll claim that coworkers interrupting you in an office are more distracting as the much rarer interruption from someone within your home.

Employees that work remotely are typically choosing to work remotely. Once they get used to working remotely, distractions stop being a problem. They know they need to produce quality work and will take steps to make sure they do.

Working too much

When you first start working from home, you suddenly find yourself living in the same space that you work. This lack of change in location and commute makes it easy to keep working. You get invested in a problem and all of a sudden it is past the time when you should have stopped working.

Even if you manage to stop working on time, it is easy to slip back into work mode. The computer setup I like to use in the evening is in the same location as my work setup. This makes it easy for me to take one more peek at our monitoring dashboards or check my work email.

You do not want to overwork and you do not want your teammates to overwork. In the short-term, overwork can be beneficial. Long-term it leads to burnout and poor outcomes.

Fewer interactions

This is a negative and positive. When you are working remotely, you have fewer random interactions with coworkers. You most likely interact with your team plus the same handful of people outside of your team regularly but you rarely interact with others.

In an office, there is a chance you’ll run into people outside your usual circle of communication. You might eat lunch with a broader variety of people. You may bump into others while getting a coffee or a snack.

You can help increase interactions on a distributed team by having some chat rooms that encourage random discussions. Another option is to have a regular and optional meeting scheduled where people can give an informal presentation on something that interests them.

Tools

You will need to select tools that work for distributed teams. Most computer or web-based tools can work in a distributed setting. Any physical tools (such as pen and paper) will not work.

A prime example of this is the classic card wall for tracking work. A physical wall with actual cards will not work as soon as there is a single remote worker on a team. Instead of a physical wall, you’ll need to using something like Trello.

It is less important to get stuck on a particular tool recommendation and more essential to pay attention to the categories of tools. Categories of tools tend to be more stable than a specific recommendation.

Text chat

You’ll want a chat application. Slack and Stride are just two of the many available services.

Video conference

Video conferencing is a technology that you should embrace. It is much better than talking to either an individual or a group on the phone. Being able to read body language makes communication far better. Personally, I’ve used Google Hangouts and Zoom for hundreds of hours each and prefer Zoom. appear.in is another option that doesn’t require installing anything. There are many options in this space and more keep appearing. It is even built into Slack.

Phone conferences

I’d try to get rid of phone conferences in preference to video conferences. Video chat has many benefits over conference calls. I actually can’t recommend any phone conferencing tools, but I will mention that Zoom supports people dialing into a video conference.

Screen sharing

You’ll want to have a way to show another person or group what is on your screen. It is even better if someone else can take control of your machine or use their cursor to point towards something on your screen.

Most of my experience with this is using the feature built-in to Zoom. Pretty much every video conference tool I’ve used (appear.in, Google Hangouts, etc.) has screen sharing built-in.

Real-time collaboration on a document

Being able to collectively edit a document with a group is pretty amazing. Etherpad and Google Docs are two options. Most of my experience is with Google Docs.

When a document supports real-time collaboration, you can do amazing things. You can use it to capture ideas from a remote group. An extreme version of this can be viewed by opening this page and searching for “Google doc.”

You can use a shared document to facilitate a remote meeting (this goes incredibly well once you get the practice of it). Having a document that everyone in a meeting can edit is so much better than a whiteboard that only one or two people can simultaneously use.

Whiteboards

Whiteboards are an example of a tool that is always brought up, even by remote workers, as something that distributed teams miss. There are alternatives.

Whiteboards are a very convenient tool when meeting in-person, but there are other ways of collaborating when working remotely. Shared documents and screen sharing go a long way towards enabling collaboration. Tools that work well for remote collaboration often have another benefit over whiteboards; they are easier to persist and reference later.

One whiteboard alternative is Zoom’s built-in whiteboard. It works fairly well. Another is to use Google Drawings. Precursor is a design focused collaborative tool that can also work.

Even after four years, I occasionally find myself missing a whiteboard or shared piece of paper. Drawing with a mouse isn’t ideal. I know some developers that use an iPad or a Wacom tablet to help them quickly sketch diagrams for their team.

Communication

Communication is inherently different on a distributed team. You cannot just walk across an office to interrupt someone and ask them a question. Communication happens mostly through text. You need to be skilled at written communication.

You lose context with written communication when compared to vocal or in-person communication. You no longer have body language or tone of voice to help you interpret what someone means. Lack of tone is huge. This is one reason that text communication is interpreted as more emotionally negative or neutral than intended⁵. If you’re reading text communication, try to read it with a positive tone.

It can also be useful to have rules around the expectations of different forms of communication. How quickly do you need to respond to an email? How quick should a response be to a chat room message? When should you pick up the phone and call someone?

Chat rooms

Chat room applications (IRC, Slack, Stride, Flowdock, etc.) are pretty great. They provide a medium of communication that has a lower barrier to entry than email. Chat tools have a place in the distributed teams tool chest.

The chat room becomes even more central to work once you start including useful bots in the room. These bots can perform many functions. You can look in the Slack App Directory to see some of the bots that people are producing.

If you start adding bots and other automated messages to your chat application, you might want to think about setting up a separate channel. Some messages are not worthy of being interjected into your team’s main chat. They can be distracting and hurt the flow of conversation. These messages tend to be ones that are informative, but not critical. For my teams, these messages include things like git commits and Trello card updates. It is great to see these in a single spot but annoying when they interrupt a conversation.

Chat rooms can also be a big time sink. They are a source of concentration interrupting notifications. The feeling of missing out on conversations can drive people to join a large number of channels. This piles on the potential for distraction.

Chat rooms also provide a feeling of immediacy that isn’t actually there. You don’t know if key people have seen your message or have had time to respond.

Despite having search functionality, I’ve found it hard to find previous conversations in chat applications. If something important appears in chat, I’d recommend extracting it from the chat application and recording it somewhere else.

I’d also recommend turning off notifications for all but most definite “someone is trying to reach me” triggers. Encourage members of your chat to use entire channel notifications sparingly and only for messages that need everyone’s attention. There are not many messages that immediately require everyone’s attention.

It can be a challenge to follow chat conversations, especially if they span a larger unit of time. Don’t be afraid to move a conversation to email or another medium that is better suited for longer and more complex discussions.

Many chat applications offer the ability to have private rooms or to send direct messages to a user. Don’t be afraid of using these private channels, but if your communication can be public, it should be public. It can be challenging to ask a question and admit you don’t know something but seeing that dialogue might help others. Similarly, having discussions about a feature, bug, or task can help spread knowledge.

Email

Despite all of the efforts to replace email; email is still useful. It is the most common form of communication between companies, it is pervasive, and it usually comes with good search capabilities.

A good email thread can keep a topic contained in a form that is possible to follow. Unlike a chat room, there (usually) aren’t off-topic interjections from uninvolved parties.

Phone

You shouldn’t be afraid of calling someone. Just recognize that this is an interruption. Your company should have a directory of telephone numbers that is accessible to everyone.

One downside of any voice conversation is that it is not automatically persisted. It can be worth following up a phone call with an email summarizing the discussion and the next steps.

Picking the right communication medium

When you are working on a distributed team, you can no longer walk over to someone’s desk and interrupt them. This is great. Not every question deserves an immediate answer.

Agree with your team when to use different forms of communication. Set expectations with regards to response times and urgency for different mediums. Maybe direct chat messages are expected to be responded to in under 10 minutes. Perhaps emails are OK having a delay of a few hours. This is something your group will need to decide.

Practices

These are some practices I’ve seen work well with distributed teams. Many of them are slight variations on what you might have experienced on a co-located team.

Stand-ups

Most of the teams I’ve been part of, whether distributed or co-located, have had a daily stand-up meeting. The intention of this meeting was to provide a short, scheduled time for communicating any roadblocks to progress, interesting information, status updates, and desire for help.

For a distributed stand-up, the team joins a video conference and we gather around a shared Google Doc that has prompts similar to the snippet below.

Day: 2017-07-11

What's interesting?

Want help?

Meetings:

These prompts provide a starting point for team members to add additional text. Our stand-ups were at the beginning of the day, so frequently team members would add text to the document at the end of the prior day. Filling in the document at the end of the day instead of right before the stand-up was useful as memories and thoughts were often fresher without having to remember them the following morning.

After being filled in, the document would look like below.

Day: 2017-07-11

What's interesting?
  - New deploy lowered response time by 20% [Jake]
  - Discovered bug in date-time library around checking if date is within interval [Sue]
  - Greg joined the team!
  - Adding blocker functionality going to take longer than expected [Donald]

Want help?
  - Having difficulties running batch process locally. [Tom]
  - I'm having a hard time understanding propensity calculation [Mike]

Meetings:
  - API overview with client @ 2 PM Central [Jake/Jeremy]

We would gather around the Google Doc and everyone would take a couple of minutes to read silently. If anyone felt like something was worth talking about they would bold the text and then we’d work from top to bottom and have a quick discussion. For our Want help? section we’d solicit volunteers and move on. The Meetings section was primarily there to provide visibility as to when certain members might not be available. After we worked through the Want help? section we’d pop over to Trello and review the work in progress and make sure everyone had an idea of what they would be doing that day.

The nice thing about doing a stand-up around a shared Google Doc is that you can put in richer media than just text. Screenshots of monitoring graphs were a regular addition to the What's interesting? section.

Every day a new section was added to the top of the Google Doc and the previous day was pushed lower on the page. Having this written history of stand-ups was useful as it allowed us to notice patterns through a persisted medium instead of relying on our memory. It also let someone who was on vacation come back and have an idea of what had happened while they were gone. Below is what the document would look like on the next day (comments removed to keep the example shorter).

Day: 2017-07-12

What's interesting?
  - [...]

Want help?
  - [...]

Meetings:
  - [...]

-------
Day: 2017-07-11

What's interesting?
  - [...]

Want help?
  - [...]

Meetings:
  - [...]

Above is an example from one of the teams I led. Another team used the following prompts.

Accomplished Yesterday

Requires Attention/Roadblocks

Scope Creep Alerts

Would like to Do Today

The important thing is to find something that works for your team. Different teams are going to prefer different formats.

Another interesting benefit of using a Google Doc to drive your stand-up is that it can be visible to other teams. You can even combine teams into a single document. Below is an example with two teams in a single document.

**Everyone**
  - [...]

**Team Events**
  1. Accomplished Yesterday
     - [...]
  2. Requires Attention/Roadblocks
     - [...]
  3. Scope Creep Alerts
     - [...]
  4. Would like to do today
     - [...]

**Team Engine**
What's interesting?
  - [...]
Want help?
  - [...]
Meetings:
  - [...]

I’ve seen this work successfully with five related teams in a single document. News and information that affects everyone is added in the Everyone section. Team specific information is put in the team sections. Each team still has their individual stand-up where they only look at their section. But since their section is part of the larger document, they get a taste of what is going on in the other related teams. This helps replace the random hallway chatter you get in a shared office and gives everyone a slightly broader picture.

This worked shockingly well. I’ve had colleagues reach out, some years after leaving, to ask for the template we used for this multi-team stand-up.

Stand-downs

Stand-downs are a meeting that provides time to informally chat with a group. I’ve seen them used as an optional end-of-day water cooler activity for a group to talk about whatever. These chats often happen, but are unscheduled, in an office.

These meetings should be an optional, short meeting scheduled near the end of the day. This gives team members a good excuse (socializing with their coworkers) to stop working (which helps with the problem of overwork). No one should feel pressure to be at these meetings.

The conversation may or may not be work-related. Maybe you discuss a language feature someone learned. You might talk about a book you started reading. It doesn’t matter what is discussed; these meetings can help a team get closer and promote more social interaction.

I’ve also worked with teams that play games, such as Jackbox Party Packs, through video conferences.

Remote Pair Programming

Pair programming is a technique that people often employ when working in-person. It works even better when remote and helps solve some of the difficulties of working remotely.

Remote pair programming helps fight the feeling of loneliness and isolation that remote workers feel. Remote pairing forces intense interaction between two people. It also helps keep the two people focused on work. It is easy for a single person to get distracted by browsing Hacker News but much harder for both people to get sucked into it.

The ideal in-person pair programming setup is when you take a single computer and hook up two large monitors, two keyboards, and two mice and then mirror the monitors. This lets both programmers use their personal keyboard and stare straight ahead at their monitor.

Remote pair programming is an even better setup. One developer, the host, somehow shares their environment with the other developer. This can be done through screen sharing using Zoom, Slack, VNC, tmate, or some other way. The important part is that both developers can see the code and, if necessary, someway of viewing a UI component. They should both be able to edit the code.

Like the ideal local pair programming environment, each developer uses their personal keyboard, mouse, and monitor. Unlike the local pair programming environment, they each also have their own computer. This lets one developer look up documentation on their computer while the host developer continues to write code. It also allows the non-host developer to shield the host from distractions like answering questions in the chat room or responding to emails.

When remote programming it is easier for the non-host developer to stop paying attention. It is easier to be rude and not pay attention to your pair when you are not sitting next to them. If you notice your pair has zoned out, nicely call them out on it or ask them a question to get them to re-engage.

One-on-ones

One-on-ones are a useful practice in both a co-located and distributed team. For those who aren’t familiar with one-on-ones, they are meetings between you and your manager (flip that if you are the manager). They should be scheduled regularly and be a time where you discuss higher-level topics than the daily work. They are extremely useful for helping you develop professionally and helping a team run smoothly. If you currently have one-on-ones and you aren’t finding them useful, I’d recommend reading some articles with tips for making them useful. Here are a couple articles that give some pretty good advice. As both a team lead and a team member I’ve found one-on-ones extremely useful. I thought they were even more useful when on a distributed team.

With a distributed team you lose out on a lot of the body language you can pick up on when in person. It is harder to tell when someone is frustrated or when a pair is not working well together. Burnout is harder to notice. One-on-ones provide a time for that information to come out.

Meet in person

You should have your distributed team or company meet in person. This shouldn’t happen too regularly; I think this ideally happens two to four times a year. Even if you see someone every day on video, there is still some magic that happens when you meet in person⁶.

You can use this time to do the typical day-to-day work, but I think it is more productive to try other activities. The most successful in-person meetups I’ve been part of consisted mostly of group discussions. This can take the form of a mini-conference or Open Space. Another option is to brainstorm some potentially wild ideas and try implementing them to see how far you can get.

Use this time to have some meals and drinks with your coworkers. Play some board games and talk about things other than work. Sing some karaoke. Get to know your coworkers as more than someone inside your computer. Doing so can help with communication and understanding personalities.

End

It is an exciting time to be a remote worker. New tools are emerging that try to make remote work easier. New techniques are being discovered. Old techniques are being adapted.

I hope you’ve found this article useful. If you are a remote worker, maybe you’ve picked up some ideas to bring into your remote work. If you work in an office, perhaps you’ve found some useful arguments for moving towards remote work.

There is much more I could write about remote work and distributed teams. Some of these sections deserve their own posts and extended examples. You can view the remote category of my site to view other articles I’ve already written.

If you’ve enjoyed this article, consider sharing (tweeting) it to your followers.

Acknowledgments

This article came to life from the notes and research I did prior to speaking at the 2016 AIT Workshop. Some of those notes came from correspondence with Timothy Pratley, Rusty Bentley, Carin Meier, Devin Walters, Paco Viramontes, Jeff Bay, and Michael Halvorson. Discussions at the conference, with the above individuals, and working remotely at Outpace and Lumanu really helped solidify my thoughts.

Other references

http://globalworkplaceanalytics.com/telecommuting-statistics ↩
Various articles: one, two, three, four, five, six, seven ↩
http://www.npr.org/2015/09/22/442582422/the-cost-of-interruptions-they-waste-more-time-than-you-think ↩
Though, you may want to pay for the Internet or provide a budget to help remote employees set up their home office.↩
Carrying too Heavy a Load? The Communication and Miscommunication of Emotion by Email and Why It’s So Hard To Detect Emotion In Emails And Texts ↩
Like being surprised at how tall or short your coworkers are. It gets me every time.↩

Measuring aggregate performance in Clojure

Sep 29th, 2017 8:48 am

Last time I needed to speed up some code, I wrote a Clojure macro that recorded the aggregate time spent executing the code wrapped by the macro. Aggregate timings were useful since the same functions were called multiple times in the code path we were trying to optimize. Seeing total times made it easier to identify where we should spend our time.

Below is the namespace I temporarily introduced into our codebase.

(ns metrics)

(defn msec-str
  "Returns a human readable version of milliseconds based upon scale"
  [msecs]
  (let [s 1000
        m (* 60 s)
        h (* 60 m)]
    (condp >= msecs
      1 (format "%.5f msecs" (float msecs))
      s (format "%.1f msecs" (float msecs))
      m (format "%.1f seconds" (float (/ msecs s)))
      h (format "%02dm:%02ds" (int (/ msecs m))
                (mod (int (/ msecs s)) 60))
      (format "%dh:%02dm" (int (/ msecs h))
              (mod (int (/ msecs m)) 60)))))

(def aggregates (atom {}))

(defmacro record-aggregate
  "Records the total time spent executing body across invocations."
  [label & body]
  `(do
     (when-not (contains? @aggregates ~label)
       (swap! aggregates assoc ~label {:order (inc (count @aggregates))}))
     (let [start-time# (System/nanoTime)
           result# (do ~@body)
           result# (if (and (seq? result#)
                            (instance? clojure.lang.IPending result#)
                            (not (realized? result#)))
                     (doall result#)
                     result#)
           end-time# (System/nanoTime)]
       (swap! aggregates
              update-in
              [~label :msecs]
              (fnil + 0)
              (/ (double (- end-time# start-time#)) 1000000.0))
       result#)))

(defn log-times
  "Logs time recorded by record-aggregate and resets the aggregate times."
  []
  (doseq [[label data] (sort-by (comp :order second) @aggregates)
          :let [msecs (:msecs data)]]
    (println "Executing" label "took:" (msec-str msecs)))
  (reset! aggregates {}))

record-aggregate takes a label and code and times how long that code takes to run. If the executed code returns an unrealized lazy sequence, it also evaluates the sequence¹.

Below is an example of using the above code. When we used it, we looked at the code path we needed to optimize and wrapped chunks of it in record-aggregate. At the end of the calculations, we inserted a call to log-times so timing data would show up in our logs.

(ns work
  (:require [metrics :as m]))

(defn calculation [x]
  (m/record-aggregate ::calculation
                      (Thread/sleep (+ 300 (rand-int 60)))
                      x))

(defn work [x]
  (m/record-aggregate ::work
                      (repeatedly 10 (fn []
                                       (Thread/sleep 5)
                                       x))))

(defn process-rows [rows]
  (let [rows (m/record-aggregate ::process-rows
                                 (->> rows
                                      (mapv calculation)
                                      (mapcat work)))]
    (m/log-times)
    rows))

Now, when (process-rows [:a :a]) is called output similar to below is printed.

Executing :work/process-rows took: 780.9 msecs
Executing :work/calculation took: 664.6 msecs
Executing :work/work took: 115.8 msecs

Using this technique, we were able to identify slow parts of our process and were able to optimize those chunks of our code. There are potential flaws with measuring time like this, but they were not a problem in our situation².

See Measure what you intend to measure ↩
See Nanotrusting the Nanotime ↩

My current Leiningen profiles.clj

Aug 27th, 2017 7:06 pm

Nearly three years ago I wrote an overview of my Leiningen profiles.clj. That post is one of my most visited articles, so I thought I’d give an update on what I currently keep in ~/.lein/profiles.clj.

profiles.clj

{:user {:plugin-repositories [["private-plugins" {:url "private url"}]]
        :dependencies [[pjstadig/humane-test-output "0.8.2"]]
        :injections [(require 'pjstadig.humane-test-output)
                     (pjstadig.humane-test-output/activate!)]
        :plugins [[io.sattvik/lein-ancient "0.6.11"]
                  [lein-pprint "1.1.2"]
                  [com.jakemccrary/lein-test-refresh "0.21.1"]
                  [lein-autoexpect "1.9.0"]]
        :signing {:gpg-key "B38C2F8C"}
        :test-refresh {:notify-command ["terminal-notifier" "-title" "Tests" "-message"]
                       :quiet true
                       :changes-only true}}}

The biggest difference between my profiles.clj from early 2015 and now is that I’ve removed all of the CIDER related plugins. I still use CIDER, but CIDER no longer requires you to list its dependencies explicitly.

I’ve also removed Eastwood and Kibit from my toolchain. I love static analysis, but these tools fail too frequently with my projects. As a result, I rarely used them and I’ve removed them. Instead, I’ve started using joker for some basic static analysis and am really enjoying it. It is fast, and it has made refactoring in Emacs noticeably better.

lein-test-refresh, lein-autoexpect, and humane-test-output have stuck around and have been updated to the latest versions. These tools make testing Clojure much nicer.

I’m also taking advantage of some new features that lein-test-refresh provides. These settings enable the most reliable, fastest feedback possible while writing tests. My recommended testing setup article goes into more details.

lein-ancient and lein-pprint have stuck around. I rarely use lein-pprint but it comes in handy when debugging project.clj problems. lein-ancient is great for helping you keep your project’s dependencies up to date. I use a forked version that contains some changes I need to work with my company’s private repository.

And there you have it. My updated profiles.clj¹.

Some of you might wonder why I don’t just link to this file in version control somewhere? Well, it is kept encrypted in a git repository because it also contains some secrets that should not be public that I’ve removed for this post.↩

Using my phone's voice control for a month

Jul 28th, 2017 9:06 am

From May 6th to June 2nd the screen of my phone had a crack. I have an Android phone, and the crack was through the software buttons at the bottom of the screen. As a result, I could not touch the back, home, or overview (app switching) buttons. For nearly a month I never saw my home screen, couldn’t go back, or switch apps through touching my phone. I was very reliant on arriving notifications giving me an opportunity to open apps.

It took me some time, but I realized I could use voice commands to replace some of the missing functionality. Using voice commands, I could open apps and no longer be at the whim of notifications.

Here is an example of my phone usage during this month. My thoughts are in [brackets]. Italics indicate actions. Talking is wrapped in “ ”.

[Alright, I want to open Instagram] “Ok Google, open Instagram.”
[Sweet, it worked] scrolls through feed
WhatsApp notification happens [Great, a notification, I can click it to open WhatsApp]
I read messages in WhatsApp.
[Time to go back to Instagram] “Ok Google, open Instagram”
[sigh, voice command failed, lets try again] “Ok Google, open Instagram”
Instagram opens [Great, time to scroll through more pictures]

As you can see, it is a bit more painful than clicking buttons to switch between different apps. Voice commands fail sometimes and, at least for me, generally take more effort than tapping the screen. That’s ok though; I was determined to embrace voice commands and experience what a future of only voice commands might feel like.

Below are some observations from using my voice to control my phone for a month.

It is awkward in public

My phone usage in public went way down. There was something about having to talk to your phone to open an app that made me not want to pull out my phone.

It is much more obvious you are using your phone when you use your voice to control it. It makes casual glances at your phone while hanging out with a group impossible. You can’t sneak a quick look at Instagram when you need to say “Ok Google, open Instagram” without completely letting everyone around you know you are no longer paying attention.

This also stopped me from using my phone in Ubers/Lyfts/cabs. I often talk to the driver or other passengers anyway, but this cemented that. I realize it is completely normal to ignore the other people in a car but I felt like a (small) asshole audibly calling out that I’m ignoring other people in the car.

You become more conscious of what apps you use

When you have to say “Okay Google, open Instagram” every time you want to open Instagram, you become way more aware of how often you use Instagram. Using your voice instead of tapping a button on your screen is a much bigger hurdle between having the urge to open something and actually opening it. It gives you more time to observe what you are doing.

You become more conscious of using your phone

Using your phone becomes a lot harder. This increased difficulty helped highlight when I was using my phone. My phone’s functionality dropped drastically and, as a result, I stopped reaching for it as much.

This reminded me of when I used a dumb (feature) phone for a couple of months a few years ago. Using a non-smartphone after using a smartphone for years was weird. It helped me reign in my usage¹.

Voice control can be pretty convenient

Even after repairing my screen, I still find myself using some voice commands. While making my morning coffee, I often ask my phone for the weather forecast. This is more convenient than opening an app and it lets me continue to use both hands while making coffee.

Setting alarms, starting countdown timers, adding reminders, and checking the weather are all things I do through voice commands now.

I wish it worked all the time

I suppose this is an argument for getting a Google Home or Amazon Echo. I have to wake up my phone to use voice commands with it. This limits the usefulness of voice commands since I need be within reach of my phone.

I wish it could do more

At some point, I got used to asking my phone to do things. Then I started giving it more complicated commands, and it would fail. I found myself giving it multi-stage commands such as “Ok Google, turn on Bluetooth and play my playlist Chill on Spotify.” That doesn’t work but it would be amazing if it did.

Recommendations

I recommend that you force yourself to use voice commands for some period of time. Pretend your home button is broken and you have to use voice control to move around your phone. You’ll become more aware of your phone usage and you’ll learn some useful voice commands that will make your technology usage nicer.

My non-smartphone experiment four years ago is what resulted in me no longer using Facebook or Twitter on my phone. It also is the reason I silenced most notifications, including email, on my phone.↩

Speeding up this site by optionally loading Disqus comments

Jun 30th, 2017 7:37 pm

Earlier this month I took another look at what was required for reading an article on this site. What else could I do to make this site load faster?

To do this, I loaded up WebPageTest and pointed it towards one of my posts. To my shock, it took 113 requests for a total of 721 KB to load a single post. This took WebPageTest 6.491 seconds. The document complete event triggered after 15 requests (103 KB, 1.6 seconds).

113 requests to load a static article was ridiculous. Most of those requests happened as a result of loading the Disqus javascript. I find comments valuable and want to continue including them on my site. Because of this, I couldn’t remove Disqus. Instead, I made loading Disqus optional.

After making the required changes, it only takes 11 requests for 61 KB of data to fully load the test post. The document complete event only required 8 requests for 51 KB of data. Optionally loading the Disqus javascript resulted in a massive reduction of data transferred.

How did I do it? The template that generates my articles now only inserts the Disqus javascript when a reader clicks a button. My final template is at the bottom of this post.

The template adds an insertDisqus function that inserts a <script> element when a reader clicks a button. This element contains the original JavaScript that loads Disqus. When the <script> element is inserted into the page, the Disqus javascript is loaded and the comments appear.

My exact template might not work for you, but I’d encourage you to think about optionally loading Disqus and other non-required JavaScript. Your readers will thank you.

{% if site.disqus_short_name and page.comments == true %}
  <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
  <div id="disqus_target">
    <script>
     var insertDisqus = function() {
       var elem = document.createElement('script');
       elem.innerHTML =  "var disqus_shortname = '{{ site.disqus_short_name }}'; var disqus_identifier = '{{ site.url }}{{ page.url }}'; var disqus_url = '{{ site.url }}{{ page.url }}'; (function () {var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);}());"
       var target = document.getElementById('disqus_target');
       target.parentNode.replaceChild(elem, target);
     }
    </script>
    <button class="comment-button" onclick="insertDisqus()"><span>ENABLE COMMENTS AND RECOMMENDED ARTICLES</span></button>
  </div>
{% endif %}

Adding a JSON Feed to an Octopress/Jekyll generated site

May 30th, 2017 10:31 pm

I went to a coffee shop this last weekend with the intention of writing up a quick article on comm. I sat down, sipping my coffee, and wasn’t motivated. I didn’t feel like knocking out a short post, and I didn’t feel like editing a draft I’ve been sitting on for a while. I wanted to do some work though, so I decided to add a JSON Feed to this site.

JSON Feed is an alternative to Atom and RSS that uses JSON instead of XML. I figured I could add support for it in less than the time it would take to enjoy my coffee and maybe some readers would find it useful. I’d be shocked if anyone actually finds this useful, but it was a fun little exercise anyway.

An old version of Octopress (2.something), which uses an old version of Jekyll (2.5.3), generates this site. Despite this, I don’t think the template would need to change much if I moved to a new version. The template below is saved as source/feed.json in my git repository.

---
layout: null
---
{
  "version": "https://jsonfeed.org/version/1",
  "title": {{ site.title | jsonify }},
  "home_page_url": "{{ site.url }}",
  "feed_url": "{{site.url}}/feed.json",
  "favicon": "{{ site.url }}/favicon.png",
  "author" : {
      "url" : "https://twitter.com/jakemcc",
      "name" : "{{ site.author | strip_html }}"
  },
  "user_comment": "This feed allows you to read the posts from this site in any feed reader that supports the JSON Feed format. To add this feed to your reader, copy the following URL - {{ site.url }}/feed.json - and add it your reader.",
  "items": [{% for post in site.posts limit: 20 %}
    {
      "id": "{{ site.url }}{{ post.id }}",
      "url": "{{ site.url }}{{ post.url }}",
      "date_published": "{{ post.date | date_to_xmlschema }}",
      "title": {% if site.titlecase %}{{ post.title | titlecase | jsonify }}{% else %}{{ post.title | jsonify }}{% endif %},
      {% if post.description %}"summary": {{ post.description | jsonify }},{% endif %}
      "content_html": {{ post.content | expand_urls: site.url | jsonify }},
      "author" : {
        "name" : "{{ site.author | strip_html }}"
      }
    }{% if forloop.last == false %},{% endif %}
    {% endfor %}
  ]
}

I approached this problem by reading the JSON Feed Version 1 spec and cribbing values from the template for my Atom feed. The trickiest part was filling in the "content_html" value. It took me a while to find figure out that jsonify needed to be at the end of {{ post.content | expand_urls: site.url | jsonify }}. That translates the post’s HTML content into its JSON representation. You’ll notice that any template expression with jsonify at the end also isn’t wrapped in quotes. This is because jsonify is doing that for me.

The {% if forloop.last == false %},{% endif %} is also important. Without this, the generated JSON has an extra , after the final element in items. This isn’t valid JSON.

I caught that by using the command line tool json. If you ever edit JSON by hand or generate it from a template then you should add this tool to your toolbox. It will prevent you from creating invalid JSON.

How did I use it? I’d make a change in the feed.json template and generate an output file. Then I’d cat that file to json --validate. When there was an error, I’d see a message like below.

0 [last: 5s] 12:43:47 ~/src/jakemcc/blog (master *)
$ cat public/feed.json | json --validate
json: error: input is not JSON: Expected ',' instead of '{' at line 25, column 5:
            {
        ....^
1 [last: 0s] 12:43:49 ~/src/jakemcc/blog (master *)
$

And there would be zero output on success.

0 [last: 5s] 12:45:25 ~/src/jakemcc/blog (master)
$ cat public/feed.json | json --validate
0 [last: 0s] 12:45:30 ~/src/jakemcc/blog (master)
$

It was pretty straightforward to add a JSON Feed. Was it a good use of my time? ¯\_(ツ)_/¯. In the process of adding the feed I learned more about Liquid templating and figured out how to embed liquid tags into a blog post. Even adding redundant features can be a useful exercise.

Using comm to verify file content matches

May 29th, 2017 10:45 am

I recently found myself in a situation where I needed to confirm that a process took in a tab separated file, did some processing, and then output a new file containing the original columns with some additional ones. The feature I was adding allowed the process to die and restart while processing the input file and pick up where it left off.

I needed to confirm the output had data for every line in the input. I reached to the command line tool comm.

Below is a made up input file.

UNIQUE_ID    USER
38101838
19183819
19123811
10348018
19881911
29182918

And here is some made up output.

UNIQUE_ID    USER    MESSAGE
38101838    A01
19183819    A05
19123811    A02
10348018    A01
19881911    A02
29182918    A05

With files this size, it would be easy enough to check visually. In my testing, I was dealing with files that had thousands of lines. This is too many to check by hand. It is a perfect amount for comm.

comm reads two files as input and then outputs three columns. The first column contains lines found only in the first file, the second column contains lines only found in the second, and the last column contains lines in both. If it is easier for you to think about it as set operations, the first two columns are similar to performing two set differences and the third is similar to set intersection. Below is an example adapted from Wikipedia showing its behavior.

$ cat foo.txt
apple
banana
eggplant
$ cat bar.txt
apple
banana
banana
zucchini
$ comm foo.txt bar.txt
                  apple
                  banana
          banana
eggplant
          zucchini

So how is this useful? Well, you can also tell comm to suppress outputting specific columns. If we send the common columns from the input and output file to comm and suppress comm’s third column then anything printed to the screen is a problem. Anything printed to the screen was found in one of the files and not the other. We’ll select the common columns using cut and, since comm expects input to be sorted, then sort using sort. Let’s see what happens.

$ comm -3 <(cut -f 1,2 input.txt | sort) <(cut -f 1,2 output.txt | sort)
$

Success! Nothing was printed to the console, so there is nothing unique in either file.

comm is a useful tool to have in your command line toolbox.

Send a push notification when your external IP address changes

May 15th, 2017 10:15 pm

I need to know when my external IP address changes. Whenever it changes, I need to update an IP whitelist and need to re-login to a few sites. I sometimes don’t notice for a couple of days and, during that time, some automatic processes fail.

After the last time this happened, I whipped up a script that sends me a push notification when my IP address changes.

The script uses Pushover to send the push notification. Pushover is great. I have used it for years to get notifications from my headless computers. If you use the below script, replace ${PUSHOVER_TOKEN} and ${PUSHOVER_USER} with your own details.

#!/bin/bash

set -e

previous_file="${HOME}/.previous-external-ip"

if [ ! -e "${previous_file}" ]; then
    dig +short myip.opendns.com @resolver1.opendns.com > "${previous_file}"
fi

current_ip=$(dig +short myip.opendns.com @resolver1.opendns.com)

previous_ip=$(cat "${previous_file}")

if [ "${current_ip}" != "${previous_ip}" ]; then
    echo "external ip changed"
    curl -s --form-string "token=${PUSHOVER_TOKEN}" \
         --form-string "user=${PUSHOVER_USER}" \
         --form-string "title=External IP address changed" \
         --form-string "message='${previous_ip}' => '${current_ip}'" \
         https://api.pushover.net/1/messages.json
fi

echo "${current_ip}" > "${previous_file}"

What are the most used Clojure libraries?

Apr 17th, 2017 10:07 am

In a previous post, we used Google’s BigQuery and the public GitHub dataset to discover the most used Clojure testing library. The answer wasn’t surprising. The built-in clojure.test was by far the most used.

Let’s use the dataset to answer a less obvious question, what are the most used libraries in Clojure projects? We’ll measure this by counting references to libraries in project.clj and build.boot files.

Before we can answer that question, we’ll need to transform the data. First, we create the Clojure subset of the GitHub dataset. I did this by executing the following queries and saving the results to tables¹.

-- Save the results of this query to the clojure.files table
SELECT
  *
FROM
  [bigquery-public-data:github_repos.files]
WHERE
  RIGHT(path, 4) = '.clj'
  OR RIGHT(path, 5) = '.cljc'
  OR RIGHT(path, 5) = '.cljs'
  OR RIGHT(path, 10) = 'boot.build'

-- Save the results to clojure.contents
SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM clojure.files)

Next we extract the dependencies from build.boot and project.clj files. Fortunately for us, both of these files specify dependencies in the same format, so we’re able to use the same regular expression on both types.

The query below identifies project.clj and build.boot files, splits each file into lines, and extracts referenced library names and versions using a regular expression. Additional filtering is done get rid of some spurious results.

SELECT
  REGEXP_EXTRACT(line, r'\[+(\S+)\s+"\S+"]') AS library,
  REGEXP_EXTRACT(line, r'\[+\S+\s+"(\S+)"]') AS version,
  COUNT(*) AS count
FROM (
  SELECT
    SPLIT(content, '\n') AS line
  FROM
    [clojure.contents]
  WHERE
    id IN (
    SELECT
      id
    FROM
      [clojure.files]
    WHERE
      path LIKE '%project.clj'
      OR path LIKE '%build.boot')
      HAVING line contains '[')
GROUP BY
  library, version
HAVING library is not null and not library contains '"'
ORDER BY
  count DESC

The first five rows from the result are below. Let’s save the entire result to a clojure.libraries table.

| library             | version | count |
|---------------------+---------+-------|
| org.clojure/clojure | 1.6.0   | 7015  |
| org.clojure/clojure | 1.5.1   | 4251  |
| org.clojure/clojure | 1.7.0   | 4093  |
| org.clojure/clojure | 1.8.0   | 3016  |
| hiccup              | 1.0.5   | 1280  |

Now we can start answering all sorts of interesting questions.

What is the most referenced library put out under the org.clojure group?

SELECT library, sum(count) count
FROM clojure.libraries
WHERE library CONTAINS 'org.clojure'
GROUP BY library
ORDER BY count desc

| Row | library                        | count |
|-----+--------------------------------+-------|
|   1 | org.clojure/clojure            | 20834 |
|   2 | org.clojure/clojurescript      |  3080 |
|   3 | org.clojure/core.async         |  2612 |
|   4 | org.clojure/tools.logging      |  1579 |
|   5 | org.clojure/data.json          |  1546 |
|   6 | org.clojure/tools.nrepl        |  1244 |
|   7 | org.clojure/java.jdbc          |  1064 |
|   8 | org.clojure/tools.cli          |  1053 |
|   9 | org.clojure/tools.namespace    |   982 |
|  10 | org.clojure/test.check         |   603 |
|  11 | org.clojure/core.match         |   578 |
|  12 | org.clojure/math.numeric-tower |   503 |
|  13 | org.clojure/data.csv           |   381 |
|  14 | org.clojure/math.combinatorics |   372 |
|  15 | org.clojure/tools.reader       |   368 |
|  16 | org.clojure/clojure-contrib    |   335 |
|  17 | org.clojure/data.xml           |   289 |
|  18 | org.clojure/tools.trace        |   236 |
|  19 | org.clojure/java.classpath     |   199 |
|  20 | org.clojure/core.cache         |   179 |

Clojure and ClojureScript are at the top, which isn’t surprising. I’m surprised to see tools.nrepl in the next five results (rows 3-7). It is the only library out of the top that I haven’t used.

What testing library is used the most? We already answered this in my last article but let’s see if we get the same answer when we’re counting how many times a library is pulled into a project.

SELECT library, sum(count) count
FROM [clojure.libraries]
WHERE library in ('midje', 'expectations', 'speclj', 'smidjen', 'fudje')
GROUP BY library
ORDER BY count desc

| Row | library                | count |
|-----+------------------------+-------|
|   1 | midje                  |  1122 |
|   2 | speclj                 |   336 |
|   3 | expectations           |   235 |
|   4 | smidjen                |     1 |

Those results are close to the previous results. Of the non-clojure.test libraries, midje still ends up on top.

What groups (as identified by the Maven groupId) have their libraries referenced the most? Top 12 are below but the full result is available.

SELECT REGEXP_EXTRACT(library, r'(\S+)/\S+') AS group, sum(count) AS count
FROM [clojure.libraries]
GROUP BY group
HAVING group IS NOT null
ORDER BY count DESC

| Row | group                 | count |
|-----+-----------------------+-------|
|   1 | org.clojure           | 39611 |
|   2 | ring                  |  5817 |
|   3 | com.cemerick          |  2053 |
|   4 | com.taoensso          |  1605 |
|   5 | prismatic             |  1398 |
|   6 | org.slf4j             |  1209 |
|   7 | cljsjs                |   868 |
|   8 | javax.servlet         |   786 |
|   9 | com.stuartsierra      |   642 |
|  10 | com.badlogicgames.gdx |   586 |
|  11 | cider                 |   560 |
|  12 | pjstadig              |   536 |

And finally, the question that inspired this article, what is the most used library?

SELECT library, sum(count) count
FROM [clojure.libraries]
WHERE library != 'org.clojure/clojure'
GROUP BY library
ORDER BY count desc

| Row | library                     | count |
|-----+-----------------------------+-------|
|   1 | compojure                   |  3609 |
|   2 | lein-cljsbuild              |  3413 |
|   3 | org.clojure/clojurescript   |  3080 |
|   4 | org.clojure/core.async      |  2612 |
|   5 | lein-ring                   |  1809 |
|   6 | cheshire                    |  1802 |
|   7 | environ                     |  1763 |
|   8 | ring                        |  1678 |
|   9 | clj-http                    |  1648 |
|  10 | clj-time                    |  1613 |
|  11 | hiccup                      |  1591 |
|  12 | lein-figwheel               |  1582 |
|  13 | org.clojure/tools.logging   |  1579 |
|  14 | org.clojure/data.json       |  1546 |
|  15 | http-kit                    |  1423 |
|  16 | lein-environ                |  1325 |
|  17 | ring/ring-defaults          |  1302 |
|  18 | org.clojure/tools.nrepl     |  1244 |
|  19 | midje                       |  1122 |
|  20 | com.cemerick/piggieback     |  1096 |
|  21 | org.clojure/java.jdbc       |  1064 |
|  22 | org.clojure/tools.cli       |  1053 |
|  23 | enlive                      |  1001 |
|  24 | ring/ring-core              |   995 |
|  25 | org.clojure/tools.namespace |   982 |

Compojure takes the top slot. Full results are available.

Before doing this research I tried to predict what libraries I’d see in the top 10. I thought that clj-time and clj-http would be up there. I’m happy to see my guess was correct.

It was pretty pleasant using BigQuery to do this analysis. Queries took at most seconds to execute. This quick feedback let me play around in the web interface without feeling like I was waiting for computers to do work. This made the research into Clojure library usage painless and fun.

I did this in early March 2017.↩

Which Clojure testing library is most used?

Mar 31st, 2017 9:54 pm

I’ve always assumed that the built-in clojure.test is the most widely used testing library in the Clojure community. Earlier this month I decided to test this assumption using the Google’s BigQuery GitHub dataset.

The BigQuery GitHub dataset contains over three terabytes of source code from more than 2.8 million open source GitHub repositories. BigQuery lets us quickly query this data using SQL.

Below is a table with the results (done in early March 2017) of my investigation. Surprising no one, clojure.test comes out as the winner and it is a winner by a lot.

| Library      | # Repos Using |
|--------------+---------------|
| clojure.test |         14304 |
| midje        |          1348 |
| expectations |           429 |
| speclj       |           207 |
| smidjen      |             1 |
| fudje        |             1 |

23,243 repositories were identified as containing Clojure (or ClojureScript) code. This means there were about 6,953 repositories that didn’t use any testing library¹. This puts the “no tests or an obscure other way of testing” in a pretty solid second place.

You should take these numbers as ballpark figures and not exact answers. I know from using GitHub’s search interface that there are three public projects using fudje².

So, why don’t all three of those projects show up? The dataset only includes projects where Google could identify the project as open source and the GitHub licenses API is used to do that³. Two of those three projects were probably unable to be identified as something with an appropriate license.

Another small problem is that since expectations is an actual word, it shows up outside of ns declarations. I ended up using a fairly simple query to generate this data and it only knows that expectations shows up somewhere in a file. I experimented with some more restrictive queries but they didn’t drastically change the result and I wasn’t sure they weren’t wrong in other ways. If you subtract a number between 100 and 150 you’ll probably have a more accurate expectations usage count.

Keep reading if you want to hear more about the steps to come up with the above numbers.

If you have other Clojure questions you think could be answered by querying this dataset, let me know in the comments or on twitter. I have some more ideas, so I wouldn’t be surprised if at least one more article gets written.

The Details

The process was pretty straightforward. Most of my time was spent exploring the tables, figuring out what the columns represented, figuring out what queries worked well, and manually confirming some of the results. BigQuery is very fast. Very little of my time was spent waiting for results.

1. Setup the data

You get 1 TB of free BigQuery usage a month. You can blow through this in a single query. Google provides sample tables that contain less data but I wanted to operate on the full set of Clojure(Script) files, so my first step was to execute some queries to create tables that only contained Clojure data.

First, I queried the github_repos.files table for all the Clojure(Script) files and saved that to a clojure.files table.

SELECT
  *
FROM
  [bigquery-public-data:github_repos.files]
WHERE
  (RIGHT(path, 4) = '.clj'
    OR RIGHT(path, 5) = '.cljc'
    OR RIGHT(path, 5) = '.cljs')

The above query took only 9.2 seconds to run and processed 328 GB of data.

Using the clojure.files table, we can select the source for all the Clojure code from the github_repos.contents. I saved this to a clojure.contents table.

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM clojure.files)

This query processed 1.84 TB of data in 21.5 seconds. So fast. In just under 30 seconds, I’ve blown through the free limit.

2. Identify what testing library (or libraries) a repo uses

We can guess that a file uses a testing library if it contains certain string. The strings we’ll search for are the namespaces we’d expect to see required or used in a ns declaration. The below query does this for each file and then rolls up the results by repository. It took 3 seconds to run and processed 611 MB of data.

SELECT
  files.repo_name,
  MAX(uses_clojure_test) uses_clojure_test,
  MAX(uses_expectations) uses_expectations,
  MAX(uses_midje) uses_midje,
  MAX(uses_speclj) uses_speclj,
  MAX(uses_fudje) uses_fudje,
  MAX(uses_smidjen) uses_smidjen,
FROM (
  SELECT
    id,
    contents.content LIKE '%clojure.test%' uses_clojure_test,
    contents.content LIKE '%expectations%' uses_expectations,
    contents.content LIKE '%midje%' uses_midje,
    contents.content LIKE '%speclj%' uses_speclj,
    contents.content LIKE '%fudje%' uses_fudje,
    contents.content LIKE '%smidjen%' uses_smidjen,
  FROM
    clojure.contents AS contents) x
JOIN
  clojure.files files ON files.id = x.id
GROUP BY
  files.repo_name

Below is a screenshot of the first few rows in the result.

BigQuery results for test library usage by repo

3. Export the data

At this point, we could continue doing the analysis using SQL and the BigQuery UI but I opted to explore the data using Clojure and the repl. There were too many rows to directly download the query results as a csv file, so I ended up having to save the results as a table and then export it to Google’s cloud storage and download from there.

The first few rows of the file look like this:

files_repo_name,uses_clojure_test,uses_expectations,uses_midje,uses_speclj,uses_fudje,uses_smidjen
wangchunyang/clojure-liberator-examples,true,false,false,false,false,false
yantonov/rex,false,false,false,false,false,false

4. Calculate some numbers

The code takes the csv file and does some transformations. You could do this in Excel or using any language of your choice. I’m not going to include code here, as it isn’t that interesting.

BigQuery thoughts

This was my first time using Google’s BigQuery. This wasn’t the most difficult analysis to do but I was impressed at the speed and ease of use. The web UI, which I used entirely for this, is neither really great or extremely terrible. It mostly just worked and I rarely had to look up documentation.

I don’t really feel comfortable making a judgment call on if the cost is expensive or not but this article cost a bit less than seven dollars to write. This doesn’t seem too outrageous to me.

Based on my limited usage of BigQuery, it is something I’d look into further if I needed its capabilities.

Probably higher, as projects can and use more than one testing library.↩
And those projects are jumarko/clojure-random, dpassen1/great-sort, and jimpil/fudje.↩
Source is a Google Developer Advocate’s response on old HN post ↩

← Older Blog Archives Newer →