Measuring aggregate performance in Clojure

Sep 29th, 2017 8:48 am

Last time I needed to speed up some code, I wrote a Clojure macro that recorded the aggregate time spent executing the code wrapped by the macro. Aggregate timings were useful since the same functions were called multiple times in the code path we were trying to optimize. Seeing total times made it easier to identify where we should spend our time.

Below is the namespace I temporarily introduced into our codebase.

(ns metrics)

(defn msec-str
  "Returns a human readable version of milliseconds based upon scale"
  [msecs]
  (let [s 1000
        m (* 60 s)
        h (* 60 m)]
    (condp >= msecs
      1 (format "%.5f msecs" (float msecs))
      s (format "%.1f msecs" (float msecs))
      m (format "%.1f seconds" (float (/ msecs s)))
      h (format "%02dm:%02ds" (int (/ msecs m))
                (mod (int (/ msecs s)) 60))
      (format "%dh:%02dm" (int (/ msecs h))
              (mod (int (/ msecs m)) 60)))))

(def aggregates (atom {}))

(defmacro record-aggregate
  "Records the total time spent executing body across invocations."
  [label & body]
  `(do
     (when-not (contains? @aggregates ~label)
       (swap! aggregates assoc ~label {:order (inc (count @aggregates))}))
     (let [start-time# (System/nanoTime)
           result# (do ~@body)
           result# (if (and (seq? result#)
                            (instance? clojure.lang.IPending result#)
                            (not (realized? result#)))
                     (doall result#)
                     result#)
           end-time# (System/nanoTime)]
       (swap! aggregates
              update-in
              [~label :msecs]
              (fnil + 0)
              (/ (double (- end-time# start-time#)) 1000000.0))
       result#)))

(defn log-times
  "Logs time recorded by record-aggregate and resets the aggregate times."
  []
  (doseq [[label data] (sort-by (comp :order second) @aggregates)
          :let [msecs (:msecs data)]]
    (println "Executing" label "took:" (msec-str msecs)))
  (reset! aggregates {}))

record-aggregate takes a label and code and times how long that code takes to run. If the executed code returns an unrealized lazy sequence, it also evaluates the sequence¹.

Below is an example of using the above code. When we used it, we looked at the code path we needed to optimize and wrapped chunks of it in record-aggregate. At the end of the calculations, we inserted a call to log-times so timing data would show up in our logs.

(ns work
  (:require [metrics :as m]))

(defn calculation [x]
  (m/record-aggregate ::calculation
                      (Thread/sleep (+ 300 (rand-int 60)))
                      x))

(defn work [x]
  (m/record-aggregate ::work
                      (repeatedly 10 (fn []
                                       (Thread/sleep 5)
                                       x))))

(defn process-rows [rows]
  (let [rows (m/record-aggregate ::process-rows
                                 (->> rows
                                      (mapv calculation)
                                      (mapcat work)))]
    (m/log-times)
    rows))

Now, when (process-rows [:a :a]) is called output similar to below is printed.

Executing :work/process-rows took: 780.9 msecs
Executing :work/calculation took: 664.6 msecs
Executing :work/work took: 115.8 msecs

Using this technique, we were able to identify slow parts of our process and were able to optimize those chunks of our code. There are potential flaws with measuring time like this, but they were not a problem in our situation².

See Measure what you intend to measure ↩
See Nanotrusting the Nanotime ↩

My current Leiningen profiles.clj

Aug 27th, 2017 7:06 pm

Nearly three years ago I wrote an overview of my Leiningen profiles.clj. That post is one of my most visited articles, so I thought I’d give an update on what I currently keep in ~/.lein/profiles.clj.

profiles.clj

{:user {:plugin-repositories [["private-plugins" {:url "private url"}]]
        :dependencies [[pjstadig/humane-test-output "0.8.2"]]
        :injections [(require 'pjstadig.humane-test-output)
                     (pjstadig.humane-test-output/activate!)]
        :plugins [[io.sattvik/lein-ancient "0.6.11"]
                  [lein-pprint "1.1.2"]
                  [com.jakemccrary/lein-test-refresh "0.21.1"]
                  [lein-autoexpect "1.9.0"]]
        :signing {:gpg-key "B38C2F8C"}
        :test-refresh {:notify-command ["terminal-notifier" "-title" "Tests" "-message"]
                       :quiet true
                       :changes-only true}}}

The biggest difference between my profiles.clj from early 2015 and now is that I’ve removed all of the CIDER related plugins. I still use CIDER, but CIDER no longer requires you to list its dependencies explicitly.

I’ve also removed Eastwood and Kibit from my toolchain. I love static analysis, but these tools fail too frequently with my projects. As a result, I rarely used them and I’ve removed them. Instead, I’ve started using joker for some basic static analysis and am really enjoying it. It is fast, and it has made refactoring in Emacs noticeably better.

lein-test-refresh, lein-autoexpect, and humane-test-output have stuck around and have been updated to the latest versions. These tools make testing Clojure much nicer.

I’m also taking advantage of some new features that lein-test-refresh provides. These settings enable the most reliable, fastest feedback possible while writing tests. My recommended testing setup article goes into more details.

lein-ancient and lein-pprint have stuck around. I rarely use lein-pprint but it comes in handy when debugging project.clj problems. lein-ancient is great for helping you keep your project’s dependencies up to date. I use a forked version that contains some changes I need to work with my company’s private repository.

And there you have it. My updated profiles.clj¹.

Some of you might wonder why I don’t just link to this file in version control somewhere? Well, it is kept encrypted in a git repository because it also contains some secrets that should not be public that I’ve removed for this post.↩

Using my phone's voice control for a month

Jul 28th, 2017 9:06 am

From May 6th to June 2nd the screen of my phone had a crack. I have an Android phone, and the crack was through the software buttons at the bottom of the screen. As a result, I could not touch the back, home, or overview (app switching) buttons. For nearly a month I never saw my home screen, couldn’t go back, or switch apps through touching my phone. I was very reliant on arriving notifications giving me an opportunity to open apps.

It took me some time, but I realized I could use voice commands to replace some of the missing functionality. Using voice commands, I could open apps and no longer be at the whim of notifications.

Here is an example of my phone usage during this month. My thoughts are in [brackets]. Italics indicate actions. Talking is wrapped in “ ”.

[Alright, I want to open Instagram] “Ok Google, open Instagram.”
[Sweet, it worked] scrolls through feed
WhatsApp notification happens [Great, a notification, I can click it to open WhatsApp]
I read messages in WhatsApp.
[Time to go back to Instagram] “Ok Google, open Instagram”
[sigh, voice command failed, lets try again] “Ok Google, open Instagram”
Instagram opens [Great, time to scroll through more pictures]

As you can see, it is a bit more painful than clicking buttons to switch between different apps. Voice commands fail sometimes and, at least for me, generally take more effort than tapping the screen. That’s ok though; I was determined to embrace voice commands and experience what a future of only voice commands might feel like.

Below are some observations from using my voice to control my phone for a month.

It is awkward in public

My phone usage in public went way down. There was something about having to talk to your phone to open an app that made me not want to pull out my phone.

It is much more obvious you are using your phone when you use your voice to control it. It makes casual glances at your phone while hanging out with a group impossible. You can’t sneak a quick look at Instagram when you need to say “Ok Google, open Instagram” without completely letting everyone around you know you are no longer paying attention.

This also stopped me from using my phone in Ubers/Lyfts/cabs. I often talk to the driver or other passengers anyway, but this cemented that. I realize it is completely normal to ignore the other people in a car but I felt like a (small) asshole audibly calling out that I’m ignoring other people in the car.

You become more conscious of what apps you use

When you have to say “Okay Google, open Instagram” every time you want to open Instagram, you become way more aware of how often you use Instagram. Using your voice instead of tapping a button on your screen is a much bigger hurdle between having the urge to open something and actually opening it. It gives you more time to observe what you are doing.

You become more conscious of using your phone

Using your phone becomes a lot harder. This increased difficulty helped highlight when I was using my phone. My phone’s functionality dropped drastically and, as a result, I stopped reaching for it as much.

This reminded me of when I used a dumb (feature) phone for a couple of months a few years ago. Using a non-smartphone after using a smartphone for years was weird. It helped me reign in my usage¹.

Voice control can be pretty convenient

Even after repairing my screen, I still find myself using some voice commands. While making my morning coffee, I often ask my phone for the weather forecast. This is more convenient than opening an app and it lets me continue to use both hands while making coffee.

Setting alarms, starting countdown timers, adding reminders, and checking the weather are all things I do through voice commands now.

I wish it worked all the time

I suppose this is an argument for getting a Google Home or Amazon Echo. I have to wake up my phone to use voice commands with it. This limits the usefulness of voice commands since I need be within reach of my phone.

I wish it could do more

At some point, I got used to asking my phone to do things. Then I started giving it more complicated commands, and it would fail. I found myself giving it multi-stage commands such as “Ok Google, turn on Bluetooth and play my playlist Chill on Spotify.” That doesn’t work but it would be amazing if it did.

Recommendations

I recommend that you force yourself to use voice commands for some period of time. Pretend your home button is broken and you have to use voice control to move around your phone. You’ll become more aware of your phone usage and you’ll learn some useful voice commands that will make your technology usage nicer.

My non-smartphone experiment four years ago is what resulted in me no longer using Facebook or Twitter on my phone. It also is the reason I silenced most notifications, including email, on my phone.↩

Speeding up this site by optionally loading Disqus comments

Jun 30th, 2017 7:37 pm

Earlier this month I took another look at what was required for reading an article on this site. What else could I do to make this site load faster?

To do this, I loaded up WebPageTest and pointed it towards one of my posts. To my shock, it took 113 requests for a total of 721 KB to load a single post. This took WebPageTest 6.491 seconds. The document complete event triggered after 15 requests (103 KB, 1.6 seconds).

113 requests to load a static article was ridiculous. Most of those requests happened as a result of loading the Disqus javascript. I find comments valuable and want to continue including them on my site. Because of this, I couldn’t remove Disqus. Instead, I made loading Disqus optional.

After making the required changes, it only takes 11 requests for 61 KB of data to fully load the test post. The document complete event only required 8 requests for 51 KB of data. Optionally loading the Disqus javascript resulted in a massive reduction of data transferred.

How did I do it? The template that generates my articles now only inserts the Disqus javascript when a reader clicks a button. My final template is at the bottom of this post.

The template adds an insertDisqus function that inserts a <script> element when a reader clicks a button. This element contains the original JavaScript that loads Disqus. When the <script> element is inserted into the page, the Disqus javascript is loaded and the comments appear.

My exact template might not work for you, but I’d encourage you to think about optionally loading Disqus and other non-required JavaScript. Your readers will thank you.

{% if site.disqus_short_name and page.comments == true %}
  <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
  <div id="disqus_target">
    <script>
     var insertDisqus = function() {
       var elem = document.createElement('script');
       elem.innerHTML =  "var disqus_shortname = '{{ site.disqus_short_name }}'; var disqus_identifier = '{{ site.url }}{{ page.url }}'; var disqus_url = '{{ site.url }}{{ page.url }}'; (function () {var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);}());"
       var target = document.getElementById('disqus_target');
       target.parentNode.replaceChild(elem, target);
     }
    </script>
    <button class="comment-button" onclick="insertDisqus()"><span>ENABLE COMMENTS AND RECOMMENDED ARTICLES</span></button>
  </div>
{% endif %}

Adding a JSON Feed to an Octopress/Jekyll generated site

May 30th, 2017 10:31 pm

I went to a coffee shop this last weekend with the intention of writing up a quick article on comm. I sat down, sipping my coffee, and wasn’t motivated. I didn’t feel like knocking out a short post, and I didn’t feel like editing a draft I’ve been sitting on for a while. I wanted to do some work though, so I decided to add a JSON Feed to this site.

JSON Feed is an alternative to Atom and RSS that uses JSON instead of XML. I figured I could add support for it in less than the time it would take to enjoy my coffee and maybe some readers would find it useful. I’d be shocked if anyone actually finds this useful, but it was a fun little exercise anyway.

An old version of Octopress (2.something), which uses an old version of Jekyll (2.5.3), generates this site. Despite this, I don’t think the template would need to change much if I moved to a new version. The template below is saved as source/feed.json in my git repository.

---
layout: null
---
{
  "version": "https://jsonfeed.org/version/1",
  "title": {{ site.title | jsonify }},
  "home_page_url": "{{ site.url }}",
  "feed_url": "{{site.url}}/feed.json",
  "favicon": "{{ site.url }}/favicon.png",
  "author" : {
      "url" : "https://twitter.com/jakemcc",
      "name" : "{{ site.author | strip_html }}"
  },
  "user_comment": "This feed allows you to read the posts from this site in any feed reader that supports the JSON Feed format. To add this feed to your reader, copy the following URL - {{ site.url }}/feed.json - and add it your reader.",
  "items": [{% for post in site.posts limit: 20 %}
    {
      "id": "{{ site.url }}{{ post.id }}",
      "url": "{{ site.url }}{{ post.url }}",
      "date_published": "{{ post.date | date_to_xmlschema }}",
      "title": {% if site.titlecase %}{{ post.title | titlecase | jsonify }}{% else %}{{ post.title | jsonify }}{% endif %},
      {% if post.description %}"summary": {{ post.description | jsonify }},{% endif %}
      "content_html": {{ post.content | expand_urls: site.url | jsonify }},
      "author" : {
        "name" : "{{ site.author | strip_html }}"
      }
    }{% if forloop.last == false %},{% endif %}
    {% endfor %}
  ]
}

I approached this problem by reading the JSON Feed Version 1 spec and cribbing values from the template for my Atom feed. The trickiest part was filling in the "content_html" value. It took me a while to find figure out that jsonify needed to be at the end of {{ post.content | expand_urls: site.url | jsonify }}. That translates the post’s HTML content into its JSON representation. You’ll notice that any template expression with jsonify at the end also isn’t wrapped in quotes. This is because jsonify is doing that for me.

The {% if forloop.last == false %},{% endif %} is also important. Without this, the generated JSON has an extra , after the final element in items. This isn’t valid JSON.

I caught that by using the command line tool json. If you ever edit JSON by hand or generate it from a template then you should add this tool to your toolbox. It will prevent you from creating invalid JSON.

How did I use it? I’d make a change in the feed.json template and generate an output file. Then I’d cat that file to json --validate. When there was an error, I’d see a message like below.

0 [last: 5s] 12:43:47 ~/src/jakemcc/blog (master *)
$ cat public/feed.json | json --validate
json: error: input is not JSON: Expected ',' instead of '{' at line 25, column 5:
            {
        ....^
1 [last: 0s] 12:43:49 ~/src/jakemcc/blog (master *)
$

And there would be zero output on success.

0 [last: 5s] 12:45:25 ~/src/jakemcc/blog (master)
$ cat public/feed.json | json --validate
0 [last: 0s] 12:45:30 ~/src/jakemcc/blog (master)
$

It was pretty straightforward to add a JSON Feed. Was it a good use of my time? ¯\_(ツ)_/¯. In the process of adding the feed I learned more about Liquid templating and figured out how to embed liquid tags into a blog post. Even adding redundant features can be a useful exercise.

Using comm to verify file content matches

May 29th, 2017 10:45 am

I recently found myself in a situation where I needed to confirm that a process took in a tab separated file, did some processing, and then output a new file containing the original columns with some additional ones. The feature I was adding allowed the process to die and restart while processing the input file and pick up where it left off.

I needed to confirm the output had data for every line in the input. I reached to the command line tool comm.

Below is a made up input file.

UNIQUE_ID    USER
38101838
19183819
19123811
10348018
19881911
29182918

And here is some made up output.

UNIQUE_ID    USER    MESSAGE
38101838    A01
19183819    A05
19123811    A02
10348018    A01
19881911    A02
29182918    A05

With files this size, it would be easy enough to check visually. In my testing, I was dealing with files that had thousands of lines. This is too many to check by hand. It is a perfect amount for comm.

comm reads two files as input and then outputs three columns. The first column contains lines found only in the first file, the second column contains lines only found in the second, and the last column contains lines in both. If it is easier for you to think about it as set operations, the first two columns are similar to performing two set differences and the third is similar to set intersection. Below is an example adapted from Wikipedia showing its behavior.

$ cat foo.txt
apple
banana
eggplant
$ cat bar.txt
apple
banana
banana
zucchini
$ comm foo.txt bar.txt
                  apple
                  banana
          banana
eggplant
          zucchini

So how is this useful? Well, you can also tell comm to suppress outputting specific columns. If we send the common columns from the input and output file to comm and suppress comm’s third column then anything printed to the screen is a problem. Anything printed to the screen was found in one of the files and not the other. We’ll select the common columns using cut and, since comm expects input to be sorted, then sort using sort. Let’s see what happens.

$ comm -3 <(cut -f 1,2 input.txt | sort) <(cut -f 1,2 output.txt | sort)
$

Success! Nothing was printed to the console, so there is nothing unique in either file.

comm is a useful tool to have in your command line toolbox.

Send a push notification when your external IP address changes

May 15th, 2017 10:15 pm

I need to know when my external IP address changes. Whenever it changes, I need to update an IP whitelist and need to re-login to a few sites. I sometimes don’t notice for a couple of days and, during that time, some automatic processes fail.

After the last time this happened, I whipped up a script that sends me a push notification when my IP address changes.

The script uses Pushover to send the push notification. Pushover is great. I have used it for years to get notifications from my headless computers. If you use the below script, replace ${PUSHOVER_TOKEN} and ${PUSHOVER_USER} with your own details.

#!/bin/bash

set -e

previous_file="${HOME}/.previous-external-ip"

if [ ! -e "${previous_file}" ]; then
    dig +short myip.opendns.com @resolver1.opendns.com > "${previous_file}"
fi

current_ip=$(dig +short myip.opendns.com @resolver1.opendns.com)

previous_ip=$(cat "${previous_file}")

if [ "${current_ip}" != "${previous_ip}" ]; then
    echo "external ip changed"
    curl -s --form-string "token=${PUSHOVER_TOKEN}" \
         --form-string "user=${PUSHOVER_USER}" \
         --form-string "title=External IP address changed" \
         --form-string "message='${previous_ip}' => '${current_ip}'" \
         https://api.pushover.net/1/messages.json
fi

echo "${current_ip}" > "${previous_file}"

What are the most used Clojure libraries?

Apr 17th, 2017 10:07 am

In a previous post, we used Google’s BigQuery and the public GitHub dataset to discover the most used Clojure testing library. The answer wasn’t surprising. The built-in clojure.test was by far the most used.

Let’s use the dataset to answer a less obvious question, what are the most used libraries in Clojure projects? We’ll measure this by counting references to libraries in project.clj and build.boot files.

Before we can answer that question, we’ll need to transform the data. First, we create the Clojure subset of the GitHub dataset. I did this by executing the following queries and saving the results to tables¹.

-- Save the results of this query to the clojure.files table
SELECT
  *
FROM
  [bigquery-public-data:github_repos.files]
WHERE
  RIGHT(path, 4) = '.clj'
  OR RIGHT(path, 5) = '.cljc'
  OR RIGHT(path, 5) = '.cljs'
  OR RIGHT(path, 10) = 'boot.build'

-- Save the results to clojure.contents
SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM clojure.files)

Next we extract the dependencies from build.boot and project.clj files. Fortunately for us, both of these files specify dependencies in the same format, so we’re able to use the same regular expression on both types.

The query below identifies project.clj and build.boot files, splits each file into lines, and extracts referenced library names and versions using a regular expression. Additional filtering is done get rid of some spurious results.

SELECT
  REGEXP_EXTRACT(line, r'\[+(\S+)\s+"\S+"]') AS library,
  REGEXP_EXTRACT(line, r'\[+\S+\s+"(\S+)"]') AS version,
  COUNT(*) AS count
FROM (
  SELECT
    SPLIT(content, '\n') AS line
  FROM
    [clojure.contents]
  WHERE
    id IN (
    SELECT
      id
    FROM
      [clojure.files]
    WHERE
      path LIKE '%project.clj'
      OR path LIKE '%build.boot')
      HAVING line contains '[')
GROUP BY
  library, version
HAVING library is not null and not library contains '"'
ORDER BY
  count DESC

The first five rows from the result are below. Let’s save the entire result to a clojure.libraries table.

| library             | version | count |
|---------------------+---------+-------|
| org.clojure/clojure | 1.6.0   | 7015  |
| org.clojure/clojure | 1.5.1   | 4251  |
| org.clojure/clojure | 1.7.0   | 4093  |
| org.clojure/clojure | 1.8.0   | 3016  |
| hiccup              | 1.0.5   | 1280  |

Now we can start answering all sorts of interesting questions.

What is the most referenced library put out under the org.clojure group?

SELECT library, sum(count) count
FROM clojure.libraries
WHERE library CONTAINS 'org.clojure'
GROUP BY library
ORDER BY count desc

| Row | library                        | count |
|-----+--------------------------------+-------|
|   1 | org.clojure/clojure            | 20834 |
|   2 | org.clojure/clojurescript      |  3080 |
|   3 | org.clojure/core.async         |  2612 |
|   4 | org.clojure/tools.logging      |  1579 |
|   5 | org.clojure/data.json          |  1546 |
|   6 | org.clojure/tools.nrepl        |  1244 |
|   7 | org.clojure/java.jdbc          |  1064 |
|   8 | org.clojure/tools.cli          |  1053 |
|   9 | org.clojure/tools.namespace    |   982 |
|  10 | org.clojure/test.check         |   603 |
|  11 | org.clojure/core.match         |   578 |
|  12 | org.clojure/math.numeric-tower |   503 |
|  13 | org.clojure/data.csv           |   381 |
|  14 | org.clojure/math.combinatorics |   372 |
|  15 | org.clojure/tools.reader       |   368 |
|  16 | org.clojure/clojure-contrib    |   335 |
|  17 | org.clojure/data.xml           |   289 |
|  18 | org.clojure/tools.trace        |   236 |
|  19 | org.clojure/java.classpath     |   199 |
|  20 | org.clojure/core.cache         |   179 |

Clojure and ClojureScript are at the top, which isn’t surprising. I’m surprised to see tools.nrepl in the next five results (rows 3-7). It is the only library out of the top that I haven’t used.

What testing library is used the most? We already answered this in my last article but let’s see if we get the same answer when we’re counting how many times a library is pulled into a project.

SELECT library, sum(count) count
FROM [clojure.libraries]
WHERE library in ('midje', 'expectations', 'speclj', 'smidjen', 'fudje')
GROUP BY library
ORDER BY count desc

| Row | library                | count |
|-----+------------------------+-------|
|   1 | midje                  |  1122 |
|   2 | speclj                 |   336 |
|   3 | expectations           |   235 |
|   4 | smidjen                |     1 |

Those results are close to the previous results. Of the non-clojure.test libraries, midje still ends up on top.

What groups (as identified by the Maven groupId) have their libraries referenced the most? Top 12 are below but the full result is available.

SELECT REGEXP_EXTRACT(library, r'(\S+)/\S+') AS group, sum(count) AS count
FROM [clojure.libraries]
GROUP BY group
HAVING group IS NOT null
ORDER BY count DESC

| Row | group                 | count |
|-----+-----------------------+-------|
|   1 | org.clojure           | 39611 |
|   2 | ring                  |  5817 |
|   3 | com.cemerick          |  2053 |
|   4 | com.taoensso          |  1605 |
|   5 | prismatic             |  1398 |
|   6 | org.slf4j             |  1209 |
|   7 | cljsjs                |   868 |
|   8 | javax.servlet         |   786 |
|   9 | com.stuartsierra      |   642 |
|  10 | com.badlogicgames.gdx |   586 |
|  11 | cider                 |   560 |
|  12 | pjstadig              |   536 |

And finally, the question that inspired this article, what is the most used library?

SELECT library, sum(count) count
FROM [clojure.libraries]
WHERE library != 'org.clojure/clojure'
GROUP BY library
ORDER BY count desc

| Row | library                     | count |
|-----+-----------------------------+-------|
|   1 | compojure                   |  3609 |
|   2 | lein-cljsbuild              |  3413 |
|   3 | org.clojure/clojurescript   |  3080 |
|   4 | org.clojure/core.async      |  2612 |
|   5 | lein-ring                   |  1809 |
|   6 | cheshire                    |  1802 |
|   7 | environ                     |  1763 |
|   8 | ring                        |  1678 |
|   9 | clj-http                    |  1648 |
|  10 | clj-time                    |  1613 |
|  11 | hiccup                      |  1591 |
|  12 | lein-figwheel               |  1582 |
|  13 | org.clojure/tools.logging   |  1579 |
|  14 | org.clojure/data.json       |  1546 |
|  15 | http-kit                    |  1423 |
|  16 | lein-environ                |  1325 |
|  17 | ring/ring-defaults          |  1302 |
|  18 | org.clojure/tools.nrepl     |  1244 |
|  19 | midje                       |  1122 |
|  20 | com.cemerick/piggieback     |  1096 |
|  21 | org.clojure/java.jdbc       |  1064 |
|  22 | org.clojure/tools.cli       |  1053 |
|  23 | enlive                      |  1001 |
|  24 | ring/ring-core              |   995 |
|  25 | org.clojure/tools.namespace |   982 |

Compojure takes the top slot. Full results are available.

Before doing this research I tried to predict what libraries I’d see in the top 10. I thought that clj-time and clj-http would be up there. I’m happy to see my guess was correct.

It was pretty pleasant using BigQuery to do this analysis. Queries took at most seconds to execute. This quick feedback let me play around in the web interface without feeling like I was waiting for computers to do work. This made the research into Clojure library usage painless and fun.

I did this in early March 2017.↩

Which Clojure testing library is most used?

Mar 31st, 2017 9:54 pm

I’ve always assumed that the built-in clojure.test is the most widely used testing library in the Clojure community. Earlier this month I decided to test this assumption using the Google’s BigQuery GitHub dataset.

The BigQuery GitHub dataset contains over three terabytes of source code from more than 2.8 million open source GitHub repositories. BigQuery lets us quickly query this data using SQL.

Below is a table with the results (done in early March 2017) of my investigation. Surprising no one, clojure.test comes out as the winner and it is a winner by a lot.

| Library      | # Repos Using |
|--------------+---------------|
| clojure.test |         14304 |
| midje        |          1348 |
| expectations |           429 |
| speclj       |           207 |
| smidjen      |             1 |
| fudje        |             1 |

23,243 repositories were identified as containing Clojure (or ClojureScript) code. This means there were about 6,953 repositories that didn’t use any testing library¹. This puts the “no tests or an obscure other way of testing” in a pretty solid second place.

You should take these numbers as ballpark figures and not exact answers. I know from using GitHub’s search interface that there are three public projects using fudje².

So, why don’t all three of those projects show up? The dataset only includes projects where Google could identify the project as open source and the GitHub licenses API is used to do that³. Two of those three projects were probably unable to be identified as something with an appropriate license.

Another small problem is that since expectations is an actual word, it shows up outside of ns declarations. I ended up using a fairly simple query to generate this data and it only knows that expectations shows up somewhere in a file. I experimented with some more restrictive queries but they didn’t drastically change the result and I wasn’t sure they weren’t wrong in other ways. If you subtract a number between 100 and 150 you’ll probably have a more accurate expectations usage count.

Keep reading if you want to hear more about the steps to come up with the above numbers.

If you have other Clojure questions you think could be answered by querying this dataset, let me know in the comments or on twitter. I have some more ideas, so I wouldn’t be surprised if at least one more article gets written.

The Details

The process was pretty straightforward. Most of my time was spent exploring the tables, figuring out what the columns represented, figuring out what queries worked well, and manually confirming some of the results. BigQuery is very fast. Very little of my time was spent waiting for results.

1. Setup the data

You get 1 TB of free BigQuery usage a month. You can blow through this in a single query. Google provides sample tables that contain less data but I wanted to operate on the full set of Clojure(Script) files, so my first step was to execute some queries to create tables that only contained Clojure data.

First, I queried the github_repos.files table for all the Clojure(Script) files and saved that to a clojure.files table.

SELECT
  *
FROM
  [bigquery-public-data:github_repos.files]
WHERE
  (RIGHT(path, 4) = '.clj'
    OR RIGHT(path, 5) = '.cljc'
    OR RIGHT(path, 5) = '.cljs')

The above query took only 9.2 seconds to run and processed 328 GB of data.

Using the clojure.files table, we can select the source for all the Clojure code from the github_repos.contents. I saved this to a clojure.contents table.

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM clojure.files)

This query processed 1.84 TB of data in 21.5 seconds. So fast. In just under 30 seconds, I’ve blown through the free limit.

2. Identify what testing library (or libraries) a repo uses

We can guess that a file uses a testing library if it contains certain string. The strings we’ll search for are the namespaces we’d expect to see required or used in a ns declaration. The below query does this for each file and then rolls up the results by repository. It took 3 seconds to run and processed 611 MB of data.

SELECT
  files.repo_name,
  MAX(uses_clojure_test) uses_clojure_test,
  MAX(uses_expectations) uses_expectations,
  MAX(uses_midje) uses_midje,
  MAX(uses_speclj) uses_speclj,
  MAX(uses_fudje) uses_fudje,
  MAX(uses_smidjen) uses_smidjen,
FROM (
  SELECT
    id,
    contents.content LIKE '%clojure.test%' uses_clojure_test,
    contents.content LIKE '%expectations%' uses_expectations,
    contents.content LIKE '%midje%' uses_midje,
    contents.content LIKE '%speclj%' uses_speclj,
    contents.content LIKE '%fudje%' uses_fudje,
    contents.content LIKE '%smidjen%' uses_smidjen,
  FROM
    clojure.contents AS contents) x
JOIN
  clojure.files files ON files.id = x.id
GROUP BY
  files.repo_name

Below is a screenshot of the first few rows in the result.

BigQuery results for test library usage by repo

3. Export the data

At this point, we could continue doing the analysis using SQL and the BigQuery UI but I opted to explore the data using Clojure and the repl. There were too many rows to directly download the query results as a csv file, so I ended up having to save the results as a table and then export it to Google’s cloud storage and download from there.

The first few rows of the file look like this:

files_repo_name,uses_clojure_test,uses_expectations,uses_midje,uses_speclj,uses_fudje,uses_smidjen
wangchunyang/clojure-liberator-examples,true,false,false,false,false,false
yantonov/rex,false,false,false,false,false,false

4. Calculate some numbers

The code takes the csv file and does some transformations. You could do this in Excel or using any language of your choice. I’m not going to include code here, as it isn’t that interesting.

BigQuery thoughts

This was my first time using Google’s BigQuery. This wasn’t the most difficult analysis to do but I was impressed at the speed and ease of use. The web UI, which I used entirely for this, is neither really great or extremely terrible. It mostly just worked and I rarely had to look up documentation.

I don’t really feel comfortable making a judgment call on if the cost is expensive or not but this article cost a bit less than seven dollars to write. This doesn’t seem too outrageous to me.

Based on my limited usage of BigQuery, it is something I’d look into further if I needed its capabilities.

Probably higher, as projects can and use more than one testing library.↩
And those projects are jumarko/clojure-random, dpassen1/great-sort, and jimpil/fudje.↩
Source is a Google Developer Advocate’s response on old HN post ↩

Using lein-test-refresh with expectations

Feb 27th, 2017 9:19 am

The 2.2.0 release¹ of expectations adds a clojure.test compatible syntax. The release adds the defexpect macro which forces you to name your test but then generates code that is compatible with clojure.test.

Why would you want this? Because clojure.test is the built-in testing library for Clojure, an entire ecosystem has been built around it. Tool support for clojure.test is always going to be ahead of support for the original expectations. By using the new clojure.test compatible syntax, expectations can take advantage of all the tools built for clojure.test.

Using lein-test-refresh with expectations

If you move to the new clojure.test compatible syntax, you can start using lein-test-refresh to automatically rerun your tests when your code changes. lein-test-refresh is a fork of the original expectations autorunner, lein-autoexpect, but it has grown to have more features than its original inspiration. Now you can use it with expectations².

Below is a sample project.clj that uses lein-test-refresh with the latest expectations.

(defproject expectations-project "0.1.0-SNAPSHOT"
  :description "Sample project using expectations"
  :dependencies [[org.clojure/clojure "1.8.0"]]
  :plugins [[com.jakemccrary/lein-test-refresh  "0.18.1"]]
  :profiles {:dev {:dependencies [[expectations "2.2.0-beta1"]]}})

Here is an example test file.

(ns expectations-project.core-test
  (:require [expectations :refer :all]
            [expectations.clojure.test :refer [defexpect]]))

(defexpect two
  2 (+ 1 1))

(defexpect three
  3 (+ 1 1))

(defexpect group
  (expect [1 2] (conj [] 1 5))
  (expect #{1 2} (conj #{} 1 2))
  (expect {1 2} (assoc {} 1 3)))

And here is the result of running lein test-refresh.

$ lein test-refresh
*********************************************
*************** Running tests ***************
:reloading (expectations-project.core-test)

FAIL in (group) (expectations_project/core_test.clj:11)
expected: [1 2]
  actual: [1 5] from (conj [] 1 5)

FAIL in (group) (expectations_project/core_test.clj:11)
expected: {1 2}
  actual: {1 3} from (assoc {} 1 3)

FAIL in (three) (expectations_project/core_test.clj:8)
expected: 3
  actual: 2 from (+ 1 1)

Ran 3 tests containing 5 assertions.n
3 failures, 0 errors.

Failed 3 of 5 assertions
Finished at 11:53:06.281 (run time: 0.270s)

After some quick edits to fix the test errors and saving the file, here is the output from the tests re-running.

*********************************************
*************** Running tests ***************
:reloading (expectations-project.core-test)

Ran 3 tests containing 5 assertions.
0 failures, 0 errors.
:reloading ()

Ran 3 tests containing 5 assertions.
0 failures, 0 errors.

Passed all tests
Finished at 11:53:59.045 (run time: 0.013s)

If you’re using expectations and switch to the new clojure.test compatible syntax, I’d encourage you to start using lein-test-refresh.

As of 2016-02-27 2.2.0 isn’t out yet, but 2.2.0-beta1 has been released and has the changes.↩
In fact, you have to use it if you use Leiningen and the new syntax and want your tests to run automatically.↩

← Older Blog Archives Newer →