I’ve always assumed that the built-in
clojure.test is the most
widely used testing library in the Clojure community. Earlier this
month I decided to test this assumption using the
Google’s BigQuery GitHub dataset.
The BigQuery GitHub dataset contains over three terabytes of source code from more than 2.8 million open source GitHub repositories. BigQuery lets us quickly query this data using SQL.
Below is a table with the results (done in early March 2017) of my
investigation. Surprising no one,
clojure.test comes out as the
winner and it is a winner by a lot.
1 2 3 4 5 6 7 8
23,243 repositories were identified as containing Clojure (or ClojureScript) code. This means there were about 6,953 repositories that didn’t use any testing library1. This puts the “no tests or an obscure other way of testing” in a pretty solid second place.
So, why don’t all three of those projects show up? The dataset only includes projects where Google could identify the project as open source and the GitHub licenses API is used to do that3. Two of those three projects were probably unable to be identified as something with an appropriate license.
Another small problem is that since
expectations is an actual word,
it shows up outside of
ns declarations. I ended up using a fairly
simple query to generate this data and it only knows that
expectations shows up somewhere in a file. I experimented with some
more restrictive queries but they didn’t drastically change the result
and I wasn’t sure they weren’t wrong in other ways. If you subtract a
number between 100 and 150 you’ll probably have a more accurate
expectations usage count.
Keep reading if you want to hear more about the steps to come up with the above numbers.
If you have other Clojure questions you think could be answered by querying this dataset, let me know in the comments or on twitter. I have some more ideas, so I wouldn’t be surprised if at least one more article gets written.
The process was pretty straightforward. Most of my time was spent exploring the tables, figuring out what the columns represented, figuring out what queries worked well, and manually confirming some of the results. BigQuery is very fast. Very little of my time was spent waiting for results.
1. Setup the data
You get 1 TB of free BigQuery usage a month. You can blow through this in a single query. Google provides sample tables that contain less data but I wanted to operate on the full set of Clojure(Script) files, so my first step was to execute some queries to create tables that only contained Clojure data.
First, I queried the
github_repos.files table for all the
Clojure(Script) files and saved that to a
1 2 3 4 5 6 7 8
The above query took only 9.2 seconds to run and processed 328 GB of data.
clojure.files table, we can select the source for all the
Clojure code from the
github_repos.contents. I saved this to a
1 2 3
This query processed 1.84 TB of data in 21.5 seconds. So fast. In just under 30 seconds, I’ve blown through the free limit.
2. Identify what testing library (or libraries) a repo uses
We can guess that a file uses a testing library if it contains certain
string. The strings we’ll search for are the namespaces we’d expect to
see required or used in a
ns declaration. The below query does this
for each file and then rolls up the results by repository. It took 3
seconds to run and processed 611 MB of data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Below is a screenshot of the first few rows in the result.
3. Export the data
At this point, we could continue doing the analysis using SQL and the BigQuery UI but I opted to explore the data using Clojure and the repl. There were too many rows to directly download the query results as a csv file, so I ended up having to save the results as a table and then export it to Google’s cloud storage and download from there.
The first few rows of the file look like this:
1 2 3
4. Calculate some numbers
The code takes the csv file and does some transformations. You could do this in Excel or using any language of your choice. I’m not going to include code here, as it isn’t that interesting.
This was my first time using Google’s BigQuery. This wasn’t the most difficult analysis to do but I was impressed at the speed and ease of use. The web UI, which I used entirely for this, is neither really great or extremely terrible. It mostly just worked and I rarely had to look up documentation.
I don’t really feel comfortable making a judgment call on if the cost is expensive or not but this article cost a bit less than seven dollars to write. This doesn’t seem too outrageous to me.
Based on my limited usage of BigQuery, it is something I’d look into further if I needed its capabilities.