Jake McCrary

Using comm to verify file content matches

I recently found myself in a situation where I needed to confirm that a process took in a tab separated file, did some processing, and then output a new file containing the original columns with some additional ones. The feature I was adding allowed the process to die and restart while processing the input file and pick up where it left off.

I needed to confirm the output had data for every line in the input. I reached to the command line tool comm.

Below is a made up input file.

1
2
3
4
5
6
7
UNIQUE_ID    USER
1 38101838
2 19183819
3 19123811
4 10348018
5 19881911
6 29182918

And here is some made up output.

1
2
3
4
5
6
7
UNIQUE_ID    USER    MESSAGE
1 38101838    A01
2 19183819    A05
3 19123811    A02
4 10348018    A01
5 19881911    A02
6 29182918    A05

With files this size, it would be easy enough to check visually. In my testing, I was dealing with files that had thousands of lines. This is too many to check by hand. It is a perfect amount for comm.

comm reads two files as input and then outputs three columns. The first column contains lines found only in the first file, the second column contains lines only found in the second, and the last column contains lines in both. If it is easier for you to think about it as set operations, the first two columns are similar to performing two set differences and the third is similar to set intersection. Below is an example adapted from Wikipedia showing its behavior.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ cat foo.txt
apple
banana
eggplant
$ cat bar.txt
apple
banana
banana
zucchini
$ comm foo.txt bar.txt
                  apple
                  banana
          banana
eggplant
          zucchini

So how is this useful? Well, you can also tell comm to suppress outputting specific columns. If we send the common columns from the input and output file to comm and suppress comm’s third column then anything printed to the screen is a problem. Anything printed to the screen was found in one of the files and not the other. We’ll select the common columns using cut and, since comm expects input to be sorted, then sort using sort. Let’s see what happens.

1
2
$ comm -3 <(cut -f 1,2 input.txt | sort) <(cut -f 1,2 output.txt | sort)
$

Success! Nothing was printed to the console, so there is nothing unique in either file.

comm is a useful tool to have in your command line toolbox.

Looking forward to the next article? Never miss a post by subscribing using e-mail or RSS. The e-mail newsletter goes out periodically (at most once a month) and includes reviews of books I've been reading and links to stuff I've found interesting.

Comments