I recently found myself in a situation where I needed to confirm that a process took in a tab separated file, did some processing, and then output a new file containing the original columns with some additional ones. The feature I was adding allowed the process to die and restart while processing the input file and pick up where it left off.
I needed to confirm the output had data for every line in the input. I reached to the command line tool
Below is a made up input file.
1 2 3 4 5 6 7
And here is some made up output.
1 2 3 4 5 6 7
With files this size, it would be easy enough to check visually. In my testing, I was dealing with files that had thousands of lines. This is too many to check by hand. It is a perfect amount for
comm reads two files as input and then outputs three columns. The first column contains lines found only in the first file, the second column contains lines only found in the second, and the last column contains lines in both. If it is easier for you to think about it as set operations, the first two columns are similar to performing two set differences and the third is similar to set intersection. Below is an example adapted from Wikipedia showing its behavior.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
So how is this useful? Well, you can also tell
comm to suppress outputting specific columns. If we send the common columns from the input and output file to
comm and suppress
comm’s third column then anything printed to the screen is a problem. Anything printed to the screen was found in one of the files and not the other. We’ll select the common columns using cut and, since comm expects input to be sorted, then sort using
sort. Let’s see what happens.
Success! Nothing was printed to the console, so there is nothing unique in either file.
comm is a useful tool to have in your command line toolbox.