Shell tricks for one-liner bioinformatics

There is always more than one way to accomplish a computational task, but some solutions are better than others. For bioinformatics processing, we want solutions that are easy and quick to implement and efficient enough to get the job done quickly. You can always write a simple program, but for many tasks, using Linux/Unix shell commands provides quick solutions, without the need to explicitly open/parse files, set up data structures, etc. Instead, a one-liner command in your shell takes care of the overhead so that you write a small amount of code and move on.

The examples below are meant to illustrate some of the most useful shell commands for bioinformatics processing. The syntax is from bash, but most examples can easily be modified to work in other shells. This is by no means exhaustive and I hope to expand on this in a series of pages in the next few months. Now, let’s look at some examples.

Looping with for

Datasets are often distributed with separate files for each chromosome, and often we need to perform the same operation on each one. To do this, we can use the shell’s for loop construct! If we had files called data_chr[chromosome].txt, where [chromosome] ranges from 1 to 22, and we wanted to run ./command on them, we would do:

for chr in {1..22}; do ./command data_chr$chr.txt; done

The above code loops over all the values between 1 and 22, setting the shell variable $chr to each value in succession. The syntax for specifying the ranges of numbers is a bit odd, but it does the job: {A..B} gives a range between A and B, where A and B are integer values (note that A < B or A > B are both valid; if A > B, the numbers are assigned in descending order).

We can also use for to loop over file extensions. Suppose we had some PLINK format data called data.bed, data.bim, and data.fam. We could list these files individually by running:

for ext in bed bim fam; do ls data.$ext; done

Here the variable $ext (for extension) is set in succession to bed, bim, and fam. This particular example isn’t very useful, since we could have just typed ls data.* and seen that these three files exist. What is useful is being able to rename the base dataset with one line of code in the shell. If I wanted to rename these files human_data.bed, human_data.bim, and human_data.fam, I could write:

for ext in bed bim fam; do mv -i data.$ext human_data.$ext; done

For those who know bash well, there are some more elaborate ways of doing the above renaming, but for is easy to remember and does the job.

Another useful for construct is to loop over groups of files. We can do the following to execute ./command on every file with a .txt extension in our current directory:

for file in *.txt; do ./command $file; done

Extracting fields with awk

Many genomic file formats are text-based, with information about specific variables listed in columns in the file. Sometimes we need to extract only a specific variable (i.e., column or field) and feed this into some other command. We can do this with either awk or cut. We’ll start with awk. Suppose we wanted to know the physical positions for all the variants in the PLINK file data.bim. This is column number 4 in that file, with the first column being number 1 (i.e., awk and cut do not use zero-based column indexes). To extract column number 4 from data.bim in awk, do:

awk '{print $4}' data.bim

This will print the positions to the terminal, which is usually not what we want, so we can print them to a file called data_positions.txt:

awk '{print $4}' data.bim > data_positions.txt

Or we can pipe the output to a program, in this case ./command:

awk '{print $4}' data.bim | ./command

This will send the physical positions to stdin for ./command.

The syntax here may seem a bit cumbersome, but awk is very powerful: in fact, awk is a programming language. Everything inside ‘{ … }’ is the program, and it is run on every line of the input file. By default, awk splits each line up into fields that are separated by white space (arbitrary numbers of spaces and tabs), and assigns them to special variables named $1, $2, $3, $4, … $N, where N is the last field on the current line; it also stores the complete input line in $0. There are several built-in commands, including print. You can use print to produce lines with arbitrary text that includes the field values from the input file:

awk '{ print "SNP",$2,"is on chromosome",$1,"at position",$4 }' data.bim

If the first entry in the .bim file was for SNP rs16823551, with the first column (chromosome) being 1, and the fourth column being 3141616, this would print: “SNP rs16823551 is on chromosome 1 at position 3141616”, and similarly for every other line in the .bim file. Commas in the print command insert spaces by default, and are optional, so:

awk '{ print "chromosome" $1 }' data.bim

Would print “chromosome1” for .bim file line described above, with no space between the string and the $1 field value.

We can also use awk to extract only lines with certain field values. For example, if we had a file called data.vcf and wanted to pull out lines with a FILTER value of “PASS”, knowing that the FILTER field is column 7 in a VCF file, we would do:

awk '$7 == "PASS" { print }' data.vcf

Here, we’ve put a boolean expression outside the curly braces. awk evaluates boolean expressions for each line and only executes the code in the curly braces on lines for which the boolean expression is true. Note that a print command with no arguments simply prints the original line.

The boolean expressions that precede the code to run on each line can be arbitrarily complex. For example, if we wanted to only pull out lines from data.vcf with a FILTER value of “PASS” that for chromosome 5, we would do:

awk '$1 == 5 && $7 == "PASS" { print }' data.vcf

Here, only lines with a chromosome (first field in the VCF file) value of 5 and a FILTER value of “PASS” will be printed. Note that with no code, awk will simply print the lines for which the initial boolean expression is true, so an equivalent and shorter version of the previous command is:

awk '$1 == 5 && $7 == "PASS"' data.vcf

awk can be used with any delimiter, and the -F option specifies the delimiter. So, if we were dealing with a .csv (Comma Separated Values) file containing SNP ids in column 3 and we wanted to extract those ids, we would do:

awk -F, '{ print $3 }' data.csv

The -F, option tells awk to split fields on the ‘,’ character.

Let’s finish the awk examples by extracting multiple fields from a file. Returning again to data.vcf, the first five columns of the file tell us about the variants in the dataset, with remaining fields containing information about data quality and the data itself. If we wanted only the first five columns and nothing else, we could do the following:

awk '{ print $1,$2,$3,$4,$5 }' data.vcf

That’s a lot of $ signs and numbers! This is a bit cumbersome, and the good news is there’s a simpler way to do this.

Extracting fields with cut

The program cut can be used to extract ranges of fields from a file. So continuing from the above example, we could instead do:

cut -f1-5 data.vcf

That’s a lot shorter. The -f option to cut specifies the range of fields to extract and is fairly flexible. For example, if we wanted to pull out fields 1 through 5 and 10 through 20 of data.txt, we would do:

cut -f1-5,10-20 data.txt

Or if we wanted all the fields starting from column 6 to the end of each line, we would do:

cut -f6- data.txt

Note that cut assumes by default that fields are separated by a tab character. If the fields in data.txt were instead separated by a single space, we would do the following to get column 6 and higher:

cut -d' ' -f6- data.txt

This seems better in every respect than awk for simple column extraction, but there is a limitation. cut does not allow extraction of columns in a different order than they appear in the original file. The command cut -f5,2 data.txt fails, although printing fields in a different order is possible in awk.

More advanced awk

As mentioned above, awk is very powerful, and we won’t go through most of its features, but let’s go through three more examples.

Suppose we have a file called data.txt and we want to break out information corresponding to each chromosome into separate files. We might want to use a for loop to go through each chromosome number and then a boolean expression in awk to only pull out lines corresponding to that chromosome. Let’s begin our task with a program that doesn’t quite work and then fix it. If column 2 of data.txt contains the chromosome numbers, and we want separate files for each chromosome, we might think that the following would work (note again that the default awk behavior of printing lines that match the given boolean expression):

for chr in {1..22}; do awk '$2 == $chr' > data_chr$chr.txt; done

This is almost correct, but the expression given to awk is '$2 == $chr', and awk doesn’t know anything about the shell variable $chr. Instead of referencing a variable in the shell, we can explicitly assign variables for awk to use with the -v option:

for chr in {1..22}; do awk -v chr=$chr '$2 == chr' > data_chr$chr.txt; done

Not bad! We can provide awk with as many -v options as we want for all the variables we might care to assign:

for chr in {1..22}; do awk -v chr=$chr -v threshold=10 '$2 == chr && $4 > threshold' > data_chr$chr.txt; done

This breaks out each chromosome into a separate file and requires that the values in column 4 are greater than 10.

What if we have a file containing two columns with a count of “successes” and “attempts” for some process, each collected from a different source, and that we want to compute the average success rate among all the sources? We can use awk to sum each column and then print the average at the end using a special syntax. If the file data.txt has the successes and failures in columns 2 and 3, respectively, we could do this:

awk 'BEGIN { total_successes = total_attempts = 0; } { total_success += $2; total_attempts += $3 } END { print "Successes:", total_success, "Attempts:", total_attempts, "Rate:", total_success / total_attempts }' data.txt

awk executes the code in the BEGIN section before any lines in the input are processed, and the code in the END section after reading the input. The BEGIN section initializes two variables to 0, the main code section sums the columns on each line, and the END section prints the totals and the rate.

The last example uses two of awk‘s built-in variables. Sometimes we need to extract data but want to remove the first header line in a file, and, for files with variable numbers of fields, we may wish to pull out lines with some specific number of fields. If we have a file named data.txt and we want to omit the first line and to only process lines with 4 fields. We can do the following:

awk 'NR > 1 && NF == 4' data.txt

The variable NR is the number of input records that have been processed: typically the current line number since records are broken up by the end of line character by default. The variable NF is the number of fields in the current record (line), and by default, fields are broken up by white space. A complete list of awk‘s built-in variables is available here.

There is much more to awk and shell scripting, but this is a start. I hope to add more introductions in the coming months.