Shell tricks for one-liner bioinformatics: part 2

This post gives more examples of GNU Linux/Unix shell commands applicable to bioinformatics processing. Here, we will go through examples of the tools grep, sort, uniq, wc, join, and combinations of these, including their use in performing set operations on two or more files. The examples below assume that the reader is familiar with part 1, which discusses for loops, awk, and cut. As before, we will use bash syntax; several examples are readily transferable to other shells.

Searching files with grep

Often data or code are contained in many files and we do not recall which file contains a string or pattern we are looking for. grep searches a file or files for a string or, more generally, for a regular expression. In the simplest case, we want to search for some string within a given file. For example, if we wanted to find the string “awk” in ~/.bash_history, we would do:

grep awk ~/.bash_history

The above would print out all lines in which the text “awk” occurs anywhere. If we instead wanted to search for “awk” in the files with names ending in .sh in the current directory, we would do:

grep awk *.sh

This would print all lines in which “awk” occurs, preceded by the filename on which the given line appears.

grep has many features that are quite useful, especially its ability to search using regular expressions. This post does not include details of regular expressions, but they are very powerful and worth learning. We provide one example here. Suppose we ran a tool that produced many pieces of information about a dataset, with each line prefixed by the type of information that line contains. If that file is called “output” and we are currently interested in lines that begin with the string “TSTV”, we could use the following to print out only those lines:

grep "^TSTV" output

This is a regular expression: the ^ at the beginning of the search string indicates that the string that follows should match at the beginning of the strings. Since grep searches for matches within each successive line by default, this will match any line that starts with “TSTV”.

Searching files with sort

The program sort sorts a file, as its name suggests. In the context of bioinformatics, sort can be used on its own to order a file based on a set of fields within it. For example, suppose the file “sites.txt” contains a set of sites that we wish to sort first by chromosome number and then, within each chromosome, by physical position. If column 2 contains the chromosome and column 3 the physical position, the following produces a file called “sites-sort.txt” that is sorted in this way:

sort -k 2n -k 3n sites.txt > sites-sort.txt

The -k option tells sort which column to sort by. By default, sort uses alphabetical sorting in which the string “1000” comes before “11”, which is not our aim. We therefore use -k 2n, with the n indicating numerical sort. If sort receives multiple -k options, it applies the sorting in the given order. Thus, the above command sorts first on column 2 using a numerical sort followed by a numerical sort on column 3.

Although bioinformatic data are often sorted before we analyze it, sort is essential to using uniq, join, (each discussed below) and other shell tools such as comm.

Getting unique lines and counting with uniq

By default, the uniq command gets a list of the unique lines from sorted input. Run in this way, this command isn’t very useful for bioinformatics, but it has several options that are helpful. Suppose we have a file call “samples.ind” with a list of samples, and that column 3 gives the population of each sample. (This is the format of Eigenstrat .ind files.) If we want to get a count of the numbers of samples from each population, we can do:

awk '{print $3}' samples.ind | sort | uniq -c

Here, the -c option counts the number of copies of each unique line that is input to uniq, and we use awk to extract the populations, thereby getting counts of the number of samples from each. This example also shows piping, which was introduced in part 1, but here we conveniently pipe between multiple commands in order to avoid writing intermediate files.

Set operations on files with uniq

Now suppose we have a set of markers in “data1.bim” and “data2.bim” with marker ids in column 2. We could determine the set of markers that are shared in common between these two files—assuming they use the same marker ids for the same variants—in the following way:

cat data1.bim data2.bim | awk '{print $2}' | sort | uniq -d > intersection

Here, we use cat to concatenate the two files, extract column 2 from this input, sort this column—the set of marker ids—and pass them to uniq -d. The -d option causes uniq to print a line only if the given line appears more than once. As long as the marker ids appear only once in each file, duplicates will only occur for markers in both datasets.

The above is an example of generating the intersection of the two marker sets. To perform set union, we can do the following:

cat data1.bim data2.bim | awk '{print $2}' | sort | uniq > union

We may wish to determine the set of markers that are in “data1.bim” but not in “data2.bim”; i.e., to perform set subtraction. We can do this using a two step process: first compute the intersection, which we did above. With the intersection stored in a file called “intersection”, we can now determine the markers that are unique to “data1.bim” in the following way:

awk '{print $2}' data1.bim | cat - intersection | sort | uniq -u > data1-only

The above extracts the marker ids from “data1.bim”, concatenates them with the markers from the intersection, sorts these (necessary to use uniq), and then passes them to uniq -u. The -u option causes uniq to print only lines that appear once. Since the marker ids in “intersection” are guaranteed to be in “data1.bim”, they will appear more than once to uniq and will not be printed. The suppressed sites are exactly those that are also in “data2.bim”, so this produces the set subtraction we desire. (Note that for many shell utilities, ‘-‘ represents stdin instead of a file, so the piped input from awk gets concatenated with “intersection” in the second part of the command.)

Combining information across files using join

Consider a situation in which two files give information about the same set of samples. We may wish to combine these into one file. Here we can use join, which matches fields between two files, producing lines that combine all fields from the two lines that match between the input files.

This is best illustrated by example. Suppose we have files “BMI.txt” and “height.txt” which both contain two fields: the sample id and the given phenotype (BMI and height for the two respective files). join requires that the input files are sorted for the field that is joined on, so we can do the following:

sort -k 1 BMI.txt > BMI.sort.txt
sort -k 1 height.txt > height.sort.txt
join -j 1 BMI.sort.txt height.sort.txt > BMI_height.txt

The -j 1 option tells join to find matching strings in field 1 in each of the input files. Note that -j 1 is the default, so this last line could also have been

join BMI.sort.txt height.sort.txt > BMI_height.txt

Often the fields that we wish join to match on are different between the two files. Consider a file “pheno.txt” that contains many fields (e.g., phenotypes), with the sample id in column 3. If we wish to create a new file that adds the BMI information to that contained in “pheno.txt”, we could do the following:

sort -k 3 pheno.txt > pheno.sort.txt
join -1 3 -2 1 pheno.sort.txt BMI.sort.txt > pheno_BMI.txt

The -1 3 -2 1 options tell join to identify matches using field 3 within the first supplied file—here that’s “pheno.sort.txt”—and using field 1 within the second supplied file.

The resulting output can shift the field order somewhat: the matching field always (a) occurs once even though it is present in both fields, and (b) is field 1 in the output from join. Other than this shift of the match field, the other fields from file 1 occur in their original order, and these are followed by the fields from file 2 (omitting the match field) in their original order.

By default, join prints only lines that have a matching field in both files. Thus, it performs an intersection on an arbitrary field, which can be very convenient in general. However, sometimes we wish to retain all the information from one of the files.

If, in the above example, “pheno.txt” contained information on more samples than appear in “BMI.txt”, and we wish to retain all the information in “pheno.txt”, we could do the following:

join -a 1 -1 3 -2 1 pheno.sort.txt BMI.sort.txt > pheno_all_BMI.txt

The -a # option tells join to print all lines from the indicated file number (1 or 2), including those that do not match any line from the other file. Note that any unmatched lines printed as a result of -a # will have fewer fields than the other lines. If we know how many fields occur on matched lines, we can use awk to fill in, as in the following adaptation of the above:

join -a 1 -1 3 -2 1 pheno.sort.txt BMI.sort.txt | awk '{ if (NF < 10) { print $0,NA } else { print } }' > pheno_all_BMI2.txt

This assumes that there are 10 fields for lines that successfully join “BMI.sort.txt” with “pheno.sort.txt”. Recall from part 1 that the awk variable NF gives the number of fields on each line.

Reverting to the original line order after using join

Using join on files that have a specific order that must be maintained presents a problem. For example, a PLINK .fam file describes the order of samples in a corresponding .bed file, so we cannot reorder the .fam file. A workaround is to keep a field that lists the original line number and to numerically sort by this field once we finish modifying the lines.

For example, suppose “data.fam” is a PLINK .fam file that does not have phenotype information in column 6. If we have a separate file “case-control.txt” that contains the sample ids and the (1 or 2) case-control status of these samples, we can do the following (note that the sample ids in the PLINK .fam file are in column 2):

sort -k 1 case-control.txt > case-control.sort.txt
awk '{print $0,NR}' data.fam | sort -k 2 | join -1 2 -2 1 - case-control.sort.txt | sort -k 7n | awk '{print $2,$1,$3,$4,$5,$8}' > data-pheno.fam

The second line is long, but it does what we want. We first add a field with the line number (the NR variable in awk) to the .fam data; this gives us 7 fields. Next, we sort these lines on column 2, the field containing the sample id which we wish to join on. Then we run join, indicating that field 2 is to be matched from the lines coming from stdin (the ‘-‘ file indicates stdin should be used for file 1), and matching on field 1 for “case-control.sort.txt”. This gives us all the information we need, but the lines are out of order compared to the original “data.fam” and have extra fields. We fix this by numerically sorting on column 7, which gives the original line numbers, putting the lines back in their original order. Lastly, we pipe this to awk and reorder the columns. As noted above, join reorders the fields slightly, with field 1 being the matched field. In this case then, field 1 is the sample id, which needs to be field 2 in the resulting .fam file, so we print field 2 first, which is the PLINK format family id, followed by field 1, the sample id. We know the family id has to be field 2 since the fields from each file maintain their order (with the exception of the match field). We can then print fields 3, 4, and 5 in the same order in the output. Field 6 in PLINK fam files gives the case-control phenotype. This is the information we used join to get from “case-control.txt”, and it must be the last field: field 8, since the input before running join had 7 fields. And there we have it, a new .fam file with case-control status added from a separate file.

Redirecting output in bash: process substitution

In many examples above, we used a separate sort command to produce a file that we then used with join. It is possible to sort both files on the fly using one line in bash, as we explain below, a convenience that allows us to avoid storing the sorted files that we likely don’t need. This works using a syntax that resembles I/O redirection (i.e., < and >) and is called process substitution.

For the first join command given above, we first generated files called “BMI.sort.txt” and “height.sort.txt”. Assuming we don’t need these sorted files after the join command, what we really want to do is to give the output of both these commands to join directly. This is possible in bash using the following command:

join -j 1 <(sort -k 1 BMI.txt) <(sort -k 1 height.txt) > BMI_height.txt

These process substitutions—the two sort commands given in the line above—can be used in contexts where a filename would otherwise appear. (In fact, bash creates a temporary file descriptor with the contents produced by the process within the <(...) construct.)

Counting with wc

The final example is of a simple command: wc for word count. This simple program lists the number of lines, words (defined as a sequence of characters delimited by white space), and characters for an input file. Often line counts are useful in bioinformatics work, and wc -l gives only the line count of the input and is slightly more efficient than using wc alone.

Given PLINK format files “data.fam” and “data.bim”, a quick way to determine the number of samples and the number of markers contained in the dataset is to use the following two commands:

wc -l data.fam
wc -l data.bim

wc is also useful in combination with the set operations mentioned above, where we can pipe the output of these operations to wc -l to get counts in lieu of producing files when these are not needed.