Chapter 5 Connecting Programs and Compressing output

Now that we’ve been through the whole alignment and filtering pipeline, let’s look at the output. Specifically lets compare the sizes of the files we used. Recall that we can do that with ls -alh

On my folder I get this (some columns and files removed for clarity)

49M 29 Nov 10:46 aln.filtered.sam
59M 28 Nov 16:28 aln.sam
5.0M  2 Jul 15:04 ecoli_genome.fa
18M 28 Nov 15:53 ecoli_left_R1.fq
18M 28 Nov 15:53 ecoli_right_R2.fq

The file sizes are in the left-most column. Check out the relative size of the two read files (18M each) and the alignment SAM files (59M and 49M). The output file is much larger than the input. This has implications for storage when the files are really large (many GB) and there are lots of them. The disk space gets used really quickly. Consider also the redundancy we have - that aln.filtered.sam is the one we’re interested in, not the aln.sam so it is taking up unnecesary disk space. It’s easy to see that when you are doing a real experiment with lots of samples and hundreds of GB file size, you’re going to eat up disk space. Also larger files take longer to process, so you’re going to have a long wait. This has implications too when you get to later stages in the analysis

In this chapter we’re going to look at a technique for reducing those housekeeping overheads and speeding things up.

5.1 BAM Files

BAM files are a binary compressed version of SAM files. They contain identical information in a more computer friendly way. This means that people can’t read it, but it is rare in practice that you’ll directly read much of a SAM file with your own eyes. Let’s look at the command to do that

samtools view -S -b aln.filtered.sam > aln.filtered.bam

Again we’re using samtools view and our options are -S which means SAM format input and the new one is -b means BAM format output. Our input file is aln.filtered and we’re sending the output to aln.filtered.bam.

If we check the files with ls -alh now we get

9.2M 29 Nov 14:05 aln.filtered.bam
49M 29 Nov 10:46 aln.filtered.sam
59M 28 Nov 16:28 aln.sam
5.0M  2 Jul 15:04 ecoli_genome.fa
18M 28 Nov 15:53 ecoli_left_R1.fq
18M 28 Nov 15:53 ecoli_right_R2.fq

The BAM file is about a fifth of the size of the SAM file. So we can save space in this way. We have another trick up our sleeve though. We can connect together command lines, so that we don’t have to create intermediate files - this reduces the number of files we have to save. We can do this by using something called pipes.

5.2 Connecting Program Input and Output With Pipes

Most command line programs print their results straight out without sending it to a file. This seems strange, but it adds a lot of flexibility. If also set up our programs to read in this output then we can connect them together. We can do this with pipes. The usual way to do this is to use the | operator. Let’s look at a common example.

Here we’ll use the command ls and shuf to see how this works. We know ls will ‘list’ our directory contents, shuf shuffles lines of text sent to it. If we use | in between we can connect the output of one to the other. Try running ls a couple of times to verify you get the same output both times and then try this a few times

ls | shuf

you should get different output everytime. The important thing to note is that shuf is doing its job on the data sent from ls, which sends consistent data every time. We don’t have to create an intermediate file for shuf to work from. The | character joing two commands is the key.

We can apply this to our minimap2 and samtools commands.

5.3 From reads to filtered alignments in one step

So let’s try reducing the original alignment pipeline to one step with pipes. We’ll work in the BAM file bit later.

Simply take away the output file names (except the last one!) and replace with pipes. It looks like this

minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -q 25 -f 3 > aln.filtered.from_pipes.sam

when you do ls -alh you should see the new aln.filtered.from_pipes.sam file, its size is identical to the file we generated when we created the intermediate aln.sam file, but this time we didnt need to, saving that disk space.

5.3.1 From reads to filtered alignments in a BAM file in one step

Let’s modify the command to give us BAM not SAM, saving a further step. We already know that samtools view can output BAM instead of SAM, so lets add that option (-b) in to the samtools part.

minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -b -q 25 -f 3 > aln.filtered.from_pipes.bam

If you check the files with ls -alh now you should see that you have the new aln.filtered.from_pipes.bam file with no extra intermediate file and the smallest possible output file. Congratulations, you know now the fastest and most optimal way to make alignments and filter them.

5.4 Sorting BAM files

In practice a BAM file of alignments needs to be ordered with the alignments at the start of the first chromosome at the start of the file and the alignments on the end of the last chromosome at the end of the file. This is for computational reasons we don’t need to worry about, but it does mean we need to do another sorting step to make our files useful downstream.

Because all the alignments need to be present before we can start we can’t use the pipe technique above. So we use an input and output file. The command is samtools sort and looks like this.

samtools sort aln.filtered.from_pipes.bam -o aln.filtered.from_pipes.sorted.bam

Doing ls -alh shows a new sorted BAM aln.filtered.from_pipes.sorted.bam that contains the same information but is actually a little smaller due to being sorted. We can safely delete the unsorted version of the BAM file.

5.4.1 Automatically deleting the unsorted BAM

If the sorting goes fine, we have two BAM files with essentially the same information and don’t need the unsorted file. We can of course remove this with rm aln.filtered.from_pipes. A neat space saving trick is to combine the rm step with the successful completion of the sort. We can do this by joining the commands with &&.

That looks like this

samtools sort aln.filtered.from_pipes.bam -o aln.filtered.from_pipes.sorted.bam && rm aln.filtered.from_pipes.bam

The && doesn’t connect the data between the two commands, it just doesn’t let the second one start until the first one finishes successfully (computers have an internal concept of whether a command finished properly).

This means if the samtools sort goes wrong the rm part will not run and the input file won’t be deleted so you won’t have to remake it. This is especially useful later when we wrap all this into an automatic script.

5.5 Indexing the sorted BAM

Many downstream applications need the BAM file to have an index, so they can quickly jump to a particular part of the reference chromosome. This is a tiny file and we usually don’t need to worry about it. To generate it use samtools index

samtools index aln.filtered.from_pipes.sorted.bam

Using ls -lah we can see a tiny file called aln.filtered.from_pipes.sorted.bam.bai, this is the index.

5.6 Further Reading

For a primer on some more aspects of samtools see this tutorial