Chapter 5 Connecting Programs and Compressing output
Now that we’ve been through the whole alignment and filtering pipeline, let’s look at the output. Specifically lets compare the sizes of the files we used. Recall that we can do that with ls -alh
On my folder I get this (some columns and files removed for clarity)
49M 29 Nov 10:46 aln.filtered.sam
59M 28 Nov 16:28 aln.sam
5.0M 2 Jul 15:04 ecoli_genome.fa
18M 28 Nov 15:53 ecoli_left_R1.fq
18M 28 Nov 15:53 ecoli_right_R2.fq
The file sizes are in the left-most column. Check out the relative size of the two read files (18M each) and the alignment SAM files (59M and 49M). The output file is much larger than the input. This has implications for storage when the files are really large (many GB) and there are lots of them. The disk space gets used really quickly. Consider also the redundancy we have - that aln.filtered.sam
is the one we’re interested in, not the aln.sam
so it is taking up unnecesary disk space. It’s easy to see that when you are doing a real experiment with lots of samples and hundreds of GB file size, you’re going to eat up disk space. Also larger files take longer to process, so you’re going to have a long wait. This has implications too when you get to later stages in the analysis
In this chapter we’re going to look at a technique for reducing those housekeeping overheads and speeding things up.
5.1 BAM Files
BAM files are a binary compressed version of SAM files. They contain identical information in a more computer friendly way. This means that people can’t read it, but it is rare in practice that you’ll directly read much of a SAM file with your own eyes. Let’s look at the command to do that
samtools view -S -b aln.filtered.sam > aln.filtered.bam
Again we’re using samtools view
and our options are -S
which means SAM format input and the new one is -b
means BAM format output. Our input file is aln.filtered
and we’re sending the output to aln.filtered.bam
.
If we check the files with ls -alh
now we get
9.2M 29 Nov 14:05 aln.filtered.bam
49M 29 Nov 10:46 aln.filtered.sam
59M 28 Nov 16:28 aln.sam
5.0M 2 Jul 15:04 ecoli_genome.fa
18M 28 Nov 15:53 ecoli_left_R1.fq
18M 28 Nov 15:53 ecoli_right_R2.fq
The BAM file is about a fifth of the size of the SAM file. So we can save space in this way. We have another trick up our sleeve though. We can connect together command lines, so that we don’t have to create intermediate files - this reduces the number of files we have to save. We can do this by using something called pipes.
5.2 Connecting Program Input and Output With Pipes
Most command line programs print their results straight out without sending it to a file. This seems strange, but it adds a lot of flexibility. If also set up our programs to read in this output then we can connect them together. We can do this with pipes. The usual way to do this is to use the |
operator. Let’s look at a common example.
Here we’ll use the command ls
and shuf
to see how this works. We know ls
will ‘list’ our directory contents, shuf
shuffles lines of text sent to it. If we use |
in between we can connect the output of one to the other. Try running ls
a couple of times to verify you get the same output both times and then try this
a few times
ls | shuf
you should get different output everytime. The important thing to note is that shuf
is doing its job on the data sent from ls
, which sends consistent data every time. We don’t have to create an intermediate file for shuf
to work from. The |
character joing two commands is the key.
We can apply this to our minimap2
and samtools
commands.
5.3 From reads to filtered alignments in one step
So let’s try reducing the original alignment pipeline to one step with pipes. We’ll work in the BAM file bit later.
Simply take away the output file names (except the last one!) and replace with pipes. It looks like this
minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -q 25 -f 3 > aln.filtered.from_pipes.sam
when you do ls -alh
you should see the new aln.filtered.from_pipes.sam
file, its size is identical to the file we generated when we created the intermediate aln.sam
file, but this time we didnt need to, saving that disk space.
5.3.1 From reads to filtered alignments in a BAM file in one step
Let’s modify the command to give us BAM not SAM, saving a further step. We already know that samtools view
can output BAM instead of SAM, so lets add that option (-b
) in to the samtools
part.
minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -b -q 25 -f 3 > aln.filtered.from_pipes.bam
If you check the files with ls -alh
now you should see that you have the new aln.filtered.from_pipes.bam
file with no extra intermediate file and the smallest possible output file. Congratulations, you know now the fastest and most optimal way to make alignments and filter them.
5.4 Sorting BAM files
In practice a BAM file of alignments needs to be ordered with the alignments at the start of the first chromosome at the start of the file and the alignments on the end of the last chromosome at the end of the file. This is for computational reasons we don’t need to worry about, but it does mean we need to do another sorting step to make our files useful downstream.
Because all the alignments need to be present before we can start we can’t use the pipe technique above. So we use an input and output file. The command is samtools sort
and looks like this.
samtools sort aln.filtered.from_pipes.bam -o aln.filtered.from_pipes.sorted.bam
Doing ls -alh
shows a new sorted BAM aln.filtered.from_pipes.sorted.bam
that contains the same information but is actually a little smaller due to being sorted. We can safely delete the unsorted version of the BAM file.
5.4.1 Automatically deleting the unsorted BAM
If the sorting goes fine, we have two BAM files with essentially the same information and don’t need the unsorted file. We can of course remove this with rm aln.filtered.from_pipes
. A neat space saving trick is to combine the rm
step with the successful completion of the sort. We can do this by joining the commands with &&
.
That looks like this
samtools sort aln.filtered.from_pipes.bam -o aln.filtered.from_pipes.sorted.bam && rm aln.filtered.from_pipes.bam
The &&
doesn’t connect the data between the two commands, it just doesn’t let the second one start until the first one finishes successfully (computers have an internal concept of whether a command finished properly).
This means if the samtools sort
goes wrong the rm
part will not run and the input file won’t be deleted so you won’t have to remake it. This is especially useful later when we wrap all this into an automatic script.
5.5 Indexing the sorted BAM
Many downstream applications need the BAM file to have an index, so they can quickly jump to a particular part of the reference chromosome. This is a tiny file and we usually don’t need to worry about it. To generate it use samtools index
samtools index aln.filtered.from_pipes.sorted.bam
Using ls -lah
we can see a tiny file called aln.filtered.from_pipes.sorted.bam.bai
, this is the index.
5.6 Further Reading
For a primer on some more aspects of samtools
see this tutorial