Chapter 6 Automating The Process

We now know everything we need to do an alignment of reads against a reference in an efficient way. What’s next is to consider that this process needs to be done for every set of reads you might generate. That’s a lot of typing of the same thing over and over, which can get tedious. In this section we’ll look at how we can automate the process to make it less repetitive using a script.

6.1 Shell scripts

Scripts that contain commands we usually run in the Terminal are called shell scripts. They’re generally just the command we want to do one after another and saved in a file. We can then run that file as if it were a command and all the commands we put in the file are

Shell scripts must be a simple text file, so you can’t create them in programs like Word, you’ll need a special text editor. On most systems we have one called nano built into the Terminal.

6.1.1 Using `nano` to create a shell script

To open a file in nano type nano and the name of the file, if the file doesn’t exist it will be created.

nano my_script.sh

Will create a file and open it. To save and exit type press Ctrl then X (thats what ^X means in the help at the bottom. You can enter your script in here. Remember its not a word processor, its a Terminal text editor, so you have to use the mouse to move round and cutting and pasting is a bit clunky.

6.2 Creating a script that automates our alignment pipeline.

Let’s enter our script into nano. We’ll do it as we did in the earlier chapters, but we’ll change file names to make it clear which files are coming from the script.

First, create a script called do_aln.sh

nano do_aln.sh

Once nano opens, add the following into it

minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -b -q 25 -f 3 > aln.script.bam
samtools sort aln.script.bam -o aln.script.sorted.bam && rm aln.script.bam
samtools index aln.script.sorted.bam

That’s all the steps we want to do. Use Ctrl-X to save the changes to the file.

6.3 Running the script

To run the script we use the sh command and the script name. Try

sh do_aln.sh

You should see progress from the script as it does each step in turn. When it’s done you can ls -alh to see the new sorted BAM file from the script.

Congratulations! You just automated an entire analysis pipeline!

6.4 Running on different input files

So our script is great but the input filenames will be the same every time we run it meaning we’d need to go through the whole file and change them which is error prone. Also the output files are the same each time, meaning we could accidentally overwrite any previous work in there, which is frustrating. We can overcome this with a couple of simple changes in our script that make use of variables.

Variables are place holders for values that the script will replace when it runs. Consider these two commands

MY_MESSAGE="Hello, world!"
echo $MY_MESSAGE

Recall that echo just prints whatever follows it. Try running this, you get Hello, world! which shows that the process created a variable called MY_MESSAGE and stored the message in it. When used by a command the $ showed the command that it should use the message stored in the variable and printed Hello, world!. We can use this technique in our scripts. Note the command MY_MESSAGE="Hello, world!" must not have spaces around the equals sign.

Now we can expand our script to take advantage. Look at this script.

LEFT_READS="ecoli_left_R1.fq"
RIGHT_READS="ecoli_right_R2.fq"
REFERENCE_SEQUENCE="ecoli_genome.fa"
SAMPLE_NAME="ecoli"

minimap2 -ax sr $REFERENCE_SEQUENCE $LEFT_READS $RIGHT_READS | samtools view -S -h -b -q 25 -f 3 > $SAMPLE_NAME.bam
samtools sort $SAMPLE_NAME.bam -o $SAMPLE_NAME.sorted.bam && rm $SAMPLE_NAME.bam
samtools index $SAMPLE_NAME.sorted.bam

Right at the top we create a variable for each of our read files (LEFT_READS and RIGHT_READS), our reference files (REFERENCE_SEQUENCE) and a unique sample name (ecoli). These variables get used whenever we need them, saving us from typing the information over and over. The practical upshot of this being that we only need to change the script in one place every time we reuse it for a different sample and set of reads.

Now try this out. Save the new script in a file called do_aln_variables.sh and run it as before with sh do_aln_variables.sh. When it’s run you should see an output called ecoli.sorted.bam.