Bioinformatics_Overview

Introduction to Bash Part 3

Learning outcomes

In our last practical, we introduced you to the concept of regular expressions (regex), and played with a few basic datasets. In today’s practical we will build on this further, introducing the command sed and writing a couple of basic scripts. By the end of today’s practical you should be able to:

sed: The Stream Editor

One additional and very useful command in the terminal is sed, which is short for stream editor. This is a very powerful command which can be a little overwhelming at first. If using this for your own scripts & you can’t figure something out, remember ‘Google is your friend’ & sites like https://www.stackoverflow.com are full of people wrestling with similar problems to you. These are great places to start looking for help & even advanced programmers use these tools.

For today, there are two key sed functionalities that we want to introduce.

  1. Using sed to alter the contents of a file/input;
  2. Using sed to print regions of a file

Altering a file or other input

sed uses regular expressions that we have come across under the grep section, and we can use these to replace strings or characters within a text string. The command works in the form sed 'SCRIPT' INPUT, and the script section is where all the action happens. Input can be given to sed as either a file, or just as a text stream via the pipe that we have already introduced.

In the following example the script begins with an s to indicate that we are going to make a substitution. The beginning of the first pattern (i.e. the regex we are searching for) is denoted with the slash, with the identical delimiter indicating the replacement pattern, and this is in turn completed with the same delimiter. Try this simple example from the link https://www.grymoire.com/Unix/Sed.html which is a very detailed & helpful resource about usage of sed. Here we are sending the input (echo Sunday) to the command via the pipe, so no INPUT section is required:

echo Sunday | sed 's/day/night/'

Here you are passing sed the string Sunday, and sed takes day and turns it into night.
sed will only replace the first instance of the string on any line, so try:

echo Sundayday | sed 's/day/night/'

It only replaced the first instance of day and left the second. You can make it global, where it switches every instance by using the g option at the end of the pattern like this:

echo Sundayday | sed 's/day/night/g'

You can ‘capture’ parts of the pattern in parentheses and access that in the second part of the regular expression (what you are switching to) using \1, \2, etc., to denote the number of the captured string. If you wanted to match ATGNNNTGA, where N is any base, and just output these three bases you could try the following

echo 'ATGCCAGTA' | sed -r 's/ATG(.{3})GTA/\1/g'

Or if we needed to replace those three bases with an expanded tri-nucleotide repeat of them, you could do the following where we capture the undefined string between ATG & GTA, and expand it to three times:

echo 'ATGCCAGTA' | sed -r 's/ATG(.{3})GTA/ATG\1\1\1GTA/g'

The \1 entry takes the contents of the first parenthesis and uses it in the substitution, even though you don’t know what the bases are. Note that the -r option was set for these operations, which turns on extended regular expression capabilities. This can be a powerful tool & multiple parentheses can also be used:

echo 'ATGCCAGTA' | sed -r 's/(ATG)(.{3})(GTA)/\3\2\2\1/g'

In this command we captured each codon separately, then switched the order of the first & last triplet and expanded the middle unknown string twice. Note how quickly this starts to look confusing though! Taking care to be clear when writing these types of procedures can be an important idea when you have to go back & re-read your code a year or two later. (Yes this will happen a lot!!!)

Displaying a region from a file

The command sed can also be used to replicate the functionality of the head & grep commands, but with a little more power at your fingertips. By default sed will print the entire input stream it receives, but setting the option -n will turn this off. Try this by adding an n immediately after the -r in one of the above lines & you will notice you receive no output. This is useful if we wish to restrict our output to a subset of lines within a file, and by including a p at the end of the script section, only the section matching the results of the script will be printed.

Make sure you are in the Practical_1 directory & we can look through the file Drosophila_melanogaster.BDGP6.ncrna.fa again.

sed -n '1,10 p' Drosophila_melanogaster.BDGP6.ncrna.fa

This will print the first 10 lines, like the head command will by default. However, we could now print any range of lines we choose. Try this by changing the script to something interesting like 101,112 p. We could also restrict the range to specific lines by using the sed increment operator ~.

sed -n '1~4p' Drosophila_melanogaster.BDGP6.ncrna.fa

This will print every 4th line, beginning at the first, and is very useful for files with multi-line entries. The fastq file format from NGS data, which we’ll look at in the next topic, uses this format.

We can also make sed operate like grep by making it only print the lines which match a pattern.

sed -rn '/TGCAGGCTC.+(GA){2}.+/ p' Drosophila_melanogaster.BDGP6.ncrna.fa

Note however, that the line numbers are not present in this output, and the pattern highlighting from grep is not present either.

Writing Scripts

Now we’ve had a look at many of the key tools, we’ll move on to writing scripts which is one of the most common things a bioinformatician will do. We often do this on a HPC to run long data processing pipelines (or workflows).

Scripts are commonly written to perform repetitive tasks on multiple files, or need to perform complex series of tasks and writing the set of instructions as a script is a very powerful way of performing these tasks. They are also an excellent way of ensuring the commands you have used in your research are retained for future reference. Keeping copies of all electronic processes to ensure reproducibility is a very important component of any research. Writing scripts requires an understanding of several key concepts which form the foundation of much computer programming, so let’s walk our way through a few of them.

Some Important Concepts

Two of the most widely used techniques in programming are that of the for loop, and logical tests using an if statement.

for Loops (Recursion)

A for loop is what we use to cycle through an input one item at a time

for i in 1 2 3; do (echo "$i^2 = $(($i*$i))"); done

In the above code the fragment before the semi-colon asked the program to cycle through the values 1, 2 & 3, letting the variable i take each value in order of appearance.

Note that the value of the variable i was prefaced by the dollar sign ($). This is how the bash shell knows it is a variable, not the letter i. The command done then finished the do command. All commands like do, if or case have completing statements, which respectively are done, fi & esac.

An important concept which was glossed over in the previous paragraph is that of a variable. These are essentially just placeholders which have a value that can change. In the above loop, the same operation was performed on the variable i, but the value changed from 1 to 2 to 3. Variables in shell scripts can hold numbers or text strings and don’t have to be formally defined as in some other languages. We will commonly use this technique to list files in a directory, then to loop through a series of operations on each file.

If Statements (Conditional statements)

if statements are those which only have a binary yes or no response. For example, we could specify things like:

Notice that in the second if statement, there was a double equals sign (==). This is the programmers way of saying compare the first argument with the second argument. A single equals sign is generally interpreted by a program as assign the first argument to be what is given in the second argument. This use of double operators is very common, notably you will see && to represent the command and, and || to represent or.

A final useful trick to be aware of is the use of an exclamation mark to reverse a command. A good example of this is the use of the command != as the representation of not equal to in a logical test.

Shell Scripts

Now that we’ve been through just some of the concepts & tools we can use when writing scripts, it’s time to tackle one of our own where we can bring it all together.

Every bash shell script begins with what is known as a shebang, which we would commonly recognise as a hash sign followed by an exclamation mark, i.e #!. This is immediately followed by /bin/bash, which tells the interpreter to run the command bash in the directory /bin. This opening sequence is vital & tells the computer how to respond to all of the following commands. As a string this looks like:

#!/bin/bash

The hash symbol generally functions as a comment character in scripts. Sometimes we can include lines in a script to remind ourselves what we’re trying to do, and we can preface these with the hash to ensure the interpreter doesn’t try to run them. It’s presence as a comment here, followed by the exclamation mark, is specifically looked for by the interpreter but beyond this specific occurrence, comment lines are generally ignored by scripts & programs.

An Example Script

Let’s now look at some simple scripts. These are really just examples of some useful things you can do & may not really be the best scripts from a technical perspective. Hopefully they give you some pointers so you can get going.

Don’t try to enter these commands directly in the terminal!!! They are designed to be placed in a script which we will do after we’ve inspected the contents of the script. First, let’s just have a look through the script & make sure we understand what the script is doing.

#!/bin/bash

# First we'll declare some variables with some text strings
ME='Put your name here'
MESSAGE='This is your first script'

# Now we will place these variables into a command to get some output
echo -e "Hello ${ME}\n${MESSAGE}\nWell Done!"

Question

In the above script, there are two variables. Although we have initially set them to be one value, they are still variables. What are their names?

Writing and Executing Our First Script

Let’s create an empty file which will become our script. We’ll give it the suffix .sh as that is the common convention for bash scripts. Make sure you’re in the Practical_1 folder, then enter:

touch wellDone.sh

Now open this using the using the text editor nano:

nano wellDone.sh

Enter the above code into this file setting your actual name as the ME variable. Once you’re finished, you can save and exit the nano editor by hitting Ctrl+x (written as ^X), then confirm saving changes with y, and finally press Enter to return to the terminal. Assuming that you’ve entered everything correctly, we can now execute this script by simply entering

bash wellDone.sh

Setting File Permissions

Unfortunately, this script cannot be executed without calling bash explicitly but we can also enable execution of the file directly by setting the execute flag in the file permissions. First let’s look at what permissions we have:

ls -lh *.sh

You should see output similar to this:

-rw-rw-r-- 1 a1234567 a1234567  247 Aug  17 14:48 wellDone.sh

Question

Interpret the final triplet? What are these permissions indicating, and for whom?

As you can see, the x flag has not been set in any of the triplets, so this file is not executable as a script yet. To do this, we simply need to set the x flag, then we’ll look again using long-listing format.

chmod +x wellDone.sh
ls -lh *.sh

Note how the file now has the x flag set for every user, which means every user can execute this script. Now we can execute the script by calling it using the file path. One of the settings in bash though won’t allow you to execute the file from the same folder, so we need to add the ./ prefix to the script.

./wellDone.sh

We can set each of these flags for all triplets using + to turn the flag on, or - to turn the flag off. If we wanted to remove write permissions for all users we could simply use the command:

chmod -w wellDone.sh
ls -lh *.sh

This can be a very useful trick for write-protecting files!

These flags actually represent binary bits that are either on or off. Reading the permission triplets from right to left:

  1. the first bit is the execute flag, which has value 1
  2. the second bit is the write flag, which has the value 2
  3. the third bit is the read flag, which has the value 4

Thus each combination of flags can be represented by a single integer, as shown in the following table:

Value Binary Flags Meaning
0 000 --- No read, no write, no execute
1 001 --x No read, no write, execute
2 010 -w- No read, write, no execute
3 011 -wx No read, write, execute
4 100 r-- Read, no write, no execute
5 101 r-x Read, no write, execute
6 110 rw- Read, write, no execute
7 111 rwx Read, write, execute

We can now set permissions using a 3-digit code, where 1) the first digit represents the file owner, 2) the second digit represents the group permissions and 3) the third digit represents all remaining users.

To set the permissions for our script to read-write-executefor you and any other users in the group you belong to, we could now use

chmod 774 wellDone.sh
ls -lh *sh

Question

What will the final 4 in the above settings do?

Modifying our script

In the initial script we used two variables ${ME} and ${MESSAGE}. Now let’s change the variable ${ME} in the first line of the script to read as ME=$1. First we’ll create a copy of the script to edit, and then we’ll edit using nano

cp wellDone.sh wellDone2.sh
nano wellDone2.sh

Once you’ve finished editing, you can save and exit nano as before.

We may need to set the execute permissions again.

chmod +x wellDone2.sh
ls -lh *sh

This time we have set the script to receive input from stdin (i.e. the terminal), and we will need to supply a value (as we have done when passing an argument to a command), which will then be placed in the variable ${ME}. Choose whichever random name you want (or just use “Boris” as in the example) and enter the following

./wellDone2.sh Boris

As you can imagine, this style of scripting can be useful for iterating over multiple objects. A trivial example, which builds on a now familiar concept would be to try the following.

for n in Boris Fred; do (./wellDone2.sh ${n}); done

As a good example, this script could summarise key features in a file. Then we could simply pass the script multiple files using this strategy, and write the output to another file using the > symbol.

Using for Loops

Here’s an example of a script which uses a for loop.

#!/bin/bash

FILES=$(ls)

COUNT=0
for f in ${FILES};
  do
    ((COUNT++))
    ln=$(wc -l ${f} | sed -r 's/([0-9]*).+/\1/g')
    echo "File number ${COUNT} (${f}) has ${ln} lines"
  done

Task

Save this as a script in the Practical_1 folder called lineCount.sh. Add comments where you think you need them to make sure you understand what’s happening.

For this particular task, do you really need the /g in the sed command?

A More Advanced Script

In this section we’ll write a script for the dm6:ncrna fasta file. Briefly inspect the file before checking the script to remind yourself what it looks like. We’re going to extract some key information from those sequence header lines.

head Drosophila_melanogaster.BDGP6.ncrna.fa

Now let’s look through the following script before we run it. There is a lot of extra information here!

#!/bin/bash

INFILE=$1

# Check the file has the suffix .fa or .fasta
SUFFIX=$(echo ${INFILE} | sed -r 's/.+(fasta|fa)$/\1/')
if [ ${SUFFIX} == "fa" ] || [ ${SUFFIX} == "fasta" ]; then
  echo File has the suffix ${SUFFIX}
else
  echo File does not have the suffix 'fa' or 'fasta'. Exiting with error.
  exit 1
fi

# Define the output file by changing the suffix to .locations
OUTFILE=${INFILE%.${SUFFIX}}.locations
echo Output will be written to $OUTFILE

# Get the header lines which correspond to chromosomes, then collect the
# gene id, chromosome, start and end and write to the output file
egrep '^>.+chromosome' ${INFILE} | \
  sed -r 's/.+BDGP6:(.*):([0-9]+):([0-9]+):.+gene:([^ ]+).+/\4\t\1\t\2\t\3/g' \
  > ${OUTFILE}

echo Done

Save this as the file getLocations.sh and make it executable using chmod +x Now run it passing the .fa file as the first argument.

./getLocations.sh Drosophila_melanogaster.BDGP6.ncrna.fa

To test regular expressions regex the following web sites may be useful here and here.