The goal of this page is to highlight a number of tools that can be used to view and manipulate sequence alignment (BAM) files. These tools are all included in the gkno package, so no additional tools need to be downloaded to work through these exercises.
The command NAME=VALUE
means that we have created an environment variable with the name NAME
which contains the value VALUE
. Whenever we use the variable, we need to prepend $
to it and it will assume the value we gave to it. From now on, every time we need to access the directory where gkno
is installed, we will just use the environment variable $GKNO
. User-defined environment variables like this only exist for the duration of the session. If you log out and log in again, the environment variable will no longer exist. There are ways to make them permanent, but that is beyond the scope of this tutorial.
Next, we will use the bash 'alias' command to define some simple commands on the command line. This command lets us assign an operation to a command of our choice. In this case, we will link the command samtools to the samtools tool executable and, similarly, use the bamtools command to link to the bamtools tool executable.
- alias samtools="$GKNO/tools/samtools/samtools"
- alias bamtools="$GKNO/tools/bamtools/bin/bamtools"
bamtools and samtools
We will be using the tools bamtools
as part of this tutorial. These tools are common tools for working with BAM files and have a lot of overlap. For most basic tasks, you can use either one of these tools. There are certain functions that can only be performed by one or the other, so we will make ourselves familiar with both. There are still other tools that perform additional tasks that we are not looking at here.
To see help for samtools, type:
You can see that there are a lot of subcommands for samtools. Each of these performs a different task, some of which we will go through in this tutorial. To get help on individual subcommands, simply include the subcommand you want to know about.
- samtools view
- samtools sort
The subcommand help gives you the format of the command line expected by samtools, including the input and output files as well as a list of all of the additional arguments that can be selected along with a brief description of their role. We can see more detailed help with the following command:
Similar to samtools, to see the available commands for bamtools, type:
To see help for the bamtools subcommands, type:
- bamtools convert help
- bamtools count help
Again, like samtools, a description of the tool and the arguments that can be supplied is given.
The first thing that we probably want to do with the BAM file is look at it. Since the BAM file is a compressed binary format, using standard bash commands like 'more' or 'less', won't work.
- samtools view sample.bam | less
The | character 'pipes' the output from the samtools command into the bash command. The command less prints the results of the samtools command to screen. This is really difficult to read, since the lines are longer than the screen width and 'wrap' around. Press 'q' to terminate this view, and try the same command as above, but include -S. This makes the lines far easier to read.
- samtools view sample.bam | less -S
You can type -S and press return while the less view is open and you can toggle between these two views.
The BAM header
You probably noticed that we didn't see the BAM header in the previous views. It is there, we just need to tell samtools that we want to see it.
- samtools view -H sample.bam | less -S
Including -H now tells samtools to only view the header and so this is all we see. If we instead use -h, we will see the whole BAM file, header, alignments and all.
- samtools view -h sample.bam | less -S
You may have also noticed that the BAM file is not sorted. Aligners typically take FASTQ files and try to align them to the reference sequence. The resulting BAM file contains all of the alignments in the same order that they appeared in the FASTQ file, which is completely random in terms of genomic coordinates. For most downstream tools (e.g. iobio
or variant callers), the BAM file needs to be sorted by genomic coordinates. We can easily do this:
- samtools sort sample.bam sample_sorted
Note that samtools only wants a prefix for the output file. If you set the output file as sample_sorted.bam, you will have generated a file called sample_sorted.bam.bam!! Now let's take a look at the new BAM file and check that it's sorted properly.
- samtools view -h sample_sorted.bam | less -S
You can see that the alignments are now in coordinate order, but you can also see that the first line in the BAM header states (with the SO tag in the @HD line), that the Sort Order is 'coordinate' as required.
Now we have a sorted BAM file, but this usually isn't enough for most tools. For example, both iobio
want to look at specific regions within the BAM file and don't want to have to search through the entire file to find it, since the search would take an extremely long time. To achieve this, BAM files need to be indexed. This index provides tools a map of the data in the BAM file, so that they can quickly jump to the regions of interest. So, let's index our BAM file:
- samtools index sample_sorted.bam
If we look in our directory (ls), we will see the index file with the extension '.bai'. Let's exploit this index to see the reads that fall into the region '20000-30000' on chromosome 20.
- samtools view sample_sorted.bam 20:20000-30000 | less -S
What if we want to know how many reads fall into this region?
- samtools view sample_sorted.bam 20:20000-30000 | wc -l
Remember to not include the header in the above command, since the wc -l command just counts lines. We want to count the number of alignments and not the total number of lines in the BAM file.
The BAM flag
The BAM flag (the second column) encodes lots of useful information, but isn't particularly readable by humans. We can use samtools
to convert the bitwise flag into a list of characters. This is more readable than the simple number, but is still not great. A better way of viewing the meaning of the flag is here
, where you can type in the number and see a full explanation of the flag.
What if we want to look at all the reads in the BAM file that are marked as being in a pair. We can use the following command to do this:
- samtools view -f 0x1 sample_sorted.bam | less -S
You will see that most of the reads have flags '99' or '163'. If we look at the flag explanations here
, we can see that both of these flags include the 1 (or 0x1) bit referring to the reads being paired. While this is functional, it isn't that easy to use. This is somewhere where we can use bamtools
more effectively. We can generate the same list of reads with the command:
- bamtools filter -isPaired true -in sample_sorted.bam | samtools view - | less -S
There are a few differences between the samtools and the bamtools command lines. Firstly, bamtools requires that the input BAM file is identified with the -in argument. Secondly, by default, 'samtools view' outputs an uncompressed SAM file, but bamtools will only output a SAM file when instructed to do so with the bamtools convert -format sam command. So in the above command, we pipe the output of the bamtools command to samtools for viewing. We can check that these two commands are equivalent by counting the number of reads:
- samtools view -f 0x1 sample_sorted.bam | wc -l
- bamtools filter -isPaired true -in sample_sorted.bam | samtools view - | wc -l
This is much easier to see what we are filtering on - isPaired makes more sense than 0x1! The command bamtools filter help gives a list of all the different filters that can be applied, which are more readable for people!
is a very useful tool for looking at alignments together, but sometimes you just want a quick look at the data in a particular region. For example, we might want to see all the reads spanning a particular exon. Let's just quickly take a look at the reads that fall into the range 71,800 - 71,900 on chromosome 20.
- samtools tview -p 20:71800-71900 sample_sorted.bam chr20_fragment.fa
We can clearly see an insertion of the sequence 'ACCAGT' in most of these reads. Lower case letters (or ',') represent alignment to the reverse strand, whereas uppercase letters (and '.') represent alignment to the forward strand. Press ? to view options for changing the view in tview, for example changing the colour scheme, or showing all bases, not just those that differ from the reference. Press q to quit out of tview.
Having completed the tutorial, we can clean up all the files we created.
The flag -r performs a recursive delete which is required for deleting directories.