Advanced Sequencing Technologies and Applications course: Variant discovery practical
This tutorial will starts with a set of FASTQ files and a reference genome FASTA file and walks through all necessary steps to generate alignments, variant calls and interrogate those calls. All the necessary files are part of the gkno tutorial resources, so no additional files need to be downloaded.
We will be making use of these help messages as we progress through this tutorial.
Looking through all the available pipelines, we can see that 'bwa-index' does exactly what we need. We can get help on specific pipelines by using the -h
argument after specifying the pipeline of choice.
This help tells us a number of things about the pipeline. Firstly, the tools that are being run (in this case, there is only a single tool) and then a list of parameter sets
. These can be very useful, but will not be used in this tutorial. Finally, we have a list of all of the command line arguments that can be used with this pipeline. For simplicity, gkno
tries to use the same arguments for similar uses across all the pipelines (e.g. --fasta-argument (-r)
is used across all pipelines when a FASTA reference file is being specified).
In general, output files do not need to be specified. If no value is provided, gkno will generate an output filename based on input filenames. So, to generate the index file we need, we only need to supply the name of the reference FASTA.
- gkno bwa-index -r chr20_fragment.fa
Now take a look in this directory (ls
) and you will see a whole set of new files that will let us use bwa
for aligning our FASTQ files.
We will now go ahead and align the reads using bwa
. We will be using the mem
option within bwa
to generate split read alignments. We are going to use the full human reference for these alignments, not the reference we just created, so please ensure that you have the human resource bundle downloaded.
- cd ../analysis
- gkno bwa -h
Let's start by aligning the sample 'NA12878'. We will use the human parameter set using -ps human. This will pull the human reference from the gkno resources without us having to manually specify it.
- gkno bwa -r $REF -p ILLUMINA -s NA12878 -id NA12878 -q ../fastq/NA12878_1_sample.fastq -q2 ../fastq/NA12878_2_sample.fastq
If we look at what we just created, we can see that the output files have long, cumbersome names. This is because gkno wasn't given any output names so built the names based on the input files and the tasks that were performed. It would be nicer to have more succinct names, so let's redo thie alignment (we can get away with doing this because these files are so small. Don't do this with real files!!!).
IMPORTANT - We are going to delete all the files, so make sure you are working in the 'analysis' folder that we created. Do not delete all files if you are working in a different directory.
- rm -f *
- gkno bwa -r $REF -p ILLUMINA -s NA12878 -id NA12878 -q ../fastq/NA12878_1_sample.fastq -q2 ../fastq/NA12878_2_sample.fastq -o NA12878.bam
Now, we have created three different files here. First, we aligned the reads, marked duplicates and sorted the BAM file NA12878.bam. We then indexed this BAM file (NA12878.bam.bai) and generated statistics on the alignments (NA12878.stats). Let's align the other two samples. We will use some bash magic to do both of these with one command.
- for NAME in NA12891 NA12892; do
- gkno bwa -r $REF -p ILLUMINA -s $NAME -id $NAME -q ../fastq/$NAME\_1_sample.fastq -q2 ../fastq/$NAME\_2_sample.fastq -o $NAME.bam
What did we just do? The first line created a loop over a list of values (NA12891 and NA12892 in this case). At each step in the loop, the variable NAME is given the current value. The last line (done) terminates this loop. For each value in the loop, we then ran the gkno bwa pipeline using the variable NAME where required. Note that we used the escape character \ in the names of the FASTQ files to make it clear where the variable name ends and we needed to prepend the variable NAME with a $ to use it.
Check the BAM files
The first thing we can do is just look at the alignment statistics that gkno generated.
We can see that the alignments are good! Over 99% of the reads are mapped and roughly evenly between the forward and reverse strand. We also have a very low percentage of duplicate reads, which is definitely good for whole genome sequence data. But what is our average depth? We can use the gkno coverage pipeline to generate plots of the coverage by chromosome. Since the data we are working with only has sequences on chromosomes 17, 20, 21, and 22, we only want our plots to include these sequences, so we will generate a file containing these reference sequences first.
- echo 20 > samples.txt
- echo 21 >> samples.txt
- echo 22 >> samples.txt
- gkno coverage -i NA12878.bam -s samples.txt
Even for this small amount of data, this command takes quite a long time. But we can get at this data much quicker by using iobio
. If you are working on a local computer, use the 'choose bam file' button and navigate to the data. You will need to select both the BAM file and its index together. If working on the cloud, you can use the 'choose bam url' to specify the url of one of the BAM files we just created.
Since the BAM files we created are very small and only have coverage in a couple of places in the genome, there aren't enough reads to generate statistics genome-wide. We can use the top 'Read Coverage' panel to select chromosome 20 and the zoom in on the area where alignments exist. We now generate statistics including the percentage of mapped reads etc in the region. We can now say that we have an average coverage of ~50X.
Just for comparison, let's look at a regular whole genome BAM file. Reload the bam.iobio
web page and click 'choose bam url' and click 'Go' on the default file. This is a 1000 Genomes BAM file for the NA12878 sample. bam.iobio
is sampling the BAM file in real time and generating the statistics based on this sampling. Instead of waiting for possibly hours to generate these statistics, we have them in seconds. We can get a good sense of important statistics like duplicate rates and quality score distributions immediately using this app.
We can also use bam.iobio
to look at exome BAM files. Reload the web page, click 'choose bam url' and add the url https://s3.amazonaws.com/iobio/1000gSV/NA19238.mapped.ILLUMINA.bwa.YRI.exome.20130415.bam
, then click 'Go'. We see the same statistics as before, but the 'Read Coverage Distribution' is not that useful. Since the majority of the genome has no coverage from exome sequencing, the distribution is heavily weighted towards very low read counts. To generate statistics only in relevant regions, you can upload a BED file, or just use the default BED supplied. Do this by clicking 'Default Bed' in the upper right corner of the top 'Read Coverage' panel. Now the 'Read Coverage Distribution' looks more like what we'd expect.
Now let's use gkno to call variants on our BAM files. We will use the freebayes pipeline to do this.
There are a huge number of commands available for freebayes
, but we can safely use the default values. We will again use the human parameter set to pull in the human reference genome FASTA file.
- gkno freebayes -r $REF.fa -i *bam -o all.vcf.gz
So that took a little while to generate our results. It would be better if we could do it faster! Let's redo that calculation, but speed it up by parallelising the calling over different chromosomes. All you need to do is tell gkno some genomic regions to look at and it will parallelize the calling over these regions and automatically merge, compress and index the final files.
- rm -f f*
- rm -f all*
- gkno freebayes -r $REF.fa -i *bam -o all.vcf.gz -rg 17 -rg 20 -rg 21 -rg 22 -nj 4
The -nj or --number-jobs command tells gkno to keep up to four separate processes running at a time. This time, we saw that the calculation was faster. To review the format of VCF files, click the link below.
The header describes many aspects of the contents of a VCF file. The INFO and FORMAT lines are particularly important as they describe how information is encoded in each VCF record for the variant in general (INFO) and the samples (FORMAT). For example, the allele frequency of each variant is described by the 'AF' attribute in each variant record.
Variant coordinate (CHROM and POS)
For each variant record, the first two columns indicate the genomic coordinate of the variant.
Variant id (ID)
The third column shows the rsid (reference SNP id). This is the generated by comparing the VCF with the dbSNP database, and is absent unless the VCF file has been specifically annotated with this information.
Alleles (REF and ALT)
The REF (reference) allele is the allele as it appears in the reference genome which was used for variant calling. The length of the reference allele is dependent on the type of variant being described. A SNP will have a 1bp reference allele, whereas a reference relative deletion will have have a reference allele that encompasses all of the deleted bases in the mutation.
The ALT (alternate allele) is a comma separated list of all variants seen at this locus. It is possible that there are multiple alternate alleles; for example, consider microsatellites. We will discuss multi-allelic sites and complex alleles later in this tutorial.
Variant quality (QUAL)
The PHRED score for the variant. Often, these scores are not well calibrated and do not have the probabilistic interpretation usually ascribed to PHRED values. But, higher quality variants will generally have larger quality scores, so these are still of value when trying to filter out low quality variants.
Often, post-processing on VCF files is performed to filter out variants with specific properties. After such filtering, the FILTER field is populated with values such as 'PASS' to indicate passing variants, or, e.g. QUAL<2, ro indicate variants that failed the specified filter.
Information on the variant (INFO)
A host of key, value pairs in the form 'KEY=VALUE'. In a well-formed VCF (not all VCF files are!), each KEY will have an INFO line in the header explaining what it describes and the format of the value. There are a set of reserved KEYS (e.g. DP which describes the depth across all samples).
Sample information description (FORMAT)
The FORMAT field describes the information that is stored for each sample in the final columns of the variant record, in the form A:B:C etc. Each key (A, B, C etc.) should have a FORMAT line in the header describing the information it encodes.
Every sample in the VCF file is given a column containing the information described in the FORMAT line.
Check the VCF files
The quickest way to get a global view of our variant calls is to use vcf.iobio
. In your web browser, go to the vcf.iobio
website and select 'choose vcf file' (or 'choose vcf url' if your file is on the cloud). Select the compressed VCF file (this will have the extension vcf.gz
) and the index file (with extension vcf.gz.tbi
The application shows information only in chromosomes 17, 20, 21, and 22, and also shows that the density of variants is restricted to very small regions. If we knew nothing about this VCF file until now, we get a great deal of information from this simple view of the data. This is a great way to see if the data is what you expected it to be. Is the sample supposed to be female, but has lots of calls on the Y chromosome? Is this supposed to be whole genome data, but large regions of the genome have no variants? The information supplied here gives a quick and easy way to ask these questions.
To see vcf.iobio
showing information on the entire ExAC
database, click here
. It is clear to see features like the centromere on chromosome 1 where no calls have been made. Also, metrics like allele frequency and the distribution of base changes are available.
Manipulate VCF files
We will take a look at two packages that are useful for working with VCF files, vt
. Both of these packages can perform all sorts of different actions, so we'll play with just a couple. These tools were actually used in the gkno freebayes
pipeline, but we'll use the tools directly here. To being, we'll create an 'alias' for the vt
executable and an environment variable to point to the vcflib
directory where it's executables live.
- alias vt="$GKNO/tools/vt/vt"
First, let's use vt
to view a specific portion of our VCF file and include the header.
- vt view -h -i 17 all.vcf.gz | less -S
Use the space bar and u
to navigate up and down through the VCF file by pages, or the up and down cursor keys to move up and down line by ine. The left and right cursor keys will let you move left and right through the whole VCF records. Let's look at a couple of variants that may need to be modified depending on your analysis goals. First, we can pull out all 'complex' alleles from the VCF file. For more information on filtering rules in vt
, see the help
provided for the tool.
- vt view -f "INFO.TYPE=='complex'" all.vcf.gz | less -S
The first variant that we see is 'TGGGT' changing to 'GGGGG'. This is really a combination of two T -> G SNPs, separated by three reference matching G's. When analysing all the reads, freebayes
considers a genomic window and looks for all observed haplotypes in a window. Instead of reporting this single allele, freebayes
could have reported the two SNPs, but by reporting this way, we know that the two SNPs are phased, i.e. there are no samples that have only one of the SNPs. Chromosomes either have both alternate alleles, or both reference alleles. There may be times, though, when we want these variants to be written as separate SNPs.
- vt view -h -f "INFO.TYPE=='complex'" all.vcf.gz | $VCFLIB/vcfallelicprimitives | less -S
Now we can see that the single variant has been replaced with the two SNPs. Let's look at one more record.
- vt view -f "N_ALLELE>2" -i 22:39636944 all.vcf.gz | less -S
By scrolling to the right, we can see that the three samples have genotypes '1/2', '1/2' and '1/3'. So all of our samples are heterozygous and none of them have the reference allele in any chromosome. '1' refers to the first alternate allele, '2' the second alternate etc. We can break these multi-allelic sites into multiple lines - one for each allele.
- vt decompose -i 22:39636944 all.vcf.gz -o decomposed.vcf.gz
- less -S decomposed.vcf.gz
After we ran the decompose command, we get a summary. In this case a single multi-allelic variant was broken down. If we look at the variants in the file decomposed.vcf.gz, we can see that the single variant with three alternate alleles now appears as three different variants at the same genomic coordinate. You may also notice that the prepresentation of the alleles is unusual. The first variant is a deletion, but the second variant is SNP, but the base that is mutated is the last base in a five base sequence. If we perform a comparison between different VCF files, representing a SNP in this way will almost certainly mean that even if the two files being compared share the SNP, the comparison would not fail. It is extremely important that when comparing VCF files, we have a common, unambiguous representation of the alleles. We can normalize the variants to achieve this.
- vt normalize -r $REF.fa decomposed.vcf.gz 2>/dev/null | less -S
The 2>/dev/null is to stop the summary from the tool being written to screen. Now we have a simple deletion and two normal looking SNPs! Normalizing variants is an important step that should not be forgotten!
Subsetting on samples
If we perform joint calling with a large population of samples, we will often have a VCF file that contains hundreds or thousands of samples. But what if we're only interested in one or two of these samples?
There are two useful tools that we might want to use here, depending on our ultimate goal. The first tool, vcflib, will just keep the samples that we want, but not change the number of variant records in the file.
- $VCFLIB/vcfkeepsamples all.vcf.gz NA12878 > NA12878.vcf
- gkno compress-vcf -i NA12878.vcf
We have generated the new VCF file, NA12878.vcf.gz, which is exactly the same as all.vcf.gz, except that there is only one sample column: that for the sample NA12878. We could have listed as many samples as we liked in the command line, and each of those would have been kept.
What is the problem with this? The main problem here, is that we often want a VCF file that contains information about the requested samples, but only keeps variants that are polymorphic in those samples. For example, let's take a look at this new file in a bit more detail.
- less NA12878.vcf.gz | grep "0/0" | less -S
Every one of these lines is a variant that the sample NA12878 does not have - the genotype is 0/0, or homozygous reference. So what if we want to only keep the sample information for NA12878, but additionally, remove all homozygous reference variants?
- echo "NA12878" > samples.txt
- vt subset -s samples.txt all.vcf.gz -o NA12878_only.vcf.gz
- less NA12878.vcf.gz | grep -v "^#" | wc -l
- less NA12878_only.vcf.gz | grep -v "^#" | wc -l
With those last bash commands, we can see that the NA12878_only.vcf.gz file has 119 less variants than NA12878.vcf.gz. Let's just check that that makes sense.
- less NA12878.vcf.gz | grep "0/0" | wc -l
Exactly 119 variants kept in NA12878.vcf.gz were homozygous reference and were removed by using the vt subset command. Again, we can subset on multiple samples by adding more sample IDs to the samples.txt file.