Advanced Sequencing Technologies and Applications: iobio tutorial
The goal of this page is to highlight the iobio real-time analysis and visualization suite. We will discuss the philosophy of the project and showcase (interactively) the available iobio apps. For more information on iobio, visit the website
and specifically look at the blog
What is iobio?
iobio is a platform on which web-based genomic analysis apps can be built. Analysis is built on providing real-time visually driven analysis of data. This helps us promote an exploratory approach to data analysis.
Humans are very visual animals. We are very good at spotting patterns in data, and quickly developing questions and theories about data, if we can only see the data presented in an intuitive manner. With iobio, we are trying to provide such visualizations for a broad set of genomic analysis problems. As we proceed through this tutorial, we will see the different iobio apps and how we present and interact with our data.
To truly explore our data, we need to be able to visualize our data very quickly. In fact, our goal is to visualize data immediately after data selection. We then want to have an analysis loop where we can form questions about the data, then ask questions, change analysis parameters etc. and immediately see the results of these changes. With such quick turnaround of analyses, we can really dig into and understand our data very quickly.
Genomic data is big though, so how can we do this? We use two different methods depending on the analysis being performed. Traditionaly methods perform a "global" analysis.
Our first analysis method is to sample data. Rather than look at everything, for some tasks, we can estimate statistics by sampling across the genome.
These sampling methods are well suited to checking the quality of data, for example, asking whether we believe that a BAM file consists of high-quality data, or whether we believe that there were problems with the sequencing experiment. Currently, the bam.iobio, vcf.iobio, and taxonomer.iobio apps are based on a sampling approach.
The next method is to read all data, but in a small region.
As an example, if we want to search for genetic variants in a child with a known phenotype, it makes sense to start analysis looking only at genes that are known to be associated with the phenotype, rather than starting analysis considering all variants. The gene.iobio tools uses this methodology.
Using iobio tools
For the remainder of this tutorial, we will work with some examples together, using iobio tools to understand the data. As we step through this tutorial, open up and play with the data with the relevant iobio and see if you notice anything interesting about the datasets.
Let's start by looking at a BAM file hosted on AWS. Follow this link
and click "choose bam url", then enter the following url https://s3.amazonaws.com/iobio/samples/bam/Sample12.bam
This is what we are looking at.
A chromosome selector,
The average coverage across the genome,
The read coverage distribution,
The fragment length distribution,
A host of other other useful statistics
Is there anything about this file that doesn't look as you might expect? For this bam file, the read coverage distribution appears to be Poisson distributed as expected, and has and average coverage of ~20X. This is always a good place to start to check that the file is expected.
Does the mapping rate seem to be as expected?
What could cause the value to be this low?
If so many reads do not appear to come from the human genome, where do they come from? We can use the metagenomics algorithm, Taxonomer, to identify what was present in this DNA sample, using taxonomer.iobio
. (Note that there is a commercial version of this tool available at taxonomer.com
.) This tool takes the raw fastq files as input and then, since a fastq file is unordered, is essentially sampling reads by reading through the file.
Open the taxonomer.iobio app using this
link, then choose "Enter URL for FASTQ/FASTA(s)" and finally copy and paste the following URL https://s3.amazonaws.com/iobio/samples/fastq/unmapped_fragment.fastq
. When the pie chart appears, click on "Bacteria" - the orange portion of the pie chart.
What can we conclude about this sample?
If you have thousands of BAM files and want to compare them all, bam.iobio is not the most convenient tool. Alternatively, maybe you have one or two samples, but you want to know how they compare to the state of the art in sequencing. One of the projects we are currently funded to work on is multibam.iobio. This allows us to compare and interrogate lots of samples together and try and identify samples that appear to be outliers.
Now we have determined the origin of this DNA sample (a spit sample, rather than a blood draw), we can look at statistics associated with the variants in the vcf file. To do this, we can use the vcf.iobio
tool. Open this link and paste in the following URL:
We can now see statistics for the sample under study.
gene.iobio is an app designed to allow you to investigate variants in a disease affected child, primarily when we also have genetic data for the parents (and maybe siblings). In this section of the tutorial, we will step through some of the functionality that gene.iobio offers.
To begin open the application here
. We have preselected a demonstration dataset for use in this tutorial.
Tutorials and interesting articles can be found in the iobio blogs
and they are being constantly updated. You can always refer to these or get to the tutorials directly from gene.iobio.
In this example, we are looking at the RAI1 gene, showing variants in a proband and her parents (how can I use gene.iobio to determine that the proband is female?) We can now work through the genes, MYLK2, PDGFB, PDHA1, and AIRE to demonstrate other features of gene.iobio.
The blog post on ClinVar
shows how we can use gene.iobio
to look at ClinVar variants.
Using Rufus with gene.iobio
Rufus is a reference-free algorithm that compares samples to identify sequence reads in each individual that contain unique sequence. When used to compare a child's reads with the combined reads of the parents, this method identifies de novo mutations in the child. The VCF generated by this method thus contains a very small number of variants (~100) for the child. Of these, only a subset will hit gene regions. Below is a clinical example, showing a structural variant (duplication) in the proband.