How to build a pipeline configuration file
We are going to build a new pipeline configuration to create a new pipeline based on the bwa pipeline. The pipeline we build is going to be the bwa-multi pipeline, and so the complete version of this pipeline can always be referred to if you get lost or confused! As part of the guide, we will work over some execution features of gkno and highlight some common errors.
The bwa pipeline takes an arbitrary number of pairs of fastq files and aligns them as paired end reads using bwa mem. If n pairs of fastq files are supplied, then n bam files will be produced by the pipeline. The assumption is that each pair of fastq files corresponds to a single sample. If the n pairs of files correspond to different lanes for a single sample, then we want each pair to be aligned separately, but once complete, all the individual bam files need to be merged into a single file. Here, we will create a new pipeline to perform this function.
How to execute the bwa pipeline
Before we build the new pipeline, let's try and run the bwa pipeline. We will use tutorial data to illustrate the steps as we go. So, to begin, let's set up some paths. From a suitable working directory, type the following:
  • mkdir gkno_dev
  • cd gkno_dev
  • GKNO=<path to gkno>
  • RES=$GKNO/resources/tutorial/current
  • gkno=$GKNO/gkno
Note that the last command above should point to the directory gkno_launcher. Ok, now let\'s run bwa:
  • gkno bwa -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o out.bam
Let's go over each aspect of the command line. gkno bwa executes the bwa pipeline, and then we set a number of options. The -ps test defines the test parameter set. See here for more details on parameter sets, but in our case, it is setting the reference genome to align against. Next, -q $RES/*_1.fq defines the first mate paired end read file, giving the path we defined earlier and then picking all files in that folder that end with _1.fq. We then similarly set the second mate files using -q2. gkno will throw an error if there are different number of files going to each argument, but will do some string comparisons to determine which file passed to -q matches which file passed to -q2. Finally, -o out.bam defines the name of the output bam we wish to create. Now run the pipeline.
You will no doubt have noticed that this fails! The error message may not be totally clear, but basically this is what has happened. We have three pairs of fastq files and so gkno is expecting there to be three output files. Since we have only specified one, we would have three alignments all pointing to the same output file, which would be bad. So, we need to supply gkno with three output files. We can either apply the -o three times on the command line:
  • gkno bwa -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o 1.bam -o 2.bam -o 3.bam
Or you can just provide three values to the -o argument:
  • gkno bwa -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o 1.bam 2.bam 3.bam
Or, finally, we can create a file with the three files in, give the file the extension .list and supply the file:
  • echo 1.bam > outputs.list
  • echo 2.bam >> outputs.list
  • echo 3.bam >> outputs.list
  • gkno bwa -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o outputs.list
If you try and rerun the above commands without deleting the finished bam files first, gkno will not execute telling you that the files already exist, so you will need to delete all the files rm -f * after each execution. If the fastq files changed between executions, gkno will see this and determine that the inputs are newer than the outputs, and so will rerun the pipeline and overwrite the original files.
One last useful piece of information! Since we are running three alignments, it will run quicker if we run three parallel jobs. We can set the maximum number of parallel jobs with the -nj, so putting -nj 3 at the end of the pipeline really speeds things up!
Start building a new pipeline
Ok, we can run the bwa pipeline, so now let's build a new pipeline. We want this to do almost the same thing as bwa, but merge the three bam files together at the end. I'm going to assume that you are familiar with vim for the remainder of this guide, but if that is not the case, use whichever text editor you are comfortable with. We will use the configuration file for the bwa pipeline as a starting point, so let's make a copy of it, call it new-bwa, and store it in a new directory.
  • rm -f *
  • mkdir config
  • cp $GKNO/config_files/pipes/bwa.json config/new-bwa.json
If you try and run this new pipeline, though, it will fail.
  • gkno new-bwa
The reason is that gkno knows where all it's configuration files are stored and it can't find anything for the new-bwa pipeline. We can tell gkno to look for configuration files elsewhere though, so this isn't a problem.
  • gkno new-bwa -cp config
Now the pipeline runs and is asking us to provide it with necessary information. You can see all arguments that can be applied regardless of pipeline using the command gkno -ga. Ok, let's start editing the configuration file using your editor of choice. To start, we will edit some top level details of the pipeline. This is how it should look to start (click "Configuration file" to expand or collapse).
The first changes we need to make are the name and the description. The "configuration type" can be tool or pipeline, but this is a pipeline, so this doesn't need to be modified. The "categories" are used to define which help categories this pipeline falls into. In this case, we want our new pipeline to appear in the Alignment category, so we can leave this value alone as well. After changing these details, our configuration should look like this:
The original bwa pipeline
Below is a representation of the original bwa that we are going to modify. The blue rectangles represent the tasks defined in the "pipeline tasks" section of the confiugration file and the green diamonds represent files. These are either defined in the "unique graph nodes" section of the configuration file, or created by gkno if they are required, but undefined. As we update the pipeline in the configuration, I will draw new representations of the pipeline so we can understand what we are doing!
Adding tasks to a pipeline
Now let's add a new task to this pipeline. What do we need the pipeline to do, once the individual bam files are created? Unfortunately, if we run samtools merge on a set of bam files, it will keep the header from the first file, so we will lose all read group information specific to each alignment. We really don't want this to happen, so we need to construct a new bam header that contains all the read groups. Our first job, then, is to output the header from each created bam file, so let's add a new task to the list of tasks to do this. All we need to add is a unique task name and the name of the tool we will use to perform the task (to see all available tools, look in $GKNO/config_files/tools), in this case samtools-header. Here is the updated task list:
Ok, we have a new task, now let's create a new node to describe the new files that are created: the header files. If we want to be able to interact with any files created in the pipeline, we need to describe them with a node. So, in the "unique graph nodes" section of the configuration file, let's create a new node to describe the header file. To describe a node, we will give it a unique id: let's call it hdr. This id is not just unique to the nodes, but also cannot be the name of a task, defined in the tasks section. If you input a non-unique value, gkno should warn you of your error. We also define the task and argument with which this node is associated. In this case, the header file is associated with the --out argument in the header task. Just for fun, let's deliberately make an error and use the argument --fake to create the node list shown below. The node with id "output" should have been the last node in the configuration file. Remember to put a comma at the end of the previous node description.
Great, our pipeline is updated, so let's run it with all our parameters set.
  • gkno new-bwa -cp config -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o 1.bam 2.bam 3.bam
It failed. The reason is that gkno tried to build a node in the pipeline to contain the files produced by the --fake argument in the header task. gkno is now telling us that no such argument exists. It also tells us that this task is associated with samtools-header, so we can look at the configuration file for this tool to find all the valid arguments.
  • more $GKNO/config_files/tools/samtools-header.json
In the arguments section (shown below), we can see that the valid arguments (look for the long form argument definitions) for samtools-header are the input file --in, the output --out and the only option is --header. We wanted the node we created to represent the output for the task, so change the --fake in our pipeline configuration node to --out. Now that the argument is valid, rerun the gkno command above.
Hmm, we still fail. Now what's the problem? The problem is that we haven't defined the names of the outputs of the header task. For most tools in gkno, if you don't specify the filename on the command line, the filename will be constructed from input files provided to the task. In this case, gkno is telling us that the input file for this task is not defined. The specific error we are given pertains to creating output filenames, but the header task cannot function without an input file. Having made the above modifications to the configuration file, this is what the pipeline looks like:
As we can see, we have created the new header task and it's output, the "hdr" file - but no files go into the task. So, what we want to do is give the task an input: specifically, we want the bam file output from the sort task to feed in. Since nodes exist for both the header task and the bam file output by the sort task, we just need to make a connection. We do this by modifying the "connect nodes" section of the configuration file. Below, is the update we need to make.
Here, we added a connection from the bam file (from the "unique graph nodes", we can see that node's id is output), to the header task. Finally, we associated this file with the --in argument in the header task. Again, remember that each of these code blocks in the configuration files are separated by commas!
Ok, this is what our pipeline looks like now:
Now rerun the gkno command, and this should now execute successfully.
  • gkno new-bwa -cp config -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o 1.bam 2.bam 3.bam -nj 3
And we now have the additional files 1.header, 2.header, and 3.header.
Merge the individual header files
We don't actually want these individual header files, but a single file containing the read groups from all the headers. To achieve this, I wrote a simple Python script, and added it as a tool in gkno. To see how to add this tool, see the additional tutorial here. The tool is available within gkno though, so you do not need to follow this if you don't want.
The pipeline now needs to use this tool (called merge-bam-headers) to merge these created header files. We don't have to do anything that we haven't done before, namely:
  • Add a new task, merge-headers
  • Add a new "unique graph node" to represent the merged header file,
  • Connect the hdr outputs to the new task.
To see the code snippets, expand the following image. When these blocks have been added, our pipeline now looks like this.
Delete the test files, rerun gkno with the previous command, and we see that we get an error. Now this one is a little bit more complicated! The problem here is that we are running the pipeline three pipeline times (once for each set of fastq files). The merge-headers task will create a file with the name merged-bam-headers.header, but since we are running it three times, it's going to create the same named file three times. gkno doesn't let you run a pipeline if it knows that you're going to create the same file multiple times, since this can only lead to disaster! What we need to do then, is tell gkno that we want it to take all the created header files and give them all as input to the merge-headers task, and only run that task once. Basically, the merge-headers task is "greedy". This is easy to do. Just update the description of the merge-headers task to the following:
The task is now defined as "greedy" and the "greedy argument" is --in. This means that if there are multiple files in the node associated with the --in argument, all of these values will be given to the argument. At the end of this tutorial, we will take a look at the makefile that gkno generates and uses to execute the pipeline. If we run gkno again, it will now execute successfully, and you will see that we now generate the merged_bam_headers.header file.
Merge the bam using our new header
Now we have a bam header with the read groups from all the bam files, we can merge the bam files together. The code snippets can be found in the expandable image below, but you need to do the following tasks:
  • Add a final-merge task (note that this must also be a "greedy" task,
  • Add a "unique graph node" associated with this final merged bam file,
  • Connect the header file we just created to the --header argument of the final-merge task.
  • Connect the bam files from the sort task to the --in argument of the final-merge task.
This is now what our pipeline looks like:
Finish the pipeline
Ok, there are only a couple of extra tasks that we need to finish this pipeline. Now we have a final, merged bam file, we want to index it and generate alignment statistics on it. You should try to add these tasks and connect them up yourself, but the code snippets are included below, along with the final pipeline representation. Technically, you do not need to create new "unique graph nodes" for the index and stats file. gkno will automatically build these nodes itself, since it knows that these tasks generate outputs. We really only need to define the nodes ourselves if we want to a) attach a command line argument to the node, b) explicitly connect this node to another taski, or c) delete the files associated with the node. Since these files constitute the end of the pipeline, and we don't need to define their names, we could just omit this step.
This is now what our pipeline looks like:
Now you can run the pipeline again, and it will run all the way to the end and generate your new merged bam file, preserving all the read groups from the original files. Unfortunately, we're not quite done yet, though!
Add an output argument
When you run the pipeline now, it is generating the final output bam file name based on the names of input files. We really need to let the user define the name of the output, though. This is easy. We just need to update the node that the current --out argument points to (if we want to retain the ability to name the intermediate bam files, we would just add a new argument). In the "Outputs" section of "arguments", modify the "node id" from output to final-bam. Now, we can set the name of this output file.
Now try running the same gkno command we have been using up this point:
  • gkno new-bwa -cp config -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o 1.bam 2.bam 3.bam -nj 3
This command now fails because we are giving three output file names, and gkno is only expecting a single value for the final merged bam file. So run the following command instead:
  • gkno new-bwa -cp config -ps test -q $RES/*_1.fq -q2 $RES/*_2.fq -o all.bam -nj 3
This generates the three files that we want (all.bam, all.bam.bai, and all.stats, but we also have all the intermediate bam, index, stats, and header files that we don't really need. So let's get rid of them.
Cleaning up intermediate files
Nearly done. First of all, we don't actually need to index and generate stats on the individually aligned bam files, so let's just delete the index and stats tasks, and the two connections in the "connect nodes" section that connect the bam files to these tasks. If we rerun the pipeline now, these intermediate bai and stats files are no longer there since we never created them.
Now try running the same gkno command we have been using up this point:
If a node is associated with files that we don't want to keep, we can just tell gkno to delete them. This is useful, since gkno will delete the files as soon as all tasks that use them are complete. It won't wait until the end of the pipeline, and thus avoids building up an unnecessarily large volume of files that we are ultimately going to get rid of (if the pipeline fails for some reason, the files required to pick up where the pipeline left off will still be available though). So, in our case, the "unique graph nodes" with ids output, hdr, and header-file can have the line "delete files" : true added to delete the files that they create.
And we're done! Try rerunning the pipeline one last time, and you should see that the only files created (excluding the stdout, stderr and ok file indicating that the pipeline ran successfully), are the files that we need.
If you want to see the makefile that gkno generates and is used to actually run the pipeline, just add the -dne to the command line. This means do not execute and will force gkno to generate the makefile as usual, but not execute it. You can now look at the file and see all the magic that ensures the pipeline runs. If you want to run the makefile, you can just type:
  • make -f makefile -j number_of_parallel_jobs
A different way to build a pipeline
This tutorial build a new pipeline by defining every task in the pipeline from start to finish. This isn't the only way, however. In the "pipeline tasks" section, each task was performed by an individual tool, but gkno allows you to use a defined pipeline as a task. If you still want to learn more, this tutorial builds the same pipeline as we just did here, but uses the bwa pipeline as one of the pipeline tasks.