Build a pipeline that contains another pipeline
In this tutorial, we are going to build an extension of the already existing bwa pipeline. This is going to be the same pipeline we built in the previous tutorial. I recommend that you run through the previous pipeline first, as you will need to understand that content before moving to the content here. Also, the previous tutorial has some more general gkno information on running pipelines.
The bwa pipeline takes an arbitrary number of pairs of fastq files and aligns them as paired end reads using bwa mem. If n pairs of fastq files are supplied, then n bam files will be produced by the pipeline. The assumption is that each pair of fastq files corresponds to a single sample. If the n pairs of files correspond to different lanes for a single sample, then we want each pair to be aligned separately, but once complete, all the individual bam files need to be merged into a single file. Here, we will create a new pipeline to perform this function.
Initial setup
Before we build the new pipeline, we need to set up some paths and executables. From a suitable working directory, type the following:
  • mkdir gkno_dev
  • cd gkno_dev
  • GKNO=<path to gkno>
  • RES=$GKNO/resources/tutorial/current
  • gkno=$GKNO/gkno
Just check that everything is working fine by typing gkno and observing that gkno successfully runs.
Start building a new pipeline
Let's build a new pipeline. We want this to do almost the same thing as bwa, but merge the three bam files together at the end. In this tutorial, instead of defining all the steps in the pipeline, we are going to include the bwa pipeline explicitly as a task. To begin, let's copy the pipeline template configuration file over to a working directory:
  • mkdir config
  • cp $GKNO/config_files/template_pipeline.json config/new-bwa2.json
Add the original bwa pipeline
First of all, give the pipeline the name new-bwa2 in the "id" field at the top of the configuration file. If you are building a real pipeline for sharing with others, make sure that you also add a description and define the help categories with which the pipeline is associated. Now, we need to add the original bwa pipeline as the first task in our new pipeline. Inside the "pipeline tasks" section, include the following code:
  • {
  •   "task" : "align",
  •   "pipeline" : "bwa"
  • }
This should look familiar from adding tasks in the previous tutorial, with the exception that we are now defining the task as a pipeline and not a tool. As we add more content to the configuration file, we can run the new pipeline with the following command:
  • gkno new-bwa2 -cp config
The -cp argument tells gkno to look for the configuration file in the config directory. The pipeline runs, and is asking for you to enter some required arguments. We never built any arguments into our new pipeline, so gkno just assumes that we should pull in the arguments from the contained pipeline. If we ask for help on the pipeline:
  • gkno new-bwa2 -cp config -h
Now compare this with the help from the original bwa pipeline. Things look pretty much the same. The pipeline name and description are different, as is parameter set information, since we haven't added any parameters sets to the pipeline yet. The "Workflow" for our new pipeline has each task prepended with align. when compared with the bwa pipeline. This is because our new pipeline is running a pipeline that we have called align as the first task. Within this task, are all the tasks that this pipeline (the bwa pipeline) performs. But all the arguments are the same, so we can run our new pipeline in the same way we would run bwa.
Adding tasks to the pipeline
Now we need to add some additional tasks to our pipeline. We will follow the same process as with the previous tutorial:
  • Add a task to the "pipeline tasks" section,
  • Create "unique graph nodes" as needed,
  • Connect necessary files and tasks together.
So, the first step is to add a new task. We will start by adding the header task to generate files containing the header from each created bam file. We do this by adding the following code to the "pipeline tasks":
  • {
  •   "task" : "header",
  •   "tool" : "samtools-header"
  • }
Now we just need to connect the created bam files to the input argument of this new task. This is a little different to before though, since we don't have any defined graph nodes in this pipeline. We can look in the bwa pipeline configuration file to see which node in that pipeline we want to attach to the header task. This is the output node (the same as in the previous tutorial). If we add this connection in the "connect nodes" section:
  • {
  •   "source" : "output",
  •   "target" : "header",
  •   "argument" : "--in"
  • }
and then try and run the pipeline:
  • gkno new-bwa2 -cp config
Then the pipeline predictably fails. gkno informs us that the source node, e.g. output is invalid. This makes sense because there are no defined nodes in this pipeline. What we really wanted to do, was connect this to the output node in the contained bwa pipeline - defined as the align task. We can modify the source node to "source" : "align.output",. This tells gkno that instead of looking for the node in it's own "unique graph nodes" section, is should look in the pipeline configuration file associated with the align task. Since the output does exist in this (bwa) pipeline, the connection is valid, and the pipeline should now run.
To really check that the pipeline is running, we need to give it some files. Rather than type out all the files on the command line, we'll just add a test parameter set to the pipeline. I recommend that every pipeline you create has a test parameter set which uses only files contained in the tutorial resources directory. This will mean that every pipeline can be quickly tested to make sure it works. Copy the test parameter set from the end of the file $GKNO/config_files/pipes/bwa-multi.json and copy it into the "parameter sets" section of our new-bwa2 configuration file, then try running with this set:
  • gkno new-bwa2 -cp config -ps test -nj 3
Where we have used the -nj 3 argument to parallelize and speed up execution. This command fails, and as before, the problem is that we are pointing to nodes that don't exist in this pipeline. So, for each node, e.g. "node" : "fastq", we need to tell gkno to look in the align task pipeline. This node needs to become "node" : "align.fastq". Make this change for every node in the test parameter set, and the pipeline should then run, creating all the header files we asked for.
Adding the remaining tasks
We now need to finish adding all the tasks to the pipeline, along with adding necessary nodes and connecting everything together. This follows the exact steps of the previous tutorial. I will just provide the instructions here, but if you need more explanation, please refer back to the previous tutorial.
Add the tasks to merge the header files, merge the bam files, and finally index and generate stats on the final bam file. The code to be added to the "pipeline tasks" is:
  • {
  •   "task" : "merge-headers",
  •   "tool" : "merge-bam-headers",
  •   "greedy task" : true,
  •   "greedy argument" : "--in"
  • },
  • {
  •   "task" : "final-merge",
  •   "tool" : "samtools-merge",
  •   "greedy task" : true,
  •   "greedy argument" : "--in"
  • },
  • {
  •   "task" : "final-index",
  •   "tool" : "bamtools-index"
  • },
  • {
  •   "task" : "final-stats",
  •   "tool" : "bamtools-stats"
  • }
Next, add nodes to the "unique graph nodes" section. We only need to add nodes that we want to interact with, e.g. delete associated files, link a command line argument to, or pass on to other tasks. Here, we need to add the following nodes:
  • {
  •   "id" : "hdr",
  •   "task" : "header",
  •   "task argument" : "--out"
  • },
  • {
  •   "id" : "header-file",
  •   "task" : "merge-headers",
  •   "task argument" : "--out"
  • },
  • {
  •   "id" : "final-bam",
  •   "task" : "final-merge",
  •   "task argument" : "--out"
  • }
And finally, connect the nodes together with the following additions to the "connect nodes" section:
  • {
  •   "source" : "hdr",
  •   "target" : "merge-headers",
  •   "argument" : "--in"
  • },
  • {
  •   "source" : "header-file",
  •   "target" : "final-merge",
  •   "argument" : "--header"
  • },
  • {
  •   "source" : "align.output",
  •   "target" : "final-merge",
  •   "argument" : "--in"
  • },
  • {
  •   "source" : "final-bam",
  •   "target" : "final-index",
  •   "argument" : "--in"
  • },
  • {
  •   "source" : "final-bam",
  •   "target" : "final-stats",
  •   "argument" : "--in"
  • }
Update output argument
Our configuration file will now run the pipeline and generate the required merged bam file. The problem we have is that the name of the final file is mutated_genome_1_samblaster_sorted.bam, since gkno constructed the filename itself. This is not the name that we want, and we don't have a command line argument to change the name. You can try using the --out argument to set the name of the output file:
  • gkno new-bwa2 -cp config -ps test -nj 3 -o all.bam
The pipeline fails because the --out argument is still as defined in the original bwa pipeline, and so is trying to set the outputs of that pipeline, not the final merged output. Since we are performing three independent alignments here, we would need to provide three output filenames to the argument. So, let's update the pipeline to address this. We have already defined a node associated with the final output: final-bam. So all we need to do is add an argument in the Outputs section of arguments. We will give this the value --out, and it will overwrite any arguments of the same name in the contained pipeline. So add the following to the Outputs:
  • {
  •   "description" : "The output BAM file.",
  •   "long form argument" : "--out",
  •   "short form argument" : "-o",
  •   "node id" : "final-bam"
  • }
Clear up intermediate files
Just like in the previous tutorial, we want to delete all the intermediate files. Getting rid of all the header files is straightforward since we created the nodes associated with them: namely hdr and header-file. For both of these nodes, add the line "delete files" : true in the node description (e.g. after the "task argument" line.
Next we will delete the bai and stats file for the intermediate bam files. In the original bwa pipeline, no nodes were created for the files, so we create the nodes in this pipeline and delete the associated files. We can do that by adding the following to the unique graph nodes:
  • {
  •   "id" : "int-index",
  •   "task" : "align.index",
  •   "task argument" : "--out",
  •   "delete files" : true
  • },
  • {
  •   "id" : "int-stats",
  •   "task" : "align.stats",
  •   "task argument" : "--out",
  •   "delete files" : true
  • }
Remember that, since the task we want to work with is in the contained bwa pipeline, we need to define the task with the prepending "align.".
Finally, we want to remove the intermediate bam files. The contained bwa pipeline already has a node called output defined which points to these files, so we can't define the nodes like we did for the bai and stats files. We are not allowed to define the same node twice. We can build the node in this configuration file, however, and just point to the node in the contained pipeline. In practice, this means that we add essentially the same information, but the task is just set to align. Since this task is an entire pipeline, it doesn't make sense to include a task argument as we did above, so instead of that, we can define the node id in the align task.