Overview
gkno is a system for building and executing bioinformatics pipelines using a range of different tools. Installation generates a copy of the source code and also pulls in the source code of all of the constituent tools. Building gkno is really compiling all of the tools that are contained with the package, rather than compiling gkno itself. In addition, gkno provides a set of resource files that can be used in conjunction with the pipelines. The overall aim of the project is to simplify the identification and execution of bioinformatics pipelines and provide a community resource which allows pipelines and analysis specific parameters to be easily shared and reused by the community at large.
General architecture
When running a gkno pipeline, a pipeline configuration file is used to create a graph representation of the Python using functionality provided by the Python networkx library. Tool configuration files are then used to define all the executables and understand expected file formats, required arguments, and detailed behaviour including the handling of streams etc. Defined parameter sets and user supplied arguments are parsed and all values added to the graph. Where possible, gkno will fill in all missing required arguments (for example, by creating output filenames, where none are defined). All detailes are then checked to ensure that all supplied information is consistent with tool requirements.
Before creating makefiles for execution, gkno identifies dependent tasks in the pipeline and creates phases and subphases to allow parallelisation of execution. In the figure below, the task generate reference index needs to be completed before anything else in the pipeline can run. This task ends up in phase 1 and until this is finished, no other tasks can run. Once completed, phase 2 has to be executed next. No variant calling (phase 3) can occur until the alignments are complete. Within phase 2, there may be multiple alignments, and all of these are independent, so the run of tasks 'align' and 'sort', for each set of files can occur in parallel. Finally, phase 3 will not be started until all subphases in phase 2 are complete. If running on a cluster, running the pipeline shown in the figure below, would result in a single makefile for phase 1, n makefiles for phase 2 and a single makefile for phase 3.
For more information on parallelising execution of gkno, see the tutorial here.
File structure
In the initial set-up, gkno creates a file structure so that all files are available and in a known location. This file structure should never be modified or removed. Many of the functions within gkno make use of the known file structure, so any changes could cause a total failure. The directory structure is as follows:
More detail about the contents of this file structure can be found by expanding the panel below.
From left to right, the file structure contains the following elements:
Configuration files
The 'config_files' directory contains the configuration file gknoConfiguration.json which contains information describing the arguments that are not specific to a pipeline, but perform general high level functions. These arguments are discussed in more detail here. All of the tools that can be built into pipelines are described in their own configuration files stored in the 'tools' directory and the 'pipes' directory contains all the pipeline descriptions. There are additional resources to describe the contents of these tool and pipeline configuration files.
Executable
The gkno executable can be executed as soon as gkno has been built and allows you to manage version, the resources, and run any of the available pipelines.
Resources
When cloned, gkno comes with a set of 'tutorial' resources. These are generally small files that allow pipelines to be tested rapidly. Additional resources can be added, updated or removed and when they are, they will be stored in their own directory under the resources directory. See the resources tutorial to understand how to manage gkno resource files. These files should not be manually removed, as some parameter sets may expect these files to exist.
Source files
All of the source files for making gkno run are stored here and should be left alone.
Tools
All of the tools available in gkno are stored in the 'tools' directory. Within this directory, the 'R' directory contains scripts for running 'R' analyses and 'scripts' contains scripts that have been wrapped as gkno tools. It is possible to include your own tools in gkno by creating a configuration file to describe it and creating a directory for your tool in the 'tools' directory and either keeping your tool there, or providing a soft link to your tool executable. Including your own tools in gkno is described in more detail here.
Resources
Details coming soon.
Tool configuration files
Every tool, script or 'R' script that can be used by gkno has to be described using a configuration file. These are json format files that identify the tools location, executable commands and command line arguments. Any idiosyncracies of the tool are also described in this file, for example, modifying the tool behaviour or arguments if the tool is accepting or outputting a stream rather than a file. Details on all aspects of tool configuration files, including writing them for your own tools can be found under the 'Tool configuration files' heading on the how-to page.
Pipeline configuration files
Similarly to tools, every pipeline in gkno has to be described using a json format configuration file. This file defines all of the tasks in a pipeline and the tools used to execute each task. It is also possible to use another gkno pipeline to execute tasks, so the pipelines can, in fact, be pipelines of pipelines. There is a lot of information that can be used to define pipelines and all the details, including how to create files for own pipelines can be found under the 'Pipeline configuration files' heading on the how-to page.