Categories
  • Alignment
    • bwa-se
      Align single-end fastq files using BWA mem, converting the result to BAM.
    • bwa
      Align paired-end fastq files using BWA mem, converting the result to BAM.
    • mosaik
      Align fastq files using Mosaik. In this version additional 'special' reference sequences are included (usually mobile element insertions) in the reference.
  • Annotation
  • BAM-processing
    • add-read-group
      Add or remove sample and read group information to a BAM file.
    • bam-to-fastq
      Convert a BAM file back into a FASTQ file.
    • bedtools-coverage
      Population call variants using Freebayes, filtering the results with standard filtering methods.
    • filter-bam
      Filter a set of BAM files.
    • gatk-gvcf
      Prepare the BAM file for the haplotype caller by realigning indels and recalibrating the base qualities and generate gvcf files for each sample.
    • get-region-coverage
      Population call variants using Freebayes, filtering the results with standard filtering methods.
    • index-bam
      Index BAM files.
    • merge-bam
      Merge a set of BAM files.
    • tangram-bam
      Add tags expected by Tangram to BAM files generated by aligners other than Mosaik.
  • FASTA-processing
    • build-mosaik-reference
      Concatenate reference fasta files and generate a Mosaik reference and jump database.
    • bwa-index
      Generate the FM-index for use with bwa alignment.
    • fasta-dictionary
      Generate a dictionary containing all of the sequences in the input reference fasta.
    • index-fasta
      Index a FASTA file.
    • tangram-index
      Population call variants using Freebayes, filtering the results with standard filtering methods.
  • SV-discovery
    • tangram-index
      Population call variants using Freebayes, filtering the results with standard filtering methods.
    • tangram
      INCLUDE DESCRIPTION
  • Scripts
    • find-cds
      Search through a GFF (GTF) file and extract all CDS entries. Two output files are generated: a file containing a list of all genomic regions and a file with the corresponding gene name, transcript id and exon number.
    • regions-coord
      Generate a list of regions based on a list of chromosome coordinates.
    • regions-dict
      Generate a list of genomic regions based on a FASTA dictionary file.
  • Tools
    • bedtools-coverage
      Population call variants using Freebayes, filtering the results with standard filtering methods.
  • VCF-processing
    • get-indels
      Get all indels from a VCF files and standardize the resulting VCF file.
    • index-vcf
      Index VCF files.
    • merge-vcf
      Merge a set of VCF files.
    • simplify-vcf
      Break up all complex alleles into their primitive components, e.g. SNPs and indels. Normalize and compress the resulting VCF file.
    • snpeff
      Annotate a VCF file using SnpEff.
    • standardize-vcf
      Standardize a VCF file. This involves steps including normalization of variants, removing duplicates etc.
    • subset-vcf
      Subset a compressed VCF file on a list of samples and normalize the records.
    • vcf-extract
      Extract VCF records with a given filter field value.
    • vcf-primitives
      Break complex variants into their constituent primitive units and normalize.
  • Variant-discovery
    • compress-vcf
      Compress a VCF file using bgzip and index with tabix.
    • freebayes
      Population call variants using Freebayes, filtering the results with standard filtering methods.
    • gatk-genotype
      Joint call genotypes from a list of GVCF files.
    • gatk-gvcf
      Prepare the BAM file for the haplotype caller by realigning indels and recalibrating the base qualities and generate gvcf files for each sample.
    • gatk-vqsr
      Recalibrate the variant scores. It is recommended that this is performed if there are sufficient variants (one whole exome or at least 30 exomes).
    • normalize-vcf
      Normalize a VCF file using vt and index with tabix.
    • tangram-index
      Population call variants using Freebayes, filtering the results with standard filtering methods.
    • tangram
      INCLUDE DESCRIPTION
  • Visualisation
    • coverage
      Calculate the genome coverage across reference sequences (usually chromosomes), generating R-plots.
  • kmer-processing
    • kmer-histogram
      Count the number of occurences of kmers of a specified length and then construct a histogram of the counts.
Pipelines
  • add-read-group
    Add or remove sample and read group information to a BAM file.

    Arguments
    -c --clear flag Removes all read groups from the header.
    -rg --region string only read data from this genomic region.
    -v --remove string Removes this sample name and all associated read groups from the header.
    -s --sample string Apply this sample name to the BAM file.
    -i --in string The input BAM file(s).
    -g --read-group string Apply this read group to the BAM file.
  • bam-to-fastq
    Convert a BAM file back into a FASTQ file.

    Arguments
    -o --out string Base output name for generated output files.
    -f --first-mate-extension string Read name extension to use for first read in a pair default is '/1'.
    -a --add-name-to-plus flag Add the read name/extension to the '+' line of the fastq records.
    -i --in string The input SAM / BAM file to convert.
    -r --fasta-reference string Reference file for converting '=' in the sequence to the actual base if '=' are found and the refFile is not specified, 'N' is written to the FASTQ.
    -p --params flag Print the parameter settings to stderr.
    -n --read-name flag Process the BAM as readName sorted instead of coordinate if the header does not indicate a sort order.
    -v --no-reverse flag Do not reverse complement reads marked as reverse
    -m --merge flag Generate 1 interleaved (merged) FASTQ for paired-ends (unpaired in a separate file) use firstOut to override the filename of the interleaved file.
    -e --no-eof flag Do not expect an EOF block on a bam file.
    -s --second-mate-extension string Read name extension to use for second read in a pair default is '/2'.
  • bedtools-coverage
    Population call variants using Freebayes, filtering the results with standard filtering methods.

    Arguments
    -g --genome-file string The input genome file. This is a tab delimited file, with structure: .
    -bg --bed-graph flag Report depth in BedGraph format. For details, see: genome.ucsc.edu/goldenPath/help/bedgraph.html.
    -o --out string The output coverage file.
    -s --split flag Treat "split" BAM or BED12 entries as distinct BED intervals. when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).
    -t --strand string Calculate coverage of intervals from a specific strand. With BED files, requires at least 6 columns (strand is column 6). Can take the values + or -.
    -5 --five-prime flag Calculate coverage of 5" positions (instead of entire interval).
    -3 --three-prime flag Calculate coverage of 3" positions (instead of entire interval).
    -i --in string The input sorted BAM file.
    -l --track-line flag Adds a UCSC/Genome-Browser track line definition in the first line of the output. See here for more details about track line definition: http://genome.ucsc.edu/goldenPath/help/bedgraph.html. NOTE: When adding a trackline definition, the output BedGraph can be easily uploaded to the Genome Browser as a custom track, BUT CAN NOT be converted into a BigWig file (w/o removing the first line).
    -c --scale flag Scale the coverage by a constant factor. Each coverage value is multiplied by this factor before being reported. Useful for normalizing coverage by, e.g., reads per million (RPM). Default is 1.0; i.e., unscaled.
    -bg0 --bed-graph-with-zero flag Report depth in BedGraph format, as with --bed-graph. However with this option, regions with zero coverage are also reported. This allows one to quickly extract all regions of a genome with 0 coverage by applying: "grep -w 0$" to the output.
    -d0 --depth-zero-based flag Report the depth at each genome position (with zero-based coordinates). Reports only non-zero positions. Default behavior is to report a histogram.
    -d1 --depth-one-based flag Report the depth at each genome position (with one-based coordinates). Default behavior is to report a histogram.
    -m --combine-max-depth integer Combine all positions with a depth >= max into a single bin in the histogram. Irrelevant for --depth-one-based and --bed-graph.
    -to --track-options flag Writes additional track line definition parameters in the first line. Example: --track-options 'name="My Track" visibility=2 color=255,30,30'. Note the use of single-quotes if you have spaces in your parameters.
  • build-mosaik-reference
    Concatenate reference fasta files and generate a Mosaik reference and jump database.

    Arguments
    -r --fasta-reference string Reference fasta file.
    -o --out string The output fasta reference including the moblist.
    -m --mobile-element-fasta string The mobile element reference fasta file.
    -hs --hash-size integer Record all hashes in the genome of this size. [4 - 32]
  • bwa
    Align paired-end fastq files using BWA mem, converting the result to BAM.

    Arguments
    -o --out string The output BAM file.
    -q2 --fastq2 string The input fastq file (second mate).
    -tb --threads-bwa integer The number of threads for BWA-mem.
    -q --fastq string The input fastq file (first mate).
    -r --reference-prefix string The reference FASTA file prefix.
    -s --sample-id string The sample id.
    -p --platform-id string The platform id (e.g. ILLUMINA).
    -id --read-group-id string The read group id.
  • bwa-index
    Generate the FM-index for use with bwa alignment.

    Arguments
    -r --fasta-reference string The reference FASTA file.
    -x --index string The output FM index filename stub.
    -a --bwt-algorithm string BWT construction algorithm: bwtsw or is [auto].
  • bwa-se
    Align single-end fastq files using BWA mem, converting the result to BAM.

    Arguments
    -o --out string The output BAM file.
    -tb --threads-bwa integer The number of threads for BWA-mem.
    -q --fastq string The input fastq file (first mate).
    -r --reference-prefix string The reference FASTA file prefix.
    -s --sample-id string The sample id.
    -p --platform-id string The platform id (e.g. ILLUMINA).
    -id --read-group-id string The read group id.
  • compress-vcf
    Compress a VCF file using bgzip and index with tabix.

    Arguments
    -i --in string The file to be compressed.
  • coverage
    Calculate the genome coverage across reference sequences (usually chromosomes), generating R-plots.

    Arguments
    -os --out-scaled string The output pdf absolute coverage plot file with zero coverage bases removed.
    -z --include-zero flag Plot the bin of reads with zero coverage.
    -oc --out-absolute string The output pdf absolute coverage plot file.
    -i --in string The input sorted BAM file.
    -x --x-axis-maximum integer The maximum value for the x axis for the absolute coverage plot [determined by data, or --combine-max-depth].
    -ops --out-proportional-scaled string The output pdf scaled proportional coverage plot file.
    -r --read-counts flag Plot the raw read counts, rather than the percentage of the reads with each coverage.
    -or --out-proportional string The output pdf proportional coverage plot file.
    -l --log-scale flag Use a log scale for the y-axis.
    -c --combine-max-depth integer Combine all positions with a depth >= max into a single bin in the histogram.
    -s --reference-sequences string A sorted list of reference sequences to use when plotting the histogram.
  • fasta-dictionary
    Generate a dictionary containing all of the sequences in the input reference fasta.

    Arguments
    -r --fasta-reference string The input reference FASTA file.
    -o --out string the output sequence dictionary.
  • filter-bam
    Filter a set of BAM files.

    Arguments
    -imr --is-mate-reverse-strand bool Keep only alignments with mate on reverese strand?
    -rg --region string only read data from this genomic region.
    -imm --is-mate-mapped bool Keep only alignments with mates that mapped.
    -g --length string Keep reads with length that matches pattern.
    -t --tag string Keep reads with this key=>value pair.
    -ism --is-second-mate bool Keep only alignments marked as second mate?
    -isi --is-singleton bool Keep only singletons?
    -o --out string The output filtered and merged BAM file.
    -i --in string The input BAM file.
    -id --is-duplicate bool Keep only alignments that are marked as duplicate?
    -n --name string Keep reads with name that matches pattern.
    -ipr --is-paired bool Keep only alignments that were sequenced as paired?
    -mq --mapping-quality string Keep reads with mapping qualities that match pattern.
    -a --alignment-flag integer Keep reads with this *exact* alignment flag (for more detailed queries, see 'Alignment flag filters').
    -q --query-bases string Keep reads with motif that matches pattern.
    -z --insert-size string Keep reads with insert size that matches pattern.
    -im --is-mapped bool Keep only alignments that were mapped?
    -if --is-failed-qc bool Keep only alignments that failed QC?
    -ir --is-reverse-strand bool Keep only alignments on reverse strand?
    -s --script string the filter script file (see bamtools documentation for more information).
    -ifm --is-first-mate bool Keep only alignments marked as first mate?
    -ipa --is-primary-alignment bool Keep only alignments marked as primary?
    -ipp --is-proper-pair bool Keep only alignments that passed PE resolution?
  • find-cds
    Search through a GFF (GTF) file and extract all CDS entries. Two output files are generated: a file containing a list of all genomic regions and a file with the corresponding gene name, transcript id and exon number.

    Arguments
    -tl --transcript-list string The list of genomic region corresponding to the CDS (ordered to correspond with the transcripts file).
    -i --in string The input gff file.
    -rl --region-list string The list of genomic region corresponding to the CDS (ordered to correspond with the transcripts file).
    -s --sequences string The input list of reference sequences to consider.
  • freebayes
    Population call variants using Freebayes, filtering the results with standard filtering methods.

    Arguments
    -sf --standard-filters flag Use stringent input base and mapping quality filters. Equivalent to -m 30 -q 20 -R 0 -S 0
    -mre --min-repeat-entropy integer To detect interrupted repeats, build across sequence until it has entropy > N bits per bp. (default: 0, off)
    -Q --mismatch-base-quality-threshold integer Count mismatches toward --read-mismatch-limit if the base quality of the mismatch is >= Q. default: 10
    -r --fasta-reference string Use FILE as the reference sequence for analysis.
    -T --theta float The expected mutation rate or pairwise nucleotide diversity among the population under analysis. This serves as the single parameter to the Ewens Sampling Formula prior model. default: 0.001.
    -rg --region string :... Limit analysis to the specified region, 0-base coordinates, end_position not included (same as BED format).
    -u --no-complex flag Ignore complex events (composites of other classes).
    -X --no-mnps flag Ignore multi-nuceotide polymorphisms, MNPs.
    -mmq --min-mapping-quality integer Exclude alignments from analysis if they have a mapping quality less than Q. default: 0
    -rgl --report-genotype-likelihood-max flag Report genotypes using the maximum-likelihood estimate provided from genotype likelihoods.
    -ce --contamination-estimates string A file containing per-sample estimates of contamination, such as those generated by VerifyBamID. The format should be: sample p(read=R|genotype=AR) p(read=A|genotype=AA). Sample '*' can be used to set default contamination estimates.
    -S --genotype-variant-threshold integer Limit posterior integration to samples where the second-best genotype likelihood is no more than log(N) from the highest genotype likelihood for the sample. default: ~unbounded.
    -rq --reference-quality string Assign mapping quality of MQ to the reference allele at each site and base quality of BQ. default: 100,60
    -gq --genotype-qualities flag Calculate the marginal probability of genotypes and report as GQ in each sample field in the VCF output.
    -pc --pooled-continuous flag Output all alleles which pass input filters, regardless of genotyping outcome or model.
    -z --read-max-mismatch-fraction float Exclude reads with more than N [0,1] fraction of mismatches where each mismatch has base quality >= mismatch-base-quality-threshold. default: 1.0
    -n --use-best-n-alleles integer Evaluate only the best N SNP alleles, ranked by sum of supporting quality scores. (Set to 0 to use all; default: all)
    -k --no-population-priors flag Equivalent to --pooled-discrete --hwe-priors-off and removal of Ewens Sampling Formula component of priors.
    -x --index string bam index file.
    -o --out string The output filtered VCF file.
    -f --filter-expression string Specifies a filter to apply to the info fields of records, removes alleles which do not pass the filter.
    -pv --pvar float Report sites if the probability that there is a polymorphism at the site is greater than N. default: 0.0001.
    -maf --min-alternate-fraction float Require at least this fraction of observations supporting an alternate allele within a single individual in the in order to evaluate the position. default: 0.2
    -mbq --min-base-quality integer Exclude alleles from analysis if their supporting base quality is less than Q. default: 0
    -i --in string Add FILE to the set of BAM files to be analyzed.
    -gmb --genotyping-max-banddepth integer Integrate no deeper than the Nth best genotype by likelihood when genotyping. default: 6.
    -hl --haplotype-length integer Allow haplotype calls with contiguous embedded matches of up to this length. default: 3
    -ni --no-indels flag Ignore insertion and deletion alleles.
    -hba --haplotype-basis-alleles string When specified, only variant alleles provided in this input VCF will be used for the construction of complex or haplotype alleles.
    -fd --filter-tag-description string The description of the filter to include in the VCF header.
    -V --binomial-obs-priors-off flag Disable incorporation of prior expectations about observations. Uses read placement probability, strand balance probability, and read position (5'-3') probability.
    -ob --observation-bias string Read length-dependent allele observation biases from FILE. The format is [length] [alignment efficiency relative to reference] where the efficiency is 1 if there is no relative observation bias.
    -cnv --cnv-map string Read a copy number map from the BED file FILE, which has the format: reference sequence, start, end, sample name, copy number ... for each region in each sample which does not have the default copy number as set by --ploidy.
    -p --ploidy integer Sets the default ploidy for the analysis to N. default: 2.
    -Z --use-reference-allele flag This flag includes the reference allele in the analysis as if it is another sample from the same population.
    -pco --prob-contamination float An estimate of contamination to use for all samples. default: 10e-9
    -v --variant-input string Use variants reported in VCF file as input to the algorithm. Variants in this file will be treated as putative variants even if there is not enough support in the data to pass input filters.
    -e --read-indel-limit integer Exclude reads with more than N separate gaps. default: ~unbounded
    -j --use-mapping-quality flag Use mapping quality of alleles when calculating data likelihoods.
    -U --read-mismatch-limit integer Exclude reads with more than N mismatches where each mismatch has base quality >= mismatch-base-quality-threshold. default: ~unbounded
    -rsl --read-snp-limit integer Exclude reads with more than N base mismatches, ignoring gaps with quality >= mismatch-base-quality-threshold. default: ~unbounded
    -bqc --base-quality-cap integer Limit estimated observation quality by capping base quality at Q.
    -Y --min-supporting-mapping-qsum integer Consider any allele in which and the sum of mapping qualities of supporting reads is at least Q. default: 0
    -ud --use-duplicate-reads flag Include duplicate-marked alignments in the analysis. Default: exclude duplicates marked as such in alignments.
    -w --hwe-priors-off flag Disable estimation of the probability of the combination arising under HWE given the allele frequency as estimated by observation frequency.
    -mrs --min-repeat-size integer When assembling observations across repeats, require the total repeat length at least this many bp. default: 5
    -msa --min-supporting-allele-qsum integer Consider any allele in which the sum of qualities of supporting observations is at least Q. default: 0
    -H --harmonic-indel-quality integer Use a weighted sum of base qualities around an indel, scaled by the distance from the indel. By default use a minimum BQ in flanking sequence.
    -ft --filter-tag string The text to add to the filter field for each record in the VCF file that satisfies the filter expression.
    -D --read-dependence-factor float Incorporate non-independence of reads by scaling successive observations by this factor during data likelihood calculations. default: 0.9
    -E --max-complex-gap integer Allow complex alleles with contiguous embedded matches of up to this length.
    -mac --min-alternate-count integer Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position. default: 1
    -npo --no-partial-observations flag Exclude observations which do not fully span the dynamically-determined detection window. default: use all observations, dividing partial support across matching haplotypes when generating haplotypes.
    -mc --min-coverage integer Require at least this coverage to process a site. default: 0
    -a --allele-balance-priors-off flag Disable use of aggregate probability of observation balance between alleles as a component of the priors.
    -W --posterior-integration-limits string Integrate all genotype combinations in our posterior space which include no more than N samples with their Mth best.
    -pd --pooled-discrete flag Assume that samples result from pooled sequencing. Model pooled samples using discrete genotypes across pools. When using this flag, set --ploidy to the number of alleles in each sample or use the --cnv-map to define per-sample ploidy.
    -fx --fasta-index string The FASTA reference index file.
    -lg --legacy-gls flag Use legacy (polybayes equivalent) genotype likelihood calculations.
    -t --targets string Limit analysis to targets listed in the BED-format FILE.
    -maq --min-alternate-qsum integer Require at least this sum of quality of observation supporting an alternate allele within a single individual in order to evaluate the position. default: 0
    -N --exclude-unobserved-genotypes flag Skip sample genotypings for which the sample has no supporting reads.
    -dla --dont-left-align-indels flag Turn off left-alignment of indels, which is enabled by default.
    -mat --min-alternate-total integer Require at least this count of observations supporting an alternate allele within the total population in order to use the allele in analysis. default: 1
    -gmi --genotyping-max-iterations integer Iterate no more than N times during genotyping step. default: 1000.
    -I --no-snps flag Ignore SNP alleles.
  • gatk-genotype
    Joint call genotypes from a list of GVCF files.

    Arguments
    -t --threads integer The number of threads.
    -x --index string The reference FASTA index file.
    -rg --region string The target genomic region.
    -n --annotation string One or more specific annotations to apply to variant calls.
    -i --in string The input GVCF file(s).
    -r --fasta-reference string The reference FASTA file.
    -a --analysis-type string The type of analysis to run.
    -l --logging-level string The minimum level of logging.
    -d --fasta-dictionary string The reference FASTA dictionary file.
  • gatk-gvcf
    Prepare the BAM file for the haplotype caller by realigning indels and recalibrating the base qualities and generate gvcf files for each sample.

    Arguments
    -dt --data-threads integer The number of data threads. Each data thread uses the full amount of memory normally given to a single run. For example, if a run typically uses 2Gb, using 2 data threads will require 4Gb of memory.
    -o --out string The output GVCF file.
    -rg --region string The genomic regions to analyze.
    -t --cpu-threads integer The number of CPU threads allocated to each data thread. CPU threads share the memory allocated to the data thread, so increasing this value does not effect the memory usage.
    -kr --known-recalibration string A VCF file(s) containing known variants that will be supplied to the base quality recalibration step alone.
    -k --known-sites string A VCF file containing known variant alleles.
    -i --in string The input bam file(s).
    -r --fasta-reference string The FASTA reference.
  • gatk-vqsr
    Recalibrate the variant scores. It is recommended that this is performed if there are sufficient variants (one whole exome or at least 30 exomes).

    Arguments
    -r --fasta-reference string The reference FASTA file.
    -stf --snp-tranches-file string The output tranches file used by ApplyRecalibration for SNPs.
    -o --out string The output filtered and recalibrated VCF file.
    -oi --out-indel-recal string The output recal file used by ApplyRecalibration for INDELs.
    -rd --resource-dbsnp string A list of dbsnp sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run).
    -ro --resource-omni string A list of omni sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run).
    -ri --resource-indel string A list of indel sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run).
    -rs --resource-1000g string A list of 1000G SNP sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run).
    -mg --max-gaussians integer Maximum number of Gaussians for the positive model.
    -os --out-snp-recal string The output recal file used by ApplyRecalibration for SNPs.
    -i --in string The joint called VCF file to be recalibrated.
    -sv --snp-vcf string The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value for SNPs.
    -c --tranche float The levels of novel false discovery rate (FDR, implied by ti/tv) at which to slice the data. (in percent, that is 1.0 for 1 percent).
    -itf --indel-tranches-file string The output tranches file used by ApplyRecalibration for INDELs.
    -n --annotation string One or more specific annotations to apply to variant calls.
    -s --truth-sensitivity integer The truth sensitivity level at which to start filtering.
    -rh --resource-hapmap string A list of hapmap sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run).
  • get-indels
    Get all indels from a VCF files and standardize the resulting VCF file.

    Arguments
    -o --out string The output standardized VCF file including indels only.
    -rg --region string The genomic region to analyse.
    -f --filter-expression string The filter expression (enclose in quotes on the command line).
    -i --in string The input sorted VCF file.
    -r --fasta-reference string The reference FASTA file.
    -s --samples-files string A text file containing a list of samples on which to subset.
  • get-region-coverage
    Population call variants using Freebayes, filtering the results with standard filtering methods.

    Arguments
    -p --pileup string The output pileup data.
    -t --transcripts string An input file list of all transcripts.
    -rg --region string The genomic region to consider.
    -i --in string The input BAM file whose coverage will be calculated.
  • index-bam
    Index BAM files.

    Arguments
    -i --in string The input BAM file.
    -o --out string The index file.
    -b --depth-based-index flag create non-standard (depth based) index file (*.bti). Default behaviour is to create standard BAM index (*.bai)
  • index-fasta
    Index a FASTA file.

    Arguments
    -r --fasta-reference string The input FASTA file.
    -o --out string The output FASTA index file.
  • index-vcf
    Index VCF files.

    Arguments
    -o --out string The output index file.
    -f --force-overwrite flag Overwrite existing index.
    -i --in string The file to be indexed.
  • kmer-histogram
    Count the number of occurences of kmers of a specified length and then construct a histogram of the counts.

    Arguments
    -t --threads integer The number of threads to use for counting kmers.
    -f --full flag Generate a full histogram. Don't skip count 0. (default: true).
    -c --canonical flag Count both strand, canonical representation (default: true).
    -p --percent float Only include bars in histogram if their counts are greated than this percentage of the maximum observed count.
    -q2 --fastq2 string The second mate FASTQ file.
    -k --kmer integer The kmer length.
    -x --x-label string The x axis label [bin].
    -l --title string The plot title.
    -q --fastq string The first mate FASTQ file.
    -s --size integer The initial hash size.
    -o --out string The output file containing the kmer counts (extension: .jf).
    -y --y-label string The y axis label [value].
    -hs --histogram string The output histogram plot.
  • merge-bam
    Merge a set of BAM files.

    Arguments
    -x --index string The BAM index file(s).
    -o --out string The output merged BAM file.
    -rg --region string only read data from this genomic region.
    -i --in string The input BAM file(s).
  • merge-vcf
    Merge a set of VCF files.

    Arguments
    -o --out string The output merged, compressed VCF file.
    -i --in string The input VCF files to merge.
  • mosaik
    Align fastq files using Mosaik. In this version additional 'special' reference sequences are included (usually mobile element insertions) in the reference.

    Arguments
    -t --threads integer Specifies the number of threads to use for the aligner.
    -l --lane string The library name. e.g. g1k-sc-NA18944-JPT-1.
    -m --mosaik-reference string The Mosaik format reference genome (.dat).
    -q2 --fastq2 string The input fastq file (second mate).
    -sh --special-reference-hashes integer Specifies the maximum number of hashes for the special reference sequences.
    -sam --sample-name string The sample name. e.g. NA12878.
    -st --sequencing-technology string Sequencing technology: '454', 'helicos', 'illumina', 'illumina_long', 'sanger' or 'solid'.
    -j --jump-database string The input Mosaik jump database stub.
    -as --ann-se string Neural network file for Mosaik mapping quality scores (single end).
    -s --special-reference-prefix string The prefix attached to 'special' reference sequences.
    -act --alignment-candidate-threshold integer Specifies the minimum length of an alignment candidate.
    -q --fastq string The input fastq file (first mate).
    -ap --ann-pe string Neural network file for Mosaik mapping quality scores (paired end).
    -c --center-name string The sequencing center name.
    -hs --hash-size integer The hash-size used in Mosaik [4 - 32].
    -id --read-group-id string The read group id. e.g. SRR009060.
    -mfl --median-fragment-length integer The median fragment length.
    -a --read-archive string The output read archive.
    -mhp --maximum-hashes-per-seed integer Specifies the maximum number of positions stored for each hash.
    -pu --platform string The platform unit. e.g. IL12_490_5.
  • normalize-vcf
    Normalize a VCF file using vt and index with tabix.

    Arguments
    -rg --region string The genomic region in which to perform the analysis.
    -i --in string The input VCF file to be normalized.
    -r --fasta-reference string The FASTA reference sequence file.
    -rgf --regions-file string A file containing list of genomic regions to analyse.
    -fx --fasta-index string The FASTA reference index file.
    -w --window integer Window size for local sorting of variants [10000].
  • regions-coord
    Generate a list of regions based on a list of chromosome coordinates.

    Arguments
    -x --maximum integer The value to add to each given coordinate to define the upper value of the window.
    -c --chromosome integer The chromosome on which the windows are to be applied.
    -o --out string A list of region windows.
    -i --in string A file containing a list of genomic positions around which to generate windows (chromosome entered with --chromosome).
    -m --minimum integer The value to subtract from each given coordinate to define the lower value of the window.
  • regions-dict
    Generate a list of genomic regions based on a FASTA dictionary file.

    Arguments
    -w --window-size integer The size of the genomic regions.
    -o --out string A list of genomic regions.
    -v --invert-sequences bool Generate regions only for the reference sequences contained in the file specified with --reference-sequences.
    -i --in string A reference fasta dictionary.
    -s --reference-sequences string Only regions from the reference sequences in this file will be output. If --invert-sequences is also set, the sequences not contained in the file will be output.
  • simplify-vcf
    Break up all complex alleles into their primitive components, e.g. SNPs and indels. Normalize and compress the resulting VCF file.

    Arguments
    -r --fasta-reference string The reference FASTA file.
    -o --out string The output simplified VCF file.
    -i --in string The input sorted VCF file.
  • snpeff
    Annotate a VCF file using SnpEff.

    Arguments
    -d --database string The snpEff annotation database. This must have been downloaded (using snpeff-download) and be present in /gkno_launcher/tools/snpEff/data.
    -o --out string The output filtered and compressed VCF file.
    -i --in string The input VCF file.
  • snpeff-download
    Download a SnpEff database.

    Arguments
    -o --out string The output database.
    -d --database string The SnpEff database name.
  • standardize-vcf
    Standardize a VCF file. This involves steps including normalization of variants, removing duplicates etc.

    Arguments
    -r --fasta-reference string The reference FASTA file.
    -s --samples-files string A text file containing a list of samples on which to subset.
    -o --out string The output standardized VCF file.
    -rg --region string The genomic region to analyse.
    -i --in string The input sorted VCF file.
  • subset-vcf
    Subset a compressed VCF file on a list of samples and normalize the records.

    Arguments
    -o --out string The output subsetted compressed VCF file.
    -rg --region string The genomic region in which to perform the analysis.
    -f --filter-expression string The filter expression (enclose in quotes on the command line).
    -s --samples-file string A file containing list of samples on which to subset.
    -i --in string The input compressed VCF file.
    -r --fasta-reference string The FASTA reference.
    -rgf --regions-file string A file containing a list of genomic regions to analyse.
  • tangram
    INCLUDE DESCRIPTION

    Arguments
    -t --threads integer The number of threads to use.
    -o --out string The output VCF file containing the MEI calls.
    -mf --minimum-fragments integer The minimum number of normal fragments in a library.
    -rg --region string The region of interest in the genome.
    -i --in string An input file containing a list of BAM files.
    -p --path string The output path for Tangram files.
    -a --tangram-reference string The reference file created by Tangram index.
    -l --library-file string Library information file (generated by tangram-scan).
    -ht --histogram-file string Fragment length distribution information file (generated by tangram-scan).
  • tangram-bam
    Add tags expected by Tangram to BAM files generated by aligners other than Mosaik.

    Arguments
    -o --out string The output BAM file.
    -m --mobile-element-fasta string The mobile element fasta file.
    -rg --region string The genomic region to consider (whole genome or chromosome recommended).
    -i --in string The input BAM file.
  • tangram-index
    Population call variants using Freebayes, filtering the results with standard filtering methods.

    Arguments
    -r --fasta-reference string The input reference FASTA file.
    -s --special-reference string The input reference file containing the insertion sequences.
    -a --tangram-reference string The output reference file.
  • vcf-extract
    Extract VCF records with a given filter field value.

    Arguments
    -f --filter-expression string The filter expression (enclose in quotes on the command line).
    -rgf --regions-file string A file containing a list of genomic regions to analyse.
    -o --out string The output VCF file with the requested records.
    -rg --region string The genomic region in which to perform the analysis.
    -i --in string The input VCF file to be viewed.
  • vcf-primitives
    Break complex variants into their constituent primitive units and normalize.

    Arguments
    -r --fasta-reference string The FASTA reference.
    -o --out string The output compressed VCF file.
    -i --in string The input compressed VCF file.