Operations

This page describes all available operations in GBprocesS.
GBprocesS requires to specify parameters such as restriction enzyme cutsite remnants, sequencing primers, and other library-specific features.
Please see the explanatory illustration in the next tab to help identify these features in your own libraries.
Detailed description of a common library preparation workflow and how these relate to the subsequent steps of read preprocessing (“what goes on, must come off”) can be found on the page How it Works.
Examples of configuration .ini files for common scenarios can be found on the Example-page.

AverageQualityFilter

class AverageQualityFilter(average_quality, output_file_name_template, output_directory)

Remove records from a .fastq file based on the average quality of the whole read.

The ASCII quality score line of the .fastq records are interpreted using the Illumina 1.8+ Phred+33 encoding. An average of the resulting integers is compared to a set threshold. Reads with an average quality score below this threshold are discarded.

Parameters:

average_quality (int) – Minimum average quality threshold (Phred+33) to retain a read in the output.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

Raises:

ValueError – The average quality threshold could not be interpreted as an integer.

CutadaptDemultiplex

class CutadaptDemultiplex(error_rate, barcodes, output_file_name_template, output_directory, barcode_side_cutsite_remnant='', anchored_barcodes=True)

Perform sequence read demultiplexing using cutadapt.

Split a .fastq file into one or more samples based on the presence of ‘barcodes’: a short sequence of nucleotides that is uniquely linked to one sample. Barcodes can be ‘anchored’, meaning that they must be present at the beginning of the read. Additionally, in case of GBS data, the barcode sequence is typically followed by the remnant of the restriction enzyme cutsite that was used for digestion of the genomic DNA and ligation of the adapters. Specifying the barcode-side cutsite remnant sequence improves the demultiplexing specificity, as there is a longer sequence to match with.

This operation adds the {sample_name} field to the file name template.

Parameters:

error_rate (float) – relative fraction of allowed mismatches for a barcode or barcode+remnant combination to be matched (0 for exact match; or between 0 and 1 to allow mismatches).
barcodes (Union[Path, str]) – path to an existing barcodes .fasta file, where the name of the fasta record indicates the sample name, and the respective sequence defines the nucleotide sequence of its sample-specific barcode.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.
barcode_side_cutsite_remnant (str, optional) – String that describes the nucleotide sequence of the barcode-side cutsite remnant. defaults to “”
anchored_barcodes (bool, optional) – Only look for the barcode sequence (or barcode + barcode-side cutsite remnant combination) at the beginning of the read. defaults to True

Raises:

ValueError – error rate could not be interpreted as a floating point number.
AssertionError – The error rate must have a value between 0 (incl.) and 1 (excl.).
ValueError – The barcodes .fasta file does not exist or is not a file.
ValueError – The barcode-side cutsite remnant sequence must only contain IUPAC nucleotide characters.

CutadaptTrimmer

class CutadaptTrimmer(barcode_side_sequencing_primer, common_side_sequencing_primer, barcode_side_cutsite_remnant, common_side_cutsite_remnant, barcodes, minimum_length, error_rate, output_file_name_template, output_directory, anchored_adapters=True, spacer=None)

Perform trimming of barcodes, restriction enzyme cutsite remnants, adapters, and possible variable spacers.

Scheme of sequenced reads in a double-digest GBS experiment:

(top strand of GBS fragment is shown)
...|--barcode adapter--|-barcode-|-RE1-|-genomic insert-|-RE2-|-(spacer)-|--common adapter--|... # noqa: E501

  F sequencing primer->|-------------------- forward read ---------------->
            <---------------- reverse read ------------------------------|<-R sequencing primer # noqa: E501

after first trimming round:
                                       |---- forward read ---------------->
            <---------------- reverse read -------------|

The operation consists of two subsequent phases: The behavior of the first phase is dependant on whether or not the data contains spacers. If spacers are present, the 5’ sides of the forward and reverse reads are trimmed utilizing pattern trimming. In the forward reads, the barcodes and RE are removed. In the reverse reads (spacers and) restriction site remnants are removed. For each read the length difference between the barcode or spacer is calculated with the maximum barcode and spacer length. In order to level the reads for further operations, all reads are positionally trimmed at their 3’ side by an amount of bases equal the previously calculated length difference.

forward read:
(sequenced read is shown as it appears in the .fastq file, after positional trimming)
|-genomic insert-|-RE2-|-(spacer)-|--common adapter--|...

reverse read:
(sequenced read is shown as it appears in the .fastq file, after positional trimming)
|-genomic insert-|-RE1-|-barcode-|--barcode adapter--|...

The behavior of the first phase differs when no spacers are present. In this case, the sequences at the beginning of the forward (barcode + RE1) and reverse (RE2) are removed not by using the pattern of nucleotices, but by removing a fixed number of bases, regardless of which bases they may be (positional trimming).

In the second phase, pattern matching is used to trim the possible remaining constructs at the 3’ sides of the forward and reverse reads.

In the forward read (if the length of the read is a bit longer than the length of the genomic insert), a restriction enzyme cutsite remnant followed by an adapter sequence may be expected towards the 3’ end (see scheme above). The adapter sequence usually corresponds with the Illumina sequencing primer used to obtain the reverse read. For forward reads, the restriction enzyme cutsite remnant to be removed is corresponds to the ‘common-side cutsite remnant’, and the adapter sequence corresponds to the ‘common-side sequencing primer’.

For reverse reads, a restriction enzyme cutsite remnant followed by the sample-specific barcode and an adapter sequence may be expected towards the 3’ end (see scheme above). The adapter sequence usually corresponds with the Illumina sequencing primer used to obtain the forward read. For reverse reads, the restriction enzyme cutsite remnant to be removed corresponds to the ‘barcode-side cutsite remnant’, and the adapter sequence corresponds to the ‘barcode-side sequencing primer’.

If a variable spacer was used, then in the previous round, the variable spacer was determined for each read and can now be used to trim of the possible 3’ restriction enzyme cutsite remnants, spacers, and common adapters from the forward reads.

With pattern trimming, the reads are trimmed even if the searched pattern is only partially present at the 3’ end of the read. Conversely, if the pattern is found internally in the read sequence, the remainder of the read towards the 3’ end is also trimmed off.

forward read:
(sequenced read is shown as it appears in the .fastq file, after positional trimming)
|-genomic insert-|

reverse read:
(sequenced read is shown as it appears in the .fastq file, after positional trimming)
|-genomic insert-|

This operation is tailored for GBS methods where the barcode and restriction enzyme cutsite remnant (RE) sequences must be removed from the 5’ side of forward reads and the 3’ side of reverse reads, and the restriction enzyme cutsite remnants and possible spacers from the 5’ side of reverse reads and the 3’ side of forward reads. The operation can handle variable length sample-specific barcodes and read-specific spacers.

The user must provide the sequences of the restriction enzyme cutsite remants, the sequencing primers, the sample-specific barcodes, and possible variable spacers to Trimming.

Depending on the protocol, one (single-digest GBS) or two (double-digest GBS) restriction enzymes are used to cut the genomic DNA before adapter ligation.

In single-digest GBS, the restriction enzyme cutsite remnant sequence is the same for both the ‘barcode-side cutsite remnant’ and the ‘common-side cutsite remnant’. In double-digest GBS, the restriction enzyme sequence used in the adapters that also carry the barcode is called the ‘barcode-side cutsite remnant’, while the restriction enzyme sequence used in the common adapters (without barcode) is called the ‘common-side cutsite remnant’.

How to determine the restriction enzyme cutsite remnant?

The easiest way to deduce the restriction enzyme cutsite remnant is to first look up the restriction enzyme recognition site on e.g. NEB. If the result of double strand digestion would leave an overhang (sticky end) on the top strand, then the restriction enzyme cutsite remnant is equal to the sequence of the top strand up to the cut strand breakpoint (indicated by little diamonds on the top strand on NEB).In contrast, if the double strand digestion would leave an overhang (sticky end) on the bottom strand, then the restriction enzyme cutsite remnant is equal to the complement of the bottom strand sequence up to the cut strand breakpoint (indicated by little diamonds on the bottom strand on NEB)

Below we show examples for PstI, MspI, ApeKI, and EcoRI where the ‘|’ symbol indicates where the endonuclease would cut the strand in the nucleotide sequence.

#PstI
before digestion  -->    after digestion
5'--CTGCA|G--3'          5'--CTGCA|  Overhang located on top strand.
3'--G|ACGTC--5'          3'--G|
Sequence of top strand up to cut site: CTGCA
Restriction enzyme cutsite remnant: CTGCA

#MspI
before digestion  -->    after digestion
5'--C|CGG--3'            5'--C|
3'--GGC|C--5'            3'--GGC|    Overhang located on bottom strand.
Sequence of bottom strand up to cut site: GGC
Restriction enzyme cutsite remnant: CCG

#ApeKI
before digestion  -->    after digestion
5'--G|CWGC--3'           5'--G|
3'--CGWC|G--5'           3'--CGWC|  Overhang located on bottom strand.
Sequence of bottom strand up to cut site: CGWC
Restriction enzyme cutsite remnant: GCWG

#EcoRI
before digestion:  -->   after digestion:
5'--G|AATTC--3'          5'--G|
3'--CTTAA|G--5'          3'--CTTAA|  Overhang located on bottom strand.
Sequence of bottom strand up to cut site: CTTAA
Restriction enzyme cutsite remnant: GAATT

Parameters:

barcode_side_sequencing_primer (str) – Either the nucleotide sequence of the primer that was used to initiate sequencing on the barcode-side or ‘Nextera’ or ‘TruSeq’ to use a predefined value.
common_side_sequencing_primer (str) – Either the nucleotide sequence of the primer that was used to initiate sequencing on the common-side or ‘Nextera’ or ‘TruSeq’ to use a predefined value.
barcode_side_cutsite_remnant (str) – Nucleotide sequence of the barcode- side cutsite remnant. May also be provided in a .fasta file where each sample is followed by its barcode_side_cutsite_remnant.
common_side_cutsite_remnant (str) – Nucleotide sequence of the common- side cutsite remnant. May also be provided in a .fasta file where each sample is followed by its common_side_cutsite_remnant.
barcodes (Union[Path, str]) – path to an existing barcodes .fasta file, where the name of the fasta record indicates the sample name, and the respective sequence defines the nucleotide sequence of its sample-specific barcode.
anchored_adapters (bool) – Adapters must be present at the end of the read. In other words: the adapters must be the suffixes of the reads in order for them to be trimmed. defaults to True.
spacer (str) – sequence of the longest possible variable spacer in 5’-3’ direction of the genomic fragment. GbprocesS will consider every iteratively smaller fragment, until only the 5’ base is left, as a possible spacer. Parameter only available for paired-end data.
minimum_length (int) – Minimum length of the reads after trimming, shorter reads are discarded from the output.
error_rate (float) – rate of errors allowed between query sequences and the reads. Expressed as a floating point number from the interval ]0,1]. The error rate is only applied when removing sequences from the 5’ of the reads (phase 2).
output_file_name_template (str) – Template used to create the name of the output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

Raises:

ValueError – The minimum length of the trimmed read must be interpretable as an integer.
ValueError – The minimum length of the trimmed read must be an integer above 0.
ValueError – The error rate must be interpretable as a float.
ValueError – The barcode-side sequencing primer is neither a predefined value (e.g. “TruSeq” or “Nextera”) nor a valid DNA sequence.
ValueError – The common-side sequencing primer is neither a predefined value (e.g. “TruSeq” or “Nextera”) nor a valid DNA sequence.
ValueError – The common-side sequencing primer and the common- side cutsite remnant must be defined.
ValueError – The barcode-side cutsite remnant is not a valid DNA sequence.
ValueError – The common-side cutsite remnant is not a valid DNA sequence.
ValueError – If either the barcode-side sequencing primer or the barcode-side cutsite remnant are specified (to trim reverse reads), the other one must be defined as well.
ValueError – The barcodes .fasta file does not exist or is not a file.
ValueError – The error rate must have a value equal to 0 (perfect match) or larger than 0 and smaller than one ([0,1[).

LengthFilter

class LengthFilter(minimum_length, output_file_name_template, output_directory)

Remove records from a .fastq file with a read length smaller than a predefined size.

Parameters:

minimum_length (int) – Minimum length of the sequencing read to be retained in the output.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

Raises:

ValueError – The mimimum length of the reads could not be interpreted as an integer.
ValueError – The mimimum length of the reads should be expressed as a POSITIVE integer (0 included).

MaxNFilter

class MaxNFilter(max_n, output_file_name_template, output_directory)

Filter .fastq reads based on the number of uncalled nucleotides present. Reads are discarded if more than the specified threshold of ‘N’ nucleotides are present.

Parameters:

max_n (int) – Maximum number of allowed uncalled nucleotides in the reads.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

Raises:

ValueError – The maximum number of uncalled nucleotides could not be interpreted as an integer.
ValueError – The maximum number of uncalled nucleotides should be expressed as a POSITIVE integer (0 included).

Pear

class Pear(minimum_overlap, minimum_length, output_file_name_template, output_directory)

Perform merging: the construction of a single read from the overlap between pairs of forward and reverse reads. Merging is performed by PEAR.

Parameters:

minimum_overlap (int) – minimum overlap between the two reads to be considered for merging.
minimum_length (int) – The minimum length of the reads after merging. Reads shorter than the specified length will be discarded.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called. For this operation, an extra condition applies for the output filename template: it must end on ‘.assembled{extension}.’, as this is the way PEAR formats the output file names.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

Raises:

ValueError – The minimum overlap is not an integer.
ValueError – The minimum overlap must have a value larger than 0.
ValueError – The minimum length must be an integer.
ValueError – The minimum length must have a value larger than 0.
ValueError – The output files from PEAR end with ‘.assembled{extension}.’ Please specify this in the template for the output files in the configuration .ini file.

RemovePatternFilter

class RemovePatternFilter(pattern, output_file_name_template, output_directory)

Remove reads from fastq files based on the presence of a sequence pattern. The pattern must occur in the read without error.

Parameters:

pattern (str) – The pattern that will be searched for in the read.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

SlidingWindowQualityFilter

class SlidingWindowQualityFilter(window_size, average_quality, count, output_file_name_template, output_directory)

Remove records from a .fastq file based on quality scores within small regions of the read.

The whole read is scanned by focussing on subsequent fixed-sized regions of the read (‘windows’). For each window, the avarage quality score is calculated from the quality scores per nucleotide within that window. The quality scores are assumed to be Illumina 1.8+ Phred+33 encoded. If the average quality score of the window is smaller then a defined threshold, it is counted as a “bad” window. Once calculations for a window have finished, the next window is selected by moving the selection one nucleotide forward. If the number of windows counted as “bad” is above a user defined limit, the entire read is discarded.

example:

window_size = 21
count = 1 (maximum number of "bad" windows allowed to retain the read)
average_quality = 25 (minimum required threshold to retain a window as "good")

@EU861894-140/1
CCGATCTCTCGGCCTGCCCGGGGA
??CFFF?;HHAH#III??CFFF?,
|.........20........|       average quality score < 25 : total count of "bad" windows = 1
 |.........25........|      average quality score = 25 : total count of "bad" windows = 1
  |.........30........|     average quality score > 25 : total count of "bad" windows = 1
   |.........24........|    average quality score < 25 : total count of "bad" windows = 2

total count of bad windows (2) is greater than user defined count threshold (1).
So, the read is discarded.

Parameters:

window_size (int) – Length of the nucleotide sequence per ‘window’, used to extract quality scores from the fastq file.
average_quality (int) – Quality threshold for a window to be treated as having a “bad” average quality score.
count (int) – Threshold for the maximum number of “bad” windows allowed to retain the read.
output_file_name_template (str) – Template used to create the name of output files. The output file name template follows the syntax of format strings as descibed in PEP 3101. The template consists of text data that is transferred as-is to the output file name and replacement fields (indicated by curly braces) that describe what should be inserted in place of the field. The field name, the element inside the curly braces of the replacement field, must refer to a property attribute of the .fastq file if perform() is called.
output_directory (Union[Path, str]) – Path to an existing directory that will hold the output for this operation.

Raises:

ValueError – The average quality within the window could not be interpreted as an integer.
ValueError – The window size could not be interpreted as an integer.
ValueError – The window size of the reads should be expressed as a strictly POSITIVE integer.
ValueError – The count could not be interpreted as an integer.
ValueError – The count should be expressed as a strictly POSITIVE integer.