Scope & Quick Start

Scope

GBprocesS performs read preprocessing for Genotyping-By-Sequencing (GBS) libraries. Preprocessing is executed as an ordered, linear workflow, built upon a set of predefined operations that perform:

  1. Demultiplexing (Cutadapt).

  2. Trimming: Trimming barcodes, spacers, and restriction enzyme cutsite remnants at the 5’ end of the reads (while compensating for variable length barcodes and spacers by trimming at the 3’ end of forward reads) and trimming of restriction enzyme cutsite remnants, barcodes, and adapter sequences at the 3’ end of the reads. (Cutadapt).

  3. Merging of forward and reverse reads.

  4. Removal of reads with low quality base-calling (Python).

  5. Removal of reads with internal restriction sites (Python).

A list of the names of all available operations is available through:

gbprocess --operations

Users can adjust the functionality of GBprocesS by listing the required operations, execution order, and run parameters in a configuration .ini file. For a detailed explanation on what each specific operation does and how to configure it, see the Operations section.


Quick Start

The configuration syntax used by GBprocesS follows the INI-file format. This format defines sections, parameters and comments. Note that sections and parameter definitions are case sensitive.

Sections start with a section header between square brackets (e.g. [header1]). GBprocesS will parse sections in order, starting at the top of the configuration .ini file. GBprocesS recognizes two types of sections: the [General] section, and sections that define an operation to be executed. Below each section header, parameters can be defined by using the syntax parameter_name = parameter_value: the name of the parameter and the parameter value itself, separated by an equal sign =.

General

The first section that must be specified is the [General] section. The general section allows to configure pipeline behavior, independent of the operations that will be performed by the pipeline:

[General]
# Use 1 CPU core
cores = 1
# Location of the input files
input_directory = /data/run/
# Paired-end sequencing
sequencing_type = pe
# Template to parse the input files.
## For example, 17146FL-13-01-01_S9_L002_R1_001.fastq.bz2: run = 17146FL-13-01-01_S9_L002; orientation = 1; extension = .fastq.bz2
input_file_name_template = {run:24}_R{orientation:1}_001{extension:10}
# Location to store temporary created files
temp_dir = /tmp/

Operations

Any section following the [General] section will be interpreted as being an operation added to the workflow.
There are 9 different available optional operations, each containing several parameters, the scheme below depicts these in a logical order.
A detailed explanation of every possible operation can be found on Operations.
_images/Workflow_operations.png

Starting the pipeline

Once your custom configuration .ini file is finished, run the program with the following command:

gbprocess -c /path/to/config.ini

This makes it possible to re-use a template configuration .ini file by changing a few parameters and the paths to the data that needs to be preprocessed.


Output

By default, all run directories are created as defined by the user in the configuration .ini file (one per operation), and output FASTQ files are placed in the respective directories as full sized files or zipped files respective with the input. Log files of the various third-party components (Cutadapt, PEAR) are placed in the respective directories, listing command line parameters and summary statistics per operation. Please note that as GBprocesS may be run in parallel on multiple cores, FASTQ files may be split and processed in parallel, so that the respective sample information may also appear in multiple log files (.out).


Debugging

By default, only run information is reported when executing GBprocesS, and no stack trace is provided on error. Use the --debug flag to report debugging information:

gbprocess --debug -c /path/to/config.ini