Configuring INI-files: Examples

This page provides the Quick start guidelines and examples of the config.ini files for six common scenarios using (a combination of) the following steps:

  1. Demultiplexing (Cutadapt).

  2. Trimming: Trimming barcodes, spacers, and restriction enzyme cutsite remnants at the 5’ end of the reads (while compensating for variable length barcodes and spacers by trimming at the 3’ end of forward reads) and trimming of restriction enzyme cutsite remnants, barcodes, and adapter sequences at the 3’ end of the reads. (Cutadapt).

  3. Merging of forward and reverse reads.

  4. Removal of reads with low quality base-calling (Python).

  5. Removal of reads with internal restriction sites (Python).


Restriction enzyme cutsite remnants and sequencing primers

In order to determine the RE cutsite remnants of the applied restriction enzymes it is advised to look up the RE recognition site on NEB.
After removing every nucleotide right from the diamonds (triangles), two scenarios occur:
  1. An overhang remains on the top strand (e.g. Pst I)

    In this case, the RE cutsite remnant is equal to the top strand overhang.

    _images/PstI_cutsite_remnant.png
  2. An overhang remains on the bottom strand (e.g. Msp I)

    In this case, the RE cutsite remnant is equal to the complement of the bottom strand overhang.

    _images/MspI_cutsite_remnant.png

For both scenario’s the RE cutsite remnant is independent of the side (barcode or common). Consider the elaborate library preparation example of a double-digest paired-end sequencing locus in the third tab for further understanding.


Examples starting with Demultiplexing

Single-digest GBS and single-end sequencing

_images/sdse_scheme.png

Single-digest GBS and paired-end sequencing

_images/sdpe_scheme_sep.png

Double-digest GBS and single-end sequencing

_images/ddse_scheme.png

Double-digest GBS and paired-end sequencing

_images/ddpe_scheme_sep.png

Starting after Merging

[General]
cores = 32
input_directory = /home/User/GBS_preprocessing/03_merging
sequencing_type = se
# After merging, only a single FASTQ file remains per sample. This file is then essentially equal to single-end FASTQ files and therefore the sequencing type should be se.
temp_dir = /tmp/
input_file_name_template = {sample_name:35}.assembled{extension:10}
# Example: 006_015_170516_001_0251_069_01_1081.assembled.fastq.bz2 => sample_name = 17146FL-13-01-01_S9_L002; extension = .fastq.bz2
# Essential information on run, read orientation, and file extension is obtained from the structure of the name of the original fastq file as provided by the service provider.
# {:XX} mark the number of characters in the file name that contain the specified information. One wildcard is allowed, this is invoked by leaving the character length information out. E.g. {run}_R{orientation:1}{extension:10} .
# Fields "orientation" and "extension" are automatically transferred to all new file names created in the next steps.

[MaxNFilter]
max_n = 0
output_directory = ./05_max_n_filter
output_file_name_template = {sample_name}{extension}
# removes reads with N base calls.

[SlidingWindowQualityFilter]
window_size = 2
average_quality = 20
count = 1
output_directory = ./06_sliding_window
output_file_name_template = {sample_name}{extension}
# removes reads with low quality base calls.
# Translation of default values: For any given read, if any 2 consecutive bases (window_size) have an average Phred quality lower than 20 (average_quality) at least 1 (count) time, then remove the read.

[AverageQualityFilter]
average_quality = 25
output_directory = ./07_average_quality_filter
output_file_name_template = {sample_name}{extension}
# removes low quality reads.

[RemovePatternFilter.CTGCAG]
pattern = CTGCAG
# first enzyme
output_directory = ./08_remove_chimera_partial_digest
output_file_name_template = {sample_name}{extension}
# removes reads with intact internal restriction enzyme recognition sites.

[RemovePatternFilter.CCGG]
pattern = CCGG
# second enzyme
output_directory = ./09_remove_chimera_partial_digest
output_file_name_template = {sample_name}{extension}
# removes reads with intact internal restriction enzyme recognition sites.