This pipeline is distributed as a self-contained Apptainer (formerly known as Singularity) container, which includes all necessary dependencies and tools pre-installed.
To use the pipeline, you will need to have Apptainer installed on your system. Installation instructions are available on the Apptainer documentation site.
Once Apptainer is installed, no further setup is required. Simply download the pipeline container with this command (it requires ca. 2.5 GB of disk space)
apptainer pull library://jack2knife/pristine/pristine:latestand then run it from within your project directory (see below for detailed instructions) with the following command:
./pristine_latest.sif
This single .sif file encapsulates the entire environment, ensuring consistent and reproducible results across different systems.
To run the pipeline, you need:
- A configuration file named
config.yaml - A properly structured input directory with raw genome assemblies
The file config.yaml must be placed in the same directory as the container (PRISTINE.sif). It defines all global parameters, paths, and settings required for the analysis (for details, see below).
A toy dataset is provided in the toy_dataset/ directory. It includes:
- A small set of raw genome FASTA files organized in the correct input structure
- A ready-to-use configuration file:
config.yaml
To quickly test the pipeline:
- Download the
toy_dataset/directory - Place the Apptainer container (
pristine_latest.sif) inside thetoy_dataset/folder - From within the
toy_dataset/directory, run:
./pristine_latest.sifCurrently, the pipeline supports input in the form of raw genome FASTA files (annotation-based support coming soon).
The input directory is organized by species. Each target species must have its own folder, named in the format Genus_species, where:
Genus is the genus name (capitalized)
species is the species name (lowercase)
Inside each species folder:
There must be one or more genome files in FASTA format
Each FASTA file must be named exactly as the folder, with a suffix _i, where i is a unique integer (e.g., Yersinia_pseudotuberculosis_1.fasta)
In addition, each species folder contains a subfolder named non-targets/, which holds genome files for closely related non-target species. These files follow the same naming rule: Genus_species_i.fasta.
At the top level of the input directory, there may also be a folder named non-targets/. This contains additional non-target genome files that will be used in all comparisons, regardless of species.
An example input directory structure:
input/
├── Yersinia_pseudotuberculosis/
│ ├── Yersinia_pseudotuberculosis_1.fasta
│ ├── Yersinia_pseudotuberculosis_2.fasta
│ └── non-targets/
│ ├── Yersinia_enterocolitica_1.fasta
│ └── Yersinia_intermedia_1.fasta
│
├── Escherichia_coli/
│ ├── Escherichia_coli_1.fasta
│ ├── Escherichia_coli_2.fasta
│ └── non-targets/
│ ├── Shigella_sonnei_1.fasta
│ └── Klebsiella_pneumoniae_1.fasta
│
└── non-targets/
├── Salmonella_enterica_1.fasta
├── Vibrio_cholerae_1.fasta
└── Campylobacter_jejuni_1.fastaThe configuration file defines all input paths, pipeline settings, and primer design parameters. Below is a breakdown of each section:
Specifies the format of the input data.
raw– Input consists of raw genome FASTA files (currently fully supported)prokka– Input is in the form of Prokka-annotated files (not yet supported)panaroo– Input comes from a Panaroo output folder (not yet supported)
raw_dir: Path to the directory containing raw genome input folders (one per species)prokka_dir: Placeholder for Prokka input (set tonull)panaroo_dir: Placeholder for Panaroo input (set tonull)
- Directory where all output results, logs, and intermediate files will be stored.
- Maximum number of CPU cores to use for multi-threaded steps (e.g. alignment, BLAT, Panaroo).
- Tool used for multiple sequence alignment.
- Options:
mafftclustalprank
- Minimum average proportion of informative SNPs a locus must have to be considered diagnostic and retained in downstream analysis.
- Path to an external Primer3 config file (not fully implemented).
- If set, it overrides the
primer3.global_paramssection. - Keep as
nullfor now.
Parameters passed directly to Primer3 for primer design. These include:
PRIMER_NUM_RETURN: Number of primer pairs Primer3 should return (default = 1)PRIMER_MAX_NS_ACCEPTED: Max number of ambiguous bases allowed in the templatePRIMER_LIBERAL_BASE: Whether to use relaxed base recognitionPRIMER_MIN_SIZE/PRIMER_MAX_SIZE: Minimum and maximum allowed primer lengthsPRIMER_MIN_TM/PRIMER_MAX_TM: Minimum and maximum melting temperatures (°C)PRIMER_PRODUCT_SIZE_RANGE: Desired size range for PCR products (e.g."500-1800")
Settings for SNP-aware primer design strategy:
snp_window_size: Size of the window (in bp) used to scan for SNP-rich regionssnp_top_n: Number of top-ranked loci to attempt primer design onmin_snps: Minimum number of informative SNPs required within a window to proceed with primer design
Optional post-design validation settings using pBLAT:
perform: Set toyesornoto enable/disable in silico validationdatabase: Path to the reference.2bitdatabase used for pBLAT validationpblat_min_identity: Minimum identity threshold (%) for considering alignmentsmatch_median_filter_tolerance: Allowed deviation in median match length between targets and non-targets
The pipeline generates an organized output directory containing results for each species. The top-level output/ folder includes the following subdirectories:
Contains Prokka annotation results for each genome. Automatically generated from raw input if input_type is set to raw.
Stores results from Panaroo pan-genome analysis, used internally for identifying shared and variable loci.
This folder contains loci identified as having high diagnostic potential based on SNP profiles:
- FASTA files for each high-value locus
heatmap_informativeness.png: visual overview of SNP informativeness across loci and speciessnp_summary.csv: tabular summary of all loci, including SNP positions, proportions, and informativeness scores
The file snp_summary.csv provides a detailed summary of all analyzed loci with respect to their SNP-based diagnostic potential. For each locus, it includes:
- The number and proportion of informative SNPs (per non-target species and averaged)
- The positions of informative SNPs
- The median length of aligned sequences in target and non-target genomes
- The difference in median lengths, which may reflect indel-based divergence
- A list of informative SNP positions and the position range that captures the majority of SNPs (±2 SD)
This file serves as the analytical foundation for selecting loci for downstream primer design.
consensus_sequences/: consensus FASTA sequences of informative loci, used for primer designsnp_density_plots/: line plots showing SNP distributions across loci for visualization and interpretation
Contains designed primer pairs targeting the most informative SNP-rich regions:
primer_design_summary.csv: detailed summary of primer sequences, SNP coverage, Tm, GC content, and amplicon size- One FASTA file per locus containing the left and right primer sequences (
Locus_primers.fasta)
All outputs are grouped by species to keep analyses modular and easily navigable.