Skip to content

rotheconrad/BLAST_primer_filter

Repository files navigation

BLAST Primer Filter

BLAST_primer_filter parses primer-pair BLASTN hits against a genome assembly, filters likely PCR products, and writes browser- and plotting-friendly outputs. It is meant for marker validation workflows where you already have primer FASTA, genome FASTA, and BLAST tabular output.

What It Produces

  • *_amplicons.tsv: viable amplicons, coordinates, identity metrics, and sequences.
  • *_amplicons.gff3: IGV/JBrowse-ready PCR product annotations.
  • *_failed.tsv: primer hits or primer pairs that failed filters, with reasons.
  • *_plot.pdf: contig-scaled overview of viable products and failed hits.

The plot uses a black horizontal bar for each chromosome or contig, dark green for normal viable amplicons, light green for inverted viable amplicons, red vertical ticks for failed primer hits, and numbered labels tied to primer names in the legend.

Example output

Features

  • Handles normal and inverted primer orientations.
  • Filters partial alignments so amplicons are built from full-length primer hits.
  • Exports viable amplicons, failed hits, GFF3 browser tracks, and PDF plots.
  • Supports configurable axis units: bp, kb, Mb, and Gb.
  • Annotates plots with primer-number legends for compact multi-marker views.

Installation

Create the conda/mamba environment from the project file:

mamba env create -f environment.yml
conda activate blast-primer-filter

If you prefer to keep the environment inside the repository:

mamba env create --prefix ./.conda -f environment.yml
conda activate ./.conda

The environment includes Python, Biopython, Matplotlib, BLAST+, and pytest.

Input Files

Primer FASTA

Primer IDs must end in F or R. The shared prefix is treated as the primer pair name. The recommended style is primername_F and primername_R.

>MyPrimer_F
ATGCGTACGTTAGC
>MyPrimer_R
CGTACGACTTACGA

With this style, the internal pair name is MyPrimer_; GFF3 display names trim the trailing underscore for readability.

Genome FASTA

A reference genome or assembly FASTA used for sequence extraction and plotting contig lengths.

BLAST TSV

Run BLASTN with this exact tabular output layout:

makeblastdb -in genome.fasta -dbtype nucl
blastn -query primers.fasta -db genome.fasta \
  -outfmt "6 qseqid sseqid sstart send sstrand pident length mismatch gapopen evalue bitscore sseq" \
  -out blast.tsv

Usage

python blast_primer_analysis-v3.py \
  --primers primers.fasta \
  --blast blast.tsv \
  --genome genome.fasta \
  --out_prefix results \
  --min_len 80 \
  --max_len 3000 \
  --tick_units Mb \
  --tick_step 1

By default this writes results_amplicons.tsv, results_amplicons.gff3, results_failed.tsv, and results_plot.pdf.

For output columns, GFF3 attributes, and plot legend details, see docs/outputs.md.

You can also convert an existing amplicon table to GFF3 without the original genome FASTA:

python blast_primer_analysis-v3.py \
  --amplicons_tsv examples/second/Blast_SRR_primers_amplicons.tsv \
  --gff3 converted_amplicons.gff3

Examples

  • examples/synthetic/ contains a tiny fully runnable example with a synthetic genome and demo_amplicon_F / demo_amplicon_R primers. Run it with:

    make synthetic
  • examples/second/ contains a soybean SRR marker example with the primer FASTA, BLAST TSV, expected amplicon table, expected failed-hit table, expected GFF3 track, and PDF plot.

  • The original soybean genome FASTA is not committed because it is a large external reference. Use the committed amplicon table to smoke-test GFF3 conversion, or provide the genome FASTA locally to rerun the full analysis.

Parameters

Required for full analysis:

  • --primers: primer FASTA file.
  • --blast: BLAST results in the required format.
  • --genome: genome FASTA file.

General:

  • --out_prefix: output prefix. Default: results.
  • --gff3: custom GFF3 output path. Default: <out_prefix>_amplicons.gff3.
  • --no_gff3: skip GFF3 output.
  • --amplicons_tsv: convert an existing amplicon TSV to GFF3 and exit.

Filtering:

  • --min_len: minimum allowed amplicon length. Default: 80.
  • --max_len: maximum allowed amplicon length. Default: 3000.
  • --require_3p: number of 3-prime bases that must match exactly. Default: 3.
  • --max_mismatches: maximum mismatches allowed in primer hits. Default: 10.
  • --min_pident: minimum percent identity required. Default: 0.0.
  • --len_tolerance: allowed difference between primer length and hit length. Default: 5.
  • --min_fail_len_frac: minimum failed-hit alignment length fraction to report. Default: 0.8.
  • --min_fail_pident: minimum failed-hit percent identity to report. Default: 70.0.

Plotting:

  • --tick_units: one of bp, kb, Mb, or Gb. Default: Mb.
  • --tick_step: spacing between x-axis ticks in the chosen units. Default: 1.0.

Filter Order

The script applies filters in this order:

  1. --len_tolerance silently discards partial primer alignments that are too short or too long compared with the primer length.
  2. --max_mismatches and --min_pident mark hits as failed, but keep them available for failed-hit reporting.
  3. --require_3p enforces exact matching at the primer 3-prime end.
  4. Primer hits are paired only when they land on the same contig and have a valid normal or inverted orientation.
  5. --min_len and --max_len filter the final PCR product span.
  6. --min_fail_len_frac and --min_fail_pident control how much failed-hit detail is written to *_failed.tsv.

For a fuller walkthrough of each filter, including common gotchas, see docs/filtering.md.

Troubleshooting

  • If exact short primers are missing from BLAST output, run BLAST with blastn -task blastn-short.
  • If expected partial or distant hits disappear completely, increase --len_tolerance.
  • If *_failed.tsv is emptier than expected, lower --min_fail_pident or --min_fail_len_frac.

Development

Run the tests from an activated environment:

python -m pytest

The current tests verify soybean amplicon-to-GFF3 conversion, synthetic fixture formats, PNG plot previews, and the fully runnable synthetic BLAST example. They are split by purpose:

  • tests/test_core.py: direct function-level tests for blast_primer_analysis-v3.py.
  • tests/test_cli.py: command-line behavior and argument validation.
  • tests/test_examples.py: committed example fixtures and full synthetic BLAST run.

Citation

Conrad R. (2025). BLAST_primer_filter. GitHub repository: https://github.com/rotheconrad/BLAST_primer_filter

About

A Python tool for screening primer pairs against genome BLAST results, reporting viable amplicons, failed hits, GFF3 annotations, and genome-wide PDF plots.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors