OralDecontam

This repository provides an end-to-end pipeline that classifies paired-end FASTQ reads using Kraken2 and separates the classified reads into three categories:

Human (GRCh38)
Non-Human
Ambiguous/Root (reads only classified to the root taxonomic level)

The tool is built in Python and designed to work with large-scale metagenomic or sequencing data workflows.

Use Case

Ideal for oral samples (e.g. saliva, buccal swab) before alignment for genotpying purposes.

Requirements

Python 3.7+
Kraken2 installed and in $PATH
wget, curl, or aria2c for DB download

Download

Alternatively, you can download the kraken2 DB via

wget https://www.decodebiome.org/HROM/data/genome_catalog/decontamination_HROM.tar.gz

Clone the repository:

git clone https://github.com/your-username/kraken2-human-nonhuman-split.git
cd kraken2-human-nonhuman-split
Install any missing Python dependencies:

pip install -r requirements.txt
Usage
Basic CLI Example
python kraken2_split_pipeline.py \
    --forward sample_R1.fastq.gz \
    --reverse sample_R2.fastq.gz \
    --db /path/to/kraken2_db \
    --prefix Sample01
With Custom Output Prefixes and Directory
python kraken2_split_pipeline.py \
    --forward sample_R1.fastq.gz \
    --reverse sample_R2.fastq.gz \
    --db ./kraken_db \
    --classified classified_output \
    --unclassified unclassified_output \
    --prefix ProjectX \
    --read_dir /data/reads
Output Files
All output files will be written to the current working directory (or an optional output directory, if added):

Kraken2 Outputs
kraken2_output.txt: Kraken2 classification log
kraken2_report.txt: Classification summary
classified_1.fastq, classified_2.fastq: Reads Kraken2 could classify
unclassified_1.fastq, unclassified_2.fastq: Reads Kraken2 couldn't classify
Separated FASTQ Files
Category	Forward Read File	Reverse Read File
Human	Sample01.grch38_1.fastq	Sample01.grch38_2.fastq
Non-Human	Sample01.non-human_1.fastq	Sample01.non-human_2.fastq
Ambiguous/Root	Sample01.ambiguous_1.fastq	Sample01.ambiguous_2.fastq

Taxonomic Classification Criteria
Human reads are identified based on the following markers:

d__Eukaryota, p__Chordata, c__Mammalia, o__Primates,
f__Hominidae, g__Homo, s__Homosapiens_GrCH38.fna, GRCH38.fna
Reads classified only to root (taxid 1) are considered ambiguous.

All remaining reads are treated as non-human.

📚 References
Kraken2: https://ccb.jhu.edu/software/kraken2/
Human Genome GRCh38: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
raw_results/common		raw_results/common
scripts		scripts
README.md		README.md
oraldecontam.py		oraldecontam.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OralDecontam

Use Case

Requirements

Download

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OralDecontam

Use Case

Requirements

Download

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages