ikafssn

ikafssn (Independent programs of K-mer-based Alignment-Free Similarity Search for Nucleotide sequences) is a suite of command-line tools that builds inverted indexes over NCBI BLAST DB nucleotide sequences and performs alignment-free similarity search using k-mer matching and collinear chaining.

A conference talk on the methods used in ikafssn is available on YouTube: https://youtu.be/ppoTB6MHsqY (spoken in Japanese, slides in English).

Features

Builds a k-mer inverted index directly from NCBI BLAST databases
Supports both contiguous k-mers and discontiguous megablast-style spaced seeds (-t 13/15/18 with k=8 or 9 for PCR-optimized search, -t 16/18/21 with k=11 or 12, coding/optimal/both templates)
Three-stage search pipeline: fast candidate filtering (Stage 1), position-aware collinear chaining (Stage 2), and Parasail pairwise alignment with CIGAR/percent identity output (Stage 3), with configurable mode selection (1/2/3)
Client-server architecture with UNIX/TCP socket and HTTP REST API support, with multi-database serving from a single process
Handles IUPAC ambiguous bases by expanding degenerate k-mers during indexing and search (configurable expansion limit)
Reports queries that cannot be searched as skip-markers in TSV / JSON / SAM (with reason and detail), so no query is silently dropped
Transparent compressed I/O: FASTA inputs auto-detect gzip / bzip2 / xz / zstd from leading magic bytes; TSV / JSON / FASTA outputs select the codec by .gz / .bz2 / .xz / .zst suffix (SAM/BAM excluded)
Parallel indexing and search via Intel TBB
Lightweight per-command executables, each linking only its required dependencies

Comparison with BLAST (blastn)

ikafssn and BLAST (blastn) both search NCBI BLAST databases, but differ in indexing strategy, search algorithm, and deployment model:

Aspect	ikafssn	blastn
Primary use case	K-mer-based similarity search across large nucleotide collections	Local alignment search for homologous sequences
Search algorithm	Seed-chain-align: k-mer inverted index → candidate filtering → collinear chaining → pairwise alignment	Seed-extend: word seeds → ungapped extension → gapped alignment (Smith-Waterman)
Database format	NCBI BLAST DB (reads via C++ Toolkit)	NCBI BLAST DB (native)
Index structure	Pre-built k-mer inverted index (direct-address table, 4^k entries) stored on disk	No pre-built database index; per-query lookup table built from query words, database scanned sequentially
Pipeline modes & index content	Default Mode 1: Stage 1 candidate filtering only, over a presence-only index (`.kix`, sequence IDs without positions); long parents are split into overlapping fragments at index-build time to preserve locality without storing positions. Mode 2 (optional) adds collinear chaining and Mode 3 (optional) adds pairwise alignment, both over a position-bearing index (`.kpx`)	No pre-built database index and no staged search modes; each query runs the full seed → ungapped extension → gapped alignment pipeline
Seeding	All k-mers indexed exhaustively (contiguous or spaced seeds); high-frequency filtering at index build time only (optional `-max_freq_build`)	Exact word matches (default word size 28 for megablast, 11 for blastn); discontiguous megablast templates
Scoring model	K-mer match count (Stage 1), chain length (Stage 2), semi-global alignment score (Stage 3)	E-value based on local alignment score (bit score) with statistical significance model
Alignment	Parasail semi-global, 1-piece affine gap (optional Stage 3)	BLAST gapped extension, affine gap penalties, with X-drop heuristic
Hits per subject	Configurable via `-stage2_max_nhit_per_subject` (default: 1; 0 = unlimited)	Multiple HSPs per subject by default
Ambiguous bases	IUPAC degenerate expansion in both index and query (configurable limit)	Not expanded; seeds at ambiguous positions are skipped, extension penalizes via scoring matrix
Client-server mode	Built-in: `ikafssnserver` + `ikafssnclient` (UNIX/TCP socket), `ikafssnhttpd` (HTTP REST)	Not built-in (requires external wrappers or NCBI cloud BLAST)
Parallelization	Intel TBB (`parallel_for`, `parallel_sort`); multi-query and multi-volume parallelism	OpenMP-based multi-threading (`-num_threads`)
Output format	TSV, JSON, SAM, BAM	Tabular, XML, ASN.1, HTML, SAM, and others (`-outfmt`)
Expected query sequences	Short to moderate-length sequences (PCR amplicons, marker genes)	Any length (short queries to full chromosomes)

Comparison with minimap2

ikafssn and minimap2 both follow the seed-chain-align paradigm, but differ in design goals and key algorithmic choices:

Aspect	ikafssn	minimap2
Primary use case	Database search: find all similar entries across a large sequence collection	Read mapping: map reads to a single reference genome
Database input format	NCBI BLAST DB	FASTA, FASTQ
Query input format	FASTA	FASTA, FASTQ
Expected query sequences	Short to moderate-length sequences such as PCR amplicons and marker genes (hundreds to a few thousand bases)	Genomic reads (Illumina short reads, PacBio/ONT long reads) and assembled sequences (contigs, chromosomes)
Expected query quality	High-accuracy sequences with few or no sequencing errors (e.g., Sanger, error-corrected consensus)	Designed to tolerate high error rates (ONT ~5–15%, PacBio CLR ~10–15%; also handles HiFi and Illumina)
K-mer size	k = 5–15 (`uint16_t` for k ≤ 8, `uint32_t` for k ≥ 9; selected by `2*k` bit width)	k = 1–28 (default 15)
Seeding	All k-mers indexed in a direct-address table (4^k entries)	Minimizers (subsampled k-mers) indexed in a hash table
Pipeline modes & index content	Default Mode 1: Stage 1 candidate filtering only, over a presence-only index (`.kix`, sequence IDs without positions); long parents are split into overlapping fragments at index-build time to preserve locality without storing positions. Mode 2 (optional) adds collinear chaining and Mode 3 (optional) adds pairwise alignment, both over a position-bearing index (`.kpx`)	Single seed-chain-align pipeline with no presence-only mode; the minimizer index always stores positions (reference ID, position, strand)
Candidate filtering	Explicit Stage 1: scan ID posting lists to score and filter candidates before chaining	No separate filtering stage; seeds go directly to chaining
Chaining DP score	Chain length (anchor count) with a diagonal-deviation constraint (`max_gap`)	Estimated matching bases minus a gap penalty with logarithmic distance term
Hits per subject	Configurable via `-stage2_max_nhit_per_subject` (default: 1 best chain; set >1 or 0 for unlimited)	Multiple chains per subject (primary, secondary, supplementary)
Default result limit	Unlimited (returns all candidates above threshold)	1 primary + up to 5 secondary per query (`-N 5`, `-p 0.8`)
Alignment	Parasail semi-global, 1-piece affine gap (optional Stage 3)	KSW2 with SIMD, 2-piece affine gap and Z-drop heuristic
Ambiguous bases	IUPAC degenerate expansion (configurable limit)	Not supported (N-containing k-mers skipped)
Output format	TSV, JSON, SAM, BAM	PAF, SAM
Index partitioning	BLAST DB multi-volume with `.kvx` manifest	`-I` batch partitioning with `--split-prefix`

Comparison with MMseqs2

ikafssn and MMseqs2 (nucleotide-vs-nucleotide search, --search-type 3) both prune candidates with a k-mer prefilter before alignment, but differ in index design, default pipeline, and intended workload. MMseqs2 is primarily a protein search and clustering suite; nucleotide-vs-nucleotide search is a secondary mode that is most effective for highly similar sequences:

Aspect	ikafssn	MMseqs2 (`--search-type 3`)
Primary use case	K-mer-based similarity search across large nucleotide collections	Many-against-many protein search and clustering; nucleotide-vs-nucleotide as a secondary mode
Search algorithm	Seed-chain-align: k-mer inverted index → candidate filtering → collinear chaining → pairwise alignment	Seed-prefilter-align: k-mer match → ungapped diagonal scoring → Smith-Waterman
Database input format	NCBI BLAST DB (reads via C++ Toolkit)	Own MMseqs2 DB built by `createdb` from FASTA/FASTQ (BLAST DB not read directly)
Index structure	Pre-built on-disk k-mer inverted index (direct-address table, 4^k entries), always accessed via mmap	Optional on-disk index built by `createindex` for repeated searches (loaded into memory or mmapped, `--db-load-mode`); without it, the k-mer index is built in memory at search time — but unlike blastn, a target-side k-mer index is always used
Pipeline modes & index content	Default Mode 1: Stage 1 candidate filtering only, over a presence-only index (`.kix`, sequence IDs without positions); long parents are split into overlapping fragments at index-build time to preserve locality without storing positions. Mode 2 (optional) adds collinear chaining and Mode 3 (optional) adds pairwise alignment, both over a position-bearing index (`.kpx`)	Single prefilter → alignment pipeline with no presence-only mode; the k-mer index always stores positions (`seqId` + `position_j`), required for diagonal scoring
Seeding	All k-mers indexed exhaustively (contiguous or spaced seeds); high-frequency filtering at index build time only (optional `-max_freq_build`)	k-mers (default k=15 for nucleotides), optional spaced seeds (`--spaced-kmer-mode`); per query position, similar k-mers are derived from a substitution matrix and looked up together, controlled by sensitivity `-s`
Candidate filtering	Explicit Stage 1: scan ID posting lists to score and filter candidates; an optional per-diagonal hit-count filter and collinear chaining handle positional consistency in Stage 2	Prefilter requiring two consecutive k-mer matches on the same diagonal, scored by an ungapped diagonal alignment threshold
Alignment	Parasail semi-global, 1-piece affine gap (Mode 3, optional)	Vectorized (striped SIMD) Smith-Waterman local alignment
Scoring model	K-mer match count (Stage 1), chain length (Stage 2), semi-global alignment score (Stage 3)	Prefilter diagonal score; alignment score, sequence identity, and E-value
Ambiguous bases	IUPAC degenerate expansion in both index and query (configurable limit)	No IUPAC degenerate expansion
Client-server mode	Built-in: `ikafssnserver` + `ikafssnclient` (UNIX/TCP socket), `ikafssnhttpd` (HTTP REST)	Not built-in (a separate desktop / local web-server app exists)
Parallelization	Intel TBB (`parallel_for`, `parallel_sort`); multi-query and multi-volume parallelism	OpenMP multi-threading + MPI distribution; GPU acceleration (NVIDIA Ampere or newer)
Output format	TSV, JSON, SAM, BAM	BLAST m8 tabular (via `convertalis`), and other MMseqs2 formats
Clustering	Not provided (search only)	Core feature (`cluster` / `linclust`), mature mainly for proteins; limited for nucleotides
Expected query sequences	Short to moderate-length sequences (PCR amplicons, marker genes)	Best for highly similar nucleotide sequences; less sensitive for divergent ones

Commands

Command	Purpose
`ikafssnindex`	Build k-mer inverted index from BLAST DB
`ikafssnsearch`	Local direct search (mmap index)
`ikafssnretrieve`	Extract matched subsequences (local or NCBI efetch)
`ikafssnserver`	Search daemon (UNIX/TCP socket)
`ikafssnhttpd`	HTTP REST proxy to ikafssnserver
`ikafssnclient`	Client (socket or HTTP)
`ikafssninfo`	Index/DB and server information display

Quick Start

# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Index a BLAST DB
ikafssnindex -db mydb -k 11 -o ./index

# Search
ikafssnsearch -ix ./index -query query.fasta

# Retrieve matched subsequences
ikafssnsearch -ix ./index -query query.fasta | ikafssnretrieve -db mydb > matches.fasta

Dependencies

C++20 compiler (GCC >= 11, Clang >= 13)
CMake >= 3.16
NCBI C++ Toolkit (for BLAST DB access)
Intel TBB (for parallelization)
Parasail >= 2.6 (for Stage 3 pairwise alignment)
htslib >= 1.17 (for SAM/BAM output)
zlib, libbz2, liblzma, libzstd (for transparent compressed I/O); pkg-config required for libzstd discovery
Drogon (optional, for ikafssnhttpd)
libcurl (optional, for HTTP client mode and remote retrieval)

Documentation

For detailed usage, command-line options, deployment architecture, and index file format specifications, see:

English: doc/ikafssn.en.md
Japanese: doc/ikafssn.ja.md

Repository

Primary: https://github.com/astanabe/ikafssn
Secondary: https://gitlab.com/astanabe/ikafssn

License

This project is licensed under the Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
.github/workflows		.github/workflows
bench		bench
cmake		cmake
db		db
doc		doc
patches		patches
recipe		recipe
src		src
test		test
tools		tools
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
DEPENDENCIES		DEPENDENCIES
INSTALL		INSTALL
LICENSE		LICENSE
README.md		README.md
ikafssn-archive-keyring.asc		ikafssn-archive-keyring.asc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ikafssn

Features

Comparison with BLAST (blastn)

Comparison with minimap2

Comparison with MMseqs2

Commands

Quick Start

Dependencies

Documentation

Repository

License

About

Uh oh!

Releases 21

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ikafssn

Features

Comparison with BLAST (blastn)

Comparison with minimap2

Comparison with MMseqs2

Commands

Quick Start

Dependencies

Documentation

Repository

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages