High-performance Rust implementation for the Rinha de Backend 2026 challenge, featuring low-latency fraud scoring with vector similarity search.
This project implements a fraud detection API that uses k-nearest neighbors (k-NN) search on quantized feature vectors. The system is optimized for low latency through:
- AVX2 SIMD for vectorized distance calculations
- Memory-mapped indexes for zero-copy loading
- Epoll-based event loop for efficient I/O multiplexing
- File descriptor passing for inter-process communication
- Learned tree partitioning for efficient index search
- Specialized JSON parsers for fast request parsing
src/api/- HTTP API server with warmup and request handlingsrc/http/- HTTP request/response parsingsrc/fd_passing/- File descriptor passing with epoll event loopsrc/index/- Vector index with tree-based partitioningsrc/vector/- Query vector parsing with multiple strategiessrc/runtime/- Runtime configuration from environment variablessrc/lb/- Load balancer for multi-instance deployment
src/
βββ api/
β βββ mod.rs # Main entry point
β βββ warmup.rs # Index warmup logic
β βββ server.rs # FD mode server
β βββ handler.rs # HTTP request handler
βββ http/
β βββ mod.rs # HTTP module entry
β βββ parser.rs # Request parsing
β βββ responses.rs # Response constants
βββ fd_passing/
β βββ mod.rs # FD passing entry
β βββ evented.rs # Event loop logic
β βββ conn.rs # Connection management
β βββ epoll.rs # Epoll operations
β βββ io.rs # I/O utilities
βββ index/
β βββ mod.rs # Index entry
β βββ build.rs # Index building
β βββ format.rs # Index format
β βββ layout.rs # Memory layout
β βββ partition_scheme.rs # Tree partitioning
β βββ mmap.rs # Memory mapping
β βββ search.rs # Search algorithms
βββ vector/
β βββ mod.rs # Vector parsing entry
β βββ helpers.rs # Helper functions
β βββ compact.rs # Compact ordered parser
β βββ single_pass.rs # Single-pass parser
β βββ serde_fallback.rs # Serde fallback parser
βββ runtime.rs # Runtime configuration
- Rust 2024 Edition - Modern Rust with latest language features
- libc - Direct system calls for epoll, mmap, socket operations
- AVX2 SIMD - Vectorized distance calculations via
std::arch::x86_64 - mimalloc - High-performance memory allocator
- serde_json - Fallback JSON parsing
- threadpool - Thread pool for concurrent operations
- flate2 - Compression support
The index uses a learned decision tree to partition the vector space into 256 buckets (Tree256). Each query is routed to the most relevant partitions based on tree predicates, then k-NN search is performed within those partitions.
Key features:
- Label deferral optimization - Skips searching subtrees when consensus is reached
- Early exit threshold - Stops search when k-th neighbor distance is below threshold
- AVX2-accelerated distance - Computes 8 distances in parallel using SIMD
- Lower bound pruning - Uses bounding boxes to skip irrelevant partitions
The partition scheme is learned from sample queries using a decision tree:
- Tree depth: 8 levels (configurable up to 10)
- Predicates: Learned thresholds on vector dimensions
- Key computation: Binary traversal produces 8-bit partition key
Three parsing strategies are tried in order of performance:
- Compact Ordered Parser - Assumes fixed field order, fastest path
- Single-Pass Parser - Handles any field order with skipping
- Serde Fallback - Full serde deserialization for compatibility
Squared Euclidean distance computed with AVX2:
- Pairs of dimensions processed in parallel
- Early rejection using distance bounds
- Quantized i16 values for cache efficiency
RINHA_INDEX_PATH- Path to the index fileRINHA_NATIVE_SCALE- Quantization scale (build-time)RINHA_EARLY_EXIT_THRESHOLD- Stop search when k-th distance below this valueRINHA_LABEL_DEFER- Enable label deferral optimization (0/1)
RINHA_WARMUP_QUERIES- Number of warmup queriesRINHA_SELF_WARMUP_URL- URL for self-warmupRINHA_SELF_WARMUP_DURATION_MS- Self-warmup durationRINHA_SELF_WARMUP_CONCURRENCY- Self-warmup concurrencyRINHA_WARMUP_PAYLOADS_PATH- Path to warmup payloads
RINHA_EPOLL_BUSY_POLL- Enable busy polling (0/1)RINHA_EPOLL_IDLE_US- Idle timeout in microsecondsRINHA_SPIN_BEFORE_BLOCK_US- Spin duration before blocking
RINHA_CLIENT_FD_PRECONFIGURED- Assume FDs are pre-configured (0/1)
# Build release binaries
cargo build --release --bin api --bin preprocess --bin lb
# Build Docker images
make build
# Validate Docker Compose configuration
make config# Start local stack
make up
# Stop local stack
make down# Run all tests
cargo test
# Run specific test
cargo test --lib vector::tests::tests::compact_ordered_matches_serde_fallback- Memory-mapped indexes - Zero-copy loading with
mmap - Huge page advice -
MADV_HUGEPAGEfor TLB efficiency - mimalloc allocator - Reduced fragmentation
- Quantized vectors - i16 instead of f64 for cache efficiency
- AVX2 SIMD - Parallel distance calculations
- Busy polling - Reduce latency when under load
- Label deferral - Skip unnecessary subtree searches
- Early exit - Stop search when result is confident
- Epoll edge-triggered - Efficient event notification
- Non-blocking sockets -
TCP_NODELAY, non-blocking mode - File descriptor passing - Zero-copy between processes
- Buffer pooling - Reuse buffers to reduce allocations
The index uses a custom binary format (V5):
Header:
- Magic: "RNSPCST5" (8 bytes)
- Scale: i32
- Packed dimensions: i32
- Reference count: i32
- Partition count: i32
- Node count: i32
- Block count: i32
- Partition scheme: i16 (scheme_id, param, cut counts)
- Tree predicates: [dim: u8, flags: u8, threshold: i16]
Data sections:
- Partitions: [key: u16, root: u32, min: [i16; 16], max: [i16; 16]]
- Nodes: [left: i32, right: i32, start: u32, len: u16]
- Vectors: [i16; 16] blocks (AVX2-aligned)
- Labels: [u8; 8] per block
- Reference indices: [u32; 8] per block
- Node class bits: [u8] for label deferral
MIT