Skip to content

Tejas242/sift

Repository files navigation

Sift 🔍

sift banner

A fast, local, and fully offline semantic search CLI & TUI for your terminal.
Index folders of codebases, documentation, or notes, and search them instantly using natural language.
Zero cloud APIs. Zero costs. 100% local and private.

Build Status Go Version ONNX Runtime HNSW Recall License


Sift is a personal hobby project built for developers who want a lightweight, local alternative to cloud vector databases. By loading HuggingFace's BGE-small-en-v1.5 embeddings model locally via ONNX Runtime (CGo) and pairing it with a from-scratch HNSW vector graph index implemented in pure Go, Sift delivers semantic file search in less than 1ms traversal latency—completely offline.

🚀 Key Features

  • Offline Local Embeddings: BGE-small-en-v1.5 embeddings computed locally on your CPU via native ONNX Runtime inference.
  • High-Speed HNSW Index: A highly optimized Hierarchical Navigable Small World (HNSW) graph implemented from scratch in Go (M=16, ef=50), featuring O(1) visited bitsets and introsort routines for minimal allocations.
  • Hybrid Keyword Boosting: Intersects dense vector search scores with sparse keyword matches to improve search relevance for exact matches and short queries.
  • Semantic Paragraph Chunking: Intelligent boundary chunking based on lines, markdown paragraphs (\n\n), and code blocks instead of mechanical word splits.
  • Debounced TUI: A sleek interactive terminal interface built with BubbleTea featuring fuzzy instant search, spinner indicators, Vim navigation, and direct editor integrations.
  • Incremental Watching: Multi-directory fsnotify file watcher that debounces writes and dynamically updates the index on modification or creation.
  • No DB Dependencies: The complete index fits in a lightweight local folder (.sift/) featuring a custom flat binary graph layout and a JSON metadata skip-cache.

🏗 System Architecture

Sift is built on a modular pipeline designed to be simple, clean, and self-contained:

Sift Pipeline Architecture

  Components
  ──────────
  cmd/sift/          Cobra CLI subcommands (root, index, search, watch, tui, stats, clear, rebuild, bench, version)
  internal/config    non-global .sift.toml configuration parsing
  internal/chunker   streaming word-window text splitter, binary sniff
  internal/embed     ONNX session + tokenizer, EmbedDocs / EmbedQuery
  internal/hnsw      from-scratch HNSW graph + binary serialiser
  internal/index     ties chunker → embedder → HNSW, flush / load
  internal/watcher   fsnotify recursive dir watcher with debounce
  internal/tui       BubbleTea TUI (spinner · icons · vim nav · statusbar)

⚡ Quick Start

Prerequisites

  • Go 1.21+
  • make & curl
  • GCC (for native CGo tokenizer bindings)

Installation

Clone, fetch dependency assets, and build the binary:

git clone https://github.com/tejas242/sift
cd sift

# 1. Download native ONNX Runtime shared library (lib/onnxruntime.so)
make download-ort

# 2. Fetch the BGE-small-en-v1.5 model and config files
make download-model

# 3. Compile the production binary
make build

📖 CLI Usage

Sift provides a simple command-line interface for indexing, searching, and managing your files.

# Index a directory recursively (creates a local .sift/ index folder)
./sift index ./docs

# Perform a quick semantic search — prints top-10 ranked chunks
./sift search "how does HNSW handle graph persistence"

# Get results formatted in JSON for integration with other shell tools (like jq)
./sift search --json "asymmetric retrieval prefix"

# Limit result pool size
./sift search --top-k 5 "vector dimensions"

# Quiet execution (suppress verbose logs from ONNX model loading)
./sift -q stats

# Launch the interactive BubbleTea TUI
./sift tui

# Monitor directory recursively and update the index in real-time
./sift watch ./docs

# Wipe your index and rebuild completely from scratch
./sift rebuild ./docs

# Check index file statistics and size
./sift stats

# Wipe index and remove index files
./sift clear

⚙️ Persistent Configuration (.sift.toml)

Sift parses a .sift.toml file in the current working directory to save your setup:

model-dir = "./models"
ort-lib = "./lib/onnxruntime.so"
threads = 0              # 0 = auto-detect optimal CPU core threads
max-file-kb = 512        # skip indexing files larger than 512KB

⌨️ TUI Keybindings

When running ./sift tui, you enter a fully interactive terminal application:

Key Action
Type anything Re-searches the index in real-time (debounced at 300ms)
/ or k / j Navigate through search results
Enter Open the selected file in your $EDITOR directly at the exact line number
Ctrl+I Toggle index diagnostic statistics pane
Esc Back to search view
Ctrl+C / Ctrl+Q Exit Sift

🧠 Algorithmic Performance & Deep Dive

How HNSW Works

HNSW builds a hierarchical structure of navigable small world graphs. Nodes are inserted with an exponentially decaying layer probability. High layers act as a fast highway network containing a sparse subset of points, while layer 0 contains all points.

During a query:

  1. Search starts at the top layer, finding the local minimum.
  2. Descent goes down layer-by-layer, using the previous minimum as the entry point for the next.
  3. On Layer 0, a bounded beam search of size efSearch is executed to collect the exact nearest neighbors.

Hyperparameters

We tune our pure-Go HNSW implementation for highly accurate retrieval bounds:

Parameter Value Description
M 16 Maximum bidirectional links per node.
efConstruction 200 Size of dynamic candidate list built during insertion.
efSearch 50 Beam width size evaluated during a search query.

📈 Latency & Accuracy Benchmarks

Run locally on modest consumer hardware (AMD Ryzen 3 3250U @ 2.6GHz, CPU-only):

Metric Result
Recall@10 (1,000 vectors, BGE-small) 90.6%
Graph Insertion Latency ~3.6ms / vector
Graph Search Latency (ef=50) < 0.7ms / query
Embedding Speed (BGE on CPU) ~30-80ms / text chunk

🛠 Testing & Benchmarking in CI

Even though this is a personal hobby project, I wanted to keep the engineering standards high! We've set up automated workflows:

  1. Build & Unit Tests (ci.yml): Automatically compiles the code, checks for formatting/linting issues via go vet, and runs unit tests model-free using a mock embedder interface.
  2. Continuous Benchmarking (bench.yml): Tracks our custom BenchmarkHNSWInsert and BenchmarkHNSWSearch execution speeds over commits using github-action-benchmark to ensure no changes introduce performance regressions.

📄 License

Sift is open-source software released under the MIT License.

About

Fast, local semantic search for developers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors