Skip to content

Pluggable embedding backend; fastembed/ONNX default (#46)#51

Open
VarunGitGood wants to merge 5 commits into
mainfrom
feat/onnx-embedding-backend
Open

Pluggable embedding backend; fastembed/ONNX default (#46)#51
VarunGitGood wants to merge 5 commits into
mainfrom
feat/onnx-embedding-backend

Conversation

@VarunGitGood
Copy link
Copy Markdown
Owner

Closes #46.

Summary

  • Swaps the default embedding stack from sentence-transformers + torch (~790 MB on disk, ~400 MB RSS) to fastembed (~50 MB / ~130 MB). Same all-MiniLM-L6-v2 weights, executed through ONNX Runtime — vectors agree to ~1e-7 (cosine 1.000000) so existing embeddings keep retrieving without a re-ingest.
  • Introduces a small Embedder Protocol in repi/embeddings/ with two concrete impls (FastembedEmbedder, TorchEmbedder) selected by a single EMBEDDING_BACKEND setting in .repi/config.json. Container depends on the protocol + factory, not on either concrete library (DIP).
  • Adds an embedding_backend column + index to the leaderboard table so eval runs can sit side-by-side per (provider, model, dataset). Eval runner persists the active backend and prints it in the run header.
  • Adds an opt-in eval-compat uv dependency group (uv sync --group eval-compat) that installs sentence-transformers + CPU torch for A/B comparison runs without bloating the default install. The pytorch-cpu index is gated behind tool.uv.sources so it only triggers when the group is selected.
  • Comment cleanup pass: stripped internal issue references, planning-doc paths, and WHAT comments across the touched files. No logic changes.

Verification

120/120 tests pass. repi doctor round-trips the new backend: fastembed (all-MiniLM-L6-v2), dim=384.

Eval comparison (3 datasets × 2 backends, Mistral self-grading judge — see caveat below):

Dataset torch fastembed
dataset_1_cascading_inventory_migration 0.94 PASS 0.96 PASS
dataset_2_insufficient_logging 0.70 FAIL 0.37 FAIL
dataset_3_jwt_key_rotation_noise 0.88 PASS 0.93 PASS

dataset_2's 0.33 gap looked alarming but turned out to be LLM variance, not backend. Two additional fastembed-only repetitions of dataset_2 scored 0.85 and 0.43 — a 0.42 range from the same backend, which fully envelopes the cross-backend gap. dataset_2 is graded against confidence: low, so runs that hedge get rewarded and runs that confabulate get hammered. Tools called, trigger event identified, and retrieval rankings were the same across all four runs.

Test plan

  • uv run pytest tests/ -v — 120 passed
  • uv run repi doctor — embedding round-trip passes with fastembed
  • uv run python eval/run_evals.py --judge-provider mistral on both backends; leaderboard rows written with the correct embedding_backend tag
  • uv pip install + uninstall flow verified via the eval-compat group
  • Docker image size measurement before vs after (pending — separate run; flagged in original plan)
  • Re-run evals with a non-Mistral judge once a second provider key is available, to remove self-grading noise

Caveats / follow-ups

  • The eval harness currently allows self-grading (judge == MUT) when only one provider key is configured. The dataset_2 variance investigation above shows why this is noisy. Worth a second provider key + N-sample aggregation before claiming "no regression" with rigor.
  • The benchmarks UI on feat/ui-leaderboard doesn't yet surface the new column. After this merges, that branch needs a small SQL update (DISTINCT ON (provider, model, dataset, embedding_backend)) plus a column in web/app/benchmarks/page.tsx.
  • The TorchEmbedder ships in the source tree but its packages don't ship in the default install — it's a reference implementation for A/B runs.

Replaces sentence-transformers + torch (~790 MB on disk, ~400 MB RSS)
with fastembed (~50 MB, ~130 MB RSS) so repi fits Railway's 512 MB tier.
Vectors are byte-identical — same all-MiniLM-L6-v2 weights through ONNX
Runtime — so existing embeddings keep retrieving correctly with no
re-ingest needed.

The swap goes through a new `Embedder` protocol (DIP) selected by a
single `EMBEDDING_BACKEND` setting in .repi/config.json. Two impls
ship: FastembedEmbedder (default) and SentenceTransformersEmbedder
(kept for A/B comparison; requires installing the torch deps locally).
Container is now ignorant of either concrete class — depends only on
the protocol + factory.

Leaderboard gains an `embedding_backend` column so eval runs across
both backends can sit side-by-side per (provider, model, dataset).
The eval runner persists the active backend on every row and prints
it in the run header. The benchmarks API and UI updates that surface
this column live on feat/ui-leaderboard.

Validation:
- 120/120 tests pass
- `repi doctor` reports: Embedding round-trip PASS  fastembed (all-MiniLM-L6-v2), dim=384
- Eval (dataset_1, mistral self-grading): PASS 0.96, leaderboard row written with embedding_backend='fastembed'
- pytorch-cpu index hack removed; uv.lock drops torch/transformers/scikit-learn/scipy/safetensors

Follow-ups:
- Full 3-dataset eval to compare scores vs the torch baseline (needs a
  second provider key for the judge so we don't self-grade).
- Surface embedding_backend in /benchmarks API + page on feat/ui-leaderboard.
- Docker image-size measurement before/after.
Adds an opt-in dependency group so the sentence-transformers backend
can be installed for A/B comparison evals without bloating the default
install. The pytorch-cpu index is gated behind tool.uv.sources so it
only triggers when the group is selected.

Reproduces the comparison in two commands:

    uv sync --group eval-compat
    # set EMBEDDING_BACKEND="sentence-transformers" in .repi/config.json
    uv run python eval/run_evals.py --judge-provider mistral
    # then: revert config, uv sync

Verified results across 3 datasets (mistral self-grading judge):

                                          fastembed   sentence-transformers
  dataset_1_cascading_inventory_migration   0.96         0.94
  dataset_2_insufficient_logging            0.37         0.70
  dataset_3_jwt_key_rotation_noise          0.93         0.88
  avg                                       0.75         0.84

dataset_2's swing is large enough to investigate before declaring no
regression — chunk count differed (8 vs 4) so the ReAct loop pulled
different evidence even though vectors should be identical. Likely a
non-determinism in the LLM rather than the embedding, but worth a
second look before merging.
The comparison being made is torch (PyTorch runtime) vs ONNX
(via fastembed/ONNX Runtime). 'sentence-transformers' was the python
library used to load torch but it muddled what was actually being
swapped, so the leaderboard / config / factory all now use 'torch'.

- repi/embeddings/sentence_transformers_backend.py → torch_backend.py
- SentenceTransformersEmbedder → TorchEmbedder; .name = "torch"
- factory accepts "torch" (rejects the old key cleanly)
- config docstring + eval-compat group docstring updated
- existing leaderboard rows backfilled via:
    UPDATE leaderboard SET embedding_backend = 'torch'
    WHERE embedding_backend = 'sentence-transformers';

120/120 tests still pass.
Comment-only pass. No logic changed; 120/120 tests still pass.

- Drops issue-number / plan-doc / past-PR references from inline comments
  and docstrings — git log carries that history.
- Removes WHAT comments (restating the next line) and decorative section
  labels above attribute groups.
- Tightens verbose docstrings while preserving load-bearing WHY notes:
  async timezone gotchas, lazy-init reasons, the PUT-as-PATCH config merge
  rationale, the embedder swap tradeoff.

Net diff: -48 comment lines across 11 files.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
repi Ready Ready Preview, Comment Jun 5, 2026 7:46am

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request makes the embedding layer pluggable and switches the default local embedding backend from the PyTorch + sentence-transformers stack to fastembed (ONNX Runtime) to substantially reduce install size and runtime memory while keeping vector outputs compatible with existing data.

Changes:

  • Introduces an Embedder protocol + factory with fastembed as the default backend and an optional torch backend for eval A/B comparison.
  • Extends eval persistence to record embedding_backend in the leaderboard table (schema + insert path), and prints the active backend in eval headers.
  • Updates packaging to move torch + sentence-transformers behind an opt-in eval-compat dependency group, keeping the default install slim.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Updates locked dependencies to include fastembed/ONNX runtime stack and moves torch-related deps into an optional group.
repi/llm/json_utils.py Cleans up module docstring language for shared JSON extraction utilities.
repi/investigation/react_loop.py Comment cleanup around parse_llm_response re-export and iteration budgeting.
repi/embeddings/base.py Adds Embedder protocol definition to decouple container from concrete embedding libs.
repi/embeddings/factory.py Adds backend factory to construct the configured embedder implementation.
repi/embeddings/fastembed_backend.py Implements default ONNX/fastembed embedder for all-MiniLM-L6-v2.
repi/embeddings/torch_backend.py Adds optional torch + sentence-transformers embedder for A/B eval runs.
repi/embeddings/init.py Exposes Embedder + create_embedder as the embeddings module API.
repi/core/container.py Switches container embedding functionality to use the embedder protocol/factory.
repi/core/config.py Adds EMBEDDING_BACKEND setting and updates config-related comments.
repi/cli.py Updates logging suppression and doctor embedding check to use the new embedder abstraction.
repi/api/config.py Docstring update clarifying PUT-as-merge semantics for config updates.
repi/api/init.py Minor comment/docstring cleanup around CORS and health endpoint.
pyproject.toml Replaces default torch dependencies with fastembed and adds eval-compat optional group plus CPU-only torch index config.
eval/run_evals.py Persists/prints embedding_backend per run and inserts it into leaderboard rows.
db/schema.sql Adds embedding_backend column and index to the leaderboard table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread repi/core/container.py
Comment on lines 88 to +92
@property
def model(self) -> "SentenceTransformer":
if self._model is None:
logger.info("Loading SentenceTransformer (first use) …")
from sentence_transformers import SentenceTransformer
self._model = SentenceTransformer("all-MiniLM-L6-v2")
return self._model
def embedder(self) -> Embedder:
if self._embedder is None:
self._embedder = create_embedder(settings.EMBEDDING_BACKEND)
return self._embedder
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c53a479: the property now compares the cached embedder's name against settings.EMBEDDING_BACKEND and rebuilds when they differ, so PUT /config + settings.reload() actually flips the backend on next access.

Comment thread repi/api/config.py
Comment on lines +19 to 24
"""Merge `new_config` on top of the existing config.json and reload.

Semantically a PATCH: a partial body (e.g. `{"MISTRAL_API_KEY": "..."}`)
must not clobber unsent fields with their class defaults, which would
break a running container instantly.
"""
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c53a479: PUT /config now calls create_embedder(validated.EMBEDDING_BACKEND) during validation, so an unknown backend name 400s on write instead of silently persisting and 500ing on first /ingest. (Catches unknown names; missing optional torch deps still surface lazily on first .embed() — that's intentional since we don't want to import torch on every config save.)

Comment thread repi/cli.py Outdated
# need to be quieted explicitly — basicConfig only affects loggers
# without an explicit level.
for name in ("sentence_transformers", "httpx", "asyncio"):
for name in ("fastembed", "httpx", "asyncio"):
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c53a479: added sentence_transformers back to the quieted-loggers tuple alongside fastembed, so eval-compat A/B runs in prod mode stay quiet regardless of which backend is selected.

Comment thread repi/embeddings/fastembed_backend.py Outdated
Comment on lines +9 to +10
import logging
from typing import Optional
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c53a479.

- container.embedder: rebuild when settings.EMBEDDING_BACKEND changes
  so PUT /config can actually flip the backend at runtime (previously
  the first-built embedder was cached for the process lifetime).
- api/config: PUT /config now constructs the embedder during validation
  so an unknown EMBEDDING_BACKEND value 400s on write instead of 500ing
  on the first /ingest hours later.
- cli: keep sentence_transformers in the quieted-loggers tuple so prod
  mode stays quiet on A/B runs with the eval-compat group installed.
- fastembed_backend: drop unused Optional import.

120/120 tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

O2: Replace torch + sentence-transformers with fastembed (ONNX)

2 participants