Pluggable embedding backend; fastembed/ONNX default (#46)#51
Pluggable embedding backend; fastembed/ONNX default (#46)#51VarunGitGood wants to merge 5 commits into
Conversation
Replaces sentence-transformers + torch (~790 MB on disk, ~400 MB RSS) with fastembed (~50 MB, ~130 MB RSS) so repi fits Railway's 512 MB tier. Vectors are byte-identical — same all-MiniLM-L6-v2 weights through ONNX Runtime — so existing embeddings keep retrieving correctly with no re-ingest needed. The swap goes through a new `Embedder` protocol (DIP) selected by a single `EMBEDDING_BACKEND` setting in .repi/config.json. Two impls ship: FastembedEmbedder (default) and SentenceTransformersEmbedder (kept for A/B comparison; requires installing the torch deps locally). Container is now ignorant of either concrete class — depends only on the protocol + factory. Leaderboard gains an `embedding_backend` column so eval runs across both backends can sit side-by-side per (provider, model, dataset). The eval runner persists the active backend on every row and prints it in the run header. The benchmarks API and UI updates that surface this column live on feat/ui-leaderboard. Validation: - 120/120 tests pass - `repi doctor` reports: Embedding round-trip PASS fastembed (all-MiniLM-L6-v2), dim=384 - Eval (dataset_1, mistral self-grading): PASS 0.96, leaderboard row written with embedding_backend='fastembed' - pytorch-cpu index hack removed; uv.lock drops torch/transformers/scikit-learn/scipy/safetensors Follow-ups: - Full 3-dataset eval to compare scores vs the torch baseline (needs a second provider key for the judge so we don't self-grade). - Surface embedding_backend in /benchmarks API + page on feat/ui-leaderboard. - Docker image-size measurement before/after.
Adds an opt-in dependency group so the sentence-transformers backend
can be installed for A/B comparison evals without bloating the default
install. The pytorch-cpu index is gated behind tool.uv.sources so it
only triggers when the group is selected.
Reproduces the comparison in two commands:
uv sync --group eval-compat
# set EMBEDDING_BACKEND="sentence-transformers" in .repi/config.json
uv run python eval/run_evals.py --judge-provider mistral
# then: revert config, uv sync
Verified results across 3 datasets (mistral self-grading judge):
fastembed sentence-transformers
dataset_1_cascading_inventory_migration 0.96 0.94
dataset_2_insufficient_logging 0.37 0.70
dataset_3_jwt_key_rotation_noise 0.93 0.88
avg 0.75 0.84
dataset_2's swing is large enough to investigate before declaring no
regression — chunk count differed (8 vs 4) so the ReAct loop pulled
different evidence even though vectors should be identical. Likely a
non-determinism in the LLM rather than the embedding, but worth a
second look before merging.
The comparison being made is torch (PyTorch runtime) vs ONNX
(via fastembed/ONNX Runtime). 'sentence-transformers' was the python
library used to load torch but it muddled what was actually being
swapped, so the leaderboard / config / factory all now use 'torch'.
- repi/embeddings/sentence_transformers_backend.py → torch_backend.py
- SentenceTransformersEmbedder → TorchEmbedder; .name = "torch"
- factory accepts "torch" (rejects the old key cleanly)
- config docstring + eval-compat group docstring updated
- existing leaderboard rows backfilled via:
UPDATE leaderboard SET embedding_backend = 'torch'
WHERE embedding_backend = 'sentence-transformers';
120/120 tests still pass.
Comment-only pass. No logic changed; 120/120 tests still pass. - Drops issue-number / plan-doc / past-PR references from inline comments and docstrings — git log carries that history. - Removes WHAT comments (restating the next line) and decorative section labels above attribute groups. - Tightens verbose docstrings while preserving load-bearing WHY notes: async timezone gotchas, lazy-init reasons, the PUT-as-PATCH config merge rationale, the embedder swap tradeoff. Net diff: -48 comment lines across 11 files.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This pull request makes the embedding layer pluggable and switches the default local embedding backend from the PyTorch + sentence-transformers stack to fastembed (ONNX Runtime) to substantially reduce install size and runtime memory while keeping vector outputs compatible with existing data.
Changes:
- Introduces an
Embedderprotocol + factory withfastembedas the default backend and an optional torch backend for eval A/B comparison. - Extends eval persistence to record
embedding_backendin theleaderboardtable (schema + insert path), and prints the active backend in eval headers. - Updates packaging to move torch +
sentence-transformersbehind an opt-ineval-compatdependency group, keeping the default install slim.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updates locked dependencies to include fastembed/ONNX runtime stack and moves torch-related deps into an optional group. |
| repi/llm/json_utils.py | Cleans up module docstring language for shared JSON extraction utilities. |
| repi/investigation/react_loop.py | Comment cleanup around parse_llm_response re-export and iteration budgeting. |
| repi/embeddings/base.py | Adds Embedder protocol definition to decouple container from concrete embedding libs. |
| repi/embeddings/factory.py | Adds backend factory to construct the configured embedder implementation. |
| repi/embeddings/fastembed_backend.py | Implements default ONNX/fastembed embedder for all-MiniLM-L6-v2. |
| repi/embeddings/torch_backend.py | Adds optional torch + sentence-transformers embedder for A/B eval runs. |
| repi/embeddings/init.py | Exposes Embedder + create_embedder as the embeddings module API. |
| repi/core/container.py | Switches container embedding functionality to use the embedder protocol/factory. |
| repi/core/config.py | Adds EMBEDDING_BACKEND setting and updates config-related comments. |
| repi/cli.py | Updates logging suppression and doctor embedding check to use the new embedder abstraction. |
| repi/api/config.py | Docstring update clarifying PUT-as-merge semantics for config updates. |
| repi/api/init.py | Minor comment/docstring cleanup around CORS and health endpoint. |
| pyproject.toml | Replaces default torch dependencies with fastembed and adds eval-compat optional group plus CPU-only torch index config. |
| eval/run_evals.py | Persists/prints embedding_backend per run and inserts it into leaderboard rows. |
| db/schema.sql | Adds embedding_backend column and index to the leaderboard table. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @property | ||
| def model(self) -> "SentenceTransformer": | ||
| if self._model is None: | ||
| logger.info("Loading SentenceTransformer (first use) …") | ||
| from sentence_transformers import SentenceTransformer | ||
| self._model = SentenceTransformer("all-MiniLM-L6-v2") | ||
| return self._model | ||
| def embedder(self) -> Embedder: | ||
| if self._embedder is None: | ||
| self._embedder = create_embedder(settings.EMBEDDING_BACKEND) | ||
| return self._embedder |
There was a problem hiding this comment.
Addressed in c53a479: the property now compares the cached embedder's name against settings.EMBEDDING_BACKEND and rebuilds when they differ, so PUT /config + settings.reload() actually flips the backend on next access.
| """Merge `new_config` on top of the existing config.json and reload. | ||
|
|
||
| Semantically a PATCH: a partial body (e.g. `{"MISTRAL_API_KEY": "..."}`) | ||
| must not clobber unsent fields with their class defaults, which would | ||
| break a running container instantly. | ||
| """ |
There was a problem hiding this comment.
Addressed in c53a479: PUT /config now calls create_embedder(validated.EMBEDDING_BACKEND) during validation, so an unknown backend name 400s on write instead of silently persisting and 500ing on first /ingest. (Catches unknown names; missing optional torch deps still surface lazily on first .embed() — that's intentional since we don't want to import torch on every config save.)
| # need to be quieted explicitly — basicConfig only affects loggers | ||
| # without an explicit level. | ||
| for name in ("sentence_transformers", "httpx", "asyncio"): | ||
| for name in ("fastembed", "httpx", "asyncio"): |
There was a problem hiding this comment.
Addressed in c53a479: added sentence_transformers back to the quieted-loggers tuple alongside fastembed, so eval-compat A/B runs in prod mode stay quiet regardless of which backend is selected.
| import logging | ||
| from typing import Optional |
- container.embedder: rebuild when settings.EMBEDDING_BACKEND changes so PUT /config can actually flip the backend at runtime (previously the first-built embedder was cached for the process lifetime). - api/config: PUT /config now constructs the embedder during validation so an unknown EMBEDDING_BACKEND value 400s on write instead of 500ing on the first /ingest hours later. - cli: keep sentence_transformers in the quieted-loggers tuple so prod mode stays quiet on A/B runs with the eval-compat group installed. - fastembed_backend: drop unused Optional import. 120/120 tests still pass.
Closes #46.
Summary
sentence-transformers+torch(~790 MB on disk, ~400 MB RSS) tofastembed(~50 MB / ~130 MB). Sameall-MiniLM-L6-v2weights, executed through ONNX Runtime — vectors agree to ~1e-7 (cosine 1.000000) so existing embeddings keep retrieving without a re-ingest.EmbedderProtocol inrepi/embeddings/with two concrete impls (FastembedEmbedder,TorchEmbedder) selected by a singleEMBEDDING_BACKENDsetting in.repi/config.json.Containerdepends on the protocol + factory, not on either concrete library (DIP).embedding_backendcolumn + index to theleaderboardtable so eval runs can sit side-by-side per(provider, model, dataset). Eval runner persists the active backend and prints it in the run header.eval-compatuv dependency group (uv sync --group eval-compat) that installssentence-transformers+ CPU torch for A/B comparison runs without bloating the default install. The pytorch-cpu index is gated behindtool.uv.sourcesso it only triggers when the group is selected.Verification
120/120tests pass.repi doctorround-trips the new backend:fastembed (all-MiniLM-L6-v2), dim=384.Eval comparison (3 datasets × 2 backends, Mistral self-grading judge — see caveat below):
dataset_2's 0.33 gap looked alarming but turned out to be LLM variance, not backend. Two additional fastembed-only repetitions of dataset_2 scored 0.85 and 0.43 — a 0.42 range from the same backend, which fully envelopes the cross-backend gap. dataset_2 is graded against confidence: low, so runs that hedge get rewarded and runs that confabulate get hammered. Tools called, trigger event identified, and retrieval rankings were the same across all four runs.
Test plan
uv run pytest tests/ -v— 120 passeduv run repi doctor— embedding round-trip passes with fastembeduv run python eval/run_evals.py --judge-provider mistralon both backends; leaderboard rows written with the correctembedding_backendtaguv pipinstall + uninstall flow verified via theeval-compatgroupCaveats / follow-ups
feat/ui-leaderboarddoesn't yet surface the new column. After this merges, that branch needs a small SQL update (DISTINCT ON (provider, model, dataset, embedding_backend)) plus a column inweb/app/benchmarks/page.tsx.TorchEmbedderships in the source tree but its packages don't ship in the default install — it's a reference implementation for A/B runs.