Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
588ae9a
Add semantic search support to server store
javiermtorres Apr 24, 2026
70c19b2
Merge branch 'main' into 21-semantic-similarity-queries
javiermtorres Apr 24, 2026
9bceb0a
Remove default embeddings endpoint
javiermtorres Apr 24, 2026
ff6f201
Fix lint
javiermtorres Apr 24, 2026
56cdba2
Make semsearch modulate current scoring
javiermtorres Apr 28, 2026
16d538a
Merge branch 'main' into 21-semantic-similarity-queries
javiermtorres Apr 28, 2026
7fad12d
Add semsearch to the dependencies in CI
javiermtorres Apr 28, 2026
34e2330
Merge branch 'main' into 21-semantic-similarity-queries
javiermtorres Apr 28, 2026
7900707
Fix query tests to async protocol
javiermtorres Apr 28, 2026
5ef8132
Merge origin/main into 21-semantic-similarity-queries
peteski22 Apr 28, 2026
75070cb
server: replace local scoring module with cq.scoring from the SDK
peteski22 Apr 29, 2026
e6a8587
Merge branch 'main' into 21-semantic-similarity-queries
javiermtorres May 4, 2026
e968ee6
Merge branch 'main' into 21-semantic-similarity-queries
javiermtorres May 11, 2026
ab1af3d
docs: document TOKEN_EMBEDDING_URL and SEMSEARCH_EMBEDDING_DIM in ser…
Copilot May 12, 2026
a05f6dd
tests: add semsearch unit tests with monkey-patched _get_embeddings
Copilot May 13, 2026
27b3671
Align with main
javiermtorres May 13, 2026
471d155
Fix distance calculation
javiermtorres May 13, 2026
44341c0
Fix lint
javiermtorres May 13, 2026
763dc4f
Merge branch 'main' into 21-semantic-similarity-queries
javiermtorres May 13, 2026
35a9b90
Fix uv.lock
javiermtorres May 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ setup-sdk-python: setup-schema

.PHONY: setup-server-backend
setup-server-backend:
cd server/backend && uv sync --group dev
cd server/backend && uv sync --group dev --extra semsearch

.PHONY: setup-server-frontend
setup-server-frontend:
Expand Down
57 changes: 57 additions & 0 deletions server/backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,60 @@ The full environment-variable table for self-hosters lives in

[issue-311]: https://github.com/mozilla-ai/cq/issues/311
[issue-312]: https://github.com/mozilla-ai/cq/issues/312
[issue-310]: https://github.com/mozilla-ai/cq/issues/310

## Semantic search

Semantic similarity search is **disabled by default**. It activates only when
`TOKEN_EMBEDDING_URL` is set *and* the `semsearch` optional extra is installed.

### Installation

```
uv sync --extra semsearch
# or
pip install "cq-server[semsearch]"
```

The extra installs `sqlite-vec`, `numpy`, and `httpx`.

### Environment variables

| Variable | Required | Default | Description |
|---|---|---|---|
| `TOKEN_EMBEDDING_URL` | to enable | — | Base URL of the [encoderfile](https://github.com/mozilla-ai/encoderfile) embedding service. When this variable is set and the `semsearch` extra is installed, every insert/update writes an embedding row and `query` modulates relevance by cosine distance. |
| `SEMSEARCH_EMBEDDING_DIM` | no | `768` | Dimensionality of the embedding vectors. **Must match the model served at `TOKEN_EMBEDDING_URL`.** |

### Embedding service contract

The server calls:

```
POST {TOKEN_EMBEDDING_URL}/predict
Content-Type: application/json

{"inputs": ["word1", "word2", ...]}
```

Expected response:

```json
{
"results": [
{
"embeddings": [
{"embedding": [0.1, 0.2, ...]},
...
]
}
]
}
```

The server averages across all returned embeddings to produce a single vector.

### Disabled-by-default behaviour

When `TOKEN_EMBEDDING_URL` is unset (or the `semsearch` extra is not installed)
`semsearch.is_enabled()` returns `False` and all semsearch code-paths are
short-circuited; behaviour matches the non-semsearch baseline exactly.
10 changes: 10 additions & 0 deletions server/backend/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ dependencies = [
"alembic>=1.18.4,<2",
]

[project.optional-dependencies]
semsearch = [
Comment thread
javiermtorres marked this conversation as resolved.
"sqlite_vec~=0.1.9",
"numpy==2.*",
]

[project.scripts]
cq-server = "cq_server.app:main"

Expand All @@ -38,6 +44,7 @@ tests = [
"pytest>=9.0.3",
"pytest-asyncio>=1.3.0",
"httpx",
"jsonschema>=4.23.0",
]

[tool.setuptools.packages.find]
Expand All @@ -54,3 +61,6 @@ namespaces = false
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
log_cli = true
log_cli_level = "DEBUG"
log_level = "DEBUG"
5 changes: 5 additions & 0 deletions server/backend/src/cq_server/core/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

from .config import Settings

from ..semsearch import _ENABLED as _SEMSEARCH_ENABLED, load as semsearch_load


def _apply_sqlite_pragmas(dbapi_connection, _connection_record) -> None: # noqa: ANN001 (sqlalchemy event signature)
"""Issue cq's required SQLite PRAGMAs on every new connection.
Expand Down Expand Up @@ -59,6 +61,9 @@ def __init__(self, settings: Settings) -> None:
future=True,
)
event.listen(self._engine, "connect", _apply_sqlite_pragmas)
# Evolve into a proper plugin module system later on
if _SEMSEARCH_ENABLED:
event.listen(self._engine, "connect", semsearch_load)
else:
# PostgreSQL backend is gated by #311/#312; lifespan resolves
# the URL up-front so this branch should be unreachable in
Expand Down
26 changes: 26 additions & 0 deletions server/backend/src/cq_server/repositories/knowledge.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from __future__ import annotations

from datetime import UTC, datetime
from email.mime import text
import logging

from cq.models import KnowledgeUnit
from cq.scoring import calculate_relevance
Expand All @@ -23,6 +25,12 @@
UPDATE_UNIT_DATA,
)

from ..semsearch import _ENABLED as _SEMSEARCH_ENABLED
from ..semsearch.queries import combined_query as sem_query, insert_unit as sem_insert_unit

from sqlalchemy.sql.expression import text as text_clause

logger = logging.getLogger(__name__)

class KnowledgeRepository:
"""Read/write access to knowledge units."""
Expand Down Expand Up @@ -53,7 +61,15 @@ async def get_any(self, unit_id: str) -> KnowledgeUnit | None:

async def insert(self, unit: KnowledgeUnit) -> None:
"""Persist a new unit. Domains are normalised; raises on integrity failure."""
logger.info(f"Inserting unit {unit.id} with domains {unit.domains} and tier {unit.tier}")
# FIXME plugins should run in the same transaction as the main
# db operations. Provindg sync and async interfaces simultaneously makes this
# compplicated. So the current setup just runs the plugin in a separate transaction
# after the main insert.
await self._db.run_sync(self._insert_sync, unit)
if _SEMSEARCH_ENABLED:
with self._db.engine.begin() as conn:
await sem_insert_unit(conn, unit)

async def query(
self,
Expand All @@ -65,6 +81,16 @@ async def query(
limit: int = 5,
) -> list[KnowledgeUnit]:
"""Return approved units matching ``domains``, ranked by relevance × confidence."""
if _SEMSEARCH_ENABLED:
with self._db.engine.connect() as conn:
return await sem_query(
conn,
domains,
languages=languages,
frameworks=frameworks,
pattern=pattern,
limit=limit
)
return await self._db.run_sync(
self._query_sync,
domains,
Expand Down
107 changes: 107 additions & 0 deletions server/backend/src/cq_server/semsearch/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
"""Semantic-search helpers backed by sqlite-vec and remote token embeddings.

This module provides optional semantic indexing and retrieval for knowledge units.
Semantic search is enabled only when `TOKEN_EMBEDDING_URL` is configured and the
embedding dependencies are installed.
"""

import logging
import os
import sqlite3
from sqlalchemy.sql.expression import TextClause, text

logger = logging.getLogger(__name__)

_ENABLED = False
_DIM = int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768))


_TOKEN_EMBEDDING_URL = os.environ.get("TOKEN_EMBEDDING_URL")
if _TOKEN_EMBEDDING_URL:
try:
import sqlite_vec
import numpy as np
from httpx import AsyncClient

_ENABLED = True

logger.info(f"Token embedding enabled using encoderfile endpoint at {_TOKEN_EMBEDDING_URL}")
except ImportError:
logger.warning(
"TOKEN_EMBEDDING_URL is set but required packages are not installed; "
"semantic search will be unavailable. To enable, install cq with "
"the 'embedding' extra: pip install cq-sdk[embedding]",
exc_info=True,
)


# We have avoided using vec0 table since we won't be doing knn-style
# search, but rather filtering by domain and then ranking by distance.
# The syntax for a vec0 table would be:
# CREATE VIRTUAL TABLE IF NOT EXISTS knowledge_units_vec
# USING vec0(
# id TEXT PRIMARY KEY,
# embedding float[{dim}]
# );
_VEC_SCHEMA_SQL = """
CREATE TABLE IF NOT EXISTS knowledge_units_vec(
id TEXT PRIMARY KEY,
embedding float[{dim}]
check(
typeof(embedding) == 'blob'
and vec_length(embedding) == {dim}
)
);
"""


_VEC_SEARCH_SQL = """
SELECT
ku.data,
vec_distance_cosine(vec.embedding, :query_embedding) as distance
FROM knowledge_units_vec vec
JOIN knowledge_units ku ON ku.id = vec.id
WHERE ku.status = 'approved'
ORDER BY distance
LIMIT :limit
"""

_QUERY_VEC_COMBINED_SQL = """
SELECT
ku.data,
vec_distance_cosine(vec.embedding, :query_embedding) as distance
FROM knowledge_units ku
JOIN knowledge_units_vec vec ON ku.id = vec.id
WHERE ku.status = 'approved'
AND ku.id IN (
SELECT DISTINCT unit_id
FROM knowledge_unit_domains
WHERE domain IN :domains
)
ORDER BY distance LIMIT :limit
"""

_VEC_DELETE_SQL = "DELETE FROM knowledge_units_vec WHERE id = :unit_id"
_VEC_INSERT_SQL = "INSERT INTO knowledge_units_vec (id, embedding) VALUES (:unit_id, :embedding)"


def is_enabled() -> bool:
"""Return whether semantic search dependencies are available."""
return _ENABLED


def load(conn, _) -> None:
"""Load the sqlite-vec extension into an SQLite connection."""
if not _ENABLED:
return
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
ensure_schema(conn)


def ensure_schema(conn) -> None:
"""Create semantic search virtual table if embedding is enabled."""
if not _ENABLED:
return
conn.executescript(_VEC_SCHEMA_SQL.format(dim=_DIM))
Loading
Loading