Skip to content

fix(search): match multi-word FTS queries#306

Merged
neuromechanist merged 3 commits into
developfrom
feature/issue-305-fts-multiword-search
Jun 5, 2026
Merged

fix(search): match multi-word FTS queries#306
neuromechanist merged 3 commits into
developfrom
feature/issue-305-fts-multiword-search

Conversation

@neuromechanist

Copy link
Copy Markdown
Member

Summary

Papers (and all knowledge sources) were synced but never found at query time. Root cause: _sanitize_fts5_query() wrapped the entire query in quotes, forcing an exact consecutive-phrase FTS5 match. Every multi-word query the agent sends returned zero rows.

Proven on the live dev eeglab.db (1940 papers):

Query Old (phrase) New (OR + rank)
ICA EEGLAB 0 497
papers about ICA artifact removal 0 710

Change

_sanitize_fts5_query now: tokenize -> drop stopwords + operator chars -> quote each term individually -> OR them. Results keep the existing BM25 rank ordering, so the best matches surface first.

  • Still injection-safe: each term is individually quoted, so FTS5 operators (AND/OR/NEAR/wildcards) and punctuation are inert.
  • Identifier tokens (e.g. pop_runica, clean_rawdata) are preserved as adjacency phrases.
  • Falls back to a safe quoted phrase when a query is all stopwords/symbols.
  • Fixes all 6 search call sites (papers, discussions, FAQ, docstrings, search_all), not just papers.

Tests

  • Updated TestFTS5Sanitization for the new OR-of-terms output + injection safety.
  • Added regression tests: a multi-word NL query with non-adjacent terms, and a full question with stopwords, both now return matching papers.
  • 293 passed across knowledge + tool suites; ruff clean.

Closes #305

Replace blanket phrase-wrapping in _sanitize_fts5_query with tokenize ->
drop stopwords/operators -> quote each term -> OR. Multi-word queries
(every query the agent sends for papers/discussions/FAQ/docstrings) were
exact-phrase matched and returned nothing despite populated databases.

Still injection-safe (each term individually quoted); results ordered by
existing BM25 rank. Affects all 6 knowledge-search call sites.

Closes #305
Address PR review:
- strengthen test_sanitize_fts5_operators to assert every term is
  individually quoted (no bare operator can reach MATCH)
- drop 'list' and 'use' from stopwords (meaningful EEGLAB/MATLAB nouns)
  so multi-word queries like 'list channels' don't silently lose a term
@neuromechanist

Copy link
Copy Markdown
Member Author

Review (Sonnet, pr-review-toolkit) + resolutions

Passes: FTS5 injection safety sound (regex strips all operator chars, each term individually quoted); fallback branch correct for all-stopword/all-symbol input; single-term stopword queries fall back safely; regression genuinely fixed with real SQLite (no mocks); no new OperationalError paths.

Important findings addressed:

  1. Operator test was too weak (only checked start/end quotes). Strengthened test_sanitize_fts5_operators to assert every OR-separated term is individually quoted and no bare operator text remains.
  2. Stopwords could silently drop meaningful terms. Removed list and use from _FTS_STOPWORDS (they double as EEGLAB/MATLAB content nouns, e.g. 'list channels'). Added test_sanitize_keeps_command_noun_terms. Grammar/boolean words (and/or/not/the/what...) stay dropped on purpose — that's the precision win.

No critical issues. 41 passed, ruff clean.

BEP keyword search is OR-based and rank-ordered after the FTS fix, so the
old phrase-only no-match query ('...data type...') now matches BEPs that
mention 'data'. Use a genuinely non-matching query to keep the
no-results path covered.
@neuromechanist neuromechanist merged commit 073d260 into develop Jun 5, 2026
6 checks passed
@neuromechanist neuromechanist deleted the feature/issue-305-fts-multiword-search branch June 5, 2026 07:45
@neuromechanist neuromechanist mentioned this pull request Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant