Retrieval & ranking quality: smarter context, provenance-aware centrality, richer search by zzet · Pull Request #44 · zzet/gortex

zzet · 2026-06-03T07:22:04Z

Summary

A focused quality pass on how Gortex retrieves and ranks code. It sharpens the rerank pipeline, routes smart_context through that pipeline for the first time, hardens query expansion and zero-result recovery, adds finer-grained context compression, recovers exact embedding cosine, and introduces a dependency-closure context tool plus provenance-aware graph centrality.

29 commits, one self-contained feature each. No schema migrations — new behavior is additive and backward-compatible, with the higher-risk levers gated behind config/env flags.

What changed

Rerank pipeline (`internal/search/rerank`)

Generated-file down-ranking — a sixth path-penalty class demotes generated files (*.pb.go, mock_*.go, *_pb2.py, *.g.dart, …), but only when a real same-named hand-written implementation exists, so a generated file that is the sole definition is never demoted into oblivion.
Source-over-test signal — when a query surfaces both an implementation and its test, the implementation is lifted above the test. Batch-relative: it contributes nothing unless the matching test co-occurs, so it sharpens the tie that matters without shifting the rest of the result set.
Edge-provenance attenuation — LSP enrichment materialises a dense layer of interface-dispatch / framework-wiring edges that inflate the apparent centrality of utility code. A shared graph.ProvenanceWeight attenuates the abundant LSP tier (and the weak text-matched tier) relative to the unambiguous ast_resolved baseline, applied in HITS, PageRank, and a new rerank signal. Uniform-weight-safe: a graph with no provenance stamped ranks identically to before.
Continuous α tuning — the discrete query-class → α lookup is replaced by a continuous query-shape score, so a half-identifier query lands between the symbol and natural-language blends instead of jumping a whole tier. Falls back to the discrete table when α is unset, so every existing caller is unchanged.

`smart_context` composer (`internal/mcp`)

Ranks its working set through the full rerank pipeline — churn, HITS, community, co-change, frecency, feedback, path-penalty, source-bias and provenance now influence smart_context ordering (previously it used a feedback-only re-sort over raw BM25).
Always-on blast-radius block — inline callers grouped by file plus exact covering tests, and a no covering tests found warning when a symbol is untested.
Adaptive seeds + project-scaled budget — seed count and token budget scale with graph size (only when the caller didn't pin them explicitly).
Source-over-test embed ordering and polymorphic-sibling skeletonization — large interchangeable symbol families (an interface with many implementers) ship one detailed representative and stub the rest.

Query expansion & recovery (`internal/mcp`, `internal/search`)

Operator-free keyword-soup detection — a bare keyword list is now detected and split into per-disjunct fetches; genuine NL queries stay classified as concept.
Zero-result decomposition fallback — a compound/dotted/CamelCase query that returns nothing is decomposed into leaf symbol-name terms and retried.
Vocabulary-anchored expansion (opt-in) — constrains LLM-expander terms to the repo's own symbol vocabulary.
Concept-relatedness thesaurus — a weighted concept→concept layer on top of the equivalence classes (kept separate from the union-find so distinct concepts don't merge).
Real docs retrieval channel + prose-tuned reranking — corpus:docs now has its own fetch (not just a post-filter) and a prose weight profile that suppresses code-structural signals for documentation nodes.

Context compression (`internal/elide`, `internal/embedding`)

Per-glob fidelity tiers — full | compress | omit per glob pattern on read_file / get_editing_context (with real ** support), generalizing the binary compress_bodies.
AST sub-chunking extended to Ruby, PHP, Kotlin, and Swift (node kinds verified against the live grammars).

Embeddings (`internal/query`, `internal/embedding`, `internal/graph`)

Post-rerank pure-cosine refinement — recovers the exact cosine that the rank-based semantic signal discards, blended into the top-N. Strict no-op when vectors are absent.
Launch-time embedding-model variant selection — choose a named local model via config/env; zero-value preserves the current default.

Graph traversal & cache (`internal/query`, `internal/analysis`, `internal/graph`)

context_closure tool — given seed files/symbols, walks the transitive import/dependency closure and packs it under a token budget, ranked by graph distance or by seeded random-walk proximity.
Community-filtered walks — constrain walk_graph to a detected community.
Precomputed CSR adjacency snapshot + personalized PageRank — built once per analysis pass; powers the closure tool's proximity ranking.
Content-addressed package-scoped bundle cache — serves unchanged packages' search bundles from cache, invalidated by edge-aware per-package fingerprints (conservative: a package whose fingerprint was never reported is never served).

Text search & tokenization (`internal/search`, `internal/mcp`)

search_text regexp mode — runs the trigram backbone through a compiled pattern; bad patterns return a tool error, results carry the enclosing symbol.
Optional sparse sub-word n-gram tokenization + a learned boundary table mined from symbol names at index time. Ships off behind an env flag (precision-sensitive, reindex-required); index and query paths share one emission stage so postings always match.

Defaults & compatibility

Provenance attenuation and cosine refinement are on but provably no-op on graphs/queries without the relevant data.
Continuous α falls back to the legacy discrete table when unset; an explicit query_class pin and keyword-soup keep their discrete handling.
Sparse n-gram tokenization and vocabulary anchoring are off by default.
No response field was added without also updating its compact-wire encoder and pack-root etag, so caching and GCX callers stay coherent.

Add a sixth path-penalty class to the search rerank pipeline: generated files (foo.pb.go, mock_x.go, x_pb2.py, *.g.dart, …) are demoted to 0.4 — but only when a real same-named hand-written peer exists in the graph, so a generated file that is the sole definition stays un-penalised. The penalty fires only in the otherwise-neutral branch, preserving 'most aggressive bucket wins'. The generated-file classifier and its peer-path deriver move into the leaf 'excludes' package (IsGenerated / GeneratedPeerPaths) so both the MCP omission notes and the rerank signal share one source of truth without an import cycle.

Promote a production symbol above its own test when both surface for the same query. The signal is batch-relative: it contributes 0 unless a test candidate for the same symbol (matched by normalised name stem, TestValidateToken -> validatetoken) is present in the candidate set, so it never shifts scores when no test co-occurs — it only sharpens the implementation-over-test ordering on the case that matters. Complements the path penalty (which demotes tests unconditionally): the penalty pushes the test down, this boost pulls the matching implementation up, without double-counting across the result set.

The generated-file classifier moved to internal/excludes; its unit test now lives in excludes/generated_test.go. The MCP-side generated omission note stays covered by TestPathOmissions.

Replace the discrete 5-bucket query-class -> alpha lookup with a continuous query-shape score. AlphaForContinuous interpolates between the identifier-leaning and natural-language anchors by how identifier-shaped a query reads (fraction of CamelCase/snake/dotted tokens, discounted by stopword density and length), so a half-identifier query like 'validateToken handler' lands between the symbol and concept blends instead of jumping a whole tier. Hard shapes (path, signature, keyword-soup) still pin to their class alpha. The live search path opts in via Context.Alpha (set to AlphaForContinuous(query) when the class was auto-detected, deferring to an explicit query_class pin and to keyword-soup's discrete handling); continuousClassMultiplier reproduces the discrete table at its anchors so queries that snap to a class score exactly as before. Context.Alpha==0 preserves the legacy discrete path for every direct caller. The hybrid fusion fallback's AlphaFor now uses the continuous score too.

…ality LSP enrichment (gopls/clangd/tsserver) materialises a dense layer of interface-dispatch and framework-wiring edges; counting them at full weight inflates the apparent centrality of utility/framework code over genuine domain authorities. Add graph.ProvenanceWeight, attenuating the abundant lsp tier and the weak text-matched tier relative to the structurally-unambiguous ast_resolved baseline, and apply it in three places: - HITS authority/hub adjacency (weighted links) - PageRank (weighted out-degree; mass still conserved per source, so an LSP fan-out dispatcher spreads its score thinner) - a new rerank ProvenanceSignal (weight 0.15) scoring a candidate by the average provenance weight of its inbound call/reference edges With uniform weights (a graph with no Origin stamped) HITS is unchanged after L2 normalisation and PageRank's w/outWeight ratio reduces to 1/outDegree — identical to the prior computation — so existing centrality tests hold.

…eline

… tests

…s) into smart_context

… to project size

…leaf terms

…e expansion

…d context

Add a context_closure MCP tool that assembles a context pack from the transitive dependency closure of a set of seed files and/or symbols. Engine.ImportClosure walks the multi-seed dependency closure (imports, calls, references, depends_on plus the infrastructure edges) breadth first, recording each reached node's graph distance to the nearest seed, skipping unresolved/external targets and honouring the workspace/project scope. The handler resolves file seeds to their defined symbols, ranks the closure by distance, and packs the nearest members through the existing token-budgeted focus -> ring -> outline manifest. A rank=proximity option is exposed for a seeded-walk ranking wired in a later change; until then it falls back to distance order.

WalkOptions gains CommunityID + NodeToComm so a budgeted walk can be pinned to a single detected community. In the neighbour-admission block a neighbour is dropped (along with the edge that reached it) only when it has a defined membership that differs from the pinned community; structural nodes with no membership still pass, and the filter no-ops when no community is pinned or the membership map is absent. walk_graph accepts a community argument (ID or label), resolving a label to its ID the same way winnow_symbols does and threading the node->community map from the cached community result. The filter is a no-op until community detection has run.

Add a compact CSR adjacency snapshot over the call / reference graph, built once per analysis pass and cached on the server next to the PageRank result under the same graph-identity invalidation discipline. Each edge rides its provenance weight so the snapshot matches the global PageRank edge weighting. PersonalizedPageRank runs a seeded random-walk-with-restart over the snapshot in O(seeds * iters * edges) with no per-query graph scan: restart mass returns to the seed set so the stationary distribution concentrates on nodes reachable from the seeds along many short paths. The dependency-closure tool's rank=proximity mode is the real consumer: it ranks closure members by their restart-walk proximity to the seed set and surfaces the score per member, falling back to distance ranking when the snapshot has not been built yet.

Add a cache over the sqlite backend's SearchSymbolBundles that serves cached Node + in/out edges for packages whose content fingerprint is unchanged, skipping the node and edge fan-out for them and recomputing only the misses. The cache is keyed at the node level but validated at the package level: an entry is served only while its package's fingerprint still matches the one it was stored at, and a package the daemon has never reported a fingerprint for is never served — so an unvalidated or stale bundle can never escape. The daemon installs the authoritative per-package fingerprints after every analysis pass via the new BundleFingerprintSink capability. Those fingerprints are edge-aware (they fold each package's nodes and the edges touching them), so a reindex that altered a package's nodes or edges — including a cross-file edge landing on a node from elsewhere — retires exactly the affected bundles while leaving untouched packages cached. The cache key carries the repo prefix (the stored file paths are repo-prefixed) for cross-repo isolation, the partial-hit path tolerates a mix of cached and freshly fetched bundles, and the entry count is bounded. The cache stays inert until fingerprints arrive, so correctness never depends on it being warm.

…ol hits search_text gains an optional regexp flag that runs the same trigram-accelerated backbone through a compiled regular expression; an invalid pattern returns a tool error rather than zero hits, and results flow through the identical enclosing-symbol enrichment. The tool description now advertises that every hit already carries the enclosing symbol (symbol_id / symbol_name).

Add an opt-in sub-word n-gram emission stage layered over the existing fixed-rule word tokens. After Tokenize/TokenizeQuery and FTS normalization, the stage optionally appends character n-grams (n=3..4) for each word token, opening fuzzy sub-word recall paths while always preserving the original word tokens so exact matches still score. The stage runs identically on the index path (BM25Backend.Add) and the query path (BM25Backend.Search) via a shared ExpandSparseNgrams helper, so n-grammed postings are always probed with n-grammed query terms and the two can never disagree. Gated behind GORTEX_SPARSE_NGRAM, read once at process start (mirroring the FTS stemming and bigram-typo flags); default off because sub-word noise can demote an exact identifier match. The tokenizer consumes a small NgramBoundaries abstraction so a learned boundary table can later drive data-driven splits; passing nil degrades cleanly to fixed character n-grams.

Mine a per-repo sub-word boundary table from symbol names at index build time (BuildNgramBoundaries, the auto-concept mining pattern): count adjacent character-pair co-occurrence across the vocabulary, rank, and select the rarest pairs as high-information split seams. The table is installed onto the BM25 backend in buildSearchIndex so the optional sparse n-gram tokenizer splits at data-driven boundaries instead of a fixed n. Deterministic (sorted before the percentile cut) and per-repo. A no-op unless the sparse n-gram tokenizer is enabled.

Update the tool reference, the semantic-search guide, and the feature overview to cover the new retrieval behavior: the context_closure tool; search_text regexp mode + enclosing-symbol hits; search_symbols corpus channel / vocab_anchored / zero-result decomposition; get_editing_context fidelity_globs; smart_context blast_radius + working_set + size-scaled budget + family skeletonization; continuous bm25/vector blend; edge-provenance attenuation; generated-file and source-over-test ranking; post-rerank cosine; the embedding model variant; and the opt-in sparse sub-word tokenizer.

Replace the floating go-version 1.26 pins in the ci, release, and security workflows with go-version-file: go.mod, so every job builds and scans with the exact patched toolchain go.mod declares (1.26.4) instead of whatever patch the setup-go manifest happens to serve. Makes go.mod the single source of truth and keeps govulncheck from flagging stdlib advisories already fixed in a patch the floating manifest lagged behind. Matches the go-version-file pattern already used in bench-arm, init-smoke, and publish-claude-plugin.

ensureHugotModel now refuses to fetch a not-yet-cached model when GORTEX_EMBEDDING_OFFLINE is set, so NewLocalProvider falls through to the static provider instead of hitting the network. Gives air-gapped hosts and sandboxes a real knob and makes provider construction testable without a download. Default (unset) keeps the existing download-on-first-use behaviour. TestNewProviderFromConfig_EmptyVariantUsesDefault downloaded MiniLM on a cold cache, tripping an upstream data race in go-huggingface's parallel downloader under -race. It now points XDG_DATA_HOME at an empty temp dir and enables the offline guard, so it runs hermetically and exercises the static fallback it actually asserts.

TestNotebook_PrunesByTTL used a 1ms TTL, but pruneLocked computes its cutoff after the fresh entry's Updated stamp is written; on a loaded runner the save+prune exceeds 1ms and the just-saved entry sweeps itself (0 remain instead of 1). Bump the TTL to a minute, far below the stale entry's 1h age and far above any realistic save latency.

zzet added 30 commits June 2, 2026 23:56

test(mcp): drop local generated-file test, covered by excludes pkg

682037c

The generated-file classifier moved to internal/excludes; its unit test now lives in excludes/generated_test.go. The MCP-side generated omission note stays covered by TestPathOmissions.

feat(mcp): rank smart_context working set through the full rerank pip…

1177011

…eline

feat(mcp): bias smart_context source embedding toward production over…

3d99e9d

… tests

feat(mcp): fuse a blast-radius block (callers by file + covering test…

e460e90

…s) into smart_context

feat(mcp): scale smart_context seed count to graph size

0c96f20

feat(mcp): cluster smart_context working set by file and scale budget…

d1b5fdd

… to project size

feat(search): detect operator-free keyword-soup queries and split them

3e18d39

feat(mcp): rescue zero-result identifier queries by decomposing into …

d20d35a

…leaf terms

feat(search): vocabulary-anchored query expansion option

50af9d1

feat(search): add a concept-relatedness thesaurus layer to equivalenc…

2f90655

…e expansion

feat(mcp): give the docs corpus its own retrieval channel

4d64dac

feat(search): prose-tuned reranking for documentation queries

bbbe872

feat(elide): per-glob full/compress/omit fidelity tiers

8cfe092

feat(mcp): skeletonize large interchangeable symbol families in grade…

16d977a

…d context

feat(embedding): AST sub-chunking for ruby, php, kotlin, and swift

1d158e4

feat(search): recover exact cosine in a post-rerank refinement stage

9c4073c

feat(embedding): launch-time embedding model variant selection

570a390

test(indexer): cover the concurrent embedding worker pool

98f91a1

zzet added 4 commits June 3, 2026 17:22

updade deps

73a6bae

zzet merged commit 9909379 into main Jun 3, 2026
9 checks passed

zzet deleted the feat/retrieval-ranking-quality branch June 3, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval & ranking quality: smarter context, provenance-aware centrality, richer search#44

Retrieval & ranking quality: smarter context, provenance-aware centrality, richer search#44
zzet merged 34 commits into
mainfrom
feat/retrieval-ranking-quality

zzet commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zzet commented Jun 3, 2026

Summary

What changed

Rerank pipeline (internal/search/rerank)

smart_context composer (internal/mcp)

Query expansion & recovery (internal/mcp, internal/search)

Context compression (internal/elide, internal/embedding)

Embeddings (internal/query, internal/embedding, internal/graph)

Graph traversal & cache (internal/query, internal/analysis, internal/graph)

Text search & tokenization (internal/search, internal/mcp)

Defaults & compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rerank pipeline (`internal/search/rerank`)

`smart_context` composer (`internal/mcp`)

Query expansion & recovery (`internal/mcp`, `internal/search`)

Context compression (`internal/elide`, `internal/embedding`)

Embeddings (`internal/query`, `internal/embedding`, `internal/graph`)

Graph traversal & cache (`internal/query`, `internal/analysis`, `internal/graph`)

Text search & tokenization (`internal/search`, `internal/mcp`)