Skip to content

Retrieval & ranking quality: smarter context, provenance-aware centrality, richer search#44

Merged
zzet merged 34 commits into
mainfrom
feat/retrieval-ranking-quality
Jun 3, 2026
Merged

Retrieval & ranking quality: smarter context, provenance-aware centrality, richer search#44
zzet merged 34 commits into
mainfrom
feat/retrieval-ranking-quality

Conversation

@zzet
Copy link
Copy Markdown
Owner

@zzet zzet commented Jun 3, 2026

Summary

A focused quality pass on how Gortex retrieves and ranks code. It sharpens the rerank pipeline, routes smart_context through that pipeline for the first time, hardens query expansion and zero-result recovery, adds finer-grained context compression, recovers exact embedding cosine, and introduces a dependency-closure context tool plus provenance-aware graph centrality.

29 commits, one self-contained feature each. No schema migrations — new behavior is additive and backward-compatible, with the higher-risk levers gated behind config/env flags.

What changed

Rerank pipeline (internal/search/rerank)

  • Generated-file down-ranking — a sixth path-penalty class demotes generated files (*.pb.go, mock_*.go, *_pb2.py, *.g.dart, …), but only when a real same-named hand-written implementation exists, so a generated file that is the sole definition is never demoted into oblivion.
  • Source-over-test signal — when a query surfaces both an implementation and its test, the implementation is lifted above the test. Batch-relative: it contributes nothing unless the matching test co-occurs, so it sharpens the tie that matters without shifting the rest of the result set.
  • Edge-provenance attenuation — LSP enrichment materialises a dense layer of interface-dispatch / framework-wiring edges that inflate the apparent centrality of utility code. A shared graph.ProvenanceWeight attenuates the abundant LSP tier (and the weak text-matched tier) relative to the unambiguous ast_resolved baseline, applied in HITS, PageRank, and a new rerank signal. Uniform-weight-safe: a graph with no provenance stamped ranks identically to before.
  • Continuous α tuning — the discrete query-class → α lookup is replaced by a continuous query-shape score, so a half-identifier query lands between the symbol and natural-language blends instead of jumping a whole tier. Falls back to the discrete table when α is unset, so every existing caller is unchanged.

smart_context composer (internal/mcp)

  • Ranks its working set through the full rerank pipeline — churn, HITS, community, co-change, frecency, feedback, path-penalty, source-bias and provenance now influence smart_context ordering (previously it used a feedback-only re-sort over raw BM25).
  • Always-on blast-radius block — inline callers grouped by file plus exact covering tests, and a no covering tests found warning when a symbol is untested.
  • Adaptive seeds + project-scaled budget — seed count and token budget scale with graph size (only when the caller didn't pin them explicitly).
  • Source-over-test embed ordering and polymorphic-sibling skeletonization — large interchangeable symbol families (an interface with many implementers) ship one detailed representative and stub the rest.

Query expansion & recovery (internal/mcp, internal/search)

  • Operator-free keyword-soup detection — a bare keyword list is now detected and split into per-disjunct fetches; genuine NL queries stay classified as concept.
  • Zero-result decomposition fallback — a compound/dotted/CamelCase query that returns nothing is decomposed into leaf symbol-name terms and retried.
  • Vocabulary-anchored expansion (opt-in) — constrains LLM-expander terms to the repo's own symbol vocabulary.
  • Concept-relatedness thesaurus — a weighted concept→concept layer on top of the equivalence classes (kept separate from the union-find so distinct concepts don't merge).
  • Real docs retrieval channel + prose-tuned rerankingcorpus:docs now has its own fetch (not just a post-filter) and a prose weight profile that suppresses code-structural signals for documentation nodes.

Context compression (internal/elide, internal/embedding)

  • Per-glob fidelity tiersfull | compress | omit per glob pattern on read_file / get_editing_context (with real ** support), generalizing the binary compress_bodies.
  • AST sub-chunking extended to Ruby, PHP, Kotlin, and Swift (node kinds verified against the live grammars).

Embeddings (internal/query, internal/embedding, internal/graph)

  • Post-rerank pure-cosine refinement — recovers the exact cosine that the rank-based semantic signal discards, blended into the top-N. Strict no-op when vectors are absent.
  • Launch-time embedding-model variant selection — choose a named local model via config/env; zero-value preserves the current default.

Graph traversal & cache (internal/query, internal/analysis, internal/graph)

  • context_closure tool — given seed files/symbols, walks the transitive import/dependency closure and packs it under a token budget, ranked by graph distance or by seeded random-walk proximity.
  • Community-filtered walks — constrain walk_graph to a detected community.
  • Precomputed CSR adjacency snapshot + personalized PageRank — built once per analysis pass; powers the closure tool's proximity ranking.
  • Content-addressed package-scoped bundle cache — serves unchanged packages' search bundles from cache, invalidated by edge-aware per-package fingerprints (conservative: a package whose fingerprint was never reported is never served).

Text search & tokenization (internal/search, internal/mcp)

  • search_text regexp mode — runs the trigram backbone through a compiled pattern; bad patterns return a tool error, results carry the enclosing symbol.
  • Optional sparse sub-word n-gram tokenization + a learned boundary table mined from symbol names at index time. Ships off behind an env flag (precision-sensitive, reindex-required); index and query paths share one emission stage so postings always match.

Defaults & compatibility

  • Provenance attenuation and cosine refinement are on but provably no-op on graphs/queries without the relevant data.
  • Continuous α falls back to the legacy discrete table when unset; an explicit query_class pin and keyword-soup keep their discrete handling.
  • Sparse n-gram tokenization and vocabulary anchoring are off by default.
  • No response field was added without also updating its compact-wire encoder and pack-root etag, so caching and GCX callers stay coherent.

zzet added 30 commits June 2, 2026 23:56
Add a sixth path-penalty class to the search rerank pipeline:
generated files (foo.pb.go, mock_x.go, x_pb2.py, *.g.dart, …) are
demoted to 0.4 — but only when a real same-named hand-written peer
exists in the graph, so a generated file that is the sole definition
stays un-penalised. The penalty fires only in the otherwise-neutral
branch, preserving 'most aggressive bucket wins'.

The generated-file classifier and its peer-path deriver move into the
leaf 'excludes' package (IsGenerated / GeneratedPeerPaths) so both the
MCP omission notes and the rerank signal share one source of truth
without an import cycle.
Promote a production symbol above its own test when both surface for
the same query. The signal is batch-relative: it contributes 0 unless
a test candidate for the same symbol (matched by normalised name stem,
TestValidateToken -> validatetoken) is present in the candidate set,
so it never shifts scores when no test co-occurs — it only sharpens
the implementation-over-test ordering on the case that matters.

Complements the path penalty (which demotes tests unconditionally):
the penalty pushes the test down, this boost pulls the matching
implementation up, without double-counting across the result set.
The generated-file classifier moved to internal/excludes; its unit
test now lives in excludes/generated_test.go. The MCP-side generated
omission note stays covered by TestPathOmissions.
Replace the discrete 5-bucket query-class -> alpha lookup with a
continuous query-shape score. AlphaForContinuous interpolates between
the identifier-leaning and natural-language anchors by how
identifier-shaped a query reads (fraction of CamelCase/snake/dotted
tokens, discounted by stopword density and length), so a
half-identifier query like 'validateToken handler' lands between the
symbol and concept blends instead of jumping a whole tier. Hard shapes
(path, signature, keyword-soup) still pin to their class alpha.

The live search path opts in via Context.Alpha (set to
AlphaForContinuous(query) when the class was auto-detected, deferring
to an explicit query_class pin and to keyword-soup's discrete
handling); continuousClassMultiplier reproduces the discrete table at
its anchors so queries that snap to a class score exactly as before.
Context.Alpha==0 preserves the legacy discrete path for every direct
caller. The hybrid fusion fallback's AlphaFor now uses the continuous
score too.
…ality

LSP enrichment (gopls/clangd/tsserver) materialises a dense layer of
interface-dispatch and framework-wiring edges; counting them at full
weight inflates the apparent centrality of utility/framework code over
genuine domain authorities.

Add graph.ProvenanceWeight, attenuating the abundant lsp tier and the
weak text-matched tier relative to the structurally-unambiguous
ast_resolved baseline, and apply it in three places:
  - HITS authority/hub adjacency (weighted links)
  - PageRank (weighted out-degree; mass still conserved per source, so
    an LSP fan-out dispatcher spreads its score thinner)
  - a new rerank ProvenanceSignal (weight 0.15) scoring a candidate by
    the average provenance weight of its inbound call/reference edges

With uniform weights (a graph with no Origin stamped) HITS is
unchanged after L2 normalisation and PageRank's w/outWeight ratio
reduces to 1/outDegree — identical to the prior computation — so
existing centrality tests hold.
Add a context_closure MCP tool that assembles a context pack from the
transitive dependency closure of a set of seed files and/or symbols.

Engine.ImportClosure walks the multi-seed dependency closure (imports,
calls, references, depends_on plus the infrastructure edges) breadth
first, recording each reached node's graph distance to the nearest
seed, skipping unresolved/external targets and honouring the
workspace/project scope. The handler resolves file seeds to their
defined symbols, ranks the closure by distance, and packs the nearest
members through the existing token-budgeted focus -> ring -> outline
manifest. A rank=proximity option is exposed for a seeded-walk ranking
wired in a later change; until then it falls back to distance order.
WalkOptions gains CommunityID + NodeToComm so a budgeted walk can be
pinned to a single detected community. In the neighbour-admission block
a neighbour is dropped (along with the edge that reached it) only when
it has a defined membership that differs from the pinned community;
structural nodes with no membership still pass, and the filter no-ops
when no community is pinned or the membership map is absent.

walk_graph accepts a community argument (ID or label), resolving a
label to its ID the same way winnow_symbols does and threading the
node->community map from the cached community result. The filter is a
no-op until community detection has run.
Add a compact CSR adjacency snapshot over the call / reference graph,
built once per analysis pass and cached on the server next to the
PageRank result under the same graph-identity invalidation discipline.
Each edge rides its provenance weight so the snapshot matches the
global PageRank edge weighting.

PersonalizedPageRank runs a seeded random-walk-with-restart over the
snapshot in O(seeds * iters * edges) with no per-query graph scan:
restart mass returns to the seed set so the stationary distribution
concentrates on nodes reachable from the seeds along many short paths.

The dependency-closure tool's rank=proximity mode is the real consumer:
it ranks closure members by their restart-walk proximity to the seed
set and surfaces the score per member, falling back to distance
ranking when the snapshot has not been built yet.
Add a cache over the sqlite backend's SearchSymbolBundles that serves
cached Node + in/out edges for packages whose content fingerprint is
unchanged, skipping the node and edge fan-out for them and recomputing
only the misses. The cache is keyed at the node level but validated at
the package level: an entry is served only while its package's
fingerprint still matches the one it was stored at, and a package the
daemon has never reported a fingerprint for is never served — so an
unvalidated or stale bundle can never escape.

The daemon installs the authoritative per-package fingerprints after
every analysis pass via the new BundleFingerprintSink capability. Those
fingerprints are edge-aware (they fold each package's nodes and the
edges touching them), so a reindex that altered a package's nodes or
edges — including a cross-file edge landing on a node from elsewhere —
retires exactly the affected bundles while leaving untouched packages
cached. The cache key carries the repo prefix (the stored file paths
are repo-prefixed) for cross-repo isolation, the partial-hit path
tolerates a mix of cached and freshly fetched bundles, and the entry
count is bounded. The cache stays inert until fingerprints arrive, so
correctness never depends on it being warm.
…ol hits

search_text gains an optional regexp flag that runs the same
trigram-accelerated backbone through a compiled regular expression; an
invalid pattern returns a tool error rather than zero hits, and
results flow through the identical enclosing-symbol enrichment. The
tool description now advertises that every hit already carries the
enclosing symbol (symbol_id / symbol_name).
Add an opt-in sub-word n-gram emission stage layered over the existing
fixed-rule word tokens. After Tokenize/TokenizeQuery and FTS
normalization, the stage optionally appends character n-grams (n=3..4)
for each word token, opening fuzzy sub-word recall paths while always
preserving the original word tokens so exact matches still score.

The stage runs identically on the index path (BM25Backend.Add) and the
query path (BM25Backend.Search) via a shared ExpandSparseNgrams helper,
so n-grammed postings are always probed with n-grammed query terms and
the two can never disagree.

Gated behind GORTEX_SPARSE_NGRAM, read once at process start (mirroring
the FTS stemming and bigram-typo flags); default off because sub-word
noise can demote an exact identifier match. The tokenizer consumes a
small NgramBoundaries abstraction so a learned boundary table can later
drive data-driven splits; passing nil degrades cleanly to fixed
character n-grams.
Mine a per-repo sub-word boundary table from symbol names at index
build time (BuildNgramBoundaries, the auto-concept mining pattern):
count adjacent character-pair co-occurrence across the vocabulary,
rank, and select the rarest pairs as high-information split seams. The
table is installed onto the BM25 backend in buildSearchIndex so the
optional sparse n-gram tokenizer splits at data-driven boundaries
instead of a fixed n. Deterministic (sorted before the percentile
cut) and per-repo. A no-op unless the sparse n-gram tokenizer is
enabled.
Update the tool reference, the semantic-search guide, and the feature
overview to cover the new retrieval behavior: the context_closure
tool; search_text regexp mode + enclosing-symbol hits; search_symbols
corpus channel / vocab_anchored / zero-result decomposition;
get_editing_context fidelity_globs; smart_context blast_radius +
working_set + size-scaled budget + family skeletonization; continuous
bm25/vector blend; edge-provenance attenuation; generated-file and
source-over-test ranking; post-rerank cosine; the embedding model
variant; and the opt-in sparse sub-word tokenizer.
zzet added 4 commits June 3, 2026 17:22
Replace the floating go-version 1.26 pins in the ci, release, and security workflows with go-version-file: go.mod, so every job builds and scans with the exact patched toolchain go.mod declares (1.26.4) instead of whatever patch the setup-go manifest happens to serve.

Makes go.mod the single source of truth and keeps govulncheck from flagging stdlib advisories already fixed in a patch the floating manifest lagged behind. Matches the go-version-file pattern already used in bench-arm, init-smoke, and publish-claude-plugin.
ensureHugotModel now refuses to fetch a not-yet-cached model when GORTEX_EMBEDDING_OFFLINE is set, so NewLocalProvider falls through to the static provider instead of hitting the network. Gives air-gapped hosts and sandboxes a real knob and makes provider construction testable without a download. Default (unset) keeps the existing download-on-first-use behaviour.

TestNewProviderFromConfig_EmptyVariantUsesDefault downloaded MiniLM on a cold cache, tripping an upstream data race in go-huggingface's parallel downloader under -race. It now points XDG_DATA_HOME at an empty temp dir and enables the offline guard, so it runs hermetically and exercises the static fallback it actually asserts.
TestNotebook_PrunesByTTL used a 1ms TTL, but pruneLocked computes its cutoff after the fresh entry's Updated stamp is written; on a loaded runner the save+prune exceeds 1ms and the just-saved entry sweeps itself (0 remain instead of 1). Bump the TTL to a minute, far below the stale entry's 1h age and far above any realistic save latency.
@zzet zzet merged commit 9909379 into main Jun 3, 2026
9 checks passed
@zzet zzet deleted the feat/retrieval-ranking-quality branch June 3, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant