Retrieval & ranking quality: smarter context, provenance-aware centrality, richer search#44
Merged
Merged
Conversation
Add a sixth path-penalty class to the search rerank pipeline: generated files (foo.pb.go, mock_x.go, x_pb2.py, *.g.dart, …) are demoted to 0.4 — but only when a real same-named hand-written peer exists in the graph, so a generated file that is the sole definition stays un-penalised. The penalty fires only in the otherwise-neutral branch, preserving 'most aggressive bucket wins'. The generated-file classifier and its peer-path deriver move into the leaf 'excludes' package (IsGenerated / GeneratedPeerPaths) so both the MCP omission notes and the rerank signal share one source of truth without an import cycle.
Promote a production symbol above its own test when both surface for the same query. The signal is batch-relative: it contributes 0 unless a test candidate for the same symbol (matched by normalised name stem, TestValidateToken -> validatetoken) is present in the candidate set, so it never shifts scores when no test co-occurs — it only sharpens the implementation-over-test ordering on the case that matters. Complements the path penalty (which demotes tests unconditionally): the penalty pushes the test down, this boost pulls the matching implementation up, without double-counting across the result set.
The generated-file classifier moved to internal/excludes; its unit test now lives in excludes/generated_test.go. The MCP-side generated omission note stays covered by TestPathOmissions.
Replace the discrete 5-bucket query-class -> alpha lookup with a continuous query-shape score. AlphaForContinuous interpolates between the identifier-leaning and natural-language anchors by how identifier-shaped a query reads (fraction of CamelCase/snake/dotted tokens, discounted by stopword density and length), so a half-identifier query like 'validateToken handler' lands between the symbol and concept blends instead of jumping a whole tier. Hard shapes (path, signature, keyword-soup) still pin to their class alpha. The live search path opts in via Context.Alpha (set to AlphaForContinuous(query) when the class was auto-detected, deferring to an explicit query_class pin and to keyword-soup's discrete handling); continuousClassMultiplier reproduces the discrete table at its anchors so queries that snap to a class score exactly as before. Context.Alpha==0 preserves the legacy discrete path for every direct caller. The hybrid fusion fallback's AlphaFor now uses the continuous score too.
…ality
LSP enrichment (gopls/clangd/tsserver) materialises a dense layer of
interface-dispatch and framework-wiring edges; counting them at full
weight inflates the apparent centrality of utility/framework code over
genuine domain authorities.
Add graph.ProvenanceWeight, attenuating the abundant lsp tier and the
weak text-matched tier relative to the structurally-unambiguous
ast_resolved baseline, and apply it in three places:
- HITS authority/hub adjacency (weighted links)
- PageRank (weighted out-degree; mass still conserved per source, so
an LSP fan-out dispatcher spreads its score thinner)
- a new rerank ProvenanceSignal (weight 0.15) scoring a candidate by
the average provenance weight of its inbound call/reference edges
With uniform weights (a graph with no Origin stamped) HITS is
unchanged after L2 normalisation and PageRank's w/outWeight ratio
reduces to 1/outDegree — identical to the prior computation — so
existing centrality tests hold.
…s) into smart_context
Add a context_closure MCP tool that assembles a context pack from the transitive dependency closure of a set of seed files and/or symbols. Engine.ImportClosure walks the multi-seed dependency closure (imports, calls, references, depends_on plus the infrastructure edges) breadth first, recording each reached node's graph distance to the nearest seed, skipping unresolved/external targets and honouring the workspace/project scope. The handler resolves file seeds to their defined symbols, ranks the closure by distance, and packs the nearest members through the existing token-budgeted focus -> ring -> outline manifest. A rank=proximity option is exposed for a seeded-walk ranking wired in a later change; until then it falls back to distance order.
WalkOptions gains CommunityID + NodeToComm so a budgeted walk can be pinned to a single detected community. In the neighbour-admission block a neighbour is dropped (along with the edge that reached it) only when it has a defined membership that differs from the pinned community; structural nodes with no membership still pass, and the filter no-ops when no community is pinned or the membership map is absent. walk_graph accepts a community argument (ID or label), resolving a label to its ID the same way winnow_symbols does and threading the node->community map from the cached community result. The filter is a no-op until community detection has run.
Add a compact CSR adjacency snapshot over the call / reference graph, built once per analysis pass and cached on the server next to the PageRank result under the same graph-identity invalidation discipline. Each edge rides its provenance weight so the snapshot matches the global PageRank edge weighting. PersonalizedPageRank runs a seeded random-walk-with-restart over the snapshot in O(seeds * iters * edges) with no per-query graph scan: restart mass returns to the seed set so the stationary distribution concentrates on nodes reachable from the seeds along many short paths. The dependency-closure tool's rank=proximity mode is the real consumer: it ranks closure members by their restart-walk proximity to the seed set and surfaces the score per member, falling back to distance ranking when the snapshot has not been built yet.
Add a cache over the sqlite backend's SearchSymbolBundles that serves cached Node + in/out edges for packages whose content fingerprint is unchanged, skipping the node and edge fan-out for them and recomputing only the misses. The cache is keyed at the node level but validated at the package level: an entry is served only while its package's fingerprint still matches the one it was stored at, and a package the daemon has never reported a fingerprint for is never served — so an unvalidated or stale bundle can never escape. The daemon installs the authoritative per-package fingerprints after every analysis pass via the new BundleFingerprintSink capability. Those fingerprints are edge-aware (they fold each package's nodes and the edges touching them), so a reindex that altered a package's nodes or edges — including a cross-file edge landing on a node from elsewhere — retires exactly the affected bundles while leaving untouched packages cached. The cache key carries the repo prefix (the stored file paths are repo-prefixed) for cross-repo isolation, the partial-hit path tolerates a mix of cached and freshly fetched bundles, and the entry count is bounded. The cache stays inert until fingerprints arrive, so correctness never depends on it being warm.
…ol hits search_text gains an optional regexp flag that runs the same trigram-accelerated backbone through a compiled regular expression; an invalid pattern returns a tool error rather than zero hits, and results flow through the identical enclosing-symbol enrichment. The tool description now advertises that every hit already carries the enclosing symbol (symbol_id / symbol_name).
Add an opt-in sub-word n-gram emission stage layered over the existing fixed-rule word tokens. After Tokenize/TokenizeQuery and FTS normalization, the stage optionally appends character n-grams (n=3..4) for each word token, opening fuzzy sub-word recall paths while always preserving the original word tokens so exact matches still score. The stage runs identically on the index path (BM25Backend.Add) and the query path (BM25Backend.Search) via a shared ExpandSparseNgrams helper, so n-grammed postings are always probed with n-grammed query terms and the two can never disagree. Gated behind GORTEX_SPARSE_NGRAM, read once at process start (mirroring the FTS stemming and bigram-typo flags); default off because sub-word noise can demote an exact identifier match. The tokenizer consumes a small NgramBoundaries abstraction so a learned boundary table can later drive data-driven splits; passing nil degrades cleanly to fixed character n-grams.
Mine a per-repo sub-word boundary table from symbol names at index build time (BuildNgramBoundaries, the auto-concept mining pattern): count adjacent character-pair co-occurrence across the vocabulary, rank, and select the rarest pairs as high-information split seams. The table is installed onto the BM25 backend in buildSearchIndex so the optional sparse n-gram tokenizer splits at data-driven boundaries instead of a fixed n. Deterministic (sorted before the percentile cut) and per-repo. A no-op unless the sparse n-gram tokenizer is enabled.
Update the tool reference, the semantic-search guide, and the feature overview to cover the new retrieval behavior: the context_closure tool; search_text regexp mode + enclosing-symbol hits; search_symbols corpus channel / vocab_anchored / zero-result decomposition; get_editing_context fidelity_globs; smart_context blast_radius + working_set + size-scaled budget + family skeletonization; continuous bm25/vector blend; edge-provenance attenuation; generated-file and source-over-test ranking; post-rerank cosine; the embedding model variant; and the opt-in sparse sub-word tokenizer.
Replace the floating go-version 1.26 pins in the ci, release, and security workflows with go-version-file: go.mod, so every job builds and scans with the exact patched toolchain go.mod declares (1.26.4) instead of whatever patch the setup-go manifest happens to serve. Makes go.mod the single source of truth and keeps govulncheck from flagging stdlib advisories already fixed in a patch the floating manifest lagged behind. Matches the go-version-file pattern already used in bench-arm, init-smoke, and publish-claude-plugin.
ensureHugotModel now refuses to fetch a not-yet-cached model when GORTEX_EMBEDDING_OFFLINE is set, so NewLocalProvider falls through to the static provider instead of hitting the network. Gives air-gapped hosts and sandboxes a real knob and makes provider construction testable without a download. Default (unset) keeps the existing download-on-first-use behaviour. TestNewProviderFromConfig_EmptyVariantUsesDefault downloaded MiniLM on a cold cache, tripping an upstream data race in go-huggingface's parallel downloader under -race. It now points XDG_DATA_HOME at an empty temp dir and enables the offline guard, so it runs hermetically and exercises the static fallback it actually asserts.
TestNotebook_PrunesByTTL used a 1ms TTL, but pruneLocked computes its cutoff after the fresh entry's Updated stamp is written; on a loaded runner the save+prune exceeds 1ms and the just-saved entry sweeps itself (0 remain instead of 1). Bump the TTL to a minute, far below the stale entry's 1h age and far above any realistic save latency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A focused quality pass on how Gortex retrieves and ranks code. It sharpens the rerank pipeline, routes
smart_contextthrough that pipeline for the first time, hardens query expansion and zero-result recovery, adds finer-grained context compression, recovers exact embedding cosine, and introduces a dependency-closure context tool plus provenance-aware graph centrality.29 commits, one self-contained feature each. No schema migrations — new behavior is additive and backward-compatible, with the higher-risk levers gated behind config/env flags.
What changed
Rerank pipeline (
internal/search/rerank)*.pb.go,mock_*.go,*_pb2.py,*.g.dart, …), but only when a real same-named hand-written implementation exists, so a generated file that is the sole definition is never demoted into oblivion.graph.ProvenanceWeightattenuates the abundant LSP tier (and the weak text-matched tier) relative to the unambiguousast_resolvedbaseline, applied in HITS, PageRank, and a new rerank signal. Uniform-weight-safe: a graph with no provenance stamped ranks identically to before.smart_contextcomposer (internal/mcp)smart_contextordering (previously it used a feedback-only re-sort over raw BM25).no covering tests foundwarning when a symbol is untested.Query expansion & recovery (
internal/mcp,internal/search)corpus:docsnow has its own fetch (not just a post-filter) and a prose weight profile that suppresses code-structural signals for documentation nodes.Context compression (
internal/elide,internal/embedding)full | compress | omitper glob pattern onread_file/get_editing_context(with real**support), generalizing the binarycompress_bodies.Embeddings (
internal/query,internal/embedding,internal/graph)Graph traversal & cache (
internal/query,internal/analysis,internal/graph)context_closuretool — given seed files/symbols, walks the transitive import/dependency closure and packs it under a token budget, ranked by graph distance or by seeded random-walk proximity.walk_graphto a detected community.Text search & tokenization (
internal/search,internal/mcp)search_textregexp mode — runs the trigram backbone through a compiled pattern; bad patterns return a tool error, results carry the enclosing symbol.Defaults & compatibility
query_classpin and keyword-soup keep their discrete handling.