feat: Optimize MCP and AGENTS.md instructions, add snippet_lines#198
Conversation
Adds a `snippet_lines` parameter to both MCP tools and the CLI. - Default (10): function/class signature + first lines of body - 0: file path and line range only - None: full chunk (~30-50 lines, previous behaviour) Benchmarking on Django SWE-bench tasks showed 10 lines is the sweet spot: agents make one semble call, get enough context to navigate directly to the fix, and cost less overall than agents using grep.
Validated by SWE-bench agent experiments (6 Django tasks, gpt-5.4-mini): - chunk_size 1500→750: improves top-1 retrieval hit rate (4/6 vs 3/6). Smaller chunks separate related files more cleanly; django-13315 switches from wrong (related.py) to correct (forms/models.py) at top-1. - top_k 5→3, snippet_lines 10→5: reduces tokens per semble call by ~60% with no retrieval loss (gold files are at rank 1 when found). - Prompt / instructions: replace "use grep for exhaustive literal matches" (too permissive) with "navigate directly to the returned line; do not grep for the same content". Applied to MCP server instructions, claude.md subagent, and AGENTS.md/CLAUDE.md installer snippet. Combined effect on 5/6 benchmark tasks: WITH semble is now cheaper than WITHOUT semble (-44% on django-13315, -29% on django-11999, -24% on django-14534). Overhead on easy tasks dropped from 2-4x to <5%.
- Apply updated step 3 ("navigate directly, don't re-search or grep") and
step 5 ("grep only for exhaustive literal matches across whole repo") to
all 10 agent files: claude, cursor, gemini, kiro, opencode, copilot,
commandcode, pi, antigravity, reasonix.
- Add --snippet-lines 5 token-efficiency tip to all 9 non-claude agent
files (claude.md already had it from a previous commit).
- Improve MCP search tool docstring: "use function/class names or behavior
descriptions, not error messages" for the query.
- Improve MCP find_related docstring: clarify its use for discovering
all implementations, callers, or tests for a given location.
- Annotate chunk_size=750 with the SWE-bench validation result and a
concrete TODO for making it configurable with cache key invalidation.
- Fix "before using Grep, Glob, or Read" → "instead of using Grep or Glob to discover files" — the old phrasing wrongly implied you need semble before reading a file whose path you already know. - Rewrite workflow steps to lead with the action-oriented description: step 1 explains the query format and default parameters (top_k=3, snippet_lines=5 context lines); steps 2-3 focus on the navigate-and-edit flow; steps 4-5 cover advanced usage and the grep exception.
Update the manual AGENTS.md / CLAUDE.md snippet in docs/installation.md to match the improved agent files: - Add --snippet-lines 5 usage example for token-efficient searches - Step 3: "Inspect full files" → "Navigate directly to the returned file and line — do not re-search or grep for the same content" - Step 5: "quick confirmation" → "every occurrence across the whole repo" with a concrete example (all callers of a renamed function)
Indexes built with the old chunk_size (1500 chars) were silently reused after the change to 750 chars, returning stale retrieval results. - Save _DESIRED_CHUNK_LENGTH_CHARS in index metadata at build time. - Validate it in _metadata_matches: a None (old/missing field) or a different value triggers a transparent rebuild on next search. - Add two regression tests: chunk_size mismatch → None, and missing chunk_size field (old format) → None.
Replace Django-specific id_for_label/BoundWidget example with a generic validate-email example across all agent instructions and docs. Revert top_k default from 3 back to 5 — snippet_lines=5 already handles token efficiency; dropping top_k risked missing relevant results at ranks 4-5 with negligible savings. Remove stale "3 results by default" claim from workflow step wording.
Collapse format_results + format_results_snippet into a single
format_results(query, results, snippet_lines=None). All three modes now
return the same flat structure {file_path, start_line, end_line, score,
content?} — previously snippet_lines=None fell through to the old
format_results which returned a nested {"chunk": {...}} schema,
incompatible with the flat schema returned by snippet_lines=5.
Also renames the per-result field from "snippet" to "content" for
consistency with Chunk.content.
Update test to cover all three modes parametrically.
The initial benchmark (feat: add snippet_lines) explicitly validated 10 as the sweet spot — enough to show the function/class signature and first body lines. The reduction to 5 landed in the same commit as chunk_size 1500→750 and top_k 5→3 with no isolated validation. Empirical sampling across 5 codebases (pytest, astropy, requests, flask, django, ~75 results) shows that with chunk_size=750, ~56% of chunks start mid-function regardless of snippet length, ~19% would benefit from lines 6-10 being visible, and only ~25% show a function name in the first 5 lines. Raising to 10 recovers most of that 19% at negligible token cost (~250 chars per search call).
Codecov Report✅ All modified and coverable lines are covered by tests.
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
Confidence Score: 4/5Safe to merge — cache invalidation is correctly wired end-to-end, and the token-saving defaults are easily overridden by agents that need more context. The chunk-size change and cache-invalidation logic are well-implemented and tested. The only issues are a stale TODO comment (the "include in cache key" half is now done) and a slightly overstated line-count estimate in the MCP tool description ("~15-25 lines" for a 750-char chunk is high; ~10-20 lines is more accurate). Neither affects runtime behavior.
Reviews (1): Last reviewed commit: "Update instructions" | Re-trigger Greptile |
stephantul
left a comment
There was a problem hiding this comment.
Nice!
First off. The em dash comment is not real (it can't hurt you) (I just don't like em dashes)
I think both other comments might require rerunning experiments, so feel free to ignore them, and mostly are about how a human might read or use semble, not an agent.
| 1. Start with `semble search` to find relevant chunks. The index is built and cached automatically. | ||
| 2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. | ||
| 3. Inspect full files only when the returned chunk does not give enough context. | ||
| 3. Navigate directly to the returned file and line — do not re-search or grep for the same content. |
| semble search "authentication flow" ./my-project | ||
| semble search "save_pretrained" ./my-project | ||
| semble search "save model to disk" ./my-project --top-k 10 | ||
| semble search "authentication flow" ./my-project --snippet-lines 10 # signatures only, fast |
There was a problem hiding this comment.
perhaps make this more literal? e.g.,
# first 10 lines only, fast
Also "fast" is debatable, it's not necessarily faster. Maybe "concise" actually fits better?
| search_p.add_argument("query", help="Natural language or code query.") | ||
| search_p.add_argument("path", nargs="?", default=".", help="Local path or git URL (default: current directory).") | ||
| search_p.add_argument("-k", "--top-k", type=int, default=5, help="Number of results (default: 5).") | ||
| search_p.add_argument( |
There was a problem hiding this comment.
This argument, snippet_lines, is actually not that great of a description. Without knowing what it does, it's actually kind of surprising that:
- this is a numeric argument
- it truncates a chunk to a specific line length
So something like n_snippet_lines is maybe clearer. Or lines_per_chunk.
Nevertheless I'm ok with keeping it, I know you did all the tests with this argument probably.
This PR adds several changes and optimizations:
I tested these changes on SWE-bench Lite across 5 models (gpt-5.4, gpt-5.4-mini, deepseek-v4-pro, claude-sonnet-4.6, claude-haiku-4.5), and 3 harnesses (Claude Code, Codex, OpenCode). The biggest gains are for the MCP integration which consistently gave 50 to 70% cost reduction with flat or improved gold hit rate. For the CLI integrations (AGENTS.md/sub-agent) the savings were more modest (5 to 15% savings).