tirth8205 · tirth8205 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -23,6 +23,23 @@ jobs:
       - name: Lint with ruff
         run: ruff check code_review_graph/
 
+  script-lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Validate install.sh (POSIX sh syntax)
+        run: sh -n install.sh
+      - name: Validate install.ps1 (PowerShell syntax)
+        shell: pwsh
+        run: |
+          $errors = $null
+          [void][System.Management.Automation.PSParser]::Tokenize(
+            (Get-Content -Raw ./install.ps1), [ref]$errors)
+          if ($errors.Count -gt 0) {
+            $errors | ForEach-Object { $_.Message }
+            exit 1
+          }
+
   type-check:
     runs-on: ubuntu-latest
     steps:

diff --git a/.github/workflows/eval.yml b/.github/workflows/eval.yml
@@ -51,7 +51,11 @@ jobs:
         if: always()
         run: |
           python - <<'PY' >> "$GITHUB_STEP_SUMMARY"
-          from code_review_graph.eval.reporter import generate_full_report
+          from code_review_graph.eval.reporter import (
+              generate_full_report,
+              median_token_reduction,
+              median_token_reduction_table,
+          )
 
           print("# Weekly eval (report-only)")
           print()
@@ -61,5 +65,28 @@ jobs:
               "not fail CI."
           )
           print()
+
+          # Headline: median per-question token reduction from this run.
+          print("## Headline: median per-question token reduction")
+          print()
+          stats = median_token_reduction("evaluate/results")
+          if stats["median_percent"] is not None:
+              print(
+                  f"**{stats['median_percent']}%** median reduction over "
+                  f"{stats['n']} questions "
+                  f"(source: `{stats['source']}`). This is the median of "
+                  "per-question `(1 - graph_tokens / baseline_tokens) * 100`."
+              )
+          else:
+              print(
+                  "No usable rows this run (clone or measurement failures) — "
+                  "see the CSV artifact for details."
+              )
+          print()
+          print(median_token_reduction_table("evaluate/results"))
+          print()
+
+          print("## Full report")
+          print()
           print(generate_full_report("evaluate/results"))
           PY
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,71 @@
 
 ## [Unreleased]
 
+## [2.4.0] - 2026-06-14
+
+**The token moat, sharpened.** This release makes the graph spend fewer of your
+tokens by default, lets you *prove* the savings, and removes the Python-install
+friction — without changing how anything works (still a local SQLite graph,
+Tree-sitter parsed, served over MCP). Several defaults changed in your favor;
+see "Behavior changes" below.
+
+### Added
+
+- **`code-review-graph doctor`** — a read-only health checklist (✓/✗) that
+  verifies the graph exists and is fresh, MCP config is present, the server
+  boots, hooks are installed, and embeddings status — then prints your latest
+  **Token Savings** number as proof it's working. Answers the most common
+  question ("is this actually running?") in one command.
+- **One-line install that hides Python**: `install.sh` (macOS/Linux) and
+  `install.ps1` (Windows) bootstrap `uv` if missing, then install the CLI.
+  `curl -fsSL .../install.sh | sh`. `pip`/`pipx`/`uvx` remain alternatives.
+- **`serve --tools all|lean|<csv>`** and **`serve --detail minimal|standard|verbose`**
+  (plus `CRG_TOOLS` / `CRG_DETAIL_LEVEL` env vars) to control the exposed tool
+  set and default verbosity per server.
+- **Token budget for review output**: `get_review_context` and `detect_changes`
+  take a `max_tokens` (default 6000), keeping the highest-risk findings and
+  honestly reporting what was omitted — so a review can never blow the context
+  window.
+- Weekly eval CI now surfaces the **median per-question token reduction** in its
+  job summary; README gains a "benchmarks: reproducible" badge.
+
+### Changed (defaults that now save more tokens)
+
+- **Lean tool set by default**: `serve` exposes 7 curated tools
+  (`get_minimal_context`, `query_graph`, `semantic_search_nodes`,
+  `detect_changes`, `get_review_context`, `get_impact_radius`,
+  `get_affected_flows`) instead of all 30 — cutting ~8k tokens of tool
+  descriptions per session and reducing agent mis-picks. Restore everything with
+  `serve --tools all` or `CRG_TOOLS=all`.
+- **Minimal detail by default** for `query_graph`, `semantic_search_nodes`,
+  `get_impact_radius`, and `detect_changes`. Pass `detail_level="standard"` (or
+  `CRG_DETAIL_LEVEL=standard`) for full payloads.
+- `query_graph` gains a `max_results` cap (default 100) so `callers_of` on a hot
+  symbol can't return an unbounded payload.
+
+### Fixed
+
+- **TESTED_BY edge direction** (#515, integrates community PR #527 by
+  @Devilthelegend with an added producer-side regression test): `tests_for`,
+  `get_transitive_tests`, test-gap detection, flow criticality, symbol
+  enrichment, and dead-code detection now read test-coverage edges in the
+  canonical direction. Changed functions with tests are no longer reported as
+  untested, and the GitHub Action's "Tested" column is now correct. No DB
+  migration needed (read-side fix; stored edges were always canonical).
+- **First-run guard**: read MCP tools on a never-built repo now return a clear
+  `not_built` status pointing at `build_or_update_graph_tool`, instead of
+  silently creating an empty `graph.db` and returning "0 results".
+- GitHub Action PR comments now show **repo-relative paths** instead of absolute
+  CI-runner paths.
+
+### Behavior changes (read before upgrading)
+
+- `serve` exposes 7 tools by default; opt back into all 30 with `--tools all` /
+  `CRG_TOOLS=all`.
+- Four tools default to `minimal` detail; pass `detail_level="standard"` or set
+  `CRG_DETAIL_LEVEL=standard` for the old output.
+- Read tools return `not_built` on a fresh repo instead of an empty result.
+
 ## [2.3.6] - 2026-06-10
 
 **Community-response release.** Built from a full audit of every open PR,

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -23,7 +23,8 @@ When using code-review-graph MCP tools, follow these rules:
   - `incremental.py` — Git-based change detection, file watching
   - `embeddings.py` — Optional vector embeddings (local sentence-transformers, OpenAI-compatible endpoints, Google Gemini, MiniMax)
   - `visualization.py` — D3.js interactive HTML graph generator
-  - `cli.py` — CLI entry point (install/init, build, update, postprocess, embed, watch, status, visualize, serve/mcp, wiki, detect-changes, register, unregister, repos, eval, daemon)
+  - `cli.py` — CLI entry point (install/init, build, update, postprocess, embed, watch, status, doctor, visualize, serve/mcp, wiki, detect-changes, register, unregister, repos, eval, daemon)
+  - `doctor.py` — Health checklist (`doctor` command): graph presence, freshness vs git HEAD, MCP config, server boot smoke, hooks, embeddings; surfaces the latest Token Savings number
   - `flows.py` — Execution flow detection and criticality scoring
   - `communities.py` — Community detection (Leiden algorithm or file-based grouping) and architecture overview
   - `search.py` — FTS5 hybrid search (keyword + vector)
@@ -55,6 +56,7 @@ uv run mypy code_review_graph/ --ignore-missing-imports --no-strict-optional
 uv run code-review-graph build              # Full graph build
 uv run code-review-graph update             # Incremental update
 uv run code-review-graph status             # Show stats
+uv run code-review-graph doctor             # Health checklist (verify the install)
 uv run code-review-graph serve              # Start MCP server
 uv run code-review-graph wiki               # Generate markdown wiki
 uv run code-review-graph detect-changes     # Risk-scored change analysis
@@ -102,6 +104,7 @@ uv run code-review-graph eval               # Run evaluation benchmarks
 - `tests/test_prompts.py` — MCP prompt template tests
 - `tests/test_wiki.py` — Wiki generation
 - `tests/test_context_savings.py` — Estimated context-savings metadata
+- `tests/test_doctor.py` — `doctor` health checklist (unbuilt flags, built healthy, exit codes)
 - `tests/test_skills.py` — Install/config generation and shipped skill metadata
 - `tests/test_registry.py` — Multi-repo registry
 - `tests/test_migrations.py` — Database migrations

diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@
   <a href="https://github.com/tirth8205/code-review-graph/stargazers"><img src="https://img.shields.io/github/stars/tirth8205/code-review-graph?style=flat-square" alt="Stars"></a>
   <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square" alt="MIT Licence"></a>
   <a href="https://github.com/tirth8205/code-review-graph/actions/workflows/ci.yml"><img src="https://github.com/tirth8205/code-review-graph/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
+  <a href="https://github.com/tirth8205/code-review-graph/actions/workflows/eval.yml"><img src="https://img.shields.io/badge/benchmarks-reproducible-success?style=flat-square" alt="Benchmarks: reproducible"></a>
   <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg?style=flat-square" alt="Python 3.10+"></a>
   <a href="https://modelcontextprotocol.io/"><img src="https://img.shields.io/badge/MCP-compatible-green.svg?style=flat-square" alt="MCP"></a>
   <a href="https://code-review-graph.com"><img src="https://img.shields.io/badge/website-code--review--graph.com-blue?style=flat-square" alt="Website"></a>
@@ -38,6 +39,8 @@
 
 AI coding tools can end up re-reading large parts of your codebase on review tasks. `code-review-graph` fixes that. It builds a structural map of your code with [Tree-sitter](https://tree-sitter.github.io/tree-sitter/), tracks changes incrementally, and gives your AI assistant precise context via [MCP](https://modelcontextprotocol.io/) so it reads only what matters.
 
+Where retrieval tools focus on navigation, CRG is review-native — risk-scored change analysis, impact radius, and a PR-review GitHub Action — token-efficient, local, and free, with no waitlist.
+
 <p align="center">
   <img src="diagrams/diagram1_before_vs_after.png" alt="The Token Problem: 38x to 528x token reduction across 6 real repositories" width="85%" />
 </p>
@@ -46,10 +49,30 @@ AI coding tools can end up re-reading large parts of your codebase on review tas
 
 ## Quick Start
 
+### Quick install (no Python setup)
+
+macOS / Linux:
+
+```bash
+curl -LsSf https://raw.githubusercontent.com/tirth8205/code-review-graph/main/install.sh | sh
+```
+
+Windows (PowerShell):
+
+```powershell
+irm https://raw.githubusercontent.com/tirth8205/code-review-graph/main/install.ps1 | iex
+```
+
+This installs [uv](https://docs.astral.sh/uv/) (a single static binary that manages Python for you) if it is missing, then installs the `code-review-graph` CLI. It does **not** ship a bundled runtime — uv handles Python under the hood, so you don't have to. The script is idempotent; re-run it any time. Prefer to do it yourself? Use the alternatives below.
+
+### Alternatives (already have Python / uv)
+
 ```bash
 pip install code-review-graph                     # or: pipx install code-review-graph
+uvx code-review-graph install                     # or run without installing, via uv
 code-review-graph install          # auto-detects and configures all supported platforms
 code-review-graph build            # parse your codebase
+code-review-graph doctor           # verify the install is healthy
 ```
 
 One command sets up everything. `install` detects which AI coding tools you have, writes the correct MCP configuration for each one, installs platform-native hooks/skills where supported, and injects graph-aware instructions into your platform rules. It auto-detects whether you installed via `uvx` or `pip`/`pipx` and generates the right config. Restart your editor/tool after installing.
@@ -176,7 +199,7 @@ See [docs/GITHUB_ACTION.md](docs/GITHUB_ACTION.md) for inputs, risk levels, and
 
 **Headline number: the median per-question token reduction across the 6 repos is ~82x** (whole-corpus baseline vs graph query). The frequently quoted **528x is the maximum** — a single best-case repo (fastapi) — not the typical result.
 
-All numbers come from the automated evaluation runner against 6 real open-source repositories (13 commits total). Every config pins an upstream SHA, the Leiden community detector runs with a fixed seed, and embeddings are deterministic on CPU — so two runs on different machines produce identical numbers. The full reproduction recipe with expected outputs is in [`docs/REPRODUCING.md`](docs/REPRODUCING.md). A weekly report-only run on the two smallest configs lives in [`.github/workflows/eval.yml`](.github/workflows/eval.yml).
+All numbers come from the automated evaluation runner against 6 real open-source repositories (13 commits total). Every config pins an upstream SHA, the Leiden community detector runs with a fixed seed, and embeddings are deterministic on CPU — so two runs on different machines produce identical numbers. The full reproduction recipe with expected outputs is in [`docs/REPRODUCING.md`](docs/REPRODUCING.md). A weekly report-only run on the two smallest configs lives in [`.github/workflows/eval.yml`](.github/workflows/eval.yml); it publishes the **median per-question token reduction** for that run as a table in the job summary and uploads the raw CSVs as an artifact. The [`benchmarks: reproducible`](https://github.com/tirth8205/code-review-graph/actions/workflows/eval.yml) badge links to those runs — the badge asserts the pipeline is reproducible; the canonical numbers live here and in [`docs/REPRODUCING.md`](docs/REPRODUCING.md), never auto-committed from CI.
 
 <details>
 <summary><strong>Token efficiency: ~82x median per-question reduction (range 38x – 528x; whole-corpus vs graph query)</strong></summary>
@@ -543,22 +566,39 @@ The cloud-egress warning is auto-skipped when the base URL points to localhost
 > much narrower quality gap against smaller models at this input length.
 > Body/docstring embedding is tracked as a follow-up enhancement.
 
-#### Tool Filtering
+#### Tool Filtering (lean by default)
+
+CRG registers 30 MCP tools, but loading every description costs ~8k tokens
+per LLM turn before any work happens. To protect that budget the server ships
+a **curated lean set of 7 tools by default**:
+
+`get_minimal_context_tool`, `query_graph_tool`, `semantic_search_nodes_tool`,
+`detect_changes_tool`, `get_review_context_tool`, `get_impact_radius_tool`,
+`get_affected_flows_tool`.
+
+These cover every documented workflow (explore, review, impact, flows). When
+the server trims tools it prints a one-line notice to **stderr** so a reduced
+list is never silent.
 
-CRG exposes 30 MCP tools by default. In token-constrained environments, you can
-limit the server to a subset of tools using `--tools` or the `CRG_TOOLS`
-environment variable:
+Restore the full set, pick the curated set explicitly, or pass a custom list
+via `--tools` or the `CRG_TOOLS` environment variable:
 
 ```bash
-# Via CLI flag
-code-review-graph serve --tools query_graph_tool,semantic_search_nodes_tool,detect_changes_tool
+# All 30 tools
+code-review-graph serve --tools all
+CRG_TOOLS=all code-review-graph serve
+
+# The curated lean set (the default), spelled out
+code-review-graph serve --tools lean
 
-# Via environment variable
+# A custom subset
+code-review-graph serve --tools query_graph_tool,semantic_search_nodes_tool,detect_changes_tool
 CRG_TOOLS=query_graph_tool,semantic_search_nodes_tool code-review-graph serve
 ```
 
-The CLI flag takes precedence over the environment variable. When neither is set,
-all tools are available. This is especially useful for MCP client configurations:
+The CLI flag takes precedence over the environment variable, which takes
+precedence over the lean default. Unknown tool names are ignored gracefully.
+This is especially useful for MCP client configurations:
 
 ```json
 {
@@ -571,6 +611,10 @@ all tools are available. This is especially useful for MCP client configurations
 }
 ```
 
+You can also force a server-wide response verbosity with `--detail`
+(`minimal`/`standard`/`verbose`) or the `CRG_DETAIL_LEVEL` env var; it
+overrides each tool's per-call `detail_level`.
+
 </details>
 
 ---

diff --git a/code_review_graph/__init__.py b/code_review_graph/__init__.py
@@ -8,7 +8,7 @@
     format_context_savings,
 )
 
-__version__ = "2.3.6"
+__version__ = "2.4.0"
 
 __all__ = [
     "__version__",

diff --git a/code_review_graph/changes.py b/code_review_graph/changes.py
@@ -352,7 +352,10 @@ def analyze_changes(
     for node in changed_funcs:
         if node.is_test:
             continue
-        tested = store.get_edges_by_target(node.qualified_name)
+        # TESTED_BY edges are stored as source=production, target=test by the
+        # parser, so a changed production function finds its tests by source.
+        # See: #515
+        tested = store.get_edges_by_source(node.qualified_name)
         if not any(e.kind == "TESTED_BY" for e in tested):
             test_gaps.append({
                 "name": _sanitize_name(node.name),