Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
3677be3
fix(action): render repo-relative paths in PR comment
tirth8205 Jun 14, 2026
19a6f3a
docs: sharpen review positioning vs retrieval tools
tirth8205 Jun 14, 2026
baf8eb1
feat: code-review-graph doctor health checklist command
tirth8205 Jun 14, 2026
278e400
fix(graph): correct TESTED_BY edge direction in all consumer sites
tirth8205 Jun 14, 2026
a11dc04
test(graph): add end-to-end parser->store->get_transitive_tests guard
tirth8205 Jun 14, 2026
03e319e
fix(refactor): read TESTED_BY by source in dead-code detection
tirth8205 Jun 14, 2026
78d3de0
feat: graph-not-built guard for read tools
tirth8205 Jun 14, 2026
f15c6e0
feat: one-line installer that hides Python (uv-based)
tirth8205 Jun 14, 2026
94c6302
ci: publish median per-question token reduction in weekly eval
tirth8205 Jun 14, 2026
e5c8cd1
feat(tools): token-budget review output and bounded query results
tirth8205 Jun 14, 2026
01d88f3
feat(mcp): lean tool set + minimal-first defaults by default
tirth8205 Jun 14, 2026
56dfa9d
Merge branch 'moat/lean-tools-and-budget' into release/v2.4.0
tirth8205 Jun 14, 2026
a511741
Merge branch 'trust/doctor-and-benchmark-ci' into release/v2.4.0
tirth8205 Jun 14, 2026
3bc2d49
Merge branch 'adopt/frictionless-install' into release/v2.4.0
tirth8205 Jun 14, 2026
21701a4
Merge branch 'review/action-polish-and-positioning' into release/v2.4.0
tirth8205 Jun 14, 2026
06e2a73
Merge branch 'review/tested-by-keystone' into release/v2.4.0
tirth8205 Jun 14, 2026
8ac8cdc
release: v2.4.0 — token-moat improvements, doctor, easy install, TEST…
tirth8205 Jun 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,23 @@ jobs:
- name: Lint with ruff
run: ruff check code_review_graph/

script-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate install.sh (POSIX sh syntax)
run: sh -n install.sh
- name: Validate install.ps1 (PowerShell syntax)
shell: pwsh
run: |
$errors = $null
[void][System.Management.Automation.PSParser]::Tokenize(
(Get-Content -Raw ./install.ps1), [ref]$errors)
if ($errors.Count -gt 0) {
$errors | ForEach-Object { $_.Message }
exit 1
}

type-check:
runs-on: ubuntu-latest
steps:
Expand Down
29 changes: 28 additions & 1 deletion .github/workflows/eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,11 @@ jobs:
if: always()
run: |
python - <<'PY' >> "$GITHUB_STEP_SUMMARY"
from code_review_graph.eval.reporter import generate_full_report
from code_review_graph.eval.reporter import (
generate_full_report,
median_token_reduction,
median_token_reduction_table,
)

print("# Weekly eval (report-only)")
print()
Expand All @@ -61,5 +65,28 @@ jobs:
"not fail CI."
)
print()

# Headline: median per-question token reduction from this run.
print("## Headline: median per-question token reduction")
print()
stats = median_token_reduction("evaluate/results")
if stats["median_percent"] is not None:
print(
f"**{stats['median_percent']}%** median reduction over "
f"{stats['n']} questions "
f"(source: `{stats['source']}`). This is the median of "
"per-question `(1 - graph_tokens / baseline_tokens) * 100`."
)
else:
print(
"No usable rows this run (clone or measurement failures) — "
"see the CSV artifact for details."
)
print()
print(median_token_reduction_table("evaluate/results"))
print()

print("## Full report")
print()
print(generate_full_report("evaluate/results"))
PY
65 changes: 65 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,71 @@

## [Unreleased]

## [2.4.0] - 2026-06-14

**The token moat, sharpened.** This release makes the graph spend fewer of your
tokens by default, lets you *prove* the savings, and removes the Python-install
friction — without changing how anything works (still a local SQLite graph,
Tree-sitter parsed, served over MCP). Several defaults changed in your favor;
see "Behavior changes" below.

### Added

- **`code-review-graph doctor`** — a read-only health checklist (✓/✗) that
verifies the graph exists and is fresh, MCP config is present, the server
boots, hooks are installed, and embeddings status — then prints your latest
**Token Savings** number as proof it's working. Answers the most common
question ("is this actually running?") in one command.
- **One-line install that hides Python**: `install.sh` (macOS/Linux) and
`install.ps1` (Windows) bootstrap `uv` if missing, then install the CLI.
`curl -fsSL .../install.sh | sh`. `pip`/`pipx`/`uvx` remain alternatives.
- **`serve --tools all|lean|<csv>`** and **`serve --detail minimal|standard|verbose`**
(plus `CRG_TOOLS` / `CRG_DETAIL_LEVEL` env vars) to control the exposed tool
set and default verbosity per server.
- **Token budget for review output**: `get_review_context` and `detect_changes`
take a `max_tokens` (default 6000), keeping the highest-risk findings and
honestly reporting what was omitted — so a review can never blow the context
window.
- Weekly eval CI now surfaces the **median per-question token reduction** in its
job summary; README gains a "benchmarks: reproducible" badge.

### Changed (defaults that now save more tokens)

- **Lean tool set by default**: `serve` exposes 7 curated tools
(`get_minimal_context`, `query_graph`, `semantic_search_nodes`,
`detect_changes`, `get_review_context`, `get_impact_radius`,
`get_affected_flows`) instead of all 30 — cutting ~8k tokens of tool
descriptions per session and reducing agent mis-picks. Restore everything with
`serve --tools all` or `CRG_TOOLS=all`.
- **Minimal detail by default** for `query_graph`, `semantic_search_nodes`,
`get_impact_radius`, and `detect_changes`. Pass `detail_level="standard"` (or
`CRG_DETAIL_LEVEL=standard`) for full payloads.
- `query_graph` gains a `max_results` cap (default 100) so `callers_of` on a hot
symbol can't return an unbounded payload.

### Fixed

- **TESTED_BY edge direction** (#515, integrates community PR #527 by
@Devilthelegend with an added producer-side regression test): `tests_for`,
`get_transitive_tests`, test-gap detection, flow criticality, symbol
enrichment, and dead-code detection now read test-coverage edges in the
canonical direction. Changed functions with tests are no longer reported as
untested, and the GitHub Action's "Tested" column is now correct. No DB
migration needed (read-side fix; stored edges were always canonical).
- **First-run guard**: read MCP tools on a never-built repo now return a clear
`not_built` status pointing at `build_or_update_graph_tool`, instead of
silently creating an empty `graph.db` and returning "0 results".
- GitHub Action PR comments now show **repo-relative paths** instead of absolute
CI-runner paths.

### Behavior changes (read before upgrading)

- `serve` exposes 7 tools by default; opt back into all 30 with `--tools all` /
`CRG_TOOLS=all`.
- Four tools default to `minimal` detail; pass `detail_level="standard"` or set
`CRG_DETAIL_LEVEL=standard` for the old output.
- Read tools return `not_built` on a fresh repo instead of an empty result.

## [2.3.6] - 2026-06-10

**Community-response release.** Built from a full audit of every open PR,
Expand Down
5 changes: 4 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ When using code-review-graph MCP tools, follow these rules:
- `incremental.py` — Git-based change detection, file watching
- `embeddings.py` — Optional vector embeddings (local sentence-transformers, OpenAI-compatible endpoints, Google Gemini, MiniMax)
- `visualization.py` — D3.js interactive HTML graph generator
- `cli.py` — CLI entry point (install/init, build, update, postprocess, embed, watch, status, visualize, serve/mcp, wiki, detect-changes, register, unregister, repos, eval, daemon)
- `cli.py` — CLI entry point (install/init, build, update, postprocess, embed, watch, status, doctor, visualize, serve/mcp, wiki, detect-changes, register, unregister, repos, eval, daemon)
- `doctor.py` — Health checklist (`doctor` command): graph presence, freshness vs git HEAD, MCP config, server boot smoke, hooks, embeddings; surfaces the latest Token Savings number
- `flows.py` — Execution flow detection and criticality scoring
- `communities.py` — Community detection (Leiden algorithm or file-based grouping) and architecture overview
- `search.py` — FTS5 hybrid search (keyword + vector)
Expand Down Expand Up @@ -55,6 +56,7 @@ uv run mypy code_review_graph/ --ignore-missing-imports --no-strict-optional
uv run code-review-graph build # Full graph build
uv run code-review-graph update # Incremental update
uv run code-review-graph status # Show stats
uv run code-review-graph doctor # Health checklist (verify the install)
uv run code-review-graph serve # Start MCP server
uv run code-review-graph wiki # Generate markdown wiki
uv run code-review-graph detect-changes # Risk-scored change analysis
Expand Down Expand Up @@ -102,6 +104,7 @@ uv run code-review-graph eval # Run evaluation benchmarks
- `tests/test_prompts.py` — MCP prompt template tests
- `tests/test_wiki.py` — Wiki generation
- `tests/test_context_savings.py` — Estimated context-savings metadata
- `tests/test_doctor.py` — `doctor` health checklist (unbuilt flags, built healthy, exit codes)
- `tests/test_skills.py` — Install/config generation and shipped skill metadata
- `tests/test_registry.py` — Multi-repo registry
- `tests/test_migrations.py` — Database migrations
Expand Down
64 changes: 54 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
<a href="https://github.com/tirth8205/code-review-graph/stargazers"><img src="https://img.shields.io/github/stars/tirth8205/code-review-graph?style=flat-square" alt="Stars"></a>
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square" alt="MIT Licence"></a>
<a href="https://github.com/tirth8205/code-review-graph/actions/workflows/ci.yml"><img src="https://github.com/tirth8205/code-review-graph/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
<a href="https://github.com/tirth8205/code-review-graph/actions/workflows/eval.yml"><img src="https://img.shields.io/badge/benchmarks-reproducible-success?style=flat-square" alt="Benchmarks: reproducible"></a>
<a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg?style=flat-square" alt="Python 3.10+"></a>
<a href="https://modelcontextprotocol.io/"><img src="https://img.shields.io/badge/MCP-compatible-green.svg?style=flat-square" alt="MCP"></a>
<a href="https://code-review-graph.com"><img src="https://img.shields.io/badge/website-code--review--graph.com-blue?style=flat-square" alt="Website"></a>
Expand All @@ -38,6 +39,8 @@

AI coding tools can end up re-reading large parts of your codebase on review tasks. `code-review-graph` fixes that. It builds a structural map of your code with [Tree-sitter](https://tree-sitter.github.io/tree-sitter/), tracks changes incrementally, and gives your AI assistant precise context via [MCP](https://modelcontextprotocol.io/) so it reads only what matters.

Where retrieval tools focus on navigation, CRG is review-native — risk-scored change analysis, impact radius, and a PR-review GitHub Action — token-efficient, local, and free, with no waitlist.

<p align="center">
<img src="diagrams/diagram1_before_vs_after.png" alt="The Token Problem: 38x to 528x token reduction across 6 real repositories" width="85%" />
</p>
Expand All @@ -46,10 +49,30 @@ AI coding tools can end up re-reading large parts of your codebase on review tas

## Quick Start

### Quick install (no Python setup)

macOS / Linux:

```bash
curl -LsSf https://raw.githubusercontent.com/tirth8205/code-review-graph/main/install.sh | sh
```

Windows (PowerShell):

```powershell
irm https://raw.githubusercontent.com/tirth8205/code-review-graph/main/install.ps1 | iex
```

This installs [uv](https://docs.astral.sh/uv/) (a single static binary that manages Python for you) if it is missing, then installs the `code-review-graph` CLI. It does **not** ship a bundled runtime — uv handles Python under the hood, so you don't have to. The script is idempotent; re-run it any time. Prefer to do it yourself? Use the alternatives below.

### Alternatives (already have Python / uv)

```bash
pip install code-review-graph # or: pipx install code-review-graph
uvx code-review-graph install # or run without installing, via uv
code-review-graph install # auto-detects and configures all supported platforms
code-review-graph build # parse your codebase
code-review-graph doctor # verify the install is healthy
```

One command sets up everything. `install` detects which AI coding tools you have, writes the correct MCP configuration for each one, installs platform-native hooks/skills where supported, and injects graph-aware instructions into your platform rules. It auto-detects whether you installed via `uvx` or `pip`/`pipx` and generates the right config. Restart your editor/tool after installing.
Expand Down Expand Up @@ -176,7 +199,7 @@ See [docs/GITHUB_ACTION.md](docs/GITHUB_ACTION.md) for inputs, risk levels, and

**Headline number: the median per-question token reduction across the 6 repos is ~82x** (whole-corpus baseline vs graph query). The frequently quoted **528x is the maximum** — a single best-case repo (fastapi) — not the typical result.

All numbers come from the automated evaluation runner against 6 real open-source repositories (13 commits total). Every config pins an upstream SHA, the Leiden community detector runs with a fixed seed, and embeddings are deterministic on CPU — so two runs on different machines produce identical numbers. The full reproduction recipe with expected outputs is in [`docs/REPRODUCING.md`](docs/REPRODUCING.md). A weekly report-only run on the two smallest configs lives in [`.github/workflows/eval.yml`](.github/workflows/eval.yml).
All numbers come from the automated evaluation runner against 6 real open-source repositories (13 commits total). Every config pins an upstream SHA, the Leiden community detector runs with a fixed seed, and embeddings are deterministic on CPU — so two runs on different machines produce identical numbers. The full reproduction recipe with expected outputs is in [`docs/REPRODUCING.md`](docs/REPRODUCING.md). A weekly report-only run on the two smallest configs lives in [`.github/workflows/eval.yml`](.github/workflows/eval.yml); it publishes the **median per-question token reduction** for that run as a table in the job summary and uploads the raw CSVs as an artifact. The [`benchmarks: reproducible`](https://github.com/tirth8205/code-review-graph/actions/workflows/eval.yml) badge links to those runs — the badge asserts the pipeline is reproducible; the canonical numbers live here and in [`docs/REPRODUCING.md`](docs/REPRODUCING.md), never auto-committed from CI.

<details>
<summary><strong>Token efficiency: ~82x median per-question reduction (range 38x – 528x; whole-corpus vs graph query)</strong></summary>
Expand Down Expand Up @@ -543,22 +566,39 @@ The cloud-egress warning is auto-skipped when the base URL points to localhost
> much narrower quality gap against smaller models at this input length.
> Body/docstring embedding is tracked as a follow-up enhancement.

#### Tool Filtering
#### Tool Filtering (lean by default)

CRG registers 30 MCP tools, but loading every description costs ~8k tokens
per LLM turn before any work happens. To protect that budget the server ships
a **curated lean set of 7 tools by default**:

`get_minimal_context_tool`, `query_graph_tool`, `semantic_search_nodes_tool`,
`detect_changes_tool`, `get_review_context_tool`, `get_impact_radius_tool`,
`get_affected_flows_tool`.

These cover every documented workflow (explore, review, impact, flows). When
the server trims tools it prints a one-line notice to **stderr** so a reduced
list is never silent.

CRG exposes 30 MCP tools by default. In token-constrained environments, you can
limit the server to a subset of tools using `--tools` or the `CRG_TOOLS`
environment variable:
Restore the full set, pick the curated set explicitly, or pass a custom list
via `--tools` or the `CRG_TOOLS` environment variable:

```bash
# Via CLI flag
code-review-graph serve --tools query_graph_tool,semantic_search_nodes_tool,detect_changes_tool
# All 30 tools
code-review-graph serve --tools all
CRG_TOOLS=all code-review-graph serve

# The curated lean set (the default), spelled out
code-review-graph serve --tools lean

# Via environment variable
# A custom subset
code-review-graph serve --tools query_graph_tool,semantic_search_nodes_tool,detect_changes_tool
CRG_TOOLS=query_graph_tool,semantic_search_nodes_tool code-review-graph serve
```

The CLI flag takes precedence over the environment variable. When neither is set,
all tools are available. This is especially useful for MCP client configurations:
The CLI flag takes precedence over the environment variable, which takes
precedence over the lean default. Unknown tool names are ignored gracefully.
This is especially useful for MCP client configurations:

```json
{
Expand All @@ -571,6 +611,10 @@ all tools are available. This is especially useful for MCP client configurations
}
```

You can also force a server-wide response verbosity with `--detail`
(`minimal`/`standard`/`verbose`) or the `CRG_DETAIL_LEVEL` env var; it
overrides each tool's per-call `detail_level`.

</details>

---
Expand Down
2 changes: 1 addition & 1 deletion code_review_graph/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
format_context_savings,
)

__version__ = "2.3.6"
__version__ = "2.4.0"

__all__ = [
"__version__",
Expand Down
5 changes: 4 additions & 1 deletion code_review_graph/changes.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,10 @@ def analyze_changes(
for node in changed_funcs:
if node.is_test:
continue
tested = store.get_edges_by_target(node.qualified_name)
# TESTED_BY edges are stored as source=production, target=test by the
# parser, so a changed production function finds its tests by source.
# See: #515
tested = store.get_edges_by_source(node.qualified_name)
if not any(e.kind == "TESTED_BY" for e in tested):
test_gaps.append({
"name": _sanitize_name(node.name),
Expand Down
Loading
Loading