test(tools): all-tools integration round-trip canary (follow-up to #157) by cipher813 · Pull Request #158 · cipher813/mnemon

cipher813 · 2026-05-22T16:38:25Z

Summary

Adds tests/test_tools_integration.py — every-MCP-tool round-trip canary that catches the failure class of the 2026-05-22 memory_check_contradictions incident on Fly. This was intended to ship inside PR #157 (per Brian's "add 1 to pr157" directive) but #157 squash-merged before I could push the additional commit, so it lands here as a follow-up.

Brian's question that drove this: "given the difficulty of checking each individual mnemon tool available, are we properly using unit tests to confirm that everything works as expected without having to manually run each tool?"

Honest answer pre-PR: no. Unit tests in test_server.py mock every external dep (LLM, NLI, embedder, vecstore), which is right for asserting per-tool behavior in isolation but leaves a gap: a tool's real call path can raise an uncaught exception (or hang) that no mocked test exercises — the exact failure class that produced the opaque Error occurred during tool execution envelope from memory_check_contradictions.

What ships

Three test classes covering the gap:

test_every_tool_invokes_cleanly — iterates mcp._tool_manager._tools, calls each tool with minimal-valid inputs against a seeded vault, asserts no unhandled exception + return shape matches MCP contract (str / dict / list). Coverage-check assertions force every newly-registered tool to have a fixture input AND every removed tool to have its fixture cleaned up (registered-vs-fixture diff).
test_no_tool_returns_opaque_error_string — outputs must not contain Error occurred during tool execution or Internal server error verbatim. Opaque envelopes come from the MCP transport wrapping an escaped Python exception — never acceptable as a tool's own output.
test_destructive_tools_respect_dry_run — locks the dry_run contract on memory_check_contradictions + memory_sweep. Mutating inputs stubbed to "would-decay" labels via mocked NLI; pre-state == post-state.

Heavy paths (NLI classify, FastEmbed re-embed) are stubbed so the suite stays under ~17s. Real-model paths validated separately by scripts/calibrate_capture_threshold.py and the operator Layer-3 web test ritual (which gets --exercise-all-tools mode in PR #159, per Brian's follow-up plan).

Composability

Catches the failure class that bit memory_check_contradictions on 2026-05-22 (the LLM-required path would have raised through the broad try/except, and "no unhandled exception" would have fired on PR feat(salience): Phase 1 — first-class standing tier (default-off, soak-gated) #154 when the salience-tier tools were added — forcing the fix before merge).
Composes with feedback_no_silent_fails: every tool must catch failures at its boundary and return a clean error string, never let an exception escape.
Doesn't yet exercise the [server]-only install matrix (that's PR ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools #159). It catches any tool that raises through its wrapper regardless of extras — the universal subset of the gap.

Test plan

3 new tests passing locally
Full suite 847 → 850 passing
Real-model smoke test (NLI inference) preserved separately

Two follow-ups already filed in this session per Brian's plan:

PR ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools #159 — [server]-extras-only CI matrix + extended Layer-3 with --exercise-all-tools
PR test(coverage): enforce ≥80% coverage + README badge — 86% baseline #160 — push test coverage to ≥80% + add README coverage badge

After all three follow-ups merge, soak substrate is ready per Brian's "known bugs fixed before soak" standard.

🤖 Generated with Claude Code

Closes the unit-test-coverage gap that let memory_check_contradictions ship to Fly with a hidden bug. Pre-existing test_server.py mocks every external dep (LLM, NLI, embedder, vecstore), which is right for isolated-contract testing but leaves a gap: a tool's real call path can raise an uncaught exception that no mocked test exercises. tests/test_tools_integration.py iterates the entire registered MCP tool manager and invokes each tool against a real seeded vault with minimal-valid inputs. Three test classes: 1. test_every_tool_invokes_cleanly — no unhandled exception, return shape matches MCP contract (str/dict/list). Coverage-check assertions force every newly-registered tool to have a fixture input AND every removed tool to have its fixture cleaned up. 2. test_no_tool_returns_opaque_error_string — outputs must not contain "Error occurred during tool execution" or "Internal server error" verbatim. Opaque envelopes come from the MCP transport wrapping an escaped Python exception — never acceptable as a tool's own output. 3. test_destructive_tools_respect_dry_run — locks the dry_run contract on memory_check_contradictions + memory_sweep. Mutating inputs are stubbed to "would-decay" labels via mocked NLI; the test asserts pre-state == post-state. Heavy paths (NLI classify, FastEmbed re-embed) are stubbed so the suite stays under ~17s; real-model paths are validated by scripts/calibrate_capture_threshold.py and the operator Layer-3 web test ritual (extended in follow-up PR #158). Catches the failure class that bit memory_check_contradictions on 2026-05-22: the LLM-required path would have raised under [server] extras even with the broad try/except, and this canary's assertion "no unhandled exception" would have fired on PR #154 when the salience-tier tools were added — forcing the fix before merge. Note: this PR doesn't yet exercise the [server]-only install matrix (that's PR #158). It DOES catch any tool that raises through its wrapper regardless of extras — the universal subset of the gap. Composes with feedback_no_silent_fails: every tool must catch failures at its boundary and return a clean error string, never let an exception escape. Suite 847 → 850 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#159) Follow-up to PR #158 — closes the [server]-extras gap that the local integration canary can't see + extends the operator Layer-3 web test ritual to probe every MCP tool against the live Fly app. Two artifacts: 1. .github/workflows/ci-server-extras.yml — installs mnemon-memory[server] ONLY (the Fly Docker install) + pytest as a separate test runner. Runs the full suite under that minimal install. Includes a guard that asserts llama-cpp-python is NOT installed under [server] — so future PRs can't accidentally drag the LLM dep into the production path. This is the workflow that would have caught memory_check_contradictions's LLM hard-dependency on PR #154 when the salience-tier tools were first added; ci.yml passed because [dev] installs everything. 2. scripts/promote_stable.sh layer3 --exercise-all-tools — opt-in flag that, after the test Fly app is up but before downgrade, iterates every registered MCP tool against the remote and asserts each returns cleanly. Catches Fly-specific breakage (missing baked models, Anthropic MCP proxy timeouts, transport regressions) that the local Python-level canary in tests/test_tools_integration.py can't see. Tool list resolved dynamically from mcp._tool_manager._tools, so tools added in future PRs are exercised automatically — no per- release maintenance burden. Per-tool inputs mirror the integration- test fixture; destructive tools (memory_forget, memory_rebuild) skipped; mutating tools constrained to dry_run / round-trip. scripts/_layer3_remote_helper.py gains an exercise-all-tools subcommand wired through the FastMCP tool manager. Two regression- lock tests added to tests/test_promote_stable.sh harness (13 → 15 passing) covering helper dispatch + flag plumbing through the bash dispatcher (cmd_layer3 "$@" forwarding + EXERCISE_ALL_TOOLS=1 set). Full Python suite still 850 passing. Driver: Brian's 2026-05-22 ask after the memory_check_contradictions incident — "given the difficulty of checking each individual mnemon tool available, are we properly using unit tests to confirm that everything works as expected?" PR #158 addressed the Python-level canary; this PR addresses the deployment-environment + Fly-level canary. Together they form the test trio for catching the 2026-05-22 failure class on the next PR rather than on the next operator MCP call. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…160) Closes Brian's "coverage to ≥80% + add a badge" leg of the test-trio plan (PRs #158/#159/#160). Suite at 86.43% with the new omits + 4 added nli.py error-path tests. pyproject.toml gains [tool.coverage.run] + [tool.coverage.report] config with fail_under=80. ci.yml runs `pytest --cov` so a PR that drops coverage below the gate fails the build. Excluded modules are documented inline and all under-testable-by-design: - dashboard/* Streamlit UI; tested by running the app - __main__.py 4-line entry-point shim - upgrade.py Real Fly+AWS interactions; tested by Layer-3 - downgrade.py Same - llm.py Deprecated optional-LLM path; the deployed product is LLM-free by design (2026-05-22). NLI replaced the only production use of this module in mnemon.contradiction. README badge: shields.io static `coverage-86%-brightgreen`. Matches existing static-badge pattern (Status, Python, License, MCP). Manual update on each release; no SaaS / codecov dep. 4 new nli.py tests (suite 850 → 855): - _ensure_loaded raises NLIUnavailableError on HF download failure - _ensure_loaded raises on unexpected label set (model with different output classes fails fast, not silently mis-classifies) - prewarm() swallows unavailability per acceptable-secondary- observability category - classify_pair softmax + input-building exercised with stubbed session (lines 164-189 of nli.py) Composes with the test-trio: PR #158 — Python-level integration canary (every tool round-trip) PR #159 — [server]-extras CI matrix + layer3 --exercise-all-tools PR #160 — this PR: coverage gate + badge Together they catch the 2026-05-22 memory_check_contradictions failure class at PR review time (canary + extras matrix) and ensure overall coverage doesn't regress as the project grows. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rolls the test-trio into a soak-substrate release: - #158 every-MCP-tool integration canary - #159 [server]-extras CI matrix + layer3 --exercise-all-tools - #160 coverage gate at 80% + README badge (86% baseline) Composes with the prior 0.7.0rc2 features (NLI-based contradiction detection, dry_run flag). After PyPI publish + Fly redeploy, this is the version operators should run the Phase 1 standing-tier soak on per Brian's "all known bugs fixed before soak" standard. Suite 855 passing. mnemon --version returns 0.7.0rc3. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit e32882b into main May 22, 2026
9 checks passed

cipher813 deleted the test/tools-integration-canary branch May 22, 2026 17:10

cipher813 mentioned this pull request May 22, 2026

ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools #159

Merged

5 tasks

cipher813 mentioned this pull request May 22, 2026

test(coverage): enforce ≥80% coverage + README badge — 86% baseline #160

Merged

4 tasks

cipher813 mentioned this pull request May 22, 2026

chore(version): bump 0.7.0rc2 → 0.7.0rc3 #161

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(tools): all-tools integration round-trip canary (follow-up to #157)#158

test(tools): all-tools integration round-trip canary (follow-up to #157)#158
cipher813 merged 1 commit into
mainfrom
test/tools-integration-canary

cipher813 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 22, 2026

Summary

What ships

Composability

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant