ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools#159
Merged
Conversation
Follow-up to PR #158 — closes the [server]-extras gap that the local integration canary can't see + extends the operator Layer-3 web test ritual to probe every MCP tool against the live Fly app. Two artifacts: 1. .github/workflows/ci-server-extras.yml — installs mnemon-memory[server] ONLY (the Fly Docker install) + pytest as a separate test runner. Runs the full suite under that minimal install. Includes a guard that asserts llama-cpp-python is NOT installed under [server] — so future PRs can't accidentally drag the LLM dep into the production path. This is the workflow that would have caught memory_check_contradictions's LLM hard-dependency on PR #154 when the salience-tier tools were first added; ci.yml passed because [dev] installs everything. 2. scripts/promote_stable.sh layer3 --exercise-all-tools — opt-in flag that, after the test Fly app is up but before downgrade, iterates every registered MCP tool against the remote and asserts each returns cleanly. Catches Fly-specific breakage (missing baked models, Anthropic MCP proxy timeouts, transport regressions) that the local Python-level canary in tests/test_tools_integration.py can't see. Tool list resolved dynamically from mcp._tool_manager._tools, so tools added in future PRs are exercised automatically — no per- release maintenance burden. Per-tool inputs mirror the integration- test fixture; destructive tools (memory_forget, memory_rebuild) skipped; mutating tools constrained to dry_run / round-trip. scripts/_layer3_remote_helper.py gains an exercise-all-tools subcommand wired through the FastMCP tool manager. Two regression- lock tests added to tests/test_promote_stable.sh harness (13 → 15 passing) covering helper dispatch + flag plumbing through the bash dispatcher (cmd_layer3 "$@" forwarding + EXERCISE_ALL_TOOLS=1 set). Full Python suite still 850 passing. Driver: Brian's 2026-05-22 ask after the memory_check_contradictions incident — "given the difficulty of checking each individual mnemon tool available, are we properly using unit tests to confirm that everything works as expected?" PR #158 addressed the Python-level canary; this PR addresses the deployment-environment + Fly-level canary. Together they form the test trio for catching the 2026-05-22 failure class on the next PR rather than on the next operator MCP call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
May 22, 2026
…160) Closes Brian's "coverage to ≥80% + add a badge" leg of the test-trio plan (PRs #158/#159/#160). Suite at 86.43% with the new omits + 4 added nli.py error-path tests. pyproject.toml gains [tool.coverage.run] + [tool.coverage.report] config with fail_under=80. ci.yml runs `pytest --cov` so a PR that drops coverage below the gate fails the build. Excluded modules are documented inline and all under-testable-by-design: - dashboard/* Streamlit UI; tested by running the app - __main__.py 4-line entry-point shim - upgrade.py Real Fly+AWS interactions; tested by Layer-3 - downgrade.py Same - llm.py Deprecated optional-LLM path; the deployed product is LLM-free by design (2026-05-22). NLI replaced the only production use of this module in mnemon.contradiction. README badge: shields.io static `coverage-86%-brightgreen`. Matches existing static-badge pattern (Status, Python, License, MCP). Manual update on each release; no SaaS / codecov dep. 4 new nli.py tests (suite 850 → 855): - _ensure_loaded raises NLIUnavailableError on HF download failure - _ensure_loaded raises on unexpected label set (model with different output classes fails fast, not silently mis-classifies) - prewarm() swallows unavailability per acceptable-secondary- observability category - classify_pair softmax + input-building exercised with stubbed session (lines 164-189 of nli.py) Composes with the test-trio: PR #158 — Python-level integration canary (every tool round-trip) PR #159 — [server]-extras CI matrix + layer3 --exercise-all-tools PR #160 — this PR: coverage gate + badge Together they catch the 2026-05-22 memory_check_contradictions failure class at PR review time (canary + extras matrix) and ensure overall coverage doesn't regress as the project grows. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cipher813
added a commit
that referenced
this pull request
May 22, 2026
Rolls the test-trio into a soak-substrate release: - #158 every-MCP-tool integration canary - #159 [server]-extras CI matrix + layer3 --exercise-all-tools - #160 coverage gate at 80% + README badge (86% baseline) Composes with the prior 0.7.0rc2 features (NLI-based contradiction detection, dry_run flag). After PyPI publish + Fly redeploy, this is the version operators should run the Phase 1 standing-tier soak on per Brian's "all known bugs fixed before soak" standard. Suite 855 passing. mnemon --version returns 0.7.0rc3. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two artifacts addressing Brian's 2026-05-22 ask — "are we properly using unit tests to confirm that everything works as expected without having to manually run each tool?" — at the deployment-environment + Fly-level layers. Composes with PR #158 (Python-level integration canary) to form the test trio that catches the 2026-05-22
memory_check_contradictionsfailure class on the next PR rather than on the next operator MCP call.Artifact A —
.github/workflows/ci-server-extras.ymlNew CI workflow that installs
mnemon-memory[server]ONLY (the production-equivalent install used by the Fly Docker image) plus pytest as a separate test runner, and runs the full suite under that minimal install.Why it matters: the existing
ci.ymlinstalls[dev]extras, which transitively pull in everything mnemon optionally supports. The Fly image installs only[server]. A failure that only surfaces under[server]-extras-only conditions slips throughci.yml— that's exactly what happened withmemory_check_contradictions's LLM hard-dependency.Guards baked in:
llama-cpp-pythonis NOT installed under[server]— future PRs that accidentally drag the LLM dep into the production path fail this workflow with a clear error[server]-only — catches any test that imports something only in[llm]/[ui]Artifact B —
scripts/promote_stable.sh layer3 --exercise-all-toolsNew opt-in flag that, after the test Fly app is up but before the downgrade step, iterates every registered MCP tool against the remote and asserts each returns cleanly (no opaque error envelope, no unhandled exception, no NLI/embedder/baked-model breakage).
Catches Fly-specific failures the local canary can't see:
Operational shape:
mcp._tool_manager._tools— tools added in future PRs are exercised automatically, no per-release maintenancememory_forget,memory_rebuild) skippeddry_run=Trueor round-trip (promote + demote pair)scripts/_layer3_remote_helper.pygains theexercise-all-toolssubcommand wired through the FastMCP tool manager. The bash dispatcher inpromote_stable.shis refactored to forward args viacmd_layer3 "$@".Tests
tests/test_promote_stable.sh:test_helper_exposes_exercise_all_tools_subcommand— locks the helper dispatchertest_layer3_passes_through_exercise_all_tools_flag— locks flag plumbingTest plan
tests/test_promote_stable.shharness passesscripts/promote_stable.sh layer3 --exercise-all-toolsagainst the next RC bump to validate the Fly-level probeWhat's next
🤖 Generated with Claude Code