ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools by cipher813 · Pull Request #159 · cipher813/mnemon

cipher813 · 2026-05-22T17:17:21Z

Summary

Two artifacts addressing Brian's 2026-05-22 ask — "are we properly using unit tests to confirm that everything works as expected without having to manually run each tool?" — at the deployment-environment + Fly-level layers. Composes with PR #158 (Python-level integration canary) to form the test trio that catches the 2026-05-22 memory_check_contradictions failure class on the next PR rather than on the next operator MCP call.

Artifact A — `.github/workflows/ci-server-extras.yml`

New CI workflow that installs mnemon-memory[server] ONLY (the production-equivalent install used by the Fly Docker image) plus pytest as a separate test runner, and runs the full suite under that minimal install.

Why it matters: the existing ci.yml installs [dev] extras, which transitively pull in everything mnemon optionally supports. The Fly image installs only [server]. A failure that only surfaces under [server]-extras-only conditions slips through ci.yml — that's exactly what happened with memory_check_contradictions's LLM hard-dependency.

Guards baked in:

Asserts llama-cpp-python is NOT installed under [server] — future PRs that accidentally drag the LLM dep into the production path fail this workflow with a clear error
Runs the full suite (including the integration canary from PR test(tools): all-tools integration round-trip canary (follow-up to #157) #158) under [server]-only — catches any test that imports something only in [llm] / [ui]

Artifact B — `scripts/promote_stable.sh layer3 --exercise-all-tools`

New opt-in flag that, after the test Fly app is up but before the downgrade step, iterates every registered MCP tool against the remote and asserts each returns cleanly (no opaque error envelope, no unhandled exception, no NLI/embedder/baked-model breakage).

Catches Fly-specific failures the local canary can't see:

Missing baked models in the Docker image (e.g., if a future Dockerfile edit drops the FastEmbed or NLI bake)
Anthropic MCP proxy timeouts (the actual symptom from the 2026-05-22 incident)
Transport / auth regressions specific to OAuth AS + bearer token paths

Operational shape:

Tool list resolved dynamically from mcp._tool_manager._tools — tools added in future PRs are exercised automatically, no per-release maintenance
Per-tool inputs mirror the PR test(tools): all-tools integration round-trip canary (follow-up to #157) #158 integration-test fixture (single source of truth for canary inputs)
Destructive tools (memory_forget, memory_rebuild) skipped
Mutating tools constrained to dry_run=True or round-trip (promote + demote pair)
Adds ~30-60s to the layer3 run when the flag is on; non-NLI-touching releases can skip

scripts/_layer3_remote_helper.py gains the exercise-all-tools subcommand wired through the FastMCP tool manager. The bash dispatcher in promote_stable.sh is refactored to forward args via cmd_layer3 "$@".

Tests

2 new bash harness tests in tests/test_promote_stable.sh:
- test_helper_exposes_exercise_all_tools_subcommand — locks the helper dispatcher
- test_layer3_passes_through_exercise_all_tools_flag — locks flag plumbing
Harness 13 → 15 passing
Python suite unchanged at 850 passing
Helper smoke-tested: imports clean, dispatcher recognizes new subcommand

Test plan

tests/test_promote_stable.sh harness passes
Python suite 850 still passing
Helper imports + dispatcher smoke-test clean
Post-merge: CI workflow run validates against real GitHub Actions
Post-merge: operator can run scripts/promote_stable.sh layer3 --exercise-all-tools against the next RC bump to validate the Fly-level probe

What's next

PR test(coverage): enforce ≥80% coverage + README badge — 86% baseline #160 — push test coverage to ≥80% + add README coverage badge (per Brian's plan)
Then bump to 0.7.0rc3 (rolling all three test additions into the soak substrate), republish PyPI + Fly redeploy, and begin the Phase 1 standing-tier soak

🤖 Generated with Claude Code

Follow-up to PR #158 — closes the [server]-extras gap that the local integration canary can't see + extends the operator Layer-3 web test ritual to probe every MCP tool against the live Fly app. Two artifacts: 1. .github/workflows/ci-server-extras.yml — installs mnemon-memory[server] ONLY (the Fly Docker install) + pytest as a separate test runner. Runs the full suite under that minimal install. Includes a guard that asserts llama-cpp-python is NOT installed under [server] — so future PRs can't accidentally drag the LLM dep into the production path. This is the workflow that would have caught memory_check_contradictions's LLM hard-dependency on PR #154 when the salience-tier tools were first added; ci.yml passed because [dev] installs everything. 2. scripts/promote_stable.sh layer3 --exercise-all-tools — opt-in flag that, after the test Fly app is up but before downgrade, iterates every registered MCP tool against the remote and asserts each returns cleanly. Catches Fly-specific breakage (missing baked models, Anthropic MCP proxy timeouts, transport regressions) that the local Python-level canary in tests/test_tools_integration.py can't see. Tool list resolved dynamically from mcp._tool_manager._tools, so tools added in future PRs are exercised automatically — no per- release maintenance burden. Per-tool inputs mirror the integration- test fixture; destructive tools (memory_forget, memory_rebuild) skipped; mutating tools constrained to dry_run / round-trip. scripts/_layer3_remote_helper.py gains an exercise-all-tools subcommand wired through the FastMCP tool manager. Two regression- lock tests added to tests/test_promote_stable.sh harness (13 → 15 passing) covering helper dispatch + flag plumbing through the bash dispatcher (cmd_layer3 "$@" forwarding + EXERCISE_ALL_TOOLS=1 set). Full Python suite still 850 passing. Driver: Brian's 2026-05-22 ask after the memory_check_contradictions incident — "given the difficulty of checking each individual mnemon tool available, are we properly using unit tests to confirm that everything works as expected?" PR #158 addressed the Python-level canary; this PR addresses the deployment-environment + Fly-level canary. Together they form the test trio for catching the 2026-05-22 failure class on the next PR rather than on the next operator MCP call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…160) Closes Brian's "coverage to ≥80% + add a badge" leg of the test-trio plan (PRs #158/#159/#160). Suite at 86.43% with the new omits + 4 added nli.py error-path tests. pyproject.toml gains [tool.coverage.run] + [tool.coverage.report] config with fail_under=80. ci.yml runs `pytest --cov` so a PR that drops coverage below the gate fails the build. Excluded modules are documented inline and all under-testable-by-design: - dashboard/* Streamlit UI; tested by running the app - __main__.py 4-line entry-point shim - upgrade.py Real Fly+AWS interactions; tested by Layer-3 - downgrade.py Same - llm.py Deprecated optional-LLM path; the deployed product is LLM-free by design (2026-05-22). NLI replaced the only production use of this module in mnemon.contradiction. README badge: shields.io static `coverage-86%-brightgreen`. Matches existing static-badge pattern (Status, Python, License, MCP). Manual update on each release; no SaaS / codecov dep. 4 new nli.py tests (suite 850 → 855): - _ensure_loaded raises NLIUnavailableError on HF download failure - _ensure_loaded raises on unexpected label set (model with different output classes fails fast, not silently mis-classifies) - prewarm() swallows unavailability per acceptable-secondary- observability category - classify_pair softmax + input-building exercised with stubbed session (lines 164-189 of nli.py) Composes with the test-trio: PR #158 — Python-level integration canary (every tool round-trip) PR #159 — [server]-extras CI matrix + layer3 --exercise-all-tools PR #160 — this PR: coverage gate + badge Together they catch the 2026-05-22 memory_check_contradictions failure class at PR review time (canary + extras matrix) and ensure overall coverage doesn't regress as the project grows. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rolls the test-trio into a soak-substrate release: - #158 every-MCP-tool integration canary - #159 [server]-extras CI matrix + layer3 --exercise-all-tools - #160 coverage gate at 80% + README badge (86% baseline) Composes with the prior 0.7.0rc2 features (NLI-based contradiction detection, dry_run flag). After PyPI publish + Fly redeploy, this is the version operators should run the Phase 1 standing-tier soak on per Brian's "all known bugs fixed before soak" standard. Suite 855 passing. mnemon --version returns 0.7.0rc3. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit d0f2a4f into main May 22, 2026
10 checks passed

cipher813 deleted the ci/server-extras-matrix-and-layer3 branch May 22, 2026 17:21

cipher813 mentioned this pull request May 22, 2026

test(coverage): enforce ≥80% coverage + README badge — 86% baseline #160

Merged

4 tasks

cipher813 mentioned this pull request May 22, 2026

chore(version): bump 0.7.0rc2 → 0.7.0rc3 #161

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools#159

ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools#159
cipher813 merged 1 commit into
mainfrom
ci/server-extras-matrix-and-layer3

cipher813 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 22, 2026

Summary

Artifact A — .github/workflows/ci-server-extras.yml

Artifact B — scripts/promote_stable.sh layer3 --exercise-all-tools

Tests

Test plan

What's next

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Artifact A — `.github/workflows/ci-server-extras.yml`

Artifact B — `scripts/promote_stable.sh layer3 --exercise-all-tools`