Skip to content

ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools#159

Merged
cipher813 merged 1 commit into
mainfrom
ci/server-extras-matrix-and-layer3
May 22, 2026
Merged

ci(server-extras): [server]-only matrix + layer3 --exercise-all-tools#159
cipher813 merged 1 commit into
mainfrom
ci/server-extras-matrix-and-layer3

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Two artifacts addressing Brian's 2026-05-22 ask — "are we properly using unit tests to confirm that everything works as expected without having to manually run each tool?" — at the deployment-environment + Fly-level layers. Composes with PR #158 (Python-level integration canary) to form the test trio that catches the 2026-05-22 memory_check_contradictions failure class on the next PR rather than on the next operator MCP call.

Artifact A — .github/workflows/ci-server-extras.yml

New CI workflow that installs mnemon-memory[server] ONLY (the production-equivalent install used by the Fly Docker image) plus pytest as a separate test runner, and runs the full suite under that minimal install.

Why it matters: the existing ci.yml installs [dev] extras, which transitively pull in everything mnemon optionally supports. The Fly image installs only [server]. A failure that only surfaces under [server]-extras-only conditions slips through ci.yml — that's exactly what happened with memory_check_contradictions's LLM hard-dependency.

Guards baked in:

  • Asserts llama-cpp-python is NOT installed under [server] — future PRs that accidentally drag the LLM dep into the production path fail this workflow with a clear error
  • Runs the full suite (including the integration canary from PR test(tools): all-tools integration round-trip canary (follow-up to #157) #158) under [server]-only — catches any test that imports something only in [llm] / [ui]

Artifact B — scripts/promote_stable.sh layer3 --exercise-all-tools

New opt-in flag that, after the test Fly app is up but before the downgrade step, iterates every registered MCP tool against the remote and asserts each returns cleanly (no opaque error envelope, no unhandled exception, no NLI/embedder/baked-model breakage).

Catches Fly-specific failures the local canary can't see:

  • Missing baked models in the Docker image (e.g., if a future Dockerfile edit drops the FastEmbed or NLI bake)
  • Anthropic MCP proxy timeouts (the actual symptom from the 2026-05-22 incident)
  • Transport / auth regressions specific to OAuth AS + bearer token paths

Operational shape:

  • Tool list resolved dynamically from mcp._tool_manager._tools — tools added in future PRs are exercised automatically, no per-release maintenance
  • Per-tool inputs mirror the PR test(tools): all-tools integration round-trip canary (follow-up to #157) #158 integration-test fixture (single source of truth for canary inputs)
  • Destructive tools (memory_forget, memory_rebuild) skipped
  • Mutating tools constrained to dry_run=True or round-trip (promote + demote pair)
  • Adds ~30-60s to the layer3 run when the flag is on; non-NLI-touching releases can skip

scripts/_layer3_remote_helper.py gains the exercise-all-tools subcommand wired through the FastMCP tool manager. The bash dispatcher in promote_stable.sh is refactored to forward args via cmd_layer3 "$@".

Tests

  • 2 new bash harness tests in tests/test_promote_stable.sh:
    • test_helper_exposes_exercise_all_tools_subcommand — locks the helper dispatcher
    • test_layer3_passes_through_exercise_all_tools_flag — locks flag plumbing
  • Harness 13 → 15 passing
  • Python suite unchanged at 850 passing
  • Helper smoke-tested: imports clean, dispatcher recognizes new subcommand

Test plan

  • tests/test_promote_stable.sh harness passes
  • Python suite 850 still passing
  • Helper imports + dispatcher smoke-test clean
  • Post-merge: CI workflow run validates against real GitHub Actions
  • Post-merge: operator can run scripts/promote_stable.sh layer3 --exercise-all-tools against the next RC bump to validate the Fly-level probe

What's next

🤖 Generated with Claude Code

Follow-up to PR #158 — closes the [server]-extras gap that the local
integration canary can't see + extends the operator Layer-3 web test
ritual to probe every MCP tool against the live Fly app.

Two artifacts:

1. .github/workflows/ci-server-extras.yml — installs mnemon-memory[server]
   ONLY (the Fly Docker install) + pytest as a separate test runner.
   Runs the full suite under that minimal install. Includes a guard
   that asserts llama-cpp-python is NOT installed under [server] — so
   future PRs can't accidentally drag the LLM dep into the production
   path. This is the workflow that would have caught
   memory_check_contradictions's LLM hard-dependency on PR #154 when
   the salience-tier tools were first added; ci.yml passed because
   [dev] installs everything.

2. scripts/promote_stable.sh layer3 --exercise-all-tools — opt-in flag
   that, after the test Fly app is up but before downgrade, iterates
   every registered MCP tool against the remote and asserts each
   returns cleanly. Catches Fly-specific breakage (missing baked
   models, Anthropic MCP proxy timeouts, transport regressions) that
   the local Python-level canary in tests/test_tools_integration.py
   can't see.

   Tool list resolved dynamically from mcp._tool_manager._tools, so
   tools added in future PRs are exercised automatically — no per-
   release maintenance burden. Per-tool inputs mirror the integration-
   test fixture; destructive tools (memory_forget, memory_rebuild)
   skipped; mutating tools constrained to dry_run / round-trip.

scripts/_layer3_remote_helper.py gains an exercise-all-tools
subcommand wired through the FastMCP tool manager. Two regression-
lock tests added to tests/test_promote_stable.sh harness (13 → 15
passing) covering helper dispatch + flag plumbing through the bash
dispatcher (cmd_layer3 "$@" forwarding + EXERCISE_ALL_TOOLS=1 set).

Full Python suite still 850 passing.

Driver: Brian's 2026-05-22 ask after the memory_check_contradictions
incident — "given the difficulty of checking each individual mnemon
tool available, are we properly using unit tests to confirm that
everything works as expected?" PR #158 addressed the Python-level
canary; this PR addresses the deployment-environment + Fly-level
canary. Together they form the test trio for catching the
2026-05-22 failure class on the next PR rather than on the next
operator MCP call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit d0f2a4f into main May 22, 2026
10 checks passed
@cipher813 cipher813 deleted the ci/server-extras-matrix-and-layer3 branch May 22, 2026 17:21
cipher813 added a commit that referenced this pull request May 22, 2026
…160)

Closes Brian's "coverage to ≥80% + add a badge" leg of the test-trio
plan (PRs #158/#159/#160). Suite at 86.43% with the new omits +
4 added nli.py error-path tests.

pyproject.toml gains [tool.coverage.run] + [tool.coverage.report]
config with fail_under=80. ci.yml runs `pytest --cov` so a PR that
drops coverage below the gate fails the build. Excluded modules
are documented inline and all under-testable-by-design:

  - dashboard/*       Streamlit UI; tested by running the app
  - __main__.py       4-line entry-point shim
  - upgrade.py        Real Fly+AWS interactions; tested by Layer-3
  - downgrade.py      Same
  - llm.py            Deprecated optional-LLM path; the deployed
                      product is LLM-free by design (2026-05-22).
                      NLI replaced the only production use of this
                      module in mnemon.contradiction.

README badge: shields.io static `coverage-86%-brightgreen`. Matches
existing static-badge pattern (Status, Python, License, MCP). Manual
update on each release; no SaaS / codecov dep.

4 new nli.py tests (suite 850 → 855):
  - _ensure_loaded raises NLIUnavailableError on HF download failure
  - _ensure_loaded raises on unexpected label set (model with
    different output classes fails fast, not silently mis-classifies)
  - prewarm() swallows unavailability per acceptable-secondary-
    observability category
  - classify_pair softmax + input-building exercised with stubbed
    session (lines 164-189 of nli.py)

Composes with the test-trio:
  PR #158 — Python-level integration canary (every tool round-trip)
  PR #159 — [server]-extras CI matrix + layer3 --exercise-all-tools
  PR #160 — this PR: coverage gate + badge

Together they catch the 2026-05-22 memory_check_contradictions
failure class at PR review time (canary + extras matrix) and ensure
overall coverage doesn't regress as the project grows.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 22, 2026
Rolls the test-trio into a soak-substrate release:
- #158 every-MCP-tool integration canary
- #159 [server]-extras CI matrix + layer3 --exercise-all-tools
- #160 coverage gate at 80% + README badge (86% baseline)

Composes with the prior 0.7.0rc2 features (NLI-based contradiction
detection, dry_run flag). After PyPI publish + Fly redeploy, this is
the version operators should run the Phase 1 standing-tier soak on
per Brian's "all known bugs fixed before soak" standard.

Suite 855 passing. mnemon --version returns 0.7.0rc3.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant