Skip to content

test: cover flagged edge cases + fix snapshot persistence-health bug#330

Merged
VijitSingh97 merged 3 commits into
developfrom
claude/gallant-hodgkin-0cfe2c
Jul 1, 2026
Merged

test: cover flagged edge cases + fix snapshot persistence-health bug#330
VijitSingh97 merged 3 commits into
developfrom
claude/gallant-hodgkin-0cfe2c

Conversation

@VijitSingh97

Copy link
Copy Markdown
Collaborator

What

A source-vs-tests audit across Pithead's four testing tiers. It adds coverage — at the lowest honest tier (all tier 1) — for branches, guards, and render paths that had none, and fixes one real bug found along the way. No production behaviour changes except the bug fix.

The bug (found while auditing)

storage_service.save_snapshot's TypeError branch (a snapshot json.dumps can't serialize) called logger.error directly instead of _db_error like every other write path — so db_healthy stayed True and the #131 "DB write failing" badge never fired while snapshots silently stopped persisting. One-line root-cause fix + a regression test asserting is_db_healthy() flips to False.

Tests added (tier 1)

Area Behaviour now covered
pytest Unserializable snapshot flips db_healthy (guards the fix above)
shell Firewall iptables -I install-failure → warn + roll back, not half-open (#270)
shell Wallet subaddress / integrated payout → abort apply (#250, silent reward-loss)
shell monero.mode=remote with no host → abort; non-/24 network.subnet → abort (#180)
shell ensure_owner conditional chown + whole-tree recursive scan, no -maxdepth (#255)
frontend ProxyTotals reddens rejected only on reject_level=high
frontend Per-worker api_ok=false + reject_flag badges; XvB/Unknown PoolBadge variants
frontend #278/#313 invariant: Tari ✔ gated on a live gRPC channel, never on active-but-dead
frontend Gauge done→✔ vs syncing→percent

Each test fails if the logic is inverted (e.g. the Tari test asserts check-inline is absent for active-but-dead; the snapshot test asserts db_healthy is False).

Docs

  • docs/test-inventory.md regenerated: 764 → 770 enumerated cases (drift check green).
  • docs/testing-strategy.md Known-gaps ledger gains a Coverage-audit follow-ups section (below), so the gaps stay tracked.

Additional gaps found — not covered here (need Docker / real box → tier 3/4)

These are the real completeness risks the audit surfaced. None touch the code changed in this PR, so per "test once at the lowest honest tier" they're follow-ups, not part of this change. The first two are divergence backstops for paths this PR proves only with stubs:

  • Firewall rollback, real kernel. Enforce Tor-only egress at the network layer (fail-closed) — stop relying on per-app config #270 rollback is proven with stubbed iptables (control flow only). Whether a real kernel strips a partially inserted DOCKER-USER rule and ends fail-closed is never exercised with an induced mid-insert failure → needs a --fault-injection case.
  • ensure_owner real mixed-ownership tree. The Run containers as a non-root user (drop root inside the container) #255 scenario (user-owned dir, root-owned contents, one chown -R) can only be built as root → needs a --lifecycle step.
  • Real-container monerod failover in PR CI — only runs on the manual tier-4 box; the mini-stack breaks Tari, not monerod.
  • Non-blocking-Tari "ignore" path with real containers — unit-only; the path that silently kills yield if it regresses to a reject.
  • monerod busy / mid-reorg failover — contract test proves the client reads busy as unreachable; no scenario asserts the dashboard rejects workers on a busy-but-alive node.
  • Double outage, both-must-recover — unit-only; recovery ordering unproven end-to-end.
  • Partial-start / stop-failure idempotency — control-loop retry is unit-only; no tier-3/4 injects a docker start/stop error.
  • pithead doctor on a real box — only its exit code is unit-tested; the NTP/clock-drift check is never fault-injected.
  • Disk-full / ENOSPC verdict — only a headroom warning is checked; a real unhealthy-on-ENOSPC verdict is never forced.
  • Tor-container-down partial start — no Caddy/Tor in the mini-stack, so SOCKS-unreachable is exercised at no tier below the manual box.
  • Insecure + main matrix rowdashboard.secure=false only pairs with pool=nano, so insecure+main has no row.

Verification (all local)

  • make lint → all checks passed
  • make test → dashboard 641 passed / 94.73% cov (gate 80%), pithead shell 432/0, selftest 97/0, compose ✓, fakes 12
  • make test-patch-coverage100% on the changed line (gate 90%)
  • frontend (node --test) → 66 pass / 0 fail
  • make test-inventory-check → up to date

Fixtures use synthetic values only (4AAA…/8AAA… wallets, RFC-1918 IPs) — respects .gitleaks.toml.

🤖 Generated with Claude Code

VijitSingh97 and others added 2 commits June 30, 2026 20:31
Source-vs-tests audit of the four testing tiers. Adds coverage at the lowest
honest tier (all tier 1) for branches with no covering test, and fixes one
real bug found along the way.

Bug fix:
- storage_service.save_snapshot: a non-serializable snapshot (TypeError from
  json.dumps) logged the error but left db_healthy=True, so the #131 "DB write
  failing" badge never fired while snapshots silently stopped persisting. Route
  it through _db_error like every other write path.

Tests added (tier 1):
- pytest: unserializable snapshot flips db_healthy (regression guard for above).
- pithead shell: firewall install-failure rollback (#270); wallet subaddress /
  integrated hard-fail (#250); remote-mode-without-host and non-/24 subnet
  guards (#180); ensure_owner conditional + whole-tree recursive chown (#255).
- frontend: ProxyTotals high-reject styling; per-worker api/reject badges +
  XvB/Unknown pool badges; the #278/#313 Tari-✔ invariant (never on
  active-but-dead); Gauge done-vs-syncing.

Docs:
- test-inventory regenerated (764 -> 770 cases); testing-strategy Known-gaps
  gains a "Coverage-audit follow-ups" section listing the tier-3/4 backstops
  still needed (real-kernel firewall rollback, real mixed-ownership ensure_owner,
  real-container monerod failover, non-blocking-Tari ignore, busy/reorg,
  double-outage, doctor NTP check, ENOSPC, Tor-down, insecure+main row).

make lint + make test green; dashboard 94.73% cov; patch coverage 100%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e_owner

The two divergence risks the audit flagged — paths PR #330 proves only with
stubs — now have live counterparts, folded into the existing opt-in phases (no
new flags), following the --auth-fail-closed "live counterpart to a tier-1
assertion" pattern.

- --fault-injection: fault_firewall_rollback shadows iptables with a wrapper
  that fails every `-I`, re-runs apply_tor_egress_firewall, and asserts the box
  ends fail-closed (no pithead-tor-egress rule left half-installed), then
  reinstates the real firewall. Belt-and-braces reinstates unconditionally so a
  mid-phase abort can't leave clearnet egress open. Proves the real-kernel strip
  the tier-1 stub can't (#270).
- --lifecycle: plants a root-owned file under the dashboard data dir and asserts
  the pool-flip apply (ensure_directories -> ensure_owner) chowns it to uid 1000
  — the #255 "scan contents, not just the dir" regression, which needs root to
  set up so it can't live at tier 1.

Both are DESTRUCTIVE-then-restored, local-box only, and run at the release gate
(not CI). Help text, testing-strategy Known-gaps, and the test inventory updated.

Verified: shellcheck + shfmt clean, bash -n parses, integration self-test 97/0,
make lint green, inventory drift-check green. NOT executed against a real box —
that happens at the release gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@VijitSingh97

Copy link
Copy Markdown
Collaborator Author

Follow-up commit eb3b336: the first two "additional gaps" listed above are now implemented as tier-4 live backstops, folded into the existing --fault-injection / --lifecycle phases (no new flags), following the --auth-fail-closed "live counterpart to a tier-1 assertion" pattern.

Verified: shellcheck + shfmt clean, bash -n parses, integration self-test 97/0, make lint green, inventory drift-check green. Not executed against a real box — both are destructive-then-restored, local-box only, and run at the release gate. The remaining 9 gaps stay tracked in docs/testing-strategy.md.

…iptables

Validated the two tier-4 backstops live on a real box (gouda). The
ensure_owner migration passed as written. The firewall-rollback check did NOT
exercise the rollback: apply_tor_egress_firewall calls `sudo iptables -I`, and
sudo's secure_path ignores a PATH-shadowed iptables, so the real insert
succeeded and the rollback branch never ran (the rc-0 assertion passed for the
wrong reason — a false green).

Fix: shadow `sudo` instead of `iptables`. sudo is still resolved via PATH, so a
wrapper that fails an `iptables … -I …` insert and execs real sudo for
everything else (remove's -D, -N, iptables-save) makes the insert fail exactly
as a real mid-insert error would. Re-validated live: with the real -I failing,
the partial ruleset is rolled back (0 pithead-tor-egress rules), then reinstated
(7). Box restored byte-identical to its pre-test iptables snapshot.

shellcheck + shfmt clean, bash -n parses, self-test 97/0, inventory drift green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@VijitSingh97

Copy link
Copy Markdown
Collaborator Author

Live-validated on a real box (commit 72fae38)

Ran the two tier-4 backstops against a real synced box. Validation caught a real bug in the firewall test — exactly the kind of divergence tier-4 exists to find:

The box was restored byte-identical to its pre-test iptables-save snapshot; deployed pithead untouched (sha unchanged), stack healthy, all artifacts removed.

Note: the box runs release v1.0.3 (which predates these functions), so validation used the develop-branch pithead at a temp path against throwaway targets — it exercises the real functions on the real kernel, not the full run.sh harness end-to-end. The full harness run happens at the release gate once develop is deployed there.

@VijitSingh97 VijitSingh97 merged commit f1e6db2 into develop Jul 1, 2026
16 checks passed
@VijitSingh97 VijitSingh97 deleted the claude/gallant-hodgkin-0cfe2c branch July 1, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant