Skip to content

fix(GRC-500): stop cleanup_orphans deleting live result sinks (format-matrix panics)#9

Open
p4gs wants to merge 1 commit into
masterfrom
fix/GRC-500-sink-cleanup-race
Open

fix(GRC-500): stop cleanup_orphans deleting live result sinks (format-matrix panics)#9
p4gs wants to merge 1 commit into
masterfrom
fix/GRC-500-sink-cleanup-race

Conversation

@p4gs
Copy link
Copy Markdown
Collaborator

@p4gs p4gs commented May 31, 2026

GRC-500: format-matrix bugs

Root cause (one bug, not three)

The HTML/markdown/JSON "format-matrix" failures share a single root cause in the result-sink lifecycle — not any writer defect. All four export writers receive the identical relationships slice via dispatch_export, so the output format never changes discovery.

is_process_running (result_sink.rs) checked /proc/{pid}, which only exists on Linux. The campaign ran on macOS, where it returned false for every PID. Each run calls ResultSink::cleanup_orphans("/tmp") at startup, which then deleted the live PID-stamped sink files of concurrently-running sibling scans. The victim later panicked in drain_all()read_results() with ENOENT (exit 101) before writing any output — appearing under whichever format ran cold, while CSV got lucky on timing.

Evidence: test-output/campaign-post-tf5/vanta.com/d2_fmt/stderr.txt (the HTML run):

[INFO] Analysis completed — 0 vendors found (possible DNS failure)
thread 'main' panicked at src/app.rs:1608:22:
Failed to read results from disk sink: Failed to open result file:
/tmp/nthpartyfinder-results-59762.jsonl.zst — No such file or directory (os error 2)

The same panic appears across box/braze/circleci/atlassian/auth0 JSON runs — format-independent.

Changes

  • result_sink.rs — portable liveness via sysinfo (already a dependency); correct on Linux/macOS/Windows so live sibling sinks are preserved. Corrected two tests that had codified the buggy macOS behavior; added test_orphan_cleanup_preserves_live_removes_dead that models the race directly.
  • app.rs — the disk-sink read path no longer .expect()-panics. On a genuine read failure it fails loudly with a clear message and a dedicated exit code (4) instead of emitting a silently-empty report — satisfying "fail loudly, never silently empty."
  • cli.rs--timeout help now documents that depth-3+/cold-cache scans routinely exceed the 600s default (the "JSON wall budget"), and that output format does not affect discovery time.
  • CHANGELOG.md — documented.

Done-when mapping

  • All four formats produce consistent, non-empty results for the same discovery → fixed: they share one discovery path; the panic that destroyed cold runs is eliminated.
  • Or fail loudly, never silently empty → the read path now bails with exit 4 + guidance instead of panicking/emitting empty.
  • Document/raise the JSON wall budget--timeout help + CHANGELOG clarify the 600s default and deep-scan guidance.

Verification

  • cargo build, cargo clippy --lib, cargo fmt --check all clean.
  • 39 result_sink tests pass, including the new regression.
  • The 53 failing dns/discovery tests are pre-existing and environmental (network/DoH on this host) — confirmed failing identically on clean master via git stash.

The format-matrix "bugs" (HTML exit 137/no output, markdown cold-cache empty,
JSON ZERO_VENDORS) shared one root cause: a /tmp result-sink race, not any
writer defect. All four writers receive the identical relationships slice, so
format never changes discovery.

`is_process_running` checked `/proc/{pid}`, which only exists on Linux. On
macOS/Windows it returned false for every PID, so each run's startup call to
`ResultSink::cleanup_orphans("/tmp")` deleted the live PID-stamped sink files
of concurrently-running scans. The victim then panicked in `drain_all()` with
ENOENT (exit 101) before any output was written — surfacing under whichever
format ran cold while CSV got lucky on timing.

Evidence: test-output/campaign-post-tf5/vanta.com/d2_fmt/stderr.txt (html run)
panicked at app.rs:1608 "Failed to open result file: /tmp/...-59762.jsonl.zst:
No such file or directory" right after "0 vendors found".

Changes:
- result_sink.rs: portable liveness via `sysinfo` (already a dep); correct on
  Linux/macOS/Windows so live sibling sinks are preserved. Fixed two tests that
  had codified the buggy macOS behavior; added a regression test that models the
  race (live sink preserved, dead orphan removed).
- app.rs: the disk-sink read path no longer `.expect()`-panics; on a genuine
  read failure it fails loudly with a clear message and exit code 4 instead of
  emitting a silently-empty report.
- cli.rs: `--timeout` help now explains deep/cold scans exceed the 600s default
  and that output format does not affect discovery time.
- CHANGELOG.md: document the fix.

Verified: cargo build + clippy + fmt clean; 39 result_sink tests pass incl. the
new regression. The 53 failing dns/discovery tests are pre-existing and
environmental (network/DoH) — confirmed failing identically on clean master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant