fix(GRC-500): stop cleanup_orphans deleting live result sinks (format-matrix panics)#9
Open
p4gs wants to merge 1 commit into
Open
fix(GRC-500): stop cleanup_orphans deleting live result sinks (format-matrix panics)#9p4gs wants to merge 1 commit into
p4gs wants to merge 1 commit into
Conversation
The format-matrix "bugs" (HTML exit 137/no output, markdown cold-cache empty,
JSON ZERO_VENDORS) shared one root cause: a /tmp result-sink race, not any
writer defect. All four writers receive the identical relationships slice, so
format never changes discovery.
`is_process_running` checked `/proc/{pid}`, which only exists on Linux. On
macOS/Windows it returned false for every PID, so each run's startup call to
`ResultSink::cleanup_orphans("/tmp")` deleted the live PID-stamped sink files
of concurrently-running scans. The victim then panicked in `drain_all()` with
ENOENT (exit 101) before any output was written — surfacing under whichever
format ran cold while CSV got lucky on timing.
Evidence: test-output/campaign-post-tf5/vanta.com/d2_fmt/stderr.txt (html run)
panicked at app.rs:1608 "Failed to open result file: /tmp/...-59762.jsonl.zst:
No such file or directory" right after "0 vendors found".
Changes:
- result_sink.rs: portable liveness via `sysinfo` (already a dep); correct on
Linux/macOS/Windows so live sibling sinks are preserved. Fixed two tests that
had codified the buggy macOS behavior; added a regression test that models the
race (live sink preserved, dead orphan removed).
- app.rs: the disk-sink read path no longer `.expect()`-panics; on a genuine
read failure it fails loudly with a clear message and exit code 4 instead of
emitting a silently-empty report.
- cli.rs: `--timeout` help now explains deep/cold scans exceed the 600s default
and that output format does not affect discovery time.
- CHANGELOG.md: document the fix.
Verified: cargo build + clippy + fmt clean; 39 result_sink tests pass incl. the
new regression. The 53 failing dns/discovery tests are pre-existing and
environmental (network/DoH) — confirmed failing identically on clean master.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GRC-500: format-matrix bugs
Root cause (one bug, not three)
The HTML/markdown/JSON "format-matrix" failures share a single root cause in the result-sink lifecycle — not any writer defect. All four export writers receive the identical
relationshipsslice viadispatch_export, so the output format never changes discovery.is_process_running(result_sink.rs) checked/proc/{pid}, which only exists on Linux. The campaign ran on macOS, where it returnedfalsefor every PID. Each run callsResultSink::cleanup_orphans("/tmp")at startup, which then deleted the live PID-stamped sink files of concurrently-running sibling scans. The victim later panicked indrain_all()→read_results()with ENOENT (exit 101) before writing any output — appearing under whichever format ran cold, while CSV got lucky on timing.Evidence:
test-output/campaign-post-tf5/vanta.com/d2_fmt/stderr.txt(the HTML run):The same panic appears across box/braze/circleci/atlassian/auth0 JSON runs — format-independent.
Changes
sysinfo(already a dependency); correct on Linux/macOS/Windows so live sibling sinks are preserved. Corrected two tests that had codified the buggy macOS behavior; addedtest_orphan_cleanup_preserves_live_removes_deadthat models the race directly..expect()-panics. On a genuine read failure it fails loudly with a clear message and a dedicated exit code (4) instead of emitting a silently-empty report — satisfying "fail loudly, never silently empty."--timeouthelp now documents that depth-3+/cold-cache scans routinely exceed the 600s default (the "JSON wall budget"), and that output format does not affect discovery time.Done-when mapping
--timeouthelp + CHANGELOG clarify the 600s default and deep-scan guidance.Verification
cargo build,cargo clippy --lib,cargo fmt --checkall clean.result_sinktests pass, including the new regression.dns/discoverytests are pre-existing and environmental (network/DoH on this host) — confirmed failing identically on cleanmasterviagit stash.