fix(session): accurate wait exit codes + KV-warm distinction (#1442, #1439)#1445
Conversation
`csa session wait` was reporting `status=failure exit=1` for sessions whose `result.toml` recorded `status=success exit_code=0`. The wait loop loaded the refreshed `result.toml` after seeing the daemon completion packet, then dropped it via an mtime-based filter that preferred the daemon completion packet whenever `result.toml` was older or had the same mtime (except for one narrow corner case). In practice this filter always lost: the agent writes `result.toml` before the daemon process exits and writes `daemon-completion.toml`, so `result.toml.mtime <= daemon-completion.toml.mtime` is the normal ordering. Any session where the daemon happened to exit non-zero for reasons unrelated to the session outcome (post-write cleanup failure, parent SIGTERM, etc.) would be reported as a failure to callers. Trust `result.toml` whenever the refresh succeeds. It is the canonical session artifact recording the actual outcome; the daemon completion packet only records the daemon process exit and is meant to be a fallback when `result.toml` is unavailable. The earlier regression test `handle_session_wait_prefers_daemon_completion_exit_code` codified the buggy behaviour and is renamed/inverted to `handle_session_wait_prefers_result_toml_over_daemon_completion_exit_code` so the result-authority contract is explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`csa session wait` previously emitted exit code 124 whenever the configured wait cap fired, regardless of whether the underlying session was still healthy. AI agents (and shell callers reading the standard Unix timeout convention) interpreted 124 as failure, even though the loop's intent has always been a polite poll cap that keeps the caller's KV cache warm. This change splits the exit-cap behaviour: * Session daemon still alive at the cap (the common case): emit `CSA:SESSION_WAIT_KV_WARM` with `status=alive` and return exit code 0. The accompanying CSA:CALLER_HINT explicitly tells callers the exit is healthy and to re-wait in a fresh Bash call. * Daemon already gone with no `result.toml` at the cap (defensive, rare): emit `CSA:SESSION_WAIT_TIMEOUT` with `status=dead` and return 124. Earlier loop branches usually catch this case with exit-1 first. Existing tests that asserted the legacy 124-on-alive behaviour are updated, and a new regression test `kv_warm_tests::handle_session_wait_ kv_warm_exit_when_daemon_alive_at_cap` pins the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the session wait mechanism to prioritize result.toml as the authoritative source for session outcomes, addressing issues where daemon process exit codes caused false failures. It also introduces a new 'KV-warm' exit behavior (returning code 0) when a session remains active at the wait timeout, enabling callers to warm their KV caches before re-waiting. Review feedback identifies opportunities to reduce redundancy in liveness checks and string formatting within session_cmds_daemon_wait.rs, as well as a suggestion to consolidate duplicated test utility code into a shared module to improve maintainability.
| let session_alive = session_has_terminal_process(&session_dir) | ||
| || csa_process::ToolLiveness::is_alive(&session_dir); |
There was a problem hiding this comment.
The check session_has_terminal_process(&session_dir) is redundant here because csa_process::ToolLiveness::is_alive(&session_dir) already includes checks for live processes (both via daemon.pid and lock files). Simplifying this to just use is_alive improves readability and reduces redundant filesystem operations.
| let session_alive = session_has_terminal_process(&session_dir) | |
| || csa_process::ToolLiveness::is_alive(&session_dir); | |
| let session_alive = csa_process::ToolLiveness::is_alive(&session_dir); |
| cd = cd | ||
| .as_ref() | ||
| .map(|p| format!(" --cd '{p}'")) | ||
| .unwrap_or_default(), |
There was a problem hiding this comment.
The cd argument formatting is redundant here. It was already calculated and stored in cd_arg at line 479. Using cd_arg directly simplifies the code and ensures consistency with the other eprintln! calls in this block.
| cd = cd | |
| .as_ref() | |
| .map(|p| format!(" --cd '{p}'")) | |
| .unwrap_or_default(), | |
| cd = cd_arg, |
| struct EnvVarGuard { | ||
| key: &'static str, | ||
| original: Option<String>, | ||
| } | ||
|
|
||
| impl EnvVarGuard { | ||
| fn set(key: &'static str, value: impl AsRef<std::ffi::OsStr>) -> Self { | ||
| let original = std::env::var(key).ok(); | ||
| // SAFETY: test-scoped env mutation guarded by a process-wide mutex. | ||
| unsafe { std::env::set_var(key, value) }; | ||
| Self { key, original } | ||
| } | ||
| } | ||
|
|
||
| impl Drop for EnvVarGuard { | ||
| fn drop(&mut self) { | ||
| // SAFETY: test-scoped env mutation guarded by a process-wide mutex. | ||
| unsafe { | ||
| match self.original.as_deref() { | ||
| Some(value) => std::env::set_var(self.key, value), | ||
| None => std::env::remove_var(self.key), | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[cfg(target_os = "linux")] | ||
| fn read_process_start_time_ticks(pid: u32) -> u64 { | ||
| let stat_path = format!("/proc/{pid}/stat"); | ||
| let content = std::fs::read_to_string(stat_path).expect("read /proc stat"); | ||
| let close_paren = content.rfind(')').expect("stat comm terminator"); | ||
| let after_comm = &content[close_paren + 1..]; | ||
| let mut parts = after_comm.split_whitespace(); | ||
| parts.next().expect("state"); | ||
| parts.next().expect("ppid"); | ||
| parts.next().expect("pgrp"); | ||
| for _ in 0..16 { | ||
| parts.next().expect("intermediate stat field"); | ||
| } | ||
| parts | ||
| .next() | ||
| .expect("starttime") | ||
| .parse::<u64>() | ||
| .expect("starttime parse") | ||
| } | ||
|
|
||
| #[cfg(target_os = "linux")] | ||
| fn daemon_pid_record(pid: u32) -> String { | ||
| format!("{pid} {}\n", read_process_start_time_ticks(pid)) | ||
| } |
There was a problem hiding this comment.
The EnvVarGuard struct and read_process_start_time_ticks helper function are duplicated from other test files in this crate (e.g., session_cmds_tests_tail_wait.rs, session_cmds_tests_daemon_pid_tail.rs). This duplication increases maintenance overhead. Consider moving these shared test utilities to a common module within the test suite to improve maintainability.
Summary
csa session waitnow readsresult.tomlto determine actual exit code instead of using daemon process exit. Sessions that completed successfully no longer report false failures.CSA:SESSION_WAIT_KV_WARMhint. Exit 124 reserved for actual timeouts.Test plan
just pre-commitpassescsa review --check-verdictPASS (codex gpt-5.5)CSA:SESSION_WAIT_KV_WARMhint with exit 0Closes #1442, Closes #1439
🤖 Generated with Claude Code