Skip to content

fix(session): accurate wait exit codes + KV-warm distinction (#1442, #1439)#1445

Merged
RyderFreeman4Logos merged 2 commits into
mainfrom
fix/1442-1439-session-wait-exit-code
May 17, 2026
Merged

fix(session): accurate wait exit codes + KV-warm distinction (#1442, #1439)#1445
RyderFreeman4Logos merged 2 commits into
mainfrom
fix/1442-1439-session-wait-exit-code

Conversation

@RyderFreeman4Logos
Copy link
Copy Markdown
Owner

Summary

Test plan

  • just pre-commit passes
  • csa review --check-verdict PASS (codex gpt-5.5)
  • Verify: successful session → wait exits 0
  • Verify: KV-warm exit prints CSA:SESSION_WAIT_KV_WARM hint with exit 0

Closes #1442, Closes #1439

🤖 Generated with Claude Code

RyderFreeman4Logos and others added 2 commits May 17, 2026 09:03
`csa session wait` was reporting `status=failure exit=1` for sessions
whose `result.toml` recorded `status=success exit_code=0`. The wait
loop loaded the refreshed `result.toml` after seeing the daemon
completion packet, then dropped it via an mtime-based filter that
preferred the daemon completion packet whenever `result.toml` was older
or had the same mtime (except for one narrow corner case).

In practice this filter always lost: the agent writes `result.toml`
before the daemon process exits and writes `daemon-completion.toml`,
so `result.toml.mtime <= daemon-completion.toml.mtime` is the normal
ordering. Any session where the daemon happened to exit non-zero for
reasons unrelated to the session outcome (post-write cleanup failure,
parent SIGTERM, etc.) would be reported as a failure to callers.

Trust `result.toml` whenever the refresh succeeds. It is the canonical
session artifact recording the actual outcome; the daemon completion
packet only records the daemon process exit and is meant to be a
fallback when `result.toml` is unavailable.

The earlier regression test
`handle_session_wait_prefers_daemon_completion_exit_code` codified the
buggy behaviour and is renamed/inverted to
`handle_session_wait_prefers_result_toml_over_daemon_completion_exit_code`
so the result-authority contract is explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`csa session wait` previously emitted exit code 124 whenever the configured
wait cap fired, regardless of whether the underlying session was still
healthy. AI agents (and shell callers reading the standard Unix timeout
convention) interpreted 124 as failure, even though the loop's intent has
always been a polite poll cap that keeps the caller's KV cache warm.

This change splits the exit-cap behaviour:

* Session daemon still alive at the cap (the common case): emit
  `CSA:SESSION_WAIT_KV_WARM` with `status=alive` and return exit code 0.
  The accompanying CSA:CALLER_HINT explicitly tells callers the exit is
  healthy and to re-wait in a fresh Bash call.
* Daemon already gone with no `result.toml` at the cap (defensive, rare):
  emit `CSA:SESSION_WAIT_TIMEOUT` with `status=dead` and return 124.
  Earlier loop branches usually catch this case with exit-1 first.

Existing tests that asserted the legacy 124-on-alive behaviour are
updated, and a new regression test `kv_warm_tests::handle_session_wait_
kv_warm_exit_when_daemon_alive_at_cap` pins the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyderFreeman4Logos RyderFreeman4Logos merged commit 80bd6f0 into main May 17, 2026
5 of 7 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the session wait mechanism to prioritize result.toml as the authoritative source for session outcomes, addressing issues where daemon process exit codes caused false failures. It also introduces a new 'KV-warm' exit behavior (returning code 0) when a session remains active at the wait timeout, enabling callers to warm their KV caches before re-waiting. Review feedback identifies opportunities to reduce redundancy in liveness checks and string formatting within session_cmds_daemon_wait.rs, as well as a suggestion to consolidate duplicated test utility code into a shared module to improve maintainability.

Comment on lines +483 to +484
let session_alive = session_has_terminal_process(&session_dir)
|| csa_process::ToolLiveness::is_alive(&session_dir);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check session_has_terminal_process(&session_dir) is redundant here because csa_process::ToolLiveness::is_alive(&session_dir) already includes checks for live processes (both via daemon.pid and lock files). Simplifying this to just use is_alive improves readability and reduces redundant filesystem operations.

Suggested change
let session_alive = session_has_terminal_process(&session_dir)
|| csa_process::ToolLiveness::is_alive(&session_dir);
let session_alive = csa_process::ToolLiveness::is_alive(&session_dir);

Comment on lines +504 to +507
cd = cd
.as_ref()
.map(|p| format!(" --cd '{p}'"))
.unwrap_or_default(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cd argument formatting is redundant here. It was already calculated and stored in cd_arg at line 479. Using cd_arg directly simplifies the code and ensures consistency with the other eprintln! calls in this block.

Suggested change
cd = cd
.as_ref()
.map(|p| format!(" --cd '{p}'"))
.unwrap_or_default(),
cd = cd_arg,

Comment on lines +8 to +57
struct EnvVarGuard {
key: &'static str,
original: Option<String>,
}

impl EnvVarGuard {
fn set(key: &'static str, value: impl AsRef<std::ffi::OsStr>) -> Self {
let original = std::env::var(key).ok();
// SAFETY: test-scoped env mutation guarded by a process-wide mutex.
unsafe { std::env::set_var(key, value) };
Self { key, original }
}
}

impl Drop for EnvVarGuard {
fn drop(&mut self) {
// SAFETY: test-scoped env mutation guarded by a process-wide mutex.
unsafe {
match self.original.as_deref() {
Some(value) => std::env::set_var(self.key, value),
None => std::env::remove_var(self.key),
}
}
}
}

#[cfg(target_os = "linux")]
fn read_process_start_time_ticks(pid: u32) -> u64 {
let stat_path = format!("/proc/{pid}/stat");
let content = std::fs::read_to_string(stat_path).expect("read /proc stat");
let close_paren = content.rfind(')').expect("stat comm terminator");
let after_comm = &content[close_paren + 1..];
let mut parts = after_comm.split_whitespace();
parts.next().expect("state");
parts.next().expect("ppid");
parts.next().expect("pgrp");
for _ in 0..16 {
parts.next().expect("intermediate stat field");
}
parts
.next()
.expect("starttime")
.parse::<u64>()
.expect("starttime parse")
}

#[cfg(target_os = "linux")]
fn daemon_pid_record(pid: u32) -> String {
format!("{pid} {}\n", read_process_start_time_ticks(pid))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The EnvVarGuard struct and read_process_start_time_ticks helper function are duplicated from other test files in this crate (e.g., session_cmds_tests_tail_wait.rs, session_cmds_tests_daemon_pid_tail.rs). This duplication increases maintenance overhead. Consider moving these shared test utilities to a common module within the test suite to improve maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

csa session wait exits 1 despite result.toml showing success csa session wait: distinguish KV-warm exit from real timeout via exit code + caller hint

1 participant