Support steering active Codex turns from queued messages#472
Support steering active Codex turns from queued messages#472Liu-KM wants to merge 9 commits intotiann:mainfrom
Conversation
There was a problem hiding this comment.
Findings
- [Major] Messages queued before the turn id is known are not re-checked for steering, so they can fall through to the next turn. Evidence
cli/src/codex/codexRemoteLauncher.ts:261,cli/src/codex/codexRemoteLauncher.ts:771.
Suggested fix:if (turnId) { this.currentTurnId = turnId; void maybeSteerQueuedMessages(); }
- [Major]
await turnSettledcan hang indefinitely if the app-server exits afterturn/startsucceeds but before a terminal event arrives. Evidencecli/src/codex/codexRemoteLauncher.ts:762,cli/src/codex/codexRemoteLauncher.ts:776, contextcli/src/codex/codexAppServerClient.ts:99.
Suggested fix:appServerClient.setDisconnectHandler(() => { turnInFlight = false; activeTurnModeHash = null; this.currentTurnId = null; settleTurnInFlight(); });
Summary
Review mode: initial
The new steering path still misses one active-turn window and can deadlock the remote launcher on app-server disconnect. Residual testing gap: not found coverage for messages queued before turn_id becomes available or for disconnects after turn/start.
Testing
Not run (automation; static review only per prompt constraints)
HAPI Bot
| if (this.shouldExit || turnInFlight === false || steeringInFlight) { | ||
| return; | ||
| } | ||
| if (!this.currentThreadId || !this.currentTurnId || !activeTurnModeHash) { |
There was a problem hiding this comment.
[MAJOR] This early return means a same-mode message that lands after turnInFlight = true but before currentTurnId is populated will miss the steering window. setOnMessage() fires once, exits here, and nothing re-runs maybeSteerQueuedMessages() when startTurn() or turn/started later provides the turn id, so that message waits for a later turn instead of steering the active one.
Suggested fix:
if (turnId) {
this.currentTurnId = turnId;
void maybeSteerQueuedMessages();
}| } else if (!this.currentTurnId) { | ||
| allowAnonymousTerminalEvent = true; | ||
| } | ||
| await turnSettled; |
There was a problem hiding this comment.
[MAJOR] turnSettled is only resolved from terminal notifications. If the app-server exits after turn/start returns but before turn/completed, CodexAppServerClient only rejects pending RPCs (cli/src/codex/codexAppServerClient.ts:99), so this await never resolves and the session stays stuck in thinking while cleanup never runs.
Suggested fix:
appServerClient.setDisconnectHandler(() => {
turnInFlight = false;
activeTurnModeHash = null;
this.currentTurnId = null;
settleTurnInFlight();
});9c60274 to
f9b38bc
Compare
There was a problem hiding this comment.
Findings
- [Major] Normal shutdown now fires the disconnect callback twice, because
disconnect()callsnotifyDisconnected(...)and the killed child still emitsexit, which calls it again. IncodexRemoteLauncherthat callback settles active-turn state, so future disconnect consumers will also need to be idempotent for every cleanup path. Evidencecli/src/codex/codexAppServerClient.ts:100,cli/src/codex/codexAppServerClient.ts:206.
Suggested fix:async disconnect(): Promise<void> { if (!this.connected) { return; } const child = this.process; this.process = null; this.disconnectHandler = null; try { child?.stdin.end(); if (child) { await killProcessByChildProcess(child); } } finally { this.rejectAllPending(new Error('Codex app-server disconnected')); this.connected = false; this.resetParserState(); } }
Summary
Review mode: follow-up after new commits
One issue remains: the new disconnect hook is invoked twice on normal cleanup, which makes the hook contract unsafe for future stateful callers. Residual testing gap: Not found coverage for the normal disconnect() path invoking the handler exactly once.
Testing
Not run (bun not available in this runner; static review only)
HAPI Bot
| this.rejectAllPending(disconnectError); | ||
| this.connected = false; | ||
| this.resetParserState(); | ||
| this.notifyDisconnected(disconnectError); |
There was a problem hiding this comment.
[MAJOR] disconnect() now calls the disconnect hook directly, but the terminated child still emits exit, which calls the same hook again at cli/src/codex/codexAppServerClient.ts:108. That means normal cleanup invokes setDisconnectHandler() twice. codexRemoteLauncher happens to tolerate it today because settleTurnInFlight() is idempotent, but this makes the new hook contract unsafe for any future stateful cleanup logic.
Suggested fix:
async disconnect(): Promise<void> {
if (!this.connected) {
return;
}
const child = this.process;
this.process = null;
this.disconnectHandler = null;
try {
child?.stdin.end();
if (child) {
await killProcessByChildProcess(child);
}
} finally {
this.rejectAllPending(new Error('Codex app-server disconnected'));
this.connected = false;
this.resetParserState();
}
}There was a problem hiding this comment.
Findings
- [Major] App-server disconnect now looks like a clean turn completion, but it does not clear thread lifecycle state. After
settleTurnInFlight()resolves, the loop continues withhasThreadstill true andthis.currentThreadIdstill pointing at the dead server thread, so the next queued message skipsthread/start/thread/resumeand callsturn/startagainst a freshly reconnected app-server with a stalethreadId. Evidencecli/src/codex/codexRemoteLauncher.ts:607,cli/src/codex/codexRemoteLauncher.ts:746,cli/src/codex/codexRemoteLauncher.ts:750,cli/src/codex/codexRemoteLauncher.ts:774.
Suggested fix:appServerClient.setDisconnectHandler(() => { turnInFlight = false; activeTurnModeHash = null; allowAnonymousTerminalEvent = false; this.currentTurnId = null; this.currentThreadId = null; hasThread = false; settleTurnInFlight(); });
Summary
Review mode: follow-up after new commits
One major regression remains: an unexpected app-server disconnect no longer fails the active turn, but the launcher also does not reset thread lifecycle state, so any subsequent queued message can be sent to a new app-server process with a stale threadId. Residual testing gap: Not found coverage for "disconnect during active turn, then another queued message arrives".
Testing
Not run (bun not available in this runner; static review only)
HAPI Bot
| }); | ||
|
|
||
| appServerClient.setDisconnectHandler(() => { | ||
| turnInFlight = false; |
There was a problem hiding this comment.
[MAJOR] This disconnect handler now resolves the active-turn wait, but it leaves the thread lifecycle intact. If the app-server dies mid-turn and another queued message arrives, the main loop resumes from await turnSettled, keeps hasThread === true, skips thread/start / thread/resume, and then calls turn/start with the stale threadId against the auto-reconnected app-server. That turns an app-server crash into a broken follow-up turn instead of a clean restart.
Suggested fix:
appServerClient.setDisconnectHandler(() => {
turnInFlight = false;
activeTurnModeHash = null;
allowAnonymousTerminalEvent = false;
this.currentTurnId = null;
this.currentThreadId = null;
hasThread = false;
settleTurnInFlight();
});There was a problem hiding this comment.
Findings
- [Major] Failed steer retries loop on the same active turn —
turn/steerfailures are requeued, but thefinallyblock immediately callsmaybeSteerQueuedMessages()again whileturnInFlightis still true and the same same-mode batch is at the head of the queue. A permanent app-server rejection therefore spins repeatedturn/steercalls instead of preserving the message for the nextturn/start, which can keep the active turn noisy/busy and starve the intended queued follow-up. Evidencecli/src/codex/codexRemoteLauncher.ts:296,cli/src/codex/codexRemoteLauncher.ts:301.
Suggested fix:let steeringSuspendedForTurn = false; const maybeSteerQueuedMessages = async () => { if (this.shouldExit || turnInFlight === false || steeringInFlight || steeringSuspendedForTurn) { return; } // ... try { // ... await appServerClient.steerTurn(/* ... */); } catch (error) { if (batch) { session.queue.unshift(batch.message, batch.mode); steeringSuspendedForTurn = true; } logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error); } finally { steeringInFlight = false; if (!steeringSuspendedForTurn && turnInFlight && session.queue.size() > 0) { void maybeSteerQueuedMessages(); } } }; // Reset when the active turn settles or when starting a new turn. steeringSuspendedForTurn = false;
Summary
Review mode: follow-up after new commits
One major issue remains: the steer-failure path says it keeps the message queued for the next turn, but it immediately retries the same failed steer while the current turn is still active. Residual risk: static review only; app-server behavior for rejected turn/steer was not exercised here.
Testing
Not run (security policy: PR code not executed; static review only)
HAPI Bot
| logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error); | ||
| } finally { | ||
| steeringInFlight = false; | ||
| if (turnInFlight && session.queue.size() > 0) { |
There was a problem hiding this comment.
[MAJOR] This requeue path immediately retries the same failed steer. After catch puts batch back at the front, finally sees turnInFlight && session.queue.size() > 0, calls maybeSteerQueuedMessages() again, and the guards still match the same active turn/mode. Any deterministic turn/steer rejection will loop repeated steer attempts instead of leaving the message queued for the next turn/start.
Suggested fix:
let steeringSuspendedForTurn = false;
const maybeSteerQueuedMessages = async () => {
if (this.shouldExit || turnInFlight === false || steeringInFlight || steeringSuspendedForTurn) {
return;
}
// ...
try {
// ...
await appServerClient.steerTurn(/* ... */);
} catch (error) {
if (batch) {
session.queue.unshift(batch.message, batch.mode);
steeringSuspendedForTurn = true;
}
logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
} finally {
steeringInFlight = false;
if (!steeringSuspendedForTurn && turnInFlight && session.queue.size() > 0) {
void maybeSteerQueuedMessages();
}
}
};
// Reset when the active turn settles or when starting a new turn.
steeringSuspendedForTurn = false;There was a problem hiding this comment.
Findings
- [Major] App-server disconnect silently drops the active prompt — the new disconnect handler clears turn/thread state and resolves
turnSettled, so ifcodex app-serverexits afterturn/startreturns but before a terminal turn event, execution resumes pastawait turnSettledas if the turn completed normally. The consumedmessageis not requeued and the existingcatchpath does not emitProcess exited unexpectedly, leaving the user with a ready session and a lost prompt. Evidencecli/src/codex/codexRemoteLauncher.ts:618,cli/src/codex/codexRemoteLauncher.ts:793.
Suggested fix:let appServerDisconnectError: Error | null = null; appServerClient.setDisconnectHandler((error) => { appServerDisconnectError = error; turnInFlight = false; steeringSuspendedForTurn = false; activeTurnModeHash = null; allowAnonymousTerminalEvent = false; this.currentTurnId = null; this.currentThreadId = null; hasThread = false; settleTurnInFlight(); }); appServerDisconnectError = null; turnInFlight = true; const turnSettled = new Promise<void>((resolve) => { resolveTurnSettled = resolve; }); // startTurn(...) await turnSettled; if (appServerDisconnectError) { throw appServerDisconnectError; }
Summary
Review mode: follow-up after new commits
The previous steer-retry loop was addressed, but the added disconnect recovery path now allows an in-flight Codex turn to be resolved as successful when the app-server dies before emitting a terminal event. Residual risk: static review only; no app-server process was executed.
Testing
Not run (automation). Suggested coverage: add a codexRemoteLauncher test where startTurn returns a turn id, the disconnect handler fires before turn/completed, and the session goes through the unexpected-exit path instead of emitting ready with the original prompt lost.
HAPI Bot
| this.currentTurnId = null; | ||
| this.currentThreadId = null; | ||
| hasThread = false; | ||
| settleTurnInFlight(); |
There was a problem hiding this comment.
[MAJOR] This resolves the active turn wait after clearing all active turn state, so an app-server exit that happens after turn/start returns but before turn/completed/task_failed makes await turnSettled continue normally. The prompt has already been consumed from session.queue, nothing requeues it, and the catch block that emits Process exited unexpectedly is bypassed.
Suggested fix:
let appServerDisconnectError: Error | null = null;
appServerClient.setDisconnectHandler((error) => {
appServerDisconnectError = error;
turnInFlight = false;
steeringSuspendedForTurn = false;
activeTurnModeHash = null;
allowAnonymousTerminalEvent = false;
this.currentTurnId = null;
this.currentThreadId = null;
hasThread = false;
settleTurnInFlight();
});
appServerDisconnectError = null;
turnInFlight = true;
const turnSettled = new Promise<void>((resolve) => {
resolveTurnSettled = resolve;
});
// startTurn(...)
await turnSettled;
if (appServerDisconnectError) {
throw appServerDisconnectError;
}There was a problem hiding this comment.
Findings
- [Major] Abort can replay a steered message after clearing the queue —
handleAbort()resetssession.queue, but an in-flightturn/steeruses the same abort signal and will reject afterward; this catch then unshifts the consumed batch back into the now-reset queue. A user pressing abort while a steering request is pending can therefore see the aborted follow-up message start as a fresh turn after the terminal event, instead of being discarded with the rest of the queue. Evidencecli/src/codex/codexRemoteLauncher.ts:297.
Suggested fix:} catch (error) { const isAbortError = error instanceof Error && error.name === 'AbortError'; if (batch && !isAbortError && !this.abortController.signal.aborted && !this.shouldExit) { session.queue.unshift(batch.message, batch.mode); steeringSuspendedForTurn = true; } logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error); }
Summary
Review mode: follow-up after new commits
The previous disconnect-loss finding is addressed in the latest commit. I found one new abort-path regression in the steering fallback: abort clears queued work, but the steer catch can put the in-flight message back after that clear.
Testing
Not run (automation: bun is not installed in this runner). Suggested coverage: add a codexRemoteLauncher test where a second message is being steered, handleAbort/abort signal fires before steerTurn rejects, and the message is not requeued or started as a later turn.
HAPI Bot
| signal: this.abortController.signal | ||
| }); | ||
| } catch (error) { | ||
| if (batch) { |
There was a problem hiding this comment.
[MAJOR] This requeues any consumed steering batch on every steerTurn failure. Because the steer request is passed this.abortController.signal, a user abort rejects this await after handleAbort() has already called session.queue.reset(). The catch then unshifts the aborted follow-up message back into the cleared queue, so once the active turn emits its terminal event the supposedly aborted message can run as a new turn.
Suggested fix:
} catch (error) {
const isAbortError = error instanceof Error && error.name === 'AbortError';
if (batch && !isAbortError && !this.abortController.signal.aborted && !this.shouldExit) {
session.queue.unshift(batch.message, batch.mode);
steeringSuspendedForTurn = true;
}
logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
}There was a problem hiding this comment.
Findings
- [Major] Steering consumes queued localIds before the steer RPC succeeds, so a failed or aborted
turn/steercan mark still-queued work as already sent in the web UI. Evidencecli/src/codex/codexRemoteLauncher.ts:280, related contextcli/src/utils/MessageQueue2.ts:287,cli/src/agent/sessionBase.ts:73,web/src/lib/message-window-store.ts:543.
Summary
Review mode: follow-up after new commits
The previous abort-path finding is fixed in the latest head. I found one new regression in the steering path: it uses the normal queue consumption API, which emits messages-consumed before turn/steer succeeds, so the UI can flip a queued follow-up message to sent even when that steer is requeued or dropped.
Testing
Not run (automation: bun is not installed in this runner). Suggested coverage: add a codexRemoteLauncher test that pushes a second message with a localId, forces steerTurn to reject, and asserts no messages-consumed ack is emitted until that message is actually accepted by Codex.
HAPI Bot
| let steerSignal: AbortSignal | null = null; | ||
|
|
||
| try { | ||
| batch = await session.queue.waitForMessagesAndGetAsString(); |
There was a problem hiding this comment.
[MAJOR] waitForMessagesAndGetAsString() removes the queued batch and immediately fires onBatchConsumed, which AgentSessionBase forwards to messages-consumed. In the normal turn-start path that is fine, but here the batch still depends on turn/steer succeeding. If steerTurn rejects, this code either requeues or drops the message after the web has already marked the optimistic bubble as sent, so the UI loses the correct queued/retry state.
Suggested fix:
// Reserve the steering batch without acknowledging localIds first.
const batch = await session.queue.reserveMessagesAsString()
try {
await appServerClient.steerTurn({
threadId: this.currentThreadId,
expectedTurnId: this.currentTurnId,
input: [{ type: 'text', text: batch.message }]
}, { signal: this.abortController.signal })
session.client.emitMessagesConsumed(batch.localIds)
session.queue.commitReservedBatch(batch.token)
} catch (error) {
session.queue.releaseReservedBatch(batch.token)
throw error
}|
Closing this PR and keeping focus on the clean one (#471). |
There was a problem hiding this comment.
Findings
- No issues identified in the added/modified lines of the latest head.
Summary
Review mode: follow-up after new commits
I re-checked the latest PR head, including the previous bot finding. The queued steering path now keeps batches queued until turn/steer succeeds, and the new turn-lifecycle changes look consistent with the added coverage around double-start, abort, delayed turn ids, and reconnect.
Residual risk: I could not run the CLI Vitest suite in this runner because bun is not installed, so this pass is based on code review and the test cases added in the diff.
Testing
Not run (automation: bun is not installed in this runner)
HAPI Bot
Summary
Add support for steering an active Codex app-server turn with queued messages.
turn/steersupport to the Codex app-server clientturn/steerWhy
Codex can accept additional instructions while a task is running. HAPI previously queued those messages for a later turn, and could also start another turn before the active turn had emitted a terminal event, which could leave sessions stuck in a thinking state.
Validation
cd cli && bun x vitest run src/codex/codexRemoteLauncher.test.tsNotes
AI-assisted with GPT-5.4.