Skip to content

Support steering active Codex turns from queued messages#472

Closed
Liu-KM wants to merge 9 commits intotiann:mainfrom
Liu-KM:liu/pr-codex-mid-run-steering
Closed

Support steering active Codex turns from queued messages#472
Liu-KM wants to merge 9 commits intotiann:mainfrom
Liu-KM:liu/pr-codex-mid-run-steering

Conversation

@Liu-KM
Copy link
Copy Markdown

@Liu-KM Liu-KM commented Apr 16, 2026

Summary

Add support for steering an active Codex app-server turn with queued messages.

  • add typed turn/steer support to the Codex app-server client
  • when a same-mode message arrives while a turn is active, send it to turn/steer
  • requeue messages if steering fails
  • keep turn lifecycle state active until a terminal turn event arrives, preventing a second turn from starting too early

Why

Codex can accept additional instructions while a task is running. HAPI previously queued those messages for a later turn, and could also start another turn before the active turn had emitted a terminal event, which could leave sessions stuck in a thinking state.

Validation

  • cd cli && bun x vitest run src/codex/codexRemoteLauncher.test.ts

Notes

AI-assisted with GPT-5.4.

@Liu-KM Liu-KM marked this pull request as ready for review April 16, 2026 15:48
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Messages queued before the turn id is known are not re-checked for steering, so they can fall through to the next turn. Evidence cli/src/codex/codexRemoteLauncher.ts:261, cli/src/codex/codexRemoteLauncher.ts:771.
    Suggested fix:
    if (turnId) {
        this.currentTurnId = turnId;
        void maybeSteerQueuedMessages();
    }
  • [Major] await turnSettled can hang indefinitely if the app-server exits after turn/start succeeds but before a terminal event arrives. Evidence cli/src/codex/codexRemoteLauncher.ts:762, cli/src/codex/codexRemoteLauncher.ts:776, context cli/src/codex/codexAppServerClient.ts:99.
    Suggested fix:
    appServerClient.setDisconnectHandler(() => {
        turnInFlight = false;
        activeTurnModeHash = null;
        this.currentTurnId = null;
        settleTurnInFlight();
    });

Summary
Review mode: initial
The new steering path still misses one active-turn window and can deadlock the remote launcher on app-server disconnect. Residual testing gap: not found coverage for messages queued before turn_id becomes available or for disconnects after turn/start.

Testing
Not run (automation; static review only per prompt constraints)

HAPI Bot

if (this.shouldExit || turnInFlight === false || steeringInFlight) {
return;
}
if (!this.currentThreadId || !this.currentTurnId || !activeTurnModeHash) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This early return means a same-mode message that lands after turnInFlight = true but before currentTurnId is populated will miss the steering window. setOnMessage() fires once, exits here, and nothing re-runs maybeSteerQueuedMessages() when startTurn() or turn/started later provides the turn id, so that message waits for a later turn instead of steering the active one.

Suggested fix:

if (turnId) {
    this.currentTurnId = turnId;
    void maybeSteerQueuedMessages();
}

} else if (!this.currentTurnId) {
allowAnonymousTerminalEvent = true;
}
await turnSettled;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] turnSettled is only resolved from terminal notifications. If the app-server exits after turn/start returns but before turn/completed, CodexAppServerClient only rejects pending RPCs (cli/src/codex/codexAppServerClient.ts:99), so this await never resolves and the session stays stuck in thinking while cleanup never runs.

Suggested fix:

appServerClient.setDisconnectHandler(() => {
    turnInFlight = false;
    activeTurnModeHash = null;
    this.currentTurnId = null;
    settleTurnInFlight();
});

@Liu-KM Liu-KM force-pushed the liu/pr-codex-mid-run-steering branch from 9c60274 to f9b38bc Compare April 20, 2026 04:10
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Normal shutdown now fires the disconnect callback twice, because disconnect() calls notifyDisconnected(...) and the killed child still emits exit, which calls it again. In codexRemoteLauncher that callback settles active-turn state, so future disconnect consumers will also need to be idempotent for every cleanup path. Evidence cli/src/codex/codexAppServerClient.ts:100, cli/src/codex/codexAppServerClient.ts:206.
    Suggested fix:
    async disconnect(): Promise<void> {
        if (!this.connected) {
            return;
        }
    
        const child = this.process;
        this.process = null;
        this.disconnectHandler = null;
    
        try {
            child?.stdin.end();
            if (child) {
                await killProcessByChildProcess(child);
            }
        } finally {
            this.rejectAllPending(new Error('Codex app-server disconnected'));
            this.connected = false;
            this.resetParserState();
        }
    }

Summary
Review mode: follow-up after new commits
One issue remains: the new disconnect hook is invoked twice on normal cleanup, which makes the hook contract unsafe for future stateful callers. Residual testing gap: Not found coverage for the normal disconnect() path invoking the handler exactly once.

Testing
Not run (bun not available in this runner; static review only)

HAPI Bot

Comment thread cli/src/codex/codexAppServerClient.ts Outdated
this.rejectAllPending(disconnectError);
this.connected = false;
this.resetParserState();
this.notifyDisconnected(disconnectError);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] disconnect() now calls the disconnect hook directly, but the terminated child still emits exit, which calls the same hook again at cli/src/codex/codexAppServerClient.ts:108. That means normal cleanup invokes setDisconnectHandler() twice. codexRemoteLauncher happens to tolerate it today because settleTurnInFlight() is idempotent, but this makes the new hook contract unsafe for any future stateful cleanup logic.

Suggested fix:

async disconnect(): Promise<void> {
    if (!this.connected) {
        return;
    }

    const child = this.process;
    this.process = null;
    this.disconnectHandler = null;

    try {
        child?.stdin.end();
        if (child) {
            await killProcessByChildProcess(child);
        }
    } finally {
        this.rejectAllPending(new Error('Codex app-server disconnected'));
        this.connected = false;
        this.resetParserState();
    }
}

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] App-server disconnect now looks like a clean turn completion, but it does not clear thread lifecycle state. After settleTurnInFlight() resolves, the loop continues with hasThread still true and this.currentThreadId still pointing at the dead server thread, so the next queued message skips thread/start/thread/resume and calls turn/start against a freshly reconnected app-server with a stale threadId. Evidence cli/src/codex/codexRemoteLauncher.ts:607, cli/src/codex/codexRemoteLauncher.ts:746, cli/src/codex/codexRemoteLauncher.ts:750, cli/src/codex/codexRemoteLauncher.ts:774.
    Suggested fix:
    appServerClient.setDisconnectHandler(() => {
        turnInFlight = false;
        activeTurnModeHash = null;
        allowAnonymousTerminalEvent = false;
        this.currentTurnId = null;
        this.currentThreadId = null;
        hasThread = false;
        settleTurnInFlight();
    });

Summary
Review mode: follow-up after new commits
One major regression remains: an unexpected app-server disconnect no longer fails the active turn, but the launcher also does not reset thread lifecycle state, so any subsequent queued message can be sent to a new app-server process with a stale threadId. Residual testing gap: Not found coverage for "disconnect during active turn, then another queued message arrives".

Testing
Not run (bun not available in this runner; static review only)

HAPI Bot

});

appServerClient.setDisconnectHandler(() => {
turnInFlight = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This disconnect handler now resolves the active-turn wait, but it leaves the thread lifecycle intact. If the app-server dies mid-turn and another queued message arrives, the main loop resumes from await turnSettled, keeps hasThread === true, skips thread/start / thread/resume, and then calls turn/start with the stale threadId against the auto-reconnected app-server. That turns an app-server crash into a broken follow-up turn instead of a clean restart.

Suggested fix:

appServerClient.setDisconnectHandler(() => {
    turnInFlight = false;
    activeTurnModeHash = null;
    allowAnonymousTerminalEvent = false;
    this.currentTurnId = null;
    this.currentThreadId = null;
    hasThread = false;
    settleTurnInFlight();
});

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Failed steer retries loop on the same active turn — turn/steer failures are requeued, but the finally block immediately calls maybeSteerQueuedMessages() again while turnInFlight is still true and the same same-mode batch is at the head of the queue. A permanent app-server rejection therefore spins repeated turn/steer calls instead of preserving the message for the next turn/start, which can keep the active turn noisy/busy and starve the intended queued follow-up. Evidence cli/src/codex/codexRemoteLauncher.ts:296, cli/src/codex/codexRemoteLauncher.ts:301.
    Suggested fix:
    let steeringSuspendedForTurn = false;
    
    const maybeSteerQueuedMessages = async () => {
        if (this.shouldExit || turnInFlight === false || steeringInFlight || steeringSuspendedForTurn) {
            return;
        }
        // ...
        try {
            // ...
            await appServerClient.steerTurn(/* ... */);
        } catch (error) {
            if (batch) {
                session.queue.unshift(batch.message, batch.mode);
                steeringSuspendedForTurn = true;
            }
            logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
        } finally {
            steeringInFlight = false;
            if (!steeringSuspendedForTurn && turnInFlight && session.queue.size() > 0) {
                void maybeSteerQueuedMessages();
            }
        }
    };
    
    // Reset when the active turn settles or when starting a new turn.
    steeringSuspendedForTurn = false;

Summary
Review mode: follow-up after new commits
One major issue remains: the steer-failure path says it keeps the message queued for the next turn, but it immediately retries the same failed steer while the current turn is still active. Residual risk: static review only; app-server behavior for rejected turn/steer was not exercised here.

Testing
Not run (security policy: PR code not executed; static review only)

HAPI Bot

Comment thread cli/src/codex/codexRemoteLauncher.ts Outdated
logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
} finally {
steeringInFlight = false;
if (turnInFlight && session.queue.size() > 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This requeue path immediately retries the same failed steer. After catch puts batch back at the front, finally sees turnInFlight && session.queue.size() > 0, calls maybeSteerQueuedMessages() again, and the guards still match the same active turn/mode. Any deterministic turn/steer rejection will loop repeated steer attempts instead of leaving the message queued for the next turn/start.

Suggested fix:

let steeringSuspendedForTurn = false;

const maybeSteerQueuedMessages = async () => {
    if (this.shouldExit || turnInFlight === false || steeringInFlight || steeringSuspendedForTurn) {
        return;
    }
    // ...
    try {
        // ...
        await appServerClient.steerTurn(/* ... */);
    } catch (error) {
        if (batch) {
            session.queue.unshift(batch.message, batch.mode);
            steeringSuspendedForTurn = true;
        }
        logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
    } finally {
        steeringInFlight = false;
        if (!steeringSuspendedForTurn && turnInFlight && session.queue.size() > 0) {
            void maybeSteerQueuedMessages();
        }
    }
};

// Reset when the active turn settles or when starting a new turn.
steeringSuspendedForTurn = false;

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] App-server disconnect silently drops the active prompt — the new disconnect handler clears turn/thread state and resolves turnSettled, so if codex app-server exits after turn/start returns but before a terminal turn event, execution resumes past await turnSettled as if the turn completed normally. The consumed message is not requeued and the existing catch path does not emit Process exited unexpectedly, leaving the user with a ready session and a lost prompt. Evidence cli/src/codex/codexRemoteLauncher.ts:618, cli/src/codex/codexRemoteLauncher.ts:793.
    Suggested fix:
    let appServerDisconnectError: Error | null = null;
    
    appServerClient.setDisconnectHandler((error) => {
        appServerDisconnectError = error;
        turnInFlight = false;
        steeringSuspendedForTurn = false;
        activeTurnModeHash = null;
        allowAnonymousTerminalEvent = false;
        this.currentTurnId = null;
        this.currentThreadId = null;
        hasThread = false;
        settleTurnInFlight();
    });
    
    appServerDisconnectError = null;
    turnInFlight = true;
    const turnSettled = new Promise<void>((resolve) => {
        resolveTurnSettled = resolve;
    });
    // startTurn(...)
    await turnSettled;
    if (appServerDisconnectError) {
        throw appServerDisconnectError;
    }

Summary
Review mode: follow-up after new commits
The previous steer-retry loop was addressed, but the added disconnect recovery path now allows an in-flight Codex turn to be resolved as successful when the app-server dies before emitting a terminal event. Residual risk: static review only; no app-server process was executed.

Testing
Not run (automation). Suggested coverage: add a codexRemoteLauncher test where startTurn returns a turn id, the disconnect handler fires before turn/completed, and the session goes through the unexpected-exit path instead of emitting ready with the original prompt lost.

HAPI Bot

this.currentTurnId = null;
this.currentThreadId = null;
hasThread = false;
settleTurnInFlight();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This resolves the active turn wait after clearing all active turn state, so an app-server exit that happens after turn/start returns but before turn/completed/task_failed makes await turnSettled continue normally. The prompt has already been consumed from session.queue, nothing requeues it, and the catch block that emits Process exited unexpectedly is bypassed.

Suggested fix:

let appServerDisconnectError: Error | null = null;

appServerClient.setDisconnectHandler((error) => {
    appServerDisconnectError = error;
    turnInFlight = false;
    steeringSuspendedForTurn = false;
    activeTurnModeHash = null;
    allowAnonymousTerminalEvent = false;
    this.currentTurnId = null;
    this.currentThreadId = null;
    hasThread = false;
    settleTurnInFlight();
});

appServerDisconnectError = null;
turnInFlight = true;
const turnSettled = new Promise<void>((resolve) => {
    resolveTurnSettled = resolve;
});
// startTurn(...)
await turnSettled;
if (appServerDisconnectError) {
    throw appServerDisconnectError;
}

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Abort can replay a steered message after clearing the queue — handleAbort() resets session.queue, but an in-flight turn/steer uses the same abort signal and will reject afterward; this catch then unshifts the consumed batch back into the now-reset queue. A user pressing abort while a steering request is pending can therefore see the aborted follow-up message start as a fresh turn after the terminal event, instead of being discarded with the rest of the queue. Evidence cli/src/codex/codexRemoteLauncher.ts:297.
    Suggested fix:
    } catch (error) {
        const isAbortError = error instanceof Error && error.name === 'AbortError';
        if (batch && !isAbortError && !this.abortController.signal.aborted && !this.shouldExit) {
            session.queue.unshift(batch.message, batch.mode);
            steeringSuspendedForTurn = true;
        }
        logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
    }

Summary
Review mode: follow-up after new commits
The previous disconnect-loss finding is addressed in the latest commit. I found one new abort-path regression in the steering fallback: abort clears queued work, but the steer catch can put the in-flight message back after that clear.

Testing
Not run (automation: bun is not installed in this runner). Suggested coverage: add a codexRemoteLauncher test where a second message is being steered, handleAbort/abort signal fires before steerTurn rejects, and the message is not requeued or started as a later turn.

HAPI Bot

Comment thread cli/src/codex/codexRemoteLauncher.ts Outdated
signal: this.abortController.signal
});
} catch (error) {
if (batch) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This requeues any consumed steering batch on every steerTurn failure. Because the steer request is passed this.abortController.signal, a user abort rejects this await after handleAbort() has already called session.queue.reset(). The catch then unshifts the aborted follow-up message back into the cleared queue, so once the active turn emits its terminal event the supposedly aborted message can run as a new turn.

Suggested fix:

} catch (error) {
    const isAbortError = error instanceof Error && error.name === 'AbortError';
    if (batch && !isAbortError && !this.abortController.signal.aborted && !this.shouldExit) {
        session.queue.unshift(batch.message, batch.mode);
        steeringSuspendedForTurn = true;
    }
    logger.debug('[Codex] Failed to steer active turn; keeping message queued for next turn', error);
}

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Steering consumes queued localIds before the steer RPC succeeds, so a failed or aborted turn/steer can mark still-queued work as already sent in the web UI. Evidence cli/src/codex/codexRemoteLauncher.ts:280, related context cli/src/utils/MessageQueue2.ts:287, cli/src/agent/sessionBase.ts:73, web/src/lib/message-window-store.ts:543.

Summary
Review mode: follow-up after new commits
The previous abort-path finding is fixed in the latest head. I found one new regression in the steering path: it uses the normal queue consumption API, which emits messages-consumed before turn/steer succeeds, so the UI can flip a queued follow-up message to sent even when that steer is requeued or dropped.

Testing
Not run (automation: bun is not installed in this runner). Suggested coverage: add a codexRemoteLauncher test that pushes a second message with a localId, forces steerTurn to reject, and asserts no messages-consumed ack is emitted until that message is actually accepted by Codex.

HAPI Bot

Comment thread cli/src/codex/codexRemoteLauncher.ts Outdated
let steerSignal: AbortSignal | null = null;

try {
batch = await session.queue.waitForMessagesAndGetAsString();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] waitForMessagesAndGetAsString() removes the queued batch and immediately fires onBatchConsumed, which AgentSessionBase forwards to messages-consumed. In the normal turn-start path that is fine, but here the batch still depends on turn/steer succeeding. If steerTurn rejects, this code either requeues or drops the message after the web has already marked the optimistic bubble as sent, so the UI loses the correct queued/retry state.

Suggested fix:

// Reserve the steering batch without acknowledging localIds first.
const batch = await session.queue.reserveMessagesAsString()

try {
    await appServerClient.steerTurn({
        threadId: this.currentThreadId,
        expectedTurnId: this.currentTurnId,
        input: [{ type: 'text', text: batch.message }]
    }, { signal: this.abortController.signal })

    session.client.emitMessagesConsumed(batch.localIds)
    session.queue.commitReservedBatch(batch.token)
} catch (error) {
    session.queue.releaseReservedBatch(batch.token)
    throw error
}

@Liu-KM
Copy link
Copy Markdown
Author

Liu-KM commented Apr 20, 2026

Closing this PR and keeping focus on the clean one (#471).

@Liu-KM Liu-KM closed this Apr 20, 2026
@Liu-KM Liu-KM deleted the liu/pr-codex-mid-run-steering branch April 20, 2026 13:29
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • No issues identified in the added/modified lines of the latest head.

Summary
Review mode: follow-up after new commits
I re-checked the latest PR head, including the previous bot finding. The queued steering path now keeps batches queued until turn/steer succeeds, and the new turn-lifecycle changes look consistent with the added coverage around double-start, abort, delayed turn ids, and reconnect.
Residual risk: I could not run the CLI Vitest suite in this runner because bun is not installed, so this pass is based on code review and the test cases added in the diff.

Testing
Not run (automation: bun is not installed in this runner)

HAPI Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant