Skip to content

Fix Codex session provider repair after provider switch#704

Open
YBloom wants to merge 3 commits into
jlcodes99:mainfrom
YBloom:fix/codex-session-provider-repair
Open

Fix Codex session provider repair after provider switch#704
YBloom wants to merge 3 commits into
jlcodes99:mainfrom
YBloom:fix/codex-session-provider-repair

Conversation

@YBloom
Copy link
Copy Markdown

@YBloom YBloom commented May 9, 2026

Summary

  • repair stale Codex session provider metadata after switching the default profile between OAuth, API Key, and Local API Service providers
  • repair the affected Codex profile before instance launch when bound account injection changes its provider
  • add directory-level session visibility repair coverage for rollout files and state_5.sqlite rows
  • fix an existing browser timer type so frontend typecheck/build can pass

Verification

  • npm run typecheck
  • npm run build
  • git diff --check

Not run

  • cargo fmt: not run because this local machine does not have rustfmt installed
  • cargo test codex_session_visibility: not run because this local machine does not have cargo installed

Copilot AI review requested due to automatic review settings May 9, 2026 07:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an automatic “session visibility” repair path for Codex profiles when the effective provider changes (OAuth / API Key / Local API Service), ensuring rollout files and state_5.sqlite thread metadata don’t remain pinned to a stale provider after switching.

Changes:

  • Add a directory-scoped repair API (repair_session_visibility_for_dir) plus unit tests covering rollout + SQLite rewrites and no-op behavior.
  • Trigger automatic repairs after account switches and after enabling local access for the default Codex home.
  • Trigger automatic repairs before instance launch when bound-account injection changes a profile’s provider, and fix a frontend timer type so TS build/typecheck passes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/stores/usePlatformLayoutStore.ts Adjust timer handle typing for browser window.setTimeout usage to satisfy frontend typechecking.
src-tauri/src/modules/codex_session_visibility.rs Introduce single-directory repair helper + add tests for rollout/SQLite provider repair behavior.
src-tauri/src/commands/codex.rs Invoke automatic session visibility repair after default-home provider changes (account switch / local access activate).
src-tauri/src/commands/codex_instance.rs Repair profile session visibility pre-launch when bound account injection alters provider.
CHANGELOG.zh-CN.md Document the Codex provider-switch repair behavior.
CHANGELOG.md Document the Codex provider-switch repair behavior (English).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +217 to +224
let backup_dir = backup_instance_files(
data_dir,
&rollout_changes,
sqlite_rows_to_update > 0,
instance_id,
&target_provider,
)?;
let backup_dir_string = backup_dir.to_string_lossy().to_string();
Comment on lines +205 to +214
return Ok(CodexSessionVisibilityRepairItem {
instance_id: instance_id.to_string(),
instance_name: instance_name.to_string(),
target_provider,
changed_rollout_file_count: 0,
updated_sqlite_row_count: 0,
skipped_sqlite_file: sqlite_scan.skipped_unusable_database,
backup_dir: None,
running: false,
});
Comment on lines +110 to +114
match modules::codex_session_visibility::repair_session_visibility_for_dir(
profile_dir,
"__launch__",
"启动实例",
) {
@Str1ckl4nd
Copy link
Copy Markdown

感谢修这个问题。我本地遇到的现象应该和这个 PR 是同一类问题,但目前这个修复可能还漏了一个关键场景。

下面是我用本地只读监听脚本抓到的脱敏时间线,操作是在 Cockpit 的 Codex 页面里点击订阅账号 / 本地 API 服务的启动按钮:

22:17:47 初始状态
config_provider = codex_local_access
auth = apikey:agt
sqlite_threads = { codex_local_access: 366 }

22:17:57 切到 OAuth / 订阅账号后
config_provider = openai
auth = unknown:none
sqlite_threads = { codex_local_access: 366 }

22:18:07-22:18:24 修复过程中处于混合状态
sqlite_threads = { codex_local_access: 50, openai: 316 }
sqlite_threads = { codex_local_access: 103, openai: 263 }
sqlite_threads = { openai: 366 }

22:19:19 再切回本地 API 服务后
config_provider = codex_local_access
auth = apikey:agt
sqlite_threads = { openai: 367 }

危险状态是最后这一段:Codex 当前已经拿到本地 API 服务的 agt... key,但历史会话还标记为 openai。如果这时恢复旧会话,就可能把 agt... 发到 https://api.openai.com/v1/responses,最终报:401 Incorrect API key provided: agt_code...

我本地观察到两个问题:

  1. provider/auth 切换和历史会话 provider 修复不是原子流程。config.toml / auth.json 会先变,state_5.sqlite 和 rollout 元数据随后才逐步修复。Codex 启动或恢复会话如果发生在这个窗口里,就会进入混合 provider 状态。
  2. rollout 文件里可能不止一条 session_meta。我本地有 37 个 rollout 文件共 53 条残留的 session_meta.payload.model_provider,当时 SQLite 已经看起来修好了,但 rollout 里仍有残留。当前实现看起来仍然使用 read_first_line() / updated_first_line,也就是说可能只修每个 rollout 的第一条 session_meta

建议修法:

  • provider 切换路径需要同步完成:记录 before_provider,写入目标 auth/config 后,在启动或恢复会话前阻塞等待 profile repair 完成,至少覆盖 state_5.sqlite 和 rollout 文件。
  • rollout 修复不要只改第一行,而是全量扫描 JSONL:逐行解析,如果 row.type == "session_meta",就把 row.payload.model_provider 改成目标 provider。只有发生变化时再原子写回文件。
  • 增加一个测试:同一个 rollout 里有多条 session_meta,并且只有第二条或后面的 provider 是旧值;期望所有 stale session_meta 都被修复,而不是只修第一行。

我本地临时修复脚本的逻辑大致是:

1. 从 config.toml 读取目标 provider:model_provider,缺省为 openai。
2. UPDATE threads SET model_provider = target WHERE model_provider <> target。
3. 遍历 sessions/**/rollout-*.jsonl 和 archived_sessions/**/rollout-*.jsonl:
   逐行解析 JSON;
   如果 row.type == "session_meta",设置 row.payload.model_provider = target;
   文件有变化时原子写回。

全量修复后,本地审计结果收敛为:

config_provider = codex_local_access
sqlite_threads = { codex_local_access: 367 }
rollout_session_meta = { codex_local_access: 440 }

@YBloom YBloom force-pushed the fix/codex-session-provider-repair branch from 35f26c7 to c5ea9f7 Compare May 13, 2026 06:18
@YBloom
Copy link
Copy Markdown
Author

YBloom commented May 13, 2026

Updated the PR branch with a narrower follow-up for the stale rollout metadata case.

What changed:

  • The provider switch paths already block on repair_session_visibility_for_dir after writing target auth/config and before launch, for both default Codex account switches and Local API Service activation.
  • Rollout repair no longer rewrites only the first line. It now scans the full JSONL file and updates every type == "session_meta" row whose payload.model_provider is stale, then writes the file atomically only when something changed.
  • Added a regression test where the first session_meta already matches the target provider but a later session_meta is stale; the repair now updates both to the target provider.
  • Also kept the OAuth reverse-switch fix that removes legacy top-level base_url from config.toml.

Validation I could run locally:

  • npm run typecheck passed
  • npm run build passed
  • git diff --check passed

Not run locally:

  • cargo fmt / Rust tests, because this machine still does not have cargo / rustfmt installed.

Extra local observation while reproducing: after the on-disk repair converged, an old orphaned Codex app-server process that started days earlier was still holding an old rollout file descriptor and could continue sending requests with a cached stale provider. The PR fix reduces the startup/switch window, but already-running old app-server processes may still need to be restarted once after applying the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants