Skip to content

feat(checkpoint):增加上下文回滚功能#547

Closed
phantom5099 wants to merge 14 commits into1024XEngineer:mainfrom
phantom5099:main
Closed

feat(checkpoint):增加上下文回滚功能#547
phantom5099 wants to merge 14 commits into1024XEngineer:mainfrom
phantom5099:main

Conversation

@phantom5099
Copy link
Copy Markdown
Collaborator

@phantom5099 phantom5099 commented May 3, 2026

关联issue:#521

背景问题

NeoCode 缺乏代码回退、会话回滚与运行恢复的统一机制。用户让 agent 修改代码后发现方向错误时,只能依赖自身的 Git 或手动撤销,且会话中的 transcript、todo、plan 等状态仍保留已污染内容。具体缺口:

  1. 代码状态不可独立回退 — 没有受管的 repo 内代码 checkpoint,agent 修改的代码无法精确恢复。
  2. 代码与会话没有绑定恢复点 — session 中的 transcript、task/todo、plan 虽可持久化,但没有和代码基线绑定成单一恢复点,回退后容易出现代码与会话状态错位。
  3. 运行恢复依赖猜测 — 没有持久化的运行恢复对象,进程中断后无法稳定回答"当前 run 执行到哪一步"。
  4. 每轮对话不可回退 — plan 模式、build 模式等纯会话交互没有 checkpoint,用户无法回退到上一轮状态。
  5. Compact 操作无保护 — compact 改写会话历史前不创建快照,结果不理想时无法回退。
  6. 影子仓库缺乏自愈能力 — bare 仓库损坏时系统直接失败。
  7. Git 不可用时全链路中断 — 无 git 环境下 checkpoint 不可用。
  8. Gateway 缺少 RPC 接口 — 外部客户端无法查询、恢复或撤销 checkpoint。

方案概要

三层恢复模型

职责 实现
代码回退 保存和恢复 repo 内文件树 ShadowRepository(bare git)
会话回退 保存和恢复 transcript/todo/plan/skills SessionCheckpoint(SQLite)
运行恢复 记录和恢复运行中断位置 ResumeCheckpoint(SQLite)

Per-Turn Checkpoint

每轮 turn 开始时自动创建 checkpoint,确保用户可回退到任意一轮开始前的状态:

每轮 turn 开始:
  ├─ shadowRepo 可用 + 有代码变更 → 完整 checkpoint(代码 + 会话)
  ├─ shadowRepo 可用 + 无代码变更 → session-only checkpoint
  ├─ shadowRepo 不可用 → session-only checkpoint(降级)
  └─ compact 操作前 → session-only checkpoint(独立触发)

Restore 流程

  1. 校验 checkpoint 状态(available + restorable + session 匹配)
  2. 冲突检测(git diff --name-status 比较目标 commit 与当前工作区)
  3. 创建 pre_restore_guard 快照(供 undo 使用)
  4. 代码恢复(git checkout <hash> -- .
  5. 会话恢复(事务内删除旧消息、重新插入、更新 session head)
  6. 标记后续 checkpoint 为 restored
  7. 刷新运行时快照(RuntimeSnapshot / TodoSnapshot)

Undo 机制

查询最近的 pre_restore_guard checkpoint,递归调用 RestoreCheckpoint(Force=true),形成可逆链路。

ResumeCheckpoint

在 phase 转换点(plan → execute → verify → stopped)持久化运行位置,每个 session 仅保留最新一条。进程重启后据此恢复精确运行状态,不再依赖猜测。

保留策略

  • 每个 session 保留最近 10 个自动 checkpoint
  • 手动 checkpoint 和 pre_restore_guard 始终可恢复
  • 超出窗口的旧 checkpoint 直接 DELETE(非标记)
  • 启动时执行窗口裁剪

降级模式

Git 不可用时降级为 session-only 模式:创建会话快照、可 restore/undo,仅跳过代码快照。reason 标记为 pre_write_degraded

ShadowRepo 自愈

Init 时执行 rev-parse --git-dir 健康检查,损坏则 rename 旧目录为 .bak 并重建。启动时检查所有 checkpoint 的代码引用,丢失的标记为 broken

修改范围

核心模块

文件 变更
internal/checkpoint/checkpoint_manager.go CheckpointStore 接口新增 6 个方法:RestoreCheckpoint / SetResumeCheckpoint / PruneExpiredCheckpoints(DELETE 模式) / RepairCreatingCheckpoints / RepairBrokenCheckpoints / UpdateCheckpointStatus
internal/checkpoint/shadow_repo.go 新增 ConflictResult、DetectConflicts、Rebuild、ResolveRef、HasCodeChanges(no-op 检测)、RefExists;Init 增加健康检查与自愈
internal/session/checkpoint_types.go 新增 CheckpointReasonPreWriteDegraded 常量

Runtime

文件 变更
internal/runtime/checkpoint_restore.go 新建 — RestoreCheckpoint / UndoRestoreCheckpoint 端到端流程、createGuardCheckpoint、updateRuntimeSessionAfterRestore(刷新 RuntimeSnapshot)
internal/runtime/checkpoint_resume.go 新建 — updateResumeCheckpoint,构造并持久化 ResumeCheckpoint
internal/runtime/checkpoint_gate.go 重写为 createPerTurnCheckpoint + createFullCheckpoint,每轮 turn 开始时触发;支持降级模式(session-only)和 no-op 检测(HasCodeChanges)
internal/runtime/run.go turn 循环开头插入 per-turn checkpoint;plan / execute / verify / stopped 四处插入 updateResumeCheckpoint
internal/runtime/compact.go compact 前调用 createCompactCheckpoint
internal/runtime/events.go 新增 EventCheckpointRestored / EventCheckpointUndoRestore + payload 类型

Gateway

文件 变更
internal/gateway/types.go 新增 checkpoint_list / checkpoint_restore / checkpoint_undo_restore 三个 FrameAction
internal/gateway/contracts.go RuntimePort 新增 ListCheckpoints / RestoreCheckpoint / UndoRestore + 5 个输入输出类型
internal/gateway/registry.go initCore 注册 3 个 action handler
internal/gateway/bootstrap.go 新增 3 个处理函数 + payload 解码辅助

Bootstrap

文件 变更
internal/app/bootstrap.go ShadowRepo + CheckpointStore 始终初始化(含降级模式);启动时依次执行 RepairCreatingCheckpoints / RepairBrokenCheckpoints / PruneExpiredCheckpoints

其他

文件 变更
internal/cli/gateway_runtime_bridge.go 3 个 RuntimePort 方法的 stub 实现(CLI 模式暂不接入)
4 个测试文件 RuntimePort stub 类型补齐接口方法

预期收益

  1. 每轮可回退 — 无论是否有工具调用,用户均可回退到上一轮开始前的代码和会话状态。
  2. 完整三层恢复 — 代码快照 + 会话快照 + 运行位置的统一恢复,避免回退后状态错位。
  3. 安全恢复 — restore 前自动创建 guard 快照 + 冲突检测,支持一键 undo。
  4. 运行断点续传 — ResumeCheckpoint 记录精确 phase/turn 位置,进程重启后不再猜测。
  5. Compact 安全性 — compact 前自动创建快照,结果不理想可回退。
  6. 健壮容错 — ShadowRepo 自愈 + broken 标记 + 降级模式 + 启动时残留修复 + 窗口裁剪,五层容错确保系统在异常环境下可用。
  7. 外部集成就绪 — Gateway RPC 接口可供 TUI、桌面端等客户端直接操作 checkpoint。

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

fennoai[bot]

This comment was marked as outdated.

Generated with [codeagent](https://github.com/qbox/codeagent)
Co-authored-by: phantom5099 <245659304+phantom5099@users.noreply.github.com>
@phantom5099
Copy link
Copy Markdown
Collaborator Author

@xgopilot 给我几条comment中提到问题的修复方案

phantom5099 and others added 4 commits May 3, 2026 23:41
@1024XEngineer 1024XEngineer deleted a comment from fennoai Bot May 3, 2026
@1024XEngineer 1024XEngineer deleted a comment from fennoai Bot May 3, 2026
xgopilot and others added 2 commits May 3, 2026 16:07
Generated with [codeagent](https://github.com/qbox/codeagent)
Co-authored-by: phantom5099 <245659304+phantom5099@users.noreply.github.com>
test(checkpoint): 补充 checkpoint 覆盖
@1024XEngineer 1024XEngineer deleted a comment from fennoai Bot May 3, 2026
@1024XEngineer 1024XEngineer deleted a comment from fennoai Bot May 3, 2026
@phantom5099
Copy link
Copy Markdown
Collaborator Author

/review 找出纰漏,并分析与#521的偏差

Copy link
Copy Markdown

@fennoai fennoai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few correctness regressions in the new checkpoint flow.

Comment thread internal/app/bootstrap.go

// Checkpoint 基础设施:影子仓库 + SQLite checkpoint 存储
// 复用 sessionStore 的 *sql.DB 连接,避免 Windows 上多连接文件锁定。
sessionDB := sessionStore.DB()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sessionStore.DB() is still nil until the store has performed its first ensureDB-backed operation, and BuildGatewayServerDeps has not touched the store yet. On a fresh startup this leaves checkpointStore unset, so checkpoint wiring/repair never runs and the whole feature is silently disabled. Please open the DB explicitly here or fall back to constructing the checkpoint store from DatabasePath(...).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/code 修复

Copy link
Copy Markdown

@fennoai fennoai Bot May 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复 internal/app/bootstrap.go 冷启动时 sessionStore.DB() 仍为 nil 导致 checkpoint store 未初始化的问题。

本次修改:

  • BuildGatewayServerDeps 现在会默认按 session.DatabasePath(...) 构造 checkpoint store,避免冷启动时整条 checkpoint 链路被静默跳过。
  • 如果 session store 已经打开了 SQLite 连接,仍然优先复用这条连接,保持原有的共享连接路径。
  • 新增回归测试 TestBuildGatewayServerDepsInitializesCheckpointStoreOnFreshStartup,验证 fresh startup 下 runtime 的 ListCheckpoints 能正常工作。

验证:

  • go test ./internal/app -run 'TestBuildGatewayServerDepsInitializesCheckpointStoreOnFreshStartup|TestBuildRuntimeUsesWorkdirOverride'

代码已推送到当前工作分支,并同步创建了 fork 侧 PR:https://github.com/phantom5099/neo-code/pull/70

View job run

Comment thread internal/checkpoint/shadow_repo.go Outdated
Comment thread internal/checkpoint/shadow_repo.go
Comment thread internal/runtime/run.go Outdated
Comment thread internal/runtime/checkpoint_restore.go
@phantom5099 phantom5099 closed this May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants