Skip to content

feat(ckpt):make deamon persist#520

Open
Ziqi002 wants to merge 5 commits into
alibaba:mainfrom
Ziqi002:feat/ckpt/persist-deamon
Open

feat(ckpt):make deamon persist#520
Ziqi002 wants to merge 5 commits into
alibaba:mainfrom
Ziqi002:feat/ckpt/persist-deamon

Conversation

@Ziqi002
Copy link
Copy Markdown
Collaborator

@Ziqi002 Ziqi002 commented May 13, 2026

Description

feat(ckpt): daemon 状态持久化与架构重构

将 ws-ckpt daemon 改造为有状态模式:通过 /var/lib/ws-ckpt/state.json 持久化核心运行时信息,实现快速恢复启动、崩溃检测、旧版数据自动迁移。


持久化内容 (c05349c)

文件路径:/var/lib/ws-ckpt/state.json,原子写入(tmp+fsync+rename),daemon 启动后 + 关闭前各保存一次。

字段 说明
version: u32 Schema 版本号(当前 = 1)
backend: BackendIdentity 后端类型(BtrfsLoop/BtrfsBase) + 选择方式(config/persisted/auto) + 时间戳
paths: BackendPaths 挂载点、data_root、snapshots_root;BtrfsLoop 额外记录 loop_img 状态(img_path/size/device)
workspaces: Vec<WorkspaceEntry> ws_id、用户路径、注册时间、所属后端

其他变更

1. daemon 架构模块化拆分 (c05349c)

原 lib.rs(522行) 拆分为:

  • startup.rs — 启动决策(有 state → 恢复 / 无 state → 新装+迁移)
  • lockfile.rs — flock 单实例 + 崩溃检测
  • lib.rs — 精简为纯编排层(~157行)

2. 旧版数据自动迁移 (c05349c)

新增 common/migration.rs:无 state.json 时扫描 snapshots_root 下旧版 index.json,迁移至 state_dir/indexes/ 并生成 state.json

3. 防御性错误处理 (c05349c)

  • state.json 解析失败降级为 fresh start(避免 systemd 无限重启)
  • 6 处裸 ? 补充 .context()
  • #[non_exhaustive] 保障字段扩展兼容
  • create_backend / detect_and_create_backend 收窄为 pub(crate)

4. 运行时路径统一至 FHS 标准 (2a826c0)

  • BTRFS_IMG_PATH: /data/ws-ckpt//var/lib/ws-ckpt/
  • systemd service 添加 StateDirectory=ws-ckpt
  • RPM spec 声明 %ghost 目录

5. 移除 OverlayFS 占位后端 (134cd36)

  • 删除 overlayfs.rs,移除 BackendPaths::OverlayFs variant
  • backend_type 仅保留 auto / btrfs-base / btrfs-loop

6. RPM spec 修复 (a89dd95)

  • 移除 %pre 中阻断安装的逻辑,目录创建移至 %post

7. bootstrap 重构为 trait 方法 (aae4ce8)

  • 删除独立 bootstrap.rs(678行),改为 StorageBackend::bootstrap() trait 方法
  • BtrfsLoop: 镜像创建+挂载;BtrfsBase: 仅 mkdir(幂等)
  • 新增 daemon/util.rs 提取 ensure_symlinks()
  • startup.rs 统一 backend.bootstrap(config).await,消除分支冗余

Related Issue

closes #491
closes #501

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional change)
  • Performance improvement
  • CI/CD or build changes

Scope

  • cosh (copilot-shell)
  • sec-core (agent-sec-core)
  • skill (os-skills)
  • sight (agentsight)
  • tokenless (tokenless)
  • ckpt (ws-ckpt)
  • Multiple / Project-wide

Checklist

  • I have read the Contributing Guide
  • My code follows the project's code style
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the documentation accordingly
  • For cosh: Lint passes, type check passes, and tests pass
  • For sec-core (Rust): cargo clippy -- -D warnings and cargo fmt --check pass
  • For sec-core (Python): Ruff format and pytest pass
  • For skill: Skill directory structure is valid and shell scripts pass syntax check
  • For sight: cargo clippy -- -D warnings and cargo fmt --check pass
  • For tokenless: cargo clippy -- -D warnings and cargo fmt --check pass
  • Lock files are up to date (package-lock.json / Cargo.lock)

Testing

场景 1:state.json 持久化与 restart 恢复

验证目标:daemon 重启后从 state.json 恢复状态,不依赖 btrfs 扫描。

# --- 步骤 1: 创建工作区并打快照 ---
ws-ckpt checkpoint -w $WS -i persist-1 -m "persist test 1"
ws-ckpt checkpoint -w $WS -i persist-2 -m "persist test 2"
ws-ckpt checkpoint -w $WS -i persist-3 -m "persist test 3"

# --- 步骤 2: 确认快照存在 ---
ws-ckpt list -w $WS
# 预期: 显示 3 个快照 (persist-1, persist-2, persist-3)

# --- 步骤 3: 确认 state.json 已生成且包含工作区信息 ---
cat /var/lib/ws-ckpt/state.json | python3 -m json.tool
# 预期: JSON 输出包含 "workspaces" 列表,其中含当前 ws_id

# --- 步骤 4: 确认 index.json 在新位置 ---
ls /var/lib/ws-ckpt/indexes/
cat /var/lib/ws-ckpt/indexes/ws-*/index.json | python3 -m json.tool
# 预期: indexes/ 下存在 ws-* 目录,index.json 含 3 个快照元数据

# --- 步骤 5: 重启 daemon ---
sudo systemctl restart ws-ckpt

# --- 步骤 6: 确认快照未丢失 ---
ws-ckpt list -w $WS
# 预期: 仍然显示 3 个快照 (persist-1, persist-2, persist-3)

# --- 步骤 7: 确认是从 state.json 恢复而非扫描 btrfs ---
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | grep -i "persisted\|rebuild"
# 预期: 日志显示 "Restoring from persisted state" 类似信息
# 预期: 无 "scanning snapshots_root" 相关日志

# --- 步骤 8: 清理测试数据 ---
ws-ckpt delete -w $WS -s persist-1 --force
ws-ckpt delete -w $WS -s persist-2 --force
ws-ckpt delete -w $WS -s persist-3 --force

场景 2:backend 类型持久化权威性(篡改 state.json 后 hard fail)

验证目标:篡改 backend_type 使其与环境冲突时,daemon 拒绝启动并给出修复提示。

# --- 步骤 1: 确认当前 backend 类型 ---
cat /var/lib/ws-ckpt/state.json | python3 -c "
import json, sys
state = json.load(sys.stdin)
print('backend_type:', state['backend']['backend_type'])"
# 预期: 输出 "BtrfsLoop" 或 "BtrfsBase"

# 查看config.toml 确保文件不存在或配置的backend_type是auto
cat /etc/ws-ckpt/config.toml

# --- 步骤 2: 停止 daemon 并备份 state.json ---
sudo systemctl stop ws-ckpt
sudo cp /var/lib/ws-ckpt/state.json /var/lib/ws-ckpt/state.json.bak

# --- 步骤 3: 篡改 backend_type 为与环境不符的值 ---
sudo python3 -c "
import json
with open('/var/lib/ws-ckpt/state.json', 'r+') as f:
    state = json.load(f)
    # 将 backend_type 改为不存在对应环境的类型
    state['backend']['backend_type'] = 'BtrfsBase'
    f.seek(0); json.dump(state, f, indent=2); f.truncate()"

# --- 步骤 4: 尝试启动 daemon ---
sudo systemctl start ws-ckpt
sudo systemctl status ws-ckpt
# 预期: 服务状态为 failed

# --- 步骤 5: 检查错误日志 ---
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | tail -10
# 预期: 包含明确错误信息,例如:
#   "Backend type 'btrfs-base' selected but no mounted btrfs partition found."
#   "Please mount a btrfs partition or switch to 'btrfs-loop' backend."

# --- 步骤 6: 恢复原始 state.json 并验证恢复 ---
sudo cp /var/lib/ws-ckpt/state.json.bak /var/lib/ws-ckpt/state.json
sudo systemctl start ws-ckpt
sudo systemctl status ws-ckpt
# 预期: 服务正常启动,状态为 active (running)

场景 3:crash 检测 lockfile 机制(kill -9 模拟 crash vs 正常 stop)

验证目标:kill -9 后 lockfile 残留,重启时检测到 crash 并进入 reconcile;正常 stop 时 lockfile 清理干净,重启无误报。

# ============================
# 3a: 模拟 crash -> 检测到 unclean shutdown
# ============================

# --- 步骤 1: 确认 daemon 正在运行 ---
sudo systemctl status ws-ckpt
# 预期: active (running)

# --- 步骤 2: 确认 lockfile 存在且被锁定 ---
ls -la /var/lib/ws-ckpt/daemon.lock
# 预期: 文件存在,内容为当前 daemon PID

# --- 步骤 3: 模拟 crash(SIGKILL,不给进程清理机会) ---
DAEMON_PID=$(cat /var/lib/ws-ckpt/daemon.lock)
sudo kill -9 $DAEMON_PID

# --- 步骤 4: 确认 lockfile 残留(进程死亡但文件未删除) ---
ls -la /var/lib/ws-ckpt/daemon.lock
# 预期: 文件仍然存在

# --- 步骤 5: 确认锁已被内核释放(可以成功加锁) ---
flock -n /var/lib/ws-ckpt/daemon.lock echo "lock acquired - previous process crashed"
# 预期: 输出 "lock acquired - previous process crashed"

# --- 步骤 6: 重启 daemon ---
sudo systemctl start ws-ckpt

# --- 步骤 7: 检查日志确认检测到 crash ---
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | grep -i "crash\|unclean\|reconcile"
# 预期: 显示 "Detected unclean shutdown (lockfile ... present from previous run)"

# --- 步骤 8: 确认工作区数据完整 ---
ws-ckpt list -w $WS
# 预期: 快照完整未丢失

# ============================
# 3b: 正常 stop -> 无 crash 警告
# ============================

# --- 步骤 9: 正常停止 daemon ---
sudo systemctl stop ws-ckpt

# --- 步骤 10: 确认 lockfile 已被清理 ---
ls /var/lib/ws-ckpt/daemon.lock 2>&1
# 预期: "No such file or directory"

# --- 步骤 11: 正常启动 daemon ---
sudo systemctl start ws-ckpt

# --- 步骤 12: 确认无 crash 相关警告 ---
journalctl -u ws-ckpt --since "30 seconds ago" --no-pager | grep -i "crash\|unclean"
# 预期: 无输出(正常启动路径,无 crash 相关日志)

场景 4:v0 布局自动迁移(mock 旧版本布局)

验证目标:手动构造 v0 版本的文件布局(index.json 在 btrfs 设备内,无 state.json),验证新版 daemon 启动时能自动发现并迁移。

# --- 步骤 1: 先正常打快照,确保 btrfs 上有子卷数据 ---
ws-ckpt checkpoint -w $WS -i migrate-1 -m "migration test"
ws-ckpt list -w $WS
# 预期: 1 个快照 (migrate-1)

# --- 步骤 2: 停止 daemon ---
sudo systemctl stop ws-ckpt

# --- 步骤 3: mock v0 布局 ---
# 3a. 备份当前 index.json 内容
INDEX_CONTENT=$(cat /var/lib/ws-ckpt/indexes/*/index.json)
WS_ID=$(ls /var/lib/ws-ckpt/indexes/)

# 3b. 删除新版本的 state.json 和 indexes/
sudo rm -f /var/lib/ws-ckpt/state.json
sudo rm -rf /var/lib/ws-ckpt/indexes/

# 3c. 将 index.json 放回旧位置(btrfs 设备内 snapshots_root 下)
SNAPSHOTS_ROOT=/mnt/btrfs-workspace/snapshots
sudo mkdir -p "$SNAPSHOTS_ROOT/$WS_ID"
echo "$INDEX_CONTENT" | sudo tee "$SNAPSHOTS_ROOT/$WS_ID/index.json" > /dev/null

# --- 步骤 4: 确认 mock 结果 ---
# state.json 不存在
ls /var/lib/ws-ckpt/state.json 2>&1
# 预期: No such file or directory

# indexes/ 不存在
ls /var/lib/ws-ckpt/indexes/ 2>&1
# 预期: No such file or directory

# 旧位置 index.json 存在
cat "$SNAPSHOTS_ROOT/$WS_ID/index.json" | python3 -m json.tool
# 预期: 包含 migrate-1 快照元数据

# --- 步骤 5: 启动 daemon 触发自动迁移 ---
sudo systemctl start ws-ckpt

# --- 步骤 6: 确认迁移日志 ---
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | grep -i "migrat"
# 预期: 显示 "Migrating from v0 layout" 或类似日志

# --- 步骤 7: 确认 state.json 已生成 ---
cat /var/lib/ws-ckpt/state.json | python3 -m json.tool
# 预期: 包含 workspaces 列表和 backend 信息

# --- 步骤 8: 确认 indexes/ 目录已写入 ---
ls /var/lib/ws-ckpt/indexes/
# 预期: 包含 ws_id 对应目录

cat /var/lib/ws-ckpt/indexes/$WS_ID/index.json | python3 -m json.tool
# 预期: 包含 migrate-1 快照元数据

# --- 步骤 9: 确认快照未丢失 ---
ws-ckpt list -w $WS
# 预期: 仍然包含快照 (migrate-1)

# --- 步骤 10: 确认旧位置文件已删除(move 语义) ---
ls "$SNAPSHOTS_ROOT/$WS_ID/index.json" 2>&1
# 预期: No such file or directory(迁移后直接删除)

# --- 步骤 11: 清理测试数据 ---
ws-ckpt delete -w $WS -s migrate-1 --force

场景 5:btrfs 设备丢失时精确报错(state.json 有记录但子卷不在)

验证目标:当 btrfs 设备丢失时,daemon 能基于 state.json 中的记录给出精确的工作区级别错误信息,而非笼统地报告 "无工作区"。

# --- 步骤 1: 创建快照确保有工作区注册 ---
ws-ckpt checkpoint -w $WS -i device-test -m "device loss test"
ws-ckpt list -w $WS
# 预期: 1 个快照 (device-test)

# --- 步骤 2: 确认 state.json 中记录了该工作区 ---
cat /var/lib/ws-ckpt/state.json | python3 -c "
import json, sys
state = json.load(sys.stdin)
for ws in state['workspaces']:
    print(f\"  ws_id: {ws['ws_id']}, path: {ws['workspace_path']}\")"
# 预期: 输出当前工作区的 ws_id 和 path

# --- 步骤 3: 停止 daemon 并模拟 btrfs 设备完全丢失(img 文件以 mv 备份,后续可还原) ---
sudo systemctl stop ws-ckpt
LOOP_DEV=$(losetup -a | grep btrfs-data.img | cut -d: -f1)
sudo umount /mnt/btrfs-workspace
sudo losetup -d $LOOP_DEV
# 备份 img 文件(不删除),让 daemon 启动时找不到 img
sudo mv /var/lib/ws-ckpt/btrfs-data.img /var/lib/ws-ckpt/btrfs-data.img.bak

# --- 步骤 4: 启动 daemon(backend 无法恢复) ---
sudo systemctl start ws-ckpt

# --- 步骤 5: 查看错误信息 ---
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | tail -20
# 预期: 报错信息包含工作区信息,例如:
#   "Backend image file not found at ... Persisted state expects a BtrfsLoop backend but the image is missing"
#   或 "Snapshot xxx subvolume missing at ..., removing from index"
#   而不是简单地显示 "No workspaces found"

# --- 步骤 6: 恢复环境(还原 img 备份,验证 state.json 恢复路径) ---
sudo systemctl stop ws-ckpt
# 以 -f 覆盖恢复:如果重试期间 daemon 曾新建过空 img,这里一井覆盖掉
sudo mv -f /var/lib/ws-ckpt/btrfs-data.img.bak /var/lib/ws-ckpt/btrfs-data.img
sudo systemctl start ws-ckpt
sudo systemctl status ws-ckpt
# 预期: active (running),daemon 从 state.json 恢复,工作区与 device-test 快照全部保留
ws-ckpt list -w $WS
# 预期: 包含 device-test 快照
# 删除 device-test 测试快照
ws-ckpt delete -w $WS -s device-test --force

场景 6:子卷单独丢失 — reconcile MISSING 标记

验证目标:daemon 启动 reconcile 阶段发现快照子卷丢失时,正确标记 MISSING 并阻止对该快照的 rollback 操作,同时不影响其他正常快照。恢复子卷后重启能自动消除 MISSING 标记。

# --- 步骤 1: 创建两个快照 ---
ws-ckpt checkpoint -w $WS -i snap-a -m "missing test a"
ws-ckpt checkpoint -w $WS -i snap-b -m "missing test b"
ws-ckpt list -w $WS
# 预期: 显示 snap-a, snap-b

# --- 步骤 2: 停止 daemon ---
sudo systemctl stop ws-ckpt

# --- 步骤 3: 手动删除某个快照子卷(模拟子卷意外丢失) ---
WS_ID=$(ls /var/lib/ws-ckpt/indexes/)
sudo btrfs subvolume delete /mnt/btrfs-workspace/snapshots/$WS_ID/snap-a

# --- 步骤 4: 启动 daemon ---
sudo systemctl start ws-ckpt

# --- 步骤 5: 检查启动日志 ---
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | grep -i "missing\|unavailable\|recovered"
# 预期: ERROR 级别 "Snapshot snap-a subvolume missing at ..., marking as unavailable"
sudo systemctl status ws-ckpt
# 预期: active (running)

# --- 步骤 6: 确认 index.json 中 snap-a 标记为 missing ---
cat /var/lib/ws-ckpt/indexes/$WS_ID/index.json | python3 -c "
import json, sys
index = json.load(sys.stdin)
for sid, snap in index['snapshots'].items():
    print(f'  {sid}: missing={snap.get(\"missing\", False)}')"
# 预期: snap-a: missing=True, snap-b: missing=False(或无 missing 字段)

# --- 步骤 7: list 显示 MISSING 标记 ---
ws-ckpt list -w $WS
# 预期: snap-a 行显示 [MISSING],snap-b 正常无标记

# --- 步骤 8: rollback 到 missing 快照被拦截 ---
ws-ckpt rollback -w $WS -s snap-a
# 预期: 返回错误,包含 "subvolume is missing (data lost)" 和 "delete --force" 提示

# --- 步骤 9: rollback 到正常快照仍可用 ---
ws-ckpt rollback -w $WS -s snap-b
# 预期: 正常回滚成功

# --- 步骤 10: delete --force 清理 missing 记录 ---
ws-ckpt delete -w $WS -s snap-a --force
# 预期: 成功删除记录(跳过子卷删除步骤)
ws-ckpt list -w $WS
# 预期: 不再显示 snap-a

# --- 步骤 11(可选): 恢复场景 ---
# 创建新快照并模拟丢失后恢复
ws-ckpt checkpoint -w $WS -i snap-c -m "recovery test"
sudo systemctl stop ws-ckpt
WS_ID=$(ls /var/lib/ws-ckpt/indexes/)
sudo btrfs subvolume delete /mnt/btrfs-workspace/snapshots/$WS_ID/snap-c
sudo systemctl start ws-ckpt
ws-ckpt list -w $WS
# 预期: snap-c 显示 [MISSING]

# 停止 daemon,手动恢复子卷到原路径(用当前工作区做 source 重新生成同名子卷)
sudo systemctl stop ws-ckpt
sudo btrfs subvolume snapshot $WS /mnt/btrfs-workspace/snapshots/$WS_ID/snap-c
sudo systemctl start ws-ckpt
journalctl -u ws-ckpt --since "1 minute ago" --no-pager | grep -i "recovered"
# 预期: INFO 日志 "Snapshot snap-c subvolume recovered"
ws-ckpt list -w $WS
# 预期: snap-c 恢复正常,无 [MISSING] 标记
ws-ckpt rollback -w $WS -s snap-c
# 预期: rollback 成功

Additional Notes

@Ziqi002 Ziqi002 requested a review from casparant as a code owner May 13, 2026 09:56
@Ziqi002 Ziqi002 requested review from samchu-zsl and yummypeng and removed request for casparant May 13, 2026 09:56
@github-actions github-actions Bot added component:ckpt scope:documentation ./docs/|./*.md|./NOTICE labels May 13, 2026
Ziqi002 added 2 commits May 14, 2026 11:56
- introduce a mechanism for persisting daemon process state
- refactor daemon startup: unified recovery from disk state (config takes precedence)
- index files are migrated from the Btrfs snapshot directory to the state directory
- introduce snapshot "missing" markers and user-visible prompts
- integration of state persistence with auto-cleanup and workspace management
- add configurations: StateDirectory=ws-ckpt / RuntimeDirectory=ws-ckpt

add
@Ziqi002 Ziqi002 force-pushed the feat/ckpt/persist-deamon branch 2 times, most recently from 6eaca8b to 8211e75 Compare May 14, 2026 05:49
@Ziqi002 Ziqi002 force-pushed the feat/ckpt/persist-deamon branch from 8211e75 to c54aece Compare May 14, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ckpt] bug: rpmspec has a blocking installation process [ckpt] feat(ckpt): 持久化 Deamon 状态,Deamon重启可继承状态

1 participant