Skip to content

Conversation

@andreasjansson
Copy link
Collaborator

Fix three bugs that together prevented R2 backup/restore from working:

Bug 1 - waitForProcess didn't handle 'starting' status:
The poll loop only waited while status was 'running', exiting
immediately if a fast command was still 'starting'.

Bug 2 - exitCode null check broke config detection:
exitCode is often undefined in the sandbox API. Switched to
stdout-based detection (echo exists) instead of exitCode checks.

Bug 3 - should_restore_from_r2() race condition:
Called three times (config, workspace, skills) but the config
restore copied .last-sync locally, making timestamps match and
skipping workspace/skills. Now evaluates once with DO_RESTORE flag.

Additional fixes discovered during debugging:

  • proc.status is a readonly snapshot that never updates. Use getStatus() to poll for actual status changes.

  • getStatus() itself returns 'running' indefinitely for shell commands (known sandbox API behavior). Sync now starts the rsync chain and polls for the timestamp file instead of waiting on process status.

  • isR2Mounted used getLogs().stdout which was empty (process hadn't completed). Now uses stdout marker pattern with waitForProcess.

  • Exclude .git from rsync (workspace/.git/ has 50+ hook files, each taking seconds over s3fs).

  • Make workspace/skills rsync non-fatal (directories may not exist in fresh containers).

  • Replace inline wait loops in debug.ts with shared waitForProcess.

  • Add comprehensive e2e tests covering sync, restart persistence, and workspace marker file survival.

Fixes #212, #228, #102, #86

@andreasjansson andreasjansson force-pushed the fix-r2-persistence branch 3 times, most recently from 47d5608 to 10bbc8d Compare February 10, 2026 21:01
@Erfan1995
Copy link

Just ran into this exact issue and did some debugging from inside a running container. Can confirm your diagnosis:

What we saw:

Backup button failing with "Sync aborted: no config file found"
Logs showing Sandbox.startProcess - Canceled for test -f /root/.openclaw/openclaw.json
The config file definitely exists (verified manually)

Root cause confirmed:

The test -f process was being started but canceled before waitForProcess could get the exit code
When the process gets canceled, the code falls through to "no config file found"

s3fs performance:

Manual rsync of 44 files took 2+ minutes over s3fs (each file = network round-trip)
Processes were getting SIGKILL'd mid-sync
Excluding .git (17 hook sample files) helped significantly

Workaround that worked:

Smaller batch copies (cp for individual files) instead of full rsync
Excluding .git directory
The stdout-based detection and .git exclusion in this PR should fix it. Thanks for the detailed investigation! 🙏

Fix three bugs that together prevented R2 backup/restore from working:

Bug 1 - waitForProcess didn't handle 'starting' status:
  The poll loop only waited while status was 'running', exiting
  immediately if a fast command was still 'starting'.

Bug 2 - exitCode null check broke config detection:
  exitCode is often undefined in the sandbox API. Switched to
  stdout-based detection (echo exists) instead of exitCode checks.

Bug 3 - should_restore_from_r2() race condition:
  Called three times (config, workspace, skills) but the config
  restore copied .last-sync locally, making timestamps match and
  skipping workspace/skills. Now evaluates once with DO_RESTORE flag.

Additional fixes discovered during debugging:

- proc.status is a readonly snapshot that never updates. Use
  getStatus() to poll for actual status changes.

- getStatus() itself returns 'running' indefinitely for shell
  commands (known sandbox API behavior). Sync now starts the rsync
  chain and polls for the timestamp file instead of waiting on
  process status.

- isR2Mounted used getLogs().stdout which was empty (process hadn't
  completed). Now uses stdout marker pattern with waitForProcess.

- Exclude .git from rsync (workspace/.git/ has 50+ hook files,
  each taking seconds over s3fs).

- Make workspace/skills rsync non-fatal (directories may not exist
  in fresh containers).

- Replace inline wait loops in debug.ts with shared waitForProcess.

- Cron handler now uses fire-and-forget sync instead of polling.
  The scheduled handler has a strict time limit — polling competes
  with slow s3fs operations and exceeds it, causing an unhandled
  exception that resets the Durable Object and kills the container.

- Add comprehensive e2e tests covering sync, restart persistence,
  and workspace marker file survival.

Fixes #212, #228, #102, #86
@github-actions
Copy link

E2E Test Recording (telegram)

✅ Tests passed

E2E Test Video

@github-actions
Copy link

E2E Test Recording (base)

✅ Tests passed

E2E Test Video

@github-actions
Copy link

E2E Test Recording (workers-ai)

✅ Tests passed

E2E Test Video

@andreasjansson
Copy link
Collaborator Author

@Erfan1995 thank you for your message. I'm also seeing poor s3fs performance. I've done some experiments with rclone which seems to perform better. I'll try that a bit more and if it works, I'll open a PR later today.

@github-actions
Copy link

E2E Test Recording (discord)

❌ Tests failed

E2E Test Video

@andreasjansson andreasjansson merged commit c92bf6d into main Feb 11, 2026
17 of 23 checks passed
@andreasjansson andreasjansson deleted the fix-r2-persistence branch February 11, 2026 09:29
@andreasjansson
Copy link
Collaborator Author

Tests are flaky, but I'll merge. Looking to roll forward with rclone.

scott-edwards added a commit to scott-edwards/alfred that referenced this pull request Feb 11, 2026
Upstream PRs merged:
- cloudflare#235: Fix R2 persistence (waitForProcess, exitCode, restore race)
- cloudflare#240: Replace s3fs/rsync with rclone (removes cron trigger)

Conflict resolution:
- Keep our GATEWAY_REQUEST_TIMEOUT_MS and CONTAINER_FETCH_TIMEOUT_MS
- Drop CRON_TIMEOUT_MS and R2_MOUNT_PATH (no longer needed)
- Remove scheduled handler (sync now runs inside container)
- Keep sleepAfter: '4h' fix for keepAlive death spiral

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

R2 backup does not run: Sync aborted: no config file found

2 participants