-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix R2 persistence: three interacting bugs + sandbox API workarounds #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
47d5608 to
10bbc8d
Compare
|
Just ran into this exact issue and did some debugging from inside a running container. Can confirm your diagnosis: What we saw: Backup button failing with "Sync aborted: no config file found" Root cause confirmed: The test -f process was being started but canceled before waitForProcess could get the exit code s3fs performance: Manual rsync of 44 files took 2+ minutes over s3fs (each file = network round-trip) Workaround that worked: Smaller batch copies (cp for individual files) instead of full rsync |
Fix three bugs that together prevented R2 backup/restore from working: Bug 1 - waitForProcess didn't handle 'starting' status: The poll loop only waited while status was 'running', exiting immediately if a fast command was still 'starting'. Bug 2 - exitCode null check broke config detection: exitCode is often undefined in the sandbox API. Switched to stdout-based detection (echo exists) instead of exitCode checks. Bug 3 - should_restore_from_r2() race condition: Called three times (config, workspace, skills) but the config restore copied .last-sync locally, making timestamps match and skipping workspace/skills. Now evaluates once with DO_RESTORE flag. Additional fixes discovered during debugging: - proc.status is a readonly snapshot that never updates. Use getStatus() to poll for actual status changes. - getStatus() itself returns 'running' indefinitely for shell commands (known sandbox API behavior). Sync now starts the rsync chain and polls for the timestamp file instead of waiting on process status. - isR2Mounted used getLogs().stdout which was empty (process hadn't completed). Now uses stdout marker pattern with waitForProcess. - Exclude .git from rsync (workspace/.git/ has 50+ hook files, each taking seconds over s3fs). - Make workspace/skills rsync non-fatal (directories may not exist in fresh containers). - Replace inline wait loops in debug.ts with shared waitForProcess. - Cron handler now uses fire-and-forget sync instead of polling. The scheduled handler has a strict time limit — polling competes with slow s3fs operations and exceeds it, causing an unhandled exception that resets the Durable Object and kills the container. - Add comprehensive e2e tests covering sync, restart persistence, and workspace marker file survival. Fixes #212, #228, #102, #86
10bbc8d to
6ecc273
Compare
|
@Erfan1995 thank you for your message. I'm also seeing poor s3fs performance. I've done some experiments with rclone which seems to perform better. I'll try that a bit more and if it works, I'll open a PR later today. |
|
Tests are flaky, but I'll merge. Looking to roll forward with rclone. |
Upstream PRs merged: - cloudflare#235: Fix R2 persistence (waitForProcess, exitCode, restore race) - cloudflare#240: Replace s3fs/rsync with rclone (removes cron trigger) Conflict resolution: - Keep our GATEWAY_REQUEST_TIMEOUT_MS and CONTAINER_FETCH_TIMEOUT_MS - Drop CRON_TIMEOUT_MS and R2_MOUNT_PATH (no longer needed) - Remove scheduled handler (sync now runs inside container) - Keep sleepAfter: '4h' fix for keepAlive death spiral Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>




Fix three bugs that together prevented R2 backup/restore from working:
Bug 1 - waitForProcess didn't handle 'starting' status:
The poll loop only waited while status was 'running', exiting
immediately if a fast command was still 'starting'.
Bug 2 - exitCode null check broke config detection:
exitCode is often undefined in the sandbox API. Switched to
stdout-based detection (echo exists) instead of exitCode checks.
Bug 3 - should_restore_from_r2() race condition:
Called three times (config, workspace, skills) but the config
restore copied .last-sync locally, making timestamps match and
skipping workspace/skills. Now evaluates once with DO_RESTORE flag.
Additional fixes discovered during debugging:
proc.status is a readonly snapshot that never updates. Use getStatus() to poll for actual status changes.
getStatus() itself returns 'running' indefinitely for shell commands (known sandbox API behavior). Sync now starts the rsync chain and polls for the timestamp file instead of waiting on process status.
isR2Mounted used getLogs().stdout which was empty (process hadn't completed). Now uses stdout marker pattern with waitForProcess.
Exclude .git from rsync (workspace/.git/ has 50+ hook files, each taking seconds over s3fs).
Make workspace/skills rsync non-fatal (directories may not exist in fresh containers).
Replace inline wait loops in debug.ts with shared waitForProcess.
Add comprehensive e2e tests covering sync, restart persistence, and workspace marker file survival.
Fixes #212, #228, #102, #86