Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
3f6dd73
feat(worker): add durable terminal delivery, checkpoints, and resilience
vsumner Feb 24, 2026
c55a8b3
chore(api): trim worker progress interface surface
vsumner Feb 24, 2026
267bfc3
Harden worker delivery contracts and persist worker events
vsumner Feb 25, 2026
0111f4b
Fix clippy -Dwarnings violations in worker contract paths
vsumner Feb 25, 2026
03b846e
Fix worker delivery contract edge cases and align docs
vsumner Feb 25, 2026
05112f7
Harden worker contract semantics and deterministic status hashing
vsumner Feb 25, 2026
8cfca95
Address follow-up reliability and docs findings
vsumner Feb 25, 2026
15aa147
Harden outbound delivery paths and simplify routing flow
vsumner Feb 25, 2026
2fe58fa
Extract delivery hardening and drop worker event journal
vsumner Feb 25, 2026
d99b46c
Add canonical worker timeline projection
vsumner Feb 25, 2026
893dd4e
Fix cancel_worker race and convergence handling
vsumner Feb 25, 2026
ffd016d
Fix cancel responsiveness and emit receipt SSE events
vsumner Feb 25, 2026
190b576
Fix clippy needless_return warnings
vsumner Feb 25, 2026
8cdbdaa
Fix worker lag/flush races and gate outbound SSE emission
vsumner Feb 25, 2026
0f07bf4
Apply follow-up review fixes for contracts, browser, and watcher loop
vsumner Feb 25, 2026
bed2251
Harden worker flush handling and add regression coverage
vsumner Feb 25, 2026
7a0586c
Fix lagged worker convergence and flush panic cleanup
vsumner Feb 25, 2026
16b3233
Fix routed delivery outcome and stabilize worker timeout test
vsumner Feb 25, 2026
48367c9
Move main.rs tests to end for clippy lint
vsumner Feb 25, 2026
fb447d8
Restore stable 20260224 migration numbering
vsumner Feb 25, 2026
9fc6c1d
Fix Z.AI Coding Plan model remap for chat completions
vsumner Feb 25, 2026
1997e6c
Fix duplicate worker completion surfacing with tracked receipts
vsumner Feb 25, 2026
9c6c066
Ensure retriggers relay worker results when reply is missing
vsumner Feb 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 48 additions & 1 deletion docs/content/docs/(configuration)/config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ background_threshold = 0.80 # background summarization
aggressive_threshold = 0.85 # aggressive summarization
emergency_threshold = 0.95 # drop oldest 50%, no LLM

# Deterministic worker task contract timing.
[defaults.worker_contract]
ack_secs = 5 # seconds before first ack checkpoint
progress_secs = 45 # seconds between progress heartbeat nudges
tick_secs = 2 # scheduler tick interval for contract deadline checks

# Cortex (system observer) settings.
[defaults.cortex]
tick_interval_secs = 30
Expand All @@ -103,6 +109,7 @@ startup_delay_secs = 5
enabled = true
headless = true
evaluate_enabled = false
browser_action_timeout_secs = 45
executable_path = "/path/to/chrome" # optional, auto-detected
screenshot_dir = "/path/to/screenshots" # optional, defaults to data_dir/screenshots

Expand All @@ -119,6 +126,12 @@ cron_timezone = "America/Los_Angeles" # optional per-agent cron timezone overri
[agents.routing]
channel = "anthropic/claude-opus-4-20250514"

# Per-agent worker contract overrides (inherits defaults when omitted).
[agents.worker_contract]
ack_secs = 8
progress_secs = 60
tick_secs = 3

# Per-agent sandbox configuration.
[agents.sandbox]
mode = "enabled" # "enabled" (default) or "disabled"
Expand Down Expand Up @@ -227,6 +240,7 @@ Most config values are hot-reloaded when their files change. Spacebot watches `c
| `max_concurrent_branches` | Yes | Next branch spawn checks new limit |
| Browser config | Yes | Next worker spawn uses new config |
| Warmup config | Yes | Next warmup pass uses new values |
| `[defaults.worker_contract]` / `[agents.worker_contract]` (`ack_secs`, `progress_secs`, `tick_secs`) | Yes | Runtime and agent-level contract deadlines and polling update without restart |
| Identity files (SOUL.md, etc.) | Yes | Next channel message renders new identity |
| Skills (SKILL.md files) | Yes | Next message / worker spawn sees new skills |
| Bindings | Yes | Next message routes using new bindings |
Expand Down Expand Up @@ -471,12 +485,22 @@ Map of model names to ordered fallback chains. Used when the primary model retur

Thresholds are fractions of `context_window`.

### `[defaults.worker_contract]`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `ack_secs` | integer | 5 | Deadline to confirm a worker start was surfaced |
| `progress_secs` | integer | 45 | Deadline between meaningful worker progress updates |
| `tick_secs` | integer | 2 | Poll interval for worker contract deadline checks |

Setting `ack_secs`, `progress_secs`, or `tick_secs` to `0` is treated as unset and falls back to the resolved default for that scope.

### `[defaults.cortex]`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `tick_interval_secs` | integer | 30 | How often the cortex checks system state |
| `worker_timeout_secs` | integer | 300 | Worker timeout before cancellation |
| `worker_timeout_secs` | integer | 300 | Inactivity timeout for worker progress events before forced cancellation |
| `branch_timeout_secs` | integer | 60 | Branch timeout before cancellation |
| `circuit_breaker_threshold` | integer | 3 | Consecutive failures before auto-disable |

Expand Down Expand Up @@ -504,6 +528,7 @@ When branch/worker/cron dispatch happens before readiness is satisfied, Spacebot
| `enabled` | bool | true | Whether workers have browser tools |
| `headless` | bool | true | Run Chrome headless |
| `evaluate_enabled` | bool | false | Allow JavaScript evaluation |
| `browser_action_timeout_secs` | integer | 45 | Per-action timeout for browser operations |
| `executable_path` | string | None | Custom Chrome/Chromium path |
| `screenshot_dir` | string | None | Directory for screenshots |

Expand All @@ -518,9 +543,31 @@ When branch/worker/cron dispatch happens before readiness is satisfied, Spacebot
| `max_concurrent_branches` | integer | inherits | Override instance default |
| `max_turns` | integer | inherits | Override instance default |
| `context_window` | integer | inherits | Override instance default |
| `worker_contract` | table | inherits | Per-agent worker contract override |

Agent-specific routing is set via `[agents.routing]` with the same keys as `[defaults.routing]`.

### `[agents.worker_contract]`

Per-agent worker contract override.
Unset keys inherit from `[defaults.worker_contract]`.

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `ack_secs` | integer | inherits | Deadline to confirm a worker start was surfaced |
| `progress_secs` | integer | inherits | Deadline between meaningful worker progress updates |
| `tick_secs` | integer | inherits | Poll interval for worker contract deadline checks |

Setting `ack_secs`, `progress_secs`, or `tick_secs` to `0` is treated as unset and falls back to the resolved default for that scope.

```toml
[agents.worker_contract]
# Setting any field to 0 treats it as unset and falls back to the resolved default.
ack_secs = 8
progress_secs = 60
tick_secs = 3
```

### `[agents.sandbox]`

OS-level filesystem containment for shell and exec tool subprocesses. Uses bubblewrap (Linux) or sandbox-exec (macOS) to enforce read-only access to everything outside the workspace.
Expand Down
1 change: 1 addition & 0 deletions docs/content/docs/(deployment)/roadmap.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ The full message-in → LLM → response-out pipeline is wired end-to-end across
- **Tools** — 16 tools implement Rig's `Tool` trait with real logic (reply, branch, spawn_worker, route, cancel, skip, react, memory_save, memory_recall, set_status, shell, file, exec, browser, cron, web_search)
- **Workspace containment** — file tool validates paths stay within workspace boundary, shell/exec tools block instance directory traversal, sensitive file access, and secret env var leakage
- **Conversation persistence** — `ConversationLogger` with fire-and-forget SQLite writes, compaction archiving
- **Worker task contracts** — deterministic worker ack/progress/terminal deadlines with one-time SLA nudge and durable terminal convergence (`terminal_acked` / `terminal_failed`)
- **Cron** — scheduler with timers, active hours, circuit breaker (3 failures → disable), creates real channels. CronTool wired into channel tool factory.
- **Message routing** — full event loop with binding resolution, channel lifecycle, outbound routing
- **Settings store** — redb key-value with WorkerLogMode
Expand Down
52 changes: 49 additions & 3 deletions docs/content/docs/(features)/workers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,17 @@ Workers don't get memory tools, channel tools, or branch tools. They can't talk

```
Running ──→ Done (fire-and-forget completed)
Running ──→ Failed (error or cancellation)
Running ──→ Failed (error)
Running ──→ Cancelled (cancelled by channel/system)
Running ──→ timed_out (inactivity timeout elapsed)
Running ──→ WaitingForInput (interactive worker finished initial task)
WaitingForInput ──→ Running (follow-up message received via route)
WaitingForInput ──→ Failed (follow-up processing failed)
WaitingForInput ──→ Cancelled (cancelled by channel/system)
WaitingForInput ──→ timed_out (inactivity timeout elapsed)
```

`Done` and `Failed` are terminal. Illegal transitions are runtime errors.
`Done`, `Failed`, `Cancelled`, and `timed_out` are terminal. Illegal transitions are runtime errors.

## Context and History

Expand All @@ -95,7 +99,7 @@ Workers run in segments of 25 turns each. After each segment:

- If the agent returned a result: done
- If max turns hit: compact if needed, continue with "Continue where you left off"
- If cancelled: state = Failed
- If cancelled: state = Cancelled
- If context overflow: force compact, retry

This prevents runaway workers and handles long tasks that exceed a single agent loop.
Expand All @@ -111,10 +115,52 @@ Workers report progress via the `set_status` tool. The status string (max 256 ch

The channel LLM sees this and can decide whether to wait, ask for more info, or cancel.

Spacebot also forwards throttled worker checkpoints to the user-facing adapter:

- Start and completion updates are always surfaced.
- Mid-run checkpoints are deduped and rate-limited (default: at most one every 20s per worker, with urgent states bypassing the limit).
- Adapters that support message editing (for example Discord) update a single progress message in place to avoid channel spam.

## Concurrency

Workers run concurrently. The default limit is `max_concurrent_workers: 5` per channel (configurable per agent). Attempting to spawn beyond the limit returns an error to the LLM so it can wait or cancel an existing worker.

## Timeouts

Worker runs are bounded by `worker_timeout_secs` (default `300`) as an inactivity timeout. Any worker progress event (status updates, tool activity, permission/question prompts) resets the timer.

If no progress arrives within the timeout window, Spacebot marks the worker as `timed_out`, records a terminal result, and removes it from active worker state so the channel can continue delegating work.

## Deterministic Task Contracts

Each worker run now gets an internal task contract with three deadlines:

- **Acknowledge deadline** — confirms the worker start was surfaced to the user-facing adapter.
- **Progress deadline** — expects a meaningful heartbeat before the deadline.
- **Terminal deadline** — tracks terminal delivery lifecycle until receipt ack/failure.

If the acknowledge deadline is missed, Spacebot emits a synthesized "running" checkpoint. If the progress deadline is missed, it emits one synthesized "still working" nudge (one-time, no spam loop). Terminal receipt ack/failure then closes the contract as `terminal_acked` or `terminal_failed`.

## Terminal Delivery Reliability

Terminal worker notices (`done`, `failed`, `timed_out`, `cancelled`) are queued as durable delivery receipts before they are sent to the messaging adapter.

- Receipts are retried with bounded backoff on adapter delivery errors.
- Successful delivery marks the receipt as acknowledged.
- On process restart, in-flight (`sending`) receipts are re-queued so completion notices are not silently dropped.
- Old terminal receipts (`acked`, `failed`) are pruned periodically to keep storage bounded.

## Canonical Timeline Projection

Worker execution truth stays in `worker_runs.transcript`. Delivery truth stays in `worker_task_contracts` and `worker_delivery_receipts`.

Spacebot computes a read-time projection (it does not persist a second event log):

- Transcript steps are ordered by step index.
- Delivery/contract snapshots are ordered by timestamp.
- `workers/detail?include_timeline=true` returns the synthesized timeline plus a `terminal_converged` flag.
- `worker_inspect` shows the same projection so transcript and delivery state can be audited together.

## Model Routing

Workers default to `anthropic/claude-haiku-4.5-20250514`. Task-type overrides apply — for example, a `coding` task type routes to `anthropic/claude-sonnet-4-20250514`. Fallback chains are supported. All hot-reloadable.
Expand Down
28 changes: 28 additions & 0 deletions migrations/20260224000001_worker_delivery_receipts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
-- Durable delivery receipts for terminal worker notifications.
--
-- Tracks whether a terminal worker completion notice has been delivered to the
-- user-facing channel, with bounded retry metadata for transient adapter
-- failures.

CREATE TABLE IF NOT EXISTS worker_delivery_receipts (
id TEXT PRIMARY KEY,
worker_id TEXT NOT NULL,
channel_id TEXT NOT NULL,
kind TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
terminal_state TEXT NOT NULL,
payload_text TEXT NOT NULL,
attempt_count INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
next_attempt_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
acked_at TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE(worker_id, kind)
);

CREATE INDEX idx_worker_delivery_receipts_due
ON worker_delivery_receipts(status, next_attempt_at);

CREATE INDEX idx_worker_delivery_receipts_channel
ON worker_delivery_receipts(channel_id, created_at);
35 changes: 35 additions & 0 deletions migrations/20260224000002_worker_task_contracts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
-- Deterministic worker task contracts.
--
-- Tracks acknowledgement/progress/terminal guarantees for worker executions so
-- long-running tasks always provide bounded feedback and reach terminal states.

CREATE TABLE IF NOT EXISTS worker_task_contracts (
id TEXT PRIMARY KEY,
agent_id TEXT NOT NULL,
channel_id TEXT NOT NULL,
worker_id TEXT NOT NULL UNIQUE,
task_summary TEXT NOT NULL,
state TEXT NOT NULL DEFAULT 'created',
ack_deadline_at TIMESTAMP NOT NULL,
progress_deadline_at TIMESTAMP NOT NULL,
terminal_deadline_at TIMESTAMP NOT NULL,
last_progress_at TIMESTAMP,
last_status_hash TEXT,
attempt_count INTEGER NOT NULL DEFAULT 0,
sla_nudge_sent INTEGER NOT NULL DEFAULT 0,
terminal_state TEXT,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_worker_task_contracts_channel_state
ON worker_task_contracts(channel_id, state);

CREATE INDEX idx_worker_task_contracts_ack_due
ON worker_task_contracts(state, ack_deadline_at);

CREATE INDEX idx_worker_task_contracts_progress_due
ON worker_task_contracts(state, progress_deadline_at);

CREATE INDEX idx_worker_task_contracts_terminal_due
ON worker_task_contracts(state, terminal_deadline_at);
Loading
Loading