Long-running AI coding tasks fail silently. The agent asks "should I continue?", you're not watching, and the task stalls. Or worse — the agent says "done" but didn't actually pass the tests.
thin-supervisor fixes this. It's an acceptance-centered run supervisor that sits alongside your existing coding agent (Claude Code, Codex, or any CLI agent), watches what the agent does, and makes structured decisions: continue, re-inject, verify, retry, branch, recover, escalate, or finish. "Done" means the verifier passed and the acceptance contract is satisfied — not that the agent said so. You stay in your familiar agent UI. The supervisor handles the rest.
Architecture deep-dive: See docs/ARCHITECTURE.md for the six-layer architecture, first-class objects, and design principles.
Docs hub:
- docs/getting-started.md — install and run tmux, open-relay, and JSONL workflows
- docs/ARCHITECTURE.md — object model, layers, and current implementation status
- CHANGELOG.md — release notes and unreleased changes
- docs/design/p2-external-surfaces.md — surface abstraction and roadmap
- docs/design/p3-observation-sources.md — observation sources and normalization
- docs/design/p4-jsonl-observation.md — transcript-backed observation mode
- docs/reviews/2026-04-11-deep-code-review.md — latest deep code review log and remaining-risk audit
- docs/reviews/2026-04-12-amp-supervisor-capability-review.md — Amp-vs-thin-supervisor capability review and oracle-layer roadmap
- Operator-initiated clarification escalation. When a clarification answer comes back below the configured confidence threshold, operators can escalate the question to the worker from any channel:
thin-supervisor clarify <run_id> <question> --escalate, the TUIEkeybind, or/escalate <run_id> [question]in IM. Each escalation writes a singleclarification_escalated_to_workertimeline event so the decision is durably auditable. Actual side-instruction transport to the worker is deferred to 0.3.8; 0.3.7 ships the end-to-end operator UX and the audit surface. - Operator channel polish.
assess_driftcan opt into a heavier reasoner viadeep_explainer_modelwhile routine explanations stay on the cheapexplainer_model. Clarification answers now emit a dedicatedexplainer_answertimeline event (taggedsource="explainer") and carryescalation_recommended=Truewhen confidence falls belowclarification_escalation_confidence(default0.4) so operators know when to route the question to the worker.DriftAssessmentandExchangeViewfrozen dataclasses give TUI / IM channels a type-safe projection instead of raw dicts. - JSONL runs close the loop via Stop hook.
thin-supervisor hook installwires a Claude Code / Codex Stop hook that reads the supervisor's per-session handoff file, returns the next instruction as the agent's stopreason, and writes an ACK so the supervisor can confirm delivery. Observation-only runs no longer dead-end at "pause for human" when nothing else is wrong. - A2A inbound is live.
thin-supervisor a2a serveexposes a stdlib-only Google A2A JSON-RPC listener (tasks/send/tasks/get) behind a shared inbound boundary guard — bearer auth with localhost fallback, per-IP rate-limit, injection scan, and secret redaction with SHA-256 audit hashes.task_idequals the clientrequest_id, so task identity survives adapter and daemon restarts. - Layered system observability is live.
thin-supervisor overview(with--json/--watch) renders a whole-system view — daemon counts, live/orphaned/completed sessions, event-plane backlog, actionable alerts, and a cross-run timeline — that stays consistent withstatus,observe, and the TUI global mode (gto toggle). - Session-first deferred review plumbing is live. Sessions now persist as first-class cross-run identities, external review requests/results land in a shared event-plane store, and operator surfaces can inspect mailbox items and outstanding waits by
session_id. - Post-merge review hardening landed. Event-plane request/result ingest now has tighter validation, wake-policy bookkeeping is recorded deterministically, and session/history export includes the deferred-work timeline for audit and replay.
- Global-first observability is live.
status,dashboard,tui, andobservenow read from one canonical session index, so runs stay visible across worktrees even after daemon idle shutdown. - The runtime state machine is split by intent.
ATTACHED,RECOVERY_NEEDED, andpause_classnow distinguish attach-boundary tightening, operational recovery, and true human-owned pauses. - Structured checkpoint semantics are live. The runtime now normalizes one canonical
NormalizedCheckpoint, accepts v2 semantic fields (progress_class,evidence_scope,escalation_class,requires_authorization,blocking_inputs,reason_code), and routes contradictions by class instead of re-guessing everything from prose. - Legacy sunset policy is encoded, not hand-waved.
reason_codeis frozen toesc/rec/ver/sem, live-ingest sunset uses the 14-day dual-signal trigger, and replay/export keep permanent v1 read compatibility. - Operator IM channels are live. Telegram and Lark/Feishu provider instances are merged into one logical command surface per bot/app, with one inbound owner and multi-target outbound delivery.
- tmux injection is harder to wedge. A readiness gate now checks whether the pane is still changing, actively typing, or actually idle before issuing
send-keys. - The policy-tuning loop is end-to-end.
thin-supervisor-dev evalnow covers compare, canary, candidate review/status, gating, promotion, and the one-commandeval improvewrapper.
pip install -U thin-supervisor
thin-supervisor skill installpip install -U thin-supervisor gets the runtime CLI from PyPI. thin-supervisor skill install then installs the Codex / Claude skills into your local agent environment.
┌────────────────────────────┐ ┌──────────────────────────┐
│ Your Agent (visible pane) │ │ Supervisor (sidecar) │
│ Claude Code / Codex │ │ reads pane output │
│ │ │ parses checkpoints │
│ ... working ... │ │ gates decisions │
│ │ │ runs verifiers │
│ <checkpoint> │──│ injects next step │
│ status: step_done │ │ │
│ </checkpoint> │ │ state: RUNNING → VERIFY │
└────────────────────────────┘ └──────────────────────────┘
tmux session
| Scenario | Without supervisor | With supervisor |
|---|---|---|
| 10-step implementation plan | Agent asks permission at every step | Runs to completion, verifies each step |
| Test-driven workflow | Agent says "done" without running tests | Verifier runs tests, rejects if failing |
| Agent asks "should I continue?" | You miss it, task stalls for hours | Supervisor auto-answers, keeps going |
| Dangerous operation detected | Agent proceeds silently | Supervisor escalates to you |
| Object | Question it answers | What it is |
|---|---|---|
| WorkflowSpec | What should be done? | YAML task definition with steps, verification criteria, and finish policy |
| CheckpointEvent | What did the agent just report? | Structured status with seq tracking, evidence, and needs |
| SupervisorDecision | What does the control plane think? | Typed gate decision with confidence, reasoning, and causality link |
| HandoffInstruction | What should the agent do next? | Composed instruction with full traceability to the triggering decision |
| ExecutionSurface | How do we talk to the agent? | Protocol for read/inject/cwd — tmux, open-relay, and JSONL observation surfaces |
| SessionRun | Who is this run? | Identity + durable event history — survives crashes, enables recovery |
| Object | Purpose |
|---|---|
| AcceptanceContract | Defines "what counts as truly done" — required evidence, forbidden states, risk class, reviewer gating |
| WorkerProfile | Explicit worker capabilities — provider, model, trust level. Drives supervision intensity. |
| SupervisionPolicy | Three modes: strict_verifier (default) / collaborative_reviewer / directive_lead. Prevents thin supervisor from micromanaging strong worker. |
| RoutingDecision | Escalation routing — human, stronger reviewer, or alternate executor |
These form a causality chain: every instruction traces back to the decision that caused it, which traces back to the checkpoint that triggered it.
CheckpointEvent(seq=3) → SupervisorDecision(triggered_by_seq=3) → HandoffInstruction(triggered_by_decision=X)
Full guide: See docs/getting-started.md for step-by-step instructions covering tmux, open-relay, JSONL observation, and Codex/Claude/OpenCode/Droid workflows.
# Install
pip install thin-supervisor
# Install the Codex / Claude skills automatically when supported
thin-supervisor skill install
# Initialize in your project
cd your-project
thin-supervisor init
# If .supervisor/ exists but is missing config, repair the scaffold in place
thin-supervisor init --repair
# Write a spec (or let the Skill generate one)
cat > .supervisor/specs/my-plan.yaml << 'EOF'
kind: linear_plan
id: my_feature
goal: implement feature X with tests
approval:
required: true
status: draft
finish_policy:
require_all_steps_done: true
require_verification_pass: true
policy:
default_continue: true
max_retries_per_node: 3
steps:
- id: write_tests
type: task
objective: write failing tests for feature X
verify:
- type: artifact
path: tests/test_feature_x.py
exists: true
- id: implement
type: task
objective: implement feature X until tests pass
verify:
- type: command
run: pytest -q tests/test_feature_x.py
expect: pass
- id: final_check
type: task
objective: run full test suite
verify:
- type: command
run: pytest -q
expect: pass
EOF
# Approve the draft spec, then attach
thin-supervisor spec approve --spec .supervisor/specs/my-plan.yaml --by human
scripts/thin-supervisor-attach.sh my-planExecution entry points reject draft specs. This is deliberate: the clarify/approve step is part of the contract.
- Supervisor reads the agent's pane output every 2 seconds
- Agent emits a
<checkpoint>block after completing work - Supervisor parses the checkpoint and makes a gate decision:
- Continue — agent is making progress, don't interrupt
- Re-inject — run is attached but the first checkpoint still only cites attach/spec/admin work, so tighten the current-node instruction
- Verify — agent says step is done, run the verifier
- Retry — verification failed, inject retry instruction with failure details
- Branch — decision node in workflow, select a path
- Recover — delivery/session-health fault; supervisor attempts bounded auto-recovery before surfacing it
- Escalate — missing credentials, dangerous action, explicit review, or low confidence — pause for human
- Finish — all steps done, all verifiers pass, finish policy and review requirements satisfied
- If continuing or retrying, supervisor injects the next instruction into the pane
- Run-level decisions are logged to
session_log.jsonl; project-level bootstrap and repair incidents are logged to.supervisor/runtime/ops_log.jsonl
Historical runs can now be turned into stable artifacts and reports:
thin-supervisor run export <run_id> > run.json
thin-supervisor run summarize <run_id> --json
thin-supervisor run replay <run_id> --json
thin-supervisor run postmortem <run_id>run replay re-evaluates historical checkpoints with the current gate logic but does not inject or verify against live surfaces. run postmortem writes a markdown report under .supervisor/reports/ by default.
If your spec sets acceptance.must_review_by, the run pauses at the finish gate until someone acknowledges review:
thin-supervisor run review <run_id> --by human
# or
thin-supervisor run review <run_id> --by stronger_reviewerWhen a run enters PAUSED_FOR_HUMAN, thin-supervisor now derives two user-facing fields:
pause_reason— why the supervisor stoppednext_action— the exact recovery command to run next
By default the daemon also emits pause notifications through two built-in channels:
tmux_display— atmux display-messagealert on the supervised panejsonl— durable records in.supervisor/runtime/notifications.jsonl
Pause handling is now also policy-driven:
pause_handling_mode: notify_only— notify and remain pausedpause_handling_mode: notify_then_ai— notify first, then let the agent attempt an automatic recovery for selected cases such as blocked checkpoints, repeated node mismatch, or retry-budget exhaustion
The default is currently tuned for test periods:
pause_handling_mode: notify_then_ai
max_auto_interventions: 2The default config now includes:
notification_channels:
- kind: tmux_display
- kind: jsonl
pause_handling_mode: notify_then_ai
max_auto_interventions: 2Built-in notification channels today are tmux_display, jsonl, telegram, and lark. Telegram and Lark can also run in command mode through OperatorChannelHost, which merges provider-instance config into one logical command surface with a single inbound owner.
Agents must emit structured checkpoints for the supervisor to parse:
<checkpoint>
run_id: <run_id from thin-supervisor status>
checkpoint_seq: <incrementing integer, start from 1>
status: working | blocked | step_done | workflow_done
current_node: <step_id>
summary: <one-line description>
evidence:
- modified: <file path>
- ran: <command>
- result: <short result>
candidate_next_actions:
- <next action>
needs:
- none
question_for_supervisor:
- none
</checkpoint>
The Codex/Claude Code Skills teach agents this protocol automatically.
For a newly attached node, the first checkpoint must cite execution evidence for the current node's objective. Clarify/spec/attach/baseline artifacts are prior-phase work and do not count as execution progress on the newly injected node.
| Type | Fields | Description |
|---|---|---|
command |
run, expect |
Run a shell command. expect: pass, fail, contains:<text> |
artifact |
path, exists |
Check if a file exists |
git |
check, expect |
Check git state (e.g., check: dirty, expect: false) |
workflow |
require_node_done |
Check if current node is marked done |
All verifiers run in the agent's working directory (pane cwd), not the supervisor's.
thin-supervisor init [--force|--repair] # Create or repair .supervisor/ directory
thin-supervisor deinit [--force] # Remove .supervisor/
thin-supervisor bootstrap # Init + daemon + surface validation fast path
thin-supervisor daemon start [--config <path>] # Start background daemon
thin-supervisor daemon stop # Stop daemon
thin-supervisor stop # Legacy alias for daemon stop
thin-supervisor run register --spec <spec> --pane <target> [--surface tmux|open_relay|jsonl]
thin-supervisor run foreground --spec <spec> --pane <target> [--surface ...]
thin-supervisor run stop <run_id>
thin-supervisor run resume --spec <spec> --pane <target> [--surface ...]
thin-supervisor run review <run_id> --by human|stronger_reviewer
thin-supervisor run export <run_id> [--output file]
thin-supervisor run summarize <run_id> [--json]
thin-supervisor run replay <run_id> [--json]
thin-supervisor run postmortem <run_id> [--output file]
thin-supervisor spec approve --spec <spec> [--by human]
thin-supervisor overview # Whole-system view: counts, alerts, cross-run timeline
thin-supervisor overview --json # Same view as JSON for tools / scripts
thin-supervisor overview --watch # Re-render on an interval (Ctrl-C to stop)
thin-supervisor status # Every run across every known worktree (global-first)
thin-supervisor status --local # Restrict to the current worktree only
thin-supervisor list # Detailed active-run view
thin-supervisor dashboard # Interactive dashboard with drill-in
thin-supervisor tui # Operator TUI; press 'g' for the global-mode overview
thin-supervisor ps # Registered daemon processes across worktrees
thin-supervisor pane-owner <pane> # Show which run owns a pane
thin-supervisor observe <run_id> # Read-only snapshot; works even when no daemon is live
thin-supervisor note add <text> [--type ...] [--run ...] # Shared notes for coordination
thin-supervisor note list [--type ...] [--run ...]
thin-supervisor session detect # Detect current agent session ID
thin-supervisor session jsonl # Resolve current transcript path
thin-supervisor session list # List recent sessions and cwd
thin-supervisor config set <key> <value> # Persist config updates
thin-supervisor skill install # Install Codex / Claude skills
thin-supervisor bridge <action> [args] # tmux bridge operationsthin-supervisor is the runtime CLI. It is the only command family normal task users should need.
thin-supervisor-dev learn friction add --kind <kind> --message "..." [--run-id <run_id>] [--signal <signal>]
thin-supervisor-dev learn friction list [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn friction summarize [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn prefs set --key <key> --value <value>
thin-supervisor-dev learn prefs show [--json]
thin-supervisor-dev eval list
thin-supervisor-dev eval run [--suite approval-core|approval-adversarial|clarify-contract-core|routing-core|escalation-core|finish-gate-core|pause-ux-core] [--json]
thin-supervisor-dev eval replay --run-id <run_id> [--json]
thin-supervisor-dev eval compare --suite approval-core --candidate-policy <policy> [--json]
thin-supervisor-dev eval canary --run-id <run_id> [--run-id <run_id> ...] [--candidate-id <candidate_id>] [--phase shadow|limited] [--json]
thin-supervisor-dev eval expand --suite approval-core --output <path> [--variants-per-case 2]
thin-supervisor-dev eval propose --suite approval-core --objective <goal> [--json]
thin-supervisor-dev eval review-candidate --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval candidate-status --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval gate-candidate --candidate-id <candidate_id> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promote-candidate --candidate-id <candidate_id> --approved-by <name> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval improve --suite approval-core --objective <goal> [--approved-by <name>] [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promotion-history [--json]
thin-supervisor-dev eval rollout-history [--candidate-id <candidate_id>] [--json]
thin-supervisor-dev oracle consult --question "..." [--file path ...]thin-supervisor-dev is the devtime/operator CLI. Use it for local tuning, offline evals, candidate rollout, learning signals, and advisory second opinions. Do not expose it to normal runtime users.
Add --save-report to run, replay, compare, canary, propose, review-candidate, gate-candidate, promote-candidate, or improve to persist a JSON report under .supervisor/evals/reports/. When used with eval propose, thin-supervisor-dev also writes a candidate-lineage manifest under .supervisor/evals/candidates/, eval review-candidate turns that manifest back into a bounded human review summary, eval candidate-status assembles the manifest, latest related reports, and promotion-registry state into one lifecycle dossier, eval gate-candidate combines compare plus optional canary signals into a promotion recommendation, and eval promote-candidate records an approved promotion in .supervisor/evals/promotions.jsonl. eval improve is the one-command wrapper for this path: it runs propose -> review/status -> gate and only promotes when --approved-by is supplied and the gate allows promotion (or --force is used).
If a daemon-managed run pauses, status and list now show the human-readable reason and the suggested next command. For non-active persisted runs, the same hint appears under Local state found:.
Other agents can submit tasks into a supervisor session via Google's A2A protocol:
export SUPERVISOR_A2A_TOKEN=your-secret # optional; omit for localhost-only
thin-supervisor a2a serve --port 8081The listener advertises itself through overview and routes
tasks/send through InboundGuard → EventPlaneIngest.register_request
→ session mailbox. task_id == request_id is durable: it survives
adapter and daemon restart. See docs/a2a.md for the
full protocol mapping + security model.
status, dashboard, and tui all read from a single canonical session index (supervisor/operator/session_index.py) that unions discovery across:
- the current cwd
list_known_worktrees()(persisted registry, survives daemon/pane shutdown)- live daemon cwds
- live pane-owner cwds
git worktree listfor the current repo (read-only)
As a result:
- Every operator read surface sees the same run universe. If
statusshows a run,dashboardandtuisee it too, and vice versa. - A run that outlives its daemon (persisted to disk, daemon idle-shutdown) stays visible from any cwd — it tags as
orphanedinstead of disappearing. observe <run_id>resolves globally. When no daemon is live for the run, it reads the snapshot and recent events directly from the run's on-disk state +session_log.jsonl, so a paused run in a child worktree is still inspectable from the root workspace.status --localnarrows the view to the current worktree;psis process-oriented (which daemon processes are alive) and is distinct from run-orientedstatus.
thin-supervisor-dev eval is the first offline evaluation surface for the new skill-evolution work. Bundled suites now cover more than approval copy: approval-core checks explicit approval vs re-ask behavior, approval-adversarial covers tricky mixed signals and repeat-approval cases, clarify-contract-core checks whether the system locks the right delivery contract instead of silently narrowing “real UAT” work into a mock/dev baseline, routing-core checks deterministic step_done/workflow_done -> VERIFY_STEP routing, escalation-core checks blocked -> ESCALATE_TO_HUMAN, finish-gate-core checks reviewer and completion contracts, and pause-ux-core checks externally visible pause/completion summaries. thin-supervisor-dev eval replay --run-id ... wraps the existing history replay path into the same evaluation surface so policy candidates can be checked against real historical traces. thin-supervisor-dev eval compare ... adds a blind A/B-style comparator over deterministic suite results so baseline and candidate policies can be compared without hard-coding one output format into the report consumer. thin-supervisor-dev eval canary ... aggregates replay pass-rate, mismatch kinds, and friction over a set of real runs so shadow-canary promotion decisions become a command instead of a checklist; when you pass --candidate-id, the same command also records a rollout attempt under .supervisor/evals/rollouts.jsonl. thin-supervisor-dev eval expand ... generates provenance-tagged synthetic variants from the golden suite so coverage can grow without mutating the original contract set. thin-supervisor-dev eval propose ... is the constrained candidate-generator surface: it summarizes failure cases, consults the advisory/self-review layer, recommends a policy candidate for a stated objective without automatically changing shipped defaults, and can persist a candidate-lineage manifest for later comparison and promotion review. thin-supervisor-dev eval review-candidate ... loads one of those manifests and emits the bounded human-review summary for the next promotion step. thin-supervisor-dev eval candidate-status ... turns the manifest, related eval reports, promotion-registry state, and recorded rollout attempts into one lifecycle dossier. thin-supervisor-dev eval rollout-history ... exposes the rollout ledger directly. thin-supervisor-dev eval gate-candidate ... then combines that bounded review with deterministic compare output and optional real-run canary signals before a human decides whether to promote. thin-supervisor-dev eval improve ... is the current-main-native convenience wrapper around that same flow, so the old “proposal improvement loop” UX exists without reviving a parallel implementation. thin-supervisor-dev eval promote-candidate ... records an approved promotion in the promotion registry so candidate history and current promoted policies are queryable later.
Yes, you should run real canaries. A safe sequence is:
- Offline gate
Run
eval run,eval replay,eval compare, and optionallyeval propose, all with--save-report. - Shadow canary
Pick 3-5 real tasks and keep the baseline behavior in charge. Record each finished run with:
thin-supervisor run summarize <run_id>thin-supervisor run postmortem <run_id>thin-supervisor-dev eval replay --run-id <run_id> --save-reportthin-supervisor-dev eval canary --run-id <run_id> ... --candidate-id <candidate_id> --phase shadow --save-reportthin-supervisor-dev eval rollout-history --candidate-id <candidate_id> --json - Limited rollout If shadow canary stays clean, run 10-20 real tasks with the candidate under close observation.
For each real canary, log friction explicitly when needed:
thin-supervisor-dev learn friction add \
--kind repeated_confirmation \
--message "user had to approve twice" \
--run-id <run_id> \
--signal user_repeated_approvalThen summarize what actually accumulated for a run:
thin-supervisor-dev learn friction summarize --run-id <run_id> --jsonthin-supervisor bridge read <pane> [lines] # Capture pane output
thin-supervisor bridge type <pane> <text> # Send text (no Enter)
thin-supervisor bridge keys <pane> <key>... # Send special keys
thin-supervisor bridge list # Show all panes
thin-supervisor bridge id # Current pane ID
thin-supervisor bridge doctor # Check tmux connectivity.supervisor/config.yaml:
surface_type: "tmux" # tmux | open_relay | jsonl
surface_target: "agent" # pane label / oly session ID / transcript path
poll_interval_sec: 2.0 # seconds between reads
read_lines: 100 # lines captured per read
# LLM judge (null = offline stub mode, rules-only)
judge_model: null # e.g., anthropic/claude-haiku-4-5-20251001
judge_temperature: 0.1
judge_max_tokens: 512jsonl is observation-only: the supervisor can watch checkpoints from a transcript file, but instruction delivery still depends on the agent skill / hook path.
Override with environment variables: SUPERVISOR_SURFACE_TYPE, SUPERVISOR_SURFACE_TARGET, SUPERVISOR_PANE_TARGET, SUPERVISOR_JUDGE_MODEL, etc.
Foundational stance: see docs/PHILOSOPHY.md — why thin-supervisor watches but does not orchestrate.
Inspired by Anthropic's Scaling Managed Agents:
-
The system's memory lives in SessionRun, not in the model's context. Crashes don't lose history. Everything is in
session_log.jsonl. -
The execution surface is just a "hand", not the system. Today that includes tmux, open-relay, and transcript-backed JSONL observation. Tomorrow it could be a PTY wrapper or a remote session. The
SessionAdapterprotocol keeps the supervisor decoupled. -
Harnesses change, primitives don't. The current sidecar loop is one harness. The 6 first-class objects (WorkflowSpec, SessionRun, ExecutionSurface, CheckpointEvent, SupervisorDecision, HandoffInstruction) are the stable interface.
-
Verification is deterministic, not verbal. "Done" means the verifier passed, not that the agent said so.
-
Skill evolution happens from structured hindsight, not ad-hoc prompt edits.
friction_events anduser_preference_memorygive the system a durable learning substrate. The intended loop is: capture friction -> summarize/postmortem -> replay/eval candidate policy changes -> update skills/rules only when the offline signal says they are better.
Install for Claude Code:
cp -r skills/thin-supervisor ~/.claude/skills/Install for Codex:
cp -r packaging/thin-supervisor-codex ~/.codex/skills/thin-supervisorInvoke with /thin-supervisor to start the default flow:
- clarify ambiguous goals
- generate a draft spec
- wait for approval
- attach and execute only after approval
The skill is now split into two layers:
- frozen contract:
skills/thin-supervisor*/references/contract.md - optimizable strategy fragments under
skills/thin-supervisor*/strategy/
Future policy optimization should target the strategy fragments, not the whole SKILL.md.
If you want an Amp-style "oracle" second opinion without giving up supervisor control, use:
thin-supervisor-dev oracle consult \
--mode review \
--question "Review the retry policy design" \
--file supervisor/loop.py \
--file supervisor/gates/supervision_policy.pyWhen an external provider key is configured, thin-supervisor calls that provider as a read-only advisor. Without an external key, it falls back to a self-adversarial review scaffold instead of failing hard. Add --run <run_id> to persist the consultation into the shared notes plane for the active supervised run.
git clone https://github.com/fakechris/thin-supervisor
cd thin-supervisor
pip install -e ".[dev]"
pytest -qFor repo-specific setup and examples, start with docs/getting-started.md.
MIT