thin-supervisor

Long-running AI coding tasks fail silently. The agent asks "should I continue?", you're not watching, and the task stalls. Or worse — the agent says "done" but didn't actually pass the tests.

thin-supervisor fixes this. It's an acceptance-centered run supervisor that sits alongside your existing coding agent (Claude Code, Codex, or any CLI agent), watches what the agent does, and makes structured decisions: continue, re-inject, verify, retry, branch, recover, escalate, or finish. "Done" means the verifier passed and the acceptance contract is satisfied — not that the agent said so. You stay in your familiar agent UI. The supervisor handles the rest.

Architecture deep-dive: See docs/ARCHITECTURE.md for the six-layer architecture, first-class objects, and design principles.

Docs hub:

docs/getting-started.md — install and run tmux, open-relay, and JSONL workflows

docs/ARCHITECTURE.md — object model, layers, and current implementation status

CHANGELOG.md — release notes and unreleased changes

docs/design/p2-external-surfaces.md — surface abstraction and roadmap

docs/design/p3-observation-sources.md — observation sources and normalization

docs/design/p4-jsonl-observation.md — transcript-backed observation mode

docs/reviews/2026-04-11-deep-code-review.md — latest deep code review log and remaining-risk audit

docs/reviews/2026-04-12-amp-supervisor-capability-review.md — Amp-vs-thin-supervisor capability review and oracle-layer roadmap

Current Status (0.3.7)

Operator-initiated clarification escalation. When a clarification answer comes back below the configured confidence threshold, operators can escalate the question to the worker from any channel: thin-supervisor clarify <run_id> <question> --escalate, the TUI E keybind, or /escalate <run_id> [question] in IM. Each escalation writes a single clarification_escalated_to_worker timeline event so the decision is durably auditable. Actual side-instruction transport to the worker is deferred to 0.3.8; 0.3.7 ships the end-to-end operator UX and the audit surface.
Operator channel polish. assess_drift can opt into a heavier reasoner via deep_explainer_model while routine explanations stay on the cheap explainer_model. Clarification answers now emit a dedicated explainer_answer timeline event (tagged source="explainer") and carry escalation_recommended=True when confidence falls below clarification_escalation_confidence (default 0.4) so operators know when to route the question to the worker. DriftAssessment and ExchangeView frozen dataclasses give TUI / IM channels a type-safe projection instead of raw dicts.
JSONL runs close the loop via Stop hook. thin-supervisor hook install wires a Claude Code / Codex Stop hook that reads the supervisor's per-session handoff file, returns the next instruction as the agent's stop reason, and writes an ACK so the supervisor can confirm delivery. Observation-only runs no longer dead-end at "pause for human" when nothing else is wrong.
A2A inbound is live. thin-supervisor a2a serve exposes a stdlib-only Google A2A JSON-RPC listener (tasks/send / tasks/get) behind a shared inbound boundary guard — bearer auth with localhost fallback, per-IP rate-limit, injection scan, and secret redaction with SHA-256 audit hashes. task_id equals the client request_id, so task identity survives adapter and daemon restarts.
Layered system observability is live. thin-supervisor overview (with --json / --watch) renders a whole-system view — daemon counts, live/orphaned/completed sessions, event-plane backlog, actionable alerts, and a cross-run timeline — that stays consistent with status, observe, and the TUI global mode (g to toggle).
Session-first deferred review plumbing is live. Sessions now persist as first-class cross-run identities, external review requests/results land in a shared event-plane store, and operator surfaces can inspect mailbox items and outstanding waits by session_id.
Post-merge review hardening landed. Event-plane request/result ingest now has tighter validation, wake-policy bookkeeping is recorded deterministically, and session/history export includes the deferred-work timeline for audit and replay.
Global-first observability is live. status, dashboard, tui, and observe now read from one canonical session index, so runs stay visible across worktrees even after daemon idle shutdown.
The runtime state machine is split by intent. ATTACHED, RECOVERY_NEEDED, and pause_class now distinguish attach-boundary tightening, operational recovery, and true human-owned pauses.
Structured checkpoint semantics are live. The runtime now normalizes one canonical NormalizedCheckpoint, accepts v2 semantic fields (progress_class, evidence_scope, escalation_class, requires_authorization, blocking_inputs, reason_code), and routes contradictions by class instead of re-guessing everything from prose.
Legacy sunset policy is encoded, not hand-waved. reason_code is frozen to esc/rec/ver/sem, live-ingest sunset uses the 14-day dual-signal trigger, and replay/export keep permanent v1 read compatibility.
Operator IM channels are live. Telegram and Lark/Feishu provider instances are merged into one logical command surface per bot/app, with one inbound owner and multi-target outbound delivery.
tmux injection is harder to wedge. A readiness gate now checks whether the pane is still changing, actively typing, or actually idle before issuing send-keys.
The policy-tuning loop is end-to-end. thin-supervisor-dev eval now covers compare, canary, candidate review/status, gating, promotion, and the one-command eval improve wrapper.

Install from PyPI

pip install -U thin-supervisor
thin-supervisor skill install

pip install -U thin-supervisor gets the runtime CLI from PyPI. thin-supervisor skill install then installs the Codex / Claude skills into your local agent environment.

┌────────────────────────────┐  ┌──────────────────────────┐
│  Your Agent (visible pane) │  │  Supervisor (sidecar)    │
│  Claude Code / Codex       │  │  reads pane output       │
│                            │  │  parses checkpoints      │
│  ... working ...           │  │  gates decisions         │
│                            │  │  runs verifiers          │
│  <checkpoint>              │──│  injects next step       │
│  status: step_done         │  │                          │
│  </checkpoint>             │  │  state: RUNNING → VERIFY │
└────────────────────────────┘  └──────────────────────────┘
                 tmux session

When to use this

Scenario	Without supervisor	With supervisor
10-step implementation plan	Agent asks permission at every step	Runs to completion, verifies each step
Test-driven workflow	Agent says "done" without running tests	Verifier runs tests, rejects if failing
Agent asks "should I continue?"	You miss it, task stalls for hours	Supervisor auto-answers, keeps going
Dangerous operation detected	Agent proceeds silently	Supervisor escalates to you

Core Concepts

Runtime Objects (stable)

Object	Question it answers	What it is
WorkflowSpec	What should be done?	YAML task definition with steps, verification criteria, and finish policy
CheckpointEvent	What did the agent just report?	Structured status with seq tracking, evidence, and needs
SupervisorDecision	What does the control plane think?	Typed gate decision with confidence, reasoning, and causality link
HandoffInstruction	What should the agent do next?	Composed instruction with full traceability to the triggering decision
ExecutionSurface	How do we talk to the agent?	Protocol for read/inject/cwd — tmux, open-relay, and JSONL observation surfaces
SessionRun	Who is this run?	Identity + durable event history — survives crashes, enables recovery

Emerging Architecture (implemented, maturing)

Object	Purpose
AcceptanceContract	Defines "what counts as truly done" — required evidence, forbidden states, risk class, reviewer gating
WorkerProfile	Explicit worker capabilities — provider, model, trust level. Drives supervision intensity.
SupervisionPolicy	Three modes: `strict_verifier` (default) / `collaborative_reviewer` / `directive_lead`. Prevents thin supervisor from micromanaging strong worker.
RoutingDecision	Escalation routing — human, stronger reviewer, or alternate executor

These form a causality chain: every instruction traces back to the decision that caused it, which traces back to the checkpoint that triggered it.

CheckpointEvent(seq=3) → SupervisorDecision(triggered_by_seq=3) → HandoffInstruction(triggered_by_decision=X)

Quick Start

Full guide: See docs/getting-started.md for step-by-step instructions covering tmux, open-relay, JSONL observation, and Codex/Claude/OpenCode/Droid workflows.

# Install
pip install thin-supervisor

# Install the Codex / Claude skills automatically when supported
thin-supervisor skill install

# Initialize in your project
cd your-project
thin-supervisor init
# If .supervisor/ exists but is missing config, repair the scaffold in place
thin-supervisor init --repair

# Write a spec (or let the Skill generate one)
cat > .supervisor/specs/my-plan.yaml << 'EOF'
kind: linear_plan
id: my_feature
goal: implement feature X with tests
approval:
  required: true
  status: draft
finish_policy:
  require_all_steps_done: true
  require_verification_pass: true
policy:
  default_continue: true
  max_retries_per_node: 3

steps:
  - id: write_tests
    type: task
    objective: write failing tests for feature X
    verify:
      - type: artifact
        path: tests/test_feature_x.py
        exists: true

  - id: implement
    type: task
    objective: implement feature X until tests pass
    verify:
      - type: command
        run: pytest -q tests/test_feature_x.py
        expect: pass

  - id: final_check
    type: task
    objective: run full test suite
    verify:
      - type: command
        run: pytest -q
        expect: pass
EOF

# Approve the draft spec, then attach
thin-supervisor spec approve --spec .supervisor/specs/my-plan.yaml --by human
scripts/thin-supervisor-attach.sh my-plan

Execution entry points reject draft specs. This is deliberate: the clarify/approve step is part of the contract.

What happens next

Supervisor reads the agent's pane output every 2 seconds
Agent emits a <checkpoint> block after completing work
Supervisor parses the checkpoint and makes a gate decision:
- Continue — agent is making progress, don't interrupt
- Re-inject — run is attached but the first checkpoint still only cites attach/spec/admin work, so tighten the current-node instruction
- Verify — agent says step is done, run the verifier
- Retry — verification failed, inject retry instruction with failure details
- Branch — decision node in workflow, select a path
- Recover — delivery/session-health fault; supervisor attempts bounded auto-recovery before surfacing it
- Escalate — missing credentials, dangerous action, explicit review, or low confidence — pause for human
- Finish — all steps done, all verifiers pass, finish policy and review requirements satisfied
If continuing or retrying, supervisor injects the next instruction into the pane
Run-level decisions are logged to session_log.jsonl; project-level bootstrap and repair incidents are logged to .supervisor/runtime/ops_log.jsonl

Historical runs can now be turned into stable artifacts and reports:

thin-supervisor run export <run_id> > run.json
thin-supervisor run summarize <run_id> --json
thin-supervisor run replay <run_id> --json
thin-supervisor run postmortem <run_id>

run replay re-evaluates historical checkpoints with the current gate logic but does not inject or verify against live surfaces. run postmortem writes a markdown report under .supervisor/reports/ by default.

If your spec sets acceptance.must_review_by, the run pauses at the finish gate until someone acknowledges review:

thin-supervisor run review <run_id> --by human
# or
thin-supervisor run review <run_id> --by stronger_reviewer

When a run enters PAUSED_FOR_HUMAN, thin-supervisor now derives two user-facing fields:

pause_reason — why the supervisor stopped
next_action — the exact recovery command to run next

By default the daemon also emits pause notifications through two built-in channels:

tmux_display — a tmux display-message alert on the supervised pane
jsonl — durable records in .supervisor/runtime/notifications.jsonl

Pause handling is now also policy-driven:

pause_handling_mode: notify_only — notify and remain paused
pause_handling_mode: notify_then_ai — notify first, then let the agent attempt an automatic recovery for selected cases such as blocked checkpoints, repeated node mismatch, or retry-budget exhaustion

The default is currently tuned for test periods:

pause_handling_mode: notify_then_ai
max_auto_interventions: 2

The default config now includes:

notification_channels:
  - kind: tmux_display
  - kind: jsonl
pause_handling_mode: notify_then_ai
max_auto_interventions: 2

Built-in notification channels today are tmux_display, jsonl, telegram, and lark. Telegram and Lark can also run in command mode through OperatorChannelHost, which merges provider-instance config into one logical command surface with a single inbound owner.

Checkpoint Protocol

Agents must emit structured checkpoints for the supervisor to parse:

<checkpoint>
run_id: <run_id from thin-supervisor status>
checkpoint_seq: <incrementing integer, start from 1>
status: working | blocked | step_done | workflow_done
current_node: <step_id>
summary: <one-line description>
evidence:
  - modified: <file path>
  - ran: <command>
  - result: <short result>
candidate_next_actions:
  - <next action>
needs:
  - none
question_for_supervisor:
  - none
</checkpoint>

The Codex/Claude Code Skills teach agents this protocol automatically.

For a newly attached node, the first checkpoint must cite execution evidence for the current node's objective. Clarify/spec/attach/baseline artifacts are prior-phase work and do not count as execution progress on the newly injected node.

Verification Types

Type	Fields	Description
`command`	`run`, `expect`	Run a shell command. `expect`: `pass`, `fail`, `contains:<text>`
`artifact`	`path`, `exists`	Check if a file exists
`git`	`check`, `expect`	Check git state (e.g., `check: dirty`, `expect: false`)
`workflow`	`require_node_done`	Check if current node is marked done

All verifiers run in the agent's working directory (pane cwd), not the supervisor's.

CLI

thin-supervisor init [--force|--repair]                   # Create or repair .supervisor/ directory
thin-supervisor deinit [--force]                           # Remove .supervisor/
thin-supervisor bootstrap                                  # Init + daemon + surface validation fast path

thin-supervisor daemon start [--config <path>]             # Start background daemon
thin-supervisor daemon stop                                # Stop daemon
thin-supervisor stop                                       # Legacy alias for daemon stop

thin-supervisor run register --spec <spec> --pane <target> [--surface tmux|open_relay|jsonl]
thin-supervisor run foreground --spec <spec> --pane <target> [--surface ...]
thin-supervisor run stop <run_id>
thin-supervisor run resume --spec <spec> --pane <target> [--surface ...]
thin-supervisor run review <run_id> --by human|stronger_reviewer
thin-supervisor run export <run_id> [--output file]
thin-supervisor run summarize <run_id> [--json]
thin-supervisor run replay <run_id> [--json]
thin-supervisor run postmortem <run_id> [--output file]
thin-supervisor spec approve --spec <spec> [--by human]

thin-supervisor overview                                   # Whole-system view: counts, alerts, cross-run timeline
thin-supervisor overview --json                            # Same view as JSON for tools / scripts
thin-supervisor overview --watch                           # Re-render on an interval (Ctrl-C to stop)
thin-supervisor status                                     # Every run across every known worktree (global-first)
thin-supervisor status --local                             # Restrict to the current worktree only
thin-supervisor list                                       # Detailed active-run view
thin-supervisor dashboard                                  # Interactive dashboard with drill-in
thin-supervisor tui                                        # Operator TUI; press 'g' for the global-mode overview
thin-supervisor ps                                         # Registered daemon processes across worktrees
thin-supervisor pane-owner <pane>                          # Show which run owns a pane
thin-supervisor observe <run_id>                           # Read-only snapshot; works even when no daemon is live
thin-supervisor note add <text> [--type ...] [--run ...]  # Shared notes for coordination
thin-supervisor note list [--type ...] [--run ...]

thin-supervisor session detect                             # Detect current agent session ID
thin-supervisor session jsonl                              # Resolve current transcript path
thin-supervisor session list                               # List recent sessions and cwd
thin-supervisor config set <key> <value>                   # Persist config updates

thin-supervisor skill install                              # Install Codex / Claude skills
thin-supervisor bridge <action> [args]                     # tmux bridge operations

thin-supervisor is the runtime CLI. It is the only command family normal task users should need.

thin-supervisor-dev learn friction add --kind <kind> --message "..." [--run-id <run_id>] [--signal <signal>]
thin-supervisor-dev learn friction list [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn friction summarize [--run-id <run_id>] [--kind <kind>] [--json]
thin-supervisor-dev learn prefs set --key <key> --value <value>
thin-supervisor-dev learn prefs show [--json]
thin-supervisor-dev eval list
thin-supervisor-dev eval run [--suite approval-core|approval-adversarial|clarify-contract-core|routing-core|escalation-core|finish-gate-core|pause-ux-core] [--json]
thin-supervisor-dev eval replay --run-id <run_id> [--json]
thin-supervisor-dev eval compare --suite approval-core --candidate-policy <policy> [--json]
thin-supervisor-dev eval canary --run-id <run_id> [--run-id <run_id> ...] [--candidate-id <candidate_id>] [--phase shadow|limited] [--json]
thin-supervisor-dev eval expand --suite approval-core --output <path> [--variants-per-case 2]
thin-supervisor-dev eval propose --suite approval-core --objective <goal> [--json]
thin-supervisor-dev eval review-candidate --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval candidate-status --candidate-id <candidate_id> [--json]
thin-supervisor-dev eval gate-candidate --candidate-id <candidate_id> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promote-candidate --candidate-id <candidate_id> --approved-by <name> [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval improve --suite approval-core --objective <goal> [--approved-by <name>] [--run-id <run_id> ...] [--json]
thin-supervisor-dev eval promotion-history [--json]
thin-supervisor-dev eval rollout-history [--candidate-id <candidate_id>] [--json]
thin-supervisor-dev oracle consult --question "..." [--file path ...]

thin-supervisor-dev is the devtime/operator CLI. Use it for local tuning, offline evals, candidate rollout, learning signals, and advisory second opinions. Do not expose it to normal runtime users.

Add --save-report to run, replay, compare, canary, propose, review-candidate, gate-candidate, promote-candidate, or improve to persist a JSON report under .supervisor/evals/reports/. When used with eval propose, thin-supervisor-dev also writes a candidate-lineage manifest under .supervisor/evals/candidates/, eval review-candidate turns that manifest back into a bounded human review summary, eval candidate-status assembles the manifest, latest related reports, and promotion-registry state into one lifecycle dossier, eval gate-candidate combines compare plus optional canary signals into a promotion recommendation, and eval promote-candidate records an approved promotion in .supervisor/evals/promotions.jsonl. eval improve is the one-command wrapper for this path: it runs propose -> review/status -> gate and only promotes when --approved-by is supplied and the gate allows promotion (or --force is used).

If a daemon-managed run pauses, status and list now show the human-readable reason and the suggested next command. For non-active persisted runs, the same hint appears under Local state found:.

A2A inbound adapter

Other agents can submit tasks into a supervisor session via Google's A2A protocol:

export SUPERVISOR_A2A_TOKEN=your-secret   # optional; omit for localhost-only
thin-supervisor a2a serve --port 8081

The listener advertises itself through overview and routes tasks/send through InboundGuard → EventPlaneIngest.register_request → session mailbox. task_id == request_id is durable: it survives adapter and daemon restart. See docs/a2a.md for the full protocol mapping + security model.

Global observability plane

status, dashboard, and tui all read from a single canonical session index (supervisor/operator/session_index.py) that unions discovery across:

the current cwd
list_known_worktrees() (persisted registry, survives daemon/pane shutdown)
live daemon cwds
live pane-owner cwds
git worktree list for the current repo (read-only)

As a result:

Every operator read surface sees the same run universe. If status shows a run, dashboard and tui see it too, and vice versa.
A run that outlives its daemon (persisted to disk, daemon idle-shutdown) stays visible from any cwd — it tags as orphaned instead of disappearing.
observe <run_id> resolves globally. When no daemon is live for the run, it reads the snapshot and recent events directly from the run's on-disk state + session_log.jsonl, so a paused run in a child worktree is still inspectable from the root workspace.
status --local narrows the view to the current worktree; ps is process-oriented (which daemon processes are alive) and is distinct from run-oriented status.

thin-supervisor-dev eval is the first offline evaluation surface for the new skill-evolution work. Bundled suites now cover more than approval copy: approval-core checks explicit approval vs re-ask behavior, approval-adversarial covers tricky mixed signals and repeat-approval cases, clarify-contract-core checks whether the system locks the right delivery contract instead of silently narrowing “real UAT” work into a mock/dev baseline, routing-core checks deterministic step_done/workflow_done -> VERIFY_STEP routing, escalation-core checks blocked -> ESCALATE_TO_HUMAN, finish-gate-core checks reviewer and completion contracts, and pause-ux-core checks externally visible pause/completion summaries. thin-supervisor-dev eval replay --run-id ... wraps the existing history replay path into the same evaluation surface so policy candidates can be checked against real historical traces. thin-supervisor-dev eval compare ... adds a blind A/B-style comparator over deterministic suite results so baseline and candidate policies can be compared without hard-coding one output format into the report consumer. thin-supervisor-dev eval canary ... aggregates replay pass-rate, mismatch kinds, and friction over a set of real runs so shadow-canary promotion decisions become a command instead of a checklist; when you pass --candidate-id, the same command also records a rollout attempt under .supervisor/evals/rollouts.jsonl. thin-supervisor-dev eval expand ... generates provenance-tagged synthetic variants from the golden suite so coverage can grow without mutating the original contract set. thin-supervisor-dev eval propose ... is the constrained candidate-generator surface: it summarizes failure cases, consults the advisory/self-review layer, recommends a policy candidate for a stated objective without automatically changing shipped defaults, and can persist a candidate-lineage manifest for later comparison and promotion review. thin-supervisor-dev eval review-candidate ... loads one of those manifests and emits the bounded human-review summary for the next promotion step. thin-supervisor-dev eval candidate-status ... turns the manifest, related eval reports, promotion-registry state, and recorded rollout attempts into one lifecycle dossier. thin-supervisor-dev eval rollout-history ... exposes the rollout ledger directly. thin-supervisor-dev eval gate-candidate ... then combines that bounded review with deterministic compare output and optional real-run canary signals before a human decides whether to promote. thin-supervisor-dev eval improve ... is the current-main-native convenience wrapper around that same flow, so the old “proposal improvement loop” UX exists without reviving a parallel implementation. thin-supervisor-dev eval promote-candidate ... records an approved promotion in the promotion registry so candidate history and current promoted policies are queryable later.

Real Canary Loop

Yes, you should run real canaries. A safe sequence is:

Offline gate Run eval run, eval replay, eval compare, and optionally eval propose, all with --save-report.
Shadow canary Pick 3-5 real tasks and keep the baseline behavior in charge. Record each finished run with: thin-supervisor run summarize <run_id> thin-supervisor run postmortem <run_id> thin-supervisor-dev eval replay --run-id <run_id> --save-report thin-supervisor-dev eval canary --run-id <run_id> ... --candidate-id <candidate_id> --phase shadow --save-report thin-supervisor-dev eval rollout-history --candidate-id <candidate_id> --json
Limited rollout If shadow canary stays clean, run 10-20 real tasks with the candidate under close observation.

For each real canary, log friction explicitly when needed:

thin-supervisor-dev learn friction add \
  --kind repeated_confirmation \
  --message "user had to approve twice" \
  --run-id <run_id> \
  --signal user_repeated_approval

Then summarize what actually accumulated for a run:

thin-supervisor-dev learn friction summarize --run-id <run_id> --json

Bridge subcommands

thin-supervisor bridge read <pane> [lines]   # Capture pane output
thin-supervisor bridge type <pane> <text>     # Send text (no Enter)
thin-supervisor bridge keys <pane> <key>...   # Send special keys
thin-supervisor bridge list                   # Show all panes
thin-supervisor bridge id                     # Current pane ID
thin-supervisor bridge doctor                 # Check tmux connectivity

Configuration

.supervisor/config.yaml:

surface_type: "tmux"              # tmux | open_relay | jsonl
surface_target: "agent"           # pane label / oly session ID / transcript path
poll_interval_sec: 2.0            # seconds between reads
read_lines: 100                   # lines captured per read

# LLM judge (null = offline stub mode, rules-only)
judge_model: null                 # e.g., anthropic/claude-haiku-4-5-20251001
judge_temperature: 0.1
judge_max_tokens: 512

jsonl is observation-only: the supervisor can watch checkpoints from a transcript file, but instruction delivery still depends on the agent skill / hook path.

Override with environment variables: SUPERVISOR_SURFACE_TYPE, SUPERVISOR_SURFACE_TARGET, SUPERVISOR_PANE_TARGET, SUPERVISOR_JUDGE_MODEL, etc.

Design Philosophy

Foundational stance: see docs/PHILOSOPHY.md — why thin-supervisor watches but does not orchestrate.

Inspired by Anthropic's Scaling Managed Agents:

The system's memory lives in SessionRun, not in the model's context. Crashes don't lose history. Everything is in session_log.jsonl.
The execution surface is just a "hand", not the system. Today that includes tmux, open-relay, and transcript-backed JSONL observation. Tomorrow it could be a PTY wrapper or a remote session. The SessionAdapter protocol keeps the supervisor decoupled.
Harnesses change, primitives don't. The current sidecar loop is one harness. The 6 first-class objects (WorkflowSpec, SessionRun, ExecutionSurface, CheckpointEvent, SupervisorDecision, HandoffInstruction) are the stable interface.
Verification is deterministic, not verbal. "Done" means the verifier passed, not that the agent said so.
Skill evolution happens from structured hindsight, not ad-hoc prompt edits. friction_events and user_preference_memory give the system a durable learning substrate. The intended loop is: capture friction -> summarize/postmortem -> replay/eval candidate policy changes -> update skills/rules only when the offline signal says they are better.

Skill Integration

Install for Claude Code:

cp -r skills/thin-supervisor ~/.claude/skills/

Install for Codex:

cp -r packaging/thin-supervisor-codex ~/.codex/skills/thin-supervisor

Invoke with /thin-supervisor to start the default flow:

clarify ambiguous goals
generate a draft spec
wait for approval
attach and execute only after approval

The skill is now split into two layers:

frozen contract: skills/thin-supervisor*/references/contract.md
optimizable strategy fragments under skills/thin-supervisor*/strategy/

Future policy optimization should target the strategy fragments, not the whole SKILL.md.

Oracle Consultation

If you want an Amp-style "oracle" second opinion without giving up supervisor control, use:

thin-supervisor-dev oracle consult \
  --mode review \
  --question "Review the retry policy design" \
  --file supervisor/loop.py \
  --file supervisor/gates/supervision_policy.py

When an external provider key is configured, thin-supervisor calls that provider as a read-only advisor. Without an external key, it falls back to a self-adversarial review scaffold instead of failing hard. Add --run <run_id> to persist the consultation into the shared notes plane for the active supervised run.

Development

git clone https://github.com/fakechris/thin-supervisor
cd thin-supervisor
pip install -e ".[dev]"
pytest -q

For repo-specific setup and examples, start with docs/getting-started.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github/workflows		.github/workflows
.supervisor		.supervisor
docs		docs
packaging/thin-supervisor-codex		packaging/thin-supervisor-codex
runtime		runtime
scripts		scripts
skills/thin-supervisor		skills/thin-supervisor
specs/examples		specs/examples
supervisor		supervisor
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

thin-supervisor

Current Status (0.3.7)

Install from PyPI

When to use this

Core Concepts

Runtime Objects (stable)

Emerging Architecture (implemented, maturing)

Quick Start

What happens next

Checkpoint Protocol

Verification Types

CLI

A2A inbound adapter

Global observability plane

Real Canary Loop

Bridge subcommands

Configuration

Design Philosophy

Skill Integration

Oracle Consultation

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

thin-supervisor

Current Status (0.3.7)

Install from PyPI

When to use this

Core Concepts

Runtime Objects (stable)

Emerging Architecture (implemented, maturing)

Quick Start

What happens next

Checkpoint Protocol

Verification Types

CLI

A2A inbound adapter

Global observability plane

Real Canary Loop

Bridge subcommands

Configuration

Design Philosophy

Skill Integration

Oracle Consultation

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages