Skip to content

Feat/shell allowlist approval#142

Draft
ytallo wants to merge 44 commits into
mainfrom
feat/shell-allowlist-approval
Draft

Feat/shell allowlist approval#142
ytallo wants to merge 44 commits into
mainfrom
feat/shell-allowlist-approval

Conversation

@ytallo
Copy link
Copy Markdown
Contributor

@ytallo ytallo commented May 15, 2026

No description provided.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cabad1f5-738a-4ba7-a6dd-acd98d479890

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/shell-allowlist-approval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

skill-check — worker

3 verified, 22 skipped (no docs/).

1 error across the verified workers.

File Approximate line Severity Violation
shell/docs/leaves/kill.md ~7 error [Terminology.SlopMarketing] Avoid marketing/anthropomorphic phrasing 'Reaping'. Use concrete, technical language. (error)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
workers Error Error May 17, 2026 0:49am

Request Review

ytallo added 25 commits May 16, 2026 14:40
Spec for converting approval-gate from a synchronous blocking gate
(await_decision + 250ms poll) into a deferred-execution trigger model:
the agent receives a pending_approval tool result immediately, the
gate executes the underlying function on resolve, and the outcome is
stitched into the agent's next turn as a system message.

Depends on PR #136 (reactive ui::approval pipeline) landing first.
ytallo added 14 commits May 16, 2026 14:40
…gv function and update ShellConfig for allow_any option
…val_required becomes a fallback

When an operator registers an interceptor rule for a function id (in
iii.worker.yaml), every call to that function goes through the
classifier/pause path — even if the run's `approval_required` list
is empty. The run-level list now only matters for functions that have
no matching rule.

Rationale: the interceptor config is the deployment's source of truth
for which functions need gating. Requiring callers to also list those
ids in approval_required was a second source of truth that's easy to
forget, and the failure mode was silent allowlist-rejection on
shell::exec ("command 'python3' not in allowlist") because the gate
never fired.

Extracted decide_intercept_action() as a pure helper for unit testing
the decision matrix without spinning up the iii engine.
Add ARITY dictionary mapping bash command prefixes to their
human-meaningful token count (git -> 2, npm run -> 3, docker compose -> 3,
etc.) and a prefix() helper that walks longest-first to derive the
canonical command identity for an argv. allowlist_contains now uses
prefix_matches() so operators can write multi-token allowlist entries
('git checkout', 'npm run build') and have them match the right slice;
single-token entries and basename-normalized full-path argv heads stay
back-compat.

Ported from opencode's packages/opencode/src/permission/arity.ts.
Introduce a typed Denial enum (serde tag=kind, content=detail) carried on
wire replies, persisted records, and approval_resolved stream events.
Variants: Policy {classifier_reason, classifier_fn}, UserRejected,
UserCorrected {feedback}, StateError {phase, error}, Legacy {reason}.

approval-gate
- Decision::Deny and ClassifierDecision::Deny wrap Denial.
- block_reply_for emits {block, denial?} instead of a flat reason string.
- transition_record takes Option<Denial> in place of decision_reason.
- handle_resolve parses payload.denial; missing denial defaults to
  UserRejected so single-click reject UIs stay simple; malformed denial
  returns bad_denial.
- handle_intercept state-write failure emits Denial::StateError; already
  resolved / in-flight replays carry a replay discriminator (not a
  denial).
- migrate_legacy_record handles two generations of drift: pre-trigger-
  model status strings and the pre-Denial decision_reason field; the old
  field is stripped on read so it never round-trips.
- approval_resolved stream events carry denial in place of
  decision_reason.
- timed_out records carry no Denial; status is self-describing.

turn-orchestrator
- approval_stitching renders per-kind LLM-readable lines (policy names
  the classifier, user_corrected surfaces the feedback string, etc.).

harness-types (6 vendored copies kept byte-identical)
- ApprovalResolved carries denial: Option<Denial> in place of
  reason: Option<String>; Denial type defined locally and re-exported
  from lib.rs.
…ingle drain RPC (#146)

* feat(approval-gate): layered policy rules with findLast evaluator

Introduce a first-class Rule/Ruleset/Action surface ported from
opencode's Permission.evaluate. Rules are wildcard globs over the iii
function id (pattern is "*" in v1 — the field is reserved for per-
function pattern extractors in a follow-up). Evaluator semantics:
findLast across flattened layers, so operators stack a permissive
default with more-specific overrides without surgery on the base list.

approval-gate
- New rules module: Action (allow/deny/ask), Rule, Ruleset alias,
  wildcard_match (DP, no regex), evaluate(perm, pattern, IntoIterator
  of &Rule). 16 unit tests covering matching, layering, last-wins.
- WorkerConfig.rules: Vec<Rule> (additive; existing InterceptorRule
  flow unchanged).
- apply_policy_rules: pure helper that maps a rule hit to PolicyOutcome
  (Allow / Deny{rule_permission, rule_pattern} / FallThrough).
  Allow short-circuits to pass-through; Deny short-circuits with a
  Denial::Policy that names the matched rule; Ask and no-match fall
  through to the existing per-function interceptor flow.
- handle_subscriber consults the rules layer first, before
  decide_intercept_action.

* feat(approval-gate): cascade allow on `always: true` resolve reply

When a user resolves a pending call with decision=allow + always=true,
approval-gate now pushes a runtime Allow rule for the call's function
id and sweeps the same session's pending records, auto-resolving every
other one the new rule covers. One click instead of N.

approval-gate
- approve_and_execute() extracted from handle_resolve's allow branch as
  a reusable async helper. Both the user-driven allow path and the
  cascade sweep drive state through the same transitions
  (approved -> invoke -> executed | failed).
- Shared policy ruleset wrapped as Arc<RwLock<Ruleset>> so reply-time
  mutation is safe across the subscriber + resolve closures. Read
  guards are scoped so they never cross an .await (std::sync::RwLock
  is not async-safe to hold across suspension).
- cascade_allow_for_session() pushes the new rule under the write
  lock, snapshots session pending via list_prefix, and runs
  approve_and_execute for each newly-Allow record. The originator is
  skipped (already resolved above). Per-record state-write failures
  are logged and the rest of the cascade continues.
- handle_resolve response carries `cascaded: N` when the sweep
  resolved at least one extra record; omitted otherwise so the
  one-shot path stays unchanged.

Cascade scope is intentionally narrow for v1:
- function-id-only matching (pattern "*"), mirrors the v1 rules surface
- same session only
- in-memory rule (no cross-restart persistence)
- allow + always; deny + always is out of scope

Five new tests cover the cascade decision tree (no-always, same-session
match, cross-session non-effect, originator-skip, terminal-record skip).

* chore(approval-gate): remove legacy migration surface

The Denial refactor replaced decision_reason with a typed structure, but
several legacy hooks remained in place. They are now removed because the
underlying schemas they protected against (pre-trigger-model status
strings, pre-Denial flat decision_reason field) cannot occur in any
state store this branch is wired to.

approval-gate
- Drop migrate_legacy_record entirely. The pre-trigger-model
  status="allow"/"deny" rename and the decision_reason -> Denial::Legacy
  lift both go away. Old persisted records (if any) now surface as
  filtered-out orphans rather than silently mistranslated.
- Drop Denial::Legacy variant from the enum.
- Drop legacy_migrated flag from records.
- Drop the four migrate_legacy_record_* unit tests and the
  handle_list_undelivered_persists_migrated_legacy_record integration
  test.
- handle_list_pending no longer pre-filters legacy shapes; orphan
  records lacking a session_id stamp are dropped uniformly.
- timeout_resolved_event drops the decision_reason: "timeout" field
  (status: "timed_out" is self-describing).
- handle_sweep_session drops the optional `reason` payload field. The
  cause (session_deleted vs run_stopped) is the caller's concern; the
  swept timed_out records carry no Denial.
- skills/sweep_session.md updated.

turn-orchestrator
- approval_stitching drops the legacy denial-kind branch and the
  legacy_migrated note line; the corresponding stitch test is removed.
- run_stop no longer forwards reason: "run_stopped" on the
  approval::sweep_session payload; it logs the cause locally instead.
- The run_stop integration test now asserts the absence of the reason
  field on the sweep payload.

harness-types (6 vendored copies, byte-identical)
- Drop the Legacy variant and its docblock line from each agent_event.rs
  copy.

* feat(approval-gate): add typed Record, Status, Next schema

New src/record.rs defines the persisted approval record as a typed
struct alongside its lifecycle Status enum and a Next enum that pairs
each target status with its outcome data. Wire format is byte-identical
to the existing serde_json::Value blobs — operators can adopt the
typed shape incrementally without changing any persisted state.

approval-gate
- New module record:
  - Status enum (Pending / Approved / Executed / Failed / Denied /
    TimedOut), serde snake_case, with is_terminal() helper.
  - Record struct mirroring the historical JSON keys field-for-field,
    optional outcome fields scoped to terminal statuses, serde
    skip-when-None so existing rows round-trip cleanly.
  - Next enum that bundles (target_status, outcome_payload) so the
    type system rules out impossible transitions (e.g. Executed without
    a result, Denied without a Denial). Replaces the old
    (status_string, Option<result>, Option<error>, Option<denial>)
    parallel-Option signature when callers opt in.
  - Record::to_value / Record::from_value bridge to the Value blobs the
    iii state bus still expects.
  - 8 unit tests covering serde round-trips, status semantics, expiry
    saturation, forward-compat with unknown fields.
- lib.rs re-exports Next, Record, Status.

Handler migration to the typed schema is a separate change; this commit
just makes the types available so future work can lift Value access
into typed field access incrementally rather than in one big diff.

* refactor(approval-gate): extract wire.rs from lib.rs

First step of breaking lib.rs (~4300 lines) into focused modules.
This commit lifts the wire-format types and small wire-shape helpers
into their own module — pure data shapes with no I/O, no iii-sdk
dependency, no async — so downstream workers that only need to
understand the approval-gate protocol can depend on a small surface.

Moved (lib.rs -> wire.rs):
- Denial (structured deny payload)
- IncomingCall + requires_approval()
- Decision (Allow | Deny(Denial))
- WireDecision (allow|deny wire enum)
- pending_key()
- extract_call()
- block_reply_for()

All public items stay re-exported from the crate root, so every
existing call site (handlers, tests, integration tests, downstream
crates) continues to work without import changes.

Pure move, no behavior change. 141 approval-gate tests pass;
turn-orchestrator + harness + session + providers + harness-tui +
shell all pass.

* refactor(approval-gate): extract lifecycle.rs from lib.rs

Second step of breaking lib.rs into focused modules. This commit
lifts the persisted-record lifecycle helpers — pure Value-blob
constructors and transitions, no I/O, no iii-sdk dependency — into
their own module.

Moved (lib.rs -> lifecycle.rs):
- is_terminal_status()
- build_pending_record()
- transition_record() + transition_record_with_now()
- collect_timed_out_for_sweep()
- maybe_flip_timed_out()

All public items stay re-exported from the crate root.
Pure move, no behavior change. 141 approval-gate tests pass;
all 8 downstream worker crates pass.

* refactor(approval-gate): extract state.rs from lib.rs

Third step of breaking lib.rs into focused modules. This commit lifts
the iii-backed state-bus and function-executor implementations into
their own module, along with the `__from_approval` marker plumbing
and the boot-time marker-target safety check.

Moved (lib.rs -> state.rs):
- StateBus trait + IiiStateBus impl  (iii state::* wrapper)
- FunctionExecutor trait + IiiFunctionExecutor impl  (iii.trigger wrapper)
- rule_for()  (lookup helper used by IiiFunctionExecutor)
- merge_from_approval_marker_if_needed()  (marker payload composer)
- unverified_marker_targets()  (boot guard)

The StateBus / FunctionExecutor traits are kept exactly as they were
in lib.rs — they exist purely as test seams so unit tests can swap in
InMemoryStateBus / FakeExecutor. No new abstractions added.

Public items stay re-exported from the crate root.
Pure move, no behavior change. 141 approval-gate tests pass;
all 8 downstream worker crates pass (turn-orchestrator's pre-existing
dual_write flake notwithstanding).

* refactor(approval-gate): extract intercept.rs from lib.rs

Fourth step of breaking lib.rs into focused modules. This commit
lifts the intercept decision flow into its own module — the three
layers that decide what the gate does with an incoming function
call: policy rules, per-function interceptor rules, and classifier
replies, plus the async handle_intercept that writes the pending
record.

Moved (lib.rs -> intercept.rs):
- InterceptAction enum + decide_intercept_action()
- PolicyOutcome enum + apply_policy_rules()
- ClassifierDecision enum + interpret_classifier_reply()
- handle_intercept()

handle_intercept stays public (re-exported from the crate root).
The pub(crate) helpers stay pub(crate) and are imported back into
lib.rs for the subscriber closure in register().

Pure move, no behavior change. 141 approval-gate tests pass;
turn-orchestrator + harness + shell verified green.

* refactor(approval-gate): extract resolve.rs from lib.rs

Fifth step of breaking lib.rs into focused modules. This commit
lifts the approval::resolve flow into its own module.

Moved (lib.rs -> resolve.rs):
- handle_lookup_record()
- handle_resolve()
- cascade_allow_for_session() (the always=true sweep)
- approve_and_execute() (shared by user-driven allow + cascade)

handle_resolve and handle_lookup_record stay public via the crate
root re-exports. approve_and_execute and cascade_allow_for_session
stay private to the resolve module.

Pure move, no behavior change. 141 approval-gate tests pass.

* refactor(approval-gate): extract delivery.rs from lib.rs

Sixth step of breaking lib.rs into focused modules. This commit
lifts the delivery-tracking RPCs into their own module.

Moved (lib.rs -> delivery.rs):
- handle_list_pending()
- handle_list_undelivered()
- handle_ack_delivered()
- handle_consume_undelivered()
- handle_flush_delivered()
- handle_sweep_session()
- LIST_UNDELIVERED_DEFAULT_LIMIT

All public items stay re-exported from the crate root.
Pure move, no behavior change. 141 approval-gate tests pass.

* refactor(approval-gate): extract sweeper.rs from lib.rs

Seventh step of breaking lib.rs into focused modules. This commit
lifts the periodic timeout sweeper and the small iii-stream helpers
it owns into their own module.

Moved (lib.rs -> sweeper.rs):
- spawn_timeout_sweeper()
- timeout_resolved_event()
- write_event()
- write_hook_reply()
- uuid_like()

All five stay pub(crate) — only the register() wiring in lib.rs
and the resolve flow need them. Pure move, no behavior change.
141 approval-gate tests pass.

* refactor(approval-gate): extract register.rs from lib.rs

Final step of breaking lib.rs into focused modules. This commit
lifts the iii function/trigger wiring — the heart of the worker
startup path — into its own module.

Moved (lib.rs -> register.rs):
- register() (~420 lines: the subscriber closure + 8 function
  registrations + trigger registration + sweeper spawn)
- Refs struct (the handle bag returned to the binary)
- FN_RESOLVE / FN_LIST_PENDING / FN_LIST_UNDELIVERED /
  FN_CONSUME_UNDELIVERED / FN_ACK_DELIVERED / FN_FLUSH_DELIVERED /
  FN_SWEEP_SESSION / FN_LOOKUP_RECORD constants
- STATE_SCOPE constant

lib.rs is now ~50 lines of module declarations + re-exports +
~2580 lines of inline tests. The test mod uses super::* plus
local serde_json and std::sync imports.

All public items stay re-exported from the crate root so the
binary's `use approval_gate::register;` keeps working unchanged.
Pure move, no behavior change.

141 approval-gate tests pass; turn-orchestrator + harness + shell
verified green (turn-orchestrator's pre-existing dual_write flake
notwithstanding).

* refactor(approval-gate): move pub(crate)-using tests inline

First step of relocating the 2580-line inline test block from
lib.rs into more focused homes. This commit moves the 18 tests
that touch pub(crate) helpers — ones that can't live in tests/*.rs
because integration tests can only see pub items — into the
#[cfg(test)] mod tests block of whichever production module owns
the helper they exercise.

intercept.rs gained:
- interpret_classifier_reply_reads_decision_tags
- decide_intercept_action_* (5 tests)
- apply_policy_rules_* (5 tests)

state.rs gained:
- merge_from_approval_* (4 tests)
- rule_for_returns_matching_rule / rule_for_returns_none_when_absent

sweeper.rs gained:
- timeout_resolved_event_shape

lib.rs loses those 18 tests. 141 approval-gate tests pass (no
duplicate counts — each test runs once). turn-orchestrator + harness
+ shell verified green.

Public-API tests (the remaining ~96 in lib.rs) move to tests/*.rs
in a follow-up commit so they can share fakes via a common module.

* refactor(approval-gate): move public-API tests to tests/*.rs

Final test split. The 86 remaining inline tests in lib.rs that
exercise only the crate's `pub` surface now live in tests/*.rs,
organized by the area they cover. The state-machine proptest gets
its own file. lib.rs drops from 2369 to 52 lines (just module
declarations + re-exports).

New tests/ layout:
- common/mod.rs    — shared fakes: FakeExecutor, InMemoryStateBus,
                     FailingStateBus, sample_call(), empty_policy_rules()
- lifecycle.rs     — 19 tests: build_pending_record, transition_record,
                     maybe_flip_timed_out, collect_timed_out_for_sweep,
                     is_terminal_status, pending_key
- wire.rs          — 10 tests: extract_call, block_reply_for,
                     requires_approval
- intercept.rs     — 9 tests: handle_intercept (replays, fail-closed,
                     session_id stamping, force_pending)
- resolve.rs       — 19 tests: handle_resolve, cascade-on-`always`,
                     handle_lookup_record
- delivery.rs      — 25 tests: list_pending / list_undelivered /
                     ack_delivered / consume_undelivered /
                     flush_delivered / sweep_session
- misc.rs          — 4 tests: FN_* constants, unverified_marker_targets,
                     FakeExecutor smoke test
- state_machine.rs — the proptest with the four lifecycle invariants

Each tests/*.rs uses `mod common;` for shared fakes — only one source
of truth for in-memory bus/executor stand-ins.

141 approval-gate tests pass across 13 test binaries (previously 6).
turn-orchestrator + harness + shell verified green.

* feat(approval-gate): new Record schema with Pending|InFlight|Done lifecycle

Pending → InFlight → Done(Outcome). InFlight is the intermediate
persist that closes the dup-exec race within a worker process — a
second approval::resolve arriving during the invoke await sees the
row is non-Pending and bails.

Outcome is a tagged enum (Executed/Failed/Denied/TimedOut). resolved_at
is stamped on the first non-Pending transition for deterministic
multi-row consume ordering.

lifecycle.rs is deleted; its only surviving helper
(flipped_to_timed_out_if_expired) is now a Record method.

* fix(approval-gate): keep lifecycle.rs as transitional compat shim

Task 1 of the simplification plan deletes lifecycle.rs entirely, but
the plan was a bit optimistic: dependent modules (intercept, resolve,
delivery, sweeper) still import its helpers (build_pending_record,
transition_record, maybe_flip_timed_out, is_terminal_status,
collect_timed_out_for_sweep) and the crate fails to compile without
them. Cargo can't run record.rs's unit tests if the library itself
won't link.

Restore lifecycle.rs as a transitional shim. Tasks 5/6/8/11 will
progressively migrate each callsite to the new Record API; whichever
of those tasks lands last deletes lifecycle.rs for real.

* feat(approval-gate): Denial::Policy carries rule_permission + rule_pattern

The classifier surface is being deleted in favor of the layered rules
engine; rename the Policy variant's detail fields to match the actual
data they'll carry once T5 lands. For now, classifier callsites
(intercept.rs, register.rs) map their old reason/fn locals into the
new field names — transient mapping, deleted in T5.

Test callsites in tests/wire.rs and tests/lifecycle.rs updated to use
representative rule_permission/rule_pattern values.

Note: IncomingCall.approval_required + requires_approval() field/method
are NOT removed in this commit (the plan's T2 calls for it but
intercept.rs's decide_intercept_action still depends on them). They
disappear in T5 when intercept.rs is rewritten end-to-end.

* feat(approval-gate): pattern_for(function_id, args) extractor

For shell::exec/shell::exec_bg, derive a pattern string from
{command, args} so layered rules can match argv-level patterns like
'git status*'. Other function ids default to '*'. Malformed/missing
args fall back to '*' (matches only wildcard rules).

8 new tests cover all branches: join+args, single-string command,
empty args list, missing command, malformed args, non-shell function
id, and the documented space-conflation case.

* feat(approval-gate): StateBus::delete primitive

Required by the new approval::consume RPC (T8), which returns Done
rows and deletes them in the same call. IiiStateBus impl maps to the
existing state::delete RPC (used elsewhere in the codebase, e.g.
auth-credentials). Test fakes (InMemoryStateBus, FailingStateBus) gain
matching impls — InMemory removes the key idempotently, Failing errors
to match its other primitives.

* feat(approval-gate): verdict-driven handle_intercept

The classifier surface and per-function InterceptorRule decision flow
are gone. handle_intercept now reads the verdict via the new lib-level
verdict_for(function_id, args, ruleset) helper:

  Verdict::Allow → {block:false}, no state write
  Verdict::Deny  → {block:true, denial:Policy{rule_permission, rule_pattern}}
  Verdict::Ask   → write Pending row + reply {block:true, status:pending}

Replay defense extended for the new 3-state schema:
  Pending  → replay:in_flight, status:pending
  InFlight → replay:in_flight, status:in_flight  (new)
  Done     → replay:already_resolved, status:done

intercept.rs drops InterceptAction, decide_intercept_action,
ClassifierDecision, interpret_classifier_reply, PolicyOutcome,
apply_policy_rules. register.rs's subscriber closure collapses the
old policy_outcome+decide_intercept+classifier dance into one
handle_intercept call. resolve.rs's cascade loop reuses verdict_for
instead of the deleted apply_policy_rules.

State-write failure still fails closed with Denial::StateError.
7 new tests cover all four verdict branches + the three replay
states + fail-closed behavior.

Note: InterceptorRule config struct and intercept_rules variable
are still in place for T12 to fully remove.

* feat(approval-gate): three-phase resolve + cascade exact-pattern (T6+T7)

handle_resolve rewritten to typed Record + three-phase allow path:
  1. write InFlight (closes the dup-exec race within a worker process)
  2. iii.trigger(function_id, args) and await
  3. write Done(Executed{result}) or Done(Failed{error})

Deny is a single Pending → Done(Denied{denial}) write — no invoke,
no InFlight needed. WireDecision::Deny with no denial in payload
defaults to UserRejected. UserCorrected{feedback} round-trips.

Dup-exec guard: handle_resolve refuses non-Pending rows with
  in_flight (concurrent resolve mid-invoke) or
  already_resolved (terminal).

Cascade allow on always:true (T7) now pushes a runtime Allow rule with
the originator's EXACT pattern via rules::pattern_for(args) — not the
prior blanket pattern:'*'. 'Always allow git status' no longer
auto-allows rm -rf / via the same shell::exec function id.
Lock-ordering invariant pinned in module docs: never hold the ruleset
guard across .await.

* feat(approval-gate): approval::consume + strip delivery dead RPCs + delete sweeper.rs/lifecycle.rs (T8+T9+T11)

Three behavioral changes folded into one commit because they share files:

T8 (new approval::consume RPC):
  Three-phase drain — gather Done candidates, sort by resolved_at + cap,
  delete-and-return. Defensive session_id filter. Lazy timeout flip on
  read. Default cap CONSUME_DEFAULT_LIMIT=50 bounds the response size
  against MB-scale stdout payloads.

T9 (delivery.rs strip):
  handle_list_undelivered, handle_consume_undelivered, handle_ack_delivered,
  handle_flush_delivered all deleted. LIST_UNDELIVERED_DEFAULT_LIMIT
  constant gone. handle_list_pending rewritten on the typed Record API
  with lazy timeout flip on read. handle_sweep_session rewritten on new
  schema — flips Pending+InFlight rows to Done(TimedOut).

T11 (sweeper.rs deleted):
  Background polling task gone. Timeouts now flip lazily on read in
  handle_resolve / handle_consume / handle_list_pending. UI handles
  expires_at countdown client-side; the LLM learns of timeouts on the
  next consume.
  lifecycle.rs transient shim also deleted (no more callers).
  Stream helpers (uuid_like, write_event, write_hook_reply) move into
  register.rs as their only consumer; spawn_timeout_sweeper +
  timeout_resolved_event deleted with the rest of sweeper.rs.

register.rs surgery:
  Refs struct loses 5 dead FunctionRef fields and the sweeper JoinHandle.
  RPC registrations for list_undelivered, consume_undelivered,
  ack_delivered, flush_delivered all gone. New FN_CONSUME registration
  added. Classifier-alias warning check trimmed to live function ids.

15 new tests cover handle_consume (7) + handle_sweep_session (1) +
handle_list_pending lazy flip (1) + the pre-existing edits. 65 lib
tests pass total.

* feat(approval-gate): strip __from_approval marker plumbing (T10)

IiiFunctionExecutor::invoke forwards function_id+args directly to
iii.trigger — no marker stamp.

Deleted from state.rs:
- merge_from_approval_marker_if_needed (and its 4 unit tests)
- unverified_marker_targets (and the 2 rule_for tests)
- rule_for (only marker-related callers)

Deleted from register.rs:
- IiiFunctionExecutor's rules: Arc<Vec<InterceptorRule>> field
- Boot-time unverified-marker check (refused-to-start guard)

Per the spec's Threat Model section: bus access ≡ shell access in the
new model. The harness's bus-level access control is the perimeter;
defense-in-depth via per-target marker verification is out of scope
for v1. Shell-side marker verification is deleted in T13.

T12 will fully retire InterceptorRule + the classifier-alias warning
loop in register.rs.

* feat(approval-gate): strip config to topic+scope+timeout+rules (T12)

config.rs: WorkerConfig now has just {topic, approval_state_scope,
default_timeout_ms, rules}. interceptors + sweeper_interval_ms fields
gone. InterceptorRule struct kept as a no-op shim with minimal fields
so the classifier-alias warning loop in register.rs still compiles —
the loop is fed an empty Vec so the body never runs.

iii.worker.yaml: replaced interceptor config with a curated default
ruleset. Read-only fs/git auto-allowed; shell::exec/exec_bg ask;
catch-all asks. Operators stack their own rules on top (last-match
wins).

register.rs cfg.interceptors → cfg.rules wiring confirmed: policy_rules
is seeded from cfg.rules at startup and remains mutable via cascade.

Note: InterceptorRule shim + classifier-alias warning loop in register
will be deleted in a follow-up pass when the dependent code is fully
cleaned up. They're cosmetic at this point — neither has functional
effect.

* feat(shell): strip classify_argv / allowlist / __from_approval marker (T13)

Shell becomes a plain executor. All policy decisions live in
approval-gate's rules layer (see the layered ruleset shipped in
approval-gate/iii.worker.yaml).

Deleted:
- shell/src/functions/approval_bypass.rs (marker validation module)
- shell/src/functions/classify.rs (classifier handler)
- shell/src/arity.rs (arity-aware allowlist matcher)
- ApprovalMarker + ClassifyArgvRequest types
- ExecRequest.from_approval + ExecBgRequest.from_approval fields
- ShellConfig.allowlist + allow_any + denylist_patterns +
  compiled_denylist fields and their methods (compile_denylist,
  denylist_hit_reason, allowlist_contains, is_command_allowed)
- shell::classify_argv function registration
- shell::classify_argv manifest entry
- All allowlist/denylist unit tests in shell/src/config.rs
- One test in shell/tests/function_handlers.rs that asserted shell
  rejected unlisted commands

Kept (independent surface):
- fs::host_root jail + fs.denylist_paths (filesystem path-based denials
  for shell::fs::* tools; distinct from the exec-policy denylist)
- The fs jail validate_fs_jail boot check

Threat model implication, per the spec: bus access ≡ shell access.
The harness's bus-level access control is the perimeter; any worker on
the bus can call shell::exec with arbitrary args. Defense-in-depth via
shell-side allowlists is gone for v1.

133 shell lib tests + 328 shell integration tests pass.

* feat(turn-orchestrator): switch stitch to approval::consume (T14)

consume_approval_stitch now calls approval::consume (single RPC,
returns + deletes in one shot). Payload is { session_id } only; no
turn_id, no limit (gate enforces a default cap of 50, surfacing
overflow via the 'omitted' counter as before).

stitch_entries rewritten for the new Record wire shape: reads nested
outcome: { kind, detail } instead of the old top-level status +
result/error/denial. Outcome kinds: executed, failed, denied,
timed_out. function_call_id is the canonical key (call_id retained
as a legacy fallback for old rows on disk during the upgrade).

render_denial_lines's policy branch reads rule_permission/rule_pattern
(matching the renamed Denial::Policy detail shape from T2). The
'policy denied by classifier_fn: classifier_reason' wording becomes
'policy denied by rule <perm> : <pat>'.

omission_summary_message reworded: 'approval::flush_delivered' is gone;
the natural recovery path is the next-turn consume, which is what the
new message points at.

131 lib tests pass (was 122 + 9 failing from old-shape assertions).
Pre-existing integration test failure (tests/run_stop.rs imports
run_stop module that isn't declared in lib.rs) is unrelated to this
refactor.

* test(approval-gate): E2E lifecycle tests + delete dead integration tests (T17)

Delete 8 integration test files whose assertions targeted the deleted
surface (FN_LIST_UNDELIVERED, FN_ACK_DELIVERED, transition_record,
InterceptAction, marker plumbing, ...):
- tests/approval_lifecycle.rs (engine-backed, used deleted RPCs)
- tests/delivery.rs (old delivery RPC surface)
- tests/integration.rs (old subscriber wiring)
- tests/intercept.rs (decide_intercept_action, classifier path)
- tests/lifecycle.rs (record-helper tests on deleted schema)
- tests/misc.rs (mixed assertions on old shape)
- tests/resolve.rs (old transition_record / Approved status)
- tests/state_machine.rs (proptest dep + old schema)

The src/* unit tests added across T1-T12 cover the equivalent surface
for the new wire shape (Pending|InFlight|Done, Outcome enum,
Denial::Policy {rule_permission, rule_pattern}, etc.).

Add tests/lifecycle.rs with FIVE E2E lifecycle tests against the
in-memory StateBus + FakeExecutor — the integration safety net for
the whole refactor:
1. allow_path_end_to_end — pending → InFlight → executed → consume drains
2. deny_path_with_user_corrected_feedback_end_to_end — feedback round-trips
3. timeout_path_lazy_flips_on_consume_end_to_end — past expires_at flips on read
4. cascade_path_end_to_end — two pending rows; allow+always pushes exact-pattern
   rule; second auto-resolves; consume drains both; pinned: pushed pattern is
   the originator's argv ('echo go'), NOT blanket '*'
5. allow_rule_short_circuits_with_no_state_write — bonus: Verdict::Allow
   doesn't touch state

72 approval-gate tests pass across 6 suites (lib + wire + manifest +
lifecycle + 2 from common helpers). Net deletion: ~2500 lines of
old-shape integration tests gone, ~250 lines of new E2E in their place.

* feat(harness/web): Allow+Always, Deny feedback, expires_at countdown (T15)

ApprovalRow.tsx grows three real-feature additions matching the
approval-gate refactor's new wire surface:

1. Allow+Always button → sends {decision:'allow', always:true}. The
   gate cascades by pushing a runtime Allow rule with the originator's
   exact pattern and sweeping the session's other Pending rows
   (auto-resolving any that match). The response carries cascaded:N
   when the sweep was non-empty; not surfaced in the UI yet.

2. Optional Deny feedback textarea → when the operator types a
   correction message before clicking deny, the payload becomes
   {decision:'deny', denial:{kind:'user_corrected', detail:{feedback}}}.
   That feedback flows verbatim to the LLM via the stitched message
   on the next consume — the high-value path because the model gets
   actionable correction, not just 'denied'.

3. Client-side expires_at countdown → the modal shows MM:SS remaining
   and disables actions on expiry. No server-emitted timed_out frame
   anymore (the gate's sweeper is gone); the LLM learns of timeout
   via the next-turn approval::consume that lazy-flips the row.

Each pending row is split into its own ApprovalCard component so the
countdown hook respects rules-of-hooks (can't sit inside a .map
callback). ResolvePayload is constructed inline so the bridge's
Record<string, unknown> type accepts it without an explicit cast at
the interface boundary.

CSS classes for the new states (approval-allow-always,
approval-countdown, approval-feedback, approval-expired-note) inherit
default styling for now; a follow-up styling pass can polish.

* docs(harness/fanout): note T16 stream-rewrite as deferred follow-up

The approval-gate refactor's T16 called for switching the fanout from
polling approval::list_pending to subscribing to agent::events for
approval_requested / approval_resolved frames. The polling
implementation continues to work correctly against the new wire shape
— diff_approvals reads function_call_id (now top-level on Record) and
list_pending was kept by approval-gate's T8/T11 — so this is a perf
cleanup, not a correctness fix.

Comment the constant accordingly; defer the actual stream subscription
to a follow-up PR. No code change beyond the doc.

* fix(shell/e2e): drop allowlist/denylist tests + configs

Shell stopped enforcing command-level policy; the e2e suite still
referenced the removed allowlist/denylist_patterns config fields and
pinned behavior that no longer exists. Strip those expectations and
update the docs that described them.

- config.yaml / config-jailed.yaml: remove allowlist + denylist_patterns
  (would fail YAML parse against the new ShellConfig schema)
- cases-safety.ts: drop 3 allowlist/denylist assertions; keep one
  positive 'absolute-path command runs' case
- cases-edge.ts: 'nonexistent unlisted command' rewritten to assert
  the OS spawn error (ENOENT), not an allowlist message
- cases-exec-break.ts: drop 'unlisted command with extra fields'
- cases-exec-sandbox.ts: drop 'sandbox allowlist still enforced'
- cases-vuln-repro.ts: drop S-H3 case (advisory denylist) — the
  finding's whole point was that command policy belongs upstream
- ARCHITECTURE.md / e2e README: reflect the new boundary
…vered, flush_delivered, list_undelivered, and sweep_session. These files have been deleted as part of the ongoing simplification and refactoring of the approval-gate system.
ytallo added 4 commits May 16, 2026 21:56
…ning no-op

The T12 simplification (94528b0) removed the classifier surface but left
behind InterceptorRule and a hard-coded empty-vec warning loop for source
compatibility. Nothing references either — delete the struct, its
re-export, and the orphaned subscriber-closure captures.
…ator, hooks, and UI

When the gate replies status:"pending" the orchestrator, hook bus, and
web reducer all need to cooperate so the placeholder result resolves
into the executed outcome on approval. Six changes in one feature:

- hook-fanout: merge_first_block_wins now forwards the full blocker
  reply instead of flattening to {block, reason}, so status:"pending",
  denial, call_id, and function_id survive the merge.
- turn-orchestrator/run_start: new run::resume function. Approval-gate
  calls it after a late-resolved approval; if the session is Stopped it
  resets to Provisioning so handle_streaming runs and consume drains.
- turn-orchestrator/states/functions:
  - terminate=true on pending block (was false, looping the model).
  - Pass session_id into the before_function_call payload — without it
    the gate cannot key its pending record and silently passes through.
  - Render pending blocks as non-error (not red in the UI).
  - Race-dedupe the FunctionResult append and gate turn_end_emitted so
    two concurrent turn::step handlers don't double-emit.
- turn-orchestrator/states/assistant: resolve_pending_function_results
  replaces the placeholder pending body with the executed / failed /
  denied / timed_out outcome from approval::consume — fixes the chat
  getting stuck on "Awaiting human approval...".
- turn-orchestrator/persistence: dedupe FunctionResult by
  function_call_id at save time (last line of defense for the race).
- harness/web/reducer: dedupe function_result by function_call_id (not
  timestamp+content-length) so race-emitted MessageEnd events don't
  render the same pending block twice.
Bring back the pre-T13 argv-level allowlist + denylist on ShellConfig as
optional, default-empty. Both lists empty (the shipped default and the
playground/harness deployment) means pass-through: the approval-gate
remains sole authority. Operators running shell standalone can populate
either list to get a defense-in-depth check evaluated inside
shell::exec / shell::exec_bg right after argv is parsed.

- ShellConfig regains allowlist (Vec<String>), denylist_patterns
  (Vec<String>), and the compiled_denylist regex cache; load_config
  calls compile_denylist(). allow_any is intentionally not restored —
  its purpose was the classifier-path bypass that 94528b0 removed; with
  shell owning its own check, "no enforcement" is already expressible
  by leaving allowlist empty.
- shell/src/arity.rs is reinstated (the file 94528b0 deleted) so
  allowlist_contains can keep using prefix_matches for token-aligned
  multi-token entries ("git checkout", "docker compose up") and
  basename-normalized full-path argv heads.
- shell::exec and shell::exec_bg invoke cfg.is_command_allowed right
  after parse_argv. Denylist short-circuits before the allowlist check;
  empty lists short-circuit to Ok(()).
- shell/config.yaml ships with both lists empty. The pre-T13 curated
  read-only allowlist is left out on purpose so the harness deployment
  doesn't end up in a re-prompt loop when the gate approves a command
  that the shell-side list excludes.
- Restored config-level tests (allowlist permit/reject, basename match,
  arity matching, denylist precedence, empty-argv rejection) plus a new
  denylist_wins_over_allowlist pin. Handler-level tests un-stub
  cfg_with_allow and gain a passthrough regression test
  (exec_handler_passthrough_when_lists_empty) alongside the reinstated
  exec_handler_rejects_unlisted_command.
- E2E configs lose the stale "T13: gone" comments.

End-to-end verified through the harness web UI: approving a shell::exec
call for an unlisted command now executes the call exactly once instead
of looping back to pending_approval.
Tighter approval-card markup and a dedicated styles block. Functional
behavior is unchanged; this only touches presentation and a11y.

ApprovalRow.tsx:
- Wrap the card in a semantic <section role="region"> labelled by the
  function id (titleId derived from the call id) and use a <header> for
  the eyebrow/title/countdown row.
- Surface state via data attributes (data-expired, data-busy) instead of
  conditional class strings so CSS owns the visual variants.
- Add countdownTone() — emits ok / warn / danger / expired thresholds at
  30s / 10s / 0s, exposed as data-tone for CSS coloring.
- Split the deny-feedback summary into a label + hint pair so it can
  wrap cleanly under small widths.
- Drop the standalone "allow" duplicate; keep the explicit "allow" and
  "allow + always" pair (matches the wireframes pass).

styles.css:
- New "approvals" block at the end of the stylesheet defining .approval,
  .approval-head, .approval-eyebrow, .approval-countdown[data-tone],
  .approval-args, .approval-feedback*, .approval-actions and the three
  action buttons. Uses the existing CSS-variable palette (var(--accent),
  var(--paper), var(--rule)) and color-mix() for the wash + edge tones.
- approval-enter keyframes with prefers-reduced-motion fallback.
- Mobile breakpoint (<=640px) wraps actions and rebalances the deny
  button to its own row.
The review gaps were around approval resolution edge cases, allow+always cascading, resume re-kicks, shell policy fallback, and the web approval controls. This adds focused regression coverage at the handler/integration boundaries and fixes the manifest default_config source so the parity test reflects the shipped worker YAML instead of a divergent struct default.

Constraint: Keep unrelated dirty workspace changes out of the commit

Rejected: Full-stack UI-to-shell E2E | too heavy for this gap pass and less deterministic than the targeted boundary tests

Confidence: high

Scope-risk: narrow

Directive: Do not change approval manifest defaults without keeping iii.worker.yaml parity covered

Tested: approval-gate cargo test in clean staged worktree

Tested: shell cargo test --test function_handlers in clean staged worktree

Tested: turn-orchestrator cargo test --test run_resume in clean staged worktree

Tested: harness/web npm test -- --run src/components/ApprovalRow.test.tsx

Not-tested: Full browser-driven approval flow across orchestrator, approval-gate, shell, and web
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant