Skip to content

feat: chat A2A inner loop, council routing, compaction authority (3/3)#200

Open
mdear wants to merge 18 commits into
Intelligent-Internet:mainfrom
mdear:feature/a2a-chat-inner-loop_3_of_3
Open

feat: chat A2A inner loop, council routing, compaction authority (3/3)#200
mdear wants to merge 18 commits into
Intelligent-Internet:mainfrom
mdear:feature/a2a-chat-inner-loop_3_of_3

Conversation

@mdear
Copy link
Copy Markdown

@mdear mdear commented Apr 13, 2026

Chat A2A Inner Loop, Council Routing & Compaction Authority (3/3)

Merge order: This PR targets main but is intended to be merged after feature/a2a-agent-inner-loop_2_of_3 (#199), which itself follows feature/local-docker-sandbox_1_of_3 (#198).

This final slice lands the chat-mode A2A path and the follow-up hardening needed to validate model steering, stabilize Copilot session handling, and polish the composer UX.

Core chat A2A delivery

  • Adds the chat A2A turn loop and event translation needed for streamed chat-mode execution
  • Routes council members through A2A independently while preserving usage visibility
  • Prevents native/A2A cross-authority summary chaining during compaction

Chat A2A image retention

  • Adds extract_historical_image_parts() to rehydrate prior-turn images into current A2A requests
  • Integrates into adapter_server event source for multi-turn image continuity
  • Design doc: chat-a2a-image-rehydrate-design.md (superseded by simpler as-built approach)

Model steering and runtime validation

  • Verifies both Agent Settings and Chat Settings model selection paths end-to-end
  • Adds runtime diagnostics/helpers to confirm the selected model reaches the Copilot backend
  • Refreshes the E2E plan and related implementation notes for the steering workflow

Sandbox lifecycle hardening

  • 6-phase orphan cleanup pipeline: soft-delete expired sessions, orphan kill, stale pause, zombie removal, volume cleanup, timeout enforcement
  • Per-sandbox DB isolation (R2), conditional state marking (R1), persistent timeout_at column (R6)
  • Alembic migration for sandbox_timeout_at and FK constraints
  • Design docs: sandbox-lifecycle-assessment.md, sandbox-accumulation-root-cause-analysis.md

Copilot and sandbox hardening

  • Drains trailing post-turn session errors instead of silently swallowing them
  • Improves fallback behavior for connection, timeout, and rate-limit failures
  • Tightens sandbox port discovery/configuration and local stack control flows

Storage proxy fix

  • Adds Content-Length header to proxy_download() — previously used transfer-encoding: chunked which broke PDF/media rendering in some clients

Frontend and input UX

  • Keeps the multiline composer tail visible while typing or pasting on mobile
  • Tightens settings state typing and related model-selection handling

Test coverage and docs

  • 83 multimodal unit tests, 54 adapter server tests, 60 turn loop tests
  • Expands targeted unit and E2E coverage for the A2A turn loop, council, Copilot, and steering paths
  • Updates the supporting review, design, implementation, and test documentation

Verified test totals

  • Unit: 5758 passed, 0 failed, 0 errors, 0 skipped
  • E2E: 39 passed, 0 failed, 0 errors, 0 skipped

Diff stats

  • 92 files changed, +6186 / −758 lines
  • 14 new files (tests, migrations, design docs)

Update 2026-04-19 (commit 52f2682)

Follow-on polish on top of the above:

  • Long-horizon adapter timeouts: AgentSettings.a2a_adapter_timeout_long_horizon (default 3600s) and a2a_adapter_long_horizon_agent_kinds (default {deep_research}) override the standard adapter timeout for long-running agent kinds. agent_kind is now plumbed IIAgent → sandbox metadata → DockerSandbox._a2a_adapter_env with AgentType enum validation via a new _agent_kind_from_name helper so unknown/tool-owned agent names never trigger the override.
  • Opus 4.7 adaptive thinking: Anthropic rejects manual thinking={type:enabled,budget_tokens:...} blocks on Opus 4.7 with HTTP 400. We now detect Opus 4.7+ via _is_opus_4_7_or_later and drop the manual thinking block, letting the model manage thinking adaptively.
  • Compaction lock leak fix: In inner_loop.py the _lock.acquire() + yield CompactionAuthorityEvent(...) pair moved inside the try block. Previously a consumer aclose() could raise GeneratorExit between acquire and yield and bypass the release path.
  • stack_control.sh: new verify subcommand + sha256 manifest for build artefact attestation.
  • CODEMAPS: refreshed architecture.md and dependencies.md.
  • Tests: 24 parametrized tests for _agent_kind_from_name; TestA2AAdapterEnv extended with long-horizon override cases; adapter/orphan/R4 test tweaks.

Diff vs prior PR tip: 15 files changed, +770 / −287.


Update 2026-04-24 (commit 8a360bb)

Sandbox prewarm pool, host monitoring, and platform-health hardening on top of the lifecycle work above.

Pool & lifecycle

  • SandboxPoolManager (new agents/sandboxes/pool.py + migration 20260422_000006_sandbox_prewarm_pool.py): 2 standby slots with claim / replenish / retire state machine, retire-on-age, dedupe, slot validation, and reap_stuck_initializing() for crash recovery.
  • Cleanup-loop reap fix: reap_stuck_initializing was only invoked from bootstrap() and ensure_full(), and ensure_full() short-circuits on host_state ≥ WARN. So under any sustained host pressure stuck INITIALIZING rows accumulated indefinitely. Now wired unconditionally into orphan_cleanup (pure DB UPDATE — safe regardless of host state). POOL-04 reap latency dropped from 180s timeout → ~34s.
  • QueuePool self-deadlock fix: service.init_sandbox commits the caller's TX before calling set_timeout (which opens its own session); docker._persist_deadline wrapped in asyncio.wait_for(timeout=10s); idle_in_transaction_session_timeout=60000 set in core/db/base.
  • R1/R2/R6 hardening continued: only mark sandbox DELETED after Docker container removal is confirmed; per-sandbox DB session so one failure doesn't roll back the batch; persistent timeout_at enforced as fallback by the cleanup loop.
  • Distributed cleanup lock: Redis advisory lock sandbox:cleanup:lock (5-min TTL, SET NX EX) so only one backend instance runs cleanup at a time in multi-worker deployments.

Host monitor & circuit breaker

  • host_monitor.py: parsers for /proc/buddyinfo, pagetypeinfo, vmstat, meminfo + percentile-baseline evaluator (BOOTSTRAP / OK / WATCH / WARN / CRIT) backed by a 48h ring buffer.
  • breaker.py: circuit breaker around Docker SDK calls.
  • executor.py: bounded thread pool for Docker SDK calls (prevents thread-pool exhaustion under load).
  • New endpoints: /health/host and /health/sandbox-pool.

Platform health & ops

  • scripts/stack_control.sh: status --json, status --all, plus modular scripts/local/lib/platform_checks_*.sh (common / wsl / ubuntu / backend / pool).
  • scripts/99-ii-agent.conf: /etc/sysctl.d drop-in for WSL2 host tuning.
  • Container hardening: read_only=True + tmpfs (/tmp, /var/tmp, /run, /home/user), cap_drop=ALL, selective cap_add, no-new-privileges, mem_limit=3GB, pids_limit=512. Docker socket auto-detection probes /var/run/docker.sock, Colima, OrbStack, Podman.

Tests

  • 4 new POOL e2e + 2 HOST e2e tests in scripts/local/test_e2e.py.
  • SBOX-06 fix: replaced removed AppConfig import with Settings; hardened parser with regex against log interleaving.
  • New unit suites: test_sandbox_pool, test_sandbox_breaker, test_sandbox_create_semaphore, test_host_monitor (+ integration), test_health_host_endpoint, test_health_sandbox_pool_endpoint, plus a pool e2e suite under src/tests/e2e/.
  • E2E result: 46/46 PASS in 921.7s (zero FAIL/ERROR/SKIP) on a clean sweep after the fixes landed.

Docs

  • Design: sandbox-prewarm-pool.md, sandbox-pool-claim-self-deadlock.md, sandbox-shared-bridge-network.md, stack-control-platform-health.md, a2a-copilot-vision-support-briefing.md; refresh of sandbox-lifecycle-assessment.md.
  • Runtime: docker-wsl2-recovery.md, host-resource-monitoring.md, wsl2-host-configuration.md, sandbox-networking-design.md, post-reboot-followups.md.
  • Impl tracker: sandbox-robustness-impl-tracker.md.

Diff vs prior PR tip: 73 files changed, +12337 / −390.


Update 2026-04-25 (commit 590988f)

Sandbox file-ownership correctness fix and authoritative spec.

Skill deployment under /workspace no longer escalates to root

  • agents/skills/storage.py::copy_skill_to_sandbox: dropped user="root" from mkdir/unzip/chmod; removed the now-redundant chown -R user:user; switched zip cleanup to rm -f for retry safety.
  • Root-owned files under /workspace were breaking subsequent user-mode cleanup with Permission denied on retry. /workspace is owned by user:user 755 (uid 1001) so the root escalation was never necessary.
  • Deduplicated: removed the stale copy of copy_skill_to_sandbox / resolve_storage_uri / create_skill_zip_from_dir from settings/skills/storage.py; that module now owns only the GCS half. The agents/skills/storage.py copy is canonical.

Sandbox base API contract

  • agents/sandboxes/base.py: documented the user= parameter contract — Docker honours it via exec_run, E2B forwards best-effort, and it is not a security boundary.
  • agents/sandboxes/e2b.py: forward user= through to the E2B SDK when set.

Authoritative filesystem spec

Drive-bys (unrelated, low-risk docs)

  • docs/design-docs/session-lifecycle-and-data-custody.md — proposal v3.1, ready for core-design review.
  • docs/runtime-docs/crossnote-pdf-export-tmpdir.md — runtime note for the WSL/Ubuntu snap-Chromium ERR_FILE_NOT_FOUND issue with MPE PDF export.

Diff vs prior PR tip: 9 files changed, +1236 / −300.


Update 2026-04-27 (commit 94fb301)

Session-lifecycle purge subsystem — design doc + flag-gated implementation. Driver SHIPS DARK; do not flip the kill switch without core-team sign-off (see §0.0 of the design doc).

Why this is in the PR

The local stack revealed 1,970 of 2,033 sessions rows soft-deleted (97 %), oldest from 2026-04-13 — sessions.is_deleted and sessions.delete_after were already on the schema but no purger existed anywhere in the codebase. The closest precedent (_purge_stale_deleted_rows) only swept agent_sandboxes. This commit lands the deferred purger and the design contract that goes with it.

Schema delta — migrations/versions/20260427_000008_session_purge_v34.py

  • sessions.purge_after / purge_attempts / purge_started_at + two partial indexes (is_deleted=true candidate queue, purge_started_at IS NOT NULL claim watchdog).
  • users.is_purging gate column for the user-account purge path (PR-G).
  • New table purge_dead_letter — operator-facing leaked-resource records (provider, resource_kind, resource_id, error_message, resolved_at/by/note). Indexed on (created_at) WHERE resolved_at IS NULL.

Runtime — src/ii_agent/sessions/purge/ (15 modules, ~2 200 LOC, mypy --strict clean)

  • Three-phase driverclaim.pypii_strip.py + commit.py glued by session_purge.py as the single arbitration entry point. FOR UPDATE row-lock spans claim through commit; phase (b) is lock-free across I/O.
  • Provider-cleanup hook registryproviders.py exposes register_cleanup_hook. Registry is empty. Phase (b) is a no-op until concrete provider DELETEs (E2B sandboxes, OpenAI vector stores, GCS slide assets, Composio profiles, Stripe customers) are wired.
  • Storage reaperstorage_reaper.py reaps orphan UserAsset blobs (no SessionAsset link, not public, older than SESSIONS_STORAGE_REAPER_MIN_AGE_SECONDS).
  • ORM railsorm_guards.py::register_purge_guards() defines a before_insert listener that enforces I3 (is_purging gate) at the ORM layer. Exported but not yet wired into app/lifespan.py — listed in §0.0 pre-flip checklist.
  • Cleanup-loop integrationcleanup_loop_stage_purge_sessions() and cleanup_loop_stage_storage_reaper() slot between _pause_stale_sandboxes and _cleanup_docker_zombies in agents/sandboxes/orphan_cleanup.py.
  • Configurationcore/config/sessions.py::SessionsSettings (env prefix SESSIONS_). Defaults:
    • purge_enabled=False ✋ ships dark
    • storage_reaper_enabled=False ✋ ships dark
    • provider_cleanup_enabled=True
    • purge_grace_period_seconds=2_592_000 (30 d), ephemeral_purge_grace_period_seconds=3_600
    • purge_max_seconds_per_loop=30, purge_max_attempts=5
    • purge_claim_timeout_seconds=600, heartbeat_interval_seconds=120

Tests — src/tests/unit/sessions/purge/ (22 passed, 32 skipped)

  • test_purge_contracts.py — 22 contract tests passing today (types, exceptions, invariant identity, SARRequest validators). 32 PR-E/F/G behavioural skips (claim arbitration, dead-letter retention, ALREADY_PURGED idempotency, phase-(c) re-check, SAR intake, restore-during-SAR, etc.) tracked for follow-up.
  • test_doc_stub_parity.py — every public symbol in purge/__init__.py::__all__ is referenced by name in the design doc; doc-named symbols exist in the package.

Bug fixes vs initial draft

ID Fix Where
B1 Vanished-row case returns PurgeOutcome.ALREADY_PURGED (I19), not SKIPPED_RESTORED. Specific-id invocations precheck application_events for the canonical event type commit.py, session_purge.py
B2 Single canonical _AUDIT_EVENT_TYPE = "session.purge_committed" — replaces a legacy mapping that emitted session.purged_by_user / session.purged_by_grace and broke the documented contract commit.py
B3 ExhaustedRetriesError(message, *, dead_letter_count=0) carries the count; providers.py populates it; session_purge.py propagates it into PurgeResult so logs/metrics reflect reality exceptions.py, providers.py, session_purge.py
B4 sessions/__init__.py now imports purge.db_models so PurgeDeadLetter registers with Base.metadata at import time — was missing, would have made the table invisible to autogenerate sessions/__init__.py

Design doc — docs/design-docs/session-lifecycle-and-data-custody.md (v3.11)

  • New §0.0 Rollout gate: review-request matrix, 10-item pre-flip checklist (review approval, §8 decisions, PR-C FKs, ≥1 real cleanup hook, register_purge_guards wired, PR-E behavioural tests unblocked, canary cycle, PITR drill, observability, backup retention), reversibility envelope, three-signature sign-off block per environment with named rollback owner.
  • §0 banner amended: "The flag MUST NOT be flipped until §0.0 has been signed off by the core team. Wiring complete ≠ approved-to-ship."
  • §5 step 6 cross-references §0.0 as the irreversible boundary; steps 1–5 (schema, indexes, FKs, dead-letter table, backfill) remain zero-risk and may proceed independently.

Verified runtime evidence (live local DB after rebuild)

  • Migration applied: alembic_version = 20260427_000008
  • Schema landed: all new columns + purge_dead_letter table present ✅
  • Code wired: cleanup_loop_stage_purge_sessions() reachable from the orphan-cleanup loop ✅
  • Flag default observed: purge_enabled=False; no SESSIONS_PURGE_ENABLED in docker/.stack.env.local
  • Audit/dead-letter activity: application_events.event_type='session.purge_committed' count = 0; purge_dead_letter count = 0 ✅ (driver dormant by design)
  • Backlog still in place: 1 970 soft-deleted sessions awaiting the gated flip — exactly as specified

Quality gates

  • mypy --strict src/ii_agent/sessions/purge/ src/ii_agent/core/config/sessions.py → Success: no issues found in 15 source files
  • ruff check + ruff format --check on all touched paths: clean
  • pytest src/tests/unit/sessions/purge/ -q22 passed, 32 skipped (PR-E/F/G behavioural)

What still needs to happen before the flag flips

Recorded in the §0.0 pre-flip checklist; key items:

  1. Core-team review of design doc + purge/ package
  2. PR-C FK constraints (otherwise §3.1 CASCADE rationale is unenforced)
  3. ≥1 real register_cleanup_hook so phase (b) is not a permanent no-op
  4. register_purge_guards() wired into app/lifespan.py
  5. PR-E behavioural tests unblocked against real DB fixtures
  6. Canary cycle on a non-prod env with measurable purge_committed audit row delta
  7. PITR drill rehearsed; backup retention ≥ 37 days

Diff vs prior PR tip: 26 files changed, +868 / −182 (modified) + 19 new files.


Latest commit — 9ba1240 sessions/purge: harden invariant subsystem with three-tier enforcement

Promotes the runtime purge invariants from prose into a self-validating contract.

Schema (migration 20260429_000011_invariant_hardening):

  • CHECK constraints for I1 / I1b (state machine)
  • partial UNIQUE index for I19 (provider, resource_id)
  • BEFORE DELETE trigger on users (I14)
  • discriminator columns: users.is_purging_set_at, application_events.stripped_at
  • partial covering indexes for invariant probes

Code:

  • invariants.py: rewritten into three disjoint tiers — SCHEMA_ENFORCED (4) / DB_CHECKABLE (9) / STRUCTURAL_TEST_ENFORCED (6); fixes I2 docstring and I17 catalogue entry
  • reconcile_providers.py (new): I9 OpenAI Files audit job — correct column names, idempotent insert via WHERE NOT EXISTS, scoped to provider = 'openai'
  • check_runner.py: assert_cleanup_uses_primary_db sentinel for I17 (replica-engine guard)
  • app/lifespan.py: I17 startup gate wired in at step 4a-bis
  • pii_strip.py / user_purge.py: write discriminator timestamps
  • workers/cron/tasks.py: daily run_purge_invariants_check APScheduler job

Tests:

  • test_purge_structural_invariants.py (new): Tier 3 pinning + strong-form parity test that resolves cited test artefacts (catches catalogue drift mechanically)
  • test_reconcile_providers.py (new): 5 unit tests pinning the audit-job correctness fixes
  • test_doc_stub_parity.py: extended to validate the tier union

Docs:

  • session-lifecycle-and-data-custody.md §2.3: rewritten with Tier column, kept in sync with invariants.py catalogue

Convergence: Three review passes complete (4 → 3 → 0 defects, strictly-decreasing severity: runtime → catalogue-drift → none). 368 sessions+realtime+app unit tests pass; lint clean; migration auto-applied on backend startup verified locally against the running stack.


Latest commit — fa26339 local-dev usability + two silent-failure agent fixes

Local multi-user dev login

  • core/config/settings.py: new DevUserConfig + Settings.dev_users (JSON env DEV_USERS); validates username charset and PIN length.
  • auth/router.py: GET /auth/dev/users chooser endpoint; POST /auth/dev/login now takes {username, pin}, looks up the named user, validates PIN with constant-time compare, per-username rate limit + sleep-throttle on failures, generic error either way. Each named user maps to dev+<username>@localhost for full session/credit isolation between household members.
  • frontend/login.tsx: dropdown chooser + PIN input replaces the single "Dev login" button; only rendered when /auth/dev/users reports enabled: true.
  • frontend/utils.ts: getFirstCharacters made resilient to punctuation / unicode / empty tokens so avatar initials don't break for dev display names.
  • docker/.stack.env.local.example: DEV_USERS placeholder + documentation.

Agent runtime resilience

  • agents/models/anthropic/claude.py: stop synthesising redacted_thinking from plaintext reasoning_content — Anthropic validates redacted_thinking.data as an opaque ciphertext they issued, so plaintext triggers a non-retriable 400 (Invalid data in redacted_thinking block) that permanently bricks replay of the conversation. The block is now dropped with a WARNING; regular text/tool_use content of the assistant message survives. (Triage: session 9785de09, 2026-05-11.)
  • agents/inner_loop.py: detect empty A2A turn (ASSISTANT_TURN_STARTASSISTANT_TURN_END with no content, reasoning, tool call, or session.error — e.g. Copilot CLI when quota-exhausted) and raise ModelProviderError instead of completing the run with an empty response. Outer fallback path can now retry on native or surface the failure to run status.
  • Regression tests pinned for both fixes (test_inner_loop.py, test_v1_models_anthropic_claude.py).

Housekeeping

  • scripts/stack_control.sh: suppress spurious [timed out] annotation on AVAILABLE pool slots whose lifetime is governed by retire_at, not timeout_at (the R6 reaper explicitly excludes them).
  • .github/copilot-instructions.md: drop stale --local hint; document stack_control.sh verify for "which containers need rebuild" introspection.

Lint clean (ruff check + ruff format --check) on all changed Python files.

mdear added 3 commits April 13, 2026 15:22
- Docker Compose local stack with PostgreSQL, Redis, MinIO, sandbox
- Local sandbox entrypoint, VNC, browser automation services
- Stack control scripts (stack_control.sh, local/*)
- Backend Dockerfile + entrypoint for local development
- Configuration: .stack.env.local, settings.yaml, model_configs
- SQLAlchemy model fixes (UUID consistency, TimestampColumn)
- Agent tool/runtime improvements (reasoning_content, field renames)
- Credit billing_enabled toggle + usage handler refactor
- E2B sandbox management, VNC URL support
- 246 tests (unit, integration, smoke, E2E)
- Documentation: architecture, getting-started, local-docker-sandbox
- GitHub Copilot instructions and prompt templates
- A2A protocol: adapter server, backends (Copilot, Claude Code, Codex)
- Agent inner loop: strategy pattern, tool bridge, routing
- A2A billing: backend-aware credit calculation, provider-reported strategies
- Circuit breaker, event stream adapter, multimodal support
- Agent factory: inner loop strategy builder, converter
- Health endpoint: A2A mode fields
- CreditUsageHandler: A2A billing strategies
- Config: A2A agent settings (inner_loop_mode, a2a_backend, billing)
- 26 A2A agent tests + 10 billing strategy tests
- 17 A2A design/implementation/runtime docs
@mdear
Copy link
Copy Markdown
Author

mdear commented Apr 13, 2026

I am continuing testing on this branch feature/a2a-chat-inner-loop_3_of_3.

A2A/Copilot flow:
- validate chat and agent model steering end-to-end through the Copilot runtime
- harden adapter/session error handling, council fallback, and post-turn event draining

Frontend and local UX:
- keep multiline composer input visible on mobile and tighten settings state handling
- refine local stack/build helpers and sandbox port configuration for faster iteration

Quality and docs:
- expand unit and E2E coverage, refresh the test plan, and capture implementation notes
@mdear
Copy link
Copy Markdown
Author

mdear commented Apr 16, 2026

This PR is 3 of a series of 3 PRs:

#198#198: Docker sandbox runtime, local deploy stack, session lifecycle, frontend, test overhaul (389 files)

#199#199: A2A inner loop strategy, backend registry, billing strategies, adapter server (74 incremental files)

#200#200: Chat A2A turn loop, council A2A routing, cross-authority compaction (16 incremental files)

Merge order: #198#199#200

mdear added 10 commits April 17, 2026 14:17
… storage proxy fix

Chat A2A image retention:
- Add extract_historical_image_parts() to rehydrate prior-turn images
- Integrate into adapter_server event source for multi-turn continuity
- 83 multimodal unit tests, 54 adapter server tests, 60 turn loop tests

Sandbox lifecycle hardening:
- 6-phase orphan cleanup pipeline (soft-delete, orphan kill, stale pause,
  zombie removal, volume cleanup, timeout enforcement)
- Per-sandbox DB isolation (R2), conditional state marking (R1),
  persistent timeout_at column (R6)
- Alembic migration for sandbox timeout_at and FK constraints
- Design docs: lifecycle assessment, accumulation root-cause analysis

Storage proxy fix:
- Add Content-Length header to proxy_download (was chunked-only)

Frontend polish:
- Mobile composer scroll-into-view, model tag theming, settings typing

Test results:
- Unit: 5758 passed, 0 failed, 0 errors, 0 skipped
- E2E:    39 passed, 0 failed, 0 errors, 0 skipped
- Add Chat A2A adapter sidecar topology (sandbox-independent)
- Add claude-opus-4-7 as system model (pricing, context, frontend)
- Add A2A backend-specific timeout configuration
- Add A2AAdapterUnavailableError (HTTP 503) exception
- Harden sandbox orphan cleanup (R4 zombie, volume, timeout)
- Enrich /health endpoint with A2A inner-loop diagnostics
- Improve stack_control.sh build vs rebuild help clarity
- Add startup validation for chat A2A strict mode
- Add retry classification, thinking-temperature, inner-loop parity tests
- Add design docs: sidecar deployment, URL resolution, billing
… compaction lock fix

- Plumb agent_kind through IIAgent -> sandbox metadata -> DockerSandbox._a2a_adapter_env
  with AgentType enum validation (new _agent_kind_from_name helper)
- Add AgentSettings.a2a_adapter_timeout_long_horizon (3600s) and
  a2a_adapter_long_horizon_agent_kinds ({deep_research}) overrides
- Opus 4.7 adaptive thinking: drop manual thinking block (Anthropic rejects it
  with HTTP 400 on Opus 4.7); detect via _is_opus_4_7_or_later
- Fix compaction lock leak in inner_loop: acquire + yield moved inside try so
  consumer aclose() cannot bypass release
- stack_control.sh: add verify subcommand + sha256 manifest
- CODEMAPS: refresh architecture.md and dependencies.md
- Tests: 24 parametrized tests for _agent_kind_from_name, extend
  TestA2AAdapterEnv with long-horizon override cases, adapter/orphan/R4 tweaks
- Reorder factory priority so deferred-sandbox path always wins over
  static a2a_agent_url for agent sessions (prevents deep_research from
  hitting the 900s sidecar instead of its 3600s per-sandbox adapter)
- Add regression test for factory inner loop priority
- Add unit tests for _is_opus_4_7_or_later model-id detection
…er_loop_mode

- Per-sandbox A2A adapter now starts only when inner_loop_mode=a2a;
  native-mode sandboxes save 1 host port + adapter process resources
- start-services.sh requires SANDBOX_ADAPTER_ENABLED=true and an explicit
  SANDBOX_ADAPTER_BACKEND (no more 'simulate' fallback in production)
- Chat A2A sidecar hardened to adapter-only: entrypoint:[], read_only,
  minimal tmpfs — no Xvfb/VNC/code-server/MCP overhead
- DockerSandbox.create() skips _a2a_adapter_env() entirely in native mode
  so backend auth tokens (GITHUB_TOKEN, ANTHROPIC_API_KEY, OPENAI_API_KEY)
  do not leak into native sandbox containers
- DEFAULT_EXPOSED_PORTS is now the honest base set (6 ports); adapter
  port is added conditionally
- Add 4 new unit tests: native gating, a2a port+env+token forwarding,
  port-count requirement difference, native-mode token-leak prevention
…alth

Pool & lifecycle
- Add SandboxPoolManager with 2 standby slots, claim/replenish/retire
  state machine, retire-on-age, dedupe, slot validation, and
  reap_stuck_initializing for crash recovery (pool.py + migration
  20260422_000006_sandbox_prewarm_pool).
- Wire reap_stuck_initializing into the orphan_cleanup loop
  unconditionally (pure DB UPDATE, must run even when host monitor is
  WARN/CRIT — ensure_full skips on WARN, so stuck rows accumulated
  indefinitely otherwise).
- Fix QueuePool self-deadlock between caller's open TX and
  set_timeout's separate session: commit before set_timeout in
  service.init_sandbox; wrap docker._persist_deadline in
  asyncio.wait_for(timeout=10s); set
  idle_in_transaction_session_timeout=60000 in core/db/base.
- R1/R2 hardening in orphan_cleanup: only mark sandbox DELETED after
  Docker container removal is confirmed; per-sandbox DB session so one
  failure doesn't roll back the batch.
- R6 persistent timeout via AgentSandbox.timeout_at; cleanup loop
  enforces deadline as fallback.
- Distributed Redis advisory lock (sandbox:cleanup:lock) for cleanup
  loop in multi-worker deployments.

Host monitor & circuit breaker
- New host_monitor.py: /proc/buddyinfo, pagetypeinfo, vmstat, meminfo
  parsers + percentile-baseline evaluator (BOOTSTRAP/OK/WATCH/WARN/
  CRIT) backed by 48h ring buffer.
- New breaker.py: Docker-call circuit breaker.
- New executor.py: bounded thread pool for Docker SDK calls.
- New /health/host and /health/sandbox-pool endpoints.

Platform health & ops
- scripts/stack_control.sh: add status --json, status --all, and
  modular platform_checks_*.sh (common/wsl/ubuntu/backend/pool).
- /etc/sysctl.d drop-in (scripts/99-ii-agent.conf) for WSL2 host
  tuning.
- Docker container hardening: read_only=True + tmpfs, cap_drop=ALL,
  no-new-privileges, mem_limit=3GB, pids_limit=512.

Tests
- 4 new POOL e2e tests + 2 HOST e2e tests in scripts/local/test_e2e.py.
- Fix SBOX-06: replace removed AppConfig import with Settings, harden
  parser with regex against log interleaving.
- New unit tests: sandbox_pool, sandbox_breaker, sandbox_create_
  semaphore, host_monitor + integration, health_host_endpoint,
  health_sandbox_pool_endpoint, plus pool e2e suite.

Docs
- Design docs: sandbox-prewarm-pool, sandbox-pool-claim-self-deadlock,
  sandbox-shared-bridge-network, stack-control-platform-health,
  sandbox-lifecycle-assessment update, a2a-copilot-vision-support-
  briefing.
- Runtime docs: docker-wsl2-recovery, host-resource-monitoring,
  wsl2-host-configuration, sandbox-networking-design,
  post-reboot-followups.
- Impl tracker: sandbox-robustness-impl-tracker.
…t, and lazy MCP retry

- Add `mcp_configured` flag to AgentSandbox model + migration (20260425_000007)
- Sandbox service: MCP handoff waits for mcp_configured=True before releasing slot
- Pool: expose /health/sandbox-pool endpoint with slot occupancy + MCP readiness
- noVNC URL decoration for register_port tools (sandbox + dev variants)
- New novnc.py helper for URL decoration logic
- MCP factory: lazy retry wrapper (lazy_retry.py) for transient MCP connection failures
- Docker shell framing fixes; docker_shell.py correctness improvements
- orphan_cleanup: per-item DB isolation (R2), conditional state marking (R1)
- Claude model: extended thinking + vision support improvements
- A2A turn loop: Copilot backend fixes; fallback billing event
- health.py: /health/ready endpoint; exception handler improvements
- lifespan: startup validation for A2A chat strict mode
- scripts: stack_control.sh enhancements; test_e2e.py expanded coverage; Windows port-forward script
- docs: sandbox-pool-claim-mcp-handoff audit, postgres recovery mode runbook
- tests: 10 new unit test files covering sandbox, noVNC, MCP handoff, health, storage, middleware
- .gitignore: exclude .e2e_last_results.json; remove tracked copy
Skill copy_skill_to_sandbox previously ran mkdir/unzip/chown/chmod as
user="root". /workspace is owned by user:user 755 (uid 1001), so root
escalation was unnecessary AND harmful: root-owned files broke subsequent
user-mode cleanup with Permission denied on retries.

Code changes:
- agents/skills/storage.py: drop user="root" from mkdir/unzip/chmod;
  remove the now-redundant chown -R; switch zip cleanup to rm -f for
  retry safety. Add inline notes documenting the ownership invariant.
- settings/skills/storage.py: delete the duplicated copy_skill_to_sandbox
  + resolve_storage_uri + create_skill_zip_from_dir helpers; this module
  now owns only the GCS half of the pipeline. The agents/skills/storage
  copy is the canonical implementation.
- agents/sandboxes/base.py: document the user= parameter contract
  (Docker honours via exec_run; E2B best-effort) and explicitly warn
  callers against using it as a security boundary.
- agents/sandboxes/e2b.py: forward user= through to the E2B SDK when set.

Docs:
- docs/design-docs/sandbox-filesystem-design.md: new authoritative spec
  for /workspace ownership, write paths (put_archive cannot target /tmp
  on read_only=True containers per moby/moby#42333), and skill deployment.
- AGENTS.md / CLAUDE.md: link the new spec; add the three governing rules
  (workspace-only uploads, never user=root under /workspace, root reserved
  for system commands).

Unrelated drive-bys captured in the same commit:
- docs/design-docs/session-lifecycle-and-data-custody.md: proposal v3.1
  for review.
- docs/runtime-docs/crossnote-pdf-export-tmpdir.md: WSL/Ubuntu snap
  Chromium ERR_FILE_NOT_FOUND fix for MPE PDF export.
Implements §4.1 (three-phase purge driver) and §4.6 (storage reaper) from
docs/design-docs/session-lifecycle-and-data-custody.md, behind dual feature
flags SESSIONS_PURGE_ENABLED and SESSIONS_STORAGE_REAPER_ENABLED. Both
default to false; the cleanup-loop stage is wired but dormant in production.

Schema (migration 20260427_000008):
  - sessions.purge_after / purge_attempts / purge_started_at + partial idx
  - users.is_purging gate
  - purge_dead_letter table for operator-facing leaked-resource records

Runtime (src/ii_agent/sessions/purge/):
  - claim/pii_strip/commit phases, single arbitration point in session_purge
  - provider hook registry (empty until real DELETEs are wired)
  - storage_reaper for orphan UserAsset blobs
  - orm_guards exporting register_purge_guards (not yet called from lifespan)

Cleanup-loop integration:
  - cleanup_loop_stage_purge_sessions + cleanup_loop_stage_storage_reaper
    slot between _pause_stale_sandboxes and _cleanup_docker_zombies

Tests (src/tests/unit/sessions/purge/):
  - 22 contract tests passing; 32 PR-E/F/G behavioural tests skipped pending
    bodies (mypy --strict + ruff clean across the package)

Doc:
  - §0.0 rollout gate added: review-request matrix, 10-item pre-flip
    checklist, sign-off block. Flag MUST NOT flip without core-team approval
  - §0 PR-E row notes register_purge_guards exported but not yet wired
  - §5 step 6 cross-references §0.0 as the irreversible boundary

Bug fixes vs initial draft:
  B1+B2 commit.py — vanished-row case returns ALREADY_PURGED (I19);
        single canonical _AUDIT_EVENT_TYPE='session.purge_committed'
  B3    ExhaustedRetriesError carries dead_letter_count; session_purge
        propagates it into PurgeResult
  B4    sessions/__init__.py imports purge.db_models so PurgeDeadLetter
        registers with Base.metadata at startup
Implements GDPR Art. 17 SAR purge + grace-window cleanup with flag-gated
three-phase commit (strip → orphan-purge → session DELETE).

Mutation gating (I3/I8 §16):
- NotPurgingDep applied to 12 mutating endpoints across sessions/, pin/,
  and wishlist/ routers — closes the PATCH/fork/legacy-restore hole that
  could race the purge driver.

Invariant runner:
- New check_runner + scripts/local/check_purge_invariants.py exercising
  19 DB-checkable invariants (I1–I19); structural-only checks marked
  SKIP rather than failing.
- Runner now rolls back AsyncSession after per-invariant exceptions so
  one bad query no longer cascades into 7+ ERRORs.
- I11 rewritten from a content-key denylist to the real strip discriminator
  (user_id NULL + orphaned session_id + non-allowlist content key).
  Eliminates ~1,236 false positives.

Audit trail:
- PURGE_COMMITTED_EVENT_TYPE constant centralised in purge/types.py;
  consumed by commit.py and session_purge.py.
- application_events.session_id intentionally retains no FK so audit
  rows survive session DELETE as forensic breadcrumbs (migration
  20260428_000010).

Other:
- Storage reaper, OpenAI provider hooks, ORM guards, canary e2e test,
  PITR-restore runbook, and implementation tracker.
- Design doc drift fixes (§14.4, §16) in session-lifecycle-and-data-custody.

Verification: mypy --strict clean across changed files; runner reports
11 PASS / 0 FAIL / 0 ERROR / 8 structural-skip; 24/24 unit tests pass.
@mdear
Copy link
Copy Markdown
Author

mdear commented Apr 29, 2026

Session-purge hardening — pre-flip blockers cleared

Pushed as f16328f on top of the existing branch.

What landed

Area Change
Mutation gating (I3/I8 §16) NotPurgingDep applied to 12 mutating endpoints across sessions/router.py, sessions/pin/router.py, sessions/wishlist/router.py — closes the PATCH/fork/legacy-restore race against the purge driver.
Invariant runner New check_runner + scripts/local/check_purge_invariants.py exercising 19 DB-checkable invariants (I1–I19). Runner now rollback()s after per-invariant exceptions so one bad query no longer cascades into 7+ ERRORs.
I11 correctness Rewritten from a content-key denylist to the real strip discriminator: user_id IS NULL ∧ orphaned session_id ∧ non-allowlist content key. Eliminates ~1,236 false positives previously surfaced against static UI strings.
Audit constant PURGE_COMMITTED_EVENT_TYPE centralised in purge/types.py; consumed by commit.py and session_purge.py (single source of truth).
Forensic breadcrumb application_events.session_id intentionally retains no FK so audit rows survive session DELETE (migration 20260428_000010).
Other Storage reaper, OpenAI provider hooks, ORM guards, canary e2e test, PITR-restore runbook, implementation tracker, design-doc drift fixes (§14.4, §16).

Verification

  • mypy --strict — clean across all changed files (19 source files).
  • Invariant runner against live DB: 11 PASS / 0 FAIL / 0 ERROR / 8 structural-skip in 0.24s (vs. pre-fix: 1 FAIL → first-attempt regression: 7 ERROR → now: all green).
  • Unit tests: 24 passed, 32 skipped (skipped are pre-existing PR-E/F/G structural stubs), 0 failures.

Still open (tracked, not blockers for flag-default-off)

  • 7 §14.4 structural test files still stubbed (tracker §4.3).
  • Prometheus exporter for invariant pass/fail metrics is aspirational (tracker §4.2/§4.2a).
  • 12 system.error rows in audit trail warrant a hand audit before prod flag-flip.
  • 7 pre-existing mypy errors in sessions/router.py (untyped dict returns, AppKind re-export) — confirmed via stash baseline as not introduced by this change; tracked separately.

mdear added 2 commits April 29, 2026 09:57
Promote runtime purge invariants from prose into a self-validating
contract with mechanical artefact checks.

Schema (migration 20260429_000011):
- CHECK constraints for I1 / I1b (state machine)
- partial UNIQUE index for I19 (provider/resource_id)
- BEFORE DELETE trigger on users (I14)
- discriminator columns: users.is_purging_set_at,
  application_events.stripped_at
- partial covering indexes for invariant probes

Code:
- invariants.py: rewrite into three disjoint tiers — SCHEMA_ENFORCED (4),
  DB_CHECKABLE (9), STRUCTURAL_TEST_ENFORCED (6); fixes I2 docstring and
  I17 catalogue entry
- reconcile_providers.py: I9 OpenAI Files audit job (correct column
  names, idempotent insert via WHERE NOT EXISTS, scoped to provider
  'openai')
- check_runner.py: assert_cleanup_uses_primary_db sentinel for I17
- lifespan.py: I17 startup gate at step 4a-bis
- pii_strip.py / user_purge.py: write discriminator timestamps
- workers/cron/tasks.py: daily run_purge_invariants_check job

Tests:
- test_purge_structural_invariants.py: Tier 3 pinning + strong-form
  parity test that resolves cited test artefacts
- test_reconcile_providers.py: 5 unit tests pinning the audit-job fixes
- test_doc_stub_parity.py: validates tier union

Docs:
- session-lifecycle-and-data-custody.md §2.3: rewritten with Tier column,
  in sync with invariants.py catalogue

368 sessions+realtime+app unit tests pass; lint clean; migration
auto-applied on backend startup verified locally.
APScheduler uses CLOCK_MONOTONIC for wake-ups. On hypervisor guests
(WSL2, KVM laptops, etc.) the host can suspend the VM's vCPU, freezing
that clock; when it thaws, every fire scheduled during the gap is
reported 'missed by N min' and silently dropped (default grace = 1s).
This was causing the new daily lifecycle-invariants probe to risk
skipping a full day per missed window, and was already dropping the
40-min cleanup jobs in development.

Detect host class (env override, /proc/version for WSL2, hypervisor
flag in /proc/cpuinfo) and apply tighter grace on bare metal (60s) or
generous grace on VMs (1h default; 6h for the 24h invariants probe),
with coalesce=True everywhere so a backlog collapses to one catch-up
run. Detection result is logged at scheduler start.

Verified end-to-end on this WSL2 host: backend rebuilt, scheduler
launched 3 jobs as host_class=vm, both 40-min cleanup jobs fired and
completed at 04:05 UTC with zero misfire warnings. 13 unit tests cover
detection edge cases and per-job grace assignment.
@mdear
Copy link
Copy Markdown
Author

mdear commented Apr 30, 2026

Follow-up commit: ef22b43 — host-aware cron misfire tuning

Posting a brief note on this commit since it landed after the main review window and is small but operationally meaningful.

What it does

Tunes APScheduler's misfire_grace_time / coalesce settings on the cron registered in src/ii_agent/workers/cron/tasks.py (the same file where the new daily lifecycle-invariants probe lives) based on detected host class:

  • Bare metal → tight 60 s default grace, 30 min on the daily probe.
  • VM / hypervisor guest (WSL2, KVM laptops, etc.) → 1 h default grace, 6 h on the daily probe; coalesce=True everywhere.

Detection order: IIA_CRON_HOST_CLASS env override → microsoft|wsl in /proc/versionhypervisor flag in /proc/cpuinfo → bare metal.

Why

APScheduler's AsyncIOScheduler schedules wake-ups against loop.time() (CLOCK_MONOTONIC). When a hypervisor host suspends the guest's vCPU (laptop sleep, WSL2 going idle, host hibernate), CLOCK_MONOTONIC freezes for the duration. On thaw, every fire scheduled during the gap is reported Run time of job ... was missed by N min and — with the default misfire_grace_time = 1 s — silently dropped.

This was empirically observed on a WSL2 dev box: every 40 min cleanup fire from cleanup_long_running_tasks / cleanup_long_running_chat_messages was being missed by 3–4 min and dropped (~99% loss), and the new 24 h run_purge_invariants_check would have been at risk of skipping a full day per gap. Production Linux servers don't suspend, so the bare-metal branch keeps the tight grace there to surface real scheduler stalls.

Why this is summarized rather than verbose

The change is mechanically narrow (one factory function + per-job kwargs in tasks.py, plus 11 new unit tests in test_scheduler_tasks.py). The rationale, the failure mode, and the bare-vs-VM split are documented inline at the top of tasks.py and at each per-job override, so future readers don't need to chase the commit message. No behavioural change on production servers — they detect as bare, default grace stays 60 s. No public API surface, no schema migration, no config required (env override is opt-in only).

Verification

  • 13/13 unit tests pass; ruff clean.
  • End-to-end on this WSL2 host: backend rebuilt + restarted, scheduler logs host_class=vm, reason=WSL2 detected via /proc/version, default_misfire_grace_time=3600s, coalesce=True. Both 40 min cleanup jobs fired at 04:05 UTC (≈4.5 min after their scheduled 04:00:44 UTC slot), completed successfully, and produced zero missed by / misfire warnings — versus consistent misses on every fire prior to the change.

Files

  • src/ii_agent/workers/cron/tasks.py (+109/-3): _detect_host_class(), _JOB_DEFAULTS, AsyncIOScheduler(job_defaults=...), per-job misfire_grace_time for the invariants probe, scheduler-startup log.
  • src/tests/unit/scripts/test_scheduler_tasks.py (+157/-1): coverage for env override (valid/invalid), WSL2 detection, hypervisor flag detection, bare-metal default, missing /proc files, substring false-match guard, per-host invariants-probe grace.

1 similar comment
@mdear
Copy link
Copy Markdown
Author

mdear commented Apr 30, 2026

Follow-up commit: ef22b43 — host-aware cron misfire tuning

Posting a brief note on this commit since it landed after the main review window and is small but operationally meaningful.

What it does

Tunes APScheduler's misfire_grace_time / coalesce settings on the cron registered in src/ii_agent/workers/cron/tasks.py (the same file where the new daily lifecycle-invariants probe lives) based on detected host class:

  • Bare metal → tight 60 s default grace, 30 min on the daily probe.
  • VM / hypervisor guest (WSL2, KVM laptops, etc.) → 1 h default grace, 6 h on the daily probe; coalesce=True everywhere.

Detection order: IIA_CRON_HOST_CLASS env override → microsoft|wsl in /proc/versionhypervisor flag in /proc/cpuinfo → bare metal.

Why

APScheduler's AsyncIOScheduler schedules wake-ups against loop.time() (CLOCK_MONOTONIC). When a hypervisor host suspends the guest's vCPU (laptop sleep, WSL2 going idle, host hibernate), CLOCK_MONOTONIC freezes for the duration. On thaw, every fire scheduled during the gap is reported Run time of job ... was missed by N min and — with the default misfire_grace_time = 1 s — silently dropped.

This was empirically observed on a WSL2 dev box: every 40 min cleanup fire from cleanup_long_running_tasks / cleanup_long_running_chat_messages was being missed by 3–4 min and dropped (~99% loss), and the new 24 h run_purge_invariants_check would have been at risk of skipping a full day per gap. Production Linux servers don't suspend, so the bare-metal branch keeps the tight grace there to surface real scheduler stalls.

Why this is summarized rather than verbose

The change is mechanically narrow (one factory function + per-job kwargs in tasks.py, plus 11 new unit tests in test_scheduler_tasks.py). The rationale, the failure mode, and the bare-vs-VM split are documented inline at the top of tasks.py and at each per-job override, so future readers don't need to chase the commit message. No behavioural change on production servers — they detect as bare, default grace stays 60 s. No public API surface, no schema migration, no config required (env override is opt-in only).

Verification

  • 13/13 unit tests pass; ruff clean.
  • End-to-end on this WSL2 host: backend rebuilt + restarted, scheduler logs host_class=vm, reason=WSL2 detected via /proc/version, default_misfire_grace_time=3600s, coalesce=True. Both 40 min cleanup jobs fired at 04:05 UTC (≈4.5 min after their scheduled 04:00:44 UTC slot), completed successfully, and produced zero missed by / misfire warnings — versus consistent misses on every fire prior to the change.

Files

  • src/ii_agent/workers/cron/tasks.py (+109/-3): _detect_host_class(), _JOB_DEFAULTS, AsyncIOScheduler(job_defaults=...), per-job misfire_grace_time for the invariants probe, scheduler-startup log.
  • src/tests/unit/scripts/test_scheduler_tasks.py (+157/-1): coverage for env override (valid/invalid), WSL2 detection, hypervisor flag detection, bare-metal default, missing /proc files, substring false-match guard, per-host invariants-probe grace.

mdear added 2 commits May 11, 2026 18:36
Local multi-user dev login
- core/config/settings.py: new DevUserConfig + Settings.dev_users
  (JSON env DEV_USERS); validates username charset and PIN length.
- auth/router.py: GET /auth/dev/users chooser endpoint;
  POST /auth/dev/login now takes {username, pin}, looks up the named
  user, validates PIN with constant-time compare, per-username rate
  limit + sleep-throttle on failures, generic error message.
  Each named user maps to dev+<username>@localhost for full
  session/credit isolation between household members.
- frontend/login.tsx: chooser dropdown + PIN input replaces the single
  "Dev login" button; only shown when /auth/dev/users reports enabled.
- frontend/utils.ts: getFirstCharacters resilient to punctuation /
  unicode / empty tokens so avatar initials don't break for dev names.
- docker/.stack.env.local.example: DEV_USERS placeholder + docs.

Agent runtime resilience
- agents/models/anthropic/claude.py: stop synthesizing
  redacted_thinking from plaintext reasoning_content — Anthropic
  rejects non-issued blobs with a non-retriable 400 that bricks
  replay. Drop the block with a warning; text/tool_use survive.
  (Triage: session 9785de09, 2026-05-11.)
- agents/inner_loop.py: detect empty A2A turn (no text, reasoning,
  tool call, or session.error) and raise ModelProviderError instead
  of silently completing. Surfaces quota-exhausted Copilot CLI
  failures to the user / fallback path.
- Tests pinned for both fixes.

Housekeeping
- scripts/stack_control.sh: suppress spurious "[timed out]" annotation
  on AVAILABLE pool slots whose lifetime is governed by retire_at.
- .github/copilot-instructions.md: drop stale --local hint; document
  `stack_control.sh verify`.

Lint clean on changed Python files.
…imers

Long deep_research turns were tripping the single fixed per-turn
wall-clock cap and falling back to the native (billed) Anthropic
provider mid-task even though the Copilot backend was still
productively streaming events.

Splits the watchdog in copilot_backend.py into two independent timers:

  - absolute: hard wall-clock safety net (defaults 300s -> 1800s; long
    horizon 3600s -> 7200s)
  - activity: max idle time with no SDK events; resets on every
    non-heartbeat event (defaults 600s; long horizon 900s)

Surface area:
  - core/config/agent.py: new a2a_adapter_activity_timeout_long_horizon
    setting; tightened docstring on the absolute long-horizon setting
  - integrations/a2a/adapter_server.py: read A2A_*_ACTIVITY_TIMEOUT env
    vars and pass through to CopilotConfig
  - agents/sandboxes/docker.py: forward A2A_*_ACTIVITY_TIMEOUT vars to
    sandbox containers, honouring the long-horizon override for
    research-class agents
  - docker/docker-compose.local.yaml: expose the new env vars on the
    a2a-adapter sidecar with matching defaults
  - tests: cover the new env wiring on both the docker sandbox and
    copilot backend layers

Also in this commit:
  - e2b.Dockerfile: bump GH_CLI_VERSION 2.91.0 -> 2.92.0 (2.91.0 was
    rolled out of the apt repo, breaking sandbox rebuilds)
  - .gitignore: ignore build-manifest-*.json (generated per-build by
    scripts/stack_control.sh and COPY'd into each image; was being
    flagged as untracked after every build)
@mdear
Copy link
Copy Markdown
Author

mdear commented May 12, 2026

Follow-up commit: b1867dc — adapter timeout split + sandbox build fixes

a2a/copilot: split per-turn timeout into absolute + activity (idle) timers

Long deep_research turns were tripping the single fixed per-turn wall-clock cap and falling back to the native (billed) Anthropic provider mid-task even though the Copilot backend was still productively streaming events.

Splits the watchdog in copilot_backend.py into two independent timers:

Timer Purpose Default Long-horizon
absolute hard wall-clock safety net 300s → 1800s 3600s → 7200s
activity max idle with no SDK events; resets on every non-heartbeat event — → 600s — → 900s

The activity timer is the real is the backend stuck? signal; the absolute timer is now just a forgiving safety net so productive long turns never get killed.

Surface area

  • core/config/agent.py — new a2a_adapter_activity_timeout_long_horizon setting; tightened docstring on the absolute long-horizon setting.
  • integrations/a2a/adapter_server.py — read A2A_*_ACTIVITY_TIMEOUT env vars and pass through to CopilotConfig.
  • agents/sandboxes/docker.py — forward A2A_*_ACTIVITY_TIMEOUT vars into sandbox containers, honouring the long-horizon override for research-class agents.
  • docker/docker-compose.local.yaml — expose the new env vars on the a2a-adapter sidecar with matching defaults.
  • Tests cover the new env wiring on both the Docker sandbox and Copilot backend layers.

Native loop check: confirmed the native inner loop has no analogous per-turn watchdog — timeouts there are per-HTTP-request only (Anthropic 300s, Google 600s) and there is no turn-level wrapper that could prematurely abort a productive turn. No native-side timing changes required.

Also in this commit

  • e2b.Dockerfile: bump GH_CLI_VERSION 2.91.0 → 2.92.0. 2.91.0 was rolled out of the apt repo, breaking sandbox rebuilds.
  • .gitignore: ignore build-manifest-*.json (generated per-build by scripts/stack_control.sh and COPY'd into each backend/frontend/sandbox image; was showing as untracked after every build).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant