Skip to content

feat: add subprocess session isolation for recipe execution#135

Merged
bkrabach merged 18 commits intomainfrom
feat/subprocess-session-isolation
Mar 22, 2026
Merged

feat: add subprocess session isolation for recipe execution#135
bkrabach merged 18 commits intomainfrom
feat/subprocess-session-isolation

Conversation

@bkrabach
Copy link
Collaborator

Summary

Adds an opt-in subprocess runner to amplifier-foundation that allows agent sessions to run in isolated child processes. When a subprocess exits, the OS reclaims ALL memory immediately — eliminating the RSS watermark accumulation that causes OOM kills during parallel recipe execution.

Problem

During parallel recipe execution (foreach with parallel: true), multiple agent sessions run concurrently in the same Python process. Each session loads ~300-400MB of modules. Python's allocator never returns freed pages to the OS after sessions complete. After multiple waves of parallel sessions (e.g., deep discovery running topdown + bottomup + combine), a single process grows to 35-37GB RSS and gets OOM-killed.

Solution

A foundation-layer utility (subprocess_runner.py) that:

  • Serializes session config + prompt to a temp JSON file
  • Spawns a child Python process via asyncio.create_subprocess_exec
  • Child creates a fresh AmplifierSession, runs execute(), writes result to stdout, exits
  • Parent reads result, OS reclaims all child process memory
  • Includes max_subprocess semaphore (default 4) to prevent fork-bombing

Usage

from amplifier_foundation.subprocess_runner import run_session_in_subprocess

result = await run_session_in_subprocess(
    config=child_config,
    prompt=task,
    parent_id=parent_session.session_id,
    project_path=str(project_path),
    session_id=sub_session_id,
    timeout=1800,
)

Files

  • New: amplifier_foundation/subprocess_runner.py (271 lines) — config serialization, child runner, __main__ entry point, parent spawner, concurrency semaphore
  • New: tests/test_subprocess_runner.py (543 lines) — 21 tests covering round-trip serialization, child runner, parent spawner, error handling, timeout, concurrency
  • Modified: amplifier_foundation/__init__.py — exports run_session_in_subprocess

Design

  • Opt-in only — default behavior is unchanged (in-process spawning)
  • No kernel changes — uses existing AmplifierSession API
  • No new dependencies — stdlib only (asyncio, json, sys, tempfile)
  • IPC: temp JSON file (parent→child) + stdout (child→parent) + stderr (errors)
  • Companion PR for amplifier-app-cli wires this into session_spawner.py

Testing

21 tests covering:

  • Config serialization round-trip (minimal, full, without session_id)
  • Missing required keys and malformed JSON
  • Child session runner (success, cleanup on error, no session_id)
  • __main__ entry point validation
  • Parent-side subprocess management (success, non-zero exit, timeout, temp file cleanup)
  • Concurrency semaphore limiting

Design doc: amplifier-bundle-dot-graph/docs/plans/2026-03-21-subprocess-session-isolation-design.md

bkrabach and others added 18 commits March 22, 2026 15:15
Add RESULT_START_MARKER and RESULT_END_MARKER constants to create a
framed envelope around subprocess result output. This prevents stray
print() calls from third-party code or debug output corrupting the
result payload returned to the parent process.

Changes:
- Add RESULT_START_MARKER = '<<<AMPLIFIER_RESULT_START>>>' constant
- Add RESULT_END_MARKER = '<<<AMPLIFIER_RESULT_END>>>' constant
- Add _extract_framed_result(stdout: str) -> str helper that extracts
  content between markers, raises RuntimeError('missing result envelope')
  if markers not found, and logs unframed output at DEBUG
- Update run_session_in_subprocess() to decode stdout then call
  _extract_framed_result() instead of raw .strip()
- Update __main__ block to wrap output with print(RESULT_START_MARKER),
  print(output, end=''), print(), print(RESULT_END_MARKER)
- Update test mocks in test_success, test_temp_file_cleanup_on_success,
  test_passes_session_id_in_config, and test_max_concurrent_limits_parallelism
  to wrap mock stdout in RESULT_START_MARKER/RESULT_END_MARKER envelope
- Add TestStdoutFraming tests: test_framed_output_extracted_correctly
  and test_unframed_output_raises_runtime_error
- Add import re to subprocess_runner.py
- Define _CREDENTIAL_PATTERNS list of 6 compiled regexes:
  - sk-[a-zA-Z0-9\-_]{10,} for API keys (sk- prefix)
  - key=\s*\S+ (case-insensitive) for key=value patterns
  - token=\s*\S+ (case-insensitive) for token=value patterns
  - secret=\s*\S+ (case-insensitive) for secret=value patterns
  - password=\s*\S+ (case-insensitive) for password=value patterns
  - Bearer\s+\S+ for Bearer token headers
- Add _sanitize_error(msg: str) -> str that replaces all matches with [REDACTED]
- Update run_session_in_subprocess(): log full stderr at DEBUG, sanitize
  for RuntimeError message with exit code in the format:
  'Subprocess session failed (exit code {returncode}): {sanitized}'
- Add TestErrorSanitization test class with 4 tests verifying all behaviors

Fixes finding #9: stderr may leak credentials in subprocess error messages

Co-authored-by: Amplifier <amplifier@example.com>
… assert permissions

- Add import stat
- Add _validate_project_path() that resolves path, checks is_dir(), raises
  ValueError('does not exist or is not a directory') if invalid
- In run_session_in_subprocess(): call _validate_project_path first, move temp
  file creation inside try block (tmp_path: str | None = None before try),
  assert/enforce 0o600 permissions after write, unlink in finally only if not None
- In _run_child_session(): call _validate_project_path before os.chdir()
- Add TestCleanupHardening: test_nonexistent_project_path_raises,
  test_file_as_project_path_raises, test_valid_project_path_passes,
  test_parent_validates_project_path

Addresses findings #10 (temp file before try), #12 (file permissions),
#13 (project_path validation)
Blocker B: The __main__ block now emits a JSON envelope between framing
markers on both success and error paths, enabling the parent (session_spawner)
to parse structured output including status, turn_count, and metadata.

Success path:
  result_envelope = json.dumps({
      'output': output, 'status': 'success', 'turn_count': 1, 'metadata': {}
  })

Error path now emits JSON envelope with 'status': 'error' and 'error' field
before calling sys.exit(1).

Blocker C (part 1): Added documentation comment in _run_child_session
explaining why approval_system and display_system are intentionally not
available in subprocess mode (live runtime objects that cannot cross process
boundaries).

Tests added:
- TestMainJsonEnvelope.test_success_emits_json_envelope
- TestMainJsonEnvelope.test_error_emits_json_envelope_with_status_error
@bkrabach bkrabach force-pushed the feat/subprocess-session-isolation branch from 65f49c4 to 7b606cd Compare March 22, 2026 22:18
@bkrabach bkrabach merged commit c0b047b into main Mar 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant