Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/brainstorms/session-isolation-requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: Multi-Session State Isolation
scope: Standard
status: approved
created: 2026-04-14
spec_source: docs/specs/multi-session-state-isolation.md
---

# Multi-Session State Isolation — Requirements

## Problem Statement

When multiple Claude Code sessions run apex-master (Plan Agent) in the same project directory,
they share a single `.apex/state.json` cache. `rebuildAndCache("state")` replays ALL events
into one `current_stage`, and the last writer wins — corrupting other sessions' pipeline state.
Secondary issue: concurrent `taskCreate()` calls can produce duplicate Task IDs (`T5` + `T5`),
with one silently discarded.

## Constraints

1. [已验证] Event log format (`.apex/log/state.jsonl`) must not change — append-only JSONL with session_id is already safe.
2. [已验证] Dashboard API (`/api/state`) return structure must not change — `sessionPipelines` already groups by session.
3. [已验证] Worker communication (`.apex/workers/`) must not change — isolated by task_id.
4. [已验证] Global `state.json` must be preserved — Dashboard's `deriveStageFromTasks` reads it as fallback.
5. [已验证] Task ID format must stay `T{N}` — no session prefix (too many downstream consumers).
6. [已验证] Backward compatible — graceful fallback when no per-session cache exists.

## Approaches

### A: Per-session state cache (spec's approach)
- `rebuildAndCache("state")` dual-writes: per-session `.apex/state.{sid}.json` + global `state.json`
- `loadState()` reads per-session first, falls back to global
- `currentSessionId()` stops reading from `state.json` (cross-session pollution source)
- Task ID: derive `maxId` from event log instead of cached `next_id`
- **Pros**: Minimal change ([假设] ~190 lines per spec estimate), backward compatible, Dashboard unaffected
- **Cons**: Per-session cache files accumulate (mitigated by cleanup in `apex init`)

### B: Database-backed state with locking
- Replace JSON files with SQLite for atomic reads/writes
- **Pros**: Proper concurrency, no cache files
- **Cons**: Major rewrite, breaks all existing tooling, overkill for the problem

### C: File locking on state.json
- Use `flock` or similar before read-modify-write cycles
- **Pros**: No new files
- **Cons**: Doesn't solve the fundamental issue (one global stage can't represent N sessions)

**Selected: Approach A** — solves the root cause (per-session isolation) with minimal blast radius.

## Acceptance Criteria

1. Two sessions writing different stages each read back their own stage (not the other's).
2. Session A's exit gate check is not affected by Session B's artifacts.
3. `loadState()` prefers per-session cache, falls back to global when per-session doesn't exist.
4. `currentSessionId()` never reads another session's ID from `state.json`.
5. Concurrent `taskCreate()` calls derive maxId from event log (not cached `next_id`). [假设] Residual micro-race window exists where two processes may read the same maxId — collisions are detected and annotated, not silently dropped. Full uniqueness guarantee would require random-suffix IDs (deferred as higher-cost change).
6. Duplicate task ID events are annotated with `[conflict]` marker and source session (not silently dropped).
7. Per-session cache files older than 7 days are cleaned up by `apex init`.
8. All existing tests continue to pass. [假设] Current count ~278, exact number to be verified at implementation time.

## Risks

| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| Line numbers shifted in source files | Medium | Low | Match on code content, not line numbers. Verified: rebuildAndCache shifted ~10 lines. |
| Per-session cache files accumulate | Low | Low | Cleanup in `apex init` (7-day TTL) |
| `APEX_SESSION_ID` env var not set | Low | Medium | Graceful fallback: generate new ID per process. Worst case = fragmented events, not cross-pollution. |
| Task ID race window (two processes read same maxId) | Very Low | Medium | Event log dedup with conflict annotation instead of silent drop |

## Dependencies

| Dependency | Status |
|-----------|--------|
| `src/state/event-log.ts` (STATE_CACHE, currentSessionId, rebuildAndCache, materializeTasks) | [已验证] Available, line numbers confirmed |
| `src/state/state.ts` (loadState, STATE_PATH) | [已验证] Available |
| `src/state/tasks.ts` (taskCreate) | [已验证] Available |
| `src/commands/init.ts` (cmdInit) | [已验证] Available |
| `bun:test` framework | [已验证] Already used in project |

## Solution Shape

Four surgical changes to the state management layer:

1. **event-log.ts**: Add `sessionStateCachePath()`, modify `currentSessionId()` to stop reading state.json, modify `rebuildAndCache("state")` to dual-write per-session + global cache, modify `materializeTasks` dedup to annotate conflicts.
2. **state.ts**: Modify `loadState()` to prefer per-session cache with global fallback.
3. **tasks.ts**: Modify `taskCreate()` to derive maxId from event log instead of cached `next_id`.
4. **init.ts**: Add stale per-session cache cleanup (7-day TTL).
5. **New test file**: `session-isolation.test.ts` with 6 test cases covering isolation and collision resistance.

No new abstractions. No new dependencies. Existing tests unaffected.

## Confirmed Decisions

| # | Decision | Basis | Status |
|---|----------|-------|--------|
| D1 | Per-session cache approach over SQLite or file locking | [已验证] Root cause requires per-session isolation, not just atomicity | Confirmed |
| D2 | Global state.json preserved for Dashboard compatibility | [已验证] Dashboard's deriveStageFromTasks reads it | Confirmed |
| D3 | Task ID stays T{N} format, maxId from event log | [已验证] Too many downstream consumers for format change | Confirmed |
| D4 | currentSessionId stops reading state.json | [已验证] This is the cross-session pollution source (line 61-70) | Confirmed |
| D5 | rebuildAndCache line numbers shifted to 470-490 | [已验证] Spec says 460-480, actual is ~10 lines later | Confirmed |
42 changes: 42 additions & 0 deletions docs/execution/session-isolation-log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Multi-Session State Isolation — Execution Log
source: docs/plans/session-isolation-plan.md
status: complete
started: 2026-04-14
completed: 2026-04-14
tasks_done: 8
tasks_total: 8
---

# Execution Log

## Task Progress

| Task | ID | Status | Evidence |
|------|-----|--------|----------|
| Export session cache helpers | T31 | done | 3/3 tests pass, 290 total green |
| Isolate currentSessionId | T32 | done | 4/4 tests pass, 291 total green |
| Dual-write rebuildAndCache | T33 | done | 5/5 tests pass, 292 total green |
| Per-session loadState preference | T34 | done | 7/7 tests pass, 294 total green |
| Task ID from event log maxId | T35 | done | 8/8 tests pass, 294 total green |
| Conflict annotation | T36 | done | 9/9 tests pass, 296 total green |
| Stale cache cleanup in init | T37 | done | 10/10 tests pass, 297 total green |
| Full regression | T38 | done | 296/297 pass (1 pre-existing flaky) |

## Files Modified

| File | Changes |
|------|---------|
| `src/state/event-log.ts` | +`sessionStateCachePath()`, +`_resetSessionIdCache()`, `currentSessionId()` state.json read removed, `rebuildAndCache("state")` dual-write, `materializeTasks()` conflict annotation |
| `src/state/state.ts` | `loadState()` per-session cache preference |
| `src/state/tasks.ts` | `taskCreate()` maxId from event log |
| `src/commands/init.ts` | Stale per-session cache cleanup |
| `src/__tests__/session-isolation.test.ts` | 10 new test cases |

## Deviations from Plan

- T36 test: `toContain("[conflict]")` failed due to substring mismatch (`[conflict:]` vs `[conflict]` — the closing bracket is at the end of the full annotation, not immediately after "conflict"). Fixed to search for `[conflict:`. Also fixed the production code guard to match.

## Known Issues

- `CLI Integration > apex consensus test-all` — pre-existing timeout failure, unrelated to this work.
144 changes: 144 additions & 0 deletions docs/plans/session-isolation-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
title: Multi-Session State Isolation — Implementation Plan
scope: Standard
status: approved
created: 2026-04-14
source: docs/brainstorms/session-isolation-requirements.md
spec: docs/specs/multi-session-state-isolation.md
task_count: 8
complexity: medium
---

# Multi-Session State Isolation — Plan

## Problem Frame

Multiple concurrent Claude Code sessions sharing `.apex/state.json` corrupt each
other's pipeline stage. The cache layer materializes all events into one global
stage; last writer wins. Secondary: concurrent `taskCreate()` reads stale `next_id`
from cache, producing duplicate Task IDs with silent loss.

## Decision Log

| # | Decision | Rationale | Rejected Alternative |
|---|----------|-----------|---------------------|
| D1 | Per-session cache files (`state.{sid}.json`) alongside global | Isolates sessions without breaking Dashboard's global read | SQLite (overkill), file locking (doesn't solve multi-stage problem) |
| D2 | `currentSessionId()` stops reading `state.json` | state.json is the cross-session pollution vector (event-log.ts:61-70) | Keep reading but filter — adds complexity without benefit |
| D3 | Dual-write: per-session + global in `rebuildAndCache` | Dashboard backward compatible; CLI reads per-session | Per-session only — breaks Dashboard fallback |
| D4 | Task ID maxId from event log replay, not cached `next_id` | Event log is the source of truth; cache can be stale | Random suffix IDs — changes T{N} format, too many downstream consumers |
| D5 | Conflict annotation instead of silent dedup drop | Preserves both tasks' info; makes race visible | Silent drop (current behavior) — causes task loss |

## File Manifest

### Modified Files

| File | Function(s) Changed | Change Summary |
|------|---------------------|----------------|
| `src/state/event-log.ts` | `sessionStateCachePath()` (new), `_resetSessionIdCache()` (new), `currentSessionId()`, `rebuildAndCache()`, `materializeTasks()` | Add per-session cache path helper, test reset helper, remove state.json read from session ID, dual-write state cache, conflict-annotate duplicate task IDs |
| `src/state/state.ts` | `loadState()` | Prefer per-session cache, fallback to global |
| `src/state/tasks.ts` | `taskCreate()` | Derive maxId from event log instead of `store.next_id` |
| `src/commands/init.ts` | `cmdInit()` | Add stale per-session cache cleanup (7-day TTL) |

### New Files

| File | Purpose |
|------|---------|
| `src/__tests__/session-isolation.test.ts` | 8 test cases for session isolation + task ID collision |

### Files NOT Changed (by design)

`dashboard.ts`, `frontend/app.js`, worker files — per constraints in requirements doc.

## Task Decomposition

### T1: Export session cache helpers (event-log.ts)
- **Files**: `src/state/event-log.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts` (scaffold + first test)
- **Complexity**: trivial
- **Dependencies**: none
- **AC**: Foundation for AC 1-4
- **What**: Add `sessionStateCachePath(sid?)` function that returns `.apex/state.{sid}.json`. Add `_resetSessionIdCache()` for test support. Export both.

### T2: Isolate currentSessionId (event-log.ts)
- **Files**: `src/state/event-log.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts`
- **Complexity**: small
- **Dependencies**: T1
- **AC**: 4 (currentSessionId never reads another session's ID)
- **What**: Remove the state.json read fallback (lines 61-70). Keep: env var → cached → generate new. Three-step only.

### T3: Dual-write rebuildAndCache state (event-log.ts)
- **Files**: `src/state/event-log.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts`
- **Complexity**: medium
- **Dependencies**: T1, T2
- **AC**: 1 (two sessions read own stage), 2 (gate isolation)
- **What**: In `rebuildAndCache("state")` (line ~479), filter events by `session_id` for per-session cache, write both per-session and global.

### T4: Per-session loadState preference (state.ts)
- **Files**: `src/state/state.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts`
- **Complexity**: small
- **Dependencies**: T3
- **AC**: 3 (loadState prefers per-session, falls back to global)
- **What**: In `loadState()` (line 42), check `sessionStateCachePath()` first with `existsSync`, fall back to `STATE_PATH`.

### T5: Task ID from event log maxId (tasks.ts)
- **Files**: `src/state/tasks.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts`
- **Complexity**: small
- **Dependencies**: none (independent of T1-T4)
- **AC**: 5 (concurrent taskCreate derives maxId from event log)
- **What**: In `taskCreate()` (line 110), scan `readEvents("task")` for max numeric ID instead of using `store.next_id`.

### T6: Conflict annotation in materializeTasks (event-log.ts)
- **Files**: `src/state/event-log.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts`
- **Complexity**: small
- **Dependencies**: none (can parallel with T5)
- **AC**: 6 (duplicate IDs annotated, not silently dropped)
- **What**: In `materializeTasks()` (line ~180), change `if (store.tasks.some(...)) break` to find + annotate `[conflict]` with source session_id.

### T7: Stale cache cleanup in init (init.ts)
- **Files**: `src/commands/init.ts`
- **Test files**: `src/__tests__/session-isolation.test.ts`
- **Complexity**: trivial
- **Dependencies**: none
- **AC**: 7 (per-session cache files >7 days cleaned by init)
- **What**: At end of `cmdInit()`, scan `.apex/` for `state.apex-*.json` files older than 7 days, delete them.

### T8: Full regression verification
- **Files**: none (read-only)
- **Test files**: all test files
- **Complexity**: trivial
- **Dependencies**: T1-T7
- **AC**: 8 (all existing tests pass)
- **What**: Run `bun test` and verify full green. Count should be >= previous count.

## Test Plan

| AC | Test Scenario | Test File | Given/When/Then |
|----|--------------|-----------|-----------------|
| 1 | Two sessions write different stages | `session-isolation.test.ts` | Given session A sets brainstorm and session B sets execute / When each reads their per-session cache / Then A sees brainstorm and B sees execute |
| 2 | Gate isolation | `session-isolation.test.ts` | Given session B registers a brainstorm artifact / When session A checks its per-session state / Then session A's state has no artifacts from B |
| 3 | loadState fallback | `session-isolation.test.ts` | Given only global state.json exists (no per-session) / When loadState() is called / Then it returns the global state |
| 3 | loadState preference | `session-isolation.test.ts` | Given both global (stage=review) and per-session (stage=plan) exist / When loadState() is called / Then it returns plan |
| 4 | currentSessionId isolation | `session-isolation.test.ts` | Given state.json contains session_id="other-session" / When currentSessionId() is called with no env var and cleared cache / Then it returns a newly generated ID, not "other-session" |
| 5 | Task ID sequential derivation | `session-isolation.test.ts` | Given session A creates T1 via event log / When session B calls taskCreate / Then B derives maxId=1 from event log and creates T2 (not T1) |
| 6 | Duplicate ID conflict annotation | `session-isolation.test.ts` | Given two task.created events with id="T1" are injected into the event log / When materializeTasks runs / Then first T1 is preserved and second is annotated with [conflict] marker |
| 7 | Stale cache cleanup | `session-isolation.test.ts` | Given `.apex/state.apex-old.json` with mtime 8 days ago / When `cmdInit()` runs / Then the file is deleted |
| 8 | Regression | Full test suite | When `bun test` runs / Then all tests pass |

## Dependency Graph

```
T1 (helpers) ──→ T2 (sessionId) ──→ T3 (rebuildAndCache) ──→ T4 (loadState)
T5 (taskCreate maxId) ─────────────────────────────────────────────┐ │
T6 (conflict annotation) ─────────────────────────────────────────┤ │
T7 (init cleanup) ────────────────────────────────────────────────┤ │
▼ ▼
T8 (regression)
```

T1→T2→T3→T4 is the critical chain. T5, T6, T7 are independent and parallelizable.
54 changes: 54 additions & 0 deletions docs/reviews/session-isolation-review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Multi-Session State Isolation — Review
status: DONE
date: 2026-04-14
personas: Security, Correctness, Spec Compliance, Concurrency, Test Quality
---

# Review: Multi-Session State Isolation

## Summary

Multi-persona review of session isolation implementation across 4 source files + 1 test file.

## Findings Fixed

### P0: Path traversal via unsanitized APEX_SESSION_ID (Security)
- **File**: `src/state/event-log.ts:56-65`
- **Fix**: Added `sanitizeSessionId()` — strips non-alphanumeric chars, limits to 128 chars
- **Verified**: Tests pass, generated IDs pass through unchanged

### P1: stage.ts separate loadState bypasses session isolation (Correctness)
- **File**: `src/state/stage.ts:6-14`
- **Fix**: Updated `loadState()` to prefer per-session cache, consistent with `state.ts`
- **Callers affected**: `src/commands/status.ts`, `src/mcp/tools/status.ts`
- **Verified**: Tests pass, no callers of `saveState` exist

### P2: Cleanup regex misses non-apex session IDs (Security)
- **File**: `src/commands/init.ts:102`
- **Fix**: Broadened regex to `/^state\..+\.json$/` with `!== "state.json"` guard
- **Verified**: Tests pass

## Findings Accepted (Known Limitations)

### P1: Task ID TOCTOU race (Concurrency)
- **File**: `src/state/tasks.ts:116-127`
- **Status**: Known limitation per AC5. Collision detected via conflict annotation. Full fix (random suffix IDs) deferred as higher-cost change.

### P1: Non-atomic dual write (Concurrency)
- **File**: `src/state/event-log.ts:489-501`
- **Status**: Acceptable. Both caches are re-derivable from event log. Crash between writes causes no data loss.

## Findings Noted (P3)

- Conflict annotation in user data field (not metadata) — design choice
- `_resetSessionIdCache` exported without access control — test utility
- `readEvents` silently skips corrupted log lines — existing behavior
- Legacy events without session_id excluded from per-session caches — edge case
- `defaultState()` uses `sessionId()` instead of `currentSessionId()` — pre-existing

## Verification

- 298 tests pass (11 new session isolation tests)
- 1 pre-existing flaky test (CLI consensus timeout) — unrelated
- All P0/P1/P2 findings fixed and verified
Loading
Loading