OpenWar's framing is discipline, not intelligence. v0.9 starts to let the runtime get sharper the more you run it, without getting cleverer. No second LLM judges the first. Adaptation is heuristic, inspectable, deterministic, and operator-gated.
This release ships in two halves.
| Version | Scope | Status |
|---|---|---|
| v0.9.0 | openwar history: read accumulated traces, produce a descriptive report. Counts and quantiles only. No runtime behavior change. |
Shipped. |
| v0.9.1 | Adaptive autonomy plumbing: openwar learn, learned_profile: frontmatter, runner integration, detector sensitivity. Conservative thresholds make the system a no-op until run ~10. |
Shipped. |
| v0.9.2+ | Threshold tuning against real distributions. Patch releases adjust the constants in src/state/heuristics.ts. |
Pending real-run data. |
v0.9.0 is descriptive. v0.9.1 will be prescriptive. The split exists because adapting against synthetic / thin samples ships footguns. The data foundation has to exist before recommendations against it can be load-bearing.
openwar history <project_slug>
openwar history <project_slug> --since <ISO>
openwar history <project_slug> --min-samples <N>
openwar history <project_slug> --json
openwar inspect <brief_id> --history # show the history report for the brief's project
The report covers four sections, each derived deterministically from trace events alone:
- Tool usage: per-tool call count, last-used timestamp, "dead" flag when zero calls across a sufficient sample.
- Phase tool-call distribution: P50 / P90 / max per phase, computed by attributing each
tool_callevent to the most recentphase_enter. - Detector fire rates: count per detector + fires per run.
- Phase timing: total
duration_msper phase across all runs, plus average.
The report is descriptive. v0.9.0 makes no recommendation, applies nothing at runtime, and does not write any profile to disk.
v0.9.1 will recommend detector sensitivity adjustments. Doing that requires a precise definition of "false positive" per detector. Locking it here so v0.9.1 codes to a spec, not a vibe.
A detector "fires" when its predicate returns truthy. Whether that fire was "right" is detector-specific:
- True positive: fired AND the session halted at Phase 2 AND was not resumed with new context.
- False positive: fired AND the session was resumed with continued execution AND completed cleanly.
- Undefined: fired AND the session was resumed AND the resume itself halted. (Treat as 0.5 weight; the original blocker was at least partly real.)
The rare path where blocker FP is actually computable is: resume after blocker, then clean Phase 4. v0.8 captures phase transitions and final phase in session metadata. Sample size will be small.
- True positive: fired AND the operator answered
nat the Phase 3 prompt. - False positive: fired AND the operator answered
yorY. (Approval means the action was authorized in context; the detector was correct to flag it but the operator's call overrode.) - Calibration trap: "false positive" here does NOT mean "the detector was wrong to fire." It means "the detector fired on something the operator considered safe in context." v0.9.1 must surface this distinction in the inspect output so the operator does not interpret a high destructive FP rate as "the detector is broken." It is doing its job.
- No operator-override path exists today. The runtime warns but does not gate.
- FP rate is undefined. v0.9.1 will not recommend sensitivity changes for this detector. The history report counts fires only.
- No gating effect. Pure observation.
- FP rate is undefined. Same as banned_phrases.
- Fires in Phase 0. Required to advance.
- FP rate is undefined (the runtime cannot proceed without it firing; FP would mean "fired without the operator accepting," which is the intake re-prompt path; we treat that as detector working as designed).
- True positive: fired AND the session reached Phase 4 (
done). - False positive: fired AND the session did not reach Phase 4 (operator intervened, blocker hit, max_steps).
The runtime currently halts on completion fire and proceeds to Phase 4. The only way completion is FP is if the runtime intervened between fire and Phase 4 transition (rare; max_steps is the realistic case).
Detectors get a safety_critical: boolean field:
| Detector | safety_critical | v0.9.1 can recommend disabled? |
|---|---|---|
| blocker | true | no, capped at loose |
| destructive | true | no, capped at loose |
| completion | true | no, capped at loose |
| confirmation | true | no, capped at loose |
| banned_phrases | false | yes |
| phase_marker | false | yes |
The cap protects against a per-project FP rate convincing the operator to silence a detector that catches real failures on a future brief that drifts.
No recommendation fires until the sample is meaningful. v0.9.0 inherits this rule for the dead flag on tools (only set when sample ≥ 3). v0.9.1 raises the bar:
| Recommendation | Min sample |
|---|---|
| Tool dead | 3 |
Detector sensitivity loose |
5 |
Detector sensitivity disabled (non-safety-critical only) |
10 |
| Phase budget | 5 |
These numbers are guesses. v0.9.1 may revise them once real distributions surface.
Every tool_call event in v0.8 is attributed to a phase by walking the event stream and crediting the call to the most recent phase_enter. The walker is in src/state/history.ts (attributeToolCallsToPhases). v0.9.1 inherits the same walker for budget math; the history report uses it for the per-phase P50 / P90 / max columns.
A trace whose phase_enter event for a phase is missing (a corrupted middle line) credits subsequent tool calls to the previous known phase. Corrupted lines are surfaced separately in the report's footer.
Same trace inputs produce the same report bit-for-bit:
source_runsarrays sorted lexicographically.- JSON serialization with sorted object keys.
- Map iteration via explicit sort.
The deterministic invariant is tested in tests/state/history.test.ts. Breaking it requires updating the test.
v0.9.0 reads from existing v0.8 trace files:
~/.openwar/sessions/<brief_id>.trace.ndjson
v0.9.0 writes nothing. v0.9.1 will write ~/.openwar/projects/<slug>/learned.json.
Override the trace directory with OPENWAR_SESSIONS_DIR (introduced in v0.8).
- Implicit vs explicit
learned_profile:(lean explicit; decide after one quarter of v0.9.0 use). - Profile discovery rules.
- Recommendation expiry: do old runs auto-age out, or is
--sincealways operator-driven? - A/B harness for sensitivity tuning.
All deferred. v0.9.0 is intentionally narrower.
v0.9.1 ships the runtime plumbing for adaptive autonomy with conservative defaults. The system is a no-op for the first 9 runs against any project; the first usable recommendation arrives around run 10. v0.9.2+ patch releases tune the constants in src/state/heuristics.ts against real observation.
openwar learn <slug> # dry run: print candidate profile + diff vs existing
openwar learn <slug> --apply # write ~/.openwar/projects/<slug>/learned.json
openwar learn <slug> --reset # delete existing profile
openwar learn <slug> --since <ISO> # only consider traces newer than the timestamp
openwar learn <slug> --min-samples <N> # floor 5; default 10
openwar learn <slug> --emit-frontmatter # print the YAML snippet for paste-into-brief
All thresholds live in src/state/heuristics.ts as named constants with a paragraph each explaining what would justify lowering them. Patch releases change these constants and document the observation that justified the change in the CHANGELOG. tests/state/heuristics.test.ts pins the current values so an accidental tuning fails CI.
| Constant | v0.9.1 value | What it gates |
|---|---|---|
DETECTOR_LOOSE_FIRE_RATE_BAR |
0.85 | Min fires-per-run to recommend loose |
DETECTOR_LOOSE_MIN_SAMPLES |
10 | Min runs for any loose recommendation |
DETECTOR_DISABLED_FIRE_RATE_BAR |
0.95 | Min fires-per-run to recommend disabled (non-safety only) |
DETECTOR_DISABLED_MIN_SAMPLES |
20 | Min runs for any disabled recommendation |
PHASE_BUDGET_MIN_SAMPLES |
10 | Min runs reaching a phase before recommending a budget |
PHASE_BUDGET_FORMULA |
"p90+5" | Recommended budget is ceil(p90) + 5 |
DEAD_TOOL_MIN_SAMPLES |
10 | Min runs before declaring a tool dead |
---
project: my-project
brief_id: 2026-05-20-X
scope_locked: true
authorized_costs:
- generation_credits
learned_profile: my-project # explicit only; no auto-discovery from project slug
---When learned_profile: is set:
- Runner loads
~/.openwar/projects/<slug>/learned.jsonat session start. - Detector sensitivity overrides thread through the detector pass via
DetectorSensitivityMap. safety_critical: truedetectors (blocker, destructive, completion, confirmation) ignoredisabledand fall back todefault. The consultation record surfaces the attempted override for audit.- Phase budgets apply to
max_stepsin single-agent runs. Multi-agent coordinator runs ignore learned phase budgets in v0.9.1 (different budget primitives; revisited in v0.9.2+). - Brief-explicit settings always win. There is no current brief field that competes with learned sensitivity, but the precedence order is
brief > learned > defaults.
Missing profile file is a soft warning, not an error. The run proceeds with defaults. Schema-version mismatch raises a hard error with a regenerate-via-openwar learn remediation.
Three new event types in the v0.8 union (additive, no schema bump):
| Event | Fires when |
|---|---|
learned_profile_applied |
Once at session start, when a profile loads. Carries counts of detector overrides, phase budgets, and dead-tool callouts. |
learned_sensitivity_consulted |
Per detector consultation with non-default sensitivity. Records the sensitivity value and whether the detector fired or was suppressed. |
learned_budget_consulted |
At each phase enter, with the recommended budget, the actually-applied value, and the source (learned, brief, or default). |
See docs/observability.md for the full event reference.
Shows the on-disk profile for the brief's project slug plus consultation history from the brief's trace. Render:
Learned profile view
brief_id: <id>
slug: <slug>
generated_at: <ISO>
schema_version: 1
source_runs: N
Detector overrides:
detector sensitivity flag fire_rate sample reason
blocker loose safety_critical 0.91 12 ...
Phase budgets:
phase tool_calls p50 p90 sample
execute 18 8 13 12
Tool usage:
tool calls last_used flag
shell_exec 0 - DEAD
Consultation summary:
Applied at: <ISO>
Counts: detectors=1 budgets=1 dead=1
Detector consultations: 14
fired: 3
suppressed: 11
Notes:
- v0.9.1 conservative thresholds active: ...
- Threshold tuning against observed real-world distributions.
- Per-detector "strict" semantics (the parameter is accepted but treated as default in v0.9.1).
- Multi-agent coordinator budget integration.
- Auto-recommendation expiry / age-off.
- A/B sensitivity-tuning harness.
- OpenTelemetry export of the three new event types.