Agents already act on the web — but every new enterprise system is a fresh maze. MazeBreaker compiles one agent's wrong turns, human corrections and safety boundaries into governed experience capsules the next agent inherits.
It proves the full loop end-to-end with a deterministic decision-agent demo, plus an optional Browser Use runtime that can drive a real saucedemo.com browser task when an LLM key is configured:
Baseline run (AcmeBoard) → Trace & failure classification → Human corrections
→ Experience distillation → Governed capsule → EvoMap publish/retrieve
→ Guidance compilation → Inherited run (NovaDesk) → Benchmark → Confidence update
For a fuller product-facing document, see PRODUCT_MANUAL.md.
For a fuller product-facing document, see PRODUCT_MANUAL.md.
Requires Python 3.11+ and Node 18+.
cd mazebreaker
# one-time backend setup
python3.11 -m venv .venv
./.venv/bin/pip install -r backend/requirements.txt
# run backend + frontend together
./.venv/bin/python run.pyThen open http://127.0.0.1:5173.
- Backend API: http://127.0.0.1:8000 (FastAPI, docs at
/docs) - Frontend dev server: http://127.0.0.1:5173 (Vite, proxies
/api→ backend)
run.py auto-installs frontend deps on first launch. Use run.py --backend to
start only the API.
The agent is a real decision-maker, not a hardcoded script. It navigates a site graph (backend/engine/site_model.py) by scoring the links it sees (backend/adapters/policy_agent.py). Its route, step count, wrong turns and takeovers are measured emergent behaviour — change the site graph or the inherited guidance and the numbers change. (Exact figures below are illustrative of a typical run, not constants.)
- Mission Control — pick the AcmeBoard workspace-setup task, capsule mode OFF, run baseline.
- Baseline Run — with no prior experience the agent is lured by the prominent marketing CTAs, loops on Pricing, hits the Contact Sales dead end, gets stuck and is rescued by a human takeover pointing it to Login, then finds the invite form and stops at the Send Invite boundary. ~13 steps, several wrong turns — all measured.
- Capsule Forge — distill that real trace into a Governed Experience Capsule (strategy / avoid / recovery / safety + owner, verification level, permission scope, validity, rollback signals). Confirm governance, publish to Mock EvoMap.
- Inheritance Benchmark — the second agent (NovaDesk) retrieves the capsule, the capsule is compiled into machine-readable guidance and injected into the agent's scoring policy. It now avoids the traps and goes straight to the console → team → collaborators → invite form (~7 steps, 0 wrong turns), still stopping before Send.
Result: the step / wrong-turn / takeover reductions are computed from the two real runs — proof the inherited experience actually changed the agent's behaviour, not a pre-written number.
guidance_compiler.compile_agent_guidance() turns matched capsules into a
machine-readable policy hint — {prefer, avoid_kinds, avoid_signals, stop_words} —
which create_run injects into the agent before the inherited run. The agent's
link-scoring policy reads it and re-weights its choices. Run the same NovaDesk task
with vs without the capsule and the trace genuinely differs (covered by
test_guidance_injection_changes_behaviour).
Click ▶ Auto play (pitch) in the header to run the whole story automatically — baseline → forge → govern → publish → inherit → benchmark — and it waits for each run's step-by-step animation to finish before narrating the next act. Click again to stop.
After the inherited run, click ⚠ Replay as UNSAFE variant on the benchmark
page. The agent navigates fast but actually clicks Send Invite (boundary
breached). The benchmark flags safety_breached, MazeBreaker auto-deprecates
the capsule, and a follow-up match returns nothing — a deprecated capsule can
never be inherited again. This demonstrates the lifecycle rule: a capsule that
cannot hold its safety boundary is rolled back regardless of step savings.
The header shows available runtimes (scripted / stagehand·off / browser_use·off).
The runtime layer (backend/adapters/runtime.py) selects an
adapter per task and falls back to scripted when a live runtime is unavailable
or when a live run fails (network error, no LLM key, timeout) — so the demo never breaks.
stagehand_adapter.py is still a stub seam;
browser_use_adapter.py is a working
real-browser integration (see below).
The saucedemo task (domain consumer_checkout, runtime browser_use) drives a
real LLM agent in a real Chromium browser against the public sandbox
saucedemo.com: log in, add an item, go through
checkout, and hard-stop at the order-overview page before the irreversible
Finish. The Mission Control page has a ⚡ Live browser run panel with a
Run live agent on real site button; the resulting trace plays back through the
same normalizer / classifier / safety gate as every other run.
Because browser-use pins newer starlette/uvicorn than the FastAPI app
tolerates, it lives in a separate virtualenv and is invoked as a subprocess
(browser_use_worker.py). The safety
boundary is enforced two ways: a natural-language instruction and a hard
should_stop callback that halts the agent the moment the live URL reaches the
payment-review page (before any Finish click). Inherited capsule guidance is
injected via the agent's system message.
To enable the live runtime:
# one-time: isolated venv with browser-use + Chromium
python3.11 -m venv .venv-browser
./.venv-browser/bin/pip install browser-use
./.venv-browser/bin/python -m playwright install chromium
# provide an OpenAI-compatible LLM key (never commit it)
export OPENAI_API_KEY=sk-...
export OPENAI_BASE_URL=... # optional (e.g. an OpenAI-compatible gateway)
export MAZEBREAKER_LLM_MODEL=gpt-4o # optional, default gpt-4o
export MAZEBREAKER_BROWSER_HEADLESS=0 # optional, show the browser during the pitchWith the venv + key present, GET /api/runtimes reports browser_use available
and the live button drives a real browser. Without them, the same button runs the
scripted decision agent on the same saucedemo checkout maze
(site_model.py) — identical trace shape, so the
demo is robust either way.
backend/
app.py FastAPI endpoints
models.py dataclasses (incl. v3 governance fields)
storage.py JSON-file persistence
security.py credential redaction + high-risk detection
engine/
failure_classifier.py rule-based step labeling (no LLM)
trace_normalizer.py raw steps → TraceStep + RunSummary + corrections
experience_distiller.py strategy/avoid/recovery/safety extraction
capsule_generator.py governed capsule assembly
capsule_matcher.py weighted scoring + rejection rules
guidance_compiler.py verification-level-layered guidance
benchmark_engine.py baseline vs inherited comparison
adapters/
policy_agent.py decision-making site-graph agent
scripted_runtime.py default runtime wrapper around the policy agent
browser_use_adapter.py live Browser Use runtime adapter
browser_use_worker.py isolated browser-use subprocess worker
evomap_mock.py publish / fetch / report / confidence / deprecate
data/ JSON store + seed_tasks.json
frontend/ React + Vite + TypeScript (4 pages, 5 components)
tests/ pytest (classifier, distiller, matcher, compiler,
benchmark, integration, security) — 38 tests
run.py one-command launcher
| Method | Path | Purpose |
|---|---|---|
GET |
/api/tasks |
list seeded tasks |
POST |
/api/runs |
run scripted agent (baseline / inherited) |
POST |
/api/capsules/generate |
distill capsule from a run |
POST |
/api/capsules/govern |
set owner / verification level / scope |
POST |
/api/capsules/publish |
publish to Mock EvoMap |
POST |
/api/capsules/match |
retrieve + score candidates for a new task |
POST |
/api/guidance/compile |
compile capsule → layered guidance |
POST |
/api/benchmark |
compare runs + update confidence/lifecycle |
POST |
/api/safety/check |
high-risk action gate |
- Verification levels:
observation→reproduced→human_confirmed→enforced. Human corrections default toobservation(candidate only); promotion tohuman_confirmed/enforcedrequires an owner. - High-risk words (Send/Invite/Delete/Charge/Pay/Confirm…) trigger a stop; the agent never clicks Send Invite.
- Credentials (
password/cookie/token/api_key) are redacted before anything is stored. - Expired / deprecated / permission-mismatched / unsafe capsules are not used and do not raise confidence.
./.venv/bin/python -m pytest tests/ -q38 tests covering the documented test cases (failure classification, distillation, matching/rejection, guidance, benchmark math, full pipeline, and security redaction).
Scripted runtime is intentional — the demo proves the experience-compilation loop, not a general browser agent. Stagehand / Browser Use / Playwright and a real EvoMap API are designed as drop-in adapters for P1.