Skip to content

kentdu1996/MazeBreaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MazeBreaker v3 — Enterprise AI Maze Experience Compiler

Agents already act on the web — but every new enterprise system is a fresh maze. MazeBreaker compiles one agent's wrong turns, human corrections and safety boundaries into governed experience capsules the next agent inherits.

It proves the full loop end-to-end with a deterministic decision-agent demo, plus an optional Browser Use runtime that can drive a real saucedemo.com browser task when an LLM key is configured:

Baseline run (AcmeBoard) → Trace & failure classification → Human corrections
→ Experience distillation → Governed capsule → EvoMap publish/retrieve
→ Guidance compilation → Inherited run (NovaDesk) → Benchmark → Confidence update

For a fuller product-facing document, see PRODUCT_MANUAL.md.

For a fuller product-facing document, see PRODUCT_MANUAL.md.

Quick start

Requires Python 3.11+ and Node 18+.

cd mazebreaker

# one-time backend setup
python3.11 -m venv .venv
./.venv/bin/pip install -r backend/requirements.txt

# run backend + frontend together
./.venv/bin/python run.py

Then open http://127.0.0.1:5173.

run.py auto-installs frontend deps on first launch. Use run.py --backend to start only the API.

The 4-act demo (≈3 minutes)

The agent is a real decision-maker, not a hardcoded script. It navigates a site graph (backend/engine/site_model.py) by scoring the links it sees (backend/adapters/policy_agent.py). Its route, step count, wrong turns and takeovers are measured emergent behaviour — change the site graph or the inherited guidance and the numbers change. (Exact figures below are illustrative of a typical run, not constants.)

  1. Mission Control — pick the AcmeBoard workspace-setup task, capsule mode OFF, run baseline.
  2. Baseline Run — with no prior experience the agent is lured by the prominent marketing CTAs, loops on Pricing, hits the Contact Sales dead end, gets stuck and is rescued by a human takeover pointing it to Login, then finds the invite form and stops at the Send Invite boundary. ~13 steps, several wrong turns — all measured.
  3. Capsule Forge — distill that real trace into a Governed Experience Capsule (strategy / avoid / recovery / safety + owner, verification level, permission scope, validity, rollback signals). Confirm governance, publish to Mock EvoMap.
  4. Inheritance Benchmark — the second agent (NovaDesk) retrieves the capsule, the capsule is compiled into machine-readable guidance and injected into the agent's scoring policy. It now avoids the traps and goes straight to the console → team → collaborators → invite form (~7 steps, 0 wrong turns), still stopping before Send.

Result: the step / wrong-turn / takeover reductions are computed from the two real runs — proof the inherited experience actually changed the agent's behaviour, not a pre-written number.

How guidance injection works (this is what makes inheritance real)

guidance_compiler.compile_agent_guidance() turns matched capsules into a machine-readable policy hint — {prefer, avoid_kinds, avoid_signals, stop_words} — which create_run injects into the agent before the inherited run. The agent's link-scoring policy reads it and re-weights its choices. Run the same NovaDesk task with vs without the capsule and the trace genuinely differs (covered by test_guidance_injection_changes_behaviour).

One-click auto play (pitch mode)

Click ▶ Auto play (pitch) in the header to run the whole story automatically — baseline → forge → govern → publish → inherit → benchmark — and it waits for each run's step-by-step animation to finish before narrating the next act. Click again to stop.

Safety guardrail: auto-deprecate an unsafe capsule

After the inherited run, click ⚠ Replay as UNSAFE variant on the benchmark page. The agent navigates fast but actually clicks Send Invite (boundary breached). The benchmark flags safety_breached, MazeBreaker auto-deprecates the capsule, and a follow-up match returns nothing — a deprecated capsule can never be inherited again. This demonstrates the lifecycle rule: a capsule that cannot hold its safety boundary is rolled back regardless of step savings.

Pluggable agent runtimes

The header shows available runtimes (scripted / stagehand·off / browser_use·off). The runtime layer (backend/adapters/runtime.py) selects an adapter per task and falls back to scripted when a live runtime is unavailable or when a live run fails (network error, no LLM key, timeout) — so the demo never breaks. stagehand_adapter.py is still a stub seam; browser_use_adapter.py is a working real-browser integration (see below).

Live browser run (real site via Browser Use)

The saucedemo task (domain consumer_checkout, runtime browser_use) drives a real LLM agent in a real Chromium browser against the public sandbox saucedemo.com: log in, add an item, go through checkout, and hard-stop at the order-overview page before the irreversible Finish. The Mission Control page has a ⚡ Live browser run panel with a Run live agent on real site button; the resulting trace plays back through the same normalizer / classifier / safety gate as every other run.

Because browser-use pins newer starlette/uvicorn than the FastAPI app tolerates, it lives in a separate virtualenv and is invoked as a subprocess (browser_use_worker.py). The safety boundary is enforced two ways: a natural-language instruction and a hard should_stop callback that halts the agent the moment the live URL reaches the payment-review page (before any Finish click). Inherited capsule guidance is injected via the agent's system message.

To enable the live runtime:

# one-time: isolated venv with browser-use + Chromium
python3.11 -m venv .venv-browser
./.venv-browser/bin/pip install browser-use
./.venv-browser/bin/python -m playwright install chromium

# provide an OpenAI-compatible LLM key (never commit it)
export OPENAI_API_KEY=sk-...
export OPENAI_BASE_URL=...            # optional (e.g. an OpenAI-compatible gateway)
export MAZEBREAKER_LLM_MODEL=gpt-4o   # optional, default gpt-4o
export MAZEBREAKER_BROWSER_HEADLESS=0 # optional, show the browser during the pitch

With the venv + key present, GET /api/runtimes reports browser_use available and the live button drives a real browser. Without them, the same button runs the scripted decision agent on the same saucedemo checkout maze (site_model.py) — identical trace shape, so the demo is robust either way.

Architecture

backend/
  app.py                 FastAPI endpoints
  models.py              dataclasses (incl. v3 governance fields)
  storage.py             JSON-file persistence
  security.py            credential redaction + high-risk detection
  engine/
    failure_classifier.py    rule-based step labeling (no LLM)
    trace_normalizer.py      raw steps → TraceStep + RunSummary + corrections
    experience_distiller.py  strategy/avoid/recovery/safety extraction
    capsule_generator.py     governed capsule assembly
    capsule_matcher.py       weighted scoring + rejection rules
    guidance_compiler.py     verification-level-layered guidance
    benchmark_engine.py      baseline vs inherited comparison
  adapters/
    policy_agent.py          decision-making site-graph agent
    scripted_runtime.py      default runtime wrapper around the policy agent
    browser_use_adapter.py   live Browser Use runtime adapter
    browser_use_worker.py    isolated browser-use subprocess worker
    evomap_mock.py           publish / fetch / report / confidence / deprecate
  data/                  JSON store + seed_tasks.json
frontend/                React + Vite + TypeScript (4 pages, 5 components)
tests/                   pytest (classifier, distiller, matcher, compiler,
                         benchmark, integration, security) — 38 tests
run.py                   one-command launcher

Key API endpoints

Method Path Purpose
GET /api/tasks list seeded tasks
POST /api/runs run scripted agent (baseline / inherited)
POST /api/capsules/generate distill capsule from a run
POST /api/capsules/govern set owner / verification level / scope
POST /api/capsules/publish publish to Mock EvoMap
POST /api/capsules/match retrieve + score candidates for a new task
POST /api/guidance/compile compile capsule → layered guidance
POST /api/benchmark compare runs + update confidence/lifecycle
POST /api/safety/check high-risk action gate

Governance & safety (v3)

  • Verification levels: observationreproducedhuman_confirmedenforced. Human corrections default to observation (candidate only); promotion to human_confirmed/enforced requires an owner.
  • High-risk words (Send/Invite/Delete/Charge/Pay/Confirm…) trigger a stop; the agent never clicks Send Invite.
  • Credentials (password/cookie/token/api_key) are redacted before anything is stored.
  • Expired / deprecated / permission-mismatched / unsafe capsules are not used and do not raise confidence.

Tests

./.venv/bin/python -m pytest tests/ -q

38 tests covering the documented test cases (failure classification, distillation, matching/rejection, guidance, benchmark math, full pipeline, and security redaction).

P0 scope

Scripted runtime is intentional — the demo proves the experience-compilation loop, not a general browser agent. Stagehand / Browser Use / Playwright and a real EvoMap API are designed as drop-in adapters for P1.

About

企业 AI 的迷宫经验编译器 —— 把 Agent 的踩坑与人工纠正,编译成可治理、可继承的经验胶囊。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors