Mock-first mobile QA proof module for the Phantom Mesh ecosystem.
Give it one sentence describing what a user is trying to do ("sign up and complete checkout"). It generates a dozen realistic scenario variants — different locale, low battery, flaky network, font scaling, screen rotation — runs them across a device matrix, and produces a pass/warn/fail report. Live emulator driving and vision-LLM judging are the hardening path; the reliable public demo today is the mock matrix.
Latest local verification:
python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock: passedpython -m pytest tests -q: 70 passed
The current claim is scenario generation, matrix orchestration, and report evidence. It does not claim that live emulator/VLM execution is fully hardened yet.
A multi-Android testing prototype where the test author is an AI agent, not a human writing Espresso scripts.
Four cooperating agents handle the whole loop:
| Agent | Role |
|---|---|
| Generator (LLM) | turns one-sentence PRD into ~10 scenario variants |
| Explorer (UIAutomator2) | drives the app on a real / virtual device, captures screenshots |
| Verifier (vision LLM) | looks at each screenshot + asks "does this UX look broken?" |
| Reporter | rolls every cell of the scenario × device matrix into one HTML report |
A simulation engine wraps the Explorer so each scenario runs under a chosen environmental stressor — network drop, battery low, RTL locale, font-scale 175%, screen rotation, app backgrounding mid-flow. Most production bugs hide here, not on the happy path.
PRD: "User signs up, verifies email, completes onboarding tutorial."
│
▼
Generator agent (LLM)
│
▼
┌──────────────┐ 12 scenario specs
│ JP locale │ (each variant lives in
│ 175% font │ one cell of the matrix)
│ 5G → 3G drop │
│ low battery │
│ TalkBack on │
│ ... │
└──────────────┘
│
▼
┌─────── 5 emulators ────────┐
│ Pixel 4 Pixel 7 Pixel Fold │
│ Pixel Tablet Nexus 5X │
│ (12 × 5 = 60 cells) │
└─────────────────────────────┘
parallel matrix run
│
▼
each cell: Explorer drives ─▶ screenshots ─▶ Verifier (vision LLM)
▼ ▼
┌────────────────────────────┐
│ per-cell verdict + reason │
└────────────────────────────┘
▼
Reporter → HTML matrix
48 ✓ pass · 8 ⚠ warn · 4 ✗ fail
Three observations about modern mobile testing:
- Espresso-style scripted tests cover happy paths very well and edge cases very poorly. The cases users hit in the wild — broken network during checkout, locale flip mid-flow, "deny" on a permission prompt — rarely make it into the script suite because writing them takes more time than shipping the feature.
- Visual regression by pixel-diff is brittle. A real reviewer would ignore an antialiasing change and flag a misaligned button. A vision LLM approximates that judgment far better than structural diff.
- Scenario authoring scales badly with feature surface. A new feature ships on Monday; manually writing 50 cross-device scenarios takes a week. LLM scenario generation flips this — write one paragraph, get fifty scenarios.
phantom-mobile is a research playground that tackles all three at once.
| Espresso / UIAutomator | Appium | Maestro | phantom-mobile | |
|---|---|---|---|---|
| Test author | human writes Java/Kotlin | human writes WebDriver | human writes YAML | LLM agent reads PRD |
| Scenario count for new feature | hand-written, ~5-10 | hand-written, ~5-10 | hand-written, ~10-20 | ~12 generated from one paragraph |
| Detects UX regressions (not just crashes) | no | no | no | yes — vision LLM judges screenshots |
| Cross-device matrix | manual | manual | manual | automatic 5+ emulator fan-out |
| Environmental stressors (low battery, locale flip, network drop) | manual scripting | manual scripting | manual scripting | 5 simulation modules built-in |
| Self-healing on UI change | no | no | no | 🚧 planned (re-pathfind via LLM) |
| Runs without API keys | yes | yes | yes | mock mode yes; live mode needs vision LLM |
The trade is honest: Espresso/Appium/Maestro give you stable scripted runs you can git-blame. phantom-mobile gives you scenario coverage that no human would write because it's tedious to write — at the cost of LLM non-determinism. Use both. The mock-mode demo bypasses non-determinism entirely (canned outputs) so the framework's shape is reproducible.
The mock demo (make demo-mock) ships with 4 deliberately-failing cells
to demonstrate the kind of bug class the framework catches:
| # | Scenario × Device | Bug class |
|---|---|---|
| 1 | offline-drop × Pixel 4 | UI freezes when network is dropped mid-checkout — not a crash, just a dead button |
| 2 | font-scale-175% × Nexus 5X | Sign-up form clips title; submit button slides off-screen |
| 3 | RTL-locale × Pixel 7 | Email field alignment breaks; right-justified text overlaps icon |
| 4 | TalkBack × Pixel Fold | Accessibility focus order skips the "I agree" checkbox; user can't proceed |
These are representative bug shapes, not production-app findings from a live target.
git clone https://github.com/markl-a/phantom-mobile
cd phantom-mobile
make demo-mockOutput:
→ phantom-mobile matrix run :: story=signup-flow.story.md mock=True
[t+0.0s] generator → 12 scenarios
[t+0.0s] matrix 12 scenarios × 5 devices = 60 runs
[t+0.0s] explorer + verifier → pass=48 warn=8 fail=4
[t+0.0s] reporter composing matrix
[t+0.0s] done
→ artifacts at: reports/runs/<ts>/
- matrix-report.md (cross-device pass/warn/fail grid)
- scenarios.json (12 scenarios with simulation parameters)
- cell-results.jsonl (60 verdicts with reasoning)
→ Pass rate: 48/60 (80%)
The matrix report shows scenarios × devices with ✅/make test (38 unit tests
covering simulation profiles, runner shape, canned data integrity, the
phantom-mesh-bridge SDK, run-history queries, HIL adapter contract, and
the white-box + root-cause agents).
# 1. Start the mesh daemon (in the phantom-mesh repo, in another terminal):
cargo run --release -p phantom-mesh
# 2. Register the mobile agents (translates phantom-mobile/agents/*.toml
# into the mesh's schema and merges into ~/.phantom-mesh/agents.toml):
make install-agents
# 3. Run the SDK 3-example walkthrough (Generator / Verifier / Triage):
make demo-bridge
# Or to drive the full matrix with LLM-backed Generator + Verifier:
python scenarios/run_matrix.py --story tests/signup-flow.story.md --use-llmSee phantom_mesh_bridge/README.md for the SDK details and
docs/JD-MAPPING.md for how the combined system maps to autonomous-agent
testing, AI-native infra, non-AI-expert APIs, and CI/HIL pillars.
The bridge ships seven mobile agents and four supporting modules. Brief tour:
| Capability | One-liner | Entry point |
|---|---|---|
| Black-box test gen (Generator) | PRD/story → 8–15 scenarios with simulation params | agents/generator.toml |
| White-box test gen | Kotlin/Compose source → JUnit + Espresso + property tests with branch-coverage rationale | agents/whitebox-generator.toml |
| Verifier | Captured UI dump + scenario → pass/warn/fail verdict with reason | agents/verifier.toml |
| Triage | 60 cells of failures → clustered bug groups with severity + regression flag | agents/triage.toml |
| Root-cause | Failing cell + log dump → ranked hypothesis (cause / evidence / repro / fix) | agents/root-cause.toml |
| Quality trend CLI | SQLite-backed RunHistory + by-device + by-locale + regression candidates | python -m phantom_mesh_bridge.trends |
| Prompt caching benchmark | Direct Anthropic call demonstrating ~85% cost / ~3× latency drop on cache hits | python -m phantom_mesh_bridge.cache_demo |
| HIL adapter layer | Uniform Backend ABC with Emulator (working) + RealPixel + SensorInject stubs | tools/hil_adapter.py |
Run quality trends after a few --record-history matrix runs:
python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock --record-history
python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock --record-history
python -m phantom_mesh_bridge.trends --story signup-flow --days 30make emulator-up # boots emulator-matrix.yml configurations
adb install path/to/your-app.apk # install the target APK on each
make demo # runs the matrix against live devices
# Optional: with phantom-mesh LLM-driven generator + verifier
phantom serve &
make demo # runner detects phantom and invokes LLM agentsLive mode requires the Android SDK (emulator, adb), uiautomator2
(pip install uiautomator2 && python -m uiautomator2 init), and the AVDs
declared in android/emulator-matrix.yml. See android/README.md
for AVD setup. All Makefile targets are listed via make help.
Scenarios are markdown stories under tests/. The Generator agent reads the
story, expands it into concrete test specs. The Explorer agent runs each spec
on every emulator. The Verifier agent reviews each screenshot. The Reporter
agent assembles results into the cross-device matrix.
phantom-mobile/
├── agents/ # phantom-mesh agent configs (TOML)
│ ├── generator.toml
│ ├── explorer.toml
│ ├── verifier.toml
│ └── reporter.toml
├── simulation/ # environmental stressors
│ ├── network.py # 3G / packet loss / latency
│ ├── battery.py # low / thermal-throttled
│ ├── locale.py # zh-TW / ar-RTL / ja-JP / etc
│ ├── accessibility.py # font scale / dark mode / TalkBack
│ └── lifecycle.py # background → foreground / rotation
├── tools/ # phantom tool wrappers
│ ├── uiautomator_driver.py
│ ├── adb_helpers.py
│ └── screenshot_judge.py # VLM call
├── android/ # emulator configs
│ ├── emulator-matrix.yml # 5+ configs (Pixel 6, Pixel 8 Pro, low-end, foldable, tablet)
│ ├── start-matrix.sh
│ └── README.md
├── tests/ # human-readable test stories
│ ├── signup-flow.story.md
│ └── checkout-flow.story.md
├── reports/ # output reports
│ └── sample-cross-device-report.md
├── docs/
│ ├── ARCHITECTURE.md
│ ├── SIMULATION-ENGINE.md
│ └── INTERVIEW-TALK-TRACK.md
└── LICENSE
Platform scope (2026-05): Android only. iOS support is planned but not started — XCUITest bridging needs an Apple-Developer signing flow we haven't tackled yet. If you need iOS coverage today, swap the Explorer agent for Appium's iOS driver and keep everything else.
| Component | State |
|---|---|
Mock-mode end-to-end demo (make demo-mock) |
✅ runnable on any machine, <1s, 60-cell matrix |
| Generator agent (story → scenarios) | ✅ canned 12-scenario set; LLM-driven path opt-in via --use-llm |
| Explorer agent (UIAutomator2 driver) | ⚙️ wrapper done; lazy-imported in live mode |
| Verifier agent (VLM judging) | ⚙️ wrapper + prompt builder done; real VLM call via phantom-mesh API |
| Simulation: network / locale / a11y / battery / lifecycle | ✅ all 5 modules working with graceful fallback when adb absent |
| Cross-device matrix runner | ✅ working in mock mode (60 cells); live mode ⚙️ |
| Reporter (cross-device matrix MD with axis breakdown) | ✅ working |
Tests (python -m pytest tests -q) |
✅ 70 tests passing in the latest local verification |
| Live-mode signup-flow against real emulators | ⚙️ partial — needs uiautomator2 install + AVD setup |
| Self-healing (UI changed → re-pathfind via LLM) | 🚧 future work |
- 🌟 phantom-mesh — The agent runtime this depends on.
- 🔬 phantom-secops — Sibling project: same agent runtime, red/blue-team simulation domain.
Apache-2.0