phantom-mobile

Mock-first mobile QA proof module for the Phantom Mesh ecosystem.

Give it one sentence describing what a user is trying to do ("sign up and complete checkout"). It generates a dozen realistic scenario variants — different locale, low battery, flaky network, font scaling, screen rotation — runs them across a device matrix, and produces a pass/warn/fail report. Live emulator driving and vision-LLM judging are the hardening path; the reliable public demo today is the mock matrix.

Current Verification

Latest local verification:

python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock: passed
python -m pytest tests -q: 70 passed

The current claim is scenario generation, matrix orchestration, and report evidence. It does not claim that live emulator/VLM execution is fully hardened yet.

TL;DR — what is this?

A multi-Android testing prototype where the test author is an AI agent, not a human writing Espresso scripts.

Four cooperating agents handle the whole loop:

Agent	Role
Generator (LLM)	turns one-sentence PRD into ~10 scenario variants
Explorer (UIAutomator2)	drives the app on a real / virtual device, captures screenshots
Verifier (vision LLM)	looks at each screenshot + asks "does this UX look broken?"
Reporter	rolls every cell of the `scenario × device` matrix into one HTML report

A simulation engine wraps the Explorer so each scenario runs under a chosen environmental stressor — network drop, battery low, RTL locale, font-scale 175%, screen rotation, app backgrounding mid-flow. Most production bugs hide here, not on the happy path.

  PRD: "User signs up, verifies email, completes onboarding tutorial."
                                  │
                                  ▼
                      Generator agent (LLM)
                                  │
                                  ▼
                  ┌──────────────┐  12 scenario specs
                  │ JP locale    │  (each variant lives in
                  │ 175% font    │   one cell of the matrix)
                  │ 5G → 3G drop │
                  │ low battery  │
                  │ TalkBack on  │
                  │ ...          │
                  └──────────────┘
                                  │
                                  ▼
                  ┌─────── 5 emulators ────────┐
                  │  Pixel 4 Pixel 7 Pixel Fold │
                  │  Pixel Tablet  Nexus 5X    │
                  │  (12 × 5 = 60 cells)        │
                  └─────────────────────────────┘
                       parallel matrix run
                                  │
                                  ▼
   each cell:  Explorer drives ─▶ screenshots ─▶ Verifier (vision LLM)
                                       ▼              ▼
                              ┌────────────────────────────┐
                              │  per-cell verdict + reason  │
                              └────────────────────────────┘
                                          ▼
                                   Reporter → HTML matrix
                                   48 ✓ pass · 8 ⚠ warn · 4 ✗ fail

Why "AI agents" and not just a faster Espresso?

Three observations about modern mobile testing:

Espresso-style scripted tests cover happy paths very well and edge cases very poorly. The cases users hit in the wild — broken network during checkout, locale flip mid-flow, "deny" on a permission prompt — rarely make it into the script suite because writing them takes more time than shipping the feature.
Visual regression by pixel-diff is brittle. A real reviewer would ignore an antialiasing change and flag a misaligned button. A vision LLM approximates that judgment far better than structural diff.
Scenario authoring scales badly with feature surface. A new feature ships on Monday; manually writing 50 cross-device scenarios takes a week. LLM scenario generation flips this — write one paragraph, get fifty scenarios.

phantom-mobile is a research playground that tackles all three at once.

How it differs from Maestro / Appium / Espresso

	Espresso / UIAutomator	Appium	Maestro	phantom-mobile
Test author	human writes Java/Kotlin	human writes WebDriver	human writes YAML	LLM agent reads PRD
Scenario count for new feature	hand-written, ~5-10	hand-written, ~5-10	hand-written, ~10-20	~12 generated from one paragraph
Detects UX regressions (not just crashes)	no	no	no	yes — vision LLM judges screenshots
Cross-device matrix	manual	manual	manual	automatic 5+ emulator fan-out
Environmental stressors (low battery, locale flip, network drop)	manual scripting	manual scripting	manual scripting	5 simulation modules built-in
Self-healing on UI change	no	no	no	🚧 planned (re-pathfind via LLM)
Runs without API keys	yes	yes	yes	mock mode yes; live mode needs vision LLM

The trade is honest: Espresso/Appium/Maestro give you stable scripted runs you can git-blame. phantom-mobile gives you scenario coverage that no human would write because it's tedious to write — at the cost of LLM non-determinism. Use both. The mock-mode demo bypasses non-determinism entirely (canned outputs) so the framework's shape is reproducible.

What problems it has actually caught

The mock demo (make demo-mock) ships with 4 deliberately-failing cells to demonstrate the kind of bug class the framework catches:

#	Scenario × Device	Bug class
1	offline-drop × Pixel 4	UI freezes when network is dropped mid-checkout — not a crash, just a dead button
2	font-scale-175% × Nexus 5X	Sign-up form clips title; submit button slides off-screen
3	RTL-locale × Pixel 7	Email field alignment breaks; right-justified text overlaps icon
4	TalkBack × Pixel Fold	Accessibility focus order skips the "I agree" checkbox; user can't proceed

These are representative bug shapes, not production-app findings from a live target.

Quick start

Mock mode — no emulator, no API key, runs anywhere in <1 second

git clone https://github.com/markl-a/phantom-mobile
cd phantom-mobile
make demo-mock

Output:

→ phantom-mobile matrix run :: story=signup-flow.story.md mock=True
  [t+0.0s] generator             → 12 scenarios
  [t+0.0s] matrix                12 scenarios × 5 devices = 60 runs
  [t+0.0s] explorer + verifier   → pass=48 warn=8 fail=4
  [t+0.0s] reporter              composing matrix
  [t+0.0s] done

→ artifacts at: reports/runs/<ts>/
   - matrix-report.md   (cross-device pass/warn/fail grid)
   - scenarios.json     (12 scenarios with simulation parameters)
   - cell-results.jsonl (60 verdicts with reasoning)

→ Pass rate: 48/60 (80%)

The matrix report shows scenarios × devices with ✅/⚠️/❌ glyphs, top failures with reasoning, and a per-axis breakdown (network / locale / a11y / lifecycle / battery). Tests run via make test (38 unit tests covering simulation profiles, runner shape, canned data integrity, the phantom-mesh-bridge SDK, run-history queries, HIL adapter contract, and the white-box + root-cause agents).

LLM mode — runs against a real phantom-mesh daemon

# 1. Start the mesh daemon (in the phantom-mesh repo, in another terminal):
cargo run --release -p phantom-mesh

# 2. Register the mobile agents (translates phantom-mobile/agents/*.toml
#    into the mesh's schema and merges into ~/.phantom-mesh/agents.toml):
make install-agents

# 3. Run the SDK 3-example walkthrough (Generator / Verifier / Triage):
make demo-bridge

# Or to drive the full matrix with LLM-backed Generator + Verifier:
python scenarios/run_matrix.py --story tests/signup-flow.story.md --use-llm

See phantom_mesh_bridge/README.md for the SDK details and docs/JD-MAPPING.md for how the combined system maps to autonomous-agent testing, AI-native infra, non-AI-expert APIs, and CI/HIL pillars.

Phase 2 capabilities

The bridge ships seven mobile agents and four supporting modules. Brief tour:

Capability	One-liner	Entry point
Black-box test gen (Generator)	PRD/story → 8–15 scenarios with simulation params	`agents/generator.toml`
White-box test gen	Kotlin/Compose source → JUnit + Espresso + property tests with branch-coverage rationale	`agents/whitebox-generator.toml`
Verifier	Captured UI dump + scenario → pass/warn/fail verdict with reason	`agents/verifier.toml`
Triage	60 cells of failures → clustered bug groups with severity + regression flag	`agents/triage.toml`
Root-cause	Failing cell + log dump → ranked hypothesis (cause / evidence / repro / fix)	`agents/root-cause.toml`
Quality trend CLI	SQLite-backed RunHistory + by-device + by-locale + regression candidates	`python -m phantom_mesh_bridge.trends`
Prompt caching benchmark	Direct Anthropic call demonstrating ~85% cost / ~3× latency drop on cache hits	`python -m phantom_mesh_bridge.cache_demo`
HIL adapter layer	Uniform Backend ABC with Emulator (working) + RealPixel + SensorInject stubs	`tools/hil_adapter.py`

Run quality trends after a few --record-history matrix runs:

python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock --record-history
python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock --record-history
python -m phantom_mesh_bridge.trends --story signup-flow --days 30

Live mode — against connected emulators

make emulator-up                    # boots emulator-matrix.yml configurations
adb install path/to/your-app.apk    # install the target APK on each
make demo                            # runs the matrix against live devices

# Optional: with phantom-mesh LLM-driven generator + verifier
phantom serve &
make demo  # runner detects phantom and invokes LLM agents

Live mode requires the Android SDK (emulator, adb), uiautomator2 (pip install uiautomator2 && python -m uiautomator2 init), and the AVDs declared in android/emulator-matrix.yml. See android/README.md for AVD setup. All Makefile targets are listed via make help.

Scenarios are markdown stories under tests/. The Generator agent reads the story, expands it into concrete test specs. The Explorer agent runs each spec on every emulator. The Verifier agent reviews each screenshot. The Reporter agent assembles results into the cross-device matrix.

Repo layout

phantom-mobile/
├── agents/                     # phantom-mesh agent configs (TOML)
│   ├── generator.toml
│   ├── explorer.toml
│   ├── verifier.toml
│   └── reporter.toml
├── simulation/                 # environmental stressors
│   ├── network.py              # 3G / packet loss / latency
│   ├── battery.py              # low / thermal-throttled
│   ├── locale.py               # zh-TW / ar-RTL / ja-JP / etc
│   ├── accessibility.py        # font scale / dark mode / TalkBack
│   └── lifecycle.py            # background → foreground / rotation
├── tools/                      # phantom tool wrappers
│   ├── uiautomator_driver.py
│   ├── adb_helpers.py
│   └── screenshot_judge.py     # VLM call
├── android/                    # emulator configs
│   ├── emulator-matrix.yml     # 5+ configs (Pixel 6, Pixel 8 Pro, low-end, foldable, tablet)
│   ├── start-matrix.sh
│   └── README.md
├── tests/                      # human-readable test stories
│   ├── signup-flow.story.md
│   └── checkout-flow.story.md
├── reports/                    # output reports
│   └── sample-cross-device-report.md
├── docs/
│   ├── ARCHITECTURE.md
│   ├── SIMULATION-ENGINE.md
│   └── INTERVIEW-TALK-TRACK.md
└── LICENSE

Status

Platform scope (2026-05): Android only. iOS support is planned but not started — XCUITest bridging needs an Apple-Developer signing flow we haven't tackled yet. If you need iOS coverage today, swap the Explorer agent for Appium's iOS driver and keep everything else.

Component	State
Mock-mode end-to-end demo (`make demo-mock`)	✅ runnable on any machine, <1s, 60-cell matrix
Generator agent (story → scenarios)	✅ canned 12-scenario set; LLM-driven path opt-in via `--use-llm`
Explorer agent (UIAutomator2 driver)	⚙️ wrapper done; lazy-imported in live mode
Verifier agent (VLM judging)	⚙️ wrapper + prompt builder done; real VLM call via phantom-mesh API
Simulation: network / locale / a11y / battery / lifecycle	✅ all 5 modules working with graceful fallback when adb absent
Cross-device matrix runner	✅ working in mock mode (60 cells); live mode ⚙️
Reporter (cross-device matrix MD with axis breakdown)	✅ working
Tests (`python -m pytest tests -q`)	✅ 70 tests passing in the latest local verification
Live-mode signup-flow against real emulators	⚙️ partial — needs uiautomator2 install + AVD setup
Self-healing (UI changed → re-pathfind via LLM)	🚧 future work

Related projects

🌟 phantom-mesh — The agent runtime this depends on.
🔬 phantom-secops — Sibling project: same agent runtime, red/blue-team simulation domain.

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phantom-mobile

Current Verification

TL;DR — what is this?

Why "AI agents" and not just a faster Espresso?

How it differs from Maestro / Appium / Espresso

What problems it has actually caught

Quick start

Mock mode — no emulator, no API key, runs anywhere in <1 second

LLM mode — runs against a real phantom-mesh daemon

Phase 2 capabilities

Live mode — against connected emulators

Repo layout

Status

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
agents		agents
android		android
docs		docs
phantom_mesh_bridge		phantom_mesh_bridge
reports		reports
scenarios		scenarios
scripts		scripts
simulation		simulation
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements-dev.txt		requirements-dev.txt

Folders and files

Latest commit

History

Repository files navigation

phantom-mobile

Current Verification

TL;DR — what is this?

Why "AI agents" and not just a faster Espresso?

How it differs from Maestro / Appium / Espresso

What problems it has actually caught

Quick start

Mock mode — no emulator, no API key, runs anywhere in <1 second

LLM mode — runs against a real phantom-mesh daemon

Phase 2 capabilities

Live mode — against connected emulators

Repo layout

Status

Related projects

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages