Skip to content

markl-a/phantom-mobile

Repository files navigation

phantom-mobile

Mock-first mobile QA proof module for the Phantom Mesh ecosystem.

Give it one sentence describing what a user is trying to do ("sign up and complete checkout"). It generates a dozen realistic scenario variants — different locale, low battery, flaky network, font scaling, screen rotation — runs them across a device matrix, and produces a pass/warn/fail report. Live emulator driving and vision-LLM judging are the hardening path; the reliable public demo today is the mock matrix.

Powered by phantom-mesh License Platforms


Current Verification

Latest local verification:

  • python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock: passed
  • python -m pytest tests -q: 70 passed

The current claim is scenario generation, matrix orchestration, and report evidence. It does not claim that live emulator/VLM execution is fully hardened yet.


TL;DR — what is this?

A multi-Android testing prototype where the test author is an AI agent, not a human writing Espresso scripts.

Four cooperating agents handle the whole loop:

Agent Role
Generator (LLM) turns one-sentence PRD into ~10 scenario variants
Explorer (UIAutomator2) drives the app on a real / virtual device, captures screenshots
Verifier (vision LLM) looks at each screenshot + asks "does this UX look broken?"
Reporter rolls every cell of the scenario × device matrix into one HTML report

A simulation engine wraps the Explorer so each scenario runs under a chosen environmental stressor — network drop, battery low, RTL locale, font-scale 175%, screen rotation, app backgrounding mid-flow. Most production bugs hide here, not on the happy path.

  PRD: "User signs up, verifies email, completes onboarding tutorial."
                                  │
                                  ▼
                      Generator agent (LLM)
                                  │
                                  ▼
                  ┌──────────────┐  12 scenario specs
                  │ JP locale    │  (each variant lives in
                  │ 175% font    │   one cell of the matrix)
                  │ 5G → 3G drop │
                  │ low battery  │
                  │ TalkBack on  │
                  │ ...          │
                  └──────────────┘
                                  │
                                  ▼
                  ┌─────── 5 emulators ────────┐
                  │  Pixel 4 Pixel 7 Pixel Fold │
                  │  Pixel Tablet  Nexus 5X    │
                  │  (12 × 5 = 60 cells)        │
                  └─────────────────────────────┘
                       parallel matrix run
                                  │
                                  ▼
   each cell:  Explorer drives ─▶ screenshots ─▶ Verifier (vision LLM)
                                       ▼              ▼
                              ┌────────────────────────────┐
                              │  per-cell verdict + reason  │
                              └────────────────────────────┘
                                          ▼
                                   Reporter → HTML matrix
                                   48 ✓ pass · 8 ⚠ warn · 4 ✗ fail

Why "AI agents" and not just a faster Espresso?

Three observations about modern mobile testing:

  1. Espresso-style scripted tests cover happy paths very well and edge cases very poorly. The cases users hit in the wild — broken network during checkout, locale flip mid-flow, "deny" on a permission prompt — rarely make it into the script suite because writing them takes more time than shipping the feature.
  2. Visual regression by pixel-diff is brittle. A real reviewer would ignore an antialiasing change and flag a misaligned button. A vision LLM approximates that judgment far better than structural diff.
  3. Scenario authoring scales badly with feature surface. A new feature ships on Monday; manually writing 50 cross-device scenarios takes a week. LLM scenario generation flips this — write one paragraph, get fifty scenarios.

phantom-mobile is a research playground that tackles all three at once.

How it differs from Maestro / Appium / Espresso

Espresso / UIAutomator Appium Maestro phantom-mobile
Test author human writes Java/Kotlin human writes WebDriver human writes YAML LLM agent reads PRD
Scenario count for new feature hand-written, ~5-10 hand-written, ~5-10 hand-written, ~10-20 ~12 generated from one paragraph
Detects UX regressions (not just crashes) no no no yes — vision LLM judges screenshots
Cross-device matrix manual manual manual automatic 5+ emulator fan-out
Environmental stressors (low battery, locale flip, network drop) manual scripting manual scripting manual scripting 5 simulation modules built-in
Self-healing on UI change no no no 🚧 planned (re-pathfind via LLM)
Runs without API keys yes yes yes mock mode yes; live mode needs vision LLM

The trade is honest: Espresso/Appium/Maestro give you stable scripted runs you can git-blame. phantom-mobile gives you scenario coverage that no human would write because it's tedious to write — at the cost of LLM non-determinism. Use both. The mock-mode demo bypasses non-determinism entirely (canned outputs) so the framework's shape is reproducible.


What problems it has actually caught

The mock demo (make demo-mock) ships with 4 deliberately-failing cells to demonstrate the kind of bug class the framework catches:

# Scenario × Device Bug class
1 offline-drop × Pixel 4 UI freezes when network is dropped mid-checkout — not a crash, just a dead button
2 font-scale-175% × Nexus 5X Sign-up form clips title; submit button slides off-screen
3 RTL-locale × Pixel 7 Email field alignment breaks; right-justified text overlaps icon
4 TalkBack × Pixel Fold Accessibility focus order skips the "I agree" checkbox; user can't proceed

These are representative bug shapes, not production-app findings from a live target.


Quick start

Mock mode — no emulator, no API key, runs anywhere in <1 second

git clone https://github.com/markl-a/phantom-mobile
cd phantom-mobile
make demo-mock

Output:

→ phantom-mobile matrix run :: story=signup-flow.story.md mock=True
  [t+0.0s] generator             → 12 scenarios
  [t+0.0s] matrix                12 scenarios × 5 devices = 60 runs
  [t+0.0s] explorer + verifier   → pass=48 warn=8 fail=4
  [t+0.0s] reporter              composing matrix
  [t+0.0s] done

→ artifacts at: reports/runs/<ts>/
   - matrix-report.md   (cross-device pass/warn/fail grid)
   - scenarios.json     (12 scenarios with simulation parameters)
   - cell-results.jsonl (60 verdicts with reasoning)

→ Pass rate: 48/60 (80%)

The matrix report shows scenarios × devices with ✅/⚠️/❌ glyphs, top failures with reasoning, and a per-axis breakdown (network / locale / a11y / lifecycle / battery). Tests run via make test (38 unit tests covering simulation profiles, runner shape, canned data integrity, the phantom-mesh-bridge SDK, run-history queries, HIL adapter contract, and the white-box + root-cause agents).

LLM mode — runs against a real phantom-mesh daemon

# 1. Start the mesh daemon (in the phantom-mesh repo, in another terminal):
cargo run --release -p phantom-mesh

# 2. Register the mobile agents (translates phantom-mobile/agents/*.toml
#    into the mesh's schema and merges into ~/.phantom-mesh/agents.toml):
make install-agents

# 3. Run the SDK 3-example walkthrough (Generator / Verifier / Triage):
make demo-bridge

# Or to drive the full matrix with LLM-backed Generator + Verifier:
python scenarios/run_matrix.py --story tests/signup-flow.story.md --use-llm

See phantom_mesh_bridge/README.md for the SDK details and docs/JD-MAPPING.md for how the combined system maps to autonomous-agent testing, AI-native infra, non-AI-expert APIs, and CI/HIL pillars.

Phase 2 capabilities

The bridge ships seven mobile agents and four supporting modules. Brief tour:

Capability One-liner Entry point
Black-box test gen (Generator) PRD/story → 8–15 scenarios with simulation params agents/generator.toml
White-box test gen Kotlin/Compose source → JUnit + Espresso + property tests with branch-coverage rationale agents/whitebox-generator.toml
Verifier Captured UI dump + scenario → pass/warn/fail verdict with reason agents/verifier.toml
Triage 60 cells of failures → clustered bug groups with severity + regression flag agents/triage.toml
Root-cause Failing cell + log dump → ranked hypothesis (cause / evidence / repro / fix) agents/root-cause.toml
Quality trend CLI SQLite-backed RunHistory + by-device + by-locale + regression candidates python -m phantom_mesh_bridge.trends
Prompt caching benchmark Direct Anthropic call demonstrating ~85% cost / ~3× latency drop on cache hits python -m phantom_mesh_bridge.cache_demo
HIL adapter layer Uniform Backend ABC with Emulator (working) + RealPixel + SensorInject stubs tools/hil_adapter.py

Run quality trends after a few --record-history matrix runs:

python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock --record-history
python scenarios/run_matrix.py --story tests/signup-flow.story.md --mock --record-history
python -m phantom_mesh_bridge.trends --story signup-flow --days 30

Live mode — against connected emulators

make emulator-up                    # boots emulator-matrix.yml configurations
adb install path/to/your-app.apk    # install the target APK on each
make demo                            # runs the matrix against live devices

# Optional: with phantom-mesh LLM-driven generator + verifier
phantom serve &
make demo  # runner detects phantom and invokes LLM agents

Live mode requires the Android SDK (emulator, adb), uiautomator2 (pip install uiautomator2 && python -m uiautomator2 init), and the AVDs declared in android/emulator-matrix.yml. See android/README.md for AVD setup. All Makefile targets are listed via make help.

Scenarios are markdown stories under tests/. The Generator agent reads the story, expands it into concrete test specs. The Explorer agent runs each spec on every emulator. The Verifier agent reviews each screenshot. The Reporter agent assembles results into the cross-device matrix.


Repo layout

phantom-mobile/
├── agents/                     # phantom-mesh agent configs (TOML)
│   ├── generator.toml
│   ├── explorer.toml
│   ├── verifier.toml
│   └── reporter.toml
├── simulation/                 # environmental stressors
│   ├── network.py              # 3G / packet loss / latency
│   ├── battery.py              # low / thermal-throttled
│   ├── locale.py               # zh-TW / ar-RTL / ja-JP / etc
│   ├── accessibility.py        # font scale / dark mode / TalkBack
│   └── lifecycle.py            # background → foreground / rotation
├── tools/                      # phantom tool wrappers
│   ├── uiautomator_driver.py
│   ├── adb_helpers.py
│   └── screenshot_judge.py     # VLM call
├── android/                    # emulator configs
│   ├── emulator-matrix.yml     # 5+ configs (Pixel 6, Pixel 8 Pro, low-end, foldable, tablet)
│   ├── start-matrix.sh
│   └── README.md
├── tests/                      # human-readable test stories
│   ├── signup-flow.story.md
│   └── checkout-flow.story.md
├── reports/                    # output reports
│   └── sample-cross-device-report.md
├── docs/
│   ├── ARCHITECTURE.md
│   ├── SIMULATION-ENGINE.md
│   └── INTERVIEW-TALK-TRACK.md
└── LICENSE

Status

Platform scope (2026-05): Android only. iOS support is planned but not started — XCUITest bridging needs an Apple-Developer signing flow we haven't tackled yet. If you need iOS coverage today, swap the Explorer agent for Appium's iOS driver and keep everything else.

Component State
Mock-mode end-to-end demo (make demo-mock) ✅ runnable on any machine, <1s, 60-cell matrix
Generator agent (story → scenarios) ✅ canned 12-scenario set; LLM-driven path opt-in via --use-llm
Explorer agent (UIAutomator2 driver) ⚙️ wrapper done; lazy-imported in live mode
Verifier agent (VLM judging) ⚙️ wrapper + prompt builder done; real VLM call via phantom-mesh API
Simulation: network / locale / a11y / battery / lifecycle ✅ all 5 modules working with graceful fallback when adb absent
Cross-device matrix runner ✅ working in mock mode (60 cells); live mode ⚙️
Reporter (cross-device matrix MD with axis breakdown) ✅ working
Tests (python -m pytest tests -q) ✅ 70 tests passing in the latest local verification
Live-mode signup-flow against real emulators ⚙️ partial — needs uiautomator2 install + AVD setup
Self-healing (UI changed → re-pathfind via LLM) 🚧 future work

Related projects

  • 🌟 phantom-mesh — The agent runtime this depends on.
  • 🔬 phantom-secops — Sibling project: same agent runtime, red/blue-team simulation domain.

License

Apache-2.0