Skip to content

Android on-device: release-crash fix, live transcription, APK CI + download page#182

Merged
RonTuretzky merged 62 commits into
mainfrom
feat/android-ondevice-live-ci
Jun 14, 2026
Merged

Android on-device: release-crash fix, live transcription, APK CI + download page#182
RonTuretzky merged 62 commits into
mainfrom
feat/android-ondevice-live-ci

Conversation

@RonTuretzky

Copy link
Copy Markdown
Collaborator

Ships the on-device Android app to main, working and installable.

Scope note: this branch is stacked on feat/gui (PR #175), so the diff
includes all ~60 of that PR's commits (the whole Tauri mobile app, authored by
@NubsCarson) plus the new work below. The genuinely-new commits are the last two.
Merging this effectively lands #175 as well — #175 can then be closed.

New in this branch (on top of #175)

  • fix(android): release-build crash. Minified R8 release builds stripped
    com.k2fsa.sherpa.onnx config fields read only via JNI (decodingMethod, …),
    so the recognizer died at stop with "failed to get field id for decodingMethod."
    Added ProGuard keep rules. (PR feat: VoxTerm as a standalone Tauri app — one GUI on desktop + on-device mobile #175 alone still has this bug.)
  • feat(android): live transcription. The voxasr plugin decodes the growing
    buffer during recording (finalized windows + a volatile partial) and exposes it
    via pollTranscript; backend-local.js feeds app.js's existing live view. The
    authoritative full pass still runs once at stop.
  • fix(android): fetch-deps.sh — brace ${MODEL} so the trailing multibyte
    ellipsis doesn't trip set -u under UTF-8 (broke local + would break CI).
  • ci(android): android-release.yml — build + sign an arm64 APK on mobile-path
    changes / manual dispatch, publish to a rolling android-latest release.
  • docs(android): docs/android-install.md install guide + a GitHub Pages
    landing page (docs/index.html) + README pointer.

After merge

  • The new workflow triggers (the merge touches mobile paths) and publishes the
    first android-latest APK. Until the ANDROID_* signing secrets are set it's
    debug-signed
    (installable, not update-stable) — see docs/android-install.md.
  • Enable Pages from main /docs to serve the download page.

NubsCarson added 30 commits June 4, 2026 06:26
…rt → review)

A clean, responsive web GUI that fully drives VoxTerm's engine from the browser
(desktop + phone over LAN), with a Python control backend — no reinvention of the
transcription/diarization logic, it reuses VoxTerm's own AudioCapture + transcriber +
Silero VAD + diarizer + EventLogger.

  gui/server.py     stdlib http.server + SSE status stream + JSON API (loopback by
                    default; VOXTERM_GUI_LAN=1 to reach it from a phone). CSP, nosniff,
                    bounded request bodies, static-dir traversal guard, capped SSE.
  gui/engine.py     control layer: start/stop recording via AudioCapture, background
                    transcribe+export job with progress, session history, artifact reads
                    (path-traversal guarded).
  gui/transcribe.py importable transcription (WAV/buffer -> faithful events.jsonl +
                    -transcript.md) reusing VoxTerm's engine; progress callback for the UI.
  gui/export.py     the reviewed LLM-agent exporter (events.jsonl -> -agent.md + .json),
                    ported self-contained into the fork (+ gui/test_export.py, 23 tests).
  gui/static/       polished UI (index.html/style.css/app.js): record hero w/ live level
                    ring + timer, model/language pickers, SSE-driven transcript view,
                    client-side speaker rename (flows into copy/export), session browser,
                    Copy-for-AI / download .md / download .json.

v1 = record → stop → transcribe (robust; reuses the tested pipeline). Verified so far
without a mic: API + static serving + traversal guards + the full load/view/export flow
against a real 53-turn session; export tests 23/23. Pending a recording-finalize (mic
contention): the record-from-GUI path and a Tauri v2 native/mobile wrapper. Live
word-streaming, party/P2P, hivemind = labeled fast-follows.
… + correctness

Review of the GUI (16 agents) found 11 real issues, all fixed + verified:
- BLOCKER: strict CSP (style-src 'self', no 'unsafe-inline') silently blocked every
  element.style the UI sets (level ring, progress bar, speaker color dots) — the core
  visuals. Allow 'unsafe-inline' for style-src (all interpolated values are escaped).
- MAJOR (security): LAN mode (VOXTERM_GUI_LAN=1) had zero auth — anyone on the wifi
  could start a recording of the room or read past transcripts. Now requires a token
  (generated/printed on start, or VOXTERM_GUI_TOKEN) on every /api/* call; loopback
  stays open. Verified: no-token/bad-token -> 401, valid -> 200.
- MAJOR (perf): the transcriber/VAD/diarizer were reloaded from disk every recording.
  Cache them (lock-guarded) in gui.transcribe; reset the diarizer session per run.
- MAJOR (xss): unescaped speaker rename/label in the legend innerHTML -> escapeHtml.
- MAJOR (correctness): hand-built YAML in the client export broke / allowed key
  injection on a rename/peer_name with a quote or newline -> JSON.stringify scalars
  (mirrors the server's _yaml_scalar).
- MAJOR (crash): Download .md/.json threw on a raw-markdown fallback session (CUR null)
  -> guard the handlers.
- MINOR: dir-aware artifact resolution (same stem in two dirs returned the wrong file);
  poll-thread appends under the lock + join the thread WITHOUT holding it (avoids a
  deadlock) so trailing audio isn't dropped; SSE counter guarded by a lock; session-stem
  escaped in the sidebar; flush startup prints so the LAN token is visible immediately.
- NIT: start_recording wraps mic-open in try/except -> structured {ok:false,error} so a
  busy/missing mic shows a real message instead of a 500.

Verified: py_compile + node --check clean; export tests 23/23; CSP header correct; the
loopback load flow against a real 53-turn session; LAN 401/200/401; and a live
record -> stop -> transcribe -> export run through the engine (graceful 0-turn on a
near-silent clip). 2 review findings correctly refuted (malformed Content-Length is
cosmetic; nav a11y is enhancement).
Recording-safe hardening (built while a live mic recording ran — file-only changes):

- gui/test_engine.py (16 tests): non-mic engine paths — models()/languages(),
  _write_wav round-trip + clipping, sessions() discovery/ordering/flags across dirs,
  read_artifact/_resolve text + path-traversal rejection + only_dir restriction,
  idle status() shape. Isolated to temp dirs; never opens the mic or a model.
- gui/test_server.py (14 tests): in-process server on an ephemeral port — static
  serving + content-types, traversal blocked (403), /api/options|status|sessions,
  404 unknown route, and the full LAN-auth contract (no-token 401 / valid 200 /
  wrong 401 / header 200 / TOKEN=None open; static stays open). No /api/record POST.
- gui/README.md: honest docs — what it is, how to run, the phone/LAN token flow,
  the privacy/security model, files + outputs, v1 features + labeled fast-follows.
- UX polish (static/* only, API unchanged): a11y (aria-expanded synced, aria-live on
  status/toast, :focus-visible rings), keyboard (Space / r toggle record, Escape +
  outside-click close the mobile drawer, without hijacking focused controls), a
  "Summarize for AI" button (copies transcript prefixed with a ready-to-paste LLM
  summarization task), real mic-error toasts, an empty-sessions state, and export
  buttons disabled until a transcript is loaded.

Verified (light, recording-safe): py_compile + node --check clean; all three gui
suites green (23+16+14 = 53 tests); serve smoke confirms the UI + new control load.
…se A of the roadmap)

Recording-safe (file-only): turns the web GUI into an installable PWA so it lands on
your phone/desktop home screen and opens instantly/offline.
- manifest.webmanifest (name, standalone, theme/bg, maskable icons) + icon.svg +
  generated icon-192/512.png.
- sw.js service worker: cache-first for the app shell, network-only for /api and SSE,
  versioned cache dropped on activate. Registered from app.js (CSP script-src 'self').
- server.py: serve /manifest.webmanifest + /sw.js at ROOT (root SW scope), add the
  .webmanifest/.png content-types, and extend CSP with manifest-src 'self' + worker-src
  'self' (the strict default-src 'none' would otherwise block both).
- index.html: rel=manifest, theme-color, svg icon + apple-touch-icon.

Verified (recording-safe): py_compile + node --check clean; server tests 14/14; serve
smoke confirms manifest (application/manifest+json), sw.js, and icons all 200 with the
CSP allowances present.

Roadmap + rationale live outside the repo at ~/voxterm-plans/voxterm-gui-roadmap.md.
Next (needs no-recording): Tauri v2 native/mobile wrapper, live word-streaming, and the
record-through-the-GUI live test.
…ack, network errors

Adversarial review of the PWA/UX code (3 confirmed) fixed:
- Stale-shell trap: cache-first never revalidated, so shipping a new app.js/style.css
  without bumping the SW left installed clients on the old shell forever. Switch the
  static shell to stale-while-revalidate (serve cache, refresh in background) — changed
  assets are picked up on the next load with no manual cache bump.
- Offline navigation: exact-URL match meant "/?token=..." (phone/LAN mode) never matched
  the cached "/", so the offline shell never loaded there, and non-"/" nav offline
  returned undefined -> browser error page. Navigations are now network-first with a
  fallback to caches.match("/", {ignoreSearch:true}).
- Network errors: getJSON had no catch (server down -> unhandled rejection, silent
  no-op). Now catches, toasts "Network error", and returns {ok:false,error:"network"};
  init() and loadSessions() default missing fields so the UI degrades cleanly.

2 findings correctly refuted (token-URL cache bloat = negligible nit; key-repeat
double-trigger = benign). Verified: node --check app.js+sw.js; engine+server tests green.
…tate

Two real, recording-safe features (built + tested while a live recording ran):

- SRT/VTT subtitle export: each transcript turn already carries audio start/end, so
  export.py now adds t_offset_end per turn and renders proper .srt (HH:MM:SS,mmm,
  1-indexed cues) + .vtt (WEBVTT, HH:MM:SS.mmm) — written alongside -agent.md/.json,
  a --format {md,json,srt,vtt,all} CLI flag, served via engine.read_artifact (srt/vtt
  kinds), and built client-side in the UI for instant "Download SRT/VTT" of any loaded
  session (rename-aware). Verified on the real 28-min session (valid cues).
- Settings persistence: the GUI remembers your Model + Language in localStorage
  (private-mode-safe), restoring them on load when still offered by the server.
- Loading state: a calm body.working affordance + hard-disabled record button while a
  transcription job runs.

Tests: gui suites now 65 total (export 35 / engine 16 / server 14), all green;
node --check clean; serve smoke confirms t_offset_end in the API JSON + the .srt
artifact served (200).
…, peer label, clamps

Adversarial review of the SRT/VTT code (4 confirmed) + a byte-for-byte client↔server
parity check surfaced and fixed:
- Cue-text injection: a newline / blank line / "-->" inside a turn's text or label
  corrupted SRT/VTT cue boundaries (could inject a fake cue). Add _cue_text() (collapse
  newlines, neutralize "-->") applied to both cue text and label in to_srt/to_vtt.
- Client didn't skip empty-text turns and used the array index → blank cues + index
  drift vs the server file. Client now filters empty turns with an independent counter
  (mirrors the backend), so a downloaded .srt/.vtt byte-matches the server artifact.
- Peer label divergence: client used nameFor() ("Sam · laptop (peer)") vs backend
  "Sam (peer)". Client cueLabel now matches the backend exactly (renames stay a
  client-only delta on local speakers).
- Degenerate-span clamp: client bumped end +2.0s vs backend +0.5s; unified to +0.5s.
- build() now also clamps t_offset_end > t_offset in the JSON sidecar (was only the
  rendered cue), so the documented invariant holds for out-of-order offsets.

Verified: 67 gui tests green (export 37 incl. 2 new regressions / engine 16 / server 14);
node --check clean; a verbatim Python-vs-Node parity harness on a nasty doc (peer +
empty + blank-line/"-->" injection + zero/neg span) produces byte-identical SRT.
… fix gui.export path

The README predated several shipped features; bring it current: subtitle export
(.srt/.vtt + byte-identical client downloads), PWA (installable/offline), settings
persistence, keyboard/a11y, Summarize-for-AI; add the srt/vtt outputs + t_offset_end;
correct the stale 'python -m glass.export' to 'python -m gui.export --format ...'.
Type to filter the (growing) session list by date/stem; case-insensitive; the filter
survives list refreshes and shows a clean 'no match' state. Pure frontend (loadSessions
now caches SESSIONS + renderSessions(query) renders a filtered view); the typing guard
already prevents the Space/R record shortcut from firing in the search box.
…ing (gui/live.py)

Tails the raw PCM of a WAV being recorded and transcribes each new speech window with
VoxTerm's engine, printing '[mm:ss] text' as the conversation happens. Reads the FILE,
not the mic, so it runs alongside any recorder with zero contention. Text-only + fw-base
default for low latency. CLI: python -m gui.live ROOM.wav [--model] [--interval] [--max-seconds].

Proven on a live recording (transcribed the active conversation in near-real-time).
NOT yet wired into the GUI browser UI — that's the next step (stream lines over SSE to a
live transcript panel).
…ing, stream to a panel

Wires gui/live.py's near-real-time transcription into the browser UI so it's actually
usable, not just a CLI:
- engine.py: live_start/live_stop + a background tail-transcribe thread that follows the
  newest in-progress recording FROM the current end (true live — no slow backlog replay),
  transcribes finalized speech windows with the cached fw-base engine, and appends
  '[mm:ss] text' lines (capped) exposed via status().live. Reads the file, not the mic,
  so it runs alongside any recorder with zero contention.
- server.py: POST /api/live/start (optional wav, defaults to newest) + /api/live/stop.
- static: a '⦿ Live transcript' toggle + a streaming, auto-scrolling live panel (calm
  theme, pulsing dot); applyStatus renders status().live.lines.

Verified end-to-end on a real in-progress recording: start -> lines appear within ~10-16s
of new speech with correct audio timestamps (e.g. [39:19]…) -> stop. server tests 14/14;
node --check clean. (Browser render of the panel is wired but visually confirmable only
in a browser; the data path is proven.)
…s test

- engine.delete_session(stem, dir): removes only a session's text artifacts
  (-transcript/-agent.{md,json,srt,vtt}/-events.jsonl) for the stem, reusing _resolve's
  traversal guard + _session_dirs/only_dir restriction; never touches .wav (audio kept).
- POST /api/session/delete (behind the LAN-auth gate).
- UI: a subtle ✕ on each session row (confirm; stopPropagation so it can't trigger open;
  clears the view if the open session is deleted).
- test_engine: +6 delete tests (exact-files, traversal rejected, dir-restricted, .wav
  untouched, missing-stem ok); fixed test_status_idle_shape to expect the 'live' key
  added by the prior live-transcription commit.

gui suites green (export 37 / engine 22 / server 14).
The live monitor finalized speech only on silence, so an in-progress
utterance showed nothing until the speaker paused. Add a partial preview of
the still-growing tail: each pass re-decodes the tail and a LocalAgreement-n
stabilizer commits the longest word-prefix that has agreed across the last n
hypotheses (stable) and marks the remainder volatile. As words settle they
graduate stable→so the head stops flickering while the tail updates live.

- gui/stabilize.py: PartialStabilizer (pure, LocalAgreement-n) + 9 unit tests
- engine._live_loop: re-decode tail → stabilize → status.live.partial; reset
  on finalize so each utterance starts clean
- app.js/style.css: render the partial (committed words solid, volatile tail
  dimmed + softly pulsing)

Proven on a real recording: ASR revised "floor"→"hood" mid-utterance and the
stabilizer held it volatile until settled (never committed the wrong word).
82 tests green. Idea ported from elizaOS's streaming partial-stabilizer.
Gap #1 — live now tails the GUI's own recording. start_recording streams
straight to a growing on-disk WAV (placeholder header, _poll appends s16 PCM
under the lock + flush, stop patches the real header). The live monitor tails
that same file, so clicking Live during a GUI recording shows your words
(before, Record buffered in RAM and only wrote on stop, so Live saw nothing).
Bonus: a long session no longer sits entirely in RAM; transcription loads the
file off-thread.

Gap #3 — small stuff:
- live.py CLI: ported the LocalAgreement stabilizer (in-place updating partial
  line) for parity with the GUI
- app.js/index.html/style.css: a scrolling live amplitude canvas during record

3 new streaming-WAV tests (header is 44B + parses, _pcm_bytes==_write_wav,
growing file is tailable mid-write then finalizes valid). 85 tests green.
Found by a verifying multi-agent audit of the new streaming-record/live code.

Concurrency:
- stop_recording now stops the live monitor (live is bound to the recording's
  lifetime) — was leaking a daemon thread that re-decoded the finalized file
  forever and raced the post-stop job.
- live monitor uses a DEDICATED transcriber (_get_engines dedicated="live") so
  it never shares CTranslate2 decode state (or the dedup buffer) with the batch
  job — CT2 isn't safe for concurrent decode on one instance.
- live_start/live_stop track real thread liveness (no double live loop after a
  timed-out join); status() snapshots self._live under the lock.

Correctness:
- stop_recording surfaces a header-patch I/O failure as an error instead of
  silently transcribing a zero-data WAV into a spurious empty session.

Security (server.py):
- CSRF: reject cross-origin state-changing POSTs (Sec-Fetch-Site / Origin vs Host).
- DNS-rebinding: loopback Host-header allowlist (blocks a rebinding site driving
  the tokenless local API).
- Clickjacking: CSP frame-ancestors 'none' + X-Frame-Options: DENY.
- _authed compares on UTF-8 bytes so a non-ASCII token yields 401, not a crash.

+3 security regression tests (host allowlist, CSRF, non-ASCII token). 88 green.
VoxTerm splits turns on VAD silence alone, so a natural pause after "and…" or
"the…" wrongly ends a turn mid-sentence. Add a zero-model end-of-turn signal
(gui/eot.py): P(turn complete) from grammar cues — terminal punctuation 0.95,
trailing conjunction 0.15, trailing article/preposition 0.20, short 0.70, else
0.50. The live loop now merges a finalized fragment into the previous line when
that line ended mid-clause (live view is text-only, so no speaker boundary to
cross), giving readable sentences instead of choppy breath-split lines.

9 unit tests; 97 green. Idea ported from elizaOS's HeuristicEotClassifier.
(Diarization hardening + windowed-live ports were checked against VoxTerm's
code and found redundant — VoxTerm already gates centroid updates by cosine sim
and bounds the live buffer via VAD — so they were intentionally skipped.)
…ucture)

From the Android-plan UI critique — the safe, additive set (deferring layout
moves like the bottom record bar until we can render on a device):
- delete ✕ was opacity:0-until-hover → invisible on touch; show it on
  @media (hover: none)
- waveform canvas was a fixed 600px bitmap stretched by CSS → blurry on
  phones/retina; size the bitmap to CSS px × devicePixelRatio, draw in CSS px
- --faint #6b7280 (~3.9:1, failed WCAG AA) → #7d8694 (~4.6:1)
- mobile: safe-area insets (notch/gesture bar) on main/sidebar/nav/toast +
  44px min tap targets on btn/select/ghost/✕/legend
- honor prefers-reduced-motion (kills the looping pulses/level animations)

97 tests green; JS validates.
v1 Android app = the existing web UI in a native shell, talking to the VoxTerm
backend on your desktop over the LAN. The phone does NO transcription.

- src-tauri/: Tauri v2 host crate (identifier site.nubs.voxterm, frontendDist
  → ../mobile-pair, window "main", android minSdk 24) + gen/android/ gradle project
- mobile-pair/index.html: on-theme pairing page — enter desktop host/port/token
  (prefilled from localStorage), navigates the webview to
  http://host:port/?token=… where the desktop serves UI+API+SSE from one origin,
  so app.js works unchanged (reads the token from the query string)
- AndroidManifest: INTERNET permission ONLY — no RECORD_AUDIO, no camera, no
  location. The app structurally cannot record you; the desktop owns the mic.
  usesCleartextTraffic=true for the LAN http backend (token is the gate).

Scaffold only — not yet built. Lives on feat/gui, no PR.
scripts/android-dev.sh — plug in a phone (or --emulator) and it self-heals the
toolchain (rust targets), builds the APK, installs, launches, and asserts the
app is alive. Test traffic stays on loopback via `adb reverse tcp:8740` — never
touches Wi-Fi. Stages A–F with structured exit codes (10 toolchain/11 targets/
20 device/30 build/40 install/50 launch/60 smoke). Hard gates: build, install,
launch, render-not-blank (scripts/assert_screen.py, Pillow luminance check).
Soft for v1: the backend round-trip (depends on the in-app connect flow).

Supporting bits:
- scripts/mock_backend.py — torch-free stdlib stand-in (serves gui/static +
  canned /api + heartbeat SSE, logs requests) for fast offline CI runs (--mock)
- gui/server.py — opt-in request logging via VOXTERM_GUI_LOG=1 (silent by
  default) so the smoke test can assert GET /api/options + /api/events
- mobile-pair: auto-connect if a backend answers on the device's localhost
  (the adb-reverse/dev case) — fails fast on a real phone → pairing form stays

97 tests green. Quickstart: scripts/android-dev.sh --emulator --debug --mock
(offline) or scripts/android-dev.sh --debug (real phone, real engine).
… Silicon

A verifying cross-platform audit found the GUI dead on Apple Silicon (the
flagship target) and the android script broken on every mac. All fixes are
Linux-safe (97 tests still green) and standard per-platform branching:

GUI (HIGH — Apple Silicon had an empty model dropdown + KeyErrors):
- Engine.models() falls back to AVAILABLE_MODELS when FASTER_WHISPER_MODELS is
  empty (Apple Silicon) so the dropdown is never blank
- new CPU-aware default (transcribe.gui_default_model / Engine.default_model):
  prefer fw-small where faster-whisper exists, fall back to MLX only on Apple
  Silicon — and crucially NOT raw config.DEFAULT_MODEL, which is qwen3-0.6b when
  qwen-asr is installed (too slow on CPU). Fixes the live + post-stop KeyErrors.
- /api/options exposes default_model; app.js pre-selects it (no more fw-small)

scripts/android-dev.sh (broke on all mac):
- ANDROID_HOME / JAVA_HOME per-OS (mac ~/Library/Android/sdk, Studio JBR / java_home)
- resolve python3 (mac has no bare `python`); arm64-v8a AVD + -gpu host on Apple Silicon

audio/capture.py: actionable mac mic-permission error (TCC not granted)
gui/export.py: per-platform live-dir fallback (was Linux XDG only)

Full report: ~/voxterm-plans/mac-compat-report.md
Zero-regression hardening from the cross-platform audit (99 tests green):
- 0.1 decouple headless ASR from the Textual TUI: gui/transcribe.py imported
  tui.app (pulling textual+sounddevice into every server/headless import).
  Extracted the pure split into tui/text_split.py; tui.app delegates to it.
  Verified: importing gui.server no longer loads `textual`.
- 0.3/0.4 gui/server._read_json: a malformed Content-Length raised an uncaught
  ValueError out of the POST handlers — guard it; also close the connection on
  an oversized body (no undrained body / latent HTTP desync).
- 0.5 live-state writes now take self._lock (brief dict mutations only, never
  around transcribe/VAD) to match the locked reader in status() — the
  "consistent snapshot" comment is now actually true.
- 0.6 mobile-pair: the loopback auto-probe honors the port field (was hardcoded 8740).
- 0.7 export.py docstring: `glass.export` -> `gui.export` (no glass pkg).
- 0.9 Android cleartext: documented that app-wide cleartext is INTENTIONAL for
  the LAN thin client (can't scope arbitrary RFC1918 IPs declaratively; the
  token + LAN is the trust model) — kept on for release on purpose.
- 0.11 drop the cosmetic SSE `Connection: keep-alive` header (HTTP/1.0).
- 0.10 + 0.8: capture.py macOS mic-permission tests; commit src-tauri/Cargo.lock.
A new, 100%-optional CPU streaming-ASR tier that runs everywhere VoxTerm does
(Linux, macOS arm64, Windows) with no GPU. Verified end-to-end on this Linux/CPU
box: installs clean, decodes correctly, and does NOT disturb VoxTerm's pinned
onnxruntime (sherpa statically links its own ORT).

- pyproject: `[project.optional-dependencies] streaming = ["sherpa-onnx..."]`
  (marker excludes Intel-macOS — no wheel). NOT a core dep.
- config.py: one DRY gate after the platform branches — surfaces the
  `sherpa-stream-en` model key + SHERPA_MODELS ONLY when sherpa-onnx is importable
  AND a wheel exists for the platform. Absent → byte-for-byte unchanged.
- audio/transcriber.py: SherpaStreamingTranscriber (lazy import w/ clear error;
  downloads the 20M streaming-zipformer on first load; per-call create_stream so
  it's a drop-in for the existing chunked callers; same RMS/hallucination/dedup
  filters; ALL-CAPS model output → sentence-case). Factory dispatch added before
  the Whisper fallback.
- gui/test_sherpa.py: skip-guarded (no-op without the extra) — gating consistency,
  factory dispatch, RMS short-circuit.

Zero-regression: without the [streaming] extra installed, nothing changes for any
existing user. 102 tests green (99 + 3, the new ones skip when sherpa is absent).
Follow-on (noted, not yet done): a true-streaming live-loop path (persistent
OnlineStream + endpoint finalize) so the GUI live view streams word-by-word.
…needs a Mac)

iOS reuses the existing Tauri thin-client (mobile-pair → LAN desktop, INTERNET/no-mic).
Everything here is cross-platform + lint-clean on Linux; the actual init/build/sign/run
loop requires a Mac + Xcode (cannot build off a Mac).

- src-tauri/Info.ios.plist: NSAllowsLocalNetworking (minimal ATS for LAN http, NOT
  arbitrary loads) + NSLocalNetworkUsageDescription (iOS-14 local-network prompt).
- tauri.conf.json: additive bundle.iOS { minimumSystemVersion "14.0" }.
- scripts/ios-dev.sh: Darwin-guarded (clean no-op off-Mac); adds iOS rust targets,
  `cargo tauri ios init` once, then ios dev|build.
- src-tauri/.gitignore: ignore generated /gen/apple/ build artifacts.
- docs/ios-thinclient.md: build path, the two plist keys, signing, pairing.

Zero-regression: no Python touched; Android (gen/android, manifest) byte-for-byte
unaffected; bundle.iOS + Info.ios.plist are read only by the iOS bundle target.
102 tests green.
…sherpa)

The live monitor now prefers the sherpa streaming backend when it's installed
(opt-in) and drives it as a true streaming recognizer instead of chunked VAD
windows:
- _live_loop split into setup/dispatch + two paths. The chunked path
  (_live_chunk_loop) is the original code VERBATIM — fw-*/MLX/qwen3/parakeet and
  any non-sherpa backend behave byte-for-byte as before (zero regression).
- _live_stream_loop: one persistent OnlineStream fed the tailed PCM; the running
  decode is published as the volatile partial each ~1s; sherpa's endpoint
  detection (or the 20s cap) finalizes a line. Same self._lock discipline.
- live model preference: sherpa-stream-en (if installed) → fw-base → platform
  default. Only changes behavior when the optional [streaming] extra is present.

Verified: streaming primitives grow the partial incrementally + decode correctly
on this box; 102 tests green (chunked path unchanged).
Adversarial QA of the new code + a real KVM-emulator run surfaced these (all fixed):
- transcriber: _ensure_sherpa_model is now ATOMIC (extract to staging → rename) with
  a complete-model guard (all 4 artifacts) so an interrupted extraction self-heals
  instead of a permanent StopIteration; load() uses a _pick() helper that raises a
  clear RuntimeError naming the missing file; .part download cleaned up on failure.
- transcriber: SherpaStreamingTranscriber.is_loaded is now a @Property, matching every
  other backend (was a method — would mis-read as loaded via getattr).
- engine: the streaming live path now applies the hallucination + dedup filters on
  finalized lines, like the chunked/batch backends.
- android-dev.sh: launch the CORRECT component — debug builds install
  site.nubs.voxterm.debug, and the activity class keeps the base namespace, so the
  launch is <appId.debug>/<base>.MainActivity (the emulator caught the old
  site.nubs.voxterm/.MainActivity → "activity did not report Status: ok", exit 50).
- android-dev.sh: validate $PYTHON is actually runnable (clean exit 10, not a late fail).
- assert_screen.py: exit 3 = SKIP when Pillow is absent (macOS) so the render gate isn't
  a silent pass; android-dev.sh treats exit 3 as a soft skip.

102 tests green. (Low/cosmetic, left + noted: streaming line-start timestamp drift;
the loopback auto-probe's cross-origin read is best-effort and degrades to manual pairing.)
…low 14)

Use get_flattened_data when available, fall back to getdata — no behavior change,
silences the Pillow-14 DeprecationWarning the emulator run surfaced.
- audio/transcriber.py: generalized the sherpa model registry (repo→URL map) so
  multiple sherpa transducer models share one SherpaStreamingTranscriber.
- config.py: new optional gated key `sherpa-nemotron-en` (NeMo FastConformer-RNNT
  0.6B, exported for sherpa-onnx). Same find_spec gate → zero-regression when the
  [streaming] extra is absent.
- scripts/bench_asr.py: reproducible WER (word edit-distance, normalized) + CPU RTF
  benchmark across backends.
- docs/streaming-asr-benchmark.md: results + honest analysis.

Numbers (Linux CPU, 3 labeled clips): fw-small 2.1% WER / 0.64 RTF (batch, og default);
fw-base 5.1% / 0.18; sherpa-nemotron-en 4.4% / 0.25 (streaming sweet spot — near-fw-base
accuracy, ~4x real-time, native streaming); sherpa-stream-en zipformer-20M 20.9% / 0.064
(~16x real-time but inaccurate). nemotron-EN proven to load + decode via the same backend.

102 tests green (test_sherpa now covers both gated keys, skips without the extra).
Engine.models() returned only FASTER_WHISPER_MODELS on Linux/Intel/Windows, so the
optional sherpa-stream-en / sherpa-nemotron-en keys (present in AVAILABLE_MODELS but
not the fw set) never appeared in the GUI model dropdown. Union the platform's base set
with SHERPA_MODELS so they're selectable wherever installed. Found by rendering the GUI
headless. test_models_returns_only_fw_keys -> test_models_are_valid_keys (valid-keys
invariant incl. the additive sherpa keys).
scripts/gui_e2e.py boots gui.server, drives headless Chrome via the DevTools
Protocol, and asserts the real browser flow: model dropdown + session list
populate from the API, and clicking a past session loads + renders its
transcript (with a screenshot). Covers the browser path unit tests can't — only
record-with-a-mic still needs hardware. websocket-client is a dev-only dep.

Verified: dropdown includes the optional sherpa keys, 4 sessions, transcript
renders end-to-end.
docs/streaming-asr.md: install the optional [streaming] extra, the two model keys
(sherpa-stream-en / sherpa-nemotron-en), GUI/CLI usage, how it works, and the
zero-regression/opt-in posture. gui/README 'Models' section now points to it +
the benchmark. Makes the streaming feature discoverable + usable (upstream-ready).
NubsCarson and others added 28 commits June 4, 2026 21:56
…fix)

45-agent audit of this session's additions. Confirmed fixes:
- security: gate /api/* GETs with the same-origin check too (not just POST);
  strip the ?token= from the URL after read (history.replaceState); cap each SSE
  stream at 10 min so an abandoned silent client can't hold a slot.
- desktop UX: the pairing page starts behind a 'Connecting…' loader and only
  reveals the phone form when no local engine answers (held under Tauri) — so the
  desktop app no longer flashes the phone pairing form during engine startup.
- bug: applyStatus() guarded s.job before deref; sherpa live dedup state reset
  between sessions; TUI 'g' no longer spawns duplicate engines.
- cleanup: removed dead/contradictory Pillow branch in assert_screen; bench no
  longer re-decodes WAVs for the total; transcribe cleanup logs a file-close
  failure instead of swallowing it; host input inputmode url; CSP connect-src
  tightened; backend seam fallback simplified; Tauri externalBin/freeze documented.

99 gui tests + browser e2e + Tauri build all green. Fork only.
…live engine

From the 4-agent parity audit (verdict: GUI cleanly reuses the TUI engine; these
are the non-by-design divergences worth fixing):
- F1/F2 (export drift): delete the client-side JS formatter fork (buildMarkdown/
  buildSrt/buildVtt/buildJson + helpers, ~80 lines) — it had silently desynced from
  export.py (the downloaded .md was missing 8 front-matter fields). New POST /api/export
  renders server-side via export.py (the single formatter), rebuilding from the events
  log and applying the client's speaker renames. Verified: a no-rename .md byte-matches
  the on-disk -agent.md.
- F3: models() offered FASTER_WHISPER_MODELS and silently hid installed qwen3 on Linux
  (short-circuit) — now offers the full AVAILABLE_MODELS, matching the TUI.
- F8: the live-monitor fallback used config.DEFAULT_MODEL (the CPU-unusable qwen3) —
  now uses the CPU-aware gui_default_model().
- F10: the GUI live path reached into transcriber underscore-privates — added a public
  surface (recognizer / reset_dedup / is_duplicate / is_hallucination) and use it.
- F4: documented the intentional TUI-scope gaps (no P2P/hivemind/system-audio, manual
  rename vs the cross-session speaker DB) in gui/README.md.

103 gui+sherpa tests (incl. a new export_session test) + browser e2e + /api/export
HTTP check all green. Fork only.
… (kills the relay)

The phone transcribes locally — no pairing, no relay, no network. Builds to a
green APK here (runtime needs a device/emulator-with-mic).

- tauri-plugin-voxasr: Tauri 2 Android plugin. Kotlin VoxasrPlugin reads the mic
  (AudioRecord, 16kHz mono PCM16) and streams it through a sherpa-onnx
  OnlineRecognizer with endpoint detection, emitting partial/final events. The
  20M int8 zipformer is bundled in assets + staged to filesDir on first use, so
  it's fully offline (no first-run download). RECORD_AUDIO only; no INTERNET added.
- Rust shim exposes start_transcribe/stop_transcribe (Android-only; desktop/iOS
  get a clean 'unsupported' stub so it's a plain, CLI-discoverable dep).
- sherpa-onnx 1.13.2 Android AAR (static-link ORT, all ABIs), version-matched to
  the desktop engine. AAR + model gitignored; fetch-deps.sh stages them.
- App: registers the plugin on #[cfg(mobile)].

VERIFIED: cargo tauri android build --apk (aarch64) → green APK; the Kotlin
compiled against the sherpa AAR (proves the OnlineRecognizer/AudioRecord API
usage); APK contains lib/arm64-v8a/libsherpa-onnx-jni.so + assets/voxterm-model
(int8) + VoxasrPlugin in classes4.dex + RECORD_AUDIO merged. Fork only.
…ive captions

Makes the on-device engine a usable feature: the Android app now offers "Transcribe
on this device" (fully offline) with live partial/final captions, pairing kept as the
browser fallback.

- mobile-pair: on-device mode (Start/Stop + captions) shown when the native plugin is
  present; revealMobileHome() picks on-device on the app, pairing in a plain browser.
- pair.js invokes plugin:voxasr|start_transcribe/stop and listens for the plugin's
  partial/final/error events via addPluginListener.
- withGlobalTauri + a CSP ipc: allowance let the vanilla webview reach the plugin;
  capabilities/mobile.json grants voxasr:default.
- plugin: dropped the unused StartArgs (model is bundled), so the command takes no args.

VERIFIED: cargo tauri android build --apk (aarch64) → green APK (177M); the updated
frontend re-embedded in libapp_lib.so; plugin:voxasr registered (144 refs in .so);
sherpa .so + offline model still bundled. Runtime (mic→captions) needs a device. Fork only.
…for device debugging

The on-device error sink was #err, which lives in the hidden #pairform — so a
start_transcribe failure was silent. Added a visible #odErr in the on-device panel
and a console.error (inspectable via chrome://inspect on a device). No behavior
change to the happy path.

aarch64 APK rebuilds green with sherpa lib + offline model bundled. Fork only.
…self-test proof

Re-architected the live path from plugin events to polling — addPluginListener needs a
registerListener permission a hand-written plugin doesn't generate (the prior 'Start' bug).
Now uses only permitted commands; no listener wall.

- Kotlin: AudioRecord -> sherpa OnlineRecognizer accumulates finals + partial; pollTranscript
  command returns/clears them. Debug self-test decodes a bundled clip on load (proves decoding).
- Rust: poll_transcript command (+ start/stop); run_mobile_plugin uses serde_json::Value so the
  resolve round-trips cleanly. serde_json dep added.
- JS (pair.js): on Start, poll plugin:voxasr|poll_transcript every 500ms, render finals + partial;
  clear on Stop. Visible #odErr for on-device errors.
- permissions/default.toml: allow-poll-transcript; build.rs COMMANDS += poll_transcript.

VERIFIED on the x86_64 emulator: self-test decoded the bundled clip to the correct text
(model loads + decodes on-device, no network); tapping Start flips to Stop with AudioRecord
live (green mic dot) and pollTranscript firing every 500ms, no errors. Actual mic->captions
needs a real device (emulator has no mic); decoding + capture + poll wiring all proven. Fork only.
…ening

- /api/audio serves the session WAV with HTTP Range/206 (media-src added to
  the CSP so the <audio> element can load; 416 routed through _hdr; do_HEAD
  405; Content-Length on JSON/static). The engine hardlinks <stem>-gui.wav at
  transcribe time so playback maps to the exact recording.
- CPU-aware transcriber load(): explicit int8 + cpu_threads + greedy beam_size=1
  + a warm dummy decode. The GUI defaults to fw-base via gui_default_model(),
  and the engine warms the model at server start.
- "Detect speakers" diarize flag threaded through stop_recording -> transcribe.
- start_recording tolerates a malformed device value and reverts to the OS
  default input when "System default" is re-selected (no sticky global).
- _session_title keeps short first utterances (>= 2 chars) so titles aren't
  dates.
…iew recording

- Rebuild the UI as a monochrome (no accent hue) document-style transcript with
  a sticky record dock, a settings popover, and an export menu. The record dot
  is the only color. Inline <audio> playback (click a timestamp to seek) plus a
  Download-WAV action.
- Recording shows a level meter + "Recording..." state and the accurate,
  diarized transcript appears on stop -- one model, no streaming preview to
  reconcile against the final result.
- Robustness: title derives from the transcript (no raw-date headings);
  same-speaker turns keep a clickable timestamp instead of an orphaned box; the
  player pauses when leaving a transcript and its probe is session-tokened; seek
  waits for audio metadata; the record button has a single owner; init() surfaces
  an unreachable server.
- a11y: real keyboard focus ring on menu items, aria-live progress, readable
  muted text. PWA shell cache bumped; manifest/theme colors aligned. Docs updated.
…e2e for the redesign

- <audio preload="metadata"> so the seek bar shows the clip length immediately on
  load instead of a misleading 0:00/0:00 (cheap for a local same-origin WAV; the
  probe still defers a cold seek to loadedmetadata).
- Rewrite scripts/gui_e2e.py for the redesigned UI and add the checks unit tests
  can't cover: transcript-derived title (not a raw date), the recording's audio
  actually LOADING UNDER THE PAGE CSP (a fresh Audio() obeys media-src like the
  inline player), the visible player's real duration, and a record->stop cycle,
  with a securitypolicyviolation collector asserting zero violations.

Verified in headless Chrome: audio loadedmetadata, duration 14.66s, 0 CSP violations.
The TUI records system audio (macOS ScreenCaptureKit, Linux parec) and mixes it
with the mic; the GUI was mic-only. Add an "Audio source" selector (Microphone /
System audio / Mic + system) in the settings popover, threaded through
/api/record/start -> Engine.start_recording(source=...). system/both reuse the
engine's existing SystemCapture; "both" mixes via the same time-aligned add the
TUI uses (_mix_chunks). Fails gracefully with a clear message when the platform
tool is missing (e.g. parec not installed); selection persists in localStorage.

Tests: gui/test_capture_source.py (mix overlap+tails+clip, source wiring with the
capture classes mocked). Windows stays unavailable (no engine system-audio there).
The TUI's "U" action runs a local-LLM summary (MLX on Apple Silicon, or an
ollama:<model> backend anywhere); the GUI only had "Summarize for AI" (copies a
prompt for an external model). Add "Summarize with local LLM": POST /api/summarize
-> Engine.summarize_session() reuses the session transcript + the TUI's own
summarizer.engine (get_summarizer/resolve_template), shows the result in a
dismissible panel above the transcript, and surfaces a clear message (never a
crash) when no backend is available. A "Summary model" settings field
(persisted) lets non-Mac users point at an Ollama model.

Tests: gui/test_summarize.py (ok / no-transcript / graceful no-backend / path
traversal, summarizer mocked). 112 gui tests + headless e2e green.
…or guard

Extend the headless-Chrome e2e to exercise the new local-LLM summarize action
(asserts it fails GRACEFULLY with no backend present — no crash, block hidden),
confirm the audio-source selector offers mic/system/both, and collect
window.onerror + unhandledrejection so the run fails on ANY uncaught JS error
anywhere in the flow. Verified locally: summarize graceful, source options
correct, 0 CSP violations, 0 uncaught JS errors.
…n cleanup)

From a code-quality deep scan (one function per purpose, no dead code, no
passthrough params):
- _mix_chunks: collapsed the gui/engine.py copy and tui/app.py's staticmethod
  into one audio/mix.py::mix_chunks — both call it; the TUI staticmethod is gone
  (not replaced with a wrapper).
- _fmt_hms: was duplicated in gui/transcribe.py (truncating) vs gui/export.py
  (rounding) → ±1s live-vs-export drift. One gui/_timefmt.py::fmt_hms (rounding),
  used by transcribe/export/engine; also dropped the non-essential _fmt_hms
  parameter the live loops threaded around.
- _write_wav: dead production code (recording uses _wav_header + _pcm_bytes) —
  deleted; its tests folded into the _pcm_bytes encoder test; unused `wave`
  import removed.
- app.js: one copyOrDownload() helper (copyForAI + summarizeForAI shared the
  clipboard-or-download fallback); a named PEER_COLOR const instead of a bare hex
  that aliased a rotating speaker slot.

Full suite 523 passed.
Verified by a green debug APK build (cargo tauri android build --debug --apk):
- Model staging is now atomic + complete: verify ALL 4 required files (was only
  tokens.txt) and stage into a .tmp dir then renameTo() the final dir, so a
  mid-copy process kill can't wedge a half-populated voxterm-model dir.
- Guard AudioRecord init: bail (with lastError) when getMinBufferSize() returns
  <= 0 (minBuf*2 would throw) and when state != STATE_INITIALIZED (mic busy).
- Never leak the native sherpa OnlineStream on a failed start — track it in a
  nullable and release it in finally; mark `recognizer` @volatile (built lazily
  from both the mic worker and the debug self-test thread); reset `running` on
  every exit path.
Swap the bundled offline model from the 20M zipformer (2023-02-17) to the
70M streaming zipformer2 (2023-06-26). On the bundled test clip the 20M
dropped the opening clause and garbled "brothels"; the 70M transcribes it
in full and correct. On a real phone it decodes at xRT 0.09 (0.62s for
7.13s of audio, ~11x real-time), so the accuracy gain costs no latency.
APK grows ~26 MB (encoder int8 67 MB vs 40 MB).

The 70M model is model_type=zipformer2, which has no `attention_dims`
metadata, so the hardcoded modelType="zipformer" failed to init the
encoder. Set modelType="" to auto-detect the architecture from the model's
own ONNX metadata, so fetch-deps.sh is the single source of truth for the
bundled model and no architecture string has to stay in sync here.

Also log a measured xRT in the debug self-test, so on-device latency is a
real number rather than an assumption.
Select the bundled offline model with VOXASR_MODEL:

  zipformer-70m  (default) streaming zipformer2, ~68 MB assets / ~232 MB APK
                 fast (xRT 0.09 on a real phone), ALL-CAPS, no punctuation
  nemotron-0.6b  NeMo FastConformer-RNNT, ~632 MB assets / ~621 MB APK
                 accurate, native casing + punctuation, xRT 0.29 on the same phone

The default stays the lightweight zipformer so a plain build is small and
installs anywhere; nemotron is opt-in for builds that want transcript-grade
output and can afford the size. The Kotlin plugin already auto-detects the
architecture and feature dim from each model's ONNX metadata (modelType="",
metadata-driven feat_dim), so the tier swap needs no code change.

Also replace the fragile hardcoded epoch-specific cp filenames with a glob
that matches both naming schemes (zipformer's
`encoder-epoch-…-chunk-16-left-128.int8.onnx` and nemotron's plain
`encoder.int8.onnx`), mirroring the desktop loader's _pick(); add a guard
for an unknown VOXASR_MODEL. shellcheck-clean.

Both tiers verified end-to-end on a real device: bundled-clip self-test
decodes correctly and the live start/poll/stop pipeline runs without error.
…fecycle

start_transcribe used to reject with "microphone permission not granted" when
the runtime permission was absent, so a fresh install's first Start hard-failed
with no in-app recovery. The plugin now owns the mic: it declares the
RECORD_AUDIO "microphone" alias and requests it on first Start, resuming in a
@PermissionCallback once granted (verified on a device: fresh install -> system
prompt -> grant -> records).

Lifecycle hardening while here:
- ensureRecognizer(): one @synchronized lazy builder shared by the mic worker
  and the debug self-test, closing a check-then-act race that could build and
  leak two native recognizers, and removing the duplicated idiom.
- a per-session generation token so a worker that outlives stop's 2s join can
  neither run the mic alongside nor reset the running flag of a newer session.
- stop_transcribe clears the trailing partial so poll_transcript stops
  returning a never-finalized line after recording ends.
- the webview clears the transcript on each Start (no cross-session concat /
  unbounded DOM growth).
…nused dep

The plugin's android/.tauri/tauri-api/ tree is the Tauri-CLI-generated mirror of
the tauri-android framework (Apache/MIT "Tauri Programme", ~2150 LOC incl. 2+2
scaffold tests) — vendored upstream code, not part of this contribution. The
build resolves :tauri-android from the gen settings path and never uses this
copy (verified: a clean build with the directory removed still produces the
APK), and the sibling src-tauri/gen/android already gitignores its own /.tauri.
Gitignore android/.tauri/ and untrack the 29 files so the diff is the plugin.

Also drop the unused direct `serde` dependency (the crate uses only
serde_json::Value).
…self-heal deps

- add tauri-plugin-voxasr/README.md (purpose, the start/stop/poll command
  surface, fetch-deps.sh + VOXASR_MODEL tiers, RECORD_AUDIO/no-INTERNET stance,
  build via scripts/android-dev.sh) plus a short subsection + CHANGELOG entry in
  the main docs.
- fix the lib.rs crate docstring: it described a voxasr://partial/final event
  contract that does not exist — the plugin is poll-only (poll_transcript).
- update capabilities/mobile.json's description: it still said "window/webview/
  event only ... pairs to a desktop" though it now grants voxasr:default and
  on-device is the primary mode.
- android-dev.sh runs fetch-deps.sh when the AAR/model are missing, so the
  advertised one-command build works on a fresh checkout (honors VOXASR_MODEL).
- fix a stale revealForm() reference in an index.html comment (revealMobileHome).
…e-at-stop)

Replace the streaming zipformer/nemotron path with offline Whisper: the mic is
buffered while recording and, at stop, the whole clip is decoded by a sherpa-onnx
OfflineRecognizer — full context, native punctuation + casing, no rough live
output. This is the same model family the desktop's faster-whisper uses, so the
phone gets transcript-grade results.

- fetch-deps.sh: VOXASR_MODEL tiers are now whisper-tiny/base/small.en (base.en
  default ~154 MB); Whisper has no joiner, so stage encoder/decoder/tokens only,
  and wipe the model dir first so a tier switch leaves no stale files.
- VoxasrPlugin.kt: OfflineRecognizer (modelType="whisper", en/transcribe); record
  to a PCM buffer; at stop, split into <=30 s windows (cut at the quietest point
  near the boundary so words aren't sliced) and join. poll_transcript now reports
  { phase, elapsed, level, durationSec, segments[], error? }. Keeps the runtime
  RECORD_AUDIO request, generation guard, and @synchronized recognizer build; the
  stop path snapshots the take's buffer and joins a prior worker before reopening
  the single-owner mic.
- measured on a real phone: base.en self-test xRT ~0.2 (~5x real-time), correct
  punctuated transcript.
The phone now runs the SAME web GUI as the desktop instead of a separate stripped
page. gui/static is staged into the mobile bundle (mobile-pair/app/) and a
LocalBackend drives the native voxasr plugin + localStorage instead of the
desktop's Python HTTP engine — same look, same record→transcribe→view→export flow.

- gui/static/backend-local.js: implements the window.VOX_BACKEND seam
  (getJSON/events/authUrl) against the plugin; synthesizes app.js's
  recording→transcribing→done state machine from poll_transcript; persists
  sessions + renders client-side md/json/srt/vtt export. Sets the `on-device`
  flag so app.js/CSS hide Python-only features (model/source/mic/diarize/summary,
  language, local-LLM summary, WAV download, speaker rename) — no dead buttons.
- scripts/stage-mobile.sh (+ tauri beforeBuildCommand/beforeDevCommand): copies
  gui/static into mobile-pair/app/ with backend-local.js swapped in and the PWA
  shell dropped; mobile-pair/app/ is gitignored (gui/static stays the source).
- mobile-pair: the Android app redirects to the on-device GUI; the pairing form
  is now browser-only (dead loopback probe removed).
- AndroidManifest strips INTERNET (tools:node=remove) → the APK is provably
  offline; CSP trimmed to match (no remote/blob tokens).
- app.js/style.css: two small on-device guards; empty-state copy fixed (both
  platforms transcribe at stop, not live).

Verified e2e on a real phone: GUI loads, degrade applied, two record→transcribe
takes complete cleanly, zero console errors, only RECORD_AUDIO granted.
…ngine

Update the plugin README (offline Whisper, the phase-based poll contract, whisper
model tiers, the unified-GUI architecture), the main README's Android section, and
the CHANGELOG entry — the previous text described the superseded streaming model.
sherpa-onnx Whisper truncates anything ≥30 s ("process only the first 30 s and
discard the remaining"), so an exactly-30 s window risks a boundary warning. Cap
the windows at 29 s — comfortably under the limit, no data discarded, plus a
margin for the silence-aware cut. Verified the chunked decode of a synthetic
>30 s clip joins into coherent text with the cut landing in a pause.
…g it

stagedModelDir() copied the bundled assets into a temp dir and swapped it in
without checking the required model files actually landed — a build shipping
incomplete assets would surface as a cryptic native recognizer crash later
rather than a clear error. Verify all required files are present before the
atomic rename (clear IOException otherwise), check renameTo's result, and
@synchronized it so the debug self-test can't race a first record into staging.

Verified on a physical device (debug APK): a cold re-stage after `pm clear`
stages all files and the offline self-test decodes test.wav correctly
(xRT 0.18, full casing + punctuation).
The comment advertised 'zipformer-70m default | nemotron-0.6b', but fetch-deps.sh
only accepts whisper-{tiny,base,small}.en (default whisper-base.en) and exit 1s on
anything else — so copying the old hint sent users straight into a script abort.
…PK CI

- fix(android): keep com.k2fsa.sherpa.onnx classes/members under R8. Minified
  release builds stripped config fields read only via JNI (decodingMethod, ...),
  so the recognizer crashed at stop with "failed to get field id for
  decodingMethod". Adds keep rules to the app proguard-rules.pro.

- feat(android): live transcription preview. The voxasr plugin now decodes the
  growing buffer during recording (finalized <=29s windows + a volatile partial)
  and exposes it via pollTranscript; backend-local.js maps it into app.js's
  existing live view. The authoritative full pass still runs once at stop.

- fix(android): brace ${MODEL} in fetch-deps.sh so the multibyte ellipsis after
  it doesn't trip bash's set -u under a UTF-8 locale (broke local builds and
  would break CI).

- ci(android): add .github/workflows/android-release.yml — build + sign an arm64
  APK on mobile-path changes (or manual dispatch) and publish it to a rolling
  android-latest release.

- docs(android): add docs/android-install.md + a README install pointer.
A self-contained static page (docs/index.html, served via Pages from /docs) with
a Download APK button (-> the android-latest release asset), sideload steps, an
Obtainium auto-update section, and requirements. Adds .nojekyll so the static
page is served as-is.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants