Skip to content

feat: VoxTerm as a standalone Tauri app — one GUI on desktop + on-device mobile#175

Merged
RonTuretzky merged 60 commits into
dmarzzz:mainfrom
NubsCarson:feat/gui
Jun 14, 2026
Merged

feat: VoxTerm as a standalone Tauri app — one GUI on desktop + on-device mobile#175
RonTuretzky merged 60 commits into
dmarzzz:mainfrom
NubsCarson:feat/gui

Conversation

@NubsCarson

@NubsCarson NubsCarson commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Ready for review. Turns VoxTerm into a standalone Tauri app with one web GUI that runs on both desktop and phone — a thin control surface over the existing engine, not a reimplementation.

  • Desktop: a Tauri native shell spawns VoxTerm's existing Python engine on a loopback port + per-launch token and points the webview at it, so the GUI (gui/static) reuses audio.capture, get_transcriber, Silero VAD, the diarizer, and EventLogger — nothing in the speech pipeline is duplicated.
  • Phone (Android): the same GUI runs on-device with a native engine instead of Python — a LocalBackend drives a sherpa-onnx Tauri plugin (tauri-plugin-voxasr) that records and, at stop, transcribes the clip with offline Whisper (full punctuation). Fully offline: the APK strips the INTERNET permission (RECORD_AUDIO only). Python-only features (diarization, AI summarize, system audio) are hidden on-device.

GUI adds: monochrome document UI, inline audio playback + timestamp-seek, mic/system/both capture, session history/search, server-side md/json/srt/vtt + summary export, loopback-default + LAN token auth + strict CSP.

Verified: full Python suite green (523 passed / 4 skipped) + a headless-Chrome e2e (audio-under-CSP, transcript-derived titles, 0 CSP violations, 0 uncaught JS errors); the Android app + offline on-device transcription verified end-to-end on a real phone. 0 commits behind main, mergeable.

It's large (172 files). If that's a lot to review in one pass, I'm glad to make it easy: split into a stacked series (Tauri shell → on-device plugin → GUI), or land #174's sherpa backend first and rebase this down to just the GUI wrappers. Whatever's easiest for you to review — happy to reshape it.

NubsCarson added 30 commits June 4, 2026 06:26
…rt → review)

A clean, responsive web GUI that fully drives VoxTerm's engine from the browser
(desktop + phone over LAN), with a Python control backend — no reinvention of the
transcription/diarization logic, it reuses VoxTerm's own AudioCapture + transcriber +
Silero VAD + diarizer + EventLogger.

  gui/server.py     stdlib http.server + SSE status stream + JSON API (loopback by
                    default; VOXTERM_GUI_LAN=1 to reach it from a phone). CSP, nosniff,
                    bounded request bodies, static-dir traversal guard, capped SSE.
  gui/engine.py     control layer: start/stop recording via AudioCapture, background
                    transcribe+export job with progress, session history, artifact reads
                    (path-traversal guarded).
  gui/transcribe.py importable transcription (WAV/buffer -> faithful events.jsonl +
                    -transcript.md) reusing VoxTerm's engine; progress callback for the UI.
  gui/export.py     the reviewed LLM-agent exporter (events.jsonl -> -agent.md + .json),
                    ported self-contained into the fork (+ gui/test_export.py, 23 tests).
  gui/static/       polished UI (index.html/style.css/app.js): record hero w/ live level
                    ring + timer, model/language pickers, SSE-driven transcript view,
                    client-side speaker rename (flows into copy/export), session browser,
                    Copy-for-AI / download .md / download .json.

v1 = record → stop → transcribe (robust; reuses the tested pipeline). Verified so far
without a mic: API + static serving + traversal guards + the full load/view/export flow
against a real 53-turn session; export tests 23/23. Pending a recording-finalize (mic
contention): the record-from-GUI path and a Tauri v2 native/mobile wrapper. Live
word-streaming, party/P2P, hivemind = labeled fast-follows.
… + correctness

Review of the GUI (16 agents) found 11 real issues, all fixed + verified:
- BLOCKER: strict CSP (style-src 'self', no 'unsafe-inline') silently blocked every
  element.style the UI sets (level ring, progress bar, speaker color dots) — the core
  visuals. Allow 'unsafe-inline' for style-src (all interpolated values are escaped).
- MAJOR (security): LAN mode (VOXTERM_GUI_LAN=1) had zero auth — anyone on the wifi
  could start a recording of the room or read past transcripts. Now requires a token
  (generated/printed on start, or VOXTERM_GUI_TOKEN) on every /api/* call; loopback
  stays open. Verified: no-token/bad-token -> 401, valid -> 200.
- MAJOR (perf): the transcriber/VAD/diarizer were reloaded from disk every recording.
  Cache them (lock-guarded) in gui.transcribe; reset the diarizer session per run.
- MAJOR (xss): unescaped speaker rename/label in the legend innerHTML -> escapeHtml.
- MAJOR (correctness): hand-built YAML in the client export broke / allowed key
  injection on a rename/peer_name with a quote or newline -> JSON.stringify scalars
  (mirrors the server's _yaml_scalar).
- MAJOR (crash): Download .md/.json threw on a raw-markdown fallback session (CUR null)
  -> guard the handlers.
- MINOR: dir-aware artifact resolution (same stem in two dirs returned the wrong file);
  poll-thread appends under the lock + join the thread WITHOUT holding it (avoids a
  deadlock) so trailing audio isn't dropped; SSE counter guarded by a lock; session-stem
  escaped in the sidebar; flush startup prints so the LAN token is visible immediately.
- NIT: start_recording wraps mic-open in try/except -> structured {ok:false,error} so a
  busy/missing mic shows a real message instead of a 500.

Verified: py_compile + node --check clean; export tests 23/23; CSP header correct; the
loopback load flow against a real 53-turn session; LAN 401/200/401; and a live
record -> stop -> transcribe -> export run through the engine (graceful 0-turn on a
near-silent clip). 2 review findings correctly refuted (malformed Content-Length is
cosmetic; nav a11y is enhancement).
Recording-safe hardening (built while a live mic recording ran — file-only changes):

- gui/test_engine.py (16 tests): non-mic engine paths — models()/languages(),
  _write_wav round-trip + clipping, sessions() discovery/ordering/flags across dirs,
  read_artifact/_resolve text + path-traversal rejection + only_dir restriction,
  idle status() shape. Isolated to temp dirs; never opens the mic or a model.
- gui/test_server.py (14 tests): in-process server on an ephemeral port — static
  serving + content-types, traversal blocked (403), /api/options|status|sessions,
  404 unknown route, and the full LAN-auth contract (no-token 401 / valid 200 /
  wrong 401 / header 200 / TOKEN=None open; static stays open). No /api/record POST.
- gui/README.md: honest docs — what it is, how to run, the phone/LAN token flow,
  the privacy/security model, files + outputs, v1 features + labeled fast-follows.
- UX polish (static/* only, API unchanged): a11y (aria-expanded synced, aria-live on
  status/toast, :focus-visible rings), keyboard (Space / r toggle record, Escape +
  outside-click close the mobile drawer, without hijacking focused controls), a
  "Summarize for AI" button (copies transcript prefixed with a ready-to-paste LLM
  summarization task), real mic-error toasts, an empty-sessions state, and export
  buttons disabled until a transcript is loaded.

Verified (light, recording-safe): py_compile + node --check clean; all three gui
suites green (23+16+14 = 53 tests); serve smoke confirms the UI + new control load.
…se A of the roadmap)

Recording-safe (file-only): turns the web GUI into an installable PWA so it lands on
your phone/desktop home screen and opens instantly/offline.
- manifest.webmanifest (name, standalone, theme/bg, maskable icons) + icon.svg +
  generated icon-192/512.png.
- sw.js service worker: cache-first for the app shell, network-only for /api and SSE,
  versioned cache dropped on activate. Registered from app.js (CSP script-src 'self').
- server.py: serve /manifest.webmanifest + /sw.js at ROOT (root SW scope), add the
  .webmanifest/.png content-types, and extend CSP with manifest-src 'self' + worker-src
  'self' (the strict default-src 'none' would otherwise block both).
- index.html: rel=manifest, theme-color, svg icon + apple-touch-icon.

Verified (recording-safe): py_compile + node --check clean; server tests 14/14; serve
smoke confirms manifest (application/manifest+json), sw.js, and icons all 200 with the
CSP allowances present.

Roadmap + rationale live outside the repo at ~/voxterm-plans/voxterm-gui-roadmap.md.
Next (needs no-recording): Tauri v2 native/mobile wrapper, live word-streaming, and the
record-through-the-GUI live test.
…ack, network errors

Adversarial review of the PWA/UX code (3 confirmed) fixed:
- Stale-shell trap: cache-first never revalidated, so shipping a new app.js/style.css
  without bumping the SW left installed clients on the old shell forever. Switch the
  static shell to stale-while-revalidate (serve cache, refresh in background) — changed
  assets are picked up on the next load with no manual cache bump.
- Offline navigation: exact-URL match meant "/?token=..." (phone/LAN mode) never matched
  the cached "/", so the offline shell never loaded there, and non-"/" nav offline
  returned undefined -> browser error page. Navigations are now network-first with a
  fallback to caches.match("/", {ignoreSearch:true}).
- Network errors: getJSON had no catch (server down -> unhandled rejection, silent
  no-op). Now catches, toasts "Network error", and returns {ok:false,error:"network"};
  init() and loadSessions() default missing fields so the UI degrades cleanly.

2 findings correctly refuted (token-URL cache bloat = negligible nit; key-repeat
double-trigger = benign). Verified: node --check app.js+sw.js; engine+server tests green.
…tate

Two real, recording-safe features (built + tested while a live recording ran):

- SRT/VTT subtitle export: each transcript turn already carries audio start/end, so
  export.py now adds t_offset_end per turn and renders proper .srt (HH:MM:SS,mmm,
  1-indexed cues) + .vtt (WEBVTT, HH:MM:SS.mmm) — written alongside -agent.md/.json,
  a --format {md,json,srt,vtt,all} CLI flag, served via engine.read_artifact (srt/vtt
  kinds), and built client-side in the UI for instant "Download SRT/VTT" of any loaded
  session (rename-aware). Verified on the real 28-min session (valid cues).
- Settings persistence: the GUI remembers your Model + Language in localStorage
  (private-mode-safe), restoring them on load when still offered by the server.
- Loading state: a calm body.working affordance + hard-disabled record button while a
  transcription job runs.

Tests: gui suites now 65 total (export 35 / engine 16 / server 14), all green;
node --check clean; serve smoke confirms t_offset_end in the API JSON + the .srt
artifact served (200).
…, peer label, clamps

Adversarial review of the SRT/VTT code (4 confirmed) + a byte-for-byte client↔server
parity check surfaced and fixed:
- Cue-text injection: a newline / blank line / "-->" inside a turn's text or label
  corrupted SRT/VTT cue boundaries (could inject a fake cue). Add _cue_text() (collapse
  newlines, neutralize "-->") applied to both cue text and label in to_srt/to_vtt.
- Client didn't skip empty-text turns and used the array index → blank cues + index
  drift vs the server file. Client now filters empty turns with an independent counter
  (mirrors the backend), so a downloaded .srt/.vtt byte-matches the server artifact.
- Peer label divergence: client used nameFor() ("Sam · laptop (peer)") vs backend
  "Sam (peer)". Client cueLabel now matches the backend exactly (renames stay a
  client-only delta on local speakers).
- Degenerate-span clamp: client bumped end +2.0s vs backend +0.5s; unified to +0.5s.
- build() now also clamps t_offset_end > t_offset in the JSON sidecar (was only the
  rendered cue), so the documented invariant holds for out-of-order offsets.

Verified: 67 gui tests green (export 37 incl. 2 new regressions / engine 16 / server 14);
node --check clean; a verbatim Python-vs-Node parity harness on a nasty doc (peer +
empty + blank-line/"-->" injection + zero/neg span) produces byte-identical SRT.
… fix gui.export path

The README predated several shipped features; bring it current: subtitle export
(.srt/.vtt + byte-identical client downloads), PWA (installable/offline), settings
persistence, keyboard/a11y, Summarize-for-AI; add the srt/vtt outputs + t_offset_end;
correct the stale 'python -m glass.export' to 'python -m gui.export --format ...'.
Type to filter the (growing) session list by date/stem; case-insensitive; the filter
survives list refreshes and shows a clean 'no match' state. Pure frontend (loadSessions
now caches SESSIONS + renderSessions(query) renders a filtered view); the typing guard
already prevents the Space/R record shortcut from firing in the search box.
…ing (gui/live.py)

Tails the raw PCM of a WAV being recorded and transcribes each new speech window with
VoxTerm's engine, printing '[mm:ss] text' as the conversation happens. Reads the FILE,
not the mic, so it runs alongside any recorder with zero contention. Text-only + fw-base
default for low latency. CLI: python -m gui.live ROOM.wav [--model] [--interval] [--max-seconds].

Proven on a live recording (transcribed the active conversation in near-real-time).
NOT yet wired into the GUI browser UI — that's the next step (stream lines over SSE to a
live transcript panel).
…ing, stream to a panel

Wires gui/live.py's near-real-time transcription into the browser UI so it's actually
usable, not just a CLI:
- engine.py: live_start/live_stop + a background tail-transcribe thread that follows the
  newest in-progress recording FROM the current end (true live — no slow backlog replay),
  transcribes finalized speech windows with the cached fw-base engine, and appends
  '[mm:ss] text' lines (capped) exposed via status().live. Reads the file, not the mic,
  so it runs alongside any recorder with zero contention.
- server.py: POST /api/live/start (optional wav, defaults to newest) + /api/live/stop.
- static: a '⦿ Live transcript' toggle + a streaming, auto-scrolling live panel (calm
  theme, pulsing dot); applyStatus renders status().live.lines.

Verified end-to-end on a real in-progress recording: start -> lines appear within ~10-16s
of new speech with correct audio timestamps (e.g. [39:19]…) -> stop. server tests 14/14;
node --check clean. (Browser render of the panel is wired but visually confirmable only
in a browser; the data path is proven.)
…s test

- engine.delete_session(stem, dir): removes only a session's text artifacts
  (-transcript/-agent.{md,json,srt,vtt}/-events.jsonl) for the stem, reusing _resolve's
  traversal guard + _session_dirs/only_dir restriction; never touches .wav (audio kept).
- POST /api/session/delete (behind the LAN-auth gate).
- UI: a subtle ✕ on each session row (confirm; stopPropagation so it can't trigger open;
  clears the view if the open session is deleted).
- test_engine: +6 delete tests (exact-files, traversal rejected, dir-restricted, .wav
  untouched, missing-stem ok); fixed test_status_idle_shape to expect the 'live' key
  added by the prior live-transcription commit.

gui suites green (export 37 / engine 22 / server 14).
The live monitor finalized speech only on silence, so an in-progress
utterance showed nothing until the speaker paused. Add a partial preview of
the still-growing tail: each pass re-decodes the tail and a LocalAgreement-n
stabilizer commits the longest word-prefix that has agreed across the last n
hypotheses (stable) and marks the remainder volatile. As words settle they
graduate stable→so the head stops flickering while the tail updates live.

- gui/stabilize.py: PartialStabilizer (pure, LocalAgreement-n) + 9 unit tests
- engine._live_loop: re-decode tail → stabilize → status.live.partial; reset
  on finalize so each utterance starts clean
- app.js/style.css: render the partial (committed words solid, volatile tail
  dimmed + softly pulsing)

Proven on a real recording: ASR revised "floor"→"hood" mid-utterance and the
stabilizer held it volatile until settled (never committed the wrong word).
82 tests green. Idea ported from elizaOS's streaming partial-stabilizer.
Gap dmarzzz#1 — live now tails the GUI's own recording. start_recording streams
straight to a growing on-disk WAV (placeholder header, _poll appends s16 PCM
under the lock + flush, stop patches the real header). The live monitor tails
that same file, so clicking Live during a GUI recording shows your words
(before, Record buffered in RAM and only wrote on stop, so Live saw nothing).
Bonus: a long session no longer sits entirely in RAM; transcription loads the
file off-thread.

Gap dmarzzz#3 — small stuff:
- live.py CLI: ported the LocalAgreement stabilizer (in-place updating partial
  line) for parity with the GUI
- app.js/index.html/style.css: a scrolling live amplitude canvas during record

3 new streaming-WAV tests (header is 44B + parses, _pcm_bytes==_write_wav,
growing file is tailable mid-write then finalizes valid). 85 tests green.
Found by a verifying multi-agent audit of the new streaming-record/live code.

Concurrency:
- stop_recording now stops the live monitor (live is bound to the recording's
  lifetime) — was leaking a daemon thread that re-decoded the finalized file
  forever and raced the post-stop job.
- live monitor uses a DEDICATED transcriber (_get_engines dedicated="live") so
  it never shares CTranslate2 decode state (or the dedup buffer) with the batch
  job — CT2 isn't safe for concurrent decode on one instance.
- live_start/live_stop track real thread liveness (no double live loop after a
  timed-out join); status() snapshots self._live under the lock.

Correctness:
- stop_recording surfaces a header-patch I/O failure as an error instead of
  silently transcribing a zero-data WAV into a spurious empty session.

Security (server.py):
- CSRF: reject cross-origin state-changing POSTs (Sec-Fetch-Site / Origin vs Host).
- DNS-rebinding: loopback Host-header allowlist (blocks a rebinding site driving
  the tokenless local API).
- Clickjacking: CSP frame-ancestors 'none' + X-Frame-Options: DENY.
- _authed compares on UTF-8 bytes so a non-ASCII token yields 401, not a crash.

+3 security regression tests (host allowlist, CSRF, non-ASCII token). 88 green.
VoxTerm splits turns on VAD silence alone, so a natural pause after "and…" or
"the…" wrongly ends a turn mid-sentence. Add a zero-model end-of-turn signal
(gui/eot.py): P(turn complete) from grammar cues — terminal punctuation 0.95,
trailing conjunction 0.15, trailing article/preposition 0.20, short 0.70, else
0.50. The live loop now merges a finalized fragment into the previous line when
that line ended mid-clause (live view is text-only, so no speaker boundary to
cross), giving readable sentences instead of choppy breath-split lines.

9 unit tests; 97 green. Idea ported from elizaOS's HeuristicEotClassifier.
(Diarization hardening + windowed-live ports were checked against VoxTerm's
code and found redundant — VoxTerm already gates centroid updates by cosine sim
and bounds the live buffer via VAD — so they were intentionally skipped.)
…ucture)

From the Android-plan UI critique — the safe, additive set (deferring layout
moves like the bottom record bar until we can render on a device):
- delete ✕ was opacity:0-until-hover → invisible on touch; show it on
  @media (hover: none)
- waveform canvas was a fixed 600px bitmap stretched by CSS → blurry on
  phones/retina; size the bitmap to CSS px × devicePixelRatio, draw in CSS px
- --faint #6b7280 (~3.9:1, failed WCAG AA) → #7d8694 (~4.6:1)
- mobile: safe-area insets (notch/gesture bar) on main/sidebar/nav/toast +
  44px min tap targets on btn/select/ghost/✕/legend
- honor prefers-reduced-motion (kills the looping pulses/level animations)

97 tests green; JS validates.
v1 Android app = the existing web UI in a native shell, talking to the VoxTerm
backend on your desktop over the LAN. The phone does NO transcription.

- src-tauri/: Tauri v2 host crate (identifier site.nubs.voxterm, frontendDist
  → ../mobile-pair, window "main", android minSdk 24) + gen/android/ gradle project
- mobile-pair/index.html: on-theme pairing page — enter desktop host/port/token
  (prefilled from localStorage), navigates the webview to
  http://host:port/?token=… where the desktop serves UI+API+SSE from one origin,
  so app.js works unchanged (reads the token from the query string)
- AndroidManifest: INTERNET permission ONLY — no RECORD_AUDIO, no camera, no
  location. The app structurally cannot record you; the desktop owns the mic.
  usesCleartextTraffic=true for the LAN http backend (token is the gate).

Scaffold only — not yet built. Lives on feat/gui, no PR.
scripts/android-dev.sh — plug in a phone (or --emulator) and it self-heals the
toolchain (rust targets), builds the APK, installs, launches, and asserts the
app is alive. Test traffic stays on loopback via `adb reverse tcp:8740` — never
touches Wi-Fi. Stages A–F with structured exit codes (10 toolchain/11 targets/
20 device/30 build/40 install/50 launch/60 smoke). Hard gates: build, install,
launch, render-not-blank (scripts/assert_screen.py, Pillow luminance check).
Soft for v1: the backend round-trip (depends on the in-app connect flow).

Supporting bits:
- scripts/mock_backend.py — torch-free stdlib stand-in (serves gui/static +
  canned /api + heartbeat SSE, logs requests) for fast offline CI runs (--mock)
- gui/server.py — opt-in request logging via VOXTERM_GUI_LOG=1 (silent by
  default) so the smoke test can assert GET /api/options + /api/events
- mobile-pair: auto-connect if a backend answers on the device's localhost
  (the adb-reverse/dev case) — fails fast on a real phone → pairing form stays

97 tests green. Quickstart: scripts/android-dev.sh --emulator --debug --mock
(offline) or scripts/android-dev.sh --debug (real phone, real engine).
… Silicon

A verifying cross-platform audit found the GUI dead on Apple Silicon (the
flagship target) and the android script broken on every mac. All fixes are
Linux-safe (97 tests still green) and standard per-platform branching:

GUI (HIGH — Apple Silicon had an empty model dropdown + KeyErrors):
- Engine.models() falls back to AVAILABLE_MODELS when FASTER_WHISPER_MODELS is
  empty (Apple Silicon) so the dropdown is never blank
- new CPU-aware default (transcribe.gui_default_model / Engine.default_model):
  prefer fw-small where faster-whisper exists, fall back to MLX only on Apple
  Silicon — and crucially NOT raw config.DEFAULT_MODEL, which is qwen3-0.6b when
  qwen-asr is installed (too slow on CPU). Fixes the live + post-stop KeyErrors.
- /api/options exposes default_model; app.js pre-selects it (no more fw-small)

scripts/android-dev.sh (broke on all mac):
- ANDROID_HOME / JAVA_HOME per-OS (mac ~/Library/Android/sdk, Studio JBR / java_home)
- resolve python3 (mac has no bare `python`); arm64-v8a AVD + -gpu host on Apple Silicon

audio/capture.py: actionable mac mic-permission error (TCC not granted)
gui/export.py: per-platform live-dir fallback (was Linux XDG only)

Full report: ~/voxterm-plans/mac-compat-report.md
Zero-regression hardening from the cross-platform audit (99 tests green):
- 0.1 decouple headless ASR from the Textual TUI: gui/transcribe.py imported
  tui.app (pulling textual+sounddevice into every server/headless import).
  Extracted the pure split into tui/text_split.py; tui.app delegates to it.
  Verified: importing gui.server no longer loads `textual`.
- 0.3/0.4 gui/server._read_json: a malformed Content-Length raised an uncaught
  ValueError out of the POST handlers — guard it; also close the connection on
  an oversized body (no undrained body / latent HTTP desync).
- 0.5 live-state writes now take self._lock (brief dict mutations only, never
  around transcribe/VAD) to match the locked reader in status() — the
  "consistent snapshot" comment is now actually true.
- 0.6 mobile-pair: the loopback auto-probe honors the port field (was hardcoded 8740).
- 0.7 export.py docstring: `glass.export` -> `gui.export` (no glass pkg).
- 0.9 Android cleartext: documented that app-wide cleartext is INTENTIONAL for
  the LAN thin client (can't scope arbitrary RFC1918 IPs declaratively; the
  token + LAN is the trust model) — kept on for release on purpose.
- 0.11 drop the cosmetic SSE `Connection: keep-alive` header (HTTP/1.0).
- 0.10 + 0.8: capture.py macOS mic-permission tests; commit src-tauri/Cargo.lock.
A new, 100%-optional CPU streaming-ASR tier that runs everywhere VoxTerm does
(Linux, macOS arm64, Windows) with no GPU. Verified end-to-end on this Linux/CPU
box: installs clean, decodes correctly, and does NOT disturb VoxTerm's pinned
onnxruntime (sherpa statically links its own ORT).

- pyproject: `[project.optional-dependencies] streaming = ["sherpa-onnx..."]`
  (marker excludes Intel-macOS — no wheel). NOT a core dep.
- config.py: one DRY gate after the platform branches — surfaces the
  `sherpa-stream-en` model key + SHERPA_MODELS ONLY when sherpa-onnx is importable
  AND a wheel exists for the platform. Absent → byte-for-byte unchanged.
- audio/transcriber.py: SherpaStreamingTranscriber (lazy import w/ clear error;
  downloads the 20M streaming-zipformer on first load; per-call create_stream so
  it's a drop-in for the existing chunked callers; same RMS/hallucination/dedup
  filters; ALL-CAPS model output → sentence-case). Factory dispatch added before
  the Whisper fallback.
- gui/test_sherpa.py: skip-guarded (no-op without the extra) — gating consistency,
  factory dispatch, RMS short-circuit.

Zero-regression: without the [streaming] extra installed, nothing changes for any
existing user. 102 tests green (99 + 3, the new ones skip when sherpa is absent).
Follow-on (noted, not yet done): a true-streaming live-loop path (persistent
OnlineStream + endpoint finalize) so the GUI live view streams word-by-word.
…needs a Mac)

iOS reuses the existing Tauri thin-client (mobile-pair → LAN desktop, INTERNET/no-mic).
Everything here is cross-platform + lint-clean on Linux; the actual init/build/sign/run
loop requires a Mac + Xcode (cannot build off a Mac).

- src-tauri/Info.ios.plist: NSAllowsLocalNetworking (minimal ATS for LAN http, NOT
  arbitrary loads) + NSLocalNetworkUsageDescription (iOS-14 local-network prompt).
- tauri.conf.json: additive bundle.iOS { minimumSystemVersion "14.0" }.
- scripts/ios-dev.sh: Darwin-guarded (clean no-op off-Mac); adds iOS rust targets,
  `cargo tauri ios init` once, then ios dev|build.
- src-tauri/.gitignore: ignore generated /gen/apple/ build artifacts.
- docs/ios-thinclient.md: build path, the two plist keys, signing, pairing.

Zero-regression: no Python touched; Android (gen/android, manifest) byte-for-byte
unaffected; bundle.iOS + Info.ios.plist are read only by the iOS bundle target.
102 tests green.
…sherpa)

The live monitor now prefers the sherpa streaming backend when it's installed
(opt-in) and drives it as a true streaming recognizer instead of chunked VAD
windows:
- _live_loop split into setup/dispatch + two paths. The chunked path
  (_live_chunk_loop) is the original code VERBATIM — fw-*/MLX/qwen3/parakeet and
  any non-sherpa backend behave byte-for-byte as before (zero regression).
- _live_stream_loop: one persistent OnlineStream fed the tailed PCM; the running
  decode is published as the volatile partial each ~1s; sherpa's endpoint
  detection (or the 20s cap) finalizes a line. Same self._lock discipline.
- live model preference: sherpa-stream-en (if installed) → fw-base → platform
  default. Only changes behavior when the optional [streaming] extra is present.

Verified: streaming primitives grow the partial incrementally + decode correctly
on this box; 102 tests green (chunked path unchanged).
Adversarial QA of the new code + a real KVM-emulator run surfaced these (all fixed):
- transcriber: _ensure_sherpa_model is now ATOMIC (extract to staging → rename) with
  a complete-model guard (all 4 artifacts) so an interrupted extraction self-heals
  instead of a permanent StopIteration; load() uses a _pick() helper that raises a
  clear RuntimeError naming the missing file; .part download cleaned up on failure.
- transcriber: SherpaStreamingTranscriber.is_loaded is now a @Property, matching every
  other backend (was a method — would mis-read as loaded via getattr).
- engine: the streaming live path now applies the hallucination + dedup filters on
  finalized lines, like the chunked/batch backends.
- android-dev.sh: launch the CORRECT component — debug builds install
  site.nubs.voxterm.debug, and the activity class keeps the base namespace, so the
  launch is <appId.debug>/<base>.MainActivity (the emulator caught the old
  site.nubs.voxterm/.MainActivity → "activity did not report Status: ok", exit 50).
- android-dev.sh: validate $PYTHON is actually runnable (clean exit 10, not a late fail).
- assert_screen.py: exit 3 = SKIP when Pillow is absent (macOS) so the render gate isn't
  a silent pass; android-dev.sh treats exit 3 as a soft skip.

102 tests green. (Low/cosmetic, left + noted: streaming line-start timestamp drift;
the loopback auto-probe's cross-origin read is best-effort and degrades to manual pairing.)
…low 14)

Use get_flattened_data when available, fall back to getdata — no behavior change,
silences the Pillow-14 DeprecationWarning the emulator run surfaced.
- audio/transcriber.py: generalized the sherpa model registry (repo→URL map) so
  multiple sherpa transducer models share one SherpaStreamingTranscriber.
- config.py: new optional gated key `sherpa-nemotron-en` (NeMo FastConformer-RNNT
  0.6B, exported for sherpa-onnx). Same find_spec gate → zero-regression when the
  [streaming] extra is absent.
- scripts/bench_asr.py: reproducible WER (word edit-distance, normalized) + CPU RTF
  benchmark across backends.
- docs/streaming-asr-benchmark.md: results + honest analysis.

Numbers (Linux CPU, 3 labeled clips): fw-small 2.1% WER / 0.64 RTF (batch, og default);
fw-base 5.1% / 0.18; sherpa-nemotron-en 4.4% / 0.25 (streaming sweet spot — near-fw-base
accuracy, ~4x real-time, native streaming); sherpa-stream-en zipformer-20M 20.9% / 0.064
(~16x real-time but inaccurate). nemotron-EN proven to load + decode via the same backend.

102 tests green (test_sherpa now covers both gated keys, skips without the extra).
Engine.models() returned only FASTER_WHISPER_MODELS on Linux/Intel/Windows, so the
optional sherpa-stream-en / sherpa-nemotron-en keys (present in AVAILABLE_MODELS but
not the fw set) never appeared in the GUI model dropdown. Union the platform's base set
with SHERPA_MODELS so they're selectable wherever installed. Found by rendering the GUI
headless. test_models_returns_only_fw_keys -> test_models_are_valid_keys (valid-keys
invariant incl. the additive sherpa keys).
scripts/gui_e2e.py boots gui.server, drives headless Chrome via the DevTools
Protocol, and asserts the real browser flow: model dropdown + session list
populate from the API, and clicking a past session loads + renders its
transcript (with a screenshot). Covers the browser path unit tests can't — only
record-with-a-mic still needs hardware. websocket-client is a dev-only dep.

Verified: dropdown includes the optional sherpa keys, 4 sessions, transcript
renders end-to-end.
docs/streaming-asr.md: install the optional [streaming] extra, the two model keys
(sherpa-stream-en / sherpa-nemotron-en), GUI/CLI usage, how it works, and the
zero-regression/opt-in posture. gui/README 'Models' section now points to it +
the benchmark. Makes the streaming feature discoverable + usable (upstream-ready).
…ening

- /api/audio serves the session WAV with HTTP Range/206 (media-src added to
  the CSP so the <audio> element can load; 416 routed through _hdr; do_HEAD
  405; Content-Length on JSON/static). The engine hardlinks <stem>-gui.wav at
  transcribe time so playback maps to the exact recording.
- CPU-aware transcriber load(): explicit int8 + cpu_threads + greedy beam_size=1
  + a warm dummy decode. The GUI defaults to fw-base via gui_default_model(),
  and the engine warms the model at server start.
- "Detect speakers" diarize flag threaded through stop_recording -> transcribe.
- start_recording tolerates a malformed device value and reverts to the OS
  default input when "System default" is re-selected (no sticky global).
- _session_title keeps short first utterances (>= 2 chars) so titles aren't
  dates.
…iew recording

- Rebuild the UI as a monochrome (no accent hue) document-style transcript with
  a sticky record dock, a settings popover, and an export menu. The record dot
  is the only color. Inline <audio> playback (click a timestamp to seek) plus a
  Download-WAV action.
- Recording shows a level meter + "Recording..." state and the accurate,
  diarized transcript appears on stop -- one model, no streaming preview to
  reconcile against the final result.
- Robustness: title derives from the transcript (no raw-date headings);
  same-speaker turns keep a clickable timestamp instead of an orphaned box; the
  player pauses when leaving a transcript and its probe is session-tokened; seek
  waits for audio metadata; the record button has a single owner; init() surfaces
  an unreachable server.
- a11y: real keyboard focus ring on menu items, aria-live progress, readable
  muted text. PWA shell cache bumped; manifest/theme colors aligned. Docs updated.
…e2e for the redesign

- <audio preload="metadata"> so the seek bar shows the clip length immediately on
  load instead of a misleading 0:00/0:00 (cheap for a local same-origin WAV; the
  probe still defers a cold seek to loadedmetadata).
- Rewrite scripts/gui_e2e.py for the redesigned UI and add the checks unit tests
  can't cover: transcript-derived title (not a raw date), the recording's audio
  actually LOADING UNDER THE PAGE CSP (a fresh Audio() obeys media-src like the
  inline player), the visible player's real duration, and a record->stop cycle,
  with a securitypolicyviolation collector asserting zero violations.

Verified in headless Chrome: audio loadedmetadata, duration 14.66s, 0 CSP violations.
The TUI records system audio (macOS ScreenCaptureKit, Linux parec) and mixes it
with the mic; the GUI was mic-only. Add an "Audio source" selector (Microphone /
System audio / Mic + system) in the settings popover, threaded through
/api/record/start -> Engine.start_recording(source=...). system/both reuse the
engine's existing SystemCapture; "both" mixes via the same time-aligned add the
TUI uses (_mix_chunks). Fails gracefully with a clear message when the platform
tool is missing (e.g. parec not installed); selection persists in localStorage.

Tests: gui/test_capture_source.py (mix overlap+tails+clip, source wiring with the
capture classes mocked). Windows stays unavailable (no engine system-audio there).
The TUI's "U" action runs a local-LLM summary (MLX on Apple Silicon, or an
ollama:<model> backend anywhere); the GUI only had "Summarize for AI" (copies a
prompt for an external model). Add "Summarize with local LLM": POST /api/summarize
-> Engine.summarize_session() reuses the session transcript + the TUI's own
summarizer.engine (get_summarizer/resolve_template), shows the result in a
dismissible panel above the transcript, and surfaces a clear message (never a
crash) when no backend is available. A "Summary model" settings field
(persisted) lets non-Mac users point at an Ollama model.

Tests: gui/test_summarize.py (ok / no-transcript / graceful no-backend / path
traversal, summarizer mocked). 112 gui tests + headless e2e green.
…or guard

Extend the headless-Chrome e2e to exercise the new local-LLM summarize action
(asserts it fails GRACEFULLY with no backend present — no crash, block hidden),
confirm the audio-source selector offers mic/system/both, and collect
window.onerror + unhandledrejection so the run fails on ANY uncaught JS error
anywhere in the flow. Verified locally: summarize graceful, source options
correct, 0 CSP violations, 0 uncaught JS errors.
…n cleanup)

From a code-quality deep scan (one function per purpose, no dead code, no
passthrough params):
- _mix_chunks: collapsed the gui/engine.py copy and tui/app.py's staticmethod
  into one audio/mix.py::mix_chunks — both call it; the TUI staticmethod is gone
  (not replaced with a wrapper).
- _fmt_hms: was duplicated in gui/transcribe.py (truncating) vs gui/export.py
  (rounding) → ±1s live-vs-export drift. One gui/_timefmt.py::fmt_hms (rounding),
  used by transcribe/export/engine; also dropped the non-essential _fmt_hms
  parameter the live loops threaded around.
- _write_wav: dead production code (recording uses _wav_header + _pcm_bytes) —
  deleted; its tests folded into the _pcm_bytes encoder test; unused `wave`
  import removed.
- app.js: one copyOrDownload() helper (copyForAI + summarizeForAI shared the
  clipboard-or-download fallback); a named PEER_COLOR const instead of a bare hex
  that aliased a rotating speaker slot.

Full suite 523 passed.
@NubsCarson

Copy link
Copy Markdown
Contributor Author

Self-review (deep-scan) — known follow-ups still open on this draft, for transparency:

In-PR cleanups are already done (latest commits): de-duplicated _mix_chunksaudio/mix.py, _fmt_hmsgui/_timefmt.py (also fixed a ±1s live-vs-export rounding drift), deleted dead _write_wav, and small JS tidy-ups.

Still open:

  • On-device (Kotlin) robustness — needs an APK build to verify, so deliberately not committed blind: verify all 4 model files + stage-to-.tmp-then-rename for atomic model staging; guard AudioRecord init (minBuf <= 0 / STATE_INITIALIZED); release the native stream on a failed start (try/finally) and mark recognizer @Volatile. (Edge cases — the happy path is APK-verified.)
  • Cross-branch dedup at merge time: the sherpa SherpaStreamingTranscriber backend also lands in feat(asr): optional cross-platform CPU streaming ASR backend (sherpa-onnx) #174 — land it once there and rebase this on top so this PR only adds the GUI wrappers + the CPU-tuned load(). Same for load_wav_16k_mono (→ a shared audio/ helper) and gui_default_model delegating to the CPU-default policy in fix: default to fw-base on CPU (qwen3-0.6b is unusable without a GPU) #168.

NubsCarson added 10 commits June 5, 2026 22:15
Verified by a green debug APK build (cargo tauri android build --debug --apk):
- Model staging is now atomic + complete: verify ALL 4 required files (was only
  tokens.txt) and stage into a .tmp dir then renameTo() the final dir, so a
  mid-copy process kill can't wedge a half-populated voxterm-model dir.
- Guard AudioRecord init: bail (with lastError) when getMinBufferSize() returns
  <= 0 (minBuf*2 would throw) and when state != STATE_INITIALIZED (mic busy).
- Never leak the native sherpa OnlineStream on a failed start — track it in a
  nullable and release it in finally; mark `recognizer` @volatile (built lazily
  from both the mic worker and the debug self-test thread); reset `running` on
  every exit path.
Swap the bundled offline model from the 20M zipformer (2023-02-17) to the
70M streaming zipformer2 (2023-06-26). On the bundled test clip the 20M
dropped the opening clause and garbled "brothels"; the 70M transcribes it
in full and correct. On a real phone it decodes at xRT 0.09 (0.62s for
7.13s of audio, ~11x real-time), so the accuracy gain costs no latency.
APK grows ~26 MB (encoder int8 67 MB vs 40 MB).

The 70M model is model_type=zipformer2, which has no `attention_dims`
metadata, so the hardcoded modelType="zipformer" failed to init the
encoder. Set modelType="" to auto-detect the architecture from the model's
own ONNX metadata, so fetch-deps.sh is the single source of truth for the
bundled model and no architecture string has to stay in sync here.

Also log a measured xRT in the debug self-test, so on-device latency is a
real number rather than an assumption.
Select the bundled offline model with VOXASR_MODEL:

  zipformer-70m  (default) streaming zipformer2, ~68 MB assets / ~232 MB APK
                 fast (xRT 0.09 on a real phone), ALL-CAPS, no punctuation
  nemotron-0.6b  NeMo FastConformer-RNNT, ~632 MB assets / ~621 MB APK
                 accurate, native casing + punctuation, xRT 0.29 on the same phone

The default stays the lightweight zipformer so a plain build is small and
installs anywhere; nemotron is opt-in for builds that want transcript-grade
output and can afford the size. The Kotlin plugin already auto-detects the
architecture and feature dim from each model's ONNX metadata (modelType="",
metadata-driven feat_dim), so the tier swap needs no code change.

Also replace the fragile hardcoded epoch-specific cp filenames with a glob
that matches both naming schemes (zipformer's
`encoder-epoch-…-chunk-16-left-128.int8.onnx` and nemotron's plain
`encoder.int8.onnx`), mirroring the desktop loader's _pick(); add a guard
for an unknown VOXASR_MODEL. shellcheck-clean.

Both tiers verified end-to-end on a real device: bundled-clip self-test
decodes correctly and the live start/poll/stop pipeline runs without error.
…fecycle

start_transcribe used to reject with "microphone permission not granted" when
the runtime permission was absent, so a fresh install's first Start hard-failed
with no in-app recovery. The plugin now owns the mic: it declares the
RECORD_AUDIO "microphone" alias and requests it on first Start, resuming in a
@PermissionCallback once granted (verified on a device: fresh install -> system
prompt -> grant -> records).

Lifecycle hardening while here:
- ensureRecognizer(): one @synchronized lazy builder shared by the mic worker
  and the debug self-test, closing a check-then-act race that could build and
  leak two native recognizers, and removing the duplicated idiom.
- a per-session generation token so a worker that outlives stop's 2s join can
  neither run the mic alongside nor reset the running flag of a newer session.
- stop_transcribe clears the trailing partial so poll_transcript stops
  returning a never-finalized line after recording ends.
- the webview clears the transcript on each Start (no cross-session concat /
  unbounded DOM growth).
…nused dep

The plugin's android/.tauri/tauri-api/ tree is the Tauri-CLI-generated mirror of
the tauri-android framework (Apache/MIT "Tauri Programme", ~2150 LOC incl. 2+2
scaffold tests) — vendored upstream code, not part of this contribution. The
build resolves :tauri-android from the gen settings path and never uses this
copy (verified: a clean build with the directory removed still produces the
APK), and the sibling src-tauri/gen/android already gitignores its own /.tauri.
Gitignore android/.tauri/ and untrack the 29 files so the diff is the plugin.

Also drop the unused direct `serde` dependency (the crate uses only
serde_json::Value).
…self-heal deps

- add tauri-plugin-voxasr/README.md (purpose, the start/stop/poll command
  surface, fetch-deps.sh + VOXASR_MODEL tiers, RECORD_AUDIO/no-INTERNET stance,
  build via scripts/android-dev.sh) plus a short subsection + CHANGELOG entry in
  the main docs.
- fix the lib.rs crate docstring: it described a voxasr://partial/final event
  contract that does not exist — the plugin is poll-only (poll_transcript).
- update capabilities/mobile.json's description: it still said "window/webview/
  event only ... pairs to a desktop" though it now grants voxasr:default and
  on-device is the primary mode.
- android-dev.sh runs fetch-deps.sh when the AAR/model are missing, so the
  advertised one-command build works on a fresh checkout (honors VOXASR_MODEL).
- fix a stale revealForm() reference in an index.html comment (revealMobileHome).
…e-at-stop)

Replace the streaming zipformer/nemotron path with offline Whisper: the mic is
buffered while recording and, at stop, the whole clip is decoded by a sherpa-onnx
OfflineRecognizer — full context, native punctuation + casing, no rough live
output. This is the same model family the desktop's faster-whisper uses, so the
phone gets transcript-grade results.

- fetch-deps.sh: VOXASR_MODEL tiers are now whisper-tiny/base/small.en (base.en
  default ~154 MB); Whisper has no joiner, so stage encoder/decoder/tokens only,
  and wipe the model dir first so a tier switch leaves no stale files.
- VoxasrPlugin.kt: OfflineRecognizer (modelType="whisper", en/transcribe); record
  to a PCM buffer; at stop, split into <=30 s windows (cut at the quietest point
  near the boundary so words aren't sliced) and join. poll_transcript now reports
  { phase, elapsed, level, durationSec, segments[], error? }. Keeps the runtime
  RECORD_AUDIO request, generation guard, and @synchronized recognizer build; the
  stop path snapshots the take's buffer and joins a prior worker before reopening
  the single-owner mic.
- measured on a real phone: base.en self-test xRT ~0.2 (~5x real-time), correct
  punctuated transcript.
The phone now runs the SAME web GUI as the desktop instead of a separate stripped
page. gui/static is staged into the mobile bundle (mobile-pair/app/) and a
LocalBackend drives the native voxasr plugin + localStorage instead of the
desktop's Python HTTP engine — same look, same record→transcribe→view→export flow.

- gui/static/backend-local.js: implements the window.VOX_BACKEND seam
  (getJSON/events/authUrl) against the plugin; synthesizes app.js's
  recording→transcribing→done state machine from poll_transcript; persists
  sessions + renders client-side md/json/srt/vtt export. Sets the `on-device`
  flag so app.js/CSS hide Python-only features (model/source/mic/diarize/summary,
  language, local-LLM summary, WAV download, speaker rename) — no dead buttons.
- scripts/stage-mobile.sh (+ tauri beforeBuildCommand/beforeDevCommand): copies
  gui/static into mobile-pair/app/ with backend-local.js swapped in and the PWA
  shell dropped; mobile-pair/app/ is gitignored (gui/static stays the source).
- mobile-pair: the Android app redirects to the on-device GUI; the pairing form
  is now browser-only (dead loopback probe removed).
- AndroidManifest strips INTERNET (tools:node=remove) → the APK is provably
  offline; CSP trimmed to match (no remote/blob tokens).
- app.js/style.css: two small on-device guards; empty-state copy fixed (both
  platforms transcribe at stop, not live).

Verified e2e on a real phone: GUI loads, degrade applied, two record→transcribe
takes complete cleanly, zero console errors, only RECORD_AUDIO granted.
…ngine

Update the plugin README (offline Whisper, the phase-based poll contract, whisper
model tiers, the unified-GUI architecture), the main README's Android section, and
the CHANGELOG entry — the previous text described the superseded streaming model.
sherpa-onnx Whisper truncates anything ≥30 s ("process only the first 30 s and
discard the remaining"), so an exactly-30 s window risks a boundary warning. Cap
the windows at 29 s — comfortably under the limit, no data discarded, plus a
margin for the silence-aware cut. Verified the chunked decode of a synthetic
>30 s clip joins into coherent text with the cut landing in a pause.
@NubsCarson NubsCarson changed the title feat(gui): web GUI control app over the VoxTerm engine (draft — for visibility/feedback) feat: VoxTerm as a Tauri app — one GUI on desktop + on-device mobile (draft/RFC) Jun 6, 2026
…g it

stagedModelDir() copied the bundled assets into a temp dir and swapped it in
without checking the required model files actually landed — a build shipping
incomplete assets would surface as a cryptic native recognizer crash later
rather than a clear error. Verify all required files are present before the
atomic rename (clear IOException otherwise), check renameTo's result, and
@synchronized it so the debug self-test can't race a first record into staging.

Verified on a physical device (debug APK): a cold re-stage after `pm clear`
stages all files and the offline self-test decodes test.wav correctly
(xRT 0.18, full casing + punctuation).
@NubsCarson

NubsCarson commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

For whoever triages this (no rush — flagging in case it helps a human or a review agent):

These 11 PRs split into three buckets — all branched off main, tested, and 0 commits behind origin/main:

Drop-in, independent, mergeable in any order:

Behavior fixes, low-risk, independent:

Bigger / your call:

#179 is a non-urgent follow-up note on #176's residual (session-code entropy), not a blocker for #176.

Totally understand 11 from one contributor is a lot — happy to consolidate, hold, or close any of these to whatever fits your roadmap and bandwidth. Just say the word.

NubsCarson added a commit to NubsCarson/VoxTerm that referenced this pull request Jun 6, 2026
gui/, src-tauri/, tauri-plugin-voxasr/ live only on the feat/gui branch (dmarzzz#175),
not on main — a docs-accuracy PR shouldn't describe a tree that isn't there yet.
They'll be added when dmarzzz#175 lands. Keeps the verified-present additions
(dictation/, network/, summarizer/) and the cross-platform reframing.
@NubsCarson NubsCarson marked this pull request as ready for review June 7, 2026 00:05
@NubsCarson NubsCarson changed the title feat: VoxTerm as a Tauri app — one GUI on desktop + on-device mobile (draft/RFC) feat: VoxTerm as a standalone Tauri app — one GUI on desktop + on-device mobile Jun 7, 2026
The comment advertised 'zipformer-70m default | nemotron-0.6b', but fetch-deps.sh
only accepts whisper-{tiny,base,small}.en (default whisper-base.en) and exit 1s on
anything else — so copying the old hint sent users straight into a script abort.
@RonTuretzky

Copy link
Copy Markdown
Collaborator

Thanks @NubsCarson — your on-device Android work landed in main via #182 (your commits kept their authorship in history). On top of it that PR added a release-build crash fix (R8 was stripping the sherpa-onnx JNI config fields → "failed to get field id for decodingMethod"), live on-device transcription, an APK build/sign/publish workflow, and an install page. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants