feat: VoxTerm as a standalone Tauri app — one GUI on desktop + on-device mobile#175
Conversation
…rt → review)
A clean, responsive web GUI that fully drives VoxTerm's engine from the browser
(desktop + phone over LAN), with a Python control backend — no reinvention of the
transcription/diarization logic, it reuses VoxTerm's own AudioCapture + transcriber +
Silero VAD + diarizer + EventLogger.
gui/server.py stdlib http.server + SSE status stream + JSON API (loopback by
default; VOXTERM_GUI_LAN=1 to reach it from a phone). CSP, nosniff,
bounded request bodies, static-dir traversal guard, capped SSE.
gui/engine.py control layer: start/stop recording via AudioCapture, background
transcribe+export job with progress, session history, artifact reads
(path-traversal guarded).
gui/transcribe.py importable transcription (WAV/buffer -> faithful events.jsonl +
-transcript.md) reusing VoxTerm's engine; progress callback for the UI.
gui/export.py the reviewed LLM-agent exporter (events.jsonl -> -agent.md + .json),
ported self-contained into the fork (+ gui/test_export.py, 23 tests).
gui/static/ polished UI (index.html/style.css/app.js): record hero w/ live level
ring + timer, model/language pickers, SSE-driven transcript view,
client-side speaker rename (flows into copy/export), session browser,
Copy-for-AI / download .md / download .json.
v1 = record → stop → transcribe (robust; reuses the tested pipeline). Verified so far
without a mic: API + static serving + traversal guards + the full load/view/export flow
against a real 53-turn session; export tests 23/23. Pending a recording-finalize (mic
contention): the record-from-GUI path and a Tauri v2 native/mobile wrapper. Live
word-streaming, party/P2P, hivemind = labeled fast-follows.
… + correctness
Review of the GUI (16 agents) found 11 real issues, all fixed + verified:
- BLOCKER: strict CSP (style-src 'self', no 'unsafe-inline') silently blocked every
element.style the UI sets (level ring, progress bar, speaker color dots) — the core
visuals. Allow 'unsafe-inline' for style-src (all interpolated values are escaped).
- MAJOR (security): LAN mode (VOXTERM_GUI_LAN=1) had zero auth — anyone on the wifi
could start a recording of the room or read past transcripts. Now requires a token
(generated/printed on start, or VOXTERM_GUI_TOKEN) on every /api/* call; loopback
stays open. Verified: no-token/bad-token -> 401, valid -> 200.
- MAJOR (perf): the transcriber/VAD/diarizer were reloaded from disk every recording.
Cache them (lock-guarded) in gui.transcribe; reset the diarizer session per run.
- MAJOR (xss): unescaped speaker rename/label in the legend innerHTML -> escapeHtml.
- MAJOR (correctness): hand-built YAML in the client export broke / allowed key
injection on a rename/peer_name with a quote or newline -> JSON.stringify scalars
(mirrors the server's _yaml_scalar).
- MAJOR (crash): Download .md/.json threw on a raw-markdown fallback session (CUR null)
-> guard the handlers.
- MINOR: dir-aware artifact resolution (same stem in two dirs returned the wrong file);
poll-thread appends under the lock + join the thread WITHOUT holding it (avoids a
deadlock) so trailing audio isn't dropped; SSE counter guarded by a lock; session-stem
escaped in the sidebar; flush startup prints so the LAN token is visible immediately.
- NIT: start_recording wraps mic-open in try/except -> structured {ok:false,error} so a
busy/missing mic shows a real message instead of a 500.
Verified: py_compile + node --check clean; export tests 23/23; CSP header correct; the
loopback load flow against a real 53-turn session; LAN 401/200/401; and a live
record -> stop -> transcribe -> export run through the engine (graceful 0-turn on a
near-silent clip). 2 review findings correctly refuted (malformed Content-Length is
cosmetic; nav a11y is enhancement).
Recording-safe hardening (built while a live mic recording ran — file-only changes): - gui/test_engine.py (16 tests): non-mic engine paths — models()/languages(), _write_wav round-trip + clipping, sessions() discovery/ordering/flags across dirs, read_artifact/_resolve text + path-traversal rejection + only_dir restriction, idle status() shape. Isolated to temp dirs; never opens the mic or a model. - gui/test_server.py (14 tests): in-process server on an ephemeral port — static serving + content-types, traversal blocked (403), /api/options|status|sessions, 404 unknown route, and the full LAN-auth contract (no-token 401 / valid 200 / wrong 401 / header 200 / TOKEN=None open; static stays open). No /api/record POST. - gui/README.md: honest docs — what it is, how to run, the phone/LAN token flow, the privacy/security model, files + outputs, v1 features + labeled fast-follows. - UX polish (static/* only, API unchanged): a11y (aria-expanded synced, aria-live on status/toast, :focus-visible rings), keyboard (Space / r toggle record, Escape + outside-click close the mobile drawer, without hijacking focused controls), a "Summarize for AI" button (copies transcript prefixed with a ready-to-paste LLM summarization task), real mic-error toasts, an empty-sessions state, and export buttons disabled until a transcript is loaded. Verified (light, recording-safe): py_compile + node --check clean; all three gui suites green (23+16+14 = 53 tests); serve smoke confirms the UI + new control load.
…se A of the roadmap) Recording-safe (file-only): turns the web GUI into an installable PWA so it lands on your phone/desktop home screen and opens instantly/offline. - manifest.webmanifest (name, standalone, theme/bg, maskable icons) + icon.svg + generated icon-192/512.png. - sw.js service worker: cache-first for the app shell, network-only for /api and SSE, versioned cache dropped on activate. Registered from app.js (CSP script-src 'self'). - server.py: serve /manifest.webmanifest + /sw.js at ROOT (root SW scope), add the .webmanifest/.png content-types, and extend CSP with manifest-src 'self' + worker-src 'self' (the strict default-src 'none' would otherwise block both). - index.html: rel=manifest, theme-color, svg icon + apple-touch-icon. Verified (recording-safe): py_compile + node --check clean; server tests 14/14; serve smoke confirms manifest (application/manifest+json), sw.js, and icons all 200 with the CSP allowances present. Roadmap + rationale live outside the repo at ~/voxterm-plans/voxterm-gui-roadmap.md. Next (needs no-recording): Tauri v2 native/mobile wrapper, live word-streaming, and the record-through-the-GUI live test.
…ack, network errors
Adversarial review of the PWA/UX code (3 confirmed) fixed:
- Stale-shell trap: cache-first never revalidated, so shipping a new app.js/style.css
without bumping the SW left installed clients on the old shell forever. Switch the
static shell to stale-while-revalidate (serve cache, refresh in background) — changed
assets are picked up on the next load with no manual cache bump.
- Offline navigation: exact-URL match meant "/?token=..." (phone/LAN mode) never matched
the cached "/", so the offline shell never loaded there, and non-"/" nav offline
returned undefined -> browser error page. Navigations are now network-first with a
fallback to caches.match("/", {ignoreSearch:true}).
- Network errors: getJSON had no catch (server down -> unhandled rejection, silent
no-op). Now catches, toasts "Network error", and returns {ok:false,error:"network"};
init() and loadSessions() default missing fields so the UI degrades cleanly.
2 findings correctly refuted (token-URL cache bloat = negligible nit; key-repeat
double-trigger = benign). Verified: node --check app.js+sw.js; engine+server tests green.
…tate
Two real, recording-safe features (built + tested while a live recording ran):
- SRT/VTT subtitle export: each transcript turn already carries audio start/end, so
export.py now adds t_offset_end per turn and renders proper .srt (HH:MM:SS,mmm,
1-indexed cues) + .vtt (WEBVTT, HH:MM:SS.mmm) — written alongside -agent.md/.json,
a --format {md,json,srt,vtt,all} CLI flag, served via engine.read_artifact (srt/vtt
kinds), and built client-side in the UI for instant "Download SRT/VTT" of any loaded
session (rename-aware). Verified on the real 28-min session (valid cues).
- Settings persistence: the GUI remembers your Model + Language in localStorage
(private-mode-safe), restoring them on load when still offered by the server.
- Loading state: a calm body.working affordance + hard-disabled record button while a
transcription job runs.
Tests: gui suites now 65 total (export 35 / engine 16 / server 14), all green;
node --check clean; serve smoke confirms t_offset_end in the API JSON + the .srt
artifact served (200).
…, peer label, clamps
Adversarial review of the SRT/VTT code (4 confirmed) + a byte-for-byte client↔server
parity check surfaced and fixed:
- Cue-text injection: a newline / blank line / "-->" inside a turn's text or label
corrupted SRT/VTT cue boundaries (could inject a fake cue). Add _cue_text() (collapse
newlines, neutralize "-->") applied to both cue text and label in to_srt/to_vtt.
- Client didn't skip empty-text turns and used the array index → blank cues + index
drift vs the server file. Client now filters empty turns with an independent counter
(mirrors the backend), so a downloaded .srt/.vtt byte-matches the server artifact.
- Peer label divergence: client used nameFor() ("Sam · laptop (peer)") vs backend
"Sam (peer)". Client cueLabel now matches the backend exactly (renames stay a
client-only delta on local speakers).
- Degenerate-span clamp: client bumped end +2.0s vs backend +0.5s; unified to +0.5s.
- build() now also clamps t_offset_end > t_offset in the JSON sidecar (was only the
rendered cue), so the documented invariant holds for out-of-order offsets.
Verified: 67 gui tests green (export 37 incl. 2 new regressions / engine 16 / server 14);
node --check clean; a verbatim Python-vs-Node parity harness on a nasty doc (peer +
empty + blank-line/"-->" injection + zero/neg span) produces byte-identical SRT.
… fix gui.export path The README predated several shipped features; bring it current: subtitle export (.srt/.vtt + byte-identical client downloads), PWA (installable/offline), settings persistence, keyboard/a11y, Summarize-for-AI; add the srt/vtt outputs + t_offset_end; correct the stale 'python -m glass.export' to 'python -m gui.export --format ...'.
Type to filter the (growing) session list by date/stem; case-insensitive; the filter survives list refreshes and shows a clean 'no match' state. Pure frontend (loadSessions now caches SESSIONS + renderSessions(query) renders a filtered view); the typing guard already prevents the Space/R record shortcut from firing in the search box.
…ing (gui/live.py) Tails the raw PCM of a WAV being recorded and transcribes each new speech window with VoxTerm's engine, printing '[mm:ss] text' as the conversation happens. Reads the FILE, not the mic, so it runs alongside any recorder with zero contention. Text-only + fw-base default for low latency. CLI: python -m gui.live ROOM.wav [--model] [--interval] [--max-seconds]. Proven on a live recording (transcribed the active conversation in near-real-time). NOT yet wired into the GUI browser UI — that's the next step (stream lines over SSE to a live transcript panel).
…ing, stream to a panel Wires gui/live.py's near-real-time transcription into the browser UI so it's actually usable, not just a CLI: - engine.py: live_start/live_stop + a background tail-transcribe thread that follows the newest in-progress recording FROM the current end (true live — no slow backlog replay), transcribes finalized speech windows with the cached fw-base engine, and appends '[mm:ss] text' lines (capped) exposed via status().live. Reads the file, not the mic, so it runs alongside any recorder with zero contention. - server.py: POST /api/live/start (optional wav, defaults to newest) + /api/live/stop. - static: a '⦿ Live transcript' toggle + a streaming, auto-scrolling live panel (calm theme, pulsing dot); applyStatus renders status().live.lines. Verified end-to-end on a real in-progress recording: start -> lines appear within ~10-16s of new speech with correct audio timestamps (e.g. [39:19]…) -> stop. server tests 14/14; node --check clean. (Browser render of the panel is wired but visually confirmable only in a browser; the data path is proven.)
…s test
- engine.delete_session(stem, dir): removes only a session's text artifacts
(-transcript/-agent.{md,json,srt,vtt}/-events.jsonl) for the stem, reusing _resolve's
traversal guard + _session_dirs/only_dir restriction; never touches .wav (audio kept).
- POST /api/session/delete (behind the LAN-auth gate).
- UI: a subtle ✕ on each session row (confirm; stopPropagation so it can't trigger open;
clears the view if the open session is deleted).
- test_engine: +6 delete tests (exact-files, traversal rejected, dir-restricted, .wav
untouched, missing-stem ok); fixed test_status_idle_shape to expect the 'live' key
added by the prior live-transcription commit.
gui suites green (export 37 / engine 22 / server 14).
The live monitor finalized speech only on silence, so an in-progress utterance showed nothing until the speaker paused. Add a partial preview of the still-growing tail: each pass re-decodes the tail and a LocalAgreement-n stabilizer commits the longest word-prefix that has agreed across the last n hypotheses (stable) and marks the remainder volatile. As words settle they graduate stable→so the head stops flickering while the tail updates live. - gui/stabilize.py: PartialStabilizer (pure, LocalAgreement-n) + 9 unit tests - engine._live_loop: re-decode tail → stabilize → status.live.partial; reset on finalize so each utterance starts clean - app.js/style.css: render the partial (committed words solid, volatile tail dimmed + softly pulsing) Proven on a real recording: ASR revised "floor"→"hood" mid-utterance and the stabilizer held it volatile until settled (never committed the wrong word). 82 tests green. Idea ported from elizaOS's streaming partial-stabilizer.
Gap dmarzzz#1 — live now tails the GUI's own recording. start_recording streams straight to a growing on-disk WAV (placeholder header, _poll appends s16 PCM under the lock + flush, stop patches the real header). The live monitor tails that same file, so clicking Live during a GUI recording shows your words (before, Record buffered in RAM and only wrote on stop, so Live saw nothing). Bonus: a long session no longer sits entirely in RAM; transcription loads the file off-thread. Gap dmarzzz#3 — small stuff: - live.py CLI: ported the LocalAgreement stabilizer (in-place updating partial line) for parity with the GUI - app.js/index.html/style.css: a scrolling live amplitude canvas during record 3 new streaming-WAV tests (header is 44B + parses, _pcm_bytes==_write_wav, growing file is tailable mid-write then finalizes valid). 85 tests green.
Found by a verifying multi-agent audit of the new streaming-record/live code. Concurrency: - stop_recording now stops the live monitor (live is bound to the recording's lifetime) — was leaking a daemon thread that re-decoded the finalized file forever and raced the post-stop job. - live monitor uses a DEDICATED transcriber (_get_engines dedicated="live") so it never shares CTranslate2 decode state (or the dedup buffer) with the batch job — CT2 isn't safe for concurrent decode on one instance. - live_start/live_stop track real thread liveness (no double live loop after a timed-out join); status() snapshots self._live under the lock. Correctness: - stop_recording surfaces a header-patch I/O failure as an error instead of silently transcribing a zero-data WAV into a spurious empty session. Security (server.py): - CSRF: reject cross-origin state-changing POSTs (Sec-Fetch-Site / Origin vs Host). - DNS-rebinding: loopback Host-header allowlist (blocks a rebinding site driving the tokenless local API). - Clickjacking: CSP frame-ancestors 'none' + X-Frame-Options: DENY. - _authed compares on UTF-8 bytes so a non-ASCII token yields 401, not a crash. +3 security regression tests (host allowlist, CSRF, non-ASCII token). 88 green.
VoxTerm splits turns on VAD silence alone, so a natural pause after "and…" or "the…" wrongly ends a turn mid-sentence. Add a zero-model end-of-turn signal (gui/eot.py): P(turn complete) from grammar cues — terminal punctuation 0.95, trailing conjunction 0.15, trailing article/preposition 0.20, short 0.70, else 0.50. The live loop now merges a finalized fragment into the previous line when that line ended mid-clause (live view is text-only, so no speaker boundary to cross), giving readable sentences instead of choppy breath-split lines. 9 unit tests; 97 green. Idea ported from elizaOS's HeuristicEotClassifier. (Diarization hardening + windowed-live ports were checked against VoxTerm's code and found redundant — VoxTerm already gates centroid updates by cosine sim and bounds the live buffer via VAD — so they were intentionally skipped.)
…ucture) From the Android-plan UI critique — the safe, additive set (deferring layout moves like the bottom record bar until we can render on a device): - delete ✕ was opacity:0-until-hover → invisible on touch; show it on @media (hover: none) - waveform canvas was a fixed 600px bitmap stretched by CSS → blurry on phones/retina; size the bitmap to CSS px × devicePixelRatio, draw in CSS px - --faint #6b7280 (~3.9:1, failed WCAG AA) → #7d8694 (~4.6:1) - mobile: safe-area insets (notch/gesture bar) on main/sidebar/nav/toast + 44px min tap targets on btn/select/ghost/✕/legend - honor prefers-reduced-motion (kills the looping pulses/level animations) 97 tests green; JS validates.
v1 Android app = the existing web UI in a native shell, talking to the VoxTerm backend on your desktop over the LAN. The phone does NO transcription. - src-tauri/: Tauri v2 host crate (identifier site.nubs.voxterm, frontendDist → ../mobile-pair, window "main", android minSdk 24) + gen/android/ gradle project - mobile-pair/index.html: on-theme pairing page — enter desktop host/port/token (prefilled from localStorage), navigates the webview to http://host:port/?token=… where the desktop serves UI+API+SSE from one origin, so app.js works unchanged (reads the token from the query string) - AndroidManifest: INTERNET permission ONLY — no RECORD_AUDIO, no camera, no location. The app structurally cannot record you; the desktop owns the mic. usesCleartextTraffic=true for the LAN http backend (token is the gate). Scaffold only — not yet built. Lives on feat/gui, no PR.
scripts/android-dev.sh — plug in a phone (or --emulator) and it self-heals the toolchain (rust targets), builds the APK, installs, launches, and asserts the app is alive. Test traffic stays on loopback via `adb reverse tcp:8740` — never touches Wi-Fi. Stages A–F with structured exit codes (10 toolchain/11 targets/ 20 device/30 build/40 install/50 launch/60 smoke). Hard gates: build, install, launch, render-not-blank (scripts/assert_screen.py, Pillow luminance check). Soft for v1: the backend round-trip (depends on the in-app connect flow). Supporting bits: - scripts/mock_backend.py — torch-free stdlib stand-in (serves gui/static + canned /api + heartbeat SSE, logs requests) for fast offline CI runs (--mock) - gui/server.py — opt-in request logging via VOXTERM_GUI_LOG=1 (silent by default) so the smoke test can assert GET /api/options + /api/events - mobile-pair: auto-connect if a backend answers on the device's localhost (the adb-reverse/dev case) — fails fast on a real phone → pairing form stays 97 tests green. Quickstart: scripts/android-dev.sh --emulator --debug --mock (offline) or scripts/android-dev.sh --debug (real phone, real engine).
… Silicon A verifying cross-platform audit found the GUI dead on Apple Silicon (the flagship target) and the android script broken on every mac. All fixes are Linux-safe (97 tests still green) and standard per-platform branching: GUI (HIGH — Apple Silicon had an empty model dropdown + KeyErrors): - Engine.models() falls back to AVAILABLE_MODELS when FASTER_WHISPER_MODELS is empty (Apple Silicon) so the dropdown is never blank - new CPU-aware default (transcribe.gui_default_model / Engine.default_model): prefer fw-small where faster-whisper exists, fall back to MLX only on Apple Silicon — and crucially NOT raw config.DEFAULT_MODEL, which is qwen3-0.6b when qwen-asr is installed (too slow on CPU). Fixes the live + post-stop KeyErrors. - /api/options exposes default_model; app.js pre-selects it (no more fw-small) scripts/android-dev.sh (broke on all mac): - ANDROID_HOME / JAVA_HOME per-OS (mac ~/Library/Android/sdk, Studio JBR / java_home) - resolve python3 (mac has no bare `python`); arm64-v8a AVD + -gpu host on Apple Silicon audio/capture.py: actionable mac mic-permission error (TCC not granted) gui/export.py: per-platform live-dir fallback (was Linux XDG only) Full report: ~/voxterm-plans/mac-compat-report.md
Zero-regression hardening from the cross-platform audit (99 tests green): - 0.1 decouple headless ASR from the Textual TUI: gui/transcribe.py imported tui.app (pulling textual+sounddevice into every server/headless import). Extracted the pure split into tui/text_split.py; tui.app delegates to it. Verified: importing gui.server no longer loads `textual`. - 0.3/0.4 gui/server._read_json: a malformed Content-Length raised an uncaught ValueError out of the POST handlers — guard it; also close the connection on an oversized body (no undrained body / latent HTTP desync). - 0.5 live-state writes now take self._lock (brief dict mutations only, never around transcribe/VAD) to match the locked reader in status() — the "consistent snapshot" comment is now actually true. - 0.6 mobile-pair: the loopback auto-probe honors the port field (was hardcoded 8740). - 0.7 export.py docstring: `glass.export` -> `gui.export` (no glass pkg). - 0.9 Android cleartext: documented that app-wide cleartext is INTENTIONAL for the LAN thin client (can't scope arbitrary RFC1918 IPs declaratively; the token + LAN is the trust model) — kept on for release on purpose. - 0.11 drop the cosmetic SSE `Connection: keep-alive` header (HTTP/1.0). - 0.10 + 0.8: capture.py macOS mic-permission tests; commit src-tauri/Cargo.lock.
A new, 100%-optional CPU streaming-ASR tier that runs everywhere VoxTerm does (Linux, macOS arm64, Windows) with no GPU. Verified end-to-end on this Linux/CPU box: installs clean, decodes correctly, and does NOT disturb VoxTerm's pinned onnxruntime (sherpa statically links its own ORT). - pyproject: `[project.optional-dependencies] streaming = ["sherpa-onnx..."]` (marker excludes Intel-macOS — no wheel). NOT a core dep. - config.py: one DRY gate after the platform branches — surfaces the `sherpa-stream-en` model key + SHERPA_MODELS ONLY when sherpa-onnx is importable AND a wheel exists for the platform. Absent → byte-for-byte unchanged. - audio/transcriber.py: SherpaStreamingTranscriber (lazy import w/ clear error; downloads the 20M streaming-zipformer on first load; per-call create_stream so it's a drop-in for the existing chunked callers; same RMS/hallucination/dedup filters; ALL-CAPS model output → sentence-case). Factory dispatch added before the Whisper fallback. - gui/test_sherpa.py: skip-guarded (no-op without the extra) — gating consistency, factory dispatch, RMS short-circuit. Zero-regression: without the [streaming] extra installed, nothing changes for any existing user. 102 tests green (99 + 3, the new ones skip when sherpa is absent). Follow-on (noted, not yet done): a true-streaming live-loop path (persistent OnlineStream + endpoint finalize) so the GUI live view streams word-by-word.
…needs a Mac)
iOS reuses the existing Tauri thin-client (mobile-pair → LAN desktop, INTERNET/no-mic).
Everything here is cross-platform + lint-clean on Linux; the actual init/build/sign/run
loop requires a Mac + Xcode (cannot build off a Mac).
- src-tauri/Info.ios.plist: NSAllowsLocalNetworking (minimal ATS for LAN http, NOT
arbitrary loads) + NSLocalNetworkUsageDescription (iOS-14 local-network prompt).
- tauri.conf.json: additive bundle.iOS { minimumSystemVersion "14.0" }.
- scripts/ios-dev.sh: Darwin-guarded (clean no-op off-Mac); adds iOS rust targets,
`cargo tauri ios init` once, then ios dev|build.
- src-tauri/.gitignore: ignore generated /gen/apple/ build artifacts.
- docs/ios-thinclient.md: build path, the two plist keys, signing, pairing.
Zero-regression: no Python touched; Android (gen/android, manifest) byte-for-byte
unaffected; bundle.iOS + Info.ios.plist are read only by the iOS bundle target.
102 tests green.
…sherpa) The live monitor now prefers the sherpa streaming backend when it's installed (opt-in) and drives it as a true streaming recognizer instead of chunked VAD windows: - _live_loop split into setup/dispatch + two paths. The chunked path (_live_chunk_loop) is the original code VERBATIM — fw-*/MLX/qwen3/parakeet and any non-sherpa backend behave byte-for-byte as before (zero regression). - _live_stream_loop: one persistent OnlineStream fed the tailed PCM; the running decode is published as the volatile partial each ~1s; sherpa's endpoint detection (or the 20s cap) finalizes a line. Same self._lock discipline. - live model preference: sherpa-stream-en (if installed) → fw-base → platform default. Only changes behavior when the optional [streaming] extra is present. Verified: streaming primitives grow the partial incrementally + decode correctly on this box; 102 tests green (chunked path unchanged).
Adversarial QA of the new code + a real KVM-emulator run surfaced these (all fixed): - transcriber: _ensure_sherpa_model is now ATOMIC (extract to staging → rename) with a complete-model guard (all 4 artifacts) so an interrupted extraction self-heals instead of a permanent StopIteration; load() uses a _pick() helper that raises a clear RuntimeError naming the missing file; .part download cleaned up on failure. - transcriber: SherpaStreamingTranscriber.is_loaded is now a @Property, matching every other backend (was a method — would mis-read as loaded via getattr). - engine: the streaming live path now applies the hallucination + dedup filters on finalized lines, like the chunked/batch backends. - android-dev.sh: launch the CORRECT component — debug builds install site.nubs.voxterm.debug, and the activity class keeps the base namespace, so the launch is <appId.debug>/<base>.MainActivity (the emulator caught the old site.nubs.voxterm/.MainActivity → "activity did not report Status: ok", exit 50). - android-dev.sh: validate $PYTHON is actually runnable (clean exit 10, not a late fail). - assert_screen.py: exit 3 = SKIP when Pillow is absent (macOS) so the render gate isn't a silent pass; android-dev.sh treats exit 3 as a soft skip. 102 tests green. (Low/cosmetic, left + noted: streaming line-start timestamp drift; the loopback auto-probe's cross-origin read is best-effort and degrades to manual pairing.)
…low 14) Use get_flattened_data when available, fall back to getdata — no behavior change, silences the Pillow-14 DeprecationWarning the emulator run surfaced.
- audio/transcriber.py: generalized the sherpa model registry (repo→URL map) so multiple sherpa transducer models share one SherpaStreamingTranscriber. - config.py: new optional gated key `sherpa-nemotron-en` (NeMo FastConformer-RNNT 0.6B, exported for sherpa-onnx). Same find_spec gate → zero-regression when the [streaming] extra is absent. - scripts/bench_asr.py: reproducible WER (word edit-distance, normalized) + CPU RTF benchmark across backends. - docs/streaming-asr-benchmark.md: results + honest analysis. Numbers (Linux CPU, 3 labeled clips): fw-small 2.1% WER / 0.64 RTF (batch, og default); fw-base 5.1% / 0.18; sherpa-nemotron-en 4.4% / 0.25 (streaming sweet spot — near-fw-base accuracy, ~4x real-time, native streaming); sherpa-stream-en zipformer-20M 20.9% / 0.064 (~16x real-time but inaccurate). nemotron-EN proven to load + decode via the same backend. 102 tests green (test_sherpa now covers both gated keys, skips without the extra).
Engine.models() returned only FASTER_WHISPER_MODELS on Linux/Intel/Windows, so the optional sherpa-stream-en / sherpa-nemotron-en keys (present in AVAILABLE_MODELS but not the fw set) never appeared in the GUI model dropdown. Union the platform's base set with SHERPA_MODELS so they're selectable wherever installed. Found by rendering the GUI headless. test_models_returns_only_fw_keys -> test_models_are_valid_keys (valid-keys invariant incl. the additive sherpa keys).
scripts/gui_e2e.py boots gui.server, drives headless Chrome via the DevTools Protocol, and asserts the real browser flow: model dropdown + session list populate from the API, and clicking a past session loads + renders its transcript (with a screenshot). Covers the browser path unit tests can't — only record-with-a-mic still needs hardware. websocket-client is a dev-only dep. Verified: dropdown includes the optional sherpa keys, 4 sessions, transcript renders end-to-end.
docs/streaming-asr.md: install the optional [streaming] extra, the two model keys (sherpa-stream-en / sherpa-nemotron-en), GUI/CLI usage, how it works, and the zero-regression/opt-in posture. gui/README 'Models' section now points to it + the benchmark. Makes the streaming feature discoverable + usable (upstream-ready).
…ening - /api/audio serves the session WAV with HTTP Range/206 (media-src added to the CSP so the <audio> element can load; 416 routed through _hdr; do_HEAD 405; Content-Length on JSON/static). The engine hardlinks <stem>-gui.wav at transcribe time so playback maps to the exact recording. - CPU-aware transcriber load(): explicit int8 + cpu_threads + greedy beam_size=1 + a warm dummy decode. The GUI defaults to fw-base via gui_default_model(), and the engine warms the model at server start. - "Detect speakers" diarize flag threaded through stop_recording -> transcribe. - start_recording tolerates a malformed device value and reverts to the OS default input when "System default" is re-selected (no sticky global). - _session_title keeps short first utterances (>= 2 chars) so titles aren't dates.
…iew recording - Rebuild the UI as a monochrome (no accent hue) document-style transcript with a sticky record dock, a settings popover, and an export menu. The record dot is the only color. Inline <audio> playback (click a timestamp to seek) plus a Download-WAV action. - Recording shows a level meter + "Recording..." state and the accurate, diarized transcript appears on stop -- one model, no streaming preview to reconcile against the final result. - Robustness: title derives from the transcript (no raw-date headings); same-speaker turns keep a clickable timestamp instead of an orphaned box; the player pauses when leaving a transcript and its probe is session-tokened; seek waits for audio metadata; the record button has a single owner; init() surfaces an unreachable server. - a11y: real keyboard focus ring on menu items, aria-live progress, readable muted text. PWA shell cache bumped; manifest/theme colors aligned. Docs updated.
…e2e for the redesign - <audio preload="metadata"> so the seek bar shows the clip length immediately on load instead of a misleading 0:00/0:00 (cheap for a local same-origin WAV; the probe still defers a cold seek to loadedmetadata). - Rewrite scripts/gui_e2e.py for the redesigned UI and add the checks unit tests can't cover: transcript-derived title (not a raw date), the recording's audio actually LOADING UNDER THE PAGE CSP (a fresh Audio() obeys media-src like the inline player), the visible player's real duration, and a record->stop cycle, with a securitypolicyviolation collector asserting zero violations. Verified in headless Chrome: audio loadedmetadata, duration 14.66s, 0 CSP violations.
The TUI records system audio (macOS ScreenCaptureKit, Linux parec) and mixes it with the mic; the GUI was mic-only. Add an "Audio source" selector (Microphone / System audio / Mic + system) in the settings popover, threaded through /api/record/start -> Engine.start_recording(source=...). system/both reuse the engine's existing SystemCapture; "both" mixes via the same time-aligned add the TUI uses (_mix_chunks). Fails gracefully with a clear message when the platform tool is missing (e.g. parec not installed); selection persists in localStorage. Tests: gui/test_capture_source.py (mix overlap+tails+clip, source wiring with the capture classes mocked). Windows stays unavailable (no engine system-audio there).
The TUI's "U" action runs a local-LLM summary (MLX on Apple Silicon, or an ollama:<model> backend anywhere); the GUI only had "Summarize for AI" (copies a prompt for an external model). Add "Summarize with local LLM": POST /api/summarize -> Engine.summarize_session() reuses the session transcript + the TUI's own summarizer.engine (get_summarizer/resolve_template), shows the result in a dismissible panel above the transcript, and surfaces a clear message (never a crash) when no backend is available. A "Summary model" settings field (persisted) lets non-Mac users point at an Ollama model. Tests: gui/test_summarize.py (ok / no-transcript / graceful no-backend / path traversal, summarizer mocked). 112 gui tests + headless e2e green.
…or guard Extend the headless-Chrome e2e to exercise the new local-LLM summarize action (asserts it fails GRACEFULLY with no backend present — no crash, block hidden), confirm the audio-source selector offers mic/system/both, and collect window.onerror + unhandledrejection so the run fails on ANY uncaught JS error anywhere in the flow. Verified locally: summarize graceful, source options correct, 0 CSP violations, 0 uncaught JS errors.
…n cleanup) From a code-quality deep scan (one function per purpose, no dead code, no passthrough params): - _mix_chunks: collapsed the gui/engine.py copy and tui/app.py's staticmethod into one audio/mix.py::mix_chunks — both call it; the TUI staticmethod is gone (not replaced with a wrapper). - _fmt_hms: was duplicated in gui/transcribe.py (truncating) vs gui/export.py (rounding) → ±1s live-vs-export drift. One gui/_timefmt.py::fmt_hms (rounding), used by transcribe/export/engine; also dropped the non-essential _fmt_hms parameter the live loops threaded around. - _write_wav: dead production code (recording uses _wav_header + _pcm_bytes) — deleted; its tests folded into the _pcm_bytes encoder test; unused `wave` import removed. - app.js: one copyOrDownload() helper (copyForAI + summarizeForAI shared the clipboard-or-download fallback); a named PEER_COLOR const instead of a bare hex that aliased a rotating speaker slot. Full suite 523 passed.
|
Self-review (deep-scan) — known follow-ups still open on this draft, for transparency: In-PR cleanups are already done (latest commits): de-duplicated Still open:
|
Verified by a green debug APK build (cargo tauri android build --debug --apk): - Model staging is now atomic + complete: verify ALL 4 required files (was only tokens.txt) and stage into a .tmp dir then renameTo() the final dir, so a mid-copy process kill can't wedge a half-populated voxterm-model dir. - Guard AudioRecord init: bail (with lastError) when getMinBufferSize() returns <= 0 (minBuf*2 would throw) and when state != STATE_INITIALIZED (mic busy). - Never leak the native sherpa OnlineStream on a failed start — track it in a nullable and release it in finally; mark `recognizer` @volatile (built lazily from both the mic worker and the debug self-test thread); reset `running` on every exit path.
Swap the bundled offline model from the 20M zipformer (2023-02-17) to the 70M streaming zipformer2 (2023-06-26). On the bundled test clip the 20M dropped the opening clause and garbled "brothels"; the 70M transcribes it in full and correct. On a real phone it decodes at xRT 0.09 (0.62s for 7.13s of audio, ~11x real-time), so the accuracy gain costs no latency. APK grows ~26 MB (encoder int8 67 MB vs 40 MB). The 70M model is model_type=zipformer2, which has no `attention_dims` metadata, so the hardcoded modelType="zipformer" failed to init the encoder. Set modelType="" to auto-detect the architecture from the model's own ONNX metadata, so fetch-deps.sh is the single source of truth for the bundled model and no architecture string has to stay in sync here. Also log a measured xRT in the debug self-test, so on-device latency is a real number rather than an assumption.
Select the bundled offline model with VOXASR_MODEL:
zipformer-70m (default) streaming zipformer2, ~68 MB assets / ~232 MB APK
fast (xRT 0.09 on a real phone), ALL-CAPS, no punctuation
nemotron-0.6b NeMo FastConformer-RNNT, ~632 MB assets / ~621 MB APK
accurate, native casing + punctuation, xRT 0.29 on the same phone
The default stays the lightweight zipformer so a plain build is small and
installs anywhere; nemotron is opt-in for builds that want transcript-grade
output and can afford the size. The Kotlin plugin already auto-detects the
architecture and feature dim from each model's ONNX metadata (modelType="",
metadata-driven feat_dim), so the tier swap needs no code change.
Also replace the fragile hardcoded epoch-specific cp filenames with a glob
that matches both naming schemes (zipformer's
`encoder-epoch-…-chunk-16-left-128.int8.onnx` and nemotron's plain
`encoder.int8.onnx`), mirroring the desktop loader's _pick(); add a guard
for an unknown VOXASR_MODEL. shellcheck-clean.
Both tiers verified end-to-end on a real device: bundled-clip self-test
decodes correctly and the live start/poll/stop pipeline runs without error.
…fecycle start_transcribe used to reject with "microphone permission not granted" when the runtime permission was absent, so a fresh install's first Start hard-failed with no in-app recovery. The plugin now owns the mic: it declares the RECORD_AUDIO "microphone" alias and requests it on first Start, resuming in a @PermissionCallback once granted (verified on a device: fresh install -> system prompt -> grant -> records). Lifecycle hardening while here: - ensureRecognizer(): one @synchronized lazy builder shared by the mic worker and the debug self-test, closing a check-then-act race that could build and leak two native recognizers, and removing the duplicated idiom. - a per-session generation token so a worker that outlives stop's 2s join can neither run the mic alongside nor reset the running flag of a newer session. - stop_transcribe clears the trailing partial so poll_transcript stops returning a never-finalized line after recording ends. - the webview clears the transcript on each Start (no cross-session concat / unbounded DOM growth).
…nused dep The plugin's android/.tauri/tauri-api/ tree is the Tauri-CLI-generated mirror of the tauri-android framework (Apache/MIT "Tauri Programme", ~2150 LOC incl. 2+2 scaffold tests) — vendored upstream code, not part of this contribution. The build resolves :tauri-android from the gen settings path and never uses this copy (verified: a clean build with the directory removed still produces the APK), and the sibling src-tauri/gen/android already gitignores its own /.tauri. Gitignore android/.tauri/ and untrack the 29 files so the diff is the plugin. Also drop the unused direct `serde` dependency (the crate uses only serde_json::Value).
…self-heal deps - add tauri-plugin-voxasr/README.md (purpose, the start/stop/poll command surface, fetch-deps.sh + VOXASR_MODEL tiers, RECORD_AUDIO/no-INTERNET stance, build via scripts/android-dev.sh) plus a short subsection + CHANGELOG entry in the main docs. - fix the lib.rs crate docstring: it described a voxasr://partial/final event contract that does not exist — the plugin is poll-only (poll_transcript). - update capabilities/mobile.json's description: it still said "window/webview/ event only ... pairs to a desktop" though it now grants voxasr:default and on-device is the primary mode. - android-dev.sh runs fetch-deps.sh when the AAR/model are missing, so the advertised one-command build works on a fresh checkout (honors VOXASR_MODEL). - fix a stale revealForm() reference in an index.html comment (revealMobileHome).
…e-at-stop)
Replace the streaming zipformer/nemotron path with offline Whisper: the mic is
buffered while recording and, at stop, the whole clip is decoded by a sherpa-onnx
OfflineRecognizer — full context, native punctuation + casing, no rough live
output. This is the same model family the desktop's faster-whisper uses, so the
phone gets transcript-grade results.
- fetch-deps.sh: VOXASR_MODEL tiers are now whisper-tiny/base/small.en (base.en
default ~154 MB); Whisper has no joiner, so stage encoder/decoder/tokens only,
and wipe the model dir first so a tier switch leaves no stale files.
- VoxasrPlugin.kt: OfflineRecognizer (modelType="whisper", en/transcribe); record
to a PCM buffer; at stop, split into <=30 s windows (cut at the quietest point
near the boundary so words aren't sliced) and join. poll_transcript now reports
{ phase, elapsed, level, durationSec, segments[], error? }. Keeps the runtime
RECORD_AUDIO request, generation guard, and @synchronized recognizer build; the
stop path snapshots the take's buffer and joins a prior worker before reopening
the single-owner mic.
- measured on a real phone: base.en self-test xRT ~0.2 (~5x real-time), correct
punctuated transcript.
The phone now runs the SAME web GUI as the desktop instead of a separate stripped page. gui/static is staged into the mobile bundle (mobile-pair/app/) and a LocalBackend drives the native voxasr plugin + localStorage instead of the desktop's Python HTTP engine — same look, same record→transcribe→view→export flow. - gui/static/backend-local.js: implements the window.VOX_BACKEND seam (getJSON/events/authUrl) against the plugin; synthesizes app.js's recording→transcribing→done state machine from poll_transcript; persists sessions + renders client-side md/json/srt/vtt export. Sets the `on-device` flag so app.js/CSS hide Python-only features (model/source/mic/diarize/summary, language, local-LLM summary, WAV download, speaker rename) — no dead buttons. - scripts/stage-mobile.sh (+ tauri beforeBuildCommand/beforeDevCommand): copies gui/static into mobile-pair/app/ with backend-local.js swapped in and the PWA shell dropped; mobile-pair/app/ is gitignored (gui/static stays the source). - mobile-pair: the Android app redirects to the on-device GUI; the pairing form is now browser-only (dead loopback probe removed). - AndroidManifest strips INTERNET (tools:node=remove) → the APK is provably offline; CSP trimmed to match (no remote/blob tokens). - app.js/style.css: two small on-device guards; empty-state copy fixed (both platforms transcribe at stop, not live). Verified e2e on a real phone: GUI loads, degrade applied, two record→transcribe takes complete cleanly, zero console errors, only RECORD_AUDIO granted.
…ngine Update the plugin README (offline Whisper, the phase-based poll contract, whisper model tiers, the unified-GUI architecture), the main README's Android section, and the CHANGELOG entry — the previous text described the superseded streaming model.
sherpa-onnx Whisper truncates anything ≥30 s ("process only the first 30 s and
discard the remaining"), so an exactly-30 s window risks a boundary warning. Cap
the windows at 29 s — comfortably under the limit, no data discarded, plus a
margin for the silence-aware cut. Verified the chunked decode of a synthetic
>30 s clip joins into coherent text with the cut landing in a pause.
…g it stagedModelDir() copied the bundled assets into a temp dir and swapped it in without checking the required model files actually landed — a build shipping incomplete assets would surface as a cryptic native recognizer crash later rather than a clear error. Verify all required files are present before the atomic rename (clear IOException otherwise), check renameTo's result, and @synchronized it so the debug self-test can't race a first record into staging. Verified on a physical device (debug APK): a cold re-stage after `pm clear` stages all files and the offline self-test decodes test.wav correctly (xRT 0.18, full casing + punctuation).
|
For whoever triages this (no rush — flagging in case it helps a human or a review agent): These 11 PRs split into three buckets — all branched off Drop-in, independent, mergeable in any order:
Behavior fixes, low-risk, independent:
Bigger / your call:
#179 is a non-urgent follow-up note on #176's residual (session-code entropy), not a blocker for #176. Totally understand 11 from one contributor is a lot — happy to consolidate, hold, or close any of these to whatever fits your roadmap and bandwidth. Just say the word. |
gui/, src-tauri/, tauri-plugin-voxasr/ live only on the feat/gui branch (dmarzzz#175), not on main — a docs-accuracy PR shouldn't describe a tree that isn't there yet. They'll be added when dmarzzz#175 lands. Keeps the verified-present additions (dictation/, network/, summarizer/) and the cross-platform reframing.
The comment advertised 'zipformer-70m default | nemotron-0.6b', but fetch-deps.sh
only accepts whisper-{tiny,base,small}.en (default whisper-base.en) and exit 1s on
anything else — so copying the old hint sent users straight into a script abort.
|
Thanks @NubsCarson — your on-device Android work landed in |
Ready for review. Turns VoxTerm into a standalone Tauri app with one web GUI that runs on both desktop and phone — a thin control surface over the existing engine, not a reimplementation.
gui/static) reusesaudio.capture,get_transcriber, Silero VAD, the diarizer, andEventLogger— nothing in the speech pipeline is duplicated.LocalBackenddrives a sherpa-onnx Tauri plugin (tauri-plugin-voxasr) that records and, at stop, transcribes the clip with offline Whisper (full punctuation). Fully offline: the APK strips theINTERNETpermission (RECORD_AUDIO only). Python-only features (diarization, AI summarize, system audio) are hidden on-device.GUI adds: monochrome document UI, inline audio playback + timestamp-seek, mic/system/both capture, session history/search, server-side md/json/srt/vtt + summary export, loopback-default + LAN token auth + strict CSP.
Verified: full Python suite green (523 passed / 4 skipped) + a headless-Chrome e2e (audio-under-CSP, transcript-derived titles, 0 CSP violations, 0 uncaught JS errors); the Android app + offline on-device transcription verified end-to-end on a real phone. 0 commits behind
main, mergeable.It's large (172 files). If that's a lot to review in one pass, I'm glad to make it easy: split into a stacked series (Tauri shell → on-device plugin → GUI), or land #174's sherpa backend first and rebase this down to just the GUI wrappers. Whatever's easiest for you to review — happy to reshape it.