Minimal push-to-talk desktop dictation app.
Reimplemented from scratch, embeddable, cross platform possible.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
python3 gui.pydictate-mindictate-min --list-input-devicesgui.pyis a desktop configuration UI (tkinter/ttk, no extra GUI dependencies).- Features:
- grouped env controls (toggles, dropdowns, text fields)
- preset profiles
- input device dropdowns (
Refresh Input Devices) - model dropdowns (
Refresh Models) for STT and cleanup backends - live runtime apply while
dictate-minis running - optional auto-restart for restart-required settings (
Auto Restart, enabled by default)
- Live runtime updates are written to
.dictate-runtime.envand consumed viaDICTATE_RUNTIME_ENV_FILE.
- Hold configured PTT key (default
Right Ctrl) to record. - Release to transcribe and paste into the active app.
Ctrl+Cto quit.
DICTATE_MODE:pttorloopback(defaultptt)DICTATE_INPUT_DEVICE: numeric sounddevice input index (optional)DICTATE_INPUT_DEVICE_NAME: case-insensitive input-name substring match (used whenDICTATE_INPUT_DEVICEis unset)DICTATE_SAMPLE_RATE: input sample rate hint (default16000, auto-falls back if unsupported)DICTATE_PASTE:1or0(default1inptt,0inloopback)DICTATE_PASTE_ALIGN_FOCUS:1or0(default1); capture focused window on PTT press and paste there even if focus changed before output, then restore focus to the newer windowDICTATE_PASTE_MODE:clipboard,type,primary(defaulttypeon Linux,clipboardotherwise)- invalid values fall back to
clipboard
- invalid values fall back to
DICTATE_PASTE_PRIMARY_CLICK: forprimarymode on Linux, trigger middle-click paste after setting PRIMARY selection (default1)DICTATE_PASTE_PRESERVE: preserve and restore prior clipboard/selection text around paste inclipboard/primarymodes (default1)DICTATE_PASTE_RESTORE_DELAY_MS: delay before restoring clipboard after paste to avoid race with target app (default80)DICTATE_DEBUG:1for verbose runtime diagnostics (default0)DICTATE_DEBUG_KEYS:1to log every key press/release and whether it matches PTT (default0)DICTATE_FILE_LOG:1to append runtime/model/output events toYYYYMMDD.log(default1)DICTATE_RUNTIME_ENV_FILE: optional path to a runtime.envoverrides file polled while running for hot updates (default disabled)- hot-reloadable: debug flags, most cleanup settings,
DICTATE_PTT_AUTO_PAUSE_MEDIA, ducking settings,DICTATE_PTT_AUTO_SUBMIT,DICTATE_CONTEXT_RESET_EVERY,DICTATE_TRIM_CHUNK_PERIOD,DICTATE_LOOP_GUARD,DICTATE_LOOP_GUARD_ALLOW,DICTATE_PASTE_ALIGN_FOCUS - restart-required: mode/device/STT engine settings, paste transport settings, and most audio pipeline geometry settings
- hot-reloadable: debug flags, most cleanup settings,
DICTATE_PTT_KEY:cmd_r,super_r,cmd_l,super_l,super,win,shift_r,shift_l,ctrl_l,ctrl_r,alt_l,alt_r(defaultctrl_r)DICTATE_PTT_AUTO_PAUSE_MEDIA: on PTT press, runplayerctl -a pauseto pause media playback (Linux, default1)DICTATE_PTT_DUCK_MEDIA: on PTT press, lower default sink volume and restore it on release (Linux, default0)DICTATE_PTT_DUCK_SCOPE: duck target scope when ducking is enabled:defaultorall(ducks all non-monitor sinks) (defaultdefault)DICTATE_PTT_DUCK_MEDIA_PERCENT: target volume while PTT is held when ducking is enabled (default30)DICTATE_PTT_AUTO_SUBMIT: press Enter after emitting a PTT chunk; hold Shift while releasing PTT to suppress for that chunk (default0)- On Linux PTT release, only players that were playing on press (and are paused at release) are resumed.
- If both ducking and auto-pause are enabled, ducking takes precedence on PTT press.
DICTATE_LOOPBACK_CHUNK_S: chunk size in seconds (default4)DICTATE_LOOPBACK_HINT: fallback name hint for non-pulse auto-pick (defaultloopback pcm)DICTATE_PULSE_SOURCE: force pulse source name, e.g.jamesdsp_sink.monitorDICTATE_MIN_CHUNK_RMS: skip near-silent chunks below threshold (default0.0008)
DICTATE_CLEANUP: enable cleanup pass (1/0, default1)DICTATE_CLEANUP_BACKEND: cleanup backend:ollama(default) orgeneric_v1(api_v1/lm_api_v1aliases)DICTATE_CLEANUP_URL: cleanup endpoint URL (defaults toDICTATE_OLLAMA_URLif set, elsehttp://localhost:11434/api/chat)- for
generic_v1, if this is just a base URL likehttp://host:1234, app auto-useshttp://host:1234/v1/chat
- for
DICTATE_CLEANUP_MODEL: cleanup model name (defaults toDICTATE_OLLAMA_MODEL; auto-pick forollamawhen empty; forgeneric_v1, attempts discovery from/api/v1/modelsor/v1/modelsand picks a single loaded model, else a single available model)DICTATE_CLEANUP_API_TOKEN: bearer token for cleanup backend auth (or setLM_API_TOKEN)DICTATE_CLEANUP_PROMPT: system prompt for cleanup model (default: "You are a dictation post-processor. Fix punctuation and capitalization only. Do not add ideas. Output only corrected text.")DICTATE_CLEANUP_PROMPTS: slash-rule overrides by active window title, format:/match/prompt,/other/prompt2or/match/prompt ;; /other/prompt2 ;; default prompt- matching is case-insensitive substring against active window title
- if multiple rules match, only the last matching rule is sent
- if none match, the default prompt is used (when provided in
;; default promptform), elseDICTATE_CLEANUP_PROMPT - if no rule matches and the resolved default prompt is empty, cleanup is skipped (no cleanup model request is sent) even when
DICTATE_CLEANUP=1 - prompt templates support placeholders:
- current:
{input},{target},{backend},{model} - previous single:
{prev_input},{prev_output},{prev_prompt} - previous indexed (most recent first):
{prev_input_1},{prev_input_2},{prev_output_1},{prev_prompt_1}, etc. - full buffer (joined by newlines):
{buffer_inputs},{buffer_outputs},{buffer_prompts}
- current:
- example:
DICTATE_CLEANUP_PROMPTS="/terminal/Output only shell-safe text. Keep commands exact.,/code/Preserve code tokens exactly; only fix punctuation around prose."
DICTATE_CLEANUP_REASONING: forgeneric_v1payloads (defaultoff)DICTATE_CLEANUP_TEMPERATURE: forgeneric_v1payloads (default0.2)DICTATE_CLEANUP_HISTORY_SIZE: number of prior cleanup exchanges kept for template placeholders (default12)DICTATE_OLLAMA_URL/DICTATE_OLLAMA_MODEL: legacy compatibility aliases still supported
DICTATE_STT_MODEL: faster-whisper model, e.g.tiny,small,medium.en(defaultmedium.en)DICTATE_STT_DEVICE:cpu,auto,cuda(defaultauto)DICTATE_STT_COMPUTE: e.g.int8,float16(default auto:float16forauto/cuda,int8forcpu)DICTATE_STT_CONDITION_PREV: maps tocondition_on_previous_text(default0)DICTATE_STT_BEAM_SIZE: beam size (default5)DICTATE_STT_NO_SPEECH_THRESHOLD: no-speech threshold (default0.6)DICTATE_STT_LOGPROB_THRESHOLD: log-prob threshold (default-1.0)DICTATE_STT_COMPRESSION_RATIO_THRESHOLD: compression-ratio threshold (default2.4)DICTATE_INPUT_LANGUAGE:autoor language code (defaultauto)- The current defaults and decoding parameters above are the settings that have worked best for this project so far.
DICTATE_STT_CONDITION_PREVis still a useful tuning knob to experiment with per setup/content:0usually reduces cross-chunk drift/repetition, while1can improve continuity.
DICTATE_CONTEXT: enable text context carryover (1/0, default1)DICTATE_CONTEXT_CHARS: max retained text context chars (default600)DICTATE_CONTEXT_RESET_EVERY: reset context every N emitted chunks (0disables, default1)DICTATE_AUDIO_CONTEXT_S: prepended previous-audio seconds per chunk (default1.6)DICTATE_AUDIO_CONTEXT_PAD_S: overlap pad used for timestamp clipping (default0.12)DICTATE_STT_TAIL_PAD_S: append trailing silence before Whisper decode to reduce end-of-chunk hallucinations (default0.08)DICTATE_TRIM_CHUNK_PERIOD: trim trailing./...from chunk output (default1)
DICTATE_LOOP_GUARD: enable pathological-loop detection + context reset (default1)DICTATE_LOOP_GUARD_REPEAT_RATIO: repetition trigger ratio (default0.55)DICTATE_LOOP_GUARD_PUNCT_RATIO: punctuation-density trigger ratio (default0.35)DICTATE_LOOP_GUARD_SHORT_RUN: repeated short-token run trigger length (default4)DICTATE_LOOP_GUARD_SHORT_LEN: max token length considered “short” for run detection (default3)DICTATE_LOOP_GUARD_ALLOW: comma-separated exact-match allowlist that bypasses loop guard (spaces/;also accepted) (defaultipv4,mac,semver,dotted_numeric)- supported rules:
ipv4,ipv6,mac,semver,dotted_numeric - examples:
1.1.1.1(ipv4),AA:BB:CC:DD:EE:FF(mac),1.2.3(semver/dotted_numeric)
- supported rules:
- STT is
faster-whisper(works on Linux; no MLX required). - Cleanup supports
ollamaandgeneric_v1-style backends. - Legacy
DICTATE_OLLAMA_URL/DICTATE_OLLAMA_MODELare still accepted as compatibility aliases. - Ollama is used for cleanup only (STT is not routed through Ollama in this app).
- If CUDA init fails, set
DICTATE_STT_DEVICE=cpuexplicitly for stable fallback.
- ALSA/PortAudio stream-open failures like:
Expression 'AlsaOpen(...)' failedExpression 'PaAlsaStream_Initialize(...)' failedtypically mean PortAudio could see a device but could not open it with current capture parameters/backend routing.
- Common causes:
- stale
DICTATE_INPUT_DEVICEindex after device reorder - loopback source/backend mismatch (
pulsesource not actually available at open time) - sample-rate/channel constraints on the selected source
- transient device busy state during backend switch/restart
- stale
- Runtime behavior:
- on stream-open failure, app now retries by refreshing a stale numeric device index from its current device name, then retries once with default input device
- Quick checks:
- refresh device list:
dictate-min --list-input-devices - prefer name-based selection: set
DICTATE_INPUT_DEVICE_NAMEand clearDICTATE_INPUT_DEVICE - in loopback mode, set
DICTATE_PULSE_SOURCEexplicitly to the active monitor source - if needed, set
DICTATE_SAMPLE_RATEto the source native rate (often48000)
- refresh device list:
- Split platform-specific code into adapters (audio routing, key hooks, paste, media control).
- Define a shared runtime core (
config,pipeline,state,events) with platform modules:platform/linux.pyplatform/macos.pyplatform/windows.py
- Remove Linux-specific assumptions from startup and runtime hot paths.
- Add feature-capability probing so unsupported features degrade gracefully per OS.
- Keep current backends:
ollamageneric_v1(/api/v1/chat//v1/chat)
- Add providers with first-class schema adapters:
- OpenAI-compatible chat endpoints
- Anthropic-style message APIs
- Additional local gateways as adapter plugins
- Improve model discovery and selection policy:
- prefer loaded model
- support pinned model aliases
- optional fail-fast on missing model
- Keep
faster-whisperas default baseline. - Add pluggable STT engine interface for:
- alternate local Whisper runtimes
- cloud STT providers
- platform-native dictation APIs (where available)
- Standardize STT output normalization (timestamps, confidence, no-speech handling).
- Expand hot-reload coverage where safe and keep strict restart boundaries where required.
- Add explicit “restart needed” diagnostics in CLI logs and GUI state.
- Add profile import/export and per-project profile switching.
- Add automated tests for:
- prompt rule parsing/templating
- runtime override application
- duck/pause/resume state machines
- backend response parsing edge cases
- Add integration smoke tests for
dictate-minandgui.pylaunch paths.
Use loopback mode to transcribe whatever is playing on a loopback/monitor input device:
source .venv/bin/activate
DICTATE_MODE=loopback dictate-minIf auto-pick chooses the wrong source, pin it:
DICTATE_MODE=loopback DICTATE_INPUT_DEVICE=25 dictate-minOr steer auto-pick by name:
DICTATE_MODE=loopback DICTATE_LOOPBACK_HINT="loopback pcm" dictate-minOn PipeWire/Pulse systems, loopback mode first tries:
PULSE_SOURCE=<Active Sink>.monitor(derived from current sink inputs)- fallback:
<Default Sink>.monitor - with the
pulseinput device
This makes capture follow real playback routing (e.g. JamesDSP) instead of your default input source.
Overlap handling:
- Uses
word_timestamps=Trueand drops words that fall inside the prepended audio-context window. - Adds cross-chunk prefix/suffix dedup on emitted words.
Where transcription goes:
- Always printed to stdout as plain text chunks (space-separated, no
TRANSCRIPT:prefix) - Also pasted only when
DICTATE_PASTE=1 - Daily file log in current working directory:
YYYYMMDD.log
dictate-minpython3 gui.pyDICTATE_MODE=loopback DICTATE_DEBUG=1 DICTATE_CLEANUP=0 dictate-minDICTATE_MODE=loopback DICTATE_PULSE_SOURCE=jamesdsp_sink.monitor dictate-minDICTATE_MODE=ptt DICTATE_INPUT_DEVICE_NAME=HTI-UC320 dictate-minDICTATE_MODE=loopback \
DICTATE_STT_MODEL=tiny \
DICTATE_LOOPBACK_CHUNK_S=3 \
DICTATE_STT_BEAM_SIZE=1 \
DICTATE_CLEANUP=0 \
dictate-minDICTATE_MODE=loopback \
DICTATE_STT_MODEL=medium.en \
DICTATE_STT_BEAM_SIZE=5 \
DICTATE_INPUT_LANGUAGE=en \
dictate-minDICTATE_MODE=loopback \
DICTATE_LOOP_GUARD=1 \
DICTATE_LOOP_GUARD_REPEAT_RATIO=0.40 \
DICTATE_LOOP_GUARD_PUNCT_RATIO=0.20 \
DICTATE_LOOP_GUARD_SHORT_RUN=3 \
DICTATE_CONTEXT_RESET_EVERY=6 \
dictate-minDICTATE_TRIM_CHUNK_PERIOD=0 dictate-minDICTATE_PASTE_MODE=type dictate-minDICTATE_PASTE_MODE=primary DICTATE_PASTE_PRIMARY_CLICK=1 dictate-minDICTATE_PASTE_PRESERVE=0 dictate-minClipboard preservation note:
- Preservation is best-effort for text clipboard content. Non-text clipboard payloads (e.g. images/custom mime types) may not round-trip through text clipboard APIs.
