Add audio transcription with per-chat language support#460
Add audio transcription with per-chat language support#460bodharma wants to merge 7 commits intod99kris:masterfrom
Conversation
|
Hello @bodharma - thanks for contributing, and great work! The patch touches many components, yet manages to follow the design and code style of nchat very well. I'd like to understand a little bit more high-level before going into detailed code review.
No need to update the patch yet, I think it can be good to discuss and understand pros/cons first. |
|
Hello! Thanks for the detailed review and thoughtful questions. Let me address each one:
The separate Alt-Shift-U binding exists for cache management. Alt-U checks cache first (instant display, saves API costs), while Alt-Shift-U forces a fresh API call. This is useful when the first transcription used the
You're absolutely right - the current UX is incomplete. When disabled, transcriptions are cached but not displayed (no feedback). This was intended for future features (search indexing, accessibility), but pressing
The fields serve different purposes:
Storage overhead is minimal. I can add comments documenting intent or remove unused ones?
I chose a separate table for:
If this becomes a bottleneck, it can be optimized with batch loading. If you prefer denormalization for simplicity, I can migrate to messages table, but we'd lose the metadata fields and expiration capability. Let me know which direction you'd like to take! No rush on patch updates - happy to discuss pros/cons first. References:
|
|
Thanks for the quick reply! Firstly, let me share how I envision the feature to work: With this background let me provide my responses:
Also, please help to update TRANSCRIPTION.md, especially section "How to Use" with a real (copy & paste) UI example that reflects what it looks like. If you can update the PR along these lines, I'll re-review. Thanks! |
|
One small point: the documentation instructs the user to run Documentation should instruct the user on how to create a venv, install dependencies in that environment, and then call the script from the virtual environment, e.g. these are path for my locally-installed nchat, but Alternatively, instructions might be easier with |
Thanks for helping to review @pinusc 👍 Yeah, I spotted this too and was going to comment on it once the general structure / scope was settled. Another approach I ended up using is automatically creating a virtual environment and installing dependencies from within the script - see pipdeps.py for an example. It feels a bit hacky, but I haven't found any other reasonably practical way to distribute standalone python scripts that depends on pip-packages. |
|
One thing I would also mention in the documentation is that the I ended up with where this allowed me to use Using I think due to python packaging complexity, especially coupled with nchat as a system package, it makes sense to include a method that is, in theory, easier to set up (for me it was just an AUR package away). About and then in or The config does get ugly this way, but maybe a wrapper script could check for As another side note: these examples create a virtualenv in Finally, another alternative is to sidestep the issue by just ditching python entirely and switching to a bash script. |
|
I would also like to report two bugs: long audio transcriptions do not wrap properly, and some words are lost. Resizing the window size a bit makes them appear (but then other words disappear). Also a highlight issue: the last line of the transcription gets highlighted as a reaction (only if there are no reaction emojis). This happens both for hard linebreaks and soft wrapping. I didn't say this before, but thanks for developing this! It's awesome and it works really well. Especially useful since whatsapp does not currently have italian audio transcription, which I need, and running on PC allows me to run the |
Good points, I also observed these. FYI @bodharma - I have local fixes for these issues - maybe makes more sense to share once we have the high level design laid down (but let me know if you'd like to see them earlier).
Thanks! Same here @bodharma - I have local fix for this.
Agree it will be a nice addition! |
The fact that it's configurable is great, but I feel like it might be better to have a sane default and then make it work with the external message viewer (alt-w by default) to allow reading the whole transcript. |
|
This looks very interesting. I'm available to help with the code if you'd like. This functionality is a must-have for text-based chats; having audio transcription is a great addition. |
|
In Whisper.cpp is available in: It's actually quite easy to use it after you download a model: just call Example: |
Built over a weekend to read voice messages as text instead of listening. Features: - Alt-U to transcribe voice messages - Alt-Shift-U to re-transcribe - Alt-L to set language per chat - 22+ languages (en, ru, uk, es, fr, de, zh, ja, etc.) - Auto-truncate long transcriptions (configurable max lines) - Works with OpenAI API, whisper.cpp, or local Whisper Config in ~/.config/nchat/ui.conf: - audio_transcribe_enabled=1 # turn it on - audio_transcribe_language=auto # or en, ru, uk, etc. - audio_transcribe_max_lines=15 # truncate long ones Database: Added transcriptionLanguage to chats2 table (schema v9) Docs: TRANSCRIPTION.md for usage, TRANSCRIPTION-SETUP.md for setup
Add schema 11→12 migration that adds a transcription TEXT column to the messages table and migrates existing data from the transcriptions table. Rewrite StoreTranscription, GetTranscription, HasTranscription, and DeleteTranscription to operate on messages instead of the separate transcriptions table. Remove p_Language and p_Service params from StoreTranscription as they are no longer needed.
…g options - Remove retranscribe_audio key binding (alt-shift-U) and its dispatch handler; alt-U now always (re-)transcribes unconditionally - Simplify TranscribeAudio(): drop p_ForceRetranscribe param, remove the cache-check block and the useCache/timeout static variables - Fix StoreTranscription call to 4-arg signature (drop language arg) - Remove audio_transcribe_cache and audio_transcribe_inline config defaults - Remove OnKeyRetranscribeAudio() declaration and definition
|
Hi @d99kris — rebased onto current master and addressed all the feedback:
Could you share your local patches for the two rendering bugs (word wrap and the reaction-highlight on the last transcription line)? Also spotted a bug in the |
d35052a to
9c334f9
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an audio transcription feature to nchat, including UI actions to transcribe selected voice messages and set a per-chat transcription language, backed by cache DB schema updates and a bundled Python transcription helper.
Changes:
- Add keybindings/UI handlers for transcribing selected audio messages and setting per-chat transcription language.
- Extend cache DB schema and data model to store per-chat
transcriptionLanguageand per-message transcriptions. - Add a bundled
transcribePython script plus documentation for usage/setup.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/uimodel.h | Exposes new transcription and language-setting key handlers/methods. |
| src/uimodel.cpp | Implements transcription command execution, per-chat language selection, and persistence hooks. |
| src/uilanguagelistdialog.h | Adds a list dialog interface for choosing a transcription language. |
| src/uilanguagelistdialog.cpp | Implements the language picker list and filtering. |
| src/uikeyconfig.cpp | Adds default key mappings for transcribe and set-language actions. |
| src/uihistoryview.cpp | Renders cached transcriptions inline under audio attachments and truncates long outputs. |
| src/uihelpview.cpp | Adds help-bar items for the new shortcuts. |
| src/uiconfig.cpp | Introduces default config keys for transcription feature toggles/parameters. |
| src/transcribe | Provides the Whisper backend wrapper script (OpenAI / whisper.cpp / local whisper). |
| src/main.cpp | Updates interactive help text for transcription shortcuts. |
| lib/ncutil/src/messagecache.h | Adds cache request type + APIs for transcription language and transcription storage. |
| lib/ncutil/src/messagecache.cpp | Implements schema migrations and persistence for transcription language and transcriptions. |
| lib/common/src/protocol.h | Extends ChatInfo with transcriptionLanguage. |
| doc/TRANSCRIPTION.md | Documents transcription usage and configuration. |
| doc/TRANSCRIPTION-SETUP.md | Documents backend setup options (whisper.cpp, local python, OpenAI, etc.). |
| CMakeLists.txt | Builds/installs the new language dialog sources and installs the transcribe script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Build and execute command | ||
| std::string transcription; | ||
| std::string cmd = cmdTemplate; | ||
| StrUtil::ReplaceString(cmd, "%1", filePath); | ||
|
|
||
| // Add language parameter if specified | ||
| // First check per-chat language setting, then fall back to global setting | ||
| std::string language; | ||
| if (m_ChatInfos.count(profileId) && m_ChatInfos[profileId].count(chatId)) | ||
| { | ||
| language = m_ChatInfos[profileId][chatId].transcriptionLanguage; | ||
| } | ||
|
|
||
| // Fall back to global setting if per-chat language is not set | ||
| if (language.empty()) | ||
| { | ||
| language = UiConfig::GetStr("audio_transcribe_language"); | ||
| } | ||
|
|
||
| // Default to auto if still empty | ||
| if (language.empty()) | ||
| { | ||
| language = "auto"; | ||
| } | ||
|
|
||
| if (language != "auto") | ||
| { | ||
| cmd += " -l " + language; | ||
| } |
There was a problem hiding this comment.
Transcription command is built by string concatenation and executed via SysUtil::RunCommand (which uses /bin/sh -c). filePath (from attachment path) and language are inserted unescaped, so paths containing quotes or a malicious/invalid language value can break the command or lead to shell injection. Prefer executing without a shell (argv-style) or at minimum properly shell-escape the substituted file path and strictly validate/whitelist the language value before appending it.
| if (rv && !transcription.empty()) | ||
| { | ||
| MessageCache::StoreTranscription(profileId, chatId, msgId, transcription); | ||
|
|
||
| // Update UI | ||
| UpdateHistory(); | ||
| return true; | ||
| } |
There was a problem hiding this comment.
MessageCache::StoreTranscription(...) return value is ignored, but TranscribeAudio() still returns success. If cache is disabled (app.conf cache_enabled=0) or the UPDATE affects no rows, the user will see a successful transcription with nothing displayed. Propagate StoreTranscription failure (or fall back to storing in-memory / showing the transcription directly) so the UI result matches what’s persisted.
| " Alt-u transcribe selected audio message\n" | ||
| " Alt-U re-transcribe selected audio message\n" |
There was a problem hiding this comment.
Help text advertises Alt-U as “re-transcribe”, but no separate key binding/action exists for it (only transcribe_audio is configured/handled). Either add a dedicated key/action for re-transcribe (and implement different behavior), or remove/update this help entry to match what’s actually supported.
| if (schemaVersion == 9) | ||
| { | ||
| LOG_INFO("update db schema 9 to 10"); | ||
|
|
||
| *m_Dbs[p_ProfileId] << "CREATE TABLE IF NOT EXISTS transcriptions (" | ||
| "chatId TEXT NOT NULL," | ||
| "msgId TEXT NOT NULL," | ||
| "transcription TEXT NOT NULL," | ||
| "language TEXT DEFAULT ''," | ||
| "service TEXT DEFAULT ''," | ||
| "timestamp INTEGER NOT NULL," | ||
| "PRIMARY KEY (chatId, msgId)" | ||
| ");"; | ||
|
|
||
| *m_Dbs[p_ProfileId] << "CREATE INDEX IF NOT EXISTS idx_transcriptions_timestamp " | ||
| "ON transcriptions(timestamp);"; | ||
|
|
||
| schemaVersion = 10; | ||
| *m_Dbs[p_ProfileId] << "UPDATE version " | ||
| "SET schema = ?;" << schemaVersion; | ||
| } | ||
|
|
||
| if (schemaVersion == 10) | ||
| { | ||
| LOG_INFO("update db schema 10 to 11"); | ||
|
|
||
| *m_Dbs[p_ProfileId] << "ALTER TABLE chats2 ADD COLUMN transcriptionLanguage TEXT DEFAULT '';"; | ||
|
|
||
| schemaVersion = 11; | ||
| *m_Dbs[p_ProfileId] << "UPDATE version " | ||
| "SET schema = ?;" << schemaVersion; | ||
| } | ||
|
|
||
| if (schemaVersion == 11) | ||
| { | ||
| LOG_INFO("update db schema 11 to 12"); | ||
|
|
||
| *m_Dbs[p_ProfileId] << "ALTER TABLE messages ADD COLUMN transcription TEXT DEFAULT '';"; | ||
|
|
||
| // Migrate existing transcriptions into messages table | ||
| *m_Dbs[p_ProfileId] << | ||
| "UPDATE messages SET transcription = (" | ||
| " SELECT t.transcription FROM transcriptions t" | ||
| " WHERE t.chatId = messages.chatId AND t.msgId = messages.id" | ||
| ") WHERE EXISTS (" | ||
| " SELECT 1 FROM transcriptions t" | ||
| " WHERE t.chatId = messages.chatId AND t.msgId = messages.id" | ||
| ");"; | ||
|
|
||
| schemaVersion = 12; | ||
| *m_Dbs[p_ProfileId] << "UPDATE version SET schema = ?;" << schemaVersion; | ||
| } | ||
|
|
||
| static const int64_t s_SchemaVersion = 12; |
There was a problem hiding this comment.
PR description mentions “schema v9”, but the code bumps the cache DB schema to 12 (adds transcriptions table, chats2.transcriptionLanguage, and messages.transcription). Please update the PR description (or adjust the migrations) so the documented schema version matches the actual code.
- Fix shell injection: escape single quotes in filePath and validate language against safe character whitelist before shell substitution - Align audio extension lists: replace aac with webm in both uimodel and uihistoryview to match the transcribe script SUPPORTED_FORMATS - Fix isTranscription dim color: reverse std::distance args so posFromEnd increases correctly in reverse iteration (was wrapping to huge size_t) - Clamp WordWrap width to minimum 1 to prevent unsigned underflow on very narrow terminals - Check StoreTranscription return and log warning on failure - Replace INSERT ON CONFLICT with UPDATE-only for transcriptionLanguage to prevent partial chat rows if chatId is missing - Drop legacy transcriptions table and index in schema 12 migration - Remove unimplemented config defaults: audio_transcribe_auto, audio_transcribe_timeout, and hardcoded audio_transcribe_command path (runtime fallback uses CMake install prefix instead) - Update TRANSCRIPTION.md: remove dead config keys, fix keyboard shortcuts, fix cache path/SQL, add webm to supported formats
Built over a weekend to read voice messages as text instead of listening.
Features:
Config in ~/.config/nchat/ui.conf:
Database: Added transcriptionLanguage to chats2 table (schema v9)
Docs: TRANSCRIPTION.md for usage, TRANSCRIPTION-SETUP.md for setup