Skip to content

Add audio transcription with per-chat language support#460

Open
bodharma wants to merge 7 commits intod99kris:masterfrom
bodharma:feature/per-chat-transcription-language
Open

Add audio transcription with per-chat language support#460
bodharma wants to merge 7 commits intod99kris:masterfrom
bodharma:feature/per-chat-transcription-language

Conversation

@bodharma
Copy link
Copy Markdown

@bodharma bodharma commented Jan 4, 2026

Built over a weekend to read voice messages as text instead of listening.

Features:

  • Alt-U to transcribe voice messages
  • Alt-Shift-U to re-transcribe
  • Alt-L to set language per chat
  • 22+ languages (en, ru, uk, es, fr, de, zh, ja, etc.)
  • Auto-truncate long transcriptions (configurable max lines)
  • Works with OpenAI API, whisper.cpp, or local Whisper

Config in ~/.config/nchat/ui.conf:

  • audio_transcribe_enabled=1 # turn it on
  • audio_transcribe_language=auto # or en, ru, uk, etc.
  • audio_transcribe_max_lines=15 # truncate long ones

Database: Added transcriptionLanguage to chats2 table (schema v9)

Docs: TRANSCRIPTION.md for usage, TRANSCRIPTION-SETUP.md for setup

@d99kris d99kris self-assigned this Jan 6, 2026
@d99kris
Copy link
Copy Markdown
Owner

d99kris commented Jan 11, 2026

Hello @bodharma - thanks for contributing, and great work! The patch touches many components, yet manages to follow the design and code style of nchat very well.

I'd like to understand a little bit more high-level before going into detailed code review.

  1. Why is a separate key binding for re-transcribe needed? Is there an issue letting alt-u (re-)transcribe every time?

  2. What does disabling audio_transcribe_inline mean from a user's perspective? It seems nothing would be shown to the user.

  3. The new transcriptions db table has several fields not really used (timestamp, service, language), are they added to cater for some future feature addition?

  4. Do you see any concerns with storing the transcribed text in the messages table (as a new field) instead? My main concern with a separate table is performance and I'd like to minimize the number of db queries for common use-cases (like viewing message history).

No need to update the patch yet, I think it can be good to discuss and understand pros/cons first.

@bodharma
Copy link
Copy Markdown
Author

Hello! Thanks for the detailed review and thoughtful questions. Let me address each one:

  1. Re-transcribe key binding

The separate Alt-Shift-U binding exists for cache management. Alt-U checks cache first (instant display, saves API costs), while Alt-Shift-U forces a fresh API call. This is useful when the first transcription used the
wrong language, had network errors, or user wants to retry with different settings. Without this, users would need to manually clear cache or disable caching entirely. It follows the common UX pattern (browser Ctrl-R
vs Ctrl-Shift-R).

  1. audio_transcribe_inline setting

You're absolutely right - the current UX is incomplete. When disabled, transcriptions are cached but not displayed (no feedback). This was intended for future features (search indexing, accessibility), but pressing
Alt-U provides no indication to the user. I can either add a status message or remove the setting entirely. What's your preference?

  1. Unused database fields

The fields serve different purposes:

  • language: ✅ Currently populated - tracks which language was used
  • timestamp: ✅ Currently populated - intended for expiring old transcriptions (privacy)
  • service: ❌ Not yet used - intended for cost tracking/service fallback

Storage overhead is minimal. I can add comments documenting intent or remove unused ones?

  1. Database architecture (separate vs messages table)

I chose a separate table for:

  • Storage efficiency: 90%+ messages aren't audio (avoids NULL overhead)
  • Metadata flexibility: Can store language/service/timestamp without bloating messages
  • Privacy features: Can expire transcriptions independently
  • Performance: The N+1 pattern uses PRIMARY KEY lookups on local SQLite (~5ms impact for typical usage)

If this becomes a bottleneck, it can be optimized with batch loading. If you prefer denormalization for simplicity, I can migrate to messages table, but we'd lose the metadata fields and expiration capability.

Let me know which direction you'd like to take! No rush on patch updates - happy to discuss pros/cons first.

References:

  • Schema: messagecache.cpp:382-414
  • Logic: uimodel.cpp:3920-4070, uihistoryview.cpp:232-289
  • Bindings: uikeyconfig.cpp:235-236

@d99kris
Copy link
Copy Markdown
Owner

d99kris commented Jan 18, 2026

Thanks for the quick reply!

Firstly, let me share how I envision the feature to work:
For messages with an attached audio clip, the user can press alt-u to perform transcription (which is always cached).
When a transcription for a message is available it is always shown to the user inline, they don't need to press any key.
If the user for some reason wants to re-transcribe a clip, they can select the message and press alt-u again.

With this background let me provide my responses:

  1. I'm not seeing a scenario where the user needs to press alt-u to display a cached transcription that already exists (and is not displayed). If there is no such scenario, let's just let alt-u (re-)transcribe every time.

  2. I don't see any benefit of having a audio_transcribe_inline disabled mode. That file could however check if audio_transcribe_enabled is enabled at all, before looking for / displaying transcription.

  3. Right now these fields are written but not read anywhere in the patch. That means they are effectively dead code / speculative design. Let's remove any fields / code not used.

  4. Yes, let's just go with one extra column in the messages table and drop the metadata. I think it will be better performance-wise (reducing number of queries) and it will also be simpler to extend other functions to access transcribed data; auto-compose, history export, search.

Also, please help to update TRANSCRIPTION.md, especially section "How to Use" with a real (copy & paste) UI example that reflects what it looks like.

If you can update the PR along these lines, I'll re-review. Thanks!

@pinusc
Copy link
Copy Markdown
Contributor

pinusc commented Jan 19, 2026

One small point: the documentation instructs the user to run pip; however, most linux distros don't allow to run system-wide pip anymore.

Documentation should instruct the user on how to create a venv, install dependencies in that environment, and then call the script from the virtual environment, e.g.

audio_transcribe_command=~/dev/nchat/.venv/bin/python ~/dev/nchat/src/transcribe -s whisper-local -f '%1' -m base

these are path for my locally-installed nchat, but ~/dev/nchat/src/transcribe could just as easily be /usr/local/libexec/nchat/transcribe; instructions could use as a default location .local/nchat/venv

Alternatively, instructions might be easier with pipx or uv

@d99kris
Copy link
Copy Markdown
Owner

d99kris commented Jan 20, 2026

One small point: the documentation instructs the user to run pip; however, most linux distros don't allow to run system-wide pip anymore.

Thanks for helping to review @pinusc 👍

Yeah, I spotted this too and was going to comment on it once the general structure / scope was settled.
Personally I try to avoid pip packages since that change, but it might be tedius to rewrite it without pip dependencies (could be worth a try though).

Another approach I ended up using is automatically creating a virtual environment and installing dependencies from within the script - see pipdeps.py for an example. It feels a bit hacky, but I haven't found any other reasonably practical way to distribute standalone python scripts that depends on pip-packages.

@pinusc
Copy link
Copy Markdown
Contributor

pinusc commented Jan 20, 2026

One thing I would also mention in the documentation is that the transcribe script is not special, and any script that takes an audio file as an input and gives text as an output works.

I ended up with
audio_transcribe_command=~/.config/nchat/transcribe.sh '%1'

where transcribe.sh is

ffmpeg -loglevel quiet -i "$1" -f wav - | whisper-cli -m ~/.local/share/ggml-large-v2.bin -l auto -otxt -f -

this allowed me to use whisper-cli which I think makes more sense for this use case than running a full-on server.
In fact, I feel like the transcribe script should be able to just call whisper-cli, or use some python bindings, instead of running the server... maybe have that as a fourth option?

Using whisper-faster was just not an option for me as I ran into dependency issues, since the underlying libraries do not support cuda 12. I think this kind of issue might be common: python packaging is already complicated, and it becomes even worse with CUDA / hardware acceleration stuff.

I think due to python packaging complexity, especially coupled with nchat as a system package, it makes sense to include a method that is, in theory, easier to set up (for me it was just an AUR package away).


About pipdeps.py: I would recommend against that as it will make a simple utility script even more complex. Note that we run into python version issues as well as hardware acceleration; for example, faster-whisper does NOT run on python>=3.14, due to a dependency that is not available. My opinion is that if people are running nchat, they are likely proficient enough to follow instructions to create a virtualenv or use uv. In fact, uv solves enough problems (and is becoming widely adopted) that I would just recommend instructions explain how to install dependencies using uv. This could be as simple as:

cd .config/nchat
uv venv --python 3.13
uv pip install faster-whisper

and then in ui.conf, either

audio_transcribe_command=~/.config/nchat/.venv/bin/python /usr/local/libexec/nchat/transcribe ...

or

audio_transcribe_command=uv run --directory ~/.config/nchat /usr/local/libexec/nchat/transcribe ...

The config does get ugly this way, but maybe a wrapper script could check for .venv/venv within nchat's data directory and try to activate it, before falling back to system python? This way creating the virtualenv is sufficient.

As another side note: these examples create a virtualenv in .config/nchat, but I'm a strong proponent of #322, where all non-config stuff would be in XDG_DATA_DIR, e.g. ~/.local/nchat/.


Finally, another alternative is to sidestep the issue by just ditching python entirely and switching to a bash script. openai api can just be substituted with a couple curl calls; drop support for faster-whisper and only explicitly support whisper.cpp, which as I noted above is really easy to handle with a bash script. Or maybe include an optional python script for faster-whisper and hand-wavily tell the user to install it in a python virtual environment (in this scenario the python script could just live on the github and not be installed at all)

@pinusc
Copy link
Copy Markdown
Contributor

pinusc commented Jan 20, 2026

I would also like to report two bugs: long audio transcriptions do not wrap properly, and some words are lost. Resizing the window size a bit makes them appear (but then other words disappear).
This does not happen with other messages.

Also a highlight issue: the last line of the transcription gets highlighted as a reaction (only if there are no reaction emojis). This happens both for hard linebreaks and soft wrapping.

I didn't say this before, but thanks for developing this! It's awesome and it works really well. Especially useful since whatsapp does not currently have italian audio transcription, which I need, and running on PC allows me to run the large model which is a lot more accurate than whatever the phone version runs.

@d99kris
Copy link
Copy Markdown
Owner

d99kris commented Jan 21, 2026

I would also like to report two bugs: long audio transcriptions do not wrap properly, and some words are lost. Resizing the window size a bit makes them appear (but then other words disappear).
This does not happen with other messages.

Good points, I also observed these. FYI @bodharma - I have local fixes for these issues - maybe makes more sense to share once we have the high level design laid down (but let me know if you'd like to see them earlier).

Also a highlight issue: the last line of the transcription gets highlighted as a reaction (only if there are no reaction emojis). > This happens both for hard linebreaks and soft wrapping.

Thanks! Same here @bodharma - I have local fix for this.

I didn't say this before, but thanks for developing this! It's awesome and it works really well. Especially useful since whatsapp does not currently have italian audio transcription, which I need, and running on PC allows me to run the large model which is a lot more accurate than whatever the phone version runs.

Agree it will be a nice addition!

@pinusc
Copy link
Copy Markdown
Contributor

pinusc commented Jan 21, 2026

  • Auto-truncate long transcriptions (configurable max lines)

The fact that it's configurable is great, but I feel like it might be better to have a sane default and then make it work with the external message viewer (alt-w by default) to allow reading the whole transcript.

@hellocodelinux
Copy link
Copy Markdown

This looks very interesting. I'm available to help with the code if you'd like. This functionality is a must-have for text-based chats; having audio transcription is a great addition.

@fulalas
Copy link
Copy Markdown

fulalas commented Apr 5, 2026

In TRANSCRIPTION-SETUP.md I would put whisper.cpp at the top, then whisper python (much slower), then Groq and finally OpenAI -- considering the price/privacy

Whisper.cpp is available in:
Arch: https://aur.archlinux.org/packages/whisper.cpp
Debian: https://packages.debian.org/sid/whisper.cpp
Fedora: https://packages.fedoraproject.org/pkgs/whisper-cpp/whisper-cpp

It's actually quite easy to use it after you download a model: just call whisper-cli as it should be in the $PATH. No server is needed.

Example: whisper-cli "audio.wav" --model /user_path/ggml-large-v3.bin --threads $(nproc) --language auto --output-txt

Built over a weekend to read voice messages as text instead of listening.

Features:
- Alt-U to transcribe voice messages
- Alt-Shift-U to re-transcribe
- Alt-L to set language per chat
- 22+ languages (en, ru, uk, es, fr, de, zh, ja, etc.)
- Auto-truncate long transcriptions (configurable max lines)
- Works with OpenAI API, whisper.cpp, or local Whisper

Config in ~/.config/nchat/ui.conf:
- audio_transcribe_enabled=1  # turn it on
- audio_transcribe_language=auto  # or en, ru, uk, etc.
- audio_transcribe_max_lines=15  # truncate long ones

Database: Added transcriptionLanguage to chats2 table (schema v9)

Docs: TRANSCRIPTION.md for usage, TRANSCRIPTION-SETUP.md for setup
Add schema 11→12 migration that adds a transcription TEXT column to the
messages table and migrates existing data from the transcriptions table.
Rewrite StoreTranscription, GetTranscription, HasTranscription, and
DeleteTranscription to operate on messages instead of the separate
transcriptions table. Remove p_Language and p_Service params from
StoreTranscription as they are no longer needed.
…g options

- Remove retranscribe_audio key binding (alt-shift-U) and its dispatch
  handler; alt-U now always (re-)transcribes unconditionally
- Simplify TranscribeAudio(): drop p_ForceRetranscribe param, remove the
  cache-check block and the useCache/timeout static variables
- Fix StoreTranscription call to 4-arg signature (drop language arg)
- Remove audio_transcribe_cache and audio_transcribe_inline config defaults
- Remove OnKeyRetranscribeAudio() declaration and definition
@bodharma
Copy link
Copy Markdown
Author

Hi @d99kris — rebased onto current master and addressed all the feedback:

  • DB refactor: transcription now stored as a transcription column in the messages table (schema v12), migrated from the separate transcriptions table. Removes the N+1 query concern and simplifies access from other features.
  • Single key binding: removed Alt-Shift-U; Alt-U now always re-transcribes.
  • Removed config options: dropped audio_transcribe_inline and audio_transcribe_cache. Transcription display is gated on audio_transcribe_enabled only.
  • Docs: added a real UI example to TRANSCRIPTION.md; rewrote TRANSCRIPTION-SETUP.md with whisper.cpp first (system package + simple bash script), uv instead of pip, and backends reordered by cost/privacy as suggested by @fulalas and @pinusc.

Could you share your local patches for the two rendering bugs (word wrap and the reaction-highlight on the last transcription line)?

Also spotted a bug in the isTranscription position check in uihistoryview.cpp: std::distance(wline, wlines.rbegin()) returns negative values for non-initial loop iterations, which wraps around in size_t posFromEnd. This means the dim color for transcription lines never applies. Happy to fix alongside your patches so we can do it in one pass.

@bodharma bodharma force-pushed the feature/per-chat-transcription-language branch from d35052a to 9c334f9 Compare April 27, 2026 14:37
Copilot AI review requested due to automatic review settings April 27, 2026 14:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an audio transcription feature to nchat, including UI actions to transcribe selected voice messages and set a per-chat transcription language, backed by cache DB schema updates and a bundled Python transcription helper.

Changes:

  • Add keybindings/UI handlers for transcribing selected audio messages and setting per-chat transcription language.
  • Extend cache DB schema and data model to store per-chat transcriptionLanguage and per-message transcriptions.
  • Add a bundled transcribe Python script plus documentation for usage/setup.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/uimodel.h Exposes new transcription and language-setting key handlers/methods.
src/uimodel.cpp Implements transcription command execution, per-chat language selection, and persistence hooks.
src/uilanguagelistdialog.h Adds a list dialog interface for choosing a transcription language.
src/uilanguagelistdialog.cpp Implements the language picker list and filtering.
src/uikeyconfig.cpp Adds default key mappings for transcribe and set-language actions.
src/uihistoryview.cpp Renders cached transcriptions inline under audio attachments and truncates long outputs.
src/uihelpview.cpp Adds help-bar items for the new shortcuts.
src/uiconfig.cpp Introduces default config keys for transcription feature toggles/parameters.
src/transcribe Provides the Whisper backend wrapper script (OpenAI / whisper.cpp / local whisper).
src/main.cpp Updates interactive help text for transcription shortcuts.
lib/ncutil/src/messagecache.h Adds cache request type + APIs for transcription language and transcription storage.
lib/ncutil/src/messagecache.cpp Implements schema migrations and persistence for transcription language and transcriptions.
lib/common/src/protocol.h Extends ChatInfo with transcriptionLanguage.
doc/TRANSCRIPTION.md Documents transcription usage and configuration.
doc/TRANSCRIPTION-SETUP.md Documents backend setup options (whisper.cpp, local python, OpenAI, etc.).
CMakeLists.txt Builds/installs the new language dialog sources and installs the transcribe script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/uimodel.cpp
Comment on lines +4364 to +4392
// Build and execute command
std::string transcription;
std::string cmd = cmdTemplate;
StrUtil::ReplaceString(cmd, "%1", filePath);

// Add language parameter if specified
// First check per-chat language setting, then fall back to global setting
std::string language;
if (m_ChatInfos.count(profileId) && m_ChatInfos[profileId].count(chatId))
{
language = m_ChatInfos[profileId][chatId].transcriptionLanguage;
}

// Fall back to global setting if per-chat language is not set
if (language.empty())
{
language = UiConfig::GetStr("audio_transcribe_language");
}

// Default to auto if still empty
if (language.empty())
{
language = "auto";
}

if (language != "auto")
{
cmd += " -l " + language;
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transcription command is built by string concatenation and executed via SysUtil::RunCommand (which uses /bin/sh -c). filePath (from attachment path) and language are inserted unescaped, so paths containing quotes or a malicious/invalid language value can break the command or lead to shell injection. Prefer executing without a shell (argv-style) or at minimum properly shell-escape the substituted file path and strictly validate/whitelist the language value before appending it.

Copilot uses AI. Check for mistakes.
Comment thread src/uimodel.cpp Outdated
Comment thread src/uimodel.cpp
Comment on lines +4398 to +4405
if (rv && !transcription.empty())
{
MessageCache::StoreTranscription(profileId, chatId, msgId, transcription);

// Update UI
UpdateHistory();
return true;
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MessageCache::StoreTranscription(...) return value is ignored, but TranscribeAudio() still returns success. If cache is disabled (app.conf cache_enabled=0) or the UPDATE affects no rows, the user will see a successful transcription with nothing displayed. Propagate StoreTranscription failure (or fall back to storing in-memory / showing the transcription directly) so the UI result matches what’s persisted.

Copilot uses AI. Check for mistakes.
Comment thread src/main.cpp
Comment on lines +687 to +688
" Alt-u transcribe selected audio message\n"
" Alt-U re-transcribe selected audio message\n"
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help text advertises Alt-U as “re-transcribe”, but no separate key binding/action exists for it (only transcribe_audio is configured/handled). Either add a dedicated key/action for re-transcribe (and implement different behavior), or remove/update this help entry to match what’s actually supported.

Copilot uses AI. Check for mistakes.
Comment thread doc/TRANSCRIPTION.md Outdated
Comment thread lib/ncutil/src/messagecache.cpp Outdated
Comment thread src/uiconfig.cpp Outdated
Comment on lines +443 to +496
if (schemaVersion == 9)
{
LOG_INFO("update db schema 9 to 10");

*m_Dbs[p_ProfileId] << "CREATE TABLE IF NOT EXISTS transcriptions ("
"chatId TEXT NOT NULL,"
"msgId TEXT NOT NULL,"
"transcription TEXT NOT NULL,"
"language TEXT DEFAULT '',"
"service TEXT DEFAULT '',"
"timestamp INTEGER NOT NULL,"
"PRIMARY KEY (chatId, msgId)"
");";

*m_Dbs[p_ProfileId] << "CREATE INDEX IF NOT EXISTS idx_transcriptions_timestamp "
"ON transcriptions(timestamp);";

schemaVersion = 10;
*m_Dbs[p_ProfileId] << "UPDATE version "
"SET schema = ?;" << schemaVersion;
}

if (schemaVersion == 10)
{
LOG_INFO("update db schema 10 to 11");

*m_Dbs[p_ProfileId] << "ALTER TABLE chats2 ADD COLUMN transcriptionLanguage TEXT DEFAULT '';";

schemaVersion = 11;
*m_Dbs[p_ProfileId] << "UPDATE version "
"SET schema = ?;" << schemaVersion;
}

if (schemaVersion == 11)
{
LOG_INFO("update db schema 11 to 12");

*m_Dbs[p_ProfileId] << "ALTER TABLE messages ADD COLUMN transcription TEXT DEFAULT '';";

// Migrate existing transcriptions into messages table
*m_Dbs[p_ProfileId] <<
"UPDATE messages SET transcription = ("
" SELECT t.transcription FROM transcriptions t"
" WHERE t.chatId = messages.chatId AND t.msgId = messages.id"
") WHERE EXISTS ("
" SELECT 1 FROM transcriptions t"
" WHERE t.chatId = messages.chatId AND t.msgId = messages.id"
");";

schemaVersion = 12;
*m_Dbs[p_ProfileId] << "UPDATE version SET schema = ?;" << schemaVersion;
}

static const int64_t s_SchemaVersion = 12;
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions “schema v9”, but the code bumps the cache DB schema to 12 (adds transcriptions table, chats2.transcriptionLanguage, and messages.transcription). Please update the PR description (or adjust the migrations) so the documented schema version matches the actual code.

Copilot uses AI. Check for mistakes.
Comment thread src/uiconfig.cpp Outdated
Comment thread src/uihistoryview.cpp Outdated
- Fix shell injection: escape single quotes in filePath and validate
  language against safe character whitelist before shell substitution
- Align audio extension lists: replace aac with webm in both uimodel
  and uihistoryview to match the transcribe script SUPPORTED_FORMATS
- Fix isTranscription dim color: reverse std::distance args so posFromEnd
  increases correctly in reverse iteration (was wrapping to huge size_t)
- Clamp WordWrap width to minimum 1 to prevent unsigned underflow on
  very narrow terminals
- Check StoreTranscription return and log warning on failure
- Replace INSERT ON CONFLICT with UPDATE-only for transcriptionLanguage
  to prevent partial chat rows if chatId is missing
- Drop legacy transcriptions table and index in schema 12 migration
- Remove unimplemented config defaults: audio_transcribe_auto,
  audio_transcribe_timeout, and hardcoded audio_transcribe_command path
  (runtime fallback uses CMake install prefix instead)
- Update TRANSCRIPTION.md: remove dead config keys, fix keyboard
  shortcuts, fix cache path/SQL, add webm to supported formats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants