You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CustomVoice language conditioning bug documented in #87 (Issue B) and addressed by PR #98 is confirmed to still exist in v0.8.0. The assistantPreambleKernel in v0.8.0 still only implements the 8-row (no-language) prefix layout, producing unintelligible audio output for the only officially supported Qwen3-TTS model family (Qwen3-TTS-12Hz-0.6B-CustomVoice).
We have applied the fix from PR #98 (adapted for v0.8.0) and confirmed it produces correct, intelligible speech validated by human listening on a Jetson Orin Nano.
Additionally, codec_language_id is not propagated from the HuggingFace talker_config during export (tensorrt_edgellm/scripts/export.py — _patch_tts_config copies codec_think_id but not codec_language_id).
Fix applied (v0.8.0)
We applied the equivalent of PR #98 adapted for v0.8.0. Files changed:
cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h} — Added languageId, codecThinkId, and prefixRows parameters. When languageId >= 0, emits the 9-row prefix; otherwise preserves the 8-row layout.
cpp/runtime/qwen3OmniTTSRuntime.{cpp,h} — Added mLanguageIdMap and mActiveLanguageId. Parses codec_language_id map and codec_think_id from config.json. Auto-selects "english" as default when available. Threads languageId through projectToTalkerInput and buildTalkerPrefillFromSegments.
examples/omni/qwen3_tts_inference.cpp — Forwards "language" field from input JSON.
tensorrt_edgellm/scripts/export.py — Added "codec_language_id" to the list of keys copied from talker_config in _patch_tts_config.
Engine config.json — Manually added the 12-language codec_language_id map from the HuggingFace checkpoint's talker_config.
Output: 3.76s of clear, intelligible English speech ("Hello, I am Chip. It is nice to meet you.")
Validated by human listening on Jetson Orin Nano via reSpeaker XVF3800 USB audio
Speaker: serena (id=3066), Language: english (id=2050)
Additional note on message format
The official TTS docs show messages with "role": "assistant" only (not "user"). Using "role": "user" produces long silence/hum even with the language fix applied. This is consistent with Qwen3-TTS being a pure TTS model, not conversational.
Request
PR #98 addresses this exact issue but targets v0.7.x and currently has merge conflicts (dirty status). Could the maintainers either:
Summary
The CustomVoice language conditioning bug documented in #87 (Issue B) and addressed by PR #98 is confirmed to still exist in v0.8.0. The
assistantPreambleKernelin v0.8.0 still only implements the 8-row (no-language) prefix layout, producing unintelligible audio output for the only officially supported Qwen3-TTS model family (Qwen3-TTS-12Hz-0.6B-CustomVoice).We have applied the fix from PR #98 (adapted for v0.8.0) and confirmed it produces correct, intelligible speech validated by human listening on a Jetson Orin Nano.
Environment
mainas of 2026-06-15)Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice-DENABLE_CUTE_DSL=gemm -DCUTE_DSL_ARTIFACT_TAG=sm_87 -DCMAKE_CUDA_ARCHITECTURES=87What fails
Running
qwen3_tts_inferenceagainst the 0.6B-CustomVoice model with v0.8.0 unmodified produces WAV files that contain:userrole in messages)The audio is structurally valid (WAV container, correct sample rate) but semantically wrong — no intelligible speech.
Root cause (same as #87 Issue B)
assistantPreambleKernelincpp/kernels/talkerMLPKernels/talkerMLPKernels.cuonly implements the 8-row (Base / no-language) prefix:CustomVoice models require the 9-row layout with
language_id:Additionally,
codec_language_idis not propagated from the HuggingFacetalker_configduring export (tensorrt_edgellm/scripts/export.py—_patch_tts_configcopiescodec_think_idbut notcodec_language_id).Fix applied (v0.8.0)
We applied the equivalent of PR #98 adapted for v0.8.0. Files changed:
cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h}— AddedlanguageId,codecThinkId, andprefixRowsparameters. WhenlanguageId >= 0, emits the 9-row prefix; otherwise preserves the 8-row layout.cpp/runtime/qwen3OmniTTSRuntime.{cpp,h}— AddedmLanguageIdMapandmActiveLanguageId. Parsescodec_language_idmap andcodec_think_idfromconfig.json. Auto-selects"english"as default when available. ThreadslanguageIdthroughprojectToTalkerInputandbuildTalkerPrefillFromSegments.examples/omni/qwen3_tts_inference.cpp— Forwards"language"field from input JSON.tensorrt_edgellm/scripts/export.py— Added"codec_language_id"to the list of keys copied fromtalker_configin_patch_tts_config.Engine
config.json— Manually added the 12-languagecodec_language_idmap from the HuggingFace checkpoint'stalker_config.Validation
After the fix:
serena(id=3066), Language:english(id=2050)Additional note on message format
The official TTS docs show messages with
"role": "assistant"only (not"user"). Using"role": "user"produces long silence/hum even with the language fix applied. This is consistent with Qwen3-TTS being a pure TTS model, not conversational.Request
PR #98 addresses this exact issue but targets v0.7.x and currently has merge conflicts (
dirtystatus). Could the maintainers either:mainThis is the only bug preventing the officially supported CustomVoice models from producing correct output.
References
codec_language_idmap