Skip to content

CustomVoice language conditioning bug confirmed in v0.8.0 — working fix applied (ref #87 / PR #98) #109

Description

@robertvandervoort

Summary

The CustomVoice language conditioning bug documented in #87 (Issue B) and addressed by PR #98 is confirmed to still exist in v0.8.0. The assistantPreambleKernel in v0.8.0 still only implements the 8-row (no-language) prefix layout, producing unintelligible audio output for the only officially supported Qwen3-TTS model family (Qwen3-TTS-12Hz-0.6B-CustomVoice).

We have applied the fix from PR #98 (adapted for v0.8.0) and confirmed it produces correct, intelligible speech validated by human listening on a Jetson Orin Nano.

Environment

  • TensorRT-Edge-LLM: v0.8.0 (main as of 2026-06-15)
  • Device: Jetson Orin Nano (SM87, JetPack 6.x)
  • CUDA: 13.2
  • TensorRT: 10.x
  • Model: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
  • Build flags: -DENABLE_CUTE_DSL=gemm -DCUTE_DSL_ARTIFACT_TAG=sm_87 -DCMAKE_CUDA_ARCHITECTURES=87

What fails

Running qwen3_tts_inference against the 0.6B-CustomVoice model with v0.8.0 unmodified produces WAV files that contain:

  • Very short bursts of clicks/hiss (~1-2 seconds)
  • Longer hum/noise with no speech content (~13 seconds when using user role in messages)

The audio is structurally valid (WAV container, correct sample rate) but semantically wrong — no intelligible speech.

Root cause (same as #87 Issue B)

assistantPreambleKernel in cpp/kernels/talkerMLPKernels/talkerMLPKernels.cu only implements the 8-row (Base / no-language) prefix:

rows 0-2: projected[0-2]
row 3:    ttsPad + embTable[codecNothinkId]
row 4:    ttsPad + embTable[codecThinkBosId]
row 5:    ttsPad + embTable[codecThinkEosId]
row 6:    ttsPad + embTable[speakerId]
row 7:    ttsBos + embTable[codecPadId]

CustomVoice models require the 9-row layout with language_id:

rows 0-2: projected[0-2]
row 3:    ttsPad + embTable[codecThinkId]
row 4:    ttsPad + embTable[codecThinkBosId]
row 5:    ttsPad + embTable[languageId]        ← NEW
row 6:    ttsPad + embTable[codecThinkEosId]
row 7:    ttsPad + embTable[speakerId]
row 8:    ttsBos + embTable[codecPadId]

Additionally, codec_language_id is not propagated from the HuggingFace talker_config during export (tensorrt_edgellm/scripts/export.py_patch_tts_config copies codec_think_id but not codec_language_id).

Fix applied (v0.8.0)

We applied the equivalent of PR #98 adapted for v0.8.0. Files changed:

  1. cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h} — Added languageId, codecThinkId, and prefixRows parameters. When languageId >= 0, emits the 9-row prefix; otherwise preserves the 8-row layout.

  2. cpp/runtime/qwen3OmniTTSRuntime.{cpp,h} — Added mLanguageIdMap and mActiveLanguageId. Parses codec_language_id map and codec_think_id from config.json. Auto-selects "english" as default when available. Threads languageId through projectToTalkerInput and buildTalkerPrefillFromSegments.

  3. examples/omni/qwen3_tts_inference.cpp — Forwards "language" field from input JSON.

  4. tensorrt_edgellm/scripts/export.py — Added "codec_language_id" to the list of keys copied from talker_config in _patch_tts_config.

  5. Engine config.json — Manually added the 12-language codec_language_id map from the HuggingFace checkpoint's talker_config.

Validation

After the fix:

projectToTalkerInput: seqLen=21, N=13, prefixRows=9, outputSeqLen=24, speakerId=3066, languageId=2050
Batch 0: 47 audio frames (exit: EOS)
  • Output: 3.76s of clear, intelligible English speech ("Hello, I am Chip. It is nice to meet you.")
  • Validated by human listening on Jetson Orin Nano via reSpeaker XVF3800 USB audio
  • Speaker: serena (id=3066), Language: english (id=2050)

Additional note on message format

The official TTS docs show messages with "role": "assistant" only (not "user"). Using "role": "user" produces long silence/hum even with the language fix applied. This is consistent with Qwen3-TTS being a pure TTS model, not conversational.

Request

PR #98 addresses this exact issue but targets v0.7.x and currently has merge conflicts (dirty status). Could the maintainers either:

  1. Rebase/merge PR feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS #98, or
  2. Apply the equivalent fix to v0.8.0 main

This is the only bug preventing the officially supported CustomVoice models from producing correct output.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions