CustomVoice language conditioning bug confirmed in v0.8.0 — working fix applied (ref #87 / PR #98)

## Summary

The CustomVoice language conditioning bug documented in #87 (Issue B) and addressed by PR #98 is **confirmed to still exist in v0.8.0**. The `assistantPreambleKernel` in v0.8.0 still only implements the 8-row (no-language) prefix layout, producing unintelligible audio output for the only officially supported Qwen3-TTS model family (`Qwen3-TTS-12Hz-0.6B-CustomVoice`).

We have applied the fix from PR #98 (adapted for v0.8.0) and confirmed it produces correct, intelligible speech validated by human listening on a Jetson Orin Nano.

## Environment

- **TensorRT-Edge-LLM**: v0.8.0 (`main` as of 2026-06-15)
- **Device**: Jetson Orin Nano (SM87, JetPack 6.x)
- **CUDA**: 13.2
- **TensorRT**: 10.x
- **Model**: `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`
- **Build flags**: `-DENABLE_CUTE_DSL=gemm -DCUTE_DSL_ARTIFACT_TAG=sm_87 -DCMAKE_CUDA_ARCHITECTURES=87`

## What fails

Running `qwen3_tts_inference` against the 0.6B-CustomVoice model with v0.8.0 unmodified produces WAV files that contain:
- Very short bursts of clicks/hiss (~1-2 seconds)
- Longer hum/noise with no speech content (~13 seconds when using `user` role in messages)

The audio is structurally valid (WAV container, correct sample rate) but semantically wrong — no intelligible speech.

## Root cause (same as #87 Issue B)

`assistantPreambleKernel` in `cpp/kernels/talkerMLPKernels/talkerMLPKernels.cu` only implements the 8-row (Base / no-language) prefix:

```
rows 0-2: projected[0-2]
row 3:    ttsPad + embTable[codecNothinkId]
row 4:    ttsPad + embTable[codecThinkBosId]
row 5:    ttsPad + embTable[codecThinkEosId]
row 6:    ttsPad + embTable[speakerId]
row 7:    ttsBos + embTable[codecPadId]
```

CustomVoice models require the 9-row layout with `language_id`:

```
rows 0-2: projected[0-2]
row 3:    ttsPad + embTable[codecThinkId]
row 4:    ttsPad + embTable[codecThinkBosId]
row 5:    ttsPad + embTable[languageId]        ← NEW
row 6:    ttsPad + embTable[codecThinkEosId]
row 7:    ttsPad + embTable[speakerId]
row 8:    ttsBos + embTable[codecPadId]
```

Additionally, `codec_language_id` is not propagated from the HuggingFace `talker_config` during export (`tensorrt_edgellm/scripts/export.py` — `_patch_tts_config` copies `codec_think_id` but not `codec_language_id`).

## Fix applied (v0.8.0)

We applied the equivalent of PR #98 adapted for v0.8.0. Files changed:

1. **`cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h}`** — Added `languageId`, `codecThinkId`, and `prefixRows` parameters. When `languageId >= 0`, emits the 9-row prefix; otherwise preserves the 8-row layout.

2. **`cpp/runtime/qwen3OmniTTSRuntime.{cpp,h}`** — Added `mLanguageIdMap` and `mActiveLanguageId`. Parses `codec_language_id` map and `codec_think_id` from `config.json`. Auto-selects `"english"` as default when available. Threads `languageId` through `projectToTalkerInput` and `buildTalkerPrefillFromSegments`.

3. **`examples/omni/qwen3_tts_inference.cpp`** — Forwards `"language"` field from input JSON.

4. **`tensorrt_edgellm/scripts/export.py`** — Added `"codec_language_id"` to the list of keys copied from `talker_config` in `_patch_tts_config`.

5. **Engine `config.json`** — Manually added the 12-language `codec_language_id` map from the HuggingFace checkpoint's `talker_config`.

## Validation

After the fix:

```
projectToTalkerInput: seqLen=21, N=13, prefixRows=9, outputSeqLen=24, speakerId=3066, languageId=2050
Batch 0: 47 audio frames (exit: EOS)
```

- Output: 3.76s of clear, intelligible English speech ("Hello, I am Chip. It is nice to meet you.")
- Validated by human listening on Jetson Orin Nano via reSpeaker XVF3800 USB audio
- Speaker: `serena` (id=3066), Language: `english` (id=2050)

## Additional note on message format

The official TTS docs show messages with `"role": "assistant"` only (not `"user"`). Using `"role": "user"` produces long silence/hum even with the language fix applied. This is consistent with Qwen3-TTS being a pure TTS model, not conversational.

## Request

PR #98 addresses this exact issue but targets v0.7.x and currently has merge conflicts (`dirty` status). Could the maintainers either:
1. Rebase/merge PR #98, or
2. Apply the equivalent fix to v0.8.0 `main`

This is the only bug preventing the officially supported CustomVoice models from producing correct output.

## References

- #87 — Original bug report (Issue B: language conditioning)
- #98 — Community PR with the fix (v0.7.x, merge-dirty)
- [Qwen3-TTS CustomVoice config.json](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/blob/main/config.json) — source of `codec_language_id` map

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CustomVoice language conditioning bug confirmed in v0.8.0 — working fix applied (ref #87 / PR #98) #109

Summary

Environment

What fails

Root cause (same as #87 Issue B)

Fix applied (v0.8.0)

Validation

Additional note on message format

Request

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

CustomVoice language conditioning bug confirmed in v0.8.0 — working fix applied (ref #87 / PR #98) #109

Description

Summary

Environment

What fails

Root cause (same as #87 Issue B)

Fix applied (v0.8.0)

Validation

Additional note on message format

Request

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions