feat(q3tts): add qwen3-tts llama-server runtime by muggle-stack · Pull Request #14 · spacemit-com/llama.cpp

muggle-stack · 2026-07-02T12:03:00Z

Summary

Add Qwen3-TTS runtime tooling under tools/speech/backends/qwen3_tts.
Wire an OpenAI-compatible /v1/audio/speech path through the common speech backend layout.
Register Qwen3-TTS side tensors and SpacemiT talker/CP runtime kernel paths for Q8/Q4 execution.
Add default 24 kHz audio post-processing and guard against no-EOS truncated segment responses.

Validation

git diff --check
K3 native build: cmake --build build-q3tts-spacemit --target llama-server q3tts-runner talker_driver.headmain -j8
K3 speech smoke: /v1/audio/speech returned 24 kHz PCM WAV with X-Speech-Backend: qwen3-tts
K3 regression: Chinese, English, and mixed Chinese/English short/long text cases generated successfully
No-EOS guard check: forcing old long-segment behavior returns HTTP 500 instead of a partial truncated WAV

Regression Results

zh_short     audio 3.760s   wall 3.487s   RTF 0.93   segments 1
zh_long      audio 17.520s  wall 14.949s  RTF 0.85   segments 3
en_short     audio 3.920s   wall 3.611s   RTF 0.92   segments 1
en_long      audio 16.140s  wall 14.300s  RTF 0.89   segments 3
mixed_short  audio 6.720s   wall 6.095s   RTF 0.91   segments 1
mixed_long   audio 20.000s  wall 18.162s  RTF 0.91   segments 4

- Add Qwen3-TTS runtime tools and OpenAI-compatible speech endpoint wiring. - Register Qwen3-TTS side tensors and safe mixed-language stdin splitting. - Add SpacemiT talker and CP runtime kernel paths for Q8/Q4 execution. - Clean 24 kHz audio output with default runtime post-processing. - Move Qwen3-TTS under the common speech backend layout. Co-authored-by: codex <codex@openai.com>

Copilot

Pull request overview

This PR adds an initial speech synthesis backend integration for llama-server, exposing an OpenAI-compatible /v1/audio/speech endpoint backed by a new Qwen3-TTS runtime/tooling layout under tools/speech/. It also introduces Qwen3-side tensors and a few performance/graph-reuse related changes needed to support the runtime efficiently (especially on SpacemiT/RISC-V targets).

Changes:

Add tools/speech/ + qwen3_tts backend with runner/tools, runtime packaging, and reference prompt conversion utilities.
Wire /v1/audio/speech (and /audio/speech) into llama-server, with a server_speech_service abstraction and a Qwen3-TTS backend implementation.
Add Qwen3 model-side tensor registrations + embed-only mode and integrate SpacemiT CPU kernel/perf toggles used by the runtime.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tools/speech/README.md	Documents speech backend layout and current backends
tools/speech/CMakeLists.txt	Adds speech backend subdir + install target
tools/speech/backends/qwen3_tts/tools/q3tts_run_main.cpp	CLI entrypoint for q3tts runner
tools/speech/backends/qwen3_tts/tools/q3tts_ref_to_bin.cpp	Reference WAV/text to speaker/prompt bin converter
tools/speech/backends/qwen3_tts/tools/q3tts_cp_kernel_bench.cpp	CP kernel microbenchmark for SpacemiT/RVV
tools/speech/backends/qwen3_tts/src/talker_driver.c	In-process talker+CP driver emitting codec frames
tools/speech/backends/qwen3_tts/src/kernels/heads_pool.h	RVV GEMV pool for CP lm-heads
tools/speech/backends/qwen3_tts/README.md	Backend build/run documentation
tools/speech/backends/qwen3_tts/include/qwen3_tts/qwen3_tts_runtime.h	Runtime public header (CLI)
tools/speech/backends/qwen3_tts/include/qwen3_tts/q3tts_codec_ort.h	ONNX Runtime codec decoder pool
tools/speech/backends/qwen3_tts/include/qwen3_tts/q3tts_audio_sdk.h	Optional ALSA segment playback helper
tools/speech/backends/qwen3_tts/CMakeLists.txt	Builds/installs q3tts runtime + tools
tools/speech/backends/qwen3_tts/cmake/talker_driver.qwen3tts-k3.in	Script wrapper for talker_driver defaults
tools/speech/backends/qwen3_tts/cmake/q3tts-run.in	Script wrapper for end-to-end runner
tools/server/server.cpp	Registers speech routes and proxy plumbing
tools/server/server-speech.h	Speech service API surface for server
tools/server/server-speech.cpp	Backend selection + service wrapper implementation
tools/server/server-speech-qwen3-tts.h	Qwen3-TTS speech backend interface
tools/server/server-speech-qwen3-tts.cpp	Qwen3-TTS backend: runner process mgmt + WAV merge
tools/server/server-speech-backend.h	Abstract speech backend interface
tools/server/server-context.h	Adds `post_speech_oai` route slot
tools/server/server-context.cpp	Implements `/audio/speech` handler and backend init
tools/server/CMakeLists.txt	Adds speech sources to server-context target
tools/CMakeLists.txt	Adds `tools/speech` subdir behind build option
src/models/qwen3.cpp	Adds Q3TTS tensors, SWIGLU gate_up support, embed-only mode
src/llama-quant.cpp	Prevents quantizing `q3tts.*` tensors
src/llama-model.h	Adds `ffn_gate_up` tensor pointer in layer struct
src/llama-model.cpp	Avoids layer buft assignment for `TENSOR_SKIP` tensors
src/llama-context.h	Adds graph reuse + threadpool caching fields
src/llama-context.cpp	Adds ctx pad env, 2-way graph cache, threadpool/n_threads caching
src/llama-arch.h	Adds tensor IDs for gate_up and Q3TTS tensors
src/llama-arch.cpp	Maps new tensor IDs to names/infos
ggml/src/ggml-cpu/spacemit/ime2_kernels.cpp	Adds env toggle + new m1 n64 + m2 i8i4_hp kernels
ggml/src/ggml-cpu/spacemit/ime.cpp	Adds env toggles and SWIGLU-down fusion path for SpacemiT
ggml/src/ggml-cpu/spacemit/ime_env.cpp	Fixes env-string empty checks and logging call
ggml/src/ggml-cpu/ggml-cpu.c	Adds SWIGLU-down fusion support and fusion skip behavior
ggml/CMakeLists.txt	Adds `GGML_RV_ZBA` option
common/common.h	Enables media_backend/smt_config_dir fields for speech builds
common/arg.cpp	Enables media backend args under speech builds; expands examples
CMakeLists.txt	Adds LLAMA_BUILD_SPEECH option + LLAMA_BUILD_Q3TTS alias/define

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+add_subdirectory(backends/qwen3_tts)
+
+add_custom_target(speech-install
+    DEPENDS q3tts-install
+)


+        std::vector<char *> argv;
+        argv.reserve(args.size() + 1);
+        for (auto & arg : args) {
+            argv.push_back(arg.data());
+        }
+        argv.push_back(nullptr);
+        execv(argv[0], argv.data());
+        _exit(127);


+void qwen3_tts_backend::ensure_started() {
+    std::lock_guard<std::mutex> lock(start_mutex);
+    if (child_pid > 0 && !child_closed) {
+        return;
+    }
+    start_process();
+}


+if(LLAMA_BUILD_SPEECH)
+    target_sources(${TARGET} PRIVATE
+        server-speech-backend.h
+        server-speech.cpp
+        server-speech.h
+        server-speech-qwen3-tts.cpp
+        server-speech-qwen3-tts.h
+    )
+endif()


+#include <chrono>
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <pthread.h>
+#include <vector>


+add_executable(talker_driver.headmain ${CMAKE_CURRENT_SOURCE_DIR}/src/talker_driver.c)
+target_compile_definitions(talker_driver.headmain PRIVATE _GNU_SOURCE)
+target_compile_options(talker_driver.headmain PRIVATE
+    $<$<C_COMPILER_ID:GNU>:-O2>
+    $<$<C_COMPILER_ID:GNU>:-fno-tree-vectorize>
+)
+if(CMAKE_SYSTEM_PROCESSOR MATCHES "^(riscv)")
+    target_compile_options(talker_driver.headmain PRIVATE
+        $<$<C_COMPILER_ID:GNU>:-march=rv64gcv_zfh_zvfh_zba_zicbop_zihintpause>
+        $<$<C_COMPILER_ID:GNU>:-mabi=lp64d>
+    )
+endif()
+target_include_directories(talker_driver.headmain PRIVATE
+    ${Q3TTS_INCLUDE_DIRS}
+    ${CMAKE_CURRENT_SOURCE_DIR}/src/kernels
+)
+target_link_libraries(talker_driver.headmain PRIVATE
+    llama
+    ggml-cpu
+    ggml-base
+    ggml
+    pthread
+    m
+)
+


+add_executable(q3tts-cp-kernel-bench ${CMAKE_CURRENT_SOURCE_DIR}/tools/q3tts_cp_kernel_bench.cpp)
+target_include_directories(q3tts-cp-kernel-bench PRIVATE
+    ${CMAKE_SOURCE_DIR}/ggml/src
+    ${CMAKE_SOURCE_DIR}/ggml/src/ggml-cpu
+)
+target_link_libraries(q3tts-cp-kernel-bench PRIVATE
+    ggml-cpu
+    ggml-base
+    ggml
+    pthread
+    m
+)
+target_compile_features(q3tts-cp-kernel-bench PRIVATE cxx_std_17)
+


+configure_file(${CMAKE_CURRENT_SOURCE_DIR}/cmake/q3tts-run.in
+               ${Q3TTS_SCRIPT_OUTPUT_DIR}/q3tts-run @ONLY)
+configure_file(${CMAKE_CURRENT_SOURCE_DIR}/cmake/talker_driver.qwen3tts-k3.in
+               ${Q3TTS_SCRIPT_OUTPUT_DIR}/talker_driver.qwen3tts-k3 @ONLY)
+
+execute_process(COMMAND chmod +x
+    ${Q3TTS_SCRIPT_OUTPUT_DIR}/q3tts-run
+    ${Q3TTS_SCRIPT_OUTPUT_DIR}/talker_driver.qwen3tts-k3
+)
+


muggle-stack requested a review from alex-spacemit as a code owner July 2, 2026 12:03

github-actions Bot added documentation Improvements or additions to documentation build server ggml model labels Jul 2, 2026

alex-spacemit requested a review from Copilot July 2, 2026 13:56

Copilot started reviewing on behalf of alex-spacemit July 2, 2026 13:56 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(q3tts): add qwen3-tts llama-server runtime#14

feat(q3tts): add qwen3-tts llama-server runtime#14
muggle-stack wants to merge 1 commit into
spacemit-com:spacemit-mtmdfrom
muggle-stack:feat/q3tts-llama-server-runtime

muggle-stack commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

muggle-stack commented Jul 2, 2026

Summary

Validation

Regression Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants