feat(q3tts): add qwen3-tts llama-server runtime#14
Open
muggle-stack wants to merge 1 commit into
Open
Conversation
- Add Qwen3-TTS runtime tools and OpenAI-compatible speech endpoint wiring. - Register Qwen3-TTS side tensors and safe mixed-language stdin splitting. - Add SpacemiT talker and CP runtime kernel paths for Q8/Q4 execution. - Clean 24 kHz audio output with default runtime post-processing. - Move Qwen3-TTS under the common speech backend layout. Co-authored-by: codex <codex@openai.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an initial speech synthesis backend integration for llama-server, exposing an OpenAI-compatible /v1/audio/speech endpoint backed by a new Qwen3-TTS runtime/tooling layout under tools/speech/. It also introduces Qwen3-side tensors and a few performance/graph-reuse related changes needed to support the runtime efficiently (especially on SpacemiT/RISC-V targets).
Changes:
- Add
tools/speech/+qwen3_ttsbackend with runner/tools, runtime packaging, and reference prompt conversion utilities. - Wire
/v1/audio/speech(and/audio/speech) intollama-server, with aserver_speech_serviceabstraction and a Qwen3-TTS backend implementation. - Add Qwen3 model-side tensor registrations + embed-only mode and integrate SpacemiT CPU kernel/perf toggles used by the runtime.
Reviewed changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/speech/README.md | Documents speech backend layout and current backends |
| tools/speech/CMakeLists.txt | Adds speech backend subdir + install target |
| tools/speech/backends/qwen3_tts/tools/q3tts_run_main.cpp | CLI entrypoint for q3tts runner |
| tools/speech/backends/qwen3_tts/tools/q3tts_ref_to_bin.cpp | Reference WAV/text to speaker/prompt bin converter |
| tools/speech/backends/qwen3_tts/tools/q3tts_cp_kernel_bench.cpp | CP kernel microbenchmark for SpacemiT/RVV |
| tools/speech/backends/qwen3_tts/src/talker_driver.c | In-process talker+CP driver emitting codec frames |
| tools/speech/backends/qwen3_tts/src/kernels/heads_pool.h | RVV GEMV pool for CP lm-heads |
| tools/speech/backends/qwen3_tts/README.md | Backend build/run documentation |
| tools/speech/backends/qwen3_tts/include/qwen3_tts/qwen3_tts_runtime.h | Runtime public header (CLI) |
| tools/speech/backends/qwen3_tts/include/qwen3_tts/q3tts_codec_ort.h | ONNX Runtime codec decoder pool |
| tools/speech/backends/qwen3_tts/include/qwen3_tts/q3tts_audio_sdk.h | Optional ALSA segment playback helper |
| tools/speech/backends/qwen3_tts/CMakeLists.txt | Builds/installs q3tts runtime + tools |
| tools/speech/backends/qwen3_tts/cmake/talker_driver.qwen3tts-k3.in | Script wrapper for talker_driver defaults |
| tools/speech/backends/qwen3_tts/cmake/q3tts-run.in | Script wrapper for end-to-end runner |
| tools/server/server.cpp | Registers speech routes and proxy plumbing |
| tools/server/server-speech.h | Speech service API surface for server |
| tools/server/server-speech.cpp | Backend selection + service wrapper implementation |
| tools/server/server-speech-qwen3-tts.h | Qwen3-TTS speech backend interface |
| tools/server/server-speech-qwen3-tts.cpp | Qwen3-TTS backend: runner process mgmt + WAV merge |
| tools/server/server-speech-backend.h | Abstract speech backend interface |
| tools/server/server-context.h | Adds post_speech_oai route slot |
| tools/server/server-context.cpp | Implements /audio/speech handler and backend init |
| tools/server/CMakeLists.txt | Adds speech sources to server-context target |
| tools/CMakeLists.txt | Adds tools/speech subdir behind build option |
| src/models/qwen3.cpp | Adds Q3TTS tensors, SWIGLU gate_up support, embed-only mode |
| src/llama-quant.cpp | Prevents quantizing q3tts.* tensors |
| src/llama-model.h | Adds ffn_gate_up tensor pointer in layer struct |
| src/llama-model.cpp | Avoids layer buft assignment for TENSOR_SKIP tensors |
| src/llama-context.h | Adds graph reuse + threadpool caching fields |
| src/llama-context.cpp | Adds ctx pad env, 2-way graph cache, threadpool/n_threads caching |
| src/llama-arch.h | Adds tensor IDs for gate_up and Q3TTS tensors |
| src/llama-arch.cpp | Maps new tensor IDs to names/infos |
| ggml/src/ggml-cpu/spacemit/ime2_kernels.cpp | Adds env toggle + new m1 n64 + m2 i8i4_hp kernels |
| ggml/src/ggml-cpu/spacemit/ime.cpp | Adds env toggles and SWIGLU-down fusion path for SpacemiT |
| ggml/src/ggml-cpu/spacemit/ime_env.cpp | Fixes env-string empty checks and logging call |
| ggml/src/ggml-cpu/ggml-cpu.c | Adds SWIGLU-down fusion support and fusion skip behavior |
| ggml/CMakeLists.txt | Adds GGML_RV_ZBA option |
| common/common.h | Enables media_backend/smt_config_dir fields for speech builds |
| common/arg.cpp | Enables media backend args under speech builds; expands examples |
| CMakeLists.txt | Adds LLAMA_BUILD_SPEECH option + LLAMA_BUILD_Q3TTS alias/define |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1
to
+5
| add_subdirectory(backends/qwen3_tts) | ||
|
|
||
| add_custom_target(speech-install | ||
| DEPENDS q3tts-install | ||
| ) |
Comment on lines
+529
to
+536
| std::vector<char *> argv; | ||
| argv.reserve(args.size() + 1); | ||
| for (auto & arg : args) { | ||
| argv.push_back(arg.data()); | ||
| } | ||
| argv.push_back(nullptr); | ||
| execv(argv[0], argv.data()); | ||
| _exit(127); |
Comment on lines
+473
to
+479
| void qwen3_tts_backend::ensure_started() { | ||
| std::lock_guard<std::mutex> lock(start_mutex); | ||
| if (child_pid > 0 && !child_closed) { | ||
| return; | ||
| } | ||
| start_process(); | ||
| } |
Comment on lines
+30
to
+38
| if(LLAMA_BUILD_SPEECH) | ||
| target_sources(${TARGET} PRIVATE | ||
| server-speech-backend.h | ||
| server-speech.cpp | ||
| server-speech.h | ||
| server-speech-qwen3-tts.cpp | ||
| server-speech-qwen3-tts.h | ||
| ) | ||
| endif() |
Comment on lines
+1
to
+7
| #include <chrono> | ||
| #include <cstdint> | ||
| #include <cstdio> | ||
| #include <cstdlib> | ||
| #include <cstring> | ||
| #include <pthread.h> | ||
| #include <vector> |
Comment on lines
+147
to
+171
| add_executable(talker_driver.headmain ${CMAKE_CURRENT_SOURCE_DIR}/src/talker_driver.c) | ||
| target_compile_definitions(talker_driver.headmain PRIVATE _GNU_SOURCE) | ||
| target_compile_options(talker_driver.headmain PRIVATE | ||
| $<$<C_COMPILER_ID:GNU>:-O2> | ||
| $<$<C_COMPILER_ID:GNU>:-fno-tree-vectorize> | ||
| ) | ||
| if(CMAKE_SYSTEM_PROCESSOR MATCHES "^(riscv)") | ||
| target_compile_options(talker_driver.headmain PRIVATE | ||
| $<$<C_COMPILER_ID:GNU>:-march=rv64gcv_zfh_zvfh_zba_zicbop_zihintpause> | ||
| $<$<C_COMPILER_ID:GNU>:-mabi=lp64d> | ||
| ) | ||
| endif() | ||
| target_include_directories(talker_driver.headmain PRIVATE | ||
| ${Q3TTS_INCLUDE_DIRS} | ||
| ${CMAKE_CURRENT_SOURCE_DIR}/src/kernels | ||
| ) | ||
| target_link_libraries(talker_driver.headmain PRIVATE | ||
| llama | ||
| ggml-cpu | ||
| ggml-base | ||
| ggml | ||
| pthread | ||
| m | ||
| ) | ||
|
|
Comment on lines
+172
to
+185
| add_executable(q3tts-cp-kernel-bench ${CMAKE_CURRENT_SOURCE_DIR}/tools/q3tts_cp_kernel_bench.cpp) | ||
| target_include_directories(q3tts-cp-kernel-bench PRIVATE | ||
| ${CMAKE_SOURCE_DIR}/ggml/src | ||
| ${CMAKE_SOURCE_DIR}/ggml/src/ggml-cpu | ||
| ) | ||
| target_link_libraries(q3tts-cp-kernel-bench PRIVATE | ||
| ggml-cpu | ||
| ggml-base | ||
| ggml | ||
| pthread | ||
| m | ||
| ) | ||
| target_compile_features(q3tts-cp-kernel-bench PRIVATE cxx_std_17) | ||
|
|
Comment on lines
+186
to
+195
| configure_file(${CMAKE_CURRENT_SOURCE_DIR}/cmake/q3tts-run.in | ||
| ${Q3TTS_SCRIPT_OUTPUT_DIR}/q3tts-run @ONLY) | ||
| configure_file(${CMAKE_CURRENT_SOURCE_DIR}/cmake/talker_driver.qwen3tts-k3.in | ||
| ${Q3TTS_SCRIPT_OUTPUT_DIR}/talker_driver.qwen3tts-k3 @ONLY) | ||
|
|
||
| execute_process(COMMAND chmod +x | ||
| ${Q3TTS_SCRIPT_OUTPUT_DIR}/q3tts-run | ||
| ${Q3TTS_SCRIPT_OUTPUT_DIR}/talker_driver.qwen3tts-k3 | ||
| ) | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tools/speech/backends/qwen3_tts./v1/audio/speechpath through the common speech backend layout.Validation
git diff --checkcmake --build build-q3tts-spacemit --target llama-server q3tts-runner talker_driver.headmain -j8/v1/audio/speechreturned 24 kHz PCM WAV withX-Speech-Backend: qwen3-ttsRegression Results