From cf0bf250e263ebac706771f90a6602b11e474bb2 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 2 Jul 2026 21:23:02 +0000 Subject: [PATCH 01/29] Fix multi-turn tool-calling checkpoint starvation for recurrent models (Granite-4) Recurrent/hybrid models (granitehybrid, Mamba, Jamba) can only roll a slot back to a saved context checkpoint. In upstream b9859 the near-prompt-end checkpoints are gated by checkpoint_min_step (default 8192 tokens) and new checkpoints are otherwise only created at user-message boundaries. An agentic tool-calling conversation appends only assistant/tool messages after turn 1, so no new checkpoint is ever created and every turn re-prefills the whole conversation tail. Measured on a synthetic granitehybrid model (llama-server, 6-turn tool loop, ~643 new tokens/turn): prefilled tokens per turn grew 901 -> 1544 -> 2187 -> 2830 -> 3473 unpatched, i.e. quadratic total prefill. patches/0005 (upstream-submittable, server-context.cpp): - exempt near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (seq-rm type FULL or RS); SWA-only models are unaffected - never create a checkpoint at the same position as the newest one (the last-user-message checkpoint was re-created identically every turn, flooding the 32-entry checkpoint list) With the patch the same loop prefills a constant 647 tokens/turn (each turn restores the previous turn's near-end checkpoint): 5.4x less prefill at turn 6, growing with conversation length. Outputs verified byte-identical to unpatched at temperature=0. ModelParameters gains setCtxCheckpoints(int) / setCheckpointMinStep(int) (--ctx-checkpoints / --checkpoint-min-step, both LLAMA_EXAMPLE_SERVER scope, reach the embedded server through common_params_parse) so callers can tune checkpoint density/RAM from Java. +2 unit tests (144 pass), javadoc clean, spotless applied. Complements open upstream PRs #24035/#24899/#24891 (checkpoint invalidation/ retention); this fixes checkpoint starvation. Drop the patch once upstream lands role-boundary checkpoint placement. Co-Authored-By: Claude Fable 5 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- CLAUDE.md | 1 + ...ecurrent-near-prompt-end-checkpoints.patch | 32 +++++++++++++++++ .../llama/parameters/ModelParameters.java | 36 +++++++++++++++++++ .../ModelParametersExtendedTest.java | 12 +++++++ 4 files changed, 81 insertions(+) create mode 100644 llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch diff --git a/CLAUDE.md b/CLAUDE.md index e6da18c3..cc9cec4a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -432,6 +432,7 @@ Current patches: | `0002-server-preserve-caller-load-progress-callback.patch` | Load-progress-callback regression introduced in llama.cpp **b9789**: `server_context::load_model` (`tools/server/server-context.cpp`) now **unconditionally** installs the server's own load-progress reporter on `params_base.load_progress_callback` immediately before `common_init_from_params`, clobbering any callback the embedding caller already set. libjllama's `LoadProgressCallback` feature wires `common_params.load_progress_callback` to a JNI trampoline *before* calling `load_model`, so the bump silently killed it — `LoadProgressCallbackTest` saw zero progress updates and the abort-on-`false` path never threw. The patch guards the assignment with `if (params_base.load_progress_callback == nullptr)`, so the server installs its own reporter **only when the caller hasn't** — a caller-supplied callback survives and fires during load. Standalone `llama-server` (no caller callback, so the field is null) is unaffected. Same JNI-vs-standalone divergence class as `0001`. | | `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b` includes the change. | | `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b` includes the change. | +| `0005-server-recurrent-near-prompt-end-checkpoints.patch` | **Multi-turn tool-calling perf fix for recurrent/hybrid models (e.g. Granite-4)**, upstream-submittable. In `server_context::update_slots` (`tools/server/server-context.cpp`) the near-prompt-end context checkpoints are gated by `checkpoint_min_step` (default 8192 tokens). An agentic conversation that appends only assistant/tool messages never produces a new user-message checkpoint (`is_user_start`/`is_last_user_message` match `COMMON_CHAT_ROLE_USER` only), so after turn 1 no new checkpoint is ever created and — because recurrent state can only roll back to a checkpoint — **every turn re-prefills the whole conversation tail** (measured on a synthetic granitehybrid model: prefilled tokens grew 901 → 1544 → 2187 → 2830 → 3473 over turns 2–6). The patch (1) exempts near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (`ctx_tgt_seq_rm_type` is `FULL` or `RS` — SWA-only models are unaffected), and (2) skips creating a checkpoint whose position equals the newest one (the last-user-message checkpoint was re-created identically on every turn, flooding the 32-entry list). After the patch each turn restores the previous turn's near-end checkpoint and prefill is constant (~new-turn-sized; 647 tokens/turn in the same measurement, ≈5.4× less prefill at turn 6 and growing with conversation length). Validated output-identical (`temperature=0`) vs. unpatched. Complements — not duplicates — open upstream PRs #24035/#24899/#24891 (they fix checkpoint *invalidation/retention*; this fixes checkpoint *starvation*). Drop once upstream solves agentic checkpoint placement (e.g. a merged role-boundary checkpointing design, cf. #21885 / #22826 discussion). | ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`) diff --git a/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch b/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch new file mode 100644 index 00000000..bc1df408 --- /dev/null +++ b/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch @@ -0,0 +1,32 @@ +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index 39aa20b..6512fa9 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -3560,8 +3560,26 @@ private: + // do not checkpoint after mtmd chunks + do_checkpoint = do_checkpoint && !has_mtmd; + ++ // recurrent (and hybrid) models cannot partially roll back their state, so the only way to ++ // avoid re-processing an entire multi-turn conversation on the next request is a checkpoint ++ // near the end of the current prompt. without this, a conversation that appends only ++ // assistant/tool messages (agentic tool-calling) re-processes the whole tail every turn, ++ // because no new user-message checkpoint is ever created and the min-step spacing blocks ++ // the near-prompt-end ones. exempt those models' near-end checkpoints from the spacing. ++ const bool is_ckpt_only_rollback = ++ ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL || ++ ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_RS; ++ + // no need to create checkpoints that are too close together, unless it's the last user message +- do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || is_last_user_message || n_tokens_start > slot.prompt.checkpoints.back().n_tokens + params_base.checkpoint_min_step); ++ do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || is_last_user_message || ++ (near_prompt_end && is_ckpt_only_rollback) || ++ n_tokens_start > slot.prompt.checkpoints.back().n_tokens + params_base.checkpoint_min_step); ++ ++ // never store a duplicate of the newest checkpoint (e.g. the last-user-message checkpoint ++ // would otherwise be re-created on every turn of a tool-calling loop, flooding the ++ // checkpoint list until useful entries are evicted) ++ do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || ++ slot.prompt.checkpoints.back().n_tokens != n_tokens_start); + SLT_DBG(slot, "main/do_checkpoint = %s, pos_min = %d, pos_max = %d\n", do_checkpoint ? "yes" : "no", pos_min, pos_max); + + // note: we create the checkpoint before calling llama_decode(), so the current batch is not diff --git a/llama/src/main/java/net/ladenthin/llama/parameters/ModelParameters.java b/llama/src/main/java/net/ladenthin/llama/parameters/ModelParameters.java index ce62131b..a8c6965b 100644 --- a/llama/src/main/java/net/ladenthin/llama/parameters/ModelParameters.java +++ b/llama/src/main/java/net/ladenthin/llama/parameters/ModelParameters.java @@ -1142,6 +1142,42 @@ public ModelParameters setSlotPromptSimilarity(float similarity) { return putScalar("--slot-prompt-similarity", similarity); } + /** + * Set the maximum number of context checkpoints kept per slot (default: 32; 0 disables + * checkpointing). + * + *

Context checkpoints let the server roll a slot back to an earlier state instead of + * re-processing the whole prompt when a follow-up request diverges from the cached tokens. + * They are essential for models that cannot truncate their state to an arbitrary position: + * recurrent/hybrid architectures (e.g. Granite-4, Mamba, Jamba) and SWA models. Each + * checkpoint costs host memory proportional to the model's recurrent/SWA state size, so + * lower this value on memory-constrained machines or raise it for very long multi-turn + * (agentic tool-calling) sessions.

+ * + * @param ctxCheckpoints the maximum number of context checkpoints per slot + * @return this builder + */ + public ModelParameters setCtxCheckpoints(int ctxCheckpoints) { + return putScalar("--ctx-checkpoints", ctxCheckpoints); + } + + /** + * Set the minimum spacing between context checkpoints in tokens (default: 8192; 0 = no + * minimum). + * + *

Smaller values create checkpoints more often, improving prompt-cache reuse for + * multi-turn conversations at the cost of more host memory (bounded by + * {@link #setCtxCheckpoints(int)}). This matters most for recurrent/hybrid models + * (e.g. Granite-4), whose state can only be rolled back to a checkpoint — with sparse + * checkpoints a follow-up request may have to re-process most of the conversation.

+ * + * @param checkpointMinStep the minimum number of tokens between two checkpoints (must not be negative) + * @return this builder + */ + public ModelParameters setCheckpointMinStep(int checkpointMinStep) { + return putScalar("--checkpoint-min-step", checkpointMinStep); + } + /** * Load LoRA adapters without applying them (apply later via POST /lora-adapters). * diff --git a/llama/src/test/java/net/ladenthin/llama/parameters/ModelParametersExtendedTest.java b/llama/src/test/java/net/ladenthin/llama/parameters/ModelParametersExtendedTest.java index bc4dc3aa..7bf7b476 100644 --- a/llama/src/test/java/net/ladenthin/llama/parameters/ModelParametersExtendedTest.java +++ b/llama/src/test/java/net/ladenthin/llama/parameters/ModelParametersExtendedTest.java @@ -891,6 +891,18 @@ public void testSetSlotPromptSimilarity() { assertThat(p.parameters.get("--slot-prompt-similarity"), is("0.8")); } + @Test + public void testSetCtxCheckpoints() { + ModelParameters p = new ModelParameters().setCtxCheckpoints(8); + assertThat(p.parameters.get("--ctx-checkpoints"), is("8")); + } + + @Test + public void testSetCheckpointMinStep() { + ModelParameters p = new ModelParameters().setCheckpointMinStep(0); + assertThat(p.parameters.get("--checkpoint-min-step"), is("0")); + } + // ------------------------------------------------------------------------- // Override KV // ------------------------------------------------------------------------- From 7ddb792f329a1c6685cdad05fc2a350519f3ff5a Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 06:36:27 +0000 Subject: [PATCH 02/29] patch 0005: extract checkpoint conditions into named booleans MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pure readability refactor of the checkpoint-starvation fix — no behavior change. The two compound `do_checkpoint = do_checkpoint && (empty || ...)` assignments are lifted into named locals so the final gate reads: do_checkpoint = do_checkpoint && checkpoint_well_spaced && checkpoint_not_duplicate; - checkpoint_well_spaced: the min-step spacing test with the last-user-message and near-prompt-end (checkpoint-only-rollback) exemptions - checkpoint_not_duplicate: the same-position dedup guard Each named bool keeps the leading `checkpoints.empty() ||` so the `checkpoints.back()` access stays short-circuit-guarded (identical semantics to the previous inlined `&&`-chains). Compiles clean; patch re-verified to apply and reverse-check (idempotence) against pristine b9859 via the same `git apply` path the CMake applier uses. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- ...ecurrent-near-prompt-end-checkpoints.patch | 29 ++++++++++++------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch b/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch index bc1df408..59f729ff 100644 --- a/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch +++ b/llama/patches/0005-server-recurrent-near-prompt-end-checkpoints.patch @@ -1,11 +1,13 @@ diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp -index 39aa20b..6512fa9 100644 +index 39aa20b..d3d5978 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp -@@ -3560,8 +3560,26 @@ private: +@@ -3560,8 +3560,32 @@ private: // do not checkpoint after mtmd chunks do_checkpoint = do_checkpoint && !has_mtmd; +- // no need to create checkpoints that are too close together, unless it's the last user message +- do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || is_last_user_message || n_tokens_start > slot.prompt.checkpoints.back().n_tokens + params_base.checkpoint_min_step); + // recurrent (and hybrid) models cannot partially roll back their state, so the only way to + // avoid re-processing an entire multi-turn conversation on the next request is a checkpoint + // near the end of the current prompt. without this, a conversation that appends only @@ -16,17 +18,22 @@ index 39aa20b..6512fa9 100644 + ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL || + ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_RS; + - // no need to create checkpoints that are too close together, unless it's the last user message -- do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || is_last_user_message || n_tokens_start > slot.prompt.checkpoints.back().n_tokens + params_base.checkpoint_min_step); -+ do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || is_last_user_message || ++ // don't create checkpoints too close together, unless it's the last user message or a ++ // near-prompt-end checkpoint for a checkpoint-only-rollback model (leading empty() guards ++ // the checkpoints.back() access via short-circuit) ++ const bool checkpoint_well_spaced = ++ slot.prompt.checkpoints.empty() || ++ is_last_user_message || + (near_prompt_end && is_ckpt_only_rollback) || -+ n_tokens_start > slot.prompt.checkpoints.back().n_tokens + params_base.checkpoint_min_step); ++ n_tokens_start > slot.prompt.checkpoints.back().n_tokens + params_base.checkpoint_min_step; ++ ++ // and never duplicate the newest checkpoint's position (else the last-user-message ++ // checkpoint is re-created every turn, flooding the list until useful entries are evicted) ++ const bool checkpoint_not_duplicate = ++ slot.prompt.checkpoints.empty() || ++ slot.prompt.checkpoints.back().n_tokens != n_tokens_start; + -+ // never store a duplicate of the newest checkpoint (e.g. the last-user-message checkpoint -+ // would otherwise be re-created on every turn of a tool-calling loop, flooding the -+ // checkpoint list until useful entries are evicted) -+ do_checkpoint = do_checkpoint && (slot.prompt.checkpoints.empty() || -+ slot.prompt.checkpoints.back().n_tokens != n_tokens_start); ++ do_checkpoint = do_checkpoint && checkpoint_well_spaced && checkpoint_not_duplicate; SLT_DBG(slot, "main/do_checkpoint = %s, pos_min = %d, pos_max = %d\n", do_checkpoint ? "yes" : "no", pos_min, pos_max); // note: we create the checkpoint before calling llama_decode(), so the current batch is not From 926cd3c45c8340a01a8767d61c7f48e11b2c2276 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 07:47:52 +0000 Subject: [PATCH 03/29] server: add -b/-ub/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs to OpenAiServerCli MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The fat-jar launcher (OpenAiCompatServer.main) parses args via OpenAiServerCli, which only understood a subset of flags and threw on anything else. Extend it with the seven tuning flags that a llama-server.exe user needs so the bundled `java -jar …-jar-with-dependencies.jar` covers a full invocation without any custom Java: -b/--batch-size, -ub/--ubatch-size -> ModelParameters.setBatchSize/setUbatchSize -tb/--threads-batch -> setThreadsBatch -ctk/--cache-type-k, -ctv/--cache-type-v -> setCacheTypeK/V (case-insensitive CacheType lookup; unknown -> error) --jinja -> enableJinja --chat-template-kwargs -> setChatTemplateKwargs --chat-template-kwargs is parsed here (Jackson, already a server-package dep) into the raw-per-value map setChatTemplateKwargs expects, so a malformed object fails fast with usage text instead of at native model load. All setters already existed; the ints/CacheType/kwargs use 0/null "leave the default" sentinels mirroring the existing ctx/threads/parallel handling. +13 unit tests (30 pass total); usage() and README flag list updated; javadoc and spotless clean. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- README.md | 8 +- .../llama/server/OpenAiServerCli.java | 239 +++++++++++++++++- .../llama/server/OpenAiServerCliTest.java | 128 ++++++++++ 3 files changed, 366 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index ec6f5e75..05c8f44d 100644 --- a/README.md +++ b/README.md @@ -661,8 +661,12 @@ java -cp target/llama-.jar net.ladenthin.llama.server.OpenAiCompatServe ``` Run with `--help` for the full option list (`-m/--model`, `--host`, `-p/--port`, `-c/--ctx-size`, -`-ngl/--n-gpu-layers`, `-t/--threads`, `--parallel`, `--model-id`, `--api-key`, `--mmproj`, -`--embedding`, `--reranking`). +`-b/--batch-size`, `-ub/--ubatch-size`, `-ngl/--n-gpu-layers`, `-t/--threads`, `-tb/--threads-batch`, +`-ctk/--cache-type-k`, `-ctv/--cache-type-v`, `--jinja`, `--chat-template-kwargs`, `--parallel`, +`--model-id`, `--api-key`, `--mmproj`, `--embedding`, `--reranking`). The tuning flags mirror +llama.cpp's server, so an invocation like +`--jinja --chat-template-kwargs '{"reasoning_effort":"low"}' -ctk q8_0 -ctv q8_0 -b 4096 -ub 2048` +works directly. Verify with curl (streaming chat): diff --git a/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java b/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java index f32728b8..c810a8a5 100644 --- a/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java +++ b/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java @@ -4,8 +4,15 @@ package net.ladenthin.llama.server; +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; import java.nio.file.Path; import java.nio.file.Paths; +import java.util.Collections; +import java.util.LinkedHashMap; +import java.util.Map; +import net.ladenthin.llama.args.CacheType; import net.ladenthin.llama.parameters.ModelParameters; import org.jspecify.annotations.Nullable; @@ -19,8 +26,11 @@ * via {@link #isHelpRequested(String[])} so callers can print help without it being treated as an error. * *

Flags mirror llama.cpp's own server where they overlap ({@code -m}, {@code -p}, {@code -c}, - * {@code -ngl}, {@code -t}); a few legacy spellings are accepted as aliases so earlier documented - * invocations keep working. + * {@code -b}, {@code -ub}, {@code -ngl}, {@code -t}, {@code -tb}, {@code -ctk}, {@code -ctv}, + * {@code --jinja}, {@code --chat-template-kwargs}); a few legacy spellings are accepted as aliases so + * earlier documented invocations keep working. The {@code --chat-template-kwargs} JSON is parsed here + * (the only JSON this otherwise dependency-light parser touches) so a malformed object fails fast with + * usage text rather than at native model load. */ public final class OpenAiServerCli { @@ -65,7 +75,14 @@ public static Options parse(String... args) { int ctxSize = 0; int gpuLayers = 0; int threads = 0; + int threadsBatch = 0; int parallel = 0; + int batchSize = 0; + int ubatchSize = 0; + @Nullable CacheType cacheTypeK = null; + @Nullable CacheType cacheTypeV = null; + boolean jinja = false; + @Nullable Map chatTemplateKwargs = null; boolean embedding = false; boolean reranking = false; @@ -97,6 +114,32 @@ public static Options parse(String... args) { case "--threads": threads = intValue(args, ++i, arg); break; + case "-tb": + case "--threads-batch": + threadsBatch = intValue(args, ++i, arg); + break; + case "-b": + case "--batch-size": + batchSize = intValue(args, ++i, arg); + break; + case "-ub": + case "--ubatch-size": + ubatchSize = intValue(args, ++i, arg); + break; + case "-ctk": + case "--cache-type-k": + cacheTypeK = cacheTypeValue(args, ++i, arg); + break; + case "-ctv": + case "--cache-type-v": + cacheTypeV = cacheTypeValue(args, ++i, arg); + break; + case "--jinja": + jinja = true; + break; + case "--chat-template-kwargs": + chatTemplateKwargs = parseChatTemplateKwargs(nextValue(args, ++i, arg), arg); + break; case "--parallel": parallel = intValue(args, ++i, arg); break; @@ -131,7 +174,24 @@ public static Options parse(String... args) { throw error("Missing required argument: -m/--model "); } return new Options( - host, port, modelPath, modelId, apiKey, mmproj, ctxSize, gpuLayers, threads, parallel, embedding, + host, + port, + modelPath, + modelId, + apiKey, + mmproj, + ctxSize, + gpuLayers, + threads, + threadsBatch, + parallel, + batchSize, + ubatchSize, + cacheTypeK, + cacheTypeV, + jinja, + chatTemplateKwargs, + embedding, reranking); } @@ -155,8 +215,16 @@ public static String usage() { " --host Interface to bind (default: " + DEFAULT_HOST + ")", " -p, --port TCP port to listen on (default: " + DEFAULT_PORT + ")", " -c, --ctx-size Context window size (default: llama.cpp default)", + " -b, --batch-size Logical (prompt) batch size (default: llama.cpp default)", + " -ub, --ubatch-size Physical (micro) batch size (default: llama.cpp default)", " -ngl,--n-gpu-layers Layers to offload to GPU (default: 0 = CPU only)", " -t, --threads Inference thread count (default: llama.cpp default)", + " -tb, --threads-batch Thread count for batch/prompt processing (default: same as -t)", + " -ctk,--cache-type-k KV cache K quantization: " + cacheTypeChoices() + " (default: f16)", + " -ctv,--cache-type-v KV cache V quantization: " + cacheTypeChoices() + " (default: f16)", + " --jinja Use the model's Jinja chat template", + " --chat-template-kwargs JSON object of chat-template variables (requires --jinja),", + " e.g. {\"reasoning_effort\":\"low\"}", " --parallel Parallel inference slots (default: llama.cpp default)", " --model-id Model id reported by /v1/models (default: file name)", " --api-key Require an 'Authorization: Bearer ' header", @@ -191,6 +259,53 @@ private static int intValue(String[] args, int valueIndex, String flag) { } } + /** Reusable parser for the {@code --chat-template-kwargs} JSON object; no state, thread-safe. */ + private static final ObjectMapper CHAT_TEMPLATE_KWARGS_MAPPER = new ObjectMapper(); + + private static CacheType cacheTypeValue(String[] args, int valueIndex, String flag) { + final String raw = nextValue(args, valueIndex, flag).trim(); + for (final CacheType type : CacheType.values()) { + if (type.getArgValue().equalsIgnoreCase(raw)) { + return type; + } + } + throw error(flag + " expects one of " + cacheTypeChoices() + ", got: " + raw); + } + + private static String cacheTypeChoices() { + final StringBuilder sb = new StringBuilder(); + for (final CacheType type : CacheType.values()) { + if (sb.length() > 0) { + sb.append(", "); + } + sb.append(type.getArgValue()); + } + return sb.toString(); + } + + /** + * Parse a {@code --chat-template-kwargs} JSON object into the raw-per-value map that + * {@link ModelParameters#setChatTemplateKwargs(Map)} expects: each entry's value is kept as its + * raw JSON text (a string stays quoted, a boolean/number stays bare), so the object is + * reconstructed verbatim for the native flag. Insertion order is preserved. + */ + private static Map parseChatTemplateKwargs(String json, String flag) { + final JsonNode root; + try { + root = CHAT_TEMPLATE_KWARGS_MAPPER.readTree(json); + } catch (JsonProcessingException e) { + throw error(flag + " expects a JSON object (e.g. {\"reasoning_effort\":\"low\"}), got: " + json, e); + } + if (root == null || !root.isObject()) { + throw error(flag + " expects a JSON object (e.g. {\"reasoning_effort\":\"low\"}), got: " + json); + } + final Map kwargs = new LinkedHashMap<>(); + for (final Map.Entry field : root.properties()) { + kwargs.put(field.getKey(), field.getValue().toString()); + } + return Collections.unmodifiableMap(kwargs); + } + private static IllegalArgumentException error(String message) { return error(message, null); } @@ -200,10 +315,12 @@ private static IllegalArgumentException error(String message, @Nullable Throwabl } /** - * Immutable, parsed launcher options. {@code ctxSize}, {@code threads} and {@code parallel} use - * {@code 0} as a sentinel meaning "leave the llama.cpp default" — they are only applied to - * {@link ModelParameters} when positive. {@code gpuLayers} is always applied (its own default of - * {@code 0} already means CPU-only). + * Immutable, parsed launcher options. The integer tuning knobs — {@code ctxSize}, + * {@code threads}, {@code threadsBatch}, {@code parallel}, {@code batchSize} and + * {@code ubatchSize} — use {@code 0} as a sentinel meaning "leave the llama.cpp default", and are + * only applied to {@link ModelParameters} when positive. {@code cacheTypeK}/{@code cacheTypeV} + * and {@code chatTemplateKwargs} use {@code null} as the same "leave the default" sentinel. + * {@code gpuLayers} is always applied (its own default of {@code 0} already means CPU-only). */ public static final class Options { @@ -216,7 +333,14 @@ public static final class Options { private final int ctxSize; private final int gpuLayers; private final int threads; + private final int threadsBatch; private final int parallel; + private final int batchSize; + private final int ubatchSize; + private final @Nullable CacheType cacheTypeK; + private final @Nullable CacheType cacheTypeV; + private final boolean jinja; + private final @Nullable Map chatTemplateKwargs; private final boolean embedding; private final boolean reranking; @@ -230,7 +354,14 @@ private Options( int ctxSize, int gpuLayers, int threads, + int threadsBatch, int parallel, + int batchSize, + int ubatchSize, + @Nullable CacheType cacheTypeK, + @Nullable CacheType cacheTypeV, + boolean jinja, + @Nullable Map chatTemplateKwargs, boolean embedding, boolean reranking) { this.host = host; @@ -242,7 +373,14 @@ private Options( this.ctxSize = ctxSize; this.gpuLayers = gpuLayers; this.threads = threads; + this.threadsBatch = threadsBatch; this.parallel = parallel; + this.batchSize = batchSize; + this.ubatchSize = ubatchSize; + this.cacheTypeK = cacheTypeK; + this.cacheTypeV = cacheTypeV; + this.jinja = jinja; + this.chatTemplateKwargs = chatTemplateKwargs; this.embedding = embedding; this.reranking = reranking; } @@ -341,6 +479,72 @@ public int getParallel() { return parallel; } + /** + * The batch/prompt-processing thread count, or {@code 0} for the llama.cpp default (same as + * {@link #getThreads()}). + * + * @return the batch thread count + */ + public int getThreadsBatch() { + return threadsBatch; + } + + /** + * The logical (prompt) batch size, or {@code 0} for the llama.cpp default. + * + * @return the batch size + */ + public int getBatchSize() { + return batchSize; + } + + /** + * The physical (micro) batch size, or {@code 0} for the llama.cpp default. + * + * @return the micro-batch size + */ + public int getUbatchSize() { + return ubatchSize; + } + + /** + * The KV cache K quantization type, or {@code null} for the llama.cpp default. + * + * @return the K cache type, or {@code null} when unset + */ + public @Nullable CacheType getCacheTypeK() { + return cacheTypeK; + } + + /** + * The KV cache V quantization type, or {@code null} for the llama.cpp default. + * + * @return the V cache type, or {@code null} when unset + */ + public @Nullable CacheType getCacheTypeV() { + return cacheTypeV; + } + + /** + * Whether the model's Jinja chat template is enabled. + * + * @return {@code true} if {@code --jinja} was requested + */ + public boolean isJinja() { + return jinja; + } + + /** + * The parsed {@code --chat-template-kwargs} as a raw-per-value map (see + * {@link ModelParameters#setChatTemplateKwargs(Map)}), or {@code null} when unset. The map is + * unmodifiable. + * + * @return the chat-template variables, or {@code null} when unset + */ + public @Nullable Map getChatTemplateKwargs() { + return chatTemplateKwargs; + } + /** * Whether to load the model in embedding mode. * @@ -376,9 +580,30 @@ public ModelParameters toModelParameters() { if (threads > 0) { params.setThreads(threads); } + if (threadsBatch > 0) { + params.setThreadsBatch(threadsBatch); + } if (parallel > 0) { params.setParallel(parallel); } + if (batchSize > 0) { + params.setBatchSize(batchSize); + } + if (ubatchSize > 0) { + params.setUbatchSize(ubatchSize); + } + if (cacheTypeK != null) { + params.setCacheTypeK(cacheTypeK); + } + if (cacheTypeV != null) { + params.setCacheTypeV(cacheTypeV); + } + if (jinja) { + params.enableJinja(); + } + if (chatTemplateKwargs != null) { + params.setChatTemplateKwargs(chatTemplateKwargs); + } if (embedding) { params.enableEmbedding(); } diff --git a/llama/src/test/java/net/ladenthin/llama/server/OpenAiServerCliTest.java b/llama/src/test/java/net/ladenthin/llama/server/OpenAiServerCliTest.java index ff3dcd11..30204d6a 100644 --- a/llama/src/test/java/net/ladenthin/llama/server/OpenAiServerCliTest.java +++ b/llama/src/test/java/net/ladenthin/llama/server/OpenAiServerCliTest.java @@ -9,6 +9,7 @@ import static org.hamcrest.Matchers.is; import static org.junit.jupiter.api.Assertions.assertThrows; +import net.ladenthin.llama.args.CacheType; import org.junit.jupiter.api.Test; /** @@ -207,4 +208,131 @@ public void modelParametersIncludeModelPath() { OpenAiServerCli.parse("-m", "models/m.gguf").toModelParameters().toString(); assertThat(json, containsString("models/m.gguf")); } + + @Test + public void tuningFlagsDefaultToSentinels() { + OpenAiServerCli.Options options = OpenAiServerCli.parse("-m", "m.gguf"); + assertThat(options.getBatchSize(), is(0)); + assertThat(options.getUbatchSize(), is(0)); + assertThat(options.getThreadsBatch(), is(0)); + assertThat(options.getCacheTypeK(), is((CacheType) null)); + assertThat(options.getCacheTypeV(), is((CacheType) null)); + assertThat(options.isJinja(), is(false)); + assertThat(options.getChatTemplateKwargs(), is((Object) null)); + } + + @Test + public void tuningShortFlagsParsed() { + OpenAiServerCli.Options options = OpenAiServerCli.parse( + "-m", "m.gguf", "-b", "4096", "-ub", "2048", "-tb", "16", "-ctk", "q8_0", "-ctv", "q8_0"); + assertThat(options.getBatchSize(), is(4096)); + assertThat(options.getUbatchSize(), is(2048)); + assertThat(options.getThreadsBatch(), is(16)); + assertThat(options.getCacheTypeK(), is(CacheType.Q8_0)); + assertThat(options.getCacheTypeV(), is(CacheType.Q8_0)); + } + + @Test + public void tuningLongFlagsParsed() { + OpenAiServerCli.Options options = OpenAiServerCli.parse( + "-m", + "m.gguf", + "--batch-size", + "512", + "--ubatch-size", + "256", + "--threads-batch", + "6", + "--cache-type-k", + "f16", + "--cache-type-v", + "q4_0", + "--jinja"); + assertThat(options.getBatchSize(), is(512)); + assertThat(options.getUbatchSize(), is(256)); + assertThat(options.getThreadsBatch(), is(6)); + assertThat(options.getCacheTypeK(), is(CacheType.F16)); + assertThat(options.getCacheTypeV(), is(CacheType.Q4_0)); + assertThat(options.isJinja(), is(true)); + } + + @Test + public void cacheTypeIsCaseInsensitive() { + OpenAiServerCli.Options options = OpenAiServerCli.parse("-m", "m.gguf", "-ctk", "Q8_0"); + assertThat(options.getCacheTypeK(), is(CacheType.Q8_0)); + } + + @Test + public void unknownCacheTypeThrows() { + IllegalArgumentException ex = assertThrows( + IllegalArgumentException.class, () -> OpenAiServerCli.parse("-m", "m.gguf", "-ctk", "q3_k")); + assertThat(ex.getMessage(), containsString("expects one of")); + assertThat(ex.getMessage(), containsString("q8_0")); + assertThat(ex.getMessage(), containsString("q3_k")); + } + + @Test + public void chatTemplateKwargsParsedToRawJsonValues() { + OpenAiServerCli.Options options = OpenAiServerCli.parse( + "-m", "m.gguf", "--chat-template-kwargs", "{\"reasoning_effort\":\"low\",\"enable_thinking\":true}"); + assertThat(options.getChatTemplateKwargs().get("reasoning_effort"), is("\"low\"")); + assertThat(options.getChatTemplateKwargs().get("enable_thinking"), is("true")); + } + + @Test + public void chatTemplateKwargsInvalidJsonThrows() { + IllegalArgumentException ex = assertThrows( + IllegalArgumentException.class, + () -> OpenAiServerCli.parse("-m", "m.gguf", "--chat-template-kwargs", "{not json")); + assertThat(ex.getMessage(), containsString("--chat-template-kwargs expects a JSON object")); + } + + @Test + public void chatTemplateKwargsNonObjectThrows() { + IllegalArgumentException ex = assertThrows( + IllegalArgumentException.class, + () -> OpenAiServerCli.parse("-m", "m.gguf", "--chat-template-kwargs", "\"low\"")); + assertThat(ex.getMessage(), containsString("--chat-template-kwargs expects a JSON object")); + } + + @Test + public void toModelParametersCarriesTuningFlags() { + String argv = OpenAiServerCli.parse( + "-m", + "m.gguf", + "-b", + "4096", + "-ub", + "2048", + "-tb", + "16", + "-ctk", + "q8_0", + "-ctv", + "q8_0", + "--jinja", + "--chat-template-kwargs", + "{\"reasoning_effort\":\"low\"}") + .toModelParameters() + .toString(); + assertThat(argv, containsString("--batch-size 4096")); + assertThat(argv, containsString("--ubatch-size 2048")); + assertThat(argv, containsString("--threads-batch 16")); + assertThat(argv, containsString("--cache-type-k q8_0")); + assertThat(argv, containsString("--cache-type-v q8_0")); + assertThat(argv, containsString("--jinja")); + assertThat(argv, containsString("--chat-template-kwargs")); + assertThat(argv, containsString("reasoning_effort")); + } + + @Test + public void usageMentionsNewTuningFlags() { + String usage = OpenAiServerCli.usage(); + assertThat(usage, containsString("--batch-size")); + assertThat(usage, containsString("--ubatch-size")); + assertThat(usage, containsString("--threads-batch")); + assertThat(usage, containsString("--cache-type-k")); + assertThat(usage, containsString("--jinja")); + assertThat(usage, containsString("--chat-template-kwargs")); + } } From 3c8aeb8ddbfc7be0f43053fa4a13c37c87568fc8 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 08:35:18 +0000 Subject: [PATCH 04/29] =?UTF-8?q?server:=20add=20NativeServer=20=E2=80=94?= =?UTF-8?q?=20run=20the=20full=20llama.cpp=20server=20(WebUI)=20in-DLL=20v?= =?UTF-8?q?ia=20JNI?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Second server mode alongside the Java-transport OpenAiCompatServer: NativeServer runs the *full* upstream llama.cpp HTTP server — embedded WebUI included — inside libjllama over JNI, with no separate llama-server executable. It forwards the raw llama-server argv verbatim, so every flag works exactly as for the standalone binary (no per-flag Java mapping). How: b9859 already exposes `int llama_server(int, char**)` (no main() in server.cpp). patches/0006 makes it embeddable — skips installing process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM), parses the forwarded argv via common_params_parse instead of common_params_parse_main (whose GetCommandLineW recovery would grab java.exe's command line — the Windows bug class 0001 fixes), and adds llama_server_request_shutdown() for out-of-band stop (ctx_server is loop-local). native_server.cpp's JNI bridge runs llama_server on a worker thread; start/stop/isRunning map to the three native methods. CMake: server.cpp + server-tools.cpp are now compiled in (non-Android — both pull subprocess.h/posix_spawn_*, so they share server-models.cpp's guard), plus native_server.cpp. NativeServer is an independent lifecycle (loads its own model from the argv, like llama-server.exe), single-instance per process (upstream keeps shutdown state in file-scope globals), and unavailable on Android. Reusing an already-loaded LlamaModel's context is a documented TODO. libjllama loads lazily in start(), so construction/arg-parsing/close stay pure-Java unit-testable. Verified end-to-end on Linux x86_64 with a synthetic granitehybrid model: server starts, GET /health -> 200 {"status":"ok"}, /v1/models and /props served, / is the native WebUI route (404 locally with the empty-asset stub; serves index.html in released jars that bake in webui-generated assets), close() shuts down cleanly. 7 pure-Java NativeServer tests + javadoc + spotless + clang-format(22.1.5) clean. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- CLAUDE.md | 10 +- README.md | 29 +++ TODO.md | 30 +++ llama/CMakeLists.txt | 17 ++ .../0006-server-embed-native-server-jni.patch | 67 +++++++ llama/src/main/cpp/native_server.cpp | 107 ++++++++++ llama/src/main/cpp/native_server_bridge.h | 22 ++ .../ladenthin/llama/server/NativeServer.java | 189 +++++++++++++----- .../llama/server/NativeServerSmokeTest.java | 53 +++-- 9 files changed, 453 insertions(+), 71 deletions(-) create mode 100644 llama/patches/0006-server-embed-native-server-jni.patch create mode 100644 llama/src/main/cpp/native_server.cpp create mode 100644 llama/src/main/cpp/native_server_bridge.h diff --git a/CLAUDE.md b/CLAUDE.md index cc9cec4a..5e1f5e5a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -433,6 +433,7 @@ Current patches: | `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b` includes the change. | | `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b` includes the change. | | `0005-server-recurrent-near-prompt-end-checkpoints.patch` | **Multi-turn tool-calling perf fix for recurrent/hybrid models (e.g. Granite-4)**, upstream-submittable. In `server_context::update_slots` (`tools/server/server-context.cpp`) the near-prompt-end context checkpoints are gated by `checkpoint_min_step` (default 8192 tokens). An agentic conversation that appends only assistant/tool messages never produces a new user-message checkpoint (`is_user_start`/`is_last_user_message` match `COMMON_CHAT_ROLE_USER` only), so after turn 1 no new checkpoint is ever created and — because recurrent state can only roll back to a checkpoint — **every turn re-prefills the whole conversation tail** (measured on a synthetic granitehybrid model: prefilled tokens grew 901 → 1544 → 2187 → 2830 → 3473 over turns 2–6). The patch (1) exempts near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (`ctx_tgt_seq_rm_type` is `FULL` or `RS` — SWA-only models are unaffected), and (2) skips creating a checkpoint whose position equals the newest one (the last-user-message checkpoint was re-created identically on every turn, flooding the 32-entry list). After the patch each turn restores the previous turn's near-end checkpoint and prefill is constant (~new-turn-sized; 647 tokens/turn in the same measurement, ≈5.4× less prefill at turn 6 and growing with conversation length). Validated output-identical (`temperature=0`) vs. unpatched. Complements — not duplicates — open upstream PRs #24035/#24899/#24891 (they fix checkpoint *invalidation/retention*; this fixes checkpoint *starvation*). Drop once upstream solves agentic checkpoint placement (e.g. a merged role-boundary checkpointing design, cf. #21885 / #22826 discussion). | +| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9859 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`) @@ -846,7 +847,14 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in - `json_helpers.hpp` — Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable. - `jni_helpers.hpp` — JNI bridge helpers (handle management + server orchestration). Includes `json_helpers.hpp`. - Uses `nlohmann/json` for JSON deserialization of parameters. -- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). Only `server.cpp` (the standalone `main()` + route wiring) remains excluded — wiring the routes to JNI is the next step. +- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9859 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. + +### Two server modes (`OpenAiCompatServer` vs `NativeServer`) + +The library exposes **two** ways to serve a model over HTTP, on two different transports: + +1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. This is the fat-jar `Main-Class`; its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`). +2. **`server.NativeServer` (native transport).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible. ### Native Helper Architecture diff --git a/README.md b/README.md index 05c8f44d..d5c96dd9 100644 --- a/README.md +++ b/README.md @@ -710,6 +710,35 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` ( `OpenAiServerConfig.apiKey(...)`) to require an `Authorization: Bearer` token; the server binds to `127.0.0.1` by default. +### Native server with the built-in WebUI (`NativeServer`) + +`OpenAiCompatServer` above is a JSON **API** server (its `/` is a 404 — no web page). If you want +the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use +`net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over +JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so +every flag works exactly as it does for the standalone binary: + +```java +try (NativeServer server = new NativeServer( + "-m", "gpt-oss-20b-UD-Q4_K_XL.gguf", + "--host", "127.0.0.1", "--port", "8080", + "-c", "65536", "-b", "4096", "-ub", "2048", + "--jinja", "-ngl", "0", "-t", "8", "-tb", "16", + "-ctk", "q8_0", "-ctv", "q8_0", + "--chat-template-kwargs", "{\"reasoning_effort\":\"low\"}", + "--parallel", "1").start()) { + // Open http://127.0.0.1:8080/ in a browser for the WebUI; the OpenAI API is at /v1/... too. + Thread.currentThread().join(); +} +``` + +Differences from `OpenAiCompatServer`: it **loads its own model** from the arguments (an independent +lifecycle, like `llama-server.exe`, not a shared `LlamaModel`), it is **single-instance per +process**, it serves the **WebUI** (in released jars — local `cmake` builds ship the empty-asset +stub, so no UI there), and it is **not available on Android** (the upstream server needs +`posix_spawn`). Readiness: poll `GET /health`. No SSL (plain HTTP — bind localhost or front with a +TLS proxy). + ### LangChain4j integration A separate artifact, **`net.ladenthin:llama-langchain4j`**, adapts a `LlamaModel` to diff --git a/TODO.md b/TODO.md index 66de274f..e6390529 100644 --- a/TODO.md +++ b/TODO.md @@ -13,6 +13,36 @@ cross-cutting initiative. ## Open — jllama-specific +### NativeServer — reuse an already-loaded `LlamaModel` (open, enhancement) + +`net.ladenthin.llama.server.NativeServer` (the native-transport server mode that runs the full +upstream `llama_server` — WebUI included — inside `libjllama` over JNI) currently loads its **own** +model from the forwarded argv, exactly like running `llama-server.exe`. This is the "independent +lifecycle" v1: simple, and every llama-server flag is forwarded verbatim. + +**Enhancement:** let `NativeServer` optionally attach to an **already-loaded** `LlamaModel`'s +`server_context` instead of loading a second copy of the weights (saves the RAM/VRAM and load time +of a duplicate model when a caller already has a `LlamaModel` open). Feasibility notes from the +initial investigation: + +- The upstream HTTP transport (`server_http_context`) and the route bundle + (`server_routes routes(params, ctx_server)`) only need a reference to a `server_context`. A + `LlamaModel` already owns and drives one (`jllama_context` in `jni_helpers.hpp`), and its JNI + methods already post tasks to that context's queue — so a second driver (the HTTP routes) posting + to the same queue is plausible; the queue is the synchronization point. +- The real work is **lifecycle/ownership**: today `llama_server()` owns the whole flow (parse → + backend init → `ctx_server.load_model` → `start_loop` on its own thread → cleanup). Reuse would + need a *different* entry that skips model loading and the `start_loop`/backend ownership (the + existing `LlamaModel` worker already runs the loop), registers the HTTP routes against the shared + `server_context`, and starts only `server_http_context`. That is a separate, smaller C++ entry + point (not `llama_server`), plus reconciling params (the loaded model's params vs. server params) + and ensuring only one thread drives `update_slots`. +- Logging: `llama_server` calls `common_init()` which routes llama.cpp logging to stderr/file; a + reuse path must not clobber the JNI log callback a `LlamaModel` consumer may rely on. + +Until then, run `NativeServer` standalone (it owns the process's llama backend + logging while +running), or use the Java-transport `OpenAiCompatServer` when sharing a `LlamaModel`. + ### PIT gate not hermetic — `value.ContentPart.audioFile(Path)` (open) The PIT mutation gate reaches 100% **only when the audio test fixture is present**. Without it the diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index 4c480f81..ec56bbb3 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -355,6 +355,23 @@ if(NOT ANDROID_ABI AND NOT OS_NAME MATCHES "Android") ) endif() +# Native-server mode (net.ladenthin.llama.server.NativeServer): compile the standalone server +# entry point (server.cpp's `llama_server`, made embeddable by patches/0006) and its tools helper +# (server-tools.cpp); jllama's JNI bridge (native_server.cpp) then calls llama_server on a worker +# thread. This runs the *full* upstream HTTP server — WebUI included, every llama-server flag +# forwarded — inside libjllama, with no separate llama-server executable. server.cpp and +# server-tools.cpp both pull in vendor/sheredom/subprocess.h (posix_spawn_*), so they share the +# non-Android guard used for server-models.cpp above; native_server.cpp links against llama_server +# and is guarded too. On Android the NativeServer native methods are simply absent (its JNI calls +# throw UnsatisfiedLinkError) — use OpenAiCompatServer there. +if(NOT ANDROID_ABI AND NOT OS_NAME MATCHES "Android") + target_sources(jllama PRIVATE + ${llama.cpp_SOURCE_DIR}/tools/server/server-tools.cpp + ${llama.cpp_SOURCE_DIR}/tools/server/server.cpp + ${CMAKE_SOURCE_DIR}/src/main/cpp/native_server.cpp + ) +endif() + # Phase 2: also compile the upstream HTTP transport (server-http.cpp) and its # cpp-httplib backend directly into jllama, so the OpenAI-compatible server can be # driven natively from JNI — shipped inside libjllama, with no separate diff --git a/llama/patches/0006-server-embed-native-server-jni.patch b/llama/patches/0006-server-embed-native-server-jni.patch new file mode 100644 index 00000000..35a146d5 --- /dev/null +++ b/llama/patches/0006-server-embed-native-server-jni.patch @@ -0,0 +1,67 @@ +diff --git a/tools/server/server.cpp b/tools/server/server.cpp +index 84c7f0b..5c9fac9 100644 +--- a/tools/server/server.cpp ++++ b/tools/server/server.cpp +@@ -25,6 +25,28 @@ + static std::function shutdown_handler; + static std::atomic_flag is_terminating = ATOMIC_FLAG_INIT; + ++// [jllama] Embedded-mode support: when llama_server() is hosted inside libjllama and driven over ++// JNI (net.ladenthin.llama.server.NativeServer), it must NOT install process-wide signal handlers ++// (that would hijack the JVM's SIGINT/SIGTERM), and it must be stoppable out-of-band because ++// ctx_server is local to llama_server(). It also parses exactly the forwarded argv rather than ++// re-deriving it from the process command line (which would be java.exe's — the Windows bug the ++// 0001 patch fixes for the embedded path). These symbols are declared in ++// src/main/cpp/native_server_bridge.h and called by native_server.cpp. ++static std::atomic g_llama_server_embedded{false}; ++ ++void llama_server_set_embedded(bool embedded) { ++ g_llama_server_embedded.store(embedded); ++} ++ ++void llama_server_request_shutdown() { ++ // Mirrors the SIGTERM path: invoke the installed shutdown_handler, which unblocks ++ // ctx_server.start_loop() (single-model) / ctx_http.stop() (router). No-op if the server has ++ // not finished starting (handler not yet installed) — stop after /health reports ready. ++ if (shutdown_handler) { ++ shutdown_handler(SIGTERM); ++ } ++} ++ + static inline void signal_handler(int signal) { + if (is_terminating.test_and_set()) { + // in case it hangs, we can force terminate the server by hitting Ctrl+C twice +@@ -87,7 +109,13 @@ int llama_server(int argc, char ** argv) { + // touch it. lifecycle is symmetric, stop_gc() runs in clean_up() before backend free + g_stream_sessions.start_gc(); + +- if (!common_params_parse_main(argc, argv, params, LLAMA_EXAMPLE_SERVER)) { ++ // [jllama] embedded (JNI) callers forward a clean UTF-8 argv, so honor it exactly via ++ // common_params_parse; only the standalone tool needs common_params_parse_main's ++ // process-command-line (GetCommandLineW) UTF-8 recovery. ++ const bool parsed_ok = g_llama_server_embedded.load() ++ ? common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER) ++ : common_params_parse_main(argc, argv, params, LLAMA_EXAMPLE_SERVER); ++ if (!parsed_ok) { + return 1; + } + +@@ -412,6 +440,10 @@ int llama_server(int argc, char ** argv) { + } + + // TODO: refactor in common/console ++ // [jllama] skip installing process-wide signal handlers when embedded in the JVM (they would ++ // hijack the JVM's own SIGINT/SIGTERM). NativeServer stops the embedded server via ++ // llama_server_request_shutdown() instead. ++ if (!g_llama_server_embedded.load()) { + #if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) + struct sigaction sigint_action; + sigint_action.sa_handler = signal_handler; +@@ -425,6 +457,7 @@ int llama_server(int argc, char ** argv) { + }; + SetConsoleCtrlHandler(reinterpret_cast(console_ctrl_handler), true); + #endif ++ } + + SRV_INF("listening on %s\n", ctx_http.listening_address.c_str()); + diff --git a/llama/src/main/cpp/native_server.cpp b/llama/src/main/cpp/native_server.cpp new file mode 100644 index 00000000..d9cfa527 --- /dev/null +++ b/llama/src/main/cpp/native_server.cpp @@ -0,0 +1,107 @@ +// SPDX-FileCopyrightText: 2026 Bernard Ladenthin +// +// SPDX-License-Identifier: MIT + +// JNI bridge for net.ladenthin.llama.server.NativeServer: runs the full upstream llama.cpp HTTP +// server (llama_server(), including its embedded WebUI) inside libjllama, driven over JNI. The +// argv is forwarded verbatim from Java, so every llama-server flag is supported. This is an +// independent server lifecycle (it loads its own model from the argv), distinct from LlamaModel +// and the Java-side OpenAiCompatServer. +// +// Only ONE native server may run per process: server.cpp keeps its shutdown_handler / +// is_terminating state in file-scope globals, so a second concurrent llama_server() would clobber +// them. NativeServer enforces this on the Java side. + +#include "native_server_bridge.h" + +#include + +#include +#include +#include +#include +#include + +namespace { + +// Owns the argv storage for the lifetime of the running server plus the worker thread that runs +// llama_server(). The argv pointers reference the std::string storage in `args`, which is filled +// once (with reserve) and never mutated afterwards, so the pointers stay valid. +struct native_server { + std::vector args; // args[0] is the program name ("llama-server") + std::vector argv; // points into `args` + std::thread worker; + std::atomic finished{false}; + int exit_code = -1; +}; + +} // namespace + +extern "C" { + +JNIEXPORT jlong JNICALL Java_net_ladenthin_llama_server_NativeServer_startNativeServer(JNIEnv *env, jclass, + jobjectArray jargs) { + auto *srv = new native_server(); + + const jsize n = (jargs != nullptr) ? env->GetArrayLength(jargs) : 0; + srv->args.reserve(static_cast(n) + 1); + srv->args.emplace_back("llama-server"); // argv[0] + for (jsize i = 0; i < n; ++i) { + auto js = static_cast(env->GetObjectArrayElement(jargs, i)); + if (js != nullptr) { + const char *chars = env->GetStringUTFChars(js, nullptr); + srv->args.emplace_back(chars != nullptr ? chars : ""); + if (chars != nullptr) { + env->ReleaseStringUTFChars(js, chars); + } + env->DeleteLocalRef(js); + } else { + srv->args.emplace_back(""); + } + } + + srv->argv.reserve(srv->args.size()); + for (auto &arg : srv->args) { + srv->argv.push_back(const_cast(arg.c_str())); + } + + // Embedded mode: no process signal handlers, honor the forwarded argv (see patches/0006). + llama_server_set_embedded(true); + + srv->worker = std::thread([srv]() { + srv->exit_code = llama_server(static_cast(srv->argv.size()), srv->argv.data()); + srv->finished.store(true); + }); + + return reinterpret_cast(srv); +} + +JNIEXPORT void JNICALL Java_net_ladenthin_llama_server_NativeServer_stopNativeServer(JNIEnv *, jclass, jlong handle) { + auto *srv = reinterpret_cast(handle); + if (srv == nullptr) { + return; + } + // Signal shutdown, retrying until the worker actually returns: a stop issued before the server + // finished starting (shutdown_handler not yet installed by llama_server) would otherwise be + // lost. Once the handler is installed the first signal takes effect; if the model failed to + // load, llama_server has already returned and `finished` is set. + while (!srv->finished.load()) { + llama_server_request_shutdown(); + if (srv->finished.load()) { + break; + } + std::this_thread::sleep_for(std::chrono::milliseconds(50)); + } + if (srv->worker.joinable()) { + srv->worker.join(); + } + delete srv; +} + +JNIEXPORT jboolean JNICALL Java_net_ladenthin_llama_server_NativeServer_isRunningNative(JNIEnv *, jclass, + jlong handle) { + auto *srv = reinterpret_cast(handle); + return (srv != nullptr && !srv->finished.load()) ? JNI_TRUE : JNI_FALSE; +} + +} // extern "C" diff --git a/llama/src/main/cpp/native_server_bridge.h b/llama/src/main/cpp/native_server_bridge.h new file mode 100644 index 00000000..1a40c766 --- /dev/null +++ b/llama/src/main/cpp/native_server_bridge.h @@ -0,0 +1,22 @@ +// SPDX-FileCopyrightText: 2026 Bernard Ladenthin +// +// SPDX-License-Identifier: MIT + +#pragma once + +// Declarations for the upstream server entry point (llama.cpp tools/server/server.cpp) that +// jllama's NativeServer JNI bridge (native_server.cpp) calls to run the full llama.cpp HTTP +// server — WebUI included — inside libjllama, with no separate llama-server executable. +// +// - llama_server: upstream's renamed main (b9859 already exposes `int llama_server(int, char**)` +// as a non-static, externally linkable function). Runs the server and blocks until shutdown, +// returning its process-style exit code (0 = clean). +// - llama_server_set_embedded / llama_server_request_shutdown: added by +// patches/0006-server-embed-native-server-jni.patch so the server can run embedded in the JVM +// (does not install process-wide signal handlers, and honors the forwarded argv instead of +// re-deriving it from the process command line) and can be stopped out-of-band (the SIGTERM +// path) since its server_context is local to llama_server(). + +int llama_server(int argc, char ** argv); +void llama_server_set_embedded(bool embedded); +void llama_server_request_shutdown(); diff --git a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java index 024ac827..db69ca95 100644 --- a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java +++ b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java @@ -5,105 +5,190 @@ package net.ladenthin.llama.server; import java.util.Objects; +import java.util.concurrent.atomic.AtomicBoolean; import lombok.ToString; +import net.ladenthin.llama.loader.LlamaLoader; /** - * Scaffold for the native HTTP server bridge — the planned counterpart to - * {@link OpenAiCompatServer}. + * Runs the full upstream llama.cpp HTTP server — including its embedded + * WebUI — inside {@code libjllama}, driven over JNI, with no separate + * {@code llama-server} executable. It is the second of two server modes, the native counterpart to + * the Java-transport {@link OpenAiCompatServer}. * - *

{@link OpenAiCompatServer} implements the HTTP transport in Java (on the JDK's - * {@code com.sun.net.httpserver}) and drives the native llama.cpp server core over JNI. This - * class is instead the entry point for the upstream native HTTP transport that is already - * compiled into {@code libjllama} (llama.cpp's {@code server-http.cpp} plus its {@code cpp-httplib} - * backend). That native transport is the only component able to serve the embedded llama.cpp - * WebUI (the {@code ui.cpp}/{@code ui.h} asset table compiled in behind - * {@code LLAMA_UI_HAS_ASSETS}).

+ *

The constructor takes the raw llama-server command-line arguments and forwards them verbatim + * to the native entry point ({@code llama_server}), so every llama-server flag is supported + * ({@code -m}, {@code -c}, {@code -b}, {@code -ub}, {@code -ngl}, {@code -t}, {@code -tb}, + * {@code -ctk}, {@code -ctv}, {@code --jinja}, {@code --chat-template-kwargs}, {@code --host}, + * {@code --port}, {@code --ui}/{@code --no-ui}, …). Unlike {@link OpenAiCompatServer}, no per-flag + * Java mapping is involved.

* - *

Status: scaffold only. The route registration that upstream performs in - * {@code server.cpp} (deliberately excluded from this build) is not yet wired to a JNI entry point, so - * {@link #start()} throws {@link UnsupportedOperationException} for now. This class only fixes the - * package structure and the public API shape; the native {@code startServer}/{@code stopServer} - * methods, their C++ implementation, the server lifecycle/threading and WebUI serving are a separate, - * detailed step (see {@code CLAUDE.md}, "WebUI (llama.cpp Svelte UI) embedding").

+ *

Independent lifecycle. {@code NativeServer} loads its own model from + * the forwarded arguments — exactly like running {@code llama-server.exe} — and is unrelated to any + * {@code net.ladenthin.llama.LlamaModel} you may also have open. Reusing an already-loaded + * {@code LlamaModel}'s context instead of loading a second copy is a possible future enhancement + * (see {@code TODO.md}). While the native server runs it owns the process-wide llama backend and + * routes llama.cpp logging to stderr/file (llama-server's own logging), not the JNI log callback.

* - *

It is {@link AutoCloseable} so that, once implemented, callers can drive it with - * try-with-resources exactly like {@link OpenAiCompatServer}.

+ *

Single instance per process. The upstream server keeps its shutdown state in + * file-scope globals, so only one {@code NativeServer} may run at a time; {@link #start()} throws if + * another instance is already running.

+ * + *

Typical use:

+ *
{@code
+ * try (NativeServer server = new NativeServer(
+ *         "-m", "models/model.gguf", "--host", "127.0.0.1", "--port", "8080", "-c", "65536").start()) {
+ *     // Server (and WebUI at http://127.0.0.1:8080/) runs on a native worker thread.
+ *     // Readiness: poll GET /health until it returns {"status":"ok"}.
+ *     Thread.currentThread().join();
+ * }
+ * }
+ * + *

Platform note. The native methods are compiled into {@code libjllama} on all + * platforms except Android (the upstream server pulls in {@code posix_spawn_*}, unavailable there); + * on Android use {@link OpenAiCompatServer}. No SSL: the embedded server is plain HTTP — bind + * localhost or front it with a TLS proxy.

*/ @ToString public final class NativeServer implements AutoCloseable { - /** Message thrown by {@link #start()} until the native route-wiring lands. */ - static final String NOT_WIRED_MESSAGE = - "NativeServer is a scaffold: the upstream native HTTP routes (server-http.cpp) are " - + "not yet wired to JNI. Use OpenAiCompatServer for now; the native server and " - + "embedded WebUI are a planned step."; + /** Guards the process-wide single-instance invariant (upstream uses file-scope globals). */ + private static final AtomicBoolean RUNNING = new AtomicBoolean(false); + + /** Default bind host reported by {@link #getHost()} when {@code --host} is not passed. */ + private static final String DEFAULT_HOST = "127.0.0.1"; + + /** Default port reported by {@link #getPort()} when no port flag is passed. */ + private static final int DEFAULT_PORT = 8080; - /** Immutable server configuration (bind host, port, ...) shared with {@link OpenAiCompatServer}. */ - private final OpenAiServerConfig config; + /** The llama-server argument vector, forwarded verbatim to the native entry point. */ + private final String[] args; + + /** Native handle (pointer) while running, or {@code 0} when not started / stopped. */ + private volatile long handle; /** - * Creates a native-server bridge for the given configuration. + * Creates a native-server bridge for the given llama-server arguments. * - *

Construction performs no native work and binds no socket; it only captures the configuration. - * Call {@link #start()} to launch the server (not implemented yet).

+ *

Construction performs no native work and binds no socket; it only captures the arguments. + * Call {@link #start()} to launch the server.

* - * @param config the server configuration (host, port, ...); must not be {@code null} + * @param args the llama-server command-line arguments (e.g. {@code "-m", "model.gguf", + * "--port", "8080"}); must not be {@code null} and must not contain {@code null} + * elements */ - public NativeServer(OpenAiServerConfig config) { - this.config = Objects.requireNonNull(config, "config"); + public NativeServer(String... args) { + Objects.requireNonNull(args, "args"); + for (final String arg : args) { + Objects.requireNonNull(arg, "args element"); + } + this.args = args.clone(); } /** - * Starts the native HTTP server and begins serving the embedded WebUI. - * - *

Not implemented yet — this is a scaffold. The native route registration and - * its JNI binding are a planned step, so this method always throws until then.

+ * Starts the native HTTP server (and its embedded WebUI) on a background thread and returns + * immediately. The server binds and begins serving {@code GET /health} before the model finishes + * loading; poll {@code /health} for readiness. * - * @return this server instance (for fluent / try-with-resources use), once implemented - * @throws UnsupportedOperationException always, until the native routes are wired to JNI + * @return this server instance (for fluent / try-with-resources use) + * @throws IllegalStateException if this instance was already started, or another + * {@code NativeServer} is already running in this process */ - // Scaffold: start() intentionally always throws for now, but must stay callable (not @DoNotCall) - // so the real implementation and its callers/tests keep the same signature. - @SuppressWarnings("DoNotCallSuggester") public NativeServer start() { - throw new UnsupportedOperationException(NOT_WIRED_MESSAGE); + if (handle != 0) { + throw new IllegalStateException("NativeServer already started"); + } + if (!RUNNING.compareAndSet(false, true)) { + throw new IllegalStateException( + "another NativeServer is already running in this process (only one is supported)"); + } + try { + // Load libjllama lazily here (not in a static initializer) so construction, argument + // parsing and close() stay usable — and unit-testable — without the native library. + LlamaLoader.initialize(); + handle = startNativeServer(args); + } catch (final RuntimeException | Error e) { + RUNNING.set(false); + throw e; + } + return this; } /** - * Reports whether the native server is currently running. + * Reports whether the native server worker is currently running. + * + *

Note: this becomes {@code true} as soon as the worker thread starts, which is before the + * socket is necessarily accepting connections — use {@code GET /health} to detect readiness.

* - * @return {@code false} — the scaffold never starts a server yet + * @return {@code true} if the server has been started and its worker has not yet exited */ public boolean isRunning() { - return false; + final long h = handle; + return h != 0 && isRunningNative(h); } /** - * Returns the host the server is configured to bind to. + * Returns the bind host parsed from the arguments ({@code --host}), or {@code 127.0.0.1} when + * absent. Best-effort convenience for logging; the authoritative value is what the native server + * parsed. * * @return the configured bind host */ public String getHost() { - return config.getHost(); + for (int i = 0; i < args.length - 1; i++) { + if ("--host".equals(args[i])) { + return args[i + 1]; + } + } + return DEFAULT_HOST; } /** - * Returns the port the server is configured to bind to. + * Returns the port parsed from the arguments ({@code --port} / {@code -p}), or {@code 8080} when + * absent or unparseable. Best-effort convenience for logging. * * @return the configured port */ public int getPort() { - return config.getPort(); + for (int i = 0; i < args.length - 1; i++) { + if ("--port".equals(args[i]) || "-p".equals(args[i])) { + try { + return Integer.parseInt(args[i + 1].trim()); + } catch (final NumberFormatException e) { + return DEFAULT_PORT; + } + } + } + return DEFAULT_PORT; } /** - * Stops the native server if it is running. - * - *

No-op in the scaffold (nothing is ever started), so it is always safe to call, including from - * try-with-resources. Real lifecycle teardown is part of the planned native-server implementation.

+ * Stops the native server if it is running and releases the native handle. Blocks until the + * server has fully shut down. Safe to call more than once and from try-with-resources even if + * {@link #start()} was never called (no-op then). */ @Override public void close() { - // Nothing is started yet, so there is nothing to release. + final long h = handle; + if (h == 0) { + return; + } + handle = 0; + try { + stopNativeServer(h); + } finally { + RUNNING.set(false); + } } + + /** + * Starts the native server on a worker thread and returns an opaque handle. The argv is + * forwarded verbatim (with a synthetic {@code argv[0]}). + */ + private static native long startNativeServer(String[] args); + + /** Signals shutdown, joins the worker thread, and frees the handle. */ + private static native void stopNativeServer(long handle); + + /** Whether the worker thread for the given handle is still running. */ + private static native boolean isRunningNative(long handle); } diff --git a/llama/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java b/llama/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java index 7e74dec4..389136f9 100644 --- a/llama/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java +++ b/llama/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java @@ -5,44 +5,61 @@ package net.ladenthin.llama.server; import static org.hamcrest.MatcherAssert.assertThat; -import static org.hamcrest.Matchers.containsString; import static org.hamcrest.Matchers.is; import static org.junit.jupiter.api.Assertions.assertThrows; import org.junit.jupiter.api.Test; /** - * Model-free smoke test for the {@link NativeServer} scaffold: it must construct without any native - * work, expose its configured host/port, never report itself running, throw a clear - * {@link UnsupportedOperationException} from {@link NativeServer#start()} until the native routes are - * wired, and be a safe no-op {@link AutoCloseable}. No model and no {@code libjllama} required. + * Model-free, library-free unit tests for {@link NativeServer}'s pure-Java surface: it must + * construct without any native work (libjllama is loaded lazily in {@link NativeServer#start()}, + * not in a static initializer), best-effort parse host/port from the forwarded arguments, report + * itself not running before {@code start()}, and be a safe no-op {@link AutoCloseable} when never + * started. Actually starting the native server is exercised by CI / manual runs with a real model. */ public class NativeServerSmokeTest { - private static OpenAiServerConfig config() { - return OpenAiServerConfig.builder().host("127.0.0.1").port(1234).build(); + @Test + public void parsesHostAndPortFromArgs() { + NativeServer server = new NativeServer("-m", "m.gguf", "--host", "0.0.0.0", "--port", "1234"); + assertThat(server.getHost(), is("0.0.0.0")); + assertThat(server.getPort(), is(1234)); + assertThat(server.isRunning(), is(false)); + } + + @Test + public void shortPortFlagParsed() { + NativeServer server = new NativeServer("-m", "m.gguf", "-p", "9099"); + assertThat(server.getPort(), is(9099)); } @Test - public void exposesConfiguredHostAndPortWithoutStarting() { - NativeServer server = new NativeServer(config()); + public void defaultsWhenFlagsAbsent() { + NativeServer server = new NativeServer("-m", "m.gguf"); assertThat(server.getHost(), is("127.0.0.1")); - assertThat(server.getPort(), is(1234)); - assertThat(server.isRunning(), is(false)); + assertThat(server.getPort(), is(8080)); } @Test - public void startThrowsUntilNativeRoutesAreWired() { - NativeServer server = new NativeServer(config()); - UnsupportedOperationException ex = assertThrows(UnsupportedOperationException.class, server::start); - assertThat(ex.getMessage(), containsString("not yet wired")); - assertThat(server.isRunning(), is(false)); + public void nonIntegerPortFallsBackToDefault() { + NativeServer server = new NativeServer("-m", "m.gguf", "--port", "abc"); + assertThat(server.getPort(), is(8080)); } @Test - public void closeIsSafeNoOpEvenViaTryWithResources() { - try (NativeServer server = new NativeServer(config())) { + public void closeBeforeStartIsSafeNoOpViaTryWithResources() { + try (NativeServer server = new NativeServer("-m", "m.gguf")) { assertThat(server.isRunning(), is(false)); } } + + @Test + public void nullArgsRejected() { + assertThrows(NullPointerException.class, () -> new NativeServer((String[]) null)); + } + + @Test + public void nullArgElementRejected() { + assertThrows(NullPointerException.class, () -> new NativeServer("-m", null)); + } } From 8a1a68fb02af52deba1ff9fe1e5746a34ceb3d27 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 08:46:47 +0000 Subject: [PATCH 05/29] server: make NativeServer the default fat-jar Main-Class (keep OpenAiCompatServer) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two runnable server mains now exist. The fat jar's default Main-Class becomes NativeServer, so `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full native llama.cpp server with its embedded WebUI, forwarding every argument. OpenAiCompatServer is unchanged and still runnable via `java -cp net.ladenthin.llama.server.OpenAiCompatServer …`. - NativeServer.main(args): forwards argv, starts the server, registers a JVM shutdown hook (the embedded server installs no signal handlers of its own — see patches/0006 — so the hook is what stops it cleanly on Ctrl-C/SIGTERM), and blocks until the native worker exits. - llama/pom.xml assembly profile: Main-Class OpenAiCompatServer -> NativeServer. - README + CLAUDE.md: document the two modes and how to select each. Verified end-to-end (Linux x86_64, synthetic granitehybrid): `java -cp … NativeServer -m model --port 8972` serves /health=ok after load; SIGTERM to the JVM fires the shutdown hook -> clean "cleaning up before exit" -> port down. Javadoc + spotless clean; 7 pure-Java NativeServer tests pass. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- CLAUDE.md | 6 +-- README.md | 31 ++++++++++----- llama/pom.xml | 8 ++-- .../ladenthin/llama/server/NativeServer.java | 38 +++++++++++++++++++ 4 files changed, 68 insertions(+), 15 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 5e1f5e5a..f2e9db12 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -836,7 +836,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in - `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`. - `OSInfo` — Detects OS and architecture for library resolution. - **`server` package — OpenAI-compatible HTTP endpoint (a single implementation).** - - `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198). + - `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), embeddable and runnable via `java -cp net.ladenthin.llama.server.OpenAiCompatServer …` (the fat-jar default `Main-Class` is now `NativeServer` — see "Two server modes"). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198). - **Alternative protocol surfaces** (pure translation over the OpenAI chat core — no second inference path; each reconstructs streamed tool calls via `ToolCallDeltaAccumulator`): **Ollama-native** (`GET /api/version`, `/api/tags`, `POST /api/show`, `/api/chat` with NDJSON streaming, `/api/generate` prompt-completion/FIM — `OllamaApiSupport`; `/api/show` advertises tools/insert/vision capabilities + context length for Copilot's Ollama provider), **Anthropic Messages** (`POST /v1/messages`, SSE event stream — `AnthropicApiSupport` + `AnthropicStreamTranslator`), and **OpenAI Responses** (`POST /v1/responses`, SSE event stream — `ResponsesApiSupport` + `ResponsesStreamTranslator`). The llama.cpp-native `GET /props` (context length + `modalities`) is served via `OpenAiSseFormatter.propsJson` for autocomplete clients that size their context from it. - Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`; `corsAllowOrigin`; `supportsVision`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`; flags incl. `--mmproj`/`--embedding`/`--reranking`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON + usage normalization), `OaiRerankSupport` (pure rerank request/response shaping), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`. - The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server". @@ -853,8 +853,8 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in The library exposes **two** ways to serve a model over HTTP, on two different transports: -1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. This is the fat-jar `Main-Class`; its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`). -2. **`server.NativeServer` (native transport).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible. +1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. It has its own `main` (run via `java -cp net.ladenthin.llama.server.OpenAiCompatServer …`); its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`). +2. **`server.NativeServer` (native transport) — the fat-jar default `Main-Class`.** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible. ### Native Helper Architecture diff --git a/README.md b/README.md index d5c96dd9..e1db2a15 100644 --- a/README.md +++ b/README.md @@ -107,7 +107,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++. - **Infilling** (fill-in-the-middle) for code models. - **Tokenize / detokenize** and **JSON-schema → grammar** conversion. - **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`). -- **Runnable OpenAI-compatible HTTP server** (`OpenAiCompatServer`, the fat-jar `Main-Class`, streaming SSE, zero extra dependency): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`. +- **Two runnable HTTP server modes.** The fat jar's default `Main-Class` is `NativeServer` — the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI, no separate `llama-server.exe`: `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080`. The Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) is also available: `java -cp …-jar-with-dependencies.jar net.ladenthin.llama.server.OpenAiCompatServer --model model.gguf --port 8080`. - **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration). - Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build. @@ -591,7 +591,9 @@ array alone at `GET /slots`. OpenAI responses preserve `net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra -dependency and no separate server process. It is both embeddable and the fat-jar `Main-Class`. It +dependency and no separate server process. It is embeddable, and runnable via +`java -cp net.ladenthin.llama.server.OpenAiCompatServer …` (the fat jar's default +`Main-Class` is instead `NativeServer` — see "Native server with the built-in WebUI" below). It serves: | Method & path | Backed by | @@ -646,16 +648,17 @@ try (LlamaModel model = new LlamaModel(modelParams); } ``` -…or run it standalone. The fat jar built by the `assembly` profile (`mvn -P assembly package`) is -runnable (its `Main-Class` is `net.ladenthin.llama.server.OpenAiCompatServer`); the plain library jar -works too via `-cp`: +…or run it standalone. It has its own `main`, launched by class name via `-cp` (the fat jar's +default `java -jar` `Main-Class` is `NativeServer` — the native server below — so name +`OpenAiCompatServer` explicitly to get this Java one): ```bash -# fat jar (bundles the native lib + Java deps) -java -jar target/llama--jar-with-dependencies.jar \ +# fat jar (bundles the native lib + Java deps) — name the class explicitly +java -cp target/llama--jar-with-dependencies.jar \ + net.ladenthin.llama.server.OpenAiCompatServer \ --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99 -# or the plain jar +# or the plain library jar java -cp target/llama-.jar net.ladenthin.llama.server.OpenAiCompatServer \ --model models/model.gguf --port 8080 --model-id local-model ``` @@ -716,7 +719,17 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` ( the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use `net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so -every flag works exactly as it does for the standalone binary: +every flag works exactly as it does for the standalone binary. It is the fat jar's default +`Main-Class`, so `java -jar` just forwards its args to the native server (pass `--help` for the full +llama-server option list): + +```bash +java -jar target/llama--jar-with-dependencies.jar \ + -m models/model.gguf --host 127.0.0.1 --port 8080 -c 65536 --jinja +# then open http://127.0.0.1:8080/ for the WebUI +``` + +Or embed it: ```java try (NativeServer server = new NativeServer( diff --git a/llama/pom.xml b/llama/pom.xml index 67e6e563..e489161c 100644 --- a/llama/pom.xml +++ b/llama/pom.xml @@ -1296,8 +1296,10 @@ SPDX-License-Identifier: MIT + + net.ladenthin + llama + 5.0.4 + vulkan-linux-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + vulkan-linux-aarch64 + + net.ladenthin diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index 67484b35..01bb9a57 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -249,8 +249,10 @@ endif() # OS-aware because the same GGML flag is used on more than one platform: # - GGML_CUDA -> Linux (resources_linux_cuda) AND Windows (resources_windows_cuda) # - GGML_OPENCL -> Android (resources_android_opencl) AND Windows (resources_windows_opencl) -# - GGML_VULKAN -> Windows only (resources_windows_vulkan) -# The classifier->tree mapping is mirrored by the matching Maven profile in pom.xml. +# - GGML_VULKAN -> Windows (resources_windows_vulkan) AND Linux (resources_linux_vulkan) +# The classifier->tree mapping is mirrored by the matching Maven profile in pom.xml. The Linux +# Vulkan tree holds both x86_64 and aarch64 under Linux/${OS_ARCH}; two Maven profiles +# (vulkan-linux / vulkan-linux-aarch64) split it into one single-arch classifier JAR each. if(GGML_CUDA) if(OS_NAME STREQUAL "Windows") set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_cuda/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) @@ -260,8 +262,13 @@ if(GGML_CUDA) message(STATUS "GPU (CUDA Linux) build - Installing files to ${JLLAMA_DIR}") endif() elseif(GGML_VULKAN) - set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_vulkan/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) - message(STATUS "GPU (Vulkan) build - Installing files to ${JLLAMA_DIR}") + if(OS_NAME STREQUAL "Windows") + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_vulkan/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (Vulkan Windows) build - Installing files to ${JLLAMA_DIR}") + else() + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_linux_vulkan/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (Vulkan Linux) build - Installing files to ${JLLAMA_DIR}") + endif() elseif(GGML_OPENCL) if(OS_NAME STREQUAL "Windows") set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_opencl/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) diff --git a/llama/pom.xml b/llama/pom.xml index af521698..636e6da1 100644 --- a/llama/pom.xml +++ b/llama/pom.xml @@ -1149,6 +1149,169 @@ SPDX-License-Identifier: MIT + + + vulkan-linux + + + + org.apache.maven.plugins + maven-compiler-plugin + + + vulkan-linux + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_linux_vulkan + + + + + + maven-resources-plugin + + + copy-resources-vulkan-linux + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_linux_vulkan + + + + ${basedir}/src/main/resources_linux_vulkan/ + + net/ladenthin/llama/Linux/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + vulkan-linux + package + + jar + + + vulkan-linux-x86-64 + + ${project.build.outputDirectory}_linux_vulkan + + + + + + + + + + + vulkan-linux-aarch64 + + + + org.apache.maven.plugins + maven-compiler-plugin + + + vulkan-linux-aarch64 + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_linux_vulkan_aarch64 + + + + + + maven-resources-plugin + + + copy-resources-vulkan-linux-aarch64 + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_linux_vulkan_aarch64 + + + + ${basedir}/src/main/resources_linux_vulkan/ + + net/ladenthin/llama/Linux/aarch64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + vulkan-linux-aarch64 + package + + jar + + + vulkan-linux-aarch64 + + ${project.build.outputDirectory}_linux_vulkan_aarch64 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + From d9a6a834c47cc86ac42aa2523d3bd82c02f16110 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 13:45:58 +0000 Subject: [PATCH 16/29] Fix ArchUnit violations from the new server code (layering + no-sleep) Two LlamaArchitectureTest rules failed on PR #293 because of this branch's server additions: - layeredArchitecture (12 violations): the branch adds two legitimate new edges out of the Server layer -- Server -> Args (OpenAiServerCli maps -ctk/-ctv to the args.CacheType enum) and Server -> Loader (NativeServer.start() calls LlamaLoader.initialize() before launching the embedded native server). The rule documents itself as the EXACT set of accessors today, to be updated when a new dependency is intended, so Server is added to the Loader and Args mayOnlyBeAccessedByLayers lists (+ a doc note). Server remains the only layer allowed to reach the Api root and stays mayNotBeAccessedByAnyLayer. - noThreadSleep (1 violation): NativeServer.main() kept the JVM alive with a while(isRunning()) Thread.sleep(200) poll loop. The rule bans Thread.sleep and has no suppression seam (it prefers Condition.await/poll), so main() now blocks on a bounded CountDownLatch.await(200ms) signalled by the shutdown hook. This is also a behavioural improvement: Ctrl-C/SIGTERM wakes the wait immediately instead of after up to a 200 ms tick, while the timeout still re-checks isRunning() to catch a self-terminated native worker. Verified: LlamaArchitectureTest 12/12 pass; server-package tests 44/44 pass (ServerLauncher, OpenAiServerCli, NativeServerSmoke); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .../net/ladenthin/llama/server/NativeServer.java | 15 ++++++++++++--- .../ladenthin/llama/LlamaArchitectureTest.java | 10 ++++++---- 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java index 778d6b72..ea70e1b0 100644 --- a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java +++ b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java @@ -5,6 +5,8 @@ package net.ladenthin.llama.server; import java.util.Objects; +import java.util.concurrent.CountDownLatch; +import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicBoolean; import lombok.ToString; import net.ladenthin.llama.loader.LlamaLoader; @@ -198,6 +200,10 @@ public void close() { public static void main(String[] args) throws InterruptedException { final NativeServer server = new NativeServer(args); final AtomicBoolean stoppedByHook = new AtomicBoolean(false); + // Signalled by the shutdown hook so the main thread wakes immediately on Ctrl-C / SIGTERM + // rather than waiting out a poll tick — and so the wait uses a bounded latch await instead of + // Thread.sleep (banned by LlamaArchitectureTest.noThreadSleep). + final CountDownLatch stopSignal = new CountDownLatch(1); // Graceful Ctrl-C / SIGTERM: the embedded server installs no signal handlers of its own // (see patches/0006), so the JVM-level shutdown hook is what stops it before exit. Runtime.getRuntime() @@ -205,13 +211,16 @@ public static void main(String[] args) throws InterruptedException { () -> { stoppedByHook.set(true); server.close(); + stopSignal.countDown(); }, "jllama-native-server-shutdown")); server.start(); // Keep the JVM alive until the native worker exits — on its own (e.g. a fatal startup/model - // error that llama_server has already logged) or because the shutdown hook stopped it. - while (server.isRunning()) { - Thread.sleep(200L); + // error that llama_server has already logged) or because the shutdown hook stopped it. The + // bounded await returns early when the hook fires; on timeout we re-check isRunning() to catch + // a self-terminated worker. + while (server.isRunning() && !stopSignal.await(200L, TimeUnit.MILLISECONDS)) { + // wait for the native worker to exit or the shutdown hook to fire } if (!stoppedByHook.get()) { server.close(); diff --git a/llama/src/test/java/net/ladenthin/llama/LlamaArchitectureTest.java b/llama/src/test/java/net/ladenthin/llama/LlamaArchitectureTest.java index 28897996..a3100e57 100644 --- a/llama/src/test/java/net/ladenthin/llama/LlamaArchitectureTest.java +++ b/llama/src/test/java/net/ladenthin/llama/LlamaArchitectureTest.java @@ -94,8 +94,10 @@ public class LlamaArchitectureTest { * intend it. Conceptual tiers (informational): {@code Server} > {@code Api} (root) > * {@code Loader} > {@code Json}/{@code Parameters} > * {@code Value}/{@code Callback}/{@code Exception}/{@code Args}. The {@code Server} layer is the - * optional OpenAI-compatible HTTP entry point; it is the only layer permitted to access the - * {@code Api} root. + * optional OpenAI-compatible HTTP / native-server entry point; it is the only layer permitted to + * access the {@code Api} root, and it also reaches the {@code Loader} ({@code NativeServer} + * triggers {@code LlamaLoader.initialize()} before starting the embedded native server) and the + * {@code Args} enums ({@code OpenAiServerCli} maps {@code -ctk}/{@code -ctv} to {@code CacheType}). */ @ArchTest static final ArchRule layeredArchitecture = layeredArchitecture() @@ -121,7 +123,7 @@ public class LlamaArchitectureTest { .whereLayer("Api") .mayOnlyBeAccessedByLayers("Server") .whereLayer("Loader") - .mayOnlyBeAccessedByLayers("Api") + .mayOnlyBeAccessedByLayers("Api", "Server") .whereLayer("Json") .mayOnlyBeAccessedByLayers("Api") .whereLayer("Parameters") @@ -133,7 +135,7 @@ public class LlamaArchitectureTest { .whereLayer("Exception") .mayOnlyBeAccessedByLayers("Api", "Loader") .whereLayer("Args") - .mayOnlyBeAccessedByLayers("Api", "Loader", "Parameters") + .mayOnlyBeAccessedByLayers("Api", "Loader", "Parameters", "Server") .whereLayer("Server") .mayNotBeAccessedByAnyLayer(); From bfc766c07d80440655e692f73819ea20ee8bafd2 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 13:49:39 +0000 Subject: [PATCH 17/29] Windows arm64: disable OpenMP so the clang-cl build is self-contained MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The clang-cl fix worked (jllama.dll linked for Windows/aarch64), but the next step failed at test discovery: gtest_discover_tests could not launch jllama_test.exe -> exit 0xc0000135 (STATUS_DLL_NOT_FOUND). Root cause: with clang-cl, ggml links LLVM's OpenMP runtime (libomp.lib -> needs libomp140.aarch64.dll at run time). Unlike MSVC's ambient vcomp140.dll on x64, that DLL is not on PATH, so neither the test exe nor a consumer could load the binary. (Upstream llama.cpp works around this by copying libomp140.aarch64.dll next to its arm64 output.) Fix: pass -DGGML_OPENMP=OFF for the arm64 job. ggml falls back to its own std::thread threadpool, so both jllama_test.exe and the shipped arm64 jllama.dll are self-contained with no libomp dependency to ship — cleaner than bundling an LLVM OpenMP DLL into the default JAR. The x86_64/x86 jobs keep OpenMP (MSVC vcomp, which is ambient and already proven). Also updated the job comment + CLAUDE.md to record that VC\Tools\Llvm\ARM64 supplies clang-cl/lld-link (no separate LLVM install needed) and the OpenMP rationale. The getenv/strdup/ctime deprecation messages in the same log are warnings only (clang-cl flagging POSIX names against the MSVC UCRT headers), not the failure. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 13 +++++++++---- CLAUDE.md | 8 +++++++- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 6a151eea..c5c1c618 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -887,9 +887,14 @@ jobs: # clang-cl (LLVM's MSVC-compatible driver) satisfies that guard (its compiler id is "Clang") # while still leaving CMake's MSVC=TRUE, so our static /MT CRT block (CMAKE_MSVC_RUNTIME_LIBRARY # in CMakeLists.txt) keeps applying and the generator stays Ninja Multi-Config. msvc-dev-cmd - # (arm64) supplies the MSVC headers/libs/linker that clang-cl links against. NOTE: clang-cl must - # be on PATH (the VS "C++ Clang tools" component / LLVM); if a first CI run reports it missing, - # add an LLVM setup step here. + # (arm64) supplies the MSVC headers/libs/linker AND the bundled clang-cl / lld-link under + # VC\Tools\Llvm\ARM64, so no separate LLVM install is needed. + # + # GGML_OPENMP=OFF: with clang-cl, ggml links LLVM's OpenMP (libomp.lib -> needs libomp140.aarch64.dll + # at runtime), which is NOT on PATH like MSVC's ambient vcomp140.dll on x64 — so gtest_discover_tests + # (and any consumer) failed to launch the binary with 0xc0000135 STATUS_DLL_NOT_FOUND. Turning OpenMP + # off makes ggml use its own std::thread threadpool, so the arm64 jllama.dll (and the test exe) are + # self-contained with no libomp dependency to ship. The x86_64/x86 jobs keep OpenMP (MSVC vcomp). runs-on: windows-11-arm steps: - uses: actions/checkout@v7 @@ -909,7 +914,7 @@ jobs: # explicitly (so the OSInfo-class OS-detection path is skipped) — same as the x86_64 job. # clang-cl (see the job comment) is required: ggml refuses MSVC cl.exe on ARM. run: | - .github\build.bat -G "Ninja Multi-Config" -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DOS_NAME=Windows -DOS_ARCH=aarch64 -DBUILD_TESTING=ON + .github\build.bat -G "Ninja Multi-Config" -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DGGML_OPENMP=OFF -DOS_NAME=Windows -DOS_ARCH=aarch64 -DBUILD_TESTING=ON - name: Run C++ unit tests run: ctest --test-dir llama/build --output-on-failure - name: Upload artifacts diff --git a/CLAUDE.md b/CLAUDE.md index 7869417f..5d2ddc26 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -301,7 +301,13 @@ via `if (MSVC AND NOT CMAKE_C_COMPILER_ID STREQUAL "Clang")`; `clang-cl` (LLVM's satisfies that guard (compiler id `"Clang"`) while keeping CMake's `MSVC=TRUE`, so the static `/MT` CRT block still applies and the generator stays Ninja Multi-Config. The job passes `-DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl`; `msvc-dev-cmd` supplies the MSVC -headers/libs/linker `clang-cl` links against. (Upstream llama.cpp instead cross-compiles arm64 from an +headers/libs/linker **and the bundled clang-cl/lld-link** (`VC\Tools\Llvm\ARM64`), so no separate +LLVM install is needed. It also passes **`-DGGML_OPENMP=OFF`**: with clang-cl, ggml links LLVM's +OpenMP (`libomp.lib` → `libomp140.aarch64.dll` at runtime), which — unlike MSVC's ambient +`vcomp140.dll` on x64 — is not on `PATH`, so the test exe (and any consumer) failed to launch with +`0xc0000135` (`STATUS_DLL_NOT_FOUND`). Disabling OpenMP makes ggml use its own `std::thread` +threadpool, leaving the arm64 `jllama.dll` self-contained (the x86_64/x86 jobs keep OpenMP via MSVC +`vcomp`). (Upstream llama.cpp instead cross-compiles arm64 from an x64 runner with `vcvarsall amd64_arm64` + a `clang`/`clang++` toolchain file and no arm64 tests; the native-runner + `clang-cl` route here keeps the `/MT` CRT and lets `ctest` run on real ARM hardware.) From c6ac704f354a11cf165fba43ccfb5e0ac8fbd9a0 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 14:38:11 +0000 Subject: [PATCH 18/29] Upgrade llama.cpp from b9864 to b9866 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Trivial additive upgrade — no incompatibilities, no project source changes. Bumps GIT_TAG (+ the TTS provenance banner), the README badge/link, and the CLAUDE.md pinned-version line + build examples. The b9864..b9866 diff is backend/WebUI-only: the CUDA topk-moe kernel gains a case 288 instantiation + accepts n_expert==288 (StepFun 3.7's non-power-of-2 expert count) — device-side, affecting only the cuda13 classifiers; a test-backend-ops.cpp case (not built here, LLAMA_BUILD_TESTS OFF); and WebUI changes (a config string-boolean normalization migration + a thinking-default flip) that auto-follow the pinned GIT_TAG via the build-webui job. The project binds no new symbol. Patch verification: the diff touches no patch-target file and no OuteTTS anchor, so all six patches are byte-identical to b9864. Confirmed end-to-end by a clean cmake configure: b9866 fetched (case 288 present) and all six patches applied via the fail-loud PATCH_COMMAND (exit 0; 0005 + 0006 markers present), OuteTTS anchors held. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- CLAUDE.md | 12 ++++++------ README.md | 2 +- docs/history/llama-cpp-breaking-changes.md | 2 ++ llama/CMakeLists.txt | 4 ++-- 4 files changed, 11 insertions(+), 9 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 5d2ddc26..21b962b7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI. -Current llama.cpp pinned version: **b9864** +Current llama.cpp pinned version: **b9866** ## Upgrading CUDA Version @@ -344,7 +344,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi ships no UI): ```bash # needs node/npm + network; embed.cpp is plain C++17 (no npm) -git clone --depth 1 --branch b9864 https://github.com/ggml-org/llama.cpp /tmp/lc +git clone --depth 1 --branch b9866 https://github.com/ggml-org/llama.cpp /tmp/lc ( cd /tmp/lc/tools/ui && npm ci && npm run build \ && ( cd dist && find . -type f -not -path './_gzip/*' \ | while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \ @@ -384,7 +384,7 @@ cache lives in **Depot Cache** over sccache's **WebDAV** backend: - `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored as the repo secret **`DEPOT_TOKEN`**. -Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9864`), the +Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9866`), the ~280 upstream object files are byte-identical every run, so a warm cache recompiles only the *changed* files. Depot's cache is **shared across all branches** (unlike GitHub's per-branch `actions/cache`), so every branch builds incrementally; a `b` version bump @@ -497,7 +497,7 @@ Current patches: | `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b` includes the change. | | `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b` includes the change. | | `0005-server-recurrent-near-prompt-end-checkpoints.patch` | **Multi-turn tool-calling perf fix for recurrent/hybrid models (e.g. Granite-4)**, upstream-submittable. In `server_context::update_slots` (`tools/server/server-context.cpp`) the near-prompt-end context checkpoints are gated by `checkpoint_min_step` (default 8192 tokens). An agentic conversation that appends only assistant/tool messages never produces a new user-message checkpoint (`is_user_start`/`is_last_user_message` match `COMMON_CHAT_ROLE_USER` only), so after turn 1 no new checkpoint is ever created and — because recurrent state can only roll back to a checkpoint — **every turn re-prefills the whole conversation tail** (measured on a synthetic granitehybrid model: prefilled tokens grew 901 → 1544 → 2187 → 2830 → 3473 over turns 2–6). The patch (1) exempts near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (`ctx_tgt_seq_rm_type` is `FULL` or `RS` — SWA-only models are unaffected), and (2) skips creating a checkpoint whose position equals the newest one (the last-user-message checkpoint was re-created identically on every turn, flooding the 32-entry list). After the patch each turn restores the previous turn's near-end checkpoint and prefill is constant (~new-turn-sized; 647 tokens/turn in the same measurement, ≈5.4× less prefill at turn 6 and growing with conversation length). Validated output-identical (`temperature=0`) vs. unpatched. Complements — not duplicates — open upstream PRs #24035/#24899/#24891 (they fix checkpoint *invalidation/retention*; this fixes checkpoint *starvation*). Drop once upstream solves agentic checkpoint placement (e.g. a merged role-boundary checkpointing design, cf. #21885 / #22826 discussion). | -| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9864 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | +| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9866 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`) @@ -911,7 +911,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in - `json_helpers.hpp` — Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable. - `jni_helpers.hpp` — JNI bridge helpers (handle management + server orchestration). Includes `json_helpers.hpp`. - Uses `nlohmann/json` for JSON deserialization of parameters. -- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9864 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. +- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9866 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. ### Two server modes (`OpenAiCompatServer` vs `NativeServer`) @@ -1100,7 +1100,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson" #### Upstream source location (in CMake build tree) -llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9864`. +llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9866`. **GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the diff --git a/README.md b/README.md index 88e2d948..ff968310 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ **Build:** ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational) ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey) -[![llama.cpp b9864](https://img.shields.io/badge/llama.cpp-%23b9864-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9864) +[![llama.cpp b9866](https://img.shields.io/badge/llama.cpp-%23b9866-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9866) [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/) ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162) [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev) diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md index 32849278..6a26f71e 100644 --- a/docs/history/llama-cpp-breaking-changes.md +++ b/docs/history/llama-cpp-breaking-changes.md @@ -417,3 +417,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r | b9859–b9862 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9862. The b9859→b9862 diff touches only two patch-target files — `tools/server/server-context.cpp` and `server-context.h` (the `model_ftype`/`get_meta`/`get_model_info` additions at ~L3989/~L5121 and the new struct field at ~L50). Patches **0002** (load-progress guard, ~L1152), **0003** (slot-prompt-similarity getter/setter, ~L3965 + `server_context` struct ~L106) and **0005** (near-prompt-end checkpoints, `update_slots` ~L3560) were **applied in sequence** against the actual b9862 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all three applied cleanly (their regions are disjoint from and far from the b9862 additions). Patches **0001** (`common/arg.{cpp,h}`, `test-arg-parser.cpp`, ~34 standalone mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not present** in the b9859→b9862 changed-file list, so their hunks are byte-identical to b9859 and apply unchanged. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. | | b9862–b9864 | `tools/server/server-context.cpp` + `server-schema.cpp` + `server-task.h` + `tools/server/README.md` + `tools/ui/**` | **New feature (additive), no break.** Adds a **per-request `sse_ping_interval`** to the completion API: `task_params` gains `int32_t sse_ping_interval = 30` (`server-task.h`), `make_llama_cmpl_schema` exposes it as a `field_num` with hard limits `[-1, INT32_MAX]` and `eval_llama_cmpl_schema` seeds it from `params_base.sse_ping_interval` (`server-schema.cpp`), and `handle_completions_impl` (`server-context.cpp`, ~L4089) captures the per-task value (instead of the server-level `params.sse_ping_interval`) into the SSE `next` lambda so a request can override the server `--sse-ping-interval` (`-1` disables pings). All inside upstream-compiled server TUs the project already links; the project binds no new symbol. **NativeServer** mode gets it for free (full `llama_server`). The rest of the diff is the **Svelte WebUI** (`tools/ui/**`: MCP server recommendations dialog, a bearer-token Authorization field, migration of the MCP default-enabled key into settings config, `STREAM_VISIBILITY_KICK_MS` 1000→3000, + Vitest units) — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required for the bump itself. **Follow-up (done):** `InferenceParameters.withSsePingInterval(int)` now emits the `sse_ping_interval` key (it flows through the OAI-compat completion path via `eval_llama_cmpl_schema`), covered by a Java wither test + three C++ schema round-trip guards (round-trip, `-1` disables, below-hard-limit throws, absent inherits the server default). The same follow-up **audited the completion schema for other already-parseable-but-unexposed fields** and added the plain-scalar wins as withers: `withXtcProbability`/`withXtcThreshold` (XTC sampler), `withNDiscard`, `withNIndent`, `withTMaxPredictMs`, `withPostSamplingProbs`, `withTimingsPerToken`, `withReturnTokens`. (`t_max_prompt_ms` was deliberately skipped — it is commented out `// TODO: implement` in b9864's `make_llama_cmpl_schema`, so it is not parseable.) Remaining schema fields left unexposed on purpose: OAI aliases already covered (`max_tokens`/`max_completion_tokens` → `n_predict`), OAI/server-internal or array-shaped/advanced knobs (`n`/`n_cmpl`, `logprobs`, `echo`, `verbose`, `include_usage`, `return_progress`, `response_fields`, `lora`, `grammar_lazy`/`grammar_triggers`/`preserved_tokens`, `chat_format`, `parse_tool_calls`, `reasoning_control`, `backend_sampling`, `adaptive_*`). | | b9862–b9864 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9864. The b9862→b9864 diff touches exactly one patch-target file — `tools/server/server-context.cpp` — and only in `handle_completions_impl` (~L4089), far below every patched region (0002 load-progress guard ~L1152, 0005 near-prompt-end checkpoints ~L3560, 0003 slot-prompt-similarity getter/setter ~L3965). Patches **0002/0003/0005** were **applied in sequence** against the actual b9864 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all clean. `server-context.h` is unchanged in this range (so 0003's `.h` hunk is byte-identical); `server-schema.cpp`/`server-task.h` are **not** patch targets. Patches **0001** (`common/arg.*`, `test-arg-parser.cpp`, ~34 mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not** in the changed-file list, so they apply unchanged. Confirmed end-to-end by a clean `cmake` configure: b9864 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held (`tools/tts/tts.cpp` unchanged). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. | +| b9864–b9866 | `ggml/src/ggml-cuda/topk-moe.cu` + `tests/test-backend-ops.cpp` + `tools/ui/**` | Backend/WebUI-only, no API surface. (1) **CUDA topk-moe** gains a `case 288` instantiation (`topk_moe_cuda<288>`) and `ggml_cuda_should_use_topk_moe` now also accepts `n_expert == 288` (the non-power-of-2 expert count of **StepFun 3.7**) — a device-side kernel add, internal to `ggml-cuda`, affecting only the `cuda13-*` classifiers (a StepFun-3.7 MoE GGUF now uses the fused topk-moe path on CUDA instead of the generic fallback). (2) `test-backend-ops.cpp` adds the matching `test_topk_moe({288,22,1,1}, …)` case — **not built here** (`LLAMA_BUILD_TESTS` OFF for the FetchContent subproject). (3) **WebUI** (`tools/ui/**`): a `config-type-normalization-v1` migration coercing legacy string-encoded booleans in persisted config back to real booleans (the strict server schema now rejects `"true"`/`"false"` strings), and a thinking-enabled default flip to `true` — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required. | +| b9864–b9866 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9866. The b9864→b9866 diff touches **no** patch-target file (`common/arg.*`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `server-schema.cpp`, `server-task.h`, `server.cpp`, `test-arg-parser.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edits are `ggml-cuda/topk-moe.cu`, `tests/test-backend-ops.cpp` and `tools/ui/**` — so every patch hunk/offset is byte-identical to b9864. Confirmed end-to-end by a clean `cmake` configure: b9866 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held. Full build + `ctest` (target 462/462) to be confirmed by the CI pipeline. | diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index 462275cb..7552b096 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) FetchContent_Declare( llama.cpp GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git - GIT_TAG b9864 + GIT_TAG b9866 PATCH_COMMAND ${CMAKE_COMMAND} -DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches -DLLAMA_SRC= @@ -166,7 +166,7 @@ execute_process( COMMAND ${CMAKE_COMMAND} -DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp -DOUT_CPP=${JLLAMA_TTS_GEN_CPP} - -DLLAMA_TAG=b9864 + -DLLAMA_TAG=b9866 -P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT ) From 19303d0ec7f7de84318ceb64652ad90d394e8312 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 14:55:24 +0000 Subject: [PATCH 19/29] Automate llama.cpp version-bump target selection and chunking Add a diff-size-driven bump workflow so a version upgrade never lands an unreviewably large diff in one step. - .github/scripts/llama-next-version.sh: read-only helper that computes the next reviewable step. Reads the current pin from llama/CMakeLists.txt and the target from an explicit b arg or the GitHub releases atom feed, against a cached blobless mirror clone. If git diff cur..target is under the threshold (LLAMA_BUMP_MAX_DIFF_KB, default 100 KiB) it bumps straight to the target; otherwise it binary-searches the intermediate tags for the largest one still under the threshold and prints that chunk plus its compare/.patch URLs. LLAMA_BUMP_EXCLUDE_WEBUI sizes the diff excluding the auto-followed tools/ui WebUI. - docs/upgrade/llama-cpp-version-bump.md: the runbook (documentation root) for target selection, byte-size chunking, the helper, and the edit/verify/commit loop. - CLAUDE.md: link the runbook from the Upgrading/Downgrading section. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/scripts/llama-next-version.sh | 123 +++++++++++++++++++++++ CLAUDE.md | 7 ++ docs/upgrade/llama-cpp-version-bump.md | 132 +++++++++++++++++++++++++ 3 files changed, 262 insertions(+) create mode 100755 .github/scripts/llama-next-version.sh create mode 100644 docs/upgrade/llama-cpp-version-bump.md diff --git a/.github/scripts/llama-next-version.sh b/.github/scripts/llama-next-version.sh new file mode 100755 index 00000000..517d2e9b --- /dev/null +++ b/.github/scripts/llama-next-version.sh @@ -0,0 +1,123 @@ +#!/usr/bin/env bash +# SPDX-FileCopyrightText: 2026 Bernard Ladenthin +# +# SPDX-License-Identifier: MIT +# +# Pick the NEXT llama.cpp tag to bump the pin to, one reviewable chunk at a time. +# +# The runbook this supports is docs/upgrade/llama-cpp-version-bump.md. Strategy: +# * TARGET = the topmost RELEASE on the GitHub releases page (read from the release atom feed), +# or an explicit "b" passed as $1. +# * CURRENT = the pinned tag in llama/CMakeLists.txt (GIT_TAG b). +# * If `git diff CURRENT..TARGET` is smaller than the threshold (default 100 KiB), bump straight +# to TARGET. Otherwise CHUNK: pick the largest intermediate b tag whose diff from CURRENT +# is still under the threshold, so each bump stays a small, reviewable patch. Re-run after each +# bump to walk the remaining chunks up to TARGET. +# +# This tool only READS (a cached mirror clone + the pin file); it never edits the repo. Apply the +# bump by hand per the runbook. It prints the compare/.patch URLs for the chosen step. +# +# Env: +# LLAMA_BUMP_MAX_DIFF_KB per-step diff-size threshold in KiB (default 100) +# LLAMA_BUMP_EXCLUDE_WEBUI if "1", size the diff EXCLUDING tools/ui (the auto-followed WebUI, which +# does not need per-bump review); default 0 = the full diff you paste/review +# LLAMA_BUMP_CACHE mirror-clone location (default ~/.cache/jllama-llamacpp-mirror) +# +# Network: needs read access to github.com (git clone/fetch + the release atom feed). No token. + +set -euo pipefail + +THRESHOLD_KB="${LLAMA_BUMP_MAX_DIFF_KB:-100}" +THRESHOLD=$((THRESHOLD_KB * 1024)) +EXCLUDE_WEBUI="${LLAMA_BUMP_EXCLUDE_WEBUI:-0}" +REPO="ggml-org/llama.cpp" +GIT_URL="https://github.com/${REPO}.git" +CACHE="${LLAMA_BUMP_CACHE:-$HOME/.cache/jllama-llamacpp-mirror}" +ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" +CMAKELISTS="$ROOT/llama/CMakeLists.txt" + +# --- current pinned tag number, e.g. "GIT_TAG b9866" -> 9866 ----------------------------------- +cur="$(grep -oE 'GIT_TAG[[:space:]]+b[0-9]+' "$CMAKELISTS" | grep -oE '[0-9]+' | head -1 || true)" +[ -n "$cur" ] || { echo "ERROR: could not read 'GIT_TAG b' from $CMAKELISTS" >&2; exit 1; } + +# --- cached blobless mirror of llama.cpp (clone once, then fetch tags) -------------------------- +if [ -d "$CACHE/.git" ]; then + git -C "$CACHE" fetch --quiet --tags --prune origin || true +else + echo "cloning ${REPO} (blobless) into $CACHE (one-time) ..." >&2 + git clone --filter=blob:none --no-checkout --quiet "$GIT_URL" "$CACHE" +fi + +# --- target: explicit "$1" (b) or the latest RELEASE from the atom feed ------------------- +if [ "${1:-}" != "" ]; then + target="$(printf '%s' "$1" | grep -oE '[0-9]+' | head -1)" + [ -n "$target" ] || { echo "ERROR: '$1' is not a b tag" >&2; exit 1; } +else + feed="$(curl -sSL --fail --retry 4 --retry-delay 2 "https://github.com/${REPO}/releases.atom" 2>/dev/null || true)" + [ -n "$feed" ] || { echo "ERROR: cannot fetch the releases feed (network/rate limit). Read the topmost release at https://github.com/${REPO}/releases and pass it: $0 b" >&2; exit 2; } + target="$(printf '%s' "$feed" | grep -oE 'releases/tag/b[0-9]+' | grep -oE '[0-9]+' | sort -un | tail -1)" + [ -n "$target" ] || { echo "ERROR: parsed no release tags from the feed." >&2; exit 3; } +fi + +git -C "$CACHE" rev-parse -q --verify "b${cur}^{commit}" >/dev/null 2>&1 || { echo "ERROR: b$cur is not a tag in the mirror" >&2; exit 3; } +git -C "$CACHE" rev-parse -q --verify "b${target}^{commit}" >/dev/null 2>&1 || { echo "ERROR: b$target is not a tag in the mirror" >&2; exit 3; } + +# diff byte size between two tag numbers, honoring the WebUI-exclusion toggle +diffsize() { + if [ "$EXCLUDE_WEBUI" = "1" ]; then + git -C "$CACHE" diff "b$1" "b$2" -- . ':(exclude)tools/ui' 2>/dev/null | wc -c + else + git -C "$CACHE" diff "b$1" "b$2" 2>/dev/null | wc -c + fi +} + +scope="full diff" +[ "$EXCLUDE_WEBUI" = "1" ] && scope="diff excluding tools/ui" +echo "current pin : b$cur" +echo "latest release : b$target" +echo "threshold : ${THRESHOLD_KB} KiB per step (${scope})" + +if [ "$cur" -ge "$target" ]; then + echo "=> up to date — no bump needed." + exit 0 +fi + +# --- choose next step: TARGET if it fits, else the largest intermediate tag under the threshold - +if [ "$(diffsize "$cur" "$target")" -lt "$THRESHOLD" ]; then + next="$target" +else + # existing b-tags strictly after cur, up to and including target, ascending + # shellcheck disable=SC2207 + cands=($(git -C "$CACHE" tag -l 'b*' | grep -oE 'b[0-9]+' | grep -oE '[0-9]+' | sort -un \ + | awk -v c="$cur" -v t="$target" '$1 > c && $1 <= t')) + # binary search for the largest candidate whose diff from cur is under the threshold + # (diff size grows monotonically enough with the tag number for this to be a safe heuristic) + lo=0; hi=$(( ${#cands[@]} - 1 )); best="" + while [ "$lo" -le "$hi" ]; do + mid=$(( (lo + hi) / 2 )); T="${cands[$mid]}" + if [ "$(diffsize "$cur" "$T")" -lt "$THRESHOLD" ]; then best="$T"; lo=$(( mid + 1 )); else hi=$(( mid - 1 )); fi + done + if [ -n "$best" ]; then + next="$best" + else + next="${cands[0]}" + echo "NOTE: even b$cur..b$next exceeds ${THRESHOLD_KB} KiB — a single-commit step this large is unavoidable." >&2 + fi +fi + +full=$(git -C "$CACHE" diff "b$cur" "b$next" | wc -c) +noui=$(git -C "$CACHE" diff "b$cur" "b$next" -- . ':(exclude)tools/ui' | wc -c) +commits=$(git -C "$CACHE" rev-list --count "b$cur".."b$next") +echo +echo "next step : b$cur -> b$next" +echo " diff size : $((full / 1024)) KiB full / $((noui / 1024)) KiB excluding tools/ui (auto-followed WebUI)" +echo " commits : $commits" +if [ "$next" -eq "$target" ]; then + echo " progress : reaches the latest release — final chunk" +else + echo " progress : intermediate chunk — re-run this script after the bump for the next one" +fi +echo " review diff : https://github.com/${REPO}/compare/b$cur...b$next" +echo " raw .patch : https://github.com/${REPO}/compare/b$cur...b$next.patch" +echo +echo "Apply this bump per docs/upgrade/llama-cpp-version-bump.md (b$cur -> b$next)." diff --git a/CLAUDE.md b/CLAUDE.md index 21b962b7..badca380 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -536,6 +536,13 @@ re-verify the generator the same way you re-verify `patches/`. ## Upgrading/Downgrading llama.cpp Version +**Runbook (documentation root):** [`docs/upgrade/llama-cpp-version-bump.md`](docs/upgrade/llama-cpp-version-bump.md) +covers the full bump process end-to-end — picking the target (topmost GitHub release, via the atom +feed), **chunking by `git diff` byte-size** (bump straight to the target when the diff is < 100 KiB, +else step through the largest intermediate tag still under the threshold), the +`.github/scripts/llama-next-version.sh` helper that computes the next reviewable step, and the +edit/verify/commit loop below. Use it for any non-trivial bump; the steps here are the mechanical core. + To change the llama.cpp version, update the following **three** files (and re-verify `patches/`): 1. **llama/CMakeLists.txt** — the `GIT_TAG` line for llama.cpp: `GIT_TAG b8831` diff --git a/docs/upgrade/llama-cpp-version-bump.md b/docs/upgrade/llama-cpp-version-bump.md new file mode 100644 index 00000000..8dc3e864 --- /dev/null +++ b/docs/upgrade/llama-cpp-version-bump.md @@ -0,0 +1,132 @@ +# llama.cpp version-bump runbook + +This is the **documentation root** for bumping the pinned llama.cpp version. It links the +mechanical edit steps in [`../../CLAUDE.md`](../../CLAUDE.md#upgradingdowngrading-llamacpp-version) +together with a repeatable **target-selection + chunking** strategy so a bump never lands an +unreviewably large diff in one step. + +The current pin lives in `llama/CMakeLists.txt` as `GIT_TAG b`. llama.cpp tags **every** +master commit as `b`, but only a subset get GitHub *Releases*. + +--- + +## TL;DR + +```bash +# From the repo root. Prints the next reviewable step (b -> b) and its compare/.patch URLs. +.github/scripts/llama-next-version.sh # target = latest RELEASE (atom feed) +.github/scripts/llama-next-version.sh b9900 # target = an explicit tag +``` + +Then apply the printed `b -> b` step per [§ Applying a bump](#applying-a-bump) and re-run +the script to walk the next chunk, until it prints **"reaches the latest release — final chunk"**. + +--- + +## 1. Pick the target (topmost release) + +The **target candidate is the topmost release** on +. Read it from the release **atom feed**, which is +reachable from restricted sandboxes where the ggml-org REST API is blocked: + +``` +https://github.com/ggml-org/llama.cpp/releases.atom +``` + +The first ``'s `releases/tag/b` is the latest release. `llama-next-version.sh` does this +for you; if the feed is rate-limited (repeated unauthenticated fetches can return empty), open the +releases page in a browser and pass the tag explicitly: `llama-next-version.sh b`. + +> **Why releases, not just the newest `b` tag:** releases are the versions upstream deems +> shippable; an arbitrary master commit tag may be mid-refactor. Intermediate **chunk** steps +> (below) are allowed to land on non-release tags — they are transient waypoints, not the target. + +## 2. Chunk by diff **byte-size**, not commit count + +The step size is governed by the **size of `git diff` between the pinned tag and the target**, not by +how many commits separate them: + +- If `git diff b b` is **< 100 KiB**, bump straight to the target in one step. +- If it is **≥ 100 KiB**, pick an **intermediate** `b` tag whose diff from the current pin is the + largest still **under** the threshold, bump to that first, then repeat. Each step stays a small, + reviewable patch. + +The threshold is a knob (`LLAMA_BUMP_MAX_DIFF_KB`, default `100`). This is a heuristic: diff size grows +monotonically enough with the tag number that the helper binary-searches the intermediate tags safely. + +> **`tools/ui` (the WebUI) dominates the full diff** and is *auto-followed* — CI rebuilds the matching +> Svelte UI from the pinned `GIT_TAG`, so it needs no per-bump source review. To size the diff on the +> code you actually review, set `LLAMA_BUMP_EXCLUDE_WEBUI=1` (the helper prints both figures regardless). + +### The helper: `.github/scripts/llama-next-version.sh` + +It only **reads** — a cached blobless mirror clone of llama.cpp plus `llama/CMakeLists.txt`; it never +edits the repo. It prints the chosen `b -> b` step, its full and WebUI-excluded diff size, +the commit count, and the `compare` / `.patch` URLs. Environment: + +| Var | Default | Meaning | +|---|---|---| +| `LLAMA_BUMP_MAX_DIFF_KB` | `100` | Per-step diff-size threshold, in KiB. | +| `LLAMA_BUMP_EXCLUDE_WEBUI` | `0` | `1` = size the diff **excluding** `tools/ui`. | +| `LLAMA_BUMP_CACHE` | `~/.cache/jllama-llamacpp-mirror` | Mirror-clone location (cloned once, then fetched). | + +Worked example — pin `b9859`, latest release `b9866` (full diff 133 KiB ≥ 100 KiB, so it chunks): + +``` +$ .github/scripts/llama-next-version.sh b9866 +current pin : b9859 +latest release : b9866 +threshold : 100 KiB per step (full diff) + +next step : b9859 -> b9862 + diff size : 45 KiB full / ... KiB excluding tools/ui (auto-followed WebUI) + commits : 3 + progress : intermediate chunk — re-run this script after the bump for the next one + review diff : https://github.com/ggml-org/llama.cpp/compare/b9859...b9862 + raw .patch : https://github.com/ggml-org/llama.cpp/compare/b9859...b9862.patch +``` + +## 3. Review the chunk's diff + +Fetch the printed `compare/...patch` URL (or open the `compare` page). Walk it against the +**priority-ordered API-compatibility review list** in +[`../../CLAUDE.md`](../../CLAUDE.md#files-to-check-for-api-compatibility) — the 8 header rows that have +historically caused breaks (`common.h`, `chat.h`, `speculative.h`, `mtmd.h`, `llama-cpp.h`, `arg.h`, +`llama.h`, `download.h`), plus the project `CMakeLists.txt` for renamed link targets. Note any new +API surface worth wiring through the Java layer (e.g. a new completion param or model-metadata getter). + +--- + +## Applying a bump + +Once you have the `b -> b` step, apply it exactly as +[`CLAUDE.md § Upgrading/Downgrading`](../../CLAUDE.md#upgradingdowngrading-llamacpp-version) describes. +Concretely: + +1. **Edit the pin — three files:** + - `llama/CMakeLists.txt` — the `GIT_TAG b` line **and** the `-DLLAMA_TAG=b` used by the + WebUI/TTS extraction (both must move together). + - `README.md` — the llama.cpp badge and link (version appears twice). + - `CLAUDE.md` — the "Current llama.cpp pinned version" line (and any build-example `b`). +2. **Re-verify `patches/`** — a clean configure re-runs the fail-loud `PATCH_COMMAND`, so every patch + `0001`–`0006` must still apply. Use a **fresh** build dir (a stale one re-applies over an + already-patched tree and reports a false "does not apply"): + ```bash + cd llama && mvn -q compile # generates the OSInfo class CMake's OS-detection needs + rm -rf build && cmake -B build # fail-loud: aborts here if any patch no longer applies + ``` + If a patch no longer applies, refresh its diff against the new source and recommit it. +3. **Append the history rows** — add a pair of rows to + [`../history/llama-cpp-breaking-changes.md`](../history/llama-cpp-breaking-changes.md) covering the + `b -> b` range (what broke / what was new; "no source change" is a valid row). +4. **Commit + push** on the working branch (do not open a new PR if one already tracks the branch): + ```bash + git add llama/CMakeLists.txt README.md CLAUDE.md docs/history/llama-cpp-breaking-changes.md + git commit -m "Upgrade llama.cpp from b to b" + git push -u origin + ``` +5. **Re-run the helper** for the next chunk. Repeat until it reports the **final chunk** (target + reached). + +CI builds every native classifier from the new pin; the full model-backed Java + C++ suites gate the +result. A build failure at the configure step almost always means a patch needs refreshing (step 2). From 2a847872971eb0b33f1d4c97ded020254e4e575c Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 15:32:10 +0000 Subject: [PATCH 20/29] Upgrade llama.cpp from b9866 to b9867 First bump driven by the new .github/scripts/llama-next-version.sh helper: b9866 -> b9867 is a 2 KiB single-commit final chunk (well under the 100 KiB threshold), so it bumps straight to the latest release. b9867 (spec: support spec-draft-p-min in DFlash) changes only common/speculative.cpp: the DFlash draft path now also clamps n_min to the block size, raises the draft sampler top_k 1 -> 10, stops drafting when the top candidate probability drops below p_min, and discards a step producing fewer than n_min tokens. All three use existing common_speculative_params fields; common/speculative.h is untouched. Entirely inside upstream-compiled common; the project binds no common_speculative_* symbol. No project source changes required. Re-verified all six patches (0001-0006) apply cleanly against b9867 via a fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers present); OuteTTS generator anchors held. Appended the b9866->b9867 history rows. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- CLAUDE.md | 12 ++++++------ README.md | 2 +- docs/history/llama-cpp-breaking-changes.md | 2 ++ llama/CMakeLists.txt | 4 ++-- 4 files changed, 11 insertions(+), 9 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index badca380..1f3e053a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI. -Current llama.cpp pinned version: **b9866** +Current llama.cpp pinned version: **b9867** ## Upgrading CUDA Version @@ -344,7 +344,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi ships no UI): ```bash # needs node/npm + network; embed.cpp is plain C++17 (no npm) -git clone --depth 1 --branch b9866 https://github.com/ggml-org/llama.cpp /tmp/lc +git clone --depth 1 --branch b9867 https://github.com/ggml-org/llama.cpp /tmp/lc ( cd /tmp/lc/tools/ui && npm ci && npm run build \ && ( cd dist && find . -type f -not -path './_gzip/*' \ | while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \ @@ -384,7 +384,7 @@ cache lives in **Depot Cache** over sccache's **WebDAV** backend: - `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored as the repo secret **`DEPOT_TOKEN`**. -Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9866`), the +Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9867`), the ~280 upstream object files are byte-identical every run, so a warm cache recompiles only the *changed* files. Depot's cache is **shared across all branches** (unlike GitHub's per-branch `actions/cache`), so every branch builds incrementally; a `b` version bump @@ -497,7 +497,7 @@ Current patches: | `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b` includes the change. | | `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b` includes the change. | | `0005-server-recurrent-near-prompt-end-checkpoints.patch` | **Multi-turn tool-calling perf fix for recurrent/hybrid models (e.g. Granite-4)**, upstream-submittable. In `server_context::update_slots` (`tools/server/server-context.cpp`) the near-prompt-end context checkpoints are gated by `checkpoint_min_step` (default 8192 tokens). An agentic conversation that appends only assistant/tool messages never produces a new user-message checkpoint (`is_user_start`/`is_last_user_message` match `COMMON_CHAT_ROLE_USER` only), so after turn 1 no new checkpoint is ever created and — because recurrent state can only roll back to a checkpoint — **every turn re-prefills the whole conversation tail** (measured on a synthetic granitehybrid model: prefilled tokens grew 901 → 1544 → 2187 → 2830 → 3473 over turns 2–6). The patch (1) exempts near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (`ctx_tgt_seq_rm_type` is `FULL` or `RS` — SWA-only models are unaffected), and (2) skips creating a checkpoint whose position equals the newest one (the last-user-message checkpoint was re-created identically on every turn, flooding the 32-entry list). After the patch each turn restores the previous turn's near-end checkpoint and prefill is constant (~new-turn-sized; 647 tokens/turn in the same measurement, ≈5.4× less prefill at turn 6 and growing with conversation length). Validated output-identical (`temperature=0`) vs. unpatched. Complements — not duplicates — open upstream PRs #24035/#24899/#24891 (they fix checkpoint *invalidation/retention*; this fixes checkpoint *starvation*). Drop once upstream solves agentic checkpoint placement (e.g. a merged role-boundary checkpointing design, cf. #21885 / #22826 discussion). | -| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9866 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | +| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9867 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`) @@ -918,7 +918,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in - `json_helpers.hpp` — Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable. - `jni_helpers.hpp` — JNI bridge helpers (handle management + server orchestration). Includes `json_helpers.hpp`. - Uses `nlohmann/json` for JSON deserialization of parameters. -- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9866 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. +- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9867 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. ### Two server modes (`OpenAiCompatServer` vs `NativeServer`) @@ -1107,7 +1107,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson" #### Upstream source location (in CMake build tree) -llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9866`. +llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9867`. **GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the diff --git a/README.md b/README.md index ff968310..aebd786b 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ **Build:** ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational) ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey) -[![llama.cpp b9866](https://img.shields.io/badge/llama.cpp-%23b9866-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9866) +[![llama.cpp b9867](https://img.shields.io/badge/llama.cpp-%23b9867-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9867) [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/) ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162) [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev) diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md index 6a26f71e..65a7ea61 100644 --- a/docs/history/llama-cpp-breaking-changes.md +++ b/docs/history/llama-cpp-breaking-changes.md @@ -419,3 +419,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r | b9862–b9864 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9864. The b9862→b9864 diff touches exactly one patch-target file — `tools/server/server-context.cpp` — and only in `handle_completions_impl` (~L4089), far below every patched region (0002 load-progress guard ~L1152, 0005 near-prompt-end checkpoints ~L3560, 0003 slot-prompt-similarity getter/setter ~L3965). Patches **0002/0003/0005** were **applied in sequence** against the actual b9864 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all clean. `server-context.h` is unchanged in this range (so 0003's `.h` hunk is byte-identical); `server-schema.cpp`/`server-task.h` are **not** patch targets. Patches **0001** (`common/arg.*`, `test-arg-parser.cpp`, ~34 mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not** in the changed-file list, so they apply unchanged. Confirmed end-to-end by a clean `cmake` configure: b9864 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held (`tools/tts/tts.cpp` unchanged). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. | | b9864–b9866 | `ggml/src/ggml-cuda/topk-moe.cu` + `tests/test-backend-ops.cpp` + `tools/ui/**` | Backend/WebUI-only, no API surface. (1) **CUDA topk-moe** gains a `case 288` instantiation (`topk_moe_cuda<288>`) and `ggml_cuda_should_use_topk_moe` now also accepts `n_expert == 288` (the non-power-of-2 expert count of **StepFun 3.7**) — a device-side kernel add, internal to `ggml-cuda`, affecting only the `cuda13-*` classifiers (a StepFun-3.7 MoE GGUF now uses the fused topk-moe path on CUDA instead of the generic fallback). (2) `test-backend-ops.cpp` adds the matching `test_topk_moe({288,22,1,1}, …)` case — **not built here** (`LLAMA_BUILD_TESTS` OFF for the FetchContent subproject). (3) **WebUI** (`tools/ui/**`): a `config-type-normalization-v1` migration coercing legacy string-encoded booleans in persisted config back to real booleans (the strict server schema now rejects `"true"`/`"false"` strings), and a thinking-enabled default flip to `true` — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required. | | b9864–b9866 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9866. The b9864→b9866 diff touches **no** patch-target file (`common/arg.*`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `server-schema.cpp`, `server-task.h`, `server.cpp`, `test-arg-parser.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edits are `ggml-cuda/topk-moe.cu`, `tests/test-backend-ops.cpp` and `tools/ui/**` — so every patch hunk/offset is byte-identical to b9864. Confirmed end-to-end by a clean `cmake` configure: b9866 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held. Full build + `ctest` (target 462/462) to be confirmed by the CI pipeline. | +| b9866–b9867 | `common/speculative.cpp` | Internal-only, no API surface. A tweak to the **DFlash** block-diffusion speculative draft path (`common_speculative_impl_draft_dflash`, from the b9829–b9839 DFlash feature): (1) the block-size clamp now also clamps `params.n_min` (not just `n_max`) to `block_size - 1` and logs both; (2) the per-step draft sampler's `top_k` goes `1 → 10`; (3) drafting now **stops early** when the top candidate's probability drops below `params.p_min` (upstream b9867 title "spec: support spec-draft-p-min in DFlash"), and a step that produced fewer than `params.n_min` tokens is discarded (`result.clear()`). All three use **already-existing** `common_speculative_params` fields (`n_min`/`n_max`/`p_min`) — no struct/header/API change (`common/speculative.h` untouched). Entirely inside upstream-compiled `common`; the project binds no `common_speculative_*` symbol and exposes no `--spec-*` inference param, so it flows through `libllama` unchanged. No project source changes required. | +| b9866–b9867 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9867. The b9866→b9867 diff touches **no** patch-target file (`common/arg.*`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `server-schema.cpp`, `server-task.h`, `server.cpp`, `test-arg-parser.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edit is `common/speculative.cpp` — so every patch hunk/offset is byte-identical to b9866. Confirmed end-to-end by a clean `cmake` configure: b9867 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` and 0006's `g_llama_server_embedded` markers present), OuteTTS generator anchors held. First bump driven by `.github/scripts/llama-next-version.sh` (b9866→b9867, 2 KiB single-commit final chunk). Full build + `ctest` (target 462/462) to be confirmed by the CI pipeline. | diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index 7552b096..a021df6f 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) FetchContent_Declare( llama.cpp GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git - GIT_TAG b9866 + GIT_TAG b9867 PATCH_COMMAND ${CMAKE_COMMAND} -DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches -DLLAMA_SRC= @@ -166,7 +166,7 @@ execute_process( COMMAND ${CMAKE_COMMAND} -DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp -DOUT_CPP=${JLLAMA_TTS_GEN_CPP} - -DLLAMA_TAG=b9866 + -DLLAMA_TAG=b9867 -P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT ) From 215db79c8815a8acd1bd9a8df57313db1c2abacd Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 3 Jul 2026 18:12:36 +0000 Subject: [PATCH 21/29] Add REUSE license header to the version-bump runbook doc The new docs/upgrade/llama-cpp-version-bump.md lacked copyright/licensing info, failing the License Compliance (REUSE) check. Add the top-of-file HTML-comment SPDX block used by the sibling docs (docs/history/*.md, docs/feature-investigation-*.md). reuse lint now reports 310/310 files compliant with REUSE 3.3. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- docs/upgrade/llama-cpp-version-bump.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/upgrade/llama-cpp-version-bump.md b/docs/upgrade/llama-cpp-version-bump.md index 8dc3e864..c12c4d57 100644 --- a/docs/upgrade/llama-cpp-version-bump.md +++ b/docs/upgrade/llama-cpp-version-bump.md @@ -1,3 +1,9 @@ + + # llama.cpp version-bump runbook This is the **documentation root** for bumping the pinned llama.cpp version. It links the From 890da0f81aff125c9da276874910de273f9ea733 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 07:08:06 +0000 Subject: [PATCH 22/29] Upgrade llama.cpp from b9867 to b9870 Driven by .github/scripts/llama-next-version.sh: b9867 -> b9870 is a 21 KiB (11 KiB excl. WebUI) three-commit final chunk, under the 100 KiB threshold, so it bumps straight to the latest release. The only source edit in b9867..b9870 is common/chat.cpp: a StepFun message-content whitespace workaround (issue #24181) that trims leading and trailing whitespace from each common_chat_msg content, reasoning_content and text content-part before Jinja rendering, detected by the StepFun template signature. It uses existing common_chat_msg fields; common/chat.h is untouched. The removed stepfun-ai-Step-3.5-Flash.jinja template and the test-chat additions are not built here (LLAMA_BUILD_TESTS OFF); tools/ui is the auto-followed WebUI. No project source changes required. Re-verified all six patches (0001-0006) apply cleanly against b9870 via a fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers and the b9870 trim_all_content change present); OuteTTS generator anchors held. Appended the b9867->b9870 history rows. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- CLAUDE.md | 12 ++++++------ README.md | 2 +- docs/history/llama-cpp-breaking-changes.md | 2 ++ llama/CMakeLists.txt | 4 ++-- 4 files changed, 11 insertions(+), 9 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 1f3e053a..8f2113cf 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI. -Current llama.cpp pinned version: **b9867** +Current llama.cpp pinned version: **b9870** ## Upgrading CUDA Version @@ -344,7 +344,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi ships no UI): ```bash # needs node/npm + network; embed.cpp is plain C++17 (no npm) -git clone --depth 1 --branch b9867 https://github.com/ggml-org/llama.cpp /tmp/lc +git clone --depth 1 --branch b9870 https://github.com/ggml-org/llama.cpp /tmp/lc ( cd /tmp/lc/tools/ui && npm ci && npm run build \ && ( cd dist && find . -type f -not -path './_gzip/*' \ | while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \ @@ -384,7 +384,7 @@ cache lives in **Depot Cache** over sccache's **WebDAV** backend: - `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored as the repo secret **`DEPOT_TOKEN`**. -Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9867`), the +Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9870`), the ~280 upstream object files are byte-identical every run, so a warm cache recompiles only the *changed* files. Depot's cache is **shared across all branches** (unlike GitHub's per-branch `actions/cache`), so every branch builds incrementally; a `b` version bump @@ -497,7 +497,7 @@ Current patches: | `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b` includes the change. | | `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b` includes the change. | | `0005-server-recurrent-near-prompt-end-checkpoints.patch` | **Multi-turn tool-calling perf fix for recurrent/hybrid models (e.g. Granite-4)**, upstream-submittable. In `server_context::update_slots` (`tools/server/server-context.cpp`) the near-prompt-end context checkpoints are gated by `checkpoint_min_step` (default 8192 tokens). An agentic conversation that appends only assistant/tool messages never produces a new user-message checkpoint (`is_user_start`/`is_last_user_message` match `COMMON_CHAT_ROLE_USER` only), so after turn 1 no new checkpoint is ever created and — because recurrent state can only roll back to a checkpoint — **every turn re-prefills the whole conversation tail** (measured on a synthetic granitehybrid model: prefilled tokens grew 901 → 1544 → 2187 → 2830 → 3473 over turns 2–6). The patch (1) exempts near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (`ctx_tgt_seq_rm_type` is `FULL` or `RS` — SWA-only models are unaffected), and (2) skips creating a checkpoint whose position equals the newest one (the last-user-message checkpoint was re-created identically on every turn, flooding the 32-entry list). After the patch each turn restores the previous turn's near-end checkpoint and prefill is constant (~new-turn-sized; 647 tokens/turn in the same measurement, ≈5.4× less prefill at turn 6 and growing with conversation length). Validated output-identical (`temperature=0`) vs. unpatched. Complements — not duplicates — open upstream PRs #24035/#24899/#24891 (they fix checkpoint *invalidation/retention*; this fixes checkpoint *starvation*). Drop once upstream solves agentic checkpoint placement (e.g. a merged role-boundary checkpointing design, cf. #21885 / #22826 discussion). | -| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9867 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | +| `0006-server-embed-native-server-jni.patch` | **Makes `server.cpp`'s `llama_server` embeddable in the JVM** so the `NativeServer` JNI bridge can run the full upstream HTTP server (WebUI included) inside `libjllama` — see "Two server modes" below. b9870 already exposes `int llama_server(int, char**)` (non-static; no `main` in the file), so the patch only adds embedded-mode support: (1) a `g_llama_server_embedded` flag + `llama_server_set_embedded()` / `llama_server_request_shutdown()` (declared in the committed `src/main/cpp/native_server_bridge.h`); (2) skips installing the process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM's); (3) in embedded mode parses the **forwarded** argv via `common_params_parse` instead of `common_params_parse_main` (whose `GetCommandLineW` recovery would pick up `java.exe`'s command line — the same Windows class of bug `0001` fixes). `llama_server_request_shutdown()` mirrors the SIGTERM path (invokes the installed `shutdown_handler` → `ctx_server.terminate()` unblocks `start_loop()`), giving JNI an out-of-band stop since `ctx_server` is loop-local. Applies **after `0001`** (which flips this call site to `common_params_parse_main`), so its context is the post-`0001` tree; regenerate against `0001`+source on a bump. Only touches `tools/server/server.cpp`. | ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`) @@ -918,7 +918,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in - `json_helpers.hpp` — Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable. - `jni_helpers.hpp` — JNI bridge helpers (handle management + server orchestration). Includes `json_helpers.hpp`. - Uses `nlohmann/json` for JSON deserialization of parameters. -- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9867 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. +- The upstream server library (`server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-schema.cpp`, `server-models.cpp`, and — since b9829 — `server-stream.cpp`) is compiled directly into `jllama` via CMake — there is no hand-ported `server.hpp` fork. **`server-stream.cpp` is mandatory, not optional:** it defines the resumable-streaming SSE replay buffer (`g_stream_sessions`, `stream_session_attach_pipe`, `stream_aware_should_stop`, `stream_conv_id_from_headers`, the `stream_pipe_*` types) that `server-context.cpp` / `server-http.cpp` / `server-models.cpp` now `#include "server-stream.h"` and call, so omitting it fails the link with undefined references. It is platform-neutral (threads + std mutex/condvar, no `subprocess.h`/`posix_spawn_*`), so it builds on Android too and sits outside the `server-models.cpp` Android guard. `jllama` wires its own JNI routes and never calls `g_stream_sessions.start_gc()` (only the excluded standalone `server.cpp` `main()` does), so its GC thread stays dormant. **Phase 2:** the upstream HTTP transport (`tools/server/server-http.cpp`) and its `cpp-httplib` backend (`vendor/cpp-httplib/httplib.cpp`) are now compiled into `jllama` too, so the OpenAI-compatible server can be driven natively from JNI *inside* `libjllama` — no separate `llama-server` executable (a JNI shared library loads anywhere a JVM runs, which a standalone binary does not). `server-http.cpp` does `#include "ui.h"` (the WebUI asset table that `tools/ui`/`llama-ui` normally generates); since the Svelte WebUI is not shipped, `src/main/cpp/webui_stub/ui.h` supplies the upstream **empty-asset** interface and leaves `LLAMA_UI_HAS_ASSETS` undefined (all static-asset-serving blocks compile out). `` already resolves via `llama-common`'s `vendor/` include dir (same nlohmann/json 3.12.0 as the FetchContent copy). No SSL: `CPPHTTPLIB_OPENSSL_SUPPORT` is left undefined (plain-HTTP; bind localhost / front with a TLS proxy). **`server.cpp` is now compiled in too** (on non-Android — it and `server-tools.cpp` pull in `subprocess.h`/`posix_spawn_*`, so they share `server-models.cpp`'s Android guard): b9870 exposes its entry as `int llama_server(int, char**)` (no `main` in the file), and `patches/0006` makes it embeddable (no process signal handlers, forwarded-argv parse, out-of-band shutdown). The `NativeServer` JNI bridge (`src/main/cpp/native_server.cpp`) calls `llama_server` on a worker thread, so the **full** upstream server — WebUI and all — runs inside `libjllama`. See "Two server modes" below. ### Two server modes (`OpenAiCompatServer` vs `NativeServer`) @@ -1107,7 +1107,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson" #### Upstream source location (in CMake build tree) -llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9867`. +llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9870`. **GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the diff --git a/README.md b/README.md index aebd786b..b7170130 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ **Build:** ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational) ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey) -[![llama.cpp b9867](https://img.shields.io/badge/llama.cpp-%23b9867-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9867) +[![llama.cpp b9870](https://img.shields.io/badge/llama.cpp-%23b9870-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9870) [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/) ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162) [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev) diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md index 65a7ea61..cf745f86 100644 --- a/docs/history/llama-cpp-breaking-changes.md +++ b/docs/history/llama-cpp-breaking-changes.md @@ -421,3 +421,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r | b9864–b9866 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9866. The b9864→b9866 diff touches **no** patch-target file (`common/arg.*`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `server-schema.cpp`, `server-task.h`, `server.cpp`, `test-arg-parser.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edits are `ggml-cuda/topk-moe.cu`, `tests/test-backend-ops.cpp` and `tools/ui/**` — so every patch hunk/offset is byte-identical to b9864. Confirmed end-to-end by a clean `cmake` configure: b9866 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held. Full build + `ctest` (target 462/462) to be confirmed by the CI pipeline. | | b9866–b9867 | `common/speculative.cpp` | Internal-only, no API surface. A tweak to the **DFlash** block-diffusion speculative draft path (`common_speculative_impl_draft_dflash`, from the b9829–b9839 DFlash feature): (1) the block-size clamp now also clamps `params.n_min` (not just `n_max`) to `block_size - 1` and logs both; (2) the per-step draft sampler's `top_k` goes `1 → 10`; (3) drafting now **stops early** when the top candidate's probability drops below `params.p_min` (upstream b9867 title "spec: support spec-draft-p-min in DFlash"), and a step that produced fewer than `params.n_min` tokens is discarded (`result.clear()`). All three use **already-existing** `common_speculative_params` fields (`n_min`/`n_max`/`p_min`) — no struct/header/API change (`common/speculative.h` untouched). Entirely inside upstream-compiled `common`; the project binds no `common_speculative_*` symbol and exposes no `--spec-*` inference param, so it flows through `libllama` unchanged. No project source changes required. | | b9866–b9867 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9867. The b9866→b9867 diff touches **no** patch-target file (`common/arg.*`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `server-schema.cpp`, `server-task.h`, `server.cpp`, `test-arg-parser.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edit is `common/speculative.cpp` — so every patch hunk/offset is byte-identical to b9866. Confirmed end-to-end by a clean `cmake` configure: b9867 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` and 0006's `g_llama_server_embedded` markers present), OuteTTS generator anchors held. First bump driven by `.github/scripts/llama-next-version.sh` (b9866→b9867, 2 KiB single-commit final chunk). Full build + `ctest` (target 462/462) to be confirmed by the CI pipeline. | +| b9867–b9870 | `common/chat.cpp` + `models/templates/stepfun-ai-Step-3.5-Flash.jinja` (removed) + `tests/test-chat*.cpp` | Internal-only, no API surface. Adds a **StepFun** message-content whitespace workaround (issue #24181): `common_chat_templates_apply_jinja` detects a StepFun template (`src.find("You have access to the following functions in JSONSchema format")`) and, before rendering, trims leading/trailing whitespace from each `common_chat_msg`'s `content`/`reasoning_content` and its `"text"` `content_parts` via a new `static` `workaround::trim_all_content(...)` — otherwise leftover whitespace drove the model into reasoning loops. Uses only existing `common_chat_msg` fields; `common/chat.h` is untouched (no struct/API change). The removed `stepfun-ai-Step-3.5-Flash.jinja` embedded template and the `test-chat*.cpp` additions are **not built here** (`LLAMA_BUILD_TESTS` OFF for the FetchContent subproject). All inside upstream-compiled `common`, flowing through the embedded server / `LlamaModel` chat path automatically. No project source changes required. | +| b9867–b9870 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9870. The b9867→b9870 diff touches **no** patch-target file (`common/arg.*`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `server-schema.cpp`, `server-task.h`, `server.cpp`, `test-arg-parser.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only source edit is `common/chat.cpp` (a StepFun whitespace workaround), plus `tools/ui/**` (WebUI, auto-followed) and `tests/test-chat*.cpp` (not built) — so every patch hunk/offset is byte-identical to b9867. **Note:** patch `0004` also targets `tests/test-chat.cpp`, which b9870 edits, but `0004`'s hunks add the reasoning-budget cases in a disjoint region (verified clean by the configure below). Confirmed end-to-end by a clean `cmake` configure: b9870 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` and 0006's `g_llama_server_embedded` markers present, b9870's `trim_all_content` present), OuteTTS generator anchors held. Full build + `ctest` (target 462/462) to be confirmed by the CI pipeline. | diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index a021df6f..f526b093 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) FetchContent_Declare( llama.cpp GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git - GIT_TAG b9867 + GIT_TAG b9870 PATCH_COMMAND ${CMAKE_COMMAND} -DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches -DLLAMA_SRC= @@ -166,7 +166,7 @@ execute_process( COMMAND ${CMAKE_COMMAND} -DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp -DOUT_CPP=${JLLAMA_TTS_GEN_CPP} - -DLLAMA_TAG=b9867 + -DLLAMA_TAG=b9870 -P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT ) From fce103b5af09c2f9c926363509d4f4f7e942f6a1 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 08:36:54 +0000 Subject: [PATCH 23/29] Add 8 GPU-backend classifiers: ROCm/HIP, SYCL, Win-arm64 OpenCL, OpenVINO Wire eight new GPU-backend classifiers following the exact same 5-place pattern as the existing CUDA/Vulkan classifiers (fail-loud, in package.needs, no continue-on-error, no special cases): rocm-linux-x86-64 GGML_HIP Linux x86_64 (AMD ROCm/HIP) rocm-windows-x86-64 GGML_HIP Windows x86_64 (AMD HIP SDK) sycl-fp16-linux-x86-64 GGML_SYCL+F16 Linux x86_64 (Intel oneAPI, fp16) sycl-fp32-linux-x86-64 GGML_SYCL Linux x86_64 (Intel oneAPI, fp32) sycl-windows-x86-64 GGML_SYCL Windows x86_64 (Intel oneAPI) opencl-windows-aarch64 GGML_OPENCL Windows aarch64 (Snapdragon/Adreno) openvino-linux-x86-64 GGML_OPENVINO Linux x86_64 (Intel OpenVINO) openvino-windows-x86-64 GGML_OPENVINO Windows x86_64 (Intel OpenVINO) - llama/CMakeLists.txt: extend the OS-aware backend routing with GGML_HIP, GGML_SYCL (Linux fp16/fp32 split by GGML_SYCL_F16) and GGML_OPENVINO branches. - llama/pom.xml: eight classifier profiles; the existing opencl-windows include is now arch-scoped to Windows/x86_64 so the new aarch64 OpenCL build sharing the resources_windows_opencl tree does not leak into it (vulkan-linux split precedent). - .github/workflows/publish.yml: eight build jobs (build-only; GitHub runners have no matching GPU), all added to package.needs and to the download + profile-activation steps of package/publish-snapshot/publish-release. Vendor toolchain installs are first-pass and intentionally fail loud if a URL/version is stale. - README.md + CLAUDE.md: classifier table rows, dependency snippets, and a wiring/routing section. .gitignore: the seven new resources_* trees. All build-only, no vendor runtime bundled (consumer's driver/toolkit supplies it). Validated locally: CMake CPU reconfigure parses the extended routing, Maven recognizes all 8 profiles, publish.yml is valid YAML, pom.xml is well-formed, REUSE compliant. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 406 +++++++++++++++++++++- .gitignore | 7 + CLAUDE.md | 32 ++ README.md | 90 ++++- llama/CMakeLists.txt | 41 ++- llama/pom.xml | 636 +++++++++++++++++++++++++++++++++- 6 files changed, 1198 insertions(+), 14 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index c5c1c618..00e3a05b 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -1086,6 +1086,302 @@ jobs: path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ if-no-files-found: error + # --------------------------------------------------------------------------- + # Additional GPU-backend classifiers (fail-loud, same wiring as the CUDA/Vulkan/ + # OpenCL jobs): AMD ROCm/HIP, Intel SYCL (oneAPI), Windows-on-ARM OpenCL (Adreno), + # Intel OpenVINO. All BUILD-ONLY (GitHub runners have no AMD/Intel/Adreno GPU, and + # no ctest — a GPU-linked jllama_test can't enumerate a device). GPU runtime libs + # are NOT bundled — the consumer's driver/toolkit supplies them. CMakeLists.txt + # routes each backend to its own src/main/resources_* tree; the matching Maven + # profile turns it into a classifier JAR. Toolchain install steps are first-pass — + # if a vendor URL/version 404s in CI, adjust it (the failure is intentional signal). + # --------------------------------------------------------------------------- + + build-linux-x86_64-rocm: + name: Build Linux x86_64 ROCm/HIP (AMD) + needs: [startgate, build-webui] + runs-on: ubuntu-latest + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: ${{ env.JAVA_VERSION }} + - name: Install ROCm/HIP (AMD apt repo) + run: | + sudo mkdir --parents --mode=0755 /etc/apt/keyrings + wget -qO- https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null + echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.3.4 noble main" | sudo tee /etc/apt/sources.list.d/rocm.list + printf 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600\n' | sudo tee /etc/apt/preferences.d/rocm-pin-600 + sudo apt-get update + sudo apt-get install -y rocm-hip-sdk rocblas-dev hipblas-dev + echo "/opt/rocm/bin" >> "$GITHUB_PATH" + echo "ROCM_PATH=/opt/rocm" >> "$GITHUB_ENV" + - name: Build libraries + shell: bash + run: | + mvn --no-transfer-progress -f llama/pom.xml compile + .github/build.sh "-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102 -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ -DGGML_NATIVE=OFF -DOS_NAME=Linux -DOS_ARCH=x86_64" + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Linux-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_linux_rocm/net/ladenthin/llama/ + if-no-files-found: error + + build-windows-x86_64-rocm: + name: Build Windows 2025 x86_64 ROCm/HIP (AMD) + needs: [startgate, build-webui] + runs-on: windows-2025-vs2026 + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - name: Set up MSVC developer environment (x64) + uses: ilammy/msvc-dev-cmd@v1 + with: + arch: x64 + - name: Install AMD HIP SDK for Windows + shell: pwsh + run: | + $url = "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-Win10-Win11-For-HIP.exe" + Invoke-WebRequest -Uri $url -OutFile "$env:RUNNER_TEMP\hip-sdk.exe" + Start-Process -FilePath "$env:RUNNER_TEMP\hip-sdk.exe" -ArgumentList "-install" -Wait + "HIP_PATH=C:\Program Files\AMD\ROCm\6.2\" | Out-File -FilePath $env:GITHUB_ENV -Append + "C:\Program Files\AMD\ROCm\6.2\bin" | Out-File -FilePath $env:GITHUB_PATH -Append + - name: Build libraries + shell: cmd + run: | + .github\build.bat -G "Ninja Multi-Config" -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030;gfx1100;gfx1101;gfx1102 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang -DOS_NAME=Windows -DOS_ARCH=x86_64 + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Windows-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_windows_rocm/net/ladenthin/llama/ + if-no-files-found: error + + build-linux-x86_64-sycl-fp16: + name: Build Linux x86_64 SYCL fp16 (Intel oneAPI) + needs: [startgate, build-webui] + runs-on: ubuntu-latest + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: ${{ env.JAVA_VERSION }} + - name: Install Intel oneAPI (DPC++ + MKL) + run: | + wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null + echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list + sudo apt-get update + sudo apt-get install -y intel-oneapi-compiler-dpcpp-cpp intel-oneapi-mkl-devel + - name: Build libraries + shell: bash + run: | + source /opt/intel/oneapi/setvars.sh + mvn --no-transfer-progress -f llama/pom.xml compile + .github/build.sh "-DGGML_SYCL=ON -DGGML_SYCL_F16=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=OFF -DOS_NAME=Linux -DOS_ARCH=x86_64" + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Linux-x86_64-sycl-fp16 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp16/net/ladenthin/llama/ + if-no-files-found: error + + build-linux-x86_64-sycl-fp32: + name: Build Linux x86_64 SYCL fp32 (Intel oneAPI) + needs: [startgate, build-webui] + runs-on: ubuntu-latest + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: ${{ env.JAVA_VERSION }} + - name: Install Intel oneAPI (DPC++ + MKL) + run: | + wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null + echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list + sudo apt-get update + sudo apt-get install -y intel-oneapi-compiler-dpcpp-cpp intel-oneapi-mkl-devel + - name: Build libraries + shell: bash + run: | + source /opt/intel/oneapi/setvars.sh + mvn --no-transfer-progress -f llama/pom.xml compile + .github/build.sh "-DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=OFF -DOS_NAME=Linux -DOS_ARCH=x86_64" + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Linux-x86_64-sycl-fp32 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp32/net/ladenthin/llama/ + if-no-files-found: error + + build-windows-x86_64-sycl: + name: Build Windows 2025 x86_64 SYCL (Intel oneAPI) + needs: [startgate, build-webui] + runs-on: windows-2025-vs2026 + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - name: Set up MSVC developer environment (x64) + uses: ilammy/msvc-dev-cmd@v1 + with: + arch: x64 + - name: Install Intel oneAPI (Windows, DPC++ compiler) + shell: cmd + run: | + curl -fSL -o "%RUNNER_TEMP%\oneapi.exe" "https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9a98af19-1c68-46ce-9fdd-e249240c7c42/intel-oneapi-base-toolkit-2025.0.1.47_offline.exe" + "%RUNNER_TEMP%\oneapi.exe" -s -a --silent --eula accept --components intel.oneapi.win.dpcpp-compiler:intel.oneapi.win.mkl.devel + - name: Build libraries + shell: cmd + run: | + call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" + .github\build.bat -G "Ninja Multi-Config" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DOS_NAME=Windows -DOS_ARCH=x86_64 + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Windows-x86_64-sycl + path: ${{ github.workspace }}/llama/src/main/resources_windows_sycl/net/ladenthin/llama/ + if-no-files-found: error + + build-windows-arm64-opencl: + name: Build Windows 11 arm64 OpenCL (Adreno) + needs: [startgate, build-webui] + # Windows-on-ARM OpenCL (Snapdragon X / Adreno). Same clang-cl + GGML_OPENMP=OFF + # toolchain as the arm64 CPU job (ggml refuses MSVC cl.exe on ARM). Reuses the + # resources_windows_opencl tree under Windows/aarch64; the opencl-windows-aarch64 + # Maven profile packages only that subtree. build_opencl_windows.bat stages the + # OpenCL headers + ICD loader before delegating to build.bat. + runs-on: windows-11-arm + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - name: Set up MSVC developer environment (arm64) + uses: ilammy/msvc-dev-cmd@v1 + with: + arch: arm64 + - name: Build libraries + shell: cmd + run: | + .github\build_opencl_windows.bat -G "Ninja Multi-Config" -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DGGML_OPENMP=OFF -DGGML_OPENCL=ON -DGGML_OPENCL_EMBED_KERNELS=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON -DOS_NAME=Windows -DOS_ARCH=aarch64 + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Windows-aarch64-opencl + path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + if-no-files-found: error + + build-linux-x86_64-openvino: + name: Build Linux x86_64 OpenVINO (Intel) + needs: [startgate, build-webui] + runs-on: ubuntu-latest + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: ${{ env.JAVA_VERSION }} + - name: Install Intel OpenVINO (apt repo) + run: | + wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/intel-openvino.gpg > /dev/null + echo "deb [signed-by=/usr/share/keyrings/intel-openvino.gpg] https://apt.repos.intel.com/openvino/2025 ubuntu24 main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2025.list + sudo apt-get update + sudo apt-get install -y openvino-2025.0.0 + - name: Build libraries + shell: bash + run: | + source /opt/intel/openvino/setupvars.sh || true + mvn --no-transfer-progress -f llama/pom.xml compile + .github/build.sh "-DGGML_OPENVINO=ON -DGGML_NATIVE=OFF -DOS_NAME=Linux -DOS_ARCH=x86_64" + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Linux-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_linux_openvino/net/ladenthin/llama/ + if-no-files-found: error + + build-windows-x86_64-openvino: + name: Build Windows 2025 x86_64 OpenVINO (Intel) + needs: [startgate, build-webui] + runs-on: windows-2025-vs2026 + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - name: Set up MSVC developer environment (x64) + uses: ilammy/msvc-dev-cmd@v1 + with: + arch: x64 + - name: Install Intel OpenVINO (Windows archive) + shell: pwsh + run: | + $url = "https://storage.openvinotoolkit.org/repositories/openvino/packages/2025.0/windows/openvino_toolkit_windows_2025.0.0.17942.1f68be9f594_x86_64.zip" + Invoke-WebRequest -Uri $url -OutFile "$env:RUNNER_TEMP\openvino.zip" + Expand-Archive -Path "$env:RUNNER_TEMP\openvino.zip" -DestinationPath "C:\openvino" -Force + "OpenVINO_DIR=C:\openvino\runtime\cmake" | Out-File -FilePath $env:GITHUB_ENV -Append + - name: Build libraries + shell: cmd + run: | + .github\build.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOS_NAME=Windows -DOS_ARCH=x86_64 + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Windows-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_windows_openvino/net/ladenthin/llama/ + if-no-files-found: error + # --------------------------------------------------------------------------- # CI-only jobs — no release artifact, purely for test coverage # --------------------------------------------------------------------------- @@ -1668,6 +1964,14 @@ jobs: - build-windows-x86_64-cuda - build-windows-x86_64-vulkan - build-windows-x86_64-opencl + - build-linux-x86_64-rocm + - build-windows-x86_64-rocm + - build-linux-x86_64-sycl-fp16 + - build-linux-x86_64-sycl-fp32 + - build-windows-x86_64-sycl + - build-windows-arm64-opencl + - build-linux-x86_64-openvino + - build-windows-x86_64-openvino - test-cpp-linux-x86_64 - build-macos-arm64-metal-15 - test-java-linux-x86_64 @@ -1725,6 +2029,38 @@ jobs: with: name: Windows-x86_64-opencl path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_linux_rocm/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_windows_rocm/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-sycl-fp16 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp16/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-sycl-fp32 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp32/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-sycl + path: ${{ github.workspace }}/llama/src/main/resources_windows_sycl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-aarch64-opencl + path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_linux_openvino/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_windows_openvino/net/ladenthin/llama/ - uses: actions/setup-java@v5 with: distribution: 'temurin' @@ -1738,7 +2074,7 @@ jobs: # Windows classifier JARs: `windows-msvc` (MSVC-built CPU natives) plus the GPU # backends `cuda-windows` / `vulkan-windows` / `opencl-windows`. The default JAR's # Windows natives are the Ninja `*-libraries` merged into src/main/resources/ above. - run: mvn --batch-mode --no-transfer-progress -P release,cuda,vulkan-linux,vulkan-linux-aarch64,opencl-android,windows-msvc,cuda-windows,vulkan-windows,opencl-windows,assembly -Dmaven.test.skip=true -Dgpg.skip=true package + run: mvn --batch-mode --no-transfer-progress -P release,cuda,vulkan-linux,vulkan-linux-aarch64,opencl-android,windows-msvc,cuda-windows,vulkan-windows,opencl-windows,rocm-linux,rocm-windows,sycl-fp16-linux,sycl-fp32-linux,sycl-windows,opencl-windows-aarch64,openvino-linux,openvino-windows,assembly -Dmaven.test.skip=true -Dgpg.skip=true package - name: Upload JARs uses: actions/upload-artifact@v7 with: @@ -1844,6 +2180,38 @@ jobs: with: name: Windows-x86_64-opencl path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_linux_rocm/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_windows_rocm/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-sycl-fp16 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp16/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-sycl-fp32 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp32/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-sycl + path: ${{ github.workspace }}/llama/src/main/resources_windows_sycl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-aarch64-opencl + path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_linux_openvino/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_windows_openvino/net/ladenthin/llama/ - name: Set up Maven Central Repository uses: actions/setup-java@v5 with: @@ -1868,7 +2236,7 @@ jobs: # :llama-langchain4j. The `release` profile (GPG + Central Publishing) is inherited # from the parent, so every module — including the parent pom — is signed. - name: Publish snapshot (reactor - parent + llama + llama-langchain4j) - run: mvn --batch-mode --no-transfer-progress -P release,cuda,vulkan-linux,vulkan-linux-aarch64,opencl-android,windows-msvc,cuda-windows,vulkan-windows,opencl-windows -Dmaven.test.skip=true deploy + run: mvn --batch-mode --no-transfer-progress -P release,cuda,vulkan-linux,vulkan-linux-aarch64,opencl-android,windows-msvc,cuda-windows,vulkan-windows,opencl-windows,rocm-linux,rocm-windows,sycl-fp16-linux,sycl-fp32-linux,sycl-windows,opencl-windows-aarch64,openvino-linux,openvino-windows -Dmaven.test.skip=true deploy env: MAVEN_USERNAME: ${{ secrets.CENTRAL_USERNAME }} MAVEN_PASSWORD: ${{ secrets.CENTRAL_TOKEN }} @@ -1962,6 +2330,38 @@ jobs: with: name: Windows-x86_64-opencl path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_linux_rocm/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-rocm + path: ${{ github.workspace }}/llama/src/main/resources_windows_rocm/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-sycl-fp16 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp16/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-sycl-fp32 + path: ${{ github.workspace }}/llama/src/main/resources_linux_sycl_fp32/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-sycl + path: ${{ github.workspace }}/llama/src/main/resources_windows_sycl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-aarch64-opencl + path: ${{ github.workspace }}/llama/src/main/resources_windows_opencl/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Linux-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_linux_openvino/net/ladenthin/llama/ + - uses: actions/download-artifact@v8 + with: + name: Windows-x86_64-openvino + path: ${{ github.workspace }}/llama/src/main/resources_windows_openvino/net/ladenthin/llama/ - name: Set up Maven Central Repository uses: actions/setup-java@v5 with: @@ -1977,7 +2377,7 @@ jobs: # :llama-langchain4j. The `release` profile (GPG + Central Publishing) is inherited # from the parent, so every module — including the parent pom — is signed. - name: Publish release (reactor - parent + llama + llama-langchain4j) - run: mvn --batch-mode --no-transfer-progress -P release,cuda,vulkan-linux,vulkan-linux-aarch64,opencl-android,windows-msvc,cuda-windows,vulkan-windows,opencl-windows -Dmaven.test.skip=true deploy + run: mvn --batch-mode --no-transfer-progress -P release,cuda,vulkan-linux,vulkan-linux-aarch64,opencl-android,windows-msvc,cuda-windows,vulkan-windows,opencl-windows,rocm-linux,rocm-windows,sycl-fp16-linux,sycl-fp32-linux,sycl-windows,opencl-windows-aarch64,openvino-linux,openvino-windows -Dmaven.test.skip=true deploy env: MAVEN_USERNAME: ${{ secrets.CENTRAL_USERNAME }} MAVEN_PASSWORD: ${{ secrets.CENTRAL_TOKEN }} diff --git a/.gitignore b/.gitignore index 50ea904f..d160476a 100644 --- a/.gitignore +++ b/.gitignore @@ -47,6 +47,13 @@ llama/src/main/resources_windows_msvc/ llama/src/main/resources_windows_cuda/ llama/src/main/resources_windows_vulkan/ llama/src/main/resources_windows_opencl/ +llama/src/main/resources_linux_rocm/ +llama/src/main/resources_windows_rocm/ +llama/src/main/resources_linux_sycl_fp16/ +llama/src/main/resources_linux_sycl_fp32/ +llama/src/main/resources_windows_sycl/ +llama/src/main/resources_linux_openvino/ +llama/src/main/resources_windows_openvino/ llama/src/main/resources/**/*.so llama/src/main/resources/**/*.dylib llama/src/main/resources/**/*.dll diff --git a/CLAUDE.md b/CLAUDE.md index 8f2113cf..f12eac62 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -311,6 +311,38 @@ threadpool, leaving the arm64 `jllama.dll` self-contained (the x86_64/x86 jobs k x64 runner with `vcvarsall amd64_arm64` + a `clang`/`clang++` toolchain file and no arm64 tests; the native-runner + `clang-cl` route here keeps the `/MT` CRT and lets `ctest` run on real ARM hardware.) +## Additional GPU-backend classifiers (ROCm/HIP, SYCL, Win-arm64 OpenCL, OpenVINO) + +Eight further GPU classifiers extend the matrix toward upstream llama.cpp's full release set. They +follow the **exact same 5-place wiring** as the CUDA/Vulkan classifiers (no special cases — KISS): a +`CMakeLists.txt` backend branch, a `publish.yml` build job (in `package.needs`, **fail-loud** — a +broken build reds the pipeline, same policy as every GPU job), a `pom.xml` classifier profile, a +`README.md` row, and a git-ignored `resources_*` tree. All are **build-only** (GitHub runners have no +matching GPU) and bundle **no** vendor runtime. + +| Classifier | GGML flag(s) | Job runner / toolchain | Tree | +|---|---|---|---| +| `rocm-linux-x86-64` | `GGML_HIP=ON -DAMDGPU_TARGETS=…` | `ubuntu-latest` + ROCm apt repo (`/opt/rocm/llvm/bin/clang`) | `resources_linux_rocm` | +| `rocm-windows-x86-64` | `GGML_HIP=ON` | `windows-2025-vs2026` + AMD HIP SDK | `resources_windows_rocm` | +| `sycl-fp16-linux-x86-64` | `GGML_SYCL=ON -DGGML_SYCL_F16=ON` (`icx`/`icpx`) | `ubuntu-latest` + Intel oneAPI apt | `resources_linux_sycl_fp16` | +| `sycl-fp32-linux-x86-64` | `GGML_SYCL=ON` (`icx`/`icpx`) | `ubuntu-latest` + Intel oneAPI apt | `resources_linux_sycl_fp32` | +| `sycl-windows-x86-64` | `GGML_SYCL=ON` (`icx`) | `windows-2025-vs2026` + oneAPI installer | `resources_windows_sycl` | +| `opencl-windows-aarch64` | `GGML_OPENCL=ON …ADRENO_KERNELS=ON` (clang-cl, `GGML_OPENMP=OFF`) | `windows-11-arm` (arm64 CPU job's toolchain) | `resources_windows_opencl` (arch subdir `aarch64`) | +| `openvino-linux-x86-64` | `GGML_OPENVINO=ON` | `ubuntu-latest` + OpenVINO apt | `resources_linux_openvino` | +| `openvino-windows-x86-64` | `GGML_OPENVINO=ON` | `windows-2025-vs2026` + OpenVINO archive | `resources_windows_openvino` | + +Two routing notes mirror existing precedent: **Linux SYCL** ships two precision variants at the *same* +arch, so `CMakeLists.txt` routes them to two *distinct* trees by `GGML_SYCL_F16` (fp16 vs fp32). +**Windows OpenCL** now holds both `x86_64` (desktop ICD) and `aarch64` (Snapdragon/Adreno) in the one +`resources_windows_opencl` tree, split by the `opencl-windows` / `opencl-windows-aarch64` profiles' +arch-scoped `` — exactly like the `vulkan-linux` / `vulkan-linux-aarch64` split. + +The vendor toolchain install steps in `publish.yml` are **first-pass** (apt repos / vendor installers +pinned to a specific version): if a URL/version 404s in CI, the job fails loud and the step is adjusted +— the failure is intentional signal, not a regression to hide behind `continue-on-error`. +`src/main/resources_{linux_rocm,windows_rocm,linux_sycl_fp16,linux_sycl_fp32,windows_sycl,linux_openvino,windows_openvino}/` +are all git-ignored (staged by CI, never committed). + ## WebUI (llama.cpp Svelte UI) embedding The llama.cpp WebUI is **built once in CI and shared to every native build**, then diff --git a/README.md b/README.md index b7170130..119215b9 100644 --- a/README.md +++ b/README.md @@ -164,10 +164,12 @@ If any of these match your platform, you can include the Maven dependency and ge The Maven coordinate `net.ladenthin:llama` publishes one default JAR (CPU-only; its Windows natives are built with the Ninja Multi-Config + MSVC toolchain) plus -optional JARs selected via a Maven ``: three Windows GPU builds -(CUDA / Vulkan / OpenCL), the Linux CUDA and Android OpenCL builds, and an -alternate-toolchain MSVC Windows CPU build. Pick at most one GPU/accelerator -classifier — those are mutually exclusive — and optionally a CPU Windows build. +optional JARs selected via a Maven ``: NVIDIA CUDA (Linux / Windows), +Vulkan (Linux x86-64 / aarch64, Windows), AMD ROCm/HIP (Linux / Windows), Intel +SYCL (Linux fp16 / fp32, Windows) and OpenVINO (Linux / Windows) GPU builds, OpenCL +(Android Adreno, Windows x86-64 / Snapdragon-arm64), and an alternate-toolchain MSVC +Windows CPU build. Pick at most one GPU/accelerator classifier — those are mutually +exclusive — and optionally a CPU Windows build. | Classifier | Backend | Target platform | Runtime requirement | |---|---|---|---| @@ -180,6 +182,22 @@ classifier — those are mutually exclusive — and optionally a CPU Windows bui | `vulkan-linux-x86-64` | Vulkan | Linux x86-64 with a Vulkan 1.2+ GPU (NVIDIA / AMD / Intel) | A Vulkan runtime (`libvulkan.so.1`), which current GPU drivers install. No Vulkan SDK is needed at runtime. The most portable Linux GPU option (vendor-independent, no CUDA toolkit). Built natively on `ubuntu-latest`, so it shares the aarch64 build's higher glibc floor (≈ 2.39). | | `vulkan-linux-aarch64` | Vulkan | Linux aarch64 with a Vulkan 1.2+ GPU | A Vulkan runtime (`libvulkan.so.1`) from the device/driver. glibc ≥ 2.39 (built on `ubuntu-24.04-arm`). | | `opencl-android-aarch64` | OpenCL (Adreno) | Android aarch64 with Qualcomm Adreno GPU | A device-supplied OpenCL ICD (`libOpenCL.so`). Devices without an ICD (e.g. most non-Snapdragon Android hardware) must use the default CPU JAR. | +| `rocm-linux-x86-64` | ROCm / HIP | Linux x86-64 with AMD GPU | An installed AMD ROCm runtime (`libamdhip64.so`, `librocblas.so`, `libhipblas.so`) on the host. Not bundled; native load fails without it. No CPU fallback. | +| `rocm-windows-x86-64` | ROCm / HIP | Windows x86-64 with AMD GPU | The AMD HIP SDK runtime DLLs (`amdhip64.dll`, `rocblas.dll`, `hipblas.dll`) on `PATH`. Not bundled. No CPU fallback. | +| `sycl-fp16-linux-x86-64` | SYCL (Intel oneAPI, fp16) | Linux x86-64 with Intel GPU (Arc / iGPU) | An installed Intel oneAPI / Level-Zero runtime. fp16 accumulation (faster, slightly lower precision). Not bundled. | +| `sycl-fp32-linux-x86-64` | SYCL (Intel oneAPI, fp32) | Linux x86-64 with Intel GPU (Arc / iGPU) | An installed Intel oneAPI / Level-Zero runtime. fp32 accumulation (higher precision). Not bundled. | +| `sycl-windows-x86-64` | SYCL (Intel oneAPI) | Windows x86-64 with Intel GPU (Arc / iGPU) | The Intel oneAPI / Level-Zero runtime DLLs on `PATH`. Not bundled. | +| `opencl-windows-aarch64` | OpenCL (Adreno) | Windows-on-ARM aarch64 (Snapdragon X) with Adreno GPU | A device-supplied OpenCL ICD (`OpenCL.dll`, from the Adreno driver). Not bundled. | +| `openvino-linux-x86-64` | OpenVINO | Linux x86-64 (Intel GPU / NPU / CPU) | An installed Intel OpenVINO runtime. Not bundled. | +| `openvino-windows-x86-64` | OpenVINO | Windows x86-64 (Intel GPU / NPU / CPU) | The Intel OpenVINO runtime DLLs on `PATH`. Not bundled. | + +> [!NOTE] +> The AMD (`rocm-*`), Intel SYCL (`sycl-*`), Windows-on-ARM OpenCL +> (`opencl-windows-aarch64`) and Intel OpenVINO (`openvino-*`) classifiers are +> newly added GPU backends. Like the other GPU classifiers they are validated +> **build-only** in CI (GitHub runners have no matching GPU), so end-to-end +> inference is verified locally / on self-hosted hardware. As with every GPU JAR, +> the vendor runtime is supplied by the consumer's driver/toolkit and is not bundled. ```xml @@ -252,6 +270,70 @@ classifier — those are mutually exclusive — and optionally a CPU Windows bui 5.0.4 msvc-windows + + + + net.ladenthin + llama + 5.0.4 + rocm-linux-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + rocm-windows-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + sycl-fp16-linux-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + sycl-fp32-linux-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + sycl-windows-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + opencl-windows-aarch64 + + + + + net.ladenthin + llama + 5.0.4 + openvino-linux-x86-64 + + + + + net.ladenthin + llama + 5.0.4 + openvino-windows-x86-64 + ``` > [!IMPORTANT] diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index f526b093..b33d2575 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -247,12 +247,18 @@ endif() # under its own Maven classifier, so it must land in a backend-specific resource # root (the default CPU tree stays src/main/resources/). The GPU branches are # OS-aware because the same GGML flag is used on more than one platform: -# - GGML_CUDA -> Linux (resources_linux_cuda) AND Windows (resources_windows_cuda) -# - GGML_OPENCL -> Android (resources_android_opencl) AND Windows (resources_windows_opencl) -# - GGML_VULKAN -> Windows (resources_windows_vulkan) AND Linux (resources_linux_vulkan) +# - GGML_CUDA -> Linux (resources_linux_cuda) AND Windows (resources_windows_cuda) +# - GGML_OPENCL -> Android (resources_android_opencl) AND Windows (resources_windows_opencl) +# - GGML_VULKAN -> Windows (resources_windows_vulkan) AND Linux (resources_linux_vulkan) +# - GGML_HIP -> Linux (resources_linux_rocm) AND Windows (resources_windows_rocm) [AMD ROCm/HIP] +# - GGML_SYCL -> Windows (resources_windows_sycl) AND Linux (fp16/fp32 split, see below) [Intel oneAPI] +# - GGML_OPENVINO -> Linux (resources_linux_openvino) AND Windows (resources_windows_openvino) [Intel OpenVINO] # The classifier->tree mapping is mirrored by the matching Maven profile in pom.xml. The Linux # Vulkan tree holds both x86_64 and aarch64 under Linux/${OS_ARCH}; two Maven profiles -# (vulkan-linux / vulkan-linux-aarch64) split it into one single-arch classifier JAR each. +# (vulkan-linux / vulkan-linux-aarch64) split it into one single-arch classifier JAR each. The +# Windows OpenCL tree likewise holds both x86_64 (desktop ICD) and aarch64 (Snapdragon/Adreno), +# split by the opencl-windows / opencl-windows-aarch64 profiles. Linux SYCL ships two precision +# variants at the SAME arch, so it is routed to two distinct trees by GGML_SYCL_F16 (fp16 vs fp32). if(GGML_CUDA) if(OS_NAME STREQUAL "Windows") set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_cuda/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) @@ -277,6 +283,33 @@ elseif(GGML_OPENCL) set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_android_opencl/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) message(STATUS "GPU (OpenCL Android) build - Installing files to ${JLLAMA_DIR}") endif() +elseif(GGML_HIP) + if(OS_NAME STREQUAL "Windows") + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_rocm/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (ROCm/HIP Windows) build - Installing files to ${JLLAMA_DIR}") + else() + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_linux_rocm/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (ROCm/HIP Linux) build - Installing files to ${JLLAMA_DIR}") + endif() +elseif(GGML_SYCL) + if(OS_NAME STREQUAL "Windows") + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_sycl/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (SYCL Windows) build - Installing files to ${JLLAMA_DIR}") + elseif(GGML_SYCL_F16) + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_linux_sycl_fp16/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (SYCL Linux fp16) build - Installing files to ${JLLAMA_DIR}") + else() + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_linux_sycl_fp32/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (SYCL Linux fp32) build - Installing files to ${JLLAMA_DIR}") + endif() +elseif(GGML_OPENVINO) + if(OS_NAME STREQUAL "Windows") + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_windows_openvino/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (OpenVINO Windows) build - Installing files to ${JLLAMA_DIR}") + else() + set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources_linux_openvino/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) + message(STATUS "GPU (OpenVINO Linux) build - Installing files to ${JLLAMA_DIR}") + endif() else() set(JLLAMA_DIR ${CMAKE_SOURCE_DIR}/src/main/resources/net/ladenthin/llama/${OS_NAME}/${OS_ARCH}) message(STATUS "CPU build - Installing files to ${JLLAMA_DIR}") diff --git a/llama/pom.xml b/llama/pom.xml index 636e6da1..da5a58d2 100644 --- a/llama/pom.xml +++ b/llama/pom.xml @@ -1314,12 +1314,15 @@ SPDX-License-Identifier: MIT - + are better supported. The resource copy includes ONLY the Windows/x86_64 + subtree so the aarch64 natives (opencl-windows-aarch64, staged into the same + tree by the sibling job) do not leak into this JAR. Staged by CI before this + profile runs. --> opencl-windows @@ -1364,7 +1367,7 @@ SPDX-License-Identifier: MIT ${basedir}/src/main/resources_windows_opencl/ - **/*.* + net/ladenthin/llama/Windows/x86_64/** @@ -1394,6 +1397,633 @@ SPDX-License-Identifier: MIT + + + rocm-linux + + + + org.apache.maven.plugins + maven-compiler-plugin + + + rocm-linux + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_linux_rocm + + + + + + maven-resources-plugin + + + copy-resources-rocm-linux + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_linux_rocm + + + + ${basedir}/src/main/resources_linux_rocm/ + + net/ladenthin/llama/Linux/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + rocm-linux + package + + jar + + + rocm-linux-x86-64 + + ${project.build.outputDirectory}_linux_rocm + + + + + + + + + + rocm-windows + + + + org.apache.maven.plugins + maven-compiler-plugin + + + rocm-windows + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_windows_rocm + + + + + + maven-resources-plugin + + + copy-resources-rocm-windows + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_windows_rocm + + + + ${basedir}/src/main/resources_windows_rocm/ + + net/ladenthin/llama/Windows/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + rocm-windows + package + + jar + + + rocm-windows-x86-64 + + ${project.build.outputDirectory}_windows_rocm + + + + + + + + + + sycl-fp16-linux + + + + org.apache.maven.plugins + maven-compiler-plugin + + + sycl-fp16-linux + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_linux_sycl_fp16 + + + + + + maven-resources-plugin + + + copy-resources-sycl-fp16-linux + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_linux_sycl_fp16 + + + + ${basedir}/src/main/resources_linux_sycl_fp16/ + + net/ladenthin/llama/Linux/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + sycl-fp16-linux + package + + jar + + + sycl-fp16-linux-x86-64 + + ${project.build.outputDirectory}_linux_sycl_fp16 + + + + + + + + + + sycl-fp32-linux + + + + org.apache.maven.plugins + maven-compiler-plugin + + + sycl-fp32-linux + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_linux_sycl_fp32 + + + + + + maven-resources-plugin + + + copy-resources-sycl-fp32-linux + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_linux_sycl_fp32 + + + + ${basedir}/src/main/resources_linux_sycl_fp32/ + + net/ladenthin/llama/Linux/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + sycl-fp32-linux + package + + jar + + + sycl-fp32-linux-x86-64 + + ${project.build.outputDirectory}_linux_sycl_fp32 + + + + + + + + + + sycl-windows + + + + org.apache.maven.plugins + maven-compiler-plugin + + + sycl-windows + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_windows_sycl + + + + + + maven-resources-plugin + + + copy-resources-sycl-windows + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_windows_sycl + + + + ${basedir}/src/main/resources_windows_sycl/ + + net/ladenthin/llama/Windows/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + sycl-windows + package + + jar + + + sycl-windows-x86-64 + + ${project.build.outputDirectory}_windows_sycl + + + + + + + + + + opencl-windows-aarch64 + + + + org.apache.maven.plugins + maven-compiler-plugin + + + opencl-windows-aarch64 + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_windows_opencl_aarch64 + + + + + + maven-resources-plugin + + + copy-resources-opencl-windows-aarch64 + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_windows_opencl_aarch64 + + + + ${basedir}/src/main/resources_windows_opencl/ + + net/ladenthin/llama/Windows/aarch64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + opencl-windows-aarch64 + package + + jar + + + opencl-windows-aarch64 + + ${project.build.outputDirectory}_windows_opencl_aarch64 + + + + + + + + + + openvino-linux + + + + org.apache.maven.plugins + maven-compiler-plugin + + + openvino-linux + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_linux_openvino + + + + + + maven-resources-plugin + + + copy-resources-openvino-linux + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_linux_openvino + + + + ${basedir}/src/main/resources_linux_openvino/ + + net/ladenthin/llama/Linux/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + openvino-linux + package + + jar + + + openvino-linux-x86-64 + + ${project.build.outputDirectory}_linux_openvino + + + + + + + + + + openvino-windows + + + + org.apache.maven.plugins + maven-compiler-plugin + + + openvino-windows + compile + + compile + + + + module-info.java + + + -h + src/main/cpp + + + ${project.build.outputDirectory}_windows_openvino + + + + + + maven-resources-plugin + + + copy-resources-openvino-windows + process-classes + + copy-resources + + + + ${project.build.outputDirectory}_windows_openvino + + + + ${basedir}/src/main/resources_windows_openvino/ + + net/ladenthin/llama/Windows/x86_64/** + + + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + openvino-windows + package + + jar + + + openvino-windows-x86-64 + + ${project.build.outputDirectory}_windows_openvino + + + + + + + + vmlens From 41028260e640a845385a28053f12ff34c6f128ae Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 09:25:29 +0000 Subject: [PATCH 24/29] Fix vendor-toolchain installs for the 4 failing GPU classifier jobs Mirror upstream llama.cpp's own release-job recipes for the Windows SYCL and Windows HIP builds, and fix the two OpenVINO installs: - Windows ROCm/HIP: the AMD HIP SDK URL 404'd and find_package(hip) could not locate the SDK. Use HIP SDK 26.Q1 (upstream's pin), resolve HIP_PATH from the installed ROCm dir, and pass -DCMAKE_PREFIX_PATH plus the SDK's own clang/clang++ so ggml-hip's find_package(hip) resolves (GPU_TARGETS, upstream spelling). - Windows SYCL: the oneAPI offline installer URL returned 403. Use upstream's intel-deep-learning-essentials-2025.3.3.18 offline installer with the extract + bootstrapper silent install (DPC++/MKL/oneDNN/TBB components), then setvars intel64 --force and build with cl (C) + icx (C++), matching upstream. - Linux OpenVINO: OpenVINOConfig.cmake's find_package(TBB) failed. Add libtbb-dev (supplies TBBConfig.cmake). - Windows OpenVINO: the archive extracts into a nested versioned folder, so the hard-coded C:\openvino\runtime\cmake did not exist. Resolve the nested dir and pass -DOpenVINO_DIR explicitly. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 41 ++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 00e3a05b..ff9233ed 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -1155,16 +1155,21 @@ jobs: arch: x64 - name: Install AMD HIP SDK for Windows shell: pwsh - run: | - $url = "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q4-Win10-Win11-For-HIP.exe" - Invoke-WebRequest -Uri $url -OutFile "$env:RUNNER_TEMP\hip-sdk.exe" - Start-Process -FilePath "$env:RUNNER_TEMP\hip-sdk.exe" -ArgumentList "-install" -Wait - "HIP_PATH=C:\Program Files\AMD\ROCm\6.2\" | Out-File -FilePath $env:GITHUB_ENV -Append - "C:\Program Files\AMD\ROCm\6.2\bin" | Out-File -FilePath $env:GITHUB_PATH -Append + # Mirrors upstream llama.cpp's windows-hip release job: HIP SDK 26.Q1, then + # resolve HIP_PATH from the installed ROCm dir and point the compilers + + # CMAKE_PREFIX_PATH at it so ggml-hip's find_package(hip) resolves. + run: | + $url = "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-26.Q1-Win11-For-HIP.exe" + Invoke-WebRequest -Uri $url -OutFile "$env:RUNNER_TEMP\rocm-install.exe" + $proc = Start-Process "$env:RUNNER_TEMP\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru -Wait + if ($proc.ExitCode -ne 0) { Write-Error "HIP SDK install failed with exit code $($proc.ExitCode)"; exit 1 } + $hip = $(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Split-Path | Split-Path) + "HIP_PATH=$hip" | Out-File -FilePath $env:GITHUB_ENV -Append + "$hip\bin" | Out-File -FilePath $env:GITHUB_PATH -Append - name: Build libraries shell: cmd run: | - .github\build.bat -G "Ninja Multi-Config" -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030;gfx1100;gfx1101;gfx1102 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang -DOS_NAME=Windows -DOS_ARCH=x86_64 + .github\build.bat -G "Ninja Multi-Config" -DGGML_HIP=ON -DGPU_TARGETS=gfx1030;gfx1100;gfx1101;gfx1102 -DCMAKE_PREFIX_PATH="%HIP_PATH%" -DCMAKE_C_COMPILER="%HIP_PATH%\bin\clang.exe" -DCMAKE_CXX_COMPILER="%HIP_PATH%\bin\clang++.exe" -DOS_NAME=Windows -DOS_ARCH=x86_64 - name: Upload artifacts uses: actions/upload-artifact@v7 with: @@ -1263,16 +1268,19 @@ jobs: uses: ilammy/msvc-dev-cmd@v1 with: arch: x64 - - name: Install Intel oneAPI (Windows, DPC++ compiler) + - name: Install Intel oneAPI (DPC++ + MKL + oneDNN + TBB) shell: cmd + # Mirrors upstream llama.cpp's windows-sycl release job: extract the offline + # installer, then run its bootstrapper with the DPC++/MKL/oneDNN/TBB components. run: | - curl -fSL -o "%RUNNER_TEMP%\oneapi.exe" "https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9a98af19-1c68-46ce-9fdd-e249240c7c42/intel-oneapi-base-toolkit-2025.0.1.47_offline.exe" - "%RUNNER_TEMP%\oneapi.exe" -s -a --silent --eula accept --components intel.oneapi.win.dpcpp-compiler:intel.oneapi.win.mkl.devel + curl -fSL -o "%RUNNER_TEMP%\oneapi.exe" "https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe" + "%RUNNER_TEMP%\oneapi.exe" -s -x -f "%RUNNER_TEMP%\oneapi_extracted" --log "%RUNNER_TEMP%\extract.log" + "%RUNNER_TEMP%\oneapi_extracted\bootstrapper.exe" -s --action install --components=intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel --eula=accept -p=NEED_VS2022_INTEGRATION=0 --log-dir="%RUNNER_TEMP%" - name: Build libraries shell: cmd run: | - call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" - .github\build.bat -G "Ninja Multi-Config" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DOS_NAME=Windows -DOS_ARCH=x86_64 + call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force + .github\build.bat -G "Ninja Multi-Config" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DOS_NAME=Windows -DOS_ARCH=x86_64 - name: Upload artifacts uses: actions/upload-artifact@v7 with: @@ -1335,7 +1343,8 @@ jobs: wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/intel-openvino.gpg > /dev/null echo "deb [signed-by=/usr/share/keyrings/intel-openvino.gpg] https://apt.repos.intel.com/openvino/2025 ubuntu24 main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2025.list sudo apt-get update - sudo apt-get install -y openvino-2025.0.0 + # libtbb-dev supplies the TBBConfig.cmake that OpenVINOConfig.cmake's find_package(TBB) needs. + sudo apt-get install -y openvino-2025.0.0 libtbb-dev - name: Build libraries shell: bash run: | @@ -1370,11 +1379,13 @@ jobs: $url = "https://storage.openvinotoolkit.org/repositories/openvino/packages/2025.0/windows/openvino_toolkit_windows_2025.0.0.17942.1f68be9f594_x86_64.zip" Invoke-WebRequest -Uri $url -OutFile "$env:RUNNER_TEMP\openvino.zip" Expand-Archive -Path "$env:RUNNER_TEMP\openvino.zip" -DestinationPath "C:\openvino" -Force - "OpenVINO_DIR=C:\openvino\runtime\cmake" | Out-File -FilePath $env:GITHUB_ENV -Append + # The archive extracts into a nested versioned folder; point OpenVINO_DIR at its runtime/cmake. + $root = (Get-ChildItem "C:\openvino" -Directory | Select-Object -First 1).FullName + "OpenVINO_DIR=$root\runtime\cmake" | Out-File -FilePath $env:GITHUB_ENV -Append - name: Build libraries shell: cmd run: | - .github\build.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOS_NAME=Windows -DOS_ARCH=x86_64 + .github\build.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOpenVINO_DIR="%OpenVINO_DIR%" -DOS_NAME=Windows -DOS_ARCH=x86_64 - name: Upload artifacts uses: actions/upload-artifact@v7 with: From 3119efdba78b18efd21b169254a639771d29c554 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 10:00:19 +0000 Subject: [PATCH 25/29] Fix GPU classifier build errors: HIP MSVC, SYCL/OpenVINO CRT, OpenVINO OpenCL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Second round of fail-loud CI fixes for the new GPU classifiers, from the actual build logs: - Windows ROCm/HIP: device-code compile failed because ROCm 7.1's HIP clang headers cannot overload the __host__ __device__ isgreater/isless/... that the very new VS 2026 MSVC declares via _CLANG_BUILTIN2. Move the job to windows-2022 (MSVC 14.4x), which is what upstream llama.cpp uses for win-hip. - Windows SYCL: icx rejected the project's static /MT CRT with '-fsycl' ("invalid argument 'MT' not allowed with '-fsycl'"). Exempt GGML_SYCL (and GGML_OPENVINO, whose import libs are /MD) from the static-CRT force in CMakeLists so they build with the dynamic /MD runtime. Those classifiers already need the vendor runtime on the host, so the self-contained-DLL rationale doesn't apply; CPU + CUDA/Vulkan/OpenCL keep /MT. - Linux OpenVINO: past the TBB fix, ggml-openvino's find_package(OpenCL) failed. Add ocl-icd-opencl-dev + opencl-headers to the apt install. - Windows OpenVINO: same find_package(OpenCL) need — build it via build_opencl_windows.bat (stages the Khronos headers + OpenCL.lib, then delegates to build.bat) instead of build.bat directly. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 18 +++++++++++++----- llama/CMakeLists.txt | 9 ++++++++- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index ff9233ed..6d815bb3 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -1139,9 +1139,14 @@ jobs: if-no-files-found: error build-windows-x86_64-rocm: - name: Build Windows 2025 x86_64 ROCm/HIP (AMD) + name: Build Windows x86_64 ROCm/HIP (AMD) needs: [startgate, build-webui] - runs-on: windows-2025-vs2026 + # windows-2022 (MSVC 14.4x), NOT windows-2025-vs2026 (VS 2026 / MSVC 14.51): ROCm 7.1's + # HIP clang headers (__clang_hip_cmath.h) cannot overload the __host__ __device__ + # isgreater/isless/... that the very new MSVC declares via _CLANG_BUILTIN2, so the + # device-code compile fails. Upstream llama.cpp builds win-hip on windows-2022 for the same + # reason (it drives the HIP SDK's own clang and relies on the older MSVC STL). + runs-on: windows-2022 steps: - uses: actions/checkout@v7 - name: Download shared WebUI assets @@ -1343,8 +1348,9 @@ jobs: wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/intel-openvino.gpg > /dev/null echo "deb [signed-by=/usr/share/keyrings/intel-openvino.gpg] https://apt.repos.intel.com/openvino/2025 ubuntu24 main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2025.list sudo apt-get update - # libtbb-dev supplies the TBBConfig.cmake that OpenVINOConfig.cmake's find_package(TBB) needs. - sudo apt-get install -y openvino-2025.0.0 libtbb-dev + # libtbb-dev supplies the TBBConfig.cmake that OpenVINOConfig.cmake's find_package(TBB) needs; + # ocl-icd-opencl-dev + opencl-headers satisfy ggml-openvino's find_package(OpenCL). + sudo apt-get install -y openvino-2025.0.0 libtbb-dev ocl-icd-opencl-dev opencl-headers - name: Build libraries shell: bash run: | @@ -1384,8 +1390,10 @@ jobs: "OpenVINO_DIR=$root\runtime\cmake" | Out-File -FilePath $env:GITHUB_ENV -Append - name: Build libraries shell: cmd + # Via build_opencl_windows.bat (not build.bat): it stages the Khronos OpenCL headers + + # OpenCL.lib that ggml-openvino's find_package(OpenCL) needs, then delegates to build.bat. run: | - .github\build.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOpenVINO_DIR="%OpenVINO_DIR%" -DOS_NAME=Windows -DOS_ARCH=x86_64 + .github\build_opencl_windows.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOpenVINO_DIR="%OpenVINO_DIR%" -DOS_NAME=Windows -DOS_ARCH=x86_64 - name: Upload artifacts uses: actions/upload-artifact@v7 with: diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index b33d2575..36603fe5 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -8,7 +8,14 @@ project(jllama CXX) # Must be set before any FetchContent_MakeAvailable() so that llama.cpp and all # other subprojects inherit the same CRT choice (mixing /MT and /MD in a single # link is a linker error). -if(MSVC) +# +# EXCEPTION: the Intel oneAPI SYCL and OpenVINO backends must use the DYNAMIC /MD +# runtime — `icx -fsycl` rejects /MT outright ("invalid argument 'MT' not allowed +# with '-fsycl'") and the OpenVINO import libraries are built /MD (mixing would be a +# link error). Those classifiers already require the vendor runtime on the host, so +# the self-contained-DLL rationale does not apply to them; the CPU + CUDA/Vulkan/OpenCL +# classifiers keep /MT. +if(MSVC AND NOT GGML_SYCL AND NOT GGML_OPENVINO) set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$:Debug>" CACHE STRING "" FORCE) endif() From fb88b121a33c9fbddd5463b7d458a686a0f83d54 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 10:24:35 +0000 Subject: [PATCH 26/29] OpenVINO: bump to 2026.2.1 + fix OpenCL headers; raise cmake min to 3.22 OpenVINO backend now compiles but failed on both platforms: - ov::Allocator template error (allocate/is_equal on a const void*): version mismatch. llama.cpp's ggml-openvino targets OpenVINO 2026.2.1 (what upstream ships), not the 2025.0.0 I pinned. Bump Linux apt to openvino-2026.2.1 (repo /openvino/2026) and the Windows archive to 2026.2.1. - Windows 'CL/cl2.hpp' not found: the staged Khronos OpenCL-Headers dropped cl2.hpp. Install OpenCL via vcpkg (opencl:x64-windows ships cl2.hpp) and pass the vcpkg toolchain file, mirroring upstream's windows-openvino job; drop the build_opencl_windows.bat staging for this job. - Linux: add opencl-clhpp-headers + intel-opencl-icd to the apt set (upstream's full OpenCL package list for ubuntu-openvino). Also raise cmake_minimum_required 3.15 -> 3.22 to match what the build actually relies on (runners ship 3.31); no behavior change. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 24 +++++++++++++++--------- llama/CMakeLists.txt | 2 +- 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 6d815bb3..478d7139 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -1346,11 +1346,14 @@ jobs: - name: Install Intel OpenVINO (apt repo) run: | wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/intel-openvino.gpg > /dev/null - echo "deb [signed-by=/usr/share/keyrings/intel-openvino.gpg] https://apt.repos.intel.com/openvino/2025 ubuntu24 main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2025.list + echo "deb [signed-by=/usr/share/keyrings/intel-openvino.gpg] https://apt.repos.intel.com/openvino/2026 ubuntu24 main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2026.list sudo apt-get update - # libtbb-dev supplies the TBBConfig.cmake that OpenVINOConfig.cmake's find_package(TBB) needs; - # ocl-icd-opencl-dev + opencl-headers satisfy ggml-openvino's find_package(OpenCL). - sudo apt-get install -y openvino-2025.0.0 libtbb-dev ocl-icd-opencl-dev opencl-headers + # OpenVINO 2026.2.1 matches the version llama.cpp's ggml-openvino targets (2025.0.0's + # ov::Allocator API mismatched and broke the template compile). libtbb-dev supplies + # TBBConfig.cmake; ocl-icd-opencl-dev + opencl-headers + opencl-clhpp-headers (the C++ + # CL/cl2.hpp) + intel-opencl-icd satisfy ggml-openvino's find_package(OpenCL) — the same + # OpenCL package set upstream llama.cpp's ubuntu-openvino job installs. + sudo apt-get install -y openvino-2026.2.1 libtbb-dev ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd - name: Build libraries shell: bash run: | @@ -1379,10 +1382,14 @@ jobs: uses: ilammy/msvc-dev-cmd@v1 with: arch: x64 - - name: Install Intel OpenVINO (Windows archive) + - name: Install OpenCL headers (vcpkg) + Intel OpenVINO 2026.2.1 shell: pwsh + # vcpkg's opencl port ships the full C++ headers incl. CL/cl2.hpp that OpenVINO's + # ocl_wrapper.hpp needs (the Khronos OpenCL-Headers dropped cl2.hpp) — same as upstream + # llama.cpp's windows-openvino job. OpenVINO 2026.2.1 matches ggml-openvino's target API. run: | - $url = "https://storage.openvinotoolkit.org/repositories/openvino/packages/2025.0/windows/openvino_toolkit_windows_2025.0.0.17942.1f68be9f594_x86_64.zip" + C:\vcpkg\vcpkg install opencl:x64-windows + $url = "https://storage.openvinotoolkit.org/repositories/openvino/packages/2026.2.1/windows/openvino_toolkit_windows_2026.2.1.21919.ede283a88e3_x86_64.zip" Invoke-WebRequest -Uri $url -OutFile "$env:RUNNER_TEMP\openvino.zip" Expand-Archive -Path "$env:RUNNER_TEMP\openvino.zip" -DestinationPath "C:\openvino" -Force # The archive extracts into a nested versioned folder; point OpenVINO_DIR at its runtime/cmake. @@ -1390,10 +1397,9 @@ jobs: "OpenVINO_DIR=$root\runtime\cmake" | Out-File -FilePath $env:GITHUB_ENV -Append - name: Build libraries shell: cmd - # Via build_opencl_windows.bat (not build.bat): it stages the Khronos OpenCL headers + - # OpenCL.lib that ggml-openvino's find_package(OpenCL) needs, then delegates to build.bat. + # vcpkg toolchain file wires in the OpenCL (incl. cl2.hpp) that ggml-openvino needs. run: | - .github\build_opencl_windows.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOpenVINO_DIR="%OpenVINO_DIR%" -DOS_NAME=Windows -DOS_ARCH=x86_64 + .github\build.bat -G "Ninja Multi-Config" -DGGML_OPENVINO=ON -DOpenVINO_DIR="%OpenVINO_DIR%" -DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake -DOS_NAME=Windows -DOS_ARCH=x86_64 - name: Upload artifacts uses: actions/upload-artifact@v7 with: diff --git a/llama/CMakeLists.txt b/llama/CMakeLists.txt index 36603fe5..523d4b3b 100644 --- a/llama/CMakeLists.txt +++ b/llama/CMakeLists.txt @@ -1,4 +1,4 @@ -cmake_minimum_required(VERSION 3.15) +cmake_minimum_required(VERSION 3.22) project(jllama CXX) From c83bfe000aca95e801d7698e14af2db2f666b05f Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 10:37:34 +0000 Subject: [PATCH 27/29] Linux OpenVINO: install 2026.2.1 from archive, not the (nonexistent) 2026 apt repo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The apt repo https://apt.repos.intel.com/openvino/2026 returns 404 — Intel only publishes OpenVINO apt repos up to ~2025, and 2025.x has the older ov::Allocator API that breaks ggml-openvino's template compile. Switch Linux OpenVINO to the archive for 2026.2.1, exactly as upstream llama.cpp's linux-setup-openvino composite action does: storage.openvinotoolkit.org/repositories/openvino/packages/2026.2.1/linux/ openvino_toolkit_ubuntu24_2026.2.1.21919.ede283a88e3_x86_64.tgz extracted to /opt/intel/openvino, with OpenVINO_DIR set to its runtime/cmake. OpenCL headers (incl. the C++ CL/cl2.hpp via opencl-clhpp-headers) come from Ubuntu's own repos, so no Intel apt repo is needed at all. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 478d7139..fb21ea71 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -1343,23 +1343,24 @@ jobs: with: distribution: 'temurin' java-version: ${{ env.JAVA_VERSION }} - - name: Install Intel OpenVINO (apt repo) + - name: Install OpenCL dev + Intel OpenVINO 2026.2.1 (archive) run: | - wget -qO- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/intel-openvino.gpg > /dev/null - echo "deb [signed-by=/usr/share/keyrings/intel-openvino.gpg] https://apt.repos.intel.com/openvino/2026 ubuntu24 main" | sudo tee /etc/apt/sources.list.d/intel-openvino-2026.list + # Intel's OpenVINO APT repo only publishes up to ~2025 (the /openvino/2026 path 404s), and + # 2025.x has the older ov::Allocator API that breaks ggml-openvino's template compile. So use + # the ARCHIVE for 2026.2.1 — exactly what upstream llama.cpp's linux-setup-openvino action does. + # OpenCL headers (incl. the C++ CL/cl2.hpp via opencl-clhpp-headers) come from Ubuntu's own repos. sudo apt-get update - # OpenVINO 2026.2.1 matches the version llama.cpp's ggml-openvino targets (2025.0.0's - # ov::Allocator API mismatched and broke the template compile). libtbb-dev supplies - # TBBConfig.cmake; ocl-icd-opencl-dev + opencl-headers + opencl-clhpp-headers (the C++ - # CL/cl2.hpp) + intel-opencl-icd satisfy ggml-openvino's find_package(OpenCL) — the same - # OpenCL package set upstream llama.cpp's ubuntu-openvino job installs. - sudo apt-get install -y openvino-2026.2.1 libtbb-dev ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd + sudo apt-get install -y ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd + url="https://storage.openvinotoolkit.org/repositories/openvino/packages/2026.2.1/linux/openvino_toolkit_ubuntu24_2026.2.1.21919.ede283a88e3_x86_64.tgz" + sudo mkdir -p /opt/intel/openvino + curl -fSL "$url" | sudo tar -xz --strip-components=1 -C /opt/intel/openvino + echo "OpenVINO_DIR=/opt/intel/openvino/runtime/cmake" >> "$GITHUB_ENV" - name: Build libraries shell: bash run: | source /opt/intel/openvino/setupvars.sh || true mvn --no-transfer-progress -f llama/pom.xml compile - .github/build.sh "-DGGML_OPENVINO=ON -DGGML_NATIVE=OFF -DOS_NAME=Linux -DOS_ARCH=x86_64" + .github/build.sh "-DGGML_OPENVINO=ON -DOpenVINO_DIR=$OpenVINO_DIR -DGGML_NATIVE=OFF -DOS_NAME=Linux -DOS_ARCH=x86_64" - name: Upload artifacts uses: actions/upload-artifact@v7 with: From c0f2d1a5dfe1dda34e535fb5adeb9e418941eb42 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 12:13:00 +0000 Subject: [PATCH 28/29] Add Linux s390x (big-endian) build with a qemu-user C++ test gate Wire build-linux-s390x: cross-compile for IBM Z (s390x, big-endian) with the GCC cross toolchain (native x86 speed), then run the full 462-test C++ suite under qemu-user as a real big-endian correctness gate for our byte-order- sensitive code (the little-endian WAV writer, JSON/token/embedding transforms, JNI helpers). Model-backed Java tests are not run under emulation (slow/flaky); the Java<->JNI boundary uses host-native array copies, so the C++ gate covers the actual endian risk. - publish.yml: build-linux-s390x (g++-s390x-linux-gnu + qemu-user-static; CMAKE_CROSSCOMPILING_EMULATOR + QEMU_LD_PREFIX make ctest run the s390x exe; GGML_OPENMP=OFF avoids cross-libgomp). s390x is a default-jar CPU platform like aarch64, so the artifact merges via the *-libraries glob (no classifier / pom profile). Fail-loud and in package.needs. - OSInfo.java: map os.arch=s390x -> Linux/s390x (S390X constant + archMapping). - README/CLAUDE.md: document the platform + the big-endian gate. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .github/workflows/publish.yml | 48 +++++++++++++++++++ CLAUDE.md | 18 +++++++ README.md | 2 +- .../net/ladenthin/llama/loader/OSInfo.java | 4 ++ 4 files changed, 71 insertions(+), 1 deletion(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index fb21ea71..c8cc9b2d 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -437,6 +437,53 @@ jobs: name: Linux-aarch64-libraries path: ${{ github.workspace }}/llama/src/main/resources/net/ladenthin/llama/ + build-linux-s390x: + name: Build and Test Linux s390x (big-endian, qemu) + needs: [startgate, build-webui] + # Cross-compile for IBM Z (s390x, BIG-ENDIAN) with the GCC cross toolchain, then run the full + # C++ unit suite under qemu-user — a real big-endian correctness gate for our helpers and + # serializers (esp. the little-endian WAV writer, JSON/token/embedding transforms). The BUILD + # is native speed (x86 cross-gcc); only the tiny test binary is emulated. s390x is a DEFAULT-jar + # CPU platform (like aarch64), so the artifact merges via the `*-libraries` glob (no classifier / + # pom profile). Model-backed Java tests are NOT run under emulation (a JVM + GGUF inference under + # qemu-user is slow/flaky); the C++ gate covers the actual byte-order risk since the Java<->JNI + # boundary uses host-native array copies. GGML_OPENMP=OFF avoids cross-libgomp issues (ggml uses + # its own std::thread pool). CMAKE_CROSSCOMPILING_EMULATOR makes ctest run the s390x exe via qemu; + # QEMU_LD_PREFIX lets the emulated binary find the s390x sysroot libs. + runs-on: ubuntu-latest + env: + QEMU_LD_PREFIX: /usr/s390x-linux-gnu + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + steps: + - uses: actions/checkout@v7 + - name: Download shared WebUI assets + uses: actions/download-artifact@v8 + with: + name: webui-generated + path: ${{ github.workspace }}/llama/webui-generated/ + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: ${{ env.JAVA_VERSION }} + - name: Install s390x cross toolchain + qemu-user + run: | + sudo apt-get update + sudo apt-get install -y gcc-s390x-linux-gnu g++-s390x-linux-gnu qemu-user-static + - name: Build libraries (cross-compile s390x) + shell: bash + run: | + mvn --no-transfer-progress -f llama/pom.xml compile + .github/build.sh "-DGGML_NATIVE=OFF -DGGML_OPENMP=OFF -DBUILD_TESTING=ON -DCMAKE_SYSTEM_NAME=Linux -DCMAKE_SYSTEM_PROCESSOR=s390x -DCMAKE_C_COMPILER=s390x-linux-gnu-gcc -DCMAKE_CXX_COMPILER=s390x-linux-gnu-g++ -DCMAKE_CROSSCOMPILING_EMULATOR=/usr/bin/qemu-s390x-static -DOS_NAME=Linux -DOS_ARCH=s390x" + - name: Run C++ unit tests under qemu-s390x (big-endian gate) + run: ctest --test-dir llama/build --output-on-failure + - name: Upload artifacts + uses: actions/upload-artifact@v7 + with: + name: Linux-s390x-libraries + path: ${{ github.workspace }}/llama/src/main/resources/net/ladenthin/llama/ + build-linux-x86_64-vulkan: name: Build Linux x86_64 Vulkan needs: [startgate, build-webui] @@ -1978,6 +2025,7 @@ jobs: needs: - crosscompile-linux-x86_64-cuda - crosscompile-linux-aarch64 + - build-linux-s390x - build-linux-x86_64-vulkan - build-linux-aarch64-vulkan - crosscompile-android-aarch64 diff --git a/CLAUDE.md b/CLAUDE.md index f12eac62..1c861a53 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1067,6 +1067,24 @@ Wiring (mirrors the macOS native jobs, not the dockcross jobs): - Branch protection: if a required check pinned the old name "Cross-Compile Linux aarch64 (LTS)", repoint it to "Build and Test Linux aarch64". +### Linux s390x: big-endian cross-build + qemu test gate + +`build-linux-s390x` extends the default JAR to **IBM Z (s390x, big-endian)** — the one target whose +byte order differs from every other platform. It **cross-compiles** with the GCC s390x toolchain +(`g++-s390x-linux-gnu`, native x86 speed — no emulated build) and then runs the **full C++ unit suite +under `qemu-user`** (`CMAKE_CROSSCOMPILING_EMULATOR=/usr/bin/qemu-s390x-static`, `QEMU_LD_PREFIX=/usr/s390x-linux-gnu`). +That `ctest` run is a **real big-endian correctness gate** for the byte-order-sensitive surface — the +little-endian WAV writer (`tts_wav.hpp`), the JSON/token/embedding transforms, and the JNI helpers — +which is where an endian bug in *our* code could hide. Model-backed **Java** tests are deliberately +**not** run under emulation (a JVM + GGUF inference under `qemu-user` is slow and flaky); the Java↔JNI +boundary uses host-native array copies (endian-transparent), so the C++ gate covers the actual risk. +`-DGGML_OPENMP=OFF` sidesteps cross-libgomp issues (ggml uses its own `std::thread` pool). s390x is a +CPU platform like aarch64, so it ships in the **default** JAR (`Linux-s390x-libraries` merges via the +`*-libraries` glob; `OSInfo` maps `os.arch=s390x` → `Linux/s390x`) — no classifier, no pom profile. +**Fail-loud** and in `package.needs` like every other build. (Upstream llama.cpp already supports s390x +— it ships `ubuntu-s390x` with GGUF big-endian handling — so the native inference path is upstream's +concern; this job validates only *our* layer's endian-safety.) + ## Testing ### Java tests diff --git a/README.md b/README.md index 119215b9..772dd18f 100644 --- a/README.md +++ b/README.md @@ -173,7 +173,7 @@ exclusive — and optionally a CPU Windows build. | Classifier | Backend | Target platform | Runtime requirement | |---|---|---|---| -| _(none)_ | CPU | Linux x86-64 / aarch64, macOS x86-64 / aarch64, Windows x86-64 / x86 / aarch64 (Ninja Multi-Config + MSVC), Android aarch64 (CPU) | A JDK 8+ JVM. **Linux `aarch64` additionally requires glibc ≥ 2.39** (e.g. Ubuntu 24.04+, Debian 13+) — it is built natively on `ubuntu-24.04-arm`, matching upstream llama.cpp's own ARM binaries; older-glibc ARM hosts (Ubuntu 22.04, Debian 12, RHEL 8/9, Amazon Linux 2023) are not supported. Linux x86-64 keeps a glibc 2.17 floor (manylinux2014). **Windows `aarch64`** (Windows on ARM — Snapdragon X / Surface) is built natively on `windows-11-arm` and ships in the default JAR alongside the x86-64 / x86 natives. | +| _(none)_ | CPU | Linux x86-64 / aarch64 / s390x, macOS x86-64 / aarch64, Windows x86-64 / x86 / aarch64 (Ninja Multi-Config + MSVC), Android aarch64 (CPU) | A JDK 8+ JVM. **Linux `aarch64` additionally requires glibc ≥ 2.39** (e.g. Ubuntu 24.04+, Debian 13+) — it is built natively on `ubuntu-24.04-arm`, matching upstream llama.cpp's own ARM binaries; older-glibc ARM hosts (Ubuntu 22.04, Debian 12, RHEL 8/9, Amazon Linux 2023) are not supported. Linux x86-64 keeps a glibc 2.17 floor (manylinux2014). **Windows `aarch64`** (Windows on ARM — Snapdragon X / Surface) is built natively on `windows-11-arm` and ships in the default JAR alongside the x86-64 / x86 natives. | | `msvc-windows` | CPU (MSVC / Visual Studio generator) | Windows x86-64 and x86 | None beyond a JDK 8+ JVM. Same CPU backend as the default JAR's Windows natives, but compiled with the Visual Studio generator instead of `Ninja Multi-Config`. Both use the same MSVC toolchain (static `/MT` CRT), so they are functionally equivalent — provided as an alternate-toolchain option. | | `cuda13-windows-x86-64` | CUDA 13 | Windows x86-64 with NVIDIA GPU | NVIDIA driver + CUDA 13 Toolkit installed on the host (`cudart64_13.dll`, `cublas64_13.dll`, `cublasLt64_13.dll` resolvable on `PATH`). The runtime libraries are **not bundled** in the JAR; native-library load fails with `UnsatisfiedLinkError` if they are absent. No CPU fallback. | | `vulkan-windows-x86-64` | Vulkan | Windows x86-64 with a Vulkan 1.2+ GPU (NVIDIA / AMD / Intel) | A Vulkan runtime (`vulkan-1.dll`), which current GPU drivers install. No Vulkan SDK is needed at runtime. The most portable Windows GPU option (vendor-independent). | diff --git a/llama/src/main/java/net/ladenthin/llama/loader/OSInfo.java b/llama/src/main/java/net/ladenthin/llama/loader/OSInfo.java index 21cf7a3b..138b5900 100644 --- a/llama/src/main/java/net/ladenthin/llama/loader/OSInfo.java +++ b/llama/src/main/java/net/ladenthin/llama/loader/OSInfo.java @@ -114,6 +114,8 @@ public OSInfo() {} public static final String PPC64 = "ppc64"; /** Folder name for 64-bit RISC-V. */ public static final String RISCV64 = "riscv64"; + /** Folder name for 64-bit IBM Z (s390x, big-endian). */ + public static final String S390X = "s390x"; static { // x86 mappings @@ -155,6 +157,8 @@ public OSInfo() {} archMapping.put("ppc64le", PPC64); archMapping.put(RISCV64, RISCV64); + + archMapping.put(S390X, S390X); } /** From 9b0f50b77ed53648fd15c6533cb4a7608a8e95a2 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 14:25:52 +0000 Subject: [PATCH 29/29] NativeServer.main: own the server in try/finally (SonarQube S2095) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address 'Resources should be closed' (java:S2095) on NativeServer.main. The server was already closed on every real path (shutdown hook on SIGTERM, explicit close on self-termination), but not in a structure Sonar recognizes. Wrap the body in try/finally so close() is guaranteed on normal or exceptional exit — S2095's 'close in a finally clause' option. try-with-resources is deliberately NOT used: the shutdown hook must also call close() explicitly, which javac flags under -Werror as 'explicit call to close() on an auto-closeable resource'. close() is idempotent (guards on a zero handle), so the finally and the hook both firing is safe. The now-redundant stoppedByHook flag is dropped. All 7 NativeServerSmokeTest cases still pass. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7 --- .../ladenthin/llama/server/NativeServer.java | 52 ++++++++++--------- 1 file changed, 28 insertions(+), 24 deletions(-) diff --git a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java index ea70e1b0..65caf6c8 100644 --- a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java +++ b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java @@ -198,31 +198,35 @@ public void close() { * @throws InterruptedException if interrupted while waiting for the server to exit */ public static void main(String[] args) throws InterruptedException { + // Own the server in a try/finally so close() is guaranteed on normal or exceptional exit of + // the block (satisfies S2095 via the "close in a finally clause" option — try-with-resources + // is not used because the shutdown hook must also call close() explicitly, which javac flags + // under -Werror as an "explicit call to close() on an auto-closeable resource"). close() is + // idempotent (guards on a zero handle), so the finally and the hook both firing is safe. final NativeServer server = new NativeServer(args); - final AtomicBoolean stoppedByHook = new AtomicBoolean(false); - // Signalled by the shutdown hook so the main thread wakes immediately on Ctrl-C / SIGTERM - // rather than waiting out a poll tick — and so the wait uses a bounded latch await instead of - // Thread.sleep (banned by LlamaArchitectureTest.noThreadSleep). - final CountDownLatch stopSignal = new CountDownLatch(1); - // Graceful Ctrl-C / SIGTERM: the embedded server installs no signal handlers of its own - // (see patches/0006), so the JVM-level shutdown hook is what stops it before exit. - Runtime.getRuntime() - .addShutdownHook(new Thread( - () -> { - stoppedByHook.set(true); - server.close(); - stopSignal.countDown(); - }, - "jllama-native-server-shutdown")); - server.start(); - // Keep the JVM alive until the native worker exits — on its own (e.g. a fatal startup/model - // error that llama_server has already logged) or because the shutdown hook stopped it. The - // bounded await returns early when the hook fires; on timeout we re-check isRunning() to catch - // a self-terminated worker. - while (server.isRunning() && !stopSignal.await(200L, TimeUnit.MILLISECONDS)) { - // wait for the native worker to exit or the shutdown hook to fire - } - if (!stoppedByHook.get()) { + try { + // Signalled by the shutdown hook so the main thread wakes immediately on Ctrl-C / SIGTERM + // rather than waiting out a poll tick — and so the wait uses a bounded latch await instead + // of Thread.sleep (banned by LlamaArchitectureTest.noThreadSleep). + final CountDownLatch stopSignal = new CountDownLatch(1); + // Graceful Ctrl-C / SIGTERM: the embedded server installs no signal handlers of its own + // (see patches/0006), so the JVM-level shutdown hook is what stops it before exit. + Runtime.getRuntime() + .addShutdownHook(new Thread( + () -> { + server.close(); + stopSignal.countDown(); + }, + "jllama-native-server-shutdown")); + server.start(); + // Keep the JVM alive until the native worker exits — on its own (e.g. a fatal startup/model + // error that llama_server has already logged) or because the shutdown hook stopped it. The + // bounded await returns early when the hook fires; on timeout we re-check isRunning() to + // catch a self-terminated worker. + while (server.isRunning() && !stopSignal.await(200L, TimeUnit.MILLISECONDS)) { + // wait for the native worker to exit or the shutdown hook to fire + } + } finally { server.close(); } }