Feature wave: Android AAR + Kotlin facade, server attach/router modes, LangChain4j streaming, GGUF tooling (llama.cpp b9878)#298
Merged
Conversation
Three features from the similar-projects investigation (native-server-first scope — no new Java-server routes): Runtime LoRA adapter control (upstream GET/POST /lora-adapters parity): - new JNI methods getLoraAdaptersJson/setLoraAdaptersJson posting SERVER_TASK_TYPE_GET_LORA / SET_LORA (parse_lora_request wire format) - typed LlamaModel.getLoraAdapters() / setLoraAdapters(Map) / setLoraAdapter(int, float); new value.LoraAdapter + json.LoraAdapterResponseParser (finite-scale validation) - closes the setLoraInitWithoutApply() gap (its Javadoc pointed at an endpoint the bindings could not reach) Typed batch embeddings (requested by upstream kherud users): - LlamaModel.embed(Collection<String>) -> List<float[]> over the OAI array-input path of handleEmbeddings; json.EmbeddingResponseParser restores request order via the response index field UTF-8-safe JNI string path: - json_to_jstring_impl now serialises via upstream safe_json_to_str (U+FFFD replacement instead of json::type_error 316 when non-stream content ends mid-codepoint at the token limit) and builds the Java String through the cached String(byte[], "UTF-8") constructor (utf8_to_jstring_impl) instead of NewStringUTF, which expects Modified UTF-8 and is spec-invalid for supplementary-plane characters (4-byte emoji; Android CheckJNI aborts) - applyTemplate return and the log-callback message take the same path - streamed chunks were already boundary-safe (upstream process_token holds back incomplete UTF-8); pinned end to end by the new tests Tests: +17 C++ unit tests (utf8/json_to_jstring byte-capture mocks, parse_lora_request, server_task_result_get_lora::to_json; total 479), +28 model-free Java unit tests (parsers + PIT-complete LoraAdapter), +3 model-backed integration classes/methods (RuntimeLoraIntegrationTest, Utf8RoundTripIntegrationTest, LlamaEmbeddingsTest batch cases). PIT 255/255 mutants killed; javadoc:jar clean; ArchUnit green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Three native-server-focused features: - NativeServer attach mode (closes the "reuse an already-loaded LlamaModel" TODO): patches/0007 extracts the upstream route table into a shared llama_server_register_common_routes(...) and adds llama_server_attach(), which serves an already-loaded LlamaModel's server_context over the full upstream HTTP frontend (WebUI, resumable streaming) - no second model load, no second start_loop; the model's worker keeps driving the queue. Java: NativeServer(LlamaModel, String...) over startAttachedNativeServer JNI. Validated by NativeServerAttachIntegrationTest (HTTP health/props/ completion/chat + concurrent direct JNI calls on the same model). - In-JVM router mode (multi-model management): the upstream router spawns workers by re-executing its own binary, which inside a JVM is java, so embedded router workers could never start. patches/0008 adds the LLAMA_SERVER_WORKER_CMD override (whitespace-split, replaces only the worker-binary token), exposed as NativeServer.setWorkerCommand(String...); workers relaunch as fresh JVMs running the classic single-model NativeServer. Validated by RouterModeIntegrationTest (Linux CI: --models-dir listing -> POST /models/load -> worker-JVM spawn -> proxied chat completion) plus model-free setWorkerCommand validation tests. - In-JVM GGUF quantization: LlamaQuantizer.quantize(in, out, QuantizationType[, threads, allowRequantize]) over llama_model_quantize (LLamaSharp/llama-cpp-python precedent). args.QuantizationType pins the llama_ftype b9870 mapping (PIT-complete, 256/256 mutants killed). QuantizerIntegrationTest re-quantizes the 135M draft model and loads the result; refusal-without-opt-in and missing-input error paths covered. Local verification: full native rebuild with patches 0007/0008 applied cleanly, 479/479 C++ tests pass, NativeLibraryLoadSmokeTest green with the rebuilt lib, javadoc clean, spotless + pinned clang-format applied. The model-backed integration tests run in CI. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Bridge the remaining langchain4j v1 gaps in llama-langchain4j (blocking path): - Tool calling: ChatRequest.toolSpecifications()/toolChoice() map to the jllama typed tools path (ToolDefinition + tool_choice); assistant tool-call turns and ToolExecutionResultMessages round-trip through the history, and a native tool_calls response comes back as AiMessage.toolExecutionRequests() with finish reason TOOL_EXECUTION. - JsonSchemaElementSerializer: recursive public-API-only serializer for the langchain4j JsonSchemaElement tree (object/string/integer/number/ boolean/enum/array/reference/anyOf/null/raw), emitting langchain4j's $defs / #/$defs/... conventions (their serializer is internal-only). - response_format: ResponseFormat.JSON maps to json_object mode; a JsonSchema-bearing format maps to the native json_schema grammar constraint (structured output). Applies to both adapters. - Multimodal user input: ImageContent (base64 or URL) and AudioContent (inline wav/mp3) map to ContentPart array-form content for the mtmd pipeline; unsupported media fails loud instead of silently dropping. - JllamaStreamingChatModel: fails fast with UnsupportedFeatureException when tools are requested (streaming tool-call reconstruction is the documented follow-up). Tests: 12 new model-free mapping/serializer tests (31 total in module), plus JllamaToolCallingIntegrationTest (gated on the new net.ladenthin.llama.langchain4j.tool.model property; CI passes the cached Qwen2.5-Instruct tool model to the langchain4j integration job). Also bundles three SpotBugs verify fixes from the previous batch: LlamaModel static-field ordering (IMC_IMMATURE_CLASS_WRONG_FIELD_ORDER), EmbeddingResponseParser IndexedVector rewrite (CLI_CONSTANT_LIST_INDEX), and a scoped EI_EXPOSE_REP2 exclusion for NativeServer's borrowed-model attach constructor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Replace the submodule/NDK source-integration flow as the recommended
Android path with first-class Maven artifacts, so an Android Studio app
needs exactly one dependency line:
implementation("net.ladenthin:llama-android:<version>") // CPU
implementation("net.ladenthin:llama-android-opencl:<version>") // Adreno
implementation("net.ladenthin:llama-kotlin:<version>") // optional
llama-android/ (standalone plain-Gradle build, NOT a reactor module —
Maven cannot deploy <packaging>aar</packaging>; no AGP and no Android SDK
needed to build, version + mirrored dependency versions are parsed from
the Maven poms so `mvn versions:set` stays the single bump point):
- AAR = manifest (minSdkVersion 28, enforced on consumers by AGP) +
classes.jar (byte-identical Maven-built core classes minus desktop
native resources and module-info.class) + jni/arm64-v8a/libjllama.so +
consumer R8/ProGuard rules (proguard.txt, applied automatically) + R.txt.
- POM mirrors the core's compile deps (jackson/slf4j-api/jspecify/
checker-qual); logback deliberately excluded (JVM-only binding).
- LlamaLoader already tries System.loadLibrary("jllama") first on
Android, so the AAR-installed .so resolves with zero core changes.
llama-kotlin/ (new Maven reactor module, pure Kotlin 2.2 / jvmTarget 1.8):
- generateFlow/generateChatFlow: cold Flow token streaming, source closed
on completion, error, AND cancellation (no leaked native task slots).
- completeSuspend/chatSuspend/chatCompleteTextSuspend/embedSuspend;
completeSuspend wires coroutine cancellation into the cooperative
CancellationToken so a cancelled coroutine stops the native loop at the
next token boundary.
- Core dep is provided-scope so Android consumers pair the facade with
the AAR instead of transitively pulling the fat desktop JAR.
- 6 model-free unit tests over the internal seams.
16 KB page-size (Google Play, Android 15+ targets): CMakeLists.txt now
pins -Wl,-z,max-page-size=16384 for Android builds and CI asserts every
LOAD segment of the shipped .so is 16384-aligned (currently satisfied by
toolchain default; the pin + assert prevent silent regression).
CI (publish.yml):
- test-java-llama-kotlin: model-free unit tests.
- package-android-aar: assembles both AARs from the fresh native
artifacts, validates structure (entries, minSdk, classes.jar content,
16 KB alignment) and runs an AGP consumer smoke test — the minimal app
fixture in .github/android-consumer-test/ resolves the AAR from
mavenLocal and runs a full R8 assembleRelease on the runner's Android
SDK, then asserts the APK carries libjllama.so and the un-stripped
binding (proving Android Studio consumption without an emulator).
- publish-snapshot/publish-release: gated on the new jobs; AAR snapshots
publish to the Central snapshots repo via Gradle, releases upload a
signed Central Portal bundle via the Publisher API. llama-kotlin rides
the normal reactor deploy.
Docs: README "Importing in Android" rewritten around the AAR (source
integration kept as advanced option), module READMEs, CLAUDE.md
(reactor layout, version bump, new "Android AAR + Kotlin facade"
section), RELEASE.md, TODO.md (Android section marked done; sample app
and multi-ABI/emulator-CI stay as follow-ups).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
…inks Add the projects surveyed during the feature-gap research and Android investigation that were not yet linked: the kherud/java-llama.cpp fork parent (previously only in the header note), the sibling llama.cpp bindings in other languages (llama-cpp-python, LLamaSharp, node-llama-cpp), and a new "Other local inference stacks" group for Ollama (whose native API this project's server implements) and ExecuTorch (the engine behind llama-stack-client-kotlin's local mode). The llama-stack-client-kotlin entry now points at the new llama-android AAR + llama-kotlin facade as the native on-device equivalent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Replace the raw HTTP+JSON boilerplate router-mode callers had to write themselves with a typed client for the upstream model-management endpoints: - value.RouterModel (+ nested Status enum): one GET /models entry — identifier, lifecycle status (exact-match mapping of upstream server_model_status_to_string strings: downloading/downloaded/ unloaded/loading/loaded/sleeping, UNKNOWN otherwise), the raw status string, and the router's failed-worker marker (status.failed + exit_code). - json.RouterModelsResponseParser: pure transform of the router GET /models wire format (data/models array fallback, id/name fallback), unit-testable with JSON literals. - server.RouterClient: listModels/findModel/loadModel/unloadModel plus awaitModelLoaded(id, timeout) — polls until LOADED and fails fast with the worker's exit code when the router marks the model failed, or immediately for an unknown id, instead of running out the timeout. Non-2xx responses surface the router's error body. Works against the in-JVM NativeServer router or any external llama-server router (plain HTTP, no JNI). Tests: 25 new model-free tests — RouterModelTest (getters, status mapping, equals/hashCode, toString shapes), RouterModelsResponseParserTest (upstream shape, failed marker, fallbacks, tolerance), RouterClientTest (stub HTTP server: parsing, request bodies, error surfacing, the awaitModelLoaded state machine incl. poll-sequence, fail-fast, and timeout paths). RouterModeIntegrationTest now drives model discovery, load, and readiness through RouterClient against a real router, replacing its hand-rolled JSON polling. Gates: layeredArchitecture updated (Server may access Json — the rule is the documented intent registry for new inter-package edges); awaitModelLoaded uses a never-counted-down CountDownLatch instead of the banned Thread.sleep; SpotBugs clean (toString/equals/hashCode added, exact status matching avoids IMPROPER_UNICODE, scoped URLCONNECTION_SSRF_FD exclusion with developer-supplied-host rationale); PIT 274/274 (RouterModel inside the value.* 100% gate); javadoc builds clean. README router-mode section, CLAUDE.md, and TODO.md updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Replace the handwritten equals/hashCode with @EqualsAndHashCode over the host/port fields, matching the established pattern (value.* and the other server.* classes). toString stays intentionally handwritten so the client renders as its target URL in log traces — the same documented handwritten-toString convention ChatMessage/ToolCall/RouterModel use. SpotBugs (IMC_IMMATURE_CLASS_*) stays satisfied by the generated methods. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Three backlog features, each with model-free tests plus gated integration coverage: GGUF metadata inspector (no model load): - GgufInspector: pure-Java GGUF v2/v3 header + key/value reader — no native library, no tensor data, cost independent of file size. Little- and big-endian containers auto-detected via the version field; fail-loud on v1, unknown versions/type ids, truncation, and implausible lengths (sanity caps). All value types decoded (integers→Long, floats→Double, bool, string, arrays). - value.GgufMetadata: full entry table + typed accessors (architecture, name, parameter count, <arch>.context_length, general.file_type, chat template). Complements the loaded-model getModelMeta(). - 21 tests against in-memory generated fixtures (no committed binaries) + a gated real-model read. LangChain4j streaming tool calls + thinking events: - JllamaStreamingChatModel now streams over the native OAI chat.completion.chunk path via the new StreamingChunkAssembler: delta.content → onPartialResponse, delta.reasoning_content → onPartialThinking (+ AiMessage.thinking()), delta.tool_calls fragments accumulated per index → onPartialToolCall / onCompleteToolCall and AiMessage.toolExecutionRequests() with finish reason TOOL_EXECUTION; real finish reason + token usage on the final response. The UnsupportedFeatureException fail-fast is gone; toStreamingParameters now carries tools/tool_choice like the blocking path. - 6 assembler tests (canned chunks: text, split/parallel tool calls, thinking, usage, fail-loud) + a gated streamed-tool-call integration test. Session fork/rewind (conversation checkpoints): - Session.checkpoint(filepath) → value.SessionCheckpoint pairing the native slot KV-save file with the transcript-turn snapshot; Session.rewind(checkpoint) restores both atomically under the session lock (native state and transcript cannot drift); Session.fork(newSlotId, filepath) branches into an independent session on another slot (same system message + params customizer; requires setParallel >= 2). All rejected while a stream is in progress, same guard as save/restore. - Plumbing: ChatTranscript.turnsSnapshot()/resetTurns(), SessionState.turnsSnapshot()/restoreTurns()/getSystemMessage(). - Model-free bookkeeping/guard tests + SessionForkRewindIntegrationTest (rewind-continue, independent fork, own-slot fail-fast). Gates: PIT 295/295 (GgufMetadata, SessionCheckpoint, ChatTranscript additions inside the value.* 100% gate); SpotBugs clean (dynamic exception messages in GgufInspector; scoped exclusions with rationale for the stateful-reader PRMC false positive, the tagged-decoder URV, and SessionCheckpoint's order-significant List parameter — ChatMessage precedent); javadoc clean; langchain4j module verify green (38 tests). README sections for checkpoints and GGUF inspection; TODO.md updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Extends Android from build-verified to runtime-verified in CI, and makes
the binding usable on x86_64 Android environments (Android Studio
emulator, Chromebooks, x86-64 Android hardware) for all consumers.
Phase 1 — new native build:
- .github/dockcross/dockcross-android-x86_64 wrapper (same pinned image
tag as the arm64 one; wrappers are image-generic launchers — verified
byte-identical modulo image name; update.sh already listed the
generation command).
- crosscompile-android-x86_64 job (dockcross + the same sccache
steady-state env), artifact Linux-Android-x86_64-libraries — fail-loud
and in the package/publish needs graphs. The artifact ALSO merges into
the default JAR's Linux-Android/x86_64 tree automatically via the
*-libraries glob (OSInfo already maps x86_64 Android there), so plain
JAR consumers get the ABI too. The CMake Android guard (weak symbols +
16 KB max-page-size) keys on OS_NAME and applies unchanged.
Phase 2 — multi-ABI AAR:
- llama-android CPU AAR now ships jni/arm64-v8a + jni/x86_64 (per-ABI
fail-loud staging checks; app bundles split per ABI so phones download
only arm64). OpenCL flavor stays arm64-only (Adreno = Qualcomm ARM).
- Structural validation covers both ABIs incl. the 16 KB LOAD-alignment
readelf check per .so; the R8 consumer smoke asserts both libs in the
APK; publish jobs stage both ABIs.
Phase 3 — on-emulator instrumentation:
- test-android-emulator job: KVM-accelerated x86_64 emulator (API 30,
reactivecircus/android-emulator-runner), publishes the CPU AAR to
mavenLocal (per-publication task), adb-pushes the already-cached
draft model (AMD-Llama-135m, no new download) and runs the consumer
fixture's connectedDebugAndroidTest.
- OnDeviceInferenceTest (androidx.test): System.loadLibrary("jllama")
from the APK's native-lib dir + JNI_OnLoad FindClass against D8-dexed
classes, pure-Java GgufInspector on-device, and real native inference
(non-empty generation). Self-skips without the pushed model so a bare
local emulator run stays green.
- VALIDATION-ONLY for now (not in the publish needs graphs): emulator
boot is the flakiest CI machinery; promote to a release gate after a
stable streak (same staged policy as the sccache rollout). Not
covered by the emulator: arm64 kernels and the Adreno flavor — the
planned example app covers those on real hardware.
Docs: README (default-JAR platforms, 64-bit-only note, AAR section),
llama-android/README (multi-ABI), CLAUDE.md, TODO.md, fixture README.
Locally verified: multi-ABI AAR assembles with both ABIs, per-ABI
fail-loud check fires on a missing .so, and the per-publication
mavenLocal task publishes the CPU AAR.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Two first-run failures on PR #298: - REUSE compliance (test job): the four files added by the Android/Kotlin work lacked SPDX info — llama-android/README.md, llama-kotlin/README.md, and the javadoc-placeholder README.txt get SPDX headers; the generated dockcross-android-x86_64 wrapper joins the existing dockcross wrapper annotation in REUSE.toml. `reuse lint` is compliant again (365/365). - SonarCloud "Build and analyze": RouterModeIntegrationTest.tearDown called NativeServer.setWorkerCommand() unconditionally; when the class self-skips via a @BeforeAll assumption (no model on the lib-less analysis runner) @afterall still runs, and setWorkerCommand loads the native library -> UnsatisfiedLinkError. The teardown now clears the worker-command override only when setup actually installed it (workerCommandSet flag), so a skipped class tears down as a no-op. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
…RL() CodeQL flagged the URL(String) constructor (deprecated since JDK 20, no validation/encoding). URI.create(...).toURL() is the non-deprecated equivalent and is available on Java 8, so the bytecode floor is unaffected. Behavior identical for the fixed localhost/router URLs; 9/9 RouterClientTest green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Two more first-run failures on PR #298 (head 725f570): - Package + Validate Android AARs: the R8 release pass in the consumer smoke failed with "Missing classes" — the AAR's consumer keep rule retains the whole binding, so R8 verifies every referenced type, including compile-time-only ones absent on Android: com.sun.net.httpserver.* (JVM-only OpenAiCompatServer transport), lombok.Generated and animal-sniffer's IgnoreJRERequirement (CLASS-retention build annotations). consumer-proguard.txt now ships the matching -dontwarn rules, so every consumer app's R8 pass gets them automatically — the standard treatment for compileOnly references in published Android libraries. - Android emulator on-device test: sh exit code 2 with no gradle output — reactivecircus/android-emulator-runner executes the script input LINE BY LINE via sh, so the multi-line if-block was fed as a lone "if ...; then" (syntax error). The logic moves into the committed .github/run-android-emulator-test.sh (bash -n verified) and the job's script: is a single line invoking it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
The on-device emulator test failed with UnsatisfiedLinkError ("No native
library found ... Directly from .apk/lib") even though the x86_64
libjllama.so was verifiably inside the APK. Root cause (confirmed via
readelf -d on the shipped 5.0.5 arm64 Android lib, which carries the same
latent defect): the dockcross cross-clang links two DT_NEEDED entries that
exist on no Android device, so bionic's dlopen rejects the library:
- libomp.so (LLVM OpenMP runtime, pulled in by ggml's OpenMP path)
- libc++_shared.so (NDK shared C++ runtime, only present when an app
packages it itself)
Three-part fix:
1. llama/CMakeLists.txt (Android guard): set GGML_OPENMP OFF (ggml falls
back to its own std::thread pool — the same trade the Windows-arm64
clang-cl job makes) and link -static-libstdc++ so libc++ is embedded.
Only bionic system libraries remain as dependencies.
2. publish.yml (package-android-aar validation): per-.so DT_NEEDED
whitelist via readelf -dW (libc/libm/libdl/liblog/libandroid, plus
libOpenCL.so for the OpenCL flavor) — a future toolchain bump cannot
silently reintroduce a non-bionic dependency; the job fails naming the
offending library.
3. LlamaLoader: the Android System.loadLibrary catch block now includes
the UnsatisfiedLinkError message in the "Directly from .apk/lib (...)"
tried-path entry — the actual dlopen reason was previously swallowed,
which made this failure look like a missing library.
Also documents the new dlopen-ability invariant in CLAUDE.md next to the
16 KB page-size invariant.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Small upstream range (5 files, ~9.5 KiB): a quantized-tensor fix for the CPU concat op and a null-buffer guard for the K/V rotation graph inputs (upstream #25215), plus WebUI settings changes (auto-followed by the build-webui job) and a test-backend-ops addition (not built here). All eight local patches (0001-0008) re-verified: applied cleanly in order onto a b9873 checkout; the range touches no patch-target file and no OuteTTS generator anchor. History row appended to docs/history/llama-cpp-breaking-changes.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
ggml-only range (3 files, ~9.6 KiB): the CUDA concat op gains the same quantized-tensor block-size handling b9873 added to the CPU op, plus a tensor-parallel + -ncmoe crash fix on MoE models (upstream #25028). No API surface, no project source changes. All eight local patches (0001-0008) re-verified: applied cleanly in order onto a b9876 checkout; the range touches no patch-target file and no OuteTTS generator anchor. History rows appended to docs/history/llama-cpp-breaking-changes.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
The TODO entry "PIT gate not hermetic — value.ContentPart.audioFile(Path)" was stale: ContentPartTest already carries the hermetic @tempdir tests the entry proposed (wav dispatch incl. case-insensitive .WAV, mp3 dispatch, unknown-extension rejection). Verified in a fixture-less, network-restricted sandbox: mvn -f llama/pom.xml test-compile pitest:mutationCoverage reports 295/295 mutations killed (100%), 0 NO_COVERAGE. No committed audio fixture is needed for the PIT gate; the model-backed AudioInputIntegrationTest remains separately (and intentionally) gated on a real speech clip. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
…e gate The committed sample.wav (260ddb0) redded both REUSE lint jobs — a new binary with no license info. This covers and wires it: - llama/src/test/resources/audios/README.md: provenance/license/override notes mirroring the images/ README (recorded by the project author, MIT-granted for this project). - REUSE.toml: the audios README joins the MIT markdown list and sample.wav gets its own MIT annotation (WAV has no in-file header channel, same as test-image.jpg). reuse lint: 368/368 compliant. - AudioInputIntegrationTest now defaults the audio prompt to the committed clip (TestConstants.DEFAULT_AUDIO_INPUT_PATH), mirroring the vision.image default — only the audio model + mmproj still need staging. README/CLAUDE.md property tables updated. Also promotes test-android-emulator to a RELEASE GATE (both publish needs: graphs) per owner decision: the job ran flake-free through PR #298's validation cycle (boot ~30 s, on-device inference green), so a broken on-device runtime now blocks publishing — same fail-loud policy as every native artifact job. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Every DONE/RESOLVED entry moves out of "Open"; a concise 2026-07-05 record (one-liners with pointers to PR #298 / CLAUDE.md / git history) is kept in the Done section. Trims: - Dropped fully-done sections: NativeServer attach mode, typed router API, GGUF inspector, session fork/rewind, PIT hermeticity, Windows native classifiers, b9739 arg-parse regression, code audit (its one optional follow-up becomes its own small open section), branch protection rename (closed as a no-op per owner). - OpenAI-compat endpoint section reduced to its open follow-ups, marked deprioritized per the native-server-first owner decision. - Similar-projects backlog reduced to the jbang example remainder. - Android section reduced to the example-app follow-up. - Upstream-PR section generalized from patch 0001 to all six upstream-submittable patches (0001/0002/0005-0008). - License Compliance entry notes the same 17-issues status now blocks PR #298's merge state. File shrinks 654 -> 315 lines; only genuinely open work remains under "Open". Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Smallest range yet (2 files, ~1.8 KiB), internal-only: a fail-loud GGML_ABORT guard in the ggml meta backend for unsupported multi-buffers (upstream #22197), and llama_model now copies the borrowed tensor_split array into an owned vector so tensor-parallel KV-cache split metadata cannot read a dangling caller pointer. No API surface, no project source changes. All eight local patches (0001-0008) re-verified: applied cleanly in order onto a b9878 checkout; the range touches no patch-target file and no OuteTTS generator anchor. History rows appended to docs/history/llama-cpp-breaking-changes.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
net.ladenthin:llama-android/llama-android-openclAARs (standalone plain-Gradle build, no AGP/SDK needed; classes byte-identical to the Maven core jar) + thenet.ladenthin:llama-kotlincoroutines facade (Flow streaming, suspend wrappers, cancellation wired toCancellationToken). The CPU AAR is multi-ABI (arm64-v8a+x86_64); a new dockcross x86_64 job also feeds the default JAR. Includes the Android dlopen fix:GGML_OPENMP OFF+-static-libstdc++remove thelibomp.so/libc++_shared.soDT_NEEDEDentries that madeSystem.loadLibraryfail on every device (latent in the released 5.0.5 arm64 lib too); CI now enforces a bionic-onlyDT_NEEDEDwhitelist and 16 KB LOAD alignment per shipped.so.NativeServerattach mode (NativeServer(LlamaModel, String...), patch0007) serves an already-loaded model over the full upstream HTTP frontend (one copy of the weights); in-JVM router mode (patch0008+NativeServer.setWorkerCommand) with the typedRouterClient/RouterModelAPI (list/load/unload/await-loaded with fail-fast on failed workers).GgufInspector(GGUF v2/v3 header/metadata reader, LE+BE, no model load),LlamaQuantizer(in-JVMllama_model_quantize),Session.checkpoint/rewind/fork(slot-checkpoint based conversation branching), runtime LoRA adapter control, typed batch embeddings (embed(Collection)), UTF-8-safe JNI string path (utf8_to_jstring_impl— fixes supplementary-plane/emoji handling, Android CheckJNI-safe).JsonSchemaElementSerializer), JSON mode /response_format, multimodal user input, and full streaming — streamed tool calls (onPartialToolCall/onCompleteToolCall), per-token thinking events, real finish reason + token usage (StreamingChunkAssembler).test-java-llama-kotlin,package-android-aar(structure, 16 KB alignment, DT_NEEDED whitelist, AGP/R8 consumer smoke),crosscompile-android-x86_64, andtest-android-emulator— on-deviceSystem.loadLibrary+GgufInspector+ real inference on a KVM x86_64 emulator, promoted to a release gate after running flake-free. Committed audio fixture (audios/sample.wav, REUSE-annotated) is now theAudioInputIntegrationTestdefault prompt.TODO.mdcleaned to open-items-only.Test plan
99359a0, incl. emulator on-device inference, model-backed integration suites on Linux/macOS x3/Windows x3, PIT 295/295, C++ suites incl. s390x/qemu; final head re-run pending after the b9876/b9878 bumps — both ranges are internal-only upstream fixes with all patches re-verified)docs/history/llama-cpp-breaking-changes.mdrows, TODO.md cleanup)Related issues / PRs
LlamaModel" TODO via attach mode (patch0007).UnsatisfiedLinkErrorclass of failures (libomp/libc++_shared DT_NEEDED) — also latent in the 5.0.4/5.0.5 Android arm64 artifacts.0003(server : add slot_prompt_similarity getter/setter ggml-org/llama.cpp#22393) and0004(server: honour per-request reasoning_budget_tokens in chat completions ggml-org/llama.cpp#23116); patches0001/0002/0005–0008are upstream-submittable (tracked in TODO.md).Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdSECURITY.md)🤖 Generated with Claude Code
https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX