Skip to content

Feature wave: Android AAR + Kotlin facade, server attach/router modes, LangChain4j streaming, GGUF tooling (llama.cpp b9878)#298

Merged
bernardladenthin merged 20 commits into
mainfrom
claude/java-llama-cpp-features-l4tl6v
Jul 5, 2026
Merged

Feature wave: Android AAR + Kotlin facade, server attach/router modes, LangChain4j streaming, GGUF tooling (llama.cpp b9878)#298
bernardladenthin merged 20 commits into
mainfrom
claude/java-llama-cpp-features-l4tl6v

Conversation

@bernardladenthin

@bernardladenthin bernardladenthin commented Jul 5, 2026

Copy link
Copy Markdown
Owner

Summary

  • Android distribution: net.ladenthin:llama-android / llama-android-opencl AARs (standalone plain-Gradle build, no AGP/SDK needed; classes byte-identical to the Maven core jar) + the net.ladenthin:llama-kotlin coroutines facade (Flow streaming, suspend wrappers, cancellation wired to CancellationToken). The CPU AAR is multi-ABI (arm64-v8a + x86_64); a new dockcross x86_64 job also feeds the default JAR. Includes the Android dlopen fix: GGML_OPENMP OFF + -static-libstdc++ remove the libomp.so/libc++_shared.so DT_NEEDED entries that made System.loadLibrary fail on every device (latent in the released 5.0.5 arm64 lib too); CI now enforces a bionic-only DT_NEEDED whitelist and 16 KB LOAD alignment per shipped .so.
  • Server modes: NativeServer attach mode (NativeServer(LlamaModel, String...), patch 0007) serves an already-loaded model over the full upstream HTTP frontend (one copy of the weights); in-JVM router mode (patch 0008 + NativeServer.setWorkerCommand) with the typed RouterClient/RouterModel API (list/load/unload/await-loaded with fail-fast on failed workers).
  • API additions: pure-Java GgufInspector (GGUF v2/v3 header/metadata reader, LE+BE, no model load), LlamaQuantizer (in-JVM llama_model_quantize), Session.checkpoint/rewind/fork (slot-checkpoint based conversation branching), runtime LoRA adapter control, typed batch embeddings (embed(Collection)), UTF-8-safe JNI string path (utf8_to_jstring_impl — fixes supplementary-plane/emoji handling, Android CheckJNI-safe).
  • LangChain4j: blocking tool calling (own JsonSchemaElementSerializer), JSON mode / response_format, multimodal user input, and full streaming — streamed tool calls (onPartialToolCall/onCompleteToolCall), per-token thinking events, real finish reason + token usage (StreamingChunkAssembler).
  • llama.cpp pin: b9870 → b9878 (three small internal-only ranges; all 8 carried patches re-verified at each step; history rows appended).
  • CI: test-java-llama-kotlin, package-android-aar (structure, 16 KB alignment, DT_NEEDED whitelist, AGP/R8 consumer smoke), crosscompile-android-x86_64, and test-android-emulator — on-device System.loadLibrary + GgufInspector + real inference on a KVM x86_64 emulator, promoted to a release gate after running flake-free. Committed audio fixture (audios/sample.wav, REUSE-annotated) is now the AudioInputIntegrationTest default prompt. TODO.md cleaned to open-items-only.

Test plan

  • Affected unit / integration tests pass locally
  • CI is green on this branch (full 45-job matrix green on the b9873 head 99359a0, incl. emulator on-device inference, model-backed integration suites on Linux/macOS x3/Windows x3, PIT 295/295, C++ suites incl. s390x/qemu; final head re-run pending after the b9876/b9878 bumps — both ranges are internal-only upstream fixes with all patches re-verified)
  • Docs / CHANGELOG updated where applicable (README classifier table + Importing in Android + Similar Projects, CLAUDE.md sections for AAR/Kotlin/emulator gate/dlopen invariant, docs/history/llama-cpp-breaking-changes.md rows, TODO.md cleanup)

Related issues / PRs

Checklist

  • I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
  • My commits follow Conventional Commits
  • No security-sensitive changes (if there are, I have notified the maintainer privately per SECURITY.md)

🤖 Generated with Claude Code

https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX

claude added 9 commits July 5, 2026 08:24
Three features from the similar-projects investigation (native-server-first
scope — no new Java-server routes):

Runtime LoRA adapter control (upstream GET/POST /lora-adapters parity):
- new JNI methods getLoraAdaptersJson/setLoraAdaptersJson posting
  SERVER_TASK_TYPE_GET_LORA / SET_LORA (parse_lora_request wire format)
- typed LlamaModel.getLoraAdapters() / setLoraAdapters(Map) /
  setLoraAdapter(int, float); new value.LoraAdapter +
  json.LoraAdapterResponseParser (finite-scale validation)
- closes the setLoraInitWithoutApply() gap (its Javadoc pointed at an
  endpoint the bindings could not reach)

Typed batch embeddings (requested by upstream kherud users):
- LlamaModel.embed(Collection<String>) -> List<float[]> over the OAI
  array-input path of handleEmbeddings; json.EmbeddingResponseParser
  restores request order via the response index field

UTF-8-safe JNI string path:
- json_to_jstring_impl now serialises via upstream safe_json_to_str
  (U+FFFD replacement instead of json::type_error 316 when non-stream
  content ends mid-codepoint at the token limit) and builds the Java
  String through the cached String(byte[], "UTF-8") constructor
  (utf8_to_jstring_impl) instead of NewStringUTF, which expects
  Modified UTF-8 and is spec-invalid for supplementary-plane
  characters (4-byte emoji; Android CheckJNI aborts)
- applyTemplate return and the log-callback message take the same path
- streamed chunks were already boundary-safe (upstream process_token
  holds back incomplete UTF-8); pinned end to end by the new tests

Tests: +17 C++ unit tests (utf8/json_to_jstring byte-capture mocks,
parse_lora_request, server_task_result_get_lora::to_json; total 479),
+28 model-free Java unit tests (parsers + PIT-complete LoraAdapter),
+3 model-backed integration classes/methods (RuntimeLoraIntegrationTest,
Utf8RoundTripIntegrationTest, LlamaEmbeddingsTest batch cases).
PIT 255/255 mutants killed; javadoc:jar clean; ArchUnit green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Three native-server-focused features:

- NativeServer attach mode (closes the "reuse an already-loaded LlamaModel"
  TODO): patches/0007 extracts the upstream route table into a shared
  llama_server_register_common_routes(...) and adds llama_server_attach(),
  which serves an already-loaded LlamaModel's server_context over the full
  upstream HTTP frontend (WebUI, resumable streaming) - no second model load,
  no second start_loop; the model's worker keeps driving the queue. Java:
  NativeServer(LlamaModel, String...) over startAttachedNativeServer JNI.
  Validated by NativeServerAttachIntegrationTest (HTTP health/props/
  completion/chat + concurrent direct JNI calls on the same model).

- In-JVM router mode (multi-model management): the upstream router spawns
  workers by re-executing its own binary, which inside a JVM is java, so
  embedded router workers could never start. patches/0008 adds the
  LLAMA_SERVER_WORKER_CMD override (whitespace-split, replaces only the
  worker-binary token), exposed as NativeServer.setWorkerCommand(String...);
  workers relaunch as fresh JVMs running the classic single-model
  NativeServer. Validated by RouterModeIntegrationTest (Linux CI:
  --models-dir listing -> POST /models/load -> worker-JVM spawn -> proxied
  chat completion) plus model-free setWorkerCommand validation tests.

- In-JVM GGUF quantization: LlamaQuantizer.quantize(in, out,
  QuantizationType[, threads, allowRequantize]) over llama_model_quantize
  (LLamaSharp/llama-cpp-python precedent). args.QuantizationType pins the
  llama_ftype b9870 mapping (PIT-complete, 256/256 mutants killed).
  QuantizerIntegrationTest re-quantizes the 135M draft model and loads the
  result; refusal-without-opt-in and missing-input error paths covered.

Local verification: full native rebuild with patches 0007/0008 applied
cleanly, 479/479 C++ tests pass, NativeLibraryLoadSmokeTest green with the
rebuilt lib, javadoc clean, spotless + pinned clang-format applied. The
model-backed integration tests run in CI.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Bridge the remaining langchain4j v1 gaps in llama-langchain4j (blocking
path):

- Tool calling: ChatRequest.toolSpecifications()/toolChoice() map to the
  jllama typed tools path (ToolDefinition + tool_choice); assistant
  tool-call turns and ToolExecutionResultMessages round-trip through the
  history, and a native tool_calls response comes back as
  AiMessage.toolExecutionRequests() with finish reason TOOL_EXECUTION.
- JsonSchemaElementSerializer: recursive public-API-only serializer for
  the langchain4j JsonSchemaElement tree (object/string/integer/number/
  boolean/enum/array/reference/anyOf/null/raw), emitting langchain4j's
  $defs / #/$defs/... conventions (their serializer is internal-only).
- response_format: ResponseFormat.JSON maps to json_object mode; a
  JsonSchema-bearing format maps to the native json_schema grammar
  constraint (structured output). Applies to both adapters.
- Multimodal user input: ImageContent (base64 or URL) and AudioContent
  (inline wav/mp3) map to ContentPart array-form content for the mtmd
  pipeline; unsupported media fails loud instead of silently dropping.
- JllamaStreamingChatModel: fails fast with UnsupportedFeatureException
  when tools are requested (streaming tool-call reconstruction is the
  documented follow-up).

Tests: 12 new model-free mapping/serializer tests (31 total in module),
plus JllamaToolCallingIntegrationTest (gated on the new
net.ladenthin.llama.langchain4j.tool.model property; CI passes the
cached Qwen2.5-Instruct tool model to the langchain4j integration job).

Also bundles three SpotBugs verify fixes from the previous batch:
LlamaModel static-field ordering (IMC_IMMATURE_CLASS_WRONG_FIELD_ORDER),
EmbeddingResponseParser IndexedVector rewrite (CLI_CONSTANT_LIST_INDEX),
and a scoped EI_EXPOSE_REP2 exclusion for NativeServer's borrowed-model
attach constructor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Replace the submodule/NDK source-integration flow as the recommended
Android path with first-class Maven artifacts, so an Android Studio app
needs exactly one dependency line:

  implementation("net.ladenthin:llama-android:<version>")           // CPU
  implementation("net.ladenthin:llama-android-opencl:<version>")    // Adreno
  implementation("net.ladenthin:llama-kotlin:<version>")            // optional

llama-android/ (standalone plain-Gradle build, NOT a reactor module —
Maven cannot deploy <packaging>aar</packaging>; no AGP and no Android SDK
needed to build, version + mirrored dependency versions are parsed from
the Maven poms so `mvn versions:set` stays the single bump point):
- AAR = manifest (minSdkVersion 28, enforced on consumers by AGP) +
  classes.jar (byte-identical Maven-built core classes minus desktop
  native resources and module-info.class) + jni/arm64-v8a/libjllama.so +
  consumer R8/ProGuard rules (proguard.txt, applied automatically) + R.txt.
- POM mirrors the core's compile deps (jackson/slf4j-api/jspecify/
  checker-qual); logback deliberately excluded (JVM-only binding).
- LlamaLoader already tries System.loadLibrary("jllama") first on
  Android, so the AAR-installed .so resolves with zero core changes.

llama-kotlin/ (new Maven reactor module, pure Kotlin 2.2 / jvmTarget 1.8):
- generateFlow/generateChatFlow: cold Flow token streaming, source closed
  on completion, error, AND cancellation (no leaked native task slots).
- completeSuspend/chatSuspend/chatCompleteTextSuspend/embedSuspend;
  completeSuspend wires coroutine cancellation into the cooperative
  CancellationToken so a cancelled coroutine stops the native loop at the
  next token boundary.
- Core dep is provided-scope so Android consumers pair the facade with
  the AAR instead of transitively pulling the fat desktop JAR.
- 6 model-free unit tests over the internal seams.

16 KB page-size (Google Play, Android 15+ targets): CMakeLists.txt now
pins -Wl,-z,max-page-size=16384 for Android builds and CI asserts every
LOAD segment of the shipped .so is 16384-aligned (currently satisfied by
toolchain default; the pin + assert prevent silent regression).

CI (publish.yml):
- test-java-llama-kotlin: model-free unit tests.
- package-android-aar: assembles both AARs from the fresh native
  artifacts, validates structure (entries, minSdk, classes.jar content,
  16 KB alignment) and runs an AGP consumer smoke test — the minimal app
  fixture in .github/android-consumer-test/ resolves the AAR from
  mavenLocal and runs a full R8 assembleRelease on the runner's Android
  SDK, then asserts the APK carries libjllama.so and the un-stripped
  binding (proving Android Studio consumption without an emulator).
- publish-snapshot/publish-release: gated on the new jobs; AAR snapshots
  publish to the Central snapshots repo via Gradle, releases upload a
  signed Central Portal bundle via the Publisher API. llama-kotlin rides
  the normal reactor deploy.

Docs: README "Importing in Android" rewritten around the AAR (source
integration kept as advanced option), module READMEs, CLAUDE.md
(reactor layout, version bump, new "Android AAR + Kotlin facade"
section), RELEASE.md, TODO.md (Android section marked done; sample app
and multi-ABI/emulator-CI stay as follow-ups).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
…inks

Add the projects surveyed during the feature-gap research and Android
investigation that were not yet linked: the kherud/java-llama.cpp fork
parent (previously only in the header note), the sibling llama.cpp
bindings in other languages (llama-cpp-python, LLamaSharp,
node-llama-cpp), and a new "Other local inference stacks" group for
Ollama (whose native API this project's server implements) and
ExecuTorch (the engine behind llama-stack-client-kotlin's local mode).
The llama-stack-client-kotlin entry now points at the new llama-android
AAR + llama-kotlin facade as the native on-device equivalent.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Replace the raw HTTP+JSON boilerplate router-mode callers had to write
themselves with a typed client for the upstream model-management
endpoints:

- value.RouterModel (+ nested Status enum): one GET /models entry —
  identifier, lifecycle status (exact-match mapping of upstream
  server_model_status_to_string strings: downloading/downloaded/
  unloaded/loading/loaded/sleeping, UNKNOWN otherwise), the raw status
  string, and the router's failed-worker marker (status.failed +
  exit_code).
- json.RouterModelsResponseParser: pure transform of the router
  GET /models wire format (data/models array fallback, id/name
  fallback), unit-testable with JSON literals.
- server.RouterClient: listModels/findModel/loadModel/unloadModel plus
  awaitModelLoaded(id, timeout) — polls until LOADED and fails fast
  with the worker's exit code when the router marks the model failed,
  or immediately for an unknown id, instead of running out the
  timeout. Non-2xx responses surface the router's error body. Works
  against the in-JVM NativeServer router or any external llama-server
  router (plain HTTP, no JNI).

Tests: 25 new model-free tests — RouterModelTest (getters, status
mapping, equals/hashCode, toString shapes), RouterModelsResponseParserTest
(upstream shape, failed marker, fallbacks, tolerance), RouterClientTest
(stub HTTP server: parsing, request bodies, error surfacing, the
awaitModelLoaded state machine incl. poll-sequence, fail-fast, and
timeout paths). RouterModeIntegrationTest now drives model discovery,
load, and readiness through RouterClient against a real router,
replacing its hand-rolled JSON polling.

Gates: layeredArchitecture updated (Server may access Json — the rule
is the documented intent registry for new inter-package edges);
awaitModelLoaded uses a never-counted-down CountDownLatch instead of
the banned Thread.sleep; SpotBugs clean (toString/equals/hashCode
added, exact status matching avoids IMPROPER_UNICODE, scoped
URLCONNECTION_SSRF_FD exclusion with developer-supplied-host
rationale); PIT 274/274 (RouterModel inside the value.* 100% gate);
javadoc builds clean. README router-mode section, CLAUDE.md, and
TODO.md updated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Replace the handwritten equals/hashCode with @EqualsAndHashCode over the
host/port fields, matching the established pattern (value.* and the other
server.* classes). toString stays intentionally handwritten so the client
renders as its target URL in log traces — the same documented
handwritten-toString convention ChatMessage/ToolCall/RouterModel use.
SpotBugs (IMC_IMMATURE_CLASS_*) stays satisfied by the generated methods.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Three backlog features, each with model-free tests plus gated
integration coverage:

GGUF metadata inspector (no model load):
- GgufInspector: pure-Java GGUF v2/v3 header + key/value reader — no
  native library, no tensor data, cost independent of file size.
  Little- and big-endian containers auto-detected via the version
  field; fail-loud on v1, unknown versions/type ids, truncation, and
  implausible lengths (sanity caps). All value types decoded
  (integers→Long, floats→Double, bool, string, arrays).
- value.GgufMetadata: full entry table + typed accessors
  (architecture, name, parameter count, <arch>.context_length,
  general.file_type, chat template). Complements the loaded-model
  getModelMeta().
- 21 tests against in-memory generated fixtures (no committed
  binaries) + a gated real-model read.

LangChain4j streaming tool calls + thinking events:
- JllamaStreamingChatModel now streams over the native OAI
  chat.completion.chunk path via the new StreamingChunkAssembler:
  delta.content → onPartialResponse, delta.reasoning_content →
  onPartialThinking (+ AiMessage.thinking()), delta.tool_calls
  fragments accumulated per index → onPartialToolCall /
  onCompleteToolCall and AiMessage.toolExecutionRequests() with
  finish reason TOOL_EXECUTION; real finish reason + token usage on
  the final response. The UnsupportedFeatureException fail-fast is
  gone; toStreamingParameters now carries tools/tool_choice like the
  blocking path.
- 6 assembler tests (canned chunks: text, split/parallel tool calls,
  thinking, usage, fail-loud) + a gated streamed-tool-call
  integration test.

Session fork/rewind (conversation checkpoints):
- Session.checkpoint(filepath) → value.SessionCheckpoint pairing the
  native slot KV-save file with the transcript-turn snapshot;
  Session.rewind(checkpoint) restores both atomically under the
  session lock (native state and transcript cannot drift);
  Session.fork(newSlotId, filepath) branches into an independent
  session on another slot (same system message + params customizer;
  requires setParallel >= 2). All rejected while a stream is in
  progress, same guard as save/restore.
- Plumbing: ChatTranscript.turnsSnapshot()/resetTurns(),
  SessionState.turnsSnapshot()/restoreTurns()/getSystemMessage().
- Model-free bookkeeping/guard tests + SessionForkRewindIntegrationTest
  (rewind-continue, independent fork, own-slot fail-fast).

Gates: PIT 295/295 (GgufMetadata, SessionCheckpoint, ChatTranscript
additions inside the value.* 100% gate); SpotBugs clean (dynamic
exception messages in GgufInspector; scoped exclusions with rationale
for the stateful-reader PRMC false positive, the tagged-decoder URV,
and SessionCheckpoint's order-significant List parameter — ChatMessage
precedent); javadoc clean; langchain4j module verify green (38 tests).
README sections for checkpoints and GGUF inspection; TODO.md updated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Extends Android from build-verified to runtime-verified in CI, and makes
the binding usable on x86_64 Android environments (Android Studio
emulator, Chromebooks, x86-64 Android hardware) for all consumers.

Phase 1 — new native build:
- .github/dockcross/dockcross-android-x86_64 wrapper (same pinned image
  tag as the arm64 one; wrappers are image-generic launchers — verified
  byte-identical modulo image name; update.sh already listed the
  generation command).
- crosscompile-android-x86_64 job (dockcross + the same sccache
  steady-state env), artifact Linux-Android-x86_64-libraries — fail-loud
  and in the package/publish needs graphs. The artifact ALSO merges into
  the default JAR's Linux-Android/x86_64 tree automatically via the
  *-libraries glob (OSInfo already maps x86_64 Android there), so plain
  JAR consumers get the ABI too. The CMake Android guard (weak symbols +
  16 KB max-page-size) keys on OS_NAME and applies unchanged.

Phase 2 — multi-ABI AAR:
- llama-android CPU AAR now ships jni/arm64-v8a + jni/x86_64 (per-ABI
  fail-loud staging checks; app bundles split per ABI so phones download
  only arm64). OpenCL flavor stays arm64-only (Adreno = Qualcomm ARM).
- Structural validation covers both ABIs incl. the 16 KB LOAD-alignment
  readelf check per .so; the R8 consumer smoke asserts both libs in the
  APK; publish jobs stage both ABIs.

Phase 3 — on-emulator instrumentation:
- test-android-emulator job: KVM-accelerated x86_64 emulator (API 30,
  reactivecircus/android-emulator-runner), publishes the CPU AAR to
  mavenLocal (per-publication task), adb-pushes the already-cached
  draft model (AMD-Llama-135m, no new download) and runs the consumer
  fixture's connectedDebugAndroidTest.
- OnDeviceInferenceTest (androidx.test): System.loadLibrary("jllama")
  from the APK's native-lib dir + JNI_OnLoad FindClass against D8-dexed
  classes, pure-Java GgufInspector on-device, and real native inference
  (non-empty generation). Self-skips without the pushed model so a bare
  local emulator run stays green.
- VALIDATION-ONLY for now (not in the publish needs graphs): emulator
  boot is the flakiest CI machinery; promote to a release gate after a
  stable streak (same staged policy as the sccache rollout). Not
  covered by the emulator: arm64 kernels and the Adreno flavor — the
  planned example app covers those on real hardware.

Docs: README (default-JAR platforms, 64-bit-only note, AAR section),
llama-android/README (multi-ABI), CLAUDE.md, TODO.md, fixture README.
Locally verified: multi-ABI AAR assembles with both ABIs, per-ABI
fail-loud check fires on a missing .so, and the per-publication
mavenLocal task publishes the CPU AAR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Two first-run failures on PR #298:

- REUSE compliance (test job): the four files added by the Android/Kotlin
  work lacked SPDX info — llama-android/README.md, llama-kotlin/README.md,
  and the javadoc-placeholder README.txt get SPDX headers; the generated
  dockcross-android-x86_64 wrapper joins the existing dockcross wrapper
  annotation in REUSE.toml. `reuse lint` is compliant again (365/365).

- SonarCloud "Build and analyze": RouterModeIntegrationTest.tearDown
  called NativeServer.setWorkerCommand() unconditionally; when the class
  self-skips via a @BeforeAll assumption (no model on the lib-less
  analysis runner) @afterall still runs, and setWorkerCommand loads the
  native library -> UnsatisfiedLinkError. The teardown now clears the
  worker-command override only when setup actually installed it
  (workerCommandSet flag), so a skipped class tears down as a no-op.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Comment thread llama/src/main/java/net/ladenthin/llama/server/RouterClient.java Fixed
…RL()

CodeQL flagged the URL(String) constructor (deprecated since JDK 20, no
validation/encoding). URI.create(...).toURL() is the non-deprecated
equivalent and is available on Java 8, so the bytecode floor is
unaffected. Behavior identical for the fixed localhost/router URLs;
9/9 RouterClientTest green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Two more first-run failures on PR #298 (head 725f570):

- Package + Validate Android AARs: the R8 release pass in the consumer
  smoke failed with "Missing classes" — the AAR's consumer keep rule
  retains the whole binding, so R8 verifies every referenced type,
  including compile-time-only ones absent on Android:
  com.sun.net.httpserver.* (JVM-only OpenAiCompatServer transport),
  lombok.Generated and animal-sniffer's IgnoreJRERequirement
  (CLASS-retention build annotations). consumer-proguard.txt now ships
  the matching -dontwarn rules, so every consumer app's R8 pass gets
  them automatically — the standard treatment for compileOnly
  references in published Android libraries.

- Android emulator on-device test: sh exit code 2 with no gradle
  output — reactivecircus/android-emulator-runner executes the script
  input LINE BY LINE via sh, so the multi-line if-block was fed as a
  lone "if ...; then" (syntax error). The logic moves into the
  committed .github/run-android-emulator-test.sh (bash -n verified)
  and the job's script: is a single line invoking it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
The on-device emulator test failed with UnsatisfiedLinkError ("No native
library found ... Directly from .apk/lib") even though the x86_64
libjllama.so was verifiably inside the APK. Root cause (confirmed via
readelf -d on the shipped 5.0.5 arm64 Android lib, which carries the same
latent defect): the dockcross cross-clang links two DT_NEEDED entries that
exist on no Android device, so bionic's dlopen rejects the library:

  - libomp.so         (LLVM OpenMP runtime, pulled in by ggml's OpenMP path)
  - libc++_shared.so  (NDK shared C++ runtime, only present when an app
                       packages it itself)

Three-part fix:

1. llama/CMakeLists.txt (Android guard): set GGML_OPENMP OFF (ggml falls
   back to its own std::thread pool — the same trade the Windows-arm64
   clang-cl job makes) and link -static-libstdc++ so libc++ is embedded.
   Only bionic system libraries remain as dependencies.

2. publish.yml (package-android-aar validation): per-.so DT_NEEDED
   whitelist via readelf -dW (libc/libm/libdl/liblog/libandroid, plus
   libOpenCL.so for the OpenCL flavor) — a future toolchain bump cannot
   silently reintroduce a non-bionic dependency; the job fails naming the
   offending library.

3. LlamaLoader: the Android System.loadLibrary catch block now includes
   the UnsatisfiedLinkError message in the "Directly from .apk/lib (...)"
   tried-path entry — the actual dlopen reason was previously swallowed,
   which made this failure look like a missing library.

Also documents the new dlopen-ability invariant in CLAUDE.md next to the
16 KB page-size invariant.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Small upstream range (5 files, ~9.5 KiB): a quantized-tensor fix for the
CPU concat op and a null-buffer guard for the K/V rotation graph inputs
(upstream #25215), plus WebUI settings changes (auto-followed by the
build-webui job) and a test-backend-ops addition (not built here).

All eight local patches (0001-0008) re-verified: applied cleanly in order
onto a b9873 checkout; the range touches no patch-target file and no
OuteTTS generator anchor. History row appended to
docs/history/llama-cpp-breaking-changes.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
ggml-only range (3 files, ~9.6 KiB): the CUDA concat op gains the same
quantized-tensor block-size handling b9873 added to the CPU op, plus a
tensor-parallel + -ncmoe crash fix on MoE models (upstream #25028). No
API surface, no project source changes.

All eight local patches (0001-0008) re-verified: applied cleanly in
order onto a b9876 checkout; the range touches no patch-target file and
no OuteTTS generator anchor. History rows appended to
docs/history/llama-cpp-breaking-changes.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
The TODO entry "PIT gate not hermetic — value.ContentPart.audioFile(Path)"
was stale: ContentPartTest already carries the hermetic @tempdir tests the
entry proposed (wav dispatch incl. case-insensitive .WAV, mp3 dispatch,
unknown-extension rejection). Verified in a fixture-less, network-restricted
sandbox: mvn -f llama/pom.xml test-compile pitest:mutationCoverage reports
295/295 mutations killed (100%), 0 NO_COVERAGE. No committed audio fixture
is needed for the PIT gate; the model-backed AudioInputIntegrationTest
remains separately (and intentionally) gated on a real speech clip.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
…e gate

The committed sample.wav (260ddb0) redded both REUSE lint jobs — a new
binary with no license info. This covers and wires it:

- llama/src/test/resources/audios/README.md: provenance/license/override
  notes mirroring the images/ README (recorded by the project author,
  MIT-granted for this project).
- REUSE.toml: the audios README joins the MIT markdown list and
  sample.wav gets its own MIT annotation (WAV has no in-file header
  channel, same as test-image.jpg). reuse lint: 368/368 compliant.
- AudioInputIntegrationTest now defaults the audio prompt to the
  committed clip (TestConstants.DEFAULT_AUDIO_INPUT_PATH), mirroring the
  vision.image default — only the audio model + mmproj still need
  staging. README/CLAUDE.md property tables updated.

Also promotes test-android-emulator to a RELEASE GATE (both publish
needs: graphs) per owner decision: the job ran flake-free through PR
#298's validation cycle (boot ~30 s, on-device inference green), so a
broken on-device runtime now blocks publishing — same fail-loud policy
as every native artifact job.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Every DONE/RESOLVED entry moves out of "Open"; a concise 2026-07-05
record (one-liners with pointers to PR #298 / CLAUDE.md / git history)
is kept in the Done section. Trims:

- Dropped fully-done sections: NativeServer attach mode, typed router
  API, GGUF inspector, session fork/rewind, PIT hermeticity, Windows
  native classifiers, b9739 arg-parse regression, code audit (its one
  optional follow-up becomes its own small open section), branch
  protection rename (closed as a no-op per owner).
- OpenAI-compat endpoint section reduced to its open follow-ups, marked
  deprioritized per the native-server-first owner decision.
- Similar-projects backlog reduced to the jbang example remainder.
- Android section reduced to the example-app follow-up.
- Upstream-PR section generalized from patch 0001 to all six
  upstream-submittable patches (0001/0002/0005-0008).
- License Compliance entry notes the same 17-issues status now blocks
  PR #298's merge state.

File shrinks 654 -> 315 lines; only genuinely open work remains under
"Open".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
Smallest range yet (2 files, ~1.8 KiB), internal-only: a fail-loud
GGML_ABORT guard in the ggml meta backend for unsupported multi-buffers
(upstream #22197), and llama_model now copies the borrowed tensor_split
array into an owned vector so tensor-parallel KV-cache split metadata
cannot read a dangling caller pointer. No API surface, no project source
changes.

All eight local patches (0001-0008) re-verified: applied cleanly in
order onto a b9878 checkout; the range touches no patch-target file and
no OuteTTS generator anchor. History rows appended to
docs/history/llama-cpp-breaking-changes.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVMuGj2shABrHWJ9sNqLqX
@sonarqubecloud

sonarqubecloud Bot commented Jul 5, 2026

Copy link
Copy Markdown

@bernardladenthin bernardladenthin changed the title Add Kotlin coroutines facade, Android AAR packaging, and server attach mode Feature wave: Android AAR + Kotlin facade, server attach/router modes, LangChain4j streaming, GGUF tooling (llama.cpp b9878) Jul 5, 2026
@bernardladenthin bernardladenthin merged commit 8d08b37 into main Jul 5, 2026
13 of 67 checks passed
@bernardladenthin bernardladenthin deleted the claude/java-llama-cpp-features-l4tl6v branch July 5, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants