Very new here so if this already exists apologies...
The intent here is to allow JVM warm up to kick in...
Description:
Currently, GPULlama3.java requires spawning a new JVM process for each inference request when wrapped in a web API. This causes 20-80s latency per request due to repeated JVM/TornadoVM/model loading overhead.
Request: Add a persistent server mode where:
- Model loads once at startup and stays in GPU memory
- HTTP server accepts inference requests without process restarts
- Similar to how Ollama operates (loads model once, serves all requests from same process)
Current workaround limitations:
- Flask + subprocess: 20-80s latency (JVM/model reload per request)
- Spring Boot + LangChain4j: Version incompatibility (langchain4j-gpu-llama3 requires Java 21, base image has Java 17)
Ideal solution: Built-in HTTP server (like Ollama) or Java 17-compatible LangChain4j integration
Very new here so if this already exists apologies...
The intent here is to allow JVM warm up to kick in...
Description:
Currently, GPULlama3.java requires spawning a new JVM process for each inference request when wrapped in a web API. This causes 20-80s latency per request due to repeated JVM/TornadoVM/model loading overhead.
Request: Add a persistent server mode where:
Current workaround limitations:
Ideal solution: Built-in HTTP server (like Ollama) or Java 17-compatible LangChain4j integration