FlexInfer is a mobile-oriented LLM inference runtime for running large
quantized models under tight memory budgets. It extends the llama.cpp and
ggml stack with prefetch-aware execution paths, memory planning, and Linux
host plus Android experiment scripts.
This repository contains the code artifact for the FlexInfer paper. Model weights are not included; use GGUF models that you are licensed to use.
src/,include/,common/,ggml/,gguf-py/: runtime and model format code derived from thellama.cppecosystem, with FlexInfer modifications.examples/main/: command line inference binaries, including the FlexInfer prefetch-aware CLIflexinfer-cli.scripts/: conversion, benchmark, and experiment helper scripts.docs/: artifact, Android, benchmark, and release notes.
Upstream baseline executables intentionally retain the llama.cpp naming style,
such as llama-cli and llama-bench. FlexInfer-specific executables use
FlexInfer names, including flexinfer-cli and flexinfer-bench.
The FlexInfer prefetch path is compiled with the FLEXINFER definition.
The flexinfer-cli and flexinfer-bench binaries enable it through their
prefetch libraries; upstream-style binaries do not. Some shared compatibility
changes intentionally affect both paths, including alignment defaults, public
debug/benchmark parameters, and build/logging plumbing.
- CMake 3.14 or newer
- A C/C++ compiler with C++11 support
- Python 3 for model conversion scripts
- Android NDK for Android cross-compilation
Install Python dependencies when you need model conversion utilities:
python -m pip install -r requirements.txtConvert a local Hugging Face model checkpoint to GGUF:
bash scripts/convert-hf-models.sh <local_model_path> hf-models/ggml-model-<model_name>-q4_0.ggufReplace <local_model_path> and <model_name> with your own model path and
output name. Keep generated model files outside git; hf-models/ is ignored.
Build for the host machine:
bash build-host.shCross-compile command-line Android artifacts:
export ANDROID_NDK_ROOT=/path/to/android-ndk
bash build-android.shThe build scripts install artifacts into host/ or android/. See
docs/artifact.md for a fuller artifact setup guide.
Standard llama.cpp style inference:
LD_LIBRARY_PATH=host/lib ./host/bin/llama-cli \
-m hf-models/ggml-model-llama-2-7b-chat-q4_0.gguf \
-p "I believe the meaning of life is" \
-n 64 -t 1 -c 512FlexInfer prefetch-aware inference:
LD_LIBRARY_PATH=host/lib ./host/bin/flexinfer-cli \
-m hf-models/ggml-model-llama-2-7b-chat-q4_0.gguf \
-p "I believe the meaning of life is" \
-n 64 -t 1 -c 512 -am 2 -tp 1The key FlexInfer flags are:
-am: available memory budget in GB for FlexInfer planning. Fractional budgets such as0.5are supported.-tp: number of threads used for prefetching. Use0to disable threaded prefetching and use synchronous reads.
FlexInfer's Linux direct-I/O path expects GGUF tensor data to be 4096-byte
aligned. Models produced by scripts/convert-hf-models.sh use this alignment
by default. If you use externally generated GGUF files, regenerate or quantize
them with 4096-byte GGUF alignment before using the direct-I/O FlexInfer path.
On Android, use the installed Android tree after pushing it to the target device:
LD_LIBRARY_PATH=android/lib ./android/bin/flexinfer-cli \
-m hf-models/ggml-model-llama-2-7b-chat-q4_0.gguf \
-p "I believe the meaning of life is" \
-n 64 -t 1 -c 512 -am 2 -tp 1The benchmark helper supports Linux host artifacts from build-host.sh and
Android artifacts from build-android.sh:
AM=8 TP=2 P=16 N=16 bash scripts/bench-speed.shLimit a helper run to selected models by passing file names relative to
MODEL_PREFIX:
MODEL_LIST="ggml-model-llama-2-7b-chat-q4_0.gguf" \
AM=2 TP=2 T=48 P=16 N=16 bash scripts/bench-speed.shRun a sweep over multiple memory budgets:
MEMORY_CONFIGS="0.5 1 2 4" TP=4 bash scripts/bench-speed-memory-sweep.shBenchmark output is written to bench-results/ unless RESULT_DIR is set.
Generated benchmark logs are ignored by git.
For I/O-sensitive comparisons, enable cold-cache and resource-control options
such as DROP_CACHES=1, TASKSET_CPUS, and CGROUP_SPEC when the platform
allows them. See docs/benchmark.md for benchmark variables, cooldown controls,
cache control, CPU binding, and result file layout.
You can use cgroups to emulate smaller memory budgets on Linux:
sudo cgcreate -g memory:/limmem
echo 2147483648 | sudo tee /sys/fs/cgroup/limmem/memory.max
sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
sudo cgexec -g memory:limmem ./host/bin/flexinfer-cli \
-m hf-models/ggml-model-llama-2-70b-chat-q4_0.gguf \
-p "I believe the meaning of life is" \
-n 16 -am 2 -tp 1If you use FlexInfer in your research, please cite:
@inproceedings{du2025flexinfer,
author = {Du, Hongchao and Wu, Shangyu and Kharlamova, Arina and Guan, Nan and Xue, Chun Jason},
title = {FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference},
booktitle = {Proceedings of the 5th Workshop on Machine Learning and Systems},
series = {EuroMLSys '25},
year = {2025},
isbn = {979-8-4007-1538-9},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
location = {Rotterdam, Netherlands},
pages = {56--65},
numpages = {10},
doi = {10.1145/3721146.3721961}
}This repository is distributed under the MIT License. FlexInfer changes are
released under the same terms unless a source file states otherwise. Portions
of the code are derived from llama.cpp, ggml, and gguf-py; retain their
copyright and permission notices when redistributing source or binary artifacts.
See NOTICE and embedded third-party license files for details.