FastCache

FastCache provides ready-to-run scripts and utilities to test KV-cache compression for MiniCPM-V-2.6 and LLaVA models under concurrent workloads. Two entry scripts are provided:

scripts/minicpm_testcon2.6.py
scripts/llava_testcon2.6.py

These share a common utils package in utils_ccm/.

1) Environment

Python 3.10+
NVIDIA GPU with CUDA (tested with A100)
Recommended: create a fresh virtualenv

Install dependencies:

pip install -r requirements.txt

2) Prepare Models

You need the base model weights locally (HF or internal mirror). Example paths below; adjust for your machine.

MiniCPM-V-2.6 (vision-language): put at e.g. /data/huggingface/MiniCPM-V-2_6
LLaVA 1.5 7B HF: put at e.g. /data/huggingface/llava-1.5-7b-hf

The scripts use trust_remote_code=True for these repos.

2.1) Quick Demo Dataset (No External Download)

We include a tiny demo dataset in datasets/ with two small images and a LLaVA-style JSON (datasets/llava_v1_5_mix665k_demo.json). Use it to validate your setup without downloading large datasets:

# MiniCPM quick demo (2 samples)
python scripts/minicpm_testcon2.6.py \
  --model_path /data/huggingface/MiniCPM-V-2_6 \
  --ckpt_path ckpt/minicpm_mlp.pth \
  --datasets_js_path datasets/llava_v1_5_mix665k_demo.json \
  --datasets_img_path ./datasets \
  --use_compression --comp_mode ccm \
  --prefill_batch_size 1 --compress_batch_size 1 --decoding_batch_size 1 \
  --req_per_sec 1.0 --num_samples 2 --torch_dtype float16

# LLaVA quick demo (2 samples)
python scripts/llava_testcon2.6.py \
  --model_path /data/huggingface/llava-1.5-7b-hf \
  --ckpt_path ckpt/llava_mlp.pth \
  --datasets_js_path datasets/llava_v1_5_mix665k_demo.json \
  --datasets_img_path ./datasets \
  --use_compression --comp_mode ccm \
  --prefill_batch_size 1 --compress_batch_size 1 --decoding_batch_size 1 \
  --req_per_sec 1.0 --num_samples 2 --torch_dtype float16

3) Compressor MLP Weights (CCM)

Two MLP checkpoints (MiniCPM + LLaVA) are stored in this repo via Git LFS so you can run out of the box:

ckpt/minicpm_mlp.pth (MiniCPM)
ckpt/llava_mlp.pth (LLaVA)

Make sure you have Git LFS:

git lfs install
# If the files are not present after clone, fetch them explicitly:
git lfs pull

If you want to host weights elsewhere (e.g., GitHub Releases/Hugging Face) just update --ckpt_path or configs/experiments.yaml.

4) Datasets

The examples expect either GQA-style or MileBench-style VQA data. Provide two paths:

--datasets_js_path: annotation json or folder (e.g. /data/huggingface/LLaVA-Instruct-150K/llava_v1_5_mix665k.json)
--datasets_img_path: image root folder (e.g. ./datasets)

You can swap to your own data; see the load_image_data* helpers in each script.

5) Quick Start

MiniCPM (single run, CCM):

python scripts/minicpm_testcon2.6.py \
  --model_path /data/huggingface/MiniCPM-V-2_6 \
  --ckpt_path ckpt/minicpm_mlp.pth \
  --datasets_js_path /data/huggingface/LLaVA-Instruct-150K/llava_v1_5_mix665k.json \
  --datasets_img_path ./datasets \
  --use_compression --comp_mode ccm \
  --prefill_batch_size 15 --compress_batch_size 45 --decoding_batch_size 90 \
  --req_per_sec 10.0 --num_samples 90 --torch_dtype float16

LLaVA (single run, CCM):

python scripts/llava_testcon2.6.py \
  --model_path /data/huggingface/llava-1.5-7b-hf \
  --ckpt_path ckpt/llava_mlp.pth \
  --datasets_js_path /data/huggingface/LLaVA-Instruct-150K/llava_v1_5_mix665k.json \
  --datasets_img_path ./datasets \
  --use_compression --comp_mode ccm \
  --prefill_batch_size 15 --compress_batch_size 45 --decoding_batch_size 90 \
  --req_per_sec 10.0 --num_samples 90 --torch_dtype float16

Batch (YAML) mode (MiniCPM example):

python scripts/minicpm_testcon2.6.py --config configs/experiments.yaml

6) What Gets Compressed

For CCM (--comp_mode ccm), the scripts build either KVCacheHybridCompressor (MiniCPM) or KVCacheLinearDecoupleCompressor (LLaVA) from utils_ccm/module_ccm_v11.py and load your MLP .pth.
Other modes like Knorm / SnapKV / ExpectedAttention are available via kvpress; ensure the package is installed.

7) GPU/Cache Notes

utils_ccm/utils_kvcachePool.py provides a simple KV cache pool for merging batched requests.
utils_ccm/utils_schedule_v11.py implements a basic dynamic scheduler and GPU monitoring (pynvml required).
NVTX ranges are used for profiling (nvtx). If you don’t have it, remove those imports.

8) Paths To Change

configs/experiments.yaml: set your real model_path, ckpt_path, dataset paths.
Or pass them via CLI as shown above.

9) Troubleshooting

Transformers versions: OffloadedCache/LLaVA classes require recent transformers (>=4.43). If missing, upgrade.
CUDA OOM: reduce batch sizes and/or use --torch_dtype float16.
kvpress import errors: pip install kvpress.

10) Training Compressors

We provide training scripts for both LLaVA and MiniCPM compressors in scripts/.

Training Goal

Train MLP-based KV-cache compressors that reduce memory footprint while preserving model output quality. The compressor learns to compress the Key-Value cache with minimal information loss.

Training Method

Knowledge Distillation: The compressor is trained to produce outputs that match the original (uncompressed) model's output distribution.
Loss Function: KL divergence between compressed and original output logits, combined with accuracy metrics.
Architecture: Linear decouple compressor (KVCacheLinearDecoupleCompressor for LLaVA) and hybrid compressor (KVCacheHybridCompressor for MiniCPM).

Training Scripts

Script	Model	Output Checkpoint
`scripts/finetune_llava.py`	LLaVA-1.5-7B	`best_finetune_mlp_1030_mm_9.pth`
`scripts/finetune_minicpm.py`	MiniCPM-V-2.6	`best_finetune_mlp_13B_mm_1_minicpm.pth`

Training Procedure

Load pre-trained VLM model (LLaVA or MiniCPM)
Forward pass to generate original KV-cache
Compress KV-cache using the compressor network
Compute loss between compressed and original outputs
Backpropagate and update compressor weights

Example Training Command (LLaVA)

# Modify paths in finetune_llava.py before running
python scripts/finetune_llava.py

Key parameters to adjust in the script:

model_path: Path to LLaVA/MiniCPM model
json_path: Path to training data (LLaVA-Instruct-150K)
imgsets_path: Path to image datasets
compression_ratio: Target compression ratio (default: 5x)
num_epochs: Number of training epochs
learning_rate: Learning rate (default: 5e-4)

11) Repository Layout

scripts/: entry points and training scripts
utils_ccm/: shared modules (compressors, scheduler, kv-pool, helpers)
configs/: example YAML for batch experiments
ckpt/: place your two MLP weights here locally (or download from Release)

12) KV-cache Pipeline（SM 分离/GreenContext）

本仓库同时包含基于 nano-vllm 的 method-aware KV-cache 压缩流水线，位于：

kv-cache pipeline/：运行时与压缩调度器
kv-cache pipeline/bench/bench_kvcache_matrix.py：端到端吞吐矩阵测试
kv-cache pipeline/greenctx_shim/：GreenContext shim 源码与构建脚本
docs/kv-cache-pipeline-e2e.md：完整中文端到端指南

快速使用（示例）

cd "kv-cache pipeline"

编译 GreenContext shim（可选，用于 sgl_kernel ABI 不匹配时）：

./greenctx_shim/build_shim.sh

运行基准测试（示例）：

CUDA_VISIBLE_DEVICES=0 \
LD_PRELOAD=./greenctx_shim/sgl_kernel_shim.so \
python bench/bench_kvcache_matrix.py \
  --models qwen3-8b \
  --workloads synthetic-long \
  --variants compression-only,triad \
  --synthetic-context 4096 \
  --synthetic-output 128 \
  --synthetic-token "Hello" \
  --synthetic-num-batches 1 \
  --synthetic-batch-sizes "32,64" \
  --qwen3-8b /data/huggingface/Qwen3-8B

更多细节请参考：docs/kv-cache-pipeline-e2e.md。

Reference

Zhu, J., Wu, H., Wang, H., Li, Y., Hou, B., Li, R., & Zhai, J. (2025). Fastcache: Optimizing multimodal llm serving through lightweight kv-cache compression framework. arXiv preprint arXiv:2503.08461.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
ckpt		ckpt
configs		configs
datasets		datasets
docs		docs
kv-cache pipeline		kv-cache pipeline
nanovllm		nanovllm
scripts		scripts
utils_ccm		utils_ccm
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.nanovllm		LICENSE.nanovllm
README.md		README.md
README.nanovllm.md		README.nanovllm.md
bench.py		bench.py
bench_autocompress_full.py		bench_autocompress_full.py
bench_compression_modes.py		bench_compression_modes.py
bench_decode_speedup.py		bench_decode_speedup.py
bench_full_comparison.py		bench_full_comparison.py
bench_kvpress_nanovllm.py		bench_kvpress_nanovllm.py
bench_kvpress_native.py		bench_kvpress_native.py
bench_llama_kvpress_nanovllm.py		bench_llama_kvpress_nanovllm.py
bench_llava.py		bench_llava.py
bench_multimodal_compression.py		bench_multimodal_compression.py
bench_pipeline_async.py		bench_pipeline_async.py
bench_qwen3_kvpress.py		bench_qwen3_kvpress.py
bench_qwen3_kvpress_nanovllm.py		bench_qwen3_kvpress_nanovllm.py
bench_throughput_detailed.py		bench_throughput_detailed.py
bench_zero_overhead.py		bench_zero_overhead.py
demo_autocompress.py		demo_autocompress.py
example.py		example.py
fastcache_paths.py		fastcache_paths.py
nanovllm_fastapi_server.py		nanovllm_fastapi_server.py
profile_compression_methods.py		profile_compression_methods.py
pyproject.nanovllm.toml		pyproject.nanovllm.toml
requirements.txt		requirements.txt
stress_test_concurrent.py		stress_test_concurrent.py
stress_test_throughput.py		stress_test_throughput.py
stress_test_v2.py		stress_test_v2.py
test_async_compression.py		test_async_compression.py
test_block_release_debug.py		test_block_release_debug.py
test_decode_step.py		test_decode_step.py
test_e2e_throughput.py		test_e2e_throughput.py
test_extreme_pressure.py		test_extreme_pressure.py
test_fastapi_lazy.py		test_fastapi_lazy.py
test_lazy_comprehensive.py		test_lazy_comprehensive.py
test_lazy_compression.py		test_lazy_compression.py
test_llava_compression.py		test_llava_compression.py
test_llava_compression_v2.py		test_llava_compression_v2.py
test_llava_simple.py		test_llava_simple.py
test_memory_pressure.py		test_memory_pressure.py
test_memory_release.py		test_memory_release.py
test_memory_saving.py		test_memory_saving.py
test_memory_saving_simple.py		test_memory_saving_simple.py
test_minicpm_basic.py		test_minicpm_basic.py
test_minicpm_compression.py		test_minicpm_compression.py
testcon.py		testcon.py
tp_inference.py		tp_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastCache

1) Environment

2) Prepare Models

2.1) Quick Demo Dataset (No External Download)

3) Compressor MLP Weights (CCM)

4) Datasets

5) Quick Start

6) What Gets Compressed

7) GPU/Cache Notes

8) Paths To Change

9) Troubleshooting

10) Training Compressors

Training Goal

Training Method

Training Scripts

Training Procedure

Example Training Command (LLaVA)

11) Repository Layout

12) KV-cache Pipeline（SM 分离/GreenContext）

快速使用（示例）

Reference

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Ringssss/Fastcache

Folders and files

Latest commit

History

Repository files navigation

FastCache

1) Environment

2) Prepare Models

2.1) Quick Demo Dataset (No External Download)

3) Compressor MLP Weights (CCM)

4) Datasets

5) Quick Start

6) What Gets Compressed

7) GPU/Cache Notes

8) Paths To Change

9) Troubleshooting

10) Training Compressors

Training Goal

Training Method

Training Scripts

Training Procedure

Example Training Command (LLaVA)

11) Repository Layout

12) KV-cache Pipeline（SM 分离/GreenContext）

快速使用（示例）

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages