A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.
- Why llm-cache-router
- Features
- Installation
- Quickstart
- Streaming
- Cache Warmup
- Routing Strategies
- Cache Backends
- Multimodal Messages & Cache Keys
- Budget and Cost Tracking
- FastAPI Integration
- Async Context Manager
- Supported Providers
- Architecture
- Development
- Roadmap
- Contributing
- License
Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:
- Save money — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
- Stay resilient — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
- Control cost — built-in daily/monthly budget guardrails with Prometheus metrics for every request.
One dependency. Six providers. Three cache backends. Full async support.
- Semantic cache — vector-similarity matching via
sentence-transformers, not just exact string hashing. - Multimodal-aware cache keys — images, audio, and video blocks are hashed into the query; cache is scoped per requested
model. - Multi-provider routing across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
- Three routing strategies:
CHEAPEST_FIRST,FASTEST_FIRST,FALLBACK_CHAIN. - Pluggable cache backends: in-memory (FAISS), Redis, Qdrant.
- Streaming — native async SSE streaming for every provider, transparent to the cache layer.
- Cost tracker with per-model pricing, daily/monthly budget limits and savings accounting.
- Cache warmup with controlled concurrency for pre-production pre-loading.
- FastAPI middleware + Prometheus metrics endpoint out of the box.
- Typed — Pydantic v2 models everywhere, fully typed public API.
- Tested — 11 test modules covering router, cache (incl. multimodal keys & model isolation), providers, retry, warmup, and HTTP middleware.
Latest: v0.2.4 release notes — multimodal cache keys and per-model cache isolation.
pip install llm-cache-routerOptional extras:
pip install "llm-cache-router[redis]" # Redis cache backend
pip install "llm-cache-router[qdrant]" # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]" # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]" # everything above
pip install "llm-cache-router[dev]" # tests, ruff, mypyRequires Python 3.11+.
import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy
async def main() -> None:
router = LLMRouter(
providers={
"openai": {"api_key": "sk-...", "models": ["gpt-4o-mini"]},
"anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
"gemini": {"api_key": "AIza...", "models": ["gemini-1.5-flash"]},
"ollama": {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
},
cache=CacheConfig(
backend="memory",
threshold=0.92, # cosine similarity threshold
ttl=3600, # cache TTL in seconds
max_entries=10_000,
),
strategy=RoutingStrategy.CHEAPEST_FIRST,
budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)
response = await router.complete(
messages=[{"role": "user", "content": "What is a semantic cache?"}],
model="gpt-4o-mini",
)
print(response.content)
print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")
asyncio.run(main())All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.
async for chunk in router.stream(
messages=[{"role": "user", "content": "Explain async/await in Python"}],
model="gpt-4o-mini",
):
print(chunk.delta, end="", flush=True)
if chunk.is_final:
print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")Pre-load the cache with known queries before traffic hits production:
from llm_cache_router.models import WarmupEntry
results = await router.warmup(
entries=[
WarmupEntry(
messages=[{"role": "user", "content": "What is RAG?"}],
model="gpt-4o-mini",
),
WarmupEntry(
messages=[{"role": "user", "content": "Explain vector databases"}],
model="gpt-4o-mini",
),
],
concurrency=5,
skip_cached=True,
)
print(results) # {"warmed": 2, "skipped": 0, "failed": 0}| Strategy | Description |
|---|---|
CHEAPEST_FIRST |
Picks the cheapest provider/model by live pricing for each call. |
FASTEST_FIRST |
Picks the provider with the lowest observed latency (EMA). |
FALLBACK_CHAIN |
Tries providers in order, falls back on error/timeout. |
router = LLMRouter(
providers={
"openai": {"api_key": "sk-...", "models": ["gpt-4o"]},
"anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
},
strategy=RoutingStrategy.FALLBACK_CHAIN,
fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)Default. Zero dependencies beyond the core install. Best for single-process apps and tests.
cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.
cache=CacheConfig(
backend="redis",
redis_url="redis://localhost:6379/0",
redis_namespace="llm_cache_router_prod",
threshold=0.92,
ttl=3600,
max_entries=50_000,
redis_command_timeout_sec=1.5,
redis_retry_attempts=3,
redis_retry_backoff_sec=0.2,
redis_candidate_k=256,
)Native vector database for very large caches (millions of entries) and cross-service deployments.
pip install "llm-cache-router[qdrant]"cache=CacheConfig(
backend="qdrant",
qdrant_url="http://localhost:6333",
qdrant_api_key=None, # optional for Qdrant Cloud
qdrant_collection="llm_cache",
threshold=0.92,
ttl=3600,
max_entries=100_000,
)Messages follow the OpenAI-compatible shape: content can be a string or a list of blocks (text, image_url, Anthropic image, audio, video). The router passes your requested model into the cache layer so different models never share a hit for the same text.
from llm_cache_router.models import Message, WarmupEntry
messages: list[Message] = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {"url": "data:image/png;base64,..."},
},
],
}
]
response = await router.complete(messages=messages, model="gpt-4o-mini")
# Second call with the same text + same image → cache_hit=True
# Same text but a different image → cache miss
# Same messages but model="gpt-4o" → cache miss (different model scope)Warmup supports the same multimodal payloads:
WarmupEntry(
messages=messages,
model="gpt-4o-mini",
)Binary media is stored in the cache key as a short sha256 fingerprint, not the full base64 payload.
Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.
router = LLMRouter(
providers={...},
budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)
stats = router.stats()
print(stats.total_cost_usd) # total spent since start
print(stats.saved_cost_usd) # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd) # None if no limit is set
print(stats.cache_hit_rate) # 0.0–1.0pip install "llm-cache-router[fastapi]"from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
add_http_metrics_middleware,
mount_metrics_endpoint,
)
app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")Exposed Prometheus metrics:
llm_router_http_requests_total{method,path,status}llm_router_http_request_duration_seconds_*(histogram)llm_router_cache_hits_total,llm_router_cache_misses_totalllm_router_cost_usd_total,llm_router_saved_cost_usd_total
async with LLMRouter(providers={...}) as router:
response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections| Provider | Streaming | Notes |
|---|---|---|
| OpenAI | yes | gpt-4o, gpt-4o-mini, o1-*, etc. |
| Anthropic | yes | Claude 3.5 Sonnet/Haiku, Opus |
| Google Gemini | yes | 1.5 Flash, 1.5 Pro |
| Ollama | yes | Any locally-served model |
| MiniMax | yes | MiniMax-Text-01 and others |
| Qwen (Dashscope) | yes | qwen-plus, qwen-max, etc. |
Adding a new provider = subclass LLMProvider, register with @register_provider("name"). See llm_cache_router/providers/base.py.
llm_cache_router/
cache/ # memory (FAISS) / redis / qdrant backends
providers/ # openai, anthropic, gemini, ollama, minimax, qwen
strategies/ # cheapest, fastest, fallback
embeddings/ # SentenceEncoder, HashingEncoder
cost/ # CostTracker with daily/monthly budgets
middleware/ # FastAPI middleware
observability/ # Prometheus metrics
models.py # Pydantic models (Message, LLMResponse, CacheEntry, ...)
router.py # LLMRouter — public entrypoint
retry.py # RetryConfig + exponential backoff
warmup.py # async warmup helper
git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router
# using uv (recommended)
uv sync --all-extras
uv run pytest
# or plain pip
pip install -e ".[all,dev]"
pytestCode quality is enforced in CI via:
ruff check(lint) andruff format --check(style)mypy --ignore-missing-imports(type check)pyteston Python 3.11, 3.12, 3.13 with coverage
- v0.3 — Django helpers and middleware.
- v0.4 — Streaming retry (reconnect on SSE drop).
- v0.5 — Request tracing hooks (OpenTelemetry).
- v1.0 — Full OTel spans, pluggable pricing providers, cache invalidation API.
Pull requests are welcome. Please:
- Open an issue first for anything larger than a small bug fix.
- Add tests for new behaviour.
- Run
ruff check,ruff format,mypyandpytestbefore pushing.
MIT — see LICENSE for details.
llm-cache-router — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.
v0.2.4: корректные ключи кэша для multimodal-сообщений (хэш медиа вместо base64) и изоляция кэша по model. Release notes.
Установка:
pip install llm-cache-router
# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"Требуется Python 3.11+. Полная документация и примеры — выше (на английском).