llm-cache-router

A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.

Why llm-cache-router

Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:

Save money — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
Stay resilient — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
Control cost — built-in daily/monthly budget guardrails with Prometheus metrics for every request.

One dependency. Six providers. Three cache backends. Full async support.

Features

Semantic cache — vector-similarity matching via sentence-transformers, not just exact string hashing.
Multimodal-aware cache keys — images, audio, and video blocks are hashed into the query; cache is scoped per requested model.
Multi-provider routing across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
Three routing strategies: CHEAPEST_FIRST, FASTEST_FIRST, FALLBACK_CHAIN.
Pluggable cache backends: in-memory (FAISS), Redis, Qdrant.
Streaming — native async SSE streaming for every provider, transparent to the cache layer.
Cost tracker with per-model pricing, daily/monthly budget limits and savings accounting.
Cache warmup with controlled concurrency for pre-production pre-loading.
FastAPI middleware + Prometheus metrics endpoint out of the box.
Typed — Pydantic v2 models everywhere, fully typed public API.
Tested — 11 test modules covering router, cache (incl. multimodal keys & model isolation), providers, retry, warmup, and HTTP middleware.

Latest: v0.2.4 release notes — multimodal cache keys and per-model cache isolation.

Installation

pip install llm-cache-router

Optional extras:

pip install "llm-cache-router[redis]"     # Redis cache backend
pip install "llm-cache-router[qdrant]"    # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]"   # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]"       # everything above
pip install "llm-cache-router[dev]"       # tests, ruff, mypy

Requires Python 3.11+.

Quickstart

import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy


async def main() -> None:
    router = LLMRouter(
        providers={
            "openai":    {"api_key": "sk-...",           "models": ["gpt-4o-mini"]},
            "anthropic": {"api_key": "sk-ant-...",       "models": ["claude-3-5-sonnet"]},
            "gemini":    {"api_key": "AIza...",          "models": ["gemini-1.5-flash"]},
            "ollama":    {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
        },
        cache=CacheConfig(
            backend="memory",
            threshold=0.92,       # cosine similarity threshold
            ttl=3600,             # cache TTL in seconds
            max_entries=10_000,
        ),
        strategy=RoutingStrategy.CHEAPEST_FIRST,
        budget={"daily_usd": 5.0, "monthly_usd": 50.0},
    )

    response = await router.complete(
        messages=[{"role": "user", "content": "What is a semantic cache?"}],
        model="gpt-4o-mini",
    )
    print(response.content)
    print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")


asyncio.run(main())

Streaming

All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.

async for chunk in router.stream(
    messages=[{"role": "user", "content": "Explain async/await in Python"}],
    model="gpt-4o-mini",
):
    print(chunk.delta, end="", flush=True)
    if chunk.is_final:
        print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")

Cache Warmup

Pre-load the cache with known queries before traffic hits production:

from llm_cache_router.models import WarmupEntry

results = await router.warmup(
    entries=[
        WarmupEntry(
            messages=[{"role": "user", "content": "What is RAG?"}],
            model="gpt-4o-mini",
        ),
        WarmupEntry(
            messages=[{"role": "user", "content": "Explain vector databases"}],
            model="gpt-4o-mini",
        ),
    ],
    concurrency=5,
    skip_cached=True,
)
print(results)  # {"warmed": 2, "skipped": 0, "failed": 0}

Routing Strategies

Strategy	Description
`CHEAPEST_FIRST`	Picks the cheapest provider/model by live pricing for each call.
`FASTEST_FIRST`	Picks the provider with the lowest observed latency (EMA).
`FALLBACK_CHAIN`	Tries providers in order, falls back on error/timeout.

router = LLMRouter(
    providers={
        "openai":    {"api_key": "sk-...",     "models": ["gpt-4o"]},
        "anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
    },
    strategy=RoutingStrategy.FALLBACK_CHAIN,
    fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)

Cache Backends

In-memory (FAISS)

Default. Zero dependencies beyond the core install. Best for single-process apps and tests.

cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)

Redis

Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.

cache=CacheConfig(
    backend="redis",
    redis_url="redis://localhost:6379/0",
    redis_namespace="llm_cache_router_prod",
    threshold=0.92,
    ttl=3600,
    max_entries=50_000,
    redis_command_timeout_sec=1.5,
    redis_retry_attempts=3,
    redis_retry_backoff_sec=0.2,
    redis_candidate_k=256,
)

Qdrant

Native vector database for very large caches (millions of entries) and cross-service deployments.

pip install "llm-cache-router[qdrant]"

cache=CacheConfig(
    backend="qdrant",
    qdrant_url="http://localhost:6333",
    qdrant_api_key=None,           # optional for Qdrant Cloud
    qdrant_collection="llm_cache",
    threshold=0.92,
    ttl=3600,
    max_entries=100_000,
)

Multimodal Messages & Cache Keys

Messages follow the OpenAI-compatible shape: content can be a string or a list of blocks (text, image_url, Anthropic image, audio, video). The router passes your requested model into the cache layer so different models never share a hit for the same text.

from llm_cache_router.models import Message, WarmupEntry

messages: list[Message] = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "data:image/png;base64,..."},
            },
        ],
    }
]

response = await router.complete(messages=messages, model="gpt-4o-mini")
# Second call with the same text + same image → cache_hit=True
# Same text but a different image → cache miss
# Same messages but model="gpt-4o" → cache miss (different model scope)

Warmup supports the same multimodal payloads:

WarmupEntry(
    messages=messages,
    model="gpt-4o-mini",
)

Binary media is stored in the cache key as a short sha256 fingerprint, not the full base64 payload.

Budget and Cost Tracking

Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.

router = LLMRouter(
    providers={...},
    budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)

stats = router.stats()
print(stats.total_cost_usd)           # total spent since start
print(stats.saved_cost_usd)           # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd)     # None if no limit is set
print(stats.cache_hit_rate)           # 0.0–1.0

FastAPI Integration

pip install "llm-cache-router[fastapi]"

from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
    add_http_metrics_middleware,
    mount_metrics_endpoint,
)

app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")

Exposed Prometheus metrics:

llm_router_http_requests_total{method,path,status}
llm_router_http_request_duration_seconds_* (histogram)
llm_router_cache_hits_total, llm_router_cache_misses_total
llm_router_cost_usd_total, llm_router_saved_cost_usd_total

Async Context Manager

async with LLMRouter(providers={...}) as router:
    response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections

Supported Providers

Provider	Streaming	Notes
OpenAI	yes	`gpt-4o`, `gpt-4o-mini`, `o1-*`, etc.
Anthropic	yes	Claude 3.5 Sonnet/Haiku, Opus
Google Gemini	yes	1.5 Flash, 1.5 Pro
Ollama	yes	Any locally-served model
MiniMax	yes	`MiniMax-Text-01` and others
Qwen (Dashscope)	yes	`qwen-plus`, `qwen-max`, etc.

Adding a new provider = subclass LLMProvider, register with @register_provider("name"). See llm_cache_router/providers/base.py.

Architecture

llm_cache_router/
  cache/          # memory (FAISS) / redis / qdrant backends
  providers/      # openai, anthropic, gemini, ollama, minimax, qwen
  strategies/     # cheapest, fastest, fallback
  embeddings/     # SentenceEncoder, HashingEncoder
  cost/           # CostTracker with daily/monthly budgets
  middleware/     # FastAPI middleware
  observability/  # Prometheus metrics
  models.py       # Pydantic models (Message, LLMResponse, CacheEntry, ...)
  router.py       # LLMRouter — public entrypoint
  retry.py        # RetryConfig + exponential backoff
  warmup.py       # async warmup helper

Development

git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router

# using uv (recommended)
uv sync --all-extras
uv run pytest

# or plain pip
pip install -e ".[all,dev]"
pytest

Code quality is enforced in CI via:

ruff check (lint) and ruff format --check (style)
mypy --ignore-missing-imports (type check)
pytest on Python 3.11, 3.12, 3.13 with coverage

Roadmap

v0.3 — Django helpers and middleware.
v0.4 — Streaming retry (reconnect on SSE drop).
v0.5 — Request tracing hooks (OpenTelemetry).
v1.0 — Full OTel spans, pluggable pricing providers, cache invalidation API.

Contributing

Pull requests are welcome. Please:

Open an issue first for anything larger than a small bug fix.
Add tests for new behaviour.
Run ruff check, ruff format, mypy and pytest before pushing.

License

MIT — see LICENSE for details.

🇷🇺 Краткое описание (Russian)

llm-cache-router — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.

v0.2.4: корректные ключи кэша для multimodal-сообщений (хэш медиа вместо base64) и изоляция кэша по model. Release notes.

Установка:

pip install llm-cache-router

# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"

Требуется Python 3.11+. Полная документация и примеры — выше (на английском).

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
docs/releases		docs/releases
examples		examples
llm_cache_router.egg-info		llm_cache_router.egg-info
llm_cache_router		llm_cache_router
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-cache-router

Table of Contents

Why llm-cache-router

Features

Installation

Quickstart

Streaming

Cache Warmup

Routing Strategies

Cache Backends

In-memory (FAISS)

Redis

Qdrant

Multimodal Messages & Cache Keys

Budget and Cost Tracking

FastAPI Integration

Async Context Manager

Supported Providers

Architecture

Development

Roadmap

Contributing

License

🇷🇺 Краткое описание (Russian)

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-cache-router

Table of Contents

Why llm-cache-router

Features

Installation

Quickstart

Streaming

Cache Warmup

Routing Strategies

Cache Backends

In-memory (FAISS)

Redis

Qdrant

Multimodal Messages & Cache Keys

Budget and Cost Tracking

FastAPI Integration

Async Context Manager

Supported Providers

Architecture

Development

Roadmap

Contributing

License

🇷🇺 Краткое описание (Russian)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages