Skip to content

svalench/llm-cache-router

Repository files navigation

llm-cache-router

PyPI version Python versions PyPI Downloads License: MIT CI Code style: ruff

A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.


Table of Contents


Why llm-cache-router

Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:

  • Save money — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
  • Stay resilient — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
  • Control cost — built-in daily/monthly budget guardrails with Prometheus metrics for every request.

One dependency. Six providers. Three cache backends. Full async support.

Features

  • Semantic cache — vector-similarity matching via sentence-transformers, not just exact string hashing.
  • Multimodal-aware cache keys — images, audio, and video blocks are hashed into the query; cache is scoped per requested model.
  • Multi-provider routing across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
  • Three routing strategies: CHEAPEST_FIRST, FASTEST_FIRST, FALLBACK_CHAIN.
  • Pluggable cache backends: in-memory (FAISS), Redis, Qdrant.
  • Streaming — native async SSE streaming for every provider, transparent to the cache layer.
  • Cost tracker with per-model pricing, daily/monthly budget limits and savings accounting.
  • Cache warmup with controlled concurrency for pre-production pre-loading.
  • FastAPI middleware + Prometheus metrics endpoint out of the box.
  • Typed — Pydantic v2 models everywhere, fully typed public API.
  • Tested — 11 test modules covering router, cache (incl. multimodal keys & model isolation), providers, retry, warmup, and HTTP middleware.

Latest: v0.2.4 release notes — multimodal cache keys and per-model cache isolation.

Installation

pip install llm-cache-router

Optional extras:

pip install "llm-cache-router[redis]"     # Redis cache backend
pip install "llm-cache-router[qdrant]"    # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]"   # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]"       # everything above
pip install "llm-cache-router[dev]"       # tests, ruff, mypy

Requires Python 3.11+.

Quickstart

import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy


async def main() -> None:
    router = LLMRouter(
        providers={
            "openai":    {"api_key": "sk-...",           "models": ["gpt-4o-mini"]},
            "anthropic": {"api_key": "sk-ant-...",       "models": ["claude-3-5-sonnet"]},
            "gemini":    {"api_key": "AIza...",          "models": ["gemini-1.5-flash"]},
            "ollama":    {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
        },
        cache=CacheConfig(
            backend="memory",
            threshold=0.92,       # cosine similarity threshold
            ttl=3600,             # cache TTL in seconds
            max_entries=10_000,
        ),
        strategy=RoutingStrategy.CHEAPEST_FIRST,
        budget={"daily_usd": 5.0, "monthly_usd": 50.0},
    )

    response = await router.complete(
        messages=[{"role": "user", "content": "What is a semantic cache?"}],
        model="gpt-4o-mini",
    )
    print(response.content)
    print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")


asyncio.run(main())

Streaming

All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.

async for chunk in router.stream(
    messages=[{"role": "user", "content": "Explain async/await in Python"}],
    model="gpt-4o-mini",
):
    print(chunk.delta, end="", flush=True)
    if chunk.is_final:
        print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")

Cache Warmup

Pre-load the cache with known queries before traffic hits production:

from llm_cache_router.models import WarmupEntry

results = await router.warmup(
    entries=[
        WarmupEntry(
            messages=[{"role": "user", "content": "What is RAG?"}],
            model="gpt-4o-mini",
        ),
        WarmupEntry(
            messages=[{"role": "user", "content": "Explain vector databases"}],
            model="gpt-4o-mini",
        ),
    ],
    concurrency=5,
    skip_cached=True,
)
print(results)  # {"warmed": 2, "skipped": 0, "failed": 0}

Routing Strategies

Strategy Description
CHEAPEST_FIRST Picks the cheapest provider/model by live pricing for each call.
FASTEST_FIRST Picks the provider with the lowest observed latency (EMA).
FALLBACK_CHAIN Tries providers in order, falls back on error/timeout.
router = LLMRouter(
    providers={
        "openai":    {"api_key": "sk-...",     "models": ["gpt-4o"]},
        "anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
    },
    strategy=RoutingStrategy.FALLBACK_CHAIN,
    fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)

Cache Backends

In-memory (FAISS)

Default. Zero dependencies beyond the core install. Best for single-process apps and tests.

cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)

Redis

Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.

cache=CacheConfig(
    backend="redis",
    redis_url="redis://localhost:6379/0",
    redis_namespace="llm_cache_router_prod",
    threshold=0.92,
    ttl=3600,
    max_entries=50_000,
    redis_command_timeout_sec=1.5,
    redis_retry_attempts=3,
    redis_retry_backoff_sec=0.2,
    redis_candidate_k=256,
)

Qdrant

Native vector database for very large caches (millions of entries) and cross-service deployments.

pip install "llm-cache-router[qdrant]"
cache=CacheConfig(
    backend="qdrant",
    qdrant_url="http://localhost:6333",
    qdrant_api_key=None,           # optional for Qdrant Cloud
    qdrant_collection="llm_cache",
    threshold=0.92,
    ttl=3600,
    max_entries=100_000,
)

Multimodal Messages & Cache Keys

Messages follow the OpenAI-compatible shape: content can be a string or a list of blocks (text, image_url, Anthropic image, audio, video). The router passes your requested model into the cache layer so different models never share a hit for the same text.

from llm_cache_router.models import Message, WarmupEntry

messages: list[Message] = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {
                "type": "image_url",
                "image_url": {"url": "data:image/png;base64,..."},
            },
        ],
    }
]

response = await router.complete(messages=messages, model="gpt-4o-mini")
# Second call with the same text + same image → cache_hit=True
# Same text but a different image → cache miss
# Same messages but model="gpt-4o" → cache miss (different model scope)

Warmup supports the same multimodal payloads:

WarmupEntry(
    messages=messages,
    model="gpt-4o-mini",
)

Binary media is stored in the cache key as a short sha256 fingerprint, not the full base64 payload.

Budget and Cost Tracking

Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.

router = LLMRouter(
    providers={...},
    budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)

stats = router.stats()
print(stats.total_cost_usd)           # total spent since start
print(stats.saved_cost_usd)           # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd)     # None if no limit is set
print(stats.cache_hit_rate)           # 0.0–1.0

FastAPI Integration

pip install "llm-cache-router[fastapi]"
from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
    add_http_metrics_middleware,
    mount_metrics_endpoint,
)

app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")

Exposed Prometheus metrics:

  • llm_router_http_requests_total{method,path,status}
  • llm_router_http_request_duration_seconds_* (histogram)
  • llm_router_cache_hits_total, llm_router_cache_misses_total
  • llm_router_cost_usd_total, llm_router_saved_cost_usd_total

Async Context Manager

async with LLMRouter(providers={...}) as router:
    response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections

Supported Providers

Provider Streaming Notes
OpenAI yes gpt-4o, gpt-4o-mini, o1-*, etc.
Anthropic yes Claude 3.5 Sonnet/Haiku, Opus
Google Gemini yes 1.5 Flash, 1.5 Pro
Ollama yes Any locally-served model
MiniMax yes MiniMax-Text-01 and others
Qwen (Dashscope) yes qwen-plus, qwen-max, etc.

Adding a new provider = subclass LLMProvider, register with @register_provider("name"). See llm_cache_router/providers/base.py.

Architecture

llm_cache_router/
  cache/          # memory (FAISS) / redis / qdrant backends
  providers/      # openai, anthropic, gemini, ollama, minimax, qwen
  strategies/     # cheapest, fastest, fallback
  embeddings/     # SentenceEncoder, HashingEncoder
  cost/           # CostTracker with daily/monthly budgets
  middleware/     # FastAPI middleware
  observability/  # Prometheus metrics
  models.py       # Pydantic models (Message, LLMResponse, CacheEntry, ...)
  router.py       # LLMRouter — public entrypoint
  retry.py        # RetryConfig + exponential backoff
  warmup.py       # async warmup helper

Development

git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router

# using uv (recommended)
uv sync --all-extras
uv run pytest

# or plain pip
pip install -e ".[all,dev]"
pytest

Code quality is enforced in CI via:

  • ruff check (lint) and ruff format --check (style)
  • mypy --ignore-missing-imports (type check)
  • pytest on Python 3.11, 3.12, 3.13 with coverage

Roadmap

  • v0.3 — Django helpers and middleware.
  • v0.4 — Streaming retry (reconnect on SSE drop).
  • v0.5 — Request tracing hooks (OpenTelemetry).
  • v1.0 — Full OTel spans, pluggable pricing providers, cache invalidation API.

Contributing

Pull requests are welcome. Please:

  1. Open an issue first for anything larger than a small bug fix.
  2. Add tests for new behaviour.
  3. Run ruff check, ruff format, mypy and pytest before pushing.

License

MIT — see LICENSE for details.


🇷🇺 Краткое описание (Russian)

llm-cache-router — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.

v0.2.4: корректные ключи кэша для multimodal-сообщений (хэш медиа вместо base64) и изоляция кэша по model. Release notes.

Установка:

pip install llm-cache-router

# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"

Требуется Python 3.11+. Полная документация и примеры — выше (на английском).

About

A lightweight, production-ready Python library that combines semantic caching, multi-provider LLM routing, and cost tracking in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages