| title | emoji | colorFrom | colorTo | sdk | pinned |
|---|---|---|---|---|---|
Cascade - Intelligent LLM Router |
🌊 |
purple |
blue |
docker |
false |
Intelligent LLM Request Router - Reduce API costs by 60%+ through smart routing and semantic caching.
Live API (no signup required):
curl -X POST http://136.111.230.240:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"Explain AI in 5 words"}]}'Experience Cascade's intelligent routing and cost optimization in action!
Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).
- ML-Powered Routing: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
- Semantic Caching: Vector similarity search finds cached responses for similar queries
- OpenAI Compatible: Drop-in replacement for OpenAI API
- Cost Analytics: Real-time dashboard showing savings and usage metrics
- 60%+ Cost Reduction: Typical savings by routing simple queries to free models
┌─────────────────────────────────────────────────────────────┐
│ Cascade │
├─────────────────────────────────────────────────────────────┤
│ │
│ Request ──► Semantic Cache ──► Cache Hit? ──► Return │
│ │ │
│ ▼ (miss) │
│ ML Classifier │
│ │ │
│ ┌──────────┼──────────┐ │
│ ▼ ▼ ▼ │
│ Simple Medium Complex │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Llama3.2 GPT-4o-mini GPT-4o │
│ (free) ($0.15/1M) ($2.50/1M) │
│ │
└─────────────────────────────────────────────────────────────┘
- Python 3.11+
- Docker & Docker Compose (optional)
- Ollama (for local models)
- OpenAI API key
# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade
# Install dependencies
pip install poetry
poetry install
# Set up environment
cp .env.example .env
# Edit .env with your API keys# Start all services
docker-compose up -d
# API available at http://localhost:8000
# UI available at http://localhost:8501# Start the API server
poetry run uvicorn cascade.api.main:app --reload
# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.pyCascade is OpenAI-compatible. Just change your base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Uses your configured key
)
# Automatic routing based on complexity
response = client.chat.completions.create(
model="auto", # Let Cascade choose the best model
messages=[{"role": "user", "content": "What is 2+2?"}]
)# Force GPT-4o for complex tasks
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a compiler..."}]
)curl http://localhost:8000/v1/stats{
"total_requests": 1247,
"cost": {
"actual": 2.34,
"baseline": 7.89,
"saved_dollars": 5.55,
"saved_percentage": 70.3
},
"cache": {
"hit_rate": 42.6
}
}| Environment Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
- | OpenAI API key |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
REDIS_HOST |
localhost |
Redis host |
QDRANT_URL |
http://localhost:6333 |
Qdrant server URL |
SIMILARITY_THRESHOLD |
0.92 |
Semantic cache threshold |
CACHE_TTL |
3600 |
Cache TTL in seconds |
cascade/
├── src/cascade/
│ ├── api/ # FastAPI application
│ ├── cache/ # Redis + Qdrant caching
│ ├── cost/ # Cost tracking & analytics
│ ├── providers/ # LLM provider adapters
│ ├── router/ # ML classifier & routing
│ └── ui/ # Streamlit dashboard
├── ml/ # ML training pipeline
│ ├── data/ # Dataset loading
│ ├── training/ # Model training
│ └── export/ # ONNX conversion
├── tests/ # Test suite
└── docker-compose.yml
- Request Arrives: User sends a chat completion request
- Cache Check: Check semantic cache for similar previous queries
- Complexity Classification: ML model predicts query complexity (0-1)
- Routing Decision:
- Score < 0.35 → Ollama (free)
- Score 0.35-0.70 → GPT-4o-mini ($0.15/1M tokens)
- Score > 0.70 → GPT-4o ($2.50/1M tokens)
- Response: Forward to selected model, cache result, return
# Run tests
make test
# Run linting
make lint
# Format code
make format
# Train the classifier
make train
# Export to ONNX
make export-onnxRailway offers the easiest deployment with automatic builds:
# Install Railway CLI
npm install -g @railway/cli
# Login and deploy
railway login
railway init
railway up
# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)# Install Fly CLI
curl -L https://fly.io/install.sh | sh
# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy- Fork this repository
- Connect to Render
- Use the
render.yamlblueprint - Set
OPENAI_API_KEYin environment variables
# Build and run with docker-compose
docker-compose up -d
# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | Your OpenAI API key |
REDIS_URL |
No | Redis connection URL (for caching) |
QDRANT_URL |
No | Qdrant URL (for semantic cache) |
PORT |
No | Server port (default: 8000) |
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Service info |
/health |
GET | Health check |
/v1/chat/completions |
POST | OpenAI-compatible chat |
/v1/models |
GET | List available models |
/v1/stats |
GET | Usage statistics |
Contributions are welcome! Please read our contributing guidelines first.
MIT License - see LICENSE for details.