LLM Cost and Latency Optimization Dashboard
An LLM usage analytics dashboard for cost, latency, routing, and optimization decisions.
CostPilot is a dashboard and middleware toolkit that tracks LLM usage, cost, latency, cache hit rate, model selection, and workflow-level spend. It serves as the financial control panel for AI systems.
- Real-time Cost Tracking: Monitor token usage and costs across all LLM providers
- Latency Analytics: Track response times with p50, p95, p99 percentiles
- Workflow Attribution: Attribute costs to specific workflows and features
- Cache Hit Rate: Measure prompt caching effectiveness and savings
- Model Comparison: Compare cost and performance across models
- Expensive Prompt Detection: Identify costly prompts for optimization
- Budget Alerts: Set spending thresholds and get notified
┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ SDK/MW │────▶│ FastAPI │────▶│ PostgreSQL │
│ (Python) │ │ Server │ │ │
└─────────────┘ └──────┬───────┘ └────────────┘
│
┌──────▼───────┐
│ Next.js │
│ Dashboard │
└──────────────┘
cp .env.example .env
docker-compose up -dThe dashboard will be available at http://localhost:3000 and the API at http://localhost:8000.
cd server
pip install -r requirements.txt
uvicorn app.main:app --reloadcd dashboard
npm install
npm run devcd sdk
pip install -e .from costpilot import CostPilotClient
client = CostPilotClient(
server_url="http://localhost:8000",
api_key="your-api-key",
project="my-project"
)
await client.log_usage({
"model": "gpt-4o",
"workflow": "summarization",
"prompt_tokens": 1500,
"completion_tokens": 500,
"latency_ms": 1200,
"cached": False
})from costpilot.decorators import track_cost, track_llm_call
@track_cost(workflow="content-generation")
@track_llm_call(model="gpt-4o")
async def generate_content(prompt: str) -> str:
response = await openai.chat.completions.create(...)
return response.choices[0].message.contentfrom costpilot.middleware import CostPilotMiddleware
app.add_middleware(CostPilotMiddleware, client=client)CostPilot calculates costs based on provider pricing data:
- Input tokens: Charged per 1K tokens at the input rate
- Output tokens: Charged per 1K tokens at the output rate
- Cache savings: Cached tokens are tracked separately for savings calculation
- Workflow aggregation: Costs roll up to workflow and project levels
Pricing data is loaded from YAML configuration files. See config/pricing.example.yaml for the format.
Supported providers:
- OpenAI (GPT-4o, GPT-4o-mini, GPT-4 Turbo, GPT-3.5 Turbo)
- Anthropic (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku)
- Google (Gemini 1.5 Pro, Gemini 1.5 Flash)
- Mistral (Mistral Large, Mistral Medium)
| Endpoint | Method | Description |
|---|---|---|
/api/v1/usage |
POST | Log a usage record |
/api/v1/usage/batch |
POST | Log batch usage records |
/api/v1/costs |
GET | Query cost data |
/api/v1/costs/by-workflow |
GET | Costs grouped by workflow |
/api/v1/costs/by-model |
GET | Costs grouped by model |
/api/v1/analytics/over-time |
GET | Cost trends over time |
/api/v1/analytics/expensive-prompts |
GET | Most expensive prompts |
/api/v1/analytics/optimization |
GET | Optimization suggestions |
/api/v1/health |
GET | Health check |
- Home: Summary overview with key metrics and charts
- Costs: Detailed cost breakdown by model, workflow, and time
- Latency: Latency distribution and trends per model
- Workflows: Workflow-level spend and performance metrics
# SDK tests
cd sdk && pytest
# Server tests
cd server && pytest
# Dashboard build
cd dashboard && npm run buildSee .env.example for all configuration options.
Budget thresholds can include webhook_url, warning_threshold_percent, and critical_threshold_percent. CostPilot records the last crossed threshold and exposes the alert payload from /api/v1/budget/status/{project}:
{
"project": "my-project",
"monthly_budget_usd": 1000.0,
"current_spend_usd": 850.0,
"percent_used": 85.0,
"status": "warning",
"threshold_percent": 80,
"triggered_at": "2026-05-08T22:00:00Z"
}Outbound webhook delivery requires an approved sender integration; the API stores webhook configuration and exposes the exact payload safely.
MIT