The RAG Cache API provides a high-performance caching layer for Language Model (LLM) queries with both exact and semantic matching capabilities. Built with FastAPI, it offers automatic OpenAPI documentation and type-safe endpoints.
Development: http://localhost:8000
Production: https://your-domain.com
FastAPI provides automatic interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI JSON: http://localhost:8000/openapi.json
Currently, the API does not require authentication for local development. For production deployment, implement one of the following:
- API Key authentication
- OAuth2 with JWT tokens
- Basic HTTP authentication
See DEPLOYMENT.md for production authentication setup.
Check the health status of the service and its dependencies.
Endpoint: GET /health
Response:
{
"status": "healthy",
"timestamp": "2025-11-11T10:30:00.000Z",
"services": {
"redis": "connected",
"qdrant": "connected"
}
}Status Codes:
200 OK- Service is healthy503 Service Unavailable- Service is unhealthy
Example:
curl http://localhost:8000/healthProcess a query with caching support.
Endpoint: POST /api/v1/query
Request Body:
{
"query": "What is artificial intelligence?",
"use_cache": true,
"semantic_threshold": 0.85,
"provider": "openai",
"model": "gpt-3.5-turbo",
"max_tokens": 500,
"temperature": 0.7
}Request Parameters:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
query |
string | Yes | - | The query text to process |
use_cache |
boolean | No | true |
Whether to use cache |
semantic_threshold |
float | No | 0.85 |
Similarity threshold for semantic cache (0.0-1.0) |
provider |
string | No | "openai" |
LLM provider (openai, anthropic) |
model |
string | No | "gpt-3.5-turbo" |
Model name |
max_tokens |
integer | No | 500 |
Maximum response tokens |
temperature |
float | No | 0.7 |
Temperature for response generation (0.0-1.0) |
Response:
{
"query": "What is artificial intelligence?",
"response": "Artificial intelligence (AI) refers to...",
"provider": "openai",
"model": "gpt-3.5-turbo",
"cache_hit": true,
"cache_type": "exact",
"latency_ms": 45,
"tokens_used": 150,
"timestamp": "2025-11-11T10:30:00.000Z"
}Response Fields:
| Field | Type | Description |
|---|---|---|
query |
string | Original query text |
response |
string | Generated or cached response |
provider |
string | LLM provider used |
model |
string | Model used for generation |
cache_hit |
boolean | Whether response was from cache |
cache_type |
string | Type of cache hit (exact, semantic, none) |
latency_ms |
integer | Response time in milliseconds |
tokens_used |
integer | Tokens used (0 if cache hit) |
timestamp |
string | Response timestamp (ISO 8601) |
Status Codes:
200 OK- Query processed successfully400 Bad Request- Invalid request parameters500 Internal Server Error- Server error503 Service Unavailable- Service dependencies unavailable
Examples:
# Basic query
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?"
}'
# Query with custom parameters
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"query": "Explain neural networks",
"provider": "openai",
"model": "gpt-4",
"temperature": 0.5,
"semantic_threshold": 0.90
}'
# Query without cache
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is deep learning?",
"use_cache": false
}'Get cache performance statistics.
Endpoint: GET /api/v1/cache/stats
Response:
{
"total_queries": 1500,
"cache_hits": 875,
"cache_misses": 625,
"hit_rate": 58.33,
"exact_hits": 500,
"semantic_hits": 375,
"avg_latency_ms": 125,
"avg_cache_latency_ms": 45,
"avg_llm_latency_ms": 850,
"tokens_saved": 125000,
"uptime_seconds": 86400
}Response Fields:
| Field | Type | Description |
|---|---|---|
total_queries |
integer | Total queries processed |
cache_hits |
integer | Total cache hits |
cache_misses |
integer | Total cache misses |
hit_rate |
float | Cache hit rate percentage |
exact_hits |
integer | Exact cache hits |
semantic_hits |
integer | Semantic cache hits |
avg_latency_ms |
integer | Average latency |
avg_cache_latency_ms |
integer | Average cache hit latency |
avg_llm_latency_ms |
integer | Average LLM query latency |
tokens_saved |
integer | Total tokens saved by caching |
uptime_seconds |
integer | Service uptime |
Example:
curl http://localhost:8000/api/v1/cache/statsClear all cached entries (admin operation).
Endpoint: DELETE /api/v1/cache
Response:
{
"message": "Cache cleared successfully",
"entries_deleted": 150
}Status Codes:
200 OK- Cache cleared successfully500 Internal Server Error- Failed to clear cache
Example:
curl -X DELETE http://localhost:8000/api/v1/cacheRetrieve a specific cache entry by query.
Endpoint: GET /api/v1/cache/entry
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
query |
string | Yes | The query to lookup |
Response:
{
"query": "What is AI?",
"response": "Artificial intelligence refers to...",
"created_at": "2025-11-11T10:30:00.000Z",
"ttl": 3600,
"hits": 5
}Status Codes:
200 OK- Cache entry found404 Not Found- Cache entry not found
Example:
curl "http://localhost:8000/api/v1/cache/entry?query=What%20is%20AI%3F"All error responses follow a consistent format:
{
"detail": "Error message describing what went wrong",
"error_code": "ERROR_CODE",
"timestamp": "2025-11-11T10:30:00.000Z"
}| Code | Status | Description |
|---|---|---|
INVALID_QUERY |
400 | Query text is empty or invalid |
INVALID_THRESHOLD |
400 | Semantic threshold out of range (0.0-1.0) |
PROVIDER_NOT_FOUND |
400 | Invalid LLM provider specified |
CACHE_ERROR |
500 | Cache service error |
LLM_ERROR |
500 | LLM provider error |
SERVICE_UNAVAILABLE |
503 | Required service unavailable |
{
"detail": "Semantic threshold must be between 0.0 and 1.0",
"error_code": "INVALID_THRESHOLD",
"timestamp": "2025-11-11T10:30:00.000Z"
}Current Status: Not implemented in MVP
Future Implementation:
- Rate limiting by IP address
- Rate limiting by API key
- Configurable limits (requests per minute/hour)
- 429 Too Many Requests response
Current Status: Not applicable for current endpoints
Future Implementation:
- Paginated cache history endpoint
- Paginated query logs endpoint
Current Status: Not implemented
Future Implementation:
- Webhook notifications for cache events
- Configurable webhook endpoints
- Retry logic for failed deliveries
import requests
class RAGCacheClient:
def __init__(self, base_url: str = "http://localhost:8000"):
self.base_url = base_url
def query(
self,
query: str,
use_cache: bool = True,
semantic_threshold: float = 0.85,
provider: str = "openai",
model: str = "gpt-3.5-turbo"
) -> dict:
"""Send a query to the RAG Cache API."""
response = requests.post(
f"{self.base_url}/api/v1/query",
json={
"query": query,
"use_cache": use_cache,
"semantic_threshold": semantic_threshold,
"provider": provider,
"model": model
}
)
response.raise_for_status()
return response.json()
def get_stats(self) -> dict:
"""Get cache statistics."""
response = requests.get(f"{self.base_url}/api/v1/cache/stats")
response.raise_for_status()
return response.json()
# Usage
client = RAGCacheClient()
result = client.query("What is machine learning?")
print(f"Response: {result['response']}")
print(f"Cache hit: {result['cache_hit']}")class RAGCacheClient {
constructor(private baseUrl: string = 'http://localhost:8000') {}
async query(options: {
query: string;
useCache?: boolean;
semanticThreshold?: number;
provider?: string;
model?: string;
}): Promise<any> {
const response = await fetch(`${this.baseUrl}/api/v1/query`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
query: options.query,
use_cache: options.useCache ?? true,
semantic_threshold: options.semanticThreshold ?? 0.85,
provider: options.provider ?? 'openai',
model: options.model ?? 'gpt-3.5-turbo',
}),
});
if (!response.ok) {
throw new Error(`API error: ${response.statusText}`);
}
return response.json();
}
async getStats(): Promise<any> {
const response = await fetch(`${this.baseUrl}/api/v1/cache/stats`);
return response.json();
}
}
// Usage
const client = new RAGCacheClient();
const result = await client.query({ query: 'What is machine learning?' });
console.log(`Response: ${result.response}`);
console.log(`Cache hit: ${result.cache_hit}`);# Function to query the API
query_api() {
local query="$1"
local provider="${2:-openai}"
local model="${3:-gpt-3.5-turbo}"
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d "{
\"query\": \"$query\",
\"provider\": \"$provider\",
\"model\": \"$model\"
}"
}
# Usage
query_api "What is machine learning?" "openai" "gpt-4"- Cache Hit: < 50ms
- Cache Miss (with LLM): < 1000ms
- Health Check: < 10ms
- Exact Match: Check for exact query match in Redis
- Semantic Match: Check for similar queries in Qdrant
- LLM Query: Fall back to LLM if no cache hit
- Cache Store: Store new response in both caches
- Use Caching: Enable
use_cachefor better performance - Adjust Threshold: Lower
semantic_thresholdfor more cache hits - Batch Queries: Group similar queries together
- Monitor Stats: Check cache hit rate regularly
- Clean Cache: Periodically clear old entries
Future implementation will include Prometheus metrics:
/metrics- Prometheus metrics endpoint- Counter:
ragcache_queries_total - Counter:
ragcache_cache_hits_total - Counter:
ragcache_cache_misses_total - Histogram:
ragcache_query_duration_seconds - Gauge:
ragcache_cache_size
All API requests are logged with:
- Request ID
- Timestamp
- Endpoint
- Status code
- Response time
- Cache hit/miss status
The API uses URL-based versioning:
- Current Version: v1
- Base Path:
/api/v1
Breaking changes will be introduced in new versions (/api/v2, etc.) while maintaining backward compatibility.
For API issues or questions:
- Check the interactive documentation
- Review ARCHITECTURE.md for system design
- See CONTRIBUTING.md for development guidelines
- Open an issue on GitHub for bugs or feature requests
- Initial API release
- Query processing endpoint
- Health check endpoint
- Cache statistics endpoint
- Exact and semantic caching support
- OpenAI provider integration
- v1.1.0: Add Anthropic provider
- v1.2.0: Add rate limiting
- v1.3.0: Add authentication
- v2.0.0: Breaking changes (TBD)
Last Updated: November 11, 2025