diff --git a/docs/proposals/semantic-integration.md b/docs/proposals/semantic-integration.md
new file mode 100644
index 00000000..eb4ff9d9
--- /dev/null
+++ b/docs/proposals/semantic-integration.md
@@ -0,0 +1,654 @@
+# Empowering vLLM Router with Semantic Intelligence
+
+**Authors**: vLLM Semantic Router Team
+**Date**: December 2025
+**Status**: Proposal
+**Target Audience**: vLLM Community
+
+---
+
+## Executive Summary
+
+This proposal outlines a plan to integrate semantic routing capabilities into the vLLM Router project, enabling it to replace the Python-based vLLM API Server with a high-performance, intelligent routing layer. The integration will provide cross-instance capabilities including semantic caching, centralized response management, automatic LoRA adapter selection, and enterprise-grade security features—all while maintaining the performance benefits of Rust implementation.
+
+**Key Benefits**:
+
+1. **Signal Based Routing**: Signal-based routing for Multi-LoRA deployments
+ - Keyword-based routing for simple pattern matching
+ - Domain classification for intent-aware adapter selection
+ - Embedding-based semantic similarity for nuanced routing
+ - Fact-checking and verification routing for high-stakes queries
+
+2. **Cross-Instance Intelligence**: Shared state and optimization across all vLLM instances
+ - Response API: Centralized response storage enabling stateful multi-turn conversations
+ - Semantic Cache: 48.5% token reduction through cross-instance vector similarity matching
+
+3. **Guardrails**: Enterprise-grade security and safety features
+ - PII Detection: Prevent sensitive information leakage
+ - Jailbreak Prevention: Block malicious prompt injection attempts
+ - Hallucination Detection: Verify response reliability for critical domains
+
+---
+
+## 1. Background and Motivation
+
+### 1.1 Current vLLM Architecture
+
+The current vLLM deployment typically follows this pattern:
+
+```mermaid
+graph LR
+ C1[Client 1] --> API1[vLLM API Server
Python]
+ C2[Client 2] --> API2[vLLM API Server
Python]
+ C3[Client 3] --> API3[vLLM API Server
Python]
+
+ API1 --> E1[vLLM Engine 1
C++/CUDA]
+ API2 --> E2[vLLM Engine 2
C++/CUDA]
+ API3 --> E3[vLLM Engine 3
C++/CUDA]
+
+ style API1 fill:#ffcccc
+ style API2 fill:#ffcccc
+ style API3 fill:#ffcccc
+ style E1 fill:#ccffcc
+ style E2 fill:#ccffcc
+ style E3 fill:#ccffcc
+```
+
+**Limitations**:
+- **Single-Instance Scope**: Each API server manages only one vLLM engine instance
+- **Python Performance Bottleneck**: GIL limitations, interpreted language overhead
+- **No Cross-Instance Capabilities**: Cannot share cache, state, or intelligence across instances
+- **Limited Routing Logic**: Basic request forwarding without semantic understanding
+
+### 1.2 vLLM Router: Current State
+
+The vLLM Router (https://github.com/vllm-project/router) is a high-performance Rust-based load balancer that provides:
+- Load balancing policies (round_robin, consistent_hash, cache_aware, etc.)
+- Prefill-Decode disaggregation
+- Circuit breakers and retry logic
+- Kubernetes service discovery
+
+**Gap**: While excellent at load balancing, it lacks semantic understanding and cross-instance intelligence.
+
+### 1.3 vLLM Semantic Router: Proven Capabilities
+
+The vLLM Semantic Router project (https://github.com/vllm-project/semantic-router) has demonstrated:
+- **Intent Classification**: ModernBERT-based classification for routing decisions
+- **Semantic Caching**: Cross-instance cache with 92-95% similarity thresholds
+- **Security**: PII detection, jailbreak prevention, hallucination detection
+- **LoRA Selection**: Automatic adapter selection based on query intent
+- **Response API**: Centralized response storage for multi-turn conversations
+
+**Research Validation**: Published at NeurIPS 2025 MLForSys workshop ([arxiv:2510.08731](https://arxiv.org/abs/2510.08731))
+
+### 1.4 Capability Comparison
+
+The following table compares the capabilities of pure vLLM Router vs vLLM Router with Semantic Router integration:
+
+| Capability | vLLM Router (Current) | vLLM Router + Semantic Router | Impact |
+|------------|----------------------|-------------------------------|---------|
+| **Load Balancing** | ✅ Round Robin, Consistent Hash, Cache-Aware | ✅ Same + Intent-Aware Routing | Enhanced routing intelligence |
+| **Prefill-Decode Disaggregation** | ✅ Supported | ✅ Supported | Maintained in vLLM Router layer |
+| **Prefix Caching** | ✅ Supported | ✅ Supported | Maintained in vLLM Engine layer |
+| **Circuit Breaker & Retry** | ✅ Built-in | ✅ Built-in | No change |
+| **LoRA Adapter Support** | ✅ Manual selection by client | ✅ **Automatic selection** based on intent | 10.2% accuracy improvement |
+| **Multi-turn Adaptive Routing** | ❌ Not supported | ✅ **Cost/Accuracy/Feedback-aware** routing | Dynamic optimization per conversation |
+| **Semantic Caching** | ❌ Not supported | ✅ **Cross-instance cache** (Redis/Milvus) | 48.5% token reduction |
+| **Response API** | ⚠️ Limited to single instance | ✅ **Cross-instance state** (Redis) | Enables stateful conversations |
+| **Intent Classification** | ❌ Not supported | ✅ **ModernBERT-based** classification | Domain-aware routing |
+| **PII Detection** | ❌ Not supported | ✅ **Built-in** security plugin | Enterprise compliance |
+| **Jailbreak Prevention** | ❌ Not supported | ✅ **Built-in** security plugin | Safety & security |
+| **Hallucination Detection** | ❌ Not supported | ✅ **Optional** post-processing plugin | Quality assurance |
+| **Tool Selection** | ❌ Not supported | ✅ **Automatic** tool routing | Enhanced agent capabilities |
+| **Domain-Aware Prompts** | ❌ Not supported | ✅ **Dynamic** system prompt injection | Context-specific responses |
+
+**Key Takeaways**:
+
+- ✅ **Backward Compatible**: All existing vLLM Router features remain unchanged
+- 🚀 **Performance Boost**: 47% lower latency, 48.5% fewer tokens
+- 🎯 **Accuracy Improvement**: 10.2% higher accuracy through intelligent LoRA selection
+- 🔒 **Enterprise-Ready**: Built-in security (PII, jailbreak, hallucination detection)
+- 💰 **Cost Reduction**: Semantic caching and token efficiency reduce operational costs
+- 🌐 **Cross-Instance Intelligence**: Shared cache and state across all instances
+
+### 1.5 Architecture Comparison
+
+```mermaid
+graph TB
+ subgraph Current["Current Architecture (Python API Server)"]
+ direction TB
+ C1[Clients] --> P1[Python API Server 1]
+ C1 --> P2[Python API Server 2]
+ C1 --> P3[Python API Server 3]
+ P1 --> V1[vLLM Engine 1]
+ P2 --> V2[vLLM Engine 2]
+ P3 --> V3[vLLM Engine 3]
+
+ Note1[❌ No cross-instance cache
❌ No semantic routing
❌ Python performance bottleneck
❌ No centralized state]
+ end
+
+ subgraph Proposed["Proposed Architecture (vLLM Router + Semantic Router)"]
+ direction TB
+ C2[Clients] --> VR[vLLM Router + Semantic Router
Rust + Golang/Rust FFI]
+ VR --> Cache[(Semantic Cache
Redis/Milvus)]
+ VR --> RespAPI[(Response API
Redis)]
+ VR --> LB[Load Balancer]
+ LB --> VE1[vLLM Engine 1
+LoRA A]
+ LB --> VE2[vLLM Engine 2
+LoRA B]
+ LB --> VE3[vLLM Engine 3
+LoRA C]
+
+ Note2[✅ Cross-instance cache
✅ Intent-based LoRA selection
✅ High performance Rust
✅ Centralized state management]
+ end
+
+ style Current fill:#ffe6e6
+ style Proposed fill:#e6ffe6
+ style Note1 fill:#ffcccc
+ style Note2 fill:#ccffcc
+```
+
+---
+
+## 2. Proposed Architecture
+
+### 2.1 High-Level Architecture
+
+```mermaid
+graph TB
+ Client[Client Requests]
+
+ subgraph Router["vLLM Router (Rust)"]
+ subgraph SR["Semantic Router Layer"]
+ IC[Intent Classification
ModernBERT-based
Category detection]
+ SC[Semantic Cache Check
Redis/Milvus
Vector similarity]
+ SP[Security Plugins
PII Detection
Jailbreak Prevention]
+ LS[LoRA Selection
Intent → LoRA mapping
Domain-specific]
+ end
+
+ subgraph LB["Load Balancing Layer"]
+ RR[Round Robin / Consistent Hash]
+ CB[Circuit Breaker]
+ RL[Retry Logic]
+ PD[Prefill-Decode Disaggregation]
+ end
+ end
+
+ subgraph Instances["vLLM Engine Instances"]
+ I1[Instance 1
+LoRA A]
+ I2[Instance 2
+LoRA B]
+ I3[Instance 3
+LoRA C]
+ I4[...]
+ end
+
+ Client --> IC
+ IC --> SC
+ SC --> SP
+ SP --> LS
+ LS --> RR
+ RR --> CB
+ CB --> RL
+ RL --> PD
+ PD --> I1
+ PD --> I2
+ PD --> I3
+ PD --> I4
+
+ style Router fill:#e1f5ff
+ style SR fill:#fff3e0
+ style LB fill:#f3e5f5
+ style Instances fill:#e8f5e9
+```
+
+### 2.2 Request Flow
+
+```mermaid
+sequenceDiagram
+ participant C as Client
+ participant R as vLLM Router
+ participant IC as Intent Classifier
+ participant SC as Semantic Cache
+ participant SP as Security Plugins
+ participant LB as Load Balancer
+ participant V as vLLM Instance
+ participant RA as Response API
+
+ C->>R: OpenAI API Request
+ R->>IC: Classify Intent
+ IC-->>R: Category (e.g., "math")
+
+ R->>SC: Check Cache
+ alt Cache Hit
+ SC-->>R: Cached Response
+ R-->>C: Return Response
+ else Cache Miss
+ SC-->>R: No Match
+
+ R->>SP: Security Check
+ alt PII/Jailbreak Detected
+ SP-->>R: Block Request
+ R-->>C: Error Response
+ else Safe
+ SP-->>R: Pass
+
+ R->>R: Select LoRA Adapter
+ Note over R: Based on intent category
+
+ R->>LB: Route Request
+ LB->>V: Forward to Instance
+ V-->>LB: LLM Response
+ LB-->>R: Response
+
+ par Store Response
+ R->>RA: Store in Response API
+ and Update Cache
+ R->>SC: Cache Response
+ end
+
+ R-->>C: Return Response
+ end
+ end
+```
+
+### 2.3 Why Not Signal-Based Model Selection?
+
+**Important Design Decision**: This proposal focuses on **LoRA-level selection** within a single base model, not **model-level selection** (e.g., choosing between GPT-4 vs Claude).
+
+**Rationale**:
+- vLLM Router operates at the **model-instance level**: Each router deployment manages instances of a single base model
+- **Model selection** (choosing different models) should happen at a higher layer (API Gateway, orchestration layer)
+- **LoRA selection** (choosing adapters for the same base model) is appropriate at the router level because:
+ - All instances share the same base model weights
+ - LoRA adapters are lightweight and can be dynamically loaded
+ - Intent classification can determine the best adapter for each query
+ - This maintains the router's focus on efficient request distribution
+
+**Example**:
+```yaml
+# Appropriate: LoRA selection within llama-3-70b
+base_model: llama-3-70b
+lora_adapters:
+ - name: code-lora # For programming queries
+ - name: medical-lora # For medical queries
+ - name: legal-lora # For legal queries
+
+# Not appropriate: Model selection across different models
+# This should be handled by upstream API Gateway
+models:
+ - gpt-4 # ❌ Different model
+ - claude-3 # ❌ Different model
+ - llama-3-70b # ❌ Different model
+```
+
+---
+
+## 3. Core Features
+
+### 3.1 Auto-Selection of LoRA Adapters
+
+**Problem**: Different tasks benefit from domain-specific fine-tuning, but manually specifying LoRA adapters is cumbersome.
+
+**Solution**: Automatic LoRA selection based on semantic intent classification.
+
+**How It Works**:
+1. Query is classified into categories (math, code, medical, legal, etc.)
+2. Configuration maps categories to LoRA adapters
+3. Router automatically sets the `model` parameter to the appropriate LoRA name
+4. vLLM instance loads and applies the correct adapter
+
+**Configuration Example**:
+```yaml
+# Define available LoRA adapters
+model_config:
+ "llama-3-70b":
+ loras:
+ - name: "code-lora"
+ description: "Optimized for programming tasks"
+ - name: "medical-lora"
+ description: "Specialized for medical queries"
+
+# Map intents to LoRA adapters
+decisions:
+ - name: "code_decision"
+ rules:
+ conditions:
+ - type: "domain"
+ name: "computer_science"
+ modelRefs:
+ - model: "llama-3-70b"
+ lora_name: "code-lora"
+ use_reasoning: true
+```
+
+**Benefits**:
+- **Improved Accuracy**: Domain-specific adapters outperform base models
+- **Transparent to Clients**: No API changes required
+- **Cost Efficient**: Share base model weights across adapters
+
+### 3.2 Cross-Instance Response API
+
+**Problem**: vLLM Python API Server stores response state locally, preventing cross-instance access for multi-turn conversations.
+
+**Solution**: Centralized response storage using Redis.
+
+**How It Works**:
+1. Each response is assigned a unique ID
+2. Response content and metadata stored in Redis
+3. Subsequent requests can reference previous responses by ID
+4. Any vLLM instance can retrieve responses from any other instance
+
+**Use Cases**:
+- **Multi-turn conversations**: Continue conversation on different instance
+- **Response continuation**: Generate more content from previous response
+- **A/B testing**: Compare responses from different instances
+- **Debugging**: Inspect responses across the cluster
+
+**API Example**:
+```bash
+# First request
+curl -X POST http://router:8000/v1/chat/completions \
+ -d '{"model": "llama-3-70b", "messages": [...]}'
+# Response: {"id": "resp_abc123", "choices": [...]}
+
+# Continue conversation on any instance
+curl -X POST http://router:8000/v1/chat/completions \
+ -d '{"model": "llama-3-70b", "previous_response_id": "resp_abc123", ...}'
+```
+
+### 3.3 Cross-Instance Semantic Cache
+
+**Problem**: Traditional prefix caching is instance-local and requires exact matches.
+
+**Solution**: Semantic cache using vector similarity across all instances.
+
+**How It Works**:
+1. Query is embedded using lightweight model (BERT, Qwen3-Embedding, etc.)
+2. Embedding compared against cache using cosine similarity
+3. If similarity > threshold (e.g., 0.92), return cached response
+4. Cache is shared across all instances via Redis/Milvus
+
+**Benefits**:
+- **Higher Hit Rate**: Semantic matching vs exact matching
+- **Cross-Instance**: Any instance can benefit from any other's cache
+- **Configurable**: Per-category similarity thresholds
+- **Cost Savings**: 48.5% token reduction (from research paper)
+
+**Configuration Example**:
+```yaml
+semantic_cache:
+ enabled: true
+ backend_type: "hybrid" # memory + milvus
+ similarity_threshold: 0.92
+ embedding_model: "qwen3" # High quality, 1024-dim
+
+# Per-category overrides
+decisions:
+ - name: "medical_decision"
+ plugins:
+ - type: "semantic-cache"
+ configuration:
+ enabled: true
+ similarity_threshold: 0.95 # Higher threshold for medical
+```
+
+### 3.4 Security Plugins
+
+#### 3.4.1 PII Detection
+
+**Purpose**: Prevent sensitive information from being sent to LLM.
+
+**Implementation**: ModernBERT-based token classification trained on Microsoft Presidio dataset (~50K examples).
+
+**Detected Types**: Email, phone, SSN, credit card, address, name, etc.
+
+**Configuration**:
+```yaml
+plugins:
+ - type: "pii"
+ configuration:
+ enabled: true
+ pii_types_allowed: [] # Block all PII types
+```
+
+#### 3.4.2 Jailbreak Detection
+
+**Purpose**: Block malicious prompts attempting to bypass safety guidelines.
+
+**Implementation**: ModernBERT classifier trained on jailbreak benchmark datasets.
+
+**Configuration**:
+```yaml
+prompt_guard:
+ enabled: true
+ threshold: 0.7
+ model_id: "models/jailbreak_classifier_modernbert"
+```
+
+#### 3.4.3 Hallucination Detection
+
+**Purpose**: Detect unreliable or fabricated content in LLM responses.
+
+**Implementation**: Post-processing check using specialized models (e.g., HaluGate).
+
+**Use Cases**: High-stakes domains (medical, legal, financial)
+
+### 3.5 Domain-Aware System Prompts
+
+**Purpose**: Automatically inject specialized system prompts based on query classification.
+
+**Example**:
+```yaml
+decisions:
+ - name: "medical_decision"
+ plugins:
+ - type: "system_prompt"
+ configuration:
+ system_prompt: "You are a medical expert... [specialized instructions]"
+```
+
+**Benefits**:
+- No manual prompt engineering per request
+- Consistent behavior across similar queries
+- Easy to update prompts centrally
+
+### 3.6 Tool Selection
+
+**Purpose**: Automatically select relevant tools based on query intent, reducing prompt tokens and improving accuracy.
+
+**Configuration**:
+```yaml
+tools:
+ enabled: true
+ top_k: 3
+ similarity_threshold: 0.2
+ tools_db_path: "config/tools_db.json"
+```
+
+### 4.4 Plugin Architecture
+
+**Design Principles**:
+
+- **Composable**: Plugins can be enabled/disabled independently
+- **Configurable**: Each plugin has its own configuration
+- **Ordered**: Plugins execute in defined order (pre-processing → routing → post-processing)
+- **Extensible**: Easy to add new plugins
+
+**Plugin Execution Flow**:
+
+```mermaid
+graph LR
+ Request[Request] --> Pre[Pre-Processing]
+
+ subgraph Pre["Pre-Processing Plugins"]
+ P1[PII Detection]
+ P2[Jailbreak Detection]
+ P3[Semantic Cache Check]
+ P4[Tool Selection]
+ end
+
+ Pre --> Route[Routing]
+
+ subgraph Route["Routing Plugins"]
+ R1[Intent Classification]
+ R2[LoRA Selection]
+ R3[System Prompt Injection]
+ end
+
+ Route --> LLM[vLLM Instance]
+ LLM --> Post[Post-Processing]
+
+ subgraph Post["Post-Processing Plugins"]
+ PP1[Hallucination Detection]
+ PP2[Response API Storage]
+ PP3[Semantic Cache Update]
+ end
+
+ Post --> Response[Response]
+
+ style Pre fill:#fff3e0
+ style Route fill:#e1f5ff
+ style Post fill:#f3e5f5
+```
+
+**Plugin Types**:
+
+1. **Pre-Processing Plugins**: Execute before routing
+ - PII Detection
+ - Jailbreak Detection
+ - Semantic Cache Check
+ - Tool Selection
+
+2. **Routing Plugins**: Influence routing decisions
+ - Intent Classification
+ - LoRA Selection
+ - Domain-Aware System Prompts
+
+3. **Post-Processing Plugins**: Execute after LLM response
+ - Hallucination Detection
+ - Response API Storage
+ - Semantic Cache Update
+
+---
+
+## 5. Benefits and Impact
+
+### 5.1 Intelligent Routing: Signal-Based Multi-LoRA Selection
+
+The semantic router provides multiple routing strategies for automatically selecting the optimal LoRA adapter based on query characteristics:
+
+**Keyword-Based Routing**:
+- Simple pattern matching for explicit indicators
+- Fast and deterministic routing decisions
+- Example: Route queries containing "SQL" or "database" to database-specialized LoRA
+
+**Domain Classification**:
+- ModernBERT-based intent classification
+- Categorizes queries into domains (math, code, medical, legal, etc.)
+- **10.2% accuracy improvement** on MMLU-Pro benchmark through domain-specific adapters
+- Automatic mapping from domain to optimal LoRA adapter
+
+**Embedding-Based Semantic Routing**:
+- Vector similarity matching for nuanced routing
+- Handles queries that don't fit clear keyword or domain patterns
+- Uses lightweight embedding models (BERT, Qwen3-Embedding)
+- Enables fine-grained routing based on semantic similarity
+
+**Fact-Checking and Verification Routing**:
+- Specialized routing for high-stakes queries requiring verification
+- Can route to fact-checking LoRA adapters or trigger additional validation
+- Critical for medical, legal, and financial domains
+
+**Benefits**:
+- **Improved Accuracy**: Domain-specific adapters outperform base models
+- **Transparent to Clients**: No API changes required
+- **Cost Efficient**: Share base model weights across adapters
+- **Flexible**: Multiple routing strategies for different use cases
+
+### 5.2 Cross-Instance Intelligence: Shared State and Optimization
+
+**Response API: Centralized Response Storage**:
+- All responses stored in Redis with unique IDs
+- Any vLLM instance can access responses from any other instance
+- Enables stateful multi-turn conversations across instances
+- Use cases:
+ - Continue conversations on different instances
+ - A/B testing across instances
+ - Response continuation and refinement
+ - Debugging and audit trails
+
+**Semantic Cache: Cross-Instance Vector Similarity**:
+- **48.5% token reduction** through intelligent caching
+- **47.1% latency reduction** for cache hits
+- Vector similarity matching (cosine similarity > 0.92)
+- Shared across all instances via Redis/Milvus
+- Benefits:
+ - Higher hit rate than exact prefix matching
+ - Cross-instance cache sharing
+ - Configurable per-category thresholds
+ - Significant cost savings
+
+**Performance Impact**:
+- **10x higher throughput** compared to Python API Server
+- Rust implementation eliminates GIL and interpreter overhead
+- Minimal overhead from semantic router layer (<5ms)
+- Efficient concurrent request handling
+
+### 5.3 Guardrails: Enterprise-Grade Security and Safety
+
+**PII Detection**:
+- ModernBERT-based token classification
+- Trained on Microsoft Presidio dataset (~50K examples)
+- Detects: Email, phone, SSN, credit card, address, name, etc.
+- Prevents sensitive information from being sent to LLM
+- Configurable per-category policies
+
+**Jailbreak Prevention**:
+- ModernBERT classifier trained on jailbreak benchmarks
+- Blocks malicious prompts attempting to bypass safety guidelines
+- Configurable threshold for detection sensitivity
+- Real-time blocking before request reaches LLM
+
+**Hallucination Detection**:
+- Post-processing verification of LLM responses
+- Uses specialized models (e.g., HaluGate)
+- Critical for high-stakes domains (medical, legal, financial)
+- Optional per-category configuration
+
+**Enterprise Benefits**:
+- Compliance with data protection regulations
+- Reduced risk of security incidents
+- Audit logging and observability
+- Centralized security policies across all instances
+
+### 5.4 Operational Benefits
+
+**Simplified Deployment**:
+- Single binary replaces Python API Server
+- Unified configuration for routing and intelligence
+- Kubernetes-native with Helm charts
+
+**Cost Reduction**:
+- Fewer instances needed due to higher throughput
+- Reduced token costs from caching (48.5% reduction)
+- Lower infrastructure costs
+
+**Maintainability**:
+- Rust's type safety reduces bugs
+- Clear plugin architecture for extensions
+- Comprehensive observability (OpenTelemetry, Prometheus)
+
+## 6. Conclusion
+
+Integrating semantic router capabilities into vLLM Router represents a significant evolution in LLM serving infrastructure. By combining the performance of Rust with the intelligence of semantic routing, we can:
+
+1. **Replace Python API Server** with a high-performance alternative
+2. **Enable cross-instance capabilities** that were previously impossible
+3. **Improve accuracy** through intent-aware LoRA selection
+4. **Reduce costs** via semantic caching and token efficiency
+5. **Enhance security** with built-in PII and jailbreak detection
+
+This proposal builds on proven research (NeurIPS 2025) and production-tested code (vLLM Semantic Router project), providing a clear path to integrate these capabilities into the vLLM ecosystem.
+
+**Next Steps**:
+1. Community feedback on this proposal
+2. RFC process in vLLM project
+3. Implementation of Phase 1 (Q1 2026)
+4. Iterative deployment and optimization