diff --git a/docs/proposals/semantic-integration.md b/docs/proposals/semantic-integration.md
new file mode 100644
index 00000000..eb4ff9d9
--- /dev/null
+++ b/docs/proposals/semantic-integration.md
@@ -0,0 +1,654 @@
+# Empowering vLLM Router with Semantic Intelligence
+
+**Authors**: vLLM Semantic Router Team
+**Date**: December 2025
+**Status**: Proposal
+**Target Audience**: vLLM Community
+
+---
+
+## Executive Summary
+
+This proposal outlines a plan to integrate semantic routing capabilities into the vLLM Router project, enabling it to replace the Python-based vLLM API Server with a high-performance, intelligent routing layer. The integration will provide cross-instance capabilities including semantic caching, centralized response management, automatic LoRA adapter selection, and enterprise-grade security features—all while maintaining the performance benefits of Rust implementation.
+
+**Key Benefits**:
+
+1. **Signal Based Routing**: Signal-based routing for Multi-LoRA deployments
+   - Keyword-based routing for simple pattern matching
+   - Domain classification for intent-aware adapter selection
+   - Embedding-based semantic similarity for nuanced routing
+   - Fact-checking and verification routing for high-stakes queries
+
+2. **Cross-Instance Intelligence**: Shared state and optimization across all vLLM instances
+   - Response API: Centralized response storage enabling stateful multi-turn conversations
+   - Semantic Cache: 48.5% token reduction through cross-instance vector similarity matching
+
+3. **Guardrails**: Enterprise-grade security and safety features
+   - PII Detection: Prevent sensitive information leakage
+   - Jailbreak Prevention: Block malicious prompt injection attempts
+   - Hallucination Detection: Verify response reliability for critical domains
+
+---
+
+## 1. Background and Motivation
+
+### 1.1 Current vLLM Architecture
+
+The current vLLM deployment typically follows this pattern:
+
+```mermaid
+graph LR
+    C1[Client 1] --> API1[vLLM API Server<br/>Python]
+    C2[Client 2] --> API2[vLLM API Server<br/>Python]
+    C3[Client 3] --> API3[vLLM API Server<br/>Python]
+
+    API1 --> E1[vLLM Engine 1<br/>C++/CUDA]
+    API2 --> E2[vLLM Engine 2<br/>C++/CUDA]
+    API3 --> E3[vLLM Engine 3<br/>C++/CUDA]
+
+    style API1 fill:#ffcccc
+    style API2 fill:#ffcccc
+    style API3 fill:#ffcccc
+    style E1 fill:#ccffcc
+    style E2 fill:#ccffcc
+    style E3 fill:#ccffcc
+```
+
+**Limitations**:
+- **Single-Instance Scope**: Each API server manages only one vLLM engine instance
+- **Python Performance Bottleneck**: GIL limitations, interpreted language overhead
+- **No Cross-Instance Capabilities**: Cannot share cache, state, or intelligence across instances
+- **Limited Routing Logic**: Basic request forwarding without semantic understanding
+
+### 1.2 vLLM Router: Current State
+
+The vLLM Router (https://github.com/vllm-project/router) is a high-performance Rust-based load balancer that provides:
+- Load balancing policies (round_robin, consistent_hash, cache_aware, etc.)
+- Prefill-Decode disaggregation
+- Circuit breakers and retry logic
+- Kubernetes service discovery
+
+**Gap**: While excellent at load balancing, it lacks semantic understanding and cross-instance intelligence.
+
+### 1.3 vLLM Semantic Router: Proven Capabilities
+
+The vLLM Semantic Router project (https://github.com/vllm-project/semantic-router) has demonstrated:
+- **Intent Classification**: ModernBERT-based classification for routing decisions
+- **Semantic Caching**: Cross-instance cache with 92-95% similarity thresholds
+- **Security**: PII detection, jailbreak prevention, hallucination detection
+- **LoRA Selection**: Automatic adapter selection based on query intent
+- **Response API**: Centralized response storage for multi-turn conversations
+
+**Research Validation**: Published at NeurIPS 2025 MLForSys workshop ([arxiv:2510.08731](https://arxiv.org/abs/2510.08731))
+
+### 1.4 Capability Comparison
+
+The following table compares the capabilities of pure vLLM Router vs vLLM Router with Semantic Router integration:
+
+| Capability | vLLM Router (Current) | vLLM Router + Semantic Router | Impact |
+|------------|----------------------|-------------------------------|---------|
+| **Load Balancing** | ✅ Round Robin, Consistent Hash, Cache-Aware | ✅ Same + Intent-Aware Routing | Enhanced routing intelligence |
+| **Prefill-Decode Disaggregation** | ✅ Supported | ✅ Supported | Maintained in vLLM Router layer |
+| **Prefix Caching** | ✅ Supported | ✅ Supported | Maintained in vLLM Engine layer |
+| **Circuit Breaker & Retry** | ✅ Built-in | ✅ Built-in | No change |
+| **LoRA Adapter Support** | ✅ Manual selection by client | ✅ **Automatic selection** based on intent | 10.2% accuracy improvement |
+| **Multi-turn Adaptive Routing** | ❌ Not supported | ✅ **Cost/Accuracy/Feedback-aware** routing | Dynamic optimization per conversation |
+| **Semantic Caching** | ❌ Not supported | ✅ **Cross-instance cache** (Redis/Milvus) | 48.5% token reduction |
+| **Response API** | ⚠️ Limited to single instance | ✅ **Cross-instance state** (Redis) | Enables stateful conversations |
+| **Intent Classification** | ❌ Not supported | ✅ **ModernBERT-based** classification | Domain-aware routing |
+| **PII Detection** | ❌ Not supported | ✅ **Built-in** security plugin | Enterprise compliance |
+| **Jailbreak Prevention** | ❌ Not supported | ✅ **Built-in** security plugin | Safety & security |
+| **Hallucination Detection** | ❌ Not supported | ✅ **Optional** post-processing plugin | Quality assurance |
+| **Tool Selection** | ❌ Not supported | ✅ **Automatic** tool routing | Enhanced agent capabilities |
+| **Domain-Aware Prompts** | ❌ Not supported | ✅ **Dynamic** system prompt injection | Context-specific responses |
+
+**Key Takeaways**:
+
+- ✅ **Backward Compatible**: All existing vLLM Router features remain unchanged
+- 🚀 **Performance Boost**: 47% lower latency, 48.5% fewer tokens
+- 🎯 **Accuracy Improvement**: 10.2% higher accuracy through intelligent LoRA selection
+- 🔒 **Enterprise-Ready**: Built-in security (PII, jailbreak, hallucination detection)
+- 💰 **Cost Reduction**: Semantic caching and token efficiency reduce operational costs
+- 🌐 **Cross-Instance Intelligence**: Shared cache and state across all instances
+
+### 1.5 Architecture Comparison
+
+```mermaid
+graph TB
+    subgraph Current["Current Architecture (Python API Server)"]
+        direction TB
+        C1[Clients] --> P1[Python API Server 1]
+        C1 --> P2[Python API Server 2]
+        C1 --> P3[Python API Server 3]
+        P1 --> V1[vLLM Engine 1]
+        P2 --> V2[vLLM Engine 2]
+        P3 --> V3[vLLM Engine 3]
+
+        Note1[❌ No cross-instance cache<br/>❌ No semantic routing<br/>❌ Python performance bottleneck<br/>❌ No centralized state]
+    end
+
+    subgraph Proposed["Proposed Architecture (vLLM Router + Semantic Router)"]
+        direction TB
+        C2[Clients] --> VR[vLLM Router + Semantic Router<br/>Rust + Golang/Rust FFI]
+        VR --> Cache[(Semantic Cache<br/>Redis/Milvus)]
+        VR --> RespAPI[(Response API<br/>Redis)]
+        VR --> LB[Load Balancer]
+        LB --> VE1[vLLM Engine 1<br/>+LoRA A]
+        LB --> VE2[vLLM Engine 2<br/>+LoRA B]
+        LB --> VE3[vLLM Engine 3<br/>+LoRA C]
+
+        Note2[✅ Cross-instance cache<br/>✅ Intent-based LoRA selection<br/>✅ High performance Rust<br/>✅ Centralized state management]
+    end
+
+    style Current fill:#ffe6e6
+    style Proposed fill:#e6ffe6
+    style Note1 fill:#ffcccc
+    style Note2 fill:#ccffcc
+```
+
+---
+
+## 2. Proposed Architecture
+
+### 2.1 High-Level Architecture
+
+```mermaid
+graph TB
+    Client[Client Requests]
+
+    subgraph Router["vLLM Router (Rust)"]
+        subgraph SR["Semantic Router Layer"]
+            IC[Intent Classification<br/>ModernBERT-based<br/>Category detection]
+            SC[Semantic Cache Check<br/>Redis/Milvus<br/>Vector similarity]
+            SP[Security Plugins<br/>PII Detection<br/>Jailbreak Prevention]
+            LS[LoRA Selection<br/>Intent → LoRA mapping<br/>Domain-specific]
+        end
+
+        subgraph LB["Load Balancing Layer"]
+            RR[Round Robin / Consistent Hash]
+            CB[Circuit Breaker]
+            RL[Retry Logic]
+            PD[Prefill-Decode Disaggregation]
+        end
+    end
+
+    subgraph Instances["vLLM Engine Instances"]
+        I1[Instance 1<br/>+LoRA A]
+        I2[Instance 2<br/>+LoRA B]
+        I3[Instance 3<br/>+LoRA C]
+        I4[...]
+    end
+
+    Client --> IC
+    IC --> SC
+    SC --> SP
+    SP --> LS
+    LS --> RR
+    RR --> CB
+    CB --> RL
+    RL --> PD
+    PD --> I1
+    PD --> I2
+    PD --> I3
+    PD --> I4
+
+    style Router fill:#e1f5ff
+    style SR fill:#fff3e0
+    style LB fill:#f3e5f5
+    style Instances fill:#e8f5e9
+```
+
+### 2.2 Request Flow
+
+```mermaid
+sequenceDiagram
+    participant C as Client
+    participant R as vLLM Router
+    participant IC as Intent Classifier
+    participant SC as Semantic Cache
+    participant SP as Security Plugins
+    participant LB as Load Balancer
+    participant V as vLLM Instance
+    participant RA as Response API
+
+    C->>R: OpenAI API Request
+    R->>IC: Classify Intent
+    IC-->>R: Category (e.g., "math")
+
+    R->>SC: Check Cache
+    alt Cache Hit
+        SC-->>R: Cached Response
+        R-->>C: Return Response
+    else Cache Miss
+        SC-->>R: No Match
+
+        R->>SP: Security Check
+        alt PII/Jailbreak Detected
+            SP-->>R: Block Request
+            R-->>C: Error Response
+        else Safe
+            SP-->>R: Pass
+
+            R->>R: Select LoRA Adapter
+            Note over R: Based on intent category
+
+            R->>LB: Route Request
+            LB->>V: Forward to Instance
+            V-->>LB: LLM Response
+            LB-->>R: Response
+
+            par Store Response
+                R->>RA: Store in Response API
+            and Update Cache
+                R->>SC: Cache Response
+            end
+
+            R-->>C: Return Response
+        end
+    end
+```
+
+### 2.3 Why Not Signal-Based Model Selection?
+
+**Important Design Decision**: This proposal focuses on **LoRA-level selection** within a single base model, not **model-level selection** (e.g., choosing between GPT-4 vs Claude).
+
+**Rationale**:
+- vLLM Router operates at the **model-instance level**: Each router deployment manages instances of a single base model
+- **Model selection** (choosing different models) should happen at a higher layer (API Gateway, orchestration layer)
+- **LoRA selection** (choosing adapters for the same base model) is appropriate at the router level because:
+  - All instances share the same base model weights
+  - LoRA adapters are lightweight and can be dynamically loaded
+  - Intent classification can determine the best adapter for each query
+  - This maintains the router's focus on efficient request distribution
+
+**Example**:
+```yaml
+# Appropriate: LoRA selection within llama-3-70b
+base_model: llama-3-70b
+lora_adapters:
+  - name: code-lora      # For programming queries
+  - name: medical-lora   # For medical queries
+  - name: legal-lora     # For legal queries
+
+# Not appropriate: Model selection across different models
+# This should be handled by upstream API Gateway
+models:
+  - gpt-4              # ❌ Different model
+  - claude-3           # ❌ Different model
+  - llama-3-70b        # ❌ Different model
+```
+
+---
+
+## 3. Core Features
+
+### 3.1 Auto-Selection of LoRA Adapters
+
+**Problem**: Different tasks benefit from domain-specific fine-tuning, but manually specifying LoRA adapters is cumbersome.
+
+**Solution**: Automatic LoRA selection based on semantic intent classification.
+
+**How It Works**:
+1. Query is classified into categories (math, code, medical, legal, etc.)
+2. Configuration maps categories to LoRA adapters
+3. Router automatically sets the `model` parameter to the appropriate LoRA name
+4. vLLM instance loads and applies the correct adapter
+
+**Configuration Example**:
+```yaml
+# Define available LoRA adapters
+model_config:
+  "llama-3-70b":
+    loras:
+      - name: "code-lora"
+        description: "Optimized for programming tasks"
+      - name: "medical-lora"
+        description: "Specialized for medical queries"
+
+# Map intents to LoRA adapters
+decisions:
+  - name: "code_decision"
+    rules:
+      conditions:
+        - type: "domain"
+          name: "computer_science"
+    modelRefs:
+      - model: "llama-3-70b"
+        lora_name: "code-lora"
+        use_reasoning: true
+```
+
+**Benefits**:
+- **Improved Accuracy**: Domain-specific adapters outperform base models
+- **Transparent to Clients**: No API changes required
+- **Cost Efficient**: Share base model weights across adapters
+
+### 3.2 Cross-Instance Response API
+
+**Problem**: vLLM Python API Server stores response state locally, preventing cross-instance access for multi-turn conversations.
+
+**Solution**: Centralized response storage using Redis.
+
+**How It Works**:
+1. Each response is assigned a unique ID
+2. Response content and metadata stored in Redis
+3. Subsequent requests can reference previous responses by ID
+4. Any vLLM instance can retrieve responses from any other instance
+
+**Use Cases**:
+- **Multi-turn conversations**: Continue conversation on different instance
+- **Response continuation**: Generate more content from previous response
+- **A/B testing**: Compare responses from different instances
+- **Debugging**: Inspect responses across the cluster
+
+**API Example**:
+```bash
+# First request
+curl -X POST http://router:8000/v1/chat/completions \
+  -d '{"model": "llama-3-70b", "messages": [...]}'
+# Response: {"id": "resp_abc123", "choices": [...]}
+
+# Continue conversation on any instance
+curl -X POST http://router:8000/v1/chat/completions \
+  -d '{"model": "llama-3-70b", "previous_response_id": "resp_abc123", ...}'
+```
+
+### 3.3 Cross-Instance Semantic Cache
+
+**Problem**: Traditional prefix caching is instance-local and requires exact matches.
+
+**Solution**: Semantic cache using vector similarity across all instances.
+
+**How It Works**:
+1. Query is embedded using lightweight model (BERT, Qwen3-Embedding, etc.)
+2. Embedding compared against cache using cosine similarity
+3. If similarity > threshold (e.g., 0.92), return cached response
+4. Cache is shared across all instances via Redis/Milvus
+
+**Benefits**:
+- **Higher Hit Rate**: Semantic matching vs exact matching
+- **Cross-Instance**: Any instance can benefit from any other's cache
+- **Configurable**: Per-category similarity thresholds
+- **Cost Savings**: 48.5% token reduction (from research paper)
+
+**Configuration Example**:
+```yaml
+semantic_cache:
+  enabled: true
+  backend_type: "hybrid"  # memory + milvus
+  similarity_threshold: 0.92
+  embedding_model: "qwen3"  # High quality, 1024-dim
+
+# Per-category overrides
+decisions:
+  - name: "medical_decision"
+    plugins:
+      - type: "semantic-cache"
+        configuration:
+          enabled: true
+          similarity_threshold: 0.95  # Higher threshold for medical
+```
+
+### 3.4 Security Plugins
+
+#### 3.4.1 PII Detection
+
+**Purpose**: Prevent sensitive information from being sent to LLM.
+
+**Implementation**: ModernBERT-based token classification trained on Microsoft Presidio dataset (~50K examples).
+
+**Detected Types**: Email, phone, SSN, credit card, address, name, etc.
+
+**Configuration**:
+```yaml
+plugins:
+  - type: "pii"
+    configuration:
+      enabled: true
+      pii_types_allowed: []  # Block all PII types
+```
+
+#### 3.4.2 Jailbreak Detection
+
+**Purpose**: Block malicious prompts attempting to bypass safety guidelines.
+
+**Implementation**: ModernBERT classifier trained on jailbreak benchmark datasets.
+
+**Configuration**:
+```yaml
+prompt_guard:
+  enabled: true
+  threshold: 0.7
+  model_id: "models/jailbreak_classifier_modernbert"
+```
+
+#### 3.4.3 Hallucination Detection
+
+**Purpose**: Detect unreliable or fabricated content in LLM responses.
+
+**Implementation**: Post-processing check using specialized models (e.g., HaluGate).
+
+**Use Cases**: High-stakes domains (medical, legal, financial)
+
+### 3.5 Domain-Aware System Prompts
+
+**Purpose**: Automatically inject specialized system prompts based on query classification.
+
+**Example**:
+```yaml
+decisions:
+  - name: "medical_decision"
+    plugins:
+      - type: "system_prompt"
+        configuration:
+          system_prompt: "You are a medical expert... [specialized instructions]"
+```
+
+**Benefits**:
+- No manual prompt engineering per request
+- Consistent behavior across similar queries
+- Easy to update prompts centrally
+
+### 3.6 Tool Selection
+
+**Purpose**: Automatically select relevant tools based on query intent, reducing prompt tokens and improving accuracy.
+
+**Configuration**:
+```yaml
+tools:
+  enabled: true
+  top_k: 3
+  similarity_threshold: 0.2
+  tools_db_path: "config/tools_db.json"
+```
+
+### 4.4 Plugin Architecture
+
+**Design Principles**:
+
+- **Composable**: Plugins can be enabled/disabled independently
+- **Configurable**: Each plugin has its own configuration
+- **Ordered**: Plugins execute in defined order (pre-processing → routing → post-processing)
+- **Extensible**: Easy to add new plugins
+
+**Plugin Execution Flow**:
+
+```mermaid
+graph LR
+    Request[Request] --> Pre[Pre-Processing]
+
+    subgraph Pre["Pre-Processing Plugins"]
+        P1[PII Detection]
+        P2[Jailbreak Detection]
+        P3[Semantic Cache Check]
+        P4[Tool Selection]
+    end
+
+    Pre --> Route[Routing]
+
+    subgraph Route["Routing Plugins"]
+        R1[Intent Classification]
+        R2[LoRA Selection]
+        R3[System Prompt Injection]
+    end
+
+    Route --> LLM[vLLM Instance]
+    LLM --> Post[Post-Processing]
+
+    subgraph Post["Post-Processing Plugins"]
+        PP1[Hallucination Detection]
+        PP2[Response API Storage]
+        PP3[Semantic Cache Update]
+    end
+
+    Post --> Response[Response]
+
+    style Pre fill:#fff3e0
+    style Route fill:#e1f5ff
+    style Post fill:#f3e5f5
+```
+
+**Plugin Types**:
+
+1. **Pre-Processing Plugins**: Execute before routing
+   - PII Detection
+   - Jailbreak Detection
+   - Semantic Cache Check
+   - Tool Selection
+
+2. **Routing Plugins**: Influence routing decisions
+   - Intent Classification
+   - LoRA Selection
+   - Domain-Aware System Prompts
+
+3. **Post-Processing Plugins**: Execute after LLM response
+   - Hallucination Detection
+   - Response API Storage
+   - Semantic Cache Update
+
+---
+
+## 5. Benefits and Impact
+
+### 5.1 Intelligent Routing: Signal-Based Multi-LoRA Selection
+
+The semantic router provides multiple routing strategies for automatically selecting the optimal LoRA adapter based on query characteristics:
+
+**Keyword-Based Routing**:
+- Simple pattern matching for explicit indicators
+- Fast and deterministic routing decisions
+- Example: Route queries containing "SQL" or "database" to database-specialized LoRA
+
+**Domain Classification**:
+- ModernBERT-based intent classification
+- Categorizes queries into domains (math, code, medical, legal, etc.)
+- **10.2% accuracy improvement** on MMLU-Pro benchmark through domain-specific adapters
+- Automatic mapping from domain to optimal LoRA adapter
+
+**Embedding-Based Semantic Routing**:
+- Vector similarity matching for nuanced routing
+- Handles queries that don't fit clear keyword or domain patterns
+- Uses lightweight embedding models (BERT, Qwen3-Embedding)
+- Enables fine-grained routing based on semantic similarity
+
+**Fact-Checking and Verification Routing**:
+- Specialized routing for high-stakes queries requiring verification
+- Can route to fact-checking LoRA adapters or trigger additional validation
+- Critical for medical, legal, and financial domains
+
+**Benefits**:
+- **Improved Accuracy**: Domain-specific adapters outperform base models
+- **Transparent to Clients**: No API changes required
+- **Cost Efficient**: Share base model weights across adapters
+- **Flexible**: Multiple routing strategies for different use cases
+
+### 5.2 Cross-Instance Intelligence: Shared State and Optimization
+
+**Response API: Centralized Response Storage**:
+- All responses stored in Redis with unique IDs
+- Any vLLM instance can access responses from any other instance
+- Enables stateful multi-turn conversations across instances
+- Use cases:
+  - Continue conversations on different instances
+  - A/B testing across instances
+  - Response continuation and refinement
+  - Debugging and audit trails
+
+**Semantic Cache: Cross-Instance Vector Similarity**:
+- **48.5% token reduction** through intelligent caching
+- **47.1% latency reduction** for cache hits
+- Vector similarity matching (cosine similarity > 0.92)
+- Shared across all instances via Redis/Milvus
+- Benefits:
+  - Higher hit rate than exact prefix matching
+  - Cross-instance cache sharing
+  - Configurable per-category thresholds
+  - Significant cost savings
+
+**Performance Impact**:
+- **10x higher throughput** compared to Python API Server
+- Rust implementation eliminates GIL and interpreter overhead
+- Minimal overhead from semantic router layer (<5ms)
+- Efficient concurrent request handling
+
+### 5.3 Guardrails: Enterprise-Grade Security and Safety
+
+**PII Detection**:
+- ModernBERT-based token classification
+- Trained on Microsoft Presidio dataset (~50K examples)
+- Detects: Email, phone, SSN, credit card, address, name, etc.
+- Prevents sensitive information from being sent to LLM
+- Configurable per-category policies
+
+**Jailbreak Prevention**:
+- ModernBERT classifier trained on jailbreak benchmarks
+- Blocks malicious prompts attempting to bypass safety guidelines
+- Configurable threshold for detection sensitivity
+- Real-time blocking before request reaches LLM
+
+**Hallucination Detection**:
+- Post-processing verification of LLM responses
+- Uses specialized models (e.g., HaluGate)
+- Critical for high-stakes domains (medical, legal, financial)
+- Optional per-category configuration
+
+**Enterprise Benefits**:
+- Compliance with data protection regulations
+- Reduced risk of security incidents
+- Audit logging and observability
+- Centralized security policies across all instances
+
+### 5.4 Operational Benefits
+
+**Simplified Deployment**:
+- Single binary replaces Python API Server
+- Unified configuration for routing and intelligence
+- Kubernetes-native with Helm charts
+
+**Cost Reduction**:
+- Fewer instances needed due to higher throughput
+- Reduced token costs from caching (48.5% reduction)
+- Lower infrastructure costs
+
+**Maintainability**:
+- Rust's type safety reduces bugs
+- Clear plugin architecture for extensions
+- Comprehensive observability (OpenTelemetry, Prometheus)
+
+## 6. Conclusion
+
+Integrating semantic router capabilities into vLLM Router represents a significant evolution in LLM serving infrastructure. By combining the performance of Rust with the intelligence of semantic routing, we can:
+
+1. **Replace Python API Server** with a high-performance alternative
+2. **Enable cross-instance capabilities** that were previously impossible
+3. **Improve accuracy** through intent-aware LoRA selection
+4. **Reduce costs** via semantic caching and token efficiency
+5. **Enhance security** with built-in PII and jailbreak detection
+
+This proposal builds on proven research (NeurIPS 2025) and production-tested code (vLLM Semantic Router project), providing a clear path to integrate these capabilities into the vLLM ecosystem.
+
+**Next Steps**:
+1. Community feedback on this proposal
+2. RFC process in vLLM project
+3. Implementation of Phase 1 (Q1 2026)
+4. Iterative deployment and optimization