This repository contains the Membrane system: a distributed, content-addressed, reconstruction-driven memory fabric for LLM inference, built on top of the analytical throughput model from:
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang
arXiv:2604.15039v2
-
Analytical throughput model (Section 3.4.1, Equations 1–6)
- Per-instance KV throughput
Phi_kv(l) - Stage throughputs
Theta_membrane,Theta_pd-p,Theta_pd-d - End-to-end system throughput
Lambda_max
- Per-instance KV throughput
-
Throughput-optimal configuration (Section 3.4.2)
- Grid search over routing threshold
tand PD prefill/decode split - Optimality conditions from Equations 7–8
- Grid search over routing threshold
-
Dual-timescale scheduling (Section 3.4.3)
- Short-term: bandwidth- and cache-aware routing with congestion response
- Long-term: traffic-driven reallocation and re-optimization
-
Workload generator (Section 4.1)
- Truncated log-normal request lengths (
mu=9.90,sigma=1.00,[128,128K]) - Fixed output length of 1024 tokens
- Truncated log-normal request lengths (
-
Case-study baselines (Section 4)
- Membrane-PD (selective offloading)
- Homogeneous PD
- Naive Heterogeneous PD
-
Evaluation metrics
Lambda_max(sustainable throughput)- Mean and P90 TTFT
- Cross-datacenter bandwidth utilization
Beyond the paper reproduction, this codebase implements:
- Fragment data model: Immutable, content-addressed KV segments with structural signatures.
- Four in-memory indices: Exact, Semantic, Positional, and Co-access.
- Graph layer: Fragment relationship graph with weighted edges.
- Reconstruction engine: Reads fragments via
rebuild_context()with fallback to prefill. - Global directory: Multi-node fragment location resolution.
- Transfer plane: Delta-sync and chunked, resumable fragment transfer.
- Multi-tenant deduplication: Canonical store with tenant isolation.
# Create a virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate
# Install the package in editable mode with dev dependencies
pip install -e ".[dev]"Core dependencies are typer and rich for the CLI. Optional extras are available:
pip install -e ".[server]"— FastAPI, uvicorn, gRPC, Redispip install -e ".[gpu]"— PyTorch CUDA backendpip install -e ".[local-llm]"— HuggingFace Transformers backend
pytest tests/ -v# Paper reproduction demo
python scripts/demo.py
# Full Membrane multi-phase demo
python scripts/demo_full.py
# Multi-node simulation demo
python scripts/demo_membrane.pymembrane/
├── model/ # Analytical model and simulator (paper reproduction)
│ ├── __init__.py
│ ├── profiler.py # Table 5 profiling data + interpolation
│ ├── throughput_model.py # Equations (1)–(6)
│ ├── optimizer.py # Grid search for t and N_p/N_d
│ ├── workload.py # Truncated log-normal generator
│ ├── router.py # Length-based and cache-aware routing
│ ├── scheduler.py # Dual-timescale scheduler
│ ├── metrics.py # TTFT and bandwidth metrics
│ └── simulator.py # End-to-end baseline simulations
├── compute/ # Compute backends
│ ├── backend.py
│ ├── cpu_backend.py
│ ├── gpu_backend.py
│ ├── transformers_backend.py
│ ├── openai_backend.py
│ ├── anthropic_backend.py
│ └── ollama_backend.py
├── persistence/ # Storage backends
│ ├── memory_backend.py
│ └── redis_backend.py
├── transport/ # Network transports
│ ├── http_server.py # stdlib HTTP server
│ ├── fastapi_server.py # FastAPI + uvicorn server
│ └── grpc_server.py # gRPC server
├── network/ # Cluster and peer networking
│ ├── cluster_manager.py
│ ├── config.py
│ ├── gossip_state.py
│ ├── peer_client.py
│ └── remote_transfer.py
├── __init__.py # Public API exports
├── server.py # Unified production server
├── cli.py # Command-line interface + dashboard
├── fragment.py # Core data model
├── exact_index.py # Hash-based exact index
├── semantic_index.py # Embedding similarity index
├── positional_index.py # Interval-based adjacency index
├── co_access_index.py # Co-access graph adjacency index
├── graph_manager.py # Graph lifecycle management
├── fragment_graph.py # Fragment relationship graph
├── reconstruction_engine.py # Context reconstruction from fragments
├── prefill_adapter.py # Adapter to analytical model
├── global_directory.py # Fragment location directory
├── distributed_directory.py # Shard-aware distributed directory
├── transfer_service.py # Sender/receiver negotiation
├── chunked_transfer.py # Chunk-based fragment transfer
├── delta_encoder.py # Delta encoding for transport
├── canonical_store.py # Multi-tenant deduplicated storage
├── tenant_isolation.py # Tenant policy enforcement
├── kv_cache_manager.py # Local KV cache management
├── membrane_node.py # Node owning a shard
├── origin_node.py # Primary fragment host
├── replica_node.py # Replica fragment host
├── supernode.py # Cluster coordination node
├── latency_router.py # Latency-aware routing
├── economic_router.py # Cost-aware routing
├── hash_ring.py # Consistent hashing
├── workload_analyzer.py # Session history analysis
├── session_tracker.py # Per-session access tracking
├── node_telemetry.py # Node health telemetry
├── dynamic_role_manager.py # Compute/memory role assignment
├── joint_optimizer.py # Joint placement optimization
├── offload_decision_engine.py # KV offload decisions
├── remote_prefill_dispatcher.py # Remote prefill dispatch
├── cluster_replicator.py # Cluster-wide replication
├── promotion_policy.py # Replica promotion logic
├── predictor.py # Access pattern prediction
├── cost_model.py # Cost estimation
├── value_density.py # Value density scoring
├── semantic_cluster.py # Semantic clustering
├── semantic_hash.py # Semantic hashing utilities
├── structural_signature.py # Model/layer/token span signatures
├── prefix.py # Prefix data model
├── prefix_version_chain.py # Version chain for prefixes
├── fragmentation_engine.py # Prompt-to-fragment conversion
├── index_system.py # Unified index facade
├── cache_metrics.py # Cache performance metrics
├── tool_trace.py # Tool execution traces
├── artifact.py # Artifact data model
├── memory_object.py # Memory object protocol
└── ...
tests/
├── test_*.py # Model unit tests
└── membrane/
├── test_*.py # Membrane unit and integration tests
scripts/
├── demo.py # Paper reproduction demo
├── demo_full.py # Full Membrane multi-phase demo
└── demo_membrane.py # Multi-node simulation demo
| Component | Status | Notes |
|---|---|---|
| Throughput equations (1)–(6) | Exact | Implemented verbatim from the paper. |
| Table 5 profiling data | Exact | All four measured points transcribed exactly. |
| Grid-search optimizer | Exact | Exhaustive 2-D search over t and N_p. |
| Workload distribution | Exact | Truncated log-normal with paper parameters. |
| Routing logic | Exact | Threshold and cache-aware rules from Sections 3.3 and 3.4.3. |
| Dual-timescale scheduler | Approximate | Congestion threshold value and relaxation rate are ASSUMPTION (not specified in paper). Long-term reallocation period is NOT DETERMINED. |
| Decode constants | Approximate | T_decode = 25 ms and BS_max = 20 are ASSUMPTION inferred from Table 6 consistency, not explicitly stated. |
| Prefix cache model | Approximate | Simplified to per-request prefix lengths; hit-rate distribution is NOT DETERMINED by the paper. |
| TTFT model | Approximate | Adds KV transfer time to prefill time for Membrane requests; queuing delay is NOT DETERMINED (assumed negligible for steady-state throughput). |
| Interpolation | Approximate | Linear interpolation between measured profiling points; method is NOT DETERMINED by the paper. |
| Homogeneous baseline | Exact | Uses same total instance count, no Membrane. |
| Naive heterogeneous baseline | Exact | All prefill on Membrane, all decode on PD, no threshold. |
-
Table 6 absolute numbers: Because
Lambda_maxdepends on conditional meansE[L | L > t]andE[L | L <= t], which are computed from a synthetic sample, exact reproduction of the paper'st = 19.4KandLambda_max = 3.24requires either the exact same random seed or a very large sample. Our search with 20k–50k samples consistently findstin the 16K–21K range andLambda_maxin the 3.0–3.4 range, which is statistically consistent with the paper. -
Decode SLO constants: The paper states
BS_maxandT_decodeare "SLO-governed constants" but does not give their values. We inferredT_decode = 0.025 sandBS_max = 20from the consistency of Table 6 across the three baselines (seeoptimizer.pydocstring for derivation). These are marked as ASSUMPTION. -
Scheduler parameters: The paper describes the short-term scheduler's behavior conceptually ("raise the effective threshold") but does not specify exact utilization thresholds, step sizes, or the long-term reallocation period. We chose
congestion_threshold = 0.85and a 10% threshold increase, with gradual 1% relaxation. These are marked as ASSUMPTION. -
Prefix cache evaluation: The case study in Section 4 does not report prefix-cache hit rates. Our baseline reproduction assumes zero cache hits so that the results are driven solely by the routing threshold and hardware allocation, matching the paper's primary comparison. The router module still implements the cache-aware logic from Section 3.4.3 for extensibility.
-
No GPU-level simulation: The paper's system is a real serving stack. Our reproduction is an analytical/discrete-event simulator that captures the steady-state throughput model. Fine-grained GPU scheduling, TCP congestion control, and layer-wise pipelining (Section 3.3) are not modeled at the packet or kernel level.
MIT