Your RAG system is confidently serving wrong answers 67% of the time. When React deprecates an API, Python removes a module, or a medical guideline changes, retrieval keeps surfacing the old version. The user gets a well-formatted, authoritative, stale response.
VDD makes memory decay a function of environmental change instead of a fixed timer. When knowledge is stable, forget slowly. When semantic drift is detected, forget fast.
The honest pitch: VDD is not the best method in any single scenario. Recency wins in high-drift. Time-weighted wins in gradual transitions. But VDD is never the worst---making it the safest default when you don't know what kind of drift you'll face. That's the real-world condition.
Every RAG system with a knowledge base faces a question: how aggressively should old memories be discounted?
- Too slow (static low decay) → stale answers accumulate (67% staleness rate)
- Too fast (aggressive recency) → valid knowledge is destroyed during reversions
- Any fixed rate → wrong for at least some drift pattern you'll encounter
VDD makes the decay rate adaptive:
lambda(t) = lambda_base + (lambda_max - lambda_base) * sigmoid(k * (V_t - V_0))
V_t is the detected semantic volatility (embedding centroid distance). When embeddings shift, lambda increases and old memories fade. When embeddings are stable, lambda stays low and knowledge is preserved.
| Parameter | Default | What it does |
|---|---|---|
lambda_base |
0.05 | Resting decay (stable periods) |
lambda_max |
0.9 | Maximum decay (during drift) |
k |
10 | Sigmoid steepness (at k>=5, activation choice is immaterial) |
V_0 |
0.1 | Volatility threshold (critical to tune, 254% performance range) |
42 experiments. 120 real-world facts. 3 domains. 13 methods compared. Here's the truth.
| Claim | Evidence | Experiment |
|---|---|---|
| 87.5% error reduction vs static baselines | IAE 15.81 vs 126.67 | Exp 2 |
| Better staleness handling than recency | d = +1.85 (large effect) | Exp 20 |
| Better reversion handling than recency | d = +1.22 | Exp 18 |
| Never the worst across all 4 drift patterns | Rank #2/13, 100% rank stability | Exp 8, 22, 31 |
| Negligible latency overhead | 2.75ms mean, 6.48ms P99 | Exp 7 |
| Beats LRU-based decay | d = +0.921 | Exp 37 |
| FreshQA: 100% current-answer retrieval | vs no_decay 70-80% | Exp 39 |
| Auto-calibration eliminates V_0 tuning | 22.7% improvement over hand-tuned | Exp 40 |
| Rankings hold across embedding models | Spearman rho = 0.978 | Exp 41 |
| Scenario | Winner | How badly VDD loses |
|---|---|---|
| Constant high drift | Recency (lambda=0.5) | d = -2.90 (very large) |
| Gradual transitions | Time-weighted (alpha=0.01) | 0.871 vs 0.584 accuracy |
| 3 of 4 synthetic scenarios | Holt-Winters, EMA-lambda | VDD ranks #5, they rank #2-3 |
| Real embeddings, absolute accuracy | Time-weighted, static | 0.875 and 0.736 vs VDD 0.628 |
| StreamingQA (36K questions) | No decay | d = -7.68 (accumulation, not replacement) |
The key insight: VDD doesn't win individual races. It places consistently, which means it's the method you want when you can't predict the race.
Tested on React (v16-18), Python (v3.8-3.12), Node.js (v16-20) documentation:
| Method | Accuracy | Staleness | Verdict |
|---|---|---|---|
| Time-weighted | 0.875 | 0.000 | Best for gradual transitions |
| Static (lambda=0.1) | 0.736 | 0.135 | Good if stable |
| VDD | 0.628 | 0.243 | Best staleness/robustness tradeoff |
| Recency (lambda=0.5) | 0.570 | 0.300 | Wins on constant churn |
| Sliding window | 0.421 | 0.398 | Catastrophic failure on 2/3 domains |
| No decay | 0.333 | 0.667 | Random chance |
- 85/88 significant results survive Benjamini-Hochberg FDR correction (0 lost)
- Type M analysis: at d=0.5 true effect, our experiments inflate to d=0.715 (1.43x). Production effects will be smaller than reported.
- Hash-to-real embedding transfer: rankings preserved in 4/5 scenarios (r=0.935), but effect sizes attenuate ~30%
- LLM-in-the-loop validation: n=50 seeds, 7,500 LLM calls via llama3.1:8b + nomic-embed-text
- You don't know what drift pattern you'll face (the common case)
- Knowledge reversions are possible (rollbacks, seasonal changes, corrections)
- Avoiding stale answers matters more than peak accuracy
- You're building a coding assistant (gradual deprecations + sudden breaking changes)
- You're building a medical/legal knowledge base (long stability + sudden guideline changes)
- Constant high churn (news, social media) → use recency
- Gradual, predictable evolution → use time-weighted decay
- Known stable domain → use static decay with low lambda
- You have labeled data for drift → online_lambda matches VDD with labels (d = -0.056)
- Sliding window (N>50) on multi-topic domains: catastrophic 0.333 accuracy on 2/3 test domains
See the practitioner decision tree for a visual guide to method selection based on your use case.
git clone https://github.com/abe238/volatility-driven-decay.git
cd volatility-driven-decay
python -m venv venv && source venv/bin/activate
pip install -e ".[all]"from vdd.drift_detection import EmbeddingDistanceDetector
from vdd.memory import VDDMemoryBank
from vdd.retrieval import VDDRetriever
detector = EmbeddingDistanceDetector(current_window=10, archive_window=200)
memory = VDDMemoryBank(lambda_base=0.05, lambda_max=0.9, k=10.0, v0=0.1)
retriever = VDDRetriever(memory, detector)
retriever.add("React 18 uses Suspense for data fetching", embedding)
results = retriever.retrieve(query_embedding, k=5)LangChain: Custom VDDRetriever wrapping any vectorstore. Drift detection runs on the embedding history within the retriever's _get_relevant_documents method.
LlamaIndex: Custom NodePostprocessor that applies VDD's temporal weighting after initial retrieval.
Haystack: Custom Ranker component operating between retriever and reader stages.
42 experiments organized by what they test:
| Group | Experiments | What they validate | External deps |
|---|---|---|---|
| Core validation | 1-8 | Drift detection, VDD mechanism, baselines | None |
| Extended validation | 9-15 | Precision, scaling, statistical hardening | None |
| Real-world | 16-20 | 120 facts, 3 domains, staleness | Ollama (Exp 16) |
| Advanced analysis | 21-26, 29 | Effective lambda, sigmoid sensitivity | None |
| Ollama-required | 23, 27-28, 33 | Real embeddings, LLM evaluation, 3-domain | Ollama |
| Confirmatory | 30-32 | n=30 reruns, adaptive baselines, bimodality | None |
| Revision | 34-41 | FDR correction, n=50 LLM, real embedding suite, activation ablation, StreamingQA, FreshQA, auto-calibration, cross-model embedding | Ollama (35-36, 38-39, 41) |
| Practitioner | 42 | Decision tree visualization for method selection | None |
# Run all CPU experiments (~15 min)
python run_experiments.py --all
# Run a specific experiment
python experiments/02_scalar_simulation_fixed.py
# Experiments requiring Ollama
ollama pull nomic-embed-text && ollama pull llama3.1:8b
python experiments/33_three_domain.pyAll results (plots + JSON) are saved to results/.
This paper started with a fatal flaw and was rebuilt from scratch. We're sharing the full journey because science should show its work.
V1: Built a VDD prototype with oracle drift detection. Looked amazing. The oracle was the result---we were measuring our own assumptions. Fatal flaw discovered, entire approach scrapped.
V2-V3: Replaced oracle with real detection. Discovered recency (the simplest baseline) beats VDD on IAE. This was demoralizing but honest. We kept going.
V4: Reframed from "VDD is best" to "VDD is never worst." Added staleness analysis. The weaker claim turned out to be the more useful one: practitioners need reliable defaults, not optimistic benchmarks.
V5: Proved VDD's lambda distribution is genuinely bimodal (Hartigan's Dip p < 0.001), not just noisy static decay. Expanded to 90 facts, n=30 seeds. Added bootstrap CIs.
V6: Three-domain validation (React + Python + Node.js, 120 facts). Added adaptive baselines (Holt-Winters, EMA-lambda, DWM-lite). Published with 31 experiments.
V7-V8: Peer review variations and condensed arxiv formatting.
V9 (current): Major revision addressing 11 peer review issues head-on:
- Added Benjamini-Hochberg FDR correction (85/88 survive)
- Expanded LLM validation from n=5 to n=50 (7,500 LLM calls)
- Validated hash-to-real embedding transfer (4/5 scenarios, r=0.935)
- Added 12 missing citations (HippoRAG, MemoRAG, Mem0, ARM, T-GRAG, etc.)
- Proved activation function choice is immaterial at k>=5
- Added Type M error analysis (1.43x inflation at d=0.5)
- Acknowledged every limitation we could identify (14 total)
- Added theoretical analysis (regret framework formalizing "never worst" property)
- Added practitioner decision tree for method selection (Exp 42)
The paper is stronger because reviewers pushed back. Every objection made the science more honest.
- Recency often wins on raw accuracy (d = -2.90 in high-drift)
- Adaptive baselines (Holt-Winters, EMA-lambda) beat VDD in 3/4 synthetic scenarios
- Detector false positive rate is 13.5%, not the ~1% initially reported
- Time-weighted dominates gradual transitions (0.871 vs 0.584)
- V_0 and window sizes require tuning (254-325% performance range), though auto-calibration (Exp 40) mitigates V_0 tuning with 22.7% improvement over hand-tuned defaults
- Effect sizes inflate ~1.43x under controlled conditions; production effects will be smaller
- O(n) scaling; >10K memories needs approximate nearest neighbor search
- LLM validation shows VDD/recency have zero variance (deterministic retrieval, not statistical robustness)
- Only 3 technical documentation domains tested; no news, medical, or legal validation
- No production deployment study
- StreamingQA shows decay is counterproductive on accumulation tasks (d = -7.68); need replacement-focused benchmarks
- Effect sizes attenuate ~30% with real embeddings vs hash-based
- online_lambda matches VDD (d = -0.056), weakening the unique mechanism claim
- 120-fact dataset is purpose-built, not a community benchmark
We believe listing what doesn't work is as valuable as listing what does.
120 versioned facts with ground-truth answers across 3 API evolution timelines:
| Domain | Facts | Versions | Example drift |
|---|---|---|---|
| React | 60 | v16, v17, v18 | Class components → Hooks → Suspense |
| Python | 30 | 3.8, 3.10, 3.12 | Assignment expressions, match statements |
| Node.js | 30 | v16, v18, v20 | Fetch API, test runner, permission model |
All data is in data/real_rag/. Released under Apache 2.0 for reproducibility.
volatility-driven-decay/
├── paper_v9.md # Paper (markdown, 1,190+ lines)
├── paper_v9.pdf # Paper (PDF, 42 pages)
├── paper_v9.tex # Paper (LaTeX)
├── paper_v9.docx # Paper (Word)
├── arxiv_submission/ # arXiv-ready package
│ ├── main.tex # Self-contained LaTeX (1,800+ lines)
│ ├── main.pdf # Compiled PDF
│ └── figures/ # All 19 figures
├── src/vdd/ # Core library
│ ├── drift_detection/ # ADWIN, embedding distance, Page-Hinkley
│ ├── memory/ # VDD memory bank, static decay baseline
│ └── retrieval/ # VDD retriever
├── experiments/ # All 42 experiment scripts
├── results/ # Generated plots and JSON data
├── data/real_rag/ # 120 versioned facts (React, Python, Node.js)
├── tests/ # Unit tests
└── run_experiments.py # CLI experiment runner
@article{diaz2026vdd,
title={Volatility-Driven Decay: Adaptive Memory Retention for
RAG Systems Under Unknown Drift},
author={Diaz, Abe},
journal={arXiv preprint},
year={2026}
}Apache License 2.0. See LICENSE.
Opinions are my own. This work does not relate to my position at Amazon.