Empirical validation of the Specificity Erosion signal from the DriftAlert framework.
- Pre-registered: OSF pkr84, locked 2026-05-05 before any LLM API call. The pre-registration is immutable; deviations are appended to
results/reports/deviations.md. - Primary execution: complete (commit
5679dd7, Haiku 4.5 N=12 sessions × 20 turns × 2 arms; T=0.3 confirmatory). - Stretch (Path 1+2 robustness, Path 4 hardening): complete (commit
2328ce6, Sonnet 4.6 N=3 cross-model, Haiku T=0.7 N=6 sweep, four post-hoc hardening probes — multiverse, in-context residue mechanism, negative-control margin calibration, institutional-corpus subsample robustness). - Path 3 (content-controlled v2 user-script) is descoped: existing Path 1 + Path 2 + Path 4 evidence forms a robust empirical story; running additional experimental arms risked introducing power-limited contradictions. The
--user-scriptflag andscript_versionkwarg in the harness are inert plumbing kept for transparency. Rationale logged as Deviation 5 inresults/reports/deviations.md. - Companion paper: in preparation; this repository is the code-and-data companion released alongside the arXiv preprint.
Specificity Erosion is a hypothesized failure mode of multi-turn LLM deployments in which model responses drift over the course of a session — away from the institutional context they were conditioned on at turn 1 and toward the model's pretraining prior. In an enterprise setting (e.g., a regulated wealth-advisory firm with named methodologies, prohibitions, and house style), this drift presents as a quiet loss of firm-specific specificity: replies remain plausible but become generic.
DriftBench-SE measures this quantitatively. For each LLM response r_t at turn t, we compute a contrastive embedding margin
m_t = cos(emb(r_t), μ_inst) − cos(emb(r_t), μ_generic)
where μ_inst is the centroid of an institutional corpus and μ_generic is the centroid of a generic-domain corpus, both embedded with the same sentence-transformer (BGE-large-en-v1.5 primary; e5-large-v2 secondary; MiniLM-L6-v2 non-retrieval robustness). We compare a control arm (institutional context injected at turn 1 only) against a treatment arm (re-injected on every turn) over 20-turn sessions and test whether the per-turn margin decays in the control arm and whether the treatment arm flattens that decay.
Three converging conclusions, supported across embedders, sampling temperatures, and post-hoc hardening probes (full numbers in paper/section_3_2_1.md and results/reports/):
- T20 level effect is real and robust on Haiku 4.5. Treatment exceeds control at turn 20 with Hedges' g ≈ 0.90 (BGE), 0.97 (e5), 0.80 (MiniLM); the multiverse / specification-curve probe finds the
arm[T.treatment]main effect is positive in 100 % of 48 LMM specifications (median +0.017, IQR [+0.007, +0.024]); a corpus-subsample probe finds the headline survives even when only 25 % of the institutional corpus anchors μ_inst (median Hedges' g = +0.886 across 50 random N=10 subsamples; 100 % reach Welch p<.05). - H1 / H2 trajectory null is real on Haiku 4.5. Across embedders, the per-session Mann-Kendall on control's margin trajectory yields zero significant decreasing sessions; Bayes factors (BF₀₁) on the LMM
turn:arminteraction are 8.7 / 22.9 / 2.8 (BGE / e5 / MiniLM) all favoring the null; TOST equivalence holds on e5 at α = .05. Higher temperature (T=0.7, N=6) does not unmask drift — the audit's "low-T mode-hugging" hypothesis is ruled out. - H6 cross-model on Sonnet 4.6 (N=3) does not replicate. Welch t = −0.05 (one-sided p = 0.520, BGE), but at N=3 the test has only 24 % power for the Haiku-sized effect (post-hoc; MDE = Hedges' g ≥ 2.48). The non-replication is consistent with both "model-specific to Haiku" and "under-powered at N=3"; we report both interpretations honestly. Resolving the ambiguity (Sonnet at N=12) is identified as the cleanest near-term follow-up.
The repository is publication-ready on findings 1+2 with the post-hoc hardening probes summarized in paper/section_3_2_1.md § "Hardening probes."
If you use this software, the corpora, or the empirical results, please cite the companion paper (when available on arXiv) and this repository. GitHub renders a "Cite this repository" button driven by CITATION.cff.
Companion paper (BibTeX placeholder; replace with arXiv ID and DOI on preprint upload):
@article{rayman2026agentic,
author = {Rayman, Drew and Prakash, Pranadarth and Haraldsdottir, Kara Kristin Bloendal},
title = {The Nature of Agentic Drift: Detecting and Recalibrating Semantic Fidelity Loss in Enterprise AI Systems},
year = {2026},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
note = {Replace XXXX.XXXXX with the assigned arXiv identifier.}
}This repository (DOI-citable; archived at Zenodo):
@software{prakash2026driftbenchse,
author = {Rayman, Drew and Prakash, Pranadarth and Haraldsdottir, Kara Kristin Bloendal},
title = {DriftBench-SE: Empirical validation of the Specificity Erosion signal},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20098704},
url = {https://doi.org/10.5281/zenodo.20098704},
version = {0.1.0}
}OSF pre-registration:
@misc{prakash2026driftbenchprereg,
author = {Prakash, Pranadarth and Rayman, Drew and Haraldsdottir, Kara Kristin Bloendal},
title = {DriftBench-SE: Empirical validation of the Specificity Erosion signal (pre-registration)},
year = {2026},
publisher = {OSF},
howpublished = {\url{https://osf.io/pkr84}}
}driftbench-se/
├── README.md this file
├── LICENSE MIT (code) + per-zone licenses for data and reports
├── CITATION.cff machine-readable citation metadata
├── CHANGELOG.md release history
├── prereg.md locked OSF pre-registration (immutable)
├── pyproject.toml pinned deps; py3.11; cross-platform via [tool.uv] required-environments
├── uv.lock deterministic dep resolution
├── requirements-lock.txt pip-friendly lock derived from uv export
├── data/
│ ├── institutional/ 40 inst_NNN.txt + manifest.json + NOTICE.md (CC-BY-4.0; synthetic)
│ ├── generic/ manifest.json with per-doc-type license metadata
│ │ (CC-BY-SA-4.0 for Wikipedia, public domain for SEC ADV);
│ │ docs NOT redistributed — fetch via recorded URLs
│ ├── user_script.json locked 20-turn user script
│ └── filler_block.txt neutral filler for the control_filler stretch arm
├── src/
│ ├── llm_clients.py LLMClient ABC + Anthropic + OpenAI clients + SQLite cache + retry
│ ├── runner.py run_session(): control/treatment/control_filler/amnesiac
│ ├── metrics.py embed_texts, centroid, contrastive_margin, build_inst_context_block
│ ├── stats.py MK, LMM (with fit_failed fallback), mediation, Holm,
│ │ TOST, Savage-Dickey BF, post-hoc power, MDE
│ ├── corpora.py, viz.py, utils.py, __init__.py
├── scripts/
│ ├── 00_preflight_variance.py Haiku 4.5 sampling-variance pre-flight
│ ├── 01_build_corpora.py Wikipedia / SEC scraping for the generic corpus
│ ├── 02_validate_corpora.py linear-probe + centroid-distance gates
│ ├── 03_run_experiment.py primary + stretch arms; --user-script flag
│ ├── 04_compute_stats.py primary statistical pipeline (H1, H2, level effect)
│ ├── 05_make_figures.py Fig 1 (trajectory), Fig 2 (final dist), Fig 3 (UMAP), Fig 4 (robustness 2×2)
│ ├── 06_compute_stretch.py Stretch H4/H5/H6/A7/A8 + cross-model run
│ ├── 07_robustness.py Path 1: TOST + Bayes factor + post-hoc power
│ ├── 08_temperature_compare.py Path 2: T=0.3 vs T=0.7 (H7a/b/c)
│ ├── 10_multiverse.py Path 4 (i): 48-spec specification-curve analysis
│ ├── 11_residue_mechanism.py Path 4 (ii): in-context residue trajectory test
│ ├── 12_negative_control.py Path 4 (iii): generic-corpus split placebo + LOO sanity
│ └── 13_corpus_subsample.py Path 4 (iv): institutional-corpus subsample robustness
│ # Slot 09 reserved for Path 3 (content-controlled user script); descoped 2026-05-08.
├── tests/ 73 unit tests; pytest target
├── results/
│ ├── raw/ per-session JSON (gitignored; reproducible from cache)
│ ├── tables/ runs_with_margins{,_stretch}.parquet, residue_per_turn.parquet
│ └── reports/ stats_*, stretch_stats_*, robustness_*,
│ temperature_compare_*, multiverse_*, residue_mechanism_*,
│ negative_control_*, corpus_subsample_*, deviations.md, corpus_validation.json
├── figures/ fig1-4 PDF + PNG (publication-ready)
├── paper/
│ ├── section_3_2_1.md drop-in §3.2.1 with primary + stretch + Path 1/2/4 numbers
│ ├── stretch_protocol.md pre-registered stretch protocol (locked 2026-05-05)
│ └── paper_context_bundle.md Claude.ai-uploadable bundle of every empirical artifact
└── results/reports/deviations.md appended-only log of pre-registration deviations
The full pipeline runs end-to-end from a fresh clone in roughly 35 minutes on Apple Silicon (longer on Intel Mac due to slower CPU embedding). The repository is hardware-agnostic — pyproject.toml carries [tool.uv] required-environments for arm64-darwin, x86_64-darwin, and linux-x86_64. Intel Mac resolves to torch 2.2.2, transformers <5, numpy <2 per Deviation 3.
# 1. Clone and install
git clone https://github.com/pranuprakash/driftbench-se.git
cd driftbench-se
uv sync
# 2. API keys (gitignored)
cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
EOF
# (OPENAI_API_KEY only needed if you reactivate the original cross-vendor stretch arm;
# the project as committed is Anthropic-only — Sonnet 4.6 substitutes per Deviation 2.)
# 3. Verify environment
uv run pytest tests/ # expect 73 / 73 pass
uv run ruff check src/ scripts/ tests/ # expect "All checks passed!"
uv run mypy src/ # expect "Success: no issues found"
# 4. Build the generic corpus from the recorded URLs (Wikipedia + SEC ADV).
# The institutional corpus is already in data/institutional/ (synthetic).
uv run python scripts/01_build_corpora.py
# 5. Validate corpora against the locked gates (centroid distance ≥ 0.15, linear-probe ≥ 0.85)
uv run python scripts/02_validate_corpora.py
# Must print "VALIDATION PASS" before any experiment is allowed.
# 6. Primary experiment (Haiku 4.5, N=12, T=0.3, max_tokens=4096; ~140 min, ~$10)
uv run python scripts/03_run_experiment.py \
--llm claude-haiku-4-5-20251001 \
--n-sessions 12 --n-turns 20 --temperature 0.3 --max-tokens 4096
# 7. Compute primary statistics + render figures
uv run python scripts/04_compute_stats.py
uv run python scripts/05_make_figures.py
# 8. (Optional) Stretch arms — filler / amnesiac / cross-model on Sonnet 4.6
# See paper/stretch_protocol.md for the locked protocol.
uv run python scripts/03_run_experiment.py --arms control_filler --n-sessions 12
uv run python scripts/03_run_experiment.py --arms amnesiac_control --n-sessions 12
uv run python scripts/03_run_experiment.py --llm claude-sonnet-4-6 --n-sessions 3 --max-cost-usd 6
uv run python scripts/06_compute_stretch.py
# 9. Robustness re-analysis + temperature sweep (Paths 1 and 2)
uv run python scripts/07_robustness.py
uv run python scripts/03_run_experiment.py --temperature 0.7 --n-sessions 6 --max-cost-usd 7
uv run python scripts/08_temperature_compare.py
# 10. Hardening probes (Path 4) — read-only on existing parquets, no API
uv run python scripts/10_multiverse.py
uv run python scripts/11_residue_mechanism.py
uv run python scripts/12_negative_control.py
uv run python scripts/13_corpus_subsample.pyThe LLM cache at .cache/llm_cache.sqlite keys on (system_prompt + chat_history + session_idx) per Deviation footnote in prereg.md; re-running with the same configuration is effectively free after the first execution. Re-running stats and figures on the committed runs_with_margins.parquet and runs_with_margins_stretch.parquet is also free (no API calls).
results/raw/— one JSON file per session × arm × LLM (gitignored; reproducible from the LLM cache).results/tables/— analyzed Parquet:runs_with_margins.parquet— primary Haiku 4.5 (BGE + e5 margins).runs_with_margins_stretch.parquet— primary + filler + amnesiac + Sonnet cross-model (BGE + e5 + MiniLM margins).residue_per_turn.parquet— Path 4 (ii) cumulative-history residue trajectories.
results/reports/— human-readable + JSON statistics. Start at:stats_report.txt— primary headline (LMM, MK, Welch, Holm correction).stretch_stats_report.txt— H4 / H5 / H6 / A7 / A8 results.robustness_report.txt— Path 1 (TOST + Bayes + post-hoc power).temperature_compare_report.txt— Path 2 (T=0.7 sweep).multiverse_report.txt,residue_mechanism_report.txt,negative_control_report.txt,corpus_subsample_report.txt— Path 4 (i)–(iv) hardening probes.deviations.md— appended-only log of all pre-registration deviations (six entries as of 2026-05-08).corpus_validation.json— gate-check record (centroid distance 0.18; linear-probe 5-fold accuracy 0.99).
figures/—fig1_trajectory.{pdf,png}(per-turn trajectory, control vs treatment, both embedders);fig2_final_distribution.{pdf,png}(T20 distribution);fig3_umap.{pdf,png}(corpus + response 2-D projection);fig4_robustness.{pdf,png}(2×2 small-multiple {BGE, e5} × {Haiku, Sonnet}).
A reviewer wanting only the result should open paper/section_3_2_1.md, then figures/fig1_trajectory.pdf and figures/fig2_final_distribution.pdf.
The pre-registration (prereg.md, OSF pkr84) was locked on 2026-05-05 before any LLM API call. It is immutable. All deviations from the locked plan are recorded with timestamp, the original locked text, the deviation, and a justification, in results/reports/deviations.md.
| # | Deviation | Summary |
|---|---|---|
| 1 | Secondary embedder substitution | stella_en_400M_v5 → intfloat/e5-large-v2 (xformers / MPS platform incompatibility) |
| 2 | Cross-LLM substitution | gpt-4o-mini → claude-sonnet-4-6 at N=3 (no OpenAI key on analysis machine) |
| 3 | Hardware-agnostic dependency markers | pyproject.toml resolves on Apple Silicon, Intel Mac, and Linux |
| 4 | Temperature sweep at T=0.7 | Exploratory Path-2 arm probing low-T mode-hugging |
| 5 | (reserved; Path 3 descoped) | — |
| 6 | Path 4 hardening probes | Multiverse, residue mechanism, negative-control calibration, corpus subsample — additive only, no new API spend |
See LICENSE for the full text. Summary:
- Code (
src/,scripts/,tests/, top-level Python project files): MIT License. - Institutional corpus (
data/institutional/): CC-BY-4.0 (data/institutional/NOTICE.md). Synthetic; refers to a fictional firm ("Meridian Heritage Advisors") and is not investment advice. - Generic corpus: linked rather than redistributed. Source URLs, retrieval timestamps, and per-doc-type upstream licenses (Wikipedia: CC-BY-SA-4.0; SEC Form ADV Part 2A brochures: U.S. federal-government public-domain) are recorded in
data/generic/manifest.json. Reproducers fetch the documents themselves at the recorded URLs to respect upstream terms. - Pre-registration, paper section, statistical reports, figures: CC-BY-4.0.
- Drew Rayman — meetsynthia.ai — drew@meetsynthia.ai
- Pranadarth Prakash — Columbia University — pranup48@gmail.com (corresponding)
- Kara Kristin Bloendal Haraldsdottir — Columbia University — kkb2143@columbia.edu
For reproducibility questions or to report a deviation, please open an issue at https://github.com/pranuprakash/driftbench-se/issues or email the corresponding author.
- Python 3.11 (pinned via
.python-version). - ~10 GB free disk for sentence-transformer weights (BGE-large ≈ 1.3 GB; e5-large ≈ 1.3 GB; MiniLM-L6 ≈ 90 MB; the rest is the torch wheel and CUDA-free dependencies).
- CPU is sufficient. No GPU is required at the corpus and response volumes used here. A consumer laptop with 16 GB RAM completes the full pipeline; Intel Mac is ~6× slower than Apple Silicon for the embedding stages but produces identical results.