DriftBench-SE

Empirical validation of the Specificity Erosion signal from the DriftAlert framework.

Status

Pre-registered: OSF pkr84, locked 2026-05-05 before any LLM API call. The pre-registration is immutable; deviations are appended to results/reports/deviations.md.
Primary execution: complete (commit 5679dd7, Haiku 4.5 N=12 sessions × 20 turns × 2 arms; T=0.3 confirmatory).
Stretch (Path 1+2 robustness, Path 4 hardening): complete (commit 2328ce6, Sonnet 4.6 N=3 cross-model, Haiku T=0.7 N=6 sweep, four post-hoc hardening probes — multiverse, in-context residue mechanism, negative-control margin calibration, institutional-corpus subsample robustness).
Path 3 (content-controlled v2 user-script) is descoped: existing Path 1 + Path 2 + Path 4 evidence forms a robust empirical story; running additional experimental arms risked introducing power-limited contradictions. The --user-script flag and script_version kwarg in the harness are inert plumbing kept for transparency. Rationale logged as Deviation 5 in results/reports/deviations.md.
Companion paper: in preparation; this repository is the code-and-data companion released alongside the arXiv preprint.

What this is

Specificity Erosion is a hypothesized failure mode of multi-turn LLM deployments in which model responses drift over the course of a session — away from the institutional context they were conditioned on at turn 1 and toward the model's pretraining prior. In an enterprise setting (e.g., a regulated wealth-advisory firm with named methodologies, prohibitions, and house style), this drift presents as a quiet loss of firm-specific specificity: replies remain plausible but become generic.

DriftBench-SE measures this quantitatively. For each LLM response r_t at turn t, we compute a contrastive embedding margin

m_t = cos(emb(r_t), μ_inst) − cos(emb(r_t), μ_generic)

where μ_inst is the centroid of an institutional corpus and μ_generic is the centroid of a generic-domain corpus, both embedded with the same sentence-transformer (BGE-large-en-v1.5 primary; e5-large-v2 secondary; MiniLM-L6-v2 non-retrieval robustness). We compare a control arm (institutional context injected at turn 1 only) against a treatment arm (re-injected on every turn) over 20-turn sessions and test whether the per-turn margin decays in the control arm and whether the treatment arm flattens that decay.

Headline findings

Three converging conclusions, supported across embedders, sampling temperatures, and post-hoc hardening probes (full numbers in paper/section_3_2_1.md and results/reports/):

T20 level effect is real and robust on Haiku 4.5. Treatment exceeds control at turn 20 with Hedges' g ≈ 0.90 (BGE), 0.97 (e5), 0.80 (MiniLM); the multiverse / specification-curve probe finds the arm[T.treatment] main effect is positive in 100 % of 48 LMM specifications (median +0.017, IQR [+0.007, +0.024]); a corpus-subsample probe finds the headline survives even when only 25 % of the institutional corpus anchors μ_inst (median Hedges' g = +0.886 across 50 random N=10 subsamples; 100 % reach Welch p<.05).
H1 / H2 trajectory null is real on Haiku 4.5. Across embedders, the per-session Mann-Kendall on control's margin trajectory yields zero significant decreasing sessions; Bayes factors (BF₀₁) on the LMM turn:arm interaction are 8.7 / 22.9 / 2.8 (BGE / e5 / MiniLM) all favoring the null; TOST equivalence holds on e5 at α = .05. Higher temperature (T=0.7, N=6) does not unmask drift — the audit's "low-T mode-hugging" hypothesis is ruled out.
H6 cross-model on Sonnet 4.6 (N=3) does not replicate. Welch t = −0.05 (one-sided p = 0.520, BGE), but at N=3 the test has only 24 % power for the Haiku-sized effect (post-hoc; MDE = Hedges' g ≥ 2.48). The non-replication is consistent with both "model-specific to Haiku" and "under-powered at N=3"; we report both interpretations honestly. Resolving the ambiguity (Sonnet at N=12) is identified as the cleanest near-term follow-up.

The repository is publication-ready on findings 1+2 with the post-hoc hardening probes summarized in paper/section_3_2_1.md § "Hardening probes."

Citation

If you use this software, the corpora, or the empirical results, please cite the companion paper (when available on arXiv) and this repository. GitHub renders a "Cite this repository" button driven by CITATION.cff.

Companion paper (BibTeX placeholder; replace with arXiv ID and DOI on preprint upload):

@article{rayman2026agentic,
  author  = {Rayman, Drew and Prakash, Pranadarth and Haraldsdottir, Kara Kristin Bloendal},
  title   = {The Nature of Agentic Drift: Detecting and Recalibrating Semantic Fidelity Loss in Enterprise AI Systems},
  year    = {2026},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  note    = {Replace XXXX.XXXXX with the assigned arXiv identifier.}
}

This repository (DOI-citable; archived at Zenodo):

@software{prakash2026driftbenchse,
  author  = {Rayman, Drew and Prakash, Pranadarth and Haraldsdottir, Kara Kristin Bloendal},
  title   = {DriftBench-SE: Empirical validation of the Specificity Erosion signal},
  year    = {2026},
  publisher = {Zenodo},
  doi     = {10.5281/zenodo.20098704},
  url     = {https://doi.org/10.5281/zenodo.20098704},
  version = {0.1.0}
}

OSF pre-registration:

@misc{prakash2026driftbenchprereg,
  author       = {Prakash, Pranadarth and Rayman, Drew and Haraldsdottir, Kara Kristin Bloendal},
  title        = {DriftBench-SE: Empirical validation of the Specificity Erosion signal (pre-registration)},
  year         = {2026},
  publisher    = {OSF},
  howpublished = {\url{https://osf.io/pkr84}}
}

Repository layout

driftbench-se/
├── README.md                  this file
├── LICENSE                    MIT (code) + per-zone licenses for data and reports
├── CITATION.cff               machine-readable citation metadata
├── CHANGELOG.md               release history
├── prereg.md                  locked OSF pre-registration (immutable)
├── pyproject.toml             pinned deps; py3.11; cross-platform via [tool.uv] required-environments
├── uv.lock                    deterministic dep resolution
├── requirements-lock.txt      pip-friendly lock derived from uv export
├── data/
│   ├── institutional/         40 inst_NNN.txt + manifest.json + NOTICE.md (CC-BY-4.0; synthetic)
│   ├── generic/               manifest.json with per-doc-type license metadata
│   │                          (CC-BY-SA-4.0 for Wikipedia, public domain for SEC ADV);
│   │                          docs NOT redistributed — fetch via recorded URLs
│   ├── user_script.json       locked 20-turn user script
│   └── filler_block.txt       neutral filler for the control_filler stretch arm
├── src/
│   ├── llm_clients.py         LLMClient ABC + Anthropic + OpenAI clients + SQLite cache + retry
│   ├── runner.py              run_session(): control/treatment/control_filler/amnesiac
│   ├── metrics.py             embed_texts, centroid, contrastive_margin, build_inst_context_block
│   ├── stats.py               MK, LMM (with fit_failed fallback), mediation, Holm,
│   │                          TOST, Savage-Dickey BF, post-hoc power, MDE
│   ├── corpora.py, viz.py, utils.py, __init__.py
├── scripts/
│   ├── 00_preflight_variance.py     Haiku 4.5 sampling-variance pre-flight
│   ├── 01_build_corpora.py          Wikipedia / SEC scraping for the generic corpus
│   ├── 02_validate_corpora.py       linear-probe + centroid-distance gates
│   ├── 03_run_experiment.py         primary + stretch arms; --user-script flag
│   ├── 04_compute_stats.py          primary statistical pipeline (H1, H2, level effect)
│   ├── 05_make_figures.py           Fig 1 (trajectory), Fig 2 (final dist), Fig 3 (UMAP), Fig 4 (robustness 2×2)
│   ├── 06_compute_stretch.py        Stretch H4/H5/H6/A7/A8 + cross-model run
│   ├── 07_robustness.py             Path 1: TOST + Bayes factor + post-hoc power
│   ├── 08_temperature_compare.py    Path 2: T=0.3 vs T=0.7 (H7a/b/c)
│   ├── 10_multiverse.py             Path 4 (i): 48-spec specification-curve analysis
│   ├── 11_residue_mechanism.py      Path 4 (ii): in-context residue trajectory test
│   ├── 12_negative_control.py       Path 4 (iii): generic-corpus split placebo + LOO sanity
│   └── 13_corpus_subsample.py       Path 4 (iv): institutional-corpus subsample robustness
│   # Slot 09 reserved for Path 3 (content-controlled user script); descoped 2026-05-08.
├── tests/                     73 unit tests; pytest target
├── results/
│   ├── raw/                   per-session JSON (gitignored; reproducible from cache)
│   ├── tables/                runs_with_margins{,_stretch}.parquet, residue_per_turn.parquet
│   └── reports/               stats_*, stretch_stats_*, robustness_*,
│                              temperature_compare_*, multiverse_*, residue_mechanism_*,
│                              negative_control_*, corpus_subsample_*, deviations.md, corpus_validation.json
├── figures/                   fig1-4 PDF + PNG (publication-ready)
├── paper/
│   ├── section_3_2_1.md       drop-in §3.2.1 with primary + stretch + Path 1/2/4 numbers
│   ├── stretch_protocol.md    pre-registered stretch protocol (locked 2026-05-05)
│   └── paper_context_bundle.md  Claude.ai-uploadable bundle of every empirical artifact
└── results/reports/deviations.md  appended-only log of pre-registration deviations

Quick reproduce

The full pipeline runs end-to-end from a fresh clone in roughly 35 minutes on Apple Silicon (longer on Intel Mac due to slower CPU embedding). The repository is hardware-agnostic — pyproject.toml carries [tool.uv] required-environments for arm64-darwin, x86_64-darwin, and linux-x86_64. Intel Mac resolves to torch 2.2.2, transformers <5, numpy <2 per Deviation 3.

# 1. Clone and install
git clone https://github.com/pranuprakash/driftbench-se.git
cd driftbench-se
uv sync

# 2. API keys (gitignored)
cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
EOF
# (OPENAI_API_KEY only needed if you reactivate the original cross-vendor stretch arm;
#  the project as committed is Anthropic-only — Sonnet 4.6 substitutes per Deviation 2.)

# 3. Verify environment
uv run pytest tests/                        # expect 73 / 73 pass
uv run ruff check src/ scripts/ tests/      # expect "All checks passed!"
uv run mypy src/                            # expect "Success: no issues found"

# 4. Build the generic corpus from the recorded URLs (Wikipedia + SEC ADV).
#    The institutional corpus is already in data/institutional/ (synthetic).
uv run python scripts/01_build_corpora.py

# 5. Validate corpora against the locked gates (centroid distance ≥ 0.15, linear-probe ≥ 0.85)
uv run python scripts/02_validate_corpora.py
# Must print "VALIDATION PASS" before any experiment is allowed.

# 6. Primary experiment (Haiku 4.5, N=12, T=0.3, max_tokens=4096; ~140 min, ~$10)
uv run python scripts/03_run_experiment.py \
    --llm claude-haiku-4-5-20251001 \
    --n-sessions 12 --n-turns 20 --temperature 0.3 --max-tokens 4096

# 7. Compute primary statistics + render figures
uv run python scripts/04_compute_stats.py
uv run python scripts/05_make_figures.py

# 8. (Optional) Stretch arms — filler / amnesiac / cross-model on Sonnet 4.6
#    See paper/stretch_protocol.md for the locked protocol.
uv run python scripts/03_run_experiment.py --arms control_filler --n-sessions 12
uv run python scripts/03_run_experiment.py --arms amnesiac_control --n-sessions 12
uv run python scripts/03_run_experiment.py --llm claude-sonnet-4-6 --n-sessions 3 --max-cost-usd 6
uv run python scripts/06_compute_stretch.py

# 9. Robustness re-analysis + temperature sweep (Paths 1 and 2)
uv run python scripts/07_robustness.py
uv run python scripts/03_run_experiment.py --temperature 0.7 --n-sessions 6 --max-cost-usd 7
uv run python scripts/08_temperature_compare.py

# 10. Hardening probes (Path 4) — read-only on existing parquets, no API
uv run python scripts/10_multiverse.py
uv run python scripts/11_residue_mechanism.py
uv run python scripts/12_negative_control.py
uv run python scripts/13_corpus_subsample.py

The LLM cache at .cache/llm_cache.sqlite keys on (system_prompt + chat_history + session_idx) per Deviation footnote in prereg.md; re-running with the same configuration is effectively free after the first execution. Re-running stats and figures on the committed runs_with_margins.parquet and runs_with_margins_stretch.parquet is also free (no API calls).

Outputs

results/raw/ — one JSON file per session × arm × LLM (gitignored; reproducible from the LLM cache).
results/tables/ — analyzed Parquet:
- runs_with_margins.parquet — primary Haiku 4.5 (BGE + e5 margins).
- runs_with_margins_stretch.parquet — primary + filler + amnesiac + Sonnet cross-model (BGE + e5 + MiniLM margins).
- residue_per_turn.parquet — Path 4 (ii) cumulative-history residue trajectories.
results/reports/ — human-readable + JSON statistics. Start at:
- stats_report.txt — primary headline (LMM, MK, Welch, Holm correction).
- stretch_stats_report.txt — H4 / H5 / H6 / A7 / A8 results.
- robustness_report.txt — Path 1 (TOST + Bayes + post-hoc power).
- temperature_compare_report.txt — Path 2 (T=0.7 sweep).
- multiverse_report.txt, residue_mechanism_report.txt, negative_control_report.txt, corpus_subsample_report.txt — Path 4 (i)–(iv) hardening probes.
- deviations.md — appended-only log of all pre-registration deviations (six entries as of 2026-05-08).
- corpus_validation.json — gate-check record (centroid distance 0.18; linear-probe 5-fold accuracy 0.99).
figures/ — fig1_trajectory.{pdf,png} (per-turn trajectory, control vs treatment, both embedders); fig2_final_distribution.{pdf,png} (T20 distribution); fig3_umap.{pdf,png} (corpus + response 2-D projection); fig4_robustness.{pdf,png} (2×2 small-multiple {BGE, e5} × {Haiku, Sonnet}).

A reviewer wanting only the result should open paper/section_3_2_1.md, then figures/fig1_trajectory.pdf and figures/fig2_final_distribution.pdf.

Pre-registration and deviations

The pre-registration (prereg.md, OSF pkr84) was locked on 2026-05-05 before any LLM API call. It is immutable. All deviations from the locked plan are recorded with timestamp, the original locked text, the deviation, and a justification, in results/reports/deviations.md.

#	Deviation	Summary
1	Secondary embedder substitution	`stella_en_400M_v5` → `intfloat/e5-large-v2` (xformers / MPS platform incompatibility)
2	Cross-LLM substitution	`gpt-4o-mini` → `claude-sonnet-4-6` at N=3 (no OpenAI key on analysis machine)
3	Hardware-agnostic dependency markers	`pyproject.toml` resolves on Apple Silicon, Intel Mac, and Linux
4	Temperature sweep at T=0.7	Exploratory Path-2 arm probing low-T mode-hugging
5	(reserved; Path 3 descoped)	—
6	Path 4 hardening probes	Multiverse, residue mechanism, negative-control calibration, corpus subsample — additive only, no new API spend

License

See LICENSE for the full text. Summary:

Code (src/, scripts/, tests/, top-level Python project files): MIT License.
Institutional corpus (data/institutional/): CC-BY-4.0 (data/institutional/NOTICE.md). Synthetic; refers to a fictional firm ("Meridian Heritage Advisors") and is not investment advice.
Generic corpus: linked rather than redistributed. Source URLs, retrieval timestamps, and per-doc-type upstream licenses (Wikipedia: CC-BY-SA-4.0; SEC Form ADV Part 2A brochures: U.S. federal-government public-domain) are recorded in data/generic/manifest.json. Reproducers fetch the documents themselves at the recorded URLs to respect upstream terms.
Pre-registration, paper section, statistical reports, figures: CC-BY-4.0.

Authors and contact

Drew Rayman — meetsynthia.ai — drew@meetsynthia.ai
Pranadarth Prakash — Columbia University — pranup48@gmail.com (corresponding)
Kara Kristin Bloendal Haraldsdottir — Columbia University — kkb2143@columbia.edu

For reproducibility questions or to report a deviation, please open an issue at https://github.com/pranuprakash/driftbench-se/issues or email the corresponding author.

Hardware requirements

Python 3.11 (pinned via .python-version).
~10 GB free disk for sentence-transformer weights (BGE-large ≈ 1.3 GB; e5-large ≈ 1.3 GB; MiniLM-L6 ≈ 90 MB; the rest is the torch wheel and CUDA-free dependencies).
CPU is sufficient. No GPU is required at the corpus and response volumes used here. A consumer laptop with 16 GB RAM completes the full pipeline; Intel Mac is ~6× slower than Apple Silicon for the embedding stages but produces identical results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DriftBench-SE

Status

What this is

Headline findings

Citation

Repository layout

Quick reproduce

Outputs

Pre-registration and deviations

License

Authors and contact

Hardware requirements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
figures		figures
paper		paper
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
prereg.md		prereg.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DriftBench-SE

Status

What this is

Headline findings

Citation

Repository layout

Quick reproduce

Outputs

Pre-registration and deviations

License

Authors and contact

Hardware requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages