Skip to content

pranuprakash/driftbench-se

Repository files navigation

DriftBench-SE

Empirical validation of the Specificity Erosion signal from the DriftAlert framework.

DOI Pre-registered on OSF License: MIT Data License: CC-BY-4.0 Python 3.11


Status

  • Pre-registered: OSF pkr84, locked 2026-05-05 before any LLM API call. The pre-registration is immutable; deviations are appended to results/reports/deviations.md.
  • Primary execution: complete (commit 5679dd7, Haiku 4.5 N=12 sessions × 20 turns × 2 arms; T=0.3 confirmatory).
  • Stretch (Path 1+2 robustness, Path 4 hardening): complete (commit 2328ce6, Sonnet 4.6 N=3 cross-model, Haiku T=0.7 N=6 sweep, four post-hoc hardening probes — multiverse, in-context residue mechanism, negative-control margin calibration, institutional-corpus subsample robustness).
  • Path 3 (content-controlled v2 user-script) is descoped: existing Path 1 + Path 2 + Path 4 evidence forms a robust empirical story; running additional experimental arms risked introducing power-limited contradictions. The --user-script flag and script_version kwarg in the harness are inert plumbing kept for transparency. Rationale logged as Deviation 5 in results/reports/deviations.md.
  • Companion paper: in preparation; this repository is the code-and-data companion released alongside the arXiv preprint.

What this is

Specificity Erosion is a hypothesized failure mode of multi-turn LLM deployments in which model responses drift over the course of a session — away from the institutional context they were conditioned on at turn 1 and toward the model's pretraining prior. In an enterprise setting (e.g., a regulated wealth-advisory firm with named methodologies, prohibitions, and house style), this drift presents as a quiet loss of firm-specific specificity: replies remain plausible but become generic.

DriftBench-SE measures this quantitatively. For each LLM response r_t at turn t, we compute a contrastive embedding margin

m_t = cos(emb(r_t), μ_inst) − cos(emb(r_t), μ_generic)

where μ_inst is the centroid of an institutional corpus and μ_generic is the centroid of a generic-domain corpus, both embedded with the same sentence-transformer (BGE-large-en-v1.5 primary; e5-large-v2 secondary; MiniLM-L6-v2 non-retrieval robustness). We compare a control arm (institutional context injected at turn 1 only) against a treatment arm (re-injected on every turn) over 20-turn sessions and test whether the per-turn margin decays in the control arm and whether the treatment arm flattens that decay.

Headline findings

Three converging conclusions, supported across embedders, sampling temperatures, and post-hoc hardening probes (full numbers in paper/section_3_2_1.md and results/reports/):

  1. T20 level effect is real and robust on Haiku 4.5. Treatment exceeds control at turn 20 with Hedges' g ≈ 0.90 (BGE), 0.97 (e5), 0.80 (MiniLM); the multiverse / specification-curve probe finds the arm[T.treatment] main effect is positive in 100 % of 48 LMM specifications (median +0.017, IQR [+0.007, +0.024]); a corpus-subsample probe finds the headline survives even when only 25 % of the institutional corpus anchors μ_inst (median Hedges' g = +0.886 across 50 random N=10 subsamples; 100 % reach Welch p<.05).
  2. H1 / H2 trajectory null is real on Haiku 4.5. Across embedders, the per-session Mann-Kendall on control's margin trajectory yields zero significant decreasing sessions; Bayes factors (BF₀₁) on the LMM turn:arm interaction are 8.7 / 22.9 / 2.8 (BGE / e5 / MiniLM) all favoring the null; TOST equivalence holds on e5 at α = .05. Higher temperature (T=0.7, N=6) does not unmask drift — the audit's "low-T mode-hugging" hypothesis is ruled out.
  3. H6 cross-model on Sonnet 4.6 (N=3) does not replicate. Welch t = −0.05 (one-sided p = 0.520, BGE), but at N=3 the test has only 24 % power for the Haiku-sized effect (post-hoc; MDE = Hedges' g ≥ 2.48). The non-replication is consistent with both "model-specific to Haiku" and "under-powered at N=3"; we report both interpretations honestly. Resolving the ambiguity (Sonnet at N=12) is identified as the cleanest near-term follow-up.

The repository is publication-ready on findings 1+2 with the post-hoc hardening probes summarized in paper/section_3_2_1.md § "Hardening probes."


Citation

If you use this software, the corpora, or the empirical results, please cite the companion paper (when available on arXiv) and this repository. GitHub renders a "Cite this repository" button driven by CITATION.cff.

Companion paper (BibTeX placeholder; replace with arXiv ID and DOI on preprint upload):

@article{rayman2026agentic,
  author  = {Rayman, Drew and Prakash, Pranadarth and Haraldsdottir, Kara Kristin Bloendal},
  title   = {The Nature of Agentic Drift: Detecting and Recalibrating Semantic Fidelity Loss in Enterprise AI Systems},
  year    = {2026},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  note    = {Replace XXXX.XXXXX with the assigned arXiv identifier.}
}

This repository (DOI-citable; archived at Zenodo):

@software{prakash2026driftbenchse,
  author  = {Rayman, Drew and Prakash, Pranadarth and Haraldsdottir, Kara Kristin Bloendal},
  title   = {DriftBench-SE: Empirical validation of the Specificity Erosion signal},
  year    = {2026},
  publisher = {Zenodo},
  doi     = {10.5281/zenodo.20098704},
  url     = {https://doi.org/10.5281/zenodo.20098704},
  version = {0.1.0}
}

OSF pre-registration:

@misc{prakash2026driftbenchprereg,
  author       = {Prakash, Pranadarth and Rayman, Drew and Haraldsdottir, Kara Kristin Bloendal},
  title        = {DriftBench-SE: Empirical validation of the Specificity Erosion signal (pre-registration)},
  year         = {2026},
  publisher    = {OSF},
  howpublished = {\url{https://osf.io/pkr84}}
}

Repository layout

driftbench-se/
├── README.md                  this file
├── LICENSE                    MIT (code) + per-zone licenses for data and reports
├── CITATION.cff               machine-readable citation metadata
├── CHANGELOG.md               release history
├── prereg.md                  locked OSF pre-registration (immutable)
├── pyproject.toml             pinned deps; py3.11; cross-platform via [tool.uv] required-environments
├── uv.lock                    deterministic dep resolution
├── requirements-lock.txt      pip-friendly lock derived from uv export
├── data/
│   ├── institutional/         40 inst_NNN.txt + manifest.json + NOTICE.md (CC-BY-4.0; synthetic)
│   ├── generic/               manifest.json with per-doc-type license metadata
│   │                          (CC-BY-SA-4.0 for Wikipedia, public domain for SEC ADV);
│   │                          docs NOT redistributed — fetch via recorded URLs
│   ├── user_script.json       locked 20-turn user script
│   └── filler_block.txt       neutral filler for the control_filler stretch arm
├── src/
│   ├── llm_clients.py         LLMClient ABC + Anthropic + OpenAI clients + SQLite cache + retry
│   ├── runner.py              run_session(): control/treatment/control_filler/amnesiac
│   ├── metrics.py             embed_texts, centroid, contrastive_margin, build_inst_context_block
│   ├── stats.py               MK, LMM (with fit_failed fallback), mediation, Holm,
│   │                          TOST, Savage-Dickey BF, post-hoc power, MDE
│   ├── corpora.py, viz.py, utils.py, __init__.py
├── scripts/
│   ├── 00_preflight_variance.py     Haiku 4.5 sampling-variance pre-flight
│   ├── 01_build_corpora.py          Wikipedia / SEC scraping for the generic corpus
│   ├── 02_validate_corpora.py       linear-probe + centroid-distance gates
│   ├── 03_run_experiment.py         primary + stretch arms; --user-script flag
│   ├── 04_compute_stats.py          primary statistical pipeline (H1, H2, level effect)
│   ├── 05_make_figures.py           Fig 1 (trajectory), Fig 2 (final dist), Fig 3 (UMAP), Fig 4 (robustness 2×2)
│   ├── 06_compute_stretch.py        Stretch H4/H5/H6/A7/A8 + cross-model run
│   ├── 07_robustness.py             Path 1: TOST + Bayes factor + post-hoc power
│   ├── 08_temperature_compare.py    Path 2: T=0.3 vs T=0.7 (H7a/b/c)
│   ├── 10_multiverse.py             Path 4 (i): 48-spec specification-curve analysis
│   ├── 11_residue_mechanism.py      Path 4 (ii): in-context residue trajectory test
│   ├── 12_negative_control.py       Path 4 (iii): generic-corpus split placebo + LOO sanity
│   └── 13_corpus_subsample.py       Path 4 (iv): institutional-corpus subsample robustness
│   # Slot 09 reserved for Path 3 (content-controlled user script); descoped 2026-05-08.
├── tests/                     73 unit tests; pytest target
├── results/
│   ├── raw/                   per-session JSON (gitignored; reproducible from cache)
│   ├── tables/                runs_with_margins{,_stretch}.parquet, residue_per_turn.parquet
│   └── reports/               stats_*, stretch_stats_*, robustness_*,
│                              temperature_compare_*, multiverse_*, residue_mechanism_*,
│                              negative_control_*, corpus_subsample_*, deviations.md, corpus_validation.json
├── figures/                   fig1-4 PDF + PNG (publication-ready)
├── paper/
│   ├── section_3_2_1.md       drop-in §3.2.1 with primary + stretch + Path 1/2/4 numbers
│   ├── stretch_protocol.md    pre-registered stretch protocol (locked 2026-05-05)
│   └── paper_context_bundle.md  Claude.ai-uploadable bundle of every empirical artifact
└── results/reports/deviations.md  appended-only log of pre-registration deviations

Quick reproduce

The full pipeline runs end-to-end from a fresh clone in roughly 35 minutes on Apple Silicon (longer on Intel Mac due to slower CPU embedding). The repository is hardware-agnosticpyproject.toml carries [tool.uv] required-environments for arm64-darwin, x86_64-darwin, and linux-x86_64. Intel Mac resolves to torch 2.2.2, transformers <5, numpy <2 per Deviation 3.

# 1. Clone and install
git clone https://github.com/pranuprakash/driftbench-se.git
cd driftbench-se
uv sync

# 2. API keys (gitignored)
cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
EOF
# (OPENAI_API_KEY only needed if you reactivate the original cross-vendor stretch arm;
#  the project as committed is Anthropic-only — Sonnet 4.6 substitutes per Deviation 2.)

# 3. Verify environment
uv run pytest tests/                        # expect 73 / 73 pass
uv run ruff check src/ scripts/ tests/      # expect "All checks passed!"
uv run mypy src/                            # expect "Success: no issues found"

# 4. Build the generic corpus from the recorded URLs (Wikipedia + SEC ADV).
#    The institutional corpus is already in data/institutional/ (synthetic).
uv run python scripts/01_build_corpora.py

# 5. Validate corpora against the locked gates (centroid distance ≥ 0.15, linear-probe ≥ 0.85)
uv run python scripts/02_validate_corpora.py
# Must print "VALIDATION PASS" before any experiment is allowed.

# 6. Primary experiment (Haiku 4.5, N=12, T=0.3, max_tokens=4096; ~140 min, ~$10)
uv run python scripts/03_run_experiment.py \
    --llm claude-haiku-4-5-20251001 \
    --n-sessions 12 --n-turns 20 --temperature 0.3 --max-tokens 4096

# 7. Compute primary statistics + render figures
uv run python scripts/04_compute_stats.py
uv run python scripts/05_make_figures.py

# 8. (Optional) Stretch arms — filler / amnesiac / cross-model on Sonnet 4.6
#    See paper/stretch_protocol.md for the locked protocol.
uv run python scripts/03_run_experiment.py --arms control_filler --n-sessions 12
uv run python scripts/03_run_experiment.py --arms amnesiac_control --n-sessions 12
uv run python scripts/03_run_experiment.py --llm claude-sonnet-4-6 --n-sessions 3 --max-cost-usd 6
uv run python scripts/06_compute_stretch.py

# 9. Robustness re-analysis + temperature sweep (Paths 1 and 2)
uv run python scripts/07_robustness.py
uv run python scripts/03_run_experiment.py --temperature 0.7 --n-sessions 6 --max-cost-usd 7
uv run python scripts/08_temperature_compare.py

# 10. Hardening probes (Path 4) — read-only on existing parquets, no API
uv run python scripts/10_multiverse.py
uv run python scripts/11_residue_mechanism.py
uv run python scripts/12_negative_control.py
uv run python scripts/13_corpus_subsample.py

The LLM cache at .cache/llm_cache.sqlite keys on (system_prompt + chat_history + session_idx) per Deviation footnote in prereg.md; re-running with the same configuration is effectively free after the first execution. Re-running stats and figures on the committed runs_with_margins.parquet and runs_with_margins_stretch.parquet is also free (no API calls).


Outputs

  • results/raw/ — one JSON file per session × arm × LLM (gitignored; reproducible from the LLM cache).
  • results/tables/ — analyzed Parquet:
    • runs_with_margins.parquet — primary Haiku 4.5 (BGE + e5 margins).
    • runs_with_margins_stretch.parquet — primary + filler + amnesiac + Sonnet cross-model (BGE + e5 + MiniLM margins).
    • residue_per_turn.parquet — Path 4 (ii) cumulative-history residue trajectories.
  • results/reports/ — human-readable + JSON statistics. Start at:
    • stats_report.txt — primary headline (LMM, MK, Welch, Holm correction).
    • stretch_stats_report.txt — H4 / H5 / H6 / A7 / A8 results.
    • robustness_report.txt — Path 1 (TOST + Bayes + post-hoc power).
    • temperature_compare_report.txt — Path 2 (T=0.7 sweep).
    • multiverse_report.txt, residue_mechanism_report.txt, negative_control_report.txt, corpus_subsample_report.txt — Path 4 (i)–(iv) hardening probes.
    • deviations.md — appended-only log of all pre-registration deviations (six entries as of 2026-05-08).
    • corpus_validation.json — gate-check record (centroid distance 0.18; linear-probe 5-fold accuracy 0.99).
  • figures/fig1_trajectory.{pdf,png} (per-turn trajectory, control vs treatment, both embedders); fig2_final_distribution.{pdf,png} (T20 distribution); fig3_umap.{pdf,png} (corpus + response 2-D projection); fig4_robustness.{pdf,png} (2×2 small-multiple {BGE, e5} × {Haiku, Sonnet}).

A reviewer wanting only the result should open paper/section_3_2_1.md, then figures/fig1_trajectory.pdf and figures/fig2_final_distribution.pdf.


Pre-registration and deviations

The pre-registration (prereg.md, OSF pkr84) was locked on 2026-05-05 before any LLM API call. It is immutable. All deviations from the locked plan are recorded with timestamp, the original locked text, the deviation, and a justification, in results/reports/deviations.md.

# Deviation Summary
1 Secondary embedder substitution stella_en_400M_v5intfloat/e5-large-v2 (xformers / MPS platform incompatibility)
2 Cross-LLM substitution gpt-4o-miniclaude-sonnet-4-6 at N=3 (no OpenAI key on analysis machine)
3 Hardware-agnostic dependency markers pyproject.toml resolves on Apple Silicon, Intel Mac, and Linux
4 Temperature sweep at T=0.7 Exploratory Path-2 arm probing low-T mode-hugging
5 (reserved; Path 3 descoped)
6 Path 4 hardening probes Multiverse, residue mechanism, negative-control calibration, corpus subsample — additive only, no new API spend

License

See LICENSE for the full text. Summary:

  • Code (src/, scripts/, tests/, top-level Python project files): MIT License.
  • Institutional corpus (data/institutional/): CC-BY-4.0 (data/institutional/NOTICE.md). Synthetic; refers to a fictional firm ("Meridian Heritage Advisors") and is not investment advice.
  • Generic corpus: linked rather than redistributed. Source URLs, retrieval timestamps, and per-doc-type upstream licenses (Wikipedia: CC-BY-SA-4.0; SEC Form ADV Part 2A brochures: U.S. federal-government public-domain) are recorded in data/generic/manifest.json. Reproducers fetch the documents themselves at the recorded URLs to respect upstream terms.
  • Pre-registration, paper section, statistical reports, figures: CC-BY-4.0.

Authors and contact

For reproducibility questions or to report a deviation, please open an issue at https://github.com/pranuprakash/driftbench-se/issues or email the corresponding author.


Hardware requirements

  • Python 3.11 (pinned via .python-version).
  • ~10 GB free disk for sentence-transformer weights (BGE-large ≈ 1.3 GB; e5-large ≈ 1.3 GB; MiniLM-L6 ≈ 90 MB; the rest is the torch wheel and CUDA-free dependencies).
  • CPU is sufficient. No GPU is required at the corpus and response volumes used here. A consumer laptop with 16 GB RAM completes the full pipeline; Intel Mac is ~6× slower than Apple Silicon for the embedding stages but produces identical results.

About

Empirical validation of the Specificity Erosion signal from DriftAlert. Pre-registered (OSF pkr84). Companion to The Nature of Agentic Drift (Rayman, Prakash, Haraldsdottir, 2026).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages