Skip to content

Add paired-bootstrap CIs to memory-pollution model differences#142

Merged
weich97 merged 2 commits into
mainfrom
mempollution-ci
Jun 15, 2026
Merged

Add paired-bootstrap CIs to memory-pollution model differences#142
weich97 merged 2 commits into
mainfrom
mempollution-ci

Conversation

@weich97

@weich97 weich97 commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Surfaces the paired-bootstrap 95% intervals the memory-pollution analysis already computed but did not output: model_difference.csv gains ci_low/ci_high and the per-model figure gets error bars. Consistent with the execution-sensitivity paper's CI reporting (the audit benchmark uses Wilson because its outcome is a proportion). Intervals confirm the split - gemini/deepseek shifts exclude zero, gpt/claude sit on it.

Test plan

  • analysis re-runs, figure renders with error bars
  • ruff clean

weich97 added 2 commits June 15, 2026 11:12
- run_audit_eval.py: surface the exception message (not just the type)
  on a failed task so balance-exhaustion (402) is visible to wrappers.
- render_finaudit_figures.py: draw 95% Wilson intervals on the
  difficulty-tier bars when the CI experiment is present, using its
  temperature-0.7 point estimates for consistency.
- audit_eval_ci_wilson.csv: per-(model, tier) recall with Wilson 95%
  intervals over 6 models x 100 tasks x 3 samples. The difficulty
  inversion is confirmed: strong auditors' L1 intervals (lower bound
  >= 0.73) are disjoint from weak auditors' (upper bound <= 0.60).
model_difference.csv now carries the paired-bootstrap 95% interval on
each agent's hold-ratio shift (the interval the analysis already
computed but did not surface), and the per-model figure draws error
bars. This matches the confidence reporting in the execution-sensitivity
results; the binomial recall in the audit benchmark uses Wilson
intervals instead. The intervals confirm the model split: gemini-3.1-pro
and deepseek-v4-pro shifts exclude zero while gpt-5.5 and
claude-opus-4.7 sit on it.
@weich97 weich97 merged commit cc4c363 into main Jun 15, 2026
10 checks passed
@weich97 weich97 deleted the mempollution-ci branch June 15, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant