Add paired-bootstrap CIs to memory-pollution model differences by weich97 · Pull Request #142 · weich97/TreLLM

weich97 · 2026-06-15T03:19:07Z

Surfaces the paired-bootstrap 95% intervals the memory-pollution analysis already computed but did not output: model_difference.csv gains ci_low/ci_high and the per-model figure gets error bars. Consistent with the execution-sensitivity paper's CI reporting (the audit benchmark uses Wilson because its outcome is a proportion). Intervals confirm the split - gemini/deepseek shifts exclude zero, gpt/claude sit on it.

Test plan

analysis re-runs, figure renders with error bars
ruff clean

- run_audit_eval.py: surface the exception message (not just the type) on a failed task so balance-exhaustion (402) is visible to wrappers. - render_finaudit_figures.py: draw 95% Wilson intervals on the difficulty-tier bars when the CI experiment is present, using its temperature-0.7 point estimates for consistency. - audit_eval_ci_wilson.csv: per-(model, tier) recall with Wilson 95% intervals over 6 models x 100 tasks x 3 samples. The difficulty inversion is confirmed: strong auditors' L1 intervals (lower bound >= 0.73) are disjoint from weak auditors' (upper bound <= 0.60).

model_difference.csv now carries the paired-bootstrap 95% interval on each agent's hold-ratio shift (the interval the analysis already computed but did not surface), and the per-model figure draws error bars. This matches the confidence reporting in the execution-sensitivity results; the binomial recall in the audit benchmark uses Wilson intervals instead. The intervals confirm the model split: gemini-3.1-pro and deepseek-v4-pro shifts exclude zero while gpt-5.5 and claude-opus-4.7 sit on it.

weich97 added 2 commits June 15, 2026 11:12

weich97 merged commit cc4c363 into main Jun 15, 2026
10 checks passed

weich97 deleted the mempollution-ci branch June 15, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add paired-bootstrap CIs to memory-pollution model differences#142

Add paired-bootstrap CIs to memory-pollution model differences#142
weich97 merged 2 commits into
mainfrom
mempollution-ci

weich97 commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weich97 commented Jun 15, 2026

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant