Skip to content

Add FinAudit recall confidence intervals from 3-sample experiment#141

Merged
weich97 merged 1 commit into
mainfrom
finaudit-ci-intervals
Jun 15, 2026
Merged

Add FinAudit recall confidence intervals from 3-sample experiment#141
weich97 merged 1 commit into
mainfrom
finaudit-ci-intervals

Conversation

@weich97

@weich97 weich97 commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Confirms the FinAudit difficulty inversion with confidence intervals from a 3-sample (temperature 0.7) experiment over all six auditors (1800 instances).

  • run_audit_eval.py surfaces the exception message on task failure (so 402 exhaustion is visible to supervisors).
  • Figure 1 gains 95% Wilson intervals; the strong/weak L1 intervals are disjoint (strong lower bound >= 0.73, weak upper bound <= 0.60) - the inversion is not a sampling artifact.
  • audit_eval_ci_wilson.csv released.

Test plan

  • ruff clean; figures render with error bars
  • statistics.wilson_interval unit-tested on main

- run_audit_eval.py: surface the exception message (not just the type)
  on a failed task so balance-exhaustion (402) is visible to wrappers.
- render_finaudit_figures.py: draw 95% Wilson intervals on the
  difficulty-tier bars when the CI experiment is present, using its
  temperature-0.7 point estimates for consistency.
- audit_eval_ci_wilson.csv: per-(model, tier) recall with Wilson 95%
  intervals over 6 models x 100 tasks x 3 samples. The difficulty
  inversion is confirmed: strong auditors' L1 intervals (lower bound
  >= 0.73) are disjoint from weak auditors' (upper bound <= 0.60).
@weich97 weich97 merged commit 4749907 into main Jun 15, 2026
10 checks passed
@weich97 weich97 deleted the finaudit-ci-intervals branch June 15, 2026 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant