Skip to content

Add temperature/repeated-sampling support and Wilson intervals#138

Merged
weich97 merged 1 commit into
mainfrom
audit-eval-sampling-ci
Jun 13, 2026
Merged

Add temperature/repeated-sampling support and Wilson intervals#138
weich97 merged 1 commit into
mainfrom
audit-eval-sampling-ci

Conversation

@weich97

@weich97 weich97 commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Infrastructure for confidence intervals on auditor recall (FinAudit / paper 04):

  • configurable temperature on the analyst (default 0.0, deterministic runs unchanged)
  • run_audit_eval.py --samples-per-task / --temperature, cache+checkpoint keyed by sample (sample 0 keeps the legacy key so the temp-0 main table replays)
  • statistics.wilson_interval with unit tests (robust at 0/1 extremes)

The temp-0.7 3-sample CI experiment is running; results land separately.

Test plan

  • statistics tests (incl. Wilson extremes) pass
  • live 3-sample smoke
  • ruff clean

- DeepSeekLLMAnalyst gains a configurable temperature (default 0.0, so
  existing deterministic runs are unchanged).
- run_audit_eval.py accepts --samples-per-task and --temperature, keys
  the cache and checkpoint by sample (sample 0 keeps the legacy key so
  the deterministic main-table cache replays), enabling confidence
  intervals over auditor recall without disturbing the temp-0 results.
- statistics.wilson_interval: binomial score interval, robust at the
  0/1 extremes that the auditor recalls hit, with unit tests.
@weich97 weich97 merged commit c8405e4 into main Jun 13, 2026
10 checks passed
@weich97 weich97 deleted the audit-eval-sampling-ci branch June 13, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant