Add temperature/repeated-sampling support and Wilson intervals by weich97 · Pull Request #138 · weich97/TreLLM

weich97 · 2026-06-13T11:32:54Z

Infrastructure for confidence intervals on auditor recall (FinAudit / paper 04):

configurable temperature on the analyst (default 0.0, deterministic runs unchanged)
run_audit_eval.py --samples-per-task / --temperature, cache+checkpoint keyed by sample (sample 0 keeps the legacy key so the temp-0 main table replays)
statistics.wilson_interval with unit tests (robust at 0/1 extremes)

The temp-0.7 3-sample CI experiment is running; results land separately.

Test plan

statistics tests (incl. Wilson extremes) pass
live 3-sample smoke
ruff clean

- DeepSeekLLMAnalyst gains a configurable temperature (default 0.0, so existing deterministic runs are unchanged). - run_audit_eval.py accepts --samples-per-task and --temperature, keys the cache and checkpoint by sample (sample 0 keeps the legacy key so the deterministic main-table cache replays), enabling confidence intervals over auditor recall without disturbing the temp-0 results. - statistics.wilson_interval: binomial score interval, robust at the 0/1 extremes that the auditor recalls hit, with unit tests.

weich97 merged commit c8405e4 into main Jun 13, 2026
10 checks passed

weich97 deleted the audit-eval-sampling-ci branch June 13, 2026 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add temperature/repeated-sampling support and Wilson intervals#138

Add temperature/repeated-sampling support and Wilson intervals#138
weich97 merged 1 commit into
mainfrom
audit-eval-sampling-ci

weich97 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weich97 commented Jun 13, 2026

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant