Confidence calibration toolkit for LLM verbalized-probability outputs. Real benchmark on 998 BoolQ questions with Llama-3.1-8B: ECE 0.148 -> 0.030, log-loss 3.9 -> 0.41.
calibration ece isotonic-regression platt-scaling groq temperature-scaling llm anthropic brier boolq
-
Updated
May 12, 2026 - Python