Skip to content

feat(evaluate): Sprint 16 β€” evaluator reliability + GGUF backend (v1.6.0)#8

Merged
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-xPy83
Jun 14, 2026
Merged

feat(evaluate): Sprint 16 β€” evaluator reliability + GGUF backend (v1.6.0)#8
wesleyscholl merged 1 commit into
mainfrom
claude/konjo-toki-xPy83

Conversation

@konjoinfinity

Copy link
Copy Markdown
Contributor

$(cat <<'EOF'

Summary

  • Evaluator reliability fix (arXiv 2603.06594): single LLM judges degrade to near-random accuracy on adversarial samples under distribution shift. Introduces RuleScorer + HybridScorer as a defensible measurement baseline that is not susceptible to judge distribution shift.
  • GGUF backend: GGUFEvaluator wraps llama-cpp-python for CPU-only inference on quantized models (Phi-3, Qwen2.5 Q4_K_M etc.) β€” no GPU required in CI. Raises ImportError cleanly when dep is absent.
  • CLI: python -m toki evaluate --evaluator rule|hybrid|gguf://path/to/model.gguf

Deliverables

Symbol Description
EvaluatorMode StrEnum: RULE / LLM / HYBRID
RuleScorer Compiled-regex safety scorer, zero external deps
ScoredResult Frozen dataclass: score, rule_score, llm_score, agreement, flagged
HybridScorer Ensemble of RuleScorer + optional JudgeBase; logs disagreements
GGUFEvaluator llama-cpp-python backend with RuleScorer fallback on parse errors
RobustnessEvaluator Extended with evaluator_mode + llm_judge params (backward compat)

Test plan

  • python/tests/test_hybrid_scorer.py β€” 27 tests: EvaluatorMode, RuleScorer patterns, HybridScorer RULE/LLM/HYBRID modes, ScoredResult fields, disagreement flagging
  • python/tests/test_gguf_evaluator.py β€” 15 tests: import guard, mocked llama_cpp.Llama, score clamping, parse-error fallback to RuleScorer
  • python/tests/test_evaluator_extended.py β€” 11 tests: RobustnessEvaluator backward compat, RULE mode, HYBRID mode with MockJudge
  • python/tests/test_main.py additions β€” 5 CLI tests: --evaluator rule, --evaluator hybrid, score-in-range, default=rule, GGUF ImportError
  • 600 / 600 tests passing (547 β†’ 600)
  • cargo test green, cargo clippy -- -D warnings clean

Unblocks

P3-3 (SaLoRA/SPLoRA safety-subspace LoRA) and P3-1 (dual-agent red-team loop) both require trustworthy measurement before their ASR numbers are meaningful. This sprint provides it.

https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx
EOF
)


Generated by Claude Code

…6.0)

Addresses arXiv 2603.06594: single LLM judges degrade to near-random accuracy
on adversarial samples. Adds a hybrid scoring layer so measurement is defensible.

- EvaluatorMode (RULE | LLM | HYBRID) + RuleScorer (zero deps, compiled regex)
- HybridScorer: ensemble of RuleScorer + optional JudgeBase; logs disagreements
  when |rule βˆ’ llm| > agreement_threshold (default 0.2)
- ScoredResult: frozen dataclass with score, rule_score, llm_score, agreement, flagged
- GGUFEvaluator: llama-cpp-python backend (optional dep); raises ImportError cleanly
  when absent; falls back to RuleScorer on parse errors
- RobustnessEvaluator: evaluator_mode + llm_judge optional params (backward compat)
- CLI: python -m toki evaluate --evaluator rule|hybrid|gguf://path
- pyproject.toml: [gguf] optional dep (llama-cpp-python>=0.2.0), version 1.6.0
- 53 new tests (27 hybrid_scorer + 15 gguf + 11 evaluator_extended + 5 CLI)
- 600 total tests passing (547 β†’ 600)

https://claude.ai/code/session_01XCHiLCiVeL6WXQdsAcQTbx
@wesleyscholl wesleyscholl marked this pull request as ready for review June 14, 2026 12:53
@wesleyscholl wesleyscholl merged commit 518be83 into main Jun 14, 2026
7 checks passed
@wesleyscholl wesleyscholl deleted the claude/konjo-toki-xPy83 branch June 14, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants