Skip to content

Conversation

@jsong468
Copy link
Contributor

@jsong468 jsong468 commented Dec 23, 2025

Refactors the scorer evaluation framework to provide a simpler, more user-friendly API. Scorers now have direct evaluate_async() and get_scorer_metrics() methods, eliminating the need for manual evaluator instantiation.

Simplified API

  • New ScorerEvalDatasetFiles Configuration
  • Maps input dataset file patterns (glob) to output result file
  • Scorers define default evaluation_file_mapping; users can override
  • harm_category now part of config (required for harm scorers)

New Scorer Evaluation Registry and scorer_metrics_io.py Module

  • Thread-safe utilities for reading/writing metrics to JSONL files, which represent PyRIT's official scoring registry where metrics produced by running common scoring configurations on official labeled datasets in the repo are stored.
  • Note: there is still the option to run evaluation without persisting the results to the registry using RegistryUpdateBehavior.NEVER_UPDATE.
  • These JSONLs are separated into separate harm categories for harm (float) scoring, and there is also one for objective_achieved scoring and one for refusal scoring.
  • get_all_objective_metrics or get_all_harm_metrics for browsing/comparing scorer configurations
    find_metrics_by_hash for looking up specific configurations
  • scorer_eval directories are cleaned up and separated into harm, objective, and refusal

Standardized CSV Format

  • Standard column names auto-detected: assistant_response, human_score, objective/harm_category, data_type
  • No longer need to specify column names in API calls
  • CSVs now specify version, as well as harm_category file (yaml), and harm definition version for harm evaluation datasets.

Added harm definitions/versions to HarmScorerMetrics and more details overall in ScorerMetrics

  • this is important to know what these were rated against
  • Added the HarmDefinition class to manage this. HarmDefinition includes version, which flows to HarmScorerMetrics. HarmScorerMetrics includes new harm-specific fields.
  • ScorerMetrics base class includes more details on the evaluation like num_responses and num_scorer_trials

Removed Components

  • Deleted: scorer_evals.ipynb/.py (replaced by improved 8_scorer_metrics.ipynb)
  • Removed complex run_evaluation_from_csv_async() with many parameters

Added evaluate_scorers script

  • to get some initial scorers and their metrics into the JSONL files

Scorer Printer

  • Adds Scorer Printer class for Scenarios and scorers to display results

Documentation

  • Rewrote 8_scorer_metrics.ipynb with clear metric explanations and practical examples

Testing

  • Unit tests added for changes and existing tests updated accordingly
  • Remove previous scoring integration tests as they are no longer applicable and don't make sense with scorer configurations now being much more granular. We also do not run these tests in our pipeline. We now have additional tests to make sure all evaluation CSVs can be parsed without error.

Breaking Changes

  • ScorerEvaluator has been changed significantly and some older methods do not exist.
  • SelfAskLikertScorer param change from likert_scale_path to likert_scale

TODO (future PRs)

  • Add debug info. After an evaluation, can we run it in a way we see what was missed using trial_scores attribute of ScorerMetrics
  • Add ScenarioRegistry call so we make sure we evaluate all scorers for all scenarios
  • Add accurate models, I'd update the existing metrics rather than rerunning

@rlundeen2
Copy link
Contributor

rlundeen2 commented Dec 28, 2025

I made a PR into your branch; check it out :)

jsong468#2

@jsong468 jsong468 changed the title DRAFT: Scorer evaluation refactor FEAT: Scorer evaluation refactor Jan 5, 2026
Copy link
Contributor

@rlundeen2 rlundeen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Because it is gigantic, but been worked on by both of us, I recommend the following next steps

  1. merge main
  2. Make sure .ipynb match .py
  3. Run integration tests and e2e tests make sure nothing is broken
  4. Merge PR
  5. Keep testing, do followup PR for issues

@jsong468 jsong468 changed the title FEAT: Scorer evaluation refactor BREAKING FEAT: Scorer evaluation refactor Jan 6, 2026
@jsong468 jsong468 changed the title BREAKING FEAT: Scorer evaluation refactor FEAT [BREAKING]: Scorer evaluation refactor Jan 6, 2026
@jsong468 jsong468 merged commit 4d9d72f into Azure:main Jan 7, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants