FEAT [BREAKING]: Scorer evaluation refactor #1280

jsong468 · 2025-12-23T18:08:44Z

Refactors the scorer evaluation framework to provide a simpler, more user-friendly API. Scorers now have direct evaluate_async() and get_scorer_metrics() methods, eliminating the need for manual evaluator instantiation.

Simplified API

New ScorerEvalDatasetFiles Configuration
Maps input dataset file patterns (glob) to output result file
Scorers define default evaluation_file_mapping; users can override
harm_category now part of config (required for harm scorers)

New Scorer Evaluation Registry and scorer_metrics_io.py Module

Thread-safe utilities for reading/writing metrics to JSONL files, which represent PyRIT's official scoring registry where metrics produced by running common scoring configurations on official labeled datasets in the repo are stored.
Note: there is still the option to run evaluation without persisting the results to the registry using RegistryUpdateBehavior.NEVER_UPDATE.
These JSONLs are separated into separate harm categories for harm (float) scoring, and there is also one for objective_achieved scoring and one for refusal scoring.
get_all_objective_metrics or get_all_harm_metrics for browsing/comparing scorer configurations
find_metrics_by_hash for looking up specific configurations
scorer_eval directories are cleaned up and separated into harm, objective, and refusal

Standardized CSV Format

Standard column names auto-detected: assistant_response, human_score, objective/harm_category, data_type
No longer need to specify column names in API calls
CSVs now specify version, as well as harm_category file (yaml), and harm definition version for harm evaluation datasets.

Added harm definitions/versions to HarmScorerMetrics and more details overall in ScorerMetrics

this is important to know what these were rated against
Added the HarmDefinition class to manage this. HarmDefinition includes version, which flows to HarmScorerMetrics. HarmScorerMetrics includes new harm-specific fields.
ScorerMetrics base class includes more details on the evaluation like num_responses and num_scorer_trials

Removed Components

Deleted: scorer_evals.ipynb/.py (replaced by improved 8_scorer_metrics.ipynb)
Removed complex run_evaluation_from_csv_async() with many parameters

Added evaluate_scorers script

to get some initial scorers and their metrics into the JSONL files

Scorer Printer

Adds Scorer Printer class for Scenarios and scorers to display results

Documentation

Rewrote 8_scorer_metrics.ipynb with clear metric explanations and practical examples

Testing

Unit tests added for changes and existing tests updated accordingly
Remove previous scoring integration tests as they are no longer applicable and don't make sense with scorer configurations now being much more granular. We also do not run these tests in our pipeline. We now have additional tests to make sure all evaluation CSVs can be parsed without error.

Breaking Changes

ScorerEvaluator has been changed significantly and some older methods do not exist.
SelfAskLikertScorer param change from likert_scale_path to likert_scale

TODO (future PRs)

Add debug info. After an evaluation, can we run it in a way we see what was missed using trial_scores attribute of ScorerMetrics
Add ScenarioRegistry call so we make sure we evaluate all scorers for all scenarios
Add accurate models, I'd update the existing metrics rather than rerunning

rlundeen2 · 2025-12-28T00:46:20Z

I made a PR into your branch; check it out :)

jsong468#2

…n_refactor

Simplifying Scorer Evaluation

pyrit/datasets/harm_definition/behavior_change.yaml

doc/code/executor/attack/1_prompt_sending_attack.py

doc/code/scenarios/0_scenarios.py

rlundeen2

This looks great!

Because it is gigantic, but been worked on by both of us, I recommend the following next steps

merge main
Make sure .ipynb match .py
Run integration tests and e2e tests make sure nothing is broken
Merge PR
Keep testing, do followup PR for issues

jsong468 and others added 27 commits December 4, 2025 12:41

firstdraft

bd27c42

revert nb

b714a65

revert nb

c9678ab

merge

15b196b

changes from openai migration and change name to scorermetricsregistry

be97a2c

refactor

dd55b9f

merge

c826319

update notebook and printing

dbbd7d1

Merge branch 'main' into using_scorer_metrics_1

4232743

fix tests

b38f315

separate prs

2c72396

merge and clean

3076945

workflow in place

c5e4543

dataset refactor

0cd7613

moving things

55a0383

finalizing things

e1fca4d

saving changes

7b5a3e4

generalizing return types

5a3796b

harm category additions

be1ea4c

moving paths

0afacd8

Piping harm scorers through

9dd8e63

updating csv structure

249d574

most things working

24a6517

adding scorer printer

eab659e

fixing tests

0c6dac3

merging main

61a31ec

pre-commit

e9faea3

jsong468 added 2 commits December 29, 2025 09:06

merge main

d4be70a

Merge branch 'scorer_evaluation_refactor' into jsong/scorer_evaluatio…

1bdca43

…n_refactor

Merge pull request #2 from rlundeen2/jsong/scorer_evaluation_refactor

c452f6e

Simplifying Scorer Evaluation

jsong468 commented Dec 31, 2025

View reviewed changes

pyrit/datasets/harm_definition/behavior_change.yaml Show resolved Hide resolved

jsong468 added 4 commits January 2, 2026 09:36

changes from feedback

f259c0a

merge main

0eef582

nb update

9761367

re-populate registries

76eaea1

jsong468 changed the title ~~DRAFT: Scorer evaluation refactor~~ FEAT: Scorer evaluation refactor Jan 5, 2026

jsong468 added 4 commits January 5, 2026 12:48

tests changed

6a0a031

Merge branch 'main' into scorer_evaluation_refactor

c86c235

to_py

9f6b354

fix imports

46f595d

rlundeen2 reviewed Jan 6, 2026

View reviewed changes

doc/code/executor/attack/1_prompt_sending_attack.py Show resolved Hide resolved

rlundeen2 reviewed Jan 6, 2026

View reviewed changes

doc/code/scenarios/0_scenarios.py Show resolved Hide resolved

rlundeen2 approved these changes Jan 6, 2026

View reviewed changes

jsong468 changed the title ~~FEAT: Scorer evaluation refactor~~ BREAKING FEAT: Scorer evaluation refactor Jan 6, 2026

jsong468 changed the title ~~BREAKING FEAT: Scorer evaluation refactor~~ FEAT [BREAKING]: Scorer evaluation refactor Jan 6, 2026

jsong468 added 9 commits January 6, 2026 18:36

harm_definition_version and nb update

c90588c

scenario_result minor change

9dd9504

merge

866ffbe

fix scam

765e37a

build book error

a8b5220

indentation

f51aeab

indentation again

abb29bd

fix nb

3d1a48e

ruff

16d482f

jsong468 merged commit 4d9d72f into Azure:main Jan 7, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT [BREAKING]: Scorer evaluation refactor #1280

FEAT [BREAKING]: Scorer evaluation refactor #1280

Uh oh!

jsong468 commented Dec 23, 2025 •

edited

Loading

Uh oh!

rlundeen2 commented Dec 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlundeen2 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FEAT [BREAKING]: Scorer evaluation refactor #1280

FEAT [BREAKING]: Scorer evaluation refactor #1280

Uh oh!

Conversation

jsong468 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Simplified API

New Scorer Evaluation Registry and scorer_metrics_io.py Module

Standardized CSV Format

Added harm definitions/versions to HarmScorerMetrics and more details overall in ScorerMetrics

Removed Components

Added evaluate_scorers script

Scorer Printer

Documentation

Testing

Breaking Changes

TODO (future PRs)

Uh oh!

rlundeen2 commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlundeen2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsong468 commented Dec 23, 2025 •

edited

Loading

rlundeen2 commented Dec 28, 2025 •

edited

Loading

rlundeen2 left a comment •

edited

Loading