Parse acceptance criteria from report JSON#126
Merged
vmilosevic merged 4 commits intoMay 29, 2026
Merged
Conversation
…JSON
tt-shield's release/nightly reports now include three new top-level fields
that describe model-level acceptance against tier-based criteria:
acceptance_criteria — bool, overall pass/fail
acceptance_blockers — dict[check_name → failure message]
acceptance_criteria_metadata — { enforcement_result, model_status,
enforced_tiers, informational_tiers,
failed_enforced_tiers,
informational_blockers,
masked_blockers }
ShieldBenchmarkDataMapper now emits a CompleteBenchmarkRun with
run_type='acceptance_criteria' carrying these as BenchmarkMeasurements
(step_name='acceptance_criteria'):
passed acceptance_criteria bool (2=pass, 3=fail)
enforcement_result metadata.enforcement_result
(2=PASS, 3=FAIL, 1=unknown)
model_status metadata.model_status ordinal tier
(0=NON-FUNCTIONAL, 1=EXPERIMENTAL,
2=FUNCTIONAL, 3=COMPLETE)
num_enforced_tiers len(metadata.enforced_tiers)
num_failed_enforced_tiers len(metadata.failed_enforced_tiers)
num_informational_tiers len(metadata.informational_tiers)
num_blockers len(acceptance_blockers)
num_informational_blockers len(metadata.informational_blockers)
num_masked_blockers len(metadata.masked_blockers)
Encoding choice mirrors the existing accuracy_check convention so the
downstream dashboard (tt-shield/models-dashboards) can use the same
"2=pass, 3=fail, 1=n/a" pass-check style for the new column.
acceptance_summary_markdown is intentionally not emitted (text payload;
BenchmarkMeasurement.value is float-only — leave for a future text column).
Verified against the real Llama-3.2-1B-Instruct release report
(report_76636001547.json): all 9 measurements match the source values.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for mapping acceptance_criteria fields from shield reports into a dedicated acceptance_criteria run type, along with tests to validate the numeric encodings and skipping behavior.
Changes:
- Add acceptance-criteria numeric encodings (PASS/FAIL and model status tiers) in
ShieldBenchmarkDataMapper. - Emit an
acceptance_criteriaCompleteBenchmarkRunwhen relevant report fields are present. - Add pytest coverage for pass/fail, tier encoding, unknown-value skipping, and absence handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/actions/collect_data/src/benchmark.py | Implements acceptance-criteria run generation and value encoding. |
| .github/actions/collect_data/test/test_benchmark_mapper.py | Adds unit tests for acceptance-criteria mapping and edge cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mjeremicTT
reviewed
May 28, 2026
| # to floats. Pass/fail uses the same convention as accuracy_check (2=pass, | ||
| # 3=fail). Model status is an ordinal tier. | ||
| _ACCEPTANCE_PASS_FAIL = {"PASS": 2.0, "FAIL": 3.0} | ||
| _MODEL_STATUS_TIERS = { |
Contributor
There was a problem hiding this comment.
Maybe it is better to place model status tiers with their original values in config_params, we would have additional mental mapping in this way to remember.
vmilosevic
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a new
_process_acceptance_criteriamethod toShieldBenchmarkDataMapper, wired intomap_benchmark_data, that emits a dedicatedacceptance_criteriarun whenever a report contains a booleanacceptance_criteriafield (encoded as a single passed measurement: 1.0 for pass, 0.0 for fail). The run'sconfig_paramsmerges anymodel_spec_datawith three additional report fields —acceptance_blockers,acceptance_criteria_metadata, andacceptance_summary_markdown