Skip to content

Parse acceptance criteria from report JSON#126

Merged
vmilosevic merged 4 commits into
tenstorrent:mainfrom
acvejicTT:feat/parse-acceptance-criteria-shield
May 29, 2026
Merged

Parse acceptance criteria from report JSON#126
vmilosevic merged 4 commits into
tenstorrent:mainfrom
acvejicTT:feat/parse-acceptance-criteria-shield

Conversation

@acvejicTT

@acvejicTT acvejicTT commented May 28, 2026

Copy link
Copy Markdown
Contributor

This PR adds a new _process_acceptance_criteria method to ShieldBenchmarkDataMapper, wired into map_benchmark_data, that emits a dedicated acceptance_criteria run whenever a report contains a boolean acceptance_criteria field (encoded as a single passed measurement: 1.0 for pass, 0.0 for fail). The run's config_params merges any model_spec_data with three additional report fields — acceptance_blockers, acceptance_criteria_metadata, and acceptance_summary_markdown

…JSON

tt-shield's release/nightly reports now include three new top-level fields
that describe model-level acceptance against tier-based criteria:

  acceptance_criteria              — bool, overall pass/fail
  acceptance_blockers              — dict[check_name → failure message]
  acceptance_criteria_metadata     — { enforcement_result, model_status,
                                       enforced_tiers, informational_tiers,
                                       failed_enforced_tiers,
                                       informational_blockers,
                                       masked_blockers }

ShieldBenchmarkDataMapper now emits a CompleteBenchmarkRun with
run_type='acceptance_criteria' carrying these as BenchmarkMeasurements
(step_name='acceptance_criteria'):

  passed                       acceptance_criteria bool  (2=pass, 3=fail)
  enforcement_result           metadata.enforcement_result
                                 (2=PASS, 3=FAIL, 1=unknown)
  model_status                 metadata.model_status ordinal tier
                                 (0=NON-FUNCTIONAL, 1=EXPERIMENTAL,
                                  2=FUNCTIONAL, 3=COMPLETE)
  num_enforced_tiers           len(metadata.enforced_tiers)
  num_failed_enforced_tiers    len(metadata.failed_enforced_tiers)
  num_informational_tiers      len(metadata.informational_tiers)
  num_blockers                 len(acceptance_blockers)
  num_informational_blockers   len(metadata.informational_blockers)
  num_masked_blockers          len(metadata.masked_blockers)

Encoding choice mirrors the existing accuracy_check convention so the
downstream dashboard (tt-shield/models-dashboards) can use the same
"2=pass, 3=fail, 1=n/a" pass-check style for the new column.
acceptance_summary_markdown is intentionally not emitted (text payload;
BenchmarkMeasurement.value is float-only — leave for a future text column).

Verified against the real Llama-3.2-1B-Instruct release report
(report_76636001547.json): all 9 measurements match the source values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 11:27
@acvejicTT acvejicTT requested a review from a team as a code owner May 28, 2026 11:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for mapping acceptance_criteria fields from shield reports into a dedicated acceptance_criteria run type, along with tests to validate the numeric encodings and skipping behavior.

Changes:

  • Add acceptance-criteria numeric encodings (PASS/FAIL and model status tiers) in ShieldBenchmarkDataMapper.
  • Emit an acceptance_criteria CompleteBenchmarkRun when relevant report fields are present.
  • Add pytest coverage for pass/fail, tier encoding, unknown-value skipping, and absence handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.github/actions/collect_data/src/benchmark.py Implements acceptance-criteria run generation and value encoding.
.github/actions/collect_data/test/test_benchmark_mapper.py Adds unit tests for acceptance-criteria mapping and edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/actions/collect_data/src/benchmark.py Outdated
Comment thread .github/actions/collect_data/src/benchmark.py Outdated
Comment thread .github/actions/collect_data/src/benchmark.py
# to floats. Pass/fail uses the same convention as accuracy_check (2=pass,
# 3=fail). Model status is an ordinal tier.
_ACCEPTANCE_PASS_FAIL = {"PASS": 2.0, "FAIL": 3.0}
_MODEL_STATUS_TIERS = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to place model status tiers with their original values in config_params, we would have additional mental mapping in this way to remember.

Copilot AI review requested due to automatic review settings May 29, 2026 13:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

Comment thread .github/actions/collect_data/src/benchmark.py
Comment thread .github/actions/collect_data/src/benchmark.py
Comment thread .github/actions/collect_data/test/test_benchmark_mapper.py
Comment thread .github/actions/collect_data/test/test_benchmark_mapper.py
Comment thread .github/actions/collect_data/test/test_benchmark_mapper.py
Comment thread .github/actions/collect_data/test/test_benchmark_mapper.py
Comment thread .github/actions/collect_data/test/test_benchmark_mapper.py
@vmilosevic vmilosevic merged commit 32ac6cc into tenstorrent:main May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants