Parse acceptance criteria from report JSON by acvejicTT · Pull Request #126 · tenstorrent/tt-github-actions

acvejicTT · 2026-05-28T11:27:09Z

This PR adds a new _process_acceptance_criteria method to ShieldBenchmarkDataMapper, wired into map_benchmark_data, that emits a dedicated acceptance_criteria run whenever a report contains a boolean acceptance_criteria field (encoded as a single passed measurement: 1.0 for pass, 0.0 for fail). The run's config_params merges any model_spec_data with three additional report fields — acceptance_blockers, acceptance_criteria_metadata, and acceptance_summary_markdown

…JSON tt-shield's release/nightly reports now include three new top-level fields that describe model-level acceptance against tier-based criteria: acceptance_criteria — bool, overall pass/fail acceptance_blockers — dict[check_name → failure message] acceptance_criteria_metadata — { enforcement_result, model_status, enforced_tiers, informational_tiers, failed_enforced_tiers, informational_blockers, masked_blockers } ShieldBenchmarkDataMapper now emits a CompleteBenchmarkRun with run_type='acceptance_criteria' carrying these as BenchmarkMeasurements (step_name='acceptance_criteria'): passed acceptance_criteria bool (2=pass, 3=fail) enforcement_result metadata.enforcement_result (2=PASS, 3=FAIL, 1=unknown) model_status metadata.model_status ordinal tier (0=NON-FUNCTIONAL, 1=EXPERIMENTAL, 2=FUNCTIONAL, 3=COMPLETE) num_enforced_tiers len(metadata.enforced_tiers) num_failed_enforced_tiers len(metadata.failed_enforced_tiers) num_informational_tiers len(metadata.informational_tiers) num_blockers len(acceptance_blockers) num_informational_blockers len(metadata.informational_blockers) num_masked_blockers len(metadata.masked_blockers) Encoding choice mirrors the existing accuracy_check convention so the downstream dashboard (tt-shield/models-dashboards) can use the same "2=pass, 3=fail, 1=n/a" pass-check style for the new column. acceptance_summary_markdown is intentionally not emitted (text payload; BenchmarkMeasurement.value is float-only — leave for a future text column). Verified against the real Llama-3.2-1B-Instruct release report (report_76636001547.json): all 9 measurements match the source values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for mapping acceptance_criteria fields from shield reports into a dedicated acceptance_criteria run type, along with tests to validate the numeric encodings and skipping behavior.

Changes:

Add acceptance-criteria numeric encodings (PASS/FAIL and model status tiers) in ShieldBenchmarkDataMapper.
Emit an acceptance_criteria CompleteBenchmarkRun when relevant report fields are present.
Add pytest coverage for pass/fail, tier encoding, unknown-value skipping, and absence handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
.github/actions/collect_data/src/benchmark.py	Implements acceptance-criteria run generation and value encoding.
.github/actions/collect_data/test/test_benchmark_mapper.py	Adds unit tests for acceptance-criteria mapping and edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mjeremicTT · 2026-05-28T11:37:20Z

+    # to floats. Pass/fail uses the same convention as accuracy_check (2=pass,
+    # 3=fail). Model status is an ordinal tier.
+    _ACCEPTANCE_PASS_FAIL = {"PASS": 2.0, "FAIL": 3.0}
+    _MODEL_STATUS_TIERS = {


Maybe it is better to place model status tiers with their original values in config_params, we would have additional mental mapping in this way to remember.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

Copilot AI review requested due to automatic review settings May 28, 2026 11:27

acvejicTT requested a review from a team as a code owner May 28, 2026 11:27

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread .github/actions/collect_data/src/benchmark.py Outdated

Comment thread .github/actions/collect_data/src/benchmark.py Outdated

Comment thread .github/actions/collect_data/src/benchmark.py

mjeremicTT reviewed May 28, 2026

View reviewed changes

mjeremicTT added 2 commits May 29, 2026 15:32

refactor acceptance criteria

4390f3e

fix lint issues

627cfe1

Copilot AI review requested due to automatic review settings May 29, 2026 13:35

Copilot started reviewing on behalf of mjeremicTT May 29, 2026 13:36 View session

remove whitespaces

db7696b

Copilot AI reviewed May 29, 2026

View reviewed changes

vmilosevic approved these changes May 29, 2026

View reviewed changes

vmilosevic merged commit 32ac6cc into tenstorrent:main May 29, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse acceptance criteria from report JSON#126

Parse acceptance criteria from report JSON#126
vmilosevic merged 4 commits into
tenstorrent:mainfrom
acvejicTT:feat/parse-acceptance-criteria-shield

acvejicTT commented May 28, 2026 •

edited by mjeremicTT

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjeremicTT May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

acvejicTT commented May 28, 2026 • edited by mjeremicTT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjeremicTT May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acvejicTT commented May 28, 2026 •

edited by mjeremicTT

Loading