Add additional metrics by acvejicTT · Pull Request #131 · tenstorrent/tt-github-actions

acvejicTT · 2026-06-03T17:05:42Z

Did some analysis, and we found a lot of new metrics that we are not parsing and pushing to DB. We will also implement drift script on our side to do this more proactively.

…amilies Add every scalar metric observed across LLM, image/video/diffusion, audio and embedding models in a 109-job tt-shield nightly run to the four ShieldBenchmarkDataMapper whitelists: - _process_benchmarks (run_type='benchmark') + total_token_throughput, mean_latency_ms, p50/p90/p95_latency_ms, throughput_rps, inference_steps_per_second, num_inference_steps, performance_check, rtr, wer, embedding_dimension - _process_benchmarks_summary outer (run_type='benchmark_summary') + num_requests, latency, inference_steps_per_second, num_inference_steps, e2el_ms, tput_prefill, latency_p90, latency_p95, rtr - _process_benchmarks_summary target_checks (run_type='benchmark_summary_<tier>') + tput, tput_ratio, latency, latency_ratio, latency_check, e2el_ms, e2el_ms_ratio, e2el_ms_check, tput_prefill, tput_prefill_ratio, tput_prefill_check, rtr_check - _process_evals (run_type='eval') + performance_check, tolerance, num_inference_steps, num_prompts, deviation_clip_score, max_clip, min_clip, clip_standard_deviation, fvd, fvmd, latency_p50/p90/p95, rtr, throughput_rps, tput_user, correct, total, mismatches_count, cosine_pearson, cosine_spearman, euclidean_pearson, euclidean_spearman, manhattan_pearson, manhattan_spearman, pearson, spearman, main_score Why: Previous whitelists were LLM-centric and silently dropped checks / metrics from media (whisper / diffusion) and embedding models. The dashboard's Benchmarks column was empty for ~17 models because their latency_check, e2el_ms_check, tput_prefill_check, rtr_check were not ingested. Same shape of bug existed for raw image quality / audio / classification / embedding-similarity metrics on the eval side. Going forward, a metric-drift script in tt-shield (separate PR) will diff each nightly run's reports against these whitelists and alert Slack when a producer emits a new metric we are not picking up — so this whitelist is the single source of truth. Tests: Four new whitelist-coverage tests (one per location) pin down the full set of expected keys. Full suite: 74 pass.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Extends benchmark/eval metric ingestion whitelists to cover additional model families (LLM, image/video, audio, embedding) and adds tests to ensure observed production keys are not silently dropped.

Changes:

Expand benchmarks[*], benchmarks_summary[*], target_checks, and evals[*] whitelists with additional metric keys.
Add “whitelist coverage” tests that assert every observed metric key is ingested into measurements for the appropriate step.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
.github/actions/collect_data/test/test_benchmark_mapper.py	Adds coverage tests that validate whitelist ingestion across report JSON locations/model families.
.github/actions/collect_data/src/benchmark.py	Expands the allowlisted metric keys ingested for benchmarks, benchmarks_summary/target_checks, and evals.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    benchmark = {k: 1.0 for k in _BENCHMARK_KEYS_ALL_FAMILIES}
+    benchmark.update({"model_name": "test", "device": "test"})
+    result = mapper.map_benchmark_data(pipeline, 1, {"benchmarks": [benchmark]})


+                    # Image / video / diffusion
+                    "mean_latency_ms",
+                    "p50_latency_ms",
+                    "p90_latency_ms",
+                    "p95_latency_ms",
+                    "throughput_rps",
+                    "inference_steps_per_second",
+                    "num_inference_steps",
+                    "performance_check",


Copilot AI review requested due to automatic review settings June 3, 2026 17:05

acvejicTT requested a review from a team as a code owner June 3, 2026 17:05

Copilot AI reviewed Jun 3, 2026

View reviewed changes

vmilosevic approved these changes Jun 4, 2026

View reviewed changes

vmilosevic merged commit bcc36f5 into tenstorrent:main Jun 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional metrics#131

Add additional metrics#131
vmilosevic merged 1 commit into
tenstorrent:mainfrom
acvejicTT:acvejic/add-missing-metrics-and-checks

acvejicTT commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

acvejicTT commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants