Skip to content

Add additional metrics#131

Merged
vmilosevic merged 1 commit into
tenstorrent:mainfrom
acvejicTT:acvejic/add-missing-metrics-and-checks
Jun 4, 2026
Merged

Add additional metrics#131
vmilosevic merged 1 commit into
tenstorrent:mainfrom
acvejicTT:acvejic/add-missing-metrics-and-checks

Conversation

@acvejicTT

Copy link
Copy Markdown
Contributor

Did some analysis, and we found a lot of new metrics that we are not parsing and pushing to DB. We will also implement drift script on our side to do this more proactively.

…amilies

Add every scalar metric observed across LLM, image/video/diffusion, audio
and embedding models in a 109-job tt-shield nightly run to the four
ShieldBenchmarkDataMapper whitelists:

  - _process_benchmarks (run_type='benchmark')
      + total_token_throughput, mean_latency_ms, p50/p90/p95_latency_ms,
        throughput_rps, inference_steps_per_second, num_inference_steps,
        performance_check, rtr, wer, embedding_dimension

  - _process_benchmarks_summary outer (run_type='benchmark_summary')
      + num_requests, latency, inference_steps_per_second,
        num_inference_steps, e2el_ms, tput_prefill, latency_p90,
        latency_p95, rtr

  - _process_benchmarks_summary target_checks
    (run_type='benchmark_summary_<tier>')
      + tput, tput_ratio,
        latency, latency_ratio, latency_check,
        e2el_ms, e2el_ms_ratio, e2el_ms_check,
        tput_prefill, tput_prefill_ratio, tput_prefill_check,
        rtr_check

  - _process_evals (run_type='eval')
      + performance_check, tolerance, num_inference_steps, num_prompts,
        deviation_clip_score, max_clip, min_clip, clip_standard_deviation,
        fvd, fvmd,
        latency_p50/p90/p95, rtr, throughput_rps, tput_user,
        correct, total, mismatches_count,
        cosine_pearson, cosine_spearman, euclidean_pearson,
        euclidean_spearman, manhattan_pearson, manhattan_spearman,
        pearson, spearman, main_score

Why:
  Previous whitelists were LLM-centric and silently dropped checks /
  metrics from media (whisper / diffusion) and embedding models. The
  dashboard's Benchmarks column was empty for ~17 models because their
  latency_check, e2el_ms_check, tput_prefill_check, rtr_check were not
  ingested. Same shape of bug existed for raw image quality / audio /
  classification / embedding-similarity metrics on the eval side.

  Going forward, a metric-drift script in tt-shield (separate PR) will
  diff each nightly run's reports against these whitelists and alert
  Slack when a producer emits a new metric we are not picking up — so
  this whitelist is the single source of truth.

Tests:
  Four new whitelist-coverage tests (one per location) pin down the
  full set of expected keys. Full suite: 74 pass.
Copilot AI review requested due to automatic review settings June 3, 2026 17:05
@acvejicTT acvejicTT requested a review from a team as a code owner June 3, 2026 17:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Extends benchmark/eval metric ingestion whitelists to cover additional model families (LLM, image/video, audio, embedding) and adds tests to ensure observed production keys are not silently dropped.

Changes:

  • Expand benchmarks[*], benchmarks_summary[*], target_checks, and evals[*] whitelists with additional metric keys.
  • Add “whitelist coverage” tests that assert every observed metric key is ingested into measurements for the appropriate step.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.github/actions/collect_data/test/test_benchmark_mapper.py Adds coverage tests that validate whitelist ingestion across report JSON locations/model families.
.github/actions/collect_data/src/benchmark.py Expands the allowlisted metric keys ingested for benchmarks, benchmarks_summary/target_checks, and evals.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +407 to +409
benchmark = {k: 1.0 for k in _BENCHMARK_KEYS_ALL_FAMILIES}
benchmark.update({"model_name": "test", "device": "test"})
result = mapper.map_benchmark_data(pipeline, 1, {"benchmarks": [benchmark]})
Comment on lines +342 to +350
# Image / video / diffusion
"mean_latency_ms",
"p50_latency_ms",
"p90_latency_ms",
"p95_latency_ms",
"throughput_rps",
"inference_steps_per_second",
"num_inference_steps",
"performance_check",
@vmilosevic vmilosevic merged commit bcc36f5 into tenstorrent:main Jun 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants