Skip to content

anon retrieval: ensure dealbot internal ops dashboard is updated #619

Description

@SgtPooki

Problem

Sampled retrieval (renamed from anon retrieval, #487 / #612) emits a full set of Prometheus metrics, but none of them show up on the internal ops dashboard (Dealbot Operational Dashboard 618457). We have no visibility into sampled-retrieval success rate, sub-status breakdown, or latency, and the check is absent from sev2.5/sev3 alerting.

Scope is 618457 only. It is an ops-only dashboard. The provider dashboards (782421, 624265) and the web-app dashboards are explicitly out of scope.

Why nothing shows

Metrics are listed in docs/checks/events-and-metrics.md § Sampled Retrieval: status counters sampledPieceRetrievalStatus, sampledIpniStatus, sampledCarParseStatus, sampledBlockFetchStatus, sampledPieceHttpResponseCode; histograms sampledPieceRetrievalFirstByteMs, sampledPieceRetrievalLastByteMs, sampledPieceRetrievalThroughputBps, sampledRetrievalCheckMs.

  • The check-level rollups hardcode a metric allowlist. "Check Success Rate by Type" (8973679935) and "Check Samples by Type" (11950321364) both filter name IN (...) with no sampled status in the list, so they can never show it until edited.
  • The job-level charts group by label('job_type') dynamically, so retrieval_sampled renders automatically once the job runs. Those need verification, not edits.

Prerequisite

Sampled-retrieval jobs are gated on SUBGRAPH_ENDPOINT (apps/backend/src/jobs/jobs.service.ts:1302). Until it is set in our infra, no jobs schedule and no sampled* metrics emit. Confirmed empty in BetterStack: 0 rows in Infra Prod over 30d, 0 in Infra Staging over 7d (checked 2026-06-24). Charts can be built now, but cannot be verified with live data until SUBGRAPH_ENDPOINT is enabled (rate also tunable via SAMPLED_RETRIEVALS_PER_SP_PER_HOUR, default 2/SP/hr).

618457 is a severity-organized alerting dashboard. Every chart on it maps to an alert threshold. It carries rollups for every check plus a small number of targeted detail charts that each have their own sev threshold (IPNI Verification, dataSetCreation, On-chain vs Upload). It does not carry per-sub-status breakdowns or latency percentile charts for any check. Keep sampled retrieval to that same convention.

Work

  1. Add sampledPieceRetrievalStatus to the two check-level rollups (8973679935, 11950321364). Exclude skipped from the success-rate denominator: sampled retrieval records skipped (empty piece pool) where other checks use pending, and the current query only filters value != 'pending', so skipped would count as failure and falsely trip the sev3 "alert at 0%".
  2. Decide alerting: include sampled retrieval in the sev2.5 "Approved SPs fraction failing by check type" (12253190876); decide whether the rollup's sev3 "alert at 0%" is enough or it needs its own success-rate threshold chart (the way IPNI Verification has one). Same skipped caveat applies.
  3. Verify retrieval_sampled renders in the job-level "by type" charts (dynamic, should be automatic).

Out of scope: per-sub-status charts (sampledIpniStatus, sampledCarParseStatus, sampledBlockFetchStatus, sampledPieceHttpResponseCode) and latency/throughput charts (the four sampled histograms). Those belong on the detail/provider dashboards, tracked separately if we want them.

Done when

  • Sampled retrieval appears in the two rollup charts on 618457 (samples + success rate), with skipped excluded from the success-rate denominator.
  • Sampled retrieval is wired into sev2.5/sev3 alerting, or the decision to leave it on the rollup-only "alert at 0%" is recorded on this issue.
  • Job-level "by type" charts confirmed rendering retrieval_sampled.
  • Verified against live mainnet + calibration data after SUBGRAPH_ENDPOINT is enabled.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestready-for-workTriaged: scope, plan, and DoD are clear; contributor can pick up

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
⌚️ Issue awaiting PR merge

Relationships

None yet

Development

No branches or pull requests

Issue actions