Problem
Sampled retrieval (renamed from anon retrieval, #487 / #612) emits a full set of Prometheus metrics, but none of them show up on the internal ops dashboard (Dealbot Operational Dashboard 618457). We have no visibility into sampled-retrieval success rate, sub-status breakdown, or latency, and the check is absent from sev2.5/sev3 alerting.
Scope is 618457 only. It is an ops-only dashboard. The provider dashboards (782421, 624265) and the web-app dashboards are explicitly out of scope.
Why nothing shows
Metrics are listed in docs/checks/events-and-metrics.md § Sampled Retrieval: status counters sampledPieceRetrievalStatus, sampledIpniStatus, sampledCarParseStatus, sampledBlockFetchStatus, sampledPieceHttpResponseCode; histograms sampledPieceRetrievalFirstByteMs, sampledPieceRetrievalLastByteMs, sampledPieceRetrievalThroughputBps, sampledRetrievalCheckMs.
- The check-level rollups hardcode a metric allowlist. "Check Success Rate by Type" (
8973679935) and "Check Samples by Type" (11950321364) both filter name IN (...) with no sampled status in the list, so they can never show it until edited.
- The job-level charts group by
label('job_type') dynamically, so retrieval_sampled renders automatically once the job runs. Those need verification, not edits.
Prerequisite
Sampled-retrieval jobs are gated on SUBGRAPH_ENDPOINT (apps/backend/src/jobs/jobs.service.ts:1302). Until it is set in our infra, no jobs schedule and no sampled* metrics emit. Confirmed empty in BetterStack: 0 rows in Infra Prod over 30d, 0 in Infra Staging over 7d (checked 2026-06-24). Charts can be built now, but cannot be verified with live data until SUBGRAPH_ENDPOINT is enabled (rate also tunable via SAMPLED_RETRIEVALS_PER_SP_PER_HOUR, default 2/SP/hr).
618457 is a severity-organized alerting dashboard. Every chart on it maps to an alert threshold. It carries rollups for every check plus a small number of targeted detail charts that each have their own sev threshold (IPNI Verification, dataSetCreation, On-chain vs Upload). It does not carry per-sub-status breakdowns or latency percentile charts for any check. Keep sampled retrieval to that same convention.
Work
- Add
sampledPieceRetrievalStatus to the two check-level rollups (8973679935, 11950321364). Exclude skipped from the success-rate denominator: sampled retrieval records skipped (empty piece pool) where other checks use pending, and the current query only filters value != 'pending', so skipped would count as failure and falsely trip the sev3 "alert at 0%".
- Decide alerting: include sampled retrieval in the sev2.5 "Approved SPs fraction failing by check type" (
12253190876); decide whether the rollup's sev3 "alert at 0%" is enough or it needs its own success-rate threshold chart (the way IPNI Verification has one). Same skipped caveat applies.
- Verify
retrieval_sampled renders in the job-level "by type" charts (dynamic, should be automatic).
Out of scope: per-sub-status charts (sampledIpniStatus, sampledCarParseStatus, sampledBlockFetchStatus, sampledPieceHttpResponseCode) and latency/throughput charts (the four sampled histograms). Those belong on the detail/provider dashboards, tracked separately if we want them.
Done when
- Sampled retrieval appears in the two rollup charts on
618457 (samples + success rate), with skipped excluded from the success-rate denominator.
- Sampled retrieval is wired into sev2.5/sev3 alerting, or the decision to leave it on the rollup-only "alert at 0%" is recorded on this issue.
- Job-level "by type" charts confirmed rendering
retrieval_sampled.
- Verified against live mainnet + calibration data after
SUBGRAPH_ENDPOINT is enabled.
Problem
Sampled retrieval (renamed from anon retrieval, #487 / #612) emits a full set of Prometheus metrics, but none of them show up on the internal ops dashboard (Dealbot Operational Dashboard
618457). We have no visibility into sampled-retrieval success rate, sub-status breakdown, or latency, and the check is absent from sev2.5/sev3 alerting.Scope is
618457only. It is an ops-only dashboard. The provider dashboards (782421,624265) and the web-app dashboards are explicitly out of scope.Why nothing shows
Metrics are listed in
docs/checks/events-and-metrics.md§ Sampled Retrieval: status counterssampledPieceRetrievalStatus,sampledIpniStatus,sampledCarParseStatus,sampledBlockFetchStatus,sampledPieceHttpResponseCode; histogramssampledPieceRetrievalFirstByteMs,sampledPieceRetrievalLastByteMs,sampledPieceRetrievalThroughputBps,sampledRetrievalCheckMs.8973679935) and "Check Samples by Type" (11950321364) both filtername IN (...)with no sampled status in the list, so they can never show it until edited.label('job_type')dynamically, soretrieval_sampledrenders automatically once the job runs. Those need verification, not edits.Prerequisite
Sampled-retrieval jobs are gated on
SUBGRAPH_ENDPOINT(apps/backend/src/jobs/jobs.service.ts:1302). Until it is set in our infra, no jobs schedule and nosampled*metrics emit. Confirmed empty in BetterStack: 0 rows in Infra Prod over 30d, 0 in Infra Staging over 7d (checked 2026-06-24). Charts can be built now, but cannot be verified with live data untilSUBGRAPH_ENDPOINTis enabled (rate also tunable viaSAMPLED_RETRIEVALS_PER_SP_PER_HOUR, default 2/SP/hr).618457is a severity-organized alerting dashboard. Every chart on it maps to an alert threshold. It carries rollups for every check plus a small number of targeted detail charts that each have their own sev threshold (IPNI Verification, dataSetCreation, On-chain vs Upload). It does not carry per-sub-status breakdowns or latency percentile charts for any check. Keep sampled retrieval to that same convention.Work
sampledPieceRetrievalStatusto the two check-level rollups (8973679935,11950321364). Excludeskippedfrom the success-rate denominator: sampled retrieval recordsskipped(empty piece pool) where other checks usepending, and the current query only filtersvalue != 'pending', soskippedwould count as failure and falsely trip the sev3 "alert at 0%".12253190876); decide whether the rollup's sev3 "alert at 0%" is enough or it needs its own success-rate threshold chart (the way IPNI Verification has one). Sameskippedcaveat applies.retrieval_sampledrenders in the job-level "by type" charts (dynamic, should be automatic).Out of scope: per-sub-status charts (
sampledIpniStatus,sampledCarParseStatus,sampledBlockFetchStatus,sampledPieceHttpResponseCode) and latency/throughput charts (the four sampled histograms). Those belong on the detail/provider dashboards, tracked separately if we want them.Done when
618457(samples + success rate), withskippedexcluded from the success-rate denominator.retrieval_sampled.SUBGRAPH_ENDPOINTis enabled.