Skip to content

Add accuracy/performance CI and benchmarks infrastructure#248

Merged
milindsrivastava1997 merged 3 commits intomainfrom
ci/accuracy-performance
Apr 2, 2026
Merged

Add accuracy/performance CI and benchmarks infrastructure#248
milindsrivastava1997 merged 3 commits intomainfrom
ci/accuracy-performance

Conversation

@zzylol
Copy link
Copy Markdown
Contributor

@zzylol zzylol commented Mar 28, 2026

Summary

  • accuracy_performance.yml: Full e2e eval — builds ASAP service images once (planner-rs, summary-ingest, query-engine) with a sha-<short-sha> tag, spins up the quickstart stack, waits for Arroyo pipeline + sketch ingestion, runs PromQL suite against both Prometheus (baseline) and ASAPQuery, compares accuracy and latency.
  • benchmarks/docker-compose.yml: Compose override replacing GHCR images with ${ASAP_IMAGE_TAG} so CI always tests latest committed code.
  • benchmarks/queries/promql_suite.json: 14-query fixed suite covering avg/sum/max/min/quantile at p50/p90/p95/p99, with and without grouping.
  • benchmarks/scripts/: compare.py, run_baseline.py, run_asap.py, wait_for_stack.sh, ingest_wait.sh (with null-data handling, Arroyo state detection, 600s timeout for UDF compilation).
  • asap-summary-ingest/Dockerfile: Switch FROM to ghcr.io base image so builds work in any job without a local sketchdb-base image.

Pass/fail policy: no query failures allowed; ASAP-native relative error >5% warns (does not fail); latency regressions are warn-only on ephemeral GH runners.

Test plan

  • Accuracy/performance workflow builds Docker images and runs H2O groupby accuracy check within error bounds
  • relative-regression job runs and reports latency comparison table
  • workflow_dispatch works with runner: ubuntu-latest

🤖 Generated with Claude Code

- accuracy_performance.yml: full e2e eval — builds ASAP service images
  once (planner-rs, summary-ingest, query-engine) with sha tag, spins
  up quickstart stack, waits for Arroyo pipeline + sketch ingestion,
  runs PromQL suite against Prometheus (baseline) and ASAPQuery, then
  compares accuracy and latency
- benchmarks/docker-compose.yml: compose override replacing GHCR images
  with ASAP_IMAGE_TAG so CI tests latest committed code
- benchmarks/queries/promql_suite.json: 14-query fixed suite covering
  avg/sum/max/min/quantile at p50/p90/p95/p99 with/without grouping
- benchmarks/scripts/: compare.py, run_baseline.py, run_asap.py,
  wait_for_stack.sh, ingest_wait.sh
- asap-summary-ingest/Dockerfile: switch FROM to ghcr.io base image so
  builds work without a local sketchdb-base image

Pass/fail policy: no query failures; ASAP-native relative error >5%
warns (does not fail); latency regressions are warn-only on ephemeral
GH runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
zzylol and others added 2 commits March 31, 2026 14:20
Clarifies that the field marks sketch-based approximate queries
(quantiles) where accuracy threshold enforcement applies, vs exact
aggregations (avg/sum/max/min) where it does not.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Local/Cloudlab builds default to sketchdb-base:latest (built locally).
CI overrides via BASE_IMAGE build arg to pull from GHCR, fixing the
break reported in installation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@milindsrivastava1997 milindsrivastava1997 merged commit cebb798 into main Apr 2, 2026
12 checks passed
@milindsrivastava1997 milindsrivastava1997 deleted the ci/accuracy-performance branch April 2, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants