Add accuracy, correctness, and performance CI/CD workflows by zzylol · Pull Request #236 · ProjectASAP/ASAPQuery

zzylol · 2026-03-26T23:26:23Z

Summary

correctness.yml: Cross-language Python/Rust PromQL pattern matching parity tests — token matching, serialised pattern comparison, and Rust unit tests. Runs on ephemeral GH-hosted VMs (deterministic, well-suited).
accuracy.yml: ASAP vs ClickHouse exact baseline accuracy regression check on H2O groupby dataset. Fails if relative error exceeds 5% threshold. Runs in Docker on GH-hosted VMs — sufficient for catching approximation regressions.
performance.yml: Two-tier benchmark setup:
- relative-regression job runs on every PR — catches latency regressions ≥2× on GH-hosted VMs (noisy but useful for relative comparisons)
- query-latency job is manual (workflow_dispatch only) with a runner input — warns when not on a self-hosted runner, intended for use once the asap-tools benchmarking infra (used for TurboProm paper) is decoupled from Cloudlab and registered as a GitHub self-hosted runner

Notes on runner choice

Experiments run inside Docker containers on ephemeral GH Actions VMs — good for accuracy and correctness regression detection. For precise, absolute performance benchmarking the team's dedicated asap-tools infra (baseline + ASAP, highly configurable) should be registered as a self-hosted runner once decoupled from Cloudlab.

Test plan

Correctness workflow triggers on PR and passes cross-language token matching and pattern serialisation jobs
Accuracy workflow builds Docker images and runs H2O groupby accuracy check within error bounds
Performance relative-regression job runs and reports latency comparison table
workflow_dispatch on performance workflow works with runner: ubuntu-latest and runner: self-hosted (once available)

🤖 Generated with Claude Code

- correctness.yml: cross-language Python/Rust PromQL pattern matching parity tests (token matching, serialised patterns, Rust unit tests) - accuracy.yml: ASAP vs ClickHouse exact baseline error regression check (≤5% relative error threshold) on H2O groupby dataset - performance.yml: relative latency regression detection on GH-hosted VMs; manual workflow_dispatch with self-hosted runner support for precise absolute benchmarks once asap-tools infra is decoupled from Cloudlab Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add [workspace] table to three standalone test Cargo.toml files (compare_patterns, compare_matched_tokens/rust_tests, rust_pattern_matching) to prevent "believes it's in a workspace" error on the root Cargo.toml - Add test_data/promql_queries.json with 10 PromQL test cases covering ONLY_TEMPORAL, ONLY_SPATIAL, and ONE_TEMPORAL_ONE_SPATIAL patterns, required by the cross-language master_test_runner - Unignore test_data/*.json in asap-common/.gitignore (tests/**/*.json was intended for generated result files, not fixture input data) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… PDF guide Implements all 20 steps from the ASAPQuery PR Performance & Accuracy Evaluation Guide: - eval-pr.yml: triggers on core code changes, builds the 3 ASAP-owned services (planner-rs, summary-ingest, query-engine) from PR source, spins up the full quickstart stack, waits for pipeline + ingestion, runs queries against both Prometheus (baseline) and ASAPQuery, then evaluates accuracy and performance - benchmarks/docker-compose.yml: Compose override that replaces the GHCR images for planner-rs/summary-ingest/queryengine with `build:` directives so CI always tests the latest committed code in the branch - benchmarks/queries/promql_suite.json: 14-query fixed suite covering avg/sum/max/min/quantile at p50/p90/p95/p99, with and without grouping by pattern/region/service; marks ASAP-native queries - benchmarks/scripts/wait_for_stack.sh: polls Prometheus, Arroyo, and QueryEngine until healthy (180s timeout each) - benchmarks/scripts/ingest_wait.sh: waits for asap-demo Arroyo pipeline to reach RUNNING, then 90s for sketch accumulation - benchmarks/scripts/run_baseline.py: queries Prometheus /api/v1/query 3x per query, records latencies and results - benchmarks/scripts/run_asap.py: same for ASAPQuery /api/v1/query - benchmarks/scripts/compare.py: normalises vector results by label set, enforces pass/fail policy (no query failures; ASAP-native relative error ≤ 1%); latency regressions are warn-only per PDF practical caution note on GH runner noise Pass/fail policy (PDF page 3): - No query failures - Max relative error ≤ 1% - No >10% p95 latency regression (warn-only on ephemeral runners) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… fail) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Split accuracy_performance.yml into build + eval jobs: images are built and pushed to GHCR with a sha-<short-sha> tag in the build job, then pulled (not rebuilt) by the eval job via ASAP_IMAGE_TAG env var - Remove build: sections from benchmarks/docker-compose.yml; it now only overrides image tags using ${ASAP_IMAGE_TAG} - Change asap-summary-ingest/Dockerfile FROM to ghcr.io/projectasap/asap-base:latest so it can be built in any job without requiring a local sketchdb-base image - Add --project-directory . to all docker compose calls to fix build context path resolution (asap-quickstart/asap-summary-ingest lstat error) - Fix stale eval-pr.yml path reference in accuracy_performance.yml triggers - Fix PromQLPattern::new call sites in three test files: remove stale second argument after collect_tokens was removed from the function signature Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n tests - accuracy_performance.yml: remove --project-directory . from eval job so bind mounts in asap-quickstart/docker-compose.yml resolve from asap-quickstart/ (fixes asap-planner-rs exit 1 due to missing controller-config.yaml) - correctness.yml: fix pip install path to asap-common/dependencies/py/promql_utilities/ (setup.py is there, not at the parent dir; fixes silent failure + missing promql_parser) - pattern_tests.rs: add avg_over_time and *_over_time variants to ONLY_TEMPORAL; remove duplicate ONLY_VECTOR key so disambiguation works deterministically; add build_combined_quantile_pattern for 2-arg quantile_over_time in combined queries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r timeout Three bugs caused the pipeline to always show as 'not found': 1. data.get('data', []) returns None when Arroyo returns {"data": null} for an empty pipeline list. dict.get() only falls back to the default when the key is absent, not when the value is null. Fixed with (data.get('data') or []). 2. Arroyo signals a running pipeline via state=null + stop='none', not a literal "Running" string. The ingest_wait.sh state check was looking for the wrong value; the correct pattern is already used in asap-tools/run_pipeline.sh. 3. MAX_PIPELINE_WAIT=300s is too short: Arroyo must compile Rust UDFs before the pipeline can start, which takes several minutes in CI. Raised to 600s. Also: normalise hyphens/underscores in the name match so 'asap-demo' matches whether Arroyo stores it as 'asap-demo' or 'asap_demo'; add pipeline list dump on timeout for easier future diagnosis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zzylol and others added 9 commits March 26, 2026 18:26

Raise ASAP-native accuracy threshold from 1% to 5%

8ecdf35

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Make ASAP-native accuracy threshold warning-only (>5% warns, does not…

78297c8

… fail) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Rename eval-pr.yml to accuracy_performance.yml

70d2f06

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zzylol requested a review from milindsrivastava1997 March 27, 2026 22:40

zzylol closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add accuracy, correctness, and performance CI/CD workflows#236

Add accuracy, correctness, and performance CI/CD workflows#236
zzylol wants to merge 9 commits intomainfrom
ci/accuracy-performance-correctness

zzylol commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zzylol commented Mar 26, 2026

Summary

Notes on runner choice

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant