Skip to content

Add accuracy, correctness, and performance CI/CD workflows#236

Closed
zzylol wants to merge 9 commits intomainfrom
ci/accuracy-performance-correctness
Closed

Add accuracy, correctness, and performance CI/CD workflows#236
zzylol wants to merge 9 commits intomainfrom
ci/accuracy-performance-correctness

Conversation

@zzylol
Copy link
Copy Markdown
Contributor

@zzylol zzylol commented Mar 26, 2026

Summary

  • correctness.yml: Cross-language Python/Rust PromQL pattern matching parity tests — token matching, serialised pattern comparison, and Rust unit tests. Runs on ephemeral GH-hosted VMs (deterministic, well-suited).
  • accuracy.yml: ASAP vs ClickHouse exact baseline accuracy regression check on H2O groupby dataset. Fails if relative error exceeds 5% threshold. Runs in Docker on GH-hosted VMs — sufficient for catching approximation regressions.
  • performance.yml: Two-tier benchmark setup:
    • relative-regression job runs on every PR — catches latency regressions ≥2× on GH-hosted VMs (noisy but useful for relative comparisons)
    • query-latency job is manual (workflow_dispatch only) with a runner input — warns when not on a self-hosted runner, intended for use once the asap-tools benchmarking infra (used for TurboProm paper) is decoupled from Cloudlab and registered as a GitHub self-hosted runner

Notes on runner choice

Experiments run inside Docker containers on ephemeral GH Actions VMs — good for accuracy and correctness regression detection. For precise, absolute performance benchmarking the team's dedicated asap-tools infra (baseline + ASAP, highly configurable) should be registered as a self-hosted runner once decoupled from Cloudlab.

Test plan

  • Correctness workflow triggers on PR and passes cross-language token matching and pattern serialisation jobs
  • Accuracy workflow builds Docker images and runs H2O groupby accuracy check within error bounds
  • Performance relative-regression job runs and reports latency comparison table
  • workflow_dispatch on performance workflow works with runner: ubuntu-latest and runner: self-hosted (once available)

🤖 Generated with Claude Code

zzylol and others added 9 commits March 26, 2026 18:26
- correctness.yml: cross-language Python/Rust PromQL pattern matching
  parity tests (token matching, serialised patterns, Rust unit tests)
- accuracy.yml: ASAP vs ClickHouse exact baseline error regression
  check (≤5% relative error threshold) on H2O groupby dataset
- performance.yml: relative latency regression detection on GH-hosted
  VMs; manual workflow_dispatch with self-hosted runner support for
  precise absolute benchmarks once asap-tools infra is decoupled from
  Cloudlab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add [workspace] table to three standalone test Cargo.toml files
  (compare_patterns, compare_matched_tokens/rust_tests, rust_pattern_matching)
  to prevent "believes it's in a workspace" error on the root Cargo.toml
- Add test_data/promql_queries.json with 10 PromQL test cases covering
  ONLY_TEMPORAL, ONLY_SPATIAL, and ONE_TEMPORAL_ONE_SPATIAL patterns,
  required by the cross-language master_test_runner
- Unignore test_data/*.json in asap-common/.gitignore (tests/**/*.json
  was intended for generated result files, not fixture input data)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PDF guide

Implements all 20 steps from the ASAPQuery PR Performance & Accuracy
Evaluation Guide:

- eval-pr.yml: triggers on core code changes, builds the 3 ASAP-owned
  services (planner-rs, summary-ingest, query-engine) from PR source,
  spins up the full quickstart stack, waits for pipeline + ingestion,
  runs queries against both Prometheus (baseline) and ASAPQuery, then
  evaluates accuracy and performance

- benchmarks/docker-compose.yml: Compose override that replaces the
  GHCR images for planner-rs/summary-ingest/queryengine with `build:`
  directives so CI always tests the latest committed code in the branch

- benchmarks/queries/promql_suite.json: 14-query fixed suite covering
  avg/sum/max/min/quantile at p50/p90/p95/p99, with and without
  grouping by pattern/region/service; marks ASAP-native queries

- benchmarks/scripts/wait_for_stack.sh: polls Prometheus, Arroyo, and
  QueryEngine until healthy (180s timeout each)

- benchmarks/scripts/ingest_wait.sh: waits for asap-demo Arroyo
  pipeline to reach RUNNING, then 90s for sketch accumulation

- benchmarks/scripts/run_baseline.py: queries Prometheus /api/v1/query
  3x per query, records latencies and results

- benchmarks/scripts/run_asap.py: same for ASAPQuery /api/v1/query

- benchmarks/scripts/compare.py: normalises vector results by label
  set, enforces pass/fail policy (no query failures; ASAP-native
  relative error ≤ 1%); latency regressions are warn-only per PDF
  practical caution note on GH runner noise

Pass/fail policy (PDF page 3):
  - No query failures
  - Max relative error ≤ 1%
  - No >10% p95 latency regression (warn-only on ephemeral runners)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fail)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split accuracy_performance.yml into build + eval jobs: images are built
  and pushed to GHCR with a sha-<short-sha> tag in the build job, then
  pulled (not rebuilt) by the eval job via ASAP_IMAGE_TAG env var
- Remove build: sections from benchmarks/docker-compose.yml; it now only
  overrides image tags using ${ASAP_IMAGE_TAG}
- Change asap-summary-ingest/Dockerfile FROM to ghcr.io/projectasap/asap-base:latest
  so it can be built in any job without requiring a local sketchdb-base image
- Add --project-directory . to all docker compose calls to fix build context
  path resolution (asap-quickstart/asap-summary-ingest lstat error)
- Fix stale eval-pr.yml path reference in accuracy_performance.yml triggers
- Fix PromQLPattern::new call sites in three test files: remove stale second
  argument after collect_tokens was removed from the function signature

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n tests

- accuracy_performance.yml: remove --project-directory . from eval job so
  bind mounts in asap-quickstart/docker-compose.yml resolve from asap-quickstart/
  (fixes asap-planner-rs exit 1 due to missing controller-config.yaml)
- correctness.yml: fix pip install path to asap-common/dependencies/py/promql_utilities/
  (setup.py is there, not at the parent dir; fixes silent failure + missing promql_parser)
- pattern_tests.rs: add avg_over_time and *_over_time variants to ONLY_TEMPORAL;
  remove duplicate ONLY_VECTOR key so disambiguation works deterministically;
  add build_combined_quantile_pattern for 2-arg quantile_over_time in combined queries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r timeout

Three bugs caused the pipeline to always show as 'not found':

1. data.get('data', []) returns None when Arroyo returns {"data": null} for an
   empty pipeline list. dict.get() only falls back to the default when the key
   is absent, not when the value is null. Fixed with (data.get('data') or []).

2. Arroyo signals a running pipeline via state=null + stop='none', not a literal
   "Running" string. The ingest_wait.sh state check was looking for the wrong
   value; the correct pattern is already used in asap-tools/run_pipeline.sh.

3. MAX_PIPELINE_WAIT=300s is too short: Arroyo must compile Rust UDFs before
   the pipeline can start, which takes several minutes in CI. Raised to 600s.

Also: normalise hyphens/underscores in the name match so 'asap-demo' matches
whether Arroyo stores it as 'asap-demo' or 'asap_demo'; add pipeline list dump
on timeout for easier future diagnosis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant