Add accuracy, correctness, and performance CI/CD workflows#236
Closed
Add accuracy, correctness, and performance CI/CD workflows#236
Conversation
- correctness.yml: cross-language Python/Rust PromQL pattern matching parity tests (token matching, serialised patterns, Rust unit tests) - accuracy.yml: ASAP vs ClickHouse exact baseline error regression check (≤5% relative error threshold) on H2O groupby dataset - performance.yml: relative latency regression detection on GH-hosted VMs; manual workflow_dispatch with self-hosted runner support for precise absolute benchmarks once asap-tools infra is decoupled from Cloudlab Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add [workspace] table to three standalone test Cargo.toml files (compare_patterns, compare_matched_tokens/rust_tests, rust_pattern_matching) to prevent "believes it's in a workspace" error on the root Cargo.toml - Add test_data/promql_queries.json with 10 PromQL test cases covering ONLY_TEMPORAL, ONLY_SPATIAL, and ONE_TEMPORAL_ONE_SPATIAL patterns, required by the cross-language master_test_runner - Unignore test_data/*.json in asap-common/.gitignore (tests/**/*.json was intended for generated result files, not fixture input data) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PDF guide Implements all 20 steps from the ASAPQuery PR Performance & Accuracy Evaluation Guide: - eval-pr.yml: triggers on core code changes, builds the 3 ASAP-owned services (planner-rs, summary-ingest, query-engine) from PR source, spins up the full quickstart stack, waits for pipeline + ingestion, runs queries against both Prometheus (baseline) and ASAPQuery, then evaluates accuracy and performance - benchmarks/docker-compose.yml: Compose override that replaces the GHCR images for planner-rs/summary-ingest/queryengine with `build:` directives so CI always tests the latest committed code in the branch - benchmarks/queries/promql_suite.json: 14-query fixed suite covering avg/sum/max/min/quantile at p50/p90/p95/p99, with and without grouping by pattern/region/service; marks ASAP-native queries - benchmarks/scripts/wait_for_stack.sh: polls Prometheus, Arroyo, and QueryEngine until healthy (180s timeout each) - benchmarks/scripts/ingest_wait.sh: waits for asap-demo Arroyo pipeline to reach RUNNING, then 90s for sketch accumulation - benchmarks/scripts/run_baseline.py: queries Prometheus /api/v1/query 3x per query, records latencies and results - benchmarks/scripts/run_asap.py: same for ASAPQuery /api/v1/query - benchmarks/scripts/compare.py: normalises vector results by label set, enforces pass/fail policy (no query failures; ASAP-native relative error ≤ 1%); latency regressions are warn-only per PDF practical caution note on GH runner noise Pass/fail policy (PDF page 3): - No query failures - Max relative error ≤ 1% - No >10% p95 latency regression (warn-only on ephemeral runners) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fail) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split accuracy_performance.yml into build + eval jobs: images are built
and pushed to GHCR with a sha-<short-sha> tag in the build job, then
pulled (not rebuilt) by the eval job via ASAP_IMAGE_TAG env var
- Remove build: sections from benchmarks/docker-compose.yml; it now only
overrides image tags using ${ASAP_IMAGE_TAG}
- Change asap-summary-ingest/Dockerfile FROM to ghcr.io/projectasap/asap-base:latest
so it can be built in any job without requiring a local sketchdb-base image
- Add --project-directory . to all docker compose calls to fix build context
path resolution (asap-quickstart/asap-summary-ingest lstat error)
- Fix stale eval-pr.yml path reference in accuracy_performance.yml triggers
- Fix PromQLPattern::new call sites in three test files: remove stale second
argument after collect_tokens was removed from the function signature
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n tests - accuracy_performance.yml: remove --project-directory . from eval job so bind mounts in asap-quickstart/docker-compose.yml resolve from asap-quickstart/ (fixes asap-planner-rs exit 1 due to missing controller-config.yaml) - correctness.yml: fix pip install path to asap-common/dependencies/py/promql_utilities/ (setup.py is there, not at the parent dir; fixes silent failure + missing promql_parser) - pattern_tests.rs: add avg_over_time and *_over_time variants to ONLY_TEMPORAL; remove duplicate ONLY_VECTOR key so disambiguation works deterministically; add build_combined_quantile_pattern for 2-arg quantile_over_time in combined queries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r timeout
Three bugs caused the pipeline to always show as 'not found':
1. data.get('data', []) returns None when Arroyo returns {"data": null} for an
empty pipeline list. dict.get() only falls back to the default when the key
is absent, not when the value is null. Fixed with (data.get('data') or []).
2. Arroyo signals a running pipeline via state=null + stop='none', not a literal
"Running" string. The ingest_wait.sh state check was looking for the wrong
value; the correct pattern is already used in asap-tools/run_pipeline.sh.
3. MAX_PIPELINE_WAIT=300s is too short: Arroyo must compile Rust UDFs before
the pipeline can start, which takes several minutes in CI. Raised to 600s.
Also: normalise hyphens/underscores in the name match so 'asap-demo' matches
whether Arroyo stores it as 'asap-demo' or 'asap_demo'; add pipeline list dump
on timeout for easier future diagnosis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
relative-regressionjob runs on every PR — catches latency regressions ≥2× on GH-hosted VMs (noisy but useful for relative comparisons)query-latencyjob is manual (workflow_dispatchonly) with arunnerinput — warns when not on a self-hosted runner, intended for use once the asap-tools benchmarking infra (used for TurboProm paper) is decoupled from Cloudlab and registered as a GitHub self-hosted runnerNotes on runner choice
Experiments run inside Docker containers on ephemeral GH Actions VMs — good for accuracy and correctness regression detection. For precise, absolute performance benchmarking the team's dedicated asap-tools infra (baseline + ASAP, highly configurable) should be registered as a self-hosted runner once decoupled from Cloudlab.
Test plan
relative-regressionjob runs and reports latency comparison tableworkflow_dispatchon performance workflow works withrunner: ubuntu-latestandrunner: self-hosted(once available)🤖 Generated with Claude Code