ProjectASAP · milindsrivastava1997 · Mar 23, 2026 · Mar 20, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/asap-tools/execution-utilities/asap_benchmark_pipeline/README.md b/asap-tools/execution-utilities/asap_benchmark_pipeline/README.md
@@ -0,0 +1,93 @@
+# ASAP H2O Benchmark Pipeline
+
+Benchmarks ASAP (KLL sketch-based) query serving against a ClickHouse baseline using the [H2O groupby dataset](https://h2oai.github.io/db-benchmark/) (10M rows, 100 groups).
+
+**ASAP mode** streams data through Kafka into Arroyo, which builds KLL sketches per tumbling window per group. The QueryEngine (QE) ingests these sketches and serves approximate quantile queries directly from them.
+
+**Baseline mode** loads the same data into a ClickHouse MergeTree table and runs equivalent quantile queries using ClickHouse's native `quantile()` function.
+
+Each benchmark run produces a CSV of per-query latencies and a latency plot (`.png`).
+
+## Prerequisites
+
+- Kafka and ClickHouse installed (see `../../installation/`)
+- Arroyo binary built (`arroyo/target/release/arroyo`)
+- QueryEngine binary built (`target/release/query_engine_rust`)
+- Python 3 with `requests`, `kafka-python`, `gdown`, `matplotlib`
+
+## Usage
+
+### Full pipeline (recommended)
+
+```bash
+# Run baseline benchmark (starts infra, loads data, runs queries)
+./run_pipeline.sh --mode baseline --load-data --output baseline_results.csv
+
+# Clean up between runs for fair comparison
+./cleanup.sh
+
+# Run ASAP benchmark
+./run_pipeline.sh --mode asap --load-data --output asap_results.csv
+
+# Run both back-to-back and generate comparison plot
+./run_pipeline.sh --mode both --load-data
+```
+
+### Options
+
+```
+--mode [asap|baseline|both]  Execution mode (default: asap)
+--load-data                  Download and load the H2O dataset
+--output [FILE]              Output CSV file
+--skip-infra                 Skip starting Kafka/ClickHouse (already running)
+--max-rows [N]               Limit rows loaded (0 = all, default: all)
+```
+
+### Cleanup
+
+`cleanup.sh` kills all processes (QE, Arroyo, Kafka, ClickHouse), clears Kafka topics, drops the ClickHouse table, and flushes OS page caches to ensure identical starting conditions between runs.
+
+```bash
+./cleanup.sh           # full cleanup (requires sudo for cache drop)
+./cleanup.sh --no-sudo # skip OS cache clearing
+```
+
+### Benchmark only (infra already running, data already loaded)
+
+```bash
+python3 run_benchmark.py --mode baseline --output baseline_results.csv
+python3 run_benchmark.py --mode asap --output asap_results.csv
+```
+
+### Comparison plot
+
+After running both modes, generate a side-by-side comparison:
+
+```bash
+python3 plot_latency.py
+```
+
+## Configuration
+
+| File | Purpose |
+|------|---------|
+| `streaming_config.yaml` | Arroyo sketch pipeline config (window size, aggregation type, grouping) |
+| `inference_config.yaml` | QE query-to-sketch mapping (must match `streaming_config.yaml` window size) |
+| `h2o_init.sql` | ClickHouse table schema |
+| `asap_quantile_queries.sql` | Queries for ASAP mode (QUANTILE syntax) |
+| `clickhouse_quantile_queries.sql` | Queries for baseline mode (quantile() syntax) |
+
+### Changing window size
+
+To benchmark with a different tumbling window size (e.g., 120s):
+
+1. Set `tumblingWindowSize` and `windowSize` in `streaming_config.yaml`
+2. Set the DATEADD offset in `inference_config.yaml` to match (e.g., `-120`)
+3. Regenerate query files with matching window boundaries
+
+## Output
+
+Each run produces:
+- `<output>.csv` — per-query latencies, row counts, and result previews
+- `<output>.png` — bar chart of latency by query execution order
+- `latency_comparison.png` — side-by-side comparison (from `plot_latency.py`)
diff --git a/asap-tools/execution-utilities/asap_benchmark_pipeline/asap_quantile_queries.sql b/asap-tools/execution-utilities/asap_benchmark_pipeline/asap_quantile_queries.sql
diff --git a/asap-tools/execution-utilities/asap_benchmark_pipeline/cleanup.sh b/asap-tools/execution-utilities/asap_benchmark_pipeline/cleanup.sh
@@ -0,0 +1,99 @@
+#!/bin/bash
+#
+# Clean up all benchmark processes, Kafka state, ClickHouse data, and OS caches
+# so that the next run_pipeline.sh invocation starts from identical conditions.
+#
+# Usage:
+#   ./cleanup.sh           # full cleanup (requires sudo for cache drop)
+#   ./cleanup.sh --no-sudo # skip OS cache clearing (no sudo required)
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
+TOOLS_DIR="$(cd "$SCRIPT_DIR/../.." &>/dev/null && pwd)"
+
+KAFKA_INSTALL_DIR="${KAFKA_INSTALL_DIR:-$TOOLS_DIR/installation/kafka}"
+CLICKHOUSE_INSTALL_DIR="${CLICKHOUSE_INSTALL_DIR:-$TOOLS_DIR/installation/clickhouse}"
+KAFKA_DIR="$KAFKA_INSTALL_DIR/kafka"
+CLICKHOUSE_DIR="$CLICKHOUSE_INSTALL_DIR/clickhouse"
+
+NO_SUDO=0
+for arg in "$@"; do
+    case "$arg" in
+        --no-sudo) NO_SUDO=1 ;;
+        --help) echo "Usage: ./cleanup.sh [--no-sudo]"; exit 0 ;;
+    esac
+done
+
+# ==========================================
+# 1. Kill application processes
+# ==========================================
+echo "Stopping application processes..."
+pkill -f "query_engine_rust" 2>/dev/null || true
+pkill -f "arroyo.*cluster" 2>/dev/null || true
+sleep 2
+# Force-kill any stragglers
+pkill -9 -f "query_engine_rust" 2>/dev/null || true
+pkill -9 -f "arroyo.*cluster" 2>/dev/null || true
+
+# ==========================================
+# 2. Stop Kafka
+# ==========================================
+echo "Stopping Kafka..."
+if [ -x "$KAFKA_DIR/bin/kafka-server-stop.sh" ]; then
+    "$KAFKA_DIR/bin/kafka-server-stop.sh" 2>/dev/null || true
+fi
+if [ -x "$KAFKA_DIR/bin/zookeeper-server-stop.sh" ]; then
+    "$KAFKA_DIR/bin/zookeeper-server-stop.sh" 2>/dev/null || true
+fi
+sleep 2
+pkill -f "kafka\.Kafka" 2>/dev/null || true
+pkill -f "QuorumPeerMain" 2>/dev/null || true
+pkill -9 -f "kafka\.Kafka" 2>/dev/null || true
+pkill -9 -f "QuorumPeerMain" 2>/dev/null || true
+
+# Clean Kafka data directories (topics, consumer group offsets, logs)
+if [ -d "$KAFKA_DIR" ]; then
+    echo "Clearing Kafka data..."
+    rm -rf "$KAFKA_DIR/data" "$KAFKA_DIR/logs" /tmp/kafka-logs /tmp/zookeeper 2>/dev/null || true
+fi
+
+# ==========================================
+# 3. Stop ClickHouse and clear its data
+# ==========================================
+echo "Stopping ClickHouse..."
+if curl -sf "http://localhost:8123/ping" >/dev/null 2>&1; then
+    # Drop the table before stopping so next run starts clean
+    curl -sf "http://localhost:8123" -d "DROP TABLE IF EXISTS h2o_groupby" 2>/dev/null || true
+fi
+pkill -f "clickhouse-server" 2>/dev/null || true
+pkill -f "clickhouse server" 2>/dev/null || true
+sleep 2
+pkill -9 -f "clickhouse-server" 2>/dev/null || true
+pkill -9 -f "clickhouse server" 2>/dev/null || true
+
+# Clear ClickHouse data directory
+if [ -d "$CLICKHOUSE_DIR" ]; then
+    echo "Clearing ClickHouse data..."
+    rm -rf "$CLICKHOUSE_DIR/data" "$CLICKHOUSE_DIR/store" "$CLICKHOUSE_DIR/metadata" 2>/dev/null || true
+fi
+
+# ==========================================
+# 4. Clear QE output directories
+# ==========================================
+echo "Clearing QE output..."
+rm -rf "$SCRIPT_DIR/output" "$SCRIPT_DIR/outputs" 2>/dev/null || true
+
+# ==========================================
+# 5. Drop OS page cache and dentries/inodes
+# ==========================================
+if [ "$NO_SUDO" -eq 0 ]; then
+    echo "Dropping OS page cache, dentries, and inodes..."
+    sync
+    sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+    echo "OS caches cleared"
+else
+    echo "Skipping OS cache clearing (--no-sudo)"
+fi
+
+echo "Cleanup complete. Next run_pipeline.sh will start from a clean state."