Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions asap-tools/execution-utilities/asap_benchmark_pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# ASAP H2O Benchmark Pipeline

Benchmarks ASAP (KLL sketch-based) query serving against a ClickHouse baseline using the [H2O groupby dataset](https://h2oai.github.io/db-benchmark/) (10M rows, 100 groups).

**ASAP mode** streams data through Kafka into Arroyo, which builds KLL sketches per tumbling window per group. The QueryEngine (QE) ingests these sketches and serves approximate quantile queries directly from them.

**Baseline mode** loads the same data into a ClickHouse MergeTree table and runs equivalent quantile queries using ClickHouse's native `quantile()` function.

Each benchmark run produces a CSV of per-query latencies and a latency plot (`.png`).

## Prerequisites

- Kafka and ClickHouse installed (see `../../installation/`)
- Arroyo binary built (`arroyo/target/release/arroyo`)
- QueryEngine binary built (`target/release/query_engine_rust`)
- Python 3 with `requests`, `kafka-python`, `gdown`, `matplotlib`

## Usage

### Full pipeline (recommended)

```bash
# Run baseline benchmark (starts infra, loads data, runs queries)
./run_pipeline.sh --mode baseline --load-data --output baseline_results.csv

# Clean up between runs for fair comparison
./cleanup.sh

# Run ASAP benchmark
./run_pipeline.sh --mode asap --load-data --output asap_results.csv

# Run both back-to-back and generate comparison plot
./run_pipeline.sh --mode both --load-data
```

### Options

```
--mode [asap|baseline|both] Execution mode (default: asap)
--load-data Download and load the H2O dataset
--output [FILE] Output CSV file
--skip-infra Skip starting Kafka/ClickHouse (already running)
--max-rows [N] Limit rows loaded (0 = all, default: all)
```

### Cleanup

`cleanup.sh` kills all processes (QE, Arroyo, Kafka, ClickHouse), clears Kafka topics, drops the ClickHouse table, and flushes OS page caches to ensure identical starting conditions between runs.

```bash
./cleanup.sh # full cleanup (requires sudo for cache drop)
./cleanup.sh --no-sudo # skip OS cache clearing
```

### Benchmark only (infra already running, data already loaded)

```bash
python3 run_benchmark.py --mode baseline --output baseline_results.csv
python3 run_benchmark.py --mode asap --output asap_results.csv
```

### Comparison plot

After running both modes, generate a side-by-side comparison:

```bash
python3 plot_latency.py
```

## Configuration

| File | Purpose |
|------|---------|
| `streaming_config.yaml` | Arroyo sketch pipeline config (window size, aggregation type, grouping) |
| `inference_config.yaml` | QE query-to-sketch mapping (must match `streaming_config.yaml` window size) |
| `h2o_init.sql` | ClickHouse table schema |
| `asap_quantile_queries.sql` | Queries for ASAP mode (QUANTILE syntax) |
| `clickhouse_quantile_queries.sql` | Queries for baseline mode (quantile() syntax) |

### Changing window size

To benchmark with a different tumbling window size (e.g., 120s):

1. Set `tumblingWindowSize` and `windowSize` in `streaming_config.yaml`
2. Set the DATEADD offset in `inference_config.yaml` to match (e.g., `-120`)
3. Regenerate query files with matching window boundaries

## Output

Each run produces:
- `<output>.csv` — per-query latencies, row counts, and result previews
- `<output>.png` — bar chart of latency by query execution order
- `latency_comparison.png` — side-by-side comparison (from `plot_latency.py`)

Large diffs are not rendered by default.

99 changes: 99 additions & 0 deletions asap-tools/execution-utilities/asap_benchmark_pipeline/cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#!/bin/bash
#
# Clean up all benchmark processes, Kafka state, ClickHouse data, and OS caches
# so that the next run_pipeline.sh invocation starts from identical conditions.
#
# Usage:
# ./cleanup.sh # full cleanup (requires sudo for cache drop)
# ./cleanup.sh --no-sudo # skip OS cache clearing (no sudo required)

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
TOOLS_DIR="$(cd "$SCRIPT_DIR/../.." &>/dev/null && pwd)"

KAFKA_INSTALL_DIR="${KAFKA_INSTALL_DIR:-$TOOLS_DIR/installation/kafka}"
CLICKHOUSE_INSTALL_DIR="${CLICKHOUSE_INSTALL_DIR:-$TOOLS_DIR/installation/clickhouse}"
KAFKA_DIR="$KAFKA_INSTALL_DIR/kafka"
CLICKHOUSE_DIR="$CLICKHOUSE_INSTALL_DIR/clickhouse"

NO_SUDO=0
for arg in "$@"; do
case "$arg" in
--no-sudo) NO_SUDO=1 ;;
--help) echo "Usage: ./cleanup.sh [--no-sudo]"; exit 0 ;;
esac
done

# ==========================================
# 1. Kill application processes
# ==========================================
echo "Stopping application processes..."
pkill -f "query_engine_rust" 2>/dev/null || true
pkill -f "arroyo.*cluster" 2>/dev/null || true
sleep 2
# Force-kill any stragglers
pkill -9 -f "query_engine_rust" 2>/dev/null || true
pkill -9 -f "arroyo.*cluster" 2>/dev/null || true

# ==========================================
# 2. Stop Kafka
# ==========================================
echo "Stopping Kafka..."
if [ -x "$KAFKA_DIR/bin/kafka-server-stop.sh" ]; then
"$KAFKA_DIR/bin/kafka-server-stop.sh" 2>/dev/null || true
fi
if [ -x "$KAFKA_DIR/bin/zookeeper-server-stop.sh" ]; then
"$KAFKA_DIR/bin/zookeeper-server-stop.sh" 2>/dev/null || true
fi
sleep 2
pkill -f "kafka\.Kafka" 2>/dev/null || true
pkill -f "QuorumPeerMain" 2>/dev/null || true
pkill -9 -f "kafka\.Kafka" 2>/dev/null || true
pkill -9 -f "QuorumPeerMain" 2>/dev/null || true

# Clean Kafka data directories (topics, consumer group offsets, logs)
if [ -d "$KAFKA_DIR" ]; then
echo "Clearing Kafka data..."
rm -rf "$KAFKA_DIR/data" "$KAFKA_DIR/logs" /tmp/kafka-logs /tmp/zookeeper 2>/dev/null || true
fi

# ==========================================
# 3. Stop ClickHouse and clear its data
# ==========================================
echo "Stopping ClickHouse..."
if curl -sf "http://localhost:8123/ping" >/dev/null 2>&1; then
# Drop the table before stopping so next run starts clean
curl -sf "http://localhost:8123" -d "DROP TABLE IF EXISTS h2o_groupby" 2>/dev/null || true
fi
pkill -f "clickhouse-server" 2>/dev/null || true
pkill -f "clickhouse server" 2>/dev/null || true
sleep 2
pkill -9 -f "clickhouse-server" 2>/dev/null || true
pkill -9 -f "clickhouse server" 2>/dev/null || true

# Clear ClickHouse data directory
if [ -d "$CLICKHOUSE_DIR" ]; then
echo "Clearing ClickHouse data..."
rm -rf "$CLICKHOUSE_DIR/data" "$CLICKHOUSE_DIR/store" "$CLICKHOUSE_DIR/metadata" 2>/dev/null || true
fi

# ==========================================
# 4. Clear QE output directories
# ==========================================
echo "Clearing QE output..."
rm -rf "$SCRIPT_DIR/output" "$SCRIPT_DIR/outputs" 2>/dev/null || true

# ==========================================
# 5. Drop OS page cache and dentries/inodes
# ==========================================
if [ "$NO_SUDO" -eq 0 ]; then
echo "Dropping OS page cache, dentries, and inodes..."
sync
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
echo "OS caches cleared"
else
echo "Skipping OS cache clearing (--no-sudo)"
fi

echo "Cleanup complete. Next run_pipeline.sh will start from a clean state."
Loading
Loading