Skip to content

Sketchlib Count Min with Heap#255

Merged
milindsrivastava1997 merged 12 commits intomainfrom
sketchlib-count-min-with-heap
Mar 31, 2026
Merged

Sketchlib Count Min with Heap#255
milindsrivastava1997 merged 12 commits intomainfrom
sketchlib-count-min-with-heap

Conversation

@GnaneshGnani
Copy link
Copy Markdown
Contributor

@GnaneshGnani GnaneshGnani commented Mar 31, 2026

Summary

Extends the sketchlib-rust integration to Count-Min-With-Heap (CountMinSketchWithHeap): a CountMinWithHeapBackend enum (Legacy matrix + local heap vs Sketchlib CMSHeap) with the same msgpack wire format as the Arroyo UDFs.

  • sketchlib_fidelity: section titles now reflect per-sketch CLI flags (--cms-impl for CountMinSketch, --cmwh-impl for CountMinSketchWithHeap), not whether KLL defaults to legacy.

UDF validation countminsketchwithheap_topk.rs.j2 uses a hardcoded const IMPL_MODE: ImplMode = ImplMode::Sketchlib;, like countminsketch_count.rs.j2 / countminsketch_sum.rs.j2.

Changes

New files

asap-common/sketch-core/src/count_min_with_heap_sketchlib.rs

  • Sketchlib-rust CMSHeap integration layer (CMSHeap<Vector2D<i64>, RegularPath>).
  • Helpers: new_sketchlib_cms_heap, sketchlib_cms_heap_from_matrix_and_heap, matrix_from_sketchlib_cms_heap, heap_to_wire, sketchlib_cms_heap_update, sketchlib_cms_heap_query.
  • WireHeapItem for heap serialization without circular imports.

Modified files

asap-common/sketch-core/src/lib.rs

  • pub mod count_min_with_heap_sketchlib;

asap-common/sketch-core/src/count_min_with_heap.rs

  • CountMinWithHeapBackend: Legacy { sketch, heap } | Sketchlib(SketchlibCMSHeap).
  • CountMinSketchWithHeap: backend field; sketch_matrix(), topk_heap_items(), from_legacy_matrix(); dispatch for new, update, query_key, merge, msgpack serde, aggregate_topk.
  • Clone for sketchlib path rebuilds from matrix + wire heap (sketchlib type clone limitations).

asap-common/sketch-core/src/config.rs

  • DEFAULT_CMWH_IMPL = ImplMode::Sketchlib (Count-Min-With-Heap defaults to sketchlib when configure() is not used).

asap-query-engine/src/precompute_operators/count_min_sketch_with_heap_accumulator.rs

  • Uses CountMinSketchWithHeap::from_legacy_matrix / accessors instead of direct struct fields; tests updated for dual backend.

asap-summary-ingest/templates/udfs/countminsketchwithheap_topk.rs.j2

  • sketchlib-rust dependency; compile-time ImplMode / IMPL_MODE hardcoded to Sketchlib. run_arroyosketch.py may still pass impl_mode in parameters; the template does not consume it, matching CMS.
  • Dual path: legacy (twox-hash + local CountMinSketch + BinaryHeap) vs sketchlib (CountMin + sketchlib updates + same heap maintenance + wire matrix copy for serialization).

asap-common/sketch-core/src/bin/sketchlib_fidelity.rs

  • Benchmarks for CountMinSketchWithHeap (top-k recall vs true top-k, Pearson / MAPE / RMSE on exact top-k keys).
  • CLI: --cms-impl, --kll-impl, --cmwh-impl (defaults from DEFAULT_*_IMPL).

asap-common/sketch-core/report.md

  • Fidelity tables for CountMinSketch and CountMinSketchWithHeap (legacy vs sketchlib-rust).
  • Updated CountMinSketchWithHeap sketchlib metrics after merge / current sketchlib-rust revision (see Fidelity Results below).

asap-query-engine/src/lib.rs, asap-query-engine/src/main.rs, run_arroyosketch.py, CMS UDF templates, Cargo.lock, etc.

Technical approach

Backend abstraction

pub enum CountMinWithHeapBackend {
    Legacy {
        sketch: Vec<Vec<f64>>,
        heap: Vec<HeapItem>,
    },
    Sketchlib(SketchlibCMSHeap),
}

pub struct CountMinSketchWithHeap {
    pub row_num: usize,
    pub col_num: usize,
    pub heap_size: usize,
    pub backend: CountMinWithHeapBackend,
}

Wire format

Unchanged nested msgpack: inner CmsData + topk_heap + heap_size — both backends serialize compatibly for UDF ↔ QueryEngine.

UDF vs QueryEngine (sketchlib path)

  • QueryEngine / sketch-core (sketchlib): CMSHeap (integrated Count-Min + heavy-hitter heap in sketchlib-rust).
  • Arroyo UDF (sketchlib path): sketchlib CountMin (Vector2D<i64>) plus a local BinaryHeap for top-k, then copies sketch storage into the wire matrix for msgpack.

Both paths target the same on-the-wire layout; the algorithms differ slightly but remain probabilistic-sketch compatible.

Hashing note (existing)

QueryEngine uses xxhash-rust xxh32; Arroyo UDF templates use twox-hash XxHash32 for the legacy path and sketchlib’s hashing for the sketchlib path

Testing

# Unit tests (legacy backends via sketch-core test ctor)
cargo test -p sketch-core
cargo test -p query_engine_rust

# Library tests with sketchlib backends for query engine
cargo test -p query_engine_rust --features sketchlib-tests

# Fidelity (examples)
cargo run -p sketch-core --bin sketchlib_fidelity -- --cms-impl sketchlib --cmwh-impl sketchlib
cargo run -p sketch-core --bin sketchlib_fidelity -- --cms-impl legacy --cmwh-impl legacy

# UDF validation (Arroyo must be reachable, e.g. quickstart kafka + arroyo)
cd asap-summary-ingest && python3 validate_udfs.py --udfs countminsketchwithheap_topk
# or: python3 validate_udfs.py --all_udfs

Validation performed on this branch (post-merge)

Step Result
cargo test -p sketch-core 36 passed
cargo test -p query_engine_rust 337 passed, 5 ignored; test_both_backends spawns sketchlib run
cargo test -p query_engine_rust --features sketchlib-tests 337 passed
validate_udfs.py --udfs countminsketchwithheap_topk errors: [] against local Arroyo
validate_udfs.py --all_udfs
Planner topk → streaming config aggregationType: CountMinSketchWithHeap, aggregationSubType: topk
docker compose build queryengine (context .., asap-query-engine/Dockerfile) Succeeded

Fidelity results

See asap-common/sketch-core/report.md. Snapshot after merge (sketchlib-rust git rev as locked in Cargo.lock):

CountMinSketchWithHeap (examples)

Scenario Legacy top-k recall sketchlib top-k recall Legacy Pearson (top-k) sketchlib Pearson (top-k)
depth=3, width=1024, heap=10 0.40 0.80 0.9571 1.0000
depth=5, width=2048, heap=20 0.60 1.00 0.9964 0.9982
depth=5, width=2048, heap=50 0.40 0.48 0.9999983 0.9999990

Top-k recall can differ between backends because heap maintenance differs; Pearson / MAPE / RMSE on the true top-k keys generally favor sketchlib in these runs.

Benefits

  • Unified sketchlib CMSHeap in sketch-core for merge/update/query with automatic heavy-hitter integration.
  • Runtime selection via sketch_cmwh_impl consistent with CMS/KLL flags.
  • Fidelity harness and report cover CMS + CMWH.
  • UDF validation works without template changes to validate_udfs.py (CMWH matches CMS: no impl_mode in Jinja surface).

@milindsrivastava1997 milindsrivastava1997 merged commit 11f348a into main Mar 31, 2026
19 checks passed
@milindsrivastava1997 milindsrivastava1997 deleted the sketchlib-count-min-with-heap branch March 31, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants