Sketchlib Count Min with Heap by GnaneshGnani · Pull Request #255 · ProjectASAP/ASAPQuery

GnaneshGnani · 2026-03-31T16:09:59Z

Summary

Extends the sketchlib-rust integration to Count-Min-With-Heap (CountMinSketchWithHeap): a CountMinWithHeapBackend enum (Legacy matrix + local heap vs Sketchlib CMSHeap) with the same msgpack wire format as the Arroyo UDFs.

sketchlib_fidelity: section titles now reflect per-sketch CLI flags (--cms-impl for CountMinSketch, --cmwh-impl for CountMinSketchWithHeap), not whether KLL defaults to legacy.

UDF validation countminsketchwithheap_topk.rs.j2 uses a hardcoded const IMPL_MODE: ImplMode = ImplMode::Sketchlib;, like countminsketch_count.rs.j2 / countminsketch_sum.rs.j2.

Changes

New files

asap-common/sketch-core/src/count_min_with_heap_sketchlib.rs

Sketchlib-rust CMSHeap integration layer (CMSHeap<Vector2D<i64>, RegularPath>).
Helpers: new_sketchlib_cms_heap, sketchlib_cms_heap_from_matrix_and_heap, matrix_from_sketchlib_cms_heap, heap_to_wire, sketchlib_cms_heap_update, sketchlib_cms_heap_query.
WireHeapItem for heap serialization without circular imports.

Modified files

asap-common/sketch-core/src/lib.rs

pub mod count_min_with_heap_sketchlib;

asap-common/sketch-core/src/count_min_with_heap.rs

CountMinWithHeapBackend: Legacy { sketch, heap } | Sketchlib(SketchlibCMSHeap).
CountMinSketchWithHeap: backend field; sketch_matrix(), topk_heap_items(), from_legacy_matrix(); dispatch for new, update, query_key, merge, msgpack serde, aggregate_topk.
Clone for sketchlib path rebuilds from matrix + wire heap (sketchlib type clone limitations).

asap-common/sketch-core/src/config.rs

DEFAULT_CMWH_IMPL = ImplMode::Sketchlib (Count-Min-With-Heap defaults to sketchlib when configure() is not used).

asap-query-engine/src/precompute_operators/count_min_sketch_with_heap_accumulator.rs

Uses CountMinSketchWithHeap::from_legacy_matrix / accessors instead of direct struct fields; tests updated for dual backend.

asap-summary-ingest/templates/udfs/countminsketchwithheap_topk.rs.j2

sketchlib-rust dependency; compile-time ImplMode / IMPL_MODE hardcoded to Sketchlib. run_arroyosketch.py may still pass impl_mode in parameters; the template does not consume it, matching CMS.
Dual path: legacy (twox-hash + local CountMinSketch + BinaryHeap) vs sketchlib (CountMin + sketchlib updates + same heap maintenance + wire matrix copy for serialization).

asap-common/sketch-core/src/bin/sketchlib_fidelity.rs

Benchmarks for CountMinSketchWithHeap (top-k recall vs true top-k, Pearson / MAPE / RMSE on exact top-k keys).
CLI: --cms-impl, --kll-impl, --cmwh-impl (defaults from DEFAULT_*_IMPL).

asap-common/sketch-core/report.md

Fidelity tables for CountMinSketch and CountMinSketchWithHeap (legacy vs sketchlib-rust).
Updated CountMinSketchWithHeap sketchlib metrics after merge / current sketchlib-rust revision (see Fidelity Results below).

asap-query-engine/src/lib.rs, asap-query-engine/src/main.rs, run_arroyosketch.py, CMS UDF templates, Cargo.lock, etc.

Technical approach

Backend abstraction

pub enum CountMinWithHeapBackend {
    Legacy {
        sketch: Vec<Vec<f64>>,
        heap: Vec<HeapItem>,
    },
    Sketchlib(SketchlibCMSHeap),
}

pub struct CountMinSketchWithHeap {
    pub row_num: usize,
    pub col_num: usize,
    pub heap_size: usize,
    pub backend: CountMinWithHeapBackend,
}

Wire format

Unchanged nested msgpack: inner CmsData + topk_heap + heap_size — both backends serialize compatibly for UDF ↔ QueryEngine.

UDF vs QueryEngine (sketchlib path)

QueryEngine / sketch-core (sketchlib): CMSHeap (integrated Count-Min + heavy-hitter heap in sketchlib-rust).
Arroyo UDF (sketchlib path): sketchlib CountMin (Vector2D<i64>) plus a local BinaryHeap for top-k, then copies sketch storage into the wire matrix for msgpack.

Both paths target the same on-the-wire layout; the algorithms differ slightly but remain probabilistic-sketch compatible.

Hashing note (existing)

QueryEngine uses xxhash-rust xxh32; Arroyo UDF templates use twox-hash XxHash32 for the legacy path and sketchlib’s hashing for the sketchlib path

Testing

# Unit tests (legacy backends via sketch-core test ctor)
cargo test -p sketch-core
cargo test -p query_engine_rust

# Library tests with sketchlib backends for query engine
cargo test -p query_engine_rust --features sketchlib-tests

# Fidelity (examples)
cargo run -p sketch-core --bin sketchlib_fidelity -- --cms-impl sketchlib --cmwh-impl sketchlib
cargo run -p sketch-core --bin sketchlib_fidelity -- --cms-impl legacy --cmwh-impl legacy

# UDF validation (Arroyo must be reachable, e.g. quickstart kafka + arroyo)
cd asap-summary-ingest && python3 validate_udfs.py --udfs countminsketchwithheap_topk
# or: python3 validate_udfs.py --all_udfs

Validation performed on this branch (post-merge)

Step	Result
`cargo test -p sketch-core`	36 passed
`cargo test -p query_engine_rust`	337 passed, 5 ignored; `test_both_backends` spawns sketchlib run
`cargo test -p query_engine_rust --features sketchlib-tests`	337 passed
`validate_udfs.py --udfs countminsketchwithheap_topk`	`errors: []` against local Arroyo
`validate_udfs.py --all_udfs`
Planner topk → streaming config	`aggregationType: CountMinSketchWithHeap`, `aggregationSubType: topk`
`docker compose build queryengine` (context `..`, `asap-query-engine/Dockerfile`)	Succeeded

Fidelity results

See asap-common/sketch-core/report.md. Snapshot after merge (sketchlib-rust git rev as locked in Cargo.lock):

CountMinSketchWithHeap (examples)

Scenario	Legacy top-k recall	sketchlib top-k recall	Legacy Pearson (top-k)	sketchlib Pearson (top-k)
depth=3, width=1024, heap=10	0.40	0.80	0.9571	1.0000
depth=5, width=2048, heap=20	0.60	1.00	0.9964	0.9982
depth=5, width=2048, heap=50	0.40	0.48	0.9999983	0.9999990

Top-k recall can differ between backends because heap maintenance differs; Pearson / MAPE / RMSE on the true top-k keys generally favor sketchlib in these runs.

Benefits

Unified sketchlib CMSHeap in sketch-core for merge/update/query with automatic heavy-hitter integration.
Runtime selection via sketch_cmwh_impl consistent with CMS/KLL flags.
Fidelity harness and report cover CMS + CMWH.
UDF validation works without template changes to validate_udfs.py (CMWH matches CMS: no impl_mode in Jinja surface).

- Merged latest main (includes backend abstraction from PR #207) - Set Count-Min Sketch to use sketchlib backend by default - Set Count-Min-With-Heap to use sketchlib backend by default - Keep KLL in legacy mode (not yet implemented) - UDFs correctly configured: CMS and CMWH use sketchlib

…emplates

…udfs impl_mode default

GnaneshGnani and others added 12 commits March 20, 2026 09:36

Integrate sketchlib CMSHeap for Count-Min-With-Heap

1380e02

Restore per-backend default constants, global default Legacy

b95fead

Use per-backend defaults in fidelity, configurable impl_mode in UDF t…

364365d

…emplates

report: scope to CMS and CMWH only for PR 4

786cd50

UDFs: use same impl mode as QueryEngine (sketch_cms_impl, etc.)

4f75d18

Simplify UDF impl mode, default CMS and CMWH to sketchlib

1f6c726

Fix black formatting in arroyo.py

01d9a4b

Merge main into sketchlib-count-min-with-heap; resolve Cargo.lock

f4628c6

fidelity: per-section mode labels; refresh CMWH report; fix validate_…

6a0db0a

…udfs impl_mode default

Hardcode default mode in UDFs

a4fe95e

Merge branch 'main' into sketchlib-count-min-with-heap

77c3dcf

GnaneshGnani requested a review from milindsrivastava1997 March 31, 2026 19:50

milindsrivastava1997 approved these changes Mar 31, 2026

View reviewed changes

milindsrivastava1997 merged commit 11f348a into main Mar 31, 2026
19 checks passed

milindsrivastava1997 deleted the sketchlib-count-min-with-heap branch March 31, 2026 22:59

milindsrivastava1997 mentioned this pull request Apr 1, 2026

258 add capability based matching to asap query engine instead of matching incoming queries against pre configured queries in inference config #259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sketchlib Count Min with Heap#255

Sketchlib Count Min with Heap#255
milindsrivastava1997 merged 12 commits intomainfrom
sketchlib-count-min-with-heap

GnaneshGnani commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GnaneshGnani commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New files

Modified files

Technical approach

Backend abstraction

Wire format

UDF vs QueryEngine (sketchlib path)

Hashing note (existing)

Testing

Validation performed on this branch (post-merge)

Fidelity results

Benefits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GnaneshGnani commented Mar 31, 2026 •

edited

Loading