Sketchlib Count Min with Heap#255
Merged
milindsrivastava1997 merged 12 commits intomainfrom Mar 31, 2026
Merged
Conversation
- Merged latest main (includes backend abstraction from PR #207) - Set Count-Min Sketch to use sketchlib backend by default - Set Count-Min-With-Heap to use sketchlib backend by default - Keep KLL in legacy mode (not yet implemented) - UDFs correctly configured: CMS and CMWH use sketchlib
…udfs impl_mode default
milindsrivastava1997
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the sketchlib-rust integration to Count-Min-With-Heap (
CountMinSketchWithHeap): aCountMinWithHeapBackendenum (Legacy matrix + local heap vs SketchlibCMSHeap) with the same msgpack wire format as the Arroyo UDFs.sketchlib_fidelity: section titles now reflect per-sketch CLI flags (--cms-implfor CountMinSketch,--cmwh-implfor CountMinSketchWithHeap), not whether KLL defaults to legacy.UDF validation
countminsketchwithheap_topk.rs.j2uses a hardcodedconst IMPL_MODE: ImplMode = ImplMode::Sketchlib;, likecountminsketch_count.rs.j2/countminsketch_sum.rs.j2.Changes
New files
asap-common/sketch-core/src/count_min_with_heap_sketchlib.rsCMSHeap<Vector2D<i64>, RegularPath>).new_sketchlib_cms_heap,sketchlib_cms_heap_from_matrix_and_heap,matrix_from_sketchlib_cms_heap,heap_to_wire,sketchlib_cms_heap_update,sketchlib_cms_heap_query.WireHeapItemfor heap serialization without circular imports.Modified files
asap-common/sketch-core/src/lib.rspub mod count_min_with_heap_sketchlib;asap-common/sketch-core/src/count_min_with_heap.rsCountMinWithHeapBackend:Legacy { sketch, heap }|Sketchlib(SketchlibCMSHeap).CountMinSketchWithHeap:backendfield;sketch_matrix(),topk_heap_items(),from_legacy_matrix(); dispatch fornew,update,query_key,merge, msgpack serde,aggregate_topk.asap-common/sketch-core/src/config.rsDEFAULT_CMWH_IMPL = ImplMode::Sketchlib(Count-Min-With-Heap defaults to sketchlib whenconfigure()is not used).asap-query-engine/src/precompute_operators/count_min_sketch_with_heap_accumulator.rsCountMinSketchWithHeap::from_legacy_matrix/ accessors instead of direct struct fields; tests updated for dual backend.asap-summary-ingest/templates/udfs/countminsketchwithheap_topk.rs.j2sketchlib-rustdependency; compile-timeImplMode/IMPL_MODEhardcoded toSketchlib.run_arroyosketch.pymay still passimpl_modein parameters; the template does not consume it, matching CMS.CountMinSketch+BinaryHeap) vs sketchlib (CountMin+ sketchlib updates + same heap maintenance + wire matrix copy for serialization).asap-common/sketch-core/src/bin/sketchlib_fidelity.rs--cms-impl,--kll-impl,--cmwh-impl(defaults fromDEFAULT_*_IMPL).asap-common/sketch-core/report.mdsketchlib-rustrevision (see Fidelity Results below).asap-query-engine/src/lib.rs,asap-query-engine/src/main.rs,run_arroyosketch.py, CMS UDF templates,Cargo.lock, etc.Technical approach
Backend abstraction
Wire format
Unchanged nested msgpack: inner
CmsData+topk_heap+heap_size— both backends serialize compatibly for UDF ↔ QueryEngine.UDF vs QueryEngine (sketchlib path)
Vector2D<i64>) plus a localBinaryHeapfor top-k, then copies sketch storage into the wire matrix for msgpack.Both paths target the same on-the-wire layout; the algorithms differ slightly but remain probabilistic-sketch compatible.
Hashing note (existing)
QueryEngine uses xxhash-rust
xxh32; Arroyo UDF templates use twox-hashXxHash32for the legacy path and sketchlib’s hashing for the sketchlib pathTesting
Validation performed on this branch (post-merge)
cargo test -p sketch-corecargo test -p query_engine_rusttest_both_backendsspawns sketchlib runcargo test -p query_engine_rust --features sketchlib-testsvalidate_udfs.py --udfs countminsketchwithheap_topkerrors: []against local Arroyovalidate_udfs.py --all_udfsaggregationType: CountMinSketchWithHeap,aggregationSubType: topkdocker compose build queryengine(context..,asap-query-engine/Dockerfile)Fidelity results
See
asap-common/sketch-core/report.md. Snapshot after merge (sketchlib-rust git rev as locked inCargo.lock):CountMinSketchWithHeap (examples)
Top-k recall can differ between backends because heap maintenance differs; Pearson / MAPE / RMSE on the true top-k keys generally favor sketchlib in these runs.
Benefits
sketch_cmwh_implconsistent with CMS/KLL flags.validate_udfs.py(CMWH matches CMS: noimpl_modein Jinja surface).