ASAPSketchLib is a Rust sketch library with reusable sketch building blocks, sketch implementations, and orchestration frameworks.
| Goal | Use This | When to pick it | Pandas/Polars equivalent (exact, unbounded memory) |
|---|---|---|---|
| Frequency estimation | CountMin, Count Sketch |
You need fast approximate counts for high-volume keys. | df.groupby("key").size() / df.group_by("key").agg(pl.len()) — exact but O(distinct keys) memory |
| Cardinality estimation | HyperLogLog (Regular, DataFusion, HIP) |
You need approximate distinct counts with bounded memory. | df["col"].nunique() / df["col"].n_unique() — exact but O(n) memory |
| Quantiles/distribution | KLL, DDSketch |
You need percentile/latency summaries over streams. | df["col"].quantile(0.99) — exact but requires storing all values |
| Advanced use cases (frameworks) | see Advanced Use Cases | Hierarchical subpopulation queries, multi-sketch coordination, or sliding-window aggregation over streams. | No direct equivalent — sketches are the only practical solution at stream scale |
Full sketch status and API details: APIs Index.
Simple demo use case: estimate unique users with HyperLogLog. Example usage:
use asap_sketchlib::{DataFusion, HyperLogLog, SketchInput};
let mut hll = HyperLogLog::<DataFusion>::default();
// Simulate a stream of user IDs (with duplicates)
for user_id in [101, 202, 303, 101, 404, 202, 505, 101] {
hll.insert(&SketchInput::U64(user_id));
}
let unique_users = hll.estimate();
println!("estimated unique users: {unique_users}"); // ≈ 5To validate the repo quickly:
cargo testCommon dev commands:
cargo build --all-targets
cargo test --all-features
cargo benchPerformance is the primary motivation for this library:
- Performance-focused implementations with cache-friendly flat counter arrays, row-major layouts, and direct slice access in core sketch paths.
FastPathmode computes a single hash and derives row indices via bit masking, reducing hashing overhead relative to independent-hash modes.- Native Rust: no JNI/FFI bridge. Memory layout, allocation, and hashing stay within the Rust implementation.
- Rust-first API: typed inputs (
SketchInput) and largely consistentinsert/estimate/mergepatterns across the main sketches, with pluggable hashing viaSketchHasher. - Built-in framework layer (
Hydra,HashSketchEnsemble,ExponentialHistogram,UnivMon) included in the same crate, including hash-reuse support for coordinated sketch collections.
When DataSketches may be a better fit:
- You need its broader algorithm catalog: CPC sketch, Theta/Tuple sketches with set operators (Union, Intersection, Difference), REQ quantiles sketch, VarOpt/Reservoir sampling, or FM85.
- You need cross-language binary compatibility with existing DataSketches deployments in Java, C++, or Python.
- You need long-running production maturity and an Apache-governed release cycle.
Algorithms this library provides that DataSketches does not: UnivMon (universal monitoring), Hydra (hierarchical subpopulation sketching), FoldCMS/FoldCS (memory-efficient windowed sketching), and NitroBatch.
Several sketches address the same analytical goal with different trade-offs. For example, CountMin and Count Sketch both estimate frequencies; HyperLogLog (Regular, DataFusion, HIP) all estimate cardinality; KLL and DDSketch both answer quantile queries.
The best current approach is to profile the sketch against a representative sample of your actual data and compare error rates, memory usage, and insert throughput for your specific key distribution and stream volume. The APIs Index lists the status and caveats for each sketch.
A detailed comparison guide with benchmark data across sketch types and workloads is planned.
For more details, see Docs Index.
Copyright 2025 ProjectASAP
Licensed under the MIT License. See LICENSE.