Support column stats for isin and isnotin by poodlewars · Pull Request #3014 · man-group/ArcticDB

poodlewars · 2026-04-08T13:21:30Z

Please review commit-by-commit - 0969284 is just a straight move of some test code.

Reference Issues/PRs

11292567032

What does this implement or fix?

Use query stats at read time with isin and isnotin. For a given set,

q = q[q["col"].isin(<set>)

we compute min(set) and max(set) and first evaluate column stats against that. This can show that the whole set lies outside of a given block cheaply (and in some degenerate cases show that everything in the block matches the set).

If this gives an UNKNOWN result, we then walk through the set evaluating the EqualsOperator against the stats and each element.

This has quite careful NaN handling. In computing the min and max of the set we exclude NaN as collections including NaN do not have a stable ordering. If statistics for a block are min=max=nan this means that the entire block is NaN, and therefore isin will always be False.

Worth noting that our NaN handling with isin does not match Pandas - I added a test in test_filtering.py to show the difference.

…e column stats test files are getting massive

claude · 2026-04-10T10:57:22Z

cpp/arcticdb/pipeline/column_stats_dispatch.hpp

    );
 }

+StatsComparison stats_membership_comparator(const ColumnStatsValues& stats, ValueSet& value_set, OperationType op);


The value_set parameter should be const ValueSet& — the function only reads from it (empty(), min_value(), max_value(), get_set()). This requires marking get_set() as const in value_set.hpp, which is straightforward since it just returns a shared_ptr to immutable data.

claude · 2026-04-10T10:58:00Z

ArcticDB Code Review Summary

API & Compatibility

✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
✅ Deprecation protocol N/A
✅ On-disk format unchanged
✅ Protobuf schema unchanged
✅ Key types and ValueType enum unchanged

Memory & Safety

✅ RAII used for all resource management
✅ No use-after-move or use-after-free patterns
✅ std::call_once used correctly for thread-safe lazy min_value()/max_value() caching
✅ GIL not relevant to changed code
✅ No accidental copies — ValueSet accessed by reference, shared_ptr shared across segments

Correctness

✅ NaN handling is carefully thought out: compute_min_max() excludes NaN when building the set range; stats.min == stats.max == NaN (all-NaN segment) correctly returns NONE_MATCH for isin. The divergence from Pandas NaN semantics is acknowledged in the PR description and has a dedicated test.
✅ Per-element fallback loop: breaking early on UNKNOWN is safe — once UNKNOWN is seen, min ≠ max, making ALL_MATCH (which requires min == max == elem) unreachable.
✅ ISNOTIN result correctly derived by flipping ISIN result at the function exit
✅ New ValueSet/StatsVariantData variants handled exhaustively in all std::visit / visit_unary_boolean_stats / dispatch_unary_stats sites

Code Quality

❌ stats_membership_comparator takes ValueSet& instead of const ValueSet&. The function only reads from the ValueSet; making get_set<T>() const (it just returns a stored shared_ptr) would allow this to be fully const-correct. (See inline comment.)

Testing

✅ New C++ unit tests cover empty set, single-value block, disjoint ranges, overlapping ranges, mixed types, all combinations of NaN in stats and set
✅ Python integration tests verify per-element pruning, NaN-in-set, all-NaN segment, mixed types, and negated expressions (~isin, ~isnotin)
✅ Compound filter test (col_1 > 2 & col_2.isin(...)) confirms correct interaction with boolean stats
✅ Pandas divergence for NaN is explicitly tested in test_filtering.py with a descriptive docstring

Build & Dependencies

✅ Both new test files added to CMakeLists.txt
✅ No new dependencies introduced

Security

✅ No hardcoded credentials
✅ No buffer overflow potential

PR Title & Description

✅ Title is clear and accurate
✅ Description explains the two-phase algorithm (range check then per-element walk) and the NaN rationale
✅ Linked issue referenced

Documentation

✅ No public Python API changes requiring docstring updates
❌ docs/claude/cpp/PIPELINE.md or PROCESSING.md should be updated to document that column stats now support ISIN/ISNOTIN — specifically the two-pass algorithm and the NaN exclusion from set min/max computation

Overall: The logic is sound and the implementation is clean. The NaN handling is particularly careful. Two items to address before merge: const-correctness on stats_membership_comparator (minor), and a docs/claude/ update for the new stats dispatch path.

poodlewars added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Apr 8, 2026

poodlewars force-pushed the aseaton/column-stats/isin-rebase branch from bc7f2ef to cfd38a0 Compare April 10, 2026 10:40

Alex Seaton added 2 commits April 10, 2026 11:50

11292567032 isin and isnotin support with query stats

6609f27

Split out IsIN and ColumnStatsData tests in to their own files, as th…

0969284

…e column stats test files are getting massive

poodlewars force-pushed the aseaton/column-stats/isin-rebase branch from cfd38a0 to 0969284 Compare April 10, 2026 10:50

poodlewars marked this pull request as ready for review April 10, 2026 10:51

poodlewars requested review from IvoDD and alexowens90 as code owners April 10, 2026 10:51

claude bot reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support column stats for isin and isnotin#3014

Support column stats for isin and isnotin#3014
poodlewars wants to merge 2 commits intomasterfrom
aseaton/column-stats/isin-rebase

poodlewars commented Apr 8, 2026 •

edited

Loading

Uh oh!

claude bot Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poodlewars commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 10, 2026

ArcticDB Code Review Summary

API & Compatibility

Memory & Safety

Correctness

Code Quality

Testing

Build & Dependencies

Security

PR Title & Description

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poodlewars commented Apr 8, 2026 •

edited

Loading