Skip to content

Support column stats for isin and isnotin#3014

Open
poodlewars wants to merge 2 commits intomasterfrom
aseaton/column-stats/isin-rebase
Open

Support column stats for isin and isnotin#3014
poodlewars wants to merge 2 commits intomasterfrom
aseaton/column-stats/isin-rebase

Conversation

@poodlewars
Copy link
Copy Markdown
Collaborator

@poodlewars poodlewars commented Apr 8, 2026

Please review commit-by-commit - 0969284 is just a straight move of some test code.

Reference Issues/PRs

11292567032

What does this implement or fix?

Use query stats at read time with isin and isnotin. For a given set,

q = q[q["col"].isin(<set>)

we compute min(set) and max(set) and first evaluate column stats against that. This can show that the whole set lies outside of a given block cheaply (and in some degenerate cases show that everything in the block matches the set).

If this gives an UNKNOWN result, we then walk through the set evaluating the EqualsOperator against the stats and each element.

This has quite careful NaN handling. In computing the min and max of the set we exclude NaN as collections including NaN do not have a stable ordering. If statistics for a block are min=max=nan this means that the entire block is NaN, and therefore isin will always be False.

Worth noting that our NaN handling with isin does not match Pandas - I added a test in test_filtering.py to show the difference.

@poodlewars poodlewars added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Apr 8, 2026
@poodlewars poodlewars force-pushed the aseaton/column-stats/isin-rebase branch from bc7f2ef to cfd38a0 Compare April 10, 2026 10:40
@poodlewars poodlewars force-pushed the aseaton/column-stats/isin-rebase branch from cfd38a0 to 0969284 Compare April 10, 2026 10:50
@poodlewars poodlewars marked this pull request as ready for review April 10, 2026 10:51
);
}

StatsComparison stats_membership_comparator(const ColumnStatsValues& stats, ValueSet& value_set, OperationType op);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value_set parameter should be const ValueSet& — the function only reads from it (empty(), min_value(), max_value(), get_set()). This requires marking get_set() as const in value_set.hpp, which is straightforward since it just returns a shared_ptr to immutable data.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 10, 2026

ArcticDB Code Review Summary

API & Compatibility

  • ✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
  • ✅ Deprecation protocol N/A
  • ✅ On-disk format unchanged
  • ✅ Protobuf schema unchanged
  • ✅ Key types and ValueType enum unchanged

Memory & Safety

  • ✅ RAII used for all resource management
  • ✅ No use-after-move or use-after-free patterns
  • std::call_once used correctly for thread-safe lazy min_value()/max_value() caching
  • ✅ GIL not relevant to changed code
  • ✅ No accidental copies — ValueSet accessed by reference, shared_ptr shared across segments

Correctness

  • ✅ NaN handling is carefully thought out: compute_min_max() excludes NaN when building the set range; stats.min == stats.max == NaN (all-NaN segment) correctly returns NONE_MATCH for isin. The divergence from Pandas NaN semantics is acknowledged in the PR description and has a dedicated test.
  • ✅ Per-element fallback loop: breaking early on UNKNOWN is safe — once UNKNOWN is seen, min ≠ max, making ALL_MATCH (which requires min == max == elem) unreachable.
  • ISNOTIN result correctly derived by flipping ISIN result at the function exit
  • ✅ New ValueSet/StatsVariantData variants handled exhaustively in all std::visit / visit_unary_boolean_stats / dispatch_unary_stats sites

Code Quality

  • stats_membership_comparator takes ValueSet& instead of const ValueSet&. The function only reads from the ValueSet; making get_set<T>() const (it just returns a stored shared_ptr) would allow this to be fully const-correct. (See inline comment.)

Testing

  • ✅ New C++ unit tests cover empty set, single-value block, disjoint ranges, overlapping ranges, mixed types, all combinations of NaN in stats and set
  • ✅ Python integration tests verify per-element pruning, NaN-in-set, all-NaN segment, mixed types, and negated expressions (~isin, ~isnotin)
  • ✅ Compound filter test (col_1 > 2 & col_2.isin(...)) confirms correct interaction with boolean stats
  • ✅ Pandas divergence for NaN is explicitly tested in test_filtering.py with a descriptive docstring

Build & Dependencies

  • ✅ Both new test files added to CMakeLists.txt
  • ✅ No new dependencies introduced

Security

  • ✅ No hardcoded credentials
  • ✅ No buffer overflow potential

PR Title & Description

  • ✅ Title is clear and accurate
  • ✅ Description explains the two-phase algorithm (range check then per-element walk) and the NaN rationale
  • ✅ Linked issue referenced

Documentation

  • ✅ No public Python API changes requiring docstring updates
  • docs/claude/cpp/PIPELINE.md or PROCESSING.md should be updated to document that column stats now support ISIN/ISNOTIN — specifically the two-pass algorithm and the NaN exclusion from set min/max computation

Overall: The logic is sound and the implementation is clean. The NaN handling is particularly careful. Two items to address before merge: const-correctness on stats_membership_comparator (minor), and a docs/claude/ update for the new stats dispatch path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant