Support column stats for isin and isnotin#3014
Open
poodlewars wants to merge 2 commits intomasterfrom
Open
Conversation
bc7f2ef to
cfd38a0
Compare
added 2 commits
April 10, 2026 11:50
…e column stats test files are getting massive
cfd38a0 to
0969284
Compare
| ); | ||
| } | ||
|
|
||
| StatsComparison stats_membership_comparator(const ColumnStatsValues& stats, ValueSet& value_set, OperationType op); |
Contributor
There was a problem hiding this comment.
The value_set parameter should be const ValueSet& — the function only reads from it (empty(), min_value(), max_value(), get_set()). This requires marking get_set() as const in value_set.hpp, which is straightforward since it just returns a shared_ptr to immutable data.
Contributor
ArcticDB Code Review SummaryAPI & Compatibility
Memory & Safety
Correctness
Code Quality
Testing
Build & Dependencies
Security
PR Title & Description
Documentation
Overall: The logic is sound and the implementation is clean. The NaN handling is particularly careful. Two items to address before merge: const-correctness on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Please review commit-by-commit - 0969284 is just a straight move of some test code.
Reference Issues/PRs
11292567032
What does this implement or fix?
Use query stats at read time with
isinandisnotin. For a given set,we compute
min(set)andmax(set)and first evaluate column stats against that. This can show that the whole set lies outside of a given block cheaply (and in some degenerate cases show that everything in the block matches the set).If this gives an UNKNOWN result, we then walk through the set evaluating the
EqualsOperatoragainst the stats and each element.This has quite careful NaN handling. In computing the min and max of the set we exclude NaN as collections including NaN do not have a stable ordering. If statistics for a block are
min=max=nanthis means that the entire block is NaN, and thereforeisinwill always be False.Worth noting that our NaN handling with
isindoes not match Pandas - I added a test intest_filtering.pyto show the difference.