Skip to content

ASV benchmarks for column stats#3018

Open
poodlewars wants to merge 2 commits intomasterfrom
aseaton/column-stats/asv-benchmarks
Open

ASV benchmarks for column stats#3018
poodlewars wants to merge 2 commits intomasterfrom
aseaton/column-stats/asv-benchmarks

Conversation

@poodlewars
Copy link
Copy Markdown
Collaborator

Add benchmarks that run with and without column stats, on both ordered data (that should benefit a lot) and unordered data (that should not benefit). The unordered data helps us to measure the overhead of evaluating column stats when it cannot provide any benefit.

Since we're still waiting to merge #2958 the runs with and without column stats should be identical to start with. This will also help give us some nice evidence when we rebase and merge #2958.

Evidence of stability from 3 runs:

  | Case                              | Run 1   | Run 2   | Run 3   |   CV |
  |-----------------------------------|---------|---------|---------|------|
  | uint64 1M stats=T ordered=T      | 5.97ms  | 5.37ms  | 5.37ms  | 6.2% |
  | uint64 1M stats=F random=F       | 3.67ms  | 3.71ms  | 3.97ms  | 4.3% |
  | uint64 10M stats=T ordered=T     | 13.2ms  | 11.9ms  | 12.4ms  | 5.2% |
  | uint64 10M stats=F random=F      | 16.8ms  | 16.0ms  | 17.2ms  | 3.7% |
  | combined_or 1M stats=T ordered=T | 7.93ms  | 7.38ms  | 8.35ms  | 6.2% |
  | combined_or 10M stats=T ordered=T| 23.5ms  | 22.7ms  | 24.0ms  | 2.8% |
  | combined_or 10M stats=F random=F | 22.7ms  | 22.3ms  | 24.0ms  | 3.9% |

@poodlewars poodlewars added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Apr 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 9, 2026

ArcticDB Code Review Summary

PR: ASV benchmarks for column stats (#3018)
Files: python/benchmarks/column_stats.py (extended), python/.asv/results/benchmarks.json (updated seed), Makefile (QUICK=1 flag).

Notes on previous review findings

Three issues flagged in the prior review have been addressed and remain addressed in this commit:

  • Docstring updated (line 95): accurately reflects time_filter_bool's actual selectivity.
  • time_create_column_stats now measures pure creation: the drop_column_stats call was moved to setup().
  • time_isnotin_uint64 asymmetric skip resolved: the SkipNotImplemented was removed entirely — isnotin now runs for both ordered and unordered data, symmetric with time_isin_uint64.

API & Compatibility

  • ✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
  • ✅ On-disk format unchanged (benchmark-only change)
  • ✅ Protobuf schema backwards-compatible
  • ✅ Key types and ValueType enum unchanged

Memory & Safety

  • ✅ RAII used for all resource management (Python only)
  • ✅ No use-after-move or use-after-free patterns
  • ✅ No accidental copies of large objects

Correctness

  • ColumnStatsWideFilter builds the OR tree as a balanced tree to avoid Python recursion limit — comment explains this explicitly and the approach is correct
  • ColumnStatsDynamicSchema creates a separate library with dynamic_schema=True — correct since create_libraries_across_storages uses default library options
  • ✅ Both new benchmark classes call unset_config_int in teardown() so config does not leak between iterations
  • ⚠️ ColumnStatsCreate.setup() calls drop_column_stats before stats exist on the first iteration: unchanged from before. The PR's 3-run stability evidence suggests this is a no-op drop, but a comment confirming drop_column_stats is safe when no stats exist would help.

Code Quality

  • BENCHMARK_COLUMNS constant avoids magic strings and is reused consistently across all read calls
  • _add_extra_columns helper centralises filler-column generation; used by both _generate_column_stats_dataframe and _generate_random_dataframe
  • ✅ Loop ordering in setup_cache fixed: storage loop now outermost so DataFrames are generated once per (rows, ordered) combination instead of once per storage
  • ColumnStatsCreate class docstring removed: the class docstring was deleted and not replaced. ASV uses class docstrings in benchmark reports; ColumnStatsManagement and the new classes still have theirs.

Testing

  • ✅ Two new benchmark classes: ColumnStatsDynamicSchema (filtered column exists only in some segments) and ColumnStatsWideFilter (1004 columns, 1001 OR conditions)
  • ColumnStatsDynamicSchema correctly exercises the "sometimes missing column" path with 5 of 10 chunks missing the column
  • ColumnStatsWideFilter creates stats on all 1004 columns so column-stats pruning can fire in the column_stats_enabled=True case
  • ✅ Reducing num_rows to [10_000_000] only (dropping 1M) is reasonable
  • ✅ Updated benchmarks.json seed data includes all new benchmark entries

Build & Dependencies

  • ✅ No new source files requiring CMakeLists changes (Python benchmark only)
  • create_library and LibraryOptions are existing exports — no new dependencies
  • ✅ Makefile QUICK=1 addition is correct and documented

Security

  • ✅ No hardcoded credentials or secrets

PR Title & Description

  • ✅ Title is clear and concise
  • ✅ Description explains what changed and why, with stability evidence
  • ⚠️ Description does not mention the second commit's additions (ColumnStatsDynamicSchema, ColumnStatsWideFilter, filler columns, loop reordering)

Documentation

  • ✅ New benchmark classes have docstrings explaining measurement intent
  • ✅ No docs/claude/ updates required (no production code changed)

Overall: The extensions in this commit are well-designed and address all previously flagged issues. One minor item: restore a docstring for ColumnStatsCreate (it was the only class whose docstring was removed). Everything else looks good to merge.

@poodlewars poodlewars marked this pull request as draft April 9, 2026 13:05
@poodlewars poodlewars force-pushed the aseaton/column-stats/asv-benchmarks branch from 5cff615 to 45be6f5 Compare April 9, 2026 13:28
@poodlewars poodlewars marked this pull request as ready for review April 9, 2026 13:28
self.lib.read(self.symbol, columns=["uint64_col"], query_builder=q)


class ColumnStatsCreate:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this and ColumnStatsManagement worth benchmarking at this point? I think the current implementation will get ditched, and most stats will be generated at modification time, not through these explicit calls

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the APIs will survive - we will still need a way to create stats over old data that isn't being modified any more.

Alex Seaton added 2 commits April 10, 2026 16:24
Evidence of stability from 3 runs:

  | Case                              | Run 1   | Run 2   | Run 3   |   CV |
  |-----------------------------------|---------|---------|---------|------|
  | uint64 1M stats=T ordered=T      | 5.97ms  | 5.37ms  | 5.37ms  | 6.2% |
  | uint64 1M stats=F random=F       | 3.67ms  | 3.71ms  | 3.97ms  | 4.3% |
  | uint64 10M stats=T ordered=T     | 13.2ms  | 11.9ms  | 12.4ms  | 5.2% |
  | uint64 10M stats=F random=F      | 16.8ms  | 16.0ms  | 17.2ms  | 3.7% |
  | combined_or 1M stats=T ordered=T | 7.93ms  | 7.38ms  | 8.35ms  | 6.2% |
  | combined_or 10M stats=T ordered=T| 23.5ms  | 22.7ms  | 24.0ms  | 2.8% |
  | combined_or 10M stats=F random=F | 22.7ms  | 22.3ms  | 24.0ms  | 3.9% |
@poodlewars poodlewars force-pushed the aseaton/column-stats/asv-benchmarks branch from 45be6f5 to 079025f Compare April 10, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants