Skip to content

Symbol concatenation with arrow written data#3010

Open
IvoDD wants to merge 1 commit intomasterfrom
arrow-sparse-complex-query-builder-tests
Open

Symbol concatenation with arrow written data#3010
IvoDD wants to merge 1 commit intomasterfrom
arrow-sparse-complex-query-builder-tests

Conversation

@IvoDD
Copy link
Copy Markdown
Collaborator

@IvoDD IvoDD commented Apr 6, 2026

Reference Issues/PRs

Monday ref: 18053022100

What does this implement or fix?

Concatenation is made to work with arrow written normalization metadata. Mixing arrow and pandas also works as long as arrow.has_index() == pandas_index.is_physically_stored().

This PR also adds sparse arrow testing for groupby, resampling and concat.

GroupBy and concat work as expected.

Resampling does not and is a non trivial fix. Monday ref to track: 11679866800

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@IvoDD IvoDD added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Apr 6, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 6, 2026

ArcticDB Code Review Summary

API & Compatibility

  • No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
  • Deprecation protocol followed for removed/renamed parameters
  • On-disk format unchanged (or migration path documented)
  • Protobuf schema backwards-compatible
  • Key types and ValueType enum unchanged (or additive only)

Memory & Safety

  • RAII used for all resource management
  • No use-after-move or use-after-free patterns
  • Buffer size calculations correct (no overflow potential)
  • GIL correctly managed at Python-C++ boundary
  • No accidental copies of large objects

Correctness

  • All ValueType/KeyType switch statements exhaustive
  • Edge cases handled (empty data, NaN, Unicode, concurrent access)
  • Error codes unique and in correct ErrorCategory
  • Storage backend interface fully implemented
  • Query clause composition correct

Code Quality

  • No duplicated logic (search repo for existing utilities)
  • C++ and Python implementations of shared concepts are consistent
  • Tests are not duplicating existing test scenarios

Testing

  • Behavioural changes have corresponding tests
  • Edge cases covered in tests
  • Integration tests for storage/version engine changes
  • No regression in existing test expectations

Build & Dependencies

  • New source files added to CMakeLists.txt
  • Dependency changes justified and version-pinned
  • Cross-platform compatibility maintained
  • vcpkg submodule points to official upstream commit
  • vcpkg manifest versions pinned and consistent with submodule baseline
  • Custom overlay ports up-to-date with dependency versions
  • Conda environment synced with vcpkg dependency versions

Security

  • No hardcoded credentials or secrets
  • Input validation at system boundaries
  • No buffer overflow potential in C++ code

PR Title & Description

  • Title is clear, concise, and uses imperative verb
  • Title and description are free of typos and grammatical errors
  • Description explains what changed and why
  • All significant changes in the diff are mentioned in the description
  • API/breaking changes explicitly called out in the description
  • Linked issues referenced where applicable
  • PR labelled appropriately (patch label set)

Documentation

  • Public API docstrings accurate
  • Breaking changes flagged with appropriate labels

Summary

This PR adds comprehensive tests for groupby, resample, and concat on sparse Arrow data, fixes concat to accept experimental_arrow normalization metadata, and refactors generate_norm_meta into a pairwise accumulate_norm_metadata helper that handles mixed Arrow+Pandas and Arrow+Arrow cases, plus adds TestMixedArrowPandasConcat integration tests.

Open issues (still unresolved):

Medium — xfail markers missing strict=True (from previous review, unchanged)
Both TestSparseArrowResample (class-level marker) and TestSparseArrowConcat.test_concat_with_resample use @pytest.mark.xfail without strict=True. When sparse resampling is eventually implemented these tests will silently XPASS instead of prompting removal of the marker.

Low — Combined sparse/row-count check in sorted_aggregation.cpp (from previous review, unchanged)
The single schema::check guards two independent invariants (!is_sparse() and row_count() == index_row_count()). A non-sparse column with a mismatched row count would produce the misleading message "Cannot aggregate sparse column".

Informational — _check_query_result Pandas cross-check for resample (from previous review, partially addressed)
The resample (sort_by=None) path compares unsorted results. This is a weaker assertion rather than a correctness bug.


Latest push (f2fb26d):

  • Documentation only: Reorganised comments in accumulate_norm_metadata in clause_utils.cpp. The single block comment that preceded the function header has been split into per-section inline comments (Arrow+Arrow, One arrow/one pandas, Pandas+Pandas), with the Pandas+Pandas section gaining a detailed bullet-list description of the merge rules. No logic changes. No new issues introduced.

Previously:

  • Rename only (fe96b45): common_norm_metadataaccumulate_norm_metadata and parameter firstaccumulated in clause_utils.cpp. Pure cosmetic rename — no logic changes.
  • test_different_columns had an undefined fixture parameterindex_column removed from method signature.
  • test_no_index / test_with_index expected ordering was always arrow-first — Both tests now compute expected correctly respecting the arrow_first parameter.
  • PR labelpatch label is now set.

@IvoDD IvoDD force-pushed the arrow-sparse-complex-query-builder-tests branch from 31c5dd5 to f8bc36a Compare April 8, 2026 15:04
@IvoDD IvoDD changed the title Thorough testing complex processing pipeline with sparse data Symbol concatenation with arrow written data Apr 8, 2026
@IvoDD IvoDD force-pushed the arrow-sparse-complex-query-builder-tests branch from f8bc36a to 92d9f8f Compare April 9, 2026 09:14
return pandas_meta;
}

// Pandas + Pandas
Copy link
Copy Markdown
Collaborator Author

@IvoDD IvoDD Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pandas + Pandas code in accumulate_norm_metadata is moved from the previous for loop in generate_norm_meta with the following minor changes:

  • Variable renames to be more clear
  • A couple more readability changes (non-behavior changing) highlighted in separate comments

}
if (index.tz() != res_index->tz()) {
res_index->clear_tz();
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was moved in favor of another comment in generate_norm_metadata explaining the interaction between non_mathching_name_indices and fake_field_pos

res_index->fake_name()) {
res_index->set_name("index");
res_index->set_is_int(false);
res_index->set_fake_name(true);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was replaced by an:

            if ((*res_index->mutable_timezone())[idx] != idx_timezone) {
                (*res_index->mutable_timezone())[idx] = "";
            }

@IvoDD IvoDD force-pushed the arrow-sparse-complex-query-builder-tests branch from 92d9f8f to fe96b45 Compare April 9, 2026 09:40
This PR adds sparse arrow testing for groupby, resampling and concat.

GroupBy works.

Resampling does not. Monday ref: 11679866800

Concat is made to work in this PR and includes extra testing for
concatenation between arrow and pandas
@IvoDD IvoDD force-pushed the arrow-sparse-complex-query-builder-tests branch from fe96b45 to f2fb26d Compare April 9, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant