Summary
atompack-py/src/database_batch.rs is still the main complexity hotspot after the typed-float work.
The current code is functional and fast enough, but it still carries a real tradeoff between:
- keeping the Python batch write path simple
- preserving the throughput of the canonical
add_arrays_batch path
This should be handled in a dedicated follow-up instead of continuing to widen PR #25.
Why this needs a follow-up
The typed-float PR had to recover performance regressions while also supporting:
v2 and v3
f32 and f64 builtin float fields
- raw SOA passthrough
- schema locking / schema-aware append
Most of the remaining accidental complexity is now concentrated in atompack-py/src/database_batch.rs.
In particular, the file still contains:
- a canonical fast path for the common
add_arrays_batch layout
- a more generic path for dtype-flexible writes
- duplicated setup/extraction logic across those paths
- Python-side batching code that still knows a fair amount about storage sections and schema layout
This is the area where simplicity and performance are currently in the strongest tension.
Goal
Redesign the Python batch writer so that it is easier to reason about and maintain without regressing throughput.
The target is not to change DB semantics or the on-disk format. The target is to reduce accidental complexity in the Python-side batch write pipeline.
Desired direction
A good redesign would likely move toward:
- one shared parsed batch representation, instead of two mostly separate setup flows
- clearer separation between:
- NumPy boundary validation/extraction
- schema metadata assembly
- record emission
- less repeated dtype/shape dispatch in
database_batch.rs
- less Python-side knowledge of storage layout details where Rust can safely own the decision
One possible shape is a BatchPlan / column-view style abstraction that:
- validates shapes and supported dtypes once
- exposes cheap per-record access to builtin/custom payloads
- can feed either:
- a fast canonical emitter
- or a generic emitter
without duplicating the setup logic
Constraints
Any redesign should preserve the current performance expectations for the official smoke path, especially:
Database.add_arrays_batch(...) with canonical builtin arrays
- no meaningful regression versus
origin/main
The redesign should also avoid:
- adding more database-format branching
- duplicating logic that already exists in Rust storage code
- forcing extra copies or schema reparsing in the hot path unless measurement proves the cost is negligible
Acceptance criteria
atompack-py/src/database_batch.rs is materially smaller and easier to follow
- setup/validation duplication between canonical and generic paths is reduced
- throughput smoke remains at parity with the current post-fix state
- no changes to on-disk compatibility or schema semantics are required
Context
This issue is intentionally a follow-up to the typed-float / v3 PR, not part of that merge.
Summary
atompack-py/src/database_batch.rsis still the main complexity hotspot after the typed-float work.The current code is functional and fast enough, but it still carries a real tradeoff between:
add_arrays_batchpathThis should be handled in a dedicated follow-up instead of continuing to widen PR #25.
Why this needs a follow-up
The typed-float PR had to recover performance regressions while also supporting:
v2andv3f32andf64builtin float fieldsMost of the remaining accidental complexity is now concentrated in
atompack-py/src/database_batch.rs.In particular, the file still contains:
add_arrays_batchlayoutThis is the area where simplicity and performance are currently in the strongest tension.
Goal
Redesign the Python batch writer so that it is easier to reason about and maintain without regressing throughput.
The target is not to change DB semantics or the on-disk format. The target is to reduce accidental complexity in the Python-side batch write pipeline.
Desired direction
A good redesign would likely move toward:
database_batch.rsOne possible shape is a
BatchPlan/ column-view style abstraction that:without duplicating the setup logic
Constraints
Any redesign should preserve the current performance expectations for the official smoke path, especially:
Database.add_arrays_batch(...)with canonical builtin arraysorigin/mainThe redesign should also avoid:
Acceptance criteria
atompack-py/src/database_batch.rsis materially smaller and easier to followContext
This issue is intentionally a follow-up to the typed-float / v3 PR, not part of that merge.