refactor: simplify python batch writer without regressing throughput

## Summary

`atompack-py/src/database_batch.rs` is still the main complexity hotspot after the typed-float work.

The current code is functional and fast enough, but it still carries a real tradeoff between:
- keeping the Python batch write path simple
- preserving the throughput of the canonical `add_arrays_batch` path

This should be handled in a dedicated follow-up instead of continuing to widen PR #25.

## Why this needs a follow-up

The typed-float PR had to recover performance regressions while also supporting:
- `v2` and `v3`
- `f32` and `f64` builtin float fields
- raw SOA passthrough
- schema locking / schema-aware append

Most of the remaining accidental complexity is now concentrated in `atompack-py/src/database_batch.rs`.

In particular, the file still contains:
- a canonical fast path for the common `add_arrays_batch` layout
- a more generic path for dtype-flexible writes
- duplicated setup/extraction logic across those paths
- Python-side batching code that still knows a fair amount about storage sections and schema layout

This is the area where simplicity and performance are currently in the strongest tension.

## Goal

Redesign the Python batch writer so that it is easier to reason about and maintain **without regressing throughput**.

The target is not to change DB semantics or the on-disk format. The target is to reduce accidental complexity in the Python-side batch write pipeline.

## Desired direction

A good redesign would likely move toward:
- one shared parsed batch representation, instead of two mostly separate setup flows
- clearer separation between:
  - NumPy boundary validation/extraction
  - schema metadata assembly
  - record emission
- less repeated dtype/shape dispatch in `database_batch.rs`
- less Python-side knowledge of storage layout details where Rust can safely own the decision

One possible shape is a `BatchPlan` / column-view style abstraction that:
- validates shapes and supported dtypes once
- exposes cheap per-record access to builtin/custom payloads
- can feed either:
  - a fast canonical emitter
  - or a generic emitter
without duplicating the setup logic

## Constraints

Any redesign should preserve the current performance expectations for the official smoke path, especially:
- `Database.add_arrays_batch(...)` with canonical builtin arrays
- no meaningful regression versus `origin/main`

The redesign should also avoid:
- adding more database-format branching
- duplicating logic that already exists in Rust storage code
- forcing extra copies or schema reparsing in the hot path unless measurement proves the cost is negligible

## Acceptance criteria

- `atompack-py/src/database_batch.rs` is materially smaller and easier to follow
- setup/validation duplication between canonical and generic paths is reduced
- throughput smoke remains at parity with the current post-fix state
- no changes to on-disk compatibility or schema semantics are required

## Context

This issue is intentionally a follow-up to the typed-float / v3 PR, not part of that merge.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: simplify python batch writer without regressing throughput #28

Summary

Why this needs a follow-up

Goal

Desired direction

Constraints

Acceptance criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

refactor: simplify python batch writer without regressing throughput #28

Description

Summary

Why this needs a follow-up

Goal

Desired direction

Constraints

Acceptance criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions