Skip to content

refactor: simplify python batch writer without regressing throughput #28

@Ramlaoui

Description

@Ramlaoui

Summary

atompack-py/src/database_batch.rs is still the main complexity hotspot after the typed-float work.

The current code is functional and fast enough, but it still carries a real tradeoff between:

  • keeping the Python batch write path simple
  • preserving the throughput of the canonical add_arrays_batch path

This should be handled in a dedicated follow-up instead of continuing to widen PR #25.

Why this needs a follow-up

The typed-float PR had to recover performance regressions while also supporting:

  • v2 and v3
  • f32 and f64 builtin float fields
  • raw SOA passthrough
  • schema locking / schema-aware append

Most of the remaining accidental complexity is now concentrated in atompack-py/src/database_batch.rs.

In particular, the file still contains:

  • a canonical fast path for the common add_arrays_batch layout
  • a more generic path for dtype-flexible writes
  • duplicated setup/extraction logic across those paths
  • Python-side batching code that still knows a fair amount about storage sections and schema layout

This is the area where simplicity and performance are currently in the strongest tension.

Goal

Redesign the Python batch writer so that it is easier to reason about and maintain without regressing throughput.

The target is not to change DB semantics or the on-disk format. The target is to reduce accidental complexity in the Python-side batch write pipeline.

Desired direction

A good redesign would likely move toward:

  • one shared parsed batch representation, instead of two mostly separate setup flows
  • clearer separation between:
    • NumPy boundary validation/extraction
    • schema metadata assembly
    • record emission
  • less repeated dtype/shape dispatch in database_batch.rs
  • less Python-side knowledge of storage layout details where Rust can safely own the decision

One possible shape is a BatchPlan / column-view style abstraction that:

  • validates shapes and supported dtypes once
  • exposes cheap per-record access to builtin/custom payloads
  • can feed either:
    • a fast canonical emitter
    • or a generic emitter
      without duplicating the setup logic

Constraints

Any redesign should preserve the current performance expectations for the official smoke path, especially:

  • Database.add_arrays_batch(...) with canonical builtin arrays
  • no meaningful regression versus origin/main

The redesign should also avoid:

  • adding more database-format branching
  • duplicating logic that already exists in Rust storage code
  • forcing extra copies or schema reparsing in the hot path unless measurement proves the cost is negligible

Acceptance criteria

  • atompack-py/src/database_batch.rs is materially smaller and easier to follow
  • setup/validation duplication between canonical and generic paths is reduced
  • throughput smoke remains at parity with the current post-fix state
  • no changes to on-disk compatibility or schema semantics are required

Context

This issue is intentionally a follow-up to the typed-float / v3 PR, not part of that merge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions