Skip to content

Discuss if we should change serialization format #37

@e-lo

Description

@e-lo

This issue proposes revisiting whether HDF5 remains the best long-term container choice for OMX, or whether a modern alternative (or optional backend) would better support current and future use cases.

The goal is not to break OMX semantics, but to evaluate whether the container layer could evolve while preserving:

  • Stable matrix semantics
  • Efficient sparse and dense storage
  • Long-term reproducibility

Background

OMX currently uses HDF5 as its underlying container, a design choice discussed early in the project’s history. At the time, HDF5 provided a mature, high-performance, cross-platform solution for large matrix storage.

Since then, the data ecosystem—especially in Python—has shifted significantly toward columnar, cross-language formats such as Apache Arrow and Parquet, which emphasize interoperability, cloud friendliness, and zero-copy data exchange.

Questions for discussion

  1. Does HDF5 continue to meet OMX’s needs in modern Python and cloud-based workflows?
  2. Are there known pain points with HDF5 (tooling, deployment, performance, maintenance)?
  3. Could Arrow IPC / Feather or Parquet realistically serve as:
    a. A replacement container?
    b. An optional backend?
    ...with critical current OMX features(random access, slicing, determinism)

Prerequisites

Need to evolve our governance model to support a decision this significant.

Possible outcomes

  1. Affirm HDF5 as the long-term container and document why
  2. Support an alternative container backend while retaining OMX semantics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions