-
Notifications
You must be signed in to change notification settings - Fork 18
Description
This issue proposes revisiting whether HDF5 remains the best long-term container choice for OMX, or whether a modern alternative (or optional backend) would better support current and future use cases.
The goal is not to break OMX semantics, but to evaluate whether the container layer could evolve while preserving:
- Stable matrix semantics
- Efficient sparse and dense storage
- Long-term reproducibility
Background
OMX currently uses HDF5 as its underlying container, a design choice discussed early in the project’s history. At the time, HDF5 provided a mature, high-performance, cross-platform solution for large matrix storage.
Since then, the data ecosystem—especially in Python—has shifted significantly toward columnar, cross-language formats such as Apache Arrow and Parquet, which emphasize interoperability, cloud friendliness, and zero-copy data exchange.
Questions for discussion
- Does HDF5 continue to meet OMX’s needs in modern Python and cloud-based workflows?
- Are there known pain points with HDF5 (tooling, deployment, performance, maintenance)?
- Could Arrow IPC / Feather or Parquet realistically serve as:
a. A replacement container?
b. An optional backend?
...with critical current OMX features(random access, slicing, determinism)
Prerequisites
Need to evolve our governance model to support a decision this significant.
Possible outcomes
- Affirm HDF5 as the long-term container and document why
- Support an alternative container backend while retaining OMX semantics