Skip to content

Explicitly declare spill codec dependency#21917

Open
kosiew wants to merge 5 commits intoapache:mainfrom
kosiew:codecissue-21914
Open

Explicitly declare spill codec dependency#21917
kosiew wants to merge 5 commits intoapache:mainfrom
kosiew:codecissue-21914

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

The spill subsystem implicitly relied on workspace-level Cargo feature unification (e.g., arrow-ipc/zstd), which made codec availability fragile and prone to silent regressions when features were altered elsewhere. This change establishes an explicit, crate-local contract for required codecs, improving correctness, maintainability, and debuggability.

What changes are included in this PR?

  • Add an explicit arrow-ipc dependency (with lz4 and zstd features) to datafusion-physical-plan to guarantee codec availability for spill compression.
  • Document the codec contract in IPCStreamWriter, clarifying runtime expectations and failure modes.
  • Add arrow-ipc to Cargo.lock.
  • Add cargo-machete ignore entries where dependencies are feature-only (e.g., arrow-ipc, serde_json) to avoid false positives.
  • Minor comment cleanups to clarify intent around feature-only dependencies.

Are these changes tested?

No new tests are included in this PR.

Are there any user-facing changes?

No user-facing changes. This PR only makes an internal dependency contract explicit and improves developer-facing documentation.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

…ec contract

- Cleaned up alphabetical ordering in Cargo.toml.
- Added comments to clarify feature-only dependencies related to IPCStreamWriter in spill/mod.rs.
- Documented codec contract requirement in IPCStreamWriter::new for better understanding of feature dependencies.
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 29, 2026
- Changed `ignored = "serde_json"` to `ignored = ["serde_json"]` to comply with TOML array requirements.
- Fixed additional TOML parse errors.
- Added `arrow-ipc` to the `cargo-machete` ignore list with a comment on its necessity for spill compression codec support.
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2026
kosiew added a commit to kosiew/datafusion that referenced this pull request Apr 29, 2026
- Clarified spill codec comments in Cargo.toml:50 and Cargo.toml:99 to explain that arrow-ipc is consumed only via the arrow::ipc re-export, addressing potential unused flags from cargo-machete.
- Revised writer doc comment in mod.rs:295 to specify "build-visible during Cargo feature resolution" instead of "compiler-visible."
- Replaced bare commit hash reference with PR apache#21917 for clarity.
- Added brief notes to the opportunistic serde_json cargo-machete cleanup in Cargo.toml:85 and Cargo.toml:97 to indicate they are unrelated to the spill contract.
- Clarified spill codec comments in Cargo.toml:50 and Cargo.toml:99 to explain that arrow-ipc is consumed only via the arrow::ipc re-export, addressing potential unused flags from cargo-machete.
- Revised writer doc comment in mod.rs:295 to specify "build-visible during Cargo feature resolution" instead of "compiler-visible."
- Replaced bare commit hash reference with PR apache#21917 for clarity.
- Added brief notes to the opportunistic serde_json cargo-machete cleanup in Cargo.toml:85 and Cargo.toml:97 to indicate they are unrelated to the spill contract.
@kosiew kosiew force-pushed the codecissue-21914 branch from cf9228a to 7ba281a Compare April 29, 2026 10:16
@github-actions github-actions Bot added the development-process Related to development process of DataFusion label Apr 29, 2026
@kosiew kosiew changed the title Explicitly declare spill codec dependency on arrow-ipc (lz4, zstd) and document contract Explicitly declare spill codec dependency and move semver PR comment to follow-up workflow Apr 29, 2026
@kosiew kosiew marked this pull request as ready for review April 29, 2026 13:06
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

Can we please split the breaking_change refactor from the codec depedendcy work? They seem separtae and having two PRs would be easier to review

@kosiew kosiew marked this pull request as draft April 30, 2026 04:54
@kosiew kosiew force-pushed the codecissue-21914 branch from 7e54a70 to 7ba281a Compare April 30, 2026 05:18
@github-actions github-actions Bot removed the development-process Related to development process of DataFusion label Apr 30, 2026
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Apr 30, 2026

split the breaking_change refactor into #21951

@rluvaton
Copy link
Copy Markdown
Member

merged with main since the breaking change detector is now fixed

@kosiew kosiew force-pushed the codecissue-21914 branch from fe58db3 to 0e927a6 Compare April 30, 2026 09:55
@kosiew kosiew marked this pull request as ready for review April 30, 2026 10:26
@kosiew kosiew requested a review from alamb April 30, 2026 10:26
@rluvaton
Copy link
Copy Markdown
Member

can you update the title as well?

@kosiew kosiew changed the title Explicitly declare spill codec dependency and move semver PR comment to follow-up workflow Explicitly declare spill codec dependency Apr 30, 2026
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Apr 30, 2026

@rluvaton
Thanks for catching this.
Updated the PR title.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make spill codec availability an explicit contract of the spill stack

3 participants