fix: equalize with delta/delta-rs (v0.52)#55
Open
mandrush wants to merge 503 commits into
Open
Conversation
adampolomski
previously approved these changes
Feb 17, 2026
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
I cleaned up some junk labels last week and didn't realize that I broke this action a bit 🙈 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Liam Brannigan <liambrannigan@Liams-MacBook-Pro.local>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: adam.polomski <adam.polomski@relativity.com>
# Description
This PR contains a small fix that addresses non-determinism in the
DeltaTableProvider.
The issue occurs under the following circumstances:
- A projection is used
- There is a filter condition that contains at least two columns that
are not in the projection
Currently, the additional columns which are used in the filter are
stored in a `HashSet`. As a result, the iterator over the HashSet may
return the additional columns in any order. This introduces
non-determinism in creating the logical schema. As the schema affects
the string representation of the query plan (@<column_index>), this is a
pain for asserting query plans in downstream projects.
Running the new test without the fix produces one of two possible query
plans:
Query Plan 1:
```text
DeltaScan
DataSourceExec: file_groups={1 group: [[]]}, projection=[v1], file_type=parquet, predicate=v2@1 = CAST(2 AS Int64) AND v3@2 = CAST(3 AS Int64), pruning_predicate=v2_null_count@2 != row_count@3 AND v2_min@0 <= 2 AND 2 <= v2_max@1 AND v3_null_count@6 != row_count@3 AND v3_min@4 <= 3 AND 3 <= v3_max@5, required_guarantees=[]
```
Query Plan 2:
```text
DeltaScan
DataSourceExec: file_groups={1 group: [[]]}, projection=[v1], file_type=parquet, predicate=v2@2 = CAST(2 AS Int64) AND v3@1 = CAST(3 AS Int64), pruning_predicate=v2_null_count@2 != row_count@3 AND v2_min@0 <= 2 AND 2 <= v2_max@1 AND v3_null_count@6 != row_count@3 AND v3_min@4 <= 3 AND 3 <= v3_max@5, required_guarantees=[]\n
```
The crux lies in the predicate. `predicate=v2@2 [...]` versus
`predicate=v2@1`. By sorting the additional columns the indeterminsm
should be fixed.
# Related Issue(s)
I think this is a trivial fix and therefore I did not create an issue
# Documentation
- [Expr::column_refs] returns a `HashSet`
(https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#method.column_refs)
---------
Signed-off-by: Tobias Schwarzinger <tobias.schwarzinger@tuwien.ac.at>
Co-authored-by: Ethan Urbanski <ethanurbanski@gmail.com>
# Description - replacing manual commit pipeline (into_prepared_commit_future + manual write_commit_entry) in restore.rs into standard pipeline, all 3 stages run automatically like other endpoints - exposed `post_commithook_properties` in the Python binding, public API, and type stub - added tests verifying the parameter is accepted with post commimt hook # Related Issue(s) <!--- For example: - closes #106 ---> closes #4251 <!--- Share links to useful documentation ---> --------- Signed-off-by: Byeori Kim <bk.byeori.kim@gmail.com> Co-authored-by: Ethan Urbanski <ethanurbanski@gmail.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
These slipped through CI somehow and I didn't notice that RawJson is only available in the `datafusion` feature build. Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: Adam Reeve <adam.reeve@gr-oss.io>
This passing demonstrates that this bug is actually fixed, who knows when it was fixed, but it was fixed! 😄 Closes #2882 Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
See #1214 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: R Tyler Croy <rtyler@brokenco.de>
This issue was well in the backlog and I don't really understand the use-case of binary partition columns 😆 but this seemed like a pretty straight-forward fix Below is the 🦜 generated commentary, which is mostly useless. - Add pick_binary_partition_values: bypasses the natural-order guard in pick_stats so Binary partition values can be read as BinaryArray. - Add contained_binary: implements set-membership comparison for Binary partition columns, encoding predicate bytes with the same unicode-escape format used when serializing partition values into the Delta log. - Wire contained_binary into contained() via a type check before the existing StringArray path, so = / IN predicates prune binary partitions while min_values/max_values still return None (range predicates stay conservatively un-pruned). - Update existing regression test: assert kept_files == 1 (pruning works). - Add range-predicate test: assert kept_files == 2 (< and > keep all files). Closes #1214 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: R Tyler Croy <rtyler@brokenco.de>
# Description No longer add views from UC API to the information schema in datafusion. Also some refactoring to setup better usage in the future. # Related Issue(s) - closes #4423 --------- Signed-off-by: Stephen Carman <stephen.carman@databricks.com> Co-authored-by: Stephen Carman <stephen.carman@databricks.com>
…rties Introduces a new `deltalake-hf` crate that wires up the `hf://` URL scheme using OpenDAL's HuggingFace service. `HfObjectStore` adapts Delta's `PutMode::Create` contract to HF Hub's Git-backed, single-writer model and strips the repo-id prefix from paths so the OpenDAL operator sees repo-relative paths. `HfLogStoreFactory` rebuilds the `PrefixStore` from the root store using the parsed table path, bypassing the generic `decorate_store` logic that constructs an incorrect prefix. Adds four Delta table properties under `delta.parquet.contentDefinedChunking.*` that transparently activate Parquet content-defined chunking (CDC) during writes. When enabled, row-group boundaries become deterministic functions of data content, which improves deduplication and incremental upload efficiency on content-addressable stores such as HuggingFace Hub / Xet. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
CDC is a Parquet-format concern, not a Delta-semantic property, so the
settings belong in `Metadata.format.options` (per the Delta spec) rather
than in the table-level `configuration` map.
Changes:
- Extend `MetadataExt` with `format_options()` getter and
`with_format_options()` setter. Fix the existing stop-gap `with_*`
methods to preserve `format.options` across mutations instead of
destroying them with hardcoded `"options": {}`.
- Add `with_format_options(...)` builder methods on `CreateBuilder` and
`WriteBuilder`. `WriteBuilder` resolves CDC settings from
`Metadata.format.options` on existing tables (via snapshot) or from
the write-time map on new tables.
- Move `parquet_cdc_options` helper to `table::config` with bare keys
(`contentDefinedChunking.enabled` etc.) — no longer needs a
`delta.parquet.*` prefix since `format.provider = "parquet"` is
implied.
- Drop the four `TableProperty::ParquetContentDefinedChunking*`
variants — they don't belong in the `delta.*` namespace.
- Python: `write_deltalake(format_options=...)` / `create_deltalake`
kwarg; `dt.metadata().format_options` read accessor.
- Pin `delta_kernel` to `kszucs/delta-kernel-rs#read-format-options`
branch — the kernel's `MetadataVisitor` was hardcoding
`options: HashMap::new()` on read, dropping any committed format
options. Patch fixes the visitor to actually read getter index 4.
- CI: `HF_DATASET` → `HF_BUCKET` (we primarily target HF storage
buckets, not git-backed datasets); drop the job-level `if:` guard
since the test fixture already skips when secrets are absent.
- Drop the `hf-native-tls` feature on the top-level `deltalake` crate
— only `s3` has the dual-TLS feature split in this project.
- Restore the HEAD-then-PUT shim in `HfObjectStore.put_opts` —
OpenDAL's HF backend doesn't declare `write_with_if_not_exists`,
so without the shim every Delta commit fails with `Unsupported`.
Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Convert the HF-specific `deltalake-hf` crate into a generic `deltalake-opendal` crate that can back a Delta table with any OpenDAL service, with HuggingFace folded in as one feature-gated specialization. - Generic `OpendalAdapter` trait drives shared `OpendalObjectStoreFactory` / `OpendalLogStoreFactory`; operators are built via `Operator::via_iter`. - `GenericAdapter` maps bucket-root services using the `opendal.<key>` storage-option convention. - `HfAdapter` reuses the repo prefix-strip store plus a generalized `ConditionalPutShim` (HF lacks `write_with_if_not_exists`). - `SortedListStore` restores object_store's lexicographic listing order, which OpenDAL does not guarantee (e.g. fs readdir order) and which delta-kernel's log replay requires. - The `deltalake` crate's `hf` feature now maps to `deltalake-opendal/hf`; adds an opt-in `opendal` feature. Registration is auto-wired via ctor. - Tests: network-free end-to-end roundtrips over the `fs` and `memory` services; gated `#[ignore]` integration tests for `s3` (LocalStack/MinIO) and HuggingFace Hub. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Make `deltalake-opendal` purely generic and move the HuggingFace specialization into a dedicated `deltalake-hf` crate that builds on it. - `deltalake-opendal` no longer knows about HF: removed the hf module, the `opendal-hf` feature, and the HF registration. Default features are now `["rustls", "opendal-fs", "opendal-memory"]` (network-free locals). - New `deltalake-hf` crate depends on `deltalake-opendal` and supplies the `HfAdapter` (URL parsing + repo prefix-strip) registered for `hf://`. The conditional-put shim is applied generically by the factory based on the operator's `write_with_if_not_exists` capability, so HF no longer special- cases it. - `deltalake` umbrella: `hf` feature -> `deltalake-hf`; per-service `opendal-*` features -> `deltalake-opendal`, each with its own ctor. - Python: enable the OpenDAL service features and register both the HF and generic OpenDAL handlers in the extension init (HF was previously not registered at all). Add a network-free `opendalfs://` roundtrip test. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
…s PR Separating the CDC format_options feature into its own PR as requested in review. Removes PARQUET_CDC_* constants, parquet_cdc_options(), MetadataExt format_options/with_format_options methods, WriteBuilder/CreateBuilder format_options field and Python bindings. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
The deltalake-opendal crate is generic by design — any OpenDAL service works
without per-backend Rust code. HuggingFace Hub is now just another entry in
GENERIC_SERVICES, enabled by the new `hf` feature flag (opendal/services-hf).
Users configure the HF service via `opendal.*` storage options
(opendal.repo_type, opendal.repo_id, opendal.revision, opendal.token) and use
`hf:///table_path` URIs (no host, to avoid the generic adapter injecting a
spurious `bucket` key that the HF service doesn't accept).
- Delete crates/hf entirely
- Add `hf = ["opendal/services-hf"]` feature to deltalake-opendal
- Wire `("hf", "hf")` into GENERIC_SERVICES in lib.rs
- Update deltalake/Cargo.toml: hf feature now enables opendal + opendal/hf
- Remove deltalake_hf re-export and ctor auto-register from deltalake/src/lib.rs
- Remove manual deltalake::hf::register_handlers call from python/src/lib.rs
- Update Python and Rust integration tests to use opendal.* storage options
Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
HuggingFace is just another generic OpenDAL service, so its feature flag should follow the same `opendal-<service>` convention as every other backend (opendal-s3, opendal-fs, …) rather than being a special-cased `hf` name. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
HF is now just another generic OpenDAL service, so the crate's doc comments should not single it out. Removes the HF examples from the OperatorSpec/ wrap_store/factory/shim docs (the wrap_store comment was also stale — no adapter overrides it anymore) and trims the crate description. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
…scheme Replace the ad-hoc (delta scheme, opendal service) tuple mapping with a uniform scheme: every enabled service registers under an unambiguous opendal+<service>:// scheme, and additionally under its bare <service>:// scheme when that does not collide with a native delta backend (tracked in NATIVE_SCHEMES). This drops the old opendalfs/opendalmem/opendals3 names in favour of opendal+fs/opendal+memory/opendal+s3, while non-colliding services like hf://, fs://, gcs:// keep their natural scheme. Adds a guard test pinning the known native collisions (s3, memory) so they can never be dropped from NATIVE_SCHEMES and silently shadow a native backend. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Document the generic OpenDAL backend: the opendal+<service>:// and bare <service>:// scheme rules, the opendal.<key> storage-option convention, and worked examples for an S3-compatible store, the local filesystem, and the HuggingFace Hub. Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
… support Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The description of the main changes of your pull request
Related Issue(s)
Documentation