Skip to content

fix: equalize with delta/delta-rs (v0.52)#55

Open
mandrush wants to merge 503 commits into
relativityone:mainfrom
delta-io:main
Open

fix: equalize with delta/delta-rs (v0.52)#55
mandrush wants to merge 503 commits into
relativityone:mainfrom
delta-io:main

Conversation

@mandrush

Copy link
Copy Markdown
Collaborator

Description

The description of the main changes of your pull request

Related Issue(s)

Documentation

@github-actions github-actions Bot added binding/python binding/rust delta-inspect documentation Improvements or additions to documentation labels Feb 17, 2026
adampolomski
adampolomski previously approved these changes Feb 17, 2026
khalidmammadov and others added 3 commits March 31, 2026 09:12
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
I cleaned up some junk labels last week and didn't realize that I broke
this action a bit 🙈

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
@github-actions github-actions Bot added the ci label Mar 31, 2026
ethan-tyler and others added 11 commits March 31, 2026 12:14
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Liam Brannigan <liambrannigan@Liams-MacBook-Pro.local>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: adam.polomski <adam.polomski@relativity.com>
# Description

This PR contains a small fix that addresses non-determinism in the
DeltaTableProvider.

The issue occurs under the following circumstances:
- A projection is used
- There is a filter condition that contains at least two columns that
are not in the projection

Currently, the additional columns which are used in the filter are
stored in a `HashSet`. As a result, the iterator over the HashSet may
return the additional columns in any order. This introduces
non-determinism in creating the logical schema. As the schema affects
the string representation of the query plan (@<column_index>), this is a
pain for asserting query plans in downstream projects.

Running the new test without the fix produces one of two possible query
plans:

Query Plan 1:
```text
DeltaScan
  DataSourceExec: file_groups={1 group: [[]]}, projection=[v1], file_type=parquet, predicate=v2@1 = CAST(2 AS Int64) AND v3@2 = CAST(3 AS Int64), pruning_predicate=v2_null_count@2 != row_count@3 AND v2_min@0 <= 2 AND 2 <= v2_max@1 AND v3_null_count@6 != row_count@3 AND v3_min@4 <= 3 AND 3 <= v3_max@5, required_guarantees=[]
```

Query Plan 2:
```text
DeltaScan
  DataSourceExec: file_groups={1 group: [[]]}, projection=[v1], file_type=parquet, predicate=v2@2 = CAST(2 AS Int64) AND v3@1 = CAST(3 AS Int64), pruning_predicate=v2_null_count@2 != row_count@3 AND v2_min@0 <= 2 AND 2 <= v2_max@1 AND v3_null_count@6 != row_count@3 AND v3_min@4 <= 3 AND 3 <= v3_max@5, required_guarantees=[]\n
```

The crux lies in the predicate. `predicate=v2@2 [...]` versus
`predicate=v2@1`. By sorting the additional columns the indeterminsm
should be fixed.

# Related Issue(s)

I think this is a trivial fix and therefore I did not create an issue

# Documentation

- [Expr::column_refs] returns a `HashSet`
(https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#method.column_refs)

---------

Signed-off-by: Tobias Schwarzinger <tobias.schwarzinger@tuwien.ac.at>
Co-authored-by: Ethan Urbanski <ethanurbanski@gmail.com>
# Description
- replacing manual commit pipeline (into_prepared_commit_future + manual
write_commit_entry) in restore.rs into standard pipeline, all 3 stages
run automatically like other endpoints

- exposed `post_commithook_properties` in the Python binding, public
API, and type stub

- added tests verifying the parameter is accepted with post commimt hook

# Related Issue(s)
<!---
For example:

- closes #106
--->
closes #4251




<!---
Share links to useful documentation
--->

---------

Signed-off-by: Byeori Kim <bk.byeori.kim@gmail.com>
Co-authored-by: Ethan Urbanski <ethanurbanski@gmail.com>
ethan-tyler and others added 30 commits June 14, 2026 13:41
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Khalid Mammadov <khalidmammadov9@gmail.com>
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
These slipped through CI somehow and I didn't notice that RawJson is
only available in the `datafusion` feature build.

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: Adam Reeve <adam.reeve@gr-oss.io>
This passing demonstrates that this bug is actually fixed, who knows
when it was fixed, but it was fixed! 😄

Closes #2882

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
See #1214

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: R Tyler Croy <rtyler@brokenco.de>
This issue was well in the backlog and I don't really understand the
use-case of binary partition columns 😆 but this seemed like a
pretty straight-forward fix

Below is the 🦜 generated commentary, which is mostly useless.

- Add pick_binary_partition_values: bypasses the natural-order guard in
  pick_stats so Binary partition values can be read as BinaryArray.
- Add contained_binary: implements set-membership comparison for Binary
  partition columns, encoding predicate bytes with the same unicode-escape
  format used when serializing partition values into the Delta log.
- Wire contained_binary into contained() via a type check before the
  existing StringArray path, so = / IN predicates prune binary partitions
  while min_values/max_values still return None (range predicates stay
  conservatively un-pruned).
- Update existing regression test: assert kept_files == 1 (pruning works).
- Add range-predicate test: assert kept_files == 2 (< and > keep all files).

Closes #1214

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: R Tyler Croy <rtyler@brokenco.de>
This demonstrates that #4501 is no longer an issue

Closes #4501

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
# Description
No longer add views from UC API to the information schema in datafusion.
Also some refactoring to setup better usage in the future.

# Related Issue(s)
- closes #4423

---------

Signed-off-by: Stephen Carman <stephen.carman@databricks.com>
Co-authored-by: Stephen Carman <stephen.carman@databricks.com>
…rties

Introduces a new `deltalake-hf` crate that wires up the `hf://` URL scheme
using OpenDAL's HuggingFace service. `HfObjectStore` adapts Delta's
`PutMode::Create` contract to HF Hub's Git-backed, single-writer model and
strips the repo-id prefix from paths so the OpenDAL operator sees repo-relative
paths. `HfLogStoreFactory` rebuilds the `PrefixStore` from the root store using
the parsed table path, bypassing the generic `decorate_store` logic that
constructs an incorrect prefix.

Adds four Delta table properties under `delta.parquet.contentDefinedChunking.*`
that transparently activate Parquet content-defined chunking (CDC) during writes.
When enabled, row-group boundaries become deterministic functions of data content,
which improves deduplication and incremental upload efficiency on
content-addressable stores such as HuggingFace Hub / Xet.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
CDC is a Parquet-format concern, not a Delta-semantic property, so the
settings belong in `Metadata.format.options` (per the Delta spec) rather
than in the table-level `configuration` map.

Changes:
- Extend `MetadataExt` with `format_options()` getter and
  `with_format_options()` setter. Fix the existing stop-gap `with_*`
  methods to preserve `format.options` across mutations instead of
  destroying them with hardcoded `"options": {}`.
- Add `with_format_options(...)` builder methods on `CreateBuilder` and
  `WriteBuilder`. `WriteBuilder` resolves CDC settings from
  `Metadata.format.options` on existing tables (via snapshot) or from
  the write-time map on new tables.
- Move `parquet_cdc_options` helper to `table::config` with bare keys
  (`contentDefinedChunking.enabled` etc.) — no longer needs a
  `delta.parquet.*` prefix since `format.provider = "parquet"` is
  implied.
- Drop the four `TableProperty::ParquetContentDefinedChunking*`
  variants — they don't belong in the `delta.*` namespace.
- Python: `write_deltalake(format_options=...)` / `create_deltalake`
  kwarg; `dt.metadata().format_options` read accessor.
- Pin `delta_kernel` to `kszucs/delta-kernel-rs#read-format-options`
  branch — the kernel's `MetadataVisitor` was hardcoding
  `options: HashMap::new()` on read, dropping any committed format
  options. Patch fixes the visitor to actually read getter index 4.
- CI: `HF_DATASET` → `HF_BUCKET` (we primarily target HF storage
  buckets, not git-backed datasets); drop the job-level `if:` guard
  since the test fixture already skips when secrets are absent.
- Drop the `hf-native-tls` feature on the top-level `deltalake` crate
  — only `s3` has the dual-TLS feature split in this project.
- Restore the HEAD-then-PUT shim in `HfObjectStore.put_opts` —
  OpenDAL's HF backend doesn't declare `write_with_if_not_exists`,
  so without the shim every Delta commit fails with `Unsupported`.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Convert the HF-specific `deltalake-hf` crate into a generic
`deltalake-opendal` crate that can back a Delta table with any OpenDAL
service, with HuggingFace folded in as one feature-gated specialization.

- Generic `OpendalAdapter` trait drives shared `OpendalObjectStoreFactory`
  / `OpendalLogStoreFactory`; operators are built via `Operator::via_iter`.
- `GenericAdapter` maps bucket-root services using the `opendal.<key>`
  storage-option convention.
- `HfAdapter` reuses the repo prefix-strip store plus a generalized
  `ConditionalPutShim` (HF lacks `write_with_if_not_exists`).
- `SortedListStore` restores object_store's lexicographic listing order,
  which OpenDAL does not guarantee (e.g. fs readdir order) and which
  delta-kernel's log replay requires.
- The `deltalake` crate's `hf` feature now maps to `deltalake-opendal/hf`;
  adds an opt-in `opendal` feature. Registration is auto-wired via ctor.
- Tests: network-free end-to-end roundtrips over the `fs` and `memory`
  services; gated `#[ignore]` integration tests for `s3` (LocalStack/MinIO)
  and HuggingFace Hub.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Make `deltalake-opendal` purely generic and move the HuggingFace
specialization into a dedicated `deltalake-hf` crate that builds on it.

- `deltalake-opendal` no longer knows about HF: removed the hf module,
  the `opendal-hf` feature, and the HF registration. Default features are
  now `["rustls", "opendal-fs", "opendal-memory"]` (network-free locals).
- New `deltalake-hf` crate depends on `deltalake-opendal` and supplies the
  `HfAdapter` (URL parsing + repo prefix-strip) registered for `hf://`. The
  conditional-put shim is applied generically by the factory based on the
  operator's `write_with_if_not_exists` capability, so HF no longer special-
  cases it.
- `deltalake` umbrella: `hf` feature -> `deltalake-hf`; per-service
  `opendal-*` features -> `deltalake-opendal`, each with its own ctor.
- Python: enable the OpenDAL service features and register both the HF and
  generic OpenDAL handlers in the extension init (HF was previously not
  registered at all). Add a network-free `opendalfs://` roundtrip test.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
…s PR

Separating the CDC format_options feature into its own PR as requested in
review. Removes PARQUET_CDC_* constants, parquet_cdc_options(), MetadataExt
format_options/with_format_options methods, WriteBuilder/CreateBuilder
format_options field and Python bindings.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
The deltalake-opendal crate is generic by design — any OpenDAL service works
without per-backend Rust code. HuggingFace Hub is now just another entry in
GENERIC_SERVICES, enabled by the new `hf` feature flag (opendal/services-hf).

Users configure the HF service via `opendal.*` storage options
(opendal.repo_type, opendal.repo_id, opendal.revision, opendal.token) and use
`hf:///table_path` URIs (no host, to avoid the generic adapter injecting a
spurious `bucket` key that the HF service doesn't accept).

- Delete crates/hf entirely
- Add `hf = ["opendal/services-hf"]` feature to deltalake-opendal
- Wire `("hf", "hf")` into GENERIC_SERVICES in lib.rs
- Update deltalake/Cargo.toml: hf feature now enables opendal + opendal/hf
- Remove deltalake_hf re-export and ctor auto-register from deltalake/src/lib.rs
- Remove manual deltalake::hf::register_handlers call from python/src/lib.rs
- Update Python and Rust integration tests to use opendal.* storage options

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
HuggingFace is just another generic OpenDAL service, so its feature flag
should follow the same `opendal-<service>` convention as every other backend
(opendal-s3, opendal-fs, …) rather than being a special-cased `hf` name.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
HF is now just another generic OpenDAL service, so the crate's doc comments
should not single it out. Removes the HF examples from the OperatorSpec/
wrap_store/factory/shim docs (the wrap_store comment was also stale — no
adapter overrides it anymore) and trims the crate description.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
…scheme

Replace the ad-hoc (delta scheme, opendal service) tuple mapping with a
uniform scheme: every enabled service registers under an unambiguous
opendal+<service>:// scheme, and additionally under its bare <service>://
scheme when that does not collide with a native delta backend (tracked in
NATIVE_SCHEMES). This drops the old opendalfs/opendalmem/opendals3 names in
favour of opendal+fs/opendal+memory/opendal+s3, while non-colliding services
like hf://, fs://, gcs:// keep their natural scheme.

Adds a guard test pinning the known native collisions (s3, memory) so they
can never be dropped from NATIVE_SCHEMES and silently shadow a native backend.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Document the generic OpenDAL backend: the opendal+<service>:// and bare
<service>:// scheme rules, the opendal.<key> storage-option convention, and
worked examples for an S3-compatible store, the local filesystem, and the
HuggingFace Hub.

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
… support

Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztian Szucs <szucs.krisztian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.