feat: upsert by mandrush · Pull Request #53 · relativityone/delta-rs

mandrush · 2026-02-16T12:13:10Z

Description

The description of the main changes of your pull request

Related Issue(s)

Documentation

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

…ing execution Remove cache calls on large DataFrames while keeping small result caching: - Keep conflicts_df cache (small: join keys + file paths only) - Remove implicit materializations from target_df, filtered_target_df, non_conflicting_target, and result_df - All large DataFrames now use lazy streaming execution - Add schema normalization (cast Dictionary to Utf8) for file path column to fix compatibility - Add helper method find_conflicts_keys_only for clean anti-join logic Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

The target_df parameter was not used in the function body - it only selects keys from self.source. Removed the parameter and updated the call site. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

This reverts commit 9a5a201.

…hub.com>

… detection Instead of caching the conflicts DataFrame to work around DataFusion's Dictionary encoding schema mismatch, implement manual join logic: - Collect target DataFrame with join keys + file paths (small result) - Collect distinct source join keys (small result) - Perform join in memory using HashSet for efficient lookup - Extract file paths that have matching keys This avoids materializing large DataFrames while still handling the schema inconsistency by working entirely in memory on small, already-collected data. Memory impact: Only materializes join keys + file paths (one row per conflicting file), not full row data. Much more efficient than caching full DataFrames. Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

The previous approach incorrectly materialized the entire target DataFrame which could be billions of rows. The corrected approach: 1. Keeps target_df and source lazy (not materialized) 2. Performs inner join in DataFusion (lazy operation) 3. Selects only minimal columns (join keys + file path, not full rows) 4. Collects ONLY the join result which is small (only conflicting rows) Memory footprint: For a table with billions of rows but only thousands of conflicts, we materialize only thousands of rows with minimal columns, not billions of full rows. The join result is inherently small because it contains only rows where join keys match between source and target (actual conflicts). Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

…rics - Changed extract_conflicting_filenames to extract_conflicts_dataframe to return a DataFrame - Added extract_file_paths_from_conflicts to extract file paths from the cached DataFrame - Cache the conflicts DataFrame for reuse in multiple places - Added num_conflicting_records field to UpsertMetrics - Count and report conflicting records in metrics - Updated tests to verify num_conflicting_records metric Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

adampolomski and others added 30 commits February 16, 2026 12:59

Upsert initial implementation

b09c62a

Upsert initial implementation

7cfea82

Trying workspace filter

986aa16

Trying workspace filter

329cc9c

Trying workspace filter

588069d

Trying workspace filter

8f46167

Trying workspace filter

6f59ef1

Removing old files from partition ... maybe?

3328c5a

Conflict check.

fc20767

Initial plan

04c3626

Refactor upsert.rs for improved readability and add metrics

377c5a3

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

Remove hardcoded workspace_id assumptions and make upsert generic

dfdc77f

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

Fixed tests.

a78d8dc

Cleaned up metrics and tests.

1de4970

Removed unnecessary Add struct creation for Remove actions.

7450df7

feat: removed metrics cloning

2d6fe19

feat: removed more cloning

cdfd84c

feat: reworked session handling

47bc300

feat: handling more partition column types, error on unhandled

c96c974

feat: revert formatting changes in unrelated file

cc65987

feat: removed print

8e18ab4

feat: execution time metric

5d349c4

feat: one more test case

958aad9

feat: not rewriting the whole partition for conflict resolution

710ad0f

feat: not rewriting the whole partition for conflict resolution

57f245d

feat: file column name set explicitly

0bf6ab6

feat: test case extended

cde2e54

feat: Append mode

14d98c6

feat: More comprehensive tests

5aa3d0b

feat: More comprehensive tests

9415bf4

adampolomski and others added 25 commits February 16, 2026 12:59

feat: additional test case for duplicate conflicts

08b8f4a

feat: handle schema reorderings

0d7a3c4

feat: handle schema reorderings

802979c

feat: handle cross-partition upserts

2c596b4

feat: multi-partition test case reworked

b232f56

Initial plan

b502aa2

feat: removed useless caching

d184051

Revert "feat: removed useless caching"

b1d8f1e

This reverts commit 9a5a201.

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.git…

ca26695

…hub.com>

feat: removed unnecessary columns

1f97484

feat: review comments

37b1058

feat: fetch only distinct files

6f5fd7e

feat: fetch only distinct files

d49058c

feat: optimised join

3f5c407

Initial plan

c076d37

Remove unnecessary clone when counting conflicts DataFrame

216d16a

Co-authored-by: adampolomski <10196659+adampolomski@users.noreply.github.com>

feat: moved stuff around

0bfd6ce

feat: optimised join

51edbc1

feat: optimised join

2b4ea1e

chore: rebase to main (v0.51)

2976758

github-actions Bot added the binding/rust label Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upsert#53

feat: upsert#53
mandrush wants to merge 55 commits into
mainfrom
upsert

mandrush commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mandrush commented Feb 16, 2026

Description

Related Issue(s)

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants