Skip to content

Conversation

@KiteSoar
Copy link
Contributor

@KiteSoar KiteSoar commented Dec 21, 2025

Describe the issue this Pull Request addresses

This PR refactors the codebase to use HoodieStorageFactory instead of direct HoodieHadoopStorage instantiation, improving code maintainability and extensibility.
closes #17549

Summary and Changelog

Summary:
Users and developers will benefit from a more maintainable and extensible storage layer architecture. The factory pattern allows for easier switching between different storage implementations in the future.

Changelog:

  • Replaced ~20 direct new HoodieHadoopStorage() instantiations with HoodieStorageFactory.getStorage() calls

Impact

Public API Changes: None - This is an internal refactoring only.

User-Facing Changes: None - No behavioral changes for end users.

Performance Impact: Negligible - The factory pattern adds minimal overhead (one additional method call).

Risk Level

Low

Mitigation:

  • The refactoring preserves existing functionality - only the instantiation pattern changed
  • Internal implementation classes (HoodieHadoopIOFactory, HoodieHadoopStorage itself) continue to use direct instantiation as needed
  • No changes to storage behavior, configurations, or file formats

Documentation Update

none - This is an internal code refactoring with no user-facing changes, configuration updates, or new features requiring documentation.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 21, 2025
@KiteSoar KiteSoar marked this pull request as draft December 21, 2025 13:25
@KiteSoar KiteSoar force-pushed the refactor/hoodie-storage-factory-17549 branch from 4c50af2 to 5c00a52 Compare December 21, 2025 15:15
@KiteSoar KiteSoar marked this pull request as ready for review December 21, 2025 15:16
* @return {@link HoodieStorage} instance
* @throws HoodieException if unable to create the storage instance
*/
public static HoodieStorage getStorage(StorageConfiguration<?> conf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove HoodieStorageUtils and directly use HoodieStorageFactory for construction? Move all instantiation logic into HoodieStorageFactory.

@yihua
Copy link
Contributor

yihua commented Dec 22, 2025

@CTTY could you review this PR?

@CTTY
Copy link
Contributor

CTTY commented Dec 22, 2025

Hi @KiteSoar @yihua , I'm having second thoughts on this change now.

Initially I was thinking of something like:

factory.withPath().withConf().build()

And then we can slowly migrate the entry point of storage from just HoodieStorage to HoodieStorageFactory. But I think this may be too intrusive no matter how we do it.

It makes more sense to me right now that we don't add a new HoodieStorageFactory and keep the existing HoodieStorageUtils. In this PR, we only need to switch the usages of HoodieHadoopStorage constructor to HoodieStorageUtils whenever possible.

I still want to stop people from using the HoodieHadoopStorage constructor directly but I don't have a good solution right now. The only idea in my mind is to set the constructor to be default visibility and use reflection when you can only use HoodieHadoopStorage, but I think this is anti-pattern and deserves a separate thread to discuss.

Please let me know what you think!

@yihua
Copy link
Contributor

yihua commented Dec 22, 2025

It makes more sense to me right now that we don't add a new HoodieStorageFactory and keep the existing HoodieStorageUtils. In this PR, we only need to switch the usages of HoodieHadoopStorage constructor to HoodieStorageUtils whenever possible.

Using existing HoodieStorageUtils to replace direct usages of HoodieHadoopStorage constructor makes sense to me. Right now HoodieStorageUtils serves as the factory for the HoodieStorage; the proposed HoodieStorageFactory can be a refactoring later, though not necessary at this point.

@yihua
Copy link
Contributor

yihua commented Dec 22, 2025

The only idea in my mind is to set the constructor to be default visibility and use reflection when you can only use HoodieHadoopStorage, but I think this is anti-pattern and deserves a separate thread to discuss.

We should revisit the places to instantiate the HoodieStorage instance. Ideally, there should only be a handful of places doing that, and all other places should pass the HoodieStorage instance around.

@KiteSoar KiteSoar closed this Dec 23, 2025
@KiteSoar KiteSoar reopened this Dec 23, 2025
@KiteSoar
Copy link
Contributor Author

It makes more sense to me right now that we don't add a new HoodieStorageFactory and keep the existing HoodieStorageUtils. In this PR, we only need to switch the usages of HoodieHadoopStorage constructor to HoodieStorageUtils whenever possible.

Using existing HoodieStorageUtils to replace direct usages of HoodieHadoopStorage constructor makes sense to me. Right now HoodieStorageUtils serves as the factory for the HoodieStorage; the proposed HoodieStorageFactory can be a refactoring later, though not necessary at this point.

Thanks for the review @CTTY @yihua.
That makes sense. I agree with the suggestion to drop HoodieStorageFactory and stick with the existing HoodieStorageUtils.
So, just to confirm the consensus: the goal of this PR is to switch the usages of the HoodieHadoopStorage constructor to HoodieStorageUtils wherever possible. I'll proceed with this change.

@KiteSoar KiteSoar force-pushed the refactor/hoodie-storage-factory-17549 branch from 5c00a52 to 6fe4b19 Compare December 25, 2025 01:16
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Dec 25, 2025
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KiteSoar is this ready for review? If so, @CTTY could you review again?

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KiteSoar , I just did another round of review. The approach looks good in general while I have some concerns about introducing a new util method that accepts Object[] params to instantiate storage classes. We should assume that a storage implementation only has constructor that takes {StoragePath.class, StorageConfiguration.class}

@@ -91,7 +93,10 @@ void testFetchNextBatchFromSource(Boolean useRowWriter, Boolean hasTransformer,
Boolean isNullTargetSchema, Boolean hasErrorTable, Boolean shouldTryWriteToErrorTable) {
//basic deltastreamer inputs
HoodieSparkEngineContext hoodieSparkEngineContext = mock(HoodieSparkEngineContext.class);
HoodieStorage storage = new HoodieHadoopStorage(mock(FileSystem.class));
HoodieStorage storage = HoodieStorageUtils.getStorage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can also use HoodieStorageUtils to get storage instance in tests like this(not with the FileSystem). But I'm not sure how feasible it is. If it's quite tricky then we can just use HoodieHadoopStorage constructor for tests

@KiteSoar
Copy link
Contributor Author

KiteSoar commented Dec 27, 2025

Hi @KiteSoar , I just did another round of review. The approach looks good in general while I have some concerns about introducing a new util method that accepts Object[] params to instantiate storage classes. We should assume that a storage implementation only has constructor that takes {StoragePath.class, StorageConfiguration.class}

Hi @CTTY,
Thanks for the review! I agree with your concern about the Object[] params approach. I've updated the implementation to use HoodieStorageUtils.getStorage(StoragePath, StorageConfiguration) consistently. However, I found that in some test scenarios, we need to directly use new HoodieHadoopStorage(FileSystem) constructor instead of the factory method.
Example scenario: In tests like TestFSUtilsWithRetryWrapperEnable.java, we construct custom FileSystem wrappers (e.g., HoodieRetryWrapperFileSystem wrapping a FakeRemoteFileSystem) to simulate specific behaviors like retries or failures. When using the factory method getStorage(path, conf), it internally calls FileSystem.get(path, conf) which creates a new FileSystem instance instead of using our carefully constructed test FileSystem. This breaks the test logic.
So now, in some test codes, I will continue to use new HoodieHadoopStorage(fs)

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I'm ok with using HoodieHadoopStorage in tests. Thanks for the deep dive!

We can merge after the CI turns green

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looks like there are some build issues, could you fix it?

@KiteSoar
Copy link
Contributor Author

Hmm looks like there are some build issues, could you fix it?

I’ve fixed it. Thanks for your help with my first Hudi contribution!

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@KiteSoar KiteSoar requested a review from CTTY December 30, 2025 01:33
Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green, let's go!

@CTTY CTTY merged commit ede9513 into apache:master Dec 30, 2025
71 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a HoodieStorageFactory to provide HoodieStorage instance

4 participants