Skip to content

Disable Reactor Netty auto-retry on HttpClient to prevent duplicate P…#488

Open
dxichen wants to merge 1 commit intolinkedin:mainfrom
dxichen:dchen1/fix-disable-netty-retry-put
Open

Disable Reactor Netty auto-retry on HttpClient to prevent duplicate P…#488
dxichen wants to merge 1 commit intolinkedin:mainfrom
dxichen:dchen1/fix-disable-netty-retry-put

Conversation

@dxichen
Copy link
Collaborator

@dxichen dxichen commented Mar 5, 2026

…UT requests

Reactor Netty by default retries idempotent requests (and in some versions non-idempotent ones) when a pooled connection is closed before a response is received (PrematureCloseException). For PUT requests to the OpenHouse catalog this caused a second concurrent commit attempt: the first succeeded (200 OK) while the second received a 409 Conflict, which Iceberg mapped to CommitFailedException and triggered SnapshotProducer.cleanAll() - deleting the manifest list files written by the successful commit and corrupting the table.

Fix: call disableRetry(true) on every HttpClient instance so Reactor Netty never silently retries a request, regardless of connection strategy (NEW or POOLED) or transport (HTTP or HTTPS).

Summary

Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…UT requests

Reactor Netty by default retries idempotent requests (and in some versions
non-idempotent ones) when a pooled connection is closed before a response is
received (PrematureCloseException). For PUT requests to the OpenHouse catalog
this caused a second concurrent commit attempt: the first succeeded (200 OK)
while the second received a 409 Conflict, which Iceberg mapped to
CommitFailedException and triggered SnapshotProducer.cleanAll() - deleting the
manifest list files written by the successful commit and corrupting the table.

Fix: call disableRetry(true) on every HttpClient instance so Reactor Netty never
silently retries a request, regardless of connection strategy (NEW or POOLED) or
transport (HTTP or HTTPS).

Fixes: https://linkedin.atlassian.net/browse/BDP-72706

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cbb330
Copy link
Collaborator

cbb330 commented Mar 6, 2026

PUT requests can retry on the network layer too. So disabling retries on netty would not necessarily solve this class of commit conflict.

I think its safer to handle duplicate metadata.json PUT commits on the server. For example, mysql will respond differently if entry already exists, we catch this signal in hts and return 201. We should discuss the status code in this case -- 201 could be ok, since existence of metadata at commit already could be ok but there are implications

} else {
log.info("Using connection pool strategy");
client = HttpClient.create(getCustomConnectionProvider());
client = HttpClient.create(getCustomConnectionProvider()).disableRetry(true);
Copy link
Member

@abhisheknath2011 abhisheknath2011 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is for pooled connection and not at the request level. Normally Reactor Netty may retry once automatically when:

  • A pooled connection becomes stale
  • Connection closed before request write
  • TCP reset during acquisition
  • Connection closed by server before request starts.

Ideally on 200 or 409 Netty should not retry. In which scenario you are observing the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is common client side http connection pool config, changing default netty behaviour might have some consequences. Disabling netty retry can increase:

  • 503 errors.
  • connection reset
  • PrematureCloseException
  • ClosedChannelException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is common and not only used across openhouse codebase, but also in client side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants