Disable Reactor Netty auto-retry on HttpClient to prevent duplicate P…#488
Disable Reactor Netty auto-retry on HttpClient to prevent duplicate P…#488dxichen wants to merge 1 commit intolinkedin:mainfrom
Conversation
…UT requests Reactor Netty by default retries idempotent requests (and in some versions non-idempotent ones) when a pooled connection is closed before a response is received (PrematureCloseException). For PUT requests to the OpenHouse catalog this caused a second concurrent commit attempt: the first succeeded (200 OK) while the second received a 409 Conflict, which Iceberg mapped to CommitFailedException and triggered SnapshotProducer.cleanAll() - deleting the manifest list files written by the successful commit and corrupting the table. Fix: call disableRetry(true) on every HttpClient instance so Reactor Netty never silently retries a request, regardless of connection strategy (NEW or POOLED) or transport (HTTP or HTTPS). Fixes: https://linkedin.atlassian.net/browse/BDP-72706 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
PUT requests can retry on the network layer too. So disabling retries on netty would not necessarily solve this class of commit conflict. I think its safer to handle duplicate metadata.json PUT commits on the server. For example, mysql will respond differently if entry already exists, we catch this signal in hts and return 201. We should discuss the status code in this case -- 201 could be ok, since existence of metadata at commit already could be ok but there are implications |
| } else { | ||
| log.info("Using connection pool strategy"); | ||
| client = HttpClient.create(getCustomConnectionProvider()); | ||
| client = HttpClient.create(getCustomConnectionProvider()).disableRetry(true); |
There was a problem hiding this comment.
This config is for pooled connection and not at the request level. Normally Reactor Netty may retry once automatically when:
- A pooled connection becomes stale
- Connection closed before request write
- TCP reset during acquisition
- Connection closed by server before request starts.
Ideally on 200 or 409 Netty should not retry. In which scenario you are observing the issue?
There was a problem hiding this comment.
As this is common client side http connection pool config, changing default netty behaviour might have some consequences. Disabling netty retry can increase:
- 503 errors.
- connection reset
- PrematureCloseException
- ClosedChannelException
There was a problem hiding this comment.
This code is common and not only used across openhouse codebase, but also in client side.
…UT requests
Reactor Netty by default retries idempotent requests (and in some versions non-idempotent ones) when a pooled connection is closed before a response is received (PrematureCloseException). For PUT requests to the OpenHouse catalog this caused a second concurrent commit attempt: the first succeeded (200 OK) while the second received a 409 Conflict, which Iceberg mapped to CommitFailedException and triggered SnapshotProducer.cleanAll() - deleting the manifest list files written by the successful commit and corrupting the table.
Fix: call disableRetry(true) on every HttpClient instance so Reactor Netty never silently retries a request, regardless of connection strategy (NEW or POOLED) or transport (HTTP or HTTPS).
Summary
Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.