Skip to content

Conversation

@DL1231
Copy link
Collaborator

@DL1231 DL1231 commented Dec 15, 2025

This PR distinguishes between RemoteStorageException and
RetriableRemoteStorageException in RemoteLogManager to handle
temporary storage degradations gracefully:

  1. Copy path: Avoids deleting partially uploaded segments when
    RetriableRemoteStorageException is thrown;
  2. Delete path: Skips incrementing failedRemoteDeleteRequestRate metric
    for retriable exceptions;
  3. Documentation: Updates RemoteStorageManager Javadoc to clarify
    exception usage;
  4. Testing: Adds UT for retriable scenarios.

@github-actions github-actions bot added triage PRs from the community storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature labels Dec 15, 2025
@DL1231 DL1231 requested a review from kamalcph December 15, 2025 07:43
Copy link
Contributor

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! It looks good to me.

We can further update the RemoteLogManager javadoc to convey the contract to the plugin implementor.

@showuon Could you also review this PR? More details are on the ticket: KAFKA-19981. We want to take different actions in Kafka based on the error thrown by the plugin. Let me know your thoughts. Thanks!

@github-actions github-actions bot removed the triage PRs from the community label Dec 16, 2025
Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should not treat the retriable error as a failure. Left some comments.

Comment on lines 1055 to 1057
} catch (RemoteStorageException e) {
logger.info("Copy failed, cleaning segment {}", copySegmentStartedRlsm.remoteLogSegmentId());
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In L1058 below, when deleteRemoteLogSegment throws RetriableRemoteStorageException, we don't catch it. Then, the exception thrown will be the one from deleteRemoteLogSegment, not the one from copyLogSegmentData. With that, the RetriableRemoteStorageException from deleteRemoteLogSegment could be caught in L1009 above and not update the metrics as expected. Maybe we should catch all exceptions thrown from L1058 because we want to retry later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to handle the RetriableRemoteStorageException in RLMTask L844, so that it won't be logged as WARN log.

} catch (RetriableException | RetriableRemoteStorageException ex) {
                logger.debug("Encountered a retryable error while executing current task for partition {}", topicIdPartition, ex);
            }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DL1231 Could you also address the above comment? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should catch all exceptions thrown from L1058 because we want to retry later?

since RRSE extends RSE, the current patch looks clean to me. (Open to suggestions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when deleteRemoteLogSegment throws RetriableRemoteStorageException, we don't catch it.

as @kamalcph pointed out, RRSE extends RSE, so when deleteRemoteLogSegment throws RRSE, it will be caught.

Comment on lines 1055 to 1057
} catch (RemoteStorageException e) {
logger.info("Copy failed, cleaning segment {}", copySegmentStartedRlsm.remoteLogSegmentId());
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to handle the RetriableRemoteStorageException in RLMTask L844, so that it won't be logged as WARN log.

} catch (RetriableException | RetriableRemoteStorageException ex) {
                logger.debug("Encountered a retryable error while executing current task for partition {}", topicIdPartition, ex);
            }

Copy link
Contributor

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants