-
Notifications
You must be signed in to change notification settings - Fork 14.8k
KAFKA-19981: Handle retriable remote storage exception in RemoteLogManager #21150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! It looks good to me.
We can further update the RemoteLogManager javadoc to convey the contract to the plugin implementor.
@showuon Could you also review this PR? More details are on the ticket: KAFKA-19981. We want to take different actions in Kafka based on the error thrown by the plugin. Let me know your thoughts. Thanks!
showuon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should not treat the retriable error as a failure. Left some comments.
| } catch (RemoteStorageException e) { | ||
| logger.info("Copy failed, cleaning segment {}", copySegmentStartedRlsm.remoteLogSegmentId()); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In L1058 below, when deleteRemoteLogSegment throws RetriableRemoteStorageException, we don't catch it. Then, the exception thrown will be the one from deleteRemoteLogSegment, not the one from copyLogSegmentData. With that, the RetriableRemoteStorageException from deleteRemoteLogSegment could be caught in L1009 above and not update the metrics as expected. Maybe we should catch all exceptions thrown from L1058 because we want to retry later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to handle the RetriableRemoteStorageException in RLMTask L844, so that it won't be logged as WARN log.
} catch (RetriableException | RetriableRemoteStorageException ex) {
logger.debug("Encountered a retryable error while executing current task for partition {}", topicIdPartition, ex);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DL1231 Could you also address the above comment? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should catch all exceptions thrown from L1058 because we want to retry later?
since RRSE extends RSE, the current patch looks clean to me. (Open to suggestions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when deleteRemoteLogSegment throws RetriableRemoteStorageException, we don't catch it.
as @kamalcph pointed out, RRSE extends RSE, so when deleteRemoteLogSegment throws RRSE, it will be caught.
storage/src/test/java/org/apache/kafka/server/log/remote/storage/RemoteLogManagerTest.java
Outdated
Show resolved
Hide resolved
storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java
Outdated
Show resolved
Hide resolved
| } catch (RemoteStorageException e) { | ||
| logger.info("Copy failed, cleaning segment {}", copySegmentStartedRlsm.remoteLogSegmentId()); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to handle the RetriableRemoteStorageException in RLMTask L844, so that it won't be logged as WARN log.
} catch (RetriableException | RetriableRemoteStorageException ex) {
logger.debug("Encountered a retryable error while executing current task for partition {}", topicIdPartition, ex);
}
kamalcph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR distinguishes between
RemoteStorageExceptionandRetriableRemoteStorageExceptioninRemoteLogManagerto handletemporary storage degradations gracefully:
RetriableRemoteStorageExceptionis thrown;for retriable exceptions;
RemoteStorageManagerJavadoc to clarifyexception usage;