-
Notifications
You must be signed in to change notification settings - Fork 14.8k
KAFKA-19995: Record copy lag metrics during failures #21158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
KAFKA-19995: Record copy lag metrics during failures #21158
Conversation
kamalcph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left few comments. PTAL.
| long bytesLag = log.onlyLocalLogSegmentsSize() - log.activeSegment().size(); | ||
| long segmentsLag = log.onlyLocalLogSegmentsCount() - 1; | ||
| recordLagStats(bytesLag, segmentsLag); | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can emit negative values when there are no candidate segments to upload and the highestOffsetInRemoteStorage is higher than the base-offset of the active segment. This can happen after leader change:
We can update this to ensure that non-negative values are not emitted:
long bytesLag = Math.max(0, log.onlyLocalLogSegmentsSize() - log.activeSegment().size());
long segmentsLag = Math.max(0, log.onlyLocalLogSegmentsCount() - 1);
This issue wasn't there previously as the metric gets reported only when the candidate segments are available. Could you cover this PR with unit tests?
When tiered storage segment copies are failing with CustomMetadataSizeLimitExceededException, InterruptedException, RetriableException or any outer-level exception, the RemoteCopyLagBytes and RemoteCopyLagSegments metrics are not emitted. This makes it impossible to detect growing lag during copy failures.
b5b8280 to
a26e440
Compare
| when(mockLog.activeSegment()).thenReturn(activeSegment); | ||
| when(mockLog.onlyLocalLogSegmentsSize()).thenReturn(100L); | ||
| when(activeSegment.size()).thenReturn(100); | ||
| when(mockLog.onlyLocalLogSegmentsCount()).thenReturn(0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when activeSegment size is 100, then the onlyLocalLogSegmentsCount should be 1.
Also, mock the highestOffsetInRemoteStorage as 125L and activeSegment baseOffset as 100L, last-stable-offset / log-end-offset as 150, then the lag should not report negative values.
When tiered storage segment copies are failing with CustomMetadataSizeLimitExceededException, InterruptedException, RetriableException or any outer-level exception, the RemoteCopyLagBytes and RemoteCopyLagSegments metrics are not emitted. This makes it impossible to detect growing lag during copy failures.