Skip to content

Conversation

@nandini12396
Copy link

When tiered storage segment copies are failing with CustomMetadataSizeLimitExceededException, InterruptedException, RetriableException or any outer-level exception, the RemoteCopyLagBytes and RemoteCopyLagSegments metrics are not emitted. This makes it impossible to detect growing lag during copy failures.

@github-actions github-actions bot added triage PRs from the community storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature small Small PRs labels Dec 15, 2025
Copy link
Contributor

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left few comments. PTAL.

long bytesLag = log.onlyLocalLogSegmentsSize() - log.activeSegment().size();
long segmentsLag = log.onlyLocalLogSegmentsCount() - 1;
recordLagStats(bytesLag, segmentsLag);
} catch (Exception e) {
Copy link
Contributor

@kamalcph kamalcph Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can emit negative values when there are no candidate segments to upload and the highestOffsetInRemoteStorage is higher than the base-offset of the active segment. This can happen after leader change:

We can update this to ensure that non-negative values are not emitted:

long bytesLag = Math.max(0, log.onlyLocalLogSegmentsSize() - log.activeSegment().size());
long segmentsLag = Math.max(0, log.onlyLocalLogSegmentsCount() - 1);

This issue wasn't there previously as the metric gets reported only when the candidate segments are available. Could you cover this PR with unit tests?

When tiered storage segment copies are failing with CustomMetadataSizeLimitExceededException, InterruptedException, RetriableException or any outer-level exception, the RemoteCopyLagBytes and RemoteCopyLagSegments metrics are not emitted. This makes it impossible to detect growing lag during copy failures.
@nandini12396 nandini12396 force-pushed the KAFKA-19995-copy-lag-metrics branch from b5b8280 to a26e440 Compare December 17, 2025 08:39
when(mockLog.activeSegment()).thenReturn(activeSegment);
when(mockLog.onlyLocalLogSegmentsSize()).thenReturn(100L);
when(activeSegment.size()).thenReturn(100);
when(mockLog.onlyLocalLogSegmentsCount()).thenReturn(0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when activeSegment size is 100, then the onlyLocalLogSegmentsCount should be 1.

Also, mock the highestOffsetInRemoteStorage as 125L and activeSegment baseOffset as 100L, last-stable-offset / log-end-offset as 150, then the lag should not report negative values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

small Small PRs storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants