KAFKA-19925: Fix transaction timeout handling during broker upgrades #21161
+16
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
During broker upgrades, the
sendOffsetsToTransactioncall would sometimes hang. Logs showed that it continuously returnederrorCode=51which isCONCURRENT_TRANSACTION. The test would eventually hit its timeout and fail. This happened for every single version upgrade and occurred in around 30% of the runs.Resolution
The problem above left the producer in a broken state and even after 5-10 minutes of waiting, it didn't resolve itself (even if we waited a few minutes past the transaction.max.ms time). I tried multiple solutions including waiting extended periods of time and re-trying the
sendOffsetsToTransactionmultiple times whenever timeout occurred.Unfortunately, the producer was just permanently stuck and always receiving the

errorCode=51. In this case, the recommended resolution in the Kafka docs is to close the previous producer and create a new producer. https://kafka.apache.org/documentation/#usingtransactionsUsing the old transaction.id would continue to lead to a stuck state, so this fix creates a brand new producer with a new ID and then rewinds the consumer offset to ensure EOD.
Testing and Validation
Previously, I was able to run the test for a single version upgrade and have it fail within the first 5-10 runs. After the fix, I was able to run it 40 times continuously with 0 failures. I also ran the full test (all versions) ~5 times with 9/9 cases passing.