Skip to content

Conversation

@FrancisGodinho
Copy link

Problem

During broker upgrades, the sendOffsetsToTransaction call would sometimes hang. Logs showed that it continuously returned errorCode=51 which is CONCURRENT_TRANSACTION. The test would eventually hit its timeout and fail. This happened for every single version upgrade and occurred in around 30% of the runs.

Resolution

The problem above left the producer in a broken state and even after 5-10 minutes of waiting, it didn't resolve itself (even if we waited a few minutes past the transaction.max.ms time). I tried multiple solutions including waiting extended periods of time and re-trying the sendOffsetsToTransaction multiple times whenever timeout occurred.

Unfortunately, the producer was just permanently stuck and always receiving the errorCode=51. In this case, the recommended resolution in the Kafka docs is to close the previous producer and create a new producer. https://kafka.apache.org/documentation/#usingtransactions
image

Using the old transaction.id would continue to lead to a stuck state, so this fix creates a brand new producer with a new ID and then rewinds the consumer offset to ensure EOD.

Testing and Validation

Previously, I was able to run the test for a single version upgrade and have it fail within the first 5-10 runs. After the fix, I was able to run it 40 times continuously with 0 failures. I also ran the full test (all versions) ~5 times with 9/9 cases passing.

@github-actions github-actions bot added triage PRs from the community tools small Small PRs labels Dec 16, 2025
self.perform_upgrade(from_kafka_version)

copier_timeout_sec = 180
copier_timeout_sec = 360
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: due to timeouts and re-creation of producer, this copier_timeout needed to be increased. I experimented a bit and found that 360s was a consistently reliable value.

@FrancisGodinho
Copy link
Author

@chia7712 can you take a look when you get a chance please?

@github-actions github-actions bot removed the triage PRs from the community label Dec 16, 2025
@chia7712
Copy link
Member

@FrancisGodinho thanks for you patch. I have identified some underlying issues in e2e and TV2. Addressing them should allow us to achieve more stable transaction behavior. Please check https://issues.apache.org/jira/browse/KAFKA-19999 and https://issues.apache.org/jira/browse/KAFKA-20000 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants