fix: resolve data races and improve safety for workers > 1#7
Merged
Conversation
- config.binaryData data race (workers > 1): Replace config.binaryData [][]byte field with config.binaryDataPool sync.Pool. Each worker borrows a *[][]byte from the pool, uses it for the flush, then returns it. This eliminates the race where two workers shared the same backing array and corrupted each other's data. - exactly-once offset TOCTOU (workers > 1): Move offsetCounter += rowCount inside sendRequestExactlyOnce, within the same mutex critical section as AppendRows. Previously the offset was incremented in a separate Lock/Unlock in flushChunk, allowing a second worker to read the same offset=N before the first worker incremented it, causing duplicate-offset errors in BigQuery. Pass rowCount int64 through sendRequest -> sendRequestRetries -> sendRequestExactlyOnce. - sendRequestRetries unlocked stream rebuild (workers > 1): Acquire config.mutex before Finalize/Close/buildStream on the rebuildPredicate path. Previously these calls ran without the lock, so a concurrent worker (workers > 1) could use or observe the stream while it was being torn down and replaced (use-after-free equivalent).
The mutex in outputConfig guards managedStreamSlice and its elements (streamConfig.managedstream, offsetCounter, appendResults). Renaming to streamMu makes this intent explicit at the declaration site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ms_ctx = context.Background() was a global variable used in FLBPluginInit, finalizeCloseAllStreams, and FLBPluginExitCtx. It served no purpose as a global since context.Background() is a package-level singleton; using it as a named global only obscured intent. Changes: - Remove ms_ctx global var - FLBPluginInit: use local initCtx := context.Background() - finalizeCloseAllStreams: accept ctx context.Context parameter - FLBPluginExitCtx: create exitCtx with config.flushTimeout so drain and finalize operations cannot hang indefinitely (fixes issue GoogleCloudPlatform#19 in docs/code-analysis.md) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
30s was excessive given Fluent Bit's typical flush interval of 1-5s and BigQuery Write API's normal latency of sub-second to a few seconds (including retries with backoff). 10s provides sufficient headroom for retries while staying proportional to real-world flush intervals. The value remains overridable via Flush_Timeout_Sec in the config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes three concurrent data races that occur when Fluent Bit is
configured with
workers > 1(supported since Fluent Bit v1.8), andincludes several code clarity improvements identified during the analysis.