fix: resolve data races and improve safety for workers > 1 by yuzone · Pull Request #7 · yuzone/fluentbit-bigquery-writeapi-sink

yuzone · 2026-03-27T15:05:12Z

Summary

This PR fixes three concurrent data races that occur when Fluent Bit is
configured with workers > 1 (supported since Fluent Bit v1.8), and
includes several code clarity improvements identified during the analysis.

- config.binaryData data race (workers > 1): Replace config.binaryData [][]byte field with config.binaryDataPool sync.Pool. Each worker borrows a *[][]byte from the pool, uses it for the flush, then returns it. This eliminates the race where two workers shared the same backing array and corrupted each other's data. - exactly-once offset TOCTOU (workers > 1): Move offsetCounter += rowCount inside sendRequestExactlyOnce, within the same mutex critical section as AppendRows. Previously the offset was incremented in a separate Lock/Unlock in flushChunk, allowing a second worker to read the same offset=N before the first worker incremented it, causing duplicate-offset errors in BigQuery. Pass rowCount int64 through sendRequest -> sendRequestRetries -> sendRequestExactlyOnce. - sendRequestRetries unlocked stream rebuild (workers > 1): Acquire config.mutex before Finalize/Close/buildStream on the rebuildPredicate path. Previously these calls ran without the lock, so a concurrent worker (workers > 1) could use or observe the stream while it was being torn down and replaced (use-after-free equivalent).

The mutex in outputConfig guards managedStreamSlice and its elements (streamConfig.managedstream, offsetCounter, appendResults). Renaming to streamMu makes this intent explicit at the declaration site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ms_ctx = context.Background() was a global variable used in FLBPluginInit, finalizeCloseAllStreams, and FLBPluginExitCtx. It served no purpose as a global since context.Background() is a package-level singleton; using it as a named global only obscured intent. Changes: - Remove ms_ctx global var - FLBPluginInit: use local initCtx := context.Background() - finalizeCloseAllStreams: accept ctx context.Context parameter - FLBPluginExitCtx: create exitCtx with config.flushTimeout so drain and finalize operations cannot hang indefinitely (fixes issue GoogleCloudPlatform#19 in docs/code-analysis.md) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

30s was excessive given Fluent Bit's typical flush interval of 1-5s and BigQuery Write API's normal latency of sub-second to a few seconds (including retries with backoff). 10s provides sufficient headroom for retries while staying proportional to real-world flush intervals. The value remains overridable via Flush_Timeout_Sec in the config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yuzone and others added 4 commits March 27, 2026 23:25

yuzone merged commit 54b000f into main Mar 27, 2026
1 check passed

yuzone deleted the fix/workers-gt-1-data-races branch March 27, 2026 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve data races and improve safety for workers > 1#7

fix: resolve data races and improve safety for workers > 1#7
yuzone merged 4 commits intomainfrom
fix/workers-gt-1-data-races

yuzone commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuzone commented Mar 27, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant