Skip to content

fix: resolve data races and improve safety for workers > 1#7

Merged
yuzone merged 4 commits intomainfrom
fix/workers-gt-1-data-races
Mar 27, 2026
Merged

fix: resolve data races and improve safety for workers > 1#7
yuzone merged 4 commits intomainfrom
fix/workers-gt-1-data-races

Conversation

@yuzone
Copy link
Copy Markdown
Owner

@yuzone yuzone commented Mar 27, 2026

Summary

This PR fixes three concurrent data races that occur when Fluent Bit is
configured with workers > 1 (supported since Fluent Bit v1.8), and
includes several code clarity improvements identified during the analysis.

yuzone and others added 4 commits March 27, 2026 23:25
- config.binaryData data race (workers > 1):
  Replace config.binaryData [][]byte field with config.binaryDataPool
  sync.Pool. Each worker borrows a *[][]byte from the pool, uses it
  for the flush, then returns it. This eliminates the race where two
  workers shared the same backing array and corrupted each other's data.

- exactly-once offset TOCTOU (workers > 1):
  Move offsetCounter += rowCount inside sendRequestExactlyOnce, within
  the same mutex critical section as AppendRows. Previously the offset
  was incremented in a separate Lock/Unlock in flushChunk, allowing a
  second worker to read the same offset=N before the first worker
  incremented it, causing duplicate-offset errors in BigQuery.
  Pass rowCount int64 through sendRequest -> sendRequestRetries ->
  sendRequestExactlyOnce.

- sendRequestRetries unlocked stream rebuild (workers > 1):
  Acquire config.mutex before Finalize/Close/buildStream on the
  rebuildPredicate path. Previously these calls ran without the lock,
  so a concurrent worker (workers > 1) could use or observe the stream
  while it was being torn down and replaced (use-after-free equivalent).
The mutex in outputConfig guards managedStreamSlice and its elements
(streamConfig.managedstream, offsetCounter, appendResults). Renaming
to streamMu makes this intent explicit at the declaration site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ms_ctx = context.Background() was a global variable used in
FLBPluginInit, finalizeCloseAllStreams, and FLBPluginExitCtx.
It served no purpose as a global since context.Background() is
a package-level singleton; using it as a named global only
obscured intent.

Changes:
- Remove ms_ctx global var
- FLBPluginInit: use local initCtx := context.Background()
- finalizeCloseAllStreams: accept ctx context.Context parameter
- FLBPluginExitCtx: create exitCtx with config.flushTimeout so
  drain and finalize operations cannot hang indefinitely (fixes
  issue GoogleCloudPlatform#19 in docs/code-analysis.md)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
30s was excessive given Fluent Bit's typical flush interval of 1-5s
and BigQuery Write API's normal latency of sub-second to a few seconds
(including retries with backoff). 10s provides sufficient headroom
for retries while staying proportional to real-world flush intervals.
The value remains overridable via Flush_Timeout_Sec in the config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yuzone yuzone merged commit 54b000f into main Mar 27, 2026
1 check passed
@yuzone yuzone deleted the fix/workers-gt-1-data-races branch March 27, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant