Skip to content

feat(openfeature): subscribe to FFE_FLAGS during tracer RC setup#4495

Open
leoromanovsky wants to merge 15 commits intomainfrom
leo/ffe-flags-early-rc-subscription
Open

feat(openfeature): subscribe to FFE_FLAGS during tracer RC setup#4495
leoromanovsky wants to merge 15 commits intomainfrom
leo/ffe-flags-early-rc-subscription

Conversation

@leoromanovsky
Copy link
Contributor

@leoromanovsky leoromanovsky commented Mar 4, 2026

What does this PR do?

Subscribes to FFE_FLAGS during tracer.startRemoteConfig() so feature flag configurations are included in the very first RC poll, eliminating the extra poll interval latency when NewDatadogProvider() is called after tracer.Start().

Also fixes several correctness issues discovered during review: deep copying buffered configs, preventing race conditions between tracer and provider subscription, and explicitly rejecting multiple providers (which were never supported but failed silently).

Motivation

When NewDatadogProvider() is called after tracer.Start(), the RC client has already started and the first poll may have already fired. This means FFE_FLAGS misses the first poll, adding one extra poll interval of latency before flag configs arrive and the provider becomes ready.

Changes

Early subscription (original goal):

  • Subscribe to FFE_FLAGS during tracer.startRemoteConfig() (alongside APM_TRACING and LIVE_DEBUGGING)
  • A forwarding callback in internal/openfeature buffers the latest config until the provider attaches
  • When NewDatadogProvider() is called, it replays the buffered config (fast path) or falls back to its own RC subscription (slow path)
  • Moved hardcoded ffeCapability = 46 into the remoteconfig capability iota block as FFEFlagEvaluation

Correctness fixes (from review):

  • Deep copy byte slices with bytes.Clone when buffering RC updates to prevent corruption if RC reuses buffers
  • Add SubscribeProvider() to serialize tracer/provider subscription and prevent race conditions
  • Explicitly reject multiple providers in both fast path (AttachCallback) and slow path (SubscribeProvider) - multiple providers were never supported but previously failed silently with the last callback overwriting earlier ones in productCallbacks()
  • Start RC client only in slow path (when tracer didn't subscribe) to avoid unnecessary work and test races

Verification

I wrote a side-by-side comparison in ffe-dogfooding for testing:

  1. app-go: Uses published SDK (github.com/DataDog/dd-trace-go/v2 v2.5.0)
  2. app-go-local: Uses local SDK with the fix (via go.mod replace directive)

Both apps were modified to:

  • Add timing instrumentation logging key events
  • Introduce a 7-second delay between tracer.Start() and NewDatadogProvider() to simulate the real-world scenario where an app starts the tracer, performs other initialization, and only later creates the OpenFeature provider
How

The following instrumentation was added to both apps:

tracer.Start()
log.Printf("[TIMING] tracer.Start() complete: T+%dms", time.Since(startTime).Milliseconds())

// Simulate delay (longer than RC poll interval of 5s)
time.Sleep(7 * time.Second)

log.Printf("[TIMING] Calling NewDatadogProvider(): T+%dms", time.Since(startTime).Milliseconds())
provider, err := ddopenfeature.NewDatadogProvider(ddopenfeature.ProviderConfig{})
log.Printf("[TIMING] NewDatadogProvider() returned: T+%dms", time.Since(startTime).Milliseconds())

openfeature.SetProviderAndWait(provider)
log.Printf("[TIMING] SetProviderAndWait() complete (PROVIDER READY): T+%dms", time.Since(startTime).Milliseconds())
Results
Metric Published SDK (v2.5.0) Local SDK (with fix)
tracer.Start() complete T+15ms T+8ms
7s delay complete T+7015ms T+7009ms
NewDatadogProvider() called T+7015ms T+7009ms
PROVIDER READY T+15038ms T+7014ms
Time waiting after NewDatadogProvider() ~8023ms ~5ms
Conclusion
  • Without the fix: The provider must wait for the next RC poll (~8 seconds in this test)
  • With the fix: The provider immediately replays the buffered RC config (~5ms)

The fix eliminates one full RC poll interval of latency (~5-8 seconds) when NewDatadogProvider() is called after the first RC poll has already completed.

Try it out

The test setup is in test/pr-4495-rc-timing-fix in ffe-dogfooding. Find/replace /path/to/local/dd-trace-go with your local path. Then run docker-compose up --build app-go app-go-local and observe and compare the [TIMING] logs.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running make lint locally.
  • New code doesn't break existing tests. You can check this by running make test locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • All generated files are up to date. You can check this by running make generate locally.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running make fix-modules locally.

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Mar 4, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 50.62%
Overall Coverage: 59.32% (+0.07%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b95c6a4 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 41.86047% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.94%. Comparing base (89047df) to head (41ae016).

Files with missing lines Patch % Lines
internal/openfeature/rc_subscription.go 45.28% 28 Missing and 1 partial ⚠️
internal/openfeature/testing.go 33.33% 12 Missing ⚠️
openfeature/remoteconfig.go 40.00% 4 Missing and 2 partials ⚠️
ddtrace/tracer/remote_config.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
internal/remoteconfig/remoteconfig.go 64.83% <ø> (-9.73%) ⬇️
openfeature/rc_subscription.go 100.00% <100.00%> (ø)
ddtrace/tracer/remote_config.go 88.33% <0.00%> (-4.52%) ⬇️
openfeature/remoteconfig.go 76.28% <40.00%> (+1.01%) ⬆️
internal/openfeature/testing.go 33.33% <33.33%> (ø)
internal/openfeature/rc_subscription.go 45.28% <45.28%> (ø)

... and 258 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces OpenFeature flag readiness latency by ensuring the FFE_FLAGS Remote Config product is subscribed during tracer Remote Config initialization (so it’s included in the first poll), while allowing the later-created DatadogProvider to attach to that early subscription and replay the latest buffered snapshot.

Changes:

  • Add a tracer-time RC subscription for FFE_FLAGS, with an internal forwarding callback that buffers the latest config snapshot until the provider attaches.
  • Add provider-side “fast path” attach/replay logic; keep “slow path” behavior for cases where no tracer RC subscription exists.
  • Promote the previously hardcoded FFE capability (46) into internal/remoteconfig as FFEFlagEvaluation and update module sums/mods accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
openfeature/remoteconfig.go Adds attach fast-path; switches to remoteconfig.FFEFlagEvaluation; updates unregister logic.
openfeature/rc_subscription.go Adds a thin wrapper to attach provider callback to the internal subscription bridge.
openfeature/rc_subscription_test.go Adds tests for attach/replay behavior and fast-path initialization.
internal/remoteconfig/remoteconfig.go Adds FFEFlagEvaluation capability constant to the capability iota block.
internal/openfeature/rc_subscription.go Implements tracer-time subscription + forwarding/buffering + attach/replay bridge.
internal/openfeature/rc_subscription_test.go Adds unit tests for buffering/forwarding and attach behavior.
internal/openfeature/testing.go Adds test helpers to manipulate internal subscription state.
ddtrace/tracer/remote_config.go Subscribes to FFE_FLAGS during tracer RC setup when env var is enabled.
ddtrace/tracer/tracer_test.go Adds a goleak ignore entry for an OpenFeature SDK goroutine.
instrumentation/internal/namingschematest/go.sum Dependency checksum updates.
go.work.sum Workspace checksum updates.
contrib/confluentinc/confluent-kafka-go/kafka.v2/go.mod Adds/updates indirect deps (e.g., pflag, x/crypto, x/oauth2).
contrib/confluentinc/confluent-kafka-go/kafka.v2/go.sum Dependency checksum updates.
contrib/bradfitz/gomemcache/go.mod Updates indirect golang.org/x/crypto.
contrib/bradfitz/gomemcache/go.sum Dependency checksum updates.
contrib/aws/aws-sdk-go/go.mod Updates indirect golang.org/x/crypto.
contrib/aws/aws-sdk-go/go.sum Dependency checksum updates.
contrib/aws/aws-sdk-go-v2/go.mod Updates indirect golang.org/x/crypto.
contrib/aws/aws-sdk-go-v2/go.sum Dependency checksum updates.
contrib/99designs/gqlgen/go.mod Updates indirect golang.org/x/crypto.
contrib/99designs/gqlgen/go.sum Dependency checksum updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to +70
func forwardingCallback(update remoteconfig.ProductUpdate) map[string]rc.ApplyStatus {
rcState.Lock()
defer rcState.Unlock()

if rcState.callback != nil {
return rcState.callback(update)
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In forwardingCallback, rcState is held while invoking the provider callback. Calling external code under a mutex can block RC processing and risks deadlocks if the callback ever calls back into this package. Consider copying rcState.callback to a local variable under lock, releasing the lock, then invoking it (and similarly for the buffering path).

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree with the deadlock risk. The callback chain is: forwardingCallback (holds rcState.Lock) → rcCallbackupdateConfiguration (acquires p.mu.Lock)

For a deadlock, we'd need another code path that acquires p.mu first, then tries to acquire rcState.Lock. This doesn't exist - rcCallback and updateConfiguration never call back into internal/openfeature.

The suggestion to release the lock early introduces risks. If we release rcState.Lock before invoking the callback, a second RC update could arrive and be processed concurrently or out-of-order. The current implementation ensures serial processing of updates, which is important since each RC update is a full config snapshot.

Comment on lines +91 to +103
defer rcState.Unlock()

if !rcState.subscribed {
return false
}

rcState.callback = cb

// Replay the buffered update if there is one.
if rcState.buffered != nil {
log.Debug("openfeature: replaying buffered RC config to provider")
cb(*rcState.buffered)
rcState.buffered = nil
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttachCallback replays the buffered update by calling cb while still holding rcState.Lock(). As with forwardingCallback, invoking callbacks under a mutex can lead to lock contention or future deadlocks. Suggest capturing the buffered update (and clearing it) under lock, then unlocking before calling cb.

Suggested change
defer rcState.Unlock()
if !rcState.subscribed {
return false
}
rcState.callback = cb
// Replay the buffered update if there is one.
if rcState.buffered != nil {
log.Debug("openfeature: replaying buffered RC config to provider")
cb(*rcState.buffered)
rcState.buffered = nil
if !rcState.subscribed {
rcState.Unlock()
return false
}
rcState.callback = cb
// Capture and clear the buffered update under lock, then replay it after
// releasing the mutex to avoid invoking callbacks while holding rcState.Lock().
buffered := rcState.buffered
rcState.buffered = nil
rcState.Unlock()
// Replay the buffered update if there is one.
if buffered != nil {
log.Debug("openfeature: replaying buffered RC config to provider")
cb(*buffered)

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #4495 (comment) - no deadlock risk exists here.

Also, releasing the lock early in AttachCallback is problematic: if we unlock after setting rcState.callback = cb but before calling cb(*rcState.buffered), a new RC update could arrive and forwardingCallback would forward it to cb concurrently with the replay. This could cause race conditions in the provider.

Holding the lock ensures the replay completes atomically before any new updates are processed.

Comment on lines +10 to +38
// ResetForTest resets the global rcState for test isolation.
func ResetForTest() {
rcState.Lock()
defer rcState.Unlock()
rcState.subscribed = false
rcState.callback = nil
rcState.buffered = nil
}

// SetSubscribedForTest sets the subscribed flag without actually calling Subscribe.
func SetSubscribedForTest(v bool) {
rcState.Lock()
defer rcState.Unlock()
rcState.subscribed = v
}

// SetBufferedForTest sets a buffered update for testing.
func SetBufferedForTest(u *remoteconfig.ProductUpdate) {
rcState.Lock()
defer rcState.Unlock()
rcState.buffered = u
}

// GetBufferedForTest returns the current buffered update for testing.
func GetBufferedForTest() *remoteconfig.ProductUpdate {
rcState.Lock()
defer rcState.Unlock()
return rcState.buffered
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exported *ForTest helpers live in a non-_test.go file, so they become part of the shipped internal/openfeature package API and allow mutating global state from any in-repo code. If possible, prefer keeping this test-only surface out of production builds (e.g., move tests that need state mutation into the internal/openfeature package, or put test utilities in a dedicated test-only package/build-tagged file).

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the suggestions

"Move tests that need state mutation into the internal/openfeature package" - Not feasible. The tests in openfeature/rc_subscription_test.go need access to unexported types/functions from the openfeature package (newDatadogProvider, attachProvider, universalFlagsConfiguration, etc.). Moving them to internal/openfeature would break this access.

"Put test utilities in a build-tagged file" - This would work (//go:build testutils) but requires all test runs to use go test -tags testutils ./..., including CI updates. These changes are well beyond the scope of what we're trying to achieve here.

The primary risk seems to be that "in-repo code could misuse these" - which is a code review concern and is already communicated with the *ForTest suffix for each of these methods.

Comment on lines 203 to +210
//
// Note: This function is currently not fully implemented as Remote Config
// doesn't provide an Unsubscribe method yet. The provider will continue
// to receive updates until the Remote Config client is stopped.
func stopRemoteConfig() error {
log.Debug("openfeature: unregistered from Remote Config")
// For now, we can unregister the capability to stop receiving updates
_ = remoteconfig.UnregisterCapability(ffeCapability)
_ = remoteconfig.UnregisterCapability(remoteconfig.FFEFlagEvaluation)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above stopRemoteConfig says Remote Config “doesn't provide an Unsubscribe method yet”, but internal/remoteconfig does expose Unsubscribe() (and Subscribe returns a token). Either update this comment to reflect the real limitation (e.g., this package currently discards the subscription token), or store the token from Subscribe and call Unsubscribe here instead of unregistering the capability globally.

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incorrect. The Unsubscribe() method is right here!

func Unsubscribe(token SubscriptionToken) error {

However, this package discards the subscription token.

if _, err := remoteconfig.Subscribe(internalffe.FFEProductName, provider.rcCallback, remoteconfig.FFEFlagEvaluation); err != nil {

remoteconfig.Subscribe returns a token you can use later to unsubscribe: token, err := remoteconfig.Subscribe(product, callback, capability) but we ignore it with the _, so we can't do remoteconfig.Unsubscribe(token) afterwards.

IMO it would be nice to fix this, but it feels out of scope for this PR, so I'm just updating the comment for now. I see the subscription token getting discarded elsewhere in the SDK as well, so this seems to be a common pattern

9789643

@pr-commenter
Copy link

pr-commenter bot commented Mar 4, 2026

Benchmarks

Benchmark execution time: 2026-03-13 12:41:30

Comparing candidate commit b95c6a4 in PR branch leo/ffe-flags-early-rc-subscription with baseline commit 89047df in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 155 metrics, 9 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

leoromanovsky and others added 10 commits March 6, 2026 21:45
Move the FFE_FLAGS Remote Config subscription into the tracer's
startRemoteConfig() so it's included in the very first RC poll,
eliminating one poll interval of latency before flag configs arrive.
Move the forwarding callback and buffering logic into
internal/openfeature so the tracer doesn't import the full
openfeature package (which pulls in OTel metrics/gRPC).

Also fix lint issues: maps.Copy, err.Error() in log, gofmt.
Run fixmodules to update contrib go.mod/go.sum.
@sameerank sameerank force-pushed the leo/ffe-flags-early-rc-subscription branch from 2de74cb to 41ae016 Compare March 7, 2026 05:46
@sameerank sameerank marked this pull request as ready for review March 7, 2026 06:11
@sameerank sameerank requested review from a team as code owners March 7, 2026 06:11
@sameerank sameerank requested review from greghuels and typotter March 7, 2026 06:11
Copy link

@sameerank sameerank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can approve for myself, but it would be good to get another review

Replace startWithRemoteConfig() call with direct SubscribeProvider() +
attachProvider() calls to avoid touching global remoteconfig state,
which raced with lingering shutdown goroutines from integration tests.

Also fix redundant err.Error() in log.Warn format string.
@darccio
Copy link
Member

darccio commented Mar 12, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f00a862da4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

SubscribeRC() cached subscribed=true permanently, but
remoteconfig.Stop() destroys all subscriptions. On a
Start() -> Stop() -> Start() cycle, SubscribeRC() would no-op,
leaving the provider permanently stuck on stale config.

Now verify the subscription is still live via HasProduct() before
short-circuiting. If lost, reset state and re-subscribe.
@leoromanovsky leoromanovsky requested a review from darccio March 13, 2026 12:23
Copy link
Member

@darccio darccio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback addressed, LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants