Skip to content

Per-Attempt Reconnect Jitter #1586

@scottf

Description

@scottf

Per-Attempt Reconnect Jitter

Problem

The existing reconnectJitter / reconnectJitterTls options apply jitter only after a full sweep of the server pool — i.e., after reconnectImplConnect() has tried every server in the pool once without success. Looking at the current reconnectImplConnect() loop:

// NatsConnection.reconnectImplConnect() — current behaviour
while ((cur = serverPool.nextServer()) != null) {
    if (first == null) {
        first = cur;
    } else if (first.equals(cur)) {
        // went around the pool an entire time — jitter fires here only
        invokeReconnectDelayHandler(++totalRounds);
    }
    tryToConnect(cur, resolved, NatsSystemClock.nanoTime()); // zero delay per individual attempt
}

Per-individual-attempt delay: zero. With a 3-broker pool, all failing clients simultaneously hit broker 1, fail, immediately hit broker 2, fail, immediately hit broker 3, fail — then wait for the full-sweep delay (default 2s + jitter). During a rolling restart this creates a synchronized reconnect storm where every client hammers each broker at the same moment.

Passive first-attempt problem: When passiveForceReconnect() is called (LDM or crash), reconnectImplConnect() immediately tries the first server with zero delay. If many application instances receive the same LDM signal simultaneously, all their passive connections attempt to reconnect to the same first server at exactly the same time — amplifying broker load at the worst possible moment.

ApConnection.newPassive() initial connect: The very first passive connection is made via passive.connect(true) — the initial connect path, not the reconnect path — so even the existing full-sweep jitter does not apply. The first passive connect always has zero delay regardless of configured options.

Why the existing ReconnectDelayHandler is not sufficient

ReconnectDelayHandler.getWaitTime(totalRounds) is called once per full pool sweep — the same trigger point as reconnectJitter. It cannot provide per-attempt delay and does not cover the newPassive() initial connect path at all.

Proposed API addition

Add a new reconnectAttemptJitter option applied before each individual server attempt in the reconnect loop, independent of the existing full-sweep reconnectJitter:

// In Options.Builder
/**
 * Sets the maximum random jitter added before each individual reconnect attempt.
 * A random duration in [0, reconnectAttemptJitter) is applied before each call
 * to tryToConnect() in the reconnect loop, spreading reconnect load across time
 * regardless of pool size.
 *
 * This is separate from reconnectJitter which fires after a full sweep of the pool.
 * Default: 0 (no per-attempt jitter — preserves existing behaviour).
 *
 * @param time maximum per-attempt jitter; null or zero disables it
 */
public Builder reconnectAttemptJitter(Duration time) {
    this.reconnectAttemptJitter = time;
    return this;
}

Apply inside reconnectImplConnect() before each tryToConnect():

// In NatsConnection.reconnectImplConnect() — proposed change
for (NatsUri resolved : resolvedList) {
    applyPerAttemptJitter(); // sleep [0, reconnectAttemptJitter) if configured
    tryToConnect(cur, resolved, NatsSystemClock.nanoTime());
    ...
}

Additionally, in ApConnection.newPassive(), apply the same jitter before the first passive connect() call so that initial passive connections are staggered across application instances:

// In ApConnection.newPassive() — proposed change
private void newPassive() throws InterruptedException {
    if (passive != null) {
        passive.close(false, true);
    }
    applyPerAttemptJitter(); // stagger initial passive connect across instances
    passive = new NatsConnection(passiveOptions);
    passive.connect(true);
    ...
}

Expected benefit

With reconnectAttemptJitter(Duration.ofMillis(100)):

  • 10 application instances reconnecting simultaneously spread their first broker attempt over a [0, 100ms) window → reduces instantaneous broker load by ~10×
  • Passive reconnects triggered by the same LDM signal are staggered before the first attempt, not only after a full failed sweep
  • Initial passive connects on application startup are also staggered

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions