Per-Attempt Reconnect Jitter

## Per-Attempt Reconnect Jitter

### Problem

The existing `reconnectJitter` / `reconnectJitterTls` options apply jitter only after a full sweep of the server pool — i.e., after `reconnectImplConnect()` has tried every server in the pool once without success. Looking at the current `reconnectImplConnect()` loop:

```java
// NatsConnection.reconnectImplConnect() — current behaviour
while ((cur = serverPool.nextServer()) != null) {
    if (first == null) {
        first = cur;
    } else if (first.equals(cur)) {
        // went around the pool an entire time — jitter fires here only
        invokeReconnectDelayHandler(++totalRounds);
    }
    tryToConnect(cur, resolved, NatsSystemClock.nanoTime()); // zero delay per individual attempt
}
```

Per-individual-attempt delay: zero. With a 3-broker pool, all failing clients simultaneously hit broker 1, fail, immediately hit broker 2, fail, immediately hit broker 3, fail — then wait for the full-sweep delay (default 2s + jitter). During a rolling restart this creates a synchronized reconnect storm where every client hammers each broker at the same moment.

Passive first-attempt problem: When `passiveForceReconnect()` is called (LDM or crash), `reconnectImplConnect()` immediately tries the first server with zero delay. If many application instances receive the same LDM signal simultaneously, all their passive connections attempt to reconnect to the same first server at exactly the same time — amplifying broker load at the worst possible moment.

`ApConnection.newPassive()` initial connect: The very first passive connection is made via `passive.connect(true)` — the initial connect path, not the reconnect path — so even the existing full-sweep jitter does not apply. The first passive connect always has zero delay regardless of configured options.

### Why the existing ReconnectDelayHandler is not sufficient

`ReconnectDelayHandler.getWaitTime(totalRounds)` is called once per full pool sweep — the same trigger point as `reconnectJitter`. It cannot provide per-attempt delay and does not cover the `newPassive()` initial connect path at all.

### Proposed API addition

Add a new `reconnectAttemptJitter` option applied before each individual server attempt in the reconnect loop, independent of the existing full-sweep `reconnectJitter`:

```java
// In Options.Builder
/**
 * Sets the maximum random jitter added before each individual reconnect attempt.
 * A random duration in [0, reconnectAttemptJitter) is applied before each call
 * to tryToConnect() in the reconnect loop, spreading reconnect load across time
 * regardless of pool size.
 *
 * This is separate from reconnectJitter which fires after a full sweep of the pool.
 * Default: 0 (no per-attempt jitter — preserves existing behaviour).
 *
 * @param time maximum per-attempt jitter; null or zero disables it
 */
public Builder reconnectAttemptJitter(Duration time) {
    this.reconnectAttemptJitter = time;
    return this;
}
```

Apply inside `reconnectImplConnect()` before each `tryToConnect()`:

```java
// In NatsConnection.reconnectImplConnect() — proposed change
for (NatsUri resolved : resolvedList) {
    applyPerAttemptJitter(); // sleep [0, reconnectAttemptJitter) if configured
    tryToConnect(cur, resolved, NatsSystemClock.nanoTime());
    ...
}
```

Additionally, in `ApConnection.newPassive()`, apply the same jitter before the first passive `connect()` call so that initial passive connections are staggered across application instances:

```java
// In ApConnection.newPassive() — proposed change
private void newPassive() throws InterruptedException {
    if (passive != null) {
        passive.close(false, true);
    }
    applyPerAttemptJitter(); // stagger initial passive connect across instances
    passive = new NatsConnection(passiveOptions);
    passive.connect(true);
    ...
}
```

### Expected benefit

With `reconnectAttemptJitter(Duration.ofMillis(100))`:

- 10 application instances reconnecting simultaneously spread their first broker attempt over a [0, 100ms) window → reduces instantaneous broker load by ~10×
- Passive reconnects triggered by the same LDM signal are staggered before the first attempt, not only after a full failed sweep
- Initial passive connects on application startup are also staggered


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-Attempt Reconnect Jitter #1586