Per-Attempt Reconnect Jitter
Problem
The existing reconnectJitter / reconnectJitterTls options apply jitter only after a full sweep of the server pool — i.e., after reconnectImplConnect() has tried every server in the pool once without success. Looking at the current reconnectImplConnect() loop:
// NatsConnection.reconnectImplConnect() — current behaviour
while ((cur = serverPool.nextServer()) != null) {
if (first == null) {
first = cur;
} else if (first.equals(cur)) {
// went around the pool an entire time — jitter fires here only
invokeReconnectDelayHandler(++totalRounds);
}
tryToConnect(cur, resolved, NatsSystemClock.nanoTime()); // zero delay per individual attempt
}
Per-individual-attempt delay: zero. With a 3-broker pool, all failing clients simultaneously hit broker 1, fail, immediately hit broker 2, fail, immediately hit broker 3, fail — then wait for the full-sweep delay (default 2s + jitter). During a rolling restart this creates a synchronized reconnect storm where every client hammers each broker at the same moment.
Passive first-attempt problem: When passiveForceReconnect() is called (LDM or crash), reconnectImplConnect() immediately tries the first server with zero delay. If many application instances receive the same LDM signal simultaneously, all their passive connections attempt to reconnect to the same first server at exactly the same time — amplifying broker load at the worst possible moment.
ApConnection.newPassive() initial connect: The very first passive connection is made via passive.connect(true) — the initial connect path, not the reconnect path — so even the existing full-sweep jitter does not apply. The first passive connect always has zero delay regardless of configured options.
Why the existing ReconnectDelayHandler is not sufficient
ReconnectDelayHandler.getWaitTime(totalRounds) is called once per full pool sweep — the same trigger point as reconnectJitter. It cannot provide per-attempt delay and does not cover the newPassive() initial connect path at all.
Proposed API addition
Add a new reconnectAttemptJitter option applied before each individual server attempt in the reconnect loop, independent of the existing full-sweep reconnectJitter:
// In Options.Builder
/**
* Sets the maximum random jitter added before each individual reconnect attempt.
* A random duration in [0, reconnectAttemptJitter) is applied before each call
* to tryToConnect() in the reconnect loop, spreading reconnect load across time
* regardless of pool size.
*
* This is separate from reconnectJitter which fires after a full sweep of the pool.
* Default: 0 (no per-attempt jitter — preserves existing behaviour).
*
* @param time maximum per-attempt jitter; null or zero disables it
*/
public Builder reconnectAttemptJitter(Duration time) {
this.reconnectAttemptJitter = time;
return this;
}
Apply inside reconnectImplConnect() before each tryToConnect():
// In NatsConnection.reconnectImplConnect() — proposed change
for (NatsUri resolved : resolvedList) {
applyPerAttemptJitter(); // sleep [0, reconnectAttemptJitter) if configured
tryToConnect(cur, resolved, NatsSystemClock.nanoTime());
...
}
Additionally, in ApConnection.newPassive(), apply the same jitter before the first passive connect() call so that initial passive connections are staggered across application instances:
// In ApConnection.newPassive() — proposed change
private void newPassive() throws InterruptedException {
if (passive != null) {
passive.close(false, true);
}
applyPerAttemptJitter(); // stagger initial passive connect across instances
passive = new NatsConnection(passiveOptions);
passive.connect(true);
...
}
Expected benefit
With reconnectAttemptJitter(Duration.ofMillis(100)):
- 10 application instances reconnecting simultaneously spread their first broker attempt over a [0, 100ms) window → reduces instantaneous broker load by ~10×
- Passive reconnects triggered by the same LDM signal are staggered before the first attempt, not only after a full failed sweep
- Initial passive connects on application startup are also staggered
Per-Attempt Reconnect Jitter
Problem
The existing
reconnectJitter/reconnectJitterTlsoptions apply jitter only after a full sweep of the server pool — i.e., afterreconnectImplConnect()has tried every server in the pool once without success. Looking at the currentreconnectImplConnect()loop:Per-individual-attempt delay: zero. With a 3-broker pool, all failing clients simultaneously hit broker 1, fail, immediately hit broker 2, fail, immediately hit broker 3, fail — then wait for the full-sweep delay (default 2s + jitter). During a rolling restart this creates a synchronized reconnect storm where every client hammers each broker at the same moment.
Passive first-attempt problem: When
passiveForceReconnect()is called (LDM or crash),reconnectImplConnect()immediately tries the first server with zero delay. If many application instances receive the same LDM signal simultaneously, all their passive connections attempt to reconnect to the same first server at exactly the same time — amplifying broker load at the worst possible moment.ApConnection.newPassive()initial connect: The very first passive connection is made viapassive.connect(true)— the initial connect path, not the reconnect path — so even the existing full-sweep jitter does not apply. The first passive connect always has zero delay regardless of configured options.Why the existing ReconnectDelayHandler is not sufficient
ReconnectDelayHandler.getWaitTime(totalRounds)is called once per full pool sweep — the same trigger point asreconnectJitter. It cannot provide per-attempt delay and does not cover thenewPassive()initial connect path at all.Proposed API addition
Add a new
reconnectAttemptJitteroption applied before each individual server attempt in the reconnect loop, independent of the existing full-sweepreconnectJitter:Apply inside
reconnectImplConnect()before eachtryToConnect():Additionally, in
ApConnection.newPassive(), apply the same jitter before the first passiveconnect()call so that initial passive connections are staggered across application instances:Expected benefit
With
reconnectAttemptJitter(Duration.ofMillis(100)):