Problem
On requests, timeouts are seen, but often the actual cause is disguised. Fix this problem, which is discussed further below. All changes will have to be backward compatible.
Discussion
What actually causes "Timeout or no response waiting for NATS JetStream server" on a JS publish
Traced against the jnats source in this tree (nats.java/src/main/java/io/nats/client/impl/).
The mechanism — why it's a catch-all
The exception is thrown in one place, on one condition:
// NatsJetStreamImpl.responseRequired (line 257)
Message responseRequired(Message respMessage) throws IOException {
if (respMessage == null) {
throw new IOException("Timeout or no response waiting for NATS JetStream server");
}
return respMessage;
}
So the only trigger is respMessage == null. That null is manufactured one level down, where three distinct failure modes are collapsed into the same null:
// NatsConnection.requestInternal (line 1316-1325)
CompletableFuture<Message> incoming = requestFutureInternal(...);
try {
return incoming.get(timeout.toNanos(), TimeUnit.NANOSECONDS);
}
catch (TimeoutException | ExecutionException | CancellationException e) {
return null; // <-- all three look identical upstream
}
- TimeoutException — the PubAck genuinely didn't arrive within the deadline.
- CancellationException — the future was
cancel()-ed by something else (reconnect cleanup, 503-in-CANCEL-mode, periodic cleanup of timed-out futures).
- ExecutionException — the future was completed exceptionally.
The text says "Timeout" but two of the three paths are not timeouts at all.
Important: for the publish path, a 503 is NOT hidden here
publishSyncInternal passes CancelAction.COMPLETE:
// NatsJetStream.publishSyncInternal (line 157)
Message resp = makeInternalRequestResponseRequired(
subject, merged, data, getTimeout(), CancelAction.COMPLETE, conn.forceFlushOnRequest);
return processPublishResponse(resp, options);
With COMPLETE, a 503 No-Responders status comes back as a real message (not a cancel):
// NatsConnection.deliverReply (line 1460-1471)
if (msg.isStatusMessage() && msg.getStatus().getCode() == 503) {
switch (f.getCancelAction()) {
case COMPLETE: f.complete(msg); break; // <-- publish path
case REPORT: f.completeExceptionally(new JetStreamStatusException(...)); break;
case CANCEL: default: f.cancel(true); // <-- core request().., becomes null/"Timeout"
}
}
...and processPublishResponse turns that into a different exception:
if (resp.isStatusMessage()) {
throw new IOException("Error Publishing: " + resp.getStatus().getMessageWithCode());
}
So on the current publish path, a no-responders 503 surfaces as "Error Publishing: 503 No Responders", not as "Timeout or no response". (The CANCEL→null→"Timeout" hiding of 503 is what bites the plain connection.request() / many JS management calls — which is likely the path where you had to add explicit 503 handling.)
That means: if the publish is throwing the Timeout message specifically, it is genuinely respMessage == null — i.e. timeout, reconnect-cancel, or exceptional-complete. Below, in rough order of "looks like a server timeout but isn't."
Causes, prioritized
1. Disconnect / reconnect mid-request (cancel disguised as timeout) — top suspect for intermittent
On any connection drop, every in-flight request future is cancelled:
// NatsConnection, disconnect path (line 891) -> cleanResponses(true)
// cleanResponses (line 1231-1233)
else if (closing) {
remove = true;
future.cancelClosing(); // -> CancellationException -> null -> "Timeout or no response"
}
A network blip, server restart, rolling cluster upgrade, LB idle-timeout, or a brief disconnect drops the in-flight PubAck future and you get this immediately (not after the full timeout). Correlate the exception timestamps with ConnectionListener DISCONNECTED/RECONNECTED events. If the timeouts cluster around reconnects, this is it.
2. Permissions violation (full-duration timeout, error goes to a different channel)
If the account/user lacks:
- publish permission on the target subject, or
- subscribe permission on the reply inbox (
_INBOX.> or the configured inbox prefix),
then the server emits a -ERR 'Permissions Violation' on the protocol and never delivers a PubAck. The request waits out the full timeout and throws "Timeout." The actual cause only appears in the ErrorListener (errorOccurred). If you don't have an ErrorListener wired up, you're blind to this. The inbox-subscribe case is especially sneaky: publishing looks allowed, you just never receive the ack.
3. Clustered JetStream with no stream leader (transient)
In R3/clustered JS, during leader election (node loss, rolling restart, lost quorum) the stream's RAFT group has no leader and won't ack until one is elected. Publishes in that window time out for real. Correlated with cluster events; clears on its own. Server logs show election / "no leader for stream."
4. Inbox-dispatcher slow consumer under high concurrency
All requests on a connection share one internal inbox dispatcher (mainInbox). Under a flood of concurrent synchronous publishes (note your stack runs on VirtualThread — easy to spin up thousands), reply delivery can back up past the dispatcher's pending-message limit and replies get dropped as a slow consumer. The dropped reply → future never completes → timeout. Symptom: timeouts appear only under load, scale with publish concurrency. Mitigations: bound in-flight publishes, prefer publishAsync with a bounded window, raise pending limits, or spread load over more connections.
5. The deadline is just short (default = 2s)
JetStreamOptions.DEFAULT_TIMEOUT = Options.DEFAULT_CONNECTION_TIMEOUT = Duration.ofSeconds(2). Two seconds covers a lot, but a GC pause, a fsync stall on FileStore, R3 replication lag, or a momentary server CPU spike can blow past it. This overlaps with "real" timeouts but the root cause is a tunable, not the network. Raise JetStreamOptions.builder().requestTimeout(...) and see if it disappears.
6. Genuinely slow server / storage
Disk full or slow disk, fsync-heavy storage, replication lag in R3/R5, account at maxStore/maxMemory (though hitting a limit usually returns an error PubAck → "Error Publishing", not a timeout), server overloaded. The legitimate case — but rule out 1–5 first.
7. Max payload violation
Message larger than server max_payload → server sends -ERR 'Maximum Payload Violation' and typically closes the connection → no ack → timeout (plus a reconnect, which loops back to cause #1). Check message sizes against the server's max_payload.
8. Client-side starvation (timeout without server involvement)
If the JVM is GC-thrashing or thread-starved, future.get(timeout) can expire even though the PubAck is sitting in the socket buffer. The "timeout" is entirely client-side. Watch for GC logs / event-loop starvation correlating with the exceptions. (Heavy work on message-handler threads is a common trigger — your stack is doing processMessage → sendNotification → publish synchronously inside a subscription callback.)
Problem
On requests, timeouts are seen, but often the actual cause is disguised. Fix this problem, which is discussed further below. All changes will have to be backward compatible.
Discussion
What actually causes "Timeout or no response waiting for NATS JetStream server" on a JS publish
Traced against the jnats source in this tree (
nats.java/src/main/java/io/nats/client/impl/).The mechanism — why it's a catch-all
The exception is thrown in one place, on one condition:
So the only trigger is
respMessage == null. That null is manufactured one level down, where three distinct failure modes are collapsed into the same null:cancel()-ed by something else (reconnect cleanup, 503-in-CANCEL-mode, periodic cleanup of timed-out futures).The text says "Timeout" but two of the three paths are not timeouts at all.
Important: for the publish path, a 503 is NOT hidden here
publishSyncInternalpassesCancelAction.COMPLETE:With
COMPLETE, a 503 No-Responders status comes back as a real message (not a cancel):...and
processPublishResponseturns that into a different exception:So on the current publish path, a no-responders 503 surfaces as
"Error Publishing: 503 No Responders", not as"Timeout or no response". (The CANCEL→null→"Timeout" hiding of 503 is what bites the plainconnection.request()/ many JS management calls — which is likely the path where you had to add explicit 503 handling.)That means: if the publish is throwing the Timeout message specifically, it is genuinely
respMessage == null— i.e. timeout, reconnect-cancel, or exceptional-complete. Below, in rough order of "looks like a server timeout but isn't."Causes, prioritized
1. Disconnect / reconnect mid-request (cancel disguised as timeout) — top suspect for intermittent
On any connection drop, every in-flight request future is cancelled:
A network blip, server restart, rolling cluster upgrade, LB idle-timeout, or a brief disconnect drops the in-flight PubAck future and you get this immediately (not after the full timeout). Correlate the exception timestamps with
ConnectionListenerDISCONNECTED/RECONNECTED events. If the timeouts cluster around reconnects, this is it.2. Permissions violation (full-duration timeout, error goes to a different channel)
If the account/user lacks:
_INBOX.>or the configured inbox prefix),then the server emits a
-ERR 'Permissions Violation'on the protocol and never delivers a PubAck. The request waits out the full timeout and throws "Timeout." The actual cause only appears in the ErrorListener (errorOccurred). If you don't have an ErrorListener wired up, you're blind to this. The inbox-subscribe case is especially sneaky: publishing looks allowed, you just never receive the ack.3. Clustered JetStream with no stream leader (transient)
In R3/clustered JS, during leader election (node loss, rolling restart, lost quorum) the stream's RAFT group has no leader and won't ack until one is elected. Publishes in that window time out for real. Correlated with cluster events; clears on its own. Server logs show election / "no leader for stream."
4. Inbox-dispatcher slow consumer under high concurrency
All requests on a connection share one internal inbox dispatcher (
mainInbox). Under a flood of concurrent synchronous publishes (note your stack runs onVirtualThread— easy to spin up thousands), reply delivery can back up past the dispatcher's pending-message limit and replies get dropped as a slow consumer. The dropped reply → future never completes → timeout. Symptom: timeouts appear only under load, scale with publish concurrency. Mitigations: bound in-flight publishes, preferpublishAsyncwith a bounded window, raise pending limits, or spread load over more connections.5. The deadline is just short (default = 2s)
JetStreamOptions.DEFAULT_TIMEOUT = Options.DEFAULT_CONNECTION_TIMEOUT = Duration.ofSeconds(2). Two seconds covers a lot, but a GC pause, a fsync stall onFileStore, R3 replication lag, or a momentary server CPU spike can blow past it. This overlaps with "real" timeouts but the root cause is a tunable, not the network. RaiseJetStreamOptions.builder().requestTimeout(...)and see if it disappears.6. Genuinely slow server / storage
Disk full or slow disk, fsync-heavy storage, replication lag in R3/R5, account at
maxStore/maxMemory(though hitting a limit usually returns an error PubAck → "Error Publishing", not a timeout), server overloaded. The legitimate case — but rule out 1–5 first.7. Max payload violation
Message larger than server
max_payload→ server sends-ERR 'Maximum Payload Violation'and typically closes the connection → no ack → timeout (plus a reconnect, which loops back to cause #1). Check message sizes against the server'smax_payload.8. Client-side starvation (timeout without server involvement)
If the JVM is GC-thrashing or thread-starved,
future.get(timeout)can expire even though the PubAck is sitting in the socket buffer. The "timeout" is entirely client-side. Watch for GC logs / event-loop starvation correlating with the exceptions. (Heavy work on message-handler threads is a common trigger — your stack is doingprocessMessage→sendNotification→ publish synchronously inside a subscription callback.)