Skip to content

Address generic Timeout or no response waiting for NATS JetStream server #1581

@scottf

Description

@scottf

Problem

On requests, timeouts are seen, but often the actual cause is disguised. Fix this problem, which is discussed further below. All changes will have to be backward compatible.

Discussion

What actually causes "Timeout or no response waiting for NATS JetStream server" on a JS publish

Traced against the jnats source in this tree (nats.java/src/main/java/io/nats/client/impl/).

The mechanism — why it's a catch-all

The exception is thrown in one place, on one condition:

// NatsJetStreamImpl.responseRequired (line 257)
Message responseRequired(Message respMessage) throws IOException {
    if (respMessage == null) {
        throw new IOException("Timeout or no response waiting for NATS JetStream server");
    }
    return respMessage;
}

So the only trigger is respMessage == null. That null is manufactured one level down, where three distinct failure modes are collapsed into the same null:

// NatsConnection.requestInternal (line 1316-1325)
CompletableFuture<Message> incoming = requestFutureInternal(...);
try {
    return incoming.get(timeout.toNanos(), TimeUnit.NANOSECONDS);
}
catch (TimeoutException | ExecutionException | CancellationException e) {
    return null;   // <-- all three look identical upstream
}
  • TimeoutException — the PubAck genuinely didn't arrive within the deadline.
  • CancellationException — the future was cancel()-ed by something else (reconnect cleanup, 503-in-CANCEL-mode, periodic cleanup of timed-out futures).
  • ExecutionException — the future was completed exceptionally.

The text says "Timeout" but two of the three paths are not timeouts at all.

Important: for the publish path, a 503 is NOT hidden here

publishSyncInternal passes CancelAction.COMPLETE:

// NatsJetStream.publishSyncInternal (line 157)
Message resp = makeInternalRequestResponseRequired(
    subject, merged, data, getTimeout(), CancelAction.COMPLETE, conn.forceFlushOnRequest);
return processPublishResponse(resp, options);

With COMPLETE, a 503 No-Responders status comes back as a real message (not a cancel):

// NatsConnection.deliverReply (line 1460-1471)
if (msg.isStatusMessage() && msg.getStatus().getCode() == 503) {
    switch (f.getCancelAction()) {
        case COMPLETE: f.complete(msg); break;                 // <-- publish path
        case REPORT:   f.completeExceptionally(new JetStreamStatusException(...)); break;
        case CANCEL:   default: f.cancel(true);                // <-- core request().., becomes null/"Timeout"
    }
}

...and processPublishResponse turns that into a different exception:

if (resp.isStatusMessage()) {
    throw new IOException("Error Publishing: " + resp.getStatus().getMessageWithCode());
}

So on the current publish path, a no-responders 503 surfaces as "Error Publishing: 503 No Responders", not as "Timeout or no response". (The CANCEL→null→"Timeout" hiding of 503 is what bites the plain connection.request() / many JS management calls — which is likely the path where you had to add explicit 503 handling.)

That means: if the publish is throwing the Timeout message specifically, it is genuinely respMessage == null — i.e. timeout, reconnect-cancel, or exceptional-complete. Below, in rough order of "looks like a server timeout but isn't."

Causes, prioritized

1. Disconnect / reconnect mid-request (cancel disguised as timeout) — top suspect for intermittent

On any connection drop, every in-flight request future is cancelled:

// NatsConnection, disconnect path (line 891)  ->  cleanResponses(true)
// cleanResponses (line 1231-1233)
else if (closing) {
    remove = true;
    future.cancelClosing();   // -> CancellationException -> null -> "Timeout or no response"
}

A network blip, server restart, rolling cluster upgrade, LB idle-timeout, or a brief disconnect drops the in-flight PubAck future and you get this immediately (not after the full timeout). Correlate the exception timestamps with ConnectionListener DISCONNECTED/RECONNECTED events. If the timeouts cluster around reconnects, this is it.

2. Permissions violation (full-duration timeout, error goes to a different channel)

If the account/user lacks:

  • publish permission on the target subject, or
  • subscribe permission on the reply inbox (_INBOX.> or the configured inbox prefix),

then the server emits a -ERR 'Permissions Violation' on the protocol and never delivers a PubAck. The request waits out the full timeout and throws "Timeout." The actual cause only appears in the ErrorListener (errorOccurred). If you don't have an ErrorListener wired up, you're blind to this. The inbox-subscribe case is especially sneaky: publishing looks allowed, you just never receive the ack.

3. Clustered JetStream with no stream leader (transient)

In R3/clustered JS, during leader election (node loss, rolling restart, lost quorum) the stream's RAFT group has no leader and won't ack until one is elected. Publishes in that window time out for real. Correlated with cluster events; clears on its own. Server logs show election / "no leader for stream."

4. Inbox-dispatcher slow consumer under high concurrency

All requests on a connection share one internal inbox dispatcher (mainInbox). Under a flood of concurrent synchronous publishes (note your stack runs on VirtualThread — easy to spin up thousands), reply delivery can back up past the dispatcher's pending-message limit and replies get dropped as a slow consumer. The dropped reply → future never completes → timeout. Symptom: timeouts appear only under load, scale with publish concurrency. Mitigations: bound in-flight publishes, prefer publishAsync with a bounded window, raise pending limits, or spread load over more connections.

5. The deadline is just short (default = 2s)

JetStreamOptions.DEFAULT_TIMEOUT = Options.DEFAULT_CONNECTION_TIMEOUT = Duration.ofSeconds(2). Two seconds covers a lot, but a GC pause, a fsync stall on FileStore, R3 replication lag, or a momentary server CPU spike can blow past it. This overlaps with "real" timeouts but the root cause is a tunable, not the network. Raise JetStreamOptions.builder().requestTimeout(...) and see if it disappears.

6. Genuinely slow server / storage

Disk full or slow disk, fsync-heavy storage, replication lag in R3/R5, account at maxStore/maxMemory (though hitting a limit usually returns an error PubAck → "Error Publishing", not a timeout), server overloaded. The legitimate case — but rule out 1–5 first.

7. Max payload violation

Message larger than server max_payload → server sends -ERR 'Maximum Payload Violation' and typically closes the connection → no ack → timeout (plus a reconnect, which loops back to cause #1). Check message sizes against the server's max_payload.

8. Client-side starvation (timeout without server involvement)

If the JVM is GC-thrashing or thread-starved, future.get(timeout) can expire even though the PubAck is sitting in the socket buffer. The "timeout" is entirely client-side. Watch for GC logs / event-loop starvation correlating with the exceptions. (Heavy work on message-handler threads is a common trigger — your stack is doing processMessagesendNotification → publish synchronously inside a subscription callback.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions