Address generic Timeout or no response waiting for NATS JetStream server

## Problem

On requests, timeouts are seen, but often the actual cause is disguised. Fix this problem, which is discussed further below. All changes will have to be backward compatible.

## Discussion

### What actually causes "Timeout or no response waiting for NATS JetStream server" on a JS publish

Traced against the jnats source in this tree (`nats.java/src/main/java/io/nats/client/impl/`).

### The mechanism — why it's a catch-all

The exception is thrown in one place, on one condition:

```java
// NatsJetStreamImpl.responseRequired (line 257)
Message responseRequired(Message respMessage) throws IOException {
    if (respMessage == null) {
        throw new IOException("Timeout or no response waiting for NATS JetStream server");
    }
    return respMessage;
}
```

So the *only* trigger is `respMessage == null`. That null is manufactured one level down, where **three distinct failure modes are collapsed into the same null**:

```java
// NatsConnection.requestInternal (line 1316-1325)
CompletableFuture<Message> incoming = requestFutureInternal(...);
try {
    return incoming.get(timeout.toNanos(), TimeUnit.NANOSECONDS);
}
catch (TimeoutException | ExecutionException | CancellationException e) {
    return null;   // <-- all three look identical upstream
}
```

- **TimeoutException** — the PubAck genuinely didn't arrive within the deadline.
- **CancellationException** — the future was `cancel()`-ed by something else (reconnect cleanup, 503-in-CANCEL-mode, periodic cleanup of timed-out futures).
- **ExecutionException** — the future was completed exceptionally.

The text says "Timeout" but two of the three paths are not timeouts at all.

### Important: for the *publish* path, a 503 is NOT hidden here

`publishSyncInternal` passes `CancelAction.COMPLETE`:

```java
// NatsJetStream.publishSyncInternal (line 157)
Message resp = makeInternalRequestResponseRequired(
    subject, merged, data, getTimeout(), CancelAction.COMPLETE, conn.forceFlushOnRequest);
return processPublishResponse(resp, options);
```

With `COMPLETE`, a 503 No-Responders status comes back as a *real message* (not a cancel):

```java
// NatsConnection.deliverReply (line 1460-1471)
if (msg.isStatusMessage() && msg.getStatus().getCode() == 503) {
    switch (f.getCancelAction()) {
        case COMPLETE: f.complete(msg); break;                 // <-- publish path
        case REPORT:   f.completeExceptionally(new JetStreamStatusException(...)); break;
        case CANCEL:   default: f.cancel(true);                // <-- core request().., becomes null/"Timeout"
    }
}
```

...and `processPublishResponse` turns that into a different exception:

```java
if (resp.isStatusMessage()) {
    throw new IOException("Error Publishing: " + resp.getStatus().getMessageWithCode());
}
```

**So on the current publish path, a no-responders 503 surfaces as `"Error Publishing: 503 No Responders"`, not as `"Timeout or no response"`.** (The CANCEL→null→"Timeout" hiding of 503 is what bites the *plain* `connection.request()` / many JS *management* calls — which is likely the path where you had to add explicit 503 handling.)

That means: if the publish is throwing the **Timeout** message specifically, it is genuinely `respMessage == null` — i.e. timeout, reconnect-cancel, or exceptional-complete. Below, in rough order of "looks like a server timeout but isn't."

### Causes, prioritized

#### 1. Disconnect / reconnect mid-request (cancel disguised as timeout) — top suspect for intermittent
On any connection drop, every in-flight request future is cancelled:

```java
// NatsConnection, disconnect path (line 891)  ->  cleanResponses(true)
// cleanResponses (line 1231-1233)
else if (closing) {
    remove = true;
    future.cancelClosing();   // -> CancellationException -> null -> "Timeout or no response"
}
```

A network blip, server restart, rolling cluster upgrade, LB idle-timeout, or a brief disconnect drops the in-flight PubAck future and you get this *immediately* (not after the full timeout). Correlate the exception timestamps with `ConnectionListener` DISCONNECTED/RECONNECTED events. If the timeouts cluster around reconnects, this is it.

#### 2. Permissions violation (full-duration timeout, error goes to a different channel)
If the account/user lacks:
- **publish** permission on the target subject, or
- **subscribe** permission on the reply inbox (`_INBOX.>` or the configured inbox prefix),

then the server emits a `-ERR 'Permissions Violation'` on the protocol and **never delivers a PubAck**. The request waits out the full timeout and throws "Timeout." The actual cause only appears in the **ErrorListener** (`errorOccurred`). If you don't have an ErrorListener wired up, you're blind to this. The inbox-subscribe case is especially sneaky: publishing looks allowed, you just never receive the ack.

#### 3. Clustered JetStream with no stream leader (transient)
In R3/clustered JS, during leader election (node loss, rolling restart, lost quorum) the stream's RAFT group has no leader and won't ack until one is elected. Publishes in that window time out for real. Correlated with cluster events; clears on its own. Server logs show election / "no leader for stream."

#### 4. Inbox-dispatcher slow consumer under high concurrency
All requests on a connection share **one** internal inbox dispatcher (`mainInbox`). Under a flood of concurrent synchronous publishes (note your stack runs on `VirtualThread` — easy to spin up thousands), reply delivery can back up past the dispatcher's pending-message limit and replies get **dropped as a slow consumer**. The dropped reply → future never completes → timeout. Symptom: timeouts appear only under load, scale with publish concurrency. Mitigations: bound in-flight publishes, prefer `publishAsync` with a bounded window, raise pending limits, or spread load over more connections.

#### 5. The deadline is just short (default = 2s)
`JetStreamOptions.DEFAULT_TIMEOUT = Options.DEFAULT_CONNECTION_TIMEOUT = Duration.ofSeconds(2)`. Two seconds covers a lot, but a GC pause, a fsync stall on `FileStore`, R3 replication lag, or a momentary server CPU spike can blow past it. This overlaps with "real" timeouts but the root cause is a tunable, not the network. Raise `JetStreamOptions.builder().requestTimeout(...)` and see if it disappears.

#### 6. Genuinely slow server / storage
Disk full or slow disk, fsync-heavy storage, replication lag in R3/R5, account at `maxStore`/`maxMemory` (though hitting a *limit* usually returns an error PubAck → "Error Publishing", not a timeout), server overloaded. The legitimate case — but rule out 1–5 first.

#### 7. Max payload violation
Message larger than server `max_payload` → server sends `-ERR 'Maximum Payload Violation'` and typically **closes the connection** → no ack → timeout (plus a reconnect, which loops back to cause #1). Check message sizes against the server's `max_payload`.

#### 8. Client-side starvation (timeout without server involvement)
If the JVM is GC-thrashing or thread-starved, `future.get(timeout)` can expire even though the PubAck is sitting in the socket buffer. The "timeout" is entirely client-side. Watch for GC logs / event-loop starvation correlating with the exceptions. (Heavy work on message-handler threads is a common trigger — your stack is doing `processMessage` → `sendNotification` → publish synchronously inside a subscription callback.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address generic Timeout or no response waiting for NATS JetStream server #1581

Problem

Discussion

What actually causes "Timeout or no response waiting for NATS JetStream server" on a JS publish

The mechanism — why it's a catch-all

Important: for the publish path, a 503 is NOT hidden here

Causes, prioritized

1. Disconnect / reconnect mid-request (cancel disguised as timeout) — top suspect for intermittent

2. Permissions violation (full-duration timeout, error goes to a different channel)

3. Clustered JetStream with no stream leader (transient)

4. Inbox-dispatcher slow consumer under high concurrency

5. The deadline is just short (default = 2s)

6. Genuinely slow server / storage

7. Max payload violation

8. Client-side starvation (timeout without server involvement)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Address generic Timeout or no response waiting for NATS JetStream server #1581

Description

Problem

Discussion

What actually causes "Timeout or no response waiting for NATS JetStream server" on a JS publish

The mechanism — why it's a catch-all

Important: for the publish path, a 503 is NOT hidden here

Causes, prioritized

1. Disconnect / reconnect mid-request (cancel disguised as timeout) — top suspect for intermittent

2. Permissions violation (full-duration timeout, error goes to a different channel)

3. Clustered JetStream with no stream leader (transient)

4. Inbox-dispatcher slow consumer under high concurrency

5. The deadline is just short (default = 2s)

6. Genuinely slow server / storage

7. Max payload violation

8. Client-side starvation (timeout without server involvement)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions