SDK
All
SDK Version
newest
Operating System
No response
Description
When Arrow Flight ingestion receives a retryable server error such as RESOURCE_EXHAUSTED, recovery starts, but new ingest_batch() calls can return an error during the reconnect window.
This is ambiguous because ingest_batch() adds the batch to pending_batches before checking whether the sender is available. If batch_tx is temporarily None, the caller receives an error, but the batch may still be retained and replayed during recovery. The caller cannot know whether it should retry the batch, which risks duplicate ingestion.
Reproduction
Create an Arrow Flight stream with recovery enabled and a small recovery_backoff_ms / recovery_timeout_ms so the recovery window is easy to hit.
Configure the server or mock Flight server to return a retryable error, for example gRPC RESOURCE_EXHAUSTED, after receiving a batch.
Call ingest_batch(batch1) to trigger the server error.
While recovery is in progress, keep calling ingest_batch(batchN) from the client.
Observe that some ingest_batch() calls return an error instead of blocking or returning an offset.
Also note that the failed ingest_batch() call may have already inserted the batch into Arrow pending_batches before returning the error, so recovery may replay a batch the caller believes was rejected.
Expected behavior
ingest_batch() should have clear enqueue semantics during retryable recovery, like proto. Accept and retain the batch, return an offset, and let recovery replay it.
Proto ingestion has an explicit pause path for graceful stream closure, but Arrow currently has no equivalent pause/recovery state, so retryable Flight errors can surface directly to callers during recovery.
Is it a regression?
No
Logs
Additional context
No response
SDK
All
SDK Version
newest
Operating System
No response
Description
When Arrow Flight ingestion receives a retryable server error such as RESOURCE_EXHAUSTED, recovery starts, but new ingest_batch() calls can return an error during the reconnect window.
This is ambiguous because ingest_batch() adds the batch to pending_batches before checking whether the sender is available. If batch_tx is temporarily None, the caller receives an error, but the batch may still be retained and replayed during recovery. The caller cannot know whether it should retry the batch, which risks duplicate ingestion.
Reproduction
Create an Arrow Flight stream with recovery enabled and a small recovery_backoff_ms / recovery_timeout_ms so the recovery window is easy to hit.
Configure the server or mock Flight server to return a retryable error, for example gRPC RESOURCE_EXHAUSTED, after receiving a batch.
Call ingest_batch(batch1) to trigger the server error.
While recovery is in progress, keep calling ingest_batch(batchN) from the client.
Observe that some ingest_batch() calls return an error instead of blocking or returning an offset.
Also note that the failed ingest_batch() call may have already inserted the batch into Arrow pending_batches before returning the error, so recovery may replay a batch the caller believes was rejected.
Expected behavior
ingest_batch() should have clear enqueue semantics during retryable recovery, like proto. Accept and retain the batch, return an offset, and let recovery replay it.
Proto ingestion has an explicit pause path for graceful stream closure, but Arrow currently has no equivalent pause/recovery state, so retryable Flight errors can surface directly to callers during recovery.
Is it a regression?
No
Logs
Additional context
No response