Symptom
There are two peers, producer and consumer. The producer sends bulk data to the consumer, and the consumer sends periodic pings to the producer. Especially in case of backpressure between producer and consumer (saturated link, slow consumer), one can observe that not all data that has been sent by the producer will arrive at the consumer, even though the producer has successfully sent everything. See here for a reproducer.
This may also be what happens at least in some instances of the truncated kubectl cp/exec/attach transfers that have been reported.
Background
What a successful send call means
When a user space process calls send, the only guarantee you get from a successful send call is that the kernel has taken over the data and it now sits in the send buffer. There are no further guarantees. Typically, this also implies that the data will be put on the wire relatively soon afterwards, but that is not a given. In case the transmission stalls, because the consumer is slow or blocked or the link is saturated, the data can sit in the send buffer for some time. It is also possible for the user space process to close the socket after a successful send call, even though the data has not reached the peer yet. The kernel will take care of that in the background, including shutting down the socket with the appropriate sequence of packets.
Closing a socket with data in the receive buffer
When a socket is closed while there is still data in the receive buffer, the kernel will typically send and RST instead of a FIN to indicate to the peer that the data it sent could not be delivered. An RST will also be sent if data arrives after the socket has been closed. Its purpose is the same: indicate to the peer that data it sent could not be delivered anymore. From RFC 1122, section 4.2.2.13:
[...] If such a host issues a CLOSE call while received data is still
pending in TCP, or if new data is received after CLOSE is called, its TCP
SHOULD send a RST to show that data was lost.
The destructive side effect of an RST
The RST has one destructive side effect: Whatever data remains in the send buffer when the RST is triggered, will get discarded by the kernel. For the case described above that means: The producer successfully sends its data frames, tears everything down, and closes its socket. If now a PING frame arrives while some of these data frames still remain in the send buffer, the data frames will get discarded and instead of receiving the remaining data the consumer will see a truncated transfer and a "connection reset by peer" error.
A potential fix
As we have no influence over the consumer sending ping packets, the best way to avoid this situation is to drain the producer's receive buffer, ideally until the consumer is done sending. This can be done in three steps:
- Close the write end of the connection, so the consumer knows we are done sending. Under normal conditions (i.e. without backpressure), this will almost immediately put a FIN on the wire.
- Drain the receive buffer, to prevent any late incoming bytes to cause truncation of the transfer. The drain timeout is currently hardcoded to 10s. (The timeout will only become relevant in case of a misbehaved client.) Alternatively, one could reuse closeTimeout, or make it separately configurable. Feedback welcome.
- Close the socket. In case the consumer got the FIN from the producer while draining, all data is successfully delivered. Even in case no FIN was seen within 10 seconds, the likelihood of a successful delivery has been significantly increased
Symptom
There are two peers, producer and consumer. The producer sends bulk data to the consumer, and the consumer sends periodic pings to the producer. Especially in case of backpressure between producer and consumer (saturated link, slow consumer), one can observe that not all data that has been sent by the producer will arrive at the consumer, even though the producer has successfully sent everything. See here for a reproducer.
This may also be what happens at least in some instances of the truncated kubectl cp/exec/attach transfers that have been reported.
Background
What a successful send call means
When a user space process calls send, the only guarantee you get from a successful send call is that the kernel has taken over the data and it now sits in the send buffer. There are no further guarantees. Typically, this also implies that the data will be put on the wire relatively soon afterwards, but that is not a given. In case the transmission stalls, because the consumer is slow or blocked or the link is saturated, the data can sit in the send buffer for some time. It is also possible for the user space process to close the socket after a successful send call, even though the data has not reached the peer yet. The kernel will take care of that in the background, including shutting down the socket with the appropriate sequence of packets.
Closing a socket with data in the receive buffer
When a socket is closed while there is still data in the receive buffer, the kernel will typically send and RST instead of a FIN to indicate to the peer that the data it sent could not be delivered. An RST will also be sent if data arrives after the socket has been closed. Its purpose is the same: indicate to the peer that data it sent could not be delivered anymore. From RFC 1122, section 4.2.2.13:
The destructive side effect of an RST
The RST has one destructive side effect: Whatever data remains in the send buffer when the RST is triggered, will get discarded by the kernel. For the case described above that means: The producer successfully sends its data frames, tears everything down, and closes its socket. If now a PING frame arrives while some of these data frames still remain in the send buffer, the data frames will get discarded and instead of receiving the remaining data the consumer will see a truncated transfer and a "connection reset by peer" error.
A potential fix
As we have no influence over the consumer sending ping packets, the best way to avoid this situation is to drain the producer's receive buffer, ideally until the consumer is done sending. This can be done in three steps: