fix(transport): contain TLS pump failures that crash the host app#54
Merged
Conversation
…rash the host app Ktor launches the TLS record-pump (input/output) coroutines on the CoroutineContext passed to Socket.tls(). We passed a bare Dispatchers.IO, which carries no Job and no CoroutineExceptionHandler — so a write that failed around the handshake (e.g. a broken pipe when the peer reset the socket during reconnect churn) propagated to the thread's default uncaught handler and crashed the embedding app. Pass a SupervisorJob + Dispatchers.IO + CoroutineExceptionHandler context instead, tracked in tlsJob and cancelled in close() and on the connect failure path. The failure is now contained; it still closes the channels, which the read loop surfaces as a normal connection-lost -> reconnect. This was the top crash on Meshtastic Android 2.7.14 (uncaught io.ktor TLS write: Broken pipe / ClosedWriteChannelException / Buffer NPE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Pass a supervised, exception-handled
CoroutineContexttoSocket.tls(...)so a TLS write that fails on a dead socket can no longer crash the embedding app.Why
Ktor launches the TLS record-pump (input/output) coroutines on the context passed to
Socket.tls(...).TcpTransportpassed a bareDispatchers.IO— noJob, noCoroutineExceptionHandler— so those pumps run as root coroutines. When the socket dies around the handshake (e.g. a broker resets it during reconnect churn), theoutputcoroutine's write throws and, with no handler in scope, the exception reaches the thread's default uncaught handler and crashes the host app.This is the #1 crash on Meshtastic Android 2.7.14 (~520 events / 30d on the shipped build), surfacing as three variants of one root cause, all at
io.ktor.network.tls.TLSClientHandshake$output … (TLSClientHandshake.kt:139):io.ktor.utils.io.ByteChannel.flushWriteBuffer→NullPointerExceptionio.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImpl→IOException: Broken pipeio.ktor.utils.io.ByteChannel.getWriteBuffer→ClosedWriteChannelExceptionChange
SupervisorJob() + Dispatchers.IO + CoroutineName("mqtt-tls") + CoroutineExceptionHandler { }and pass it to.tls(...).tlsJob; cancel it inclose()and on the connect failure path.Why it's safe
The failure is still surfaced normally — the dying socket closes the app-facing channels, so
MqttConnection's read loop throws →handleFatalError→ reconnect. We only suppress the duplicate, fatal report from the detached pump. Graceful DISCONNECT ordering is preserved (it is written beforeclose()cancelstlsJob).Testing
spotlessCheck detekt :library:compileKotlinJvm :library:jvmTest apiCheckall pass. No public API change (TcpTransportisinternal). A true regression test needs a broker that resets mid-handshake (integration-only), so this is verified by reasoning + the existing suite.🤖 Generated with Claude Code