Skip to content

fix(transport): contain TLS pump failures that crash the host app#54

Merged
jamesarich merged 1 commit into
mainfrom
fix/transport-tls-uncaught-write
Jun 9, 2026
Merged

fix(transport): contain TLS pump failures that crash the host app#54
jamesarich merged 1 commit into
mainfrom
fix/transport-tls-uncaught-write

Conversation

@jamesarich

Copy link
Copy Markdown
Collaborator

What

Pass a supervised, exception-handled CoroutineContext to Socket.tls(...) so a TLS write that fails on a dead socket can no longer crash the embedding app.

Why

Ktor launches the TLS record-pump (input/output) coroutines on the context passed to Socket.tls(...). TcpTransport passed a bare Dispatchers.IO — no Job, no CoroutineExceptionHandler — so those pumps run as root coroutines. When the socket dies around the handshake (e.g. a broker resets it during reconnect churn), the output coroutine's write throws and, with no handler in scope, the exception reaches the thread's default uncaught handler and crashes the host app.

This is the #1 crash on Meshtastic Android 2.7.14 (~520 events / 30d on the shipped build), surfacing as three variants of one root cause, all at io.ktor.network.tls.TLSClientHandshake$output … (TLSClientHandshake.kt:139):

  • io.ktor.utils.io.ByteChannel.flushWriteBufferNullPointerException
  • io.ktor.network.sockets.CIOWriterKt$attachForWritingDirectImplIOException: Broken pipe
  • io.ktor.utils.io.ByteChannel.getWriteBufferClosedWriteChannelException

Change

  • Build SupervisorJob() + Dispatchers.IO + CoroutineName("mqtt-tls") + CoroutineExceptionHandler { } and pass it to .tls(...).
  • Track it in tlsJob; cancel it in close() and on the connect failure path.

Why it's safe

The failure is still surfaced normally — the dying socket closes the app-facing channels, so MqttConnection's read loop throws → handleFatalError → reconnect. We only suppress the duplicate, fatal report from the detached pump. Graceful DISCONNECT ordering is preserved (it is written before close() cancels tlsJob).

Testing

spotlessCheck detekt :library:compileKotlinJvm :library:jvmTest apiCheck all pass. No public API change (TcpTransport is internal). A true regression test needs a broker that resets mid-handshake (integration-only), so this is verified by reasoning + the existing suite.

🤖 Generated with Claude Code

…rash the host app

Ktor launches the TLS record-pump (input/output) coroutines on the
CoroutineContext passed to Socket.tls(). We passed a bare Dispatchers.IO,
which carries no Job and no CoroutineExceptionHandler — so a write that
failed around the handshake (e.g. a broken pipe when the peer reset the
socket during reconnect churn) propagated to the thread's default uncaught
handler and crashed the embedding app.

Pass a SupervisorJob + Dispatchers.IO + CoroutineExceptionHandler context
instead, tracked in tlsJob and cancelled in close() and on the connect
failure path. The failure is now contained; it still closes the channels,
which the read loop surfaces as a normal connection-lost -> reconnect.

This was the top crash on Meshtastic Android 2.7.14 (uncaught io.ktor TLS
write: Broken pipe / ClosedWriteChannelException / Buffer NPE).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jamesarich jamesarich added this pull request to the merge queue Jun 9, 2026
Merged via the queue into main with commit 8c8170e Jun 9, 2026
13 of 14 checks passed
@jamesarich jamesarich deleted the fix/transport-tls-uncaught-write branch June 9, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant