fix(frontend): cancel GetBlockRange producer goroutine on client disconnect#560
Conversation
LarryRuane
left a comment
There was a problem hiding this comment.
This fix works perfectly. I verified it by running the reproduce program given in #559 and then pausing lightwalletd in the debugger, and observed that there are a huge number of those blocked threads in GetBlockRange:

The unit tests that call GetBlockRange no longer compile, due to the missing first argument, so we can add this to your PR (it should go into the same commit):
--- a/common/common_test.go
+++ b/common/common_test.go
@@ -6,6 +6,7 @@ package common
import (
"bufio"
"bytes"
+ "context"
"encoding/json"
"errors"
"fmt"
@@ -454,7 +455,7 @@ func TestGetBlockRange(t *testing.T) {
Start: &walletrpc.BlockID{Height: 380640},
End: &walletrpc.BlockID{Height: 380642},
}
- go GetBlockRange(testcache, blockChan, errChan, blockRange)
+ go GetBlockRange(context.Background(), testcache, blockChan, errChan, blockRange)
// read in block 380640
select {
@@ -557,7 +558,7 @@ func TestGetBlockRangeReverse(t *testing.T) {
Start: &walletrpc.BlockID{Height: 380642},
End: &walletrpc.BlockID{Height: 380640},
}
- go GetBlockRange(testcache, blockChan, errChan, blockRange)
+ go GetBlockRange(context.Background(), testcache, blockChan, errChan, blockRange)
// read in block 380642
select {That's the simplest way to fix the unit test, or we can modify these tests to actually test recovery from cancellation. All tests need to compile and run successfully before merging this PR (so I'm requesting that this be done).
In the interest of time, we could do the simple fix here in this PR, and improve the test coverage in a follow-up PR. Or the testing can be improved in this PR. I believe you've written a new test, range_stream_leak_poc_test.go, that we can add to this PR, but I haven't reviewed it.
| "GetBlockRange: must specify start and end heights") | ||
| } | ||
| ctx := resp.Context() | ||
| blockChan := make(chan *walletrpc.CompactBlock) |
There was a problem hiding this comment.
Nice that you're moving this down to where it's used (before the unrelated error return)
1cb78c2 to
35d62e2
Compare
…onnect
`GetBlockRange` and `GetBlockRangeNullifiers` spawn a producer goroutine that
sends blocks on an unbuffered channel without any context-cancellation
hygiene. When a gRPC client cancels mid-stream (TCP RST, deadline, peer
crash), the handler returns but the producer blocks forever on the next
`blockOut <- block` because no consumer remains. Each cancelled stream
leaks one goroutine indefinitely, retaining a `*BlockCache` reference and
one in-flight `*CompactBlock` until process exit.
Pass `resp.Context()` into the producer; have both sides respect
cancellation via `select { case ch <- v: case <-ctx.Done(): }`. The
producer's first iteration after the consumer leaves now exits cleanly
instead of parking on the unbuffered send.
Sibling-handler audit: scanned `frontend/service.go` for the same
`make(chan ...)` + `go ...` antipattern. Only `GetBlockRange` and
`GetBlockRangeNullifiers` exhibit it; other streaming RPCs iterate inline
against a callback or synchronous loop and propagate cancellation through
`resp.Send` errors.
35d62e2 to
0b155d5
Compare
|
Thanks for verifying the fix in the debugger @LarryRuane — and apologies for the unit-test compile breakage, that should have been caught locally before opening. Force-pushed an amended commit ( Happy to put together a Thanks much, dingledropper |
LarryRuane
left a comment
There was a problem hiding this comment.
tACK 0b155d5
Thanks for this PR! All tests are now passing. I think, as a follow-up PR for the regression test, of course, there's no harm in doing that. But I'm not sure it's worth the effort, so I'll leave that to you. The reason I'm not sure is that regression tests are great for covering things that might be broken by unrelated changes, but I think that's unlikely in this case.
|
Thanks @LarryRuane — and good call on the regression test; agreed the unrelated-change risk is low enough here that focused commit history wins over coverage-for-coverage's-sake. Will leave it. Appreciate the quick turnaround. Thanks much, dingledropper |
Summary
Fixes #559.
GetBlockRangeandGetBlockRangeNullifiers(frontend/service.go:230, 256) spawn a producer goroutine that sends blocks on an unbuffered channel without context-cancellation hygiene. When a gRPC client cancels mid-stream, the handler returns but the producer blocks forever on the nextblockOut <- block, leaking one goroutine per cancelled stream and retaining a*BlockCachereference + one*CompactBlockuntil process exit.Fix
common/common.go: addctx context.Contextas first parameter; wrap each blocking send inselect { case ch <- v: case <-ctx.Done(): return }so the producer exits cleanly when the consumer leaves.frontend/service.go(both handlers): passresp.Context()to the producer and add a<-ctx.Done()case to the consumer's select loop.Sibling-handler audit: scanned
frontend/service.gofor the samemake(chan ...)+go ...antipattern. Only the two named handlers exhibit it; other streaming RPCs (GetTaddressTransactions,GetTaddressBalanceStream,GetMempoolStream,GetMempoolTx,GetSubtreeRoots,GetAddressUtxosStream) iterate inline and propagate cancellation throughresp.Senderrors.Test plan
go build ./...(I do not have a Go toolchain locally; please verify)go test ./...common.GetBlockRange)GetBlockRangeintegration tests passNotes
ctxonly to the producer's send-paths. Out-of-scope follow-up: threadctxthroughGetBlock/ the upstream zcashd JSON-RPC client so an in-flight upstream call cancels with the gRPC stream. Happy to follow up if you'd like.errOut <- nilon success at the end of the loop) was also wrapped inselectfor symmetry, even though that path can only be reached when the consumer is still alive.AI disclosure
This finding was identified, the patch was drafted, and the PR description was written with assistance from Claude (Anthropic). I reviewed each step and am the responsible author.