Skip to content

fix(frontend): cancel GetBlockRange producer goroutine on client disconnect#560

Merged
LarryRuane merged 1 commit into
zcash:masterfrom
dingledropper:fix/getblockrange-goroutine-leak
May 11, 2026
Merged

fix(frontend): cancel GetBlockRange producer goroutine on client disconnect#560
LarryRuane merged 1 commit into
zcash:masterfrom
dingledropper:fix/getblockrange-goroutine-leak

Conversation

@dingledropper
Copy link
Copy Markdown
Contributor

Summary

Fixes #559.

GetBlockRange and GetBlockRangeNullifiers (frontend/service.go:230, 256) spawn a producer goroutine that sends blocks on an unbuffered channel without context-cancellation hygiene. When a gRPC client cancels mid-stream, the handler returns but the producer blocks forever on the next blockOut <- block, leaking one goroutine per cancelled stream and retaining a *BlockCache reference + one *CompactBlock until process exit.

Fix

  • common/common.go: add ctx context.Context as first parameter; wrap each blocking send in select { case ch <- v: case <-ctx.Done(): return } so the producer exits cleanly when the consumer leaves.
  • frontend/service.go (both handlers): pass resp.Context() to the producer and add a <-ctx.Done() case to the consumer's select loop.

Sibling-handler audit: scanned frontend/service.go for the same make(chan ...) + go ... antipattern. Only the two named handlers exhibit it; other streaming RPCs (GetTaddressTransactions, GetTaddressBalanceStream, GetMempoolStream, GetMempoolTx, GetSubtreeRoots, GetAddressUtxosStream) iterate inline and propagate cancellation through resp.Send errors.

Test plan

Notes

  • Patch passes ctx only to the producer's send-paths. Out-of-scope follow-up: thread ctx through GetBlock / the upstream zcashd JSON-RPC client so an in-flight upstream call cancels with the gRPC stream. Happy to follow up if you'd like.
  • The third blocking send (errOut <- nil on success at the end of the loop) was also wrapped in select for symmetry, even though that path can only be reached when the consumer is still alive.

AI disclosure

This finding was identified, the patch was drafted, and the PR description was written with assistance from Claude (Anthropic). I reviewed each step and am the responsible author.

Copy link
Copy Markdown
Collaborator

@LarryRuane LarryRuane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix works perfectly. I verified it by running the reproduce program given in #559 and then pausing lightwalletd in the debugger, and observed that there are a huge number of those blocked threads in GetBlockRange:
Image

The unit tests that call GetBlockRange no longer compile, due to the missing first argument, so we can add this to your PR (it should go into the same commit):

 --- a/common/common_test.go
+++ b/common/common_test.go
@@ -6,6 +6,7 @@ package common
 import (
        "bufio"
        "bytes"
+       "context"
        "encoding/json"
        "errors"
        "fmt"
@@ -454,7 +455,7 @@ func TestGetBlockRange(t *testing.T) {
                Start: &walletrpc.BlockID{Height: 380640},
                End:   &walletrpc.BlockID{Height: 380642},
        }
-       go GetBlockRange(testcache, blockChan, errChan, blockRange)
+       go GetBlockRange(context.Background(), testcache, blockChan, errChan, blockRange)
 
        // read in block 380640
        select {
@@ -557,7 +558,7 @@ func TestGetBlockRangeReverse(t *testing.T) {
                Start: &walletrpc.BlockID{Height: 380642},
                End:   &walletrpc.BlockID{Height: 380640},
        }
-       go GetBlockRange(testcache, blockChan, errChan, blockRange)
+       go GetBlockRange(context.Background(), testcache, blockChan, errChan, blockRange)
 
        // read in block 380642
        select {

That's the simplest way to fix the unit test, or we can modify these tests to actually test recovery from cancellation. All tests need to compile and run successfully before merging this PR (so I'm requesting that this be done).

In the interest of time, we could do the simple fix here in this PR, and improve the test coverage in a follow-up PR. Or the testing can be improved in this PR. I believe you've written a new test, range_stream_leak_poc_test.go, that we can add to this PR, but I haven't reviewed it.

Comment thread frontend/service.go
"GetBlockRange: must specify start and end heights")
}
ctx := resp.Context()
blockChan := make(chan *walletrpc.CompactBlock)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you're moving this down to where it's used (before the unrelated error return)

@dingledropper dingledropper force-pushed the fix/getblockrange-goroutine-leak branch from 1cb78c2 to 35d62e2 Compare May 8, 2026 06:23
…onnect

`GetBlockRange` and `GetBlockRangeNullifiers` spawn a producer goroutine that
sends blocks on an unbuffered channel without any context-cancellation
hygiene. When a gRPC client cancels mid-stream (TCP RST, deadline, peer
crash), the handler returns but the producer blocks forever on the next
`blockOut <- block` because no consumer remains. Each cancelled stream
leaks one goroutine indefinitely, retaining a `*BlockCache` reference and
one in-flight `*CompactBlock` until process exit.

Pass `resp.Context()` into the producer; have both sides respect
cancellation via `select { case ch <- v: case <-ctx.Done(): }`. The
producer's first iteration after the consumer leaves now exits cleanly
instead of parking on the unbuffered send.

Sibling-handler audit: scanned `frontend/service.go` for the same
`make(chan ...)` + `go ...` antipattern. Only `GetBlockRange` and
`GetBlockRangeNullifiers` exhibit it; other streaming RPCs iterate inline
against a callback or synchronous loop and propagate cancellation through
`resp.Send` errors.
@dingledropper dingledropper force-pushed the fix/getblockrange-goroutine-leak branch from 35d62e2 to 0b155d5 Compare May 8, 2026 06:28
@dingledropper
Copy link
Copy Markdown
Contributor Author

Thanks for verifying the fix in the debugger @LarryRuane — and apologies for the unit-test compile breakage, that should have been caught locally before opening.

Force-pushed an amended commit (0b155d5) per your "should go into the same commit" note. The amend applies your suggested compile-fix diff verbatim to common/common_test.go: adds the context import and passes context.Background() as the first arg to GetBlockRange in both TestGetBlockRange and TestGetBlockRangeReverse. Both tests pass locally.

Happy to put together a range_stream_leak_poc_test.go regression test in a follow-up PR if you'd like the test coverage. Re-requesting review.

Thanks much, dingledropper

Copy link
Copy Markdown
Collaborator

@LarryRuane LarryRuane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK 0b155d5

Thanks for this PR! All tests are now passing. I think, as a follow-up PR for the regression test, of course, there's no harm in doing that. But I'm not sure it's worth the effort, so I'll leave that to you. The reason I'm not sure is that regression tests are great for covering things that might be broken by unrelated changes, but I think that's unlikely in this case.

@LarryRuane LarryRuane merged commit 45b2f4c into zcash:master May 11, 2026
@dingledropper
Copy link
Copy Markdown
Contributor Author

Thanks @LarryRuane — and good call on the regression test; agreed the unrelated-change risk is low enough here that focused commit history wins over coverage-for-coverage's-sake. Will leave it.

Appreciate the quick turnaround. Thanks much, dingledropper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GetBlockRange/GetBlockRangeNullifiers leak producer goroutine on client disconnect

2 participants