Skip to content

Fix goroutine and heap leak in Worker.Process#405

Merged
PaNaVTEC merged 1 commit into
mainfrom
fix/process-goroutine-leak
Jun 15, 2026
Merged

Fix goroutine and heap leak in Worker.Process#405
PaNaVTEC merged 1 commit into
mainfrom
fix/process-goroutine-leak

Conversation

@PaNaVTEC

Copy link
Copy Markdown
Contributor

Problem

Worker.Process fetches the document in a goroutine and passes the payload back over a channel:

chanPayload := make(chan []byte)   // unbuffered
chanError := make(chan error)      // unbuffered
go func() {
    payload, err := w.fetchFile(ctx, path)
    ...
    chanPayload <- payload          // blocks until the caller receives
}()

The channels were unbuffered, so the goroutine blocks on the send until someone receives. On the png path, Process has two early returns before the select that drains the channel:

  • extractToken fails (signed URL without a token query param), or
  • fetchAnnotations fails — this calls Redis (FetchAnnotation) on every png request, so any Redis hiccup/timeout or a malformed stored annotation triggers it.

When either fires, the fetch goroutine finishes fetchFile and then blocks forever on the send — nobody will ever receive, and the goroutine doesn't select on ctx.Done(), so request cancellation can't free it. Each leak strands one goroutine and the full PDF payload it holds, until the pod restarts.

Impact

In ec1 prod this is the sawtooth in the runtime metrics: live goroutines climbing ~200→600 and heap-live-size ~160→256 MiB linearly across the month, both snapping back to baseline on each deploy. The html path is unaffected (no early return before its select).

Fix

Buffer both channels with capacity 1 so the goroutine can always complete its single send and exit, even when the caller returns early and abandons it. On an early return the buffered channel and its payload simply go out of scope and are GC'd. One-line change on each channel.

Test

Added TestWorkerProcessNoGoroutineLeak, which drives the fetchAnnotations-error path on the png format 50× and asserts the goroutine count returns to baseline. Verified it fails against the unbuffered version (baseline 2 → 52 stuck goroutines) and passes with the fix. Full suite passes under -race.

Follow-up (not in this PR)

This leak is also the reason the eu-prod pods look heavily overprovisioned — worth right-sizing lazyraster-deploy resource requests once this has baked, since the memory ramp was partly this leak.

🤖 Generated with Claude Code

Process fetches the file in a goroutine that sends the payload over a
channel. The channels were unbuffered, so the goroutine blocked on the
send until the caller received. On the png path, Process can return early
(invalid token, or a fetchAnnotations failure) before reaching the select
that drains the channel. When that happened the fetch goroutine blocked
forever on the send and never observed ctx cancellation, leaking the
goroutine and the full PDF payload it held in memory.

In ec1 prod this showed up as live goroutines and heap-live-size ramping
linearly between deploys and resetting on each restart. fetchAnnotations
hits Redis on every png request, so intermittent Redis errors (or a
malformed stored annotation) steadily accumulated leaked goroutines.

Buffer both channels with capacity 1 so the goroutine can always complete
its single send and exit even when the caller abandons it. Add a
regression test that drives the fetchAnnotations-error path and asserts
goroutines return to baseline; it leaks ~50 goroutines without the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@PaNaVTEC PaNaVTEC merged commit 95a6e96 into main Jun 15, 2026
2 checks passed
@PaNaVTEC PaNaVTEC deleted the fix/process-goroutine-leak branch June 15, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants