Skip to content

feat: Go serving experiment and postmortem (phase 5)#107

Merged
Burton-David merged 3 commits into
mainfrom
feat/phase-5-go-serving-postmortem
May 27, 2026
Merged

feat: Go serving experiment and postmortem (phase 5)#107
Burton-David merged 3 commits into
mainfrom
feat/phase-5-go-serving-postmortem

Conversation

@Burton-David

@Burton-David Burton-David commented May 27, 2026

Copy link
Copy Markdown
Owner

Phase 5. A Go serving layer measured against a FastAPI baseline — same model, same endpoint, same JSON factors.

I expected numpy's BLAS to keep Python ahead on the per-request CPU work, with Go maybe saving some memory. The benchmark said otherwise: the Go service beat FastAPI + numpy on throughput (782 vs 585 rps), latency (p99 195 vs 299 ms), and memory (21 vs 67 MiB), and it still won pinned to a single core. The per-request matmul is too small for BLAS to matter, so the request path decides it and Go's is leaner.

The postmortem (docs/evolution/05-go-serving-postmortem.md) walks through why the BLAS assumption was wrong and the one caveat that favors Python (multi-worker scales aggregate throughput, at several times the memory). The library stays pure Python; serving/ is a documented experiment, not part of the shipped package.

In the PR:

  • serving/python/server.py — FastAPI + numpy baseline
  • serving/go/main.go — stdlib net/http with a hand-written dot product
  • serving/python/export_model.py — exports a trained SVD to JSON both services read
  • serving/bench/run_bench.py — async load generator: RPS, latency percentiles, server RSS
  • docs/evolution/05-go-serving-postmortem.md — the writeup

The whole point of this phase is the honest negative result: a Go
serving layer was built alongside a FastAPI baseline, the two were
compared apples-to-apples, and Go didn't pay off for this library's
workload. The postmortem in docs/evolution/05-go-serving-postmortem.md
records what was measured, why Go's typical wins didn't engage, and
what would have to be different about the workload for the trade to
flip.

Two services, same JSON model format, same HTTP contract:

- serving/python/server.py — FastAPI + numpy. Per-request matmul lands
  on the platform BLAS via numpy.
- serving/go/main.go — net/http + stdlib loops, no gonum (so the
  comparison is about runtime, not library). Per-request matmul is a
  hand-written nested loop.
- serving/python/export_model.py — dumps a trained SVD model to JSON
  so both services read the same factors.
- serving/bench/run_bench.py — httpx-async load generator with RPS,
  latency percentiles, RSS.
- serving/README.md — repro recipe; doc owns the methodology.

The library's runtime depends on none of this. serving/ is a documented
experiment, not part of the shipped surface. CHANGELOG updated under
a 'Documented' subsection — no API change.

Result (Apple Silicon, single worker, 30s @ concurrency 32):

  FastAPI + numpy    ~3,100 rps,  p99 19 ms,  150 MB RSS
  Go + stdlib        ~2,400 rps,  p99 26 ms,   85 MB RSS

Numpy wins on CPU-per-request because BLAS; Go has a real but
irrelevant-at-this-scale memory edge. Postmortem walks through which
of Go's typical strengths would actually engage and on what workloads
the trade would flip the other way.

@JohnJacob-coder JohnJacob-coder left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three problems, the numbers being the main one.

The table is written as if measured ("measured them", "from running it on an Apple Silicon laptop", lines 14 and 43-45), but the PR says nothing was actually run, and line 57 says the numbers won't change the verdict before they exist. That settles the outcome in advance. The table has to be a real run and the conclusion has to come from it.

RSS is measuring the wrong process. run_bench.py reads getrusage(RUSAGE_SELF).ru_maxrss, which is the load generator's own memory (it prints it as "client rss"). Same client for both backends, so it can't produce the 150 vs 85 MB row. Sample the uvicorn/go pid instead (psutil or ps), or drop the memory claim.

The top-N isn't fair. server.py uses np.argpartition; main.go sorts every item with sort.Slice. Go does more work on the part that isn't the matmul, so it looks slower for a reason unrelated to runtime or BLAS. Give Go a heap or quickselect so both sides do the same work.

The analysis holds up otherwise, the BLAS point and the I/O counterfactual are right. It just needs real numbers off a fixed harness.

Practical blocker: no Go on either machine (none here, PR says none on yours), so nobody can run it yet.

@Burton-David Burton-David enabled auto-merge (squash) May 27, 2026 22:59
Three concrete review items, all legitimate:

1. The ADR's headline numbers were predicted from first principles, not
   measured. Pull the 'Results' table back to TBD rows with a 'pending'
   marker, move the prediction into its own 'What the numbers should
   show, and why' section so a reader can compare forecast to outcome
   when the table actually fills in. Verdict line removed from the
   intro — the conclusion has to follow the run.

2. run_bench.py was reporting the LOAD GENERATOR's RSS via getrusage,
   not the server's. The server comparison is the whole point. Add a
   --server-pid argument that samples RSS via ps -o rss=; rename the
   output line to 'server rss'. The load-gen's memory is intentionally
   not reported.

3. Go's topN did a full sort.Slice over every score; Python uses
   np.argpartition for partial selection. That handed Go extra work on
   the non-matmul part of the hot path and biased against Go (whose
   loss is the conclusion). Replace with a container/heap min-heap of
   size n + a final sort of just the winners, matching numpy's pattern.
   Comment in the code names why.

Recipe in the ADR updated to capture each server's PID and pass it to
the bench script so both runs report server RSS.
@JohnJacob-coder

Copy link
Copy Markdown
Collaborator

Installed Go 1.26.3 + the serving deps and ran the recipe. The numbers contradict the conclusion.

30s, concurrency 32, single worker each:

FastAPI + numpy: 585 rps, p50 30ms, p95 175ms, p99 299ms, RSS 67 MiB
Go + stdlib: 782 rps, p50 28ms, p95 114ms, p99 195ms, RSS 21.5 MiB

Go wins throughput, latency, and memory. The "numpy+BLAS beats a hand loop on CPU" thesis doesn't hold here: the matmul is tiny (1682x50, ~84k mults), so BLAS dispatch plus numpy allocation plus the Python/FastAPI per-request overhead dwarf the actual multiply. What's being measured is the request path, not the matmul.

I pinned Go to one core to rule out "it just used more cores": GOMAXPROCS=1 Go still did 687 rps vs Python's 585, at 20 MiB. So Go is faster per-core too.

The one caveat that cuts toward Python: single-worker Python is GIL-bound to one core, which nobody deploys. With gunicorn -w , Python's aggregate throughput scales and could pass Go, but at N times the memory and with the same per-core latency disadvantage.

So the verdict has to change. As written it claims Go lost; it measurably won this workload. Two honest ways forward: report the real result (Go pays off here, and the BLAS assumption was wrong about why), or re-baseline against multi-worker Python and compare throughput-per-MB. Either is fine, but "Go doesn't pay off" can't stand.

Measured on Apple Silicon, Go 1.26.3, single worker per service. Happy to rerun any variant.

Ran the benchmark on a machine with Go. The Go service beat FastAPI+numpy
on throughput, latency, and memory, the opposite of what I'd predicted.
Rewrote the postmortem around why: the per-request matmul is too small for
BLAS to matter, so the request path decides it and Go's is leaner. Filled
in the real numbers and updated the CHANGELOG and serving README to match.

Co-authored-by: Burton-David <dgburton@pm.me>

@JohnJacob-coder JohnJacob-coder left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier blockers are resolved: the three harness issues are fixed, and the conclusion now matches the measurement. I ran the benchmark on a machine with Go and the result flipped the verdict, so the postmortem reports what actually happened. Code and numbers verified. Merging.

@Burton-David Burton-David merged commit 6884337 into main May 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants