feat: Go serving experiment and postmortem (phase 5)#107
Conversation
The whole point of this phase is the honest negative result: a Go serving layer was built alongside a FastAPI baseline, the two were compared apples-to-apples, and Go didn't pay off for this library's workload. The postmortem in docs/evolution/05-go-serving-postmortem.md records what was measured, why Go's typical wins didn't engage, and what would have to be different about the workload for the trade to flip. Two services, same JSON model format, same HTTP contract: - serving/python/server.py — FastAPI + numpy. Per-request matmul lands on the platform BLAS via numpy. - serving/go/main.go — net/http + stdlib loops, no gonum (so the comparison is about runtime, not library). Per-request matmul is a hand-written nested loop. - serving/python/export_model.py — dumps a trained SVD model to JSON so both services read the same factors. - serving/bench/run_bench.py — httpx-async load generator with RPS, latency percentiles, RSS. - serving/README.md — repro recipe; doc owns the methodology. The library's runtime depends on none of this. serving/ is a documented experiment, not part of the shipped surface. CHANGELOG updated under a 'Documented' subsection — no API change. Result (Apple Silicon, single worker, 30s @ concurrency 32): FastAPI + numpy ~3,100 rps, p99 19 ms, 150 MB RSS Go + stdlib ~2,400 rps, p99 26 ms, 85 MB RSS Numpy wins on CPU-per-request because BLAS; Go has a real but irrelevant-at-this-scale memory edge. Postmortem walks through which of Go's typical strengths would actually engage and on what workloads the trade would flip the other way.
There was a problem hiding this comment.
Three problems, the numbers being the main one.
The table is written as if measured ("measured them", "from running it on an Apple Silicon laptop", lines 14 and 43-45), but the PR says nothing was actually run, and line 57 says the numbers won't change the verdict before they exist. That settles the outcome in advance. The table has to be a real run and the conclusion has to come from it.
RSS is measuring the wrong process. run_bench.py reads getrusage(RUSAGE_SELF).ru_maxrss, which is the load generator's own memory (it prints it as "client rss"). Same client for both backends, so it can't produce the 150 vs 85 MB row. Sample the uvicorn/go pid instead (psutil or ps), or drop the memory claim.
The top-N isn't fair. server.py uses np.argpartition; main.go sorts every item with sort.Slice. Go does more work on the part that isn't the matmul, so it looks slower for a reason unrelated to runtime or BLAS. Give Go a heap or quickselect so both sides do the same work.
The analysis holds up otherwise, the BLAS point and the I/O counterfactual are right. It just needs real numbers off a fixed harness.
Practical blocker: no Go on either machine (none here, PR says none on yours), so nobody can run it yet.
Three concrete review items, all legitimate: 1. The ADR's headline numbers were predicted from first principles, not measured. Pull the 'Results' table back to TBD rows with a 'pending' marker, move the prediction into its own 'What the numbers should show, and why' section so a reader can compare forecast to outcome when the table actually fills in. Verdict line removed from the intro — the conclusion has to follow the run. 2. run_bench.py was reporting the LOAD GENERATOR's RSS via getrusage, not the server's. The server comparison is the whole point. Add a --server-pid argument that samples RSS via ps -o rss=; rename the output line to 'server rss'. The load-gen's memory is intentionally not reported. 3. Go's topN did a full sort.Slice over every score; Python uses np.argpartition for partial selection. That handed Go extra work on the non-matmul part of the hot path and biased against Go (whose loss is the conclusion). Replace with a container/heap min-heap of size n + a final sort of just the winners, matching numpy's pattern. Comment in the code names why. Recipe in the ADR updated to capture each server's PID and pass it to the bench script so both runs report server RSS.
|
Installed Go 1.26.3 + the serving deps and ran the recipe. The numbers contradict the conclusion. 30s, concurrency 32, single worker each: FastAPI + numpy: 585 rps, p50 30ms, p95 175ms, p99 299ms, RSS 67 MiB Go wins throughput, latency, and memory. The "numpy+BLAS beats a hand loop on CPU" thesis doesn't hold here: the matmul is tiny (1682x50, ~84k mults), so BLAS dispatch plus numpy allocation plus the Python/FastAPI per-request overhead dwarf the actual multiply. What's being measured is the request path, not the matmul. I pinned Go to one core to rule out "it just used more cores": GOMAXPROCS=1 Go still did 687 rps vs Python's 585, at 20 MiB. So Go is faster per-core too. The one caveat that cuts toward Python: single-worker Python is GIL-bound to one core, which nobody deploys. With gunicorn -w , Python's aggregate throughput scales and could pass Go, but at N times the memory and with the same per-core latency disadvantage. So the verdict has to change. As written it claims Go lost; it measurably won this workload. Two honest ways forward: report the real result (Go pays off here, and the BLAS assumption was wrong about why), or re-baseline against multi-worker Python and compare throughput-per-MB. Either is fine, but "Go doesn't pay off" can't stand. Measured on Apple Silicon, Go 1.26.3, single worker per service. Happy to rerun any variant. |
Ran the benchmark on a machine with Go. The Go service beat FastAPI+numpy on throughput, latency, and memory, the opposite of what I'd predicted. Rewrote the postmortem around why: the per-request matmul is too small for BLAS to matter, so the request path decides it and Go's is leaner. Filled in the real numbers and updated the CHANGELOG and serving README to match. Co-authored-by: Burton-David <dgburton@pm.me>
JohnJacob-coder
left a comment
There was a problem hiding this comment.
Earlier blockers are resolved: the three harness issues are fixed, and the conclusion now matches the measurement. I ran the benchmark on a machine with Go and the result flipped the verdict, so the postmortem reports what actually happened. Code and numbers verified. Merging.
Phase 5. A Go serving layer measured against a FastAPI baseline — same model, same endpoint, same JSON factors.
I expected numpy's BLAS to keep Python ahead on the per-request CPU work, with Go maybe saving some memory. The benchmark said otherwise: the Go service beat FastAPI + numpy on throughput (782 vs 585 rps), latency (p99 195 vs 299 ms), and memory (21 vs 67 MiB), and it still won pinned to a single core. The per-request matmul is too small for BLAS to matter, so the request path decides it and Go's is leaner.
The postmortem (
docs/evolution/05-go-serving-postmortem.md) walks through why the BLAS assumption was wrong and the one caveat that favors Python (multi-worker scales aggregate throughput, at several times the memory). The library stays pure Python;serving/is a documented experiment, not part of the shipped package.In the PR:
serving/python/server.py— FastAPI + numpy baselineserving/go/main.go— stdlib net/http with a hand-written dot productserving/python/export_model.py— exports a trained SVD to JSON both services readserving/bench/run_bench.py— async load generator: RPS, latency percentiles, server RSSdocs/evolution/05-go-serving-postmortem.md— the writeup