Summary
AWS spot instance reclamation is increasingly interrupting our GPU notebook builds mid-run across the lecture repos. Because these builds run on a single g4dn.2xlarge GPU for 15–25 minutes and can't cheaply checkpoint, a reclamation near the end discards the whole run — and with it any spot savings. This is happening often enough that it's now a reliability problem worth tracking centrally rather than patching per-repo.
Most recent example
The weekly cache.yml build in lecture-jax was killed at 91% of execution: https://github.com/QuantEcon/lecture-jax/actions/runs/27929889850
The runner was a spot instance (Runner details (g4dn.2xlarge, spot, v3.1.1), InstanceLifecycle: spot). At 04:54:43, while Sphinx was at reading sources... [91%], the log shows The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled, immediately followed by The operation was canceled. The AssertionError: self.km is not None raised by nbclient cleanup is a symptom of the kernel being torn down mid-execution — there was no actual notebook/cell error (32 of 33 lectures had already executed cleanly).
Impact
- Wasted GPU minutes — a reclamation at 90% throws away ~20 min of g4dn time, so the spot discount is often a net loss on long builds.
- Flaky scheduled caches & publishes — the weekly cache and
publish* builds intermittently fail for reasons unrelated to the lectures, requiring manual re-runs.
- Noise — the failure surfaces as a Python
AssertionError/RuntimeError traceback, which looks like a content bug and costs triage time to rule out.
Affected repos
Any lecture repo using g4dn GPU RunsOn runners — at minimum lecture-jax (cache, ci, publish, collab workflows). Other GPU-using series likely share the same pattern and the same default-spot config.
Immediate mitigation (in progress)
Forcing on-demand by adding spot=false to the RunsOn runner spec. First PR: QuantEcon/lecture-jax#327.
Discussion — what should the org-wide default be?
Options, roughly in increasing order of effort:
- On-demand for long GPU builds (
cache, publish), keep spot for short/cheap PR jobs (ci, collab). Simple, mostly captures the reliability win where it matters.
- On-demand for all GPU builds. Maximally reliable; higher cost. Worth it only if PR GPU jobs are also getting reclaimed.
- Keep spot but add automatic retry-on-reclamation (re-run failed job once) so a single reclamation self-heals without a green-button click.
- Tune the RunsOn spot strategy (e.g. capacity-optimized allocation, broader instance family pool) to reduce reclamation frequency while staying on spot.
A consistent default belongs here rather than drifting per-repo. Related runner-health/telemetry work: #328.
Asks
Summary
AWS spot instance reclamation is increasingly interrupting our GPU notebook builds mid-run across the lecture repos. Because these builds run on a single
g4dn.2xlargeGPU for 15–25 minutes and can't cheaply checkpoint, a reclamation near the end discards the whole run — and with it any spot savings. This is happening often enough that it's now a reliability problem worth tracking centrally rather than patching per-repo.Most recent example
The weekly
cache.ymlbuild inlecture-jaxwas killed at 91% of execution: https://github.com/QuantEcon/lecture-jax/actions/runs/27929889850The runner was a spot instance (
Runner details (g4dn.2xlarge, spot, v3.1.1),InstanceLifecycle: spot). At04:54:43, while Sphinx was atreading sources... [91%], the log showsThe runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled, immediately followed byThe operation was canceled. TheAssertionError: self.km is not Noneraised bynbclientcleanup is a symptom of the kernel being torn down mid-execution — there was no actual notebook/cell error (32 of 33 lectures had already executed cleanly).Impact
publish*builds intermittently fail for reasons unrelated to the lectures, requiring manual re-runs.AssertionError/RuntimeErrortraceback, which looks like a content bug and costs triage time to rule out.Affected repos
Any lecture repo using
g4dnGPU RunsOn runners — at minimumlecture-jax(cache, ci, publish, collab workflows). Other GPU-using series likely share the same pattern and the same default-spot config.Immediate mitigation (in progress)
Forcing on-demand by adding
spot=falseto the RunsOn runner spec. First PR: QuantEcon/lecture-jax#327.Discussion — what should the org-wide default be?
Options, roughly in increasing order of effort:
cache,publish), keep spot for short/cheap PR jobs (ci,collab). Simple, mostly captures the reliability win where it matters.A consistent default belongs here rather than drifting per-repo. Related runner-health/telemetry work: #328.
Asks