Skip to content

GPU spot instance reclamation increasingly interrupting lecture builds #330

Description

@mmcky

Summary

AWS spot instance reclamation is increasingly interrupting our GPU notebook builds mid-run across the lecture repos. Because these builds run on a single g4dn.2xlarge GPU for 15–25 minutes and can't cheaply checkpoint, a reclamation near the end discards the whole run — and with it any spot savings. This is happening often enough that it's now a reliability problem worth tracking centrally rather than patching per-repo.

Most recent example

The weekly cache.yml build in lecture-jax was killed at 91% of execution: https://github.com/QuantEcon/lecture-jax/actions/runs/27929889850

The runner was a spot instance (Runner details (g4dn.2xlarge, spot, v3.1.1), InstanceLifecycle: spot). At 04:54:43, while Sphinx was at reading sources... [91%], the log shows The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled, immediately followed by The operation was canceled. The AssertionError: self.km is not None raised by nbclient cleanup is a symptom of the kernel being torn down mid-execution — there was no actual notebook/cell error (32 of 33 lectures had already executed cleanly).

Impact

  • Wasted GPU minutes — a reclamation at 90% throws away ~20 min of g4dn time, so the spot discount is often a net loss on long builds.
  • Flaky scheduled caches & publishes — the weekly cache and publish* builds intermittently fail for reasons unrelated to the lectures, requiring manual re-runs.
  • Noise — the failure surfaces as a Python AssertionError/RuntimeError traceback, which looks like a content bug and costs triage time to rule out.

Affected repos

Any lecture repo using g4dn GPU RunsOn runners — at minimum lecture-jax (cache, ci, publish, collab workflows). Other GPU-using series likely share the same pattern and the same default-spot config.

Immediate mitigation (in progress)

Forcing on-demand by adding spot=false to the RunsOn runner spec. First PR: QuantEcon/lecture-jax#327.

Discussion — what should the org-wide default be?

Options, roughly in increasing order of effort:

  1. On-demand for long GPU builds (cache, publish), keep spot for short/cheap PR jobs (ci, collab). Simple, mostly captures the reliability win where it matters.
  2. On-demand for all GPU builds. Maximally reliable; higher cost. Worth it only if PR GPU jobs are also getting reclaimed.
  3. Keep spot but add automatic retry-on-reclamation (re-run failed job once) so a single reclamation self-heals without a green-button click.
  4. Tune the RunsOn spot strategy (e.g. capacity-optimized allocation, broader instance family pool) to reduce reclamation frequency while staying on spot.

A consistent default belongs here rather than drifting per-repo. Related runner-health/telemetry work: #328.

Asks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions