GPU spot instance reclamation increasingly interrupting lecture builds

## Summary

AWS **spot instance reclamation is increasingly interrupting our GPU notebook builds mid-run** across the lecture repos. Because these builds run on a single `g4dn.2xlarge` GPU for 15–25 minutes and can't cheaply checkpoint, a reclamation near the end discards the whole run — and with it any spot savings. This is happening often enough that it's now a reliability problem worth tracking centrally rather than patching per-repo.

## Most recent example

The weekly `cache.yml` build in `lecture-jax` was killed at **91%** of execution: https://github.com/QuantEcon/lecture-jax/actions/runs/27929889850

The runner was a spot instance (`Runner details (g4dn.2xlarge, spot, v3.1.1)`, `InstanceLifecycle: spot`). At `04:54:43`, while Sphinx was at `reading sources... [91%]`, the log shows `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled`, immediately followed by `The operation was canceled`. The `AssertionError: self.km is not None` raised by `nbclient` cleanup is a *symptom* of the kernel being torn down mid-execution — there was no actual notebook/cell error (32 of 33 lectures had already executed cleanly).

## Impact

- **Wasted GPU minutes** — a reclamation at 90% throws away ~20 min of g4dn time, so the spot discount is often a net loss on long builds.
- **Flaky scheduled caches & publishes** — the weekly cache and `publish*` builds intermittently fail for reasons unrelated to the lectures, requiring manual re-runs.
- **Noise** — the failure surfaces as a Python `AssertionError`/`RuntimeError` traceback, which looks like a content bug and costs triage time to rule out.

## Affected repos

Any lecture repo using `g4dn` GPU RunsOn runners — at minimum `lecture-jax` (cache, ci, publish, collab workflows). Other GPU-using series likely share the same pattern and the same default-spot config.

## Immediate mitigation (in progress)

Forcing on-demand by adding `spot=false` to the RunsOn runner spec. First PR: QuantEcon/lecture-jax#327.

## Discussion — what should the org-wide default be?

Options, roughly in increasing order of effort:

1. **On-demand for long GPU builds** (`cache`, `publish`), keep spot for short/cheap PR jobs (`ci`, `collab`). Simple, mostly captures the reliability win where it matters.
2. **On-demand for all GPU builds.** Maximally reliable; higher cost. Worth it only if PR GPU jobs are also getting reclaimed.
3. **Keep spot but add automatic retry-on-reclamation** (re-run failed job once) so a single reclamation self-heals without a green-button click.
4. **Tune the RunsOn spot strategy** (e.g. capacity-optimized allocation, broader instance family pool) to reduce reclamation frequency while staying on spot.

A consistent default belongs here rather than drifting per-repo. Related runner-health/telemetry work: #328.

## Asks

- [x] Decide the org-wide GPU spot policy (one of the options above).
- [x] Roll the chosen config to all GPU-using lecture repos.
- [ ] Optionally track GPU spot reclamation frequency as part of #328 runner-health telemetry.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GPU spot instance reclamation increasingly interrupting lecture builds #330

Summary

Most recent example

Impact

Affected repos

Immediate mitigation (in progress)

Discussion — what should the org-wide default be?

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

GPU spot instance reclamation increasingly interrupting lecture builds #330

Description

Summary

Most recent example

Impact

Affected repos

Immediate mitigation (in progress)

Discussion — what should the org-wide default be?

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions