Skip to content

Force on-demand GPU runners (spot=false)#936

Merged
mmcky merged 4 commits into
mainfrom
disable-spot-gpu-runners
Jun 26, 2026
Merged

Force on-demand GPU runners (spot=false)#936
mmcky merged 4 commits into
mainfrom
disable-spot-gpu-runners

Conversation

@mmcky

@mmcky mmcky commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

Adds spot=false to the RunsOn g4dn.2xlarge GPU runner spec in all four GPU workflows so they run on on-demand instances instead of spot: cache.yml, ci.yml, collab.yml, publish.yml.

Why

AWS spot reclamation has been interrupting the GPU notebook builds mid-run. Because each build runs on a single GPU for 15–25 minutes and can't cheaply checkpoint, a reclamation near the end discards the whole build — so the spot discount is often a net loss, and scheduled cache/publish builds fail intermittently for reasons unrelated to the lectures (surfacing as an AssertionError/shutdown-signal that looks like a content bug). We just hit exactly this on a forced cache.yml run here (attempt 1 reclaimed mid-build, no execution reports, auto re-queued).

This rolls out the org-wide decision in QuantEcon/meta#330 (on-demand for all GPU builds) to lecture-python.myst, and mirrors the reference change in QuantEcon/lecture-jax#327.

Change

Workflow Runner Change
cache.yml g4dn.2xlarge append /spot=false
ci.yml g4dn.2xlarge append /spot=false
collab.yml g4dn.2xlarge append /spot=false
publish.yml g4dn.2xlarge append /spot=false

One-line change per file; no other workflow logic touched.

🤖 Generated with Claude Code

mmcky and others added 3 commits June 26, 2026 15:31
The weekly link checker (#933) flags 8 errors out of ~25k links, all
false positives or harmless artifacts on non-content pages:

- IEEE Xplore returns "202 Accepted" (anti-bot) for a valid DOI cited in
  zreferences.html -> add 202 to --accept.
- genindex / search / prf-prf are auto-generated utility pages with no
  source notebook, so the theme's "Download Notebook" button points at a
  nonexistent _notebooks/<page>.ipynb and renders a second href="None"
  -> --exclude-path those three pages.
- A Journal of Derivatives DOI redirects into a login/paywall loop that
  exceeds max-redirects; the citation itself is valid -> --exclude it.

Configuration is kept inline in the workflow args (rather than a
lychee.toml) because lychee runs against the gh-pages checkout, which
does not contain repo-root config files.

Closes #933

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
lychee treats --exclude-path values as regular expressions, so the
unescaped dots in genindex.html / search.html / prf-prf.html were regex
wildcards and the patterns were unanchored. Escape the dot and anchor the
end ('<name>\.html$') so each matches only the intended generated page.

Addresses Copilot review feedback on #934.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
AWS spot reclamation has been interrupting the g4dn.2xlarge GPU notebook
builds mid-run, discarding the whole build. Add spot=false to the RunsOn
runner spec in all four GPU workflows (cache, ci, collab, publish) so they
run on on-demand instances.

Rolls out the org-wide decision in QuantEcon/meta#330; mirrors
QuantEcon/lecture-jax#327.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 07:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates GitHub Actions workflow configuration to reduce GPU build interruptions by forcing on-demand GPU runner allocation (disabling spot instances). It also adjusts the scheduled link checker workflow’s lychee arguments to reduce known false positives when checking the published gh-pages HTML output.

Changes:

  • Append /spot=false to the runs-on runner spec for the g4dn.2xlarge GPU workflows (cache.yml, ci.yml, collab.yml, publish.yml) to ensure on-demand instances.
  • Expand linkcheck.yml lychee CLI arguments to accept additional HTTP statuses and exclude a small set of known-noise pages/DOI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Show a summary per file
File Description
.github/workflows/cache.yml Forces on-demand g4dn.2xlarge runner usage via /spot=false.
.github/workflows/ci.yml Forces on-demand g4dn.2xlarge runner usage via /spot=false.
.github/workflows/collab.yml Forces on-demand g4dn.2xlarge runner usage via /spot=false.
.github/workflows/publish.yml Forces on-demand g4dn.2xlarge runner usage via /spot=false.
.github/workflows/linkcheck.yml Refines lychee linkcheck args to reduce documented false positives.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

📖 Netlify Preview Ready!

Preview URL: https://pr-936--sunny-cactus-210e3e.netlify.app

Commit: b8140a9


Build Info

@mmcky mmcky merged commit 0567c05 into main Jun 26, 2026
1 check passed
@mmcky mmcky deleted the disable-spot-gpu-runners branch June 26, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants