⚙️ Disable spot instances on GPU runners (use on-demand)#327
Merged
Conversation
AWS spot reclamation is increasingly interrupting long GPU notebook builds mid-run (e.g. the weekly cache build was killed at 91% in https://github.com/QuantEcon/lecture-jax/actions/runs/27929889850 when the spot g4dn.2xlarge received a shutdown signal). Force on-demand capacity by adding spot=false to all four GPU runner specs. See QuantEcon/meta for the tracking discussion on GPU spot reclamation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
✅ Deploy Preview for incomparable-parfait-2417f8 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
3 tasks
There was a problem hiding this comment.
Pull request overview
This PR updates the repository’s GPU GitHub Actions workflows to request on-demand GPU runner capacity (instead of spot), improving reliability for long-running notebook builds that are susceptible to spot reclamation.
Changes:
- Adds
spot=falseto the RunsOn GPU runner spec in all GPU workflows. - Applies the same runner-spec adjustment consistently across
cache,ci,publish, andcollabworkflows.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .github/workflows/cache.yml | Forces on-demand GPU capacity for scheduled/manual cache builds by adding spot=false to the runner spec. |
| .github/workflows/ci.yml | Forces on-demand GPU capacity for PR preview builds by adding spot=false to the runner spec. |
| .github/workflows/publish.yml | Forces on-demand GPU capacity for tag-based publish builds by adding spot=false to the runner spec. |
| .github/workflows/collab.yml | Forces on-demand GPU capacity for PR Colab execution checks by adding spot=false to the runner spec. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mmcky
added a commit
to QuantEcon/lecture-python.myst
that referenced
this pull request
Jun 26, 2026
* [linkcheck] Clear residual false positives in weekly lychee report The weekly link checker (#933) flags 8 errors out of ~25k links, all false positives or harmless artifacts on non-content pages: - IEEE Xplore returns "202 Accepted" (anti-bot) for a valid DOI cited in zreferences.html -> add 202 to --accept. - genindex / search / prf-prf are auto-generated utility pages with no source notebook, so the theme's "Download Notebook" button points at a nonexistent _notebooks/<page>.ipynb and renders a second href="None" -> --exclude-path those three pages. - A Journal of Derivatives DOI redirects into a login/paywall loop that exceeds max-redirects; the citation itself is valid -> --exclude it. Configuration is kept inline in the workflow args (rather than a lychee.toml) because lychee runs against the gh-pages checkout, which does not contain repo-root config files. Closes #933 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [linkcheck] Escape and anchor --exclude-path regexes lychee treats --exclude-path values as regular expressions, so the unescaped dots in genindex.html / search.html / prf-prf.html were regex wildcards and the patterns were unanchored. Escape the dot and anchor the end ('<name>\.html$') so each matches only the intended generated page. Addresses Copilot review feedback on #934. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Force on-demand GPU runners (spot=false) AWS spot reclamation has been interrupting the g4dn.2xlarge GPU notebook builds mid-run, discarding the whole build. Add spot=false to the RunsOn runner spec in all four GPU workflows (cache, ci, collab, publish) so they run on on-demand instances. Rolls out the org-wide decision in QuantEcon/meta#330; mirrors QuantEcon/lecture-jax#327. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Forces on-demand GPU capacity instead of spot by adding
spot=falseto the RunsOn runner spec in all four GPU workflows:cache.yml,ci.yml,publish.yml, andcollab.yml.Why
AWS spot reclamation is increasingly interrupting our long GPU notebook builds mid-run. In the most recent weekly cache build (run 27929889850) the
g4dn.2xlargespot instance received a shutdown signal at 91% of the build (reading sources... [91%]), producingThe runner has received a shutdown signalfollowed byThe operation was canceled. The downstreamAssertionError: self.km is not Noneinnbclientcleanup is a symptom of the kernel being killed mid-execution, not a real notebook failure.These builds run 17–22 minutes on a single GPU, so a reclamation near the end wastes the entire run (and any spot savings along with it). On-demand trades a higher hourly rate for build reliability, which is the right call for GPU jobs that can't cheaply checkpoint and resume.
Change
cache.yml…/volume=80gb/spot=falseci.yml…/volume=80gb/spot=falsepublish.ymlpublish*tag…/volume=80gb/spot=falsecollab.yml…/volume=80gb/spot=falseNotes
ci.yml,collab.yml) and only force on-demand for the longcache.yml/publish.ymlbuilds, that's an easy narrowing — happy to adjust.🤖 Generated with Claude Code