Skip to content

⚙️ Disable spot instances on GPU runners (use on-demand)#327

Merged
mmcky merged 1 commit into
mainfrom
infra/disable-spot-gpu-runners
Jun 26, 2026
Merged

⚙️ Disable spot instances on GPU runners (use on-demand)#327
mmcky merged 1 commit into
mainfrom
infra/disable-spot-gpu-runners

Conversation

@mmcky

@mmcky mmcky commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

Forces on-demand GPU capacity instead of spot by adding spot=false to the RunsOn runner spec in all four GPU workflows: cache.yml, ci.yml, publish.yml, and collab.yml.

Why

AWS spot reclamation is increasingly interrupting our long GPU notebook builds mid-run. In the most recent weekly cache build (run 27929889850) the g4dn.2xlarge spot instance received a shutdown signal at 91% of the build (reading sources... [91%]), producing The runner has received a shutdown signal followed by The operation was canceled. The downstream AssertionError: self.km is not None in nbclient cleanup is a symptom of the kernel being killed mid-execution, not a real notebook failure.

These builds run 17–22 minutes on a single GPU, so a reclamation near the end wastes the entire run (and any spot savings along with it). On-demand trades a higher hourly rate for build reliability, which is the right call for GPU jobs that can't cheaply checkpoint and resume.

Change

Workflow Trigger Runner spec
cache.yml weekly schedule …/volume=80gb/spot=false
ci.yml pull_request …/volume=80gb/spot=false
publish.yml publish* tag …/volume=80gb/spot=false
collab.yml pull_request …/volume=80gb/spot=false

Notes

  • Tracking the broader pattern across GPU-using lecture repos in a QuantEcon/meta issue.
  • If we'd rather keep spot for the cheaper/shorter PR jobs (ci.yml, collab.yml) and only force on-demand for the long cache.yml/publish.yml builds, that's an easy narrowing — happy to adjust.

🤖 Generated with Claude Code

AWS spot reclamation is increasingly interrupting long GPU notebook
builds mid-run (e.g. the weekly cache build was killed at 91% in
https://github.com/QuantEcon/lecture-jax/actions/runs/27929889850 when
the spot g4dn.2xlarge received a shutdown signal). Force on-demand
capacity by adding spot=false to all four GPU runner specs.

See QuantEcon/meta for the tracking discussion on GPU spot reclamation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 06:34
@netlify

netlify Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploy Preview for incomparable-parfait-2417f8 ready!

Name Link
🔨 Latest commit fba7b7e
🔍 Latest deploy log https://app.netlify.com/projects/incomparable-parfait-2417f8/deploys/6a3e1d78baa0a600086ec402
😎 Deploy Preview https://deploy-preview-327--incomparable-parfait-2417f8.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the repository’s GPU GitHub Actions workflows to request on-demand GPU runner capacity (instead of spot), improving reliability for long-running notebook builds that are susceptible to spot reclamation.

Changes:

  • Adds spot=false to the RunsOn GPU runner spec in all GPU workflows.
  • Applies the same runner-spec adjustment consistently across cache, ci, publish, and collab workflows.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
.github/workflows/cache.yml Forces on-demand GPU capacity for scheduled/manual cache builds by adding spot=false to the runner spec.
.github/workflows/ci.yml Forces on-demand GPU capacity for PR preview builds by adding spot=false to the runner spec.
.github/workflows/publish.yml Forces on-demand GPU capacity for tag-based publish builds by adding spot=false to the runner spec.
.github/workflows/collab.yml Forces on-demand GPU capacity for PR Colab execution checks by adding spot=false to the runner spec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

@github-actions github-actions Bot temporarily deployed to pull request June 26, 2026 06:43 Inactive
@github-actions github-actions Bot temporarily deployed to pull request June 26, 2026 06:47 Inactive
@mmcky mmcky merged commit babfc41 into main Jun 26, 2026
8 checks passed
@mmcky mmcky deleted the infra/disable-spot-gpu-runners branch June 26, 2026 06:50
mmcky added a commit to QuantEcon/lecture-python.myst that referenced this pull request Jun 26, 2026
* [linkcheck] Clear residual false positives in weekly lychee report

The weekly link checker (#933) flags 8 errors out of ~25k links, all
false positives or harmless artifacts on non-content pages:

- IEEE Xplore returns "202 Accepted" (anti-bot) for a valid DOI cited in
  zreferences.html -> add 202 to --accept.
- genindex / search / prf-prf are auto-generated utility pages with no
  source notebook, so the theme's "Download Notebook" button points at a
  nonexistent _notebooks/<page>.ipynb and renders a second href="None"
  -> --exclude-path those three pages.
- A Journal of Derivatives DOI redirects into a login/paywall loop that
  exceeds max-redirects; the citation itself is valid -> --exclude it.

Configuration is kept inline in the workflow args (rather than a
lychee.toml) because lychee runs against the gh-pages checkout, which
does not contain repo-root config files.

Closes #933

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* [linkcheck] Escape and anchor --exclude-path regexes

lychee treats --exclude-path values as regular expressions, so the
unescaped dots in genindex.html / search.html / prf-prf.html were regex
wildcards and the patterns were unanchored. Escape the dot and anchor the
end ('<name>\.html$') so each matches only the intended generated page.

Addresses Copilot review feedback on #934.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Force on-demand GPU runners (spot=false)

AWS spot reclamation has been interrupting the g4dn.2xlarge GPU notebook
builds mid-run, discarding the whole build. Add spot=false to the RunsOn
runner spec in all four GPU workflows (cache, ci, collab, publish) so they
run on on-demand instances.

Rolls out the org-wide decision in QuantEcon/meta#330; mirrors
QuantEcon/lecture-jax#327.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants