scripts: serialize PyPy/Rust install and verify toolchain is usable#13828
scripts: serialize PyPy/Rust install and verify toolchain is usable#13828avi-starkware wants to merge 3 commits intomain-v0.14.2from
Conversation
Two related fixes for flaky docker builds:
1. Serialize install_pypy and install_rust. rustup-init's multi-threaded
component downloader has a known race where a worker thread panics
("thread 'CloseHandle' panicked at src/diskio/mod.rs: RecvError").
The race is triggered or aggravated by concurrent I/O from parallel
installers; when it fires, rustup leaves a partial toolchain stub
behind.
2. After installation, probe the active toolchain with `cargo --version`
in the project dir (where rust-toolchain.toml pins the channel). On
a healthy install this is a ~10ms no-op. If it fails — meaning
rustup created a stub for the pinned channel but never finished
populating its components — retry with `rustup toolchain install
--force` to actually download rustc/cargo/etc.
Without these fixes, install_cargo_tools.sh later runs `cargo install
sccache`, cargo proxies to the pinned (partial) toolchain, and the
build fails with "'cargo' is not installed for the toolchain '1.92-…'".
PR SummaryLow Risk Overview Adds a post-install probe ( Reviewed by Cursor Bugbot for commit 416eff2. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
Artifacts upload workflows: |
dorimedini-starkware
left a comment
There was a problem hiding this comment.
@dorimedini-starkware reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on avi-starkware, idan-starkware, and yoavGrs).
scripts/install_build_tools.sh line 106 at r1 (raw file):
log_step "install_build_tools" "Installing PyPy and Rust..." # Intentionally serial: rustup-init's multi-threaded component downloader has a # known race (`thread 'CloseHandle' panicked at src/diskio/mod.rs: RecvError`)
is there some github issue you can link to here?
Code quote:
known racescripts/install_build_tools.sh line 112 at r1 (raw file):
install_pypy install_rust log_step "install_build_tools" "PyPy and Rust installed"
Suggestion:
log_step "install_build_tools" "Installing PyPy..."
# Intentionally serial: rustup-init's multi-threaded component downloader has a
# known race (`thread 'CloseHandle' panicked at src/diskio/mod.rs: RecvError`)
# that's triggered or aggravated by concurrent I/O from parallel installers.
# When the race fires, rustup leaves a partial toolchain stub behind and later
# `cargo install` calls fail with "'cargo' is not installed for the toolchain".
install_pypy
log_step "install_build_tools" "PyPy installed. Installing Rust..."
install_rust
log_step "install_build_tools" "Rust installed."- Link rust-lang/rustup#2926 for the CloseHandle/RecvError panic - Split the PyPy and Rust log_step messages so each installer's progress is visible (per dorimedini-starkware's suggestion)
rust-lang/rustup#2926 is one known trigger for a mid-install rustup failure, not proven to be the mechanism behind every partial-install occurrence we've seen. Reword the block comment to describe the pattern (any mid-install failure can leave a registered stub without components) rather than claiming a specific causal chain, and note that the verify + force-reinstall step is what actually guarantees correctness.
dorimedini-starkware
left a comment
There was a problem hiding this comment.
@dorimedini-starkware reviewed 1 file and all commit messages, and resolved 2 discussions.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on idan-starkware and yoavGrs).
avi-starkware
left a comment
There was a problem hiding this comment.
@avi-starkware made 2 comments.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on idan-starkware and yoavGrs).
scripts/install_build_tools.sh line 106 at r1 (raw file):
Previously, dorimedini-starkware wrote…
is there some github issue you can link to here?
Done.
scripts/install_build_tools.sh line 112 at r1 (raw file):
install_pypy install_rust log_step "install_build_tools" "PyPy and Rust installed"
Done.
dorimedini-starkware
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! all files reviewed, all discussions resolved (waiting on idan-starkware and yoavGrs).
Summary
Fixes a flaky Docker build failure of the form:
What we observed
'cargo' is not installed for the toolchain '1.9x-…'on several runs (PR scripts: install project Rust toolchain before cargo tools #13795's reproducer,remove-mock-forest-storage's sequencer docker build, and the most recent report oneinat/proof-flow-1-state-reader-fixes).thread 'CloseHandle' panicked at src/diskio/mod.rs:…: called Result::unwrap() on an Err value: RecvErroron two separate runs (one on this branch, one on main'snative-blockifier-artifacts-push). In both cases, buildx then hung for ~50 minutes until its gRPC keepalive timed out.Both observations share a root pattern: rustup is asked to install the
rust-toolchain.toml-pinned channel, the install is interrupted mid-flight, and rustup is left with a stub entry for the channel (marked "installed" / "active") while its components (rustc, cargo, …) never finished downloading. On the next invocation,cargo install …fails with "cargo is not installed for the toolchain".I don't have a single log that shows the panic directly causing a subsequent "cargo not installed" error — the buildx hang usually masks what would have surfaced — but the panic is at least one plausible mid-install failure mode, and any mid-install exit (panic, OOM, timeout, etc.) produces the same partial-install state.
Fixes
install_pypyandinstall_rust. Concurrent disk I/O from a sibling installer is a known trigger for rustup's multi-threaded downloader panic (rust-lang/rustup#2926). Serial execution removes that trigger. The extra wall-time cost is ~10s per fresh base-image build (PyPy's runtime), and the step is Docker-cached most of the time.cargo --versionin the project dir (whererust-toolchain.tomlis the active override) is a ~10ms no-op on a healthy install, so devs who already have the toolchain fully installed pay nothing.--forceis only invoked when the check actually fails.Why not unconditional
rustup toolchain install --force?It works in Docker (layer cache absorbs the cost) but is disruptive on a developer machine:
--forcere-downloads ~200 MB even when the toolchain is perfectly fine. The conditional probe avoids that.Test plan