fix(megatron-fp16): avoid assert crash & double grad scaling under fp16#44
fix(megatron-fp16): avoid assert crash & double grad scaling under fp16#44jamesruio wants to merge 1 commit into
Conversation
Before this fix, training with --fp16 + dynamic loss scaling crashed
immediately on the first gradient overflow:
File "relax/backends/megatron/model.py", line 529, in train_one_step
assert update_successful
AssertionError
Root cause: in fp16 with dynamic loss scaling, optimizer.step() returns
(False, None, None) on overflow as the documented signal for the loss
scaler to reduce the scale and retry next step. The previous code path
treated this as a fatal error.
Additionally, the previous flow called optimizer.prepare_grads()
externally and then optimizer.step(), which internally calls
prepare_grads() again. This caused two correctness issues:
- _unscale_main_grads_and_check_for_nan ran twice, dividing main grads
by inv_scale a second time
- grad_scaler.update() ran twice, double-advancing the loss-scale
growth interval
Fix:
- Replace the two-step flow with a single optimizer.step() call,
eliminating both the double unscale and the double scaler update.
- Detect overflow via the (False, None, None) return signature; set
valid_step=False, skip the LR scheduler step, and log a warning
with step_id and the current loss scale so the dynamic scaling
behavior is observable.
- Move the CI MTP grad check before optimizer.step() to match the
original comment intent (gradients may be modified during step).
Aligns with THUDM/slime#1842.
|
Thanks for the fix. The fp16 overflow handling is implemented in the right direction. I have a few comments below:
We’ll merge this PR into our internal branch, run full CI & CE tests, and sync to main after all tests pass. |
|
@yxyOo Thanks for your reply ! I could reproduce the update_successful assertion crash with Qwen3-4B and script is provided below。 ` now=$(date "+%Y-%m-%d-%H:%M:%S") export WORKDIR=/workspace/wurui04 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" if [ -z "${RELAX_ENTRYPOINT_MODE:-}" ]; then PROJECT_NAME="${PROJECT_NAME:=Relax/dev/dapo-math}" CKPT_ARGS=( PROMPT_SET=${EXP_DIR}/dapo-math-17k/dapo-math-17k.jsonl ROLLOUT_ARGS=( --rm-type dapo --num-rollout ${NUM_ROLLOUT} --global-batch-size 256 EVAL_ARGS=( PERF_ARGS=( --recompute-granularity full --calculate-per-token-loss --use-dynamic-batch-size GRPO_ARGS=( --use-tis OPTIMIZER_ARGS=( SGLANG_ARGS=( WANDB_ARGS=( MISC_ARGS=( mkdir -p log |
@yxyOo Thanks for the clarification ! I would clarify that the double unscale and duplicate grad_scaler.update() calls issue exsits only in slime framework. The bug is avoided in original relax code (line 508). |
@yxyOo Hi yxyOo,could you reproduce the update_successful assertion crash with Qwen3-4B with my script ? I would greatly appreciate it if you could notify me of any progress. |
I couldn't reproduce the issue after training 110+ steps on 8×H800 (80GB) with your config, have you tested this on NVIDIA GPUs?
|
Sure! I tested it on a single 8xH20 machine using the config above, based on the latest Relax image. Could you kindly review our configurations to see if there are any remaining diffs? I would extremely appreciate it if this PR could be merged. |




Bug Fix
Aligns with THUDM/slime#1842.
Detail
Before this fix, training with --fp16 + dynamic loss scaling crashed immediately on the first gradient overflow:
Root cause: in fp16 with dynamic loss scaling, optimizer.step() returns (False, None, None) on overflow as the documented signal for the loss scaler to reduce the scale and retry next step. The previous code path treated this as a fatal error.
Additionally, the previous flow called optimizer.prepare_grads() externally and then optimizer.step(), which internally calls prepare_grads() again. This caused two correctness issues: