Add fault injection support via nvidia_resiliency_ext.#4370
Add fault injection support via nvidia_resiliency_ext.#4370hexinw-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
3299aa4 to
8b330b4
Compare
0045805 to
50e9ad3
Compare
50e9ad3 to
7d381e6
Compare
cc8ac54 to
bbd67f3
Compare
maanug-nv
left a comment
There was a problem hiding this comment.
lgtm, thanks for addressing feedback
|
/ok to test 6179109 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25242089343 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25242663760 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25243992329 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25246370882 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25247829108 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25255546383 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25257650007 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25258781947 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25262600975 |
Add fault injection support via nvidia_resiliency_ext
Adds fault injection capability for resilience testing, driven by CLI args.
Fault injection is a no-op unless --fault-injector-ranks or --fault-injector-num-ranks is explicitly passed.
nvidia_resiliency_ext.shared_utils.inject_fault; provides
setup_fault_injection() (broadcasts fault plan across ranks) and
re-exports maybe_raise_workload_exception()
dataclass (ranks, fault type/probabilities, MTTI-based delay, seed)
following the RerunStateMachineConfig pattern
ArgumentGroupFactory(FaultInjectorConfig); all default to None
setup_fault_injection() at train() entry when args are set;
calls maybe_raise_workload_exception() each step after the first