Skip to content

Add fault injection support via nvidia_resiliency_ext.#4370

Queued
hexinw-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
hexinw-nvidia:inject_fault
Queued

Add fault injection support via nvidia_resiliency_ext.#4370
hexinw-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
hexinw-nvidia:inject_fault

Conversation

@hexinw-nvidia
Copy link
Copy Markdown
Contributor

@hexinw-nvidia hexinw-nvidia commented Apr 17, 2026

Add fault injection support via nvidia_resiliency_ext

Adds fault injection capability for resilience testing, driven by CLI args.
Fault injection is a no-op unless --fault-injector-ranks or --fault-injector-num-ranks is explicitly passed.

  • megatron/core/fault_injector.py: new module wrapping
    nvidia_resiliency_ext.shared_utils.inject_fault; provides
    setup_fault_injection() (broadcasts fault plan across ranks) and
    re-exports maybe_raise_workload_exception()
  • megatron/training/config/resilience_config.py: FaultInjectorConfig
    dataclass (ranks, fault type/probabilities, MTTI-based delay, seed)
    following the RerunStateMachineConfig pattern
  • megatron/training/arguments.py: registers fault injector args via
    ArgumentGroupFactory(FaultInjectorConfig); all default to None
  • megatron/training/training.py: lazily imports and calls
    setup_fault_injection() at train() entry when args are set;
    calls maybe_raise_workload_exception() each step after the first

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@hexinw-nvidia hexinw-nvidia force-pushed the inject_fault branch 8 times, most recently from 3299aa4 to 8b330b4 Compare April 17, 2026 22:27
@hexinw-nvidia hexinw-nvidia changed the title Added fault injector testing utils. Enable via MEGATRON_FAULT_INJECTION Add fault injection support via nvidia_resiliency_ext. Apr 20, 2026
@hexinw-nvidia hexinw-nvidia marked this pull request as ready for review April 20, 2026 16:29
@hexinw-nvidia hexinw-nvidia requested review from a team as code owners April 20, 2026 16:29
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 20, 2026 16:29
@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 20, 2026
Comment thread megatron/training/config/resilience_config.py Outdated
@hexinw-nvidia hexinw-nvidia force-pushed the inject_fault branch 2 times, most recently from cc8ac54 to bbd67f3 Compare April 23, 2026 20:15
@hexinw-nvidia hexinw-nvidia requested review from a team as code owners April 23, 2026 20:15
@ko3n1g ko3n1g removed request for a team April 28, 2026 19:20
Copy link
Copy Markdown
Contributor

@maanug-nv maanug-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for addressing feedback

@maanug-nv maanug-nv requested a review from ericharper April 29, 2026 18:46
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 29, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 1, 2026
@ericharper
Copy link
Copy Markdown
Contributor

/ok to test 6179109

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 1, 2026
@ericharper ericharper enabled auto-merge May 1, 2026 18:00
@ericharper ericharper added this pull request to the merge queue May 2, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25242089343

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25242663760

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25243992329

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 2, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25246370882

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25247829108

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25255546383

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25257650007

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25258781947

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25262600975

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants