Skip to content

[codex] Migrate dFactory to latest VeOmni with NPU support#22

Open
Kirrito-k423 wants to merge 2 commits into
inclusionAI:mainfrom
Kirrito-k423:veomni-npu-migration
Open

[codex] Migrate dFactory to latest VeOmni with NPU support#22
Kirrito-k423 wants to merge 2 commits into
inclusionAI:mainfrom
Kirrito-k423:veomni-npu-migration

Conversation

@Kirrito-k423

@Kirrito-k423 Kirrito-k423 commented Jun 24, 2026

Copy link
Copy Markdown

Summary

  • Upgrade the VeOmni submodule from the old v0.1.2-era commit to current main (8ca09d7).
  • Migrate LLaDA2 MoE model registration, training arguments, dataloader/checkpoint/optimizer usage, and SFT configs to the latest VeOmni APIs.
  • Add Ascend NPU core support through NPU-aware ops config, fused_npu MoE dispatch binding, NPU-safe Liger handling, and an eager MoE fallback for baseline execution.
  • Fix LLaDA2 EP parallel plan construction for latest VeOmni (ParallelPlan(extra_parallel_plan={...})).
  • Keep this upstream PR core-only: no validation harness scripts, migration reports, workload reports, or generated artifacts are included in the PR diff.

Full delivery branch with validation scripts, migration/workload reports, and runbooks: https://github.com/Kirrito-k423/dFactory/tree/veomni-npu-migration-full-delivery

Validation Summary

The validation harness and reports live in the full delivery branch above, not in this PR diff.

  • Core PR static checks:
    • python3 -m py_compile models/llada2_moe/__init__.py models/llada2_moe/modeling_llada2_moe.py tasks/dataset/__init__.py tasks/train_llada2_common.py tasks/train_llada2_bd.py tasks/train_llada2_bd_with_dparallel.py models/llada2_moe/parallel_plan.py
    • ruby -ryaml -e 'Dir["configs/sft/llada2_*bd_sft*.yaml"].sort.each { |p| YAML.load_file(p); puts p }'
    • git diff --check --cached
  • Remote Ascend 910B2 smoke in veomni_qwen35:
    • model/config registry import passes.
    • CPU eager tiny model smoke: finite loss 5.636879920959473.
    • NPU fused_npu vs eager tiny model: kernel npu_fused_moe_forward, loss/logits/grad diff 0.0.
  • Legacy parity against a temporary pre-migration checkout (dFactory + VeOmni@600fe6d):
    • tiny eager path current loss = legacy loss = 5.506927490234375.
    • loss/logits/full-gradient diff = 0.0.
  • Real-weight old/new alignment on Ascend 910B2:
    • Model: inclusionAI/LLaDA2.0-mini-preview, 7 safetensors shards, asset check complete with --validate-safetensors.
    • Result: current loss = legacy loss = 13.501853942871094; loss_abs_diff = 0.0; logits max_abs = 0.0, mean_abs = 0.0, max_rel = 0.0; current/legacy shards loaded = 7; current/legacy missing and unexpected keys = 0.
  • SP tiny old/new parity on 2-card Ascend 910B2:
    • ulysses_size=2, eager MoE to isolate MoE kernel noise.
    • current loss = legacy loss = 5.550048351287842; rank0/rank1 loss/logits/grad diff = 0.0.
  • EP tiny validation on 2-card Ascend 910B2:
    • New VeOmni path passes with ep_size=2, kernel npu_fused_moe_forward, ep_enabled=true; expert params shard from 4 global experts to 2 local experts per rank; forward/backward passes.
    • Legacy EP NPU baseline does not run in the same environment: legacy fused EP fails with NameError: name 'group_gemm_same_nk' is not defined, so old/new EP numerical parity cannot be claimed.

Loss Alignment Chart

This is a loss alignment chart from the completed validation points, not a multi-step training loss curve.

xychart-beta
  title "LLaDA2 loss alignment"
  x-axis ["Tiny old/new", "Tiny NPU fused/eager", "Real-weight old/new", "SP old/new"]
  y-axis "loss" 0 --> 14
  line "baseline/reference" [5.506927, 5.469104, 13.501854, 5.550048]
  line "migrated/current" [5.506927, 5.469104, 13.501854, 5.550048]
Loading

SP / EP Status

  • SP: tiny 2-card NPU old/new parity is verified with ulysses_size=2; loss/logits/grad diff is 0.0 on both ranks. This validates no functional regression under SP state, but LLaDA2's custom attention still does not implement production sequence-sliced attention, so SP performance/long-sequence benefits remain future work.
  • EP: new VeOmni EP basic function is verified after the ParallelPlan(extra_parallel_plan={...}) fix. It runs ep_size=2 with fused_npu and sharded local expert parameters.
  • EP old/new parity: not claimable because the legacy VeOmni EP NPU path fails before producing baseline outputs.

Notes

  • Real-weight alignment was run forward-only (--backward disabled) because the 16B model's backward pass has much higher HBM/CPU memory pressure. Tiny-model old/new full-gradient parity is covered separately and is zero diff.
  • Remaining production work is real-weight EP, SP/EP/FSDP2 combined multi-card validation, checkpoint save/load and HF safetensor export under real SFT steps, and performance work for LLaDA2 attention/RMSNorm/RoPE NPU kernels if needed.
  • transformers v5 emits a non-blocking AttentionMaskConverter deprecation warning that currently trips VeOmni logger formatting noise in validation logs.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the LLaDA2 MoE model and training scripts to the latest VeOmni API, consolidating training logic into a common module, updating model registration, and adding support for Ascend NPU. The review feedback identifies several key issues and optimization opportunities: a bug where a boolean constant fallback is incorrectly defined as a function, a potential training crash due to nan values in entropy calculations, a PCIe transfer bottleneck from copying the block diffusion mask on every micro-step, an optimization to only run expert forward passes when tokens are routed, a parsing error in train.sh that limits NPU detection, and a missing safety check for None gradients in the validation script.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +47 to +51
try:
from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13
except ImportError:
def is_torch_greater_or_equal_than_1_13():
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In transformers.pytorch_utils, is_torch_greater_or_equal_than_1_13 is a boolean variable, not a function. Defining it as a fallback function returning True causes not is_torch_greater_or_equal_than_1_13 to always evaluate to False (since function objects are truthy in Python). It should be defined as a boolean constant instead.

try:
    from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13
except ImportError:
    is_torch_greater_or_equal_than_1_13 = True

Comment on lines +169 to +172
log_probs = F.log_softmax(logits, dim=-1)
probs = torch.exp(log_probs)
entropy_per_token = -torch.sum(probs * log_probs, dim=-1)
return entropy_per_token[correct_mask].mean()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Computing entropy as probs * log_probs can produce nan values if any probability underflows to exactly 0.0 (which is common in large vocabulary models), because 0.0 * -inf evaluates to nan in PyTorch. Use torch.where to safely mask out zero probabilities and prevent silent training crashes.

Suggested change
log_probs = F.log_softmax(logits, dim=-1)
probs = torch.exp(log_probs)
entropy_per_token = -torch.sum(probs * log_probs, dim=-1)
return entropy_per_token[correct_mask].mean()
log_probs = F.log_softmax(logits, dim=-1)
probs = torch.exp(log_probs)
entropy_per_token = -torch.sum(torch.where(probs > 0, probs * log_probs, 0.0), dim=-1)
return entropy_per_token[correct_mask].mean()

logger.info_rank0(f"Load distributed checkpoint from {args.train.checkpoint.load_path} successfully!")

block_diffusion_attn_mask = _build_block_diffusion_mask(args)
helper.empty_cache()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The block_diffusion_attn_mask is a large tensor (e.g., ~33MB to 67MB depending on precision). Currently, it is created on CPU and copied to the accelerator device on every single micro-step inside _prepare_micro_batch. Moving it to the target device once during initialization avoids a massive host-to-device PCIe transfer bottleneck.

    block_diffusion_attn_mask = _build_block_diffusion_mask(args)
    if block_diffusion_attn_mask is not None:
        block_diffusion_attn_mask = block_diffusion_attn_mask.to(get_device_type())

Comment on lines +447 to +448
for i in range(self.config.num_experts):
y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To optimize performance and avoid unnecessary compute or potential empty tensor edge cases in the underlying matrix multiplication kernels, only invoke the expert forward pass when there are actually tokens routed to that expert.

Suggested change
for i in range(self.config.num_experts):
y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i)
for i in range(self.config.num_experts):
mask = flat_topk_idx == i
if mask.any():
y[mask] = self.experts(hidden_states[mask], expert_idx=i)

Comment thread train.sh
if command -v nvidia-smi >/dev/null 2>&1; then
NPROC_PER_NODE=$(nvidia-smi --list-gpus | wc -l | tr -d ' ')
elif command -v npu-smi >/dev/null 2>&1; then
NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' | sort -u | wc -l | tr -d ' ')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The npu-smi info command outputs a table enclosed in | characters. The current awk pattern /^[[:space:]]*[0-9]+/ expects lines to start with a number or whitespace, which fails to match any lines due to the leading |. This causes NPROC_PER_NODE to always default to 1 even on multi-NPU systems. Use a field-separator-based awk command to correctly parse the NPU IDs.

Suggested change
NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' | sort -u | wc -l | tr -d ' ')
NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk -F '|' '$2 ~ /[0-9]+/ {print $2}' | awk '{print $1}' | sort -u | wc -l | tr -d ' ')

Comment thread scripts/validate_llada2_moe_veomni.py Outdated
loss.backward()

grad_name = "model.layers.0.mlp.experts.gate_proj"
grad = dict(model.named_parameters())[grad_name].grad.detach().float().cpu()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the backward pass fails to propagate gradients to the specified parameter, grad will be None. Calling .detach() on None will raise an unhelpful AttributeError. Adding a check makes the validation script much more robust and informative.

    grad_param = dict(model.named_parameters())[grad_name].grad
    if grad_param is None:
        raise ValueError(f"Gradient for {grad_name} is None. Backward pass might have failed or parameter is frozen.")
    grad = grad_param.detach().float().cpu()

@Kirrito-k423 Kirrito-k423 marked this pull request as ready for review June 25, 2026 02:52
@Kirrito-k423 Kirrito-k423 force-pushed the veomni-npu-migration branch from 8ad2cd0 to f93835c Compare June 25, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant