[codex] Migrate dFactory to latest VeOmni with NPU support by Kirrito-k423 · Pull Request #22 · inclusionAI/dFactory

Kirrito-k423 · 2026-06-24T14:18:02Z

Summary

Upgrade the VeOmni submodule from the old v0.1.2-era commit to current main (8ca09d7).
Migrate LLaDA2 MoE model registration, training arguments, dataloader/checkpoint/optimizer usage, and SFT configs to the latest VeOmni APIs.
Add Ascend NPU core support through NPU-aware ops config, fused_npu MoE dispatch binding, NPU-safe Liger handling, and an eager MoE fallback for baseline execution.
Fix LLaDA2 EP parallel plan construction for latest VeOmni (ParallelPlan(extra_parallel_plan={...})).
Keep this upstream PR core-only: no validation harness scripts, migration reports, workload reports, or generated artifacts are included in the PR diff.

Full delivery branch with validation scripts, migration/workload reports, and runbooks: https://github.com/Kirrito-k423/dFactory/tree/veomni-npu-migration-full-delivery

Validation Summary

The validation harness and reports live in the full delivery branch above, not in this PR diff.

Core PR static checks:
- python3 -m py_compile models/llada2_moe/__init__.py models/llada2_moe/modeling_llada2_moe.py tasks/dataset/__init__.py tasks/train_llada2_common.py tasks/train_llada2_bd.py tasks/train_llada2_bd_with_dparallel.py models/llada2_moe/parallel_plan.py
- ruby -ryaml -e 'Dir["configs/sft/llada2_*bd_sft*.yaml"].sort.each { |p| YAML.load_file(p); puts p }'
- git diff --check --cached
Remote Ascend 910B2 smoke in veomni_qwen35:
- model/config registry import passes.
- CPU eager tiny model smoke: finite loss 5.636879920959473.
- NPU fused_npu vs eager tiny model: kernel npu_fused_moe_forward, loss/logits/grad diff 0.0.
Legacy parity against a temporary pre-migration checkout (dFactory + VeOmni@600fe6d):
- tiny eager path current loss = legacy loss = 5.506927490234375.
- loss/logits/full-gradient diff = 0.0.
Real-weight old/new alignment on Ascend 910B2:
- Model: inclusionAI/LLaDA2.0-mini-preview, 7 safetensors shards, asset check complete with --validate-safetensors.
- Result: current loss = legacy loss = 13.501853942871094; loss_abs_diff = 0.0; logits max_abs = 0.0, mean_abs = 0.0, max_rel = 0.0; current/legacy shards loaded = 7; current/legacy missing and unexpected keys = 0.
SP tiny old/new parity on 2-card Ascend 910B2:
- ulysses_size=2, eager MoE to isolate MoE kernel noise.
- current loss = legacy loss = 5.550048351287842; rank0/rank1 loss/logits/grad diff = 0.0.
EP tiny validation on 2-card Ascend 910B2:
- New VeOmni path passes with ep_size=2, kernel npu_fused_moe_forward, ep_enabled=true; expert params shard from 4 global experts to 2 local experts per rank; forward/backward passes.
- Legacy EP NPU baseline does not run in the same environment: legacy fused EP fails with NameError: name 'group_gemm_same_nk' is not defined, so old/new EP numerical parity cannot be claimed.

Loss Alignment Chart

This is a loss alignment chart from the completed validation points, not a multi-step training loss curve.

xychart-beta
  title "LLaDA2 loss alignment"
  x-axis ["Tiny old/new", "Tiny NPU fused/eager", "Real-weight old/new", "SP old/new"]
  y-axis "loss" 0 --> 14
  line "baseline/reference" [5.506927, 5.469104, 13.501854, 5.550048]
  line "migrated/current" [5.506927, 5.469104, 13.501854, 5.550048]

SP / EP Status

SP: tiny 2-card NPU old/new parity is verified with ulysses_size=2; loss/logits/grad diff is 0.0 on both ranks. This validates no functional regression under SP state, but LLaDA2's custom attention still does not implement production sequence-sliced attention, so SP performance/long-sequence benefits remain future work.
EP: new VeOmni EP basic function is verified after the ParallelPlan(extra_parallel_plan={...}) fix. It runs ep_size=2 with fused_npu and sharded local expert parameters.
EP old/new parity: not claimable because the legacy VeOmni EP NPU path fails before producing baseline outputs.

Notes

Real-weight alignment was run forward-only (--backward disabled) because the 16B model's backward pass has much higher HBM/CPU memory pressure. Tiny-model old/new full-gradient parity is covered separately and is zero diff.
Remaining production work is real-weight EP, SP/EP/FSDP2 combined multi-card validation, checkpoint save/load and HF safetensor export under real SFT steps, and performance work for LLaDA2 attention/RMSNorm/RoPE NPU kernels if needed.
transformers v5 emits a non-blocking AttentionMaskConverter deprecation warning that currently trips VeOmni logger formatting noise in validation logs.

gemini-code-assist

Code Review

This pull request migrates the LLaDA2 MoE model and training scripts to the latest VeOmni API, consolidating training logic into a common module, updating model registration, and adding support for Ascend NPU. The review feedback identifies several key issues and optimization opportunities: a bug where a boolean constant fallback is incorrectly defined as a function, a potential training crash due to nan values in entropy calculations, a PCIe transfer bottleneck from copying the block diffusion mask on every micro-step, an optimization to only run expert forward passes when tokens are routed, a parsing error in train.sh that limits NPU detection, and a missing safety check for None gradients in the validation script.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-24T14:20:21Z

+try:
+    from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13
+except ImportError:
+    def is_torch_greater_or_equal_than_1_13():
+        return True


In transformers.pytorch_utils, is_torch_greater_or_equal_than_1_13 is a boolean variable, not a function. Defining it as a fallback function returning True causes not is_torch_greater_or_equal_than_1_13 to always evaluate to False (since function objects are truthy in Python). It should be defined as a boolean constant instead.

try: from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13 except ImportError: is_torch_greater_or_equal_than_1_13 = True

gemini-code-assist · 2026-06-24T14:20:21Z

+    log_probs = F.log_softmax(logits, dim=-1)
+    probs = torch.exp(log_probs)
+    entropy_per_token = -torch.sum(probs * log_probs, dim=-1)
+    return entropy_per_token[correct_mask].mean()


Computing entropy as probs * log_probs can produce nan values if any probability underflows to exactly 0.0 (which is common in large vocabulary models), because 0.0 * -inf evaluates to nan in PyTorch. Use torch.where to safely mask out zero probabilities and prevent silent training crashes.

Suggested change

log_probs = F.log_softmax(logits, dim=-1)

probs = torch.exp(log_probs)

entropy_per_token = -torch.sum(probs * log_probs, dim=-1)

return entropy_per_token[correct_mask].mean()

log_probs = F.log_softmax(logits, dim=-1)

probs = torch.exp(log_probs)

entropy_per_token = -torch.sum(torch.where(probs > 0, probs * log_probs, 0.0), dim=-1)

return entropy_per_token[correct_mask].mean()

gemini-code-assist · 2026-06-24T14:20:21Z

+        logger.info_rank0(f"Load distributed checkpoint from {args.train.checkpoint.load_path} successfully!")
+
+    block_diffusion_attn_mask = _build_block_diffusion_mask(args)
+    helper.empty_cache()


The block_diffusion_attn_mask is a large tensor (e.g., ~33MB to 67MB depending on precision). Currently, it is created on CPU and copied to the accelerator device on every single micro-step inside _prepare_micro_batch. Moving it to the target device once during initialization avoids a massive host-to-device PCIe transfer bottleneck.

block_diffusion_attn_mask = _build_block_diffusion_mask(args) if block_diffusion_attn_mask is not None: block_diffusion_attn_mask = block_diffusion_attn_mask.to(get_device_type())

gemini-code-assist · 2026-06-24T14:20:21Z

+        for i in range(self.config.num_experts):
+            y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i)


To optimize performance and avoid unnecessary compute or potential empty tensor edge cases in the underlying matrix multiplication kernels, only invoke the expert forward pass when there are actually tokens routed to that expert.

Suggested change

for i in range(self.config.num_experts):

y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i)

for i in range(self.config.num_experts):

mask = flat_topk_idx == i

if mask.any():

y[mask] = self.experts(hidden_states[mask], expert_idx=i)

gemini-code-assist · 2026-06-24T14:20:21Z

+  if command -v nvidia-smi >/dev/null 2>&1; then
+    NPROC_PER_NODE=$(nvidia-smi --list-gpus | wc -l | tr -d ' ')
+  elif command -v npu-smi >/dev/null 2>&1; then
+    NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' | sort -u | wc -l | tr -d ' ')


The npu-smi info command outputs a table enclosed in | characters. The current awk pattern /^[[:space:]]*[0-9]+/ expects lines to start with a number or whitespace, which fails to match any lines due to the leading |. This causes NPROC_PER_NODE to always default to 1 even on multi-NPU systems. Use a field-separator-based awk command to correctly parse the NPU IDs.

Suggested change

NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' | sort -u | wc -l | tr -d ' ')

NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk -F '|' '$2 ~ /[0-9]+/ {print $2}' | awk '{print $1}' | sort -u | wc -l | tr -d ' ')

gemini-code-assist · 2026-06-24T14:20:22Z

+    loss.backward()
+
+    grad_name = "model.layers.0.mlp.experts.gate_proj"
+    grad = dict(model.named_parameters())[grad_name].grad.detach().float().cpu()


If the backward pass fails to propagate gradients to the specified parameter, grad will be None. Calling .detach() on None will raise an unhelpful AttributeError. Adding a check makes the validation script much more robust and informative.

grad_param = dict(model.named_parameters())[grad_name].grad if grad_param is None: raise ValueError(f"Gradient for {grad_name} is None. Backward pass might have failed or parameter is frozen.") grad = grad_param.detach().float().cpu()

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Kirrito-k423 marked this pull request as ready for review June 25, 2026 02:52

Migrate dFactory core to latest VeOmni with NPU support

f93835c

Kirrito-k423 force-pushed the veomni-npu-migration branch from 8ad2cd0 to f93835c Compare June 25, 2026 03:18

Fix LLaDA2 EP parallel plan for latest VeOmni

15e9f20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Migrate dFactory to latest VeOmni with NPU support#22

[codex] Migrate dFactory to latest VeOmni with NPU support#22
Kirrito-k423 wants to merge 2 commits into
inclusionAI:mainfrom
Kirrito-k423:veomni-npu-migration

Kirrito-k423 commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for i in range(self.config.num_experts):
		y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i)

	NPROC_PER_NODE=$(npu-smi info 2>/dev/null \| awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' \| sort -u \| wc -l \| tr -d ' ')
	NPROC_PER_NODE=$(npu-smi info 2>/dev/null \| awk -F '\|' '$2 ~ /[0-9]+/ {print $2}' \| awk '{print $1}' \| sort -u \| wc -l \| tr -d ' ')

Uh oh!

Conversation

Kirrito-k423 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation Summary

Loss Alignment Chart

SP / EP Status

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kirrito-k423 commented Jun 24, 2026 •

edited

Loading