[codex] Migrate dFactory to latest VeOmni with NPU support#22
[codex] Migrate dFactory to latest VeOmni with NPU support#22Kirrito-k423 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request migrates the LLaDA2 MoE model and training scripts to the latest VeOmni API, consolidating training logic into a common module, updating model registration, and adding support for Ascend NPU. The review feedback identifies several key issues and optimization opportunities: a bug where a boolean constant fallback is incorrectly defined as a function, a potential training crash due to nan values in entropy calculations, a PCIe transfer bottleneck from copying the block diffusion mask on every micro-step, an optimization to only run expert forward passes when tokens are routed, a parsing error in train.sh that limits NPU detection, and a missing safety check for None gradients in the validation script.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13 | ||
| except ImportError: | ||
| def is_torch_greater_or_equal_than_1_13(): | ||
| return True |
There was a problem hiding this comment.
In transformers.pytorch_utils, is_torch_greater_or_equal_than_1_13 is a boolean variable, not a function. Defining it as a fallback function returning True causes not is_torch_greater_or_equal_than_1_13 to always evaluate to False (since function objects are truthy in Python). It should be defined as a boolean constant instead.
try:
from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13
except ImportError:
is_torch_greater_or_equal_than_1_13 = True| log_probs = F.log_softmax(logits, dim=-1) | ||
| probs = torch.exp(log_probs) | ||
| entropy_per_token = -torch.sum(probs * log_probs, dim=-1) | ||
| return entropy_per_token[correct_mask].mean() |
There was a problem hiding this comment.
Computing entropy as probs * log_probs can produce nan values if any probability underflows to exactly 0.0 (which is common in large vocabulary models), because 0.0 * -inf evaluates to nan in PyTorch. Use torch.where to safely mask out zero probabilities and prevent silent training crashes.
| log_probs = F.log_softmax(logits, dim=-1) | |
| probs = torch.exp(log_probs) | |
| entropy_per_token = -torch.sum(probs * log_probs, dim=-1) | |
| return entropy_per_token[correct_mask].mean() | |
| log_probs = F.log_softmax(logits, dim=-1) | |
| probs = torch.exp(log_probs) | |
| entropy_per_token = -torch.sum(torch.where(probs > 0, probs * log_probs, 0.0), dim=-1) | |
| return entropy_per_token[correct_mask].mean() |
| logger.info_rank0(f"Load distributed checkpoint from {args.train.checkpoint.load_path} successfully!") | ||
|
|
||
| block_diffusion_attn_mask = _build_block_diffusion_mask(args) | ||
| helper.empty_cache() |
There was a problem hiding this comment.
The block_diffusion_attn_mask is a large tensor (e.g., ~33MB to 67MB depending on precision). Currently, it is created on CPU and copied to the accelerator device on every single micro-step inside _prepare_micro_batch. Moving it to the target device once during initialization avoids a massive host-to-device PCIe transfer bottleneck.
block_diffusion_attn_mask = _build_block_diffusion_mask(args)
if block_diffusion_attn_mask is not None:
block_diffusion_attn_mask = block_diffusion_attn_mask.to(get_device_type())| for i in range(self.config.num_experts): | ||
| y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i) |
There was a problem hiding this comment.
To optimize performance and avoid unnecessary compute or potential empty tensor edge cases in the underlying matrix multiplication kernels, only invoke the expert forward pass when there are actually tokens routed to that expert.
| for i in range(self.config.num_experts): | |
| y[flat_topk_idx == i] = self.experts(hidden_states[flat_topk_idx == i], expert_idx=i) | |
| for i in range(self.config.num_experts): | |
| mask = flat_topk_idx == i | |
| if mask.any(): | |
| y[mask] = self.experts(hidden_states[mask], expert_idx=i) |
| if command -v nvidia-smi >/dev/null 2>&1; then | ||
| NPROC_PER_NODE=$(nvidia-smi --list-gpus | wc -l | tr -d ' ') | ||
| elif command -v npu-smi >/dev/null 2>&1; then | ||
| NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' | sort -u | wc -l | tr -d ' ') |
There was a problem hiding this comment.
The npu-smi info command outputs a table enclosed in | characters. The current awk pattern /^[[:space:]]*[0-9]+/ expects lines to start with a number or whitespace, which fails to match any lines due to the leading |. This causes NPROC_PER_NODE to always default to 1 even on multi-NPU systems. Use a field-separator-based awk command to correctly parse the NPU IDs.
| NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk '/^[[:space:]]*[0-9]+[[:space:]]+[0-9]+/ {print $1}' | sort -u | wc -l | tr -d ' ') | |
| NPROC_PER_NODE=$(npu-smi info 2>/dev/null | awk -F '|' '$2 ~ /[0-9]+/ {print $2}' | awk '{print $1}' | sort -u | wc -l | tr -d ' ') |
| loss.backward() | ||
|
|
||
| grad_name = "model.layers.0.mlp.experts.gate_proj" | ||
| grad = dict(model.named_parameters())[grad_name].grad.detach().float().cpu() |
There was a problem hiding this comment.
If the backward pass fails to propagate gradients to the specified parameter, grad will be None. Calling .detach() on None will raise an unhelpful AttributeError. Adding a check makes the validation script much more robust and informative.
grad_param = dict(model.named_parameters())[grad_name].grad
if grad_param is None:
raise ValueError(f"Gradient for {grad_name} is None. Backward pass might have failed or parameter is frozen.")
grad = grad_param.detach().float().cpu()8ad2cd0 to
f93835c
Compare
Summary
8ca09d7).fused_npuMoE dispatch binding, NPU-safe Liger handling, and an eager MoE fallback for baseline execution.ParallelPlan(extra_parallel_plan={...})).Full delivery branch with validation scripts, migration/workload reports, and runbooks: https://github.com/Kirrito-k423/dFactory/tree/veomni-npu-migration-full-delivery
Validation Summary
The validation harness and reports live in the full delivery branch above, not in this PR diff.
python3 -m py_compile models/llada2_moe/__init__.py models/llada2_moe/modeling_llada2_moe.py tasks/dataset/__init__.py tasks/train_llada2_common.py tasks/train_llada2_bd.py tasks/train_llada2_bd_with_dparallel.py models/llada2_moe/parallel_plan.pyruby -ryaml -e 'Dir["configs/sft/llada2_*bd_sft*.yaml"].sort.each { |p| YAML.load_file(p); puts p }'git diff --check --cachedveomni_qwen35:5.636879920959473.fused_npuvs eager tiny model: kernelnpu_fused_moe_forward, loss/logits/grad diff0.0.dFactory+VeOmni@600fe6d):5.506927490234375.0.0.inclusionAI/LLaDA2.0-mini-preview, 7 safetensors shards, asset check complete with--validate-safetensors.13.501853942871094;loss_abs_diff = 0.0; logitsmax_abs = 0.0,mean_abs = 0.0,max_rel = 0.0; current/legacy shards loaded =7; current/legacy missing and unexpected keys =0.ulysses_size=2, eager MoE to isolate MoE kernel noise.5.550048351287842; rank0/rank1 loss/logits/grad diff =0.0.ep_size=2, kernelnpu_fused_moe_forward,ep_enabled=true; expert params shard from 4 global experts to 2 local experts per rank; forward/backward passes.NameError: name 'group_gemm_same_nk' is not defined, so old/new EP numerical parity cannot be claimed.Loss Alignment Chart
This is a loss alignment chart from the completed validation points, not a multi-step training loss curve.
SP / EP Status
ulysses_size=2; loss/logits/grad diff is0.0on both ranks. This validates no functional regression under SP state, but LLaDA2's custom attention still does not implement production sequence-sliced attention, so SP performance/long-sequence benefits remain future work.ParallelPlan(extra_parallel_plan={...})fix. It runsep_size=2withfused_npuand sharded local expert parameters.Notes
--backwarddisabled) because the 16B model's backward pass has much higher HBM/CPU memory pressure. Tiny-model old/new full-gradient parity is covered separately and is zero diff.AttentionMaskConverterdeprecation warning that currently trips VeOmni logger formatting noise in validation logs.