Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the GatedDeltaNet module to support the Megatron-Core modular API by adding an explicit init method and importing GatedDeltaNetSubmodules. Feedback suggests addressing potential backward compatibility issues with the new import and handling runtime errors that occur if the base class falls back to object.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the GatedDeltaNet implementation by splitting the input projection into two separate modules, in_proj_qkvz and in_proj_ba, and integrates Transformer Engine components like TEColumnParallelLinear and TENorm. It also introduces support for FP8 scale inversion during weight conversion and updates the transformers dependency range. The review feedback identifies several critical issues in the weight conversion logic within gpt_bridge.py, including an AttributeError caused by accessing a deleted attribute, potential KeyErrors from incorrect HuggingFace state dict keys, and a logic error where scale inversion values were overwriting weight tensors instead of being assigned to the correct scale_inv keys.
…' into support_qwen3_5_fp8
…' into support_qwen3_5_fp8
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for decoupled input projections within the GatedDeltaNet module, enabling separate linear layers for QKVZ and BA components. The changes include updates to the model configuration, state dictionary conversion logic in the bridge to handle both standard and LoRA weights, and specific layer specifications for Qwen 3.5 GDN. Critical feedback identifies a potential KeyError in the bridge due to redundant prefix handling and a NameError in the GatedDeltaNet forward pass where a variable is accessed outside its conditional definition scope.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for decoupled input projections in the GatedDeltaNet architecture, specifically for Qwen 3.5 models, including configuration updates and weight conversion logic for LoRA and FP8. Review feedback highlights a potential ImportError due to a top-level import of the optional transformer_engine library and a possible regression in word_embeddings export logic that may affect various models. Additionally, suggestions were provided to replace hardcoded CUDA device references with portable device selection logic to support non-GPU environments.
No description provided.