diff --git a/contrib/models/glm4_moe/README.md b/contrib/models/glm4_moe/README.md
new file mode 100644
index 00000000..f9a18c96
--- /dev/null
+++ b/contrib/models/glm4_moe/README.md
@@ -0,0 +1,213 @@
+# Contrib Model: GLM-4.5 MoE
+
+NeuronX Distributed Inference implementation of [GLM-4.5 MoE](https://huggingface.co/zai-org/GLM-4.5-Air) — a Mixture-of-Experts language model from ZhipuAI / Tsinghua University with unique architectural features including partial RoPE, sigmoid routing with group selection, and shared experts.
+
+## Model Information
+
+- **HuggingFace ID:** `zai-org/GLM-4.5-Air`
+- **Model Type:** Decoder-only MoE transformer (`Glm4MoeForCausalLM`)
+- **Architecture:** 46 layers, hidden size 4096, 128 routed experts, 2 shared experts
+- **Parameters:** ~70B total, ~9B active per token
+- **License:** [GLM-4 License](https://huggingface.co/zai-org/GLM-4.5-Air)
+
+## Architecture Details
+
+GLM-4.5 MoE has several differences from standard MoE models that required custom implementations:
+
+| Feature | GLM-4.5 MoE | Standard MoE (e.g. Qwen3MoE) |
+|---|---|---|
+| RoPE | Partial (first 50% of head_dim) | Full |
+| QKV Bias | Yes (`attention_bias=True`) | No |
+| Router activation | Sigmoid | Softmax |
+| Routing | Group-limited top-k | Top-k |
+| Correction bias | `e_score_correction_bias` | None |
+| Weight normalization | `norm_topk_prob` + `routed_scaling_factor` | Simple softmax |
+| Shared experts | `n_shared_experts=1` (always active) | 0 or variable |
+| First `k` layers | Dense MLP (`first_k_dense_replace`) | All MoE |
+
+### Full Architecture (GLM-4.5 Air)
+
+| Parameter | Value |
+|---|---|
+| `num_hidden_layers` | 46 |
+| `hidden_size` | 4096 |
+| `num_attention_heads` | 32 |
+| `num_key_value_heads` | 2 |
+| `head_dim` | 128 |
+| `partial_rotary_factor` | 0.5 (rotary_dim = 64) |
+| `attention_bias` | True |
+| `n_routed_experts` | 128 |
+| `num_experts_per_tok` | 8 |
+| `n_shared_experts` | 1 |
+| `first_k_dense_replace` | 1 |
+| `moe_intermediate_size` | 2048 |
+| `intermediate_size` | 16384 (dense layers) |
+| `n_group` | 8 |
+| `topk_group` | 4 |
+| `vocab_size` | 151552 |
+| `max_position_embeddings` | 131072 |
+
+## Validation Results
+
+**Tested with:** Reduced 2-layer config (`hidden_size=512`, `n_routed_experts=8`, random weights) on `trn2.3xlarge` (LNC=2, 96 GB Neuron memory)  
+**Configuration:** TP=2 (LNC=2), `batch_size=1`, `seq_len=128`, `bfloat16`  
+**Date:** 2026-03-06
+
+> Note: Full model validation requires a larger Trn2 instance (e.g. `trn2.48xlarge`) for the 70B full model.
+> The integration test uses a reduced random-weight model to verify model structure, compilation, and logit accuracy
+> without requiring the full checkpoint or large hardware.
+
+### Test Results
+
+| Test | Status | Notes |
+|------|--------|-------|
+| Model compilation | ✅ PASS | Reduced config (2L, h=512), TP=2, `trn2.3xlarge` |
+| Model load | ✅ PASS | |
+| Logit accuracy (`check_accuracy_logits_v2`) | ✅ PASS | `divergence_difference_tol=0.001` |
+| Unit: router top-k (10 tests) | ✅ PASS | CPU-only |
+| Unit: partial RoPE (8 tests) | ✅ PASS | CPU-only |
+| Unit: decoder layer dispatch (15 tests) | ✅ PASS | CPU-only |
+| **Total** | **✅ 53/53 PASS** | |
+
+## Usage
+
+### Compile and Run
+
+```python
+import torch
+from transformers import AutoTokenizer, GenerationConfig
+from neuronx_distributed_inference.models.config import MoENeuronConfig, OnDeviceSamplingConfig
+from neuronx_distributed_inference.utils.hf_adapter import (
+    HuggingFaceGenerationAdapter,
+    load_pretrained_config,
+)
+
+# Add src to path (or install as package)
+import sys
+sys.path.insert(0, "contrib/models/glm4_moe/src")
+from glm4_moe.modeling_glm4_moe import Glm4MoeInferenceConfig, NeuronGlm4MoeForCausalLM
+
+model_path = "/path/to/GLM-4.5-Air"          # HuggingFace checkpoint
+compiled_model_path = "/path/to/compiled"     # Neuron compiled artifacts
+
+# 1. Configure
+neuron_config = MoENeuronConfig(
+    tp_degree=32,
+    moe_tp_degree=32,
+    moe_ep_degree=1,
+    batch_size=1,
+    seq_len=4096,
+    torch_dtype=torch.bfloat16,
+    on_device_sampling_config=OnDeviceSamplingConfig(do_sample=False, top_k=1),
+    fused_qkv=True,
+    flash_decoding_enabled=True,
+)
+
+inference_config = Glm4MoeInferenceConfig(
+    neuron_config=neuron_config,
+    load_config=load_pretrained_config(model_path),
+)
+
+# 2. Compile (run once, ~hours)
+model = NeuronGlm4MoeForCausalLM(model_path, inference_config)
+model.compile(compiled_model_path)
+
+# 3. Load compiled model
+model.load(compiled_model_path)
+
+# 4. Generate
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+adapter = HuggingFaceGenerationAdapter(model)
+
+prompt = "Explain mixture-of-experts routing in one paragraph."
+inputs = tokenizer(prompt, return_tensors="pt")
+
+with torch.no_grad():
+    output = adapter.generate(
+        **inputs,
+        generation_config=GenerationConfig(do_sample=False, max_new_tokens=200),
+    )
+
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+### Using the Demo Script
+
+```bash
+cd contrib/models/glm4_moe
+
+# Compile and run generation demo
+python examples/generation_glm4_moe_demo.py \
+    --model-path /path/to/GLM-4.5-Air \
+    --compiled-model-path /path/to/compiled \
+    --tp-degree 32 \
+    --seq-len 4096
+```
+
+## Compatibility Matrix
+
+| Instance / NxDI Version | 2.21+ | 2.20 | 2.19 and earlier |
+|---|---|---|---|
+| Trn2 (`trn2.48xlarge`, 512 NCs) | ✅ Recommended | Not tested | Not supported |
+| Trn2 (`trn2.3xlarge`, 4 NCs) | ✅ Tested (reduced config, 2026-03-06) | Not tested | Not supported |
+| Trn1 (`trn1.32xlarge`, 64 NCs) | Not tested | Not tested | Not supported |
+| Inf2 | Not tested | Not tested | Not supported |
+
+> **Minimum requirements:** `transformers>=4.56.0` (for `Glm4MoeForCausalLM`), AWS Neuron SDK 2.21+
+
+## Testing
+
+### Unit Tests (CPU, no Neuron hardware required)
+
+```bash
+cd contrib/models/glm4_moe
+pip install pytest
+
+# Router routing logic
+pytest test/unit/test_router.py -v
+
+# Partial RoPE and QK norm logic
+pytest test/unit/test_attention.py -v
+
+# Dense vs MoE layer dispatch
+pytest test/unit/test_decoder.py -v
+
+# All unit tests
+pytest test/unit/ -v
+```
+
+### Integration Tests (requires Trn1/Trn2 with ≥2 NeuronCores)
+
+```bash
+cd contrib/models/glm4_moe
+
+# Reduced config (~2 min compile), TP=2
+pytest test/integration/test_model.py -v -s
+
+# Run manually (standalone, no pytest)
+python test/integration/test_model.py
+```
+
+> **Note:** On `trn2.3xlarge` (LNC=2), do not set `NEURON_RT_NUM_CORES`. The test uses `tp_degree=2` which
+> maps automatically to the available NeuronCores.
+
+The integration test:
+1. Creates a tiny 2-layer random-weight model (no HuggingFace download needed)
+2. Compiles it on Neuron (fast due to small model size)
+3. Runs `check_accuracy_logits_v2` to compare Neuron logits against HuggingFace CPU logits
+
+## Example Checkpoints
+
+- [`zai-org/GLM-4.5-Air`](https://huggingface.co/zai-org/GLM-4.5-Air) — Full 70B model (128 experts, 46 layers)
+
+## Known Limitations
+
+- `zai-org/GLM-4.7-Flash` uses `Glm4MoeLiteForCausalLM` (different architecture, not supported)
+- Flash decoding requires Trn2 for optimal performance; Trn1 falls back to standard decoding
+- `e_score_correction_bias` is loaded from checkpoint as a frozen buffer (not trained during fine-tuning)
+
+## Maintainer
+
+Community contribution — PRs welcome.
+
+**Last Updated:** 2026-03-06
diff --git a/contrib/models/glm4_moe/examples/generation_glm4_moe_demo.py b/contrib/models/glm4_moe/examples/generation_glm4_moe_demo.py
new file mode 100644
index 00000000..982a4bbe
--- /dev/null
+++ b/contrib/models/glm4_moe/examples/generation_glm4_moe_demo.py
@@ -0,0 +1,228 @@
+"""
+GLM-4.5 MoE Generation Demo for NXD Inference.
+
+This script demonstrates how to compile and run inference with the GLM-4.5 MoE model
+using neuronx-distributed-inference.
+
+Usage:
+    # From contrib/models/glm4_moe/ directory:
+
+    # Compile and generate (default tiny random model):
+    python examples/generation_glm4_moe_demo.py
+
+    # Skip compile (load from existing traced model):
+    python examples/generation_glm4_moe_demo.py --skip-compile
+
+    # Use real checkpoint:
+    python examples/generation_glm4_moe_demo.py \\
+        --model-path /path/to/glm4_moe \\
+        --traced-model-path /path/to/traced \\
+        --tp-degree 8 --seq-len 2048
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+import torch
+from transformers import AutoTokenizer, GenerationConfig
+
+# Add src to path so glm4_moe package can be found
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from glm4_moe.modeling_glm4_moe import (
+    Glm4MoeInferenceConfig,
+    NeuronGlm4MoeForCausalLM,
+)
+from neuronx_distributed_inference.models.config import (
+    MoENeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (
+    HuggingFaceGenerationAdapter,
+    load_pretrained_config,
+)
+
+# Default paths — override via CLI
+MODEL_PATH = str(
+    Path(__file__).parent.parent.parent.parent.parent / "glm4_moe_tiny_random"
+)
+TRACED_MODEL_PATH = str(
+    Path(__file__).parent.parent.parent.parent.parent / "glm4_moe_tiny_random_traced"
+)
+
+torch.manual_seed(0)
+
+DTYPE = torch.bfloat16
+
+
+def get_neuron_config(tp_degree: int = 2, seq_len: int = 64) -> MoENeuronConfig:
+    """Create MoENeuronConfig for GLM-4.5 MoE model.
+
+    Args:
+        tp_degree: Tensor parallelism degree (number of NeuronCores).
+        seq_len: Maximum sequence length.
+
+    Returns:
+        Configured MoENeuronConfig instance.
+    """
+    moe_tp_degree = tp_degree  # align MoE TP degree with overall TP degree
+    return MoENeuronConfig(
+        tp_degree=tp_degree,
+        moe_tp_degree=moe_tp_degree,
+        moe_ep_degree=1,
+        batch_size=1,
+        ctx_batch_size=1,
+        tkg_batch_size=1,
+        seq_len=seq_len,
+        max_context_length=seq_len - 16,
+        torch_dtype=DTYPE,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            do_sample=False,
+            top_k=1,
+        ),
+        enable_bucketing=False,
+        flash_decoding_enabled=False,
+        fused_qkv=True,
+        sequence_parallel_enabled=False,
+        qkv_kernel_enabled=False,
+        attn_kernel_enabled=False,
+    )
+
+
+def generate(
+    model_path: str,
+    traced_model_path: str,
+    skip_compile: bool = False,
+    tp_degree: int = 2,
+    seq_len: int = 64,
+) -> None:
+    """Compile (if needed) and run GLM-4.5 MoE inference.
+
+    Args:
+        model_path: Path to the HuggingFace model checkpoint.
+        traced_model_path: Path to save/load the compiled Neuron model.
+        skip_compile: If True, skip compilation and load existing traced model.
+        tp_degree: Tensor parallelism degree.
+        seq_len: Maximum sequence length.
+    """
+    if not skip_compile:
+        print("=" * 60)
+        print("Compiling GLM-4.5 MoE model...")
+        print("=" * 60)
+
+        neuron_config = get_neuron_config(tp_degree=tp_degree, seq_len=seq_len)
+        config = Glm4MoeInferenceConfig(
+            neuron_config,
+            load_config=load_pretrained_config(model_path),
+        )
+
+        print(
+            f"  Model config: hidden_size={config.hidden_size}, "
+            f"n_routed_experts={config.n_routed_experts}, "
+            f"n_shared_experts={config.n_shared_experts}, "
+            f"first_k_dense_replace={config.first_k_dense_replace}"
+        )
+
+        model = NeuronGlm4MoeForCausalLM(model_path, config)
+        model.compile(traced_model_path)
+
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(model_path)
+            tokenizer.save_pretrained(traced_model_path)
+        except Exception as e:
+            print(f"Warning: could not save tokenizer: {e}")
+
+        print(f"Model compiled and saved to {traced_model_path}")
+
+    # Load compiled model
+    print("\n" + "=" * 60)
+    print("Loading compiled GLM-4.5 MoE model...")
+    print("=" * 60)
+    model = NeuronGlm4MoeForCausalLM(traced_model_path)
+    model.load(traced_model_path)
+
+    # Try to load tokenizer
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(traced_model_path)
+    except Exception:
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(model_path)
+        except Exception:
+            tokenizer = None
+
+    # Generate
+    print("\n" + "=" * 60)
+    print("Generating outputs...")
+    print("=" * 60)
+
+    prompt = "What is the capital of France?"
+
+    if tokenizer is not None:
+        inputs = tokenizer([prompt], return_tensors="pt", padding=True)
+        input_ids = inputs.input_ids
+        attention_mask = inputs.attention_mask
+        print(f"Prompt: {prompt!r}")
+        print(f"Input token ids: {input_ids}")
+    else:
+        input_ids = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)
+        attention_mask = torch.ones_like(input_ids)
+        print(f"Using dummy input_ids: {input_ids}")
+
+    try:
+        generation_config = GenerationConfig.from_pretrained(model_path)
+    except Exception:
+        generation_config = GenerationConfig(
+            max_new_tokens=10,
+            do_sample=False,
+            top_k=1,
+        )
+
+    generation_model = HuggingFaceGenerationAdapter(model)
+    outputs = generation_model.generate(
+        input_ids,
+        generation_config=generation_config,
+        attention_mask=attention_mask,
+        max_length=model.config.neuron_config.max_length,
+    )
+
+    print(f"Output token ids: {outputs}")
+
+    if tokenizer is not None:
+        decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+        print("Generated text:")
+        for i, text in enumerate(decoded):
+            print(f"  [{i}]: {text}")
+
+
+def main() -> None:
+    """Parse CLI arguments and run generation demo."""
+    parser = argparse.ArgumentParser(description="GLM-4.5 MoE generation demo")
+    parser.add_argument("--model-path", default=MODEL_PATH, help="Path to HF model")
+    parser.add_argument(
+        "--traced-model-path",
+        default=TRACED_MODEL_PATH,
+        help="Path to save/load traced model",
+    )
+    parser.add_argument(
+        "--skip-compile",
+        action="store_true",
+        help="Skip compilation, load existing traced model",
+    )
+    parser.add_argument(
+        "--tp-degree", type=int, default=2, help="Tensor parallelism degree"
+    )
+    parser.add_argument("--seq-len", type=int, default=64, help="Sequence length")
+    args = parser.parse_args()
+
+    generate(
+        model_path=args.model_path,
+        traced_model_path=args.traced_model_path,
+        skip_compile=args.skip_compile,
+        tp_degree=args.tp_degree,  # Fixed: was ignored before
+        seq_len=args.seq_len,  # Fixed: was ignored before
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/glm4_moe/src/glm4_moe/__init__.py b/contrib/models/glm4_moe/src/glm4_moe/__init__.py
new file mode 100644
index 00000000..34b89ef7
--- /dev/null
+++ b/contrib/models/glm4_moe/src/glm4_moe/__init__.py
@@ -0,0 +1,15 @@
+# coding=utf-8
+# SPDX-License-Identifier: Apache-2.0
+"""GLM-4.5 MoE contrib model package."""
+
+from glm4_moe.modeling_glm4_moe import (
+    NeuronGlm4MoeForCausalLM,
+    Glm4MoeInferenceConfig,
+    convert_glm4_moe_hf_to_neuron_state_dict,
+)
+
+__all__ = [
+    "NeuronGlm4MoeForCausalLM",
+    "Glm4MoeInferenceConfig",
+    "convert_glm4_moe_hf_to_neuron_state_dict",
+]
diff --git a/contrib/models/glm4_moe/src/glm4_moe/modeling_glm4_moe.py b/contrib/models/glm4_moe/src/glm4_moe/modeling_glm4_moe.py
new file mode 100644
index 00000000..7e180e13
--- /dev/null
+++ b/contrib/models/glm4_moe/src/glm4_moe/modeling_glm4_moe.py
@@ -0,0 +1,929 @@
+# coding=utf-8
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""GLM-4.5 MoE model for NXD inference.
+
+Architecture differences from Qwen3MoE:
+  - partial_rotary_factor=0.5: RoPE applied to only half of head_dim
+  - attention_bias=True: QKV projections have bias
+  - use_qk_norm (configurable): QK normalization
+  - first_k_dense_replace: first N layers use dense MLP instead of MoE
+  - n_shared_experts=1: shared expert alongside routed experts
+  - Router: sigmoid + group selection + e_score_correction_bias + routed_scaling_factor
+"""
+
+import gc
+import warnings
+import math
+from typing import List, Optional, Tuple, Union, Dict, Any
+
+import torch
+from torch import nn
+
+from neuronx_distributed_inference.models.model_base import (
+    NeuronBaseForCausalLM,
+    NeuronBaseModel,
+)
+from neuronx_distributed_inference.modules.attention.gqa import GQA
+from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm
+
+# Try except for compatibility with older compiler version
+try:
+    from neuronxcc.nki._private_kernels.attention import attention_isa_kernel
+except ImportError:
+    from neuronxcc.nki.kernels.attention import attention_isa_kernel
+
+from neuronx_distributed.parallel_layers import parallel_state
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+    ParallelEmbedding,
+)
+from neuronx_distributed.utils import cpu_mode
+from torch_neuronx.xla_impl.ops import nki_jit
+from transformers import Glm4MoeForCausalLM
+from transformers.generation import SampleDecoderOnlyOutput, SampleEncoderDecoderOutput
+from transformers.models.glm4_moe.modeling_glm4_moe import Glm4MoeRMSNorm
+
+# MoE infrastructure
+from neuronx_distributed.modules.moe.model import MoE
+from neuronx_distributed.modules.moe.expert_mlps_v2 import ExpertMLPsV2
+from neuronx_distributed.modules.moe.routing import GroupLimitedRouter
+from neuronx_distributed.modules.moe.moe_configs import RoutedExpertsMLPOpsConfig
+from neuronx_distributed.modules.moe.shared_experts import SharedExperts
+from neuronx_distributed.modules.moe.moe_process_group import (
+    init_tensor_expert_parallel_moe_process_groups,
+    get_moe_tp_ep_group,
+    get_moe_ep_group,
+)
+
+from neuronx_distributed_inference.models.config import InferenceConfig, MoENeuronConfig
+from neuronx_distributed_inference.models.model_wrapper import (
+    CONTEXT_ENCODING_MODEL_TAG,
+    TOKEN_GENERATION_MODEL_TAG,
+)
+from neuronx_distributed_inference.modules.attention.attention_base import (
+    NeuronAttentionBase,
+)
+from neuronx_distributed_inference.modules.attention.utils import (
+    RotaryEmbedding,
+    apply_rotary_pos_emb,
+)
+from neuronx_distributed_inference.models.layer_boundary_marker import (
+    ModuleMarkerEndWrapper,
+    ModuleMarkerStartWrapper,
+)
+
+_flash_fwd_call = nki_jit()(attention_isa_kernel)
+
+SampleOutput = Union[SampleEncoderDecoderOutput, SampleDecoderOnlyOutput]
+
+GQA_SHARDING_STRATEGY = GQA.REPLICATE_TO_TP_DEGREE
+
+
+# ---------------------------------------------------------------------------
+# RMSNorm helpers
+# ---------------------------------------------------------------------------
+
+
+def get_rmsnorm_cls():
+    """Return appropriate RMSNorm class for CPU vs Neuron execution."""
+    return Glm4MoeRMSNorm if cpu_mode() else CustomRMSNorm
+
+
+# ---------------------------------------------------------------------------
+# Custom router: sigmoid + group routing + e_score_correction_bias + scaling
+# ---------------------------------------------------------------------------
+
+
+class NeuronGlm4MoeRouter(GroupLimitedRouter):
+    """
+    GLM-4.5 MoE router extending GroupLimitedRouter with:
+    - e_score_correction_bias buffer (initialized to zeros, loaded from checkpoint)
+    - norm_topk_prob: normalize top-k weights before applying scaling
+    - routed_scaling_factor: scale final expert weights
+
+    The forward returns (router_logits, expert_affinities_full, topk_idx) where
+    expert_affinities_full has normalized+scaled weights for selected experts and
+    zeros elsewhere, so ExpertMLPsV2 with normalize_top_k_affinities=False uses
+    them directly.
+    """
+
+    def __init__(
+        self,
+        num_experts: int,
+        top_k: int,
+        hidden_size: int,
+        n_group: int,
+        topk_group: int,
+        norm_topk_prob: bool = True,
+        routed_scaling_factor: float = 1.0,
+        sequence_parallel_enabled: bool = False,
+        sequence_dimension: Optional[int] = None,
+        dtype: torch.dtype = torch.float32,
+        device: torch.device = torch.device("cpu"),
+        tensor_model_parallel_group=None,
+        jitter_eps: float = 0.0,
+    ):
+        super().__init__(
+            num_experts=num_experts,
+            top_k=top_k,
+            hidden_size=hidden_size,
+            n_group=n_group,
+            topk_group=topk_group,
+            sequence_parallel_enabled=sequence_parallel_enabled,
+            sequence_dimension=sequence_dimension,
+            dtype=dtype,
+            device=device,
+            tensor_model_parallel_group=tensor_model_parallel_group,
+            jitter_eps=jitter_eps,
+        )
+        self.norm_topk_prob = norm_topk_prob
+        self.routed_scaling_factor = routed_scaling_factor
+        # Initialize e_score_correction_bias as FP32 buffer (loaded from checkpoint)
+        self.register_buffer(
+            "e_score_correction_bias",
+            torch.zeros(num_experts, dtype=torch.float32),
+        )
+
+    def noaux_tc_top_k(self, scores):
+        """
+        Group-limited top-k selection with normalization and scaling.
+
+        Args:
+            scores: sigmoid-activated expert affinities [batch_size, num_experts]
+
+        Returns:
+            (topk_idx, full_affinities) where full_affinities has normalized+scaled
+            weights at selected positions and zeros elsewhere.
+        """
+        batch_size, num_experts = scores.shape
+
+        # Add correction bias for routing decision (not for final weights)
+        scores_for_choice = scores + self.e_score_correction_bias.unsqueeze(0)
+
+        # Group-based selection
+        group_scores = self._calculate_group_scores(scores_for_choice, batch_size)
+        group_idx = torch.topk(group_scores, k=self.topk_group)[1]
+        group_mask = self._create_group_mask(group_scores, group_idx)
+        score_mask = self._expand_group_mask(group_mask, batch_size)
+        masked_scores = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)
+
+        # Select top-k experts
+        _, topk_idx = torch.topk(masked_scores, k=self.top_k)
+
+        # Get weights from ORIGINAL sigmoid scores (not bias-corrected)
+        topk_weights = scores.gather(1, topk_idx)
+
+        # Normalize
+        if self.norm_topk_prob:
+            denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
+            topk_weights = topk_weights / denominator
+
+        # Apply routed scaling factor
+        topk_weights = topk_weights * self.routed_scaling_factor
+
+        # Scatter back into full-size tensor (zeros for non-selected)
+        full_affinities = torch.zeros_like(scores)
+        full_affinities.scatter_(1, topk_idx, topk_weights)
+
+        return topk_idx, full_affinities
+
+    def forward(self, hidden_states):
+        router_logits = self.get_router_logits(hidden_states)
+        expert_affinities = self.apply_activation_fn(router_logits)
+        expert_affinities = expert_affinities.to(dtype=hidden_states.dtype)
+
+        topk_idx, full_affinities = self.noaux_tc_top_k(expert_affinities)
+        topk_idx = topk_idx.detach().to(dtype=torch.long)
+
+        return router_logits, full_affinities, topk_idx
+
+
+# ---------------------------------------------------------------------------
+# MoE module initializer for GLM-4.5
+# ---------------------------------------------------------------------------
+
+
+def initialize_glm4_moe_module(config: "Glm4MoeInferenceConfig") -> MoE:
+    """
+    Initialize the GLM-4.5 MoE module with GroupLimitedRouter + SharedExperts.
+    """
+    # Set up process groups
+    if config.neuron_config.moe_ep_degree > 1:
+        moe_ep_degree = config.neuron_config.moe_ep_degree
+        moe_tp_degree = config.neuron_config.moe_tp_degree
+        init_tensor_expert_parallel_moe_process_groups(
+            moe_tp_degree, moe_ep_degree, moe_tp_degree, moe_ep_degree
+        )
+        moe_tkg_tp_group = get_moe_tp_ep_group(prefill=False)
+        moe_tkg_ep_group = get_moe_ep_group(prefill=False)
+        moe_cte_tp_group = get_moe_tp_ep_group(prefill=True)
+        moe_cte_ep_group = get_moe_ep_group(prefill=True)
+    else:
+        moe_tkg_tp_group = parallel_state.get_tensor_model_parallel_group()
+        moe_tkg_ep_group = parallel_state.get_expert_model_parallel_group()
+        moe_cte_tp_group = parallel_state.get_tensor_model_parallel_group()
+        moe_cte_ep_group = parallel_state.get_expert_model_parallel_group()
+
+    # Router
+    router = NeuronGlm4MoeRouter(
+        num_experts=config.num_local_experts,
+        top_k=config.num_experts_per_tok,
+        hidden_size=config.hidden_size,
+        n_group=config.n_group,
+        topk_group=config.topk_group,
+        norm_topk_prob=config.norm_topk_prob,
+        routed_scaling_factor=config.routed_scaling_factor,
+        dtype=config.neuron_config.router_config.dtype,
+        sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+        sequence_dimension=1,
+        tensor_model_parallel_group=parallel_state.get_tensor_model_parallel_group(),
+    )
+
+    # Expert MLPs
+    expert_mlps = ExpertMLPsV2(
+        routed_experts_mlp_config=RoutedExpertsMLPOpsConfig(
+            num_experts=config.num_local_experts,
+            hidden_size=config.hidden_size,
+            intermediate_size=config.intermediate_size,
+            hidden_size_actual=getattr(config, "original_hidden_size", None),
+            intermediate_size_actual=getattr(
+                config, "original_intermediate_size", None
+            ),
+            is_hidden_dim_shuffled=config.neuron_config.is_hidden_dim_shuffled,
+            is_intermediate_dim_shuffled=config.neuron_config.is_intermediate_dim_shuffled,
+            top_k=config.num_experts_per_tok,
+            hidden_act=config.hidden_act,
+            glu_mlp=config.neuron_config.glu_mlp,
+            glu_type=config.neuron_config.glu_type,
+            hidden_act_scaling_factor=config.neuron_config.hidden_act_scaling_factor,
+            hidden_act_bias=config.neuron_config.hidden_act_bias,
+            use_index_calc_kernel=config.neuron_config.use_index_calc_kernel,
+            gate_clamp_upper_limit=config.neuron_config.gate_clamp_upper_limit,
+            gate_clamp_lower_limit=config.neuron_config.gate_clamp_lower_limit,
+            up_clamp_upper_limit=config.neuron_config.up_clamp_upper_limit,
+            up_clamp_lower_limit=config.neuron_config.up_clamp_lower_limit,
+            normalize_top_k_affinities=False,  # router handles normalization+scaling
+            early_expert_affinity_modulation=config.neuron_config.early_expert_affinity_modulation,
+            enable_spmd_rank=config.neuron_config.blockwise_matmul_config.parallelize_token_to_block_mapping,
+        ),
+        blockwise_matmul_config=config.neuron_config.blockwise_matmul_config,
+        sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+        dtype=config.neuron_config.torch_dtype,
+        is_prefill=config.neuron_config.is_prefill_stage,
+        tensor_model_parallel_group=parallel_state.get_tensor_model_parallel_group(),
+        expert_model_parallel_group=parallel_state.get_expert_model_parallel_group(),
+        cte_tensor_model_parallel_group=moe_cte_tp_group,
+        cte_expert_model_parallel_group=moe_cte_ep_group,
+        tkg_tensor_model_parallel_group=moe_tkg_tp_group,
+        tkg_expert_model_parallel_group=moe_tkg_ep_group,
+    )
+
+    # Shared experts (always on, parallel to routed experts)
+    shared_experts = None
+    if config.n_shared_experts:
+        shared_experts = SharedExperts(
+            hidden_size=config.hidden_size,
+            intermediate_size=config.intermediate_size,
+            num_shared_experts=config.n_shared_experts,
+            hidden_act=config.hidden_act,
+            dtype=config.neuron_config.torch_dtype,
+            reduce_dtype=config.neuron_config.rpl_reduce_dtype,
+            fused_gate_up_projection=config.neuron_config.fused_shared_experts,
+            sequence_parallel_enabled=config.neuron_config.shared_experts_sequence_parallel_enabled,
+            transpose_weights=config.neuron_config.transpose_shared_experts_weights,
+        )
+
+    moe = MoE(
+        router=router,
+        expert_mlps=expert_mlps,
+        shared_experts=shared_experts,
+        sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+        return_expert_index=config.neuron_config.return_expert_index,
+        return_router_logits=config.neuron_config.return_router_logits,
+        sequence_dimension=1,
+    )
+
+    moe.eval()
+    return moe
+
+
+# ---------------------------------------------------------------------------
+# Dense MLP for first_k_dense_replace layers
+# ---------------------------------------------------------------------------
+
+
+class NeuronGlm4MoeDenseMLP(nn.Module):
+    """Standard GLU MLP (SiLU activation) used for dense layers (first_k_dense_replace)."""
+
+    def __init__(self, config: "Glm4MoeInferenceConfig"):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = (
+            config.dense_intermediate_size
+        )  # full intermediate size
+
+        # Gate and up projection (column parallel - split output)
+        self.gate_proj = ColumnParallelLinear(
+            self.hidden_size,
+            self.intermediate_size,
+            bias=False,
+            gather_output=False,
+        )
+        self.up_proj = ColumnParallelLinear(
+            self.hidden_size,
+            self.intermediate_size,
+            bias=False,
+            gather_output=False,
+        )
+        # Down projection (row parallel - reduce across TP)
+        self.down_proj = RowParallelLinear(
+            self.intermediate_size,
+            self.hidden_size,
+            bias=False,
+            input_is_parallel=True,
+        )
+        self.act_fn = nn.SiLU()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+
+
+# ---------------------------------------------------------------------------
+# Attention with partial RoPE and attention bias
+# ---------------------------------------------------------------------------
+
+
+class NeuronGlm4MoeAttention(NeuronAttentionBase):
+    """
+    GLM-4.5 MoE attention with:
+    - partial_rotary_factor=0.5: RoPE applied to first half of head_dim only
+    - attention_bias=True: bias in q/k/v projections
+    - use_qk_norm: optional QK normalization
+    """
+
+    def __init__(self, config: "Glm4MoeInferenceConfig"):
+        # Partial RoPE: use rotary_dim = int(head_dim * partial_rotary_factor)
+        rotary_dim = int(config.head_dim * config.partial_rotary_factor)
+        rotary_emb = RotaryEmbedding(
+            rotary_dim,
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta,
+        )
+
+        super().__init__(
+            config=config,
+            hidden_size=config.hidden_size,
+            num_attention_heads=config.num_attention_heads,
+            num_key_value_heads=config.num_key_value_heads,
+            head_dim=config.head_dim,
+            rotary_emb=rotary_emb,
+            rms_norm_eps=config.rms_norm_eps,
+            use_qk_norm=False,  # we handle qk_norm manually
+            qkv_bias=config.attention_bias,
+        )
+
+        # Store rotary_dim for partial RoPE apply
+        self.rotary_dim = rotary_dim
+
+        # Optional QK norm (Glm4Moe applies RMSNorm per head after projection)
+        if config.use_qk_norm:
+            self.q_layernorm = get_rmsnorm_cls()(config.head_dim, config.rms_norm_eps)
+            self.k_layernorm = get_rmsnorm_cls()(config.head_dim, config.rms_norm_eps)
+        else:
+            self.q_layernorm = None
+            self.k_layernorm = None
+
+        if not parallel_state.model_parallel_is_initialized():
+            raise ValueError(
+                "NeuronGlm4MoeAttention must be initialized in a distributed env. "
+                "Please use neuronx_distributed module to initialize a distributed env."
+            )
+
+    def apply_rotary_embedding(
+        self, Q, K, V, position_ids, cos_cache, sin_cache, use_polar_compatible_rope
+    ):
+        """Override to implement partial RoPE (apply to first rotary_dim dims only)."""
+        if not use_polar_compatible_rope and self.rotary_emb is not None:
+            if cos_cache is None or sin_cache is None:
+                cos_cache, sin_cache = self.rotary_emb(V, position_ids)
+
+            # Partial RoPE: split Q and K into rotary and pass-through portions
+            # Q, K shape: [batch, heads, seq, head_dim]
+            # cos_cache, sin_cache shape: [batch, seq, rotary_dim]
+            rotary_dim = cos_cache.shape[-1]
+
+            Q_rot = Q[..., :rotary_dim]
+            Q_pass = Q[..., rotary_dim:]
+            K_rot = K[..., :rotary_dim]
+            K_pass = K[..., rotary_dim:]
+
+            Q_rot, K_rot = apply_rotary_pos_emb(Q_rot, K_rot, cos_cache, sin_cache)
+
+            Q = torch.cat([Q_rot, Q_pass], dim=-1)
+            K = torch.cat([K_rot, K_pass], dim=-1)
+
+        return Q, K, cos_cache, sin_cache
+
+
+# ---------------------------------------------------------------------------
+# Decoder layer
+# ---------------------------------------------------------------------------
+
+
+class NeuronGlm4MoeDecoderLayer(nn.Module):
+    """
+    GLM-4.5 MoE decoder layer.
+
+    - Layers 0..first_k_dense_replace-1: dense MLP
+    - Layers first_k_dense_replace..num_hidden_layers-1: MoE block
+    """
+
+    def __init__(self, config: "Glm4MoeInferenceConfig", layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.layer_idx = layer_idx
+        self.is_moe_layer = layer_idx >= config.first_k_dense_replace
+
+        self.self_attn = NeuronGlm4MoeAttention(config=config)
+
+        self.input_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+        self.post_attention_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+
+        if self.is_moe_layer:
+            self.mlp = initialize_glm4_moe_module(config)
+        else:
+            self.mlp = NeuronGlm4MoeDenseMLP(config)
+
+        self.qkv_kernel_enabled = config.neuron_config.qkv_kernel_enabled
+        self.sequence_parallel_enabled = config.neuron_config.sequence_parallel_enabled
+        self.qkv_kernel_fused_rmsnorm = not self.sequence_parallel_enabled
+        self.moe_mask_padded_tokens = config.neuron_config.moe_mask_padded_tokens
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        padding_mask: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Tuple[
+        torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
+    ]:
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. "
+                "Please make sure use `attention_mask` instead."
+            )
+
+        residual = hidden_states
+
+        hidden_states = ModuleMarkerStartWrapper()(hidden_states)
+
+        if self.input_layernorm:
+            if self.qkv_kernel_enabled and self.qkv_kernel_fused_rmsnorm:
+                qkv_fused_rmsnorm = self.input_layernorm
+            else:
+                hidden_states = self.input_layernorm(hidden_states)
+                qkv_fused_rmsnorm = None
+        else:
+            qkv_fused_rmsnorm = None
+
+        # Self Attention
+        hidden_states, present_key_value, cos_cache, sin_cache = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            rmsnorm=qkv_fused_rmsnorm,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+
+        # MLP / MoE
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+
+        if self.is_moe_layer:
+            hidden_states = self.mlp(hidden_states, padding_mask)[0]
+        else:
+            hidden_states = self.mlp(hidden_states)
+
+        hidden_states = residual + hidden_states
+
+        hidden_states = ModuleMarkerEndWrapper()(hidden_states)
+        outputs = (hidden_states, present_key_value, cos_cache, sin_cache, None)
+
+        return outputs
+
+
+# ---------------------------------------------------------------------------
+# Model
+# ---------------------------------------------------------------------------
+
+
+class NeuronGlm4MoeModel(NeuronBaseModel):
+    """NeuronGlm4MoeModel extends the GLM-4.5 MoE model to be traceable."""
+
+    def setup_attr_for_model(self, config: "Glm4MoeInferenceConfig"):
+        self.on_device_sampling = (
+            config.neuron_config.on_device_sampling_config is not None
+        )
+        self.tp_degree = config.neuron_config.tp_degree
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.max_batch_size = config.neuron_config.max_batch_size
+        self.buckets = config.neuron_config.buckets
+
+    def init_model(self, config: "Glm4MoeInferenceConfig"):
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = ParallelEmbedding(
+            config.vocab_size,
+            config.hidden_size,
+            self.padding_idx,
+            dtype=config.neuron_config.torch_dtype,
+            shard_across_embedding=True,
+        )
+        self.layers = nn.ModuleList(
+            [
+                NeuronGlm4MoeDecoderLayer(config, layer_idx)
+                for layer_idx in range(config.num_hidden_layers)
+            ]
+        )
+        self.norm = get_rmsnorm_cls()(config.hidden_size, eps=config.rms_norm_eps)
+        self.lm_head = ColumnParallelLinear(
+            config.hidden_size,
+            config.vocab_size,
+            gather_output=False if self.on_device_sampling else True,
+            bias=False,
+        )
+
+
+# ---------------------------------------------------------------------------
+# CausalLM wrapper
+# ---------------------------------------------------------------------------
+
+
+class NeuronGlm4MoeForCausalLM(NeuronBaseForCausalLM):
+    """
+    GLM-4.5 MoE CausalLM for NXD inference.
+    """
+
+    _model_cls = NeuronGlm4MoeModel
+
+    @staticmethod
+    def load_hf_model(model_path, **kwargs):
+        return Glm4MoeForCausalLM.from_pretrained(model_path, **kwargs)
+
+    @classmethod
+    def get_config_cls(cls):
+        return Glm4MoeInferenceConfig
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(
+        state_dict: dict, config: "Glm4MoeInferenceConfig"
+    ) -> dict:
+        return convert_glm4_moe_hf_to_neuron_state_dict(state_dict, config)
+
+    def enable_context_encoding(self):
+        self.compile_tag = CONTEXT_ENCODING_MODEL_TAG
+        super().enable_context_encoding()
+
+    def enable_token_generation(self):
+        self.compile_tag = TOKEN_GENERATION_MODEL_TAG
+        super().enable_token_generation()
+
+    def get_compiler_args(self):
+        # CTE benefits from higher optimization; TKG uses O1 for faster compilation
+        if self.compile_tag == CONTEXT_ENCODING_MODEL_TAG:
+            optimization_level = "-O2"
+        else:
+            optimization_level = "-O1"
+
+        compiler_args = (
+            f"--enable-saturate-infinity --enable-mixed-precision-accumulation "
+            f"--model-type transformer {optimization_level}"
+        )
+        compiler_args += " --tensorizer-options='--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2'"
+        compiler_args += " --auto-cast=none"
+        compiler_args += " --internal-enable-dge-levels vector_dynamic_offsets"
+        compiler_args += " --internal-hlo2tensorizer-options='--verify-hlo=true'"
+        if self.neuron_config.scratchpad_page_size:
+            compiler_args += f" --hbm-scratchpad-page-size={self.neuron_config.scratchpad_page_size} "
+        return compiler_args
+
+
+# ---------------------------------------------------------------------------
+# InferenceConfig
+# ---------------------------------------------------------------------------
+
+
+class Glm4MoeInferenceConfig(InferenceConfig):
+    """
+    InferenceConfig for GLM-4.5 MoE model.
+
+    Key adaptations from Qwen3MoeInferenceConfig:
+    - Maps n_routed_experts -> num_local_experts
+    - Sets n_shared_experts from HF config
+    - Configures GLM-4.5-specific router (sigmoid, group routing, scaling factor)
+    - Handles dense layers (first_k_dense_replace)
+    - Handles partial RoPE (partial_rotary_factor)
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # GLM-4.5 uses n_routed_experts; neuronx expects num_local_experts
+        self.num_local_experts = self.n_routed_experts
+
+        # Store the dense-layer intermediate size (for NeuronGlm4MoeDenseMLP)
+        # The HF config has both intermediate_size (dense) and moe_intermediate_size (MoE)
+        self.dense_intermediate_size = self.intermediate_size
+
+        # Set intermediate_size to moe_intermediate_size for MoE layers
+        # (ExpertMLPsV2 and SharedExperts read config.intermediate_size)
+        self.intermediate_size = self.moe_intermediate_size
+
+        # Shared experts: n_shared_experts comes directly from HF config
+        # (already set via load_pretrained_config)
+
+        # Router configuration for GLM-4.5 MoE
+        self.neuron_config.router_config.dtype = torch.float32  # router in FP32
+        # act_fn is handled inside NeuronGlm4MoeRouter (always sigmoid)
+
+        # Disable the standard normalize_top_k_affinities since our router handles it
+        self.neuron_config.normalize_top_k_affinities = False
+
+        # Set DISABLE_NUMERIC_CC_TOKEN for MoE
+        self.neuron_config.disable_numeric_cc_token = True
+
+        # Shared expert config
+        self.neuron_config.fused_shared_experts = False
+        self.neuron_config.transpose_shared_experts_weights = False
+        self.neuron_config.shared_experts_sequence_parallel_enabled = False
+
+        # Check if moe_intermediate_pad_size is needed
+        self.maybe_pad_intermediate()
+
+    def maybe_pad_intermediate(self):
+        """Pad moe_intermediate_size if needed for blockwise matmul alignment."""
+        from neuronx_distributed_inference.models.config import (
+            SHARD_ON_INTERMEDIATE_DIMENSION_PER_TP,
+        )
+
+        moe_tp_degree = self.neuron_config.moe_tp_degree
+        I_TP = self.moe_intermediate_size // moe_tp_degree
+        if getattr(
+            self.neuron_config.blockwise_matmul_config,
+            "use_shard_on_intermediate_dynamic_while",
+            False,
+        ):
+            if I_TP % SHARD_ON_INTERMEDIATE_DIMENSION_PER_TP != 0:
+                padded = (
+                    math.ceil(I_TP / SHARD_ON_INTERMEDIATE_DIMENSION_PER_TP)
+                    * SHARD_ON_INTERMEDIATE_DIMENSION_PER_TP
+                    * moe_tp_degree
+                )
+                self.moe_intermediate_pad_size = max(
+                    padded - self.moe_intermediate_size, 0
+                )
+                self.moe_intermediate_size = padded
+
+    def get_required_attributes(self) -> List[str]:
+        return [
+            "head_dim",
+            "hidden_act",
+            "hidden_size",
+            "max_position_embeddings",
+            "moe_intermediate_size",
+            "n_group",
+            "n_routed_experts",
+            "n_shared_experts",
+            "norm_topk_prob",
+            "num_attention_heads",
+            "num_experts_per_tok",
+            "num_hidden_layers",
+            "num_key_value_heads",
+            "partial_rotary_factor",
+            "rms_norm_eps",
+            "rope_scaling",
+            "rope_theta",
+            "routed_scaling_factor",
+            "tie_word_embeddings",
+            "topk_group",
+            "use_qk_norm",
+            "vocab_size",
+            "first_k_dense_replace",
+            "attention_bias",
+        ]
+
+    @classmethod
+    def get_neuron_config_cls(cls):
+        return MoENeuronConfig
+
+
+# ---------------------------------------------------------------------------
+# State dict conversion: HF -> Neuronx
+# ---------------------------------------------------------------------------
+
+
+def _helper_concat_and_delete_qkv(
+    state_dict: Dict[str, Any], layer_num: int, key_type: str
+):
+    """Concatenate Q/K/V weights (or biases) for fused QKV."""
+    q_key = f"layers.{layer_num}.self_attn.q_proj.{key_type}"
+    k_key = f"layers.{layer_num}.self_attn.k_proj.{key_type}"
+    v_key = f"layers.{layer_num}.self_attn.v_proj.{key_type}"
+
+    state_dict[f"layers.{layer_num}.self_attn.Wqkv.{key_type}"] = torch.cat(
+        [
+            state_dict[q_key],
+            state_dict[k_key],
+            state_dict[v_key],
+        ]
+    )
+    del state_dict[q_key]
+    del state_dict[k_key]
+    del state_dict[v_key]
+
+
+def convert_glm4_moe_hf_to_neuron_state_dict(
+    neuron_state_dict: Dict[str, Any],
+    config: Glm4MoeInferenceConfig,
+) -> Dict[str, Any]:
+    """
+    Convert HF GLM-4.5 MoE state dict to neuronx format.
+
+    Transformations:
+    1. Add rank_util tensors
+    2. Rename q_norm/k_norm -> q_layernorm/k_layernorm
+    3. Fuse QKV weights and biases
+    4. For dense layers: no MoE weight transformation needed
+    5. For MoE layers:
+       - Rename router weight: mlp.gate.weight -> mlp.router.linear_router.weight
+       - Copy correction bias: mlp.gate.e_score_correction_bias -> mlp.router.e_score_correction_bias
+       - Fuse expert weights: per-expert gate_proj + up_proj -> [E, H, 2I] gate_up_proj
+       - Copy shared expert weights (renamed to match SharedExperts structure)
+    """
+    assert config.neuron_config.glu_mlp is True, "Only GLU MLP is supported"
+
+    # Add rank_util tensor for distributed inference
+    neuron_state_dict["rank_util.rank"] = torch.arange(
+        0, config.neuron_config.tp_degree, dtype=torch.int32
+    )
+
+    num_moe_experts = config.num_local_experts
+    pad_size = getattr(config, "moe_intermediate_pad_size", 0)
+
+    for l in range(config.num_hidden_layers):  # noqa: E741
+        # Add per-layer rank_util
+        neuron_state_dict[f"layers.{l}.self_attn.rank_util.rank"] = torch.arange(
+            0, config.neuron_config.tp_degree, dtype=torch.int32
+        )
+
+        # Rename q_norm/k_norm -> q_layernorm/k_layernorm
+        if f"layers.{l}.self_attn.q_norm.weight" in neuron_state_dict:
+            neuron_state_dict[f"layers.{l}.self_attn.q_layernorm.weight"] = (
+                neuron_state_dict[f"layers.{l}.self_attn.q_norm.weight"]
+                .detach()
+                .clone()
+            )
+            del neuron_state_dict[f"layers.{l}.self_attn.q_norm.weight"]
+
+        if f"layers.{l}.self_attn.k_norm.weight" in neuron_state_dict:
+            neuron_state_dict[f"layers.{l}.self_attn.k_layernorm.weight"] = (
+                neuron_state_dict[f"layers.{l}.self_attn.k_norm.weight"]
+                .detach()
+                .clone()
+            )
+            del neuron_state_dict[f"layers.{l}.self_attn.k_norm.weight"]
+
+        is_moe_layer = l >= config.first_k_dense_replace
+
+        if is_moe_layer:
+            # ---- Router ----
+            # Rename: mlp.gate.weight -> mlp.router.linear_router.weight
+            gate_weight_key = f"layers.{l}.mlp.gate.weight"
+            if gate_weight_key in neuron_state_dict:
+                neuron_state_dict[f"layers.{l}.mlp.router.linear_router.weight"] = (
+                    neuron_state_dict[gate_weight_key].detach().clone()
+                )
+                del neuron_state_dict[gate_weight_key]
+
+            # Copy e_score_correction_bias
+            bias_key = f"layers.{l}.mlp.gate.e_score_correction_bias"
+            if bias_key in neuron_state_dict:
+                neuron_state_dict[f"layers.{l}.mlp.router.e_score_correction_bias"] = (
+                    neuron_state_dict[bias_key].detach().clone().to(torch.float32)
+                )
+                del neuron_state_dict[bias_key]
+
+            # ---- Routed Expert weights ----
+            # Get shape info from first expert
+            gate_proj_0 = neuron_state_dict[
+                f"layers.{l}.mlp.experts.0.gate_proj.weight"
+            ]
+            intermediate_size_e, hidden_size = gate_proj_0.shape
+            device = gate_proj_0.device
+            dtype = gate_proj_0.dtype
+
+            # Fuse gate_proj + up_proj -> gate_up_proj: [E, H, 2I]
+            gate_up_proj = torch.empty(
+                num_moe_experts,
+                hidden_size,
+                2 * intermediate_size_e,
+                dtype=dtype,
+                device=device,
+            )
+            down_proj = torch.empty(
+                num_moe_experts,
+                intermediate_size_e,
+                hidden_size,
+                dtype=dtype,
+                device=device,
+            )
+
+            for e in range(num_moe_experts):
+                gate_w = (
+                    neuron_state_dict[f"layers.{l}.mlp.experts.{e}.gate_proj.weight"]
+                    .T.detach()
+                    .clone()
+                )
+                up_w = (
+                    neuron_state_dict[f"layers.{l}.mlp.experts.{e}.up_proj.weight"]
+                    .T.detach()
+                    .clone()
+                )
+                down_w = (
+                    neuron_state_dict[f"layers.{l}.mlp.experts.{e}.down_proj.weight"]
+                    .T.detach()
+                    .clone()
+                )
+
+                gate_up_slice = torch.narrow(gate_up_proj, 0, e, 1)
+                torch.narrow(gate_up_slice, 2, 0, intermediate_size_e).copy_(gate_w)
+                torch.narrow(
+                    gate_up_slice, 2, intermediate_size_e, intermediate_size_e
+                ).copy_(up_w)
+
+                down_slice = torch.narrow(down_proj, 0, e, 1)
+                down_slice.copy_(down_w)
+
+                del neuron_state_dict[f"layers.{l}.mlp.experts.{e}.gate_proj.weight"]
+                del neuron_state_dict[f"layers.{l}.mlp.experts.{e}.up_proj.weight"]
+                del neuron_state_dict[f"layers.{l}.mlp.experts.{e}.down_proj.weight"]
+
+            # Pad intermediate size if needed
+            if pad_size > 0:
+                gate_up_proj = gate_up_proj.reshape(num_moe_experts, hidden_size, 2, -1)
+                gate_up_proj = torch.nn.functional.pad(gate_up_proj, (0, pad_size))
+                gate_up_proj = gate_up_proj.reshape(num_moe_experts, hidden_size, -1)
+
+                down_proj = torch.nn.functional.pad(down_proj, (0, 0, 0, pad_size))
+
+            neuron_state_dict[
+                f"layers.{l}.mlp.expert_mlps.mlp_op.gate_up_proj.weight"
+            ] = gate_up_proj
+            neuron_state_dict[f"layers.{l}.mlp.expert_mlps.mlp_op.down_proj.weight"] = (
+                down_proj
+            )
+
+            # ---- Shared Expert weights ----
+            # SharedExperts with fused_gate_up_projection=False expects separate gate_proj, up_proj, down_proj
+            # Keys: mlp.shared_experts.gate_proj.weight -> mlp.shared_experts.gate_proj.weight (no rename needed
+            #       IF SharedExperts uses the same naming)
+            # However, we need to check SharedExperts's weight key names.
+            # SharedExperts stores weights as gate_proj and up_proj (separate) or gate_up_proj (fused).
+            # With fused_gate_up_projection=False and transpose_weights=False:
+            # The keys should remain as-is: mlp.shared_experts.{gate/up/down}_proj.weight
+            # No transformation needed - keys already match.
+
+        gc.collect()
+
+    # Fuse QKV weights (and biases if attention_bias=True)
+    if config.neuron_config.fused_qkv:
+        for l in range(config.num_hidden_layers):  # noqa: E741
+            _helper_concat_and_delete_qkv(neuron_state_dict, l, "weight")
+            if config.attention_bias:
+                _helper_concat_and_delete_qkv(neuron_state_dict, l, "bias")
+
+    return neuron_state_dict
diff --git a/contrib/models/glm4_moe/test/__init__.py b/contrib/models/glm4_moe/test/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/glm4_moe/test/conftest.py b/contrib/models/glm4_moe/test/conftest.py
new file mode 100644
index 00000000..59375b18
--- /dev/null
+++ b/contrib/models/glm4_moe/test/conftest.py
@@ -0,0 +1,44 @@
+# coding=utf-8
+"""Shared pytest fixtures for GLM-4.5 MoE tests."""
+
+import random
+import sys
+import tempfile
+from pathlib import Path
+
+import pytest
+import torch
+
+# Add src to path so glm4_moe package is importable
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+
+@pytest.fixture(scope="session", autouse=True)
+def random_seed():
+    """Fix all random seeds for reproducibility."""
+    torch.manual_seed(42)
+    random.seed(42)
+    try:
+        import torch_xla.core.xla_model as xm
+
+        xm.set_rng_state(42)
+    except ImportError:
+        pass
+    yield
+
+
+@pytest.fixture(scope="session")
+def glm4_moe_config():
+    """Load GLM-4.5 MoE test config (full architecture, few layers)."""
+    import json
+
+    config_path = Path(__file__).parent / "integration" / "config_glm4_moe_2layers.json"
+    with open(config_path) as f:
+        return json.load(f)
+
+
+@pytest.fixture
+def tmp_dir_path():
+    """Create a temporary directory that is cleaned up after the test."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        yield Path(tmpdir)
diff --git a/contrib/models/glm4_moe/test/integration/__init__.py b/contrib/models/glm4_moe/test/integration/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/glm4_moe/test/integration/config_glm4_moe_2layers.json b/contrib/models/glm4_moe/test/integration/config_glm4_moe_2layers.json
new file mode 100644
index 00000000..04ed490d
--- /dev/null
+++ b/contrib/models/glm4_moe/test/integration/config_glm4_moe_2layers.json
@@ -0,0 +1,37 @@
+{
+  "_note": "Reduced 2-layer config derived from zai-org/GLM-4.5-Air for integration testing. Real architecture parameters are preserved; only num_hidden_layers is reduced to avoid OOM during CI.",
+  "architectures": ["Glm4MoeForCausalLM"],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "eos_token_id": [151329, 151336, 151338],
+  "first_k_dense_replace": 1,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 512,
+  "initializer_range": 0.02,
+  "intermediate_size": 1024,
+  "max_position_embeddings": 131072,
+  "model_type": "glm4_moe",
+  "moe_intermediate_size": 256,
+  "n_group": 1,
+  "n_routed_experts": 8,
+  "n_shared_experts": 1,
+  "norm_topk_prob": true,
+  "num_attention_heads": 4,
+  "num_experts_per_tok": 2,
+  "num_hidden_layers": 2,
+  "num_key_value_heads": 2,
+  "num_nextn_predict_layers": 1,
+  "pad_token_id": 151329,
+  "partial_rotary_factor": 0.5,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "routed_scaling_factor": 1.0,
+  "tie_word_embeddings": false,
+  "topk_group": 1,
+  "torch_dtype": "bfloat16",
+  "use_cache": true,
+  "use_qk_norm": false,
+  "vocab_size": 151552
+}
diff --git a/contrib/models/glm4_moe/test/integration/test_model.py b/contrib/models/glm4_moe/test/integration/test_model.py
new file mode 100644
index 00000000..4eadcf4d
--- /dev/null
+++ b/contrib/models/glm4_moe/test/integration/test_model.py
@@ -0,0 +1,325 @@
+#!/usr/bin/env python3
+# coding=utf-8
+"""
+Integration tests for GLM-4.5 MoE NeuronX implementation.
+
+Tests model compilation, loading, and inference accuracy/performance using a
+reduced 2-layer config with random weights to avoid OOM on CI/test hardware.
+
+Usage:
+    # From contrib/models/glm4_moe/ directory:
+    pytest test/integration/test_model.py -v
+
+    # Run with specific tp_degree:
+    NEURON_RT_NUM_CORES=2 pytest test/integration/test_model.py -v
+"""
+
+import json
+import os
+import sys
+import tempfile
+from pathlib import Path
+
+import pytest
+import torch
+from transformers import GenerationConfig
+
+# Add contrib src and integration dir to path
+_CONTRIB_ROOT = Path(__file__).parent.parent.parent
+sys.path.insert(0, str(_CONTRIB_ROOT / "src"))
+sys.path.insert(0, str(Path(__file__).parent))  # for utils module
+
+from glm4_moe.modeling_glm4_moe import Glm4MoeInferenceConfig, NeuronGlm4MoeForCausalLM
+from neuronx_distributed_inference.models.config import (
+    MoENeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.accuracy import (
+    check_accuracy_logits_v2,
+    generate_expected_logits,
+)
+from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+
+from utils import create_neuron_config, save_hf_checkpoint, prepare_inputs
+
+
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+
+CONFIG_PATH = Path(__file__).parent / "config_glm4_moe_2layers.json"
+TP_DEGREE = int(os.environ.get("NEURON_RT_NUM_CORES", "2"))
+SEQ_LEN = 128
+BATCH_SIZE = 1
+MAX_NEW_TOKENS = 8
+
+
+# ---------------------------------------------------------------------------
+# Session-scoped fixtures: create tiny random model once per test session
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture(scope="session")
+def hf_config_dict():
+    """Load the reduced 2-layer test config dict."""
+    with open(CONFIG_PATH) as f:
+        return json.load(f)
+
+
+@pytest.fixture(scope="session")
+def hf_checkpoint_path(tmp_path_factory, hf_config_dict):
+    """Create a temporary HF checkpoint with random weights (session-scoped).
+
+    The checkpoint is created once and reused for all tests in the session.
+    """
+    tmp_dir = str(tmp_path_factory.mktemp("glm4_moe_hf_ckpt"))
+    print(f"\n[fixture] Creating tiny random HF checkpoint at {tmp_dir}...")
+    save_hf_checkpoint(hf_config_dict, tmp_dir)
+    print(f"[fixture] HF checkpoint ready.")
+    return tmp_dir
+
+
+@pytest.fixture(scope="session")
+def compiled_model_path(tmp_path_factory):
+    """Return a session-scoped temp directory for the compiled Neuron model."""
+    return str(tmp_path_factory.mktemp("glm4_moe_neuron_compiled"))
+
+
+@pytest.fixture(scope="session")
+def neuron_config():
+    """Create MoENeuronConfig for integration tests."""
+    return create_neuron_config(
+        tp_degree=TP_DEGREE,
+        seq_len=SEQ_LEN,
+        batch_size=BATCH_SIZE,
+    )
+
+
+@pytest.fixture(scope="session")
+def neuron_model(
+    hf_checkpoint_path, compiled_model_path, neuron_config, hf_config_dict
+):
+    """Compile and load GLM-4.5 MoE model on Neuron (session-scoped).
+
+    Uses the tiny 2-layer random-weight checkpoint to avoid OOM.
+    """
+    print(f"\n[fixture] Building Glm4MoeInferenceConfig from {hf_checkpoint_path}...")
+    inference_config = Glm4MoeInferenceConfig(
+        neuron_config=neuron_config,
+        load_config=load_pretrained_config(hf_checkpoint_path),
+    )
+
+    model = NeuronGlm4MoeForCausalLM(hf_checkpoint_path, inference_config)
+
+    print(f"[fixture] Compiling model to {compiled_model_path}...")
+    model.compile(compiled_model_path)
+    print(f"[fixture] Compilation complete.")
+
+    print(f"[fixture] Loading compiled model...")
+    model.load(compiled_model_path)
+    print(f"[fixture] Model loaded.")
+
+    return model
+
+
+@pytest.fixture(scope="session")
+def generation_config():
+    """Greedy generation config (no sampling) for deterministic tests."""
+    return GenerationConfig(
+        do_sample=False,
+        top_k=1,
+        temperature=1.0,
+        max_new_tokens=MAX_NEW_TOKENS,
+    )
+
+
+@pytest.fixture(scope="session")
+def input_ids_and_mask(hf_config_dict):
+    """Prepare fixed random inputs for all tests."""
+    vocab_size = hf_config_dict.get("vocab_size", 1000)
+    input_ids, attention_mask = prepare_inputs(
+        batch_size=BATCH_SIZE,
+        seq_len=SEQ_LEN // 2,  # Use half of seq_len as prompt
+        vocab_size=vocab_size,
+    )
+    return input_ids, attention_mask
+
+
+# ---------------------------------------------------------------------------
+# Tests
+# ---------------------------------------------------------------------------
+
+
+class TestGlm4MoeSmoke:
+    """Smoke tests: verify model compiles and loads without errors."""
+
+    def test_model_is_not_none(self, neuron_model):
+        """Model fixture must load successfully."""
+        assert neuron_model is not None, "neuron_model fixture returned None"
+
+    def test_model_has_config(self, neuron_model):
+        """Loaded model must carry a valid InferenceConfig."""
+        assert hasattr(neuron_model, "config"), "Model has no 'config' attribute"
+        assert hasattr(neuron_model.config, "neuron_config"), (
+            "Config missing 'neuron_config'"
+        )
+
+    def test_neuron_config_tp_degree(self, neuron_model):
+        """TP degree must match the configured value."""
+        assert neuron_model.config.neuron_config.tp_degree == TP_DEGREE
+
+
+class TestGlm4MoeAccuracy:
+    """Logit accuracy: compare Neuron output against CPU HuggingFace model."""
+
+    def test_logit_accuracy(
+        self,
+        neuron_model,
+        input_ids_and_mask,
+        generation_config,
+    ):
+        """Neuron logits must match HuggingFace CPU logits within tolerance.
+
+        Uses check_accuracy_logits_v2 which generates golden logits from the
+        HuggingFace model and compares them token-by-token against the Neuron
+        output.  A divergence_difference_tol of 0.001 is used.
+        """
+        input_ids, attention_mask = input_ids_and_mask
+
+        print("\n[test] Generating expected (HF CPU) logits...")
+        expected_logits = generate_expected_logits(
+            neuron_model=neuron_model,
+            input_ids=input_ids,
+            inputs_attention_mask=attention_mask,
+            generation_config=generation_config,
+            num_tokens=MAX_NEW_TOKENS,
+        )
+
+        print("[test] Running check_accuracy_logits_v2...")
+        check_accuracy_logits_v2(
+            neuron_model=neuron_model,
+            expected_logits=expected_logits,
+            inputs_input_ids=input_ids,
+            inputs_attention_mask=attention_mask,
+            generation_config=generation_config,
+            divergence_difference_tol=0.001,
+            num_tokens_to_check=MAX_NEW_TOKENS,
+        )
+        print("[test] Logit accuracy check passed.")
+
+
+class TestGlm4MoePerformance:
+    """Performance benchmarks: timing and throughput sanity checks."""
+
+    def test_context_encoding_runs(self, neuron_model, input_ids_and_mask):
+        """Context encoding forward pass must complete without error."""
+        input_ids, attention_mask = input_ids_and_mask
+        from neuronx_distributed_inference.utils.hf_adapter import (
+            HuggingFaceGenerationAdapter,
+        )
+
+        adapter = HuggingFaceGenerationAdapter(neuron_model)
+        gen_config = GenerationConfig(
+            do_sample=False,
+            top_k=1,
+            max_new_tokens=1,
+        )
+        with torch.no_grad():
+            output = adapter.generate(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                generation_config=gen_config,
+            )
+        assert output is not None
+        assert output.shape[1] >= input_ids.shape[1]
+
+    def test_benchmark_sampling(self, neuron_model, generation_config):
+        """benchmark_sampling must complete and return a non-empty report."""
+        from neuronx_distributed_inference.utils.benchmark import benchmark_sampling
+
+        report = benchmark_sampling(
+            model=neuron_model,
+            generation_config=generation_config,
+            num_runs=3,
+        )
+        # Report may be empty dict if model/config doesn't support it; just
+        # verify no exception was raised.
+        assert report is not None
+        print(f"\n[test] Benchmark report: {report}")
+
+
+# ---------------------------------------------------------------------------
+# __main__ runner (non-pytest)
+# ---------------------------------------------------------------------------
+
+
+if __name__ == "__main__":
+    import tempfile
+
+    from utils import save_hf_checkpoint
+
+    print("=" * 70)
+    print("GLM-4.5 MoE Integration Tests (standalone)")
+    print("=" * 70)
+
+    with open(CONFIG_PATH) as f:
+        hf_config_dict = json.load(f)
+
+    with (
+        tempfile.TemporaryDirectory() as hf_ckpt_dir,
+        tempfile.TemporaryDirectory() as compiled_dir,
+    ):
+        print(f"\nStep 1: Creating tiny random HF checkpoint at {hf_ckpt_dir}...")
+        save_hf_checkpoint(hf_config_dict, hf_ckpt_dir)
+        print("  Done.")
+
+        neuron_cfg = create_neuron_config(
+            tp_degree=TP_DEGREE, seq_len=SEQ_LEN, batch_size=BATCH_SIZE
+        )
+        inference_config = Glm4MoeInferenceConfig(
+            neuron_config=neuron_cfg,
+            load_config=load_pretrained_config(hf_ckpt_dir),
+        )
+
+        print(f"\nStep 2: Compiling model to {compiled_dir}...")
+        model = NeuronGlm4MoeForCausalLM(hf_ckpt_dir, inference_config)
+        model.compile(compiled_dir)
+        print("  Compilation complete.")
+
+        print(f"\nStep 3: Loading compiled model...")
+        model.load(compiled_dir)
+        print("  Load complete.")
+
+        print("\nStep 4: Smoke test — model attributes...")
+        assert model is not None
+        assert hasattr(model, "config")
+        print("  PASSED.")
+
+        print("\nStep 5: Logit accuracy test...")
+        vocab_size = hf_config_dict.get("vocab_size", 1000)
+        input_ids, attention_mask = prepare_inputs(BATCH_SIZE, SEQ_LEN // 2, vocab_size)
+        gen_config = GenerationConfig(
+            do_sample=False, top_k=1, temperature=1.0, max_new_tokens=MAX_NEW_TOKENS
+        )
+
+        expected_logits = generate_expected_logits(
+            neuron_model=model,
+            input_ids=input_ids,
+            inputs_attention_mask=attention_mask,
+            generation_config=gen_config,
+            num_tokens=MAX_NEW_TOKENS,
+        )
+        check_accuracy_logits_v2(
+            neuron_model=model,
+            expected_logits=expected_logits,
+            inputs_input_ids=input_ids,
+            inputs_attention_mask=attention_mask,
+            generation_config=gen_config,
+            divergence_difference_tol=0.001,
+            num_tokens_to_check=MAX_NEW_TOKENS,
+        )
+        print("  PASSED.")
+
+        print("\n" + "=" * 70)
+        print("All tests passed!")
+        print("=" * 70)
diff --git a/contrib/models/glm4_moe/test/integration/utils.py b/contrib/models/glm4_moe/test/integration/utils.py
new file mode 100644
index 00000000..719800bf
--- /dev/null
+++ b/contrib/models/glm4_moe/test/integration/utils.py
@@ -0,0 +1,132 @@
+# coding=utf-8
+"""Integration test utilities for GLM-4.5 MoE."""
+
+import json
+import os
+import sys
+import tempfile
+from pathlib import Path
+from typing import Optional, Tuple
+
+import torch
+
+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "src"))
+
+from glm4_moe.modeling_glm4_moe import Glm4MoeInferenceConfig, NeuronGlm4MoeForCausalLM
+from neuronx_distributed_inference.models.config import (
+    MoENeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+
+
+LNC = int(os.environ.get("NEURON_LOGICAL_NC_CONFIG", "1"))
+
+
+def create_neuron_config(
+    tp_degree: int = 2,
+    seq_len: int = 128,
+    batch_size: int = 1,
+    torch_dtype: torch.dtype = torch.bfloat16,
+) -> MoENeuronConfig:
+    """Create MoENeuronConfig for GLM-4.5 MoE integration tests.
+
+    Args:
+        tp_degree: Tensor parallelism degree.
+        seq_len: Maximum sequence length.
+        batch_size: Batch size for inference.
+        torch_dtype: Dtype for model weights.
+
+    Returns:
+        Configured MoENeuronConfig.
+    """
+    return MoENeuronConfig(
+        tp_degree=tp_degree,
+        moe_tp_degree=tp_degree,
+        moe_ep_degree=1,
+        batch_size=batch_size,
+        ctx_batch_size=batch_size,
+        tkg_batch_size=batch_size,
+        seq_len=seq_len,
+        max_context_length=seq_len - 8,
+        torch_dtype=torch_dtype,
+        on_device_sampling_config=OnDeviceSamplingConfig(do_sample=False, top_k=1),
+        output_logits=True,  # Required for check_accuracy_logits_v2
+        enable_bucketing=False,
+        flash_decoding_enabled=False,
+        fused_qkv=True,
+        sequence_parallel_enabled=False,
+        qkv_kernel_enabled=False,
+        attn_kernel_enabled=False,
+    )
+
+
+def save_hf_checkpoint(config_dict: dict, save_dir: str) -> str:
+    """Create a tiny random-weight HF checkpoint from a config dict.
+
+    Args:
+        config_dict: HuggingFace model config dictionary.
+        save_dir: Directory to save the checkpoint.
+
+    Returns:
+        Path to the saved checkpoint directory.
+    """
+    from transformers import AutoConfig, AutoModelForCausalLM
+
+    os.makedirs(save_dir, exist_ok=True)
+
+    # Remove integration test internal note
+    config_dict = {k: v for k, v in config_dict.items() if not k.startswith("_")}
+
+    config_path = os.path.join(save_dir, "config.json")
+    with open(config_path, "w") as f:
+        json.dump(config_dict, f, indent=2)
+
+    config = AutoConfig.from_pretrained(save_dir)
+    torch.manual_seed(42)
+    model = AutoModelForCausalLM.from_config(config)
+    model = model.to(torch.bfloat16)
+    model.save_pretrained(save_dir)
+
+    return save_dir
+
+
+def prepare_inputs(
+    batch_size: int,
+    seq_len: int,
+    vocab_size: int = 1000,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Prepare random input tensors for inference.
+
+    Args:
+        batch_size: Number of sequences in the batch.
+        seq_len: Sequence length.
+        vocab_size: Vocabulary size for token sampling.
+
+    Returns:
+        Tuple of (input_ids, attention_mask).
+    """
+    torch.manual_seed(0)
+    input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), dtype=torch.long)
+    attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long)
+    return input_ids, attention_mask
+
+
+def get_test_name_suffix(tp_degree: int, dtype: torch.dtype, seq_len: int) -> str:
+    """Generate a descriptive suffix for test artifact file names.
+
+    Args:
+        tp_degree: Tensor parallelism degree.
+        dtype: Model dtype.
+        seq_len: Sequence length.
+
+    Returns:
+        String suffix, e.g. 'tp2_bf16_s128'.
+    """
+    dtype_str = {
+        torch.bfloat16: "bf16",
+        torch.float16: "fp16",
+        torch.float32: "fp32",
+    }.get(dtype, "unk")
+    return f"tp{tp_degree}_{dtype_str}_s{seq_len}"
diff --git a/contrib/models/glm4_moe/test/unit/__init__.py b/contrib/models/glm4_moe/test/unit/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/glm4_moe/test/unit/test_attention.py b/contrib/models/glm4_moe/test/unit/test_attention.py
new file mode 100644
index 00000000..230bd0bb
--- /dev/null
+++ b/contrib/models/glm4_moe/test/unit/test_attention.py
@@ -0,0 +1,292 @@
+# coding=utf-8
+"""Unit tests for NeuronGlm4MoeAttention (partial RoPE, QK norm, attention bias).
+
+These tests verify the attention-specific logic that differs from a standard
+transformer:
+  1. Partial RoPE: only first ``partial_rotary_factor * head_dim`` dims are rotated.
+  2. Optional QK normalization per head.
+  3. QKV projections carry a bias term (attention_bias=True).
+
+All tests run on CPU; no Neuron hardware is required.
+"""
+
+import sys
+from pathlib import Path
+from types import SimpleNamespace
+from unittest.mock import MagicMock, patch
+
+import pytest
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "src"))
+
+
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+
+
+def _make_attention_config(
+    hidden_size: int = 128,
+    num_attention_heads: int = 4,
+    num_key_value_heads: int = 2,
+    head_dim: int = 32,
+    partial_rotary_factor: float = 0.5,
+    attention_bias: bool = True,
+    use_qk_norm: bool = False,
+    max_position_embeddings: int = 512,
+    rope_theta: float = 1_000_000.0,
+    rms_norm_eps: float = 1e-5,
+):
+    """Create a minimal config namespace for NeuronGlm4MoeAttention construction."""
+    cfg = SimpleNamespace(
+        hidden_size=hidden_size,
+        num_attention_heads=num_attention_heads,
+        num_key_value_heads=num_key_value_heads,
+        head_dim=head_dim,
+        partial_rotary_factor=partial_rotary_factor,
+        attention_bias=attention_bias,
+        use_qk_norm=use_qk_norm,
+        max_position_embeddings=max_position_embeddings,
+        rope_theta=rope_theta,
+        rms_norm_eps=rms_norm_eps,
+    )
+    return cfg
+
+
+# ---------------------------------------------------------------------------
+# Partial RoPE tests (no distributed context needed — logic is pure math)
+# ---------------------------------------------------------------------------
+
+
+class TestPartialRoPE:
+    """Verify the partial RoPE splitting logic in apply_rotary_embedding."""
+
+    def _make_qkv(self, batch: int, heads: int, seq: int, head_dim: int):
+        torch.manual_seed(0)
+        Q = torch.randn(batch, heads, seq, head_dim)
+        K = torch.randn(batch, heads, seq, head_dim)
+        V = torch.randn(batch, heads, seq, head_dim)
+        return Q, K, V
+
+    def _apply_partial_rope_reference(self, Q, K, rotary_dim, cos, sin):
+        """Reference implementation matching the class logic."""
+        from neuronx_distributed_inference.modules.attention.utils import (
+            apply_rotary_pos_emb,
+        )
+
+        Q_rot, Q_pass = Q[..., :rotary_dim], Q[..., rotary_dim:]
+        K_rot, K_pass = K[..., :rotary_dim], K[..., rotary_dim:]
+        Q_rot, K_rot = apply_rotary_pos_emb(Q_rot, K_rot, cos, sin)
+        Q_out = torch.cat([Q_rot, Q_pass], dim=-1)
+        K_out = torch.cat([K_rot, K_pass], dim=-1)
+        return Q_out, K_out
+
+    def test_rotary_dim_is_half_head_dim(self):
+        """rotary_dim must equal floor(head_dim * partial_rotary_factor)."""
+        head_dim = 32
+        factor = 0.5
+        expected_rotary_dim = int(head_dim * factor)
+
+        # Import RotaryEmbedding directly to verify the rotary_dim it computes
+        from neuronx_distributed_inference.modules.attention.utils import (
+            RotaryEmbedding,
+        )
+
+        rotary_emb = RotaryEmbedding(
+            expected_rotary_dim,
+            max_position_embeddings=512,
+            base=1_000_000.0,
+        )
+        # RotaryEmbedding stores dim as self.dim (or indirectly via cos_cached shape)
+        assert expected_rotary_dim == 16, f"Expected 16, got {expected_rotary_dim}"
+
+    def test_pass_through_dimensions_unchanged(self):
+        """The non-rotary tail of Q and K must be bit-exact after apply_rotary_embedding."""
+        try:
+            from neuronx_distributed_inference.modules.attention.utils import (
+                RotaryEmbedding,
+                apply_rotary_pos_emb,
+            )
+        except ImportError:
+            pytest.skip("neuronx_distributed_inference not installed")
+
+        batch, heads, seq, head_dim = 1, 4, 8, 32
+        rotary_dim = 16  # 0.5 * 32
+
+        Q, K, V = self._make_qkv(batch, heads, seq, head_dim)
+
+        rotary_emb = RotaryEmbedding(
+            rotary_dim, max_position_embeddings=seq, base=1_000_000.0
+        )
+        position_ids = torch.arange(seq).unsqueeze(0)
+        cos, sin = rotary_emb(V, position_ids)
+
+        Q_out, K_out = self._apply_partial_rope_reference(Q, K, rotary_dim, cos, sin)
+
+        # Pass-through portions must be identical to input
+        torch.testing.assert_close(Q_out[..., rotary_dim:], Q[..., rotary_dim:])
+        torch.testing.assert_close(K_out[..., rotary_dim:], K[..., rotary_dim:])
+
+    def test_rotary_dimensions_are_changed(self):
+        """The rotary portion of Q/K must differ from the original after applying RoPE."""
+        try:
+            from neuronx_distributed_inference.modules.attention.utils import (
+                RotaryEmbedding,
+                apply_rotary_pos_emb,
+            )
+        except ImportError:
+            pytest.skip("neuronx_distributed_inference not installed")
+
+        batch, heads, seq, head_dim = 1, 4, 8, 32
+        rotary_dim = 16
+
+        Q, K, V = self._make_qkv(batch, heads, seq, head_dim)
+
+        rotary_emb = RotaryEmbedding(
+            rotary_dim, max_position_embeddings=seq, base=1_000_000.0
+        )
+        position_ids = torch.arange(seq).unsqueeze(0)
+        cos, sin = rotary_emb(V, position_ids)
+
+        Q_out, _ = self._apply_partial_rope_reference(Q, K, rotary_dim, cos, sin)
+
+        # Rotary portion should differ (non-zero rotation for non-trivial positions)
+        assert not torch.allclose(Q_out[..., :rotary_dim], Q[..., :rotary_dim]), (
+            "RoPE should change the rotary portion of Q"
+        )
+
+    def test_output_shape_preserved(self):
+        """Output tensors must retain the exact same shape as input."""
+        try:
+            from neuronx_distributed_inference.modules.attention.utils import (
+                RotaryEmbedding,
+                apply_rotary_pos_emb,
+            )
+        except ImportError:
+            pytest.skip("neuronx_distributed_inference not installed")
+
+        batch, heads, seq, head_dim = 2, 4, 16, 32
+        rotary_dim = 16
+
+        Q, K, V = self._make_qkv(batch, heads, seq, head_dim)
+
+        rotary_emb = RotaryEmbedding(
+            rotary_dim, max_position_embeddings=seq, base=1_000_000.0
+        )
+        position_ids = torch.arange(seq).unsqueeze(0).expand(batch, -1)
+        cos, sin = rotary_emb(V, position_ids)
+
+        Q_out, K_out = self._apply_partial_rope_reference(Q, K, rotary_dim, cos, sin)
+
+        assert Q_out.shape == Q.shape, f"Q shape changed: {Q.shape} → {Q_out.shape}"
+        assert K_out.shape == K.shape, f"K shape changed: {K.shape} → {K_out.shape}"
+
+
+# ---------------------------------------------------------------------------
+# QK norm tests (CPU mock, no distributed env)
+# ---------------------------------------------------------------------------
+
+
+class TestQKNorm:
+    """Verify QK normalization initialization logic."""
+
+    def _build_attention_no_dist(self, use_qk_norm: bool):
+        """Construct NeuronGlm4MoeAttention with mocked distributed state."""
+        try:
+            from glm4_moe.modeling_glm4_moe import NeuronGlm4MoeAttention
+        except ImportError:
+            pytest.skip("glm4_moe package not importable (Neuron SDK missing)")
+
+        config = _make_attention_config(use_qk_norm=use_qk_norm)
+
+        with (
+            patch(
+                "neuronx_distributed.parallel_layers.parallel_state.model_parallel_is_initialized",
+                return_value=True,
+            ),
+            patch(
+                "neuronx_distributed.parallel_layers.parallel_state.get_tensor_model_parallel_rank",
+                return_value=0,
+            ),
+            patch(
+                "neuronx_distributed.parallel_layers.parallel_state.get_tensor_model_parallel_size",
+                return_value=1,
+            ),
+        ):
+            attn = NeuronGlm4MoeAttention.__new__(NeuronGlm4MoeAttention)
+            attn.rotary_dim = int(config.head_dim * config.partial_rotary_factor)
+            if config.use_qk_norm:
+                from glm4_moe.modeling_glm4_moe import get_rmsnorm_cls
+
+                attn.q_layernorm = get_rmsnorm_cls()(
+                    config.head_dim, config.rms_norm_eps
+                )
+                attn.k_layernorm = get_rmsnorm_cls()(
+                    config.head_dim, config.rms_norm_eps
+                )
+            else:
+                attn.q_layernorm = None
+                attn.k_layernorm = None
+        return attn
+
+    def test_qk_norm_none_when_disabled(self):
+        """q_layernorm and k_layernorm should be None when use_qk_norm=False."""
+        attn = self._build_attention_no_dist(use_qk_norm=False)
+        assert attn.q_layernorm is None, "q_layernorm should be None"
+        assert attn.k_layernorm is None, "k_layernorm should be None"
+
+    def test_qk_norm_present_when_enabled(self):
+        """Source code must create q_layernorm/k_layernorm when use_qk_norm=True.
+
+        We verify this via source code inspection rather than instantiation,
+        because full construction requires a distributed process group.
+        The __init__ source should contain 'q_layernorm' and 'k_layernorm'
+        assignments guarded by 'if config.use_qk_norm'.
+        """
+        import inspect
+
+        try:
+            from glm4_moe.modeling_glm4_moe import NeuronGlm4MoeAttention
+        except ImportError:
+            pytest.skip("glm4_moe package not importable (Neuron SDK missing)")
+
+        src = inspect.getsource(NeuronGlm4MoeAttention.__init__)
+        assert "use_qk_norm" in src, "use_qk_norm must be checked in __init__"
+        assert "q_layernorm" in src, "q_layernorm must be assigned in __init__"
+        assert "k_layernorm" in src, "k_layernorm must be assigned in __init__"
+
+
+# ---------------------------------------------------------------------------
+# Rotary dim calculation test (pure math, no imports needed)
+# ---------------------------------------------------------------------------
+
+
+class TestRotaryDimCalculation:
+    """Pure math: verify rotary_dim = floor(head_dim * partial_rotary_factor)."""
+
+    @pytest.mark.parametrize(
+        "head_dim,factor,expected",
+        [
+            (128, 0.5, 64),
+            (64, 0.5, 32),
+            (32, 0.5, 16),
+            (128, 0.25, 32),
+            (64, 0.75, 48),
+        ],
+    )
+    def test_rotary_dim_values(self, head_dim, factor, expected):
+        """rotary_dim must be int(head_dim * partial_rotary_factor)."""
+        rotary_dim = int(head_dim * factor)
+        assert rotary_dim == expected, (
+            f"head_dim={head_dim}, factor={factor}: "
+            f"expected rotary_dim={expected}, got {rotary_dim}"
+        )
+
+    def test_glm45_air_default(self):
+        """GLM-4.5 Air uses head_dim=128, partial_rotary_factor=0.5 → rotary_dim=64."""
+        head_dim = 128
+        partial_rotary_factor = 0.5
+        rotary_dim = int(head_dim * partial_rotary_factor)
+        assert rotary_dim == 64
diff --git a/contrib/models/glm4_moe/test/unit/test_decoder.py b/contrib/models/glm4_moe/test/unit/test_decoder.py
new file mode 100644
index 00000000..ae7b6313
--- /dev/null
+++ b/contrib/models/glm4_moe/test/unit/test_decoder.py
@@ -0,0 +1,198 @@
+# coding=utf-8
+"""Unit tests for NeuronGlm4MoeDecoderLayer dispatch logic.
+
+These tests verify:
+  1. The is_moe_layer flag: layer_idx >= first_k_dense_replace
+  2. The DenseMLP class exists and has the correct structure
+  3. The DecoderLayer class interface attributes exist
+
+The is_moe_layer flag is pure Python logic and can be tested without
+initializing a full distributed environment.  Structural checks verify
+the class API contract without forward passes.
+
+All tests run on CPU; no Neuron hardware is required.
+"""
+
+import sys
+from pathlib import Path
+
+import pytest
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "src"))
+
+
+# ---------------------------------------------------------------------------
+# Import once at module level with skip guard
+# ---------------------------------------------------------------------------
+
+
+def _import_classes():
+    """Import modeling classes, skip if Neuron SDK missing."""
+    try:
+        from glm4_moe.modeling_glm4_moe import (
+            NeuronGlm4MoeDecoderLayer,
+            NeuronGlm4MoeDenseMLP,
+        )
+
+        return NeuronGlm4MoeDecoderLayer, NeuronGlm4MoeDenseMLP
+    except ImportError:
+        pytest.skip("glm4_moe package not importable (Neuron SDK missing)")
+
+
+# ---------------------------------------------------------------------------
+# is_moe_layer flag tests (pure math, no distributed context needed)
+# ---------------------------------------------------------------------------
+
+
+class TestIsMoELayerFlag:
+    """Verify that is_moe_layer = (layer_idx >= first_k_dense_replace).
+
+    The flag is computed in __init__ as a simple boolean expression.
+    We test the formula directly without constructing the full layer,
+    since full construction requires a distributed process group.
+    """
+
+    @staticmethod
+    def _compute_flag(layer_idx: int, first_k_dense_replace: int) -> bool:
+        """Reproduce the flag logic from NeuronGlm4MoeDecoderLayer.__init__."""
+        return layer_idx >= first_k_dense_replace
+
+    @pytest.mark.parametrize(
+        "layer_idx,first_k,expected",
+        [
+            # first_k_dense_replace = 1: layer 0 is dense, rest are MoE
+            (0, 1, False),
+            (1, 1, True),
+            (2, 1, True),
+            # first_k_dense_replace = 2: layers 0,1 are dense
+            (0, 2, False),
+            (1, 2, False),
+            (2, 2, True),
+            (3, 2, True),
+            # first_k_dense_replace = 0: all layers are MoE
+            (0, 0, True),
+            (1, 0, True),
+            # first_k_dense_replace = 4: all 4 layers are dense
+            (0, 4, False),
+            (1, 4, False),
+            (2, 4, False),
+            (3, 4, False),
+            # Boundary: layer exactly at first_k_dense_replace
+            (2, 2, True),
+            (3, 3, True),
+        ],
+    )
+    def test_flag_formula(self, layer_idx, first_k, expected):
+        """is_moe_layer must equal (layer_idx >= first_k_dense_replace)."""
+        result = self._compute_flag(layer_idx, first_k)
+        assert result == expected, (
+            f"layer_idx={layer_idx}, first_k_dense_replace={first_k}: "
+            f"expected is_moe_layer={expected}, got {result}"
+        )
+
+    def test_dense_layer_is_below_boundary(self):
+        """All layers below first_k_dense_replace must be dense."""
+        first_k = 3
+        for idx in range(first_k):
+            flag = self._compute_flag(idx, first_k)
+            assert flag is False, f"Layer {idx} should be dense (first_k={first_k})"
+
+    def test_moe_layer_is_at_or_above_boundary(self):
+        """All layers at or above first_k_dense_replace must be MoE."""
+        first_k = 3
+        for idx in range(first_k, 6):
+            flag = self._compute_flag(idx, first_k)
+            assert flag is True, f"Layer {idx} should be MoE (first_k={first_k})"
+
+    def test_glm45_air_default_first_k_is_1(self):
+        """GLM-4.5 Air default: first_k_dense_replace=1, so layer 0 is dense."""
+        first_k = 1  # GLM-4.5 Air default
+        assert self._compute_flag(0, first_k) is False, "Layer 0 must be dense"
+        assert self._compute_flag(1, first_k) is True, "Layer 1 must be MoE"
+
+
+# ---------------------------------------------------------------------------
+# DenseMLP structure tests (no distributed env needed — just class inspect)
+# ---------------------------------------------------------------------------
+
+
+class TestDenseMLPStructure:
+    """Verify NeuronGlm4MoeDenseMLP class API contract."""
+
+    def test_dense_mlp_class_exists(self):
+        """NeuronGlm4MoeDenseMLP must be importable."""
+        _, DenseMLP = _import_classes()
+        assert DenseMLP is not None
+
+    def test_dense_mlp_is_nn_module(self):
+        """NeuronGlm4MoeDenseMLP must subclass nn.Module."""
+        import torch.nn as nn
+
+        _, DenseMLP = _import_classes()
+        assert issubclass(DenseMLP, nn.Module), (
+            "NeuronGlm4MoeDenseMLP must be an nn.Module subclass"
+        )
+
+    def test_dense_mlp_has_forward(self):
+        """NeuronGlm4MoeDenseMLP must define a forward method."""
+        _, DenseMLP = _import_classes()
+        assert hasattr(DenseMLP, "forward"), "Missing forward method"
+        assert callable(DenseMLP.forward)
+
+    def test_dense_mlp_init_expects_config(self):
+        """NeuronGlm4MoeDenseMLP.__init__ must accept a 'config' parameter."""
+        import inspect
+
+        _, DenseMLP = _import_classes()
+        sig = inspect.signature(DenseMLP.__init__)
+        assert "config" in sig.parameters, (
+            "NeuronGlm4MoeDenseMLP.__init__ must accept a 'config' parameter"
+        )
+
+
+# ---------------------------------------------------------------------------
+# DecoderLayer class structure tests
+# ---------------------------------------------------------------------------
+
+
+class TestDecoderLayerClassStructure:
+    """Verify NeuronGlm4MoeDecoderLayer class API contract (no instantiation)."""
+
+    def test_decoder_layer_class_exists(self):
+        """NeuronGlm4MoeDecoderLayer must be importable."""
+        DecoderLayer, _ = _import_classes()
+        assert DecoderLayer is not None
+
+    def test_decoder_layer_is_nn_module(self):
+        """NeuronGlm4MoeDecoderLayer must subclass nn.Module."""
+        import torch.nn as nn
+
+        DecoderLayer, _ = _import_classes()
+        assert issubclass(DecoderLayer, nn.Module)
+
+    def test_decoder_layer_has_forward(self):
+        """NeuronGlm4MoeDecoderLayer must define a forward method."""
+        DecoderLayer, _ = _import_classes()
+        assert hasattr(DecoderLayer, "forward") and callable(DecoderLayer.forward)
+
+    def test_decoder_layer_init_accepts_layer_idx(self):
+        """__init__ must accept a layer_idx parameter for dispatch."""
+        import inspect
+
+        DecoderLayer, _ = _import_classes()
+        sig = inspect.signature(DecoderLayer.__init__)
+        assert "layer_idx" in sig.parameters, (
+            "NeuronGlm4MoeDecoderLayer.__init__ must accept 'layer_idx'"
+        )
+
+    def test_decoder_layer_init_accepts_config(self):
+        """__init__ must accept a config parameter."""
+        import inspect
+
+        DecoderLayer, _ = _import_classes()
+        sig = inspect.signature(DecoderLayer.__init__)
+        assert "config" in sig.parameters, (
+            "NeuronGlm4MoeDecoderLayer.__init__ must accept 'config'"
+        )
diff --git a/contrib/models/glm4_moe/test/unit/test_router.py b/contrib/models/glm4_moe/test/unit/test_router.py
new file mode 100644
index 00000000..860d6fe2
--- /dev/null
+++ b/contrib/models/glm4_moe/test/unit/test_router.py
@@ -0,0 +1,195 @@
+# coding=utf-8
+"""Unit tests for NeuronGlm4MoeRouter.
+
+These tests run on CPU (no Neuron hardware required) and verify the
+routing logic independently of the full model stack.
+"""
+
+import sys
+from pathlib import Path
+
+import pytest
+import torch
+
+# Add contrib src to path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "src"))
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_router(
+    num_experts: int = 8,
+    top_k: int = 2,
+    hidden_size: int = 64,
+    n_group: int = 1,
+    topk_group: int = 1,
+    norm_topk_prob: bool = True,
+    routed_scaling_factor: float = 1.0,
+):
+    """Instantiate NeuronGlm4MoeRouter for CPU tests.
+
+    Bypasses distributed environment by not calling super().__init__() in the
+    normal path — instead we construct a minimal stub that exercises only the
+    routing math we own.
+    """
+    from glm4_moe.modeling_glm4_moe import NeuronGlm4MoeRouter
+
+    # We cannot call the full constructor because GroupLimitedRouter requires a
+    # distributed process group.  Instead we build the key attributes directly.
+    router = object.__new__(NeuronGlm4MoeRouter)
+    # Patch required attributes from GroupLimitedRouter / NeuronGlm4MoeRouter
+    router.num_experts = num_experts
+    router.top_k = top_k
+    router.hidden_size = hidden_size
+    router.n_group = n_group
+    router.topk_group = topk_group
+    router.norm_topk_prob = norm_topk_prob
+    router.routed_scaling_factor = routed_scaling_factor
+    router.register_buffer = lambda name, tensor: setattr(router, name, tensor)
+    router.register_buffer(
+        "e_score_correction_bias", torch.zeros(num_experts, dtype=torch.float32)
+    )
+    return router
+
+
+def _group_scores(scores, n_group, num_experts):
+    """Compute per-group max-score without routing classes."""
+    group_size = num_experts // n_group
+    return scores.view(scores.shape[0], n_group, group_size).max(dim=-1).values
+
+
+# ---------------------------------------------------------------------------
+# noaux_tc_top_k unit tests
+# ---------------------------------------------------------------------------
+
+
+class TestRouterTopK:
+    """Unit tests for NeuronGlm4MoeRouter.noaux_tc_top_k."""
+
+    def test_topk_idx_shape(self):
+        """topk_idx must have shape [batch, top_k]."""
+        router = _make_router(num_experts=8, top_k=2)
+        scores = torch.rand(4, 8)
+        topk_idx, _ = router.noaux_tc_top_k(scores)
+        assert topk_idx.shape == (4, 2), f"Expected (4, 2), got {topk_idx.shape}"
+
+    def test_full_affinities_shape(self):
+        """full_affinities must have same shape as scores."""
+        router = _make_router(num_experts=8, top_k=2)
+        scores = torch.rand(4, 8)
+        _, full_affinities = router.noaux_tc_top_k(scores)
+        assert full_affinities.shape == scores.shape
+
+    def test_full_affinities_sparsity(self):
+        """Only top_k entries per row should be non-zero."""
+        router = _make_router(num_experts=8, top_k=2)
+        scores = torch.rand(4, 8)
+        _, full_affinities = router.noaux_tc_top_k(scores)
+        nonzero_per_row = (full_affinities != 0).sum(dim=-1)
+        assert (nonzero_per_row == 2).all(), (
+            f"Expected 2 non-zeros per row, got {nonzero_per_row}"
+        )
+
+    def test_normalized_weights_sum_to_one(self):
+        """With norm_topk_prob=True, selected weights should sum to routed_scaling_factor."""
+        factor = 2.0
+        router = _make_router(
+            num_experts=8, top_k=2, norm_topk_prob=True, routed_scaling_factor=factor
+        )
+        scores = torch.rand(4, 8).abs() + 0.1  # ensure positive
+        _, full_affinities = router.noaux_tc_top_k(scores)
+        row_sums = full_affinities.sum(dim=-1)
+        # Each row should sum to ~routed_scaling_factor (within floating point tolerance)
+        torch.testing.assert_close(
+            row_sums,
+            torch.full_like(row_sums, factor),
+            atol=1e-5,
+            rtol=1e-5,
+            msg="Normalized + scaled weights should sum to routed_scaling_factor per row",
+        )
+
+    def test_no_normalization_scaling_only(self):
+        """With norm_topk_prob=False, weights should be raw sigmoid scores * routed_scaling_factor."""
+        factor = 0.5
+        router = _make_router(
+            num_experts=4, top_k=1, norm_topk_prob=False, routed_scaling_factor=factor
+        )
+        # Force known scores
+        scores = torch.tensor([[0.2, 0.8, 0.1, 0.4]])
+        topk_idx, full_affinities = router.noaux_tc_top_k(scores)
+        selected_weight = full_affinities[0, topk_idx[0, 0]]
+        # Raw top-1 score is 0.8; after scaling: 0.8 * 0.5 = 0.4
+        expected = torch.tensor(0.8 * factor)
+        torch.testing.assert_close(selected_weight, expected, atol=1e-5, rtol=1e-5)
+
+    def test_e_score_correction_bias_shifts_routing_decision(self):
+        """e_score_correction_bias should shift which expert gets selected."""
+        router = _make_router(
+            num_experts=4, top_k=1, n_group=1, topk_group=1, norm_topk_prob=False
+        )
+        # Scores: expert 0 wins normally
+        scores = torch.tensor([[0.9, 0.5, 0.3, 0.1]])
+        topk_before, _ = router.noaux_tc_top_k(scores)
+        assert topk_before[0, 0].item() == 0, "Expert 0 should win without bias"
+
+        # Add large bias to expert 1 → expert 1 should win now
+        router.e_score_correction_bias = torch.tensor([0.0, 5.0, 0.0, 0.0])
+        topk_after, _ = router.noaux_tc_top_k(scores)
+        assert topk_after[0, 0].item() == 1, "Expert 1 should win with strong bias"
+
+    def test_correction_bias_not_used_for_final_weights(self):
+        """e_score_correction_bias must NOT pollute the selected weights."""
+        router = _make_router(
+            num_experts=4,
+            top_k=1,
+            n_group=1,
+            topk_group=1,
+            norm_topk_prob=False,
+            routed_scaling_factor=1.0,
+        )
+        scores = torch.tensor([[0.9, 0.5, 0.3, 0.1]])
+
+        # Add correction bias that changes routing decision
+        router.e_score_correction_bias = torch.tensor([0.0, 5.0, 0.0, 0.0])
+        topk_idx, full_affinities = router.noaux_tc_top_k(scores)
+
+        selected_expert = topk_idx[0, 0].item()  # should be 1
+        selected_weight = full_affinities[0, selected_expert]
+
+        # Weight must come from original scores (0.5), NOT bias-corrected scores (5.5)
+        torch.testing.assert_close(
+            selected_weight,
+            torch.tensor(0.5),
+            atol=1e-5,
+            rtol=1e-5,
+            msg="Selected weight must be from original sigmoid scores, not bias-corrected",
+        )
+
+    def test_topk_idx_dtype(self):
+        """topk_idx must be int64 (long) for MoE dispatch compatibility."""
+        router = _make_router(num_experts=8, top_k=2)
+        scores = torch.rand(2, 8)
+        topk_idx, _ = router.noaux_tc_top_k(scores)
+        # noaux_tc_top_k itself returns the idx from torch.topk (int64)
+        assert topk_idx.dtype == torch.int64, f"Expected int64, got {topk_idx.dtype}"
+
+    def test_full_affinities_non_negative(self):
+        """Expert weights must be non-negative (softmax-like normalization)."""
+        router = _make_router(num_experts=8, top_k=3)
+        scores = torch.rand(4, 8)
+        _, full_affinities = router.noaux_tc_top_k(scores)
+        assert (full_affinities >= 0).all(), "All expert weights must be >= 0"
+
+    def test_batch_independence(self):
+        """Each batch element is routed independently."""
+        router = _make_router(
+            num_experts=4, top_k=1, n_group=1, topk_group=1, norm_topk_prob=False
+        )
+        # Different scores per batch item
+        scores = torch.tensor([[0.9, 0.1, 0.1, 0.1], [0.1, 0.1, 0.1, 0.9]])
+        topk_idx, _ = router.noaux_tc_top_k(scores)
+        assert topk_idx[0, 0].item() == 0, "First item should select expert 0"
+        assert topk_idx[1, 0].item() == 3, "Second item should select expert 3"
diff --git a/contrib/models/glm4_moe/vllm/README.md b/contrib/models/glm4_moe/vllm/README.md
new file mode 100644
index 00000000..0a8d6091
--- /dev/null
+++ b/contrib/models/glm4_moe/vllm/README.md
@@ -0,0 +1,71 @@
+# GLM-4.5 MoE — vLLM Integration
+
+This directory contains scripts for serving GLM-4.5 MoE via vLLM with the
+`neuronx-distributed-inference` backend on Trn1/Trn2 instances.
+
+## Prerequisites
+
+- AWS Neuron SDK 2.21+
+- `vllm-neuron` fork (supports `VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference`)
+- `transformers>=4.56.0` (required for `Glm4MoeForCausalLM`)
+- Trn1 (`trn1.32xlarge`) or Trn2 (`trn2.48xlarge`) instance
+
+## Registration
+
+GLM-4.5 MoE must be registered with vLLM before serving. Add the following
+to your vLLM registration file (typically `vllm/model_executor/models/registry.py`):
+
+```python
+# In the MoE models section:
+"Glm4MoeForCausalLM": ("glm4_moe", "NeuronGlm4MoeForCausalLM"),
+```
+
+Alternatively, set the model class via the NxDI model registry:
+
+```python
+from neuronx_distributed_inference.models.registry import register_model
+from glm4_moe.modeling_glm4_moe import NeuronGlm4MoeForCausalLM
+
+register_model("Glm4MoeForCausalLM", NeuronGlm4MoeForCausalLM)
+```
+
+## Offline Inference
+
+```bash
+python vllm/run_offline_inference.py \
+    --model /path/to/GLM-4.5-Air \
+    --tp-degree 32 \
+    --seq-len 4096
+```
+
+## Online Serving (OpenAI-compatible API)
+
+```bash
+bash vllm/start-vllm-server.sh
+```
+
+Then query:
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "/path/to/GLM-4.5-Air",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+```
+
+## Configuration
+
+Key `override-neuron-config` parameters for GLM-4.5 MoE:
+
+| Parameter | Recommended Value | Description |
+|---|---|---|
+| `tp_degree` | 32 (Trn1) / 64+ (Trn2) | Tensor parallelism |
+| `moe_tp_degree` | Same as `tp_degree` | MoE tensor parallelism |
+| `moe_ep_degree` | 1 | Expert parallelism (increase for EP) |
+| `batch_size` | 1 | Static batch size |
+| `seq_len` | 4096–32768 | Max sequence length |
+| `fused_qkv` | `true` | Fused QKV for performance |
+| `flash_decoding_enabled` | `true` (Trn2) | Flash attention for decoding |
diff --git a/contrib/models/glm4_moe/vllm/requirements.txt b/contrib/models/glm4_moe/vllm/requirements.txt
new file mode 100644
index 00000000..f7b00fc2
--- /dev/null
+++ b/contrib/models/glm4_moe/vllm/requirements.txt
@@ -0,0 +1,18 @@
+# vLLM integration for GLM-4.5 MoE on AWS Neuron
+#
+# vllm-neuronx is the AWS Neuron port of vLLM.
+# It is not yet available on PyPI; install from the upstream development branch
+# or the AWS Neuron SDK release channel.
+#
+# Tested version: 0.9.0.dev0+neuron222 (commit-pinned dev build)
+# Install command (from the NxDI documentation):
+#
+#   pip install vllm-neuronx==<version>
+#
+# Or for development / edge builds:
+#   git clone https://github.com/aws/vllm-neuronx
+#   pip install -e .
+#
+# Other requirements are pulled in transitively by vllm-neuronx:
+#   torch-neuronx, neuronx-distributed, transformers>=4.56.0
+vllm-neuronx>=0.9.0.dev0
diff --git a/contrib/models/glm4_moe/vllm/run_offline_inference.py b/contrib/models/glm4_moe/vllm/run_offline_inference.py
new file mode 100644
index 00000000..43f4a1bf
--- /dev/null
+++ b/contrib/models/glm4_moe/vllm/run_offline_inference.py
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# coding=utf-8
+"""Offline inference example for GLM-4.5 MoE using vLLM + NxDI backend.
+
+Usage:
+    python run_offline_inference.py \
+        --model /path/to/GLM-4.5-Air \
+        --tp-degree 32 \
+        --seq-len 4096
+
+Requires:
+    - VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
+    - transformers>=4.56.0
+    - vllm-neuron with NxDI backend support
+"""
+
+import argparse
+import os
+import sys
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser(
+        description="GLM-4.5 MoE offline inference via vLLM + NxDI",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        required=True,
+        help="Path to the GLM-4.5 MoE HuggingFace checkpoint",
+    )
+    parser.add_argument(
+        "--tp-degree",
+        type=int,
+        default=32,
+        help="Tensor parallelism degree",
+    )
+    parser.add_argument(
+        "--seq-len",
+        type=int,
+        default=4096,
+        help="Maximum sequence length",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=1,
+        help="Static batch size",
+    )
+    parser.add_argument(
+        "--max-new-tokens",
+        type=int,
+        default=200,
+        help="Maximum number of tokens to generate",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        default="bfloat16",
+        choices=["bfloat16", "float16"],
+        help="Model dtype",
+    )
+    return parser.parse_args()
+
+
+def build_neuron_config(args: argparse.Namespace) -> dict:
+    """Build the NxDI neuron config dict for vLLM override."""
+    return {
+        "tp_degree": args.tp_degree,
+        "moe_tp_degree": args.tp_degree,
+        "moe_ep_degree": 1,
+        "batch_size": args.batch_size,
+        "seq_len": args.seq_len,
+        "max_context_length": args.seq_len,
+        "torch_dtype": args.dtype,
+        "fused_qkv": True,
+        "flash_decoding_enabled": True,
+        "on_device_sampling_config": {
+            "dynamic": True,
+            "global_topk": 64,
+            "top_p": 1.0,
+            "temperature": 1.0,
+        },
+    }
+
+
+def main() -> None:
+    """Run offline inference with GLM-4.5 MoE via vLLM."""
+    args = parse_args()
+
+    # Set backend
+    os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
+
+    try:
+        import json
+
+        from vllm import LLM, SamplingParams
+    except ImportError:
+        print("ERROR: vllm is not installed. Please install the vllm-neuron fork.")
+        sys.exit(1)
+
+    neuron_config = build_neuron_config(args)
+
+    print(f"\nLoading GLM-4.5 MoE model from: {args.model}")
+    print(f"  tp_degree={args.tp_degree}, seq_len={args.seq_len}, dtype={args.dtype}")
+
+    llm = LLM(
+        model=args.model,
+        max_model_len=args.seq_len,
+        tensor_parallel_size=args.tp_degree,
+        max_num_seqs=args.batch_size,
+        override_neuron_config=neuron_config,
+        trust_remote_code=True,
+    )
+
+    sampling_params = SamplingParams(
+        max_tokens=args.max_new_tokens,
+        temperature=0.0,  # greedy
+    )
+
+    prompts = [
+        "Explain the mixture-of-experts architecture in one paragraph.",
+        "What is the capital of France?",
+        "Write a Python function to compute the Fibonacci sequence.",
+    ]
+
+    print(f"\nRunning inference on {len(prompts)} prompts...")
+    outputs = llm.generate(prompts, sampling_params)
+
+    print("\n" + "=" * 70)
+    for i, output in enumerate(outputs):
+        prompt_text = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"\n[{i + 1}] Prompt: {prompt_text[:80]}...")
+        print(f"    Output: {generated_text[:200]}")
+    print("=" * 70)
+    print(f"\nDone. Generated {len(outputs)} responses.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/glm4_moe/vllm/start-vllm-server.sh b/contrib/models/glm4_moe/vllm/start-vllm-server.sh
new file mode 100644
index 00000000..97fd9847
--- /dev/null
+++ b/contrib/models/glm4_moe/vllm/start-vllm-server.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+# Start GLM-4.5 MoE OpenAI-compatible API server via vLLM + NxDI backend.
+#
+# Usage:
+#   MODEL_PATH=/path/to/GLM-4.5-Air bash vllm/start-vllm-server.sh
+#
+# Environment variables (with defaults):
+#   MODEL_PATH       - Required: path to HuggingFace checkpoint
+#   TP_DEGREE        - Tensor parallelism degree (default: 32)
+#   SEQ_LEN          - Max sequence length (default: 4096)
+#   MAX_NUM_SEQS     - Max concurrent requests (default: 1)
+#   PORT             - Server port (default: 8000)
+
+set -euo pipefail
+
+: "${MODEL_PATH:?ERROR: MODEL_PATH environment variable must be set}"
+: "${TP_DEGREE:=32}"
+: "${SEQ_LEN:=4096}"
+: "${MAX_NUM_SEQS:=1}"
+: "${PORT:=8000}"
+
+echo "Starting GLM-4.5 MoE vLLM server..."
+echo "  Model:      ${MODEL_PATH}"
+echo "  TP degree:  ${TP_DEGREE}"
+echo "  Seq len:    ${SEQ_LEN}"
+echo "  Port:       ${PORT}"
+
+VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference" \
+python -m vllm.entrypoints.openai.api_server \
+  --model="${MODEL_PATH}" \
+  --max-model-len="${SEQ_LEN}" \
+  --tensor-parallel-size="${TP_DEGREE}" \
+  --port="${PORT}" \
+  --max-num-seqs="${MAX_NUM_SEQS}" \
+  --trust-remote-code \
+  --override-neuron-config "{
+    \"tp_degree\": ${TP_DEGREE},
+    \"moe_tp_degree\": ${TP_DEGREE},
+    \"moe_ep_degree\": 1,
+    \"batch_size\": ${MAX_NUM_SEQS},
+    \"seq_len\": ${SEQ_LEN},
+    \"max_context_length\": ${SEQ_LEN},
+    \"fused_qkv\": true,
+    \"flash_decoding_enabled\": true,
+    \"on_device_sampling_config\": {
+      \"dynamic\": true,
+      \"global_topk\": 64,
+      \"top_p\": 1.0,
+      \"temperature\": 1.0
+    }
+  }"
diff --git a/examples/generation_glm4_moe.py b/examples/generation_glm4_moe.py
new file mode 100644
index 00000000..7ab5e48b
--- /dev/null
+++ b/examples/generation_glm4_moe.py
@@ -0,0 +1,114 @@
+import os
+import sys
+
+import torch
+from transformers import AutoTokenizer, GenerationConfig
+
+# GLM-4.5 MoE is a contrib model; add its src to the Python path.
+_CONTRIB_SRC = os.path.join(
+    os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
+    "contrib",
+    "models",
+    "glm4_moe",
+    "src",
+)
+if _CONTRIB_SRC not in sys.path:
+    sys.path.insert(0, _CONTRIB_SRC)
+
+from glm4_moe.modeling_glm4_moe import Glm4MoeInferenceConfig, NeuronGlm4MoeForCausalLM
+from neuronx_distributed_inference.models.config import (
+    MoENeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (
+    HuggingFaceGenerationAdapter,
+    load_pretrained_config,
+)
+from neuronx_distributed_inference.utils.benchmark import benchmark_sampling
+
+model_path = "/path/to/GLM-4.5-Air"  # HuggingFace checkpoint root
+traced_model_path = "/path/to/GLM-4.5-Air-traced"  # Compiled Neuron artifacts
+
+torch.manual_seed(0)
+
+DTYPE = torch.bfloat16
+
+
+def generate(skip_compile=False):
+    # Initialize configs and tokenizer.
+    generation_config = GenerationConfig.from_pretrained(model_path)
+
+    if not skip_compile:
+        neuron_config = MoENeuronConfig(
+            tp_degree=32,
+            moe_tp_degree=4,
+            moe_ep_degree=8,
+            batch_size=4,
+            ctx_batch_size=1,
+            tkg_batch_size=4,
+            seq_len=4096,
+            scratchpad_page_size=512,
+            torch_dtype=DTYPE,
+            on_device_sampling_config=OnDeviceSamplingConfig(
+                do_sample=True, temperature=0.6, top_k=20, top_p=0.95
+            ),
+            enable_bucketing=False,
+            flash_decoding_enabled=True,
+            fused_qkv=True,
+            logical_nc_config=2,
+        )
+        config = Glm4MoeInferenceConfig(
+            neuron_config,
+            load_config=load_pretrained_config(model_path),
+        )
+        tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="right")
+        tokenizer.pad_token = tokenizer.eos_token
+        # Compile and save model.
+        print("\nCompiling and saving model...")
+        model = NeuronGlm4MoeForCausalLM(model_path, config)
+        model.compile(traced_model_path)
+        tokenizer.save_pretrained(traced_model_path)
+
+    # Load from compiled checkpoint.
+    print("\nLoading model from compiled checkpoint...")
+    model = NeuronGlm4MoeForCausalLM(traced_model_path)
+    model.load(traced_model_path)
+    tokenizer = AutoTokenizer.from_pretrained(traced_model_path)
+
+    # Generate outputs.
+    print("\nGenerating outputs...")
+    prompt = "Give me a short introduction to large language models."
+    messages = [{"role": "user", "content": prompt}]
+    text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    inputs = tokenizer([text], padding=True, return_tensors="pt")
+    generation_model = HuggingFaceGenerationAdapter(model)
+    outputs = generation_model.generate(
+        inputs.input_ids,
+        generation_config=generation_config,
+        attention_mask=inputs.attention_mask,
+        max_length=model.config.neuron_config.max_length,
+    )
+    output_tokens = tokenizer.batch_decode(
+        outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Generated outputs:")
+    for i, output_token in enumerate(output_tokens):
+        print(f"Output {i}: {output_token}")
+
+    print("\nPerformance Benchmarking!")
+    benchmark_sampling(
+        model=model,
+        draft_model=None,
+        generation_config=generation_config,
+        target="all",
+        benchmark_report_path="benchmark_report.json",
+        num_runs=5,
+    )
+
+
+if __name__ == "__main__":
+    generate()
diff --git a/src/neuronx_distributed_inference/utils/hf_adapter.py b/src/neuronx_distributed_inference/utils/hf_adapter.py
index c9b4a38b..0f3b9058 100644
--- a/src/neuronx_distributed_inference/utils/hf_adapter.py
+++ b/src/neuronx_distributed_inference/utils/hf_adapter.py
@@ -1,8 +1,11 @@
 import copy
+import inspect
 import os
 from types import SimpleNamespace
 from typing import Any, Dict, Optional, Union
-from neuronx_distributed_inference.utils.tensor_replacement.registry import TensorReplacementRegister
+from neuronx_distributed_inference.utils.tensor_replacement.registry import (
+    TensorReplacementRegister,
+)
 import torch
 from neuronx_distributed.utils.medusa_utils import (
     evaluate_posterior,
@@ -10,7 +13,13 @@
     generate_medusa_buffers,
     update_inference_inputs,
 )
-from transformers import AutoConfig, GenerationConfig, GenerationMixin, PretrainedConfig, PreTrainedModel
+from transformers import (
+    AutoConfig,
+    GenerationConfig,
+    GenerationMixin,
+    PretrainedConfig,
+    PreTrainedModel,
+)
 from transformers.generation import GenerateDecoderOnlyOutput, SampleDecoderOnlyOutput
 from transformers.generation.logits_process import LogitsProcessorList
 from transformers.generation.stopping_criteria import StoppingCriteriaList
@@ -40,7 +49,9 @@ def load_config(self: InferenceConfig):
         if (model_path_or_name is None and hf_config is None) or (
             model_path_or_name is not None and hf_config is not None
         ):
-            raise ValueError('Please provide only one of "model_path_or_name" or "hf_config"')
+            raise ValueError(
+                'Please provide only one of "model_path_or_name" or "hf_config"'
+            )
 
         if model_path_or_name is not None:
             config: PretrainedConfig = AutoConfig.from_pretrained(model_path_or_name)
@@ -55,11 +66,16 @@ def load_config(self: InferenceConfig):
         # Set torch_dtype in NeuronConfig.
         hf_dtype = config_dict.get("dtype", config_dict.get("torch_dtype", None))
         if hf_dtype is not None:
-            if self.neuron_config is not None and not self.neuron_config.overrides_torch_dtype:
+            if (
+                self.neuron_config is not None
+                and not self.neuron_config.overrides_torch_dtype
+            ):
                 # Update neuron_config's torch_dtype if not overriden by the user.
                 self.neuron_config.torch_dtype = hf_dtype
                 if isinstance(self.neuron_config.torch_dtype, str):
-                    self.neuron_config.torch_dtype = to_torch_dtype(self.neuron_config.torch_dtype)
+                    self.neuron_config.torch_dtype = to_torch_dtype(
+                        self.neuron_config.torch_dtype
+                    )
             config_dict.pop("dtype", None)
             config_dict.pop("torch_dtype", None)
 
@@ -92,14 +108,23 @@ def to_pretrained_config(config: InferenceConfig):
     del config_dict["neuron_config"]
 
     # handle nested configs for multi-modal models
-    config_dict = _convert_modality_config_to_pretrained_config(config_dict, "text_config")
-    config_dict = _convert_modality_config_to_pretrained_config(config_dict, "vision_config")
+    config_dict = _convert_modality_config_to_pretrained_config(
+        config_dict, "text_config"
+    )
+    config_dict = _convert_modality_config_to_pretrained_config(
+        config_dict, "vision_config"
+    )
 
     return PretrainedConfig(**config_dict)
 
 
 class HuggingFaceGenerationAdapter(PreTrainedModel, GenerationMixin):
-    def __init__(self, model: NeuronApplicationBase, input_start_offsets=None, capture_draft_logits=False):
+    def __init__(
+        self,
+        model: NeuronApplicationBase,
+        input_start_offsets=None,
+        capture_draft_logits=False,
+    ):
         hf_config = to_pretrained_config(model.config)
         super().__init__(hf_config)
         if self.generation_config is not None:
@@ -116,7 +141,9 @@ def __init__(self, model: NeuronApplicationBase, input_start_offsets=None, captu
 
         self.neuron_model = model
         self.neuron_config = model.config.neuron_config
-        self.on_device_sampling = self.neuron_config.on_device_sampling_config is not None
+        self.on_device_sampling = (
+            self.neuron_config.on_device_sampling_config is not None
+        )
         self.padding_side = self.neuron_config.padding_side
         self.sampler = None
         self.prev_kv_cache_populated = False
@@ -172,8 +199,10 @@ def _sample(
         )
         # convert adapter_ids from strings to indices
         if self.neuron_config.lora_config:
-            model_kwargs["adapter_ids"] = self.neuron_model.lora_model_manager.convert_adapter_ids_to_indices(
-                model_kwargs.get("adapter_ids"), unfinished_sequences.numel()
+            model_kwargs["adapter_ids"] = (
+                self.neuron_model.lora_model_manager.convert_adapter_ids_to_indices(
+                    model_kwargs.get("adapter_ids"), unfinished_sequences.numel()
+                )
             )
         this_peer_finished = False
         # auto-regressive generation
@@ -230,7 +259,9 @@ def _sample(
                 is_encoder_decoder=self.config.is_encoder_decoder,
             )
 
-            unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, None)
+            unfinished_sequences = unfinished_sequences & ~stopping_criteria(
+                input_ids, None
+            )
             this_peer_finished = unfinished_sequences.max() == 0
 
         if return_dict_in_generate:
@@ -263,13 +294,18 @@ def prepare_inputs_for_generation(
         scatter_index = kwargs.get("scatter_index", None)
         position_ids = kwargs.get("position_ids", None)
         input_capture_hook = kwargs.get("input_capture_hook", None)
+        tensor_capture_hook = kwargs.get("tensor_capture_hook", None)
 
         if attention_mask is not None and position_ids is None:
             # create position_ids on the fly for batch generation
             position_ids = attention_mask.long().cumsum(-1) - 1
             if self.input_start_offsets:
                 if len(self.input_start_offsets) > 1:
-                    position_ids += torch.tensor(self.input_start_offsets, dtype=position_ids.dtype, device=position_ids.device)[:, None]
+                    position_ids += torch.tensor(
+                        self.input_start_offsets,
+                        dtype=position_ids.dtype,
+                        device=position_ids.device,
+                    )[:, None]
                 else:
                     position_ids += self.input_start_offsets[0]
                 for i, offset in enumerate(self.input_start_offsets):
@@ -292,22 +328,33 @@ def prepare_inputs_for_generation(
                 "past_key_values": past_key_values,
                 "use_cache": kwargs.get("use_cache", False),
                 "attention_mask": attention_mask,
-                "medusa_args": (accepted_indices, current_length, medusa_mask, scatter_index),
+                "medusa_args": (
+                    accepted_indices,
+                    current_length,
+                    medusa_mask,
+                    scatter_index,
+                ),
                 "sampling_params": sampling_params,
                 "input_capture_hook": input_capture_hook,
-                "tensor_capture_hook": tensor_capture_hook,
-                "adapter_ids": adapter_ids
+                "adapter_ids": adapter_ids,
             }
         )
 
+        # Conditionally pass tensor_capture_hook only to models that accept it (e.g., multimodal).
+        # Text-only models like GLM-4.5 MoE do not declare this parameter.
+        if tensor_capture_hook is not None:
+            fwd_params = inspect.signature(self.neuron_model.forward).parameters
+            if "tensor_capture_hook" in fwd_params:
+                model_inputs["tensor_capture_hook"] = tensor_capture_hook
+
         tf_args = []
         if self.neuron_config.tensor_replacement_config:
-            if hasattr(self, 'generation_step'):
+            if hasattr(self, "generation_step"):
                 self.generation_step += 1
             else:
                 self.generation_step = 1
             reg = TensorReplacementRegister.get_instance()
-            tf , masks = reg.step_args(self.generation_step)
+            tf, masks = reg.step_args(self.generation_step)
             tf_args = tf + masks
 
         # Only add tf_args if not empty
@@ -322,7 +369,12 @@ def prepare_inputs_for_generation(
         return model_inputs
 
     def prepare_medusa_inputs_for_generation(
-        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        **kwargs,
     ):
         if self.neuron_model.kv_cache_populated:
             input_ids = input_ids[:, -self.neuron_config.medusa_speculation_length :]
@@ -380,13 +432,19 @@ def _update_model_kwargs_for_generation(
             if is_for_token_generation:
                 if self.padding_side == "left":
                     attention_mask = torch.cat(
-                        [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))],
+                        [
+                            attention_mask,
+                            attention_mask.new_ones((attention_mask.shape[0], 1)),
+                        ],
                         dim=-1,
                     )
                     attention_mask = attention_mask[:, 1:]
                 else:
                     attention_mask = torch.cat(
-                        [attention_mask.new_ones((attention_mask.shape[0], 1)), attention_mask],
+                        [
+                            attention_mask.new_ones((attention_mask.shape[0], 1)),
+                            attention_mask,
+                        ],
                         dim=-1,
                     )
             model_kwargs["attention_mask"] = attention_mask
@@ -407,7 +465,9 @@ def _update_model_kwargs_for_fused_generation(
                 attention_mask = torch.cat(
                     [
                         attention_mask,
-                        attention_mask.new_ones((attention_mask.shape[0], accepted_len)),
+                        attention_mask.new_ones(
+                            (attention_mask.shape[0], accepted_len)
+                        ),
                     ],
                     dim=-1,
                 )
@@ -415,7 +475,9 @@ def _update_model_kwargs_for_fused_generation(
             else:
                 attention_mask = torch.cat(
                     [
-                        attention_mask.new_ones((attention_mask.shape[0], accepted_len)),
+                        attention_mask.new_ones(
+                            (attention_mask.shape[0], accepted_len)
+                        ),
                         attention_mask,
                     ],
                     dim=-1,
@@ -500,7 +562,9 @@ def _fused_assisted_decoding(
         )
         if "sampling_params" not in fused_assistant_kwargs:
             fused_assistant_kwargs["sampling_params"] = sampling_params
-        model_inputs = self.prepare_inputs_for_generation(input_ids, **fused_assistant_kwargs)
+        model_inputs = self.prepare_inputs_for_generation(
+            input_ids, **fused_assistant_kwargs
+        )
 
         # Other auxiliary variables
         bs = input_ids.shape[0]
@@ -530,7 +594,11 @@ def _fused_assisted_decoding(
         if return_dict_in_generate:
             if output_scores:
                 # TODO: Process raw logits with logits processor when needed
-                scores += (outputs.fused_outputs[-2][:, -1, :],) if self.capture_draft_logits else (outputs.fused_outputs[-1][:, -1, :],)
+                scores += (
+                    (outputs.fused_outputs[-2][:, -1, :],)
+                    if self.capture_draft_logits
+                    else (outputs.fused_outputs[-1][:, -1, :],)
+                )
             if output_logits:
                 raw_logits += (outputs.fused_outputs[-1],)
 
@@ -551,7 +619,7 @@ def _fused_assisted_decoding(
             n_matches = torch.ops.aten.Int(n_matches)
             incremental_len = n_matches
             if self.capture_draft_logits:
-                print(f'n matches: {n_matches}')
+                print(f"n matches: {n_matches}")
 
             # 3. retrieve accepted tokens using n_matches
             if len(accepted_tokens_with_padding.shape) == 1:
@@ -564,7 +632,9 @@ def _fused_assisted_decoding(
             for eos_token_id in eos_token_id_list:
                 if eos_token_id in accepted_tokens:
                     # get column indices
-                    eos_pos_cur = (accepted_tokens == eos_token_id).nonzero(as_tuple=True)[1]
+                    eos_pos_cur = (accepted_tokens == eos_token_id).nonzero(
+                        as_tuple=True
+                    )[1]
                     eos_pos = min(torch.min(eos_pos_cur), eos_pos)
             if eos_pos < accepted_tokens.shape[1]:
                 end_for_all = True
@@ -578,7 +648,9 @@ def _fused_assisted_decoding(
                     if self.capture_draft_logits:
                         scores += tuple(outputs.fused_outputs[-2][:, :, :])
                     else:
-                        scores += tuple(outputs.fused_outputs[-1][:, i, :] for i in range(n_matches))
+                        scores += tuple(
+                            outputs.fused_outputs[-1][:, i, :] for i in range(n_matches)
+                        )
 
                 if output_logits:
                     raw_logits += (outputs.fused_outputs[-1],)
@@ -620,7 +692,9 @@ def _standard_assisted_decoding(
         if hasattr(assistant_model, "num_assistant_tokens"):
             num_assistant_tokens = assistant_model.num_assistant_tokens
         else:
-            num_assistant_tokens = assistant_model.generation_config.num_assistant_tokens
+            num_assistant_tokens = (
+                assistant_model.generation_config.num_assistant_tokens
+            )
 
         # Init values
         if eos_token_id is not None and pad_token_id is None:
@@ -630,7 +704,9 @@ def _standard_assisted_decoding(
         if isinstance(eos_token_id, int):
             eos_token_id = [eos_token_id]
         eos_token_id_tensor = (
-            torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
+            torch.tensor(eos_token_id).to(input_ids.device)
+            if eos_token_id is not None
+            else None
         )
 
         # Prepare assistant model's keys of inputs
@@ -663,11 +739,15 @@ def _standard_assisted_decoding(
                     candidate_input_ids,
                     **assistant_kwargs,
                 )
-                is_for_token_generation = assistant_model.neuron_model.kv_cache_populated
+                is_for_token_generation = (
+                    assistant_model.neuron_model.kv_cache_populated
+                )
 
                 # 1.2 Use the assistant model to obtain the next candidate logits
                 assistant_model_outputs = assistant_model(**assistant_inputs)
-                assistant_new_token = assistant_model_outputs.logits[:, 0, :].argmax(dim=-1)
+                assistant_new_token = assistant_model_outputs.logits[:, 0, :].argmax(
+                    dim=-1
+                )
 
                 # 1.3 Update inputs and args for next iteration
                 candidate_input_ids = torch.cat(
@@ -686,7 +766,9 @@ def _standard_assisted_decoding(
                         eos_token_id_tensor.shape[0], 1
                     )
                     last_assistant_token_is_eos = (
-                        ~last_assistant_token_is_eos.ne(eos_token_id_tensor.unsqueeze(1))
+                        ~last_assistant_token_is_eos.ne(
+                            eos_token_id_tensor.unsqueeze(1)
+                        )
                         .prod(dim=0)
                         .bool()
                     )
@@ -700,14 +782,19 @@ def _standard_assisted_decoding(
             candidate_length = candidate_input_ids.shape[1] - input_ids.shape[1]
 
             # 2.1 Prepare the input arguments
-            input_ids = torch.cat((new_token, candidate_input_ids[:, -candidate_length:-1]), dim=-1)
+            input_ids = torch.cat(
+                (new_token, candidate_input_ids[:, -candidate_length:-1]), dim=-1
+            )
             attention_mask = model_inputs["attention_mask"]
             pos = curr_pos + 1
             position_ids = torch.arange(pos, pos + spec_len).expand(1, spec_len)
             # Pad the input_ids if needed
             if input_ids.shape[-1] < spec_len:
                 input_ids = torch.cat(
-                    (input_ids, torch.full((1, spec_len - input_ids.shape[-1]), pad_token_id)),
+                    (
+                        input_ids,
+                        torch.full((1, spec_len - input_ids.shape[-1]), pad_token_id),
+                    ),
                     dim=-1,
                 )
 
@@ -725,7 +812,9 @@ def _standard_assisted_decoding(
             # 3. Compare the argmax from the original model logits with the assistant forecasted tokens. We can keep
             # the assistant forecasted tokens until the first mismatch, or until the max length is reached.
             candidate_new_tokens = candidate_input_ids[:, -candidate_length:-1]
-            n_matches = ((~(candidate_new_tokens == selected_tokens)).cumsum(dim=-1) < 1).sum()
+            n_matches = (
+                (~(candidate_new_tokens == selected_tokens)).cumsum(dim=-1) < 1
+            ).sum()
 
             # 4. Ensure we don't generate beyond max_len or an EOS token
             if last_assistant_token_is_eos and n_matches == candidate_length:
@@ -759,7 +848,9 @@ def _standard_assisted_decoding(
             )
 
             curr_pos = curr_pos + n_matches + 1
-            assistant_kwargs["attention_mask"] = copy.deepcopy(model_inputs["attention_mask"])
+            assistant_kwargs["attention_mask"] = copy.deepcopy(
+                model_inputs["attention_mask"]
+            )
 
             # 7. Update with the generated token length and check for stopping condition.
             cur_len = cur_len + n_matches + 1
@@ -819,7 +910,9 @@ def _medusa_assisted_decoding(
                 medusa_buffers["tree_indices"],
                 medusa_buffers["retrieve_indices"],
             )
-            position_ids = medusa_buffers["medusa_position_ids"] + input_ids.nonzero().shape[0]
+            position_ids = (
+                medusa_buffers["medusa_position_ids"] + input_ids.nonzero().shape[0]
+            )
 
             medusa_kwargs = self._prepare_medusa_kwargs(
                 position_ids, cur_len, medusa_buffers, select_indices, medusa_kwargs
@@ -835,21 +928,25 @@ def _medusa_assisted_decoding(
 
             tree_logits, tree_medusa_logits = self._extract_logits(outputs)
             logits = tree_logits[0, 0, medusa_buffers["retrieve_indices"]]
-            medusa_logits = tree_medusa_logits[:, 0, 0, medusa_buffers["retrieve_indices"]]
+            medusa_logits = tree_medusa_logits[
+                :, 0, 0, medusa_buffers["retrieve_indices"]
+            ]
 
             best_candidate, accept_length = evaluate_posterior(logits, candidates)
             cur_len = torch.tensor([input_ids.nonzero().size(0) - 1], dtype=torch.int32)
 
-            input_ids, logits, medusa_logits, new_token, select_indices = update_inference_inputs(
-                input_ids[:, : (int(cur_len[0] + 1))],
-                candidates,
-                best_candidate,
-                accept_length,
-                medusa_buffers["retrieve_indices"],
-                outputs,
-                logits,
-                medusa_logits,
-                new_token,
+            input_ids, logits, medusa_logits, new_token, select_indices = (
+                update_inference_inputs(
+                    input_ids[:, : (int(cur_len[0] + 1))],
+                    candidates,
+                    best_candidate,
+                    accept_length,
+                    medusa_buffers["retrieve_indices"],
+                    outputs,
+                    logits,
+                    medusa_logits,
+                    new_token,
+                )
             )
 
             medusa_kwargs["attention_mask"] = self._update_attention_mask(
@@ -860,7 +957,10 @@ def _medusa_assisted_decoding(
             cur_length = accept_length_tree + cur_length
             accept_lengths_tree.append(accept_length_tree)
             final_accept_length += accept_length + 1
-            if eos_token_id in new_token or final_accept_length > self.neuron_config.max_new_tokens:
+            if (
+                eos_token_id in new_token
+                or final_accept_length > self.neuron_config.max_new_tokens
+            ):
                 break
         return input_ids
 
@@ -875,7 +975,9 @@ def _prepare_medusa_kwargs(
         )
         for index, value in enumerate(select_indices):
             medusa_kwargs["accepted_indices"][index] = value
-        medusa_kwargs["accepted_indices"] = medusa_kwargs["accepted_indices"].unsqueeze(0)
+        medusa_kwargs["accepted_indices"] = medusa_kwargs["accepted_indices"].unsqueeze(
+            0
+        )
         medusa_kwargs["current_length"] = torch.arange(
             cur_len[0].item(),
             cur_len[0].item() + self.neuron_config.num_medusa_heads + 1,
@@ -890,14 +992,20 @@ def _prepare_medusa_kwargs(
         ).unsqueeze(0)
         return medusa_kwargs
 
-    def _update_attention_mask(self, model_inputs, accept_length, cur_len, medusa_kwargs):
+    def _update_attention_mask(
+        self, model_inputs, accept_length, cur_len, medusa_kwargs
+    ):
         accept_length_concat_tensor = torch.zeros(
             1, accept_length + 1, dtype=model_inputs["attention_mask"].dtype
         )
-        attn_mask = torch.cat([model_inputs["attention_mask"], accept_length_concat_tensor], dim=-1)
+        attn_mask = torch.cat(
+            [model_inputs["attention_mask"], accept_length_concat_tensor], dim=-1
+        )
 
         medusa_kwargs["attention_mask"] = attn_mask.index_fill(
-            1, torch.arange(int(cur_len[0]) + 1, int(cur_len[0]) + 1 + accept_length + 1), 1
+            1,
+            torch.arange(int(cur_len[0]) + 1, int(cur_len[0]) + 1 + accept_length + 1),
+            1,
         )
         return medusa_kwargs["attention_mask"]