diff --git a/contrib/models/granite-4.0-h-small/README.md b/contrib/models/granite-4.0-h-small/README.md
new file mode 100644
index 00000000..fb4aebc7
--- /dev/null
+++ b/contrib/models/granite-4.0-h-small/README.md
@@ -0,0 +1,292 @@
+# Contrib Model: Granite 4.0-H-Small
+
+NeuronX Distributed Inference implementation of IBM's [Granite 4.0-H-Small](https://huggingface.co/ibm-granite/granite-4.0-h-small) (GraniteMoeHybridForCausalLM).
+
+## Model Information
+
+- **HuggingFace ID:** `ibm-granite/granite-4.0-h-small`
+- **Model Type:** Hybrid Mamba2/Attention with MoE
+- **Parameters:** ~4B total (active ~800M per token with top-10 routing)
+- **License:** Apache 2.0
+
+## Architecture Details
+
+| Property | Value |
+|----------|-------|
+| Hidden size | 4096 |
+| Layers | 40 (36 Mamba2 + 4 Attention) |
+| Attention layer indices | 5, 15, 25, 35 |
+| Attention heads | 32 (8 KV heads) |
+| Experts | 72 per layer, top-10 routing |
+| Shared experts | 1 per layer |
+| Mamba heads | 128 (head_dim=64) |
+| SSM state size | 128 |
+| Conv kernel | 4 |
+| Position embeddings | None ("nope") |
+| Vocab size | 131072 |
+| tie_word_embeddings | True |
+| embedding_multiplier | 12 |
+| logits_scaling | 16 |
+| residual_multiplier | 0.22 |
+
+## Implementation Notes
+
+### Mamba State Persistence
+
+The key challenge for this hybrid architecture is persisting Mamba2 recurrent state (conv_state and ssm_state) across XLA graph executions during autoregressive decode. We solve this using the same `input_output_aliases` mechanism that NxDI uses for KV cache:
+
+1. `NeuronGraniteModel` maintains a `nn.ParameterList` (`mamba_states`) containing conv_state and ssm_state buffers for each of the 36 Mamba layers (72 parameters total)
+2. `GraniteDecoderModelInstance` (extends `DecoderModelInstance`) adds these parameters to `input_output_aliases` after the standard KV cache entries
+3. `NeuronMamba2Layer.forward()` accepts and returns state as explicit tensor arguments
+4. The output list is: `[logits, K0, V0, ..., conv_state_0, ssm_state_0, ...]`
+
+This follows the MLlama vision_key_values pattern.
+
+### Manual Depthwise Conv1d
+
+SDK 2.28 has a compiler bug (TEN404) where the auto-inserted NKI Conv1d kernel crashes on `seq_len=1` (decode path). We work around this by implementing depthwise convolution manually using weight parameters and a loop over kernel positions.
+
+### Gated RMSNorm Ordering
+
+Granite applies the gate BEFORE normalization (`norm_before_gate=False` in Mamba2 terminology), unlike Falcon-H1 which applies it after. The `GraniteRMSNormGated` class matches HF exactly: `silu(gate) * x -> RMSNorm -> weight`.
+
+### Parallel Scan (Prefill)
+
+Prefill uses a full-sequence parallel scan via cumulative sum in log-space (L x L weight matrix). This is mathematically equivalent to HF's chunk-based SSD (chunk_size=256) but produces slightly different floating-point results due to BF16 precision and different accumulation order. Average Pearson=0.9968, average Cosine=0.9987 across 10 diverse prompts.
+
+### NKI Selective Scan Kernel (Optional)
+
+The model includes an optional NKI (Neuron Kernel Interface) kernel that replaces the O(L^2) quadratic parallel scan with an O(L) hardware-accelerated scan using `nisa.tensor_tensor_scan`. This is controlled by the `USE_NKI_SCAN` flag in `modeling_granite.py` (default: `False`).
+
+**Performance note:** At `max_context_length=128`, the quadratic scan is ~30% faster than the NKI kernel because the compiler efficiently vectorizes the 128x128 weight matrix operations, while the NKI kernel incurs overhead from 8,192 individual `tensor_tensor_scan` invocations (one per head_dim x ssm_state_size combination). At larger context lengths, the NKI kernel has a significant **compilation** advantage: at `max_context_length=256`, the NKI kernel compiles successfully while the quadratic scan causes compiler OOM (see Context Length Scaling section). The NKI kernel is expected to outperform the quadratic scan at runtime for L >= 256 on instances with sufficient HBM. See the Performance Benchmarks section for details.
+
+**How it works:**
+
+`tensor_tensor_scan` computes: `result[i] = op0(data0[i], result[i-1]) op1 data1[i]`
+
+For the Mamba2 SSM recurrence `state[t] = exp(dA[t]) * state[t-1] + dBx[t]`:
+- `data0 = exp(dA)`, `op0 = multiply`
+- `data1 = dBx`, `op1 = add`
+
+The kernel processes all 32 heads (TP-sharded from 128) in the partition dimension, with seq_len in the free dimension. An outer loop iterates over `head_dim(64) x ssm_state_size(128) = 8,192` scan invocations. Inputs are pre-transposed to `(num_heads, seq_len)` layout for efficient SBUF tiling.
+
+**Requirements:**
+- Set `NEURON_PLATFORM_TARGET_OVERRIDE` environment variable to match your target platform (e.g., `trn2` for Trainium2) during compilation
+- NKI Beta 2 / SDK 2.28+ (`import nki`, `import nki.language as nl`, `import nki.isa as nisa`)
+- Neuron hardware (Trainium or Inferentia with NKI support)
+
+**To enable:** Set `USE_NKI_SCAN = True` in `modeling_granite.py`. Recommended for `max_context_length >= 256` on instances with sufficient HBM, or when the quadratic scan fails to compile.
+
+## Validation Results
+
+**Validated:** 2026-03-10
+**Configuration:** TP=4, batch_size=1, seq_len=2048, max_context_length=128, bfloat16
+**Instance:** trn2.3xlarge (LNC=2, SDK 2.28)
+
+### Prefill Accuracy (vs HF BF16 CPU, 10 prompts)
+
+| Prompt | Pearson | Cosine | MaxDiff | Greedy Match |
+|--------|---------|--------|---------|-------------|
+| "Artificial Intelligence is" | 0.9997 | 0.9999 | 1.25 | YES (`a`) |
+| "The capital of France is" | 0.9967 | 0.9984 | 1.88 | YES (`Paris`) |
+| "Water boils at a temperature of" | 0.9984 | 0.9998 | 2.00 | YES (` `) |
+| "Explain the concept of artificial intelligence..." | 0.9959 | 0.9977 | 3.00 | YES (`Your`) |
+| "Write a short Python function..." | 0.9945 | 0.9973 | 2.12 | YES (`The`) |
+| "Hello, how are you today?" | 0.9946 | 0.9978 | 1.94 | YES (`I`) |
+| "def fibonacci(n):" | 0.9969 | 0.9985 | 1.28 | YES (`"`) |
+| "In the field of machine learning..." | 0.9974 | 0.9981 | 1.62 | YES (`estimate`) |
+| "The" | 0.9980 | 0.9997 | 1.03 | YES (` `) |
+| "Hi" | 0.9960 | 0.9993 | 1.00 | YES (`,`) |
+
+| Summary Metric | Value |
+|--------|-------|
+| **Greedy token match rate** | **10/10 = 100%** |
+| **Average Pearson** | **0.9968** |
+| **Average Cosine** | **0.9987** |
+| Max absolute diff (worst case) | 3.00 |
+
+### Decode Quality (greedy, 30 tokens)
+
+| Prompt | HF Output | Neuron Output | First Token Match |
+|--------|-----------|---------------|-------------------|
+| "The capital of France is" | " Paris." | " Paris." | YES (100% token match) |
+| "Explain the concept..." | "Your response should contain at least 3 sentences. Include keywords: machine learning, algorithms, " | "Your response should contain at least 3 sentences. The response must contain at least 2 placeholder" | YES (33% token match) |
+| "def fibonacci(n):" | (code output) | (code output) | YES |
+
+Both models produce coherent, factually correct text. Token-level divergence during decode is expected: our Mamba2 prefill uses full-sequence parallel scan while HF uses chunk-based SSD (chunk_size=256). These are mathematically equivalent but accumulate different BF16 rounding, causing early divergence that cascades through autoregressive generation. Deterministic answers (e.g., "Paris.") match exactly.
+
+### Compilation
+| Metric | Value |
+|--------|-------|
+| Compile time | ~16-20 min (trn2.3xlarge) |
+| Compiler flags | `-O1 --auto-cast=none --enable-mixed-precision-accumulation` |
+
+**Note:** When using the NKI kernel (`USE_NKI_SCAN=True`), set `NEURON_PLATFORM_TARGET_OVERRIDE` to match your target platform (e.g., `trn2`) before compilation.
+
+### NKI Kernel Accuracy (V16-NKI vs V15 Quadratic)
+
+The NKI selective scan kernel produces nearly identical results to the quadratic scan:
+
+| Metric | V15 (Quadratic) | V16 (NKI) |
+|--------|-----------------|-----------|
+| **Avg Pearson** | 0.9880 | 0.9872 |
+| **Avg Cosine** | 0.9800 | 0.9782 |
+| **Greedy match** | 100% | 100% |
+| **Generation quality** | Coherent | Matches V15 |
+
+The small difference between V15 and V16-NKI is due to different floating-point accumulation order (parallel quadratic vs sequential scan). Both produce correct text generation.
+
+## Performance Benchmarks
+
+**Benchmarked:** 2026-03-10
+**Configuration:** TP=4, batch_size=1, seq_len=2048, max_context_length=128, bfloat16
+**Instance:** trn2.3xlarge (LNC=2, SDK 2.28)
+
+### Latency Comparison: Quadratic Scan vs NKI Scan
+
+| Metric | Quadratic (default) | NKI Scan | Delta |
+|--------|-------------------|----------|-------|
+| **Prefill latency** | **717 ms** | 935 ms | +30% slower |
+| **Decode per-token** | 50.3 ms | 50.3 ms | identical |
+| **100-token throughput** | **17.6 tok/s** | 16.9 tok/s | -4% |
+| **100-token total** | 5694 ms | 5915 ms | +3.9% |
+
+**Analysis:**
+- Prefill latency is constant regardless of prompt length (1-23 tokens) because NxDI pads all inputs to `max_context_length=128`
+- The quadratic scan wins at L=128 because the compiler efficiently vectorizes the 128x128 weight matrix, while the NKI kernel has overhead from 8,192 individual `tensor_tensor_scan` calls
+- Decode latency is identical because the NKI kernel only affects prefill (decode uses O(1) recurrence)
+- The NKI kernel is required for L >= 256 where the quadratic scan fails to compile (compiler OOM)
+- **Recommendation:** Use the default quadratic scan for `max_context_length <= 128`. Enable NKI scan for larger contexts where the quadratic scan fails to compile.
+
+### Context Length Scaling
+
+The model's compilation and runtime behavior varies significantly with context length due to the large MoE architecture (72 experts × 40 layers). Testing was performed on trn2.3xlarge (96 GB HBM total, 24 GB per logical core with LNC=2).
+
+| max_context_length | Quadratic Compile | NKI Compile | Runtime Load | Notes |
+|--------------------|-------------------|-------------|-------------|-------|
+| **128** | OK (~16 min) | OK (~20 min) | OK (both) | Fully benchmarked, quadratic faster |
+| **256** | **FAILED** (compiler OOM) | **OK** (~19 min) | **FAILED** (HBM OOM) | NKI compiles where quadratic cannot |
+| **512** | FAILED (compiler OOM) | FAILED (compiler OOM) | N/A | Graph too large for 124 GB host RAM |
+| **1024** | FAILED (compiler OOM) | FAILED (compiler OOM) | N/A | Graph too large for 124 GB host RAM |
+
+**Key findings:**
+
+1. **NKI kernel enables longer compilation:** At L=256, the NKI kernel produces a compiler-friendlier HLO graph (avoids the 256×256 quadratic weight matrix expansion), allowing successful compilation where the quadratic approach causes `neuronx-cc` to OOM (exit code 70, >74 GB host RAM).
+
+2. **HBM is the runtime bottleneck:** Even when the NKI kernel compiles at L=256, the model cannot be loaded on trn2.3xlarge because the compiled graph requires more HBM than the 24 GB available per logical core. The error is a 1 GB transpose buffer allocation failure on HBM.
+
+3. **MoE dominates memory:** The 72 experts × 40 layers = 2,880 expert weight sets are the primary memory consumer, not the Mamba SSM states or KV caches.
+
+4. **Larger instances unlock longer contexts:** trn2.48xlarge (32 devices, up to 3 TB total HBM) should support `max_context_length=256+` by distributing experts across more cores. The NKI kernel's compilation advantage becomes essential at these scales.
+
+### Latency Breakdown (Quadratic, default)
+
+| Phase | Latency | Notes |
+|-------|---------|-------|
+| Prefill (any prompt up to 128 tokens) | 717 ms | Constant due to padding to max_context_length |
+| Decode (per token, steady state) | ~50 ms | Measured from 100-token generation |
+| Model load (from compiled) | ~71 s | One-time cost, includes weight sharding |
+| Compilation | ~16-20 min | One-time cost |
+
+## Usage
+
+```python
+from transformers import AutoTokenizer
+from neuronx_distributed_inference.models.config import MoENeuronConfig
+from neuronx_distributed_inference.utils.hf_adapter import (
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from modeling_granite import NeuronGraniteForCausalLM, GraniteInferenceConfig
+
+MODEL_PATH = "/path/to/granite-4.0-h-small/"
+COMPILED_PATH = "/path/to/compiled_model/"
+
+# Configure
+neuron_config = MoENeuronConfig(
+    tp_degree=4,
+    batch_size=1,
+    max_context_length=128,
+    seq_len=2048,
+    on_device_sampling_config=None,
+    enable_bucketing=False,
+    flash_decoding_enabled=False,
+    torch_dtype="bfloat16",
+)
+
+config = GraniteInferenceConfig(
+    neuron_config,
+    load_config=load_pretrained_config(MODEL_PATH),
+)
+
+# Compile (first time only, ~16 min)
+model = NeuronGraniteForCausalLM(MODEL_PATH, config)
+model.compile(COMPILED_PATH)
+
+# Load compiled model
+model = NeuronGraniteForCausalLM(MODEL_PATH, config)
+model.load(COMPILED_PATH)
+
+# Generate
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+inputs = tokenizer("Artificial Intelligence is", return_tensors="pt")
+
+gen_model = HuggingFaceGenerationAdapter(model)
+outputs = gen_model.generate(
+    inputs.input_ids,
+    attention_mask=torch.ones_like(inputs.input_ids),
+    max_new_tokens=50,
+    do_sample=False,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+## Compatibility Matrix
+
+| Instance Type | SDK 2.28 | SDK 2.27 |
+|--------------|----------|----------|
+| trn2.3xlarge (TP=4, LNC=2) | Validated | Not tested |
+| trn2.48xlarge (TP=4+) | Should work | Not tested |
+| trn1.32xlarge | Not tested | Not tested |
+
+**Note:** This model requires MoE support (`MoENeuronConfig`) and Mamba state persistence. The TEN404 conv1d workaround is specific to SDK 2.28; future SDK versions may not need it.
+
+## Testing
+
+```bash
+# Integration tests (requires Neuron hardware + model weights)
+cd contrib/models/granite-4.0-h-small/
+pytest test/integration/test_model.py -v
+
+# Or run directly
+python test/integration/test_model.py
+
+# NKI kernel unit test (requires Neuron hardware)
+export NEURON_PLATFORM_TARGET_OVERRIDE=trn2  # set to your target platform
+python test/unit/test_nki_selective_scan.py
+```
+
+## Known Limitations
+
+1. **max_context_length=128 on trn2.3xlarge** — compiler OOM prevents L=256+ with quadratic scan; NKI compiles at L=256 but HBM is insufficient for runtime. Larger instances (trn2.48xlarge) are needed for longer contexts.
+2. **No on-device sampling tested** — current validation uses raw logits (`on_device_sampling_config=None`). Enabling on-device sampling for production use needs testing.
+3. **Batch size 1 only** — batch_size > 1 has not been validated.
+4. **NKI scan slower at short contexts** — the optional `USE_NKI_SCAN` kernel is ~30% slower than the default quadratic scan at `max_context_length=128` due to per-invocation overhead. It is disabled by default. Enable for `max_context_length >= 256` where it is required for compilation.
+5. **Conv1d workaround** — manual depthwise convolution avoids TEN404 but may be slower than native conv1d once the SDK bug is fixed.
+
+## Source Files
+
+| File | Description | Lines |
+|------|-------------|-------|
+| `src/modeling_granite.py` | Full model implementation with NKI selective scan kernel (config, Mamba layer, attention, MoE, NKI kernel, model wrapper, state dict conversion) | ~1600 |
+| `src/__init__.py` | Public exports | ~30 |
+| `test/integration/test_model.py` | Integration tests (compile, load, generate, coherence, throughput) | ~260 |
+| `test/unit/test_nki_selective_scan.py` | Standalone NKI selective scan kernel with CPU reference, quadratic reference, and validation tests | ~750 |
+
+## Example Checkpoints
+
+- **HuggingFace:** `ibm-granite/granite-4.0-h-small`
+
+## Maintainer
+
+**Last Updated:** 2026-03-10
diff --git a/contrib/models/granite-4.0-h-small/src/__init__.py b/contrib/models/granite-4.0-h-small/src/__init__.py
new file mode 100644
index 00000000..83eaf68a
--- /dev/null
+++ b/contrib/models/granite-4.0-h-small/src/__init__.py
@@ -0,0 +1,29 @@
+# Granite 4.0-H-Small NeuronX Port
+# Export main classes
+from .modeling_granite import (
+    GraniteInferenceConfig,
+    NeuronGraniteModel,
+    NeuronGraniteForCausalLM,
+    NeuronGraniteAttention,
+    NeuronGraniteDecoderLayer,
+    NeuronMamba2Layer,
+    GraniteRMSNormGated,
+    GraniteModelWrapper,
+    GraniteDecoderModelInstance,
+    ScaledEmbedding,
+    ScaledLMHead,
+)
+
+__all__ = [
+    "GraniteInferenceConfig",
+    "NeuronGraniteModel",
+    "NeuronGraniteForCausalLM",
+    "NeuronGraniteAttention",
+    "NeuronGraniteDecoderLayer",
+    "NeuronMamba2Layer",
+    "GraniteRMSNormGated",
+    "GraniteModelWrapper",
+    "GraniteDecoderModelInstance",
+    "ScaledEmbedding",
+    "ScaledLMHead",
+]
diff --git a/contrib/models/granite-4.0-h-small/src/modeling_granite.py b/contrib/models/granite-4.0-h-small/src/modeling_granite.py
new file mode 100644
index 00000000..e6590c6a
--- /dev/null
+++ b/contrib/models/granite-4.0-h-small/src/modeling_granite.py
@@ -0,0 +1,1604 @@
+# coding=utf-8
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Granite 4.0-H-Small (GraniteMoeHybridForCausalLM) model for NxD Inference.
+
+IBM Granite 4.0-H-Small is a hybrid Mamba2/Attention architecture with MoE:
+- 40 layers: 36 Mamba2 + 4 Attention (at indices 5, 15, 25, 35)
+- 72 experts with top-10 routing + shared expert per layer
+- hidden_size=4096, no position embeddings ("nope")
+- tie_word_embeddings=True, embedding_multiplier=12, logits_scaling=16
+
+Key implementation details:
+- Mamba state persistence via nn.ParameterList + input_output_aliases
+  (same mechanism as KV cache, following MLlama vision_key_values pattern)
+- Manual depthwise conv1d to avoid TEN404 NKI kernel bug on seq_len=1
+- Full-sequence parallel scan for prefill, O(1) recurrence for decode
+- GraniteRMSNormGated: gate applied BEFORE norm (norm_before_gate=False)
+- ScaledEmbedding/ScaledLMHead wrappers for Granite's multiplier/scaling
+"""
+
+import gc
+import logging
+import warnings
+from typing import List, Optional, Tuple, Dict, Any
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from neuronx_distributed_inference.models.model_base import (
+    NeuronBaseForCausalLM,
+    NeuronBaseModel,
+)
+from neuronx_distributed_inference.models.model_wrapper import (
+    ModelWrapper,
+    DecoderModelInstance,
+)
+from neuronx_distributed_inference.models.config import InferenceConfig, MoENeuronConfig
+from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm
+from neuronx_distributed_inference.modules.attention.attention_base import (
+    NeuronAttentionBase,
+)
+
+from neuronx_distributed.parallel_layers import parallel_state
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+)
+from neuronx_distributed.modules.moe.shared_experts import SharedExperts
+from neuronx_distributed.modules.moe.model import MoE
+from neuronx_distributed.modules.moe.routing import RouterTopK
+from neuronx_distributed.modules.moe.expert_mlps import ExpertMLPs
+from neuronx_distributed.utils import cpu_mode
+
+logger = logging.getLogger(__name__)
+
+# ==============================================================================
+# NKI Selective Scan Kernel for Mamba2 prefill
+# ==============================================================================
+# When enabled, replaces the O(L²) quadratic parallel scan with O(L) hardware-
+# accelerated scan using nisa.tensor_tensor_scan on Trainium2.
+# Set USE_NKI_SCAN = False to fall back to the quadratic implementation.
+
+USE_NKI_SCAN = False
+
+try:
+    import nki
+    import nki.language as nl
+    import nki.isa as nisa
+
+    HAS_NKI = True
+except ImportError:
+    HAS_NKI = False
+    if USE_NKI_SCAN:
+        logger.warning("NKI not available, falling back to quadratic scan")
+        USE_NKI_SCAN = False
+
+if HAS_NKI and USE_NKI_SCAN:
+    P_MAX = 128
+
+    @nki.jit
+    def nki_scan_kernel(
+        dA_exp_t,  # (NH, SL) — pre-transposed decay coefficients
+        dBx_t,  # (NH * HD * SS, SL) — flattened+transposed
+        C_t,  # (NH * SS, SL) — flattened+transposed
+        Dx_t,  # (NH * HD, SL) — pre-computed D*x, flattened+transposed
+        x_t,  # (NH * HD, SL) — flattened+transposed (unused, for shape)
+        hd_range,  # (HD,) — dummy tensor for head_dim
+        ss_range,  # (SS,) — dummy tensor for ssm_state_size
+    ):
+        """NKI O(L) selective scan using tensor_tensor_scan."""
+        NH = dA_exp_t.shape[0]
+        SL = dA_exp_t.shape[1]
+        HD = hd_range.shape[0]
+        SS = ss_range.shape[0]
+
+        y_out = nl.ndarray((NH * HD, SL), dtype=nl.float32, buffer=nl.shared_hbm)
+        final_state_out = nl.ndarray(
+            (NH * HD * SS, 1), dtype=nl.float32, buffer=nl.shared_hbm
+        )
+
+        dA_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+        nisa.memset(dst=dA_sb, value=0.0)
+        nisa.dma_copy(dst=dA_sb[0:NH, 0:SL], src=dA_exp_t[0:NH, 0:SL])
+
+        for d in nl.affine_range(HD):
+            y_acc_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+            nisa.memset(dst=y_acc_sb, value=0.0)
+            Dx_row = d * NH
+            nisa.dma_copy(
+                dst=y_acc_sb[0:NH, 0:SL],
+                src=Dx_t[Dx_row : Dx_row + NH, 0:SL],
+            )
+
+            for s in nl.affine_range(SS):
+                dBx_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.memset(dst=dBx_sb, value=0.0)
+                dBx_row = (d * SS + s) * NH
+                nisa.dma_copy(
+                    dst=dBx_sb[0:NH, 0:SL],
+                    src=dBx_t[dBx_row : dBx_row + NH, 0:SL],
+                )
+
+                init_sb = nl.ndarray((P_MAX, 1), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.memset(dst=init_sb, value=0.0)
+
+                state_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.tensor_tensor_scan(
+                    dst=state_sb[0:NH, 0:SL],
+                    data0=dA_sb[0:NH, 0:SL],
+                    data1=dBx_sb[0:NH, 0:SL],
+                    initial=init_sb[0:NH, 0:1],
+                    op0=nl.multiply,
+                    op1=nl.add,
+                )
+
+                final_sb = nl.ndarray((P_MAX, 1), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.tensor_copy(
+                    dst=final_sb[0:NH, 0:1],
+                    src=state_sb[0:NH, SL - 1 : SL],
+                )
+                fs_row = (d * SS + s) * NH
+                nisa.dma_copy(
+                    dst=final_state_out[fs_row : fs_row + NH, 0:1],
+                    src=final_sb[0:NH, 0:1],
+                )
+
+                C_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.memset(dst=C_sb, value=0.0)
+                C_row = s * NH
+                nisa.dma_copy(
+                    dst=C_sb[0:NH, 0:SL],
+                    src=C_t[C_row : C_row + NH, 0:SL],
+                )
+
+                Cs_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.tensor_tensor(
+                    dst=Cs_sb[0:NH, 0:SL],
+                    data1=C_sb[0:NH, 0:SL],
+                    data2=state_sb[0:NH, 0:SL],
+                    op=nl.multiply,
+                )
+                nisa.tensor_tensor(
+                    dst=y_acc_sb[0:NH, 0:SL],
+                    data1=y_acc_sb[0:NH, 0:SL],
+                    data2=Cs_sb[0:NH, 0:SL],
+                    op=nl.add,
+                )
+
+            y_row = d * NH
+            nisa.dma_copy(
+                dst=y_out[y_row : y_row + NH, 0:SL],
+                src=y_acc_sb[0:NH, 0:SL],
+            )
+
+        return y_out, final_state_out
+
+
+def _nki_selective_scan(
+    hidden_states_ssm, dt_processed, A, B, C, D, num_heads, head_dim, ssm_state_size
+):
+    """
+    NKI-accelerated selective scan for Mamba2 prefill.
+
+    Replaces the O(L²) quadratic scan with O(L) hardware scan.
+    Handles data layout transformation (PyTorch → NKI partition-first).
+
+    Args:
+        hidden_states_ssm: (batch, seq_len, num_heads, head_dim) float32
+        dt_processed: (batch, seq_len, num_heads) float32
+        A: (num_heads,) float32 — negative
+        B: (batch, seq_len, num_heads, ssm_state_size) float32
+        C: (batch, seq_len, num_heads, ssm_state_size) float32
+        D: (num_heads,) float32
+        num_heads, head_dim, ssm_state_size: ints
+
+    Returns:
+        y: (batch, seq_len, num_heads, head_dim)
+        final_state: (batch, num_heads, head_dim, ssm_state_size)
+    """
+    batch, seq_len = hidden_states_ssm.shape[:2]
+
+    # Pre-compute on PyTorch side (traced as XLA ops)
+    dA_exp = torch.exp(dt_processed * A.view(1, 1, -1))  # (B, L, H)
+    dB = dt_processed.unsqueeze(-1) * B  # (B, L, H, S)
+    dBx = dB.unsqueeze(3) * hidden_states_ssm.unsqueeze(-1)  # (B, L, H, D, S)
+
+    # Transpose to NKI partition-first layout (squeeze batch=1)
+    dA_exp_t = dA_exp[0].transpose(0, 1).contiguous()  # (H, L)
+
+    dBx_0 = dBx[0]  # (L, H, D, S)
+    dBx_t = (
+        dBx_0.permute(0, 2, 3, 1)
+        .reshape(seq_len, head_dim * ssm_state_size * num_heads)
+        .transpose(0, 1)
+        .contiguous()
+    )  # (D*S*H, L)
+
+    C_t = (
+        C[0]
+        .permute(0, 2, 1)
+        .reshape(seq_len, ssm_state_size * num_heads)
+        .transpose(0, 1)
+        .contiguous()
+    )  # (S*H, L)
+
+    x_t = (
+        hidden_states_ssm[0]
+        .permute(0, 2, 1)
+        .reshape(seq_len, head_dim * num_heads)
+        .transpose(0, 1)
+        .contiguous()
+    )  # (D*H, L)
+
+    Dx_t = (
+        (D.view(1, -1, 1) * hidden_states_ssm[0])
+        .permute(0, 2, 1)
+        .reshape(seq_len, head_dim * num_heads)
+        .transpose(0, 1)
+        .contiguous()
+    )  # (D*H, L)
+
+    hd_range = torch.zeros(head_dim, dtype=torch.float32, device=dA_exp_t.device)
+    ss_range = torch.zeros(ssm_state_size, dtype=torch.float32, device=dA_exp_t.device)
+
+    # Call NKI kernel
+    y_flat, state_flat = nki_scan_kernel(
+        dA_exp_t,
+        dBx_t,
+        C_t,
+        Dx_t,
+        x_t,
+        hd_range,
+        ss_range,
+    )
+
+    # Unpack: y_flat (D*H, L) → (1, L, H, D)
+    y = (
+        y_flat.reshape(head_dim, num_heads, seq_len)
+        .permute(2, 1, 0)
+        .unsqueeze(0)
+        .contiguous()
+    )
+
+    # state_flat (D*S*H, 1) → (1, H, D, S)
+    final_state = (
+        state_flat.reshape(head_dim, ssm_state_size, num_heads)
+        .permute(2, 0, 1)
+        .unsqueeze(0)
+        .contiguous()
+    )
+
+    return y, final_state
+
+
+# ==============================================================================
+# Configuration
+# ==============================================================================
+
+
+class GraniteInferenceConfig(InferenceConfig):
+    """Configuration class for Granite 4.0-H-Small model inference."""
+
+    output_attentions = False
+    output_hidden_states = False
+    use_return_dict = True
+
+    def get_required_attributes(self) -> List[str]:
+        return [
+            "attention_bias",
+            "attention_dropout",
+            "attention_multiplier",
+            "embedding_multiplier",
+            "hidden_act",
+            "hidden_size",
+            "intermediate_size",
+            "layer_types",
+            "logits_scaling",
+            "mamba_chunk_size",
+            "mamba_conv_bias",
+            "mamba_d_conv",
+            "mamba_d_head",
+            "mamba_d_state",
+            "mamba_expand",
+            "mamba_n_groups",
+            "mamba_n_heads",
+            "mamba_proj_bias",
+            "max_position_embeddings",
+            "normalization_function",
+            "num_attention_heads",
+            "num_experts_per_tok",
+            "num_hidden_layers",
+            "num_key_value_heads",
+            "num_local_experts",
+            "position_embedding_type",
+            "residual_multiplier",
+            "rms_norm_eps",
+            "shared_intermediate_size",
+            "tie_word_embeddings",
+            "vocab_size",
+        ]
+
+    @classmethod
+    def get_neuron_config_cls(cls):
+        return MoENeuronConfig
+
+
+# ==============================================================================
+# Model Wrapper (Mamba state aliasing)
+# ==============================================================================
+
+
+class GraniteDecoderModelInstance(DecoderModelInstance):
+    """
+    Extends DecoderModelInstance to alias Mamba state parameters.
+
+    After calling super().get() which aliases the KV cache parameters,
+    we add Mamba state parameters (conv_state, ssm_state for each Mamba layer)
+    to the input_output_aliases dict. This tells the XLA compiler to persist
+    these tensors across graph executions via in-place updates.
+
+    The output indices must match the order in which forward() returns tensors:
+        [res, K0, V0, K1, V1, ..., conv_state_0, ssm_state_0, conv_state_1, ssm_state_1, ...]
+    """
+
+    def get(self, bucket_rank, **kwargs):
+        self.module, self.input_output_aliases = super().get(bucket_rank, **kwargs)
+
+        past_key_values = self.module.kv_mgr.past_key_values
+        mamba_states = self.module.mamba_states
+
+        # Count where Mamba state outputs start in the output list
+        num_output_from_trace = 1  # logits/tokens
+        if getattr(self.module, "neuron_config", None) and getattr(
+            self.module.neuron_config, "output_logits", False
+        ):
+            num_output_from_trace = 2
+        num_output_from_trace += len(past_key_values)
+
+        for i in range(len(mamba_states)):
+            self.input_output_aliases[mamba_states[i]] = num_output_from_trace + i
+
+        logger.info(
+            f"GraniteDecoderModelInstance: aliased {len(past_key_values)} KV cache entries "
+            f"and {len(mamba_states)} Mamba state entries "
+            f"(Mamba starts at output index {num_output_from_trace})"
+        )
+
+        return self.module, self.input_output_aliases
+
+
+class GraniteModelWrapper(ModelWrapper):
+    """Custom ModelWrapper that returns GraniteDecoderModelInstance."""
+
+    def get_model_instance(self):
+        return GraniteDecoderModelInstance(
+            model_cls=self.model_cls,
+            config=self.config,
+            **self.model_init_kwargs,
+        )
+
+
+# ==============================================================================
+# Mamba2 Layer
+# ==============================================================================
+
+
+class GraniteRMSNormGated(nn.Module):
+    """
+    Gated RMSNorm matching HF GraniteMoeHybrid exactly.
+    Gate is applied BEFORE norm (norm_before_gate=False in Mamba2 terminology).
+    """
+
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states, gate=None):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+
+        if gate is not None:
+            hidden_states = hidden_states * F.silu(gate.to(torch.float32))
+
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+
+        return (self.weight * hidden_states).to(input_dtype)
+
+
+class NeuronMamba2Layer(nn.Module):
+    """
+    Mamba2 layer with external state passing for XLA graph persistence.
+
+    Architecture:
+    - in_proj: hidden_size -> projection_size (gather_output=True)
+    - Split into: gate (z), xBC_input, dt
+    - Manual depthwise conv1d on xBC (avoids TEN404 NKI bug)
+    - SiLU activation
+    - Split conv output into: x, B, C
+    - SSM computation (parallel scan for prefill, recurrence for decode)
+    - Gated norm: norm(y) * silu(gate)
+    - out_proj: intermediate_size -> hidden_size (gather_output=True)
+
+    State shapes (Granite 4.0-H-Small defaults):
+    - conv_state: [batch, 8448, 3]
+    - ssm_state: [batch, 128, 64, 128]
+    """
+
+    def __init__(self, config, layer_idx: int):
+        super().__init__()
+        self.layer_idx = layer_idx
+        self.config = config
+
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.mamba_n_heads
+        self.head_dim = config.mamba_d_head
+        self.ssm_state_size = config.mamba_d_state
+        self.conv_kernel_size = config.mamba_d_conv
+        self.n_groups = config.mamba_n_groups
+        self.chunk_size = config.mamba_chunk_size
+        self.use_conv_bias = config.mamba_conv_bias
+        self.use_bias = config.mamba_proj_bias
+
+        mamba_expand = config.mamba_expand
+        self.intermediate_size = mamba_expand * self.hidden_size
+        self.groups_time_state_size = self.n_groups * self.ssm_state_size
+        self.conv_dim = self.intermediate_size + 2 * self.groups_time_state_size
+        projection_size = self.intermediate_size + self.conv_dim + self.num_heads
+
+        # Input/output projections with gather_output=True (avoids manual TP)
+        if parallel_state.model_parallel_is_initialized():
+            self.in_proj = ColumnParallelLinear(
+                self.hidden_size,
+                projection_size,
+                bias=self.use_bias,
+                gather_output=True,
+            )
+            self.out_proj = ColumnParallelLinear(
+                self.intermediate_size,
+                self.hidden_size,
+                bias=False,
+                gather_output=True,
+            )
+        else:
+            self.in_proj = nn.Linear(
+                self.hidden_size, projection_size, bias=self.use_bias
+            )
+            self.out_proj = nn.Linear(
+                self.intermediate_size, self.hidden_size, bias=False
+            )
+
+        # Manual depthwise Conv1d — the NKI auto-inserted kernel crashes with
+        # TEN404 on seq_len=1. We store weights as plain parameters.
+        self.conv_weight = nn.Parameter(
+            torch.randn(self.conv_dim, self.conv_kernel_size)
+        )
+        if self.use_conv_bias:
+            self.conv_bias = nn.Parameter(torch.zeros(self.conv_dim))
+        else:
+            self.conv_bias = None
+
+        # SSM parameters
+        self.dt_bias = nn.Parameter(torch.ones(self.num_heads))
+        A = torch.arange(1, self.num_heads + 1)
+        self.A_log = nn.Parameter(torch.log(A))
+        self.D = nn.Parameter(torch.ones(self.num_heads))
+
+        # Gated RMSNorm (gate before norm, matching HF)
+        self.norm = GraniteRMSNormGated(
+            self.intermediate_size,
+            eps=config.rms_norm_eps,
+        )
+
+        self.time_step_limit = (0.0, float("inf"))
+
+    @staticmethod
+    def get_state_shapes(config, batch_size=1):
+        """Return the shapes of conv_state and ssm_state for buffer allocation."""
+        mamba_expand = config.mamba_expand
+        intermediate_size = mamba_expand * config.hidden_size
+        groups_time_state_size = config.mamba_n_groups * config.mamba_d_state
+        conv_dim = intermediate_size + 2 * groups_time_state_size
+        conv_shape = (batch_size, conv_dim, config.mamba_d_conv - 1)
+        ssm_shape = (
+            batch_size,
+            config.mamba_n_heads,
+            config.mamba_d_head,
+            config.mamba_d_state,
+        )
+        return conv_shape, ssm_shape
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        mamba_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs,
+    ):
+        """
+        Forward pass with external state.
+
+        Args:
+            hidden_states: (batch, seq_len, hidden_size)
+            mamba_state: (conv_state, ssm_state) from persistence buffers
+
+        Returns:
+            output: (batch, seq_len, hidden_size)
+            present_key_value: dummy (K, V) tuple for KV cache compatibility
+            updated_mamba_state: (conv_state, ssm_state) for persistence
+        """
+        batch_size, seq_len, _ = hidden_states.shape
+        dtype = hidden_states.dtype
+
+        if mamba_state is not None:
+            conv_state, ssm_state = mamba_state
+        else:
+            conv_state = torch.zeros(
+                batch_size,
+                self.conv_dim,
+                self.conv_kernel_size - 1,
+                device=hidden_states.device,
+                dtype=dtype,
+            )
+            ssm_state = torch.zeros(
+                batch_size,
+                self.num_heads,
+                self.head_dim,
+                self.ssm_state_size,
+                device=hidden_states.device,
+                dtype=torch.float32,
+            )
+
+        # Extract 2D padding mask (attention_mask is 4D causal, useless for Mamba)
+        padding_mask = kwargs.get("padding_mask", None)
+        if padding_mask is None and position_ids is not None and seq_len > 1:
+            indices = torch.arange(seq_len, device=position_ids.device).unsqueeze(0)
+            padding_mask = ((position_ids > 0) | (indices == 0)).float()
+
+        # Zero padding positions before in_proj
+        if padding_mask is not None and seq_len > 1:
+            hidden_states = (hidden_states * padding_mask[:, :, None]).to(dtype)
+
+        # Project input
+        projected_states = self.in_proj(hidden_states)
+
+        # Explicit slicing (NOT split()) — Neuron XLA compatibility
+        gate = projected_states[..., : self.intermediate_size]
+        hidden_states_B_C = projected_states[
+            ..., self.intermediate_size : self.intermediate_size + self.conv_dim
+        ]
+        dt = projected_states[..., -self.num_heads :]
+
+        if seq_len > 1:
+            # Prefill path
+            output, conv_state_new, ssm_state_new = self._forward_prefill(
+                hidden_states_B_C,
+                gate,
+                dt,
+                batch_size,
+                seq_len,
+                dtype,
+                padding_mask=padding_mask,
+            )
+            # Keep input state params in XLA graph (prevents pruning during tracing)
+            conv_state_new = conv_state_new + conv_state * 0
+            ssm_state_new = ssm_state_new + ssm_state * 0
+        else:
+            # Decode path (seq_len == 1)
+            output, conv_state_new, ssm_state_new = self._forward_decode(
+                hidden_states_B_C,
+                gate,
+                dt,
+                batch_size,
+                dtype,
+                conv_state,
+                ssm_state,
+            )
+
+        # Dummy KV cache for compatibility with attention-based generation loop
+        dummy_k = torch.zeros(1, 1, 1, 1, dtype=output.dtype, device=output.device)
+        dummy_v = torch.zeros(1, 1, 1, 1, dtype=output.dtype, device=output.device)
+
+        return (output, (dummy_k, dummy_v), (conv_state_new, ssm_state_new))
+
+    def _forward_prefill(
+        self,
+        hidden_states_B_C,
+        gate,
+        dt,
+        batch_size,
+        seq_len,
+        dtype,
+        padding_mask=None,
+    ):
+        """Prefill path: process full sequence with parallel scan."""
+        # Manual depthwise conv1d with causal padding
+        padded = F.pad(
+            hidden_states_B_C, (0, 0, self.conv_kernel_size - 1, 0), value=0.0
+        )
+
+        hidden_states_conv = torch.zeros_like(hidden_states_B_C)
+        for k in range(self.conv_kernel_size):
+            hidden_states_conv = hidden_states_conv + (
+                padded[:, k : k + seq_len, :]
+                * self.conv_weight[:, k].unsqueeze(0).unsqueeze(0)
+            )
+
+        hidden_states_B_C_conv = hidden_states_conv
+        if self.conv_bias is not None:
+            hidden_states_B_C_conv = hidden_states_B_C_conv + self.conv_bias.unsqueeze(
+                0
+            ).unsqueeze(0)
+
+        # Save conv_state from last K-1 real token positions
+        if padding_mask is not None and seq_len >= self.conv_kernel_size - 1:
+            real_len = padding_mask[:, :seq_len].sum(dim=1, keepdim=True).long()
+            K_minus_1 = self.conv_kernel_size - 1
+            offsets = torch.arange(
+                K_minus_1, device=hidden_states_B_C.device
+            ).unsqueeze(0)
+            gather_idx = (real_len - K_minus_1 + offsets).clamp(min=0)
+            gather_idx_expanded = gather_idx.unsqueeze(-1).expand(-1, -1, self.conv_dim)
+            conv_state_seq = torch.gather(hidden_states_B_C, 1, gather_idx_expanded)
+            conv_state_new = conv_state_seq.transpose(1, 2).contiguous()
+        elif seq_len >= self.conv_kernel_size - 1:
+            conv_state_new = (
+                hidden_states_B_C[:, -(self.conv_kernel_size - 1) :, :]
+                .transpose(1, 2)
+                .contiguous()
+            )
+        else:
+            pad_len = self.conv_kernel_size - 1 - seq_len
+            conv_state_new = F.pad(
+                hidden_states_B_C.transpose(1, 2), (pad_len, 0), value=0.0
+            ).contiguous()
+
+        hidden_states_B_C_conv = F.silu(hidden_states_B_C_conv)
+
+        # Zero out padding positions after conv1d+silu
+        if padding_mask is not None:
+            hidden_states_B_C_conv = hidden_states_B_C_conv * padding_mask[
+                :, :seq_len, None
+            ].to(hidden_states_B_C_conv.dtype)
+
+        # Split conv output
+        hidden_states_ssm = hidden_states_B_C_conv[..., : self.intermediate_size]
+        B = hidden_states_B_C_conv[
+            ...,
+            self.intermediate_size : self.intermediate_size
+            + self.groups_time_state_size,
+        ]
+        C = hidden_states_B_C_conv[..., -self.groups_time_state_size :]
+
+        # Vectorized SSM (parallel scan)
+        A = -torch.exp(self.A_log.float())
+
+        dt_processed = F.softplus(dt + self.dt_bias)
+        dt_processed = torch.clamp(dt_processed, self.time_step_limit[0], 1e6)
+
+        if padding_mask is not None:
+            dt_processed = dt_processed * padding_mask[:, :seq_len, None].to(
+                dt_processed.dtype
+            )
+
+        hidden_states_ssm = hidden_states_ssm.reshape(
+            batch_size, seq_len, self.num_heads, self.head_dim
+        ).float()
+        B = B.reshape(batch_size, seq_len, self.n_groups, self.ssm_state_size).float()
+        C = C.reshape(batch_size, seq_len, self.n_groups, self.ssm_state_size).float()
+        B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2)
+        C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2)
+
+        if USE_NKI_SCAN:
+            # NKI O(L) hardware-accelerated selective scan
+            y, ssm_state_new = _nki_selective_scan(
+                hidden_states_ssm,
+                dt_processed,
+                A,
+                B,
+                C,
+                self.D.float(),
+                self.num_heads,
+                self.head_dim,
+                self.ssm_state_size,
+            )
+            # Note: NKI path returns final state from last position.
+            # padding_mask handling for variable-length is not yet supported
+            # with NKI — assumes no padding (full sequences).
+        else:
+            # O(L²) quadratic parallel scan (fallback)
+            dA_log = dt_processed * A.view(1, 1, -1)
+            dB = dt_processed.unsqueeze(-1) * B
+            dBx = dB.unsqueeze(3) * hidden_states_ssm.unsqueeze(-1)
+
+            log_dA_cumsum = torch.cumsum(dA_log, dim=1)
+
+            causal_mask = torch.tril(
+                torch.ones(
+                    seq_len,
+                    seq_len,
+                    device=hidden_states_ssm.device,
+                    dtype=hidden_states_ssm.dtype,
+                )
+            )
+
+            log_diff = log_dA_cumsum.unsqueeze(2) - log_dA_cumsum.unsqueeze(1)
+            log_diff = log_diff.masked_fill(
+                causal_mask.unsqueeze(0).unsqueeze(-1) == 0, -1e9
+            )
+            weights = torch.exp(log_diff)
+
+            states = torch.einsum("btih,bihds->bthds", weights, dBx)
+
+            # Save final SSM state from last real token position
+            if padding_mask is not None:
+                real_len = padding_mask[:, :seq_len].sum(dim=1, keepdim=True).long()
+                last_real_idx = (real_len - 1).clamp(min=0)
+                gather_idx = last_real_idx.view(batch_size, 1, 1, 1, 1).expand(
+                    -1, -1, self.num_heads, self.head_dim, self.ssm_state_size
+                )
+                ssm_state_new = (
+                    torch.gather(states, 1, gather_idx).squeeze(1).contiguous()
+                )
+            else:
+                ssm_state_new = states[:, -1, :, :, :].contiguous()
+
+            y = torch.einsum("blhs,blhds->blhd", C, states)
+            y = y + self.D.view(1, 1, -1, 1) * hidden_states_ssm
+        y = y.reshape(batch_size, seq_len, -1)
+
+        scan_output = self.norm(y, gate)
+        output = self.out_proj(scan_output.to(dtype))
+        return output, conv_state_new, ssm_state_new.to(dtype)
+
+    def _forward_decode(
+        self, hidden_states_B_C, gate, dt, batch_size, dtype, conv_state, ssm_state
+    ):
+        """Decode path: single token, O(1) SSM update."""
+        xBC_new = hidden_states_B_C.squeeze(1)
+
+        # Conv1d with state
+        xBC_new_t = xBC_new.unsqueeze(2)
+        conv_input = torch.cat([conv_state, xBC_new_t], dim=2)
+
+        conv_out = (conv_input * self.conv_weight.unsqueeze(0)).sum(dim=2)
+        if self.conv_bias is not None:
+            conv_out = conv_out + self.conv_bias
+
+        conv_state_new = conv_input[:, :, 1:].contiguous()
+        conv_out = F.silu(conv_out)
+
+        x = conv_out[..., : self.intermediate_size]
+        B = conv_out[
+            ...,
+            self.intermediate_size : self.intermediate_size
+            + self.groups_time_state_size,
+        ]
+        C = conv_out[..., -self.groups_time_state_size :]
+
+        # SSM recurrence
+        A = -torch.exp(self.A_log.float())
+        dt_processed = F.softplus(dt.squeeze(1) + self.dt_bias)
+        dt_processed = torch.clamp(dt_processed, self.time_step_limit[0], 1e6)
+
+        x = x.reshape(batch_size, self.num_heads, self.head_dim).float()
+        B = B.reshape(batch_size, self.n_groups, self.ssm_state_size).float()
+        C = C.reshape(batch_size, self.n_groups, self.ssm_state_size).float()
+        B = B.repeat_interleave(self.num_heads // self.n_groups, dim=1)
+        C = C.repeat_interleave(self.num_heads // self.n_groups, dim=1)
+
+        dA = torch.exp(dt_processed * A.view(1, -1))
+        dB = dt_processed.unsqueeze(-1) * B
+        dBx = dB.unsqueeze(2) * x.unsqueeze(-1)
+
+        ssm_state_new = dA.unsqueeze(-1).unsqueeze(-1) * ssm_state.float() + dBx
+
+        y = torch.einsum("bhds,bhs->bhd", ssm_state_new, C)
+        y = y + self.D.view(1, -1, 1) * x
+        y = y.reshape(batch_size, -1)
+
+        gate_squeezed = gate.squeeze(1)
+        scan_output = self.norm(y, gate_squeezed)
+
+        if len(scan_output.shape) == 2:
+            scan_output = scan_output.unsqueeze(1)
+        output = self.out_proj(scan_output.to(dtype))
+        return output, conv_state_new, ssm_state_new.to(dtype)
+
+
+# ==============================================================================
+# Utility functions
+# ==============================================================================
+
+
+def get_rmsnorm_cls():
+    """Return appropriate RMSNorm implementation (CPU or Neuron)."""
+    from transformers.models.llama.modeling_llama import LlamaRMSNorm
+
+    return LlamaRMSNorm if cpu_mode() else CustomRMSNorm
+
+
+def convert_hf_to_neuron_mamba_weights(
+    hf_state_dict: Dict[str, torch.Tensor], tp_degree: int = 4
+) -> Dict[str, torch.Tensor]:
+    """Convert HF Granite Mamba conv1d weight keys to our parameter names."""
+    converted = {}
+    for key, tensor in hf_state_dict.items():
+        if "conv1d.weight" in key:
+            new_key = key.replace("conv1d.weight", "conv_weight")
+            converted[new_key] = tensor.squeeze(1)
+        elif "conv1d.bias" in key:
+            new_key = key.replace("conv1d.bias", "conv_bias")
+            converted[new_key] = tensor
+        else:
+            converted[key] = tensor
+    return converted
+
+
+# ==============================================================================
+# Attention
+# ==============================================================================
+
+
+class NeuronGraniteAttention(NeuronAttentionBase):
+    """Granite attention layer. Uses no position embeddings ("nope")."""
+
+    def __init__(self, config: GraniteInferenceConfig, layer_idx: int):
+        super().__init__(
+            config=config,
+            hidden_size=config.hidden_size,
+            num_attention_heads=config.num_attention_heads,
+            num_key_value_heads=config.num_key_value_heads,
+            head_dim=config.hidden_size // config.num_attention_heads,
+            rotary_emb=None,  # Granite uses "nope" (no position embeddings)
+            rms_norm_eps=config.rms_norm_eps,
+            use_qk_norm=False,
+        )
+
+        self.layer_idx = layer_idx
+        self.attention_multiplier = config.attention_multiplier
+
+        if not parallel_state.model_parallel_is_initialized():
+            raise ValueError(
+                "NeuronGraniteAttention must be initialized in a distributed env."
+            )
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        **kwargs,
+    ):
+        """Forward pass with attention multiplier scaling."""
+        hidden_states, present_key_value, cos_cache, sin_cache = super().forward(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            **kwargs,
+        )
+
+        if self.attention_multiplier != 1.0:
+            hidden_states = hidden_states * self.attention_multiplier
+
+        return (hidden_states, present_key_value, cos_cache, sin_cache)
+
+
+# ==============================================================================
+# Decoder Layer
+# ==============================================================================
+
+
+class NeuronGraniteDecoderLayer(nn.Module):
+    """Granite decoder layer — either attention or Mamba2, with MoE MLP."""
+
+    def __init__(self, config: GraniteInferenceConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.layer_idx = layer_idx
+        self.layer_type = config.layer_types[layer_idx]
+
+        if self.layer_type == "attention":
+            self.self_attn = NeuronGraniteAttention(config=config, layer_idx=layer_idx)
+        elif self.layer_type == "mamba":
+            self.mamba = NeuronMamba2Layer(config=config, layer_idx=layer_idx)
+        else:
+            raise ValueError(f"Unknown layer type: {self.layer_type}")
+
+        # MoE MLP with shared experts
+        router = RouterTopK(
+            num_experts=config.num_local_experts,
+            top_k=config.num_experts_per_tok,
+            hidden_size=config.hidden_size,
+            sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+            sequence_dimension=1,
+        )
+        expert_mlps = ExpertMLPs(
+            num_experts=config.num_local_experts,
+            top_k=config.num_experts_per_tok,
+            hidden_size=config.hidden_size,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+            capacity_factor=config.neuron_config.capacity_factor,
+            sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+            glu_mlp=config.neuron_config.glu_mlp,
+            normalize_top_k_affinities=True,
+            is_prefill=config.neuron_config.is_prefill_stage,
+        )
+        shared_expert = SharedExperts(
+            hidden_size=config.hidden_size,
+            intermediate_size=config.shared_intermediate_size,
+            num_shared_experts=1,
+            hidden_act=config.hidden_act,
+            dtype=config.neuron_config.torch_dtype,
+        )
+        self.mlp = MoE(
+            router=router,
+            expert_mlps=expert_mlps,
+            shared_experts=shared_expert,
+            sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
+            sequence_dimension=1,
+        )
+        self.mlp.eval()
+
+        self.input_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+        self.post_attention_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+        self.residual_multiplier = config.residual_multiplier
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        mamba_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs,
+    ):
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated. Use `attention_mask` instead."
+            )
+
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+
+        updated_mamba_state = None
+        if self.layer_type == "attention":
+            hidden_states, present_key_value, cos_cache, sin_cache = self.self_attn(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                **kwargs,
+            )
+        else:
+            hidden_states, present_key_value, updated_mamba_state = self.mamba(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                mamba_state=mamba_state,
+                **kwargs,
+            )
+            cos_cache, sin_cache = None, None
+
+        hidden_states = residual + hidden_states * self.residual_multiplier
+
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)[0]
+        hidden_states = residual + hidden_states * self.residual_multiplier
+
+        return (
+            hidden_states,
+            present_key_value,
+            cos_cache,
+            sin_cache,
+            None,
+            updated_mamba_state,
+        )
+
+
+# ==============================================================================
+# Embedding/LM Head Wrappers
+# ==============================================================================
+
+
+class ScaledEmbedding(nn.Module):
+    """Applies embedding_multiplier after embedding lookup."""
+
+    def __init__(self, embedding, multiplier):
+        super().__init__()
+        self.embedding = embedding
+        self.multiplier = multiplier
+
+    def forward(self, input_ids):
+        return self.embedding(input_ids) * self.multiplier
+
+
+class ScaledLMHead(nn.Module):
+    """Applies logits_scaling (division) after lm_head projection.
+
+    Uses __getattr__ delegation for weight/bias access so framework code
+    can find them without registering duplicate parameters.
+    """
+
+    def __init__(self, lm_head, scaling):
+        super().__init__()
+        self.lm_head = lm_head
+        self.scaling = scaling
+        if hasattr(lm_head, "gather_output"):
+            self.gather_output = lm_head.gather_output
+        if hasattr(lm_head, "tensor_parallel_group"):
+            self.tensor_parallel_group = lm_head.tensor_parallel_group
+        if hasattr(lm_head, "pad_size"):
+            self.pad_size = lm_head.pad_size
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(self.lm_head, name)
+
+    def forward(self, hidden_states):
+        return self.lm_head(hidden_states) / self.scaling
+
+
+# ==============================================================================
+# Model Body
+# ==============================================================================
+
+
+class NeuronGraniteModel(NeuronBaseModel):
+    """
+    NeuronGraniteModel — traced model body for Granite 4.0-H-Small.
+
+    Overrides forward() and get_model_output() to handle Mamba state persistence
+    alongside the standard KV cache. Mamba state is stored in nn.ParameterList
+    (self.mamba_states) aliased via input_output_aliases in GraniteDecoderModelInstance.
+    """
+
+    def setup_attr_for_model(self, config: GraniteInferenceConfig):
+        self.on_device_sampling = (
+            config.neuron_config.on_device_sampling_config is not None
+        )
+        self.tp_degree = config.neuron_config.tp_degree
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.max_batch_size = config.neuron_config.max_batch_size
+        self.buckets = config.neuron_config.buckets
+        self.embedding_multiplier = config.embedding_multiplier
+        self.logits_scaling = config.logits_scaling
+
+    def init_model(self, config: GraniteInferenceConfig):
+        self.padding_idx = getattr(config, "pad_token_id", None)
+        self.vocab_size = config.vocab_size
+
+        raw_embed = ParallelEmbedding(
+            config.vocab_size,
+            config.hidden_size,
+            self.padding_idx,
+            dtype=config.neuron_config.torch_dtype,
+            shard_across_embedding=True,
+        )
+        self.embed_tokens = ScaledEmbedding(raw_embed, self.embedding_multiplier)
+
+        self.layers = nn.ModuleList(
+            [
+                NeuronGraniteDecoderLayer(config, i)
+                for i in range(config.num_hidden_layers)
+            ]
+        )
+
+        self.norm = get_rmsnorm_cls()(self.hidden_size, eps=config.rms_norm_eps)
+
+        raw_lm_head = ColumnParallelLinear(
+            config.hidden_size,
+            config.vocab_size,
+            gather_output=False if self.on_device_sampling else True,
+            bias=False,
+        )
+        self.lm_head = ScaledLMHead(raw_lm_head, self.logits_scaling)
+
+        # Mamba state persistence buffers
+        self._mamba_layer_indices = [
+            i
+            for i in range(config.num_hidden_layers)
+            if config.layer_types[i] == "mamba"
+        ]
+        batch_size = config.neuron_config.batch_size
+        conv_shape, ssm_shape = NeuronMamba2Layer.get_state_shapes(config, batch_size)
+        dtype = config.neuron_config.torch_dtype
+
+        self.mamba_states = nn.ParameterList()
+        for _ in self._mamba_layer_indices:
+            self.mamba_states.append(
+                nn.Parameter(torch.zeros(conv_shape, dtype=dtype), requires_grad=False)
+            )
+            self.mamba_states.append(
+                nn.Parameter(torch.zeros(ssm_shape, dtype=dtype), requires_grad=False)
+            )
+
+        logger.info(
+            f"Initialized Mamba state persistence: {len(self._mamba_layer_indices)} layers, "
+            f"{len(self.mamba_states)} buffers"
+        )
+
+    def _get_mamba_states(self):
+        """Get Mamba states as list of (conv_state, ssm_state) tuples."""
+        states = []
+        for i in range(0, len(self.mamba_states), 2):
+            states.append((self.mamba_states[i], self.mamba_states[i + 1]))
+        return states
+
+    def _build_mamba_state_map(self):
+        """Map layer_idx -> mamba_idx for Mamba layers."""
+        return {
+            layer_idx: mamba_idx
+            for mamba_idx, layer_idx in enumerate(self._mamba_layer_indices)
+        }
+
+    def get_model_output(
+        self,
+        input_ids: torch.LongTensor = None,
+        seq_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        active_mask: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        prev_hidden: Optional[torch.FloatTensor] = None,
+        adapter_ids: Optional[torch.LongTensor] = None,
+        rotary_position_ids: Optional[torch.LongTensor] = None,
+        update_cache: bool = False,
+        is_for_context_encoding: bool = False,
+        vision_embeddings: Optional[torch.FloatTensor] = None,
+        vision_mask: Optional[torch.BoolTensor] = None,
+        local_attn_mask: Optional[torch.Tensor] = None,
+        windowed_context_encoding_window_idx: int = -1,
+        padding_mask: Optional[torch.Tensor] = None,
+        **kwargs,
+    ):
+        """
+        Thread Mamba state through decoder layers alongside KV cache.
+        Returns: (hidden_states, next_decoder_cache, updated_mamba_state_list)
+        """
+        batch_size, seq_length = input_ids.shape[:2]
+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][1].shape[2]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if position_ids is None:
+            device = input_ids.device if input_ids is not None else inputs_embeds.device
+            position_ids = torch.arange(
+                past_key_values_length,
+                seq_length + past_key_values_length,
+                dtype=torch.long,
+                device=device,
+            )
+            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
+        else:
+            position_ids = position_ids.view(-1, seq_length).long()
+
+        hidden_states = inputs_embeds
+
+        if self.sequence_parallel_enabled:
+            self.validate_sequence_parallel(seq_length)
+        hidden_states = self.process_sequence_parallel_hidden_states(
+            inputs_embeds, seq_length, kwargs.get("active_block_table", None)
+        )
+
+        next_decoder_cache = ()
+        cos_cache = None
+        sin_cache = None
+        cache_size = self.n_positions
+
+        if not is_for_context_encoding or windowed_context_encoding_window_idx >= 1:
+            past_key_values = self.kv_mgr.get_cache(
+                seq_ids=seq_ids,
+                seq_len=cache_size,
+                is_for_context_encoding=is_for_context_encoding,
+                windowed_context_encoding_window_idx=windowed_context_encoding_window_idx,
+                **kwargs,
+            )
+
+        mamba_states = self._get_mamba_states()
+        mamba_state_map = self._build_mamba_state_map()
+        updated_mamba_states = [None] * len(self._mamba_layer_indices)
+
+        for idx, decoder_layer in enumerate(self.layers):
+            past_key_value = (
+                past_key_values[idx] if past_key_values is not None else None
+            )
+
+            mamba_state = None
+            if idx in mamba_state_map:
+                mamba_state = mamba_states[mamba_state_map[idx]]
+
+            layer_outputs = decoder_layer(
+                hidden_states,
+                seq_ids=seq_ids,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                active_mask=active_mask,
+                adapter_ids=adapter_ids,
+                cos_cache=cos_cache,
+                sin_cache=sin_cache,
+                kv_mgr=self.kv_mgr,
+                idx=idx,
+                is_for_context_encoding=is_for_context_encoding,
+                seq_len=cache_size,
+                local_mask=local_attn_mask,
+                padding_mask=padding_mask,
+                mamba_state=mamba_state,
+                **kwargs,
+            )
+
+            hidden_states = layer_outputs[0]
+            next_decoder_cache += (layer_outputs[1],)
+            cos_cache, sin_cache = layer_outputs[2:4]
+            layer_mamba_state = layer_outputs[5]
+
+            if idx in mamba_state_map and layer_mamba_state is not None:
+                updated_mamba_states[mamba_state_map[idx]] = layer_mamba_state
+
+        if update_cache:
+            next_decoder_cache = self.kv_mgr.update_cache(
+                is_for_context_encoding=is_for_context_encoding,
+                seq_ids=seq_ids,
+                position_ids=position_ids,
+                new_key_values=next_decoder_cache,
+                seq_len=cache_size,
+                windowed_context_encoding_window_idx=windowed_context_encoding_window_idx,
+                **kwargs,
+            )
+
+        hidden_states = self.norm(hidden_states)
+        return (hidden_states, next_decoder_cache, updated_mamba_states)
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask,
+        position_ids,
+        seq_ids,
+        sampling_params,
+        prev_hidden=None,
+        adapter_ids=None,
+        accepted_indices=None,
+        current_length=None,
+        medusa_mask=None,
+        scatter_index=None,
+        slot_mapping=None,
+        active_block_table=None,
+        num_queries=None,
+        computed_context_lens=None,
+        tile_q_indices=None,
+        tile_block_tables=None,
+        tile_masks=None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        kv_cache: Optional[torch.Tensor] = None,
+        active_mask=None,
+        rotary_position_id=None,
+        vision_embeddings=None,
+        vision_mask=None,
+    ):
+        """
+        Traced forward — appends Mamba state tensors to output list.
+        Output: [res, K0, V0, ..., conv_state_0, ssm_state_0, ...]
+        """
+        prev_hidden = self.set_none_if_empty(prev_hidden)
+        adapter_ids = self.set_none_if_empty(adapter_ids)
+        accepted_indices = self.set_none_if_empty(accepted_indices)
+        current_length = self.set_none_if_empty(current_length)
+        medusa_mask = self.set_none_if_empty(medusa_mask)
+        scatter_index = self.set_none_if_empty(scatter_index)
+        slot_mapping = self.set_none_if_empty(slot_mapping)
+        active_block_table = self.set_none_if_empty(active_block_table)
+        num_queries = self.set_none_if_empty(num_queries)
+        computed_context_lens = self.set_none_if_empty(computed_context_lens)
+        tile_q_indices = self.set_none_if_empty(tile_q_indices)
+        tile_block_tables = self.set_none_if_empty(tile_block_tables)
+        tile_masks = self.set_none_if_empty(tile_masks)
+        inputs_embeds = self.set_none_if_empty(inputs_embeds)
+        kv_cache = self.set_none_if_empty(kv_cache)
+        active_mask = self.set_none_if_empty(active_mask)
+        rotary_position_id = self.set_none_if_empty(rotary_position_id)
+        vision_embeddings = self.set_none_if_empty(vision_embeddings)
+        vision_mask = self.set_none_if_empty(vision_mask)
+
+        is_for_context_encoding = self._is_context_encoding(input_ids)
+
+        attn_mask = self.create_attn_mask(
+            attention_mask,
+            is_for_context_encoding,
+            False,
+            position_ids=position_ids,
+        )
+        padding_mask = self.create_padding_mask(position_ids)
+
+        hidden_states, updated_kv_cache, updated_mamba_states = self.get_model_output(
+            input_ids=input_ids,
+            seq_ids=seq_ids,
+            attention_mask=attn_mask,
+            position_ids=position_ids,
+            active_mask=active_mask,
+            inputs_embeds=inputs_embeds,
+            adapter_ids=adapter_ids,
+            prev_hidden=prev_hidden,
+            is_for_context_encoding=is_for_context_encoding,
+            scatter_index=slot_mapping
+            if getattr(self, "is_block_kv_layout", False)
+            else scatter_index,
+            kvcache_buffer=kv_cache,
+            update_cache=True,
+            padding_mask=padding_mask,
+        )
+
+        # Slice to last token for context encoding
+        batch_size = input_ids.shape[0]
+        if not self.sliced_hidden:
+            if not (
+                position_ids.shape[-1] == getattr(self, "speculation_length", 0)
+                or position_ids.shape[-1] == 1
+            ):
+                index = torch.max(position_ids, dim=1, keepdim=True).indices
+                index = index.unsqueeze(1).expand(batch_size, 1, self.hidden_size)
+                hidden_states = torch.gather(hidden_states, dim=1, index=index)
+
+        logits = self.lm_head(hidden_states)
+        logits = logits.float()
+
+        if hasattr(self.lm_head, "pad_size"):
+            if self.lm_head.gather_output:
+                rank_id = torch.tensor(0, device=logits.device, dtype=torch.int32)
+                world_size = 1
+            else:
+                rank_id = self.rank_util.get_rank()
+                world_size = torch.distributed.get_world_size(
+                    group=self.lm_head.tensor_parallel_group
+                )
+            from neuronx_distributed_inference.models.model_base import (
+                mask_padded_logits,
+            )
+
+            logits = mask_padded_logits(
+                logits, rank_id, world_size, pad_size=self.lm_head.pad_size
+            )
+
+        if self.on_device_sampling:
+            res = self._sample_on_device(
+                logits, sampling_params, False, is_for_context_encoding
+            )
+        else:
+            res = logits
+
+        outputs = [res]
+        if getattr(self.neuron_config, "output_logits", False):
+            from neuronx_distributed_inference.models.model_base import (
+                _gather_along_dim,
+                get_tp_group,
+            )
+
+            gathered_logits = _gather_along_dim(
+                logits,
+                partition_dim=2,
+                process_group=get_tp_group(self.config),
+            )
+            outputs += [gathered_logits]
+        outputs += updated_kv_cache
+
+        # Append Mamba states for aliasing
+        for conv_state, ssm_state in updated_mamba_states:
+            outputs.append(conv_state)
+            outputs.append(ssm_state)
+
+        return outputs
+
+
+# ==============================================================================
+# CausalLM Wrapper
+# ==============================================================================
+
+
+class NeuronGraniteForCausalLM(NeuronBaseForCausalLM):
+    """Top-level causal LM class for Granite 4.0-H-Small."""
+
+    _model_cls = NeuronGraniteModel
+
+    @staticmethod
+    def load_hf_model(model_path, **kwargs):
+        from transformers import AutoModelForCausalLM
+
+        return AutoModelForCausalLM.from_pretrained(model_path, **kwargs)
+
+    @classmethod
+    def get_config_cls(cls):
+        return GraniteInferenceConfig
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(
+        state_dict: dict, config: GraniteInferenceConfig
+    ) -> dict:
+        return _convert_granite_hf_to_neuron_state_dict(state_dict, config)
+
+    def get_compiler_args(self):
+        return (
+            "--enable-saturate-infinity --enable-mixed-precision-accumulation "
+            "--model-type transformer -O1 "
+            "--tensorizer-options='--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2' "
+            "--auto-cast=none"
+        )
+
+    def get_model_wrapper_cls(self):
+        return GraniteModelWrapper
+
+    def _copy_past_key_values(self, outputs):
+        """Also copy Mamba states for CPU debugging path."""
+        n_mamba_entries = len(self.context_encoding_model.model.mamba_states)
+
+        if n_mamba_entries > 0:
+            super()._copy_past_key_values(outputs[:-n_mamba_entries])
+
+            mamba_outputs = outputs[-n_mamba_entries:]
+            for i, state_tensor in enumerate(mamba_outputs):
+                self.token_generation_model.model.mamba_states[i].data = state_tensor
+                self.context_encoding_model.model.mamba_states[i].data = state_tensor
+        else:
+            super()._copy_past_key_values(outputs)
+
+    @staticmethod
+    def update_state_dict_for_tied_weights(state_dict: dict):
+        """Handle tied embeddings and ScaledEmbedding/ScaledLMHead wrapper paths."""
+        if "embed_tokens.weight" in state_dict:
+            state_dict["embed_tokens.embedding.weight"] = state_dict.pop(
+                "embed_tokens.weight"
+            )
+
+        if (
+            "embed_tokens.embedding.weight" in state_dict
+            and "lm_head.lm_head.weight" not in state_dict
+        ):
+            state_dict["lm_head.lm_head.weight"] = state_dict[
+                "embed_tokens.embedding.weight"
+            ]
+
+        if "lm_head.weight" in state_dict:
+            state_dict["lm_head.lm_head.weight"] = state_dict.pop("lm_head.weight")
+
+        return state_dict
+
+
+# ==============================================================================
+# State Dict Conversion
+# ==============================================================================
+
+
+def _convert_granite_hf_to_neuron_state_dict(
+    neuron_state_dict: Dict[str, Any], config: GraniteInferenceConfig
+):
+    """Convert HF checkpoints to Neuron-compatible state dict."""
+    neuron_state_dict = convert_hf_to_neuron_mamba_weights(
+        neuron_state_dict, config.neuron_config.tp_degree
+    )
+
+    # Remove "model." prefix
+    new_state_dict = {}
+    for key, value in neuron_state_dict.items():
+        new_key = key[6:] if key.startswith("model.") else key
+        new_state_dict[new_key] = value
+    neuron_state_dict = new_state_dict
+
+    neuron_state_dict["rank_util.rank"] = torch.arange(
+        0, config.neuron_config.tp_degree, dtype=torch.int32
+    )
+
+    for layer_idx in range(config.num_hidden_layers):
+        layer_type = config.layer_types[layer_idx]
+
+        if layer_type == "attention":
+            if config.neuron_config.fused_qkv:
+                _helper_concat_and_delete_qkv(neuron_state_dict, layer_idx, "weight")
+                if (
+                    config.neuron_config.quantized_mlp_kernel_enabled
+                    or config.neuron_config.quantized
+                ):
+                    _helper_concat_and_delete_qkv(neuron_state_dict, layer_idx, "scale")
+
+        _convert_granite_moe_weights(neuron_state_dict, config, layer_idx)
+
+    gc.collect()
+    return neuron_state_dict
+
+
+def _helper_concat_and_delete_qkv(
+    state_dict: Dict[str, Any], layer_num: int, attr: str
+):
+    """Concatenate QKV weights for fused attention."""
+    state_dict[f"layers.{layer_num}.self_attn.Wqkv.{attr}"] = torch.cat(
+        [
+            state_dict[f"layers.{layer_num}.self_attn.q_proj.{attr}"],
+            state_dict[f"layers.{layer_num}.self_attn.k_proj.{attr}"],
+            state_dict[f"layers.{layer_num}.self_attn.v_proj.{attr}"],
+        ]
+    )
+    del state_dict[f"layers.{layer_num}.self_attn.q_proj.{attr}"]
+    del state_dict[f"layers.{layer_num}.self_attn.k_proj.{attr}"]
+    del state_dict[f"layers.{layer_num}.self_attn.v_proj.{attr}"]
+
+
+def _convert_granite_moe_weights(
+    state_dict: Dict[str, Any], config: GraniteInferenceConfig, layer_idx: int
+):
+    """Convert Granite MoE weights to neuronx_distributed format."""
+    router_key = f"layers.{layer_idx}.block_sparse_moe.router.layer.weight"
+    if router_key not in state_dict:
+        return
+
+    # Router weights
+    neuron_router_key = f"layers.{layer_idx}.mlp.router.linear_router.weight"
+    state_dict[neuron_router_key] = state_dict[router_key].detach().clone()
+    del state_dict[router_key]
+
+    # Expert weights (transpose for NxD format)
+    input_linear_key = f"layers.{layer_idx}.block_sparse_moe.input_linear.weight"
+    output_linear_key = f"layers.{layer_idx}.block_sparse_moe.output_linear.weight"
+
+    if input_linear_key in state_dict and output_linear_key in state_dict:
+        state_dict[f"layers.{layer_idx}.mlp.expert_mlps.mlp_op.gate_up_proj.weight"] = (
+            state_dict[input_linear_key].transpose(1, 2)
+        )
+        del state_dict[input_linear_key]
+
+        state_dict[f"layers.{layer_idx}.mlp.expert_mlps.mlp_op.down_proj.weight"] = (
+            state_dict[output_linear_key].transpose(1, 2)
+        )
+        del state_dict[output_linear_key]
+
+    # Shared expert weights
+    shared_input_key = f"layers.{layer_idx}.shared_mlp.input_linear.weight"
+    shared_output_key = f"layers.{layer_idx}.shared_mlp.output_linear.weight"
+
+    if shared_input_key in state_dict:
+        input_linear = state_dict[shared_input_key]
+        shared_intermediate_size = input_linear.shape[0] // 2
+        state_dict[f"layers.{layer_idx}.mlp.shared_experts.gate_proj.weight"] = (
+            input_linear[:shared_intermediate_size, :]
+        )
+        state_dict[f"layers.{layer_idx}.mlp.shared_experts.up_proj.weight"] = (
+            input_linear[shared_intermediate_size:, :]
+        )
+        del state_dict[shared_input_key]
+
+    if shared_output_key in state_dict:
+        state_dict[f"layers.{layer_idx}.mlp.shared_experts.down_proj.weight"] = (
+            state_dict[shared_output_key].detach().clone()
+        )
+        del state_dict[shared_output_key]
diff --git a/contrib/models/granite-4.0-h-small/test/__init__.py b/contrib/models/granite-4.0-h-small/test/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/granite-4.0-h-small/test/integration/__init__.py b/contrib/models/granite-4.0-h-small/test/integration/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/granite-4.0-h-small/test/integration/benchmark_latency.py b/contrib/models/granite-4.0-h-small/test/integration/benchmark_latency.py
new file mode 100644
index 00000000..baf8d4e9
--- /dev/null
+++ b/contrib/models/granite-4.0-h-small/test/integration/benchmark_latency.py
@@ -0,0 +1,305 @@
+"""
+Benchmark latency for Granite 4.0-H-Small (V15 quadratic vs V16 NKI).
+
+Measures:
+- Prefill latency (first token)
+- Decode latency (per-token, steady state)
+- End-to-end generation throughput
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+    export NEURON_PLATFORM_TARGET_OVERRIDE=trn2
+    python benchmark_latency.py --model-dir /path/to/traced_model --version v16-nki
+"""
+
+import sys
+import os
+import time
+import argparse
+import torch
+import logging
+import numpy as np
+
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+
+sys.path.insert(0, "/home/ubuntu/Granite4/neuronx-distributed-inference/src")
+
+MODEL_PATH = "/home/ubuntu/Granite4/granite-4.0-h-small/"
+
+
+def load_model(compiled_path):
+    """Load a compiled Granite model."""
+    from neuronx_distributed_inference.models.config import MoENeuronConfig
+    from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+    from neuronx_distributed_inference.models.granite.modeling_granite import (
+        NeuronGraniteForCausalLM,
+        GraniteInferenceConfig,
+    )
+
+    neuron_config = MoENeuronConfig(
+        tp_degree=4,
+        batch_size=1,
+        max_context_length=128,
+        seq_len=2048,
+        on_device_sampling_config=None,
+        enable_bucketing=False,
+        flash_decoding_enabled=False,
+        torch_dtype="bfloat16",
+    )
+
+    config = GraniteInferenceConfig(
+        neuron_config,
+        load_config=load_pretrained_config(MODEL_PATH),
+    )
+
+    model = NeuronGraniteForCausalLM(MODEL_PATH, config)
+    logger.info(f"Loading compiled model from {compiled_path}...")
+    model.load(compiled_path)
+    logger.info("Model loaded successfully.")
+    return model
+
+
+def benchmark_generation(
+    model, tokenizer, prompt, max_new_tokens, num_warmup=3, num_runs=10
+):
+    """Benchmark prefill + decode latency for a single prompt."""
+    from neuronx_distributed_inference.utils.hf_adapter import (
+        HuggingFaceGenerationAdapter,
+    )
+
+    gen_model = HuggingFaceGenerationAdapter(model)
+    inputs = tokenizer(prompt, return_tensors="pt")
+    input_ids = inputs.input_ids
+    attention_mask = torch.ones_like(input_ids)
+    prompt_len = input_ids.shape[1]
+
+    # Warmup
+    logger.info(f"Warming up ({num_warmup} runs)...")
+    for _ in range(num_warmup):
+        _ = gen_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+        )
+
+    # Timed runs
+    logger.info(f"Running {num_runs} timed generations...")
+    latencies = []
+    for i in range(num_runs):
+        start = time.perf_counter()
+        outputs = gen_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+        )
+        end = time.perf_counter()
+
+        total_time = end - start
+        generated_tokens = outputs.shape[1] - prompt_len
+        latencies.append(
+            {
+                "total_time_s": total_time,
+                "generated_tokens": generated_tokens,
+                "prompt_len": prompt_len,
+            }
+        )
+        logger.info(f"  Run {i + 1}: {total_time:.3f}s, {generated_tokens} tokens")
+
+    return latencies
+
+
+def benchmark_prefill_only(model, tokenizer, prompts, num_warmup=3, num_runs=10):
+    """Benchmark prefill latency (1 token generated = prefill + 1 decode step)."""
+    from neuronx_distributed_inference.utils.hf_adapter import (
+        HuggingFaceGenerationAdapter,
+    )
+
+    gen_model = HuggingFaceGenerationAdapter(model)
+    results = {}
+
+    for prompt in prompts:
+        inputs = tokenizer(prompt, return_tensors="pt")
+        input_ids = inputs.input_ids
+        attention_mask = torch.ones_like(input_ids)
+        prompt_len = input_ids.shape[1]
+
+        # Warmup
+        for _ in range(num_warmup):
+            _ = gen_model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=1,
+                do_sample=False,
+            )
+
+        # Timed runs
+        times = []
+        for _ in range(num_runs):
+            start = time.perf_counter()
+            _ = gen_model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_new_tokens=1,
+                do_sample=False,
+            )
+            end = time.perf_counter()
+            times.append(end - start)
+
+        results[prompt] = {
+            "prompt_len": prompt_len,
+            "mean_s": np.mean(times),
+            "std_s": np.std(times),
+            "min_s": np.min(times),
+            "max_s": np.max(times),
+            "times": times,
+        }
+        logger.info(
+            f"  Prefill '{prompt[:30]}...' (len={prompt_len}): "
+            f"{np.mean(times) * 1000:.1f} +/- {np.std(times) * 1000:.1f} ms"
+        )
+
+    return results
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Benchmark Granite 4.0-H-Small latency"
+    )
+    parser.add_argument(
+        "--model-dir", required=True, help="Path to compiled model directory"
+    )
+    parser.add_argument(
+        "--version", default="unknown", help="Version label (e.g., v15, v16-nki)"
+    )
+    parser.add_argument(
+        "--max-new-tokens", type=int, default=50, help="Tokens to generate"
+    )
+    parser.add_argument("--num-warmup", type=int, default=3, help="Warmup iterations")
+    parser.add_argument("--num-runs", type=int, default=10, help="Timed iterations")
+    args = parser.parse_args()
+
+    from transformers import AutoTokenizer
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+
+    model = load_model(args.model_dir)
+
+    print(f"\n{'=' * 70}")
+    print(f" BENCHMARK: Granite 4.0-H-Small — {args.version}")
+    print(f" Model dir: {args.model_dir}")
+    print(f" Max new tokens: {args.max_new_tokens}")
+    print(f" Warmup: {args.num_warmup}, Runs: {args.num_runs}")
+    print(f"{'=' * 70}\n")
+
+    # Test prompts of varying lengths
+    prompts = [
+        "The",  # ~1 token
+        "The capital of France is",  # ~6 tokens
+        "Explain the concept of artificial intelligence in simple terms.",  # ~10 tokens
+        "Write a short Python function that calculates the Fibonacci sequence up to n numbers. Include type hints and a docstring.",  # ~20 tokens
+    ]
+
+    # ---- Prefill Benchmark ----
+    print(f"\n--- Prefill Latency (1 new token) ---")
+    prefill_results = benchmark_prefill_only(
+        model,
+        tokenizer,
+        prompts,
+        num_warmup=args.num_warmup,
+        num_runs=args.num_runs,
+    )
+    print(
+        f"\n{'Prompt':<50} {'Len':>4} {'Mean (ms)':>10} {'Std (ms)':>10} {'Min (ms)':>10}"
+    )
+    print("-" * 90)
+    for prompt, res in prefill_results.items():
+        print(
+            f"{prompt[:48]:<50} {res['prompt_len']:>4} {res['mean_s'] * 1000:>10.1f} "
+            f"{res['std_s'] * 1000:>10.1f} {res['min_s'] * 1000:>10.1f}"
+        )
+
+    # ---- Generation Benchmark ----
+    gen_prompt = "The capital of France is"
+    print(f"\n--- Generation Latency ({args.max_new_tokens} new tokens) ---")
+    gen_results = benchmark_generation(
+        model,
+        tokenizer,
+        gen_prompt,
+        max_new_tokens=args.max_new_tokens,
+        num_warmup=args.num_warmup,
+        num_runs=args.num_runs,
+    )
+
+    total_times = [r["total_time_s"] for r in gen_results]
+    gen_tokens = [r["generated_tokens"] for r in gen_results]
+    per_token = [r["total_time_s"] / r["generated_tokens"] for r in gen_results]
+    throughput = [r["generated_tokens"] / r["total_time_s"] for r in gen_results]
+
+    print(f"\nPrompt: '{gen_prompt}' -> {gen_tokens[0]} tokens generated")
+    print(
+        f"  Total time:     {np.mean(total_times) * 1000:.1f} +/- {np.std(total_times) * 1000:.1f} ms"
+    )
+    print(
+        f"  Per-token:      {np.mean(per_token) * 1000:.1f} +/- {np.std(per_token) * 1000:.1f} ms"
+    )
+    print(
+        f"  Throughput:     {np.mean(throughput):.1f} +/- {np.std(throughput):.1f} tokens/s"
+    )
+    print(f"  Min total:      {np.min(total_times) * 1000:.1f} ms")
+    print(f"  Max total:      {np.max(total_times) * 1000:.1f} ms")
+
+    # ---- Longer generation benchmark ----
+    gen_prompt_long = "Explain the concept of artificial intelligence in simple terms."
+    max_long = 100
+    print(f"\n--- Long Generation ({max_long} new tokens) ---")
+    long_results = benchmark_generation(
+        model,
+        tokenizer,
+        gen_prompt_long,
+        max_new_tokens=max_long,
+        num_warmup=args.num_warmup,
+        num_runs=5,  # fewer runs for longer generation
+    )
+
+    total_times_l = [r["total_time_s"] for r in long_results]
+    gen_tokens_l = [r["generated_tokens"] for r in long_results]
+    per_token_l = [r["total_time_s"] / r["generated_tokens"] for r in long_results]
+    throughput_l = [r["generated_tokens"] / r["total_time_s"] for r in long_results]
+
+    print(
+        f"\nPrompt: '{gen_prompt_long[:40]}...' -> {gen_tokens_l[0]} tokens generated"
+    )
+    print(
+        f"  Total time:     {np.mean(total_times_l) * 1000:.1f} +/- {np.std(total_times_l) * 1000:.1f} ms"
+    )
+    print(
+        f"  Per-token:      {np.mean(per_token_l) * 1000:.1f} +/- {np.std(per_token_l) * 1000:.1f} ms"
+    )
+    print(
+        f"  Throughput:     {np.mean(throughput_l):.1f} +/- {np.std(throughput_l):.1f} tokens/s"
+    )
+
+    # ---- Summary ----
+    print(f"\n{'=' * 70}")
+    print(f" SUMMARY: {args.version}")
+    print(f"{'=' * 70}")
+    print(
+        f"  Prefill (6 tokens):    {prefill_results[prompts[1]]['mean_s'] * 1000:.1f} ms"
+    )
+    print(
+        f"  Prefill (20 tokens):   {prefill_results[prompts[3]]['mean_s'] * 1000:.1f} ms"
+    )
+    print(f"  Decode per-token:      {np.mean(per_token) * 1000:.1f} ms")
+    print(f"  Decode throughput:     {np.mean(throughput):.1f} tok/s")
+    print(f"  Long gen per-token:    {np.mean(per_token_l) * 1000:.1f} ms")
+    print(f"  Long gen throughput:   {np.mean(throughput_l):.1f} tok/s")
+    print(f"{'=' * 70}\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/granite-4.0-h-small/test/integration/test_model.py b/contrib/models/granite-4.0-h-small/test/integration/test_model.py
new file mode 100644
index 00000000..ccae6ae9
--- /dev/null
+++ b/contrib/models/granite-4.0-h-small/test/integration/test_model.py
@@ -0,0 +1,331 @@
+#!/usr/bin/env python3
+"""
+Integration tests for Granite 4.0-H-Small NeuronX implementation.
+
+Tests model compilation, loading, and inference accuracy/performance.
+This model is a hybrid Mamba2/Attention + MoE architecture requiring
+Mamba state persistence across decode steps.
+
+Tested on: trn2.3xlarge (TP=4, LNC=2, SDK 2.28)
+"""
+
+import pytest
+import time
+import torch
+import json
+from pathlib import Path
+from transformers import AutoTokenizer
+
+from neuronx_distributed_inference.models.config import MoENeuronConfig
+from neuronx_distributed_inference.utils.hf_adapter import (
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+
+# Import from src directory
+import sys
+
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "src"))
+from modeling_granite import NeuronGraniteForCausalLM, GraniteInferenceConfig
+
+
+# Test configuration — UPDATE THESE PATHS for your environment
+MODEL_PATH = "/home/ubuntu/Granite4/granite-4.0-h-small/"
+COMPILED_MODEL_PATH = "/home/ubuntu/Granite4/traced_model_contrib/"
+
+# Compilation parameters (trn2.3xlarge with LNC=2 -> 4 logical cores)
+TP_DEGREE = 4
+BATCH_SIZE = 1
+MAX_CONTEXT_LENGTH = 128
+SEQ_LENGTH = 2048
+
+
+def load_neuron_config_from_compiled(compiled_path: str):
+    """Load neuron configuration from compiled model's neuron_config.json."""
+    config_path = Path(compiled_path) / "neuron_config.json"
+
+    if not config_path.exists():
+        raise FileNotFoundError(f"neuron_config.json not found: {config_path}")
+
+    with open(config_path) as f:
+        config_data = json.load(f)
+
+    if "neuron_config" in config_data:
+        return config_data["neuron_config"]
+    else:
+        return config_data
+
+
+def create_model_for_inference(compiled_path: str, model_path: str):
+    """Create model for inference using compiled neuron_config."""
+    neuron_config_dict = load_neuron_config_from_compiled(compiled_path)
+
+    neuron_config = MoENeuronConfig(
+        tp_degree=neuron_config_dict.get("tp_degree", TP_DEGREE),
+        batch_size=neuron_config_dict.get("batch_size", BATCH_SIZE),
+        max_context_length=neuron_config_dict.get(
+            "max_context_length", MAX_CONTEXT_LENGTH
+        ),
+        seq_len=neuron_config_dict.get("seq_len", SEQ_LENGTH),
+        on_device_sampling_config=None,
+        enable_bucketing=False,
+        flash_decoding_enabled=False,
+        torch_dtype="bfloat16",
+    )
+
+    config = GraniteInferenceConfig(
+        neuron_config,
+        load_config=load_pretrained_config(model_path),
+    )
+
+    model = NeuronGraniteForCausalLM(model_path, config)
+    return model, neuron_config
+
+
+@pytest.fixture(scope="module")
+def compiled_model():
+    """Compile and load model."""
+    compiled_path = Path(COMPILED_MODEL_PATH)
+    if not (compiled_path / "model.pt").exists():
+        print(f"Compiling model to {COMPILED_MODEL_PATH}...")
+
+        neuron_config = MoENeuronConfig(
+            tp_degree=TP_DEGREE,
+            batch_size=BATCH_SIZE,
+            max_context_length=MAX_CONTEXT_LENGTH,
+            seq_len=SEQ_LENGTH,
+            on_device_sampling_config=None,
+            enable_bucketing=False,
+            flash_decoding_enabled=False,
+            torch_dtype="bfloat16",
+        )
+
+        config = GraniteInferenceConfig(
+            neuron_config,
+            load_config=load_pretrained_config(MODEL_PATH),
+        )
+
+        model = NeuronGraniteForCausalLM(MODEL_PATH, config)
+        model.compile(COMPILED_MODEL_PATH)
+
+        tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="right")
+        tokenizer.save_pretrained(COMPILED_MODEL_PATH)
+
+    # Load compiled model
+    model, _ = create_model_for_inference(COMPILED_MODEL_PATH, MODEL_PATH)
+    model.load(COMPILED_MODEL_PATH)
+
+    return model
+
+
+@pytest.fixture(scope="module")
+def tokenizer():
+    """Load tokenizer."""
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="right")
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    return tokenizer
+
+
+def test_model_loads(compiled_model):
+    """Smoke test: model loads successfully."""
+    assert compiled_model is not None
+    assert hasattr(compiled_model, "config")
+    print("PASS: Model loaded successfully")
+
+
+def test_model_generates(compiled_model, tokenizer):
+    """Test that model generates text (prefill + decode)."""
+    prompt = "Artificial Intelligence is"
+    inputs = tokenizer(prompt, return_tensors="pt")
+
+    gen_model = HuggingFaceGenerationAdapter(compiled_model)
+    outputs = gen_model.generate(
+        inputs.input_ids,
+        attention_mask=torch.ones_like(inputs.input_ids),
+        max_new_tokens=20,
+        do_sample=False,
+    )
+
+    new_tokens = outputs[0, inputs.input_ids.shape[1] :]
+    output_text = tokenizer.decode(new_tokens, skip_special_tokens=True)
+
+    assert len(output_text.strip()) > 0, "Should generate non-empty text"
+    assert len(new_tokens) == 20, "Should generate exactly 20 new tokens"
+    print(f"PASS: Generated '{output_text}'")
+
+
+def test_output_coherence(compiled_model, tokenizer):
+    """Test that output is coherent (not repetitive gibberish)."""
+    prompt = "Explain the concept of artificial intelligence in simple terms."
+    inputs = tokenizer(prompt, return_tensors="pt")
+
+    gen_model = HuggingFaceGenerationAdapter(compiled_model)
+    outputs = gen_model.generate(
+        inputs.input_ids,
+        attention_mask=torch.ones_like(inputs.input_ids),
+        max_new_tokens=30,
+        do_sample=False,
+    )
+
+    new_tokens = outputs[0, inputs.input_ids.shape[1] :]
+    output_text = tokenizer.decode(new_tokens, skip_special_tokens=True)
+
+    words = output_text.split()
+    assert len(words) > 3, "Output should have multiple words"
+    assert not _is_repetitive(output_text), "Output should not be repetitive"
+    print(f"PASS: Coherent output '{output_text[:100]}...'")
+
+
+def test_greedy_token_match(compiled_model, tokenizer):
+    """Test that first greedy token matches known HF reference.
+
+    HF reference (BF16 CPU): prompt 'Artificial Intelligence is' -> first token 'a' (264)
+    or 'Art' -> 'ificial' depending on exact prompt. For our reference prompt:
+    'Explain the concept of artificial intelligence in simple terms.'
+    the first generated token should be sensible and consistent.
+    """
+    prompt = "Artificial Intelligence is"
+    inputs = tokenizer(prompt, return_tensors="pt")
+
+    gen_model = HuggingFaceGenerationAdapter(compiled_model)
+    outputs = gen_model.generate(
+        inputs.input_ids,
+        attention_mask=torch.ones_like(inputs.input_ids),
+        max_new_tokens=1,
+        do_sample=False,
+    )
+
+    first_token_id = outputs[0, inputs.input_ids.shape[1] :][0].item()
+    first_token = tokenizer.decode([first_token_id])
+    print(f"PASS: First greedy token = '{first_token}' (id={first_token_id})")
+
+    # The token should be a real word/subword, not garbage
+    assert first_token_id != 0, "Token ID 0 indicates generation failure"
+    assert first_token_id != tokenizer.eos_token_id, (
+        "Should not immediately produce EOS"
+    )
+
+
+def _is_repetitive(text: str, max_repeat: int = 5) -> bool:
+    """Check if text has excessive repetition."""
+    words = text.split()
+    if len(words) < 10:
+        return False
+
+    for i in range(len(words) - max_repeat):
+        word = words[i]
+        if all(words[i + j] == word for j in range(max_repeat)):
+            return True
+
+    new_text = text[-100:] if len(text) > 100 else text
+    if len(new_text) > 20:
+        char_counts = {}
+        for c in new_text:
+            char_counts[c] = char_counts.get(c, 0) + 1
+        max_char_ratio = max(char_counts.values()) / len(new_text)
+        if max_char_ratio > 0.5:
+            return True
+
+    return False
+
+
+def test_performance_throughput(compiled_model, tokenizer):
+    """Measure token generation throughput."""
+    prompt = "Hello"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    num_tokens = 50
+
+    gen_model = HuggingFaceGenerationAdapter(compiled_model)
+
+    # Warmup
+    gen_model.generate(
+        inputs.input_ids,
+        attention_mask=torch.ones_like(inputs.input_ids),
+        max_new_tokens=5,
+        do_sample=False,
+    )
+
+    start = time.perf_counter()
+    gen_model.generate(
+        inputs.input_ids,
+        attention_mask=torch.ones_like(inputs.input_ids),
+        max_new_tokens=num_tokens,
+        do_sample=False,
+    )
+    end = time.perf_counter()
+
+    total_time = end - start
+    throughput = num_tokens / total_time
+    print(
+        f"PASS: Throughput = {throughput:.2f} tok/s ({total_time:.2f}s for {num_tokens} tokens)"
+    )
+
+
+if __name__ == "__main__":
+    print("=" * 80)
+    print("Granite 4.0-H-Small Integration Tests")
+    print("=" * 80)
+
+    # Compile if needed
+    compiled_path = Path(COMPILED_MODEL_PATH)
+    if not (compiled_path / "model.pt").exists():
+        print(f"\nCompiling model to {COMPILED_MODEL_PATH}...")
+
+        neuron_config = MoENeuronConfig(
+            tp_degree=TP_DEGREE,
+            batch_size=BATCH_SIZE,
+            max_context_length=MAX_CONTEXT_LENGTH,
+            seq_len=SEQ_LENGTH,
+            on_device_sampling_config=None,
+            enable_bucketing=False,
+            flash_decoding_enabled=False,
+            torch_dtype="bfloat16",
+        )
+
+        config = GraniteInferenceConfig(
+            neuron_config,
+            load_config=load_pretrained_config(MODEL_PATH),
+        )
+
+        model = NeuronGraniteForCausalLM(MODEL_PATH, config)
+        model.compile(COMPILED_MODEL_PATH)
+
+        tok = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="right")
+        tok.save_pretrained(COMPILED_MODEL_PATH)
+        print("Compilation complete")
+
+    # Load model
+    print(f"\nLoading compiled model from {COMPILED_MODEL_PATH}...")
+    model, _ = create_model_for_inference(COMPILED_MODEL_PATH, MODEL_PATH)
+    model.load(COMPILED_MODEL_PATH)
+    print("Model loaded")
+
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="right")
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    # Run tests
+    print("\n" + "=" * 80)
+    print("Running Tests")
+    print("=" * 80)
+
+    print("\n1. Smoke Test...")
+    test_model_loads(model)
+
+    print("\n2. Generation Test...")
+    test_model_generates(model, tokenizer)
+
+    print("\n3. Coherence Test...")
+    test_output_coherence(model, tokenizer)
+
+    print("\n4. Greedy Token Match...")
+    test_greedy_token_match(model, tokenizer)
+
+    print("\n5. Throughput...")
+    test_performance_throughput(model, tokenizer)
+
+    print("\n" + "=" * 80)
+    print("All tests passed!")
+    print("=" * 80)
diff --git a/contrib/models/granite-4.0-h-small/test/unit/__init__.py b/contrib/models/granite-4.0-h-small/test/unit/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/contrib/models/granite-4.0-h-small/test/unit/test_nki_selective_scan.py b/contrib/models/granite-4.0-h-small/test/unit/test_nki_selective_scan.py
new file mode 100644
index 00000000..c11d42a1
--- /dev/null
+++ b/contrib/models/granite-4.0-h-small/test/unit/test_nki_selective_scan.py
@@ -0,0 +1,756 @@
+#!/usr/bin/env python3
+"""
+NKI Selective Scan Kernel for Mamba2 (Granite 4.0-H-Small)
+
+Replaces the O(L²) parallel scan with O(L) hardware-accelerated scan
+using nisa.tensor_tensor_scan on Trainium2.
+
+tensor_tensor_scan computes: result[i] = op0(data0[i], result[i-1]) op1 data1[i]
+For Mamba SSM: state[t] = exp(dA[t]) * state[t-1] + dBx[t]
+  → data0 = exp(dA), op0 = multiply, data1 = dBx, op1 = add
+
+Granite Mamba2 dimensions (after TP=4 sharding, num_heads//4=32):
+    batch_size = 1, seq_len ≤ 128, num_heads = 32 (TP-sharded)
+    head_dim = 64, ssm_state_size = 128
+
+Strategy:
+    - Pre-transpose inputs to (num_heads, seq_len, ...) on PyTorch side
+    - Partition dim (P=128) maps to num_heads (32, padded to 128 or tiled)
+    - Free dim maps to seq_len (up to 128)
+    - Outer loop over head_dim × ssm_state_size
+
+Usage:
+    export NEURON_PLATFORM_TARGET_OVERRIDE=trn2
+    source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
+    python nki_selective_scan.py
+"""
+
+import numpy as np
+import torch
+import time
+
+
+# ==============================================================================
+# Reference implementation (PyTorch, CPU)
+# ==============================================================================
+
+
+def selective_scan_reference(
+    x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim) float32
+    dt: torch.Tensor,  # (batch, seq_len, num_heads) float32
+    A: torch.Tensor,  # (num_heads,) float32 — negative values
+    B: torch.Tensor,  # (batch, seq_len, num_heads, ssm_state_size) float32
+    C: torch.Tensor,  # (batch, seq_len, num_heads, ssm_state_size) float32
+    D: torch.Tensor,  # (num_heads,) float32
+) -> tuple:
+    """
+    Sequential O(L) reference selective scan.
+
+    Returns:
+        y: (batch, seq_len, num_heads, head_dim)
+        final_state: (batch, num_heads, head_dim, ssm_state_size)
+    """
+    batch, seq_len, num_heads, head_dim = x.shape
+    ssm_state_size = B.shape[-1]
+
+    state = torch.zeros(batch, num_heads, head_dim, ssm_state_size, dtype=x.dtype)
+    y = torch.zeros_like(x)
+
+    for t in range(seq_len):
+        dA = torch.exp(dt[:, t, :] * A)  # (batch, heads)
+        dB = dt[:, t, :].unsqueeze(-1) * B[:, t, :, :]  # (batch, heads, state)
+        dBx = dB.unsqueeze(2) * x[:, t, :, :].unsqueeze(
+            -1
+        )  # (batch, heads, dim, state)
+
+        state = dA.unsqueeze(-1).unsqueeze(-1) * state + dBx
+        y[:, t, :, :] = torch.einsum("bhds,bhs->bhd", state, C[:, t, :, :])
+        y[:, t, :, :] += D.view(1, -1, 1) * x[:, t, :, :]
+
+    return y, state
+
+
+def selective_scan_quadratic(
+    x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim) float32
+    dt: torch.Tensor,  # (batch, seq_len, num_heads) float32
+    A: torch.Tensor,  # (num_heads,) float32
+    B: torch.Tensor,  # (batch, seq_len, num_heads, ssm_state_size) float32
+    C: torch.Tensor,  # (batch, seq_len, num_heads, ssm_state_size) float32
+    D: torch.Tensor,  # (num_heads,) float32
+) -> tuple:
+    """O(L²) parallel scan — matches current Neuron implementation."""
+    batch, seq_len, num_heads, head_dim = x.shape
+
+    dA_log = dt * A.view(1, 1, -1)
+    dB = dt.unsqueeze(-1) * B
+    dBx = dB.unsqueeze(3) * x.unsqueeze(-1)
+
+    log_dA_cumsum = torch.cumsum(dA_log, dim=1)
+    causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=x.dtype))
+    log_diff = log_dA_cumsum.unsqueeze(2) - log_dA_cumsum.unsqueeze(1)
+    log_diff = log_diff.masked_fill(causal_mask.unsqueeze(0).unsqueeze(-1) == 0, -1e9)
+    weights = torch.exp(log_diff)
+
+    states = torch.einsum("btih,bihds->bthds", weights, dBx)
+    y = torch.einsum("blhs,blhds->blhd", C, states)
+    y = y + D.view(1, 1, -1, 1) * x
+
+    return y, states[:, -1, :, :, :]
+
+
+# ==============================================================================
+# NKI Kernel — Module-level (required for NKI tracer to find the function)
+# ==============================================================================
+
+try:
+    import nki
+    import nki.language as nl
+    import nki.isa as nisa
+
+    HAS_NKI = True
+except ImportError:
+    HAS_NKI = False
+
+if HAS_NKI:
+    P_MAX = 128
+
+    @nki.jit
+    def nki_scan_kernel(
+        dA_exp_t,  # (NH, SL) — pre-transposed decay coefficients
+        dBx_t,  # (NH * HD * SS, SL) — flattened+transposed
+        C_t,  # (NH * SS, SL) — flattened+transposed
+        Dx_t,  # (NH * HD, SL) — pre-computed D*x, flattened+transposed
+        x_t,  # (NH * HD, SL) — flattened+transposed (unused, for shape derivation)
+        hd_range,  # (HD,) — dummy tensor whose shape[0] gives head_dim
+        ss_range,  # (SS,) — dummy tensor whose shape[0] gives ssm_state_size
+    ):
+        """
+        NKI selective scan using tensor_tensor_scan.
+
+        Allocates outputs on HBM and returns them (NKI pattern).
+        D*x is pre-computed on PyTorch side to avoid broadcast issues.
+
+        Returns:
+            y_out: (NH * HD, SL) — scan output
+            final_state_out: (NH * HD * SS, 1) — final hidden state
+        """
+        NH = dA_exp_t.shape[0]
+        SL = dA_exp_t.shape[1]
+        HD = hd_range.shape[0]
+        SS = ss_range.shape[0]
+
+        # Allocate outputs on HBM
+        y_out = nl.ndarray((NH * HD, SL), dtype=nl.float32, buffer=nl.shared_hbm)
+        final_state_out = nl.ndarray(
+            (NH * HD * SS, 1), dtype=nl.float32, buffer=nl.shared_hbm
+        )
+
+        # Load dA_exp once: (num_heads, seq_len) -> SBUF
+        dA_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+        nisa.memset(dst=dA_sb, value=0.0)
+        nisa.dma_copy(
+            dst=dA_sb[0:NH, 0:SL],
+            src=dA_exp_t[0:NH, 0:SL],
+        )
+
+        # For each head_dim d:
+        for d in nl.affine_range(HD):
+            # Load pre-computed D*x as initial y accumulator
+            y_acc_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+            nisa.memset(dst=y_acc_sb, value=0.0)
+            Dx_row_start = d * NH
+            nisa.dma_copy(
+                dst=y_acc_sb[0:NH, 0:SL],
+                src=Dx_t[Dx_row_start : Dx_row_start + NH, 0:SL],
+            )
+
+            # Accumulate over ssm_state_size
+            for s in nl.affine_range(SS):
+                # Load dBx for this (d, s): from dBx_t at row (d * SS + s) * NH
+                dBx_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.memset(dst=dBx_sb, value=0.0)
+                dBx_row = (d * SS + s) * NH
+                nisa.dma_copy(
+                    dst=dBx_sb[0:NH, 0:SL],
+                    src=dBx_t[dBx_row : dBx_row + NH, 0:SL],
+                )
+
+                # Run scan: state[h, t] = dA[h, t] * state[h, t-1] + dBx[h, t]
+                init_sb = nl.ndarray((P_MAX, 1), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.memset(dst=init_sb, value=0.0)
+
+                state_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.tensor_tensor_scan(
+                    dst=state_sb[0:NH, 0:SL],
+                    data0=dA_sb[0:NH, 0:SL],
+                    data1=dBx_sb[0:NH, 0:SL],
+                    initial=init_sb[0:NH, 0:1],
+                    op0=nl.multiply,
+                    op1=nl.add,
+                )
+
+                # Save final state column: state[h, SL-1] → final_state_out
+                final_sb = nl.ndarray((P_MAX, 1), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.tensor_copy(
+                    dst=final_sb[0:NH, 0:1],
+                    src=state_sb[0:NH, SL - 1 : SL],
+                )
+                fs_row = (d * SS + s) * NH
+                nisa.dma_copy(
+                    dst=final_state_out[fs_row : fs_row + NH, 0:1],
+                    src=final_sb[0:NH, 0:1],
+                )
+
+                # Load C for this state dim s: from C_t at row s * NH
+                C_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.memset(dst=C_sb, value=0.0)
+                C_row = s * NH
+                nisa.dma_copy(
+                    dst=C_sb[0:NH, 0:SL],
+                    src=C_t[C_row : C_row + NH, 0:SL],
+                )
+
+                # y_acc += C * state
+                Cs_sb = nl.ndarray((P_MAX, SL), dtype=nl.float32, buffer=nl.sbuf)
+                nisa.tensor_tensor(
+                    dst=Cs_sb[0:NH, 0:SL],
+                    data1=C_sb[0:NH, 0:SL],
+                    data2=state_sb[0:NH, 0:SL],
+                    op=nl.multiply,
+                )
+                nisa.tensor_tensor(
+                    dst=y_acc_sb[0:NH, 0:SL],
+                    data1=y_acc_sb[0:NH, 0:SL],
+                    data2=Cs_sb[0:NH, 0:SL],
+                    op=nl.add,
+                )
+
+            # Store y_acc for this d: → y_out at row d * NH
+            y_row = d * NH
+            nisa.dma_copy(
+                dst=y_out[y_row : y_row + NH, 0:SL],
+                src=y_acc_sb[0:NH, 0:SL],
+            )
+
+        return y_out, final_state_out
+
+
+def prepare_scan_inputs(x, dt, A, B, C, D):
+    """
+    Pre-compute and transpose inputs for the NKI kernel.
+
+    This runs on PyTorch (element-wise ops, cheap) before the NKI kernel.
+
+    Args:
+        x: (batch, seq_len, num_heads, head_dim)
+        dt: (batch, seq_len, num_heads) — after softplus + clamp
+        A: (num_heads,) — negative values
+        B: (batch, seq_len, num_heads, ssm_state_size)
+        C: (batch, seq_len, num_heads, ssm_state_size)
+        D: (num_heads,)
+
+    Returns dict of transposed tensors for the NKI kernel (batch=1 assumed).
+    """
+    batch, seq_len, num_heads, head_dim = x.shape
+    ssm_state_size = B.shape[-1]
+
+    # Decay coefficients
+    dA_exp = torch.exp(dt * A.view(1, 1, -1))  # (B, L, H)
+
+    # Input contributions
+    dB = dt.unsqueeze(-1) * B  # (B, L, H, S)
+    dBx = dB.unsqueeze(3) * x.unsqueeze(-1)  # (B, L, H, D, S)
+
+    # Transpose to partition-first layout (squeeze batch dim)
+    # dA_exp: (L, H) → (H, L)
+    dA_exp_t = dA_exp[0].transpose(0, 1).contiguous()  # (H, L)
+
+    # dBx: (L, H, D, S) → flatten (H*D*S, L) with H varying fastest
+    # Layout: for each (d, s), rows d*S*H + s*H : d*S*H + (s+1)*H contain heads for that (d,s)
+    dBx_0 = dBx[0]  # (L, H, D, S)
+    # Reshape to (L, D*S*H) then transpose to (D*S*H, L)
+    # We want indexing: row (d*SS + s)*NH + h maps to dBx[t, h, d, s]
+    # So reshape dBx_0 from (L, H, D, S) to (L, D, S, H) → (L, D*S*H) → (D*S*H, L)
+    dBx_reshaped = dBx_0.permute(0, 2, 3, 1).reshape(
+        seq_len, head_dim * ssm_state_size * num_heads
+    )
+    dBx_t = dBx_reshaped.transpose(0, 1).contiguous()  # (D*S*H, L)
+
+    # C: (L, H, S) → flatten (S*H, L)
+    C_0 = C[0]  # (L, H, S)
+    C_reshaped = C_0.permute(0, 2, 1).reshape(seq_len, ssm_state_size * num_heads)
+    C_t = C_reshaped.transpose(0, 1).contiguous()  # (S*H, L)
+
+    # x: (L, H, D) → flatten (D*H, L)
+    x_0 = x[0]  # (L, H, D)
+    x_reshaped = x_0.permute(0, 2, 1).reshape(seq_len, head_dim * num_heads)
+    x_t = x_reshaped.transpose(0, 1).contiguous()  # (D*H, L)
+
+    # D*x: pre-compute skip connection — same layout as x_t: (D*H, L)
+    # D is (H,), broadcast to (L, H, D) then flatten same as x
+    Dx_0 = D.view(1, -1, 1) * x_0  # (L, H, D)
+    Dx_reshaped = Dx_0.permute(0, 2, 1).reshape(seq_len, head_dim * num_heads)
+    Dx_t = Dx_reshaped.transpose(0, 1).contiguous()  # (D*H, L)
+
+    return {
+        "dA_exp_t": dA_exp_t.float(),
+        "dBx_t": dBx_t.float(),
+        "C_t": C_t.float(),
+        "Dx_t": Dx_t.float(),
+        "x_t": x_t.float(),
+        "num_heads": num_heads,
+        "head_dim": head_dim,
+        "ssm_state_size": ssm_state_size,
+    }
+
+
+def unpack_scan_outputs(
+    y_flat, state_flat, num_heads, head_dim, ssm_state_size, seq_len
+):
+    """
+    Unpack NKI kernel outputs back to standard shapes.
+
+    Args:
+        y_flat: (D*H, L) — flattened output
+        state_flat: (D*S*H, 1) — flattened final state
+
+    Returns:
+        y: (1, seq_len, num_heads, head_dim)
+        final_state: (1, num_heads, head_dim, ssm_state_size)
+    """
+    # y_flat is (D*H, L) where row d*H + h maps to y[t, h, d]
+    # Reshape to (D, H, L) → permute to (L, H, D) → add batch
+    y_reshaped = y_flat.reshape(head_dim, num_heads, seq_len)  # (D, H, L)
+    y = y_reshaped.permute(2, 1, 0).unsqueeze(0).contiguous()  # (1, L, H, D)
+
+    # state_flat is (D*S*H, 1) where row (d*S + s)*H + h maps to state[h, d, s]
+    state_reshaped = state_flat.reshape(
+        head_dim, ssm_state_size, num_heads
+    )  # (D, S, H)
+    final_state = (
+        state_reshaped.permute(2, 0, 1).unsqueeze(0).contiguous()
+    )  # (1, H, D, S)
+
+    return y, final_state
+
+
+# ==============================================================================
+# Tests
+# ==============================================================================
+
+
+def test_reference_match():
+    """Validate O(L) and O(L²) produce same results."""
+    print("Test 1: Reference implementation match (CPU)...")
+
+    batch, seq_len, num_heads, head_dim, ssm_state_size = 1, 16, 8, 4, 8
+    torch.manual_seed(42)
+
+    x = torch.randn(batch, seq_len, num_heads, head_dim)
+    dt = torch.rand(batch, seq_len, num_heads) * 0.1
+    A = -torch.arange(1, num_heads + 1, dtype=torch.float32)
+    B = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    C = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    D = torch.ones(num_heads)
+
+    y_seq, state_seq = selective_scan_reference(x, dt, A, B, C, D)
+    y_quad, state_quad = selective_scan_quadratic(x, dt, A, B, C, D)
+
+    y_diff = (y_seq - y_quad).abs()
+    state_diff = (state_seq - state_quad).abs()
+
+    print(f"  y max diff:     {y_diff.max().item():.2e}")
+    print(f"  state max diff: {state_diff.max().item():.2e}")
+    assert y_diff.max().item() < 1e-4
+    assert state_diff.max().item() < 1e-4
+    print("  PASS\n")
+
+
+def test_transpose_roundtrip():
+    """Validate prepare/unpack are inverses."""
+    print("Test 2: Transpose round-trip (CPU)...")
+
+    batch, seq_len, num_heads, head_dim, ssm_state_size = 1, 16, 8, 4, 8
+    torch.manual_seed(42)
+
+    x = torch.randn(batch, seq_len, num_heads, head_dim)
+    dt = torch.rand(batch, seq_len, num_heads) * 0.1
+    A = -torch.arange(1, num_heads + 1, dtype=torch.float32)
+    B = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    C = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    D = torch.ones(num_heads)
+
+    inputs = prepare_scan_inputs(x, dt, A, B, C, D)
+
+    # Verify dA_exp transpose
+    dA_exp_ref = torch.exp(dt * A.view(1, 1, -1))[0]  # (L, H)
+    dA_exp_rt = inputs["dA_exp_t"].transpose(0, 1)  # (H, L) → (L, H)
+    assert (dA_exp_ref - dA_exp_rt).abs().max() < 1e-6
+
+    # Verify x transpose round-trip
+    x_t = inputs["x_t"]  # (D*H, L)
+    x_rt = x_t.reshape(head_dim, num_heads, seq_len).permute(2, 1, 0)  # (L, H, D)
+    assert (x[0] - x_rt).abs().max() < 1e-6
+
+    print("  PASS\n")
+
+
+def test_granite_scale():
+    """Test at Granite TP=4 scale."""
+    print("Test 3: Granite TP=4 scale (CPU)...")
+
+    # After TP=4 sharding: num_heads=128//4=32
+    batch, seq_len, num_heads, head_dim, ssm_state_size = 1, 128, 32, 64, 128
+    torch.manual_seed(42)
+
+    x = torch.randn(batch, seq_len, num_heads, head_dim)
+    dt = torch.rand(batch, seq_len, num_heads) * 0.1
+    A = -torch.arange(1, num_heads + 1, dtype=torch.float32)
+    B = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    C = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    D = torch.ones(num_heads)
+
+    t0 = time.perf_counter()
+    y_seq, _ = selective_scan_reference(x, dt, A, B, C, D)
+    t_seq = time.perf_counter() - t0
+
+    t0 = time.perf_counter()
+    y_quad, _ = selective_scan_quadratic(x, dt, A, B, C, D)
+    t_quad = time.perf_counter() - t0
+
+    # Also time the prepare step
+    t0 = time.perf_counter()
+    inputs = prepare_scan_inputs(x, dt, A, B, C, D)
+    t_prep = time.perf_counter() - t0
+
+    diff = (y_seq - y_quad).abs()
+    print(f"  Sequential O(L):   {t_seq:.3f}s")
+    print(f"  Quadratic O(L²):   {t_quad:.3f}s")
+    print(f"  Speedup:           {t_quad / t_seq:.1f}x (CPU, algorithmic only)")
+    print(f"  Prepare time:      {t_prep:.4f}s")
+    print(f"  y max diff:        {diff.max().item():.2e}")
+    print(f"  Prepared shapes:")
+    print(f"    dA_exp_t: {inputs['dA_exp_t'].shape}")
+    print(f"    dBx_t:    {inputs['dBx_t'].shape}")
+    print(f"    C_t:      {inputs['C_t'].shape}")
+    print(f"    x_t:      {inputs['x_t'].shape}")
+
+    assert diff.max().item() < 1e-2  # Looser for larger scale
+    print("  PASS\n")
+
+
+def cpu_simulate_kernel(
+    dA_exp_t, dBx_t, C_t, Dx_t, x_t, num_heads, head_dim, ssm_state_size, seq_len
+):
+    """
+    CPU simulation of what the NKI kernel should compute, using the same
+    transposed data layout. This isolates layout bugs from NKI bugs.
+    """
+    NH = num_heads
+    HD = head_dim
+    SS = ssm_state_size
+    SL = seq_len
+
+    y_flat = torch.zeros(HD * NH, SL, dtype=torch.float32)
+    state_flat = torch.zeros(HD * SS * NH, 1, dtype=torch.float32)
+
+    for d in range(HD):
+        # y_acc = Dx (pre-computed D*x for this d)
+        y_acc = Dx_t[d * NH : (d + 1) * NH, :].clone()  # (NH, SL)
+
+        for s in range(SS):
+            # dBx for this (d, s)
+            dBx_row = (d * SS + s) * NH
+            dBx_sb = dBx_t[dBx_row : dBx_row + NH, :]  # (NH, SL)
+
+            # Sequential scan: state[h, t] = dA[h, t] * state[h, t-1] + dBx[h, t]
+            state_sb = torch.zeros(NH, SL, dtype=torch.float32)
+            prev = torch.zeros(NH, dtype=torch.float32)
+            for t in range(SL):
+                prev = dA_exp_t[:NH, t] * prev + dBx_sb[:NH, t]
+                state_sb[:, t] = prev
+
+            # Save final state
+            fs_row = (d * SS + s) * NH
+            state_flat[fs_row : fs_row + NH, 0] = state_sb[:, SL - 1]
+
+            # C for this s
+            C_row = s * NH
+            C_sb = C_t[C_row : C_row + NH, :]  # (NH, SL)
+
+            # y_acc += C * state
+            y_acc = y_acc + C_sb * state_sb
+
+        # Store
+        y_row = d * NH
+        y_flat[y_row : y_row + NH, :] = y_acc
+
+    return y_flat, state_flat
+
+
+def test_cpu_kernel_sim():
+    """Test CPU simulation of kernel logic to isolate layout issues."""
+    print("Test 3.5: CPU kernel simulation (isolate layout vs NKI issues)...")
+
+    batch, seq_len, num_heads, head_dim, ssm_state_size = 1, 16, 32, 4, 8
+    torch.manual_seed(42)
+
+    x = torch.randn(batch, seq_len, num_heads, head_dim)
+    dt = torch.rand(batch, seq_len, num_heads) * 0.1
+    A = -torch.arange(1, num_heads + 1, dtype=torch.float32)
+    B = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    C = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    D = torch.ones(num_heads)
+
+    # CPU reference
+    y_ref, state_ref = selective_scan_reference(x, dt, A, B, C, D)
+
+    # Prepare inputs (same as NKI path)
+    inputs = prepare_scan_inputs(x, dt, A, B, C, D)
+
+    # CPU simulation of kernel
+    y_flat, state_flat = cpu_simulate_kernel(
+        inputs["dA_exp_t"],
+        inputs["dBx_t"],
+        inputs["C_t"],
+        inputs["Dx_t"],
+        inputs["x_t"],
+        num_heads,
+        head_dim,
+        ssm_state_size,
+        seq_len,
+    )
+
+    # Unpack using same function as NKI path
+    y_sim, state_sim = unpack_scan_outputs(
+        y_flat,
+        state_flat,
+        num_heads,
+        head_dim,
+        ssm_state_size,
+        seq_len,
+    )
+
+    y_diff = (y_ref - y_sim).abs()
+    state_diff = (state_ref - state_sim).abs()
+
+    print(f"  y max diff:     {y_diff.max().item():.2e}")
+    print(f"  y mean diff:    {y_diff.mean().item():.2e}")
+    print(f"  state max diff: {state_diff.max().item():.2e}")
+
+    # Print first few values for manual inspection
+    print(f"  y_ref[0,0,0,:4]:  {y_ref[0, 0, 0, :4].tolist()}")
+    print(f"  y_sim[0,0,0,:4]:  {y_sim[0, 0, 0, :4].tolist()}")
+    print(f"  y_ref[0,0,1,:4]:  {y_ref[0, 0, 1, :4].tolist()}")
+    print(f"  y_sim[0,0,1,:4]:  {y_sim[0, 0, 1, :4].tolist()}")
+
+    if y_diff.max().item() < 0.01:
+        print("  PASS — layout is correct, any NKI mismatch is a kernel issue\n")
+    else:
+        print("  FAIL — layout mismatch between prepare/unpack and reference\n")
+        # Find where the biggest diff is
+        idx = torch.unravel_index(y_diff.argmax(), y_diff.shape)
+        print(f"  Worst diff at index {idx}")
+        print(f"    ref={y_ref[idx].item():.6f}  sim={y_sim[idx].item():.6f}")
+
+
+def test_nki_kernel():
+    """Test NKI kernel on Trainium."""
+    try:
+        import torch_xla.core.xla_model as xm
+
+        assert HAS_NKI, "NKI not available"
+    except (ImportError, ModuleNotFoundError, AssertionError, NameError):  # noqa: F821
+        print("SKIP: NKI/torch_xla not available (run on Trainium)\n")
+        return
+
+    print("Test 4: NKI kernel on Trainium...")
+
+    # Start with small dimensions for validation
+    batch, seq_len, num_heads, head_dim, ssm_state_size = 1, 16, 32, 4, 8
+    torch.manual_seed(42)
+
+    x = torch.randn(batch, seq_len, num_heads, head_dim)
+    dt = torch.rand(batch, seq_len, num_heads) * 0.1
+    A = -torch.arange(1, num_heads + 1, dtype=torch.float32)
+    B = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    C = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    D = torch.ones(num_heads)
+
+    # CPU reference
+    y_ref, state_ref = selective_scan_reference(x, dt, A, B, C, D)
+
+    # CPU kernel simulation (for comparison)
+    inputs = prepare_scan_inputs(x, dt, A, B, C, D)
+    y_flat_cpu, state_flat_cpu = cpu_simulate_kernel(
+        inputs["dA_exp_t"],
+        inputs["dBx_t"],
+        inputs["C_t"],
+        inputs["Dx_t"],
+        inputs["x_t"],
+        num_heads,
+        head_dim,
+        ssm_state_size,
+        seq_len,
+    )
+
+    device = xm.xla_device()
+    dA_exp_t = inputs["dA_exp_t"].to(device)
+    dBx_t = inputs["dBx_t"].to(device)
+    C_t = inputs["C_t"].to(device)
+    Dx_t = inputs["Dx_t"].to(device)
+    x_t = inputs["x_t"].to(device)
+
+    # Run NKI kernel — now returns outputs
+    hd_range = torch.zeros(head_dim, dtype=torch.float32, device=device)
+    ss_range = torch.zeros(ssm_state_size, dtype=torch.float32, device=device)
+
+    y_flat, state_flat = nki_scan_kernel(
+        dA_exp_t,
+        dBx_t,
+        C_t,
+        Dx_t,
+        x_t,
+        hd_range,
+        ss_range,
+    )
+    xm.mark_step()
+
+    # Compare NKI output to CPU kernel sim (same layout, isolates NKI issues)
+    y_flat_nki = y_flat.cpu()
+    state_flat_nki = state_flat.cpu()
+
+    flat_y_diff = (y_flat_cpu - y_flat_nki).abs()
+    flat_s_diff = (state_flat_cpu - state_flat_nki).abs()
+    print(f"  NKI vs CPU-sim (flat y) max diff:     {flat_y_diff.max().item():.2e}")
+    print(f"  NKI vs CPU-sim (flat state) max diff: {flat_s_diff.max().item():.2e}")
+
+    # Also compare to original reference
+    y_nki, state_nki = unpack_scan_outputs(
+        y_flat_nki,
+        state_flat_nki,
+        num_heads,
+        head_dim,
+        ssm_state_size,
+        seq_len,
+    )
+
+    y_diff = (y_ref - y_nki).abs()
+    state_diff = (state_ref - state_nki).abs()
+
+    print(f"  NKI vs reference y max diff:     {y_diff.max().item():.2e}")
+    print(f"  NKI vs reference y mean diff:    {y_diff.mean().item():.2e}")
+    print(f"  NKI vs reference state max diff: {state_diff.max().item():.2e}")
+
+    # Print sample values
+    print(f"  y_ref[0,0,0,:4]:  {y_ref[0, 0, 0, :4].tolist()}")
+    print(f"  y_nki[0,0,0,:4]:  {y_nki[0, 0, 0, :4].tolist()}")
+
+    if y_diff.max().item() < 0.01:
+        print("  PASS\n")
+    else:
+        print(f"  FAIL — max diff {y_diff.max().item():.4f}\n")
+
+    # Now test at Granite TP=4 scale
+    print("Test 5: NKI kernel at Granite TP=4 scale...")
+
+    batch, seq_len, num_heads, head_dim, ssm_state_size = 1, 128, 32, 64, 128
+    torch.manual_seed(42)
+
+    x = torch.randn(batch, seq_len, num_heads, head_dim)
+    dt = torch.rand(batch, seq_len, num_heads) * 0.1
+    A = -torch.arange(1, num_heads + 1, dtype=torch.float32)
+    B = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    C = torch.randn(batch, seq_len, num_heads, ssm_state_size)
+    D = torch.ones(num_heads)
+
+    y_ref, state_ref = selective_scan_reference(x, dt, A, B, C, D)
+    inputs = prepare_scan_inputs(x, dt, A, B, C, D)
+
+    dA_exp_t = inputs["dA_exp_t"].to(device)
+    dBx_t = inputs["dBx_t"].to(device)
+    C_t = inputs["C_t"].to(device)
+    Dx_t = inputs["Dx_t"].to(device)
+    x_t = inputs["x_t"].to(device)
+
+    hd_range = torch.zeros(head_dim, dtype=torch.float32, device=device)
+    ss_range = torch.zeros(ssm_state_size, dtype=torch.float32, device=device)
+
+    t0 = time.perf_counter()
+    y_flat, state_flat = nki_scan_kernel(
+        dA_exp_t,
+        dBx_t,
+        C_t,
+        Dx_t,
+        x_t,
+        hd_range,
+        ss_range,
+    )
+    xm.mark_step()
+    t_nki = time.perf_counter() - t0
+
+    y_nki, state_nki = unpack_scan_outputs(
+        y_flat.cpu(),
+        state_flat.cpu(),
+        num_heads,
+        head_dim,
+        ssm_state_size,
+        seq_len,
+    )
+
+    y_diff = (y_ref - y_nki).abs()
+    print(f"  NKI time (incl compile): {t_nki:.3f}s")
+    print(f"  y max diff: {y_diff.max().item():.2e}")
+
+    # Benchmark: run multiple iterations after warmup
+    # First warmup call (may still compile for this shape)
+    y_flat2, state_flat2 = nki_scan_kernel(
+        dA_exp_t,
+        dBx_t,
+        C_t,
+        Dx_t,
+        x_t,
+        hd_range,
+        ss_range,
+    )
+    xm.mark_step()
+
+    # Second warmup
+    y_flat3, state_flat3 = nki_scan_kernel(
+        dA_exp_t,
+        dBx_t,
+        C_t,
+        Dx_t,
+        x_t,
+        hd_range,
+        ss_range,
+    )
+    xm.mark_step()
+
+    # Timed runs
+    n_runs = 10
+    t0 = time.perf_counter()
+    for _ in range(n_runs):
+        y_flat_bench, state_flat_bench = nki_scan_kernel(
+            dA_exp_t,
+            dBx_t,
+            C_t,
+            Dx_t,
+            x_t,
+            hd_range,
+            ss_range,
+        )
+        xm.mark_step()
+    t_bench = time.perf_counter() - t0
+    avg_ms = (t_bench / n_runs) * 1000
+
+    print(f"  NKI benchmark ({n_runs} runs):   {avg_ms:.2f} ms/call")
+    print("  PASS\n")
+
+
+if __name__ == "__main__":
+    test_reference_match()
+    test_transpose_roundtrip()
+    test_cpu_kernel_sim()
+    test_granite_scale()
+    test_nki_kernel()