Graph Surgery Framework for TRT-RTX by hthadicherla · Pull Request #992 · NVIDIA/Model-Optimizer

hthadicherla · 2026-03-06T12:33:31Z

What does this PR do?

Type of change: new feature

Adds the ONNX Graph Surgery framework — graph-level transformations on exported ONNX models for optimized inference with ONNX Runtime.

GQA Attention Replacement — Replaces native attention subgraphs (Q/K/V projections, RoPE, softmax, KV cache) with a single fused GroupQueryAttention operator. Handles RoPE cache computation, attention mask reformatting, Q/K/V weight fusion, and KV cache I/O automatically from a HuggingFace model ID. Supports FP16/BF16, INT4/AWQ quantized, and combined QKV models.

DequantizeLinear Weight Transpose — Transposes quantized weights in DequantizeLinear nodes to column-major layout for providers like NvTensorRtRtx. Handles INT4/UINT4 packed formats.

Whisper Encoder Cross-Attention KV — Adds cross-attention K/V projection outputs to the Whisper encoder for the ONNX Runtime GenAI pipeline. Loads cross-attention weights from HuggingFace and generates genai_config.json.

Usage

# GQA attention replacement
python -m modelopt.onnx.graph_surgery replace-gqa \
    -i model.onnx -o model_gqa.onnx \
    -m meta-llama/Llama-3.2-1B \
    --max-seq-len 4096 --dtype float16

# DequantizeLinear weight transpose
python -m modelopt.onnx.graph_surgery transpose-dq \
    -i model_quantized.onnx -o model_transposed.onnx

# Whisper encoder cross-attention KV
python -m modelopt.onnx.graph_surgery add-cross-kv \
    -i encoder_model.onnx -o encoder_with_kv.onnx \
    -m openai/whisper-large-v3-turbo

Testing

Added test case for GQA graph surgery at tests/unit/onnx/test_gqa_graph_surgery.py. Builds a toy attention subgraph with native ops matching the real Optimum export pattern and applies GQA surgery on it. Compared end outputs of both models — outputs match exactly.

tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_gqa_node_exists PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_gqa_attributes PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_node_count_reduced PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_rotary_emb_nodes_removed PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_position_ids_removed PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_logits_match
  Original nodes: 106  ->  GQA nodes: 13
  Logits shape:   (1, 4, 64)
  Original[0,:4]: [-2906.  7704. 15248.  8488.]
  GQA     [0,:4]: [-2906.  7704. 15248.  8488.]
  Max  abs diff:  0.000000
  Mean abs diff:  0.000000
PASSED

Summary by CodeRabbit

New Features
- ONNX graph-surgery toolkit: new utilities to replace attention with GQA, add encoder cross-attention KV outputs, convert FP16→BF16, and transpose DequantizeLinear weights.
- New CLI with subcommands to run the above transformations.
- New helper utilities for graph manipulation, RoPE cache generation, and Whisper GenAI config creation.
Tests
- Added unit tests validating GQA surgery and DQ-transpose transformations.

…-RTX and ORT-GenAI Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai · 2026-03-06T12:33:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new modelopt.onnx.graph_surgery package with a CLI, utilities, and transformations: GroupQueryAttention replacement, encoder cross-attention KV output injection, FP16→BF16 conversion, and DequantizeLinear weight transposition, plus graph helper utilities and unit tests.

Changes

Cohort / File(s)	Summary
Package init & CLI `modelopt/onnx/graph_surgery/__init__.py`, `modelopt/onnx/graph_surgery/__main__.py`	Exports four public surgery functions and adds a multi-subcommand CLI (`replace-gqa`, `add-cross-kv`, `convert-bf16`, `transpose-dq`, `analyze`) that dispatches to internal handlers.
GQA replacement `modelopt/onnx/graph_surgery/gqa_replacement.py`	Large transformation implementing replacement of attention subgraphs with GroupQueryAttention, RoPE cache generation, QKV fusion/packing, dtype/quantization handling, IR/opset adjustments, and model save/cleanup APIs.
Encoder cross-KV injection `modelopt/onnx/graph_surgery/encoder_cross_kv.py`	Adds logic to extract cross-attention weights from HF models, rename encoder I/O, and append per-layer cross-attention K/V outputs and initializers (external-data optional).
DequantizeLinear transpose `modelopt/onnx/graph_surgery/dq_transpose.py`	Adds weight transposition for DequantizeLinear feeding MatMul/Gemm, updates axis/initializers, inserts Transpose nodes, and handles packed 4-bit (INT4/UINT4) data during transpose.
FP16→BF16 conversion utilities `modelopt/onnx/graph_surgery/utils/dtype_conversion.py`	Implements fp16→bf16 array conversion, initializer/Constant/Cast/value_info conversions, and a top-level convert_fp16_to_bf16 orchestration with saving and stats.
Graph utilities & rope cache & whisper helpers `modelopt/onnx/graph_surgery/utils/graph_utils.py`, `modelopt/onnx/graph_surgery/utils/rope_cache.py`, `modelopt/onnx/graph_surgery/utils/whisper_utils.py`, `modelopt/onnx/graph_surgery/utils/__init__.py`	Adds graph query/manipulation helpers (find, convert, topological sort, cleanup, initializer↔array conversions), RoPE cache generation from HF configs, Whisper GenAI/audio processor config generation and update helpers, and re-exports these utilities.
Tests — GQA & DQ transpose `tests/unit/onnx/test_gqa_graph_surgery.py`, `tests/unit/onnx/test_dq_transpose_surgery.py`	New unit tests constructing toy/quantized ONNX models, applying surgeries, asserting graph changes (nodes, attributes, initializers) and numerical equivalence between original and transformed models.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.32% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Graph Surgery Framework for TRT-RTX' is vague and overly broad, not clearly summarizing the main changes in this comprehensive PR that introduces multiple features (GQA replacement, DQ transpose, Whisper encoder cross-KV, utilities, CLI, and tests).	Consider a more specific title that highlights the primary change, e.g., 'Add ONNX graph surgery framework with GQA replacement, quantized weight transpose, and Whisper encoder utilities' or focus on the single most important change if that better reflects the intent.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns	✅ Passed	Security-sensitive patterns absent from test files; trust_remote_code properly exposed as configurable parameters defaulting to False.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hthadicherla/graph-surgery-framework

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

modelopt/onnx/graph_surgery/dq_transpose.py (1)
125-143: Consider consolidating duplicate helper functions.

These helper functions (_find_initializer, _find_node_by_output, _get_consumers) duplicate functionality in modelopt/onnx/graph_surgery/utils/graph_utils.py. Consider importing from there to reduce code duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/dq_transpose.py` around lines 125 - 143, The
three helper functions _find_initializer, _find_node_by_output, and
_get_consumers in dq_transpose.py duplicate utilities already provided in
modelopt/onnx/graph_surgery/utils/graph_utils.py; remove these local
implementations and import the corresponding functions from graph_utils (e.g.,
import find_initializer, find_node_by_output, get_consumers or their exact
exported names) and update all local call sites in dq_transpose.py to use the
imported functions (ensure names match or alias imports to keep existing call
sites like _find_initializer, _find_node_by_output, _get_consumers working).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 926-933: The code is adding cos_cache and sin_cache to
graph.value_info twice; remove the manual creation and extension of
cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils)
to register the value_info; specifically delete the
helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
- Around line 1398-1408: external_data_name can stay None when output_path has
no directory, causing use_external_data to attempt saving with location=None;
ensure external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.
- Around line 1306-1316: The current code reads initializer bytes with
np.frombuffer(init.raw_data) which can corrupt data for external/typed
initializers; instead use onnx.numpy_helper.to_array(init) to get a safe numpy
array, cast it to float16, then create a new TensorProto with
onnx.numpy_helper.from_array(arr_fp16, name=init.name) and replace the original
initializer in graph.initializer (or assign its raw_data/data_type from the new
TensorProto), ensuring you preserve init.name and dims; update the block that
inspects graph.node for quant_mul_name and scale_name to use
numpy_helper.to_array/from_array rather than direct raw_data access.

In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py`:
- Around line 121-130: Remove the special-case handling that rewrites
"value_float" into BF16: delete the branch that checks attr.name ==
"value_float" and the code that creates new_tensor, sets data_type to
TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls
attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value"
attribute logic (the guarded path that checks tensor element types) intact so
native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.

In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 243-246: Guard against None for hidden_size/num_attention_heads
before computing head_size: check that hidden_size and num_attention_heads are
not None (and num_attention_heads != 0) when hf_model_id is None, and if they
are missing raise a clear ValueError indicating that either hf_model_id must be
provided or both hidden_size and num_attention_heads must be supplied; update
the code around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 37-38: The cross-attention K/V weight and bias initializers are
always converted with numpy_helper.from_array so BF16 models detected by
detect_model_dtype end up with FLOAT proto types; update the serialization code
that creates these initializers (the places that currently call
numpy_helper.from_array for cross-attention key/value weights/biases) to use
array_to_initializer with the is_bfloat16 flag when onnx_dtype ==
TensorProto.BFLOAT16, and convert the underlying float32 arrays via
fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to
array_to_initializer so the TensorProto dtype matches BF16 models.

---

Nitpick comments:
In `@modelopt/onnx/graph_surgery/dq_transpose.py`:
- Around line 125-143: The three helper functions _find_initializer,
_find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities
already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove
these local implementations and import the corresponding functions from
graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers
or their exact exported names) and update all local call sites in
dq_transpose.py to use the imported functions (ensure names match or alias
imports to keep existing call sites like _find_initializer,
_find_node_by_output, _get_consumers working).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8fa348a4-244d-4101-9817-ae0b39af1d53

📥 Commits

Reviewing files that changed from the base of the PR and between 37d3f10 and 51fdee8.

📒 Files selected for processing (12)

modelopt/onnx/graph_surgery/__init__.py
modelopt/onnx/graph_surgery/__main__.py
modelopt/onnx/graph_surgery/dq_transpose.py
modelopt/onnx/graph_surgery/gqa_replacement.py
modelopt/onnx/graph_surgery/utils/__init__.py
modelopt/onnx/graph_surgery/utils/dtype_conversion.py
modelopt/onnx/graph_surgery/utils/graph_utils.py
modelopt/onnx/graph_surgery/utils/rope_cache.py
modelopt/onnx/graph_surgery/utils/whisper_utils.py
modelopt/onnx/graph_surgery/whisper/__init__.py
modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py
tests/unit/onnx/test_gqa_graph_surgery.py

coderabbitai · 2026-03-06T12:44:09Z

+    add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)
+    add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)
+
+    # Add value_info for cos/sin caches
+    cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
+    sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
+    graph.value_info.extend([cos_cache_info, sin_cache_info])
+


⚠️ Potential issue | 🟡 Minor

cos_cache/sin_cache are registered twice in value_info.

add_initializer already appends corresponding value_info entries (modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 435-461). Re-adding them here creates duplicate metadata entries.

Suggested change

- # Add value_info for cos/sin caches - cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) - sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) - graph.value_info.extend([cos_cache_info, sin_cache_info])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)

add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)

# Add value_info for cos/sin caches

cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))

sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))

graph.value_info.extend([cos_cache_info, sin_cache_info])

add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)

add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 926 - 933, The code is adding cos_cache and sin_cache to graph.value_info twice; remove the manual creation and extension of cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils) to register the value_info; specifically delete the helper.make_tensor_value_info(...) lines and the graph.value_info.extend([cos_cache_info, sin_cache_info]) call in gqa_replacement.py so only add_initializer creates the metadata, or if you must keep them, first check for existing entries before extending to avoid duplicates.

coderabbitai · 2026-03-06T12:44:09Z

+        # Handle value_float attribute
+        if attr.name == "value_float":
+            fp32_val = np.array([attr.f], dtype=np.float32)
+            bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
+            new_tensor = onnx.TensorProto()
+            new_tensor.data_type = TensorProto.BFLOAT16
+            new_tensor.raw_data = bf16_uint16.tobytes()
+            attr.t.CopyFrom(new_tensor)
+            attr.name = "value"
+            return True


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5735

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '90,140p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1953

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '1,150p'

Repository: NVIDIA/Model-Optimizer

Length of output: 5735

Remove the value_float attribute conversion block.

The function converts FP16 constants to BF16, but value_float encodes native FP32 scalars. Converting them to BF16 introduces unnecessary precision loss (FP32→FP16→BF16) and changes semantics—these constants were not originally FP16 and should not be rewritten in an FP16-to-BF16 pass. The value attribute conversion correctly guards against this with a type check (line 105); value_float lacks this guard.

Suggested change

- # Handle value_float attribute - if attr.name == "value_float": - fp32_val = np.array([attr.f], dtype=np.float32) - bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16)) - new_tensor = onnx.TensorProto() - new_tensor.data_type = TensorProto.BFLOAT16 - new_tensor.raw_data = bf16_uint16.tobytes() - attr.t.CopyFrom(new_tensor) - attr.name = "value" - return True

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Handle value_float attribute

if attr.name == "value_float":

fp32_val = np.array([attr.f], dtype=np.float32)

bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))

new_tensor = onnx.TensorProto()

new_tensor.data_type = TensorProto.BFLOAT16

new_tensor.raw_data = bf16_uint16.tobytes()

attr.t.CopyFrom(new_tensor)

attr.name = "value"

return True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py` around lines 121 - 130, Remove the special-case handling that rewrites "value_float" into BF16: delete the branch that checks attr.name == "value_float" and the code that creates new_tensor, sets data_type to TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value" attribute logic (the guarded path that checks tensor element types) intact so native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.

coderabbitai · 2026-03-06T12:44:09Z

+
+    # Compute head_size
+    head_size = hidden_size // num_attention_heads
+


⚠️ Potential issue | 🟡 Minor

Guard against None values when hf_model_id is not provided.

If hf_model_id is None and the caller doesn't supply hidden_size or num_attention_heads, line 245 will raise a TypeError when attempting division with None.

🛡️ Proposed fix to add validation

+ # Validate required parameters + if hidden_size is None or num_attention_heads is None: + raise ValueError( + "hidden_size and num_attention_heads must be provided when hf_model_id is not specified" + ) + # Compute head_size head_size = hidden_size // num_attention_heads

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Compute head_size

head_size = hidden_size // num_attention_heads

# Validate required parameters

if hidden_size is None or num_attention_heads is None:

raise ValueError(

"hidden_size and num_attention_heads must be provided when hf_model_id is not specified"

)

# Compute head_size

head_size = hidden_size // num_attention_heads

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 243 - 246, Guard against None for hidden_size/num_attention_heads before computing head_size: check that hidden_size and num_attention_heads are not None (and num_attention_heads != 0) when hf_model_id is None, and if they are missing raise a clear ValueError indicating that either hf_model_id must be provided or both hidden_size and num_attention_heads must be supplied; update the code around the head_size computation (the variables head_size, hidden_size, num_attention_heads, and hf_model_id) to perform this validation and only compute head_size = hidden_size // num_attention_heads after the checks.

coderabbitai · 2026-03-06T12:44:09Z

+from ..utils.graph_utils import detect_model_dtype
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find modelopt -name "encoder_cross_kv.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 121

🏁 Script executed:

find modelopt -name "graph_utils.py" -o -name "dtype_conversion.py" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 210

🏁 Script executed:

cd modelopt && find . -type f -name "*.py" | grep -E "(graph_utils|dtype_conversion)" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 189

🏁 Script executed:

wc -l modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 125

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 20927

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 24055

🏁 Script executed:

head -100 modelopt/onnx/graph_surgery/utils/dtype_conversion.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3065

🏁 Script executed:

sed -n '180,195p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 741

🏁 Script executed:

sed -n '215,240p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1012

🏁 Script executed:

tail -50 modelopt/onnx/graph_surgery/utils/dtype_conversion.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1860

🏁 Script executed:

rg "array_to_initializer" modelopt --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2590

🏁 Script executed:

sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 1865

🏁 Script executed:

grep -B 5 -A 10 "is_bfloat16" modelopt/onnx/graph_surgery/gqa_replacement.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 1988

🏁 Script executed:

grep -B 3 -A 3 "fp16_to_bf16_array" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 900

Add BF16-aware initializer serialization for cross-attention K/V weights and biases.

detect_model_dtype allows detecting BF16 models, but these initializers are always created with numpy_helper.from_array(...). Since NumPy doesn't support bfloat16, when the model is detected as BF16 the weights remain as FP32 arrays, causing numpy_helper.from_array() to create FLOAT TensorProtos instead of BFLOAT16. This dtype mismatch will cause inference failures.

Use array_to_initializer with the is_bfloat16 flag and convert float32 → float16 → bfloat16 via fp16_to_bf16_array when onnx_dtype == TensorProto.BFLOAT16.

Suggested change

-from ..utils.graph_utils import detect_model_dtype +from ..utils.graph_utils import array_to_initializer, detect_model_dtype +from ..utils.dtype_conversion import fp16_to_bf16_array @@ - new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) - new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name)) + if onnx_dtype == TensorProto.BFLOAT16: + k_bf16 = fp16_to_bf16_array(k_weight_t.astype(np.float16)) + v_bf16 = fp16_to_bf16_array(v_weight_t.astype(np.float16)) + new_initializers.append(array_to_initializer(k_bf16, k_weight_name, is_bfloat16=True)) + new_initializers.append(array_to_initializer(v_bf16, v_weight_name, is_bfloat16=True)) + else: + new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) + new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name)) @@ - new_initializers.append( - numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name) - ) + if onnx_dtype == TensorProto.BFLOAT16: + k_bias_bf16 = fp16_to_bf16_array(k_bias.astype(np.float16)) + new_initializers.append( + array_to_initializer(k_bias_bf16, k_bias_name, is_bfloat16=True) + ) + else: + new_initializers.append( + numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name) + ) @@ - new_initializers.append( - numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name) - ) + if onnx_dtype == TensorProto.BFLOAT16: + v_bias_bf16 = fp16_to_bf16_array(v_bias.astype(np.float16)) + new_initializers.append( + array_to_initializer(v_bias_bf16, v_bias_name, is_bfloat16=True) + ) + else: + new_initializers.append( + numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name) + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 37 - 38, The cross-attention K/V weight and bias initializers are always converted with numpy_helper.from_array so BF16 models detected by detect_model_dtype end up with FLOAT proto types; update the serialization code that creates these initializers (the places that currently call numpy_helper.from_array for cross-attention key/value weights/biases) to use array_to_initializer with the is_bfloat16 flag when onnx_dtype == TensorProto.BFLOAT16, and convert the underlying float32 arrays via fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to array_to_initializer so the TensorProto dtype matches BF16 models.

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

codecov · 2026-03-09T08:26:15Z

Codecov Report

❌ Patch coverage is 39.81410% with 777 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.10%. Comparing base (e8f9687) to head (1c0de61).
⚠️ Report is 17 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/onnx/graph_surgery/gqa_replacement.py	45.76%	288 Missing ⚠️
modelopt/onnx/graph_surgery/utils/graph_utils.py	45.84%	137 Missing ⚠️
modelopt/onnx/graph_surgery/encoder_cross_kv.py	8.52%	118 Missing ⚠️
modelopt/onnx/graph_surgery/utils/whisper_utils.py	10.30%	87 Missing ⚠️
...elopt/onnx/graph_surgery/utils/dtype_conversion.py	9.89%	82 Missing ⚠️
modelopt/onnx/graph_surgery/dq_transpose.py	59.21%	62 Missing ⚠️
modelopt/onnx/graph_surgery/utils/rope_cache.py	89.28%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #992      +/-   ##
==========================================
- Coverage   72.12%   70.10%   -2.02%     
==========================================
  Files         209      220      +11     
  Lines       23628    25240    +1612     
==========================================
+ Hits        17042    17695     +653     
- Misses       6586     7545     +959

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (2)

modelopt/onnx/graph_surgery/utils/whisper_utils.py (1)

246-247: ⚠️ Potential issue | 🟠 Major

Validate hidden_size and num_attention_heads before computing head_size.

When hf_model_id is omitted, Line 247 can still raise TypeError or ZeroDivisionError here. Please fail fast with a clear ValueError instead of letting the arithmetic blow up.

Proposed fix

-    # Compute head_size
-    head_size = hidden_size // num_attention_heads
+    # Compute head_size
+    if hidden_size is None or num_attention_heads is None or num_attention_heads <= 0:
+        raise ValueError(
+            "Either hf_model_id must be provided, or hidden_size and "
+            "a positive num_attention_heads must be supplied."
+        )
+    head_size = hidden_size // num_attention_heads

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 246 - 247,
Before computing head_size = hidden_size // num_attention_heads, validate that
hidden_size and num_attention_heads are non-None integers and that
num_attention_heads is > 0 (and optionally that hidden_size %
num_attention_heads == 0 if required by the model). Replace the direct division
in the block around head_size with a guard that raises a clear ValueError
identifying hidden_size and num_attention_heads when they are invalid (None,
wrong type, or zero), referencing the variables hidden_size and
num_attention_heads in the message so callers see actionable failure instead of
TypeError/ZeroDivisionError.

modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py (1)

198-199: ⚠️ Potential issue | 🟠 Major

Serialize BF16 cross-attention tensors as BF16, not FLOAT.

detect_model_dtype() maps BF16 models to np.float32 in modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 117-144, so numpy_helper.from_array(...) emits FLOAT initializers here. The new MatMul/Add nodes then mix BF16 activations with FLOAT weights/biases. Reuse the raw-data BF16 initializer path already used in modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 413-432, for both weights and biases.

Suggested change

-        new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
-        new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
+        new_initializers.append(_to_initializer(k_weight_t, k_weight_name))
+        new_initializers.append(_to_initializer(v_weight_t, v_weight_name))
@@
-            new_initializers.append(
-                numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
-            )
+            new_initializers.append(_to_initializer(k_bias, k_bias_name))
@@
-            new_initializers.append(
-                numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
-            )
+            new_initializers.append(_to_initializer(v_bias, v_bias_name))

Apply the same helper to weights and biases so the BF16 path uses raw BF16 storage instead of FLOAT TensorProtos.

Also applies to: 223-243, 387-409

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 198 -
199, The code currently creates BF16 cross-attention initializers using
numpy_helper.from_array (see
new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
and v_weight_t/v_weight_name), which forces FLOAT TensorProtos because
detect_model_dtype maps BF16 to np.float32; replace these
numpy_helper.from_array calls with the BF16 raw-data initializer helper used in
modelopt/onnx/graph_surgery/utils/graph_utils.py (the helper that emits raw BF16
storage for TensorProtos) for both weights and biases so the MatMul/Add nodes
keep BF16 storage; apply the same replacement for the other similar blocks (the
weight/bias creation ranges around the other occurrences mentioned).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 41-54: The module currently imports AutoConfig and get_rope_caches
at module scope causing an unconditional dependency on transformers; move or
gate these imports so ONNX-only use-cases don't require HF. Specifically, remove
top-level imports of AutoConfig and get_rope_caches and instead import
AutoConfig inside functions that need HF behavior (e.g., any GQA/rope-related
functions) and import/get_rope_caches lazily where used (or wrap in a try/except
ImportError that raises a clear message when HF features are invoked). Ensure
identifiers AutoConfig and get_rope_caches are only referenced after their
lazy/guarded import so functions like analyze_attention_pattern and other
ONNX-only entry points can be imported without transformers installed.
- Around line 703-709: The fused qkv_proj/Add node (qkv_add_node) is being
appended to graph.node immediately which causes it to be removed during the
attention cleanup (Steps 7-8) or to appear before its producer qkv_matmul_node;
instead, stop appending qkv_add_node to graph.node at creation and defer
insertion until after the attention cleanup/retention logic completes—add it
only once attention-related removals are done and qkv_matmul_node has been
inserted (ensure insertion order places qkv_matmul_node before qkv_add_node).
- Around line 219-244: The _add_bf16_lm_head_cast() logic only updates outputs
named exactly "logits", leaving suffixed names like "logits_###" with the wrong
BF16/FP16 type; change the output-type update loop (the loop that iterates
graph.output and checks out.name) to match either exact "logits" or suffixed
variants (e.g., out.name == "logits" or out.name.startswith("logits") / a regex
like r"^logits(_\d+)?$") and set out.type.tensor_type.elem_type =
TensorProto.FLOAT for all matches so the renamed suffixed output (created via
lm_head_matmul_name -> new_matmul_output) ends up with FP32 type consistent with
the Cast node.

In `@modelopt/onnx/graph_surgery/utils/rope_cache.py`:
- Around line 63-66: The RoPE cache slicing uses head_dim in the condition but
must use the computed rotary width dim (head_dim * partial_rotary_factor);
update the logic that inspects cos_cache.shape[-1] to compare to dim (not
head_dim) and always slice the returned caches to the rotary width and then to
the expected half-width for GroupQueryAttention (i.e., return cos_cache[...,
:dim//2] and similarly for sin_cache), and apply the same fix to the other
analogous block (lines referenced around 89-102) so both cosine and sine caches
are correctly trimmed when partial_rotary_factor < 1.

In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 389-470: The helpers update_genai_config_encoder and
update_genai_config_decoder currently proceed and overwrite the file even when
"model.encoder" or "model.decoder" (and their expected nested
"inputs"/"outputs") are missing, which can also lead to KeyError; before
mutating or writing, validate that config contains "model" and the appropriate
"encoder"/"decoder" dicts and that the nested "inputs"/"outputs" keys exist, and
if any required section is missing raise a clear ValueError (e.g., "Missing
'model.encoder' section in genai_config.json" or "Missing
'model.decoder.outputs'") instead of continuing; only perform the assignments
and write back the file after these checks pass in update_genai_config_encoder
and update_genai_config_decoder.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 467-493: The code skips writing sidecar files when output_path is
a bare filename because output_dir becomes an empty string; change the logic
that computes or checks output_dir so that a cwd target is treated as "."
instead of "" (e.g., set output_dir = os.path.dirname(output_path) or "." when
that results in ""), then continue to call save_audio_processor_config(...) and
_save_genai_config(...) when generate_genai_config is True; update references
around output_path, output_dir, save_audio_processor_config, and
_save_genai_config to use the normalized output_dir so current-directory outputs
produce audio_processor_config.json and genai_config.json as intended.

In `@tests/unit/onnx/test_gqa_graph_surgery.py`:
- Around line 680-695: The current final assertion uses a very loose tolerance
(finite.max() < 1.0) comparing orig_logits and gqa_logits from _run_session;
replace this with a strict numerical comparison (e.g.,
numpy.testing.assert_allclose) or a much smaller absolute tolerance (for example
1e-5 or 1e-3 depending on expected precision) applied to orig_logits and
gqa_logits (or to finite via np.max(np.abs(...))) and keep the existing
non-finite check; update the assertion that references finite.max() to use the
new tighter threshold or assert_allclose so the test fails on materially wrong
logits.
- Around line 30-33: The test currently downloads HF configs in _get_config()
and again via replace_attention_with_gqa() ->
modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches; instead, create a
small local config dict/object in the test (matching only the keys used by
_get_config() and replace_attention_with_gqa(), e.g., vocab_size, n_head,
n_embd, max_position_embeddings, rotary/rope params) and update _get_config() to
return it, and monkeypatch
modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches to return
deterministic, precomputed rope caches (e.g., simple arrays sized by
SEQ_LEN/MAX_SEQ_LEN) so no network calls occur during
replace_attention_with_gqa() or _get_config().

---

Duplicate comments:
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 246-247: Before computing head_size = hidden_size //
num_attention_heads, validate that hidden_size and num_attention_heads are
non-None integers and that num_attention_heads is > 0 (and optionally that
hidden_size % num_attention_heads == 0 if required by the model). Replace the
direct division in the block around head_size with a guard that raises a clear
ValueError identifying hidden_size and num_attention_heads when they are invalid
(None, wrong type, or zero), referencing the variables hidden_size and
num_attention_heads in the message so callers see actionable failure instead of
TypeError/ZeroDivisionError.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 198-199: The code currently creates BF16 cross-attention
initializers using numpy_helper.from_array (see
new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
and v_weight_t/v_weight_name), which forces FLOAT TensorProtos because
detect_model_dtype maps BF16 to np.float32; replace these
numpy_helper.from_array calls with the BF16 raw-data initializer helper used in
modelopt/onnx/graph_surgery/utils/graph_utils.py (the helper that emits raw BF16
storage for TensorProtos) for both weights and biases so the MatMul/Add nodes
keep BF16 storage; apply the same replacement for the other similar blocks (the
weight/bias creation ranges around the other occurrences mentioned).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 362e6545-4833-4165-baf7-2774ff39987c

📥 Commits

Reviewing files that changed from the base of the PR and between 51fdee8 and 7a42815.

📒 Files selected for processing (6)

modelopt/onnx/graph_surgery/__main__.py
modelopt/onnx/graph_surgery/gqa_replacement.py
modelopt/onnx/graph_surgery/utils/rope_cache.py
modelopt/onnx/graph_surgery/utils/whisper_utils.py
modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py
tests/unit/onnx/test_gqa_graph_surgery.py

coderabbitai · 2026-03-09T08:31:40Z

+    if lm_head_node and len(lm_head_node.output) > 0:
+        # Create new output name for MatMul
+        new_matmul_output = f"{lm_head_matmul_name}_output_bf16"
+        lm_head_node.output[0] = new_matmul_output
+
+        # Create Cast node to FP32
+        lm_head_cast_node = helper.make_node(
+            "Cast",
+            inputs=[new_matmul_output],
+            outputs=["logits"],
+            name=f"{lm_head_matmul_name}/Cast_to_fp32",
+            to=TensorProto.FLOAT,
+        )
+        graph.node.append(lm_head_cast_node)
+
+        # Add value_info for bf16 intermediate
+        lm_head_bf16_info = helper.make_tensor_value_info(
+            new_matmul_output, TensorProto.BFLOAT16, ["batch_size", "sequence_length", "vocab_size"]
+        )
+        graph.value_info.append(lm_head_bf16_info)
+
+        # Update logits output to FP32
+        for out in graph.output:
+            if out.name == "logits":
+                out.type.tensor_type.elem_type = TensorProto.FLOAT
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd modelopt && wc -l onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 108

🏁 Script executed:

sed -n '210,250p' onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 146

🏁 Script executed:

sed -n '955,990p' onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 146

🏁 Script executed:

sed -n '180,300p' onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 146

🏁 Script executed:

sed -n '210,250p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1522

🏁 Script executed:

sed -n '955,990p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1541

🏁 Script executed:

sed -n '180,300p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4011

🏁 Script executed:

grep -n "_add_bf16_lm_head_cast\|_rename_standard_io" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 206

🏁 Script executed:

sed -n '100,180p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3029

🏁 Script executed:

grep -n "def _rename_standard_io\|_rename_standard_io(" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

sed -n '880,920p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1666

🏁 Script executed:

sed -n '935,975p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1867

🏁 Script executed:

grep -n "def.*rename\|io_renames" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 333

🏁 Script executed:

sed -n '120,210p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1598

🏁 Script executed:

sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1865

🏁 Script executed:

grep -B5 "for out in graph.output:" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1713

🏁 Script executed:

sed -n '201,245p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1586

🏁 Script executed:

sed -n '880,910p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1314

🏁 Script executed:

sed -n '950,1000p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2178

🏁 Script executed:

grep -n 'if out\.name ==' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 289

🏁 Script executed:

grep -n "Step [0-9]" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1315

Update type check to handle renamed suffixed output names.

When the graph has a suffixed logits output (e.g., logits_###), _add_bf16_lm_head_cast() only checks for outputs named exactly "logits" at line 242, leaving the original suffixed output with BF16/FP16 type. Later, Step 2.5 renames logits_### to logits, but the output type remains incorrect and no longer matches the FP32 Cast node output.

Suggested change

if lm_head_node and len(lm_head_node.output) > 0: + original_output_name = lm_head_node.output[0] # Create new output name for MatMul new_matmul_output = f"{lm_head_matmul_name}_output_bf16" lm_head_node.output[0] = new_matmul_output @@ # Update logits output to FP32 for out in graph.output: - if out.name == "logits": + if out.name in {original_output_name, "logits"}: out.type.tensor_type.elem_type = TensorProto.FLOAT

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 219 - 244, The _add_bf16_lm_head_cast() logic only updates outputs named exactly "logits", leaving suffixed names like "logits_###" with the wrong BF16/FP16 type; change the output-type update loop (the loop that iterates graph.output and checks out.name) to match either exact "logits" or suffixed variants (e.g., out.name == "logits" or out.name.startswith("logits") / a regex like r"^logits(_\d+)?$") and set out.type.tensor_type.elem_type = TensorProto.FLOAT for all matches so the renamed suffixed output (created via lm_head_matmul_name -> new_matmul_output) ends up with FP32 type consistent with the Cast node.

coderabbitai · 2026-03-09T08:31:40Z

+            qkv_add_node = helper.make_node(
+                "Add",
+                inputs=[qkv_matmul_output, qkv_bias_name],
+                outputs=[qkv_add_output],
+                name=qkv_add_name,
+            )
+            graph.node.append(qkv_add_node)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's get an overview of the file and locate the specific lines wc -l modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 117

🏁 Script executed:

# Read the section around lines 703-709 with context sed -n '680,730p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 2435

🏁 Script executed:

# Read the section around lines 1133-1194 with context sed -n '1110,1220p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 5104

🏁 Script executed:

# Search for all places where qkv_matmul_nodes is used and what it is rg -n "qkv_matmul_nodes" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 983

🏁 Script executed:

# Search for Step 7 and Step 8 comments to understand the flow rg -n "Step [78]" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 165

🏁 Script executed:

# Look at more context around the problematic lines (700-800) sed -n '695,790p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3976

🏁 Script executed:

# Check what exactly is at lines 1133-1194 and the surrounding context sed -n '1130,1200p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3234

🏁 Script executed:

# Search for all places where qkv_add_node is created or used rg -n "qkv_add" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 834

Defer inserting the fused qkv_proj/Add until after attention cleanup.

qkv_add_node is appended to graph.node immediately, causing it to be identified and removed during Steps 7-8. For non-quantized models, the node doesn't match any of the preserved patterns (lines 1161-1169) and is removed. Even in combined-QKV models where it survives removal, it appears in the graph before its producer qkv_matmul_node is added, violating node ordering requirements.

Suggested change

- graph.node.append(qkv_add_node) + qkv_matmul_nodes.append(qkv_add_node)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 703 - 709, The fused qkv_proj/Add node (qkv_add_node) is being appended to graph.node immediately which causes it to be removed during the attention cleanup (Steps 7-8) or to appear before its producer qkv_matmul_node; instead, stop appending qkv_add_node to graph.node at creation and defer insertion until after the attention cleanup/retention logic completes—add it only once attention-related removals are done and qkv_matmul_node has been inserted (ensure insertion order places qkv_matmul_node before qkv_add_node).

coderabbitai · 2026-03-09T08:31:40Z

+    theta = getattr(config, "rope_theta", 10000.0)
+    head_dim = config.hidden_size // config.num_attention_heads
+    partial_factor = getattr(config, "partial_rotary_factor", 1.0)
+    dim = int(head_dim * partial_factor)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/rope_cache.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4615

🏁 Script executed:

# Search for tests or usage of get_rope_caches function rg -r "get_rope_caches" --type py -B 2 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 462

🏁 Script executed:

rg "get_rope_caches" --type py -B 2 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 3737

🏁 Script executed:

# Look at how get_rope_caches is used in gqa_replacement.py rg "cos_cache|sin_cache" modelopt/onnx/graph_surgery/gqa_replacement.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 2992

🏁 Script executed:

# Search for tests related to rope or partial rotary fd -e py "test.*rope" -o -e py | xargs grep -l "partial_rotary\|get_rope" 2>/dev/null | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 182

🏁 Script executed:

find . -name "*test*.py" -type f | head -20 | xargs grep -l "rope\|partial_rotary" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 117

🏁 Script executed:

grep -n "partial_rotary\|rope\|get_rope" ./tests/_test_utils/torch/deploy/lib_test_models.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 749

🏁 Script executed:

find . -path ./node_modules -prune -o -name "*.py" -type f -exec grep -l "get_rope_caches\|gqa_replacement" {} \; 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 340

🏁 Script executed:

cat -n ./tests/unit/onnx/test_gqa_graph_surgery.py

Repository: NVIDIA/Model-Optimizer

Length of output: 29572

🏁 Script executed:

# Let's trace the logic more carefully by creating a test script python3 << 'EOF' import torch import numpy as np # Test case 1: Full rotary (partial_factor=1.0) print("=== Case 1: Full Rotary (partial_factor=1.0) ===") head_dim = 128 partial_factor = 1.0 dim = int(head_dim * partial_factor) print(f"head_dim: {head_dim}, partial_factor: {partial_factor}") print(f"dim: {dim}") # Compute cache as in rope_cache.py max_seq_len = 4 theta = 10000.0 inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.int64).float() / dim)) print(f"inv_freq shape: {inv_freq.shape}") t = torch.arange(max_seq_len, dtype=torch.int64).float() freqs = torch.outer(t, inv_freq) print(f"freqs shape: {freqs.shape}") emb = torch.cat((freqs, freqs), dim=-1) print(f"emb shape after concat: {emb.shape}") print(f"Condition check: emb.shape[-1] ({emb.shape[-1]}) == head_dim ({head_dim})? {emb.shape[-1] == head_dim}") if emb.shape[-1] == head_dim: sliced = emb[:, : head_dim // 2] print(f"After slice: {sliced.shape} (head_dim//2={head_dim//2})") else: print(f"No slice applied, returned: {emb.shape}") print("\n=== Case 2: Partial Rotary (partial_factor=0.5) ===") partial_factor = 0.5 dim = int(head_dim * partial_factor) print(f"head_dim: {head_dim}, partial_factor: {partial_factor}") print(f"dim: {dim}") inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.int64).float() / dim)) print(f"inv_freq shape: {inv_freq.shape}") freqs = torch.outer(t, inv_freq) print(f"freqs shape: {freqs.shape}") emb = torch.cat((freqs, freqs), dim=-1) print(f"emb shape after concat: {emb.shape}") print(f"Condition check: emb.shape[-1] ({emb.shape[-1]}) == head_dim ({head_dim})? {emb.shape[-1] == head_dim}") if emb.shape[-1] == head_dim: sliced = emb[:, : head_dim // 2] print(f"After slice: {sliced.shape}") else: print(f"No slice applied, RETURNED: {emb.shape}") print("\n=== Proposed Fix ===") print(f"dim // 2 = {dim} // 2 = {dim // 2}") print(f"Should slice to: [max_seq_len, {dim // 2}]") print(f"For partial factor 0.5, correct cache shape should be [4, {dim // 2}]") # What GroupQueryAttention expects print(f"\nGroupQueryAttention expects cache shape: [max_seq_len, head_dim//2]") print(f"For full rotary (dim=head_dim): [4, {head_dim//2}]") partial_factor = 0.5 dim_partial = int(head_dim * partial_factor) print(f"For partial rotary (dim={dim_partial}): [4, {dim_partial//2}]") EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 183

🏁 Script executed:

# Let me trace the logic manually python3 << 'EOF' # Case 1: Full rotary (partial_factor=1.0) print("=== Case 1: Full Rotary (partial_factor=1.0) ===") head_dim = 128 partial_factor = 1.0 dim = int(head_dim * partial_factor) print(f"head_dim: {head_dim}, partial_factor: {partial_factor}") print(f"dim = int({head_dim} * {partial_factor}) = {dim}") # inv_freq has shape [dim // 2] inv_freq_len = dim // 2 print(f"inv_freq shape: [{inv_freq_len}]") # freqs = outer(t, inv_freq) -> [max_seq_len, dim//2] max_seq_len = 4 freqs_shape = (max_seq_len, inv_freq_len) print(f"freqs shape: {freqs_shape}") # emb = cat(freqs, freqs) -> [max_seq_len, dim] emb_shape = (max_seq_len, dim) print(f"emb shape after cat: {emb_shape}") # Condition: emb.shape[-1] == head_dim? cond1 = emb_shape[-1] == head_dim print(f"Condition: emb.shape[-1] ({emb_shape[-1]}) == head_dim ({head_dim})? {cond1}") if cond1: final_shape = (max_seq_len, head_dim // 2) print(f"✓ Slice applied → final shape: {final_shape}") else: final_shape = emb_shape print(f"✗ No slice → final shape: {final_shape}") print("\n=== Case 2: Partial Rotary (partial_factor=0.5) ===") partial_factor = 0.5 dim = int(head_dim * partial_factor) print(f"head_dim: {head_dim}, partial_factor: {partial_factor}") print(f"dim = int({head_dim} * {partial_factor}) = {dim}") # inv_freq has shape [dim // 2] inv_freq_len = dim // 2 print(f"inv_freq shape: [{inv_freq_len}]") # freqs = outer(t, inv_freq) -> [max_seq_len, dim//2] freqs_shape = (max_seq_len, inv_freq_len) print(f"freqs shape: {freqs_shape}") # emb = cat(freqs, freqs) -> [max_seq_len, dim] emb_shape = (max_seq_len, dim) print(f"emb shape after cat: {emb_shape}") # Condition: emb.shape[-1] == head_dim? cond2 = emb_shape[-1] == head_dim print(f"Condition: emb.shape[-1] ({emb_shape[-1]}) == head_dim ({head_dim})? {cond2}") if cond2: final_shape = (max_seq_len, head_dim // 2) print(f"✓ Slice applied → final shape: {final_shape}") else: final_shape = emb_shape print(f"✗ NO SLICE → RETURNED shape: {final_shape}") print("\n=== What should be returned? ===") print(f"For full rotary (dim={128}): [max_seq_len, {128//2}]") print(f"For partial rotary (dim={64}): [max_seq_len, {64//2}]") print(f"\nThe docstring says: [max_seq_len, head_dim//2]") print(f" - Full case: [max_seq_len, {128//2}] ✓ correct") print(f" - Partial case: [max_seq_len, {128//2}] but got [max_seq_len, {64}] ✗ WRONG!") print(f"\nReviewer's fix: always slice to dim//2 regardless of condition") EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 971

🏁 Script executed:

# Check how the cache is actually used in gqa_replacement.py sed -n '140,200p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2319

🏁 Script executed:

# Find where GroupQueryAttention is created and how cache is passed rg "GroupQueryAttention" modelopt/onnx/graph_surgery/gqa_replacement.py -B 3 -A 8

Repository: NVIDIA/Model-Optimizer

Length of output: 3123

🏁 Script executed:

# Find the exact inputs passed to GroupQueryAttention for cos_cache and sin_cache rg "cos_cache\|sin_cache" modelopt/onnx/graph_surgery/gqa_replacement.py -B 5 -A 5 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for where cos_cache and sin_cache are added to the graph rg "add_initializer.*cos_cache\|cos_cache.*add_initializer" modelopt/onnx/graph_surgery/gqa_replacement.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Let's look at the part where caches are added as initializers sed -n '200,250p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1795

Always slice RoPE cache to rotary dimension width, not full head width.

The current conditional slice only executes when cos_cache.shape[-1] == head_dim, which is never true for partial rotary models (partial_rotary_factor < 1). For these models, dim != head_dim, causing the cache to be returned unsliced at shape [max_seq_len, dim] instead of the correct [max_seq_len, dim // 2] that GroupQueryAttention expects.

Suggested change

- dim = int(head_dim * partial_factor) + dim = int(head_dim * partial_factor) + dim -= dim % 2 @@ - # Slice to half (GQA expects this) - if cos_cache.shape[-1] == head_dim: - cos_cache = cos_cache[:, : head_dim // 2] - sin_cache = sin_cache[:, : head_dim // 2] + # Slice to half of the effective rotary dimension (GQA expects this) + rotary_cache_width = dim // 2 + cos_cache = cos_cache[:, :rotary_cache_width] + sin_cache = sin_cache[:, :rotary_cache_width]

Also applies to: 89-102

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/rope_cache.py` around lines 63 - 66, The RoPE cache slicing uses head_dim in the condition but must use the computed rotary width dim (head_dim * partial_rotary_factor); update the logic that inspects cos_cache.shape[-1] to compare to dim (not head_dim) and always slice the returned caches to the rotary width and then to the expected half-width for GroupQueryAttention (i.e., return cos_cache[..., :dim//2] and similarly for sin_cache), and apply the same fix to the other analogous block (lines referenced around 89-102) so both cosine and sine caches are correctly trimmed when partial_rotary_factor < 1.

coderabbitai · 2026-03-09T08:31:40Z

+def update_genai_config_encoder(
+    config_path: str,
+    encoder_filename: str,
+    encoder_cross_present_key_pattern: str = "present_key_cross_%d",
+    encoder_cross_present_value_pattern: str = "present_value_cross_%d",
+) -> dict[str, Any]:
+    """Update an existing genai_config.json with encoder settings.
+
+    Useful when you already have a config from ONNX Runtime export and
+    want to update the encoder section after running encoder surgery.
+
+    Args:
+        config_path: Path to existing genai_config.json.
+        encoder_filename: New encoder filename.
+        encoder_cross_present_key_pattern: Pattern for cross-attention key outputs.
+        encoder_cross_present_value_pattern: Pattern for cross-attention value outputs.
+
+    Returns:
+        Updated configuration dictionary.
+    """
+    with open(config_path) as f:
+        config = json.load(f)
+
+    # Update encoder section
+    if "model" in config and "encoder" in config["model"]:
+        config["model"]["encoder"]["filename"] = encoder_filename
+        config["model"]["encoder"]["outputs"]["cross_present_key_names"] = (
+            encoder_cross_present_key_pattern
+        )
+        config["model"]["encoder"]["outputs"]["cross_present_value_names"] = (
+            encoder_cross_present_value_pattern
+        )
+
+    # Save updated config
+    with open(config_path, "w") as f:
+        json.dump(config, f, indent=4)
+
+    logger.info(f"Updated encoder section in {config_path}")
+    return config
+
+
+def update_genai_config_decoder(
+    config_path: str,
+    decoder_filename: str,
+    decoder_past_key_pattern: str = "past_key_values.%d.decoder.key",
+    decoder_past_value_pattern: str = "past_key_values.%d.decoder.value",
+    decoder_present_key_pattern: str = "present.%d.decoder.key",
+    decoder_present_value_pattern: str = "present.%d.decoder.value",
+) -> dict[str, Any]:
+    """Update an existing genai_config.json with decoder settings.
+
+    Useful when you've run decoder surgery (MHA/GQA fusion) and need
+    to update the decoder section.
+
+    Args:
+        config_path: Path to existing genai_config.json.
+        decoder_filename: New decoder filename.
+        decoder_past_key_pattern: Pattern for past key input names.
+        decoder_past_value_pattern: Pattern for past value input names.
+        decoder_present_key_pattern: Pattern for present key output names.
+        decoder_present_value_pattern: Pattern for present value output names.
+
+    Returns:
+        Updated configuration dictionary.
+    """
+    with open(config_path) as f:
+        config = json.load(f)
+
+    # Update decoder section
+    if "model" in config and "decoder" in config["model"]:
+        config["model"]["decoder"]["filename"] = decoder_filename
+        config["model"]["decoder"]["inputs"]["past_key_names"] = decoder_past_key_pattern
+        config["model"]["decoder"]["inputs"]["past_value_names"] = decoder_past_value_pattern
+        config["model"]["decoder"]["outputs"]["present_key_names"] = decoder_present_key_pattern
+        config["model"]["decoder"]["outputs"]["present_value_names"] = decoder_present_value_pattern
+
+    # Save updated config
+    with open(config_path, "w") as f:
+        json.dump(config, f, indent=4)
+
+    logger.info(f"Updated decoder section in {config_path}")
+    return config


⚠️ Potential issue | 🟠 Major

Don’t silently succeed when the target config sections are missing.

Both update helpers write the file back and log success even if model.encoder / model.decoder is absent. That makes bad input configs look valid, and direct indexing into inputs / outputs can still raise KeyError. Please validate the expected structure up front and raise a clear error before writing.

Proposed fix

def update_genai_config_encoder( @@ ) -> dict[str, Any]: @@ with open(config_path) as f: config = json.load(f) - # Update encoder section - if "model" in config and "encoder" in config["model"]: - config["model"]["encoder"]["filename"] = encoder_filename - config["model"]["encoder"]["outputs"]["cross_present_key_names"] = ( - encoder_cross_present_key_pattern - ) - config["model"]["encoder"]["outputs"]["cross_present_value_names"] = ( - encoder_cross_present_value_pattern - ) + try: + encoder_cfg = config["model"]["encoder"] + except KeyError as exc: + raise ValueError(f"{config_path} does not contain model.encoder") from exc + + outputs_cfg = encoder_cfg.setdefault("outputs", {}) + encoder_cfg["filename"] = encoder_filename + outputs_cfg["cross_present_key_names"] = encoder_cross_present_key_pattern + outputs_cfg["cross_present_value_names"] = encoder_cross_present_value_pattern @@ def update_genai_config_decoder( @@ ) -> dict[str, Any]: @@ with open(config_path) as f: config = json.load(f) - # Update decoder section - if "model" in config and "decoder" in config["model"]: - config["model"]["decoder"]["filename"] = decoder_filename - config["model"]["decoder"]["inputs"]["past_key_names"] = decoder_past_key_pattern - config["model"]["decoder"]["inputs"]["past_value_names"] = decoder_past_value_pattern - config["model"]["decoder"]["outputs"]["present_key_names"] = decoder_present_key_pattern - config["model"]["decoder"]["outputs"]["present_value_names"] = decoder_present_value_pattern + try: + decoder_cfg = config["model"]["decoder"] + except KeyError as exc: + raise ValueError(f"{config_path} does not contain model.decoder") from exc + + inputs_cfg = decoder_cfg.setdefault("inputs", {}) + outputs_cfg = decoder_cfg.setdefault("outputs", {}) + decoder_cfg["filename"] = decoder_filename + inputs_cfg["past_key_names"] = decoder_past_key_pattern + inputs_cfg["past_value_names"] = decoder_past_value_pattern + outputs_cfg["present_key_names"] = decoder_present_key_pattern + outputs_cfg["present_value_names"] = decoder_present_value_pattern

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 389 - 470, The helpers update_genai_config_encoder and update_genai_config_decoder currently proceed and overwrite the file even when "model.encoder" or "model.decoder" (and their expected nested "inputs"/"outputs") are missing, which can also lead to KeyError; before mutating or writing, validate that config contains "model" and the appropriate "encoder"/"decoder" dicts and that the nested "inputs"/"outputs" keys exist, and if any required section is missing raise a clear ValueError (e.g., "Missing 'model.encoder' section in genai_config.json" or "Missing 'model.decoder.outputs'") instead of continuing; only perform the assignments and write back the file after these checks pass in update_genai_config_encoder and update_genai_config_decoder.

coderabbitai · 2026-03-09T08:31:40Z

+    # Generate config files if output directory exists
+    output_dir = os.path.dirname(output_path)
+    if output_dir:
+        # Save audio processor config
+        from ..utils.whisper_utils import save_audio_processor_config
+
+        save_audio_processor_config(
+            output_dir,
+            hf_model_id=hf_model_id,
+            overwrite=False,
+            trust_remote_code=trust_remote_code,
+        )
+
+        # Generate genai_config.json with encoder pointing to this output
+        if generate_genai_config:
+            from ..utils.whisper_utils import save_genai_config as _save_genai_config
+
+            encoder_filename = os.path.basename(output_path)
+            _save_genai_config(
+                output_dir=output_dir,
+                encoder_filename=encoder_filename,
+                decoder_filename=decoder_filename,
+                hf_model_id=hf_model_id,
+                provider=provider,
+                trust_remote_code=trust_remote_code,
+                overwrite=False,  # Don't overwrite if exists
+            )


⚠️ Potential issue | 🟠 Major

Treat cwd outputs as a valid sidecar target.

When output_path is just a filename, output_dir is "" and both audio_processor_config.json and genai_config.json are skipped even though generation is enabled by default. Using "." here preserves the expected behavior for current-directory outputs.

Suggested change

- output_dir = os.path.dirname(output_path) + output_dir = os.path.dirname(output_path) or "." if output_dir:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Generate config files if output directory exists

output_dir = os.path.dirname(output_path)

if output_dir:

# Save audio processor config

from ..utils.whisper_utils import save_audio_processor_config

save_audio_processor_config(

output_dir,

hf_model_id=hf_model_id,

overwrite=False,

trust_remote_code=trust_remote_code,

)

# Generate genai_config.json with encoder pointing to this output

if generate_genai_config:

from ..utils.whisper_utils import save_genai_config as _save_genai_config

encoder_filename = os.path.basename(output_path)

_save_genai_config(

output_dir=output_dir,

encoder_filename=encoder_filename,

decoder_filename=decoder_filename,

hf_model_id=hf_model_id,

provider=provider,

trust_remote_code=trust_remote_code,

overwrite=False, # Don't overwrite if exists

)

# Generate config files if output directory exists

output_dir = os.path.dirname(output_path) or "."

if output_dir:

# Save audio processor config

from ..utils.whisper_utils import save_audio_processor_config

save_audio_processor_config(

output_dir,

hf_model_id=hf_model_id,

overwrite=False,

trust_remote_code=trust_remote_code,

)

# Generate genai_config.json with encoder pointing to this output

if generate_genai_config:

from ..utils.whisper_utils import save_genai_config as _save_genai_config

encoder_filename = os.path.basename(output_path)

_save_genai_config(

output_dir=output_dir,

encoder_filename=encoder_filename,

decoder_filename=decoder_filename,

hf_model_id=hf_model_id,

provider=provider,

trust_remote_code=trust_remote_code,

overwrite=False, # Don't overwrite if exists

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 467 - 493, The code skips writing sidecar files when output_path is a bare filename because output_dir becomes an empty string; change the logic that computes or checks output_dir so that a cwd target is treated as "." instead of "" (e.g., set output_dir = os.path.dirname(output_path) or "." when that results in ""), then continue to call save_audio_processor_config(...) and _save_genai_config(...) when generate_genai_config is True; update references around output_path, output_dir, save_audio_processor_config, and _save_genai_config to use the normalized output_dir so current-directory outputs produce audio_processor_config.json and genai_config.json as intended.

coderabbitai · 2026-03-09T08:31:40Z

+MODEL_ID = "Qwen/Qwen2.5-0.5B"
+VOCAB_SIZE = 64
+SEQ_LEN = 4
+MAX_SEQ_LEN = 128


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file to understand its structure fd -t f test_gqa_graph_surgery.py | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 107

🏁 Script executed:

# Get the file size to understand how to read it wc -l tests/unit/onnx/test_gqa_graph_surgery.py

Repository: NVIDIA/Model-Optimizer

Length of output: 111

🏁 Script executed:

# Read the relevant sections: lines 30-40 and 584-630 sed -n '1,50p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1827

🏁 Script executed:

# Check for fixture definitions and AutoConfig usage rg -n "def _get_config|AutoConfig|replace_attention_with_gqa" tests/unit/onnx/test_gqa_graph_surgery.py -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 971

🏁 Script executed:

# Read lines around 584-623 sed -n '580,630p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 2103

🏁 Script executed:

# Find the gqa_replacement.py file fd -t f gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 112

🏁 Script executed:

# Search for replace_attention_with_gqa function and how it uses hf_model_id rg -n "def replace_attention_with_gqa" modelopt/onnx/graph_surgery/gqa_replacement.py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 919

🏁 Script executed:

# Search for AutoConfig usage within the gqa_replacement module rg -n "AutoConfig" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 198

🏁 Script executed:

# View the implementation around line 119 to see full context of AutoConfig usage sed -n '110,130p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1061

🏁 Script executed:

# Check if there are any mocking or offline alternatives already in place rg -n "@pytest.fixture|@mock|@patch|monkeypatch" tests/unit/onnx/test_gqa_graph_surgery.py

Repository: NVIDIA/Model-Optimizer

Length of output: 101

🏁 Script executed:

# Verify the fixture scope and check what models_and_config provides sed -n '625,650p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1334

🏁 Script executed:

# Check if there are other tests in this file that don't use the fixture rg -n "def test_" tests/unit/onnx/test_gqa_graph_surgery.py | head -15

Repository: NVIDIA/Model-Optimizer

Length of output: 432

🏁 Script executed:

# Verify that _get_config is called only within the fixture rg -n "_get_config" tests/unit/onnx/test_gqa_graph_surgery.py

Repository: NVIDIA/Model-Optimizer

Length of output: 147

🏁 Script executed:

# Check the full fixture implementation to see the complete picture sed -n '604,625p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 982

🏁 Script executed:

# Verify if there are any environment markers or skip decorators sed -n '600,630p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1304

🏁 Script executed:

# Let's check if there are any skip markers or environment checks sed -n '1,35p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 1413

🏁 Script executed:

# Verify if get_rope_caches function exists and what it does rg -n "def get_rope_caches" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check what imports and dependencies are needed for the suggested fix rg -n "get_rope_caches" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 170

🏁 Script executed:

# Look at the rope_cache module to understand how it works fd -t f rope_cache.py | head -3

Repository: NVIDIA/Model-Optimizer

Length of output: 113

🏁 Script executed:

# Check get_rope_caches implementation rg -n "def get_rope_caches" modelopt/onnx/graph_surgery/utils/rope_cache.py -A 15

Repository: NVIDIA/Model-Optimizer

Length of output: 754

Remove network calls from this unit test fixture.

The fixture makes two network calls to download Hugging Face config: once in _get_config() and again inside replace_attention_with_gqa() → get_rope_caches(). This violates the guideline that unit tests in tests/unit should be fast CPU-based tests taking no more than a few seconds to run. Replace the HuggingFace config downloads with a small local config object and monkeypatch modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches to return deterministic, pre-computed caches.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 30 - 33, The test currently downloads HF configs in _get_config() and again via replace_attention_with_gqa() -> modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches; instead, create a small local config dict/object in the test (matching only the keys used by _get_config() and replace_attention_with_gqa(), e.g., vocab_size, n_head, n_embd, max_position_embeddings, rotary/rope params) and update _get_config() to return it, and monkeypatch modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches to return deterministic, precomputed rope caches (e.g., simple arrays sized by SEQ_LEN/MAX_SEQ_LEN) so no network calls occur during replace_attention_with_gqa() or _get_config().

coderabbitai · 2026-03-09T08:31:40Z

+        orig_logits = _run_session(orig, orig_feeds)[0].astype(np.float32)
+        gqa_logits = _run_session(gqa, gqa_feeds)[0].astype(np.float32)
+
+        diff = np.abs(orig_logits - gqa_logits)
+        finite = diff[np.isfinite(diff)]
+
+        print(f"\n  Original nodes: {len(orig.graph.node)}  ->  GQA nodes: {len(gqa.graph.node)}")
+        print(f"  Logits shape:   {orig_logits.shape}")
+        print(f"  Original[0,:4]: {orig_logits[0, 0, :4]}")
+        print(f"  GQA     [0,:4]: {gqa_logits[0, 0, :4]}")
+        if len(finite) > 0:
+            print(f"  Max  abs diff:  {finite.max():.6f}")
+            print(f"  Mean abs diff:  {finite.mean():.6f}")
+
+        assert len(finite) > 0, "All values are non-finite"
+        assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance"


⚠️ Potential issue | 🟠 Major

Tighten the numerical assertion.

finite.max() < 1.0 will still pass with materially wrong logits. Since this is the only accuracy guard here, use a much smaller tolerance or assert_allclose().

Suggested change

assert len(finite) > 0, "All values are non-finite" - assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance" + np.testing.assert_allclose(orig_logits, gqa_logits, rtol=1e-2, atol=1e-2)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 680 - 695, The current final assertion uses a very loose tolerance (finite.max() < 1.0) comparing orig_logits and gqa_logits from _run_session; replace this with a strict numerical comparison (e.g., numpy.testing.assert_allclose) or a much smaller absolute tolerance (for example 1e-5 or 1e-3 depending on expected precision) applied to orig_logits and gqa_logits (or to finite via np.max(np.abs(...))) and keep the existing non-finite check; update the assertion that references finite.max() to use the new tighter threshold or assert_allclose so the test fails on materially wrong logits.

vishalpandya1990

Discussed offline, some suggestions on high-level flow here:

Design changes / directory restructuring
Add test case for DQ-transpose
make sure to have API documentation updated
example - for surgery on quantized model (can be added in follow-up PR)

Overall looks good to me.

…erform this test Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai

♻️ Duplicate comments (8)

modelopt/onnx/graph_surgery/gqa_replacement.py (5)

938-941: ⚠️ Potential issue | 🟡 Minor

cos_cache/sin_cache are registered twice in value_info.

This issue was previously identified. The add_initializer function (from graph_utils.py) already appends corresponding value_info entries. Re-adding them here creates duplicate metadata entries.

Suggested fix

-    # Add value_info for cos/sin caches
-    cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
-    sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
-    graph.value_info.extend([cos_cache_info, sin_cache_info])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 938 - 941,
Remove the duplicate value_info entries for the cos/sin caches by eliminating
the explicit creation/extension of cos_cache_info and sin_cache_info in GQA
replacement; rely on the existing add_initializer function in graph_utils.py
(which already registers value_info when adding initializers) instead of calling
graph.value_info.extend([cos_cache_info, sin_cache_info]) in this file, or add a
presence check before extending to avoid duplicates (refer to the symbols
cos_cache_info, sin_cache_info, graph.value_info.extend and add_initializer).

704-710: ⚠️ Potential issue | 🟠 Major

Defer inserting the fused qkv_proj/Add until after attention cleanup.

This issue was previously identified. The qkv_add_node is appended to graph.node immediately at line 710. During Steps 7-8, this node can be identified and removed because it doesn't match the preserved patterns for non-quantized models (lines 1161-1169 only keep specific MatMul nodes, not Add nodes).

Suggested fix

-            graph.node.append(qkv_add_node)
+            qkv_matmul_nodes.append(qkv_add_node)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 704 - 710, The
fused qkv_proj/Add node (qkv_add_node created with name qkv_add_name and outputs
qkv_add_output) is being appended to graph.node too early and will be removed
during the later attention-cleanup steps; instead, delay inserting this
qkv_add_node into graph.node until after the attention cleanup phase completes
(i.e., create the node object but do not call graph.node.append(...) immediately
— append it only after the steps that prune MatMul/Add patterns have run),
ensuring qkv_proj/Add persists for non-quantized models.

241-244: ⚠️ Potential issue | 🟠 Major

Update type check to handle renamed suffixed output names.

This issue was previously identified. The _add_bf16_lm_head_cast() function runs at Step 0.7, but I/O renaming (which converts suffixed names like logits_### to logits) happens later at Step 2.5. When the graph has a suffixed logits output, this check only matches "logits" exactly, leaving the original suffixed output with incorrect type.

Suggested fix

     if lm_head_node and len(lm_head_node.output) > 0:
+        original_output_name = lm_head_node.output[0]
         # Create new output name for MatMul
         new_matmul_output = f"{lm_head_matmul_name}_output_bf16"
-        lm_head_node.output[0] = new_matmul_output
+        lm_head_node.output[0] = new_matmul_output
@@
         # Update logits output to FP32
         for out in graph.output:
-            if out.name == "logits":
+            if out.name in {original_output_name, "logits"}:
                 out.type.tensor_type.elem_type = TensorProto.FLOAT

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 241 - 244, The
output-type update loop in _add_bf16_lm_head_cast only matches exact "logits"
and misses suffixed names like "logits_###"; update the check that iterates
graph.output (variable out) to treat any suffixed logits as logits by testing
the base name (e.g., out.name.startswith("logits") or out.name.split("_",1)[0]
== "logits") and then set out.type.tensor_type.elem_type = TensorProto.FLOAT for
those entries so both suffixed and canonical outputs are handled.

1319-1324: ⚠️ Potential issue | 🔴 Critical

Use onnx.numpy_helper to read and convert initializer data.

This issue was previously identified. Reading only init.raw_data via np.frombuffer() loses data when the initializer uses external storage or typed fields (e.g., float_data, int32_data). This can corrupt the quantization scale.

Suggested fix

-                            arr = np.frombuffer(init.raw_data, dtype=np.float32).copy()
-                            arr_fp16 = arr.astype(np.float16)
-                            init.raw_data = arr_fp16.tobytes()
-                            init.data_type = TensorProto.FLOAT16
+                            arr_fp16 = onnx.numpy_helper.to_array(init).astype(np.float16)
+                            converted_init = onnx.numpy_helper.from_array(arr_fp16, name=init.name)
+                            init.CopyFrom(converted_init)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1319 - 1324, The
loop in gqa_replacement.py that converts an initializer to float16 currently
reads init.raw_data with np.frombuffer, which can lose data for external or
typed initializers; replace that logic with onnx.numpy_helper.to_array(init) to
get a numpy array reliably, cast it to np.float16, then use
onnx.numpy_helper.from_array(cast_array, name=init.name) (or set init.raw_data =
numpy_helper.from_array(...).raw_data and update init.data_type =
TensorProto.FLOAT16) so the initializer preserves external/typed storage, dims,
and metadata while converting the values to FLOAT16.

1407-1412: ⚠️ Potential issue | 🔴 Critical

external_data_name can remain None for cwd outputs.

This issue was previously identified. If output_path has no directory component (e.g., just "model.onnx"), the default external filename is never set because the assignment is inside the if output_dir: block. If use_external_data=True, this causes location=None to be passed to the save functions.

Suggested fix

     output_dir = os.path.dirname(output_path)
     if output_dir:
         os.makedirs(output_dir, exist_ok=True)

-        if external_data_name is None:
-            external_data_name = os.path.basename(output_path) + "_data"
+    if external_data_name is None:
+        external_data_name = os.path.basename(output_path) + "_data"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1407 - 1412,
external_data_name can stay None when output_path has no directory component
because the default assignment is inside the if output_dir block; move the
external_data_name defaulting logic so that if external_data_name is None it is
set to os.path.basename(output_path) + "_data" regardless of output_dir (but
only when use_external_data/save path logic requires it). Update the block
around output_dir/output_path in gqa_replacement.py so the os.makedirs call
stays conditional on output_dir but the external_data_name fallback runs
unconditionally when external_data_name is None, referencing the
external_data_name, output_path, and output_dir variables used by the existing
save/export functions.

tests/unit/onnx/test_gqa_graph_surgery.py (2)

586-596: ⚠️ Potential issue | 🟠 Major

Remove network calls from this unit test fixture.

This issue was previously identified. The _get_config() function makes a network call to download HuggingFace config via AutoConfig.from_pretrained(MODEL_ID, ...). Additionally, replace_attention_with_gqa() makes another network call through get_rope_caches(). This violates the coding guideline that unit tests in tests/unit should be fast CPU-based tests taking no more than a few seconds to run.

Consider mocking AutoConfig.from_pretrained and get_rope_caches to return deterministic, pre-computed values.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 586 - 596, The test
fixture _get_config currently calls AutoConfig.from_pretrained(MODEL_ID) and
replace_attention_with_gqa triggers get_rope_caches, which make network calls;
instead, patch/mocks these to return deterministic in-memory values: patch
AutoConfig.from_pretrained used in _get_config to return an object with
attributes hidden_size, num_attention_heads, optional num_key_value_heads, and
rope_theta so _get_config can compute hd and inv_freq locally, and patch
get_rope_caches to return precomputed rope cache tensors; use your test's
monkeypatch/pytest fixture to replace AutoConfig.from_pretrained and
get_rope_caches calls so the unit test is CPU-only and fast while keeping the
same return shapes/values expected by replace_attention_with_gqa and downstream
code.
696-697: ⚠️ Potential issue | 🟠 Major

Tighten the numerical assertion.

This issue was previously identified. The tolerance finite.max() < 1.0 is extremely loose and will pass even with materially wrong logits. Since this is the only accuracy guard, use a much smaller tolerance.
Suggested fix
         assert len(finite) > 0, "All values are non-finite"
-        assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance"
+        np.testing.assert_allclose(orig_logits, gqa_logits, rtol=1e-2, atol=1e-2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 696 - 697, The
numerical assertion is too loose: replace the tolerance check on finite.max()
(the second assertion that currently uses 1.0) with a much tighter threshold
(e.g., 1e-3 or 1e-4) and update the error message accordingly; locate the
assertions that reference the variable finite and change the line assert
finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance" to use
the new numeric tolerance and include the chosen tolerance value in the message.

modelopt/onnx/graph_surgery/utils/rope_cache.py (1)

90-93: ⚠️ Potential issue | 🟠 Major

Always slice RoPE cache to rotary dimension width, not full head width.

This issue was previously identified. The conditional slice at line 91 only executes when cos_cache.shape[-1] == head_dim, which is never true for partial rotary models (partial_rotary_factor < 1). For these models, the cache is returned with shape [max_seq_len, dim] instead of the correct [max_seq_len, dim // 2].

Suggested fix

-    # Slice to half (GQA expects this)
-    if cos_cache.shape[-1] == head_dim:
-        cos_cache = cos_cache[:, : head_dim // 2]
-        sin_cache = sin_cache[:, : head_dim // 2]
+    # Slice to half of the effective rotary dimension (GQA expects this)
+    rotary_cache_width = dim // 2
+    cos_cache = cos_cache[:, :rotary_cache_width]
+    sin_cache = sin_cache[:, :rotary_cache_width]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/utils/rope_cache.py` around lines 90 - 93, The
RoPE cache slicing currently only triggers when cos_cache.shape[-1] == head_dim,
which misses partial-rotary models; change the logic so cos_cache and sin_cache
are always sliced to the rotary width (use head_dim // 2) instead of relying on
that conditional—i.e., remove the conditional and set cos_cache = cos_cache[:, :
head_dim // 2] and sin_cache = sin_cache[:, : head_dim // 2] so the returned
caches have shape [max_seq_len, head_dim//2] for partial and full rotary cases.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 938-941: Remove the duplicate value_info entries for the cos/sin
caches by eliminating the explicit creation/extension of cos_cache_info and
sin_cache_info in GQA replacement; rely on the existing add_initializer function
in graph_utils.py (which already registers value_info when adding initializers)
instead of calling graph.value_info.extend([cos_cache_info, sin_cache_info]) in
this file, or add a presence check before extending to avoid duplicates (refer
to the symbols cos_cache_info, sin_cache_info, graph.value_info.extend and
add_initializer).
- Around line 704-710: The fused qkv_proj/Add node (qkv_add_node created with
name qkv_add_name and outputs qkv_add_output) is being appended to graph.node
too early and will be removed during the later attention-cleanup steps; instead,
delay inserting this qkv_add_node into graph.node until after the attention
cleanup phase completes (i.e., create the node object but do not call
graph.node.append(...) immediately — append it only after the steps that prune
MatMul/Add patterns have run), ensuring qkv_proj/Add persists for non-quantized
models.
- Around line 241-244: The output-type update loop in _add_bf16_lm_head_cast
only matches exact "logits" and misses suffixed names like "logits_###"; update
the check that iterates graph.output (variable out) to treat any suffixed logits
as logits by testing the base name (e.g., out.name.startswith("logits") or
out.name.split("_",1)[0] == "logits") and then set
out.type.tensor_type.elem_type = TensorProto.FLOAT for those entries so both
suffixed and canonical outputs are handled.
- Around line 1319-1324: The loop in gqa_replacement.py that converts an
initializer to float16 currently reads init.raw_data with np.frombuffer, which
can lose data for external or typed initializers; replace that logic with
onnx.numpy_helper.to_array(init) to get a numpy array reliably, cast it to
np.float16, then use onnx.numpy_helper.from_array(cast_array, name=init.name)
(or set init.raw_data = numpy_helper.from_array(...).raw_data and update
init.data_type = TensorProto.FLOAT16) so the initializer preserves
external/typed storage, dims, and metadata while converting the values to
FLOAT16.
- Around line 1407-1412: external_data_name can stay None when output_path has
no directory component because the default assignment is inside the if
output_dir block; move the external_data_name defaulting logic so that if
external_data_name is None it is set to os.path.basename(output_path) + "_data"
regardless of output_dir (but only when use_external_data/save path logic
requires it). Update the block around output_dir/output_path in
gqa_replacement.py so the os.makedirs call stays conditional on output_dir but
the external_data_name fallback runs unconditionally when external_data_name is
None, referencing the external_data_name, output_path, and output_dir variables
used by the existing save/export functions.

In `@modelopt/onnx/graph_surgery/utils/rope_cache.py`:
- Around line 90-93: The RoPE cache slicing currently only triggers when
cos_cache.shape[-1] == head_dim, which misses partial-rotary models; change the
logic so cos_cache and sin_cache are always sliced to the rotary width (use
head_dim // 2) instead of relying on that conditional—i.e., remove the
conditional and set cos_cache = cos_cache[:, : head_dim // 2] and sin_cache =
sin_cache[:, : head_dim // 2] so the returned caches have shape [max_seq_len,
head_dim//2] for partial and full rotary cases.

In `@tests/unit/onnx/test_gqa_graph_surgery.py`:
- Around line 586-596: The test fixture _get_config currently calls
AutoConfig.from_pretrained(MODEL_ID) and replace_attention_with_gqa triggers
get_rope_caches, which make network calls; instead, patch/mocks these to return
deterministic in-memory values: patch AutoConfig.from_pretrained used in
_get_config to return an object with attributes hidden_size,
num_attention_heads, optional num_key_value_heads, and rope_theta so _get_config
can compute hd and inv_freq locally, and patch get_rope_caches to return
precomputed rope cache tensors; use your test's monkeypatch/pytest fixture to
replace AutoConfig.from_pretrained and get_rope_caches calls so the unit test is
CPU-only and fast while keeping the same return shapes/values expected by
replace_attention_with_gqa and downstream code.
- Around line 696-697: The numerical assertion is too loose: replace the
tolerance check on finite.max() (the second assertion that currently uses 1.0)
with a much tighter threshold (e.g., 1e-3 or 1e-4) and update the error message
accordingly; locate the assertions that reference the variable finite and change
the line assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds
tolerance" to use the new numeric tolerance and include the chosen tolerance
value in the message.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dbdd275c-1479-48c2-ae96-e4b163228eb4

📥 Commits

Reviewing files that changed from the base of the PR and between 7a42815 and b4fc641.

📒 Files selected for processing (3)

modelopt/onnx/graph_surgery/gqa_replacement.py
modelopt/onnx/graph_surgery/utils/rope_cache.py
tests/unit/onnx/test_gqa_graph_surgery.py

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

modelopt/onnx/graph_surgery/encoder_cross_kv.py (1)
416-418: ⚠️ Potential issue | 🟠 Major

Current-directory outputs still skip sidecar files.

When output_path is just a filename, os.path.dirname(output_path) is "", so current-directory saves never generate audio_processor_config.json or genai_config.json even though that path is enabled by default. Normalize output_dir once with or "." and reuse it for both the model save and sidecar writes.
Suggested fix
-    output_dir = os.path.dirname(output_path)
-    if output_dir:
-        os.makedirs(output_dir, exist_ok=True)
+    output_dir = os.path.dirname(output_path) or "."
+    os.makedirs(output_dir, exist_ok=True)
@@
-    output_dir = os.path.dirname(output_path)
     if output_dir:
Also applies to: 467-493
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py` around lines 416 - 418, The
code computes output_dir = os.path.dirname(output_path) and skips creating
sidecar files when output_path is a bare filename; change the normalization to
set output_dir = os.path.dirname(output_path) or "." and then call
os.makedirs(output_dir, exist_ok=True) and use this same output_dir whenever
saving the model and writing sidecar files (e.g., audio_processor_config.json
and genai_config.json); apply the same normalization/fix to the other save block
handling sidecars (the block around the later save logic) so both model save and
sidecar writes always target the normalized directory.

🧹 Nitpick comments (1)

tests/unit/onnx/test_dq_transpose_surgery.py (1)
53-59: Add one axis=1 fixture as well.

transpose_dequantize_linear_weights() also has an axis-rewrite path for DQ nodes that start on axis 1, but this builder only exercises axis 0 -> 1. A second minimal model covering axis=1 would make this suite much better at catching regressions in the other branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/test_dq_transpose_surgery.py` around lines 53 - 59, Add a
second test fixture that constructs a DequantizeLinear node with axis=1 (mirror
the existing dq_node which uses axis=0) and build a minimal ONNX model
exercising the axis=1 path so transpose_dequantize_linear_weights() is invoked
on an axis=1 DQ node; update the test to run the same assertions for both
fixtures (or add a new test function) referencing the existing test utilities
and the node name "dq_weight" pattern (or use a new name like "dq_weight_axis1")
so the axis-rewrite branch is covered and validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/graph_surgery/__init__.py`:
- Around line 67-70: The module-level imports in modelopt.onnx.graph_surgery
eagerly import optional modules; change the __init__ to lazily re-export
transpose_dequantize_linear_weights, add_cross_kv_to_encoder,
replace_attention_with_gqa, and convert_fp16_to_bf16 by implementing __getattr__
(or thin wrapper callables) that perform the actual import inside the accessor;
for example, when name == "transpose_dequantize_linear_weights" import
.dq_transpose and return the symbol, likewise for add_cross_kv_to_encoder (from
.encoder_cross_kv), replace_attention_with_gqa (from .gqa_replacement), and
convert_fp16_to_bf16 (from .utils.dtype_conversion), so optional dependencies
are only imported when those symbols are first accessed.

In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py`:
- Around line 54-68: The helper currently logs model metadata with logger.info
unconditionally (see the logger.info calls around the
WhisperForConditionalGeneration loading and the subsequent size/shape logs)
which bypasses add_cross_kv_to_encoder's verbose flag; update
add_cross_kv_to_encoder to accept and forward a verbose boolean to this helper
(or accept a logger level) and change the helper to only emit these messages
when verbose is True, or alternatively downgrade those logger.info calls to
logger.debug; apply the same change to the other block referenced (the logs in
the 134-170 region) so both places respect the verbose/quiet setting.

---

Duplicate comments:
In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py`:
- Around line 416-418: The code computes output_dir =
os.path.dirname(output_path) and skips creating sidecar files when output_path
is a bare filename; change the normalization to set output_dir =
os.path.dirname(output_path) or "." and then call os.makedirs(output_dir,
exist_ok=True) and use this same output_dir whenever saving the model and
writing sidecar files (e.g., audio_processor_config.json and genai_config.json);
apply the same normalization/fix to the other save block handling sidecars (the
block around the later save logic) so both model save and sidecar writes always
target the normalized directory.

---

Nitpick comments:
In `@tests/unit/onnx/test_dq_transpose_surgery.py`:
- Around line 53-59: Add a second test fixture that constructs a
DequantizeLinear node with axis=1 (mirror the existing dq_node which uses
axis=0) and build a minimal ONNX model exercising the axis=1 path so
transpose_dequantize_linear_weights() is invoked on an axis=1 DQ node; update
the test to run the same assertions for both fixtures (or add a new test
function) referencing the existing test utilities and the node name "dq_weight"
pattern (or use a new name like "dq_weight_axis1") so the axis-rewrite branch is
covered and validated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cbc44444-e980-4b34-abf2-96f783686023

📥 Commits

Reviewing files that changed from the base of the PR and between b4fc641 and 7d0f77b.

📒 Files selected for processing (4)

modelopt/onnx/graph_surgery/__init__.py
modelopt/onnx/graph_surgery/__main__.py
modelopt/onnx/graph_surgery/encoder_cross_kv.py
tests/unit/onnx/test_dq_transpose_surgery.py

coderabbitai · 2026-03-09T11:20:18Z

+from .dq_transpose import transpose_dequantize_linear_weights
+from .encoder_cross_kv import add_cross_kv_to_encoder
+from .gqa_replacement import replace_attention_with_gqa
+from .utils.dtype_conversion import convert_fp16_to_bf16


⚠️ Potential issue | 🟠 Major

Lazy-load these re-exports.

Importing modelopt.onnx.graph_surgery now eagerly imports every surgery module, so environments missing some [onnx]/[hf] extras can fail before the caller even touches those features. Please re-export these symbols lazily (__getattr__, thin wrappers, or guarded imports) instead of importing them unconditionally at module import time.

Based on learnings: Avoid hard imports of optional dependencies at module level; gate features by install extras ([onnx], [hf], [all]).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/__init__.py` around lines 67 - 70, The module-level imports in modelopt.onnx.graph_surgery eagerly import optional modules; change the __init__ to lazily re-export transpose_dequantize_linear_weights, add_cross_kv_to_encoder, replace_attention_with_gqa, and convert_fp16_to_bf16 by implementing __getattr__ (or thin wrapper callables) that perform the actual import inside the accessor; for example, when name == "transpose_dequantize_linear_weights" import .dq_transpose and return the symbol, likewise for add_cross_kv_to_encoder (from .encoder_cross_kv), replace_attention_with_gqa (from .gqa_replacement), and convert_fp16_to_bf16 (from .utils.dtype_conversion), so optional dependencies are only imported when those symbols are first accessed.

coderabbitai · 2026-03-09T11:20:18Z

+    logger.info(f"Loading PyTorch model: {model_id}")
+    model = WhisperForConditionalGeneration.from_pretrained(
+        model_id, trust_remote_code=trust_remote_code
+    )
+
+    weights = {}
+    num_layers = model.config.decoder_layers
+    num_heads = model.config.decoder_attention_heads
+    hidden_size = model.config.d_model
+    head_size = hidden_size // num_heads
+
+    logger.info(f"  num_layers: {num_layers}")
+    logger.info(f"  num_heads: {num_heads}")
+    logger.info(f"  hidden_size: {hidden_size}")
+    logger.info(f"  head_size: {head_size}")


⚠️ Potential issue | 🟡 Minor

verbose=False does not fully suppress progress logs.

add_cross_kv_to_encoder(verbose=False) still calls helpers that emit logger.info(...) unconditionally, so the CLI's --quiet flag keeps printing model metadata and layer-count messages. Please thread verbose into these helpers or downgrade these messages to debug.

Also applies to: 134-170

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py` around lines 54 - 68, The helper currently logs model metadata with logger.info unconditionally (see the logger.info calls around the WhisperForConditionalGeneration loading and the subsequent size/shape logs) which bypasses add_cross_kv_to_encoder's verbose flag; update add_cross_kv_to_encoder to accept and forward a verbose boolean to this helper (or accept a logger level) and change the helper to only emit these messages when verbose is True, or alternatively downgrade those logger.info calls to logger.debug; apply the same change to the other block referenced (the logs in the 134-170 region) so both places respect the verbose/quiet setting.

coderabbitai · 2026-03-09T11:20:18Z

+        k_weight_t = k_weight.T.astype(np_dtype)
+        v_weight_t = v_weight.T.astype(np_dtype)
+
+        # Add weight initializers
+        k_weight_name = f"encoder.cross_attn_k_weight.{layer_idx}"
+        v_weight_name = f"encoder.cross_attn_v_weight.{layer_idx}"
+
+        new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
+        new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== detect_model_dtype BF16 mapping ==" fd graph_utils.py modelopt | xargs -r sed -n '117,160p' echo echo "== BF16-sensitive initializer creation in encoder_cross_kv.py ==" fd encoder_cross_kv.py modelopt | xargs -r sed -n '184,245p' echo echo "== Existing BF16 tensor helpers in graph_surgery ==" rg -n "BFLOAT16|bfloat16|ml_dtypes|numpy_helper\.from_array|make_tensor\(" modelopt/onnx/graph_surgery

Repository: NVIDIA/Model-Optimizer

Length of output: 11736

🏁 Script executed:

# Verify the array_to_initializer usage pattern in gqa_replacement.py to confirm it's the correct approach fd encoder_cross_kv.py modelopt | xargs -r grep -A 3 "detect_model_dtype"

Repository: NVIDIA/Model-Optimizer

Length of output: 297

🏁 Script executed:

# Check how detect_model_dtype is used in encoder_cross_kv.py fd encoder_cross_kv.py modelopt | xargs -r grep -B 5 -A 5 "detect_model_dtype\|onnx_dtype\|np_dtype"

Repository: NVIDIA/Model-Optimizer

Length of output: 4279

🏁 Script executed:

# Get the full function context to understand dtype flow fd encoder_cross_kv.py modelopt | xargs -r sed -n '1,250p' | head -n 200

Repository: NVIDIA/Model-Optimizer

Length of output: 7554

Use array_to_initializer() helper for BF16-aware weight and bias serialization.

detect_model_dtype() maps BF16 models to np.float32 for NumPy compatibility, but this code passes that dtype directly to numpy_helper.from_array(...). For BF16 models, the new K/V weights and biases (lines 191–199, 225–242) will serialize as FLOAT initializers while the output tensors use BFLOAT16, creating a type mismatch in MatMul/Add operations that breaks graph validation and runtime execution.

Use the existing array_to_initializer(arr, name, is_bfloat16=True) helper (as shown in gqa_replacement.py lines 631, 697) to properly serialize tensors with the detected dtype, or insert explicit Cast nodes around the new subgraph.

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

modelopt/onnx/graph_surgery/gqa_replacement.py (1)

1419-1434: Redundant call to convert_model_to_external_data before onnx.save.

onnx.save() with save_as_external_data=True handles external data conversion internally. The prior convert_model_to_external_data() call is unnecessary overhead.

Suggested simplification

-        convert_model_to_external_data(
-            model,
-            all_tensors_to_one_file=True,
-            location=external_data_name,
-            size_threshold=1024,
-            convert_attribute=False,
-        )
-
         onnx.save(
             model,
             output_path,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1419 - 1434, The
code redundantly calls convert_model_to_external_data(model, ...) immediately
before onnx.save(..., save_as_external_data=True, ...); remove the explicit
convert_model_to_external_data invocation and rely on onnx.save to perform
external data conversion (keep the onnx.save call with
save_as_external_data=True, all_tensors_to_one_file=True,
location=external_data_name, size_threshold=1024 and write to output_path). This
eliminates unnecessary overhead while preserving the same external-data behavior
for model, external_data_name and output_path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 1419-1434: The code redundantly calls
convert_model_to_external_data(model, ...) immediately before onnx.save(...,
save_as_external_data=True, ...); remove the explicit
convert_model_to_external_data invocation and rely on onnx.save to perform
external data conversion (keep the onnx.save call with
save_as_external_data=True, all_tensors_to_one_file=True,
location=external_data_name, size_threshold=1024 and write to output_path). This
eliminates unnecessary overhead while preserving the same external-data behavior
for model, external_data_name and output_path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bea3482-be27-4eff-aa08-16ad595f0353

📥 Commits

Reviewing files that changed from the base of the PR and between 7d0f77b and 1c0de61.

📒 Files selected for processing (1)

modelopt/onnx/graph_surgery/gqa_replacement.py

hthadicherla added 3 commits March 6, 2026 13:14

Adding graph surgeries to make the model compatible and optimized TRT…

f1b90b4

…-RTX and ORT-GenAI Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

Made a single whisper utils file for whisper related utilities

b7f9432

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

Add test for GQA graph surgery

51fdee8

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

hthadicherla requested a review from a team as a code owner March 6, 2026 12:33

hthadicherla requested a review from ajrasane March 6, 2026 12:33

hthadicherla requested a review from vishalpandya1990 March 6, 2026 12:33

coderabbitai Bot reviewed Mar 6, 2026

View reviewed changes

Made trust remote code parameter secure and fixed test

7a42815

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

vishalpandya1990 approved these changes Mar 9, 2026

View reviewed changes

Skip GQA test for partial install since transformers is required to p…

b4fc641

…erform this test Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

Removed whisper folder and made the encoder surgery standalone

7d0f77b

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

Fixed some coderabbit suggestions and updated docustring

1c0de61

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

hthadicherla merged commit 0a84bb2 into main Mar 10, 2026
57 of 61 checks passed

hthadicherla deleted the hthadicherla/graph-surgery-framework branch March 10, 2026 08:56


		# Compute head_size
		head_size = hidden_size // num_attention_heads

Conversation

hthadicherla commented Mar 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

hthadicherla commented Mar 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 6, 2026 •

edited

Loading

codecov Bot commented Mar 9, 2026 •

edited

Loading