Graph Surgery Framework for TRT-RTX#992
Conversation
…-RTX and ORT-GenAI Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new modelopt.onnx.graph_surgery package with a CLI, utilities, and transformations: GroupQueryAttention replacement, encoder cross-attention KV output injection, FP16→BF16 conversion, and DequantizeLinear weight transposition, plus graph helper utilities and unit tests. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
modelopt/onnx/graph_surgery/dq_transpose.py (1)
125-143: Consider consolidating duplicate helper functions.These helper functions (
_find_initializer,_find_node_by_output,_get_consumers) duplicate functionality inmodelopt/onnx/graph_surgery/utils/graph_utils.py. Consider importing from there to reduce code duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/dq_transpose.py` around lines 125 - 143, The three helper functions _find_initializer, _find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove these local implementations and import the corresponding functions from graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers or their exact exported names) and update all local call sites in dq_transpose.py to use the imported functions (ensure names match or alias imports to keep existing call sites like _find_initializer, _find_node_by_output, _get_consumers working).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 926-933: The code is adding cos_cache and sin_cache to
graph.value_info twice; remove the manual creation and extension of
cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils)
to register the value_info; specifically delete the
helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
- Around line 1398-1408: external_data_name can stay None when output_path has
no directory, causing use_external_data to attempt saving with location=None;
ensure external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.
- Around line 1306-1316: The current code reads initializer bytes with
np.frombuffer(init.raw_data) which can corrupt data for external/typed
initializers; instead use onnx.numpy_helper.to_array(init) to get a safe numpy
array, cast it to float16, then create a new TensorProto with
onnx.numpy_helper.from_array(arr_fp16, name=init.name) and replace the original
initializer in graph.initializer (or assign its raw_data/data_type from the new
TensorProto), ensuring you preserve init.name and dims; update the block that
inspects graph.node for quant_mul_name and scale_name to use
numpy_helper.to_array/from_array rather than direct raw_data access.
In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py`:
- Around line 121-130: Remove the special-case handling that rewrites
"value_float" into BF16: delete the branch that checks attr.name ==
"value_float" and the code that creates new_tensor, sets data_type to
TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls
attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value"
attribute logic (the guarded path that checks tensor element types) intact so
native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 243-246: Guard against None for hidden_size/num_attention_heads
before computing head_size: check that hidden_size and num_attention_heads are
not None (and num_attention_heads != 0) when hf_model_id is None, and if they
are missing raise a clear ValueError indicating that either hf_model_id must be
provided or both hidden_size and num_attention_heads must be supplied; update
the code around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 37-38: The cross-attention K/V weight and bias initializers are
always converted with numpy_helper.from_array so BF16 models detected by
detect_model_dtype end up with FLOAT proto types; update the serialization code
that creates these initializers (the places that currently call
numpy_helper.from_array for cross-attention key/value weights/biases) to use
array_to_initializer with the is_bfloat16 flag when onnx_dtype ==
TensorProto.BFLOAT16, and convert the underlying float32 arrays via
fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to
array_to_initializer so the TensorProto dtype matches BF16 models.
---
Nitpick comments:
In `@modelopt/onnx/graph_surgery/dq_transpose.py`:
- Around line 125-143: The three helper functions _find_initializer,
_find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities
already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove
these local implementations and import the corresponding functions from
graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers
or their exact exported names) and update all local call sites in
dq_transpose.py to use the imported functions (ensure names match or alias
imports to keep existing call sites like _find_initializer,
_find_node_by_output, _get_consumers working).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8fa348a4-244d-4101-9817-ae0b39af1d53
📒 Files selected for processing (12)
modelopt/onnx/graph_surgery/__init__.pymodelopt/onnx/graph_surgery/__main__.pymodelopt/onnx/graph_surgery/dq_transpose.pymodelopt/onnx/graph_surgery/gqa_replacement.pymodelopt/onnx/graph_surgery/utils/__init__.pymodelopt/onnx/graph_surgery/utils/dtype_conversion.pymodelopt/onnx/graph_surgery/utils/graph_utils.pymodelopt/onnx/graph_surgery/utils/rope_cache.pymodelopt/onnx/graph_surgery/utils/whisper_utils.pymodelopt/onnx/graph_surgery/whisper/__init__.pymodelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pytests/unit/onnx/test_gqa_graph_surgery.py
| add_initializer(graph, "cos_cache", cos_cache, onnx_dtype) | ||
| add_initializer(graph, "sin_cache", sin_cache, onnx_dtype) | ||
|
|
||
| # Add value_info for cos/sin caches | ||
| cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) | ||
| sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) | ||
| graph.value_info.extend([cos_cache_info, sin_cache_info]) | ||
|
|
There was a problem hiding this comment.
cos_cache/sin_cache are registered twice in value_info.
add_initializer already appends corresponding value_info entries (modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 435-461). Re-adding them here creates duplicate metadata entries.
Suggested change
- # Add value_info for cos/sin caches
- cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
- sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
- graph.value_info.extend([cos_cache_info, sin_cache_info])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| add_initializer(graph, "cos_cache", cos_cache, onnx_dtype) | |
| add_initializer(graph, "sin_cache", sin_cache, onnx_dtype) | |
| # Add value_info for cos/sin caches | |
| cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) | |
| sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) | |
| graph.value_info.extend([cos_cache_info, sin_cache_info]) | |
| add_initializer(graph, "cos_cache", cos_cache, onnx_dtype) | |
| add_initializer(graph, "sin_cache", sin_cache, onnx_dtype) | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 926 - 933, The
code is adding cos_cache and sin_cache to graph.value_info twice; remove the
manual creation and extension of cos_cache_info and sin_cache_info and rely on
add_initializer (from graph_utils) to register the value_info; specifically
delete the helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
| # Handle value_float attribute | ||
| if attr.name == "value_float": | ||
| fp32_val = np.array([attr.f], dtype=np.float32) | ||
| bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16)) | ||
| new_tensor = onnx.TensorProto() | ||
| new_tensor.data_type = TensorProto.BFLOAT16 | ||
| new_tensor.raw_data = bf16_uint16.tobytes() | ||
| attr.t.CopyFrom(new_tensor) | ||
| attr.name = "value" | ||
| return True |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | head -150Repository: NVIDIA/Model-Optimizer
Length of output: 5735
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '90,140p'Repository: NVIDIA/Model-Optimizer
Length of output: 1953
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '1,150p'Repository: NVIDIA/Model-Optimizer
Length of output: 5735
Remove the value_float attribute conversion block.
The function converts FP16 constants to BF16, but value_float encodes native FP32 scalars. Converting them to BF16 introduces unnecessary precision loss (FP32→FP16→BF16) and changes semantics—these constants were not originally FP16 and should not be rewritten in an FP16-to-BF16 pass. The value attribute conversion correctly guards against this with a type check (line 105); value_float lacks this guard.
Suggested change
- # Handle value_float attribute
- if attr.name == "value_float":
- fp32_val = np.array([attr.f], dtype=np.float32)
- bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
- new_tensor = onnx.TensorProto()
- new_tensor.data_type = TensorProto.BFLOAT16
- new_tensor.raw_data = bf16_uint16.tobytes()
- attr.t.CopyFrom(new_tensor)
- attr.name = "value"
- return True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Handle value_float attribute | |
| if attr.name == "value_float": | |
| fp32_val = np.array([attr.f], dtype=np.float32) | |
| bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16)) | |
| new_tensor = onnx.TensorProto() | |
| new_tensor.data_type = TensorProto.BFLOAT16 | |
| new_tensor.raw_data = bf16_uint16.tobytes() | |
| attr.t.CopyFrom(new_tensor) | |
| attr.name = "value" | |
| return True |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py` around lines 121 -
130, Remove the special-case handling that rewrites "value_float" into BF16:
delete the branch that checks attr.name == "value_float" and the code that
creates new_tensor, sets data_type to TensorProto.BFLOAT16, converts via
fp16_to_bf16_array, assigns raw_data, calls attr.t.CopyFrom and renames
attr.name to "value"; leave the existing "value" attribute logic (the guarded
path that checks tensor element types) intact so native FP32 scalar attributes
are not downconverted by the FP16→BF16 pass.
|
|
||
| # Compute head_size | ||
| head_size = hidden_size // num_attention_heads | ||
|
|
There was a problem hiding this comment.
Guard against None values when hf_model_id is not provided.
If hf_model_id is None and the caller doesn't supply hidden_size or num_attention_heads, line 245 will raise a TypeError when attempting division with None.
🛡️ Proposed fix to add validation
+ # Validate required parameters
+ if hidden_size is None or num_attention_heads is None:
+ raise ValueError(
+ "hidden_size and num_attention_heads must be provided when hf_model_id is not specified"
+ )
+
# Compute head_size
head_size = hidden_size // num_attention_heads📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compute head_size | |
| head_size = hidden_size // num_attention_heads | |
| # Validate required parameters | |
| if hidden_size is None or num_attention_heads is None: | |
| raise ValueError( | |
| "hidden_size and num_attention_heads must be provided when hf_model_id is not specified" | |
| ) | |
| # Compute head_size | |
| head_size = hidden_size // num_attention_heads | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 243 - 246,
Guard against None for hidden_size/num_attention_heads before computing
head_size: check that hidden_size and num_attention_heads are not None (and
num_attention_heads != 0) when hf_model_id is None, and if they are missing
raise a clear ValueError indicating that either hf_model_id must be provided or
both hidden_size and num_attention_heads must be supplied; update the code
around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.
| from ..utils.graph_utils import detect_model_dtype | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find modelopt -name "encoder_cross_kv.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 121
🏁 Script executed:
find modelopt -name "graph_utils.py" -o -name "dtype_conversion.py" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 210
🏁 Script executed:
cd modelopt && find . -type f -name "*.py" | grep -E "(graph_utils|dtype_conversion)" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 189
🏁 Script executed:
wc -l modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 125
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 20927
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 24055
🏁 Script executed:
head -100 modelopt/onnx/graph_surgery/utils/dtype_conversion.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3065
🏁 Script executed:
sed -n '180,195p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 741
🏁 Script executed:
sed -n '215,240p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1012
🏁 Script executed:
tail -50 modelopt/onnx/graph_surgery/utils/dtype_conversion.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1860
🏁 Script executed:
rg "array_to_initializer" modelopt --type py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 2590
🏁 Script executed:
sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 1865
🏁 Script executed:
grep -B 5 -A 10 "is_bfloat16" modelopt/onnx/graph_surgery/gqa_replacement.py | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 1988
🏁 Script executed:
grep -B 3 -A 3 "fp16_to_bf16_array" modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 900
Add BF16-aware initializer serialization for cross-attention K/V weights and biases.
detect_model_dtype allows detecting BF16 models, but these initializers are always created with numpy_helper.from_array(...). Since NumPy doesn't support bfloat16, when the model is detected as BF16 the weights remain as FP32 arrays, causing numpy_helper.from_array() to create FLOAT TensorProtos instead of BFLOAT16. This dtype mismatch will cause inference failures.
Use array_to_initializer with the is_bfloat16 flag and convert float32 → float16 → bfloat16 via fp16_to_bf16_array when onnx_dtype == TensorProto.BFLOAT16.
Suggested change
-from ..utils.graph_utils import detect_model_dtype
+from ..utils.graph_utils import array_to_initializer, detect_model_dtype
+from ..utils.dtype_conversion import fp16_to_bf16_array
@@
- new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
- new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
+ if onnx_dtype == TensorProto.BFLOAT16:
+ k_bf16 = fp16_to_bf16_array(k_weight_t.astype(np.float16))
+ v_bf16 = fp16_to_bf16_array(v_weight_t.astype(np.float16))
+ new_initializers.append(array_to_initializer(k_bf16, k_weight_name, is_bfloat16=True))
+ new_initializers.append(array_to_initializer(v_bf16, v_weight_name, is_bfloat16=True))
+ else:
+ new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
+ new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
@@
- new_initializers.append(
- numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
- )
+ if onnx_dtype == TensorProto.BFLOAT16:
+ k_bias_bf16 = fp16_to_bf16_array(k_bias.astype(np.float16))
+ new_initializers.append(
+ array_to_initializer(k_bias_bf16, k_bias_name, is_bfloat16=True)
+ )
+ else:
+ new_initializers.append(
+ numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
+ )
@@
- new_initializers.append(
- numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
- )
+ if onnx_dtype == TensorProto.BFLOAT16:
+ v_bias_bf16 = fp16_to_bf16_array(v_bias.astype(np.float16))
+ new_initializers.append(
+ array_to_initializer(v_bias_bf16, v_bias_name, is_bfloat16=True)
+ )
+ else:
+ new_initializers.append(
+ numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 37 -
38, The cross-attention K/V weight and bias initializers are always converted
with numpy_helper.from_array so BF16 models detected by detect_model_dtype end
up with FLOAT proto types; update the serialization code that creates these
initializers (the places that currently call numpy_helper.from_array for
cross-attention key/value weights/biases) to use array_to_initializer with the
is_bfloat16 flag when onnx_dtype == TensorProto.BFLOAT16, and convert the
underlying float32 arrays via fp16_to_bf16_array (do float32 → float16 →
bfloat16) before passing to array_to_initializer so the TensorProto dtype
matches BF16 models.
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #992 +/- ##
==========================================
- Coverage 72.12% 70.10% -2.02%
==========================================
Files 209 220 +11
Lines 23628 25240 +1612
==========================================
+ Hits 17042 17695 +653
- Misses 6586 7545 +959 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (2)
modelopt/onnx/graph_surgery/utils/whisper_utils.py (1)
246-247:⚠️ Potential issue | 🟠 MajorValidate
hidden_sizeandnum_attention_headsbefore computinghead_size.When
hf_model_idis omitted, Line 247 can still raiseTypeErrororZeroDivisionErrorhere. Please fail fast with a clearValueErrorinstead of letting the arithmetic blow up.Proposed fix
- # Compute head_size - head_size = hidden_size // num_attention_heads + # Compute head_size + if hidden_size is None or num_attention_heads is None or num_attention_heads <= 0: + raise ValueError( + "Either hf_model_id must be provided, or hidden_size and " + "a positive num_attention_heads must be supplied." + ) + head_size = hidden_size // num_attention_heads🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 246 - 247, Before computing head_size = hidden_size // num_attention_heads, validate that hidden_size and num_attention_heads are non-None integers and that num_attention_heads is > 0 (and optionally that hidden_size % num_attention_heads == 0 if required by the model). Replace the direct division in the block around head_size with a guard that raises a clear ValueError identifying hidden_size and num_attention_heads when they are invalid (None, wrong type, or zero), referencing the variables hidden_size and num_attention_heads in the message so callers see actionable failure instead of TypeError/ZeroDivisionError.modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py (1)
198-199:⚠️ Potential issue | 🟠 MajorSerialize BF16 cross-attention tensors as BF16, not FLOAT.
detect_model_dtype()maps BF16 models tonp.float32inmodelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 117-144, sonumpy_helper.from_array(...)emits FLOAT initializers here. The new MatMul/Add nodes then mix BF16 activations with FLOAT weights/biases. Reuse the raw-data BF16 initializer path already used inmodelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 413-432, for both weights and biases.Suggested change
- new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) - new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name)) + new_initializers.append(_to_initializer(k_weight_t, k_weight_name)) + new_initializers.append(_to_initializer(v_weight_t, v_weight_name)) @@ - new_initializers.append( - numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name) - ) + new_initializers.append(_to_initializer(k_bias, k_bias_name)) @@ - new_initializers.append( - numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name) - ) + new_initializers.append(_to_initializer(v_bias, v_bias_name))Apply the same helper to weights and biases so the BF16 path uses raw BF16 storage instead of FLOAT TensorProtos.
Also applies to: 223-243, 387-409
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 198 - 199, The code currently creates BF16 cross-attention initializers using numpy_helper.from_array (see new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) and v_weight_t/v_weight_name), which forces FLOAT TensorProtos because detect_model_dtype maps BF16 to np.float32; replace these numpy_helper.from_array calls with the BF16 raw-data initializer helper used in modelopt/onnx/graph_surgery/utils/graph_utils.py (the helper that emits raw BF16 storage for TensorProtos) for both weights and biases so the MatMul/Add nodes keep BF16 storage; apply the same replacement for the other similar blocks (the weight/bias creation ranges around the other occurrences mentioned).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 41-54: The module currently imports AutoConfig and get_rope_caches
at module scope causing an unconditional dependency on transformers; move or
gate these imports so ONNX-only use-cases don't require HF. Specifically, remove
top-level imports of AutoConfig and get_rope_caches and instead import
AutoConfig inside functions that need HF behavior (e.g., any GQA/rope-related
functions) and import/get_rope_caches lazily where used (or wrap in a try/except
ImportError that raises a clear message when HF features are invoked). Ensure
identifiers AutoConfig and get_rope_caches are only referenced after their
lazy/guarded import so functions like analyze_attention_pattern and other
ONNX-only entry points can be imported without transformers installed.
- Around line 703-709: The fused qkv_proj/Add node (qkv_add_node) is being
appended to graph.node immediately which causes it to be removed during the
attention cleanup (Steps 7-8) or to appear before its producer qkv_matmul_node;
instead, stop appending qkv_add_node to graph.node at creation and defer
insertion until after the attention cleanup/retention logic completes—add it
only once attention-related removals are done and qkv_matmul_node has been
inserted (ensure insertion order places qkv_matmul_node before qkv_add_node).
- Around line 219-244: The _add_bf16_lm_head_cast() logic only updates outputs
named exactly "logits", leaving suffixed names like "logits_###" with the wrong
BF16/FP16 type; change the output-type update loop (the loop that iterates
graph.output and checks out.name) to match either exact "logits" or suffixed
variants (e.g., out.name == "logits" or out.name.startswith("logits") / a regex
like r"^logits(_\d+)?$") and set out.type.tensor_type.elem_type =
TensorProto.FLOAT for all matches so the renamed suffixed output (created via
lm_head_matmul_name -> new_matmul_output) ends up with FP32 type consistent with
the Cast node.
In `@modelopt/onnx/graph_surgery/utils/rope_cache.py`:
- Around line 63-66: The RoPE cache slicing uses head_dim in the condition but
must use the computed rotary width dim (head_dim * partial_rotary_factor);
update the logic that inspects cos_cache.shape[-1] to compare to dim (not
head_dim) and always slice the returned caches to the rotary width and then to
the expected half-width for GroupQueryAttention (i.e., return cos_cache[...,
:dim//2] and similarly for sin_cache), and apply the same fix to the other
analogous block (lines referenced around 89-102) so both cosine and sine caches
are correctly trimmed when partial_rotary_factor < 1.
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 389-470: The helpers update_genai_config_encoder and
update_genai_config_decoder currently proceed and overwrite the file even when
"model.encoder" or "model.decoder" (and their expected nested
"inputs"/"outputs") are missing, which can also lead to KeyError; before
mutating or writing, validate that config contains "model" and the appropriate
"encoder"/"decoder" dicts and that the nested "inputs"/"outputs" keys exist, and
if any required section is missing raise a clear ValueError (e.g., "Missing
'model.encoder' section in genai_config.json" or "Missing
'model.decoder.outputs'") instead of continuing; only perform the assignments
and write back the file after these checks pass in update_genai_config_encoder
and update_genai_config_decoder.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 467-493: The code skips writing sidecar files when output_path is
a bare filename because output_dir becomes an empty string; change the logic
that computes or checks output_dir so that a cwd target is treated as "."
instead of "" (e.g., set output_dir = os.path.dirname(output_path) or "." when
that results in ""), then continue to call save_audio_processor_config(...) and
_save_genai_config(...) when generate_genai_config is True; update references
around output_path, output_dir, save_audio_processor_config, and
_save_genai_config to use the normalized output_dir so current-directory outputs
produce audio_processor_config.json and genai_config.json as intended.
In `@tests/unit/onnx/test_gqa_graph_surgery.py`:
- Around line 680-695: The current final assertion uses a very loose tolerance
(finite.max() < 1.0) comparing orig_logits and gqa_logits from _run_session;
replace this with a strict numerical comparison (e.g.,
numpy.testing.assert_allclose) or a much smaller absolute tolerance (for example
1e-5 or 1e-3 depending on expected precision) applied to orig_logits and
gqa_logits (or to finite via np.max(np.abs(...))) and keep the existing
non-finite check; update the assertion that references finite.max() to use the
new tighter threshold or assert_allclose so the test fails on materially wrong
logits.
- Around line 30-33: The test currently downloads HF configs in _get_config()
and again via replace_attention_with_gqa() ->
modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches; instead, create a
small local config dict/object in the test (matching only the keys used by
_get_config() and replace_attention_with_gqa(), e.g., vocab_size, n_head,
n_embd, max_position_embeddings, rotary/rope params) and update _get_config() to
return it, and monkeypatch
modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches to return
deterministic, precomputed rope caches (e.g., simple arrays sized by
SEQ_LEN/MAX_SEQ_LEN) so no network calls occur during
replace_attention_with_gqa() or _get_config().
---
Duplicate comments:
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 246-247: Before computing head_size = hidden_size //
num_attention_heads, validate that hidden_size and num_attention_heads are
non-None integers and that num_attention_heads is > 0 (and optionally that
hidden_size % num_attention_heads == 0 if required by the model). Replace the
direct division in the block around head_size with a guard that raises a clear
ValueError identifying hidden_size and num_attention_heads when they are invalid
(None, wrong type, or zero), referencing the variables hidden_size and
num_attention_heads in the message so callers see actionable failure instead of
TypeError/ZeroDivisionError.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 198-199: The code currently creates BF16 cross-attention
initializers using numpy_helper.from_array (see
new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
and v_weight_t/v_weight_name), which forces FLOAT TensorProtos because
detect_model_dtype maps BF16 to np.float32; replace these
numpy_helper.from_array calls with the BF16 raw-data initializer helper used in
modelopt/onnx/graph_surgery/utils/graph_utils.py (the helper that emits raw BF16
storage for TensorProtos) for both weights and biases so the MatMul/Add nodes
keep BF16 storage; apply the same replacement for the other similar blocks (the
weight/bias creation ranges around the other occurrences mentioned).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 362e6545-4833-4165-baf7-2774ff39987c
📒 Files selected for processing (6)
modelopt/onnx/graph_surgery/__main__.pymodelopt/onnx/graph_surgery/gqa_replacement.pymodelopt/onnx/graph_surgery/utils/rope_cache.pymodelopt/onnx/graph_surgery/utils/whisper_utils.pymodelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pytests/unit/onnx/test_gqa_graph_surgery.py
| if lm_head_node and len(lm_head_node.output) > 0: | ||
| # Create new output name for MatMul | ||
| new_matmul_output = f"{lm_head_matmul_name}_output_bf16" | ||
| lm_head_node.output[0] = new_matmul_output | ||
|
|
||
| # Create Cast node to FP32 | ||
| lm_head_cast_node = helper.make_node( | ||
| "Cast", | ||
| inputs=[new_matmul_output], | ||
| outputs=["logits"], | ||
| name=f"{lm_head_matmul_name}/Cast_to_fp32", | ||
| to=TensorProto.FLOAT, | ||
| ) | ||
| graph.node.append(lm_head_cast_node) | ||
|
|
||
| # Add value_info for bf16 intermediate | ||
| lm_head_bf16_info = helper.make_tensor_value_info( | ||
| new_matmul_output, TensorProto.BFLOAT16, ["batch_size", "sequence_length", "vocab_size"] | ||
| ) | ||
| graph.value_info.append(lm_head_bf16_info) | ||
|
|
||
| # Update logits output to FP32 | ||
| for out in graph.output: | ||
| if out.name == "logits": | ||
| out.type.tensor_type.elem_type = TensorProto.FLOAT | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd modelopt && wc -l onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 108
🏁 Script executed:
sed -n '210,250p' onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 146
🏁 Script executed:
sed -n '955,990p' onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 146
🏁 Script executed:
sed -n '180,300p' onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 146
🏁 Script executed:
sed -n '210,250p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1522
🏁 Script executed:
sed -n '955,990p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1541
🏁 Script executed:
sed -n '180,300p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4011
🏁 Script executed:
grep -n "_add_bf16_lm_head_cast\|_rename_standard_io" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 206
🏁 Script executed:
sed -n '100,180p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3029
🏁 Script executed:
grep -n "def _rename_standard_io\|_rename_standard_io(" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
sed -n '880,920p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1666
🏁 Script executed:
sed -n '935,975p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1867
🏁 Script executed:
grep -n "def.*rename\|io_renames" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 333
🏁 Script executed:
sed -n '120,210p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 1598
🏁 Script executed:
sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1865
🏁 Script executed:
grep -B5 "for out in graph.output:" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1713
🏁 Script executed:
sed -n '201,245p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1586
🏁 Script executed:
sed -n '880,910p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1314
🏁 Script executed:
sed -n '950,1000p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2178
🏁 Script executed:
grep -n 'if out\.name ==' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 289
🏁 Script executed:
grep -n "Step [0-9]" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1315
Update type check to handle renamed suffixed output names.
When the graph has a suffixed logits output (e.g., logits_###), _add_bf16_lm_head_cast() only checks for outputs named exactly "logits" at line 242, leaving the original suffixed output with BF16/FP16 type. Later, Step 2.5 renames logits_### to logits, but the output type remains incorrect and no longer matches the FP32 Cast node output.
Suggested change
if lm_head_node and len(lm_head_node.output) > 0:
+ original_output_name = lm_head_node.output[0]
# Create new output name for MatMul
new_matmul_output = f"{lm_head_matmul_name}_output_bf16"
lm_head_node.output[0] = new_matmul_output
@@
# Update logits output to FP32
for out in graph.output:
- if out.name == "logits":
+ if out.name in {original_output_name, "logits"}:
out.type.tensor_type.elem_type = TensorProto.FLOAT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 219 - 244, The
_add_bf16_lm_head_cast() logic only updates outputs named exactly "logits",
leaving suffixed names like "logits_###" with the wrong BF16/FP16 type; change
the output-type update loop (the loop that iterates graph.output and checks
out.name) to match either exact "logits" or suffixed variants (e.g., out.name ==
"logits" or out.name.startswith("logits") / a regex like r"^logits(_\d+)?$") and
set out.type.tensor_type.elem_type = TensorProto.FLOAT for all matches so the
renamed suffixed output (created via lm_head_matmul_name -> new_matmul_output)
ends up with FP32 type consistent with the Cast node.
| qkv_add_node = helper.make_node( | ||
| "Add", | ||
| inputs=[qkv_matmul_output, qkv_bias_name], | ||
| outputs=[qkv_add_output], | ||
| name=qkv_add_name, | ||
| ) | ||
| graph.node.append(qkv_add_node) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's get an overview of the file and locate the specific lines
wc -l modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 117
🏁 Script executed:
# Read the section around lines 703-709 with context
sed -n '680,730p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 2435
🏁 Script executed:
# Read the section around lines 1133-1194 with context
sed -n '1110,1220p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 5104
🏁 Script executed:
# Search for all places where qkv_matmul_nodes is used and what it is
rg -n "qkv_matmul_nodes" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 983
🏁 Script executed:
# Search for Step 7 and Step 8 comments to understand the flow
rg -n "Step [78]" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 165
🏁 Script executed:
# Look at more context around the problematic lines (700-800)
sed -n '695,790p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 3976
🏁 Script executed:
# Check what exactly is at lines 1133-1194 and the surrounding context
sed -n '1130,1200p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 3234
🏁 Script executed:
# Search for all places where qkv_add_node is created or used
rg -n "qkv_add" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 834
Defer inserting the fused qkv_proj/Add until after attention cleanup.
qkv_add_node is appended to graph.node immediately, causing it to be identified and removed during Steps 7-8. For non-quantized models, the node doesn't match any of the preserved patterns (lines 1161-1169) and is removed. Even in combined-QKV models where it survives removal, it appears in the graph before its producer qkv_matmul_node is added, violating node ordering requirements.
Suggested change
- graph.node.append(qkv_add_node)
+ qkv_matmul_nodes.append(qkv_add_node)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 703 - 709, The
fused qkv_proj/Add node (qkv_add_node) is being appended to graph.node
immediately which causes it to be removed during the attention cleanup (Steps
7-8) or to appear before its producer qkv_matmul_node; instead, stop appending
qkv_add_node to graph.node at creation and defer insertion until after the
attention cleanup/retention logic completes—add it only once attention-related
removals are done and qkv_matmul_node has been inserted (ensure insertion order
places qkv_matmul_node before qkv_add_node).
| theta = getattr(config, "rope_theta", 10000.0) | ||
| head_dim = config.hidden_size // config.num_attention_heads | ||
| partial_factor = getattr(config, "partial_rotary_factor", 1.0) | ||
| dim = int(head_dim * partial_factor) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/rope_cache.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4615
🏁 Script executed:
# Search for tests or usage of get_rope_caches function
rg -r "get_rope_caches" --type py -B 2 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 462
🏁 Script executed:
rg "get_rope_caches" --type py -B 2 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 3737
🏁 Script executed:
# Look at how get_rope_caches is used in gqa_replacement.py
rg "cos_cache|sin_cache" modelopt/onnx/graph_surgery/gqa_replacement.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 2992
🏁 Script executed:
# Search for tests related to rope or partial rotary
fd -e py "test.*rope" -o -e py | xargs grep -l "partial_rotary\|get_rope" 2>/dev/null | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 182
🏁 Script executed:
find . -name "*test*.py" -type f | head -20 | xargs grep -l "rope\|partial_rotary" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 117
🏁 Script executed:
grep -n "partial_rotary\|rope\|get_rope" ./tests/_test_utils/torch/deploy/lib_test_models.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 749
🏁 Script executed:
find . -path ./node_modules -prune -o -name "*.py" -type f -exec grep -l "get_rope_caches\|gqa_replacement" {} \; 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 340
🏁 Script executed:
cat -n ./tests/unit/onnx/test_gqa_graph_surgery.pyRepository: NVIDIA/Model-Optimizer
Length of output: 29572
🏁 Script executed:
# Let's trace the logic more carefully by creating a test script
python3 << 'EOF'
import torch
import numpy as np
# Test case 1: Full rotary (partial_factor=1.0)
print("=== Case 1: Full Rotary (partial_factor=1.0) ===")
head_dim = 128
partial_factor = 1.0
dim = int(head_dim * partial_factor)
print(f"head_dim: {head_dim}, partial_factor: {partial_factor}")
print(f"dim: {dim}")
# Compute cache as in rope_cache.py
max_seq_len = 4
theta = 10000.0
inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.int64).float() / dim))
print(f"inv_freq shape: {inv_freq.shape}")
t = torch.arange(max_seq_len, dtype=torch.int64).float()
freqs = torch.outer(t, inv_freq)
print(f"freqs shape: {freqs.shape}")
emb = torch.cat((freqs, freqs), dim=-1)
print(f"emb shape after concat: {emb.shape}")
print(f"Condition check: emb.shape[-1] ({emb.shape[-1]}) == head_dim ({head_dim})? {emb.shape[-1] == head_dim}")
if emb.shape[-1] == head_dim:
sliced = emb[:, : head_dim // 2]
print(f"After slice: {sliced.shape} (head_dim//2={head_dim//2})")
else:
print(f"No slice applied, returned: {emb.shape}")
print("\n=== Case 2: Partial Rotary (partial_factor=0.5) ===")
partial_factor = 0.5
dim = int(head_dim * partial_factor)
print(f"head_dim: {head_dim}, partial_factor: {partial_factor}")
print(f"dim: {dim}")
inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.int64).float() / dim))
print(f"inv_freq shape: {inv_freq.shape}")
freqs = torch.outer(t, inv_freq)
print(f"freqs shape: {freqs.shape}")
emb = torch.cat((freqs, freqs), dim=-1)
print(f"emb shape after concat: {emb.shape}")
print(f"Condition check: emb.shape[-1] ({emb.shape[-1]}) == head_dim ({head_dim})? {emb.shape[-1] == head_dim}")
if emb.shape[-1] == head_dim:
sliced = emb[:, : head_dim // 2]
print(f"After slice: {sliced.shape}")
else:
print(f"No slice applied, RETURNED: {emb.shape}")
print("\n=== Proposed Fix ===")
print(f"dim // 2 = {dim} // 2 = {dim // 2}")
print(f"Should slice to: [max_seq_len, {dim // 2}]")
print(f"For partial factor 0.5, correct cache shape should be [4, {dim // 2}]")
# What GroupQueryAttention expects
print(f"\nGroupQueryAttention expects cache shape: [max_seq_len, head_dim//2]")
print(f"For full rotary (dim=head_dim): [4, {head_dim//2}]")
partial_factor = 0.5
dim_partial = int(head_dim * partial_factor)
print(f"For partial rotary (dim={dim_partial}): [4, {dim_partial//2}]")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 183
🏁 Script executed:
# Let me trace the logic manually
python3 << 'EOF'
# Case 1: Full rotary (partial_factor=1.0)
print("=== Case 1: Full Rotary (partial_factor=1.0) ===")
head_dim = 128
partial_factor = 1.0
dim = int(head_dim * partial_factor)
print(f"head_dim: {head_dim}, partial_factor: {partial_factor}")
print(f"dim = int({head_dim} * {partial_factor}) = {dim}")
# inv_freq has shape [dim // 2]
inv_freq_len = dim // 2
print(f"inv_freq shape: [{inv_freq_len}]")
# freqs = outer(t, inv_freq) -> [max_seq_len, dim//2]
max_seq_len = 4
freqs_shape = (max_seq_len, inv_freq_len)
print(f"freqs shape: {freqs_shape}")
# emb = cat(freqs, freqs) -> [max_seq_len, dim]
emb_shape = (max_seq_len, dim)
print(f"emb shape after cat: {emb_shape}")
# Condition: emb.shape[-1] == head_dim?
cond1 = emb_shape[-1] == head_dim
print(f"Condition: emb.shape[-1] ({emb_shape[-1]}) == head_dim ({head_dim})? {cond1}")
if cond1:
final_shape = (max_seq_len, head_dim // 2)
print(f"✓ Slice applied → final shape: {final_shape}")
else:
final_shape = emb_shape
print(f"✗ No slice → final shape: {final_shape}")
print("\n=== Case 2: Partial Rotary (partial_factor=0.5) ===")
partial_factor = 0.5
dim = int(head_dim * partial_factor)
print(f"head_dim: {head_dim}, partial_factor: {partial_factor}")
print(f"dim = int({head_dim} * {partial_factor}) = {dim}")
# inv_freq has shape [dim // 2]
inv_freq_len = dim // 2
print(f"inv_freq shape: [{inv_freq_len}]")
# freqs = outer(t, inv_freq) -> [max_seq_len, dim//2]
freqs_shape = (max_seq_len, inv_freq_len)
print(f"freqs shape: {freqs_shape}")
# emb = cat(freqs, freqs) -> [max_seq_len, dim]
emb_shape = (max_seq_len, dim)
print(f"emb shape after cat: {emb_shape}")
# Condition: emb.shape[-1] == head_dim?
cond2 = emb_shape[-1] == head_dim
print(f"Condition: emb.shape[-1] ({emb_shape[-1]}) == head_dim ({head_dim})? {cond2}")
if cond2:
final_shape = (max_seq_len, head_dim // 2)
print(f"✓ Slice applied → final shape: {final_shape}")
else:
final_shape = emb_shape
print(f"✗ NO SLICE → RETURNED shape: {final_shape}")
print("\n=== What should be returned? ===")
print(f"For full rotary (dim={128}): [max_seq_len, {128//2}]")
print(f"For partial rotary (dim={64}): [max_seq_len, {64//2}]")
print(f"\nThe docstring says: [max_seq_len, head_dim//2]")
print(f" - Full case: [max_seq_len, {128//2}] ✓ correct")
print(f" - Partial case: [max_seq_len, {128//2}] but got [max_seq_len, {64}] ✗ WRONG!")
print(f"\nReviewer's fix: always slice to dim//2 regardless of condition")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 971
🏁 Script executed:
# Check how the cache is actually used in gqa_replacement.py
sed -n '140,200p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2319
🏁 Script executed:
# Find where GroupQueryAttention is created and how cache is passed
rg "GroupQueryAttention" modelopt/onnx/graph_surgery/gqa_replacement.py -B 3 -A 8Repository: NVIDIA/Model-Optimizer
Length of output: 3123
🏁 Script executed:
# Find the exact inputs passed to GroupQueryAttention for cos_cache and sin_cache
rg "cos_cache\|sin_cache" modelopt/onnx/graph_surgery/gqa_replacement.py -B 5 -A 5 | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Search for where cos_cache and sin_cache are added to the graph
rg "add_initializer.*cos_cache\|cos_cache.*add_initializer" modelopt/onnx/graph_surgery/gqa_replacement.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Let's look at the part where caches are added as initializers
sed -n '200,250p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1795
Always slice RoPE cache to rotary dimension width, not full head width.
The current conditional slice only executes when cos_cache.shape[-1] == head_dim, which is never true for partial rotary models (partial_rotary_factor < 1). For these models, dim != head_dim, causing the cache to be returned unsliced at shape [max_seq_len, dim] instead of the correct [max_seq_len, dim // 2] that GroupQueryAttention expects.
Suggested change
- dim = int(head_dim * partial_factor)
+ dim = int(head_dim * partial_factor)
+ dim -= dim % 2
@@
- # Slice to half (GQA expects this)
- if cos_cache.shape[-1] == head_dim:
- cos_cache = cos_cache[:, : head_dim // 2]
- sin_cache = sin_cache[:, : head_dim // 2]
+ # Slice to half of the effective rotary dimension (GQA expects this)
+ rotary_cache_width = dim // 2
+ cos_cache = cos_cache[:, :rotary_cache_width]
+ sin_cache = sin_cache[:, :rotary_cache_width]Also applies to: 89-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/utils/rope_cache.py` around lines 63 - 66, The
RoPE cache slicing uses head_dim in the condition but must use the computed
rotary width dim (head_dim * partial_rotary_factor); update the logic that
inspects cos_cache.shape[-1] to compare to dim (not head_dim) and always slice
the returned caches to the rotary width and then to the expected half-width for
GroupQueryAttention (i.e., return cos_cache[..., :dim//2] and similarly for
sin_cache), and apply the same fix to the other analogous block (lines
referenced around 89-102) so both cosine and sine caches are correctly trimmed
when partial_rotary_factor < 1.
| def update_genai_config_encoder( | ||
| config_path: str, | ||
| encoder_filename: str, | ||
| encoder_cross_present_key_pattern: str = "present_key_cross_%d", | ||
| encoder_cross_present_value_pattern: str = "present_value_cross_%d", | ||
| ) -> dict[str, Any]: | ||
| """Update an existing genai_config.json with encoder settings. | ||
|
|
||
| Useful when you already have a config from ONNX Runtime export and | ||
| want to update the encoder section after running encoder surgery. | ||
|
|
||
| Args: | ||
| config_path: Path to existing genai_config.json. | ||
| encoder_filename: New encoder filename. | ||
| encoder_cross_present_key_pattern: Pattern for cross-attention key outputs. | ||
| encoder_cross_present_value_pattern: Pattern for cross-attention value outputs. | ||
|
|
||
| Returns: | ||
| Updated configuration dictionary. | ||
| """ | ||
| with open(config_path) as f: | ||
| config = json.load(f) | ||
|
|
||
| # Update encoder section | ||
| if "model" in config and "encoder" in config["model"]: | ||
| config["model"]["encoder"]["filename"] = encoder_filename | ||
| config["model"]["encoder"]["outputs"]["cross_present_key_names"] = ( | ||
| encoder_cross_present_key_pattern | ||
| ) | ||
| config["model"]["encoder"]["outputs"]["cross_present_value_names"] = ( | ||
| encoder_cross_present_value_pattern | ||
| ) | ||
|
|
||
| # Save updated config | ||
| with open(config_path, "w") as f: | ||
| json.dump(config, f, indent=4) | ||
|
|
||
| logger.info(f"Updated encoder section in {config_path}") | ||
| return config | ||
|
|
||
|
|
||
| def update_genai_config_decoder( | ||
| config_path: str, | ||
| decoder_filename: str, | ||
| decoder_past_key_pattern: str = "past_key_values.%d.decoder.key", | ||
| decoder_past_value_pattern: str = "past_key_values.%d.decoder.value", | ||
| decoder_present_key_pattern: str = "present.%d.decoder.key", | ||
| decoder_present_value_pattern: str = "present.%d.decoder.value", | ||
| ) -> dict[str, Any]: | ||
| """Update an existing genai_config.json with decoder settings. | ||
|
|
||
| Useful when you've run decoder surgery (MHA/GQA fusion) and need | ||
| to update the decoder section. | ||
|
|
||
| Args: | ||
| config_path: Path to existing genai_config.json. | ||
| decoder_filename: New decoder filename. | ||
| decoder_past_key_pattern: Pattern for past key input names. | ||
| decoder_past_value_pattern: Pattern for past value input names. | ||
| decoder_present_key_pattern: Pattern for present key output names. | ||
| decoder_present_value_pattern: Pattern for present value output names. | ||
|
|
||
| Returns: | ||
| Updated configuration dictionary. | ||
| """ | ||
| with open(config_path) as f: | ||
| config = json.load(f) | ||
|
|
||
| # Update decoder section | ||
| if "model" in config and "decoder" in config["model"]: | ||
| config["model"]["decoder"]["filename"] = decoder_filename | ||
| config["model"]["decoder"]["inputs"]["past_key_names"] = decoder_past_key_pattern | ||
| config["model"]["decoder"]["inputs"]["past_value_names"] = decoder_past_value_pattern | ||
| config["model"]["decoder"]["outputs"]["present_key_names"] = decoder_present_key_pattern | ||
| config["model"]["decoder"]["outputs"]["present_value_names"] = decoder_present_value_pattern | ||
|
|
||
| # Save updated config | ||
| with open(config_path, "w") as f: | ||
| json.dump(config, f, indent=4) | ||
|
|
||
| logger.info(f"Updated decoder section in {config_path}") | ||
| return config |
There was a problem hiding this comment.
Don’t silently succeed when the target config sections are missing.
Both update helpers write the file back and log success even if model.encoder / model.decoder is absent. That makes bad input configs look valid, and direct indexing into inputs / outputs can still raise KeyError. Please validate the expected structure up front and raise a clear error before writing.
Proposed fix
def update_genai_config_encoder(
@@
) -> dict[str, Any]:
@@
with open(config_path) as f:
config = json.load(f)
- # Update encoder section
- if "model" in config and "encoder" in config["model"]:
- config["model"]["encoder"]["filename"] = encoder_filename
- config["model"]["encoder"]["outputs"]["cross_present_key_names"] = (
- encoder_cross_present_key_pattern
- )
- config["model"]["encoder"]["outputs"]["cross_present_value_names"] = (
- encoder_cross_present_value_pattern
- )
+ try:
+ encoder_cfg = config["model"]["encoder"]
+ except KeyError as exc:
+ raise ValueError(f"{config_path} does not contain model.encoder") from exc
+
+ outputs_cfg = encoder_cfg.setdefault("outputs", {})
+ encoder_cfg["filename"] = encoder_filename
+ outputs_cfg["cross_present_key_names"] = encoder_cross_present_key_pattern
+ outputs_cfg["cross_present_value_names"] = encoder_cross_present_value_pattern
@@
def update_genai_config_decoder(
@@
) -> dict[str, Any]:
@@
with open(config_path) as f:
config = json.load(f)
- # Update decoder section
- if "model" in config and "decoder" in config["model"]:
- config["model"]["decoder"]["filename"] = decoder_filename
- config["model"]["decoder"]["inputs"]["past_key_names"] = decoder_past_key_pattern
- config["model"]["decoder"]["inputs"]["past_value_names"] = decoder_past_value_pattern
- config["model"]["decoder"]["outputs"]["present_key_names"] = decoder_present_key_pattern
- config["model"]["decoder"]["outputs"]["present_value_names"] = decoder_present_value_pattern
+ try:
+ decoder_cfg = config["model"]["decoder"]
+ except KeyError as exc:
+ raise ValueError(f"{config_path} does not contain model.decoder") from exc
+
+ inputs_cfg = decoder_cfg.setdefault("inputs", {})
+ outputs_cfg = decoder_cfg.setdefault("outputs", {})
+ decoder_cfg["filename"] = decoder_filename
+ inputs_cfg["past_key_names"] = decoder_past_key_pattern
+ inputs_cfg["past_value_names"] = decoder_past_value_pattern
+ outputs_cfg["present_key_names"] = decoder_present_key_pattern
+ outputs_cfg["present_value_names"] = decoder_present_value_pattern🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 389 - 470,
The helpers update_genai_config_encoder and update_genai_config_decoder
currently proceed and overwrite the file even when "model.encoder" or
"model.decoder" (and their expected nested "inputs"/"outputs") are missing,
which can also lead to KeyError; before mutating or writing, validate that
config contains "model" and the appropriate "encoder"/"decoder" dicts and that
the nested "inputs"/"outputs" keys exist, and if any required section is missing
raise a clear ValueError (e.g., "Missing 'model.encoder' section in
genai_config.json" or "Missing 'model.decoder.outputs'") instead of continuing;
only perform the assignments and write back the file after these checks pass in
update_genai_config_encoder and update_genai_config_decoder.
| # Generate config files if output directory exists | ||
| output_dir = os.path.dirname(output_path) | ||
| if output_dir: | ||
| # Save audio processor config | ||
| from ..utils.whisper_utils import save_audio_processor_config | ||
|
|
||
| save_audio_processor_config( | ||
| output_dir, | ||
| hf_model_id=hf_model_id, | ||
| overwrite=False, | ||
| trust_remote_code=trust_remote_code, | ||
| ) | ||
|
|
||
| # Generate genai_config.json with encoder pointing to this output | ||
| if generate_genai_config: | ||
| from ..utils.whisper_utils import save_genai_config as _save_genai_config | ||
|
|
||
| encoder_filename = os.path.basename(output_path) | ||
| _save_genai_config( | ||
| output_dir=output_dir, | ||
| encoder_filename=encoder_filename, | ||
| decoder_filename=decoder_filename, | ||
| hf_model_id=hf_model_id, | ||
| provider=provider, | ||
| trust_remote_code=trust_remote_code, | ||
| overwrite=False, # Don't overwrite if exists | ||
| ) |
There was a problem hiding this comment.
Treat cwd outputs as a valid sidecar target.
When output_path is just a filename, output_dir is "" and both audio_processor_config.json and genai_config.json are skipped even though generation is enabled by default. Using "." here preserves the expected behavior for current-directory outputs.
Suggested change
- output_dir = os.path.dirname(output_path)
+ output_dir = os.path.dirname(output_path) or "."
if output_dir:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Generate config files if output directory exists | |
| output_dir = os.path.dirname(output_path) | |
| if output_dir: | |
| # Save audio processor config | |
| from ..utils.whisper_utils import save_audio_processor_config | |
| save_audio_processor_config( | |
| output_dir, | |
| hf_model_id=hf_model_id, | |
| overwrite=False, | |
| trust_remote_code=trust_remote_code, | |
| ) | |
| # Generate genai_config.json with encoder pointing to this output | |
| if generate_genai_config: | |
| from ..utils.whisper_utils import save_genai_config as _save_genai_config | |
| encoder_filename = os.path.basename(output_path) | |
| _save_genai_config( | |
| output_dir=output_dir, | |
| encoder_filename=encoder_filename, | |
| decoder_filename=decoder_filename, | |
| hf_model_id=hf_model_id, | |
| provider=provider, | |
| trust_remote_code=trust_remote_code, | |
| overwrite=False, # Don't overwrite if exists | |
| ) | |
| # Generate config files if output directory exists | |
| output_dir = os.path.dirname(output_path) or "." | |
| if output_dir: | |
| # Save audio processor config | |
| from ..utils.whisper_utils import save_audio_processor_config | |
| save_audio_processor_config( | |
| output_dir, | |
| hf_model_id=hf_model_id, | |
| overwrite=False, | |
| trust_remote_code=trust_remote_code, | |
| ) | |
| # Generate genai_config.json with encoder pointing to this output | |
| if generate_genai_config: | |
| from ..utils.whisper_utils import save_genai_config as _save_genai_config | |
| encoder_filename = os.path.basename(output_path) | |
| _save_genai_config( | |
| output_dir=output_dir, | |
| encoder_filename=encoder_filename, | |
| decoder_filename=decoder_filename, | |
| hf_model_id=hf_model_id, | |
| provider=provider, | |
| trust_remote_code=trust_remote_code, | |
| overwrite=False, # Don't overwrite if exists | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 467 -
493, The code skips writing sidecar files when output_path is a bare filename
because output_dir becomes an empty string; change the logic that computes or
checks output_dir so that a cwd target is treated as "." instead of "" (e.g.,
set output_dir = os.path.dirname(output_path) or "." when that results in ""),
then continue to call save_audio_processor_config(...) and
_save_genai_config(...) when generate_genai_config is True; update references
around output_path, output_dir, save_audio_processor_config, and
_save_genai_config to use the normalized output_dir so current-directory outputs
produce audio_processor_config.json and genai_config.json as intended.
| MODEL_ID = "Qwen/Qwen2.5-0.5B" | ||
| VOCAB_SIZE = 64 | ||
| SEQ_LEN = 4 | ||
| MAX_SEQ_LEN = 128 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the test file to understand its structure
fd -t f test_gqa_graph_surgery.py | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 107
🏁 Script executed:
# Get the file size to understand how to read it
wc -l tests/unit/onnx/test_gqa_graph_surgery.pyRepository: NVIDIA/Model-Optimizer
Length of output: 111
🏁 Script executed:
# Read the relevant sections: lines 30-40 and 584-630
sed -n '1,50p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 1827
🏁 Script executed:
# Check for fixture definitions and AutoConfig usage
rg -n "def _get_config|AutoConfig|replace_attention_with_gqa" tests/unit/onnx/test_gqa_graph_surgery.py -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 971
🏁 Script executed:
# Read lines around 584-623
sed -n '580,630p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 2103
🏁 Script executed:
# Find the gqa_replacement.py file
fd -t f gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 112
🏁 Script executed:
# Search for replace_attention_with_gqa function and how it uses hf_model_id
rg -n "def replace_attention_with_gqa" modelopt/onnx/graph_surgery/gqa_replacement.py -A 20Repository: NVIDIA/Model-Optimizer
Length of output: 919
🏁 Script executed:
# Search for AutoConfig usage within the gqa_replacement module
rg -n "AutoConfig" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 198
🏁 Script executed:
# View the implementation around line 119 to see full context of AutoConfig usage
sed -n '110,130p' modelopt/onnx/graph_surgery/gqa_replacement.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 1061
🏁 Script executed:
# Check if there are any mocking or offline alternatives already in place
rg -n "@pytest.fixture|@mock|@patch|monkeypatch" tests/unit/onnx/test_gqa_graph_surgery.pyRepository: NVIDIA/Model-Optimizer
Length of output: 101
🏁 Script executed:
# Verify the fixture scope and check what models_and_config provides
sed -n '625,650p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 1334
🏁 Script executed:
# Check if there are other tests in this file that don't use the fixture
rg -n "def test_" tests/unit/onnx/test_gqa_graph_surgery.py | head -15Repository: NVIDIA/Model-Optimizer
Length of output: 432
🏁 Script executed:
# Verify that _get_config is called only within the fixture
rg -n "_get_config" tests/unit/onnx/test_gqa_graph_surgery.pyRepository: NVIDIA/Model-Optimizer
Length of output: 147
🏁 Script executed:
# Check the full fixture implementation to see the complete picture
sed -n '604,625p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 982
🏁 Script executed:
# Verify if there are any environment markers or skip decorators
sed -n '600,630p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 1304
🏁 Script executed:
# Let's check if there are any skip markers or environment checks
sed -n '1,35p' tests/unit/onnx/test_gqa_graph_surgery.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 1413
🏁 Script executed:
# Verify if get_rope_caches function exists and what it does
rg -n "def get_rope_caches" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check what imports and dependencies are needed for the suggested fix
rg -n "get_rope_caches" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 170
🏁 Script executed:
# Look at the rope_cache module to understand how it works
fd -t f rope_cache.py | head -3Repository: NVIDIA/Model-Optimizer
Length of output: 113
🏁 Script executed:
# Check get_rope_caches implementation
rg -n "def get_rope_caches" modelopt/onnx/graph_surgery/utils/rope_cache.py -A 15Repository: NVIDIA/Model-Optimizer
Length of output: 754
Remove network calls from this unit test fixture.
The fixture makes two network calls to download Hugging Face config: once in _get_config() and again inside replace_attention_with_gqa() → get_rope_caches(). This violates the guideline that unit tests in tests/unit should be fast CPU-based tests taking no more than a few seconds to run. Replace the HuggingFace config downloads with a small local config object and monkeypatch modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches to return deterministic, pre-computed caches.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 30 - 33, The test
currently downloads HF configs in _get_config() and again via
replace_attention_with_gqa() ->
modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches; instead, create a
small local config dict/object in the test (matching only the keys used by
_get_config() and replace_attention_with_gqa(), e.g., vocab_size, n_head,
n_embd, max_position_embeddings, rotary/rope params) and update _get_config() to
return it, and monkeypatch
modelopt.onnx.graph_surgery.utils.rope_cache.get_rope_caches to return
deterministic, precomputed rope caches (e.g., simple arrays sized by
SEQ_LEN/MAX_SEQ_LEN) so no network calls occur during
replace_attention_with_gqa() or _get_config().
| orig_logits = _run_session(orig, orig_feeds)[0].astype(np.float32) | ||
| gqa_logits = _run_session(gqa, gqa_feeds)[0].astype(np.float32) | ||
|
|
||
| diff = np.abs(orig_logits - gqa_logits) | ||
| finite = diff[np.isfinite(diff)] | ||
|
|
||
| print(f"\n Original nodes: {len(orig.graph.node)} -> GQA nodes: {len(gqa.graph.node)}") | ||
| print(f" Logits shape: {orig_logits.shape}") | ||
| print(f" Original[0,:4]: {orig_logits[0, 0, :4]}") | ||
| print(f" GQA [0,:4]: {gqa_logits[0, 0, :4]}") | ||
| if len(finite) > 0: | ||
| print(f" Max abs diff: {finite.max():.6f}") | ||
| print(f" Mean abs diff: {finite.mean():.6f}") | ||
|
|
||
| assert len(finite) > 0, "All values are non-finite" | ||
| assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance" |
There was a problem hiding this comment.
Tighten the numerical assertion.
finite.max() < 1.0 will still pass with materially wrong logits. Since this is the only accuracy guard here, use a much smaller tolerance or assert_allclose().
Suggested change
assert len(finite) > 0, "All values are non-finite"
- assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance"
+ np.testing.assert_allclose(orig_logits, gqa_logits, rtol=1e-2, atol=1e-2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 680 - 695, The
current final assertion uses a very loose tolerance (finite.max() < 1.0)
comparing orig_logits and gqa_logits from _run_session; replace this with a
strict numerical comparison (e.g., numpy.testing.assert_allclose) or a much
smaller absolute tolerance (for example 1e-5 or 1e-3 depending on expected
precision) applied to orig_logits and gqa_logits (or to finite via
np.max(np.abs(...))) and keep the existing non-finite check; update the
assertion that references finite.max() to use the new tighter threshold or
assert_allclose so the test fails on materially wrong logits.
vishalpandya1990
left a comment
There was a problem hiding this comment.
Discussed offline, some suggestions on high-level flow here:
- Design changes / directory restructuring
- Add test case for DQ-transpose
- make sure to have API documentation updated
- example - for surgery on quantized model (can be added in follow-up PR)
Overall looks good to me.
…erform this test Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (8)
modelopt/onnx/graph_surgery/gqa_replacement.py (5)
938-941:⚠️ Potential issue | 🟡 Minor
cos_cache/sin_cacheare registered twice invalue_info.This issue was previously identified. The
add_initializerfunction (from graph_utils.py) already appends correspondingvalue_infoentries. Re-adding them here creates duplicate metadata entries.Suggested fix
- # Add value_info for cos/sin caches - cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) - sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) - graph.value_info.extend([cos_cache_info, sin_cache_info])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 938 - 941, Remove the duplicate value_info entries for the cos/sin caches by eliminating the explicit creation/extension of cos_cache_info and sin_cache_info in GQA replacement; rely on the existing add_initializer function in graph_utils.py (which already registers value_info when adding initializers) instead of calling graph.value_info.extend([cos_cache_info, sin_cache_info]) in this file, or add a presence check before extending to avoid duplicates (refer to the symbols cos_cache_info, sin_cache_info, graph.value_info.extend and add_initializer).
704-710:⚠️ Potential issue | 🟠 MajorDefer inserting the fused
qkv_proj/Adduntil after attention cleanup.This issue was previously identified. The
qkv_add_nodeis appended tograph.nodeimmediately at line 710. During Steps 7-8, this node can be identified and removed because it doesn't match the preserved patterns for non-quantized models (lines 1161-1169 only keep specific MatMul nodes, not Add nodes).Suggested fix
- graph.node.append(qkv_add_node) + qkv_matmul_nodes.append(qkv_add_node)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 704 - 710, The fused qkv_proj/Add node (qkv_add_node created with name qkv_add_name and outputs qkv_add_output) is being appended to graph.node too early and will be removed during the later attention-cleanup steps; instead, delay inserting this qkv_add_node into graph.node until after the attention cleanup phase completes (i.e., create the node object but do not call graph.node.append(...) immediately — append it only after the steps that prune MatMul/Add patterns have run), ensuring qkv_proj/Add persists for non-quantized models.
241-244:⚠️ Potential issue | 🟠 MajorUpdate type check to handle renamed suffixed output names.
This issue was previously identified. The
_add_bf16_lm_head_cast()function runs at Step 0.7, but I/O renaming (which converts suffixed names likelogits_###tologits) happens later at Step 2.5. When the graph has a suffixed logits output, this check only matches"logits"exactly, leaving the original suffixed output with incorrect type.Suggested fix
if lm_head_node and len(lm_head_node.output) > 0: + original_output_name = lm_head_node.output[0] # Create new output name for MatMul new_matmul_output = f"{lm_head_matmul_name}_output_bf16" - lm_head_node.output[0] = new_matmul_output + lm_head_node.output[0] = new_matmul_output @@ # Update logits output to FP32 for out in graph.output: - if out.name == "logits": + if out.name in {original_output_name, "logits"}: out.type.tensor_type.elem_type = TensorProto.FLOAT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 241 - 244, The output-type update loop in _add_bf16_lm_head_cast only matches exact "logits" and misses suffixed names like "logits_###"; update the check that iterates graph.output (variable out) to treat any suffixed logits as logits by testing the base name (e.g., out.name.startswith("logits") or out.name.split("_",1)[0] == "logits") and then set out.type.tensor_type.elem_type = TensorProto.FLOAT for those entries so both suffixed and canonical outputs are handled.
1319-1324:⚠️ Potential issue | 🔴 CriticalUse
onnx.numpy_helperto read and convert initializer data.This issue was previously identified. Reading only
init.raw_datavianp.frombuffer()loses data when the initializer uses external storage or typed fields (e.g.,float_data,int32_data). This can corrupt the quantization scale.Suggested fix
- arr = np.frombuffer(init.raw_data, dtype=np.float32).copy() - arr_fp16 = arr.astype(np.float16) - init.raw_data = arr_fp16.tobytes() - init.data_type = TensorProto.FLOAT16 + arr_fp16 = onnx.numpy_helper.to_array(init).astype(np.float16) + converted_init = onnx.numpy_helper.from_array(arr_fp16, name=init.name) + init.CopyFrom(converted_init)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1319 - 1324, The loop in gqa_replacement.py that converts an initializer to float16 currently reads init.raw_data with np.frombuffer, which can lose data for external or typed initializers; replace that logic with onnx.numpy_helper.to_array(init) to get a numpy array reliably, cast it to np.float16, then use onnx.numpy_helper.from_array(cast_array, name=init.name) (or set init.raw_data = numpy_helper.from_array(...).raw_data and update init.data_type = TensorProto.FLOAT16) so the initializer preserves external/typed storage, dims, and metadata while converting the values to FLOAT16.
1407-1412:⚠️ Potential issue | 🔴 Critical
external_data_namecan remainNonefor cwd outputs.This issue was previously identified. If
output_pathhas no directory component (e.g., just"model.onnx"), the default external filename is never set because the assignment is inside theif output_dir:block. Ifuse_external_data=True, this causeslocation=Noneto be passed to the save functions.Suggested fix
output_dir = os.path.dirname(output_path) if output_dir: os.makedirs(output_dir, exist_ok=True) - if external_data_name is None: - external_data_name = os.path.basename(output_path) + "_data" + if external_data_name is None: + external_data_name = os.path.basename(output_path) + "_data"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1407 - 1412, external_data_name can stay None when output_path has no directory component because the default assignment is inside the if output_dir block; move the external_data_name defaulting logic so that if external_data_name is None it is set to os.path.basename(output_path) + "_data" regardless of output_dir (but only when use_external_data/save path logic requires it). Update the block around output_dir/output_path in gqa_replacement.py so the os.makedirs call stays conditional on output_dir but the external_data_name fallback runs unconditionally when external_data_name is None, referencing the external_data_name, output_path, and output_dir variables used by the existing save/export functions.tests/unit/onnx/test_gqa_graph_surgery.py (2)
586-596:⚠️ Potential issue | 🟠 MajorRemove network calls from this unit test fixture.
This issue was previously identified. The
_get_config()function makes a network call to download HuggingFace config viaAutoConfig.from_pretrained(MODEL_ID, ...). Additionally,replace_attention_with_gqa()makes another network call throughget_rope_caches(). This violates the coding guideline that unit tests intests/unitshould be fast CPU-based tests taking no more than a few seconds to run.Consider mocking
AutoConfig.from_pretrainedandget_rope_cachesto return deterministic, pre-computed values.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 586 - 596, The test fixture _get_config currently calls AutoConfig.from_pretrained(MODEL_ID) and replace_attention_with_gqa triggers get_rope_caches, which make network calls; instead, patch/mocks these to return deterministic in-memory values: patch AutoConfig.from_pretrained used in _get_config to return an object with attributes hidden_size, num_attention_heads, optional num_key_value_heads, and rope_theta so _get_config can compute hd and inv_freq locally, and patch get_rope_caches to return precomputed rope cache tensors; use your test's monkeypatch/pytest fixture to replace AutoConfig.from_pretrained and get_rope_caches calls so the unit test is CPU-only and fast while keeping the same return shapes/values expected by replace_attention_with_gqa and downstream code.
696-697:⚠️ Potential issue | 🟠 MajorTighten the numerical assertion.
This issue was previously identified. The tolerance
finite.max() < 1.0is extremely loose and will pass even with materially wrong logits. Since this is the only accuracy guard, use a much smaller tolerance.Suggested fix
assert len(finite) > 0, "All values are non-finite" - assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance" + np.testing.assert_allclose(orig_logits, gqa_logits, rtol=1e-2, atol=1e-2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/test_gqa_graph_surgery.py` around lines 696 - 697, The numerical assertion is too loose: replace the tolerance check on finite.max() (the second assertion that currently uses 1.0) with a much tighter threshold (e.g., 1e-3 or 1e-4) and update the error message accordingly; locate the assertions that reference the variable finite and change the line assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds tolerance" to use the new numeric tolerance and include the chosen tolerance value in the message.modelopt/onnx/graph_surgery/utils/rope_cache.py (1)
90-93:⚠️ Potential issue | 🟠 MajorAlways slice RoPE cache to rotary dimension width, not full head width.
This issue was previously identified. The conditional slice at line 91 only executes when
cos_cache.shape[-1] == head_dim, which is never true for partial rotary models (partial_rotary_factor < 1). For these models, the cache is returned with shape[max_seq_len, dim]instead of the correct[max_seq_len, dim // 2].Suggested fix
- # Slice to half (GQA expects this) - if cos_cache.shape[-1] == head_dim: - cos_cache = cos_cache[:, : head_dim // 2] - sin_cache = sin_cache[:, : head_dim // 2] + # Slice to half of the effective rotary dimension (GQA expects this) + rotary_cache_width = dim // 2 + cos_cache = cos_cache[:, :rotary_cache_width] + sin_cache = sin_cache[:, :rotary_cache_width]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/rope_cache.py` around lines 90 - 93, The RoPE cache slicing currently only triggers when cos_cache.shape[-1] == head_dim, which misses partial-rotary models; change the logic so cos_cache and sin_cache are always sliced to the rotary width (use head_dim // 2) instead of relying on that conditional—i.e., remove the conditional and set cos_cache = cos_cache[:, : head_dim // 2] and sin_cache = sin_cache[:, : head_dim // 2] so the returned caches have shape [max_seq_len, head_dim//2] for partial and full rotary cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 938-941: Remove the duplicate value_info entries for the cos/sin
caches by eliminating the explicit creation/extension of cos_cache_info and
sin_cache_info in GQA replacement; rely on the existing add_initializer function
in graph_utils.py (which already registers value_info when adding initializers)
instead of calling graph.value_info.extend([cos_cache_info, sin_cache_info]) in
this file, or add a presence check before extending to avoid duplicates (refer
to the symbols cos_cache_info, sin_cache_info, graph.value_info.extend and
add_initializer).
- Around line 704-710: The fused qkv_proj/Add node (qkv_add_node created with
name qkv_add_name and outputs qkv_add_output) is being appended to graph.node
too early and will be removed during the later attention-cleanup steps; instead,
delay inserting this qkv_add_node into graph.node until after the attention
cleanup phase completes (i.e., create the node object but do not call
graph.node.append(...) immediately — append it only after the steps that prune
MatMul/Add patterns have run), ensuring qkv_proj/Add persists for non-quantized
models.
- Around line 241-244: The output-type update loop in _add_bf16_lm_head_cast
only matches exact "logits" and misses suffixed names like "logits_###"; update
the check that iterates graph.output (variable out) to treat any suffixed logits
as logits by testing the base name (e.g., out.name.startswith("logits") or
out.name.split("_",1)[0] == "logits") and then set
out.type.tensor_type.elem_type = TensorProto.FLOAT for those entries so both
suffixed and canonical outputs are handled.
- Around line 1319-1324: The loop in gqa_replacement.py that converts an
initializer to float16 currently reads init.raw_data with np.frombuffer, which
can lose data for external or typed initializers; replace that logic with
onnx.numpy_helper.to_array(init) to get a numpy array reliably, cast it to
np.float16, then use onnx.numpy_helper.from_array(cast_array, name=init.name)
(or set init.raw_data = numpy_helper.from_array(...).raw_data and update
init.data_type = TensorProto.FLOAT16) so the initializer preserves
external/typed storage, dims, and metadata while converting the values to
FLOAT16.
- Around line 1407-1412: external_data_name can stay None when output_path has
no directory component because the default assignment is inside the if
output_dir block; move the external_data_name defaulting logic so that if
external_data_name is None it is set to os.path.basename(output_path) + "_data"
regardless of output_dir (but only when use_external_data/save path logic
requires it). Update the block around output_dir/output_path in
gqa_replacement.py so the os.makedirs call stays conditional on output_dir but
the external_data_name fallback runs unconditionally when external_data_name is
None, referencing the external_data_name, output_path, and output_dir variables
used by the existing save/export functions.
In `@modelopt/onnx/graph_surgery/utils/rope_cache.py`:
- Around line 90-93: The RoPE cache slicing currently only triggers when
cos_cache.shape[-1] == head_dim, which misses partial-rotary models; change the
logic so cos_cache and sin_cache are always sliced to the rotary width (use
head_dim // 2) instead of relying on that conditional—i.e., remove the
conditional and set cos_cache = cos_cache[:, : head_dim // 2] and sin_cache =
sin_cache[:, : head_dim // 2] so the returned caches have shape [max_seq_len,
head_dim//2] for partial and full rotary cases.
In `@tests/unit/onnx/test_gqa_graph_surgery.py`:
- Around line 586-596: The test fixture _get_config currently calls
AutoConfig.from_pretrained(MODEL_ID) and replace_attention_with_gqa triggers
get_rope_caches, which make network calls; instead, patch/mocks these to return
deterministic in-memory values: patch AutoConfig.from_pretrained used in
_get_config to return an object with attributes hidden_size,
num_attention_heads, optional num_key_value_heads, and rope_theta so _get_config
can compute hd and inv_freq locally, and patch get_rope_caches to return
precomputed rope cache tensors; use your test's monkeypatch/pytest fixture to
replace AutoConfig.from_pretrained and get_rope_caches calls so the unit test is
CPU-only and fast while keeping the same return shapes/values expected by
replace_attention_with_gqa and downstream code.
- Around line 696-697: The numerical assertion is too loose: replace the
tolerance check on finite.max() (the second assertion that currently uses 1.0)
with a much tighter threshold (e.g., 1e-3 or 1e-4) and update the error message
accordingly; locate the assertions that reference the variable finite and change
the line assert finite.max() < 1.0, f"Max abs diff {finite.max():.4f} exceeds
tolerance" to use the new numeric tolerance and include the chosen tolerance
value in the message.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dbdd275c-1479-48c2-ae96-e4b163228eb4
📒 Files selected for processing (3)
modelopt/onnx/graph_surgery/gqa_replacement.pymodelopt/onnx/graph_surgery/utils/rope_cache.pytests/unit/onnx/test_gqa_graph_surgery.py
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
modelopt/onnx/graph_surgery/encoder_cross_kv.py (1)
416-418:⚠️ Potential issue | 🟠 MajorCurrent-directory outputs still skip sidecar files.
When
output_pathis just a filename,os.path.dirname(output_path)is"", so current-directory saves never generateaudio_processor_config.jsonorgenai_config.jsoneven though that path is enabled by default. Normalizeoutput_dironce withor "."and reuse it for both the model save and sidecar writes.Suggested fix
- output_dir = os.path.dirname(output_path) - if output_dir: - os.makedirs(output_dir, exist_ok=True) + output_dir = os.path.dirname(output_path) or "." + os.makedirs(output_dir, exist_ok=True) @@ - output_dir = os.path.dirname(output_path) if output_dir:Also applies to: 467-493
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py` around lines 416 - 418, The code computes output_dir = os.path.dirname(output_path) and skips creating sidecar files when output_path is a bare filename; change the normalization to set output_dir = os.path.dirname(output_path) or "." and then call os.makedirs(output_dir, exist_ok=True) and use this same output_dir whenever saving the model and writing sidecar files (e.g., audio_processor_config.json and genai_config.json); apply the same normalization/fix to the other save block handling sidecars (the block around the later save logic) so both model save and sidecar writes always target the normalized directory.
🧹 Nitpick comments (1)
tests/unit/onnx/test_dq_transpose_surgery.py (1)
53-59: Add one axis=1 fixture as well.
transpose_dequantize_linear_weights()also has an axis-rewrite path for DQ nodes that start on axis 1, but this builder only exercises axis 0 -> 1. A second minimal model covering axis=1 would make this suite much better at catching regressions in the other branch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/test_dq_transpose_surgery.py` around lines 53 - 59, Add a second test fixture that constructs a DequantizeLinear node with axis=1 (mirror the existing dq_node which uses axis=0) and build a minimal ONNX model exercising the axis=1 path so transpose_dequantize_linear_weights() is invoked on an axis=1 DQ node; update the test to run the same assertions for both fixtures (or add a new test function) referencing the existing test utilities and the node name "dq_weight" pattern (or use a new name like "dq_weight_axis1") so the axis-rewrite branch is covered and validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/graph_surgery/__init__.py`:
- Around line 67-70: The module-level imports in modelopt.onnx.graph_surgery
eagerly import optional modules; change the __init__ to lazily re-export
transpose_dequantize_linear_weights, add_cross_kv_to_encoder,
replace_attention_with_gqa, and convert_fp16_to_bf16 by implementing __getattr__
(or thin wrapper callables) that perform the actual import inside the accessor;
for example, when name == "transpose_dequantize_linear_weights" import
.dq_transpose and return the symbol, likewise for add_cross_kv_to_encoder (from
.encoder_cross_kv), replace_attention_with_gqa (from .gqa_replacement), and
convert_fp16_to_bf16 (from .utils.dtype_conversion), so optional dependencies
are only imported when those symbols are first accessed.
In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py`:
- Around line 54-68: The helper currently logs model metadata with logger.info
unconditionally (see the logger.info calls around the
WhisperForConditionalGeneration loading and the subsequent size/shape logs)
which bypasses add_cross_kv_to_encoder's verbose flag; update
add_cross_kv_to_encoder to accept and forward a verbose boolean to this helper
(or accept a logger level) and change the helper to only emit these messages
when verbose is True, or alternatively downgrade those logger.info calls to
logger.debug; apply the same change to the other block referenced (the logs in
the 134-170 region) so both places respect the verbose/quiet setting.
---
Duplicate comments:
In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py`:
- Around line 416-418: The code computes output_dir =
os.path.dirname(output_path) and skips creating sidecar files when output_path
is a bare filename; change the normalization to set output_dir =
os.path.dirname(output_path) or "." and then call os.makedirs(output_dir,
exist_ok=True) and use this same output_dir whenever saving the model and
writing sidecar files (e.g., audio_processor_config.json and genai_config.json);
apply the same normalization/fix to the other save block handling sidecars (the
block around the later save logic) so both model save and sidecar writes always
target the normalized directory.
---
Nitpick comments:
In `@tests/unit/onnx/test_dq_transpose_surgery.py`:
- Around line 53-59: Add a second test fixture that constructs a
DequantizeLinear node with axis=1 (mirror the existing dq_node which uses
axis=0) and build a minimal ONNX model exercising the axis=1 path so
transpose_dequantize_linear_weights() is invoked on an axis=1 DQ node; update
the test to run the same assertions for both fixtures (or add a new test
function) referencing the existing test utilities and the node name "dq_weight"
pattern (or use a new name like "dq_weight_axis1") so the axis-rewrite branch is
covered and validated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cbc44444-e980-4b34-abf2-96f783686023
📒 Files selected for processing (4)
modelopt/onnx/graph_surgery/__init__.pymodelopt/onnx/graph_surgery/__main__.pymodelopt/onnx/graph_surgery/encoder_cross_kv.pytests/unit/onnx/test_dq_transpose_surgery.py
| from .dq_transpose import transpose_dequantize_linear_weights | ||
| from .encoder_cross_kv import add_cross_kv_to_encoder | ||
| from .gqa_replacement import replace_attention_with_gqa | ||
| from .utils.dtype_conversion import convert_fp16_to_bf16 |
There was a problem hiding this comment.
Lazy-load these re-exports.
Importing modelopt.onnx.graph_surgery now eagerly imports every surgery module, so environments missing some [onnx]/[hf] extras can fail before the caller even touches those features. Please re-export these symbols lazily (__getattr__, thin wrappers, or guarded imports) instead of importing them unconditionally at module import time.
Based on learnings: Avoid hard imports of optional dependencies at module level; gate features by install extras ([onnx], [hf], [all]).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/__init__.py` around lines 67 - 70, The
module-level imports in modelopt.onnx.graph_surgery eagerly import optional
modules; change the __init__ to lazily re-export
transpose_dequantize_linear_weights, add_cross_kv_to_encoder,
replace_attention_with_gqa, and convert_fp16_to_bf16 by implementing __getattr__
(or thin wrapper callables) that perform the actual import inside the accessor;
for example, when name == "transpose_dequantize_linear_weights" import
.dq_transpose and return the symbol, likewise for add_cross_kv_to_encoder (from
.encoder_cross_kv), replace_attention_with_gqa (from .gqa_replacement), and
convert_fp16_to_bf16 (from .utils.dtype_conversion), so optional dependencies
are only imported when those symbols are first accessed.
| logger.info(f"Loading PyTorch model: {model_id}") | ||
| model = WhisperForConditionalGeneration.from_pretrained( | ||
| model_id, trust_remote_code=trust_remote_code | ||
| ) | ||
|
|
||
| weights = {} | ||
| num_layers = model.config.decoder_layers | ||
| num_heads = model.config.decoder_attention_heads | ||
| hidden_size = model.config.d_model | ||
| head_size = hidden_size // num_heads | ||
|
|
||
| logger.info(f" num_layers: {num_layers}") | ||
| logger.info(f" num_heads: {num_heads}") | ||
| logger.info(f" hidden_size: {hidden_size}") | ||
| logger.info(f" head_size: {head_size}") |
There was a problem hiding this comment.
verbose=False does not fully suppress progress logs.
add_cross_kv_to_encoder(verbose=False) still calls helpers that emit logger.info(...) unconditionally, so the CLI's --quiet flag keeps printing model metadata and layer-count messages. Please thread verbose into these helpers or downgrade these messages to debug.
Also applies to: 134-170
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/encoder_cross_kv.py` around lines 54 - 68, The
helper currently logs model metadata with logger.info unconditionally (see the
logger.info calls around the WhisperForConditionalGeneration loading and the
subsequent size/shape logs) which bypasses add_cross_kv_to_encoder's verbose
flag; update add_cross_kv_to_encoder to accept and forward a verbose boolean to
this helper (or accept a logger level) and change the helper to only emit these
messages when verbose is True, or alternatively downgrade those logger.info
calls to logger.debug; apply the same change to the other block referenced (the
logs in the 134-170 region) so both places respect the verbose/quiet setting.
| k_weight_t = k_weight.T.astype(np_dtype) | ||
| v_weight_t = v_weight.T.astype(np_dtype) | ||
|
|
||
| # Add weight initializers | ||
| k_weight_name = f"encoder.cross_attn_k_weight.{layer_idx}" | ||
| v_weight_name = f"encoder.cross_attn_v_weight.{layer_idx}" | ||
|
|
||
| new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) | ||
| new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== detect_model_dtype BF16 mapping =="
fd graph_utils.py modelopt | xargs -r sed -n '117,160p'
echo
echo "== BF16-sensitive initializer creation in encoder_cross_kv.py =="
fd encoder_cross_kv.py modelopt | xargs -r sed -n '184,245p'
echo
echo "== Existing BF16 tensor helpers in graph_surgery =="
rg -n "BFLOAT16|bfloat16|ml_dtypes|numpy_helper\.from_array|make_tensor\(" modelopt/onnx/graph_surgeryRepository: NVIDIA/Model-Optimizer
Length of output: 11736
🏁 Script executed:
# Verify the array_to_initializer usage pattern in gqa_replacement.py to confirm it's the correct approach
fd encoder_cross_kv.py modelopt | xargs -r grep -A 3 "detect_model_dtype"Repository: NVIDIA/Model-Optimizer
Length of output: 297
🏁 Script executed:
# Check how detect_model_dtype is used in encoder_cross_kv.py
fd encoder_cross_kv.py modelopt | xargs -r grep -B 5 -A 5 "detect_model_dtype\|onnx_dtype\|np_dtype"Repository: NVIDIA/Model-Optimizer
Length of output: 4279
🏁 Script executed:
# Get the full function context to understand dtype flow
fd encoder_cross_kv.py modelopt | xargs -r sed -n '1,250p' | head -n 200Repository: NVIDIA/Model-Optimizer
Length of output: 7554
Use array_to_initializer() helper for BF16-aware weight and bias serialization.
detect_model_dtype() maps BF16 models to np.float32 for NumPy compatibility, but this code passes that dtype directly to numpy_helper.from_array(...). For BF16 models, the new K/V weights and biases (lines 191–199, 225–242) will serialize as FLOAT initializers while the output tensors use BFLOAT16, creating a type mismatch in MatMul/Add operations that breaks graph validation and runtime execution.
Use the existing array_to_initializer(arr, name, is_bfloat16=True) helper (as shown in gqa_replacement.py lines 631, 697) to properly serialize tensors with the detected dtype, or insert explicit Cast nodes around the new subgraph.
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/onnx/graph_surgery/gqa_replacement.py (1)
1419-1434: Redundant call toconvert_model_to_external_databeforeonnx.save.
onnx.save()withsave_as_external_data=Truehandles external data conversion internally. The priorconvert_model_to_external_data()call is unnecessary overhead.Suggested simplification
- convert_model_to_external_data( - model, - all_tensors_to_one_file=True, - location=external_data_name, - size_threshold=1024, - convert_attribute=False, - ) - onnx.save( model, output_path,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1419 - 1434, The code redundantly calls convert_model_to_external_data(model, ...) immediately before onnx.save(..., save_as_external_data=True, ...); remove the explicit convert_model_to_external_data invocation and rely on onnx.save to perform external data conversion (keep the onnx.save call with save_as_external_data=True, all_tensors_to_one_file=True, location=external_data_name, size_threshold=1024 and write to output_path). This eliminates unnecessary overhead while preserving the same external-data behavior for model, external_data_name and output_path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 1419-1434: The code redundantly calls
convert_model_to_external_data(model, ...) immediately before onnx.save(...,
save_as_external_data=True, ...); remove the explicit
convert_model_to_external_data invocation and rely on onnx.save to perform
external data conversion (keep the onnx.save call with
save_as_external_data=True, all_tensors_to_one_file=True,
location=external_data_name, size_threshold=1024 and write to output_path). This
eliminates unnecessary overhead while preserving the same external-data behavior
for model, external_data_name and output_path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8bea3482-be27-4eff-aa08-16ad595f0353
📒 Files selected for processing (1)
modelopt/onnx/graph_surgery/gqa_replacement.py
What does this PR do?
Type of change: new feature
Adds the ONNX Graph Surgery framework — graph-level transformations on exported ONNX models for optimized inference with ONNX Runtime.
GQA Attention Replacement — Replaces native attention subgraphs (Q/K/V projections, RoPE, softmax, KV cache) with a single fused
GroupQueryAttentionoperator. Handles RoPE cache computation, attention mask reformatting, Q/K/V weight fusion, and KV cache I/O automatically from a HuggingFace model ID. Supports FP16/BF16, INT4/AWQ quantized, and combined QKV models.DequantizeLinear Weight Transpose — Transposes quantized weights in
DequantizeLinearnodes to column-major layout for providers like NvTensorRtRtx. Handles INT4/UINT4 packed formats.Whisper Encoder Cross-Attention KV — Adds cross-attention K/V projection outputs to the Whisper encoder for the ONNX Runtime GenAI pipeline. Loads cross-attention weights from HuggingFace and generates
genai_config.json.Usage
Testing
Added test case for GQA graph surgery at
tests/unit/onnx/test_gqa_graph_surgery.py. Builds a toy attention subgraph with native ops matching the real Optimum export pattern and applies GQA surgery on it. Compared end outputs of both models — outputs match exactly.Summary by CodeRabbit
New Features
Tests