Problem
Configuring Pollara 400 NICs for RDMA workloads (auto-negotiation, PFC, QoS, DCQCN) requires imperative nicctl commands per node. These settings don't persist across reboots unless NIC personas are used — and even with personas, applying them is still an imperative, per-node operation that is error-prone at scale.
Proposal
A new CRD (e.g., NicPolicy, NicConfig) that allows users to declaratively define NIC parameters and lets the operator reconcile them. Scope could include:
- Port: auto-negotiation, MTU, FEC, pause type
- QoS: DSCP/PCP classification, PFC no-drop priorities, scheduling (DWRR/SPQ)
- DCQCN: congestion control parameters per RDMA device
Benefits
- Persistence — operator reconciles the desired state, after reboot or firmware changes for persistence
- Validation — reject invalid combinations before applying (e.g., no-drop on an unmapped priority)
- Observability — CRD status reports applied state per node and drift detection
- Day-2 ops — config changes via
kubectl apply, not SSH into each node
- Composability — different policies for different node groups (training vs inference)
Context
Running RDMA benchmarks on Pollara 400 requires several nicctl commands per node covering auto-neg, PFC, QoS, and DCQCN configuration. The full set of required parameters is documented in the AMD AI NIC Pollara 400 Ops Guide (UG1801, Sections: QoS Configuration, DCQCN Configuration) and the AMD AI NIC Benchmarking Guide (UG1813, Section: Host Configuration).
Problem
Configuring Pollara 400 NICs for RDMA workloads (auto-negotiation, PFC, QoS, DCQCN) requires imperative
nicctlcommands per node. These settings don't persist across reboots unless NIC personas are used — and even with personas, applying them is still an imperative, per-node operation that is error-prone at scale.Proposal
A new CRD (e.g.,
NicPolicy,NicConfig) that allows users to declaratively define NIC parameters and lets the operator reconcile them. Scope could include:Benefits
kubectl apply, not SSH into each nodeContext
Running RDMA benchmarks on Pollara 400 requires several
nicctlcommands per node covering auto-neg, PFC, QoS, and DCQCN configuration. The full set of required parameters is documented in the AMD AI NIC Pollara 400 Ops Guide (UG1801, Sections: QoS Configuration, DCQCN Configuration) and the AMD AI NIC Benchmarking Guide (UG1813, Section: Host Configuration).