Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 39 additions & 31 deletions .wordlist.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
AGFHC
AINIC
AINICs
Allocatable
allocatable
Allocatable
amd
amdgpu
ANP
Expand All @@ -16,47 +16,49 @@ caFile
cardinality
CatalogSource
certFile
Chrony
clientCAConfigMap
clientName
ClusterIP
ClusterServiceVersion
CN
CNI
CNIPlugins
CNIPluginsSpec
CNIs
ClusterServiceVersion
ConfigMap
configMap
configmap
configMap
ConfigMap
ConfigMaps
configs
containerd
controllerManager
CoreOS
coredns
CoreOS
cpu
CRI
CRD
crds
CRDs
CRI
CronJob
CronJobs
CRs
cryptographic
CVEs
cvf
daemonset
DaemonSet
DaemonSets
Daemonsets
DTK
daemonset
DaemonSets
DeviceConfig
DevicePlugin
DevicePluginImage
DevicePluginImagePullPolicy
DevicePluginSpec
devicePluginSpec
DevicePluginSpec
disableHttps
DockerHub
DTK
EnableNodeLabeller
etcd
ethtool
Expand All @@ -76,37 +78,38 @@ ibverbs
IfNotPresent
imagePullPolicy
imagePullSecrets
ImageStream
imageRegistrySecret
ImageStream
insecureSkipVerify
installdefaultNFDRule
io
ipc
ipam
ipc
IPC
IPv
json
K8s
kaniko
keyFile
keySecret
KMM
kmm
KMM
kube
kubeconfig
Kubelet
kubectl
kubelet
Kubelet
kubernetes
kubectl
Labeller
labeller
lifecycle
Labeller
LIF
lifecycle
MachineConfig
MaxUnavailable
MCO
MetricsExporter
MetricsExporterSpec
MPICH
MPIJob
mTLS
multus
Expand All @@ -118,30 +121,34 @@ NetworkAttachmentDefinitions
NetworkConfig
networkconfigs
NFD
NodeFeatureDiscovery
nic
NICCTL
nodeAffinity
NodeLabeller
NodeFeatureDiscovery
Nodelabeller
NodeLabeller
NodeLabellerImage
NodeLabellerImagePullPolicy
NodePort
nodePort
NodePort
nodeSelector
nodeSelectorTerms
NotReady
NPL
NTP
NVMe
OLM
OnDelete
OOM
OpenMPI
OpenShift
OpenShift's
OperatorGroup
OperatorHub
oyaml
pci
PDS
pds
PDS
Pensando
Podman
Pollara
Expand All @@ -153,51 +160,52 @@ RCCL
rdma
RDMA
relatedImageBuild
repo
relatedImageBuildPullSecret
RPMs
relatedImageSign
relatedImageSignPullSecret
relatedImageWorker
relatedImageWorkerPullSecret
RoCE
repo
roce
RoCE
ROCm
RollingUpdate
RPMs
SAR
SBR
serverName
sdk
serverName
ServiceAccount
ServiceAccounts
serviceAccountNamespaceSelector
ServiceAccounts
serviceAccountSelector
ServiceMonitor
serviceType
SR-IOV
sriov
SR-IOV
staticAuthorization
SubjectAccessReview
tawk
TBD
techsupport
TechSupport
tlsConfig
TLS
tlsConfig
TokenReview
tolerations
UI
uncordoned
Uncordoning
upgradeCRD
UpgradePolicy
upgradePolicy
UpgradeStrategy
UpgradePolicy
upgradeStrategy
UpgradeStrategy
VFs
virtualized
vnic
vNIC
VNICs
vnic
webhook
webhook's
webhookServer
Expand Down
16 changes: 8 additions & 8 deletions docs/cluster_validation_framework/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ This framework supports Gang Scheduling by checking for Pod Running status and
## Key Components

| Component | Description |
|------------|-------------|
| ------------ | ------------- |
| **CronJob** | Periodically triggers node cluster node validation checks (e.g., every 24 hours). |
| **ConfigMap** | Stores configuration, candidate selection script, Job and MPIJob manifest templates. |
| **ServiceAccount + RBAC** | Grants permission to list/label nodes and create workloads. |
Expand Down Expand Up @@ -143,20 +143,20 @@ kubectl logs job/cluster-validation-mpi-job-<20251110-0715>-launcher

## Example Output Labels

| Node | Label | Meaning |
|:--------|:--------------------------------------------|:-----------------------------------------------------------|
| node-a | `amd.com/cluster-validation-status=passed` | Node successfully passed all RCCL tests |
| node-b | `amd.com/cluster-validation-status=failed` | Node failed one or more RCCL tests |
| node-c | *(no label)* | Node not part of current candidate set |
| Node | Label | Meaning |
| ------ | ------------------------------------------- | ----------------------------------------------------------- |
| node-a | `amd.com/cluster-validation-status=passed` | Node successfully passed all RCCL tests |
| node-b | `amd.com/cluster-validation-status=failed` | Node failed one or more RCCL tests |
| node-c | *(no label)* | Node not part of current candidate set |

---

## Notes for Operators

* Update image tags (**roce-workload**, **network-operator-utils**) as needed before deployment.
* Modify `cluster-validation-config.yaml` to align with your deployment environment.
* Ensure `slotsPerWorker` and resource limits correspond to the underlying GPU and NIC configuration.
* Adjust `CronJob.spec` to set the job frequency.
* Ensure `slotsPerWorker` and resource limits correspond to the underlying GPU and NIC configuration.
* Adjust `CronJob.spec` to set the job frequency.
* Set `debug_delay` to pause after job completion for debugging.
* Configure `fluent_log_output` to define the log destination for Fluent sidecar-based centralized logging

Expand Down
15 changes: 8 additions & 7 deletions docs/device_plugin/deviceplugin.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,14 @@ spec:

### Field Description

| Field Name | Description |
|----------------------------------|-------------------------------------------------|
| **DevicePluginImage** | Device plugin image |
| **DevicePluginImagePullPolicy** | One of Always, Never, IfNotPresent. |
| **NodeLabellerImage** | Image to use for the Node Labeller |
| **NodeLabellerImagePullPolicy** | Image pull policy: Always, Never, IfNotPresent |
| **EnableNodeLabeller** | Enable or disable the Node Labeller (true/false)|
| Field Name | Description |
|----------------------------------|--------------------------------------------------|
| **DevicePluginImage** | Device plugin image |
| **DevicePluginImagePullPolicy** | One of Always, Never, IfNotPresent. |
| **NodeLabellerImage** | Image to use for the Node Labeller |
| **NodeLabellerImagePullPolicy** | Image pull policy: Always, Never, IfNotPresent |
| **EnableNodeLabeller** | Enable or disable the Node Labeller (true/false) |

</br>

The `ImagePullPolicy` field defaults to `Always` if the image tag is `:latest`, or to `IfNotPresent` for other tags. This follows the default Kubernetes behavior for `ImagePullPolicy`.
Expand Down
12 changes: 6 additions & 6 deletions docs/drivers/upgrading.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ To check the full spec of upgrade configuration run kubectl get crds networkconf
#### `driver.upgradePolicy` Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| --------- | ----------- | ------- |
| `enable` | Enable this upgrade policy | `false` |
| `maxParallelUpgrades` | Maximum number of nodes which will be upgraded in parallel | `1` |
| `maxUnavailableNodes` | Maximum number (or Percentage) of nodes which can be unavailable (cordoned) in the cluster | `25%` |
Expand All @@ -76,14 +76,14 @@ To check the full spec of upgrade configuration run kubectl get crds networkconf
#### `driver.upgradePolicy.nodeDrainPolicy` Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| --------- | ----------- | ------- |
| `force` | Allow drain to proceed on the node even if there are managed pods such as daemon-sets. In such cases drain will not proceed unless this option is set to true | `true` |
| `timeout` | The length of time to wait before giving up. Zero means infinite | `300s` |

#### `driver.upgradePolicy.podDeletionPolicy` Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| --------- | ----------- | ------- |
| `force` | Force delete all pods that use amd nics | `true` |
| `timeout` | The length of time to wait before giving up. Zero means infinite | `300s` |

Expand Down Expand Up @@ -127,7 +127,7 @@ status:
The following are the different node states during the upgrade process

| State | Description |
|-----------|---------|
| ----- | ----------- |
| `Install-In-Progress` | Driver is being installed on the node for the first time |
| `Install-Complete` | Driver install is complete |
| `Upgrade-Not-Started` | Automatic upgrade enabled and driver version change is detected. All nodes move to this state |
Expand All @@ -136,7 +136,7 @@ The following are the different node states during the upgrade process
| `Upgrade-Timed-Out` | Driver upgrade couldn't finish within 2 hours |
| `Cordon-Failed` | Cordoning of the node failed |
| `Uncordon-Failed` | Uncordoning of the node failed |
| `Drain-Failed` | Drain node or Delete pods operation failed|
| `Drain-Failed` | Drain node or Delete pods operation failed |
| `Reboot-In-Progress` | Driver upgrade is done and reboot is in progress |
| `Reboot-Failed` | Driver upgrade is done and reboot attempt failed |
| `Upgrade-Failed` | Driver upgrade failed for any other reasons |
Expand Down Expand Up @@ -202,7 +202,7 @@ The operator will automatically:
The operator uses specific tag formats based on the OS:

| OS | Tag Format | Example |
|----|------------|---------|
| -- | ---------- | ------- |
| Ubuntu | `ubuntu-<version>-<kernel>-<driver>` | `ubuntu-22.04-6.8.0-40-generic-6.1.3` |
| RHEL CoreOS | `coreos-<version>-<kernel>-<driver>` | `coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2` |

Expand Down
16 changes: 8 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ AMD Network Operator simplifies the use of AMD AINICs in Kubernetes environments
### **Supported Hardware**

| Hardware | Status |
|-----------|---------|
| -------- | ------ |
| AMD Pensando™ Pollara AI NIC | ✅ Supported |

### OS & Platform Support Matrix
Expand All @@ -25,18 +25,18 @@ Below is a list of operating systems and Kubernetes versions validated with the
Additional versions will be added in future releases.

| Operating System | Kubernetes Versions |
|------------------|---------------------|
| ---------------- | ------------------- |
| Ubuntu 22.04 LTS | 1.29 – 1.34 |
| Ubuntu 24.04 LTS | 1.29 – 1.34 |

### Software Version Compatibility Matrix

| Network Operator | AINIC Firmware | Supported NICs |
|------------------|----------------|----------------|
| v1.0.0 | 1.117.1-a-63 | Pollara 400 |
| v1.0.1 | 1.117.1-a-63 | Pollara 400 |
| v1.1.0 | 1.117.5-a-56 | Pollara 400 |
| v1.2.0 | 1.117.5-a-56<br>1.117.5-a-77 | Pollara 400 |
| Network Operator | AINIC Firmware | Supported NICs |
|------------------|--------------------------------|----------------|
| v1.0.0 | 1.117.1-a-63 | Pollara 400 |
| v1.0.1 | 1.117.1-a-63 | Pollara 400 |
| v1.1.0 | 1.117.5-a-56 | Pollara 400 |
| v1.2.0 | 1.117.5-a-56<br>1.117.5-a-77 | Pollara 400 |

## Prerequisites

Expand Down
2 changes: 1 addition & 1 deletion docs/installation/kubernetes-helm.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ helm show values rocm/network-operator-charts
```

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| ----- | ------ | --------- | ------------- |
| controllerManager.manager.image.repository | string | `"docker.io/rocm/network-operator"` | AMD Network operator controller manager image repository |
| controllerManager.manager.image.tag | string | `"v1.2.0"` | AMD Network operator controller manager image tag |
| controllerManager.manager.imagePullPolicy | string | `"Always"` | Image pull policy for AMD Network operator controller manager pod |
Expand Down
Loading
Loading