Skip to content

Add assisted installer deployment method for spoke TNF clusters#51

Open
gamado wants to merge 9 commits intoopenshift-eng:mainfrom
gamado:assisted
Open

Add assisted installer deployment method for spoke TNF clusters#51
gamado wants to merge 9 commits intoopenshift-eng:mainfrom
gamado:assisted

Conversation

@gamado
Copy link
Copy Markdown
Contributor

@gamado gamado commented Feb 15, 2026

Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing hub cluster. This adds a third deployment path alongside dev-scripts (IPI) and kcli methods.

New Ansible roles:

  • assisted/acm-install: installs ACM operator, MultiClusterHub, AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support, and configures provisioning for external BMH management
  • assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv, BareMetalHost with fencing credentials), monitors installation, and extracts spoke credentials

Usage: make deploy fencing-assisted

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Verified by :
Spoke Cluster Verification

Nodes:
NAME STATUS ROLES AGE VERSION
spoke-tnf-master-0 Ready control-plane,master,worker 46m v1.34.2
spoke-tnf-master-1 Ready control-plane,master,worker 27m v1.34.2

Cluster Operators: 34/34 Available, 0 Progressing, 0 Degraded

Full List of Resources:
* Clone Set: kubelet-clone [kubelet]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]
* spoke-tnf-master-0_redfish (stonith:fence_redfish): Started
* spoke-tnf-master-1_redfish (stonith:fence_redfish): Started
* Clone Set: etcd-clone [etcd]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Result: PASS

@openshift-ci openshift-ci bot requested review from jaypoulz and slintes February 15, 2026 16:55
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Feb 15, 2026

Hi @gamado. Thanks for your PR.

I'm waiting for a openshift-eng member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 15, 2026
Copy link
Copy Markdown
Contributor

@fonta-rh fonta-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add Assisted Installer Deployment for Spoke TNF Clusters

Solid architecture — clean separation between acm-install (hub-side) and assisted-spoke (spoke-side) with good reuse of existing roles (proxy-setup, common/update-cluster-inventory). The RHCOS ISO auto-detection with fallback and the BMH fencing credential setup are well done.

Found 5 critical issues that should be fixed before merge, 8 important issues for follow-up, and 5 minor suggestions. Details below.


Critical Issues

C1. SSH key hardcodes id_rsa.pub — will fail on ed25519-only hosts

File: roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml

The task uses cat ~/.ssh/id_rsa.pub but the project convention is ~/.ssh/id_ed25519.pub (see CLAUDE.md, kcli role's prepare.yml). Modern RHEL 9 defaults to ed25519. If the hypervisor was provisioned with ed25519 only, this task silently fails.

Suggestion: Use the same multi-key detection pattern from the kcli role or common/:

- name: Get SSH public key
  shell: |
    for key in ~/.ssh/id_ed25519.pub ~/.ssh/id_rsa.pub ~/.ssh/id_ecdsa.pub; do
      [ -f "$key" ] && cat "$key" && exit 0
    done
    echo "ERROR: No SSH public key found" >&2 && exit 1
  register: ssh_pub_key
  changed_when: false

C2. pull_secret_path duplicated across both roles

Files:

  • roles/assisted/acm-install/vars/main.yml:23
  • roles/assisted/assisted-spoke/vars/main.yml:13

Both independently define pull_secret_path: /opt/dev-scripts/pull_secret.json. If the path changes, both must be updated independently.

Suggestion: Define once as a playbook-level variable in assisted-install.yml (or in the vars file loaded by vars_files), remove from both role vars/main.yml.


C3. Hub release image extraction duplicated between roles

Files:

  • roles/assisted/acm-install/tasks/agent-service-config.yml:1-14 — fetches hub_release_image and hub_ocp_version
  • roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml:1-20 — fetches the same data

Same oc get clusterversion calls, same parsing, same variable names. Runs twice during a single deployment.

Suggestion: Move to a pre_tasks block in assisted-install.yml and pass as facts, or create a shared task file under roles/assisted/common/.


C4. RHCOS ISO fallback logic has edge cases that pass silently

File: roles/assisted/acm-install/tasks/agent-service-config.yml

Two issues:

  1. The primary extraction pipes oc image info -o json 2>/dev/null to Python. If oc image info fails, Python gets empty stdin → JSONDecodeError → rc=1. But failed_when only checks "'FAILED' in rhcos_iso_extraction.stdout", not rc. The task succeeds with empty stdout.
  2. The set_fact uses rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout). default() only triggers when the variable is undefined (i.e., fallback was skipped). If fallback runs but produces empty output, rhcos_iso_url becomes "" — the failed_when checks for 'FAILED' and 'NEEDS_FALLBACK' but not empty string.

Suggestion:

- name: Set RHCOS ISO URL fact
  set_fact:
    rhcos_iso_url: >-
      {{ (rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout)) | trim }}
  failed_when: >-
    rhcos_iso_url == 'FAILED' or
    rhcos_iso_url == 'NEEDS_FALLBACK' or
    rhcos_iso_url == ''

And add failed_when: rhcos_iso_extraction.rc != 0 or 'FAILED' in rhcos_iso_extraction.stdout to the primary extraction task.


C5. VM cleanup can cause stale disk reuse

File: roles/assisted/assisted-spoke/tasks/cleanup.yml

virsh undefine --remove-all-storage is wrapped in 2>/dev/null || true + failed_when: false. If storage removal fails (locked disk, permission denied), the VM definition is removed but the disk persists. On re-deployment, qemu-img create uses creates: guard and skips disk creation — the new VM boots from the OLD disk.

Suggestion: Separate the storage cleanup from VM undefine, or explicitly delete the qcow2 files in the cleanup task:

- name: Remove spoke VM disk images
  file:
    path: "{{ spoke_vm_image_dir }}/{{ spoke_cluster_name }}-master-{{ item }}.qcow2"
    state: absent
  loop: "{{ range(spoke_ctlplanes) | list }}"
  become: true

Important Issues

I6. Deploy script hardcodes spoke path

File: scripts/deploy-fencing-assisted.sh:42

Success message says KUBECONFIG=~/spoke-tnf/auth/kubeconfig but spoke_cluster_name is configurable in vars/assisted.yml. A user with spoke_cluster_name: my-cluster gets the wrong path in the output.

Suggestion: Either source the vars file to get the actual name, or note that the path depends on the configured spoke_cluster_name.


I7. No input validation for spoke-specific variables

File: (missing — no validate.yml in assisted-spoke)

Both dev-scripts/install-dev and kcli/kcli-install have dedicated validate.yml files. The assisted roles validate hub health but not spoke inputs (CIDR format, VIP within CIDR, non-empty cluster name, valid BMC credentials format).

Suggestion: Add a validate.yml to assisted-spoke that checks at minimum:

  • spoke_api_vip and spoke_ingress_vip are within spoke_network_cidr
  • spoke_cluster_name is non-empty and DNS-safe
  • spoke_ctlplanes >= 2

I8. No cluster state tracking — must integrate with existing system

Existing roles update cluster-vm-state.json via common/tasks/cluster-state.yml. The assisted roles skip this entirely, which means the deployment tracking system has no awareness of spoke clusters. This breaks make info, lifecycle management, and any tooling that depends on cluster-vm-state.json to know what's deployed.

The assisted-spoke role must call common/tasks/cluster-state.yml (or equivalent) to register spoke VMs and their state. This is a project convention — all deployment methods must support it.


I9. failed_when: false too broadly applied

Multiple locations silence legitimate errors:

  • cleanup.yml:3-7 — namespace delete timeout silenced entirely
  • enable-watch-all-namespaces.yml:17 — Provisioning CR patch hides API errors
  • retrieve-credentials.yml:63-78 — VM start failure hidden, triggers 2-minute blind pause

Suggestion: At minimum, log the error output. For the VM start case, if virsh start fails, don't wait 120 seconds.


I10. Wait loops provide no diagnostics on timeout

When wait for assisted-service pod or wait for agents to register times out, the user sees only the retry count. Consider adding a rescue block that dumps pod status, events, or logs.


I11. hub_auth_dir defined but never used

File: roles/assisted/assisted-spoke/vars/main.yml:21

Dead variable.


I12. storage.yml destructive rm -rf on re-run

File: roles/assisted/acm-install/tasks/storage.yml

rm -rf {{ assisted_images_path }}/* {{ assisted_db_path }}/*

Runs every time the role executes (not just on cleanup), wiping existing assisted-service data. This belongs in cleanup.yml, not storage.yml. Gate it with a when: force_cleanup or move to cleanup.


I13. No README for the new roles

File: (missing — no README.md in either assisted/acm-install/ or assisted/assisted-spoke/)

Every existing role in this project has a comprehensive README (dev-scripts/install-dev/README.md, kcli/kcli-install/README.md, etc.). READMEs are mandatory for new roles — they document prerequisites, variables, usage examples, and known limitations. The assisted.yml.template comments are not a substitute.


Suggestions

S14. changed_when: true on network/VM creation masks idempotency. Use actual command output to determine change status.

S15. setup-ksushy.yml assertion message assumes 2 hub nodes. Hubs can have 3+ nodes (e.g., 3-node HA).

S16. Template should document DHCP range constraints — VIPs must be outside .50-.150 range.

S17. Missing hub_network_cidr in template — if hub network differs from 192.168.111.0/24, nftables rules silently fail.

S18. Nearly all new files lack a trailing newline. Minor but should be fixed for consistency.


Strengths

  • Architecture is sound — clean separation between acm-install and assisted-spoke with the playbook orchestrating both
  • Good reuse of existing roles — proxy-setup and common/update-cluster-inventory in post_tasks
  • RHCOS ISO auto-detection with fallback is genuinely useful
  • Operational comments are excellent (e.g., "Requires both a ConfigMap AND an annotation on AgentServiceConfig")
  • assisted.yml.template is well-structured as user-facing documentation
  • BMH creation with fencing credentials correctly uses bmac.agent-install.openshift.io/fencing-credentials-secret-name
  • Makefile integration correctly chains fencing-ipi before fencing-assisted

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Feb 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gamado
Once this PR has been reviewed and has the lgtm label, please ask for approval from fonta-rh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gamado
Copy link
Copy Markdown
Contributor Author

gamado commented Feb 18, 2026

Both clusters are healthy:

Spoke cluster (spoke-tnf):

  • 2 nodes: spoke-tnf-master-0 and spoke-tnf-master-1, both Ready
  • OCP 4.22.0-0.nightly, all cluster operators Available=True, Degraded=False
  • etcd operator healthy

Hub cluster:

  • 2 nodes: master-0 and master-1, both Ready
  • Pacemaker cluster TNF running with quorum
  • All resources started: kubelet-clone, STONITH fencing (fence_redfish), etcd-clone
  • Corosync, Pacemaker, pcsd all active/enabled

The critical fixes (C1-C5) are fully validated. Ready to start on the important issues whenever you are.

@gamado
Copy link
Copy Markdown
Contributor Author

gamado commented Feb 18, 2026

Review Fixes Status

Two commits pushed addressing the review findings:

Commit 1: 3a61c01 — Critical Fixes (C1-C5)

Issue Status Description
C1 Fixed SSH key detection: multi-key loop (ed25519, rsa, ecdsa) instead of hardcoded id_rsa.pub
C2 Fixed pull_secret_path and hub_kubeconfig moved to playbook-level vars, removed from both roles
C3 Fixed Hub release image extraction moved to playbook pre_tasks, removed duplication from both roles
C4 Fixed RHCOS ISO extraction: added rc check on primary, empty string check on set_fact
C5 Fixed Explicit disk image removal after VM undefine to prevent stale disk reuse

Commit 2: 38bfca3 — Important Fixes + Suggestions (I6, I8, I9, I10, I13, S16-S18)

Issue Status Description
I6 Fixed Deploy script parses spoke_cluster_name from vars/assisted.yml instead of hardcoding path
I7 Skipped Input validation for spoke variables — deferred
I8 Fixed Cluster state tracking added to assisted-install.yml (deploying/deployed phases via common/cluster-state.yml)
I9 Fixed failed_when: false replaced: ignore_errors: true in cleanup, conditional check in provisioning patch, removed from virsh start
I10 Fixed Block/rescue diagnostic handlers added to all 9 wait loops across 4 files
I11 Fixed hub_auth_dir removed (was fixed in C2 commit)
I12 Skipped Destructive rm -rf in storage.yml — deferred
I13 Fixed READMEs added for acm-install and assisted-spoke roles
S14 Skipped changed_when: true idempotency — deferred
S15 Skipped setup-ksushy.yml assertion message — deferred
S16 Fixed DHCP range constraints documented in assisted.yml.template
S17 Fixed hub_network_cidr exposed in assisted.yml.template
S18 Fixed Trailing newlines added to all 22 new files

Summary

  • 13 of 18 review findings addressed
  • 5 deferred (I7, I12, S14, S15 — low risk, can be follow-up)
  • End-to-end deployment validated: make deploy fencing-assisted completed successfully (ok=155, changed=48, failed=0)
  • Both hub and spoke clusters verified healthy (nodes Ready, all COs available, Pacemaker fencing operational)

@gamado gamado requested a review from fonta-rh February 18, 2026 18:09
@gamado
Copy link
Copy Markdown
Contributor Author

gamado commented Feb 19, 2026

Environment

  • Platform: AWS c5.metal (RHEL 9.6)
  • OCP Version: 4.22.0-0.nightly-2026-02-17-112443
  • Deployment method: make deploy fencing-assisted
  • Date: 2026-02-19

Deployment Summary

Full end-to-end deployment completed successfully:

PLAY RECAP *********************************************************************
ec2-user@51.85.92.84  : ok=163  changed=50  unreachable=0  failed=0  skipped=18  rescued=0  ignored=0

Hub Cluster Status

Component Status
Nodes 2/2 Ready (master-0, master-1)
Cluster Operators All healthy (Available=True, Degraded=False)
Pacemaker TNF cluster running, partition with quorum
STONITH fence_redfish started on both nodes
etcd etcd-clone started on both nodes
kubelet kubelet-clone started on both nodes
Corosync/Pacemaker/pcsd active/enabled

Spoke Cluster Status

Component Status
Nodes 2/2 Ready (spoke-tnf-master-0, spoke-tnf-master-1)
Cluster Operators 34/34 healthy (Available=True, Degraded=False)
OCP Version 4.22.0-0.nightly-2026-02-17-112443

Fixes Validated

All fixes from the two review-fix commits were validated in this run:

Fix Validated
C1 - SSH key multi-key detection Yes — key detected and used
C2 - Playbook-level pull_secret_path / hub_kubeconfig Yes — no duplication errors
C3 - Hub release extraction in pre_tasks Yes — version extracted once, passed to both roles
C4 - RHCOS ISO fallback error handling Yes — ISO URL resolved
C5 - Explicit disk cleanup Yes — clean VM creation
I6 - Dynamic spoke path in deploy script Yes — output shows correct ~/spoke-tnf/auth/kubeconfig
I8 - Cluster state tracking Yes — DEPLOYING/DEPLOYED phases logged for ASSISTED method
I9 - failed_when: false replaced Yes — no silent failures, rescued=0
I10 - Wait loop diagnostics Yes — all waits passed cleanly, rescue blocks available but not triggered
I13 - Role READMEs Yes — present in repo
S16 - DHCP range documented Yes — template updated
S17 - hub_network_cidr in template Yes — template updated
S18 - Trailing newlines Yes — all files fixed

@fonta-rh
Copy link
Copy Markdown
Contributor

fonta-rh commented Apr 8, 2026

Hey @gamado — looks like the March 12 rebase dropped your two fix commits (3a61c01 and 38bfca3). The current branch only has:

33931e6 — Support Instance type of c5.metal
4c60c79 — Add assisted installer deployment method for spoke TNF clusters
c509770 — comments fix (removes a merge conflict marker)

The review fixes you validated on Feb 18-19 are no longer in the diff. A few concrete examples:

  • C1: create-cluster-resources.yml still has cat ~/.ssh/id_rsa.pub (your fix was the multi-key loop)
  • C2: pull_secret_path is still defined in both acm-install/vars/main.yml and assisted-spoke/vars/main.yml (your fix moved it to playbook-level)
  • C3: oc get clusterversion still runs independently in both agent-service-config.yml and create-cluster-resources.yml (your fix moved it to pre_tasks)

Looks like the fix commits weren't included during the interactive rebase. The original commits should still be recoverable via git reflog — look for 3a61c01 and 38bfca3 and cherry-pick them back on top of the current branch.

gamado and others added 4 commits April 9, 2026 11:47
Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing
hub cluster. This adds a third deployment path alongside dev-scripts (IPI)
and kcli methods.

New Ansible roles:
- assisted/acm-install: installs ACM operator, MultiClusterHub,
  AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support,
  and configures provisioning for external BMH management
- assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys
  cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv,
  BareMetalHost with fencing credentials), monitors installation, and
  extracts spoke credentials

Usage: make deploy fencing-assisted

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- C1: SSH key detection now tries ed25519 first, falls back to rsa/ecdsa
- C2: Deduplicate pull_secret_path and hub_kubeconfig to playbook level
- C3: Move hub release image extraction to playbook pre_tasks (run once)
- C4: RHCOS ISO extraction checks rc and catches empty string failures
- C5: Explicit disk cleanup prevents stale qcow2 reuse on re-deploy
- Remove unused hub_auth_dir variable (I11)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- I6: Parse spoke_cluster_name from vars/assisted.yml in deploy script
  instead of hardcoding ~/spoke-tnf/auth/kubeconfig
- I8: Add cluster state tracking (deploying/deployed) to assisted-install.yml
  using common/cluster-state.yml, consistent with dev-scripts and kcli methods
- I9: Replace blanket failed_when:false with ignore_errors:true in cleanup,
  conditional error checking in enable-watch-all-namespaces, and remove
  failed_when:false from virsh start in retrieve-credentials
- I10: Add block/rescue diagnostic handlers to all 9 wait loops across
  install-operator, agent-service-config, wait-for-install, and
  retrieve-credentials, dumping relevant status on timeout
- I13: Add README.md for acm-install and assisted-spoke roles
- S16: Document DHCP range constraints in assisted.yml.template
- S17: Expose hub_network_cidr in assisted.yml.template
- S18: Add trailing newlines to all 22 new files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a "fencing-assisted" deployment path: a Makefile target and script that run a new Ansible playbook to prepare an existing hub (ACM/MCE/assisted-service) and provision an assisted-installer spoke TNF cluster via two new roles (acm-install and assisted-spoke), templates, vars, and documentation.

Changes

Cohort / File(s) Summary
Top-level docs & templates
CLAUDE.md, deploy/openshift-clusters/vars/assisted.yml.template
Documented the new fencing-assisted flow and added a user-facing assisted vars template.
Makefile & runner script
deploy/Makefile, deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh
Added fencing-assisted make target and a wrapper script that validates inputs and launches ansible-playbook assisted-install.yml.
Hypervisor tweak
deploy/aws-hypervisor/scripts/create.sh
Relaxed EC2 instance-type regex to match additional metal-instance suffix patterns.
Ansible collections
deploy/openshift-clusters/collections/requirements.yml
Added ansible.utils collection requirement (>=2.0.0).
Entry playbook
deploy/openshift-clusters/assisted-install.yml
New playbook that validates hub artifacts, computes effective release image/version, sets cluster state, runs assisted/acm-install and assisted/assisted-spoke, and performs post-deploy proxy/credential publishing.
acm-install role
deploy/openshift-clusters/roles/assisted/acm-install/...
New hub-prep role: cluster validation, hostPath PV creation, operator subscription (ACM/MCE) with channel autodetect, MultiClusterHub creation, RHCOS ISO extraction (with fallback) and AgentServiceConfig apply, TNF enablement, Metal3 watch-all patching; includes defaults, vars, templates, and many tasks.
assisted-spoke role
deploy/openshift-clusters/roles/assisted/assisted-spoke/...
New spoke-provision role: libvirt network creation, qcow2 images and VM import, ksushy validation, rendering/applying hub resources (Namespace, PullSecret, ClusterImageSet, ClusterDeployment, AgentClusterInstall, InfraEnv), BMH and fencing secrets, install monitoring, credential retrieval, cleanup and numerous tasks, templates, defaults, and vars.
Templates & manifests
.../acm-install/templates/*, .../assisted-spoke/templates/*
Added Jinja2 templates for AgentServiceConfig, MultiClusterHub, Operator Subscription, ClusterImageSet, ClusterDeployment, AgentClusterInstall, InfraEnv, libvirt network XML, etc.
Task orchestration
.../acm-install/tasks/*.yml, .../assisted-spoke/tasks/*.yml
Many new task files (validation, storage provisioning, operator install, ISO extraction/apply, TNF enablement, Metal3 patching, VM/image creation, BMH creation, install/wait logic, credentials retrieval, ksushy checks, cleanup). Pay attention to ISO extraction logic, oc polling/error handling, Metal3 patch semantics, virsh idempotency, and timeouts.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Makefile
  participant Script as deploy-fencing-assisted.sh
  participant Ansible as assisted-install.yml
  participant Hub as Hub KubeAPI
  participant Assisted as Assisted-Service/ACM
  participant Hypervisor as Hypervisor (ksushy/libvirt)

  User->>Makefile: make deploy fencing-assisted
  Makefile->>Script: exec deploy-fencing-assisted.sh
  Script->>Ansible: ansible-playbook assisted-install.yml -i inventory.ini
  Ansible->>Hub: validate hub kubeconfig and proxy (oc)
  Ansible->>Assisted: install ACM/MCE, create MultiClusterHub
  Ansible->>Assisted: extract RHCOS ISO, apply AgentServiceConfig
  Ansible->>Hypervisor: create libvirt network, images, VMs (ksushy)
  Ansible->>Hub: apply spoke resources (Namespace, ClusterImageSet, ClusterDeployment, AgentClusterInstall, InfraEnv)
  Ansible->>Hub: apply BareMetalHost + fencing secrets
  Assisted->>Hub: agents register and progress installation
  Hub->>Ansible: provide spoke kubeconfig and admin password
  Ansible->>User: print spoke credentials and access instructions
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Add OPENSHIFT_INSTALL_EXPERIMENTAL_DISABLE_IMAGE_POLICY to assisted-unsupported-config
  to prevent spoke MCD signature verification failures during package layering
- Add retries (30x30s) to packagemanifest auto-detection to handle catalogsource
  sync delays (up to 25 minutes observed)
- Add fencing-assisted to VALID_CLUSTER_TYPES in Makefile
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (17)
deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml (1)

9-18: Consider adding nodeAffinity to hostPath PVs.

For multi-node clusters, hostPath PVs should include nodeAffinity to ensure pods are scheduled on the node where the storage exists. Without it, pods may be scheduled on a different node and fail to access the data.

♻️ Example nodeAffinity addition
apiVersion: v1
kind: PersistentVolume
metadata:
  name: assisted-pv-images
spec:
  capacity:
    storage: {{ assisted_images_size }}
  accessModes: [ReadWriteOnce]
  hostPath:
    path: {{ assisted_images_path }}
  storageClassName: {{ assisted_storage_class }}
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - {{ hub_nodes.stdout_lines[0] }}

Note: This requires restructuring to capture node names before creating PVs, or templating both PVs per node.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml`
around lines 9 - 18, Add nodeAffinity to the PersistentVolume definition for
assisted-pv-images so the hostPath PV binds only to the node that actually
contains {{ assisted_images_path }}; update the PersistentVolume resource
(metadata.name: assisted-pv-images, spec.hostPath) to include a
spec.nodeAffinity.required.nodeSelectorTerms matching the node's hostname (e.g.,
kubernetes.io/hostname) or otherwise ensure PVs are templated per-node, and
adjust the playbook/template to capture a target node name (hub_nodes or similar
variable) before creating the PV so the nodeAffinity can reference that node.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/main.yml (1)

4-30: Consider adding a rescue block for diagnostic output on failure.

Per the PR objectives, commit 38bfca3 was supposed to add block/rescue diagnostic handlers (I10 fix). The current code lacks a rescue block which would help capture diagnostic information when spoke deployment fails. This may have been lost during the rebase.

♻️ Example rescue block for diagnostics
 - block:
     - name: Cleanup existing spoke resources
       include_tasks: cleanup.yml
       when: force_cleanup | bool
     # ... other tasks ...
     - name: Retrieve spoke cluster credentials
       include_tasks: retrieve-credentials.yml
+  rescue:
+    - name: Collect diagnostic information on failure
+      shell: |
+        echo "=== AgentClusterInstall status ==="
+        oc get agentclusterinstall -n {{ spoke_cluster_name }} -o yaml || true
+        echo "=== Agent status ==="
+        oc get agent -n {{ spoke_cluster_name }} -o wide || true
+      register: diagnostics
+      changed_when: false
+      ignore_errors: true
+
+    - name: Display diagnostics
+      debug:
+        var: diagnostics.stdout_lines
+
+    - name: Fail with context
+      fail:
+        msg: "Spoke cluster deployment failed. See diagnostics above."
   environment:
     KUBECONFIG: "{{ hub_kubeconfig }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/main.yml`
around lines 4 - 30, The block that runs spoke deployment tasks (the block
including include_tasks: cleanup.yml, create-spoke-network.yml,
create-spoke-vms.yml, setup-ksushy.yml, create-cluster-resources.yml,
create-bmh.yml, wait-for-install.yml, retrieve-credentials.yml) needs a rescue:
section to capture diagnostics on failure; add a rescue: that runs a diagnostic
include_tasks (e.g. include_tasks: diagnostics.yml or include_tasks:
gather-spoke-diagnostics.yml) and any necessary cleanup/notify steps, ensure the
rescue runs with the same environment KUBECONFIG: "{{ hub_kubeconfig }}", and
reference the same task names so failures during Create spoke VMs / Wait for
spoke cluster installation / Retrieve spoke cluster credentials trigger
diagnostic collection.
deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh (1)

38-42: Fragile YAML parsing - consider using a proper YAML parser.

The grep | awk approach won't correctly handle quoted values containing colons, inline comments, or multi-line values. Consider using yq or Python for robust YAML parsing.

♻️ More robust parsing using yq
 # Parse spoke_cluster_name from vars/assisted.yml
-SPOKE_CLUSTER_NAME=$(grep '^spoke_cluster_name:' vars/assisted.yml | awk '{print $2}' | tr -d '"' | tr -d "'")
+if command -v yq &> /dev/null; then
+    SPOKE_CLUSTER_NAME=$(yq -r '.spoke_cluster_name // "spoke-tnf"' vars/assisted.yml)
+else
+    # Fallback to basic parsing (may not handle all edge cases)
+    SPOKE_CLUSTER_NAME=$(grep '^spoke_cluster_name:' vars/assisted.yml | awk '{print $2}' | tr -d '"' | tr -d "'" || true)
+fi
 if [[ -z "${SPOKE_CLUSTER_NAME}" ]]; then
     SPOKE_CLUSTER_NAME="spoke-tnf"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh` around lines 38
- 42, The current extraction of SPOKE_CLUSTER_NAME using grep/awk is fragile;
replace the pipeline that sets SPOKE_CLUSTER_NAME by invoking a YAML-aware
parser (e.g., yq or a small Python snippet) to read the spoke_cluster_name key
from vars/assisted.yml and fall back to "spoke-tnf" if empty; update the logic
surrounding the SPOKE_CLUSTER_NAME variable so the script uses the parsed value
(keep the same variable name SPOKE_CLUSTER_NAME) and preserve the existing
empty-check fallback.
CLAUDE.md (1)

135-139: Consider adding link to template file or noting key variables.

The Assisted Installer Method section mentions copying vars/assisted.yml.template but doesn't indicate what key variables users should customize. This differs from the Kcli Method section (line 131-133) which mentions specific configuration details.

Consider adding a brief note about key variables like spoke_cluster_name, spoke_network_cidr, or referencing the role README for full variable documentation.

As per coding guidelines: "Document configuration options and environment variables in documentation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 135 - 139, Update the "Assisted Installer Method"
section to reference the vars/assisted.yml.template and call out the key
variables users must customize (e.g., spoke_cluster_name, spoke_network_cidr)
and/or add a link to the role README for full variable documentation; ensure you
mention that users should copy vars/assisted.yml.template to vars/assisted.yml
and populate those variables before running the installer and optionally list a
couple of other common variables to customize.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml (2)

39-43: changed_when detection may miss updates.

The changed_when: "'created' in ..." pattern only detects new resource creation. If oc apply updates an existing resource (showing "configured"), the task incorrectly reports no change. This doesn't affect functionality but misrepresents actual changes.

♻️ Detect both create and update
-  changed_when: "'created' in cis_result.stdout"
+  changed_when: "'created' in cis_result.stdout or 'configured' in cis_result.stdout"

Apply similar change to all four Apply tasks.

Also applies to: 51-55, 63-67, 75-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml`
around lines 39 - 43, The task "Apply ClusterImageSet" currently sets
changed_when to only detect "'created' in cis_result.stdout" which misses
updates; change the condition to detect both creation and updates (e.g., use
"changed_when: \"'created' in cis_result.stdout or 'configured' in
cis_result.stdout\"" or include 'patched' if desired) for the cis_result
variable, and apply the same change to the other three "Apply ..." tasks that
register their own result variables so they report changes on both create and
update.

22-31: Pull secret creation is idempotent but has a minor race condition.

The check-then-create pattern has a small TOCTOU window. In practice, this is unlikely to matter for this use case, but using oc apply with a templated secret would be more robust.

♻️ Alternative using oc apply for true idempotency

Consider templating the secret and using oc apply similar to the other resources, which avoids the check-then-create pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml`
around lines 22 - 31, The current "Create spoke pull secret" shell task uses a
check-then-create pattern and should be replaced with an idempotent apply;
change the task that references spoke_cluster_name and pull_secret_path to
generate a secret manifest (or run oc create secret generic ...
--from-file=.dockerconfigjson={{ pull_secret_path }}
--type=kubernetes.io/dockerconfigjson --dry-run=client -o yaml) and pipe that
into oc apply -n {{ spoke_cluster_name }} -f - so the operation is atomic and
avoids the TOCTOU race; remove reliance on ps_result and adjust changed_when to
rely on the apply result (or let Ansible detect changes) instead.
deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-tnf.yml (1)

34-41: Pod selection could fail silently if no pods exist.

If no assisted-service pods exist when this runs (e.g., rollout is still in progress despite the previous wait), head -1 returns empty string, causing oc exec to fail with an unclear error.

♻️ Add guard for empty pod selection
 - name: Verify TNF support is enabled
   shell: |
-    oc exec -n {{ assisted_service_namespace }} \
-      $(oc get pod -n {{ assisted_service_namespace }} -l app=assisted-service -o name | head -1) \
-      -c assisted-service -- env | grep -i TNF_CLUSTERS_SUPPORT
+    POD=$(oc get pod -n {{ assisted_service_namespace }} -l app=assisted-service -o name 2>/dev/null | head -1)
+    if [ -z "$POD" ]; then
+      echo "ERROR: No assisted-service pod found" >&2
+      exit 1
+    fi
+    oc exec -n {{ assisted_service_namespace }} "$POD" -c assisted-service -- env | grep -i TNF_CLUSTERS_SUPPORT
   register: tnf_verify
   changed_when: false
   failed_when: "'TNF_CLUSTERS_SUPPORT=true' not in tnf_verify.stdout"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-tnf.yml`
around lines 34 - 41, The current task "Verify TNF support is enabled" uses a
shell pipeline with head -1 and then oc exec, which will error unclearly if no
assisted-service pod exists; change the task to first capture the pod name (from
oc get pod -n {{ assisted_service_namespace }} -l app=assisted-service -o name |
head -1) into a variable, add an explicit guard/failure if that variable is
empty (fail with a clear message referencing assisted_service_namespace and the
app=assisted-service selector), and only then run the oc exec to check
TNF_CLUSTERS_SUPPORT, keeping the tnf_verify registration and failed_when logic
intact but relying on the pre-check to avoid silent/unclear failures.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-network.yml (2)

33-44: nftables rule persistence and check logic.

Two observations:

  1. The grep pattern on line 36 only checks for rules in one direction (spoke...hub). If only the reverse rule exists, both rules would still be inserted, potentially creating duplicates for the hub→spoke direction.

  2. nftables rules added via nft insert are not persisted across reboots. Consider whether persistence is needed for your use case.

♻️ More robust rule checking
 - name: Add cross-bridge nftables FORWARD rules for spoke<->hub connectivity
   shell: |
     # Check if rules already exist
-    if sudo nft list chain ip filter FORWARD 2>/dev/null | grep -q "{{ spoke_network_cidr }}.*{{ hub_network_cidr }}"; then
+    FORWARD_RULES=$(sudo nft list chain ip filter FORWARD 2>/dev/null || echo "")
+    if echo "$FORWARD_RULES" | grep -q "{{ spoke_network_cidr }}.*{{ hub_network_cidr }}" && \
+       echo "$FORWARD_RULES" | grep -q "{{ hub_network_cidr }}.*{{ spoke_network_cidr }}"; then
       echo "Rules already exist"
     else
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-network.yml`
around lines 33 - 44, Update the shell task "Add cross-bridge nftables FORWARD
rules for spoke<->hub connectivity" to check and insert each direction
separately using the variables spoke_network_cidr and hub_network_cidr (e.g.,
grep for both "{{ spoke_network_cidr }}.*{{ hub_network_cidr }}" and "{{
hub_network_cidr }}.*{{ spoke_network_cidr }}" or run two independent checks) so
you only insert missing rules and avoid duplicates; adjust the echo messages and
changed_when (registered nft_result.stdout) to reflect per-direction additions;
additionally, if persistence is required, save the nftables ruleset after
changes (e.g., run "sudo nft list ruleset > /etc/nftables.conf" or use the
distro's nftables persistence mechanism) so rules survive reboots.

23-31: changed_when: true doesn't accurately reflect actual state changes.

Both virsh net-start and virsh net-autostart tasks report changed_when: true unconditionally, even when the network was already started or autostart was already enabled. This breaks idempotent reporting.

♻️ Suggested improvement for accurate change detection
 - name: Start spoke libvirt network
   shell: |
-    sudo virsh net-start {{ spoke_network_name }} 2>/dev/null || true
-  changed_when: true
+    if sudo virsh net-info {{ spoke_network_name }} 2>/dev/null | grep -q "Active:.*yes"; then
+      echo "ALREADY_RUNNING"
+    else
+      sudo virsh net-start {{ spoke_network_name }} && echo "STARTED"
+    fi
+  register: net_start_result
+  changed_when: "'STARTED' in net_start_result.stdout"

 - name: Set spoke libvirt network to autostart
   shell: |
-    sudo virsh net-autostart {{ spoke_network_name }}
-  changed_when: true
+    if sudo virsh net-info {{ spoke_network_name }} 2>/dev/null | grep -q "Autostart:.*yes"; then
+      echo "ALREADY_AUTOSTART"
+    else
+      sudo virsh net-autostart {{ spoke_network_name }} && echo "ENABLED"
+    fi
+  register: net_autostart_result
+  changed_when: "'ENABLED' in net_autostart_result.stdout"

As per coding guidelines: "Make deployments idempotent in Ansible playbooks and scripts".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-network.yml`
around lines 23 - 31, The two tasks "Start spoke libvirt network" and "Set spoke
libvirt network to autostart" should stop unconditionally reporting changes;
register each command's result (e.g., start_result, autostart_result) and set
changed_when to a real change-detection expression instead of true — for
example, for "Start spoke libvirt network" check the network's current state
with virsh net-list --all or inspect start_result.rc/output to only mark changed
when the network was actually started, and for "Set spoke libvirt network to
autostart" query virsh net-info or parse autostart_result to only mark changed
when autostart was newly enabled. Ensure both tasks use the registered result in
failed_when/changed_when to preserve idempotence and accurate reporting.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/setup-ksushy.yml (2)

19-25: Redfish endpoint query lacks error handling for connection failures.

If sushy-tools is running but the curl request fails (network issue, wrong IP/port), the task will fail with a JSON parsing error rather than a clear connection error message.

♻️ Add connection error handling
 - name: Verify spoke VMs are visible via sushy-tools
   shell: |
-    curl -sk https://{{ spoke_ksushy_ip }}:{{ spoke_ksushy_port }}/redfish/v1/Systems/ \
+    RESPONSE=$(curl -sk --connect-timeout 10 https://{{ spoke_ksushy_ip }}:{{ spoke_ksushy_port }}/redfish/v1/Systems/ \
       -u {{ spoke_bmc_user }}:{{ spoke_bmc_password }} \
-      | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['Members@odata.count'])"
+      2>&1) || { echo "ERROR: Failed to connect to sushy-tools at {{ spoke_ksushy_ip }}:{{ spoke_ksushy_port }}" >&2; exit 1; }
+    echo "$RESPONSE" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['Members@odata.count'])"
   register: sushy_systems
   changed_when: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/setup-ksushy.yml`
around lines 19 - 25, The Redfish curl+python task "Verify spoke VMs are visible
via sushy-tools" can raise a JSON parse error when curl fails; update this shell
task to detect curl failures before JSON parsing by using curl with --fail/-f
and capturing its exit code and output into the registered variable
sushy_systems, then add a failed_when that checks curl succeeded (e.g.,
sushy_systems.rc == 0) and otherwise fail with a clear message including the
curl stderr/output and the spoke_ksushy_ip/spoke_ksushy_port; only run the
python/json extraction when curl succeeded (or guard the python command to
return a sane empty JSON on failure) so the JSON parser never receives non-JSON
input.

27-34: Assert may fail if spoke VMs aren't created yet.

The assertion checks for spoke_ctlplanes + 2 systems, but according to the task flow in the README (and main.yml context), setup-ksushy.yml runs after create-spoke-vms.yml. However, if the task order changes or VMs fail to register with sushy-tools immediately, this could produce confusing failures.

Consider whether this is intentional validation or if the check should be more lenient.

The current logic assumes spoke VMs exist and are visible to sushy-tools. If this is a pre-flight check, the assertion message could clarify that spoke VMs must already be created and registered.

deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/agentclusterinstall.yml.j2 (1)

14-19: Consider making hostPrefix configurable.

The hostPrefix: 23 is hardcoded, which works for the current /14 pod network and 2-node TNF topology. For future flexibility with different cluster network CIDRs or scaling needs, consider making this configurable via a role variable with a sensible default.

♻️ Proposed change to make hostPrefix configurable

In defaults/main.yml:

spoke_host_prefix: 23

In template:

   networking:
     clusterNetwork:
     - cidr: "{{ spoke_cluster_network_cidr }}"
-      hostPrefix: 23
+      hostPrefix: {{ spoke_host_prefix }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/agentclusterinstall.yml.j2`
around lines 14 - 19, The template currently hardcodes hostPrefix: 23 in
agentclusterinstall.yml.j2; make this value configurable by introducing a role
variable (e.g., spoke_host_prefix) in defaults/main.yml with a sensible default
of 23 and replace the literal 23 in the template with the variable reference
(spoke_host_prefix) so different clusterNetwork CIDRs or topologies can override
it; update any role README or variable docs to mention spoke_host_prefix and its
purpose.
deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-watch-all-namespaces.yml (1)

30-38: Consider adding rescue diagnostics for metal3 pod wait timeout.

Unlike other wait loops in this PR that use block/rescue patterns for timeout diagnostics, this wait loop will fail with minimal context if the metal3 pod doesn't stabilize. Consider adding a rescue block to collect pod status and events on timeout for easier troubleshooting.

♻️ Proposed improvement with rescue diagnostics
-- name: Wait for metal3 pod to stabilize after provisioning change
+- block:
+    - name: Wait for metal3 pod to stabilize after provisioning change
+      shell: |
+        oc get pods -n openshift-machine-api \
+          -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state \
+          --no-headers 2>/dev/null | grep -v Terminating | grep -q Running
+      register: metal3_pod
+      until: metal3_pod.rc == 0
+      retries: "{{ (metal3_stabilize_timeout / 15) | int }}"
+      delay: 15
+  rescue:
+    - name: Collect metal3 pod diagnostics
+      shell: |
+        echo "=== Metal3 Pods ==="
+        oc get pods -n openshift-machine-api -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state
+        echo ""
+        echo "=== Recent Events ==="
+        oc get events -n openshift-machine-api --sort-by='.lastTimestamp' | tail -15
+      register: metal3_diag
+      changed_when: false
+      failed_when: false
+
+    - name: Display metal3 pod diagnostics
+      debug:
+        msg: "{{ metal3_diag.stdout }}"
+
+    - name: Fail after metal3 pod timeout
+      fail:
+        msg: "metal3 pod did not stabilize within timeout"
-  shell: |
-    oc get pods -n openshift-machine-api \
-      -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state \
-      --no-headers 2>/dev/null | grep -v Terminating | grep -q Running
-  register: metal3_pod
-  until: metal3_pod.rc == 0
-  retries: "{{ (metal3_stabilize_timeout / 15) | int }}"
-  delay: 15
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-watch-all-namespaces.yml`
around lines 30 - 38, Wrap the existing task "Wait for metal3 pod to stabilize
after provisioning change" (the shell task that sets metal3_pod and uses until
with metal3_stabilize_timeout) into an Ansible block with a rescue section; on
timeout (rescue) run diagnostic commands like oc get pods -n
openshift-machine-api -o wide, oc describe pod for the metal3 pod(s) and oc get
events -n openshift-machine-api, register their outputs and log them (via debug
or copy to a file) so you surface pod status, descriptions and recent events
when metal3_pod.wait fails; ensure you still preserve the same retry/delay logic
and variables (metal3_pod, metal3_stabilize_timeout) inside the block.
deploy/openshift-clusters/assisted-install.yml (1)

148-154: Clarify proxy.env source in deployment summary.

Line 149 instructs users to source proxy.env for spoke access, but this file gets overwritten by proxy-setup role. The message should clarify this is the spoke's proxy.env. Consider making the instruction more explicit.

♻️ Proposed clarification
           Access spoke cluster:
-            source proxy.env
+            source proxy.env  # (spoke cluster proxy - written by proxy-setup role)
             KUBECONFIG={{ spoke_kubeconfig_path }} oc get nodes
 
           Access hub cluster:
-            source hub-proxy.env
+            source hub-proxy.env  # (hub cluster proxy - preserved from original deployment)
             KUBECONFIG=~/auth/kubeconfig oc get nodes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/assisted-install.yml` around lines 148 - 154, The
deployment summary currently says to "source proxy.env" which is ambiguous
because proxy.env is overwritten by the proxy-setup role; update the text to
explicitly state "source the spoke proxy.env (generated by the proxy-setup role
for the spoke cluster)" when instructing spoke access and reference the
spoke_kubeconfig_path variable in the same sentence; similarly, clarify hub
access as "source hub-proxy.env (for the hub cluster)" to avoid confusion
between spoke and hub proxy files.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-vms.yml (1)

37-42: Consider reporting change when VMs are actually destroyed.

The changed_when: false suppresses change reporting even when VMs are successfully destroyed. While this is functionally correct (the goal is to ensure VMs are off), it reduces visibility in Ansible output. Consider checking if destruction actually occurred.

♻️ Proposed improvement for accurate change reporting
 - name: Ensure spoke VMs are shut off
   shell: |
-    sudo virsh destroy {{ spoke_cluster_name }}-master-{{ item }} 2>/dev/null || true
+    if sudo virsh domstate {{ spoke_cluster_name }}-master-{{ item }} 2>/dev/null | grep -q running; then
+      sudo virsh destroy {{ spoke_cluster_name }}-master-{{ item }}
+      echo "DESTROYED"
+    else
+      echo "ALREADY_OFF"
+    fi
   loop: "{{ range(spoke_ctlplanes) | list }}"
-  changed_when: false
-  failed_when: false
+  register: destroy_result
+  changed_when: "'DESTROYED' in destroy_result.stdout"
+  failed_when: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-vms.yml`
around lines 37 - 42, The task "Ensure spoke VMs are shut off" currently forces
no change reporting; instead, register the shell result (e.g., register:
virsh_destroy) for the loop that runs "sudo virsh destroy {{ spoke_cluster_name
}}-master-{{ item }}" over "{{ range(spoke_ctlplanes) | list }}" and set
changed_when based on the command outcome (mark changed when virsh actually
destroyed a domain and not when it returned "cannot find domain" or similar).
Update failed_when accordingly to preserve suppression of real errors while
allowing changed_when to reflect actual destructions.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-bmh.yml (1)

52-53: Treat configured output as a change too.

oc apply reports updates as configured, so this predicate hides real Secret/BMH reconciliations on reruns.

Suggested change
-  changed_when: "'created' in bmh_result.stdout"
+  changed_when: >
+    'created' in bmh_result.stdout or
+    'configured' in bmh_result.stdout

As per coding guidelines, deploy/**/*.{yml,yaml,sh}: Make deployments idempotent in Ansible playbooks and scripts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-bmh.yml`
around lines 52 - 53, The current changed_when predicate in create-bmh.yml only
treats "'created' in bmh_result.stdout" as a change, so updates reported as
"configured" by oc apply are ignored; update the changed_when condition that
references bmh_result to also consider "configured" (and possibly
"patched"/other positive indicators) in bmh_result.stdout so that Secret/BMH
reconciliations on reruns are marked as changed (e.g., expand the predicate for
the registered variable bmh_result to detect "configured" in addition to
"created").
deploy/openshift-clusters/vars/assisted.yml.template (1)

24-29: Don’t document spoke_ctlplanes as two-node-only here.

spoke_ctlplanes is parameterized, but this comment hardcodes the assisted path to 2. Please document the supported values instead of saying it is always two-node so the new workflow stays aligned with the repo’s topology requirements.

As per coding guidelines, deploy/openshift-clusters/**: Support both SNO (Single Node OpenShift) and two-node topologies in cluster management code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/vars/assisted.yml.template` around lines 24 - 29,
The comment for the template variable spoke_ctlplanes incorrectly states it is
"Always 2 for TNF"; update the comment next to the spoke_ctlplanes declaration
to document the supported values (e.g., 1 for SNO, 2 for two-node/TNF) and note
that it is parameterized rather than fixed, so the variable (spoke_ctlplanes)
can be set to either 1 or 2 depending on topology; locate the spoke_ctlplanes
line in assisted.yml.template and replace the hardcoded "Always 2 for TNF" text
with a brief description of allowed values and intended use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/Makefile`:
- Around line 79-81: The Makefile defines a target fencing-assisted but the
validation variable VALID_CLUSTER_TYPES does not include "fencing-assisted",
causing make deploy fencing-assisted to fail; update the VALID_CLUSTER_TYPES
list (the variable named VALID_CLUSTER_TYPES) to include the token
"fencing-assisted" or adjust the validation logic that checks cluster types to
accept this alias so the fencing-assisted target (the fencing-assisted recipe
that calls fencing-ipi and
./openshift-clusters/scripts/deploy-fencing-assisted.sh) can run without being
blocked by arg validation.

In
`@deploy/openshift-clusters/roles/assisted/acm-install/tasks/agent-service-config.yml`:
- Around line 72-74: The task "Get RHCOS version from ISO URL" currently sets
rhcos_version by applying regex_search(... ) | first which can fail if the regex
doesn't match; update this task to validate the extraction from rhcos_iso_url by
capturing the regex result into a temporary variable (e.g.,
rhcos_version_matches) using regex_search, then assert or fail with a clear
error via the fail module if the list is empty, or set rhcos_version using a
safe default via the default filter only after confirming matches exist;
reference the existing symbols rhcos_iso_url, rhcos_version, regex_search and
the task name to locate and modify the code.

In
`@deploy/openshift-clusters/roles/assisted/acm-install/tasks/install-operator.yml`:
- Around line 107-153: The play currently only creates and waits for
MultiClusterHub when hub_operator == 'acm', so when hub_operator == 'mce' no
operand CR is created; add an equivalent branch for MCE: template a
multiclusterengine.yml.j2 (similar to multiclusterhub.yml.j2) and write it to
/tmp/multiclusterengine.yml, then oc apply -f /tmp/multiclusterengine.yml
(registering result as e.g. mce_result), and add a wait block that polls the
MultiClusterEngine resource (oc get multiclusterengine multiclusterengine -n {{
op_config.namespace }} -o jsonpath='{.status.phase}' or appropriate ready field)
with retries/delay and a rescue block that collects the same diagnostics and
fails with a clear message; guard all these tasks with when: hub_operator ==
'mce' and reuse naming patterns (task names, registered vars) analogous to the
existing MultiClusterHub tasks.
- Around line 31-40: Add an explicit guard after the "Auto-detect channel from
packagemanifest" task to fail fast when detection fails or returns empty: insert
a task (e.g., "Fail if auto-detected channel empty") that uses the fail module
and triggers when detected_channel.rc != 0 or detected_channel.stdout | trim ==
'' and include a clear message mentioning op_config.package_name and the
detected_channel output; keep the existing "Set operator channel fact" task only
executed when detected_channel.stdout | trim is not empty (or rely on the fail
task to stop execution).

In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml`:
- Around line 48-58: The task "Fix hostPath permissions and SELinux context on
each hub node" unconditionally runs rm -rf on {{ assisted_images_path }} and {{
assisted_db_path }}, risking data loss on re-runs; remove the unconditional
deletion from that task and instead create a separate "Clean hostPath
directories" task that runs the rm -rf only when a boolean flag (e.g.,
force_cleanup | default(false) | bool) is true, loop over
hub_nodes.stdout_lines, and keep the permissions/SELinux chown/chcon steps in
the original task so the selinux_fix registration still applies.

In `@deploy/openshift-clusters/roles/assisted/acm-install/vars/main.yml`:
- Around line 7-17: The hard-coded catalog source values under
operator_config.acm.source and operator_config.mce.source prevent disconnected
deployments; replace those literals with configurable variables (e.g.,
acm_catalog_source and mce_catalog_source or a single operator_catalog_source)
and update references in the vars file so operator_config.acm.source and
operator_config.mce.source use those variables (provide sensible defaults like
"redhat-operators" to preserve current behavior). Ensure the variable names are
declared in the same vars file so mirrored catalogs can be supplied in
disconnected environments and that both the acm and mce blocks reference the new
variables.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/retrieve-credentials.yml`:
- Around line 10-18: The "Wait for admin-kubeconfig secret" task uses fixed
retries/delay (10 × 15s) which is inconsistent with the password wait that uses
spoke_credentials_timeout; update this task (named "Wait for admin-kubeconfig
secret" in retrieve-credentials.yml) to derive retries and delay from the same
spoke_credentials_timeout variable used elsewhere (use the same chunk size as
the password wait, e.g. 30s) so the total timeout is configurable and consistent
with the password secret wait.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/wait-for-install.yml`:
- Around line 107-138: The task "Wait for all agents to reach Done stage"
currently succeeds when every currently listed agent is Done, but it must also
ensure the cluster has the expected agent count; update the shell/python check
so it only exits 0 when done == total AND total == spoke_ctlplanes. Inject the
expected count (spoke_ctlplanes) into the embedded Python (via Jinja variable
interpolation or an exported env var) and compare int(total) to
int(spoke_ctlplanes) before exiting; also print a clear message when counts
mismatch and keep the registered name agents_done and the until condition
semantics unchanged.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/clusterimageset.yml.j2`:
- Line 6: The YAML template sets releaseImage: {{ effective_release_image }}
without quotes which can break when the pull spec contains special characters
(':' or '@'); update the clusterimageset.yml.j2 template to render the
releaseImage value as a quoted string (wrap the Jinja expression for
effective_release_image in quotes) so releaseImage is always emitted as a proper
YAML string.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/spoke-network.xml.j2`:
- Line 8: The template uses {{ spoke_network_name }} (derived from
spoke_cluster_name) as the libvirt bridge name which is limited to 15
characters; add either a pre-flight validation that asserts
len(spoke_cluster_name) ≤ 15 (fail early with a clear message referencing
spoke_cluster_name/spoke_network_name and virsh net-define) or update the README
variable documentation for spoke_cluster_name to explicitly state the
15-character Linux bridge name limit and show an example; ensure the check/log
mentions the exact constraint and references the libvirt/virsh failure mode so
maintainers know why the restriction exists.

In `@deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh`:
- Line 1: The script's shebang uses /bin/bash but must follow guidelines to use
/usr/bin/bash; update the top line of deploy-fencing-assisted.sh (the existing
"#!/bin/bash") to "#!/usr/bin/bash" so the script invokes the expected bash
binary location.

In `@deploy/openshift-clusters/vars/assisted.yml.template`:
- Around line 4-5: Update the usage comment in assisted.yml.template to
reference the correct make target: change the second sentence from "run 'make
fencing-assisted' to deploy a spoke TNF cluster via ACM/assisted installer." to
use "run 'make deploy fencing-assisted'" so it matches the PR description and
the actual entrypoint; ensure the wording stays consistent with the first
sentence that references 'make deploy fencing-ipi'.

---

Nitpick comments:
In `@CLAUDE.md`:
- Around line 135-139: Update the "Assisted Installer Method" section to
reference the vars/assisted.yml.template and call out the key variables users
must customize (e.g., spoke_cluster_name, spoke_network_cidr) and/or add a link
to the role README for full variable documentation; ensure you mention that
users should copy vars/assisted.yml.template to vars/assisted.yml and populate
those variables before running the installer and optionally list a couple of
other common variables to customize.

In `@deploy/openshift-clusters/assisted-install.yml`:
- Around line 148-154: The deployment summary currently says to "source
proxy.env" which is ambiguous because proxy.env is overwritten by the
proxy-setup role; update the text to explicitly state "source the spoke
proxy.env (generated by the proxy-setup role for the spoke cluster)" when
instructing spoke access and reference the spoke_kubeconfig_path variable in the
same sentence; similarly, clarify hub access as "source hub-proxy.env (for the
hub cluster)" to avoid confusion between spoke and hub proxy files.

In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-tnf.yml`:
- Around line 34-41: The current task "Verify TNF support is enabled" uses a
shell pipeline with head -1 and then oc exec, which will error unclearly if no
assisted-service pod exists; change the task to first capture the pod name (from
oc get pod -n {{ assisted_service_namespace }} -l app=assisted-service -o name |
head -1) into a variable, add an explicit guard/failure if that variable is
empty (fail with a clear message referencing assisted_service_namespace and the
app=assisted-service selector), and only then run the oc exec to check
TNF_CLUSTERS_SUPPORT, keeping the tnf_verify registration and failed_when logic
intact but relying on the pre-check to avoid silent/unclear failures.

In
`@deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-watch-all-namespaces.yml`:
- Around line 30-38: Wrap the existing task "Wait for metal3 pod to stabilize
after provisioning change" (the shell task that sets metal3_pod and uses until
with metal3_stabilize_timeout) into an Ansible block with a rescue section; on
timeout (rescue) run diagnostic commands like oc get pods -n
openshift-machine-api -o wide, oc describe pod for the metal3 pod(s) and oc get
events -n openshift-machine-api, register their outputs and log them (via debug
or copy to a file) so you surface pod status, descriptions and recent events
when metal3_pod.wait fails; ensure you still preserve the same retry/delay logic
and variables (metal3_pod, metal3_stabilize_timeout) inside the block.

In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml`:
- Around line 9-18: Add nodeAffinity to the PersistentVolume definition for
assisted-pv-images so the hostPath PV binds only to the node that actually
contains {{ assisted_images_path }}; update the PersistentVolume resource
(metadata.name: assisted-pv-images, spec.hostPath) to include a
spec.nodeAffinity.required.nodeSelectorTerms matching the node's hostname (e.g.,
kubernetes.io/hostname) or otherwise ensure PVs are templated per-node, and
adjust the playbook/template to capture a target node name (hub_nodes or similar
variable) before creating the PV so the nodeAffinity can reference that node.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-bmh.yml`:
- Around line 52-53: The current changed_when predicate in create-bmh.yml only
treats "'created' in bmh_result.stdout" as a change, so updates reported as
"configured" by oc apply are ignored; update the changed_when condition that
references bmh_result to also consider "configured" (and possibly
"patched"/other positive indicators) in bmh_result.stdout so that Secret/BMH
reconciliations on reruns are marked as changed (e.g., expand the predicate for
the registered variable bmh_result to detect "configured" in addition to
"created").

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml`:
- Around line 39-43: The task "Apply ClusterImageSet" currently sets
changed_when to only detect "'created' in cis_result.stdout" which misses
updates; change the condition to detect both creation and updates (e.g., use
"changed_when: \"'created' in cis_result.stdout or 'configured' in
cis_result.stdout\"" or include 'patched' if desired) for the cis_result
variable, and apply the same change to the other three "Apply ..." tasks that
register their own result variables so they report changes on both create and
update.
- Around line 22-31: The current "Create spoke pull secret" shell task uses a
check-then-create pattern and should be replaced with an idempotent apply;
change the task that references spoke_cluster_name and pull_secret_path to
generate a secret manifest (or run oc create secret generic ...
--from-file=.dockerconfigjson={{ pull_secret_path }}
--type=kubernetes.io/dockerconfigjson --dry-run=client -o yaml) and pipe that
into oc apply -n {{ spoke_cluster_name }} -f - so the operation is atomic and
avoids the TOCTOU race; remove reliance on ps_result and adjust changed_when to
rely on the apply result (or let Ansible detect changes) instead.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-network.yml`:
- Around line 33-44: Update the shell task "Add cross-bridge nftables FORWARD
rules for spoke<->hub connectivity" to check and insert each direction
separately using the variables spoke_network_cidr and hub_network_cidr (e.g.,
grep for both "{{ spoke_network_cidr }}.*{{ hub_network_cidr }}" and "{{
hub_network_cidr }}.*{{ spoke_network_cidr }}" or run two independent checks) so
you only insert missing rules and avoid duplicates; adjust the echo messages and
changed_when (registered nft_result.stdout) to reflect per-direction additions;
additionally, if persistence is required, save the nftables ruleset after
changes (e.g., run "sudo nft list ruleset > /etc/nftables.conf" or use the
distro's nftables persistence mechanism) so rules survive reboots.
- Around line 23-31: The two tasks "Start spoke libvirt network" and "Set spoke
libvirt network to autostart" should stop unconditionally reporting changes;
register each command's result (e.g., start_result, autostart_result) and set
changed_when to a real change-detection expression instead of true — for
example, for "Start spoke libvirt network" check the network's current state
with virsh net-list --all or inspect start_result.rc/output to only mark changed
when the network was actually started, and for "Set spoke libvirt network to
autostart" query virsh net-info or parse autostart_result to only mark changed
when autostart was newly enabled. Ensure both tasks use the registered result in
failed_when/changed_when to preserve idempotence and accurate reporting.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-vms.yml`:
- Around line 37-42: The task "Ensure spoke VMs are shut off" currently forces
no change reporting; instead, register the shell result (e.g., register:
virsh_destroy) for the loop that runs "sudo virsh destroy {{ spoke_cluster_name
}}-master-{{ item }}" over "{{ range(spoke_ctlplanes) | list }}" and set
changed_when based on the command outcome (mark changed when virsh actually
destroyed a domain and not when it returned "cannot find domain" or similar).
Update failed_when accordingly to preserve suppression of real errors while
allowing changed_when to reflect actual destructions.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/main.yml`:
- Around line 4-30: The block that runs spoke deployment tasks (the block
including include_tasks: cleanup.yml, create-spoke-network.yml,
create-spoke-vms.yml, setup-ksushy.yml, create-cluster-resources.yml,
create-bmh.yml, wait-for-install.yml, retrieve-credentials.yml) needs a rescue:
section to capture diagnostics on failure; add a rescue: that runs a diagnostic
include_tasks (e.g. include_tasks: diagnostics.yml or include_tasks:
gather-spoke-diagnostics.yml) and any necessary cleanup/notify steps, ensure the
rescue runs with the same environment KUBECONFIG: "{{ hub_kubeconfig }}", and
reference the same task names so failures during Create spoke VMs / Wait for
spoke cluster installation / Retrieve spoke cluster credentials trigger
diagnostic collection.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/setup-ksushy.yml`:
- Around line 19-25: The Redfish curl+python task "Verify spoke VMs are visible
via sushy-tools" can raise a JSON parse error when curl fails; update this shell
task to detect curl failures before JSON parsing by using curl with --fail/-f
and capturing its exit code and output into the registered variable
sushy_systems, then add a failed_when that checks curl succeeded (e.g.,
sushy_systems.rc == 0) and otherwise fail with a clear message including the
curl stderr/output and the spoke_ksushy_ip/spoke_ksushy_port; only run the
python/json extraction when curl succeeded (or guard the python command to
return a sane empty JSON on failure) so the JSON parser never receives non-JSON
input.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/agentclusterinstall.yml.j2`:
- Around line 14-19: The template currently hardcodes hostPrefix: 23 in
agentclusterinstall.yml.j2; make this value configurable by introducing a role
variable (e.g., spoke_host_prefix) in defaults/main.yml with a sensible default
of 23 and replace the literal 23 in the template with the variable reference
(spoke_host_prefix) so different clusterNetwork CIDRs or topologies can override
it; update any role README or variable docs to mention spoke_host_prefix and its
purpose.

In `@deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh`:
- Around line 38-42: The current extraction of SPOKE_CLUSTER_NAME using grep/awk
is fragile; replace the pipeline that sets SPOKE_CLUSTER_NAME by invoking a
YAML-aware parser (e.g., yq or a small Python snippet) to read the
spoke_cluster_name key from vars/assisted.yml and fall back to "spoke-tnf" if
empty; update the logic surrounding the SPOKE_CLUSTER_NAME variable so the
script uses the parsed value (keep the same variable name SPOKE_CLUSTER_NAME)
and preserve the existing empty-check fallback.

In `@deploy/openshift-clusters/vars/assisted.yml.template`:
- Around line 24-29: The comment for the template variable spoke_ctlplanes
incorrectly states it is "Always 2 for TNF"; update the comment next to the
spoke_ctlplanes declaration to document the supported values (e.g., 1 for SNO, 2
for two-node/TNF) and note that it is parameterized rather than fixed, so the
variable (spoke_ctlplanes) can be set to either 1 or 2 depending on topology;
locate the spoke_ctlplanes line in assisted.yml.template and replace the
hardcoded "Always 2 for TNF" text with a brief description of allowed values and
intended use.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85da9815-129b-4cc2-8bbf-4b85d5ca6970

📥 Commits

Reviewing files that changed from the base of the PR and between c2c0209 and bc768c3.

📒 Files selected for processing (37)
  • CLAUDE.md
  • deploy/Makefile
  • deploy/aws-hypervisor/scripts/create.sh
  • deploy/openshift-clusters/assisted-install.yml
  • deploy/openshift-clusters/collections/requirements.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/README.md
  • deploy/openshift-clusters/roles/assisted/acm-install/defaults/main.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/agent-service-config.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-tnf.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/enable-watch-all-namespaces.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/install-operator.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/main.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/validate.yml
  • deploy/openshift-clusters/roles/assisted/acm-install/templates/agentserviceconfig.yml.j2
  • deploy/openshift-clusters/roles/assisted/acm-install/templates/multiclusterhub.yml.j2
  • deploy/openshift-clusters/roles/assisted/acm-install/templates/operator-subscription.yml.j2
  • deploy/openshift-clusters/roles/assisted/acm-install/vars/main.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/README.md
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/defaults/main.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/cleanup.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-bmh.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-network.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/create-spoke-vms.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/main.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/retrieve-credentials.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/setup-ksushy.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/wait-for-install.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/agentclusterinstall.yml.j2
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/clusterdeployment.yml.j2
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/clusterimageset.yml.j2
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/infraenv.yml.j2
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/spoke-network.xml.j2
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/vars/main.yml
  • deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh
  • deploy/openshift-clusters/vars/assisted.yml.template

Comment on lines +72 to +74
- name: Get RHCOS version from ISO URL
set_fact:
rhcos_version: "{{ rhcos_iso_url | regex_search('rhcos-([\\d.]+-\\d+)-live', '\\1') | first }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation for rhcos_version extraction.

The regex extraction assumes the RHCOS ISO URL follows a specific naming pattern. If the URL format changes or the regex doesn't match, rhcos_version could be empty or cause a Jinja2 error (if | first is called on an empty list). Consider adding validation.

🛡️ Proposed validation
 - name: Get RHCOS version from ISO URL
   set_fact:
-    rhcos_version: "{{ rhcos_iso_url | regex_search('rhcos-([\\d.]+-\\d+)-live', '\\1') | first }}"
+    rhcos_version: "{{ rhcos_iso_url | regex_search('rhcos-([\\d.]+-\\d+)-live', '\\1') | first | default('unknown') }}"
+
+- name: Warn if RHCOS version could not be extracted
+  debug:
+    msg: "WARNING: Could not extract RHCOS version from URL, using 'unknown'"
+  when: rhcos_version == 'unknown'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Get RHCOS version from ISO URL
set_fact:
rhcos_version: "{{ rhcos_iso_url | regex_search('rhcos-([\\d.]+-\\d+)-live', '\\1') | first }}"
- name: Get RHCOS version from ISO URL
set_fact:
rhcos_version: "{{ rhcos_iso_url | regex_search('rhcos-([\\d.]+-\\d+)-live', '\\1') | default([], true) | first | default('unknown') }}"
- name: Warn if RHCOS version could not be extracted
debug:
msg: "WARNING: Could not extract RHCOS version from URL, using 'unknown'"
when: rhcos_version == 'unknown'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/acm-install/tasks/agent-service-config.yml`
around lines 72 - 74, The task "Get RHCOS version from ISO URL" currently sets
rhcos_version by applying regex_search(... ) | first which can fail if the regex
doesn't match; update this task to validate the extraction from rhcos_iso_url by
capturing the regex result into a temporary variable (e.g.,
rhcos_version_matches) using regex_search, then assert or fail with a clear
error via the fail module if the list is empty, or set rhcos_version using a
safe default via the default filter only after confirming matches exist;
reference the existing symbols rhcos_iso_url, rhcos_version, regex_search and
the task name to locate and modify the code.

Comment on lines +107 to +153
# Create MultiClusterHub (for ACM) or MultiClusterEngine (for MCE)
- name: Create MultiClusterHub CR
template:
src: multiclusterhub.yml.j2
dest: /tmp/multiclusterhub.yml
mode: '0644'
when: hub_operator == 'acm'

- name: Apply MultiClusterHub CR
shell: |
oc apply -f /tmp/multiclusterhub.yml
register: mch_result
changed_when: "'created' in mch_result.stdout"
when: hub_operator == 'acm'

- block:
- name: Wait for MultiClusterHub to reach Running phase
shell: |
oc get multiclusterhub multiclusterhub -n {{ op_config.namespace }} \
-o jsonpath='{.status.phase}'
register: mch_phase
until: mch_phase.stdout == 'Running'
retries: "{{ (multiclusterhub_timeout / 30) | int }}"
delay: 30
rescue:
- name: Collect MCH timeout diagnostics
shell: |
echo "=== MultiClusterHub Status ==="
oc get multiclusterhub multiclusterhub -n {{ op_config.namespace }} -o yaml 2>/dev/null | grep -A 50 'status:'
echo ""
echo "=== Non-Running Pods ==="
oc get pods -n {{ op_config.namespace }} --no-headers 2>/dev/null | grep -v Running
echo ""
echo "=== Recent Events ==="
oc get events -n {{ op_config.namespace }} --sort-by='.lastTimestamp' 2>/dev/null | tail -20
register: mch_diag
changed_when: false
failed_when: false

- name: Display MCH timeout diagnostics
debug:
msg: "{{ mch_diag.stdout }}"

- name: Fail after MCH timeout
fail:
msg: "MultiClusterHub did not reach Running phase within timeout"
when: hub_operator == 'acm'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

hub_operator: mce never creates the operand CR.

This task file exposes ACM/MCE selection, but the operand creation path only exists for ACM: it templates/applies multiclusterhub.yml.j2 and waits on MultiClusterHub. When hub_operator is mce, the CSV may install successfully but no MultiClusterEngine resource is ever created, so the assisted-service stack is never brought up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/acm-install/tasks/install-operator.yml`
around lines 107 - 153, The play currently only creates and waits for
MultiClusterHub when hub_operator == 'acm', so when hub_operator == 'mce' no
operand CR is created; add an equivalent branch for MCE: template a
multiclusterengine.yml.j2 (similar to multiclusterhub.yml.j2) and write it to
/tmp/multiclusterengine.yml, then oc apply -f /tmp/multiclusterengine.yml
(registering result as e.g. mce_result), and add a wait block that polls the
MultiClusterEngine resource (oc get multiclusterengine multiclusterengine -n {{
op_config.namespace }} -o jsonpath='{.status.phase}' or appropriate ready field)
with retries/delay and a rescue block that collects the same diagnostics and
fails with a clear message; guard all these tasks with when: hub_operator ==
'mce' and reuse naming patterns (task names, registered vars) analogous to the
existing MultiClusterHub tasks.

Comment on lines +107 to +138
- block:
- name: Wait for all agents to reach Done stage
shell: |
oc get agents -n {{ spoke_cluster_name }} -o json 2>/dev/null \
| python3 -c "
import json
import sys

data = json.load(sys.stdin)
agents = data.get('items', [])
total = len(agents)
done = 0
stuck = []
for a in agents:
stage = a.get('status', {}).get('progress', {}).get('currentStage', 'unknown')
if stage == 'Done':
done += 1
else:
state = a.get('status', {}).get('debugInfo', {}).get('state', 'unknown')
hostname = a.get('spec', {}).get('hostname', 'unknown')
stuck.append(f'{hostname}: state={state}, stage={stage}')

print(f'Agents Done: {done} / {total}')
for s in stuck:
print(f' {s}')
sys.exit(0 if done == total else 1)
"
register: agents_done
until: agents_done.rc == 0
retries: "{{ (spoke_install_timeout / 30) | int }}"
delay: 30
changed_when: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Also require the expected agent count in the final Done gate.

This block passes whenever every currently listed agent is Done. If the namespace only has 1 agent left after a partial cleanup or host loss, done == total still succeeds even though spoke_ctlplanes was not satisfied at the end. Please make this check require total == spoke_ctlplanes as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/wait-for-install.yml`
around lines 107 - 138, The task "Wait for all agents to reach Done stage"
currently succeeds when every currently listed agent is Done, but it must also
ensure the cluster has the expected agent count; update the shell/python check
so it only exits 0 when done == total AND total == spoke_ctlplanes. Inject the
expected count (spoke_ctlplanes) into the embedded Python (via Jinja variable
interpolation or an exported env var) and compare int(total) to
int(spoke_ctlplanes) before exiting; also print a clear message when counts
mismatch and keep the registered name agents_done and the until condition
semantics unchanged.

metadata:
name: "{{ effective_ocp_version }}.0"
spec:
releaseImage: {{ effective_release_image }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Quote releaseImage to safely handle pull spec special characters.

At Line 6, this value should be quoted (:, @ are common in release images).

💡 Suggested change
-  releaseImage: {{ effective_release_image }}
+  releaseImage: "{{ effective_release_image }}"

As per coding guidelines, "**/*.{yml,yaml}: Quote strings when they contain special characters in YAML files".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
releaseImage: {{ effective_release_image }}
releaseImage: "{{ effective_release_image }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/clusterimageset.yml.j2`
at line 6, The YAML template sets releaseImage: {{ effective_release_image }}
without quotes which can break when the pull spec contains special characters
(':' or '@'); update the clusterimageset.yml.j2 template to render the
releaseImage value as a quoted string (wrap the Jinja expression for
effective_release_image in quotes) so releaseImage is always emitted as a proper
YAML string.

<port start="1024" end="65535"/>
</nat>
</forward>
<bridge name="{{ spoke_network_name }}" stp="on" delay="0"/>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or validation for bridge name length
rg -n "15|IFNAMSIZ|bridge.*name.*length|name.*limit" deploy/openshift-clusters/roles/assisted/

Repository: openshift-eng/two-node-toolbox

Length of output: 1100


🏁 Script executed:

# Read the spoke-network.xml.j2 template file
cat -n deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/spoke-network.xml.j2

Repository: openshift-eng/two-node-toolbox

Length of output: 1462


🏁 Script executed:

# Check vars/main.yml for spoke_network_name definition
cat -n deploy/openshift-clusters/roles/assisted/assisted-spoke/vars/main.yml

Repository: openshift-eng/two-node-toolbox

Length of output: 930


🏁 Script executed:

# Search for README files in the assisted role
find deploy/openshift-clusters/roles/assisted -name "README*" -o -name "readme*" | head -20

Repository: openshift-eng/two-node-toolbox

Length of output: 202


🏁 Script executed:

# Look for any validation logic related to network or bridge names
rg -n "assert|fail|validate" deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/ | grep -i "name\|bridge\|network" | head -20

Repository: openshift-eng/two-node-toolbox

Length of output: 56


🏁 Script executed:

# Search for any comments or documentation about network or bridge naming constraints
rg -n "bridge.*name|network.*name" deploy/openshift-clusters/roles/assisted/assisted-spoke/ -A 2 | head -40

Repository: openshift-eng/two-node-toolbox

Length of output: 3889


🏁 Script executed:

# Read the assisted-spoke README to check for naming constraints documentation
cat -n deploy/openshift-clusters/roles/assisted/assisted-spoke/README.md

Repository: openshift-eng/two-node-toolbox

Length of output: 6375


🏁 Script executed:

# Check if there are any default values or examples that might set spoke_cluster_name
find deploy/openshift-clusters/roles/assisted/assisted-spoke -name "defaults" -type d
cat deploy/openshift-clusters/roles/assisted/assisted-spoke/defaults/main.yml 2>/dev/null || echo "No defaults file found"

Repository: openshift-eng/two-node-toolbox

Length of output: 1055


🏁 Script executed:

# Search for any examples or documentation about cluster names in the entire assisted role
rg -n "spoke_cluster_name|cluster.*name" deploy/openshift-clusters/roles/assisted/ -B 2 -A 2 | grep -i "example\|default\|must\|should\|limit\|constraint\|validation" | head -30

Repository: openshift-eng/two-node-toolbox

Length of output: 1202


Add bridge name length validation or documentation.

The spoke_network_name (derived from spoke_cluster_name) is used directly as the libvirt bridge name. Linux bridge interface names are limited to 15 characters. While the default spoke-tnf (8 chars) is safe, custom cluster names exceeding 15 characters would cause virsh net-define to fail. The README documents that spoke_cluster_name must be "DNS-safe" but does not mention this constraint.

Either:

  • Add validation in a pre-flight check to assert spoke_cluster_name length ≤ 15 characters, or
  • Document the 15-character limit in the README under the spoke_cluster_name variable description
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/openshift-clusters/roles/assisted/assisted-spoke/templates/spoke-network.xml.j2`
at line 8, The template uses {{ spoke_network_name }} (derived from
spoke_cluster_name) as the libvirt bridge name which is limited to 15
characters; add either a pre-flight validation that asserts
len(spoke_cluster_name) ≤ 15 (fail early with a clear message referencing
spoke_cluster_name/spoke_network_name and virsh net-define) or update the README
variable documentation for spoke_cluster_name to explicitly state the
15-character Linux bridge name limit and show an example; ensure the check/log
mentions the exact constraint and references the libvirt/virsh failure mode so
maintainers know why the restriction exists.

@@ -0,0 +1,59 @@
#!/bin/bash
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use #!/usr/bin/bash shebang per coding guidelines.

As per coding guidelines, shell scripts should use #!/usr/bin/bash shebang.

♻️ Proposed fix
-#!/bin/bash
+#!/usr/bin/bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
#!/usr/bin/bash
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/scripts/deploy-fencing-assisted.sh` at line 1, The
script's shebang uses /bin/bash but must follow guidelines to use /usr/bin/bash;
update the top line of deploy-fencing-assisted.sh (the existing "#!/bin/bash")
to "#!/usr/bin/bash" so the script invokes the expected bash binary location.

Comment on lines +4 to +5
# Usage: After deploying a hub cluster with 'make deploy fencing-ipi',
# run 'make fencing-assisted' to deploy a spoke TNF cluster via ACM/assisted installer.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The usage comment points to the wrong make target.

The PR description introduces this flow as make deploy fencing-assisted, but this template tells users to run make fencing-assisted. That will send readers to the wrong entrypoint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/vars/assisted.yml.template` around lines 4 - 5,
Update the usage comment in assisted.yml.template to reference the correct make
target: change the second sentence from "run 'make fencing-assisted' to deploy a
spoke TNF cluster via ACM/assisted installer." to use "run 'make deploy
fencing-assisted'" so it matches the PR description and the actual entrypoint;
ensure the wording stays consistent with the first sentence that references
'make deploy fencing-ipi'.

- I7: Add validate.yml for assisted-spoke role that checks cluster name
  is DNS-safe, ctlplanes >= 2, VIPs are within spoke CIDR and outside
  DHCP range, and API/Ingress VIPs are different
- I12: Remove rm -rf from storage.yml that wiped assisted-service data
  on every run. Directories are now created idempotently with mkdir -p
  without destroying existing data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml (1)

5-40: Consider improving changed_when accuracy and hostPath data locality.

Two minor observations:

  1. changed_when underreports: oc apply outputs "configured" when updating existing resources, so this condition only detects initial creation. Consider:

    changed_when: "'created' in storage_result.stdout or 'configured' in storage_result.stdout"
  2. hostPath without nodeAffinity: In a multi-node hub cluster, if the assisted-service pod migrates to a different node, it will access an empty directory instead of its existing data. If high availability isn't required, consider adding nodeAffinity to pin the PV to a specific node, or document this as a single-node limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml`
around lines 5 - 40, The play currently only marks changes when oc apply prints
"created" (registered to storage_result), so update the changed_when for
storage_result to also check for "configured" in storage_result.stdout (e.g.,
"'created' in storage_result.stdout or 'configured' in storage_result.stdout")
to accurately reflect updates; additionally, to avoid hostPath data loss when
pods move nodes, add nodeAffinity to the PersistentVolume specs
(assisted-pv-images and assisted-pv-db) to pin them to the host node where
assisted_images_path/assisted_db_path reside, or explicitly document the
limitation that these PVs require a single-node hub if you choose not to add
nodeAffinity.
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml (1)

42-48: Refactor DHCP exclusion check for readability/maintainability.

The logic is correct, but repeated conversions make this block hard to audit. Consider task-level vars for precomputed integer values and keep that conditions compact.

♻️ Suggested refactor
 - name: Validate VIPs are not in DHCP range (.50-.150)
+  vars:
+    dhcp_start_int: "{{ spoke_dhcp_start | ansible.utils.ipaddr('int') | int }}"
+    dhcp_end_int: "{{ spoke_dhcp_end | ansible.utils.ipaddr('int') | int }}"
+    api_vip_int: "{{ spoke_api_vip | ansible.utils.ipaddr('int') | int }}"
+    ingress_vip_int: "{{ spoke_ingress_vip | ansible.utils.ipaddr('int') | int }}"
   assert:
     that:
-      - spoke_api_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or
-        spoke_api_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int
-      - spoke_ingress_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or
-        spoke_ingress_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int
+      - api_vip_int < dhcp_start_int or api_vip_int > dhcp_end_int
+      - ingress_vip_int < dhcp_start_int or ingress_vip_int > dhcp_end_int

As per coding guidelines, "Comment complex configurations in YAML files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`
around lines 42 - 48, Precompute integer forms of the IPs and DHCP bounds to
simplify the assert: create task-level vars or a preceding set_fact that
converts spoke_api_vip, spoke_ingress_vip, spoke_dhcp_start and spoke_dhcp_end
to integers (e.g. api_vip_int, ingress_vip_int, dhcp_start_int, dhcp_end_int)
using the ansible.utils.ipaddr('int') filter, then replace the long repeated
expressions in the "Validate VIPs are not in DHCP range (.50-.150)" task's
assert/that with concise comparisons using those new names (e.g. api_vip_int <
dhcp_start_int or api_vip_int > dhcp_end_int and similarly for ingress_vip_int)
to improve readability and maintainability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`:
- Around line 21-24: The assert checks use "ansible.utils.ipaddr(...) is not
none" which is unsafe because the filter returns False for invalid/mismatched
inputs and "False is not none" passes; change each assertion (e.g., the task
named "Validate spoke_network_cidr is a valid CIDR" and the other two assertions
at the same block) to rely on the filter's truthiness instead of "is not none" —
e.g., replace "spoke_network_cidr | ansible.utils.ipaddr('network/prefix') is
not none" with simply "spoke_network_cidr |
ansible.utils.ipaddr('network/prefix')" (do the same for the other two
variables/filters referenced at lines 31 and 38).

---

Nitpick comments:
In `@deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml`:
- Around line 5-40: The play currently only marks changes when oc apply prints
"created" (registered to storage_result), so update the changed_when for
storage_result to also check for "configured" in storage_result.stdout (e.g.,
"'created' in storage_result.stdout or 'configured' in storage_result.stdout")
to accurately reflect updates; additionally, to avoid hostPath data loss when
pods move nodes, add nodeAffinity to the PersistentVolume specs
(assisted-pv-images and assisted-pv-db) to pin them to the host node where
assisted_images_path/assisted_db_path reside, or explicitly document the
limitation that these PVs require a single-node hub if you choose not to add
nodeAffinity.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`:
- Around line 42-48: Precompute integer forms of the IPs and DHCP bounds to
simplify the assert: create task-level vars or a preceding set_fact that
converts spoke_api_vip, spoke_ingress_vip, spoke_dhcp_start and spoke_dhcp_end
to integers (e.g. api_vip_int, ingress_vip_int, dhcp_start_int, dhcp_end_int)
using the ansible.utils.ipaddr('int') filter, then replace the long repeated
expressions in the "Validate VIPs are not in DHCP range (.50-.150)" task's
assert/that with concise comparisons using those new names (e.g. api_vip_int <
dhcp_start_int or api_vip_int > dhcp_end_int and similarly for ingress_vip_int)
to improve readability and maintainability.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a50f92a1-91e7-458d-8888-5acf115e7240

📥 Commits

Reviewing files that changed from the base of the PR and between 6b85d70 and 625462b.

📒 Files selected for processing (3)
  • deploy/openshift-clusters/roles/assisted/acm-install/tasks/storage.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/main.yml
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/main.yml

@fonta-rh
Copy link
Copy Markdown
Contributor

fonta-rh commented Apr 9, 2026

Thanks @gamado — confirmed the fix commits are restored (8719ccc, bc768c3). Re-reviewed the full diff:

All 5 critical fixes present and correct:

  • C1: SSH key multi-detection (ed25519/rsa/ecdsa fallback)
  • C2: pull_secret_path deduped to playbook level
  • C3: Hub release image extraction in pre_tasks (run once)
  • C4: RHCOS ISO extraction checks rc + empty string
  • C5: Explicit disk removal after virsh undefine

Important fixes (7/9): I6 deploy script, I7 validate.yml, I8 cluster state, I9 failed_when cleanup, I10 rescue diagnostics, I11 hub_auth_dir removed, I13 READMEs, I14 TNF_CLUSTERS_SUPPORT runtime verification — all good.

One remaining item — I12: storage.yml:53 still has rm -rf {{ assisted_images_path }}/* {{ assisted_db_path }}/* inside the "Fix hostPath permissions" task, which runs unconditionally. On a re-run against an existing hub with assisted-service data, this wipes the image cache and database. Consider gating it with force_cleanup or moving it to cleanup.yml.

Otherwise LGTM — nice work getting these back in.

The ipaddr() filter returns False (not None) for invalid inputs,
so 'is not none' always passes. Use truthiness check instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gamado
Copy link
Copy Markdown
Contributor Author

gamado commented Apr 9, 2026

All 13 review findings addressed (C1-C5, I6-I13). I12 fixed in 625462b, ipaddr validation bug fixed in 7b4ec76. Ready for re-review.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml (1)

45-48: Quote complex that expressions containing special characters.

Please quote these boolean expressions to match repository YAML rules and avoid parser ambiguity.

Proposed fix
-      - spoke_api_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or
-        spoke_api_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int
-      - spoke_ingress_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or
-        spoke_ingress_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int
+      - "spoke_api_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or spoke_api_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int"
+      - "spoke_ingress_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or spoke_ingress_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int"

As per coding guidelines, "Quote strings in YAML files when they contain special characters".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`
around lines 45 - 48, The boolean expressions comparing spoke_api_vip and
spoke_ingress_vip against spoke_dhcp_start and spoke_dhcp_end contain special
characters (pipes, comparison operators) and must be quoted to avoid YAML
parsing issues; update the conditional lines that reference spoke_api_vip,
spoke_ingress_vip, spoke_dhcp_start and spoke_dhcp_end in validate.yml so each
complex that expression is wrapped in quotes (e.g. quote the entire expression
that uses ansible.utils.ipaddr('int') | int < ... or ... > ...) preserving the
exact logic and spacing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`:
- Around line 4-13: The assert task "Validate spoke_cluster_name is non-empty
and DNS-safe" currently checks definition, non-empty, and a regex but not DNS
label length; update the assert to add a condition that spoke_cluster_name |
length <= 63 and update the fail_msg to mention the 63-character maximum so
invalid names are rejected early. Ensure you reference the existing task name
and the variable spoke_cluster_name when making the change.
- Around line 42-52: Add a pre-assert before the VIP range checks that verifies
the DHCP bounds are derivable and valid: ensure spoke_dhcp_start and
spoke_dhcp_end are defined, are valid IP addresses (parsable with
ansible.utils.ipaddr) and that the integer value of spoke_dhcp_start is less
than spoke_dhcp_end; if not, fail with a clear message explaining the CIDR/DHCP
range could not be derived so subsequent VIP comparisons (spoke_api_vip,
spoke_ingress_vip) won't run against invalid/empty bounds.

---

Nitpick comments:
In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`:
- Around line 45-48: The boolean expressions comparing spoke_api_vip and
spoke_ingress_vip against spoke_dhcp_start and spoke_dhcp_end contain special
characters (pipes, comparison operators) and must be quoted to avoid YAML
parsing issues; update the conditional lines that reference spoke_api_vip,
spoke_ingress_vip, spoke_dhcp_start and spoke_dhcp_end in validate.yml so each
complex that expression is wrapped in quotes (e.g. quote the entire expression
that uses ansible.utils.ipaddr('int') | int < ... or ... > ...) preserving the
exact logic and spacing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b8d37fbd-4b5b-4553-b128-7312a4ed199b

📥 Commits

Reviewing files that changed from the base of the PR and between 625462b and 7b4ec76.

📒 Files selected for processing (1)
  • deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml

Comment on lines +4 to +13
- name: Validate spoke_cluster_name is non-empty and DNS-safe
assert:
that:
- spoke_cluster_name is defined
- spoke_cluster_name | length > 0
- spoke_cluster_name is match('^[a-z0-9]([a-z0-9-]*[a-z0-9])?$')
fail_msg: >-
spoke_cluster_name '{{ spoke_cluster_name | default("") }}' is invalid.
Must be non-empty, lowercase, start/end with alphanumeric, and contain only [a-z0-9-].

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Enforce DNS label max length for spoke_cluster_name.

Line 9 validates characters/format, but not DNS-label length. Add <= 63 to fail early before downstream Kubernetes/OpenShift resource creation fails.

Proposed fix
 - name: Validate spoke_cluster_name is non-empty and DNS-safe
   assert:
     that:
       - spoke_cluster_name is defined
       - spoke_cluster_name | length > 0
+      - spoke_cluster_name | length <= 63
       - spoke_cluster_name is match('^[a-z0-9]([a-z0-9-]*[a-z0-9])?$')
     fail_msg: >-
       spoke_cluster_name '{{ spoke_cluster_name | default("") }}' is invalid.
-      Must be non-empty, lowercase, start/end with alphanumeric, and contain only [a-z0-9-].
+      Must be 1-63 chars, lowercase, start/end with alphanumeric, and contain only [a-z0-9-].
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Validate spoke_cluster_name is non-empty and DNS-safe
assert:
that:
- spoke_cluster_name is defined
- spoke_cluster_name | length > 0
- spoke_cluster_name is match('^[a-z0-9]([a-z0-9-]*[a-z0-9])?$')
fail_msg: >-
spoke_cluster_name '{{ spoke_cluster_name | default("") }}' is invalid.
Must be non-empty, lowercase, start/end with alphanumeric, and contain only [a-z0-9-].
- name: Validate spoke_cluster_name is non-empty and DNS-safe
assert:
that:
- spoke_cluster_name is defined
- spoke_cluster_name | length > 0
- spoke_cluster_name | length <= 63
- spoke_cluster_name is match('^[a-z0-9]([a-z0-9-]*[a-z0-9])?$')
fail_msg: >-
spoke_cluster_name '{{ spoke_cluster_name | default("") }}' is invalid.
Must be 1-63 chars, lowercase, start/end with alphanumeric, and contain only [a-z0-9-].
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`
around lines 4 - 13, The assert task "Validate spoke_cluster_name is non-empty
and DNS-safe" currently checks definition, non-empty, and a regex but not DNS
label length; update the assert to add a condition that spoke_cluster_name |
length <= 63 and update the fail_msg to mention the 63-character maximum so
invalid names are rejected early. Ensure you reference the existing task name
and the variable spoke_cluster_name when making the change.

Comment on lines +42 to +52
- name: Validate VIPs are not in DHCP range (.50-.150)
assert:
that:
- spoke_api_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or
spoke_api_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int
- spoke_ingress_vip | ansible.utils.ipaddr('int') | int < spoke_dhcp_start | ansible.utils.ipaddr('int') | int or
spoke_ingress_vip | ansible.utils.ipaddr('int') | int > spoke_dhcp_end | ansible.utils.ipaddr('int') | int
fail_msg: >-
VIPs must be outside the DHCP range ({{ spoke_dhcp_start }} - {{ spoke_dhcp_end }}).
API VIP: {{ spoke_api_vip }}, Ingress VIP: {{ spoke_ingress_vip }}.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate DHCP bounds are derivable before VIP range checks.

This block assumes spoke_dhcp_start/spoke_dhcp_end are valid IPs. Add an explicit pre-assert so unsupported/small CIDRs fail with a clear message instead of reaching ambiguous integer comparisons.

Proposed fix
+- name: Validate DHCP bounds are derivable from spoke_network_cidr
+  assert:
+    that:
+      - spoke_dhcp_start | ansible.utils.ipaddr
+      - spoke_dhcp_end | ansible.utils.ipaddr
+      - (spoke_dhcp_start | ansible.utils.ipaddr('int') | int) < (spoke_dhcp_end | ansible.utils.ipaddr('int') | int)
+    fail_msg: >-
+      DHCP range could not be derived from spoke_network_cidr '{{ spoke_network_cidr }}'.
+      Ensure the CIDR is large enough for configured DHCP offsets.
+
 - name: Validate VIPs are not in DHCP range (.50-.150)
   assert:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/assisted/assisted-spoke/tasks/validate.yml`
around lines 42 - 52, Add a pre-assert before the VIP range checks that verifies
the DHCP bounds are derivable and valid: ensure spoke_dhcp_start and
spoke_dhcp_end are defined, are valid IP addresses (parsable with
ansible.utils.ipaddr) and that the integer value of spoke_dhcp_start is less
than spoke_dhcp_end; if not, fail with a clear message explaining the CIDR/DHCP
range could not be derived so subsequent VIP comparisons (spoke_api_vip,
spoke_ingress_vip) won't run against invalid/empty bounds.

gamado and others added 2 commits April 9, 2026 21:48
Ansible assert requires boolean conditionals. The ipaddr() filter
returns a string on success and False on failure. Adding '| bool'
converts both cases to proper booleans.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ipaddr() filter returns a string on valid input. '| bool' coerces
strings like "192.168.125.0/24" to False. Use '!= false' to correctly
distinguish valid results (strings) from invalid results (False).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants