[release-4.20] Avoid using lb-ext.kubeconfig for seedgen cleanup#6008
Conversation
# Background During cleanup before seed generation, we want to delete all pods in the Succeeded or Failed phase. # Issue The current implementation uses `oc delete pod --all-namespaces --field-selector=status.phase==...` with the node `lb-ext.kubeconfig` as a kubeconfig. This is in contrast to the rest of the controller, which uses the in-cluster config via a client loaded with ctrl.GetConfigOrDie(). We've had users report TLS errors when the controller tries to use the lb-ext.kubeconfig, which seems expected since that kubeconfig is meant for external use and may not have the correct CA or credentials for in-cluster API access. Before they hit that error, the client has already been used, so it proves that the in-cluster config works, and this is specifically an issue with using the lb-ext.kubeconfig for this cleanup step. # Solution The solution is to avoid using the lb-ext.kubeconfig for this cleanup step, and instead use the same in-cluster client that the rest of the controller uses. # Caveat As the existing TODO states: ```go // TODO: Can this be done cleanly via client? The client.DeleteAllOf seems to require a specified namespace, so maybe loop over the namespaces ``` The Kubernetes API does not support DeleteCollection without a namespace for namespaced resources, so we cannot simply use DeleteAllOf with no namespace. Instead, we copy the behavior of kubectl when using `delete --all-namespaces`, which is to list the resources with the field selector, and then delete them one by one. This is less efficient than a single DeleteCollection call, but it's already what kubectl was doing when we were calling it, so we know it works and is acceptable for this use case. # Implementation Details We cannot use `client.MatchingFields` because we do not have an index set up and it seems overkill to do so. # Agent Interaction Agent assisted. The conversation went something like this: - Where does r.Client get its kubeconfig? → ctrl.GetConfigOrDie() → in-cluster SA - Where does the oc call get its kubeconfig? → lb-ext.kubeconfig on the host via chroot executor - Can we generate a kubeconfig from rest.Config? → no built-in utility, hand-rolled versions kept being incomplete - Just use DeleteAllOf with no namespace? → tried it, in a toy project using `sigs.k8s.io/controller-runtime/pkg/envtest`, API rejects it (no cluster-scoped DeleteCollection for namespaced resources) - Inspect kubectl source - what does it do when delete --all-namespaces? → list + delete one by one, sequentially - Did that, modified toy project to prove it works, updated the controller, added comment explaining the reasoning
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jc-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherrypick release-4.19 release-4.18 release-4.17 release-4.16 |
|
@jc-rh: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
b4a97c1
into
openshift-kni:release-4.20
|
@jc-rh: new pull request created: #6009 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is a *manual* cherry-pick of openshift-kni#6008, with an additional RBAC fix: add pod delete permission to the seedgen controller's kubebuilder RBAC annotation. On main, switching from `oc delete pod` (authenticated via lb-ext.kubeconfig) to `r.Client.Delete()` (authenticated via the controller's service account) worked because an unrelated commit d756889 ("MGMT-21789: Add new feature for deleting pods stuck in ImagePullBackOff error") had already added the pod delete verb to the ClusterRole. That commit is not in release-4.19, so on a clean cherry-pick the service account lacked the permission and pod cleanup during seed generation will fail with a forbidden error: ``` 2026-05-15T15:06:58.207369896Z 2026-05-15T15:06:58Z ERROR controllers.SeedGenerator Seed generation failed {"error": "failed to cleanup Succeeded pods: failed to delete pod openshift-kube-scheduler/installer-5-retry-1-seed-sno-node: pods \"installer-5-retry-1-seed-sno-node\" is forbidden: User \"system:serviceaccount:openshift-lifecycle-agent:lifecycle-agent-controller-manager\" cannot delete resource \"pods\" in API group \"\" in the namespace \"openshift-kube-scheduler\""} ``` The original kubebuilder annotation was never updated to include delete when the code was changed to use the client — it was simply masked on main by the MGMT-21789 RBAC addition. This commit fixes the annotation at the source so the permission is explicitly tied to the code that needs it.
This is a *manual* cherry-pick of #6008, with an additional RBAC fix: add pod delete permission to the seedgen controller's kubebuilder RBAC annotation. On main, switching from `oc delete pod` (authenticated via lb-ext.kubeconfig) to `r.Client.Delete()` (authenticated via the controller's service account) worked because an unrelated commit d756889 ("MGMT-21789: Add new feature for deleting pods stuck in ImagePullBackOff error") had already added the pod delete verb to the ClusterRole. That commit is not in release-4.19, so on a clean cherry-pick the service account lacked the permission and pod cleanup during seed generation will fail with a forbidden error: ``` 2026-05-15T15:06:58.207369896Z 2026-05-15T15:06:58Z ERROR controllers.SeedGenerator Seed generation failed {"error": "failed to cleanup Succeeded pods: failed to delete pod openshift-kube-scheduler/installer-5-retry-1-seed-sno-node: pods \"installer-5-retry-1-seed-sno-node\" is forbidden: User \"system:serviceaccount:openshift-lifecycle-agent:lifecycle-agent-controller-manager\" cannot delete resource \"pods\" in API group \"\" in the namespace \"openshift-kube-scheduler\""} ``` The original kubebuilder annotation was never updated to include delete when the code was changed to use the client — it was simply masked on main by the MGMT-21789 RBAC addition. This commit fixes the annotation at the source so the permission is explicitly tied to the code that needs it.
This is a *manual* cherry-pick of #6008, with an additional RBAC fix: add pod delete permission to the seedgen controller's kubebuilder RBAC annotation. On main, switching from `oc delete pod` (authenticated via lb-ext.kubeconfig) to `r.Client.Delete()` (authenticated via the controller's service account) worked because an unrelated commit d756889 ("MGMT-21789: Add new feature for deleting pods stuck in ImagePullBackOff error") had already added the pod delete verb to the ClusterRole. That commit is not in release-4.19, so on a clean cherry-pick the service account lacked the permission and pod cleanup during seed generation will fail with a forbidden error: ``` 2026-05-15T15:06:58.207369896Z 2026-05-15T15:06:58Z ERROR controllers.SeedGenerator Seed generation failed {"error": "failed to cleanup Succeeded pods: failed to delete pod openshift-kube-scheduler/installer-5-retry-1-seed-sno-node: pods \"installer-5-retry-1-seed-sno-node\" is forbidden: User \"system:serviceaccount:openshift-lifecycle-agent:lifecycle-agent-controller-manager\" cannot delete resource \"pods\" in API group \"\" in the namespace \"openshift-kube-scheduler\""} ``` The original kubebuilder annotation was never updated to include delete when the code was changed to use the client — it was simply masked on main by the MGMT-21789 RBAC addition. This commit fixes the annotation at the source so the permission is explicitly tied to the code that needs it.
This is an automated cherry-pick of #6003
/assign openshift-cherrypick-robot