Skip to content

[release-4.20] Avoid using lb-ext.kubeconfig for seedgen cleanup#6008

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift-kni:release-4.20from
openshift-cherrypick-robot:cherry-pick-6003-to-release-4.20
Apr 19, 2026
Merged

[release-4.20] Avoid using lb-ext.kubeconfig for seedgen cleanup#6008
openshift-merge-bot[bot] merged 2 commits into
openshift-kni:release-4.20from
openshift-cherrypick-robot:cherry-pick-6003-to-release-4.20

Conversation

@openshift-cherrypick-robot
Copy link
Copy Markdown

This is an automated cherry-pick of #6003

/assign openshift-cherrypick-robot

omertuc and others added 2 commits April 19, 2026 10:30
# Background

During cleanup before seed generation, we want to delete all pods in the
Succeeded or Failed phase.

# Issue

The current implementation uses `oc delete pod --all-namespaces
--field-selector=status.phase==...` with the node `lb-ext.kubeconfig` as
a kubeconfig. This is in contrast to the rest of the controller, which
uses the in-cluster config via a client loaded with
ctrl.GetConfigOrDie().

We've had users report TLS errors when the controller tries to use the
lb-ext.kubeconfig, which seems expected since that kubeconfig is meant
for external use and may not have the correct CA or credentials for
in-cluster API access.

Before they hit that error, the client has already been used, so it
proves that the in-cluster config works, and this is specifically an
issue with using the lb-ext.kubeconfig for this cleanup step.

# Solution

The solution is to avoid using the lb-ext.kubeconfig for this cleanup
step, and instead use the same in-cluster client that the rest of the
controller uses.

# Caveat

As the existing TODO states:

```go
// TODO: Can this be done cleanly via client? The client.DeleteAllOf
seems to require a specified namespace, so maybe loop over the
namespaces
```

The Kubernetes API does not support DeleteCollection without a namespace
for namespaced resources, so we cannot simply use DeleteAllOf with no
namespace. Instead, we copy the behavior of kubectl when using `delete
--all-namespaces`, which is to list the resources with the field
selector, and then delete them one by one. This is less efficient than a
single DeleteCollection call, but it's already what kubectl was doing
when we were calling it, so we know it works and is acceptable for this
use case.

# Implementation Details

We cannot use `client.MatchingFields` because we do not have an index
set up and it seems overkill to do so.

# Agent Interaction

Agent assisted. The conversation went something like this:

- Where does r.Client get its kubeconfig? → ctrl.GetConfigOrDie() → in-cluster SA
- Where does the oc call get its kubeconfig? → lb-ext.kubeconfig on the host via chroot executor
- Can we generate a kubeconfig from rest.Config? → no built-in utility, hand-rolled versions kept being incomplete
- Just use DeleteAllOf with no namespace? → tried it, in a toy project using `sigs.k8s.io/controller-runtime/pkg/envtest`, API rejects it (no cluster-scoped DeleteCollection for namespaced resources)
- Inspect kubectl source - what does it do when delete --all-namespaces? → list + delete one by one, sequentially
- Did that, modified toy project to prove it works, updated the controller, added comment explaining the reasoning
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jc-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2026
@jc-rh
Copy link
Copy Markdown
Member

jc-rh commented Apr 19, 2026

/cherrypick release-4.19 release-4.18 release-4.17 release-4.16

@openshift-cherrypick-robot
Copy link
Copy Markdown
Author

@jc-rh: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.19 release-4.18 release-4.17 release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit b4a97c1 into openshift-kni:release-4.20 Apr 19, 2026
11 checks passed
@openshift-cherrypick-robot
Copy link
Copy Markdown
Author

@jc-rh: new pull request created: #6009

Details

In response to this:

/cherrypick release-4.19 release-4.18 release-4.17 release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

omertuc added a commit to omertuc/lifecycle-agent that referenced this pull request May 26, 2026
This is a *manual* cherry-pick of openshift-kni#6008, with an additional RBAC fix:
add pod delete permission to the seedgen controller's kubebuilder RBAC
annotation.

On main, switching from `oc delete pod` (authenticated via
lb-ext.kubeconfig) to `r.Client.Delete()` (authenticated via the
controller's service account) worked because an unrelated commit
d756889 ("MGMT-21789: Add new feature for deleting pods stuck in
ImagePullBackOff error") had already added the pod delete verb to the
ClusterRole. That commit is not in release-4.19, so on a clean
cherry-pick the service account lacked the permission and pod cleanup
during seed generation will fail with a forbidden error:

```
2026-05-15T15:06:58.207369896Z 2026-05-15T15:06:58Z	ERROR	controllers.SeedGenerator	Seed generation failed	{"error": "failed to cleanup Succeeded pods: failed to delete pod openshift-kube-scheduler/installer-5-retry-1-seed-sno-node: pods \"installer-5-retry-1-seed-sno-node\" is forbidden: User \"system:serviceaccount:openshift-lifecycle-agent:lifecycle-agent-controller-manager\" cannot delete resource \"pods\" in API group \"\" in the namespace \"openshift-kube-scheduler\""}
```

The original kubebuilder annotation was never updated to include delete
when the code was changed to use the client — it was simply masked on
main by the MGMT-21789 RBAC addition. This commit fixes the annotation
at the source so the permission is explicitly tied to the code that
needs it.
openshift-merge-bot Bot pushed a commit that referenced this pull request May 26, 2026
This is a *manual* cherry-pick of #6008, with an additional RBAC fix:
add pod delete permission to the seedgen controller's kubebuilder RBAC
annotation.

On main, switching from `oc delete pod` (authenticated via
lb-ext.kubeconfig) to `r.Client.Delete()` (authenticated via the
controller's service account) worked because an unrelated commit
d756889 ("MGMT-21789: Add new feature for deleting pods stuck in
ImagePullBackOff error") had already added the pod delete verb to the
ClusterRole. That commit is not in release-4.19, so on a clean
cherry-pick the service account lacked the permission and pod cleanup
during seed generation will fail with a forbidden error:

```
2026-05-15T15:06:58.207369896Z 2026-05-15T15:06:58Z	ERROR	controllers.SeedGenerator	Seed generation failed	{"error": "failed to cleanup Succeeded pods: failed to delete pod openshift-kube-scheduler/installer-5-retry-1-seed-sno-node: pods \"installer-5-retry-1-seed-sno-node\" is forbidden: User \"system:serviceaccount:openshift-lifecycle-agent:lifecycle-agent-controller-manager\" cannot delete resource \"pods\" in API group \"\" in the namespace \"openshift-kube-scheduler\""}
```

The original kubebuilder annotation was never updated to include delete
when the code was changed to use the client — it was simply masked on
main by the MGMT-21789 RBAC addition. This commit fixes the annotation
at the source so the permission is explicitly tied to the code that
needs it.
openshift-merge-bot Bot pushed a commit that referenced this pull request May 28, 2026
This is a *manual* cherry-pick of #6008, with an additional RBAC fix:
add pod delete permission to the seedgen controller's kubebuilder RBAC
annotation.

On main, switching from `oc delete pod` (authenticated via
lb-ext.kubeconfig) to `r.Client.Delete()` (authenticated via the
controller's service account) worked because an unrelated commit
d756889 ("MGMT-21789: Add new feature for deleting pods stuck in
ImagePullBackOff error") had already added the pod delete verb to the
ClusterRole. That commit is not in release-4.19, so on a clean
cherry-pick the service account lacked the permission and pod cleanup
during seed generation will fail with a forbidden error:

```
2026-05-15T15:06:58.207369896Z 2026-05-15T15:06:58Z	ERROR	controllers.SeedGenerator	Seed generation failed	{"error": "failed to cleanup Succeeded pods: failed to delete pod openshift-kube-scheduler/installer-5-retry-1-seed-sno-node: pods \"installer-5-retry-1-seed-sno-node\" is forbidden: User \"system:serviceaccount:openshift-lifecycle-agent:lifecycle-agent-controller-manager\" cannot delete resource \"pods\" in API group \"\" in the namespace \"openshift-kube-scheduler\""}
```

The original kubebuilder annotation was never updated to include delete
when the code was changed to use the client — it was simply masked on
main by the MGMT-21789 RBAC addition. This commit fixes the annotation
at the source so the permission is explicitly tied to the code that
needs it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants