Skip to content

1091 ibm cpd scheduler not deleted by cp4d delete instancesh#1096

Open
luigimolinaro wants to merge 8 commits intomainfrom
1091-ibm-cpd-scheduler-not-deleted-by-cp4d-delete-instancesh
Open

1091 ibm cpd scheduler not deleted by cp4d delete instancesh#1096
luigimolinaro wants to merge 8 commits intomainfrom
1091-ibm-cpd-scheduler-not-deleted-by-cp4d-delete-instancesh

Conversation

@luigimolinaro
Copy link
Contributor

@luigimolinaro luigimolinaro commented Mar 9, 2026

Bug Fixes

Fixed Scheduler Namespace Deletion

  • Issue: Script used hardcoded ibm-scheduling namespace instead of the configurable variable
  • Fix: Now uses PROJECT_SCHEDULING_SERVICE variable (default: cpd-scheduler)

Removed Duplicate Deletion Attempts

  • Eliminated redundant namespace deletion calls in delete_ibm_license_server() and delete_ibm_certificate_manager()
  • Improved script efficiency and reduced unnecessary API calls

New Features

1. Force Finalizer Removal (--force-finalizer)

  • Enables forced removal of Kubernetes finalizers via OpenShift REST API
  • Helps resolve stuck namespaces in Terminating state
  • Works on both namespace-level and resource-level finalizers

2. Configurable Timeout (--timeout <SECONDS>)

  • Set custom timeout for namespace deletion (default: 900 seconds)
  • Prevents indefinite waiting on stuck operations
  • Example: --timeout 1200 for 20-minute timeout

3. Automatic Retry Logic

  • Up to 3 automatic retry attempts when timeout is reached
  • Shorter timeout (300s) for retry attempts
  • Automatic forced cleanup on retry if --force-finalizer is enabled

4. Parallel Namespace Deletion (--parallel)

  • Delete multiple namespaces simultaneously for significantly faster execution
  • Up to 3x faster than sequential deletion
  • Reduces total deletion time from ~15-30 minutes to ~5-10 minutes
  • Maintains safety by keeping instance and operator namespace deletions sequential
  • Parallel deletion applies to cluster-wide namespaces only

5. Colored Output

  • Green (log_success): Successful operations
  • Red (log_error): Errors and failures
  • Yellow (log_warning): Warnings and timeouts
  • Cyan (log_info): Informational messages
  • Makes it easy to quickly identify operation status

Robust Cleanup Functions

force_remove_resource_finalizers()

Removes finalizers from resources that commonly block namespace deletion:

  • PersistentVolumeClaims (PVCs): Often have finalizers preventing deletion
  • PersistentVolumes (PVs): Associated volumes that may be stuck
  • Pods in Terminating state: Force deleted with --grace-period=0
  • Services: Remove finalizers blocking service cleanup
  • ConfigMaps & Secrets: Clean up resources with finalizers

diagnose_namespace_stuck()

Provides comprehensive diagnostic information when namespaces are stuck:

  • Lists all remaining resources in the namespace
  • Shows namespace status and finalizers
  • Identifies pods in Terminating state
  • Lists PVCs that may be blocking deletion
  • Helps troubleshoot deletion issues

Enhanced wait_ns_deleted()

  • Progress logging every 60 seconds
  • Automatic diagnostics on timeout
  • Retry logic with forced cleanup
  • Better error handling and return codes

@luigimolinaro luigimolinaro linked an issue Mar 9, 2026 that may be closed by this pull request
Luigi Molinaro added 4 commits March 9, 2026 11:05
The delete_ibm_scheduler() function was using hardcoded 'ibm-scheduling'
instead of the PROJECT_SCHEDULING_SERVICE variable. This caused the
scheduler namespace to not be deleted when running cp4d-delete-instance.sh.

Changes:
- Use PROJECT_SCHEDULING_SERVICE variable with default 'cpd-scheduler'
- Update confirmation message to use the variable
- Align with cpd_vars.sh standard variable naming

Fixes #1091

Signed-off-by: Luigi Molinaro <luigi.molinaro@ibm.com>
Enhanced delete_ibm_scheduler() to handle stuck namespaces in Terminating state:
- Use PROJECT_SCHEDULING_SERVICE variable (default: cpd-scheduler)
- Export namespace to JSON and remove kubernetes finalizers
- Use OpenShift REST API to force finalize the namespace
- Clean up temporary JSON file after operation
- Align with best practices from focedeletens.sh script

This ensures the scheduler namespace is properly deleted even when
stuck with finalizers.

Fixes #1091

Signed-off-by: Luigi Molinaro <luigi.molinaro@ibm.com>
Enhanced cp4d-delete-instance.sh with multiple improvements for reliable namespace deletion:

NEW FEATURES:
- Added --force-finalizer option to enable forced finalizer removal via OpenShift REST API
- Added --timeout option to configure namespace deletion timeout (default: 900s)
- Implemented automatic retry logic with up to 3 attempts when timeout is reached
- Added comprehensive diagnostic output when namespace deletion is stuck

ROBUST CLEANUP FUNCTIONS:
- force_remove_resource_finalizers(): Removes finalizers from blocking resources
  * PersistentVolumeClaims (PVCs)
  * PersistentVolumes (PVs) associated with namespace
  * Pods stuck in Terminating state (force delete with grace-period=0)
  * Services with finalizers
  * ConfigMaps with finalizers
  * Secrets with finalizers

- diagnose_namespace_stuck(): Provides detailed diagnostic information
  * Lists all remaining resources in namespace
  * Shows namespace status and finalizers
  * Identifies terminating pods
  * Lists PVCs that may be blocking deletion

ENHANCED WAIT LOGIC:
- Configurable timeout with progress logging every 60 seconds
- Automatic retry with forced cleanup when timeout is reached
- Shorter timeout (300s) for retry attempts
- Better error handling and return codes

NAMESPACE DELETION IMPROVEMENTS:
- Applied force_remove_finalizers() to all namespace deletion functions:
  * delete_operator_ns (CP4D operators)
  * delete_instance_ns (CP4D instance)
  * delete_knative (knative-eventing, knative-serving)
  * delete_app_connect (ibm-app-connect)
  * delete_ibm_scheduler (cpd-scheduler) - now uses PROJECT_SCHEDULING_SERVICE variable
  * delete_ibm_license_server (ibm-licensing)
  * delete_ibm_certificate_manager (ibm-cert-manager)
  * delete_common_services_control (cs-control)

BUG FIXES:
- Fixed delete_ibm_scheduler() to use PROJECT_SCHEDULING_SERVICE variable instead of hardcoded 'ibm-scheduling'
- Removed duplicate namespace deletion attempts in licensing and cert-manager functions

USAGE:
  ./cp4d-delete-instance.sh <namespace>
  ./cp4d-delete-instance.sh -n <namespace> --force-finalizer --timeout 1200
Signed-off-by: Luigi Molinaro <luigi.molinaro@ibm.com>
Added color-coded logging functions with visual indicators:
- ✓ Green (log_success): Successful operations
- ✗ Red (log_error): Errors and failures
- ⚠ Yellow (log_warning): Warnings and timeouts
- ℹ Cyan (log_info): Informational messages

Applied colored logging throughout the script:
- Success messages when namespaces are deleted
- Warnings for timeouts and stuck namespaces
- Errors for failed deletion attempts
- Info messages for cleanup operations and diagnostics

This makes it much easier to quickly identify the status of operations
during namespace deletion, especially when dealing with stuck resources.

Signed-off-by: Luigi Molinaro <luigi.molinaro@ibm.com>
@luigimolinaro luigimolinaro force-pushed the 1091-ibm-cpd-scheduler-not-deleted-by-cp4d-delete-instancesh branch from f42326c to a8e277d Compare March 9, 2026 10:05
@luigimolinaro luigimolinaro requested a review from fketelaars March 9, 2026 10:05
Implemented parallel deletion mode to significantly speed up namespace cleanup:

NEW OPTION:
- --parallel: Enable parallel deletion of multiple namespaces

NEW FUNCTIONS:
- start_ns_deletion(): Initiates namespace deletion in background (non-blocking)
- wait_multiple_ns_deleted(): Waits for multiple namespaces to complete deletion in parallel
  * Monitors all namespaces simultaneously
  * Progress logging every 60 seconds
  * Applies forced cleanup to stuck namespaces if --force-finalizer is enabled
  * Individual status reporting for each namespace

PARALLEL DELETION STRATEGY:
1. Instance and operator namespaces deleted sequentially (dependencies)
2. Cluster-wide namespaces deleted in parallel:
   - knative-eventing
   - knative-serving
   - ibm-app-connect
   - cpd-scheduler
   - ibm-licensing (if not shared)
   - ibm-cert-manager (if not shared)
   - cs-control

PERFORMANCE IMPROVEMENT:
- Sequential mode: ~15-30 minutes (namespaces deleted one by one)
- Parallel mode: ~5-10 minutes (multiple namespaces deleted simultaneously)
- Up to 3x faster for environments with many namespaces

USAGE:
  # Sequential deletion (default, original behavior)
  ./cp4d-delete-instance.sh -n cpd-instance

  # Parallel deletion (faster)
  ./cp4d-delete-instance.sh -n cpd-instance --parallel

  # Parallel with force finalizer
  ./cp4d-delete-instance.sh -n cpd-instance --parallel --force-finalizer

BACKWARD COMPATIBILITY:
- Default behavior unchanged (sequential deletion)
- Parallel mode only activated with --parallel flag
- All existing options work with both modes

Signed-off-by: Luigi Molinaro <luigi.molinaro@ibm.com>
Copy link
Collaborator

@fketelaars fketelaars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Enables forced removal of Kubernetes finalizers via OpenShift REST API --> Why use REST API whereas in other parts of the code, the oc patch command is used?
  • When trying the command with a non-existing CP4D instance namespace, the command tries to delete ' ' namespace and waits 900 seconds
  • Parallel should be the default

Copy link
Collaborator

@fketelaars fketelaars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I mean:

cp4d-delete-instance.sh cpd --parallel
About to delete the following from the cluster:
- Instance namespace:
- Operator namespace:
- IBM Custom Resource Definitions
Are you sure (y/N)? y
ℹ [2026-03-13 07:12:21] Using parallel deletion mode for faster execution
error: resource(s) were provided, but no name was specified
[2026-03-13 07:12:21] Getting Custom Resources in OpenShift project ...
You must specify the type of resource to get. Use "oc api-resources" for a complete list of supported resources.

error: Required resource not specified.
Use "oc explain <resource>" for a detailed description of that resource (e.g. oc explain pods).
See 'oc get -h' for help and examples
[2026-03-13 07:12:22] Delete all Custom Resources except the base ones
[2026-03-13 07:12:22] Delete remaining Custom Resources
[2026-03-13 07:12:22] Delete role binding if Cloud Pak for Data was connected to IAM
error: resource(s) were provided, but no name was specified
error: the server doesn't have a resource type "authentication"
[2026-03-13 07:12:22] Waiting for deletion of namespace  (timeout: 900s)...

@luigimolinaro
Copy link
Contributor Author

luigimolinaro commented Mar 13, 2026

This is what I mean:

cp4d-delete-instance.sh cpd --parallel
About to delete the following from the cluster:
- Instance namespace:
- Operator namespace:
- IBM Custom Resource Definitions
Are you sure (y/N)? y
ℹ [2026-03-13 07:12:21] Using parallel deletion mode for faster execution
error: resource(s) were provided, but no name was specified
[2026-03-13 07:12:21] Getting Custom Resources in OpenShift project ...
You must specify the type of resource to get. Use "oc api-resources" for a complete list of supported resources.

error: Required resource not specified.
Use "oc explain <resource>" for a detailed description of that resource (e.g. oc explain pods).
See 'oc get -h' for help and examples
[2026-03-13 07:12:22] Delete all Custom Resources except the base ones
[2026-03-13 07:12:22] Delete remaining Custom Resources
[2026-03-13 07:12:22] Delete role binding if Cloud Pak for Data was connected to IAM
error: resource(s) were provided, but no name was specified
error: the server doesn't have a resource type "authentication"
[2026-03-13 07:12:22] Waiting for deletion of namespace  (timeout: 900s)...

My assumption is that the environment variables normally used to manage Cloud Pak for Data are set.

For example, variables like:

export PROJECT_CERT_MANAGER="cert-manager"
export PROJECT_LICENSE_SERVICE="ibm-licensing"
export PROJECT_SCHEDULING_SERVICE="cpd-scheduler"
export PROJECT_CPD_INST_OPERATORS="cpd-operators"
export PROJECT_CPD_INST_OPERANDS="cpd"

If these variables are not defined, the script has no way to determine the namespaces and resources it should operate on.

This variables are usually very clear when Cloud Pak is installed without the deployer, because these variables are explicitly configured. However, when using the Cloud Pak Deployer, users often rely only on the deployer abstraction and may not be aware of these underlying variables.

I'm currently thinking about how we could improve the script so it can automatically discover these values instead of relying on environment variables.

Let me think about the best way to implement this.
If you have any suggestions on how we could detect the namespaces or resources dynamically, they would be very welcome. 👍

Luigi Molinaro added 2 commits March 13, 2026 09:14
Major improvements:
- Auto-discovery of CP4D namespaces by finding ZenService resources
- Flexible pattern matching for namespace variants (licensing, cert-manager, scheduler)
- --dry-run mode for safe testing without deletion
- Enhanced confirmation summary with detailed resource counts
- Improved help documentation with comprehensive usage examples
- Fixed argument parsing to properly handle flags like --dry-run
- Fixed variable scope issues (removed invalid 'local' declarations)
- Smart namespace detection excludes OpenShift system namespaces
- No default assumptions for optional namespaces (scheduler, cert-manager, licensing)
- Single confirmation prompt with comprehensive deletion summary
- Better error messages when auto-discovery fails

The script now provides a much safer and user-friendly experience for deleting CP4D instances.
@luigimolinaro
Copy link
Contributor Author

luigimolinaro commented Mar 13, 2026

I think i find a way :

=== Discovered Cloud Pak for Data Configuration ===
Instance namespace:   cpd
Operator namespace:   cpd-operators
Scheduler namespace:  <not found>
Cert-manager:         cert-manager
Licensing:            ibm-licensing
==================================================

Please check

Copy link
Collaborator

@fketelaars fketelaars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the changes are heading in the right direction. Concerns:

  • cert-manager and the operator namespace are Red Hat. I don't want this script to touch these, only the IBM certificate manager if it exists.
  • I really don't want auto-discovery for the CP4D instance namespace; this is too risky IMO. Let users specify the namespace and assume that the operator namespace is <instance_namespace>-operators or it must be specified at command line or environment variable
  • There are situations where instance namespace was deleted (or pending deletion) and the operator namespace was not, also vice-versa. Don't limit the script by assuming that these are still there; I want it to clean up residuals as well, even if orphaned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ibm-cpd-scheduler not deleted by cp4d-delete-instance.sh

2 participants