CrateDB Operator v2.47.1 Fails to Update or Reliably Add Second Data Node Pool Due to spec.nodes.data Diffing Issues (KeyError: 1)

**1. Environment:**

*   **CrateDB Operator Version:** `crate/crate-operator:2.47.1` (Upgraded from `2.42.0` during troubleshooting)
*   **CrateDB Version (Targeted in CR):** `5.8.1`
*   **Kubernetes Version:**
    *   Client: `v1.31.6` (gitCommit: `6b3560758b37680cb713dfc71da03c04cadd657c`)
    *   Server: `v1.31.3` (gitCommit: `c83cbee114ddb732cdc06d3d1b62c9eb9220726f`)
*   **Kubernetes Distribution:** `kubeadm`-deployed cluster (Control plane on Ubuntu 24.04.2 LTS, Worker nodes on Ubuntu 22.04.5 LTS, some with Azure kernels).
*   **Number of Kubernetes Nodes:** 4 (1 control-plane: `iot-api`, 3 worker nodes: `vmhygenco`, `vmhygenco2`, `vmhygenco3`)

**2. Observed Behavior & Symptoms:**

Even after upgrading to CrateDB Operator `v2.47.1` and its corresponding CRDs, we continue to encounter critical issues when attempting to manage a CrateDB cluster with more than one data node pool, specifically when updating an existing second pool or when trying to add a second pool to a single-pool cluster via a CR update.

*   **`KeyError: 1` during Updates to Second Node Pool:**
    After successfully creating a two-pool cluster (`hot` and `hotter`) via a "clean slate" (delete CR & operator restart, then create CR with both pools defined from the start), any subsequent attempt to modify the configuration of the *second* pool (`hotter`) in the `CrateDB` CR (e.g., changing `spec.nodes.data.resources.heapRatio` from `0.25` to `0.26`) results in the operator failing with a `KeyError: 1`. The traceback consistently points to `handle_update_cratedb.py` when accessing `old_spec[node_spec_idx]` (or equivalent `old_spec_val[node_spec_idx]`).
    ```
    Traceback (most recent call last):
      File "/usr/local/lib/python3.12/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once # Path from 2.47.1 logs
        result = await invoke_handler(
      # ... (elided for brevity) ...
      File "/etc/cloud/main.py", line 173, in cluster_update # Path from 2.47.1 logs
        await update_cratedb(namespace, name, patch, status, diff, started)
      File "/usr/local/lib/python3.12/site-packages/crate/operator/handlers/handle_update_cratedb.py", line 126, in update_cratedb
        old_spec = old_spec[node_spec_idx]
    KeyError: 1
    ```
*   **Initial Failure to Add Second Pool via Update (Observed with v2.42.0, behavior with v2.47.1 for this specific path needs re-confirmation but is suspected to be similar given the persistent `KeyError`):** When starting with a single `hot` pool and then applying a CR update to add the `hotter` pool, the operator often failed to create the StatefulSet for the `hotter` pool. The `operator.cloud.crate.io/last` annotation would update to reflect the two-pool spec, but no resources for `hotter` would be provisioned. Subsequent changes would then often trigger the `KeyError: 1`.
*   **"Updating is superseded by..." Loop:** During troubleshooting with v2.42.0, the operator frequently entered a state logging "Updating is superseded by resuming..." followed by inactivity, failing to progress. This specific loop was not re-tested extensively with v2.47.1 after the `KeyError` resurfaced during the pool update attempt.
*   **`nodepool` Functionality:** When a two-pool cluster is successfully created via the "clean slate" method, the `nodepool` directives *are* correctly translated into pod affinity/selectors, and pods schedule on the appropriate nodes. This suggests the create path handles node pool attributes correctly.

**3. Expected Behavior:**

*   The CrateDB Operator should allow reliable modification of any data node pool's configuration in an existing multi-pool cluster.
*   The operator should seamlessly add a new data node pool when it's defined in the `spec.nodes.data` array of an existing `CrateDB` CR.
*   The operator should not encounter `KeyError` exceptions when diffing or applying changes to `spec.nodes.data`.

**4. Steps to Reproduce (Illustrating `KeyError` on Update with Operator v2.47.1):**

1.  Ensure CrateDB Operator `v2.47.1` and its corresponding CRDs (`cratedbs.cloud.crate.io` CRD at `generation: 2` or later) are installed.
2.  Perform a "clean slate" deployment:
    a.  Ensure no `CrateDB` CRs or related resources exist in the target namespace (e.g., `cratedb-ns`).
    b.  Restart the operator pod for a clean state.
    c.  Apply a `CrateDB` CR that defines *both* `hot` and `hotter` data node pools from the outset (e.g., `hotter` with `heapRatio: 0.25`).
        ```yaml
        # Initial two-pool CR for clean create
        apiVersion: cloud.crate.io/v1
        kind: CrateDB
        metadata:
          name: my-cluster
          namespace: cratedb-ns
        spec:
          cluster: {imageRegistry: crate, name: crate-dev, version: "5.8.1"}
          nodes:
            data:
              - name: hot
                replicas: 1
                nodepool: default-pool
                resources: {limits: {cpu: 4, memory: 8Gi}, disk: {count: 1, size: 50GiB, storageClass: local-path}, heapRatio: 0.25}
              - name: hotter
                replicas: 1
                nodepool: hotter-pool
                resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.25}
        ```
3.  Wait for both pools (`hot` and `hotter`) to be stable and running.
4.  Modify the `CrateDB` CR to change a property of the *second* pool (`hotter`), e.g., `heapRatio` from `0.25` to `0.26`:
    ```yaml
    # ... (spec remains the same except for hotter.heapRatio) ...
              - name: hotter
                replicas: 1
                nodepool: hotter-pool
                resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.26} # Changed
    ```
5.  Apply the updated CR.
6.  **Observation:** The operator logs will show the `cluster_update` handler failing with the `KeyError: 1` stack trace.

**5. Analysis/Hypothesis of the Root Cause:**

The fundamental issue persists in v2.47.1 and lies in `handle_update_cratedb.py`. The loop `for node_spec_idx in range(len(old_spec_val)):` attempts to compare elements of `old_spec_val` (representing `spec.nodes.data` from the previous state) and `new_spec_val` at matching indices.

The `KeyError: 1` when accessing `old_spec_val[node_spec_idx]` (where `node_spec_idx` is `1` for the second pool) indicates that the `old_spec_val` list used by the operator for *that specific diff operation* does not contain an element at index 1. This occurs even if the `operator.cloud.crate.io/last` annotation (when inspected via `kubectl`) appears to correctly store the two-pool configuration. This suggests a problem with how the operator's internal cache or Kopf's diff mechanism retrieves/presents the "old state" for the `spec.nodes.data` array, particularly when modifications are made to pools other than the first one in the list. The operator seems to be incorrectly using an "old state" that only contains the first data pool when diffing, leading to the error when processing the second pool.

**6. Successful Workaround (Only for Initial Create, Not for Updates):**

The *only* method to achieve the desired two-pool configuration was a "clean slate" create: deleting any existing `CrateDB` CR and its resources, restarting the operator, and then applying a CR manifest that defines *both* pools from the very beginning. Subsequent attempts to *update* the second pool in this successfully created two-pool cluster trigger the `KeyError: 1`.

**7. Request:**

We urgently request an investigation and fix for the `spec.nodes.data` diffing and update logic within the CrateDB operator (v2.47.1). The operator must reliably:
    a. Allow the addition of new data node pools to an existing cluster via CR update.
    b. Allow modifications to any existing data node pool in a multi-pool cluster without encountering `KeyError` exceptions.
    c. Maintain a consistent and correct "previous state" representation for `spec.nodes.data` during diff operations.

**8. Additional Information:**

*   **CRD Used:** `cratedbs.cloud.crate.io` version `v1` (CRD `metadata.generation: 2`, `creationTimestamp: "2024-11-27T05:45:08Z"`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrateDB Operator v2.47.1 Fails to Update or Reliably Add Second Data Node Pool Due to spec.nodes.data Diffing Issues (KeyError: 1) #747

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CrateDB Operator v2.47.1 Fails to Update or Reliably Add Second Data Node Pool Due to spec.nodes.data Diffing Issues (KeyError: 1) #747

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions