1. Environment:
- CrateDB Operator Version:
crate/crate-operator:2.47.1 (Upgraded from 2.42.0 during troubleshooting)
- CrateDB Version (Targeted in CR):
5.8.1
- Kubernetes Version:
- Client:
v1.31.6 (gitCommit: 6b3560758b37680cb713dfc71da03c04cadd657c)
- Server:
v1.31.3 (gitCommit: c83cbee114ddb732cdc06d3d1b62c9eb9220726f)
- Kubernetes Distribution:
kubeadm-deployed cluster (Control plane on Ubuntu 24.04.2 LTS, Worker nodes on Ubuntu 22.04.5 LTS, some with Azure kernels).
- Number of Kubernetes Nodes: 4 (1 control-plane:
iot-api, 3 worker nodes: vmhygenco, vmhygenco2, vmhygenco3)
2. Observed Behavior & Symptoms:
Even after upgrading to CrateDB Operator v2.47.1 and its corresponding CRDs, we continue to encounter critical issues when attempting to manage a CrateDB cluster with more than one data node pool, specifically when updating an existing second pool or when trying to add a second pool to a single-pool cluster via a CR update.
KeyError: 1 during Updates to Second Node Pool:
After successfully creating a two-pool cluster (hot and hotter) via a "clean slate" (delete CR & operator restart, then create CR with both pools defined from the start), any subsequent attempt to modify the configuration of the second pool (hotter) in the CrateDB CR (e.g., changing spec.nodes.data.resources.heapRatio from 0.25 to 0.26) results in the operator failing with a KeyError: 1. The traceback consistently points to handle_update_cratedb.py when accessing old_spec[node_spec_idx] (or equivalent old_spec_val[node_spec_idx]).
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once # Path from 2.47.1 logs
result = await invoke_handler(
# ... (elided for brevity) ...
File "/etc/cloud/main.py", line 173, in cluster_update # Path from 2.47.1 logs
await update_cratedb(namespace, name, patch, status, diff, started)
File "/usr/local/lib/python3.12/site-packages/crate/operator/handlers/handle_update_cratedb.py", line 126, in update_cratedb
old_spec = old_spec[node_spec_idx]
KeyError: 1
- Initial Failure to Add Second Pool via Update (Observed with v2.42.0, behavior with v2.47.1 for this specific path needs re-confirmation but is suspected to be similar given the persistent
KeyError): When starting with a single hot pool and then applying a CR update to add the hotter pool, the operator often failed to create the StatefulSet for the hotter pool. The operator.cloud.crate.io/last annotation would update to reflect the two-pool spec, but no resources for hotter would be provisioned. Subsequent changes would then often trigger the KeyError: 1.
- "Updating is superseded by..." Loop: During troubleshooting with v2.42.0, the operator frequently entered a state logging "Updating is superseded by resuming..." followed by inactivity, failing to progress. This specific loop was not re-tested extensively with v2.47.1 after the
KeyError resurfaced during the pool update attempt.
nodepool Functionality: When a two-pool cluster is successfully created via the "clean slate" method, the nodepool directives are correctly translated into pod affinity/selectors, and pods schedule on the appropriate nodes. This suggests the create path handles node pool attributes correctly.
3. Expected Behavior:
- The CrateDB Operator should allow reliable modification of any data node pool's configuration in an existing multi-pool cluster.
- The operator should seamlessly add a new data node pool when it's defined in the
spec.nodes.data array of an existing CrateDB CR.
- The operator should not encounter
KeyError exceptions when diffing or applying changes to spec.nodes.data.
4. Steps to Reproduce (Illustrating KeyError on Update with Operator v2.47.1):
- Ensure CrateDB Operator
v2.47.1 and its corresponding CRDs (cratedbs.cloud.crate.io CRD at generation: 2 or later) are installed.
- Perform a "clean slate" deployment:
a. Ensure no CrateDB CRs or related resources exist in the target namespace (e.g., cratedb-ns).
b. Restart the operator pod for a clean state.
c. Apply a CrateDB CR that defines both hot and hotter data node pools from the outset (e.g., hotter with heapRatio: 0.25).
yaml # Initial two-pool CR for clean create apiVersion: cloud.crate.io/v1 kind: CrateDB metadata: name: my-cluster namespace: cratedb-ns spec: cluster: {imageRegistry: crate, name: crate-dev, version: "5.8.1"} nodes: data: - name: hot replicas: 1 nodepool: default-pool resources: {limits: {cpu: 4, memory: 8Gi}, disk: {count: 1, size: 50GiB, storageClass: local-path}, heapRatio: 0.25} - name: hotter replicas: 1 nodepool: hotter-pool resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.25}
- Wait for both pools (
hot and hotter) to be stable and running.
- Modify the
CrateDB CR to change a property of the second pool (hotter), e.g., heapRatio from 0.25 to 0.26:
# ... (spec remains the same except for hotter.heapRatio) ...
- name: hotter
replicas: 1
nodepool: hotter-pool
resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.26} # Changed
- Apply the updated CR.
- Observation: The operator logs will show the
cluster_update handler failing with the KeyError: 1 stack trace.
5. Analysis/Hypothesis of the Root Cause:
The fundamental issue persists in v2.47.1 and lies in handle_update_cratedb.py. The loop for node_spec_idx in range(len(old_spec_val)): attempts to compare elements of old_spec_val (representing spec.nodes.data from the previous state) and new_spec_val at matching indices.
The KeyError: 1 when accessing old_spec_val[node_spec_idx] (where node_spec_idx is 1 for the second pool) indicates that the old_spec_val list used by the operator for that specific diff operation does not contain an element at index 1. This occurs even if the operator.cloud.crate.io/last annotation (when inspected via kubectl) appears to correctly store the two-pool configuration. This suggests a problem with how the operator's internal cache or Kopf's diff mechanism retrieves/presents the "old state" for the spec.nodes.data array, particularly when modifications are made to pools other than the first one in the list. The operator seems to be incorrectly using an "old state" that only contains the first data pool when diffing, leading to the error when processing the second pool.
6. Successful Workaround (Only for Initial Create, Not for Updates):
The only method to achieve the desired two-pool configuration was a "clean slate" create: deleting any existing CrateDB CR and its resources, restarting the operator, and then applying a CR manifest that defines both pools from the very beginning. Subsequent attempts to update the second pool in this successfully created two-pool cluster trigger the KeyError: 1.
7. Request:
We urgently request an investigation and fix for the spec.nodes.data diffing and update logic within the CrateDB operator (v2.47.1). The operator must reliably:
a. Allow the addition of new data node pools to an existing cluster via CR update.
b. Allow modifications to any existing data node pool in a multi-pool cluster without encountering KeyError exceptions.
c. Maintain a consistent and correct "previous state" representation for spec.nodes.data during diff operations.
8. Additional Information:
- CRD Used:
cratedbs.cloud.crate.io version v1 (CRD metadata.generation: 2, creationTimestamp: "2024-11-27T05:45:08Z").
1. Environment:
crate/crate-operator:2.47.1(Upgraded from2.42.0during troubleshooting)5.8.1v1.31.6(gitCommit:6b3560758b37680cb713dfc71da03c04cadd657c)v1.31.3(gitCommit:c83cbee114ddb732cdc06d3d1b62c9eb9220726f)kubeadm-deployed cluster (Control plane on Ubuntu 24.04.2 LTS, Worker nodes on Ubuntu 22.04.5 LTS, some with Azure kernels).iot-api, 3 worker nodes:vmhygenco,vmhygenco2,vmhygenco3)2. Observed Behavior & Symptoms:
Even after upgrading to CrateDB Operator
v2.47.1and its corresponding CRDs, we continue to encounter critical issues when attempting to manage a CrateDB cluster with more than one data node pool, specifically when updating an existing second pool or when trying to add a second pool to a single-pool cluster via a CR update.KeyError: 1during Updates to Second Node Pool:After successfully creating a two-pool cluster (
hotandhotter) via a "clean slate" (delete CR & operator restart, then create CR with both pools defined from the start), any subsequent attempt to modify the configuration of the second pool (hotter) in theCrateDBCR (e.g., changingspec.nodes.data.resources.heapRatiofrom0.25to0.26) results in the operator failing with aKeyError: 1. The traceback consistently points tohandle_update_cratedb.pywhen accessingold_spec[node_spec_idx](or equivalentold_spec_val[node_spec_idx]).KeyError): When starting with a singlehotpool and then applying a CR update to add thehotterpool, the operator often failed to create the StatefulSet for thehotterpool. Theoperator.cloud.crate.io/lastannotation would update to reflect the two-pool spec, but no resources forhotterwould be provisioned. Subsequent changes would then often trigger theKeyError: 1.KeyErrorresurfaced during the pool update attempt.nodepoolFunctionality: When a two-pool cluster is successfully created via the "clean slate" method, thenodepooldirectives are correctly translated into pod affinity/selectors, and pods schedule on the appropriate nodes. This suggests the create path handles node pool attributes correctly.3. Expected Behavior:
spec.nodes.dataarray of an existingCrateDBCR.KeyErrorexceptions when diffing or applying changes tospec.nodes.data.4. Steps to Reproduce (Illustrating
KeyErroron Update with Operator v2.47.1):v2.47.1and its corresponding CRDs (cratedbs.cloud.crate.ioCRD atgeneration: 2or later) are installed.a. Ensure no
CrateDBCRs or related resources exist in the target namespace (e.g.,cratedb-ns).b. Restart the operator pod for a clean state.
c. Apply a
CrateDBCR that defines bothhotandhotterdata node pools from the outset (e.g.,hotterwithheapRatio: 0.25).yaml # Initial two-pool CR for clean create apiVersion: cloud.crate.io/v1 kind: CrateDB metadata: name: my-cluster namespace: cratedb-ns spec: cluster: {imageRegistry: crate, name: crate-dev, version: "5.8.1"} nodes: data: - name: hot replicas: 1 nodepool: default-pool resources: {limits: {cpu: 4, memory: 8Gi}, disk: {count: 1, size: 50GiB, storageClass: local-path}, heapRatio: 0.25} - name: hotter replicas: 1 nodepool: hotter-pool resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.25}hotandhotter) to be stable and running.CrateDBCR to change a property of the second pool (hotter), e.g.,heapRatiofrom0.25to0.26:cluster_updatehandler failing with theKeyError: 1stack trace.5. Analysis/Hypothesis of the Root Cause:
The fundamental issue persists in v2.47.1 and lies in
handle_update_cratedb.py. The loopfor node_spec_idx in range(len(old_spec_val)):attempts to compare elements ofold_spec_val(representingspec.nodes.datafrom the previous state) andnew_spec_valat matching indices.The
KeyError: 1when accessingold_spec_val[node_spec_idx](wherenode_spec_idxis1for the second pool) indicates that theold_spec_vallist used by the operator for that specific diff operation does not contain an element at index 1. This occurs even if theoperator.cloud.crate.io/lastannotation (when inspected viakubectl) appears to correctly store the two-pool configuration. This suggests a problem with how the operator's internal cache or Kopf's diff mechanism retrieves/presents the "old state" for thespec.nodes.dataarray, particularly when modifications are made to pools other than the first one in the list. The operator seems to be incorrectly using an "old state" that only contains the first data pool when diffing, leading to the error when processing the second pool.6. Successful Workaround (Only for Initial Create, Not for Updates):
The only method to achieve the desired two-pool configuration was a "clean slate" create: deleting any existing
CrateDBCR and its resources, restarting the operator, and then applying a CR manifest that defines both pools from the very beginning. Subsequent attempts to update the second pool in this successfully created two-pool cluster trigger theKeyError: 1.7. Request:
We urgently request an investigation and fix for the
spec.nodes.datadiffing and update logic within the CrateDB operator (v2.47.1). The operator must reliably:a. Allow the addition of new data node pools to an existing cluster via CR update.
b. Allow modifications to any existing data node pool in a multi-pool cluster without encountering
KeyErrorexceptions.c. Maintain a consistent and correct "previous state" representation for
spec.nodes.dataduring diff operations.8. Additional Information:
cratedbs.cloud.crate.ioversionv1(CRDmetadata.generation: 2,creationTimestamp: "2024-11-27T05:45:08Z").