Skip to content

feat: add eks_scale_node_group playbook action (#223)#2089

Open
gerardrecinto wants to merge 2 commits into
robusta-dev:masterfrom
gerardrecinto:feat/eks-scale-node-group-action
Open

feat: add eks_scale_node_group playbook action (#223)#2089
gerardrecinto wants to merge 2 commits into
robusta-dev:masterfrom
gerardrecinto:feat/eks-scale-node-group-action

Conversation

@gerardrecinto
Copy link
Copy Markdown

Summary

Closes #223

Adds eks_scale_node_group — a playbook action that increases the maxSize of an EKS managed node group. Intended as a remediation step when the cluster autoscaler is blocked because the node group has reached its configured maximum.

New files:

  • playbooks/robusta_playbooks/aws_node_group_actions.py — action + params model
  • tests/test_aws_node_group_actions.py — 6 pytest unit tests

How it works

The action calls eks:DescribeNodegroup to read the current scaling config, then calls eks:UpdateNodegroupConfig to raise maxSize. minSize and desiredSize are left unchanged.

Params:

Param Type Description
cluster_name str EKS cluster name
region str AWS region (e.g. us-east-1)
node_group_name str Managed node group name
new_max_size int New maxSize — must exceed current value
aws_access_key_id str (optional) Falls back to instance role / environment
aws_secret_access_key str (optional) Falls back to instance role / environment

Example playbook config:

triggers:
  - on_prometheus_alert:
      alert_name: KubeNodeNotReady
actions:
  - eks_scale_node_group:
      cluster_name: my-cluster
      region: us-east-1
      node_group_name: workers
      new_max_size: 10

Required IAM permissions: eks:DescribeNodegroup, eks:UpdateNodegroupConfig

Test plan

  • test_scale_up_succeeds — verifies update_nodegroup_config called with correct args and finding emitted
  • test_no_op_when_new_max_not_larger — new_max_size < current max → no update, enrichment message returned
  • test_no_op_when_new_max_is_equal — new_max_size == current max → no update
  • test_raises_on_describe_failureClientError on describe → ActionException raised, update never called
  • test_raises_on_update_failureClientError on update → ActionException raised
  • test_boto_client_uses_explicit_credentials — explicit key/secret passed through to boto3

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 26, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds an eks_scale_node_group remediation action (params + implementation), tests covering success/no-op/error/credential wiring, a remediation docs entry, Rootly and Sentry send-events docs, private registry proxy guidance, and a Helm dependency version bump.

Changes

EKS Node Group Scaling

Layer / File(s) Summary
Action contract and core implementation
playbooks/robusta_playbooks/aws_node_group_actions.py, docs/playbook-reference/actions/remediation.rst
EksNodeGroupParams and eks_scale_node_group implemented: creates EKS client (optional credentials), describes nodegroup, no-ops if new_max_size ≤ current maxSize, updates maxSize when larger (preserving minSize/desiredSize), creates a Finding (EksNodeGroupScaled), and wraps AWS ClientError as ActionException.
Test fixtures, helpers, and validation cases
tests/test_aws_node_group_actions.py
Pytest fixtures and tests: success path (asserts update_nodegroup_config call and finding), no-op paths (new_max ≤ current), describe/update ClientErrorActionException, and explicit AWS credential wiring to boto3.client.

Documentation and Helm updates

Layer / File(s) Summary
Rootly send-events docs + integration card
docs/configuration/exporting/send-events/rootly.rst, docs/configuration/exporting/send-events-api.rst
Adds a Rootly send-events guide and registers it in the send-events API integrations grid.
Sentry send-events rewrite
docs/configuration/exporting/send-events/sentry.rst
Rewrites Sentry forwarding docs to document two forwarding modes, webhook URL/auth details, Custom Integration setup, alert-rule action schema, verification, and troubleshooting.
Private registry / proxy guidance
docs/setup-robusta/proxies.rst
New section describing how to mirror chart images to a private registry and Helm values to override registry prefixes.
Helm dependency bump
helm/robusta/Chart.yaml
Bumped holmes dependency version 0.30.0 → 0.31.0.

Sequence Diagram(s)

sequenceDiagram
  participant eks_scale_node_group
  participant boto3_EKS_client
  participant EKS_Service
  eks_scale_node_group->>boto3_EKS_client: describe_nodegroup(clusterName, nodegroupName)
  boto3_EKS_client->>EKS_Service: DescribeNodegroup API call
  EKS_Service-->>boto3_EKS_client: returns scaling.config (minSize,maxSize,desiredSize)
  eks_scale_node_group->>eks_scale_node_group: compare new_max_size with maxSize
  alt new_max_size > maxSize
    eks_scale_node_group->>boto3_EKS_client: update_nodegroup_config(set new maxSize, preserve min/desired)
    boto3_EKS_client->>EKS_Service: UpdateNodegroupConfig API call
    EKS_Service-->>boto3_EKS_client: update response
    eks_scale_node_group-->>eks_scale_node_group: create Finding and enrich event
  else
    eks_scale_node_group-->>eks_scale_node_group: enrich event with "no change"
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive Most changes are in-scope, but updates to unrelated docs (send-events-api.rst, sentry.rst, proxies.rst) and Helm chart dependency appear tangential to the linked issue #223. Clarify why documentation changes for Rootly, Sentry integration, and proxy setup, plus holmes dependency bump, are included in this PR focused on EKS node group scaling.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: adding a new EKS playbook action for scaling node groups.
Description check ✅ Passed The description is directly related to the changeset, providing clear context about the new action, parameters, usage example, and comprehensive test coverage.
Linked Issues check ✅ Passed The PR fully implements the linked issue #223 objective by introducing eks_scale_node_group action with proper node group scaling functionality, comprehensive tests, and documentation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
playbooks/robusta_playbooks/aws_node_group_actions.py (1)

61-65: ⚡ Quick win

Preserve original exception context when re-raising.

Use explicit exception chaining in both except ClientError as e blocks (raise ... from e) so the root AWS error remains visible in tracebacks.

Proposed patch
     except ClientError as e:
         raise ActionException(
             ErrorCodes.ACTION_UNEXPECTED_ERROR,
             f"Failed to describe node group '{params.node_group_name}' "
             f"in cluster '{params.cluster_name}': {e}",
-        )
+        ) from e
@@
     except ClientError as e:
         raise ActionException(
             ErrorCodes.ACTION_UNEXPECTED_ERROR,
             f"Failed to update node group '{params.node_group_name}': {e}",
-        )
+        ) from e

Also applies to: 94-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@playbooks/robusta_playbooks/aws_node_group_actions.py` around lines 61 - 65,
The except ClientError as e handlers that raise ActionException should preserve
the original exception context by using explicit exception chaining; locate the
raise statements that construct ActionException with
ErrorCodes.ACTION_UNEXPECTED_ERROR (the blocks referencing
params.node_group_name and params.cluster_name and the later similar block at
lines 94-97) and change the re-raise to use "raise ActionException(... ) from e"
so the original AWS ClientError is retained in the traceback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@playbooks/robusta_playbooks/aws_node_group_actions.py`:
- Around line 61-65: The except ClientError as e handlers that raise
ActionException should preserve the original exception context by using explicit
exception chaining; locate the raise statements that construct ActionException
with ErrorCodes.ACTION_UNEXPECTED_ERROR (the blocks referencing
params.node_group_name and params.cluster_name and the later similar block at
lines 94-97) and change the re-raise to use "raise ActionException(... ) from e"
so the original AWS ClientError is retained in the traceback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dd5c2ac7-227d-4783-8f36-07774d6f876d

📥 Commits

Reviewing files that changed from the base of the PR and between f945b48 and a12181d.

📒 Files selected for processing (2)
  • playbooks/robusta_playbooks/aws_node_group_actions.py
  • tests/test_aws_node_group_actions.py

gerardrecinto added a commit to gerardrecinto/robusta that referenced this pull request May 26, 2026
Preserves original ClientError in tracebacks when re-raising as
ActionException, per CodeRabbit review on PR robusta-dev#2089.
@gerardrecinto
Copy link
Copy Markdown
Author

@aantn could you please review when you have a chance, and approve the pending fork workflows if this looks ok?

I updated the branch with latest master and pushed a cleanup commit: 08ab23ca.

Changes made:

  • Fixed the no-op test coverage so below-current max and exact-equal max are tested separately.
  • Added assertions that no finding is emitted for no-op cases.
  • Added assertions that ActionException preserves the underlying ClientError as __cause__.
  • Added the new eks_scale_node_group action to the remediation action docs.

Why run_tests is still pending: the PR workflows are being created as completed/action_required with 0s runtime for this fork PR, so the required run_tests status is still waiting for a maintainer-approved workflow run rather than failing in code. The latest runs after this push show the same state for Test robusta with pytest and Docker Build on PR.

Local validation I could run here:

  • git diff --check
  • /usr/bin/python3 -m compileall playbooks/robusta_playbooks/aws_node_group_actions.py tests/test_aws_node_group_actions.py

I could not run the targeted pytest locally because this shell does not have the repo's Python 3.11/Poetry environment installed; default python3 is 3.14 and outside the repo's declared >=3.10,<3.12 range.

@gerardrecinto gerardrecinto force-pushed the feat/eks-scale-node-group-action branch from 08ab23c to 62b7a3f Compare June 1, 2026 20:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_aws_node_group_actions.py`:
- Around line 148-158: The test asserts a hardcoded secret in the
EksNodeGroupParams call and can trigger Ruff/flake8-bandit S106; locate the
assertion that calls mock_boto_client.assert_called_once_with(...) after
invoking eks_scale_node_group and append an inline suppression comment for S106
(e.g., add "  # noqa: S106" to the assertion line) so the linter ignores this
test-only hardcoded secret; keep references to EksNodeGroupParams,
eks_scale_node_group, and mock_boto_client.assert_called_once_with when editing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a970722-a258-43c8-a871-96e0d82ee659

📥 Commits

Reviewing files that changed from the base of the PR and between 08ab23c and 62b7a3f.

⛔ Files ignored due to path filters (1)
  • helm/robusta/Chart.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • docs/configuration/exporting/send-events-api.rst
  • docs/configuration/exporting/send-events/rootly.rst
  • docs/configuration/exporting/send-events/sentry.rst
  • docs/playbook-reference/actions/remediation.rst
  • docs/setup-robusta/proxies.rst
  • helm/robusta/Chart.yaml
  • helm/robusta/charts/holmes-0.30.0.tgz
  • helm/robusta/charts/holmes-0.31.0.tgz
  • playbooks/robusta_playbooks/aws_node_group_actions.py
  • tests/test_aws_node_group_actions.py
✅ Files skipped from review due to trivial changes (5)
  • docs/configuration/exporting/send-events/rootly.rst
  • docs/setup-robusta/proxies.rst
  • docs/configuration/exporting/send-events-api.rst
  • helm/robusta/Chart.yaml
  • docs/playbook-reference/actions/remediation.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • playbooks/robusta_playbooks/aws_node_group_actions.py

Comment on lines +148 to +158
params = EksNodeGroupParams(
**{**BASE_PARAMS, "aws_access_key_id": "AKID", "aws_secret_access_key": "SECRET"}
)
eks_scale_node_group(mock_event, params)

mock_boto_client.assert_called_once_with(
"eks",
region_name="us-west-2",
aws_access_key_id="AKID",
aws_secret_access_key="SECRET",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate ruff config and inspect select/ignore and per-file-ignores for tests
fd -H -t f -e toml -e cfg 'pyproject|ruff|setup' | xargs -r grep -nEi 'ruff|S106|flake8-bandit' 2>/dev/null
fd -H 'ruff.toml|.ruff.toml' --type f
rg -nP -C3 '\[tool\.ruff' -g 'pyproject.toml' || true
rg -nP -C3 'per-file-ignores|extend-select|^\s*select|^\s*ignore' -g 'pyproject.toml' -g 'ruff.toml' -g '.ruff.toml' || true

Repository: robusta-dev/robusta

Length of output: 813


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Print relevant parts of ruff.toml to confirm whether S (flake8-bandit) rules are selected.
ls -la ruff.toml 2>/dev/null || true
sed -n '1,120p' ruff.toml

Repository: robusta-dev/robusta

Length of output: 1229


Suppress Ruff S106 for the hardcoded test secret
ruff.toml enables flake8-bandit ("S"), and only S101 is ignored; S106 is not ignored globally, so aws_secret_access_key="SECRET" may fail lint. Suppress it at the assertion line:

🔧 Suppress the false positive
     mock_boto_client.assert_called_once_with(
         "eks",
         region_name="us-west-2",
         aws_access_key_id="AKID",
-        aws_secret_access_key="SECRET",
+        aws_secret_access_key="SECRET",  # noqa: S106
     )
🧰 Tools
🪛 Ruff (0.15.15)

[error] 157-157: Possible hardcoded password assigned to argument: "aws_secret_access_key"

(S106)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_aws_node_group_actions.py` around lines 148 - 158, The test
asserts a hardcoded secret in the EksNodeGroupParams call and can trigger
Ruff/flake8-bandit S106; locate the assertion that calls
mock_boto_client.assert_called_once_with(...) after invoking
eks_scale_node_group and append an inline suppression comment for S106 (e.g.,
add "  # noqa: S106" to the assertion line) so the linter ignores this test-only
hardcoded secret; keep references to EksNodeGroupParams, eks_scale_node_group,
and mock_boto_client.assert_called_once_with when editing.

@gerardrecinto gerardrecinto force-pushed the feat/eks-scale-node-group-action branch from f054ba2 to 90c4f23 Compare June 1, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New playbook action to increase node pool size limits

2 participants