feat: add eks_scale_node_group playbook action (#223) by gerardrecinto · Pull Request #2089 · robusta-dev/robusta

gerardrecinto · 2026-05-26T20:51:44Z

Summary

Closes #223

Adds eks_scale_node_group — a playbook action that increases the maxSize of an EKS managed node group. Intended as a remediation step when the cluster autoscaler is blocked because the node group has reached its configured maximum.

New files:

playbooks/robusta_playbooks/aws_node_group_actions.py — action + params model
tests/test_aws_node_group_actions.py — 6 pytest unit tests

How it works

The action calls eks:DescribeNodegroup to read the current scaling config, then calls eks:UpdateNodegroupConfig to raise maxSize. minSize and desiredSize are left unchanged.

Params:

Param	Type	Description
`cluster_name`	str	EKS cluster name
`region`	str	AWS region (e.g. `us-east-1`)
`node_group_name`	str	Managed node group name
`new_max_size`	int	New maxSize — must exceed current value
`aws_access_key_id`	str (optional)	Falls back to instance role / environment
`aws_secret_access_key`	str (optional)	Falls back to instance role / environment

Example playbook config:

triggers:
  - on_prometheus_alert:
      alert_name: KubeNodeNotReady
actions:
  - eks_scale_node_group:
      cluster_name: my-cluster
      region: us-east-1
      node_group_name: workers
      new_max_size: 10

Required IAM permissions: eks:DescribeNodegroup, eks:UpdateNodegroupConfig

Test plan

test_scale_up_succeeds — verifies update_nodegroup_config called with correct args and finding emitted
test_no_op_when_new_max_not_larger — new_max_size < current max → no update, enrichment message returned
test_no_op_when_new_max_is_equal — new_max_size == current max → no update
test_raises_on_describe_failure — ClientError on describe → ActionException raised, update never called
test_raises_on_update_failure — ClientError on update → ActionException raised
test_boto_client_uses_explicit_credentials — explicit key/secret passed through to boto3

CLAassistant · 2026-05-26T20:51:52Z

All committers have signed the CLA.

coderabbitai · 2026-05-26T20:52:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR adds an eks_scale_node_group remediation action (params + implementation), tests covering success/no-op/error/credential wiring, a remediation docs entry, Rootly and Sentry send-events docs, private registry proxy guidance, and a Helm dependency version bump.

Changes

EKS Node Group Scaling

Layer / File(s)	Summary
Action contract and core implementation `playbooks/robusta_playbooks/aws_node_group_actions.py`, `docs/playbook-reference/actions/remediation.rst`	`EksNodeGroupParams` and `eks_scale_node_group` implemented: creates EKS client (optional credentials), describes nodegroup, no-ops if `new_max_size` ≤ current `maxSize`, updates `maxSize` when larger (preserving `minSize`/`desiredSize`), creates a Finding (`EksNodeGroupScaled`), and wraps AWS `ClientError` as `ActionException`.
Test fixtures, helpers, and validation cases `tests/test_aws_node_group_actions.py`	Pytest fixtures and tests: success path (asserts `update_nodegroup_config` call and finding), no-op paths (new_max ≤ current), describe/update `ClientError` → `ActionException`, and explicit AWS credential wiring to `boto3.client`.

Documentation and Helm updates

Layer / File(s)	Summary
Rootly send-events docs + integration card `docs/configuration/exporting/send-events/rootly.rst`, `docs/configuration/exporting/send-events-api.rst`	Adds a Rootly send-events guide and registers it in the send-events API integrations grid.
Sentry send-events rewrite `docs/configuration/exporting/send-events/sentry.rst`	Rewrites Sentry forwarding docs to document two forwarding modes, webhook URL/auth details, Custom Integration setup, alert-rule action schema, verification, and troubleshooting.
Private registry / proxy guidance `docs/setup-robusta/proxies.rst`	New section describing how to mirror chart images to a private registry and Helm values to override registry prefixes.
Helm dependency bump `helm/robusta/Chart.yaml`	Bumped `holmes` dependency version 0.30.0 → 0.31.0.

Sequence Diagram(s)

sequenceDiagram
  participant eks_scale_node_group
  participant boto3_EKS_client
  participant EKS_Service
  eks_scale_node_group->>boto3_EKS_client: describe_nodegroup(clusterName, nodegroupName)
  boto3_EKS_client->>EKS_Service: DescribeNodegroup API call
  EKS_Service-->>boto3_EKS_client: returns scaling.config (minSize,maxSize,desiredSize)
  eks_scale_node_group->>eks_scale_node_group: compare new_max_size with maxSize
  alt new_max_size > maxSize
    eks_scale_node_group->>boto3_EKS_client: update_nodegroup_config(set new maxSize, preserve min/desired)
    boto3_EKS_client->>EKS_Service: UpdateNodegroupConfig API call
    EKS_Service-->>boto3_EKS_client: update response
    eks_scale_node_group-->>eks_scale_node_group: create Finding and enrich event
  else
    eks_scale_node_group-->>eks_scale_node_group: enrich event with "no change"
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	Most changes are in-scope, but updates to unrelated docs (send-events-api.rst, sentry.rst, proxies.rst) and Helm chart dependency appear tangential to the linked issue `#223`.	Clarify why documentation changes for Rootly, Sentry integration, and proxy setup, plus holmes dependency bump, are included in this PR focused on EKS node group scaling.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the primary change: adding a new EKS playbook action for scaling node groups.
Description check	✅ Passed	The description is directly related to the changeset, providing clear context about the new action, parameters, usage example, and comprehensive test coverage.
Linked Issues check	✅ Passed	The PR fully implements the linked issue `#223` objective by introducing eks_scale_node_group action with proper node group scaling functionality, comprehensive tests, and documentation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

playbooks/robusta_playbooks/aws_node_group_actions.py (1)

61-65: ⚡ Quick win

Preserve original exception context when re-raising.

Use explicit exception chaining in both except ClientError as e blocks (raise ... from e) so the root AWS error remains visible in tracebacks.

Proposed patch

     except ClientError as e:
         raise ActionException(
             ErrorCodes.ACTION_UNEXPECTED_ERROR,
             f"Failed to describe node group '{params.node_group_name}' "
             f"in cluster '{params.cluster_name}': {e}",
-        )
+        ) from e
@@
     except ClientError as e:
         raise ActionException(
             ErrorCodes.ACTION_UNEXPECTED_ERROR,
             f"Failed to update node group '{params.node_group_name}': {e}",
-        )
+        ) from e

Also applies to: 94-97

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@playbooks/robusta_playbooks/aws_node_group_actions.py` around lines 61 - 65,
The except ClientError as e handlers that raise ActionException should preserve
the original exception context by using explicit exception chaining; locate the
raise statements that construct ActionException with
ErrorCodes.ACTION_UNEXPECTED_ERROR (the blocks referencing
params.node_group_name and params.cluster_name and the later similar block at
lines 94-97) and change the re-raise to use "raise ActionException(... ) from e"
so the original AWS ClientError is retained in the traceback.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@playbooks/robusta_playbooks/aws_node_group_actions.py`:
- Around line 61-65: The except ClientError as e handlers that raise
ActionException should preserve the original exception context by using explicit
exception chaining; locate the raise statements that construct ActionException
with ErrorCodes.ACTION_UNEXPECTED_ERROR (the blocks referencing
params.node_group_name and params.cluster_name and the later similar block at
lines 94-97) and change the re-raise to use "raise ActionException(... ) from e"
so the original AWS ClientError is retained in the traceback.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dd5c2ac7-227d-4783-8f36-07774d6f876d

📥 Commits

Reviewing files that changed from the base of the PR and between f945b48 and a12181d.

📒 Files selected for processing (2)

playbooks/robusta_playbooks/aws_node_group_actions.py
tests/test_aws_node_group_actions.py

Preserves original ClientError in tracebacks when re-raising as ActionException, per CodeRabbit review on PR robusta-dev#2089.

gerardrecinto · 2026-06-01T20:43:08Z

@aantn could you please review when you have a chance, and approve the pending fork workflows if this looks ok?

I updated the branch with latest master and pushed a cleanup commit: 08ab23ca.

Changes made:

Fixed the no-op test coverage so below-current max and exact-equal max are tested separately.
Added assertions that no finding is emitted for no-op cases.
Added assertions that ActionException preserves the underlying ClientError as __cause__.
Added the new eks_scale_node_group action to the remediation action docs.

Why run_tests is still pending: the PR workflows are being created as completed/action_required with 0s runtime for this fork PR, so the required run_tests status is still waiting for a maintainer-approved workflow run rather than failing in code. The latest runs after this push show the same state for Test robusta with pytest and Docker Build on PR.

Local validation I could run here:

git diff --check
/usr/bin/python3 -m compileall playbooks/robusta_playbooks/aws_node_group_actions.py tests/test_aws_node_group_actions.py

I could not run the targeted pytest locally because this shell does not have the repo's Python 3.11/Poetry environment installed; default python3 is 3.14 and outside the repo's declared >=3.10,<3.12 range.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_aws_node_group_actions.py`:
- Around line 148-158: The test asserts a hardcoded secret in the
EksNodeGroupParams call and can trigger Ruff/flake8-bandit S106; locate the
assertion that calls mock_boto_client.assert_called_once_with(...) after
invoking eks_scale_node_group and append an inline suppression comment for S106
(e.g., add "  # noqa: S106" to the assertion line) so the linter ignores this
test-only hardcoded secret; keep references to EksNodeGroupParams,
eks_scale_node_group, and mock_boto_client.assert_called_once_with when editing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a970722-a258-43c8-a871-96e0d82ee659

📥 Commits

Reviewing files that changed from the base of the PR and between 08ab23c and 62b7a3f.

⛔ Files ignored due to path filters (1)

helm/robusta/Chart.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

docs/configuration/exporting/send-events-api.rst
docs/configuration/exporting/send-events/rootly.rst
docs/configuration/exporting/send-events/sentry.rst
docs/playbook-reference/actions/remediation.rst
docs/setup-robusta/proxies.rst
helm/robusta/Chart.yaml
helm/robusta/charts/holmes-0.30.0.tgz
helm/robusta/charts/holmes-0.31.0.tgz
playbooks/robusta_playbooks/aws_node_group_actions.py
tests/test_aws_node_group_actions.py

✅ Files skipped from review due to trivial changes (5)

docs/configuration/exporting/send-events/rootly.rst
docs/setup-robusta/proxies.rst
docs/configuration/exporting/send-events-api.rst
helm/robusta/Chart.yaml
docs/playbook-reference/actions/remediation.rst

🚧 Files skipped from review as they are similar to previous changes (1)

playbooks/robusta_playbooks/aws_node_group_actions.py

coderabbitai · 2026-06-01T20:55:50Z

+    params = EksNodeGroupParams(
+        **{**BASE_PARAMS, "aws_access_key_id": "AKID", "aws_secret_access_key": "SECRET"}
+    )
+    eks_scale_node_group(mock_event, params)
+
+    mock_boto_client.assert_called_once_with(
+        "eks",
+        region_name="us-west-2",
+        aws_access_key_id="AKID",
+        aws_secret_access_key="SECRET",
+    )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate ruff config and inspect select/ignore and per-file-ignores for tests fd -H -t f -e toml -e cfg 'pyproject|ruff|setup' | xargs -r grep -nEi 'ruff|S106|flake8-bandit' 2>/dev/null fd -H 'ruff.toml|.ruff.toml' --type f rg -nP -C3 '\[tool\.ruff' -g 'pyproject.toml' || true rg -nP -C3 'per-file-ignores|extend-select|^\s*select|^\s*ignore' -g 'pyproject.toml' -g 'ruff.toml' -g '.ruff.toml' || true

Repository: robusta-dev/robusta

Length of output: 813

🏁 Script executed:

#!/bin/bash set -euo pipefail # Print relevant parts of ruff.toml to confirm whether S (flake8-bandit) rules are selected. ls -la ruff.toml 2>/dev/null || true sed -n '1,120p' ruff.toml

Repository: robusta-dev/robusta

Length of output: 1229

Suppress Ruff S106 for the hardcoded test secret
ruff.toml enables flake8-bandit ("S"), and only S101 is ignored; S106 is not ignored globally, so aws_secret_access_key="SECRET" may fail lint. Suppress it at the assertion line:

🔧 Suppress the false positive

mock_boto_client.assert_called_once_with( "eks", region_name="us-west-2", aws_access_key_id="AKID", - aws_secret_access_key="SECRET", + aws_secret_access_key="SECRET", # noqa: S106 )

🧰 Tools

🪛 Ruff (0.15.15)

[error] 157-157: Possible hardcoded password assigned to argument: "aws_secret_access_key"

(S106)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_aws_node_group_actions.py` around lines 148 - 158, The test asserts a hardcoded secret in the EksNodeGroupParams call and can trigger Ruff/flake8-bandit S106; locate the assertion that calls mock_boto_client.assert_called_once_with(...) after invoking eks_scale_node_group and append an inline suppression comment for S106 (e.g., add " # noqa: S106" to the assertion line) so the linter ignores this test-only hardcoded secret; keep references to EksNodeGroupParams, eks_scale_node_group, and mock_boto_client.assert_called_once_with when editing.

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

gerardrecinto added a commit to gerardrecinto/robusta that referenced this pull request May 26, 2026

fix: use explicit exception chaining (raise ... from e)

b107301

Preserves original ClientError in tracebacks when re-raising as ActionException, per CodeRabbit review on PR robusta-dev#2089.

gerardrecinto force-pushed the feat/eks-scale-node-group-action branch from 08ab23c to 62b7a3f Compare June 1, 2026 20:51

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

add scale_node_group action for EKS

90c4f23

gerardrecinto force-pushed the feat/eks-scale-node-group-action branch from f054ba2 to 90c4f23 Compare June 1, 2026 21:09

Merge branch 'master' into feat/eks-scale-node-group-action

eece6a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add eks_scale_node_group playbook action (#223)#2089

feat: add eks_scale_node_group playbook action (#223)#2089
gerardrecinto wants to merge 2 commits into
robusta-dev:masterfrom
gerardrecinto:feat/eks-scale-node-group-action

gerardrecinto commented May 26, 2026

Uh oh!

CLAassistant commented May 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

gerardrecinto commented Jun 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gerardrecinto commented May 26, 2026

Summary

How it works

Test plan

Uh oh!

CLAassistant commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gerardrecinto commented Jun 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading