Skip to content

Version-control IAM policies; add lambda:InvokeFunction to deploy role#17

Merged
cipher813 merged 1 commit into
mainfrom
feat/version-controlled-iam-policies
Apr 11, 2026
Merged

Version-control IAM policies; add lambda:InvokeFunction to deploy role#17
cipher813 merged 1 commit into
mainfrom
feat/version-controlled-iam-policies

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Tonight's third-in-a-row red deploy surfaced a gap: the GitHub Actions OIDC role `github-actions-lambda-deploy` has been missing `lambda:InvokeFunction` since #12 introduced it. Every auto-deploy since has looked red, but the underlying Lambda updates actually succeeded — `infrastructure/deploy.sh`'s post-update canary step fails with `AccessDenied`, the rollback then silently swallows its own error via `|| true`, and the alias ends up stranded on whatever version just got published. Three deploys in a row (#13, #14, #16) all looked like failures despite the Lambda being updated.

You also flagged that we're not version-controlling IAM at all, which is how this gap existed without anyone noticing.

Changes

  1. `infrastructure/iam/` — new home for version-controlled IAM policies. Flat JSON files, one per role, applied by an idempotent shell script. No CloudFormation, no Terraform. For a 5-module infra-light project, this is the right amount of rigor — migrate to CFN later if the blast radius grows.
  2. `infrastructure/iam/github-actions-lambda-deploy.json` — full policy document for the OIDC deploy role. Preserves the existing ECRAuth, ECRPush, and LambdaUpdate statements verbatim, adds a new `LambdaInvokeCanary` statement granting `lambda:InvokeFunction` on all 5 alpha-engine Lambdas + their aliases/versions.
  3. `infrastructure/iam/apply.sh` — idempotent applier. One policy file per role, filename == role name. Supports `--dry-run` and single-role invocation.

Already applied live

Applied before committing so tonight's next deploy isn't blocked:

```
$ infrastructure/iam/apply.sh github-actions-lambda-deploy
Applying github-actions-lambda-deploy.json -> role=github-actions-lambda-deploy policy=github-actions-lambda-deploy-policy
OK

$ aws iam delete-role-policy --role-name github-actions-lambda-deploy --policy-name deploy-lambdas

cleaned up the orphaned old inline policy name

```

Blast radius analysis

The new Invoke permission is scoped to exactly the same 5 Lambda ARNs the role already has `UpdateFunctionCode` on. An attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas — adding InvokeFunction doesn't meaningfully expand what they can do. The alternative (short-circuit canary in CI) would silently skip the safety check in the environment where it matters most (unattended deploys), which is strictly worse.

Test plan

  • JSON validates
  • `apply.sh --dry-run` generates correct commands
  • Applied to live AWS IAM, `list-role-policies` shows exactly one inline policy matching the committed file
  • Manual canary via local creds still works (`status: "OK"`)
  • Merge this PR → the Deploy workflow run for the merge commit should be the first green auto-deploy since Add GitHub Actions auto-deploy on push to main #12
  • Subsequent deploys should either go fully green or auto-rollback as designed

Known follow-up (NOT in scope)

`infrastructure/deploy.sh`'s rollback block uses `|| true` to swallow errors silently (lines 188-193), which is why the stranded alias never got rolled back on the three recent broken canaries. Separate PR.

🤖 Generated with Claude Code

Fixes the canary IAM gap that has caused every auto-deploy since #12
to report red even when the Lambda update itself succeeded:

    AccessDeniedException: User: .../github-actions-lambda-deploy/...
    is not authorized to perform: lambda:InvokeFunction on resource:
    arn:aws:lambda:us-east-1:711398986525:function:
    alpha-engine-data-collector:live

The github-actions-lambda-deploy OIDC role was created ad-hoc when
the GitHub Actions auto-deploy workflow was introduced in #12. It
had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not
InvokeFunction — so infrastructure/deploy.sh's post-update canary
step (aws lambda invoke with dry_run=true) failed with
AccessDenied, the rollback then also failed silently because the
script's || true swallowed the error, and every deploy since has
been leaving the alias stranded on whatever version just got
published with no safety net. Three deploys in a row (#13, #14,
#16) all looked like failures despite the underlying Lambda being
updated.

This PR does two things:

1. Adds infrastructure/iam/ as the new home for version-controlled
   IAM policies. It's intentionally low-ceremony — flat JSON files,
   one per role, applied via a small idempotent shell script. No
   CloudFormation, no Terraform. For a 5-module infra-light project,
   a flat directory is the right amount of rigor. Migrate to CFN
   later if the blast radius grows.

2. Adds a new LambdaInvokeCanary statement to the existing
   deploy-role policy, granting lambda:InvokeFunction on all 5
   alpha-engine Lambdas and their aliases/versions. Scoped
   narrowly to the same functions the role already has
   UpdateFunctionCode on, so the blast radius is unchanged: an
   attacker with ECR push + UpdateFunctionCode can already run
   arbitrary code in these Lambdas.

Applied live via `infrastructure/iam/apply.sh
github-actions-lambda-deploy` before committing — so the next
deploy workflow run actually passes the canary step. Also cleaned
up the orphaned old `deploy-lambdas` inline policy (the new file's
name is `github-actions-lambda-deploy-policy`, matching the
convention of filename == role name).

Why this matters beyond tonight: every IAM change from here on is
diffable, reviewable, and recoverable. If a future PR drops a
permission, code review catches it at PR time instead of surfacing
as a mysterious AccessDenied in production.

Follow-up: the deploy.sh script's rollback-on-canary-failure logic
still uses `|| true` to swallow errors silently, which is why the
stranded alias never got rolled back. That's a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 461a8f2 into main Apr 11, 2026
1 check passed
@cipher813 cipher813 deleted the feat/version-controlled-iam-policies branch April 11, 2026 03:22
cipher813 added a commit that referenced this pull request Apr 11, 2026
The post-canary IAM block attempted \`aws iam get-role alpha-engine-data-role\`
as a "create if missing" bootstrap, then fell through to CreateRole
when the get-role call failed. This was dead code with a silent-fail
trap inside it:

1. \`alpha-engine-data-role\` already exists in AWS and is the live
   Lambda's execution role. The get-role check should succeed and
   exit the branch as a no-op — it never actually needs to create
   anything.
2. The deploy.sh header already documents the role as a prerequisite.
   The bootstrap block contradicted that contract.
3. \`&>/dev/null\` swallowed any non-zero exit from get-role, including
   "permission denied" from the github-actions-lambda-deploy role
   (which correctly lacks iam:* permissions). The branch then
   interpreted "permission denied" as "role does not exist" and
   tried to create it, which failed explicitly with iam:CreateRole
   AccessDenied.

Today's auto-deploy after PR #18 merged surfaced all of this: the
Lambda itself deployed successfully (version 4 live, canary passed),
but the workflow failed at the dead IAM step and marked the run red.

Fix: delete the entire block. Replace with a comment explaining the
role is a one-time out-of-band provisioning concern, ideally extended
into \`infrastructure/iam/\` the same way #17 did for
github-actions-lambda-deploy.

This also aligns with the no-silent-fails rule: any future IAM
provisioning that belongs in this path should fail loudly, not fall
through a pattern-matching check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 5, 2026
#155)

Initial PR commit followed the (now-corrected) embeddings.py docstring
and reported `embedding.dimension = 1024`. The schema declares
`vector(512)` and pgvector enforces dim on INSERT — the production
column has to be 512 for ingestion to be working. voyage-3-lite is 512-d.

Companion fix to alpha-engine-lib PR #17 which updates the docstring.

Tests: 8/8 (test_embedding_metadata updated to assert 512).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant