Version-control IAM policies; add lambda:InvokeFunction to deploy role#17
Merged
Merged
Conversation
Fixes the canary IAM gap that has caused every auto-deploy since #12 to report red even when the Lambda update itself succeeded: AccessDeniedException: User: .../github-actions-lambda-deploy/... is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:711398986525:function: alpha-engine-data-collector:live The github-actions-lambda-deploy OIDC role was created ad-hoc when the GitHub Actions auto-deploy workflow was introduced in #12. It had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not InvokeFunction — so infrastructure/deploy.sh's post-update canary step (aws lambda invoke with dry_run=true) failed with AccessDenied, the rollback then also failed silently because the script's || true swallowed the error, and every deploy since has been leaving the alias stranded on whatever version just got published with no safety net. Three deploys in a row (#13, #14, #16) all looked like failures despite the underlying Lambda being updated. This PR does two things: 1. Adds infrastructure/iam/ as the new home for version-controlled IAM policies. It's intentionally low-ceremony — flat JSON files, one per role, applied via a small idempotent shell script. No CloudFormation, no Terraform. For a 5-module infra-light project, a flat directory is the right amount of rigor. Migrate to CFN later if the blast radius grows. 2. Adds a new LambdaInvokeCanary statement to the existing deploy-role policy, granting lambda:InvokeFunction on all 5 alpha-engine Lambdas and their aliases/versions. Scoped narrowly to the same functions the role already has UpdateFunctionCode on, so the blast radius is unchanged: an attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas. Applied live via `infrastructure/iam/apply.sh github-actions-lambda-deploy` before committing — so the next deploy workflow run actually passes the canary step. Also cleaned up the orphaned old `deploy-lambdas` inline policy (the new file's name is `github-actions-lambda-deploy-policy`, matching the convention of filename == role name). Why this matters beyond tonight: every IAM change from here on is diffable, reviewable, and recoverable. If a future PR drops a permission, code review catches it at PR time instead of surfacing as a mysterious AccessDenied in production. Follow-up: the deploy.sh script's rollback-on-canary-failure logic still uses `|| true` to swallow errors silently, which is why the stranded alias never got rolled back. That's a separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
The post-canary IAM block attempted \`aws iam get-role alpha-engine-data-role\` as a "create if missing" bootstrap, then fell through to CreateRole when the get-role call failed. This was dead code with a silent-fail trap inside it: 1. \`alpha-engine-data-role\` already exists in AWS and is the live Lambda's execution role. The get-role check should succeed and exit the branch as a no-op — it never actually needs to create anything. 2. The deploy.sh header already documents the role as a prerequisite. The bootstrap block contradicted that contract. 3. \`&>/dev/null\` swallowed any non-zero exit from get-role, including "permission denied" from the github-actions-lambda-deploy role (which correctly lacks iam:* permissions). The branch then interpreted "permission denied" as "role does not exist" and tried to create it, which failed explicitly with iam:CreateRole AccessDenied. Today's auto-deploy after PR #18 merged surfaced all of this: the Lambda itself deployed successfully (version 4 live, canary passed), but the workflow failed at the dead IAM step and marked the run red. Fix: delete the entire block. Replace with a comment explaining the role is a one-time out-of-band provisioning concern, ideally extended into \`infrastructure/iam/\` the same way #17 did for github-actions-lambda-deploy. This also aligns with the no-silent-fails rule: any future IAM provisioning that belongs in this path should fail loudly, not fall through a pattern-matching check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
2 tasks
cipher813
added a commit
that referenced
this pull request
May 5, 2026
#155) Initial PR commit followed the (now-corrected) embeddings.py docstring and reported `embedding.dimension = 1024`. The schema declares `vector(512)` and pgvector enforces dim on INSERT — the production column has to be 512 for ingestion to be working. voyage-3-lite is 512-d. Companion fix to alpha-engine-lib PR #17 which updates the docstring. Tests: 8/8 (test_embedding_metadata updated to assert 512). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tonight's third-in-a-row red deploy surfaced a gap: the GitHub Actions OIDC role `github-actions-lambda-deploy` has been missing `lambda:InvokeFunction` since #12 introduced it. Every auto-deploy since has looked red, but the underlying Lambda updates actually succeeded — `infrastructure/deploy.sh`'s post-update canary step fails with `AccessDenied`, the rollback then silently swallows its own error via `|| true`, and the alias ends up stranded on whatever version just got published. Three deploys in a row (#13, #14, #16) all looked like failures despite the Lambda being updated.
You also flagged that we're not version-controlling IAM at all, which is how this gap existed without anyone noticing.
Changes
Already applied live
Applied before committing so tonight's next deploy isn't blocked:
```
$ infrastructure/iam/apply.sh github-actions-lambda-deploy
Applying github-actions-lambda-deploy.json -> role=github-actions-lambda-deploy policy=github-actions-lambda-deploy-policy
OK
$ aws iam delete-role-policy --role-name github-actions-lambda-deploy --policy-name deploy-lambdas
cleaned up the orphaned old inline policy name
```
Blast radius analysis
The new Invoke permission is scoped to exactly the same 5 Lambda ARNs the role already has `UpdateFunctionCode` on. An attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas — adding InvokeFunction doesn't meaningfully expand what they can do. The alternative (short-circuit canary in CI) would silently skip the safety check in the environment where it matters most (unattended deploys), which is strictly worse.
Test plan
Known follow-up (NOT in scope)
`infrastructure/deploy.sh`'s rollback block uses `|| true` to swallow errors silently (lines 188-193), which is why the stranded alias never got rolled back on the three recent broken canaries. Separate PR.
🤖 Generated with Claude Code