Fix Phase 2 Lambda deploy: drop dead COPY config.yaml, copy store/#14
Merged
Merged
Conversation
The Phase 2 Lambda deploy workflow has been failing on every push to main since #12 merged yesterday: ERROR: failed to calculate checksum of ref ... "/config.yaml": not found Root cause: the Dockerfile copies config.yaml into the image, but config.yaml is gitignored per the repo's security policy (it contains bucket names / prefixes we keep out of the public repo), so there is literally nothing to copy during the GitHub Actions checkout. The Lambda handler (lambda/handler.py) already falls back to a hardcoded default when config.yaml is absent: config = { "bucket": "alpha-engine-research", "market_data": {"s3_prefix": "market_data/"}, } So the COPY was dead weight AND a build breakage. Dropped the line and left an inline comment explaining why it must not come back. Also copied store/ into the image — it was added in #13 as a shared home for the S3 parquet loader and collectors/macro imports it at module top level. The Lambda handler doesn't import macro directly (Phase 2 only calls alternative.collect), but weekly_collector.py is in the image and does `from collectors import macro` at the top, so any future handler refactor that touches weekly_collector would bomb on `ModuleNotFoundError: store` without this. Added 'store/**' to the deploy.yml paths filter so a future change to the shared loader actually retriggers the deploy workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly (validating the #13 and #15 fixes) but DataPhase2 failed with: Runtime.ImportModuleError: No module named 'ssm_secrets' Root cause: lambda/handler.py:22 does `from ssm_secrets import load_secrets` at module top-level, but the Dockerfile never copied ssm_secrets.py into the image. The old pre-#14 image was apparently built manually at a time when this import didn't exist, so it ran fine. #14's GitHub Actions auto-deploy rebuilt the image from the current Dockerfile, published version 2, and flipped the `live` alias before the canary step failed on an unrelated IAM permission (see follow-up note below) — so the alias is now stuck on the broken v2. Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile. Audited lambda/handler.py + collectors/alternative.py import graphs via AST walk: the only first-party modules pulled in at handler import time are `collectors` (already copied) and `ssm_secrets` (now copied). ssm_secrets.py itself only imports stdlib (logging, os), so it doesn't transitively pull anything else. Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's paths filter — both were missing, which means changes to handler.py or the secrets loader would silently not retrigger a Lambda rebuild. Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` + `lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's why infrastructure/deploy.sh's canary step fails after every successful build, and why the post-canary rollback also fails silently — leaving the alias stranded on whatever version just got published. Two options: (a) add those perms to the OIDC role, or (b) short-circuit the canary step in CI. Separate infrastructure PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
Fixes the canary IAM gap that has caused every auto-deploy since #12 to report red even when the Lambda update itself succeeded: AccessDeniedException: User: .../github-actions-lambda-deploy/... is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:711398986525:function: alpha-engine-data-collector:live The github-actions-lambda-deploy OIDC role was created ad-hoc when the GitHub Actions auto-deploy workflow was introduced in #12. It had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not InvokeFunction — so infrastructure/deploy.sh's post-update canary step (aws lambda invoke with dry_run=true) failed with AccessDenied, the rollback then also failed silently because the script's || true swallowed the error, and every deploy since has been leaving the alias stranded on whatever version just got published with no safety net. Three deploys in a row (#13, #14, #16) all looked like failures despite the underlying Lambda being updated. This PR does two things: 1. Adds infrastructure/iam/ as the new home for version-controlled IAM policies. It's intentionally low-ceremony — flat JSON files, one per role, applied via a small idempotent shell script. No CloudFormation, no Terraform. For a 5-module infra-light project, a flat directory is the right amount of rigor. Migrate to CFN later if the blast radius grows. 2. Adds a new LambdaInvokeCanary statement to the existing deploy-role policy, granting lambda:InvokeFunction on all 5 alpha-engine Lambdas and their aliases/versions. Scoped narrowly to the same functions the role already has UpdateFunctionCode on, so the blast radius is unchanged: an attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas. Applied live via `infrastructure/iam/apply.sh github-actions-lambda-deploy` before committing — so the next deploy workflow run actually passes the canary step. Also cleaned up the orphaned old `deploy-lambdas` inline policy (the new file's name is `github-actions-lambda-deploy-policy`, matching the convention of filename == role name). Why this matters beyond tonight: every IAM change from here on is diffable, reviewable, and recoverable. If a future PR drops a permission, code review catches it at PR time instead of surfacing as a mysterious AccessDenied in production. Follow-up: the deploy.sh script's rollback-on-canary-failure logic still uses `|| true` to swallow errors silently, which is why the stranded alias never got rolled back. That's a separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Phase 2 Lambda
Deployworkflow has been failing on every push to main since #12 merged yesterday — including the #13 merge a few minutes ago. The failure is unrelated to #13; #12 introduced it.Root cause: the Dockerfile copies
config.yamlinto the image, butconfig.yamlis gitignored per the repo's security policy. There is literally nothing to copy at build time, sodocker builddies with:lambda/handler.pyalready falls back to a hardcoded default ({"bucket": "alpha-engine-research", "market_data": {"s3_prefix": "market_data/"}}) whenconfig.yamlis absent, so the COPY was dead weight and a build breakage.Changes
COPY config.yaml, leave an inline comment explaining why it must not come back.COPY store/ ${LAMBDA_TASK_ROOT}/store/— the store package is new (from Fix macro breadth: compute from slim cache instead of writing null #13) andcollectors/macro.pyimports from it at module top level.weekly_collector.pyis in the image and doesfrom collectors import macro, so any future handler refactor that touches weekly_collector would bomb onModuleNotFoundError: storewithout this.store/**to the paths filter so future changes to the shared loader retrigger the deploy workflow.Test plan
pytest tests/ -q— 61 passedDeployworkflow on main turns greenLastModifiedtimestamp moves forward after the mergeFollow-up
The
Deployworkflow only runs on push tomain, not on PR. Consider adding adocker build --dry-runstep toci.yml(or a separatebuild-checkworkflow on PR) so that breakage like this is caught before merge. Not in scope for this hot-fix.🤖 Generated with Claude Code