Fix Phase 2 Lambda deploy: drop dead COPY config.yaml, copy store/ by cipher813 · Pull Request #14 · cipher813/alpha-engine-data

cipher813 · 2026-04-11T00:48:33Z

Summary

The Phase 2 Lambda Deploy workflow has been failing on every push to main since #12 merged yesterday — including the #13 merge a few minutes ago. The failure is unrelated to #13; #12 introduced it.

Root cause: the Dockerfile copies config.yaml into the image, but config.yaml is gitignored per the repo's security policy. There is literally nothing to copy at build time, so docker build dies with:

ERROR: failed to calculate checksum of ref ...
"/config.yaml": not found

lambda/handler.py already falls back to a hardcoded default ({"bucket": "alpha-engine-research", "market_data": {"s3_prefix": "market_data/"}}) when config.yaml is absent, so the COPY was dead weight and a build breakage.

Changes

Dockerfile: drop COPY config.yaml, leave an inline comment explaining why it must not come back.
Dockerfile: add COPY store/ ${LAMBDA_TASK_ROOT}/store/ — the store package is new (from Fix macro breadth: compute from slim cache instead of writing null #13) and collectors/macro.py imports from it at module top level. weekly_collector.py is in the image and does from collectors import macro, so any future handler refactor that touches weekly_collector would bomb on ModuleNotFoundError: store without this.
.github/workflows/deploy.yml: add store/** to the paths filter so future changes to the shared loader retrigger the deploy workflow.

Test plan

pytest tests/ -q — 61 passed
Gitleaks pre-commit pass
After merge, verify Deploy workflow on main turns green
Confirm Phase 2 Lambda's LastModified timestamp moves forward after the merge

Follow-up

The Deploy workflow only runs on push to main, not on PR. Consider adding a docker build --dry-run step to ci.yml (or a separate build-check workflow on PR) so that breakage like this is caught before merge. Not in scope for this hot-fix.

🤖 Generated with Claude Code

The Phase 2 Lambda deploy workflow has been failing on every push to main since #12 merged yesterday: ERROR: failed to calculate checksum of ref ... "/config.yaml": not found Root cause: the Dockerfile copies config.yaml into the image, but config.yaml is gitignored per the repo's security policy (it contains bucket names / prefixes we keep out of the public repo), so there is literally nothing to copy during the GitHub Actions checkout. The Lambda handler (lambda/handler.py) already falls back to a hardcoded default when config.yaml is absent: config = { "bucket": "alpha-engine-research", "market_data": {"s3_prefix": "market_data/"}, } So the COPY was dead weight AND a build breakage. Dropped the line and left an inline comment explaining why it must not come back. Also copied store/ into the image — it was added in #13 as a shared home for the S3 parquet loader and collectors/macro imports it at module top level. The Lambda handler doesn't import macro directly (Phase 2 only calls alternative.collect), but weekly_collector.py is in the image and does `from collectors import macro` at the top, so any future handler refactor that touches weekly_collector would bomb on `ModuleNotFoundError: store` without this. Added 'store/**' to the deploy.yml paths filter so a future change to the shared loader actually retriggers the deploy workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly (validating the #13 and #15 fixes) but DataPhase2 failed with: Runtime.ImportModuleError: No module named 'ssm_secrets' Root cause: lambda/handler.py:22 does `from ssm_secrets import load_secrets` at module top-level, but the Dockerfile never copied ssm_secrets.py into the image. The old pre-#14 image was apparently built manually at a time when this import didn't exist, so it ran fine. #14's GitHub Actions auto-deploy rebuilt the image from the current Dockerfile, published version 2, and flipped the `live` alias before the canary step failed on an unrelated IAM permission (see follow-up note below) — so the alias is now stuck on the broken v2. Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile. Audited lambda/handler.py + collectors/alternative.py import graphs via AST walk: the only first-party modules pulled in at handler import time are `collectors` (already copied) and `ssm_secrets` (now copied). ssm_secrets.py itself only imports stdlib (logging, os), so it doesn't transitively pull anything else. Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's paths filter — both were missing, which means changes to handler.py or the secrets loader would silently not retrigger a Lambda rebuild. Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` + `lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's why infrastructure/deploy.sh's canary step fails after every successful build, and why the post-canary rollback also fails silently — leaving the alias stranded on whatever version just got published. Two options: (a) add those perms to the OIDC role, or (b) short-circuit the canary step in CI. Separate infrastructure PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixes the canary IAM gap that has caused every auto-deploy since #12 to report red even when the Lambda update itself succeeded: AccessDeniedException: User: .../github-actions-lambda-deploy/... is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:711398986525:function: alpha-engine-data-collector:live The github-actions-lambda-deploy OIDC role was created ad-hoc when the GitHub Actions auto-deploy workflow was introduced in #12. It had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not InvokeFunction — so infrastructure/deploy.sh's post-update canary step (aws lambda invoke with dry_run=true) failed with AccessDenied, the rollback then also failed silently because the script's || true swallowed the error, and every deploy since has been leaving the alias stranded on whatever version just got published with no safety net. Three deploys in a row (#13, #14, #16) all looked like failures despite the underlying Lambda being updated. This PR does two things: 1. Adds infrastructure/iam/ as the new home for version-controlled IAM policies. It's intentionally low-ceremony — flat JSON files, one per role, applied via a small idempotent shell script. No CloudFormation, no Terraform. For a 5-module infra-light project, a flat directory is the right amount of rigor. Migrate to CFN later if the blast radius grows. 2. Adds a new LambdaInvokeCanary statement to the existing deploy-role policy, granting lambda:InvokeFunction on all 5 alpha-engine Lambdas and their aliases/versions. Scoped narrowly to the same functions the role already has UpdateFunctionCode on, so the blast radius is unchanged: an attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas. Applied live via `infrastructure/iam/apply.sh github-actions-lambda-deploy` before committing — so the next deploy workflow run actually passes the canary step. Also cleaned up the orphaned old `deploy-lambdas` inline policy (the new file's name is `github-actions-lambda-deploy-policy`, matching the convention of filename == role name). Why this matters beyond tonight: every IAM change from here on is diffable, reviewable, and recoverable. If a future PR drops a permission, code review catches it at PR time instead of surfacing as a mysterious AccessDenied in production. Follow-up: the deploy.sh script's rollback-on-canary-failure logic still uses `|| true` to swallow errors silently, which is why the stranded alias never got rolled back. That's a separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cipher813 merged commit b95052c into main Apr 11, 2026
1 check passed

cipher813 deleted the fix/dockerfile-remove-gitignored-config-copy branch April 11, 2026 00:50

cipher813 mentioned this pull request Apr 11, 2026

Copy ssm_secrets.py into Phase 2 Lambda image (unblocks DataPhase2) #16

Merged

5 tasks

cipher813 mentioned this pull request Apr 11, 2026

Version-control IAM policies; add lambda:InvokeFunction to deploy role #17

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Phase 2 Lambda deploy: drop dead COPY config.yaml, copy store/#14

Fix Phase 2 Lambda deploy: drop dead COPY config.yaml, copy store/#14
cipher813 merged 1 commit into
mainfrom
fix/dockerfile-remove-gitignored-config-copy

cipher813 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 11, 2026

Summary

Changes

Test plan

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant