Skip to content

Fix Phase 2 Lambda deploy: drop dead COPY config.yaml, copy store/#14

Merged
cipher813 merged 1 commit into
mainfrom
fix/dockerfile-remove-gitignored-config-copy
Apr 11, 2026
Merged

Fix Phase 2 Lambda deploy: drop dead COPY config.yaml, copy store/#14
cipher813 merged 1 commit into
mainfrom
fix/dockerfile-remove-gitignored-config-copy

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

The Phase 2 Lambda Deploy workflow has been failing on every push to main since #12 merged yesterday — including the #13 merge a few minutes ago. The failure is unrelated to #13; #12 introduced it.

Root cause: the Dockerfile copies config.yaml into the image, but config.yaml is gitignored per the repo's security policy. There is literally nothing to copy at build time, so docker build dies with:

ERROR: failed to calculate checksum of ref ...
"/config.yaml": not found

lambda/handler.py already falls back to a hardcoded default ({"bucket": "alpha-engine-research", "market_data": {"s3_prefix": "market_data/"}}) when config.yaml is absent, so the COPY was dead weight and a build breakage.

Changes

  • Dockerfile: drop COPY config.yaml, leave an inline comment explaining why it must not come back.
  • Dockerfile: add COPY store/ ${LAMBDA_TASK_ROOT}/store/ — the store package is new (from Fix macro breadth: compute from slim cache instead of writing null #13) and collectors/macro.py imports from it at module top level. weekly_collector.py is in the image and does from collectors import macro, so any future handler refactor that touches weekly_collector would bomb on ModuleNotFoundError: store without this.
  • .github/workflows/deploy.yml: add store/** to the paths filter so future changes to the shared loader retrigger the deploy workflow.

Test plan

  • pytest tests/ -q — 61 passed
  • Gitleaks pre-commit pass
  • After merge, verify Deploy workflow on main turns green
  • Confirm Phase 2 Lambda's LastModified timestamp moves forward after the merge

Follow-up

The Deploy workflow only runs on push to main, not on PR. Consider adding a docker build --dry-run step to ci.yml (or a separate build-check workflow on PR) so that breakage like this is caught before merge. Not in scope for this hot-fix.

🤖 Generated with Claude Code

The Phase 2 Lambda deploy workflow has been failing on every push to
main since #12 merged yesterday:

    ERROR: failed to calculate checksum of ref ...
    "/config.yaml": not found

Root cause: the Dockerfile copies config.yaml into the image, but
config.yaml is gitignored per the repo's security policy (it contains
bucket names / prefixes we keep out of the public repo), so there is
literally nothing to copy during the GitHub Actions checkout.

The Lambda handler (lambda/handler.py) already falls back to a
hardcoded default when config.yaml is absent:

    config = {
        "bucket": "alpha-engine-research",
        "market_data": {"s3_prefix": "market_data/"},
    }

So the COPY was dead weight AND a build breakage. Dropped the line
and left an inline comment explaining why it must not come back.

Also copied store/ into the image — it was added in #13 as a shared
home for the S3 parquet loader and collectors/macro imports it at
module top level. The Lambda handler doesn't import macro directly
(Phase 2 only calls alternative.collect), but weekly_collector.py is
in the image and does `from collectors import macro` at the top, so
any future handler refactor that touches weekly_collector would bomb
on `ModuleNotFoundError: store` without this.

Added 'store/**' to the deploy.yml paths filter so a future change
to the shared loader actually retriggers the deploy workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit b95052c into main Apr 11, 2026
1 check passed
@cipher813 cipher813 deleted the fix/dockerfile-remove-gitignored-config-copy branch April 11, 2026 00:50
cipher813 added a commit that referenced this pull request Apr 11, 2026
Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly
(validating the #13 and #15 fixes) but DataPhase2 failed with:

    Runtime.ImportModuleError: No module named 'ssm_secrets'

Root cause: lambda/handler.py:22 does `from ssm_secrets import
load_secrets` at module top-level, but the Dockerfile never copied
ssm_secrets.py into the image. The old pre-#14 image was apparently
built manually at a time when this import didn't exist, so it ran
fine. #14's GitHub Actions auto-deploy rebuilt the image from the
current Dockerfile, published version 2, and flipped the `live` alias
before the canary step failed on an unrelated IAM permission (see
follow-up note below) — so the alias is now stuck on the broken v2.

Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile.
Audited lambda/handler.py + collectors/alternative.py import graphs
via AST walk: the only first-party modules pulled in at handler
import time are `collectors` (already copied) and `ssm_secrets` (now
copied). ssm_secrets.py itself only imports stdlib (logging, os), so
it doesn't transitively pull anything else.

Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's
paths filter — both were missing, which means changes to handler.py
or the secrets loader would silently not retrigger a Lambda rebuild.

Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC
role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` +
`lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's
why infrastructure/deploy.sh's canary step fails after every
successful build, and why the post-canary rollback also fails
silently — leaving the alias stranded on whatever version just got
published. Two options: (a) add those perms to the OIDC role, or
(b) short-circuit the canary step in CI. Separate infrastructure PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 11, 2026
Fixes the canary IAM gap that has caused every auto-deploy since #12
to report red even when the Lambda update itself succeeded:

    AccessDeniedException: User: .../github-actions-lambda-deploy/...
    is not authorized to perform: lambda:InvokeFunction on resource:
    arn:aws:lambda:us-east-1:711398986525:function:
    alpha-engine-data-collector:live

The github-actions-lambda-deploy OIDC role was created ad-hoc when
the GitHub Actions auto-deploy workflow was introduced in #12. It
had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not
InvokeFunction — so infrastructure/deploy.sh's post-update canary
step (aws lambda invoke with dry_run=true) failed with
AccessDenied, the rollback then also failed silently because the
script's || true swallowed the error, and every deploy since has
been leaving the alias stranded on whatever version just got
published with no safety net. Three deploys in a row (#13, #14,
#16) all looked like failures despite the underlying Lambda being
updated.

This PR does two things:

1. Adds infrastructure/iam/ as the new home for version-controlled
   IAM policies. It's intentionally low-ceremony — flat JSON files,
   one per role, applied via a small idempotent shell script. No
   CloudFormation, no Terraform. For a 5-module infra-light project,
   a flat directory is the right amount of rigor. Migrate to CFN
   later if the blast radius grows.

2. Adds a new LambdaInvokeCanary statement to the existing
   deploy-role policy, granting lambda:InvokeFunction on all 5
   alpha-engine Lambdas and their aliases/versions. Scoped
   narrowly to the same functions the role already has
   UpdateFunctionCode on, so the blast radius is unchanged: an
   attacker with ECR push + UpdateFunctionCode can already run
   arbitrary code in these Lambdas.

Applied live via `infrastructure/iam/apply.sh
github-actions-lambda-deploy` before committing — so the next
deploy workflow run actually passes the canary step. Also cleaned
up the orphaned old `deploy-lambdas` inline policy (the new file's
name is `github-actions-lambda-deploy-policy`, matching the
convention of filename == role name).

Why this matters beyond tonight: every IAM change from here on is
diffable, reviewable, and recoverable. If a future PR drops a
permission, code review catches it at PR time instead of surfacing
as a mysterious AccessDenied in production.

Follow-up: the deploy.sh script's rollback-on-canary-failure logic
still uses `|| true` to swallow errors silently, which is why the
stranded alias never got rolled back. That's a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant