Copy ssm_secrets.py into Phase 2 Lambda image (unblocks DataPhase2)#16
Merged
Conversation
Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly (validating the #13 and #15 fixes) but DataPhase2 failed with: Runtime.ImportModuleError: No module named 'ssm_secrets' Root cause: lambda/handler.py:22 does `from ssm_secrets import load_secrets` at module top-level, but the Dockerfile never copied ssm_secrets.py into the image. The old pre-#14 image was apparently built manually at a time when this import didn't exist, so it ran fine. #14's GitHub Actions auto-deploy rebuilt the image from the current Dockerfile, published version 2, and flipped the `live` alias before the canary step failed on an unrelated IAM permission (see follow-up note below) — so the alias is now stuck on the broken v2. Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile. Audited lambda/handler.py + collectors/alternative.py import graphs via AST walk: the only first-party modules pulled in at handler import time are `collectors` (already copied) and `ssm_secrets` (now copied). ssm_secrets.py itself only imports stdlib (logging, os), so it doesn't transitively pull anything else. Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's paths filter — both were missing, which means changes to handler.py or the secrets loader would silently not retrigger a Lambda rebuild. Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` + `lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's why infrastructure/deploy.sh's canary step fails after every successful build, and why the post-canary rollback also fails silently — leaving the alias stranded on whatever version just got published. Two options: (a) add those perms to the OIDC role, or (b) short-circuit the canary step in CI. Separate infrastructure PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
Fixes the canary IAM gap that has caused every auto-deploy since #12 to report red even when the Lambda update itself succeeded: AccessDeniedException: User: .../github-actions-lambda-deploy/... is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:711398986525:function: alpha-engine-data-collector:live The github-actions-lambda-deploy OIDC role was created ad-hoc when the GitHub Actions auto-deploy workflow was introduced in #12. It had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not InvokeFunction — so infrastructure/deploy.sh's post-update canary step (aws lambda invoke with dry_run=true) failed with AccessDenied, the rollback then also failed silently because the script's || true swallowed the error, and every deploy since has been leaving the alias stranded on whatever version just got published with no safety net. Three deploys in a row (#13, #14, #16) all looked like failures despite the underlying Lambda being updated. This PR does two things: 1. Adds infrastructure/iam/ as the new home for version-controlled IAM policies. It's intentionally low-ceremony — flat JSON files, one per role, applied via a small idempotent shell script. No CloudFormation, no Terraform. For a 5-module infra-light project, a flat directory is the right amount of rigor. Migrate to CFN later if the blast radius grows. 2. Adds a new LambdaInvokeCanary statement to the existing deploy-role policy, granting lambda:InvokeFunction on all 5 alpha-engine Lambdas and their aliases/versions. Scoped narrowly to the same functions the role already has UpdateFunctionCode on, so the blast radius is unchanged: an attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas. Applied live via `infrastructure/iam/apply.sh github-actions-lambda-deploy` before committing — so the next deploy workflow run actually passes the canary step. Also cleaned up the orphaned old `deploy-lambdas` inline policy (the new file's name is `github-actions-lambda-deploy-policy`, matching the convention of filename == role name). Why this matters beyond tonight: every IAM change from here on is diffable, reviewable, and recoverable. If a future PR drops a permission, code review catches it at PR time instead of surfacing as a mysterious AccessDenied in production. Follow-up: the deploy.sh script's rollback-on-canary-failure logic still uses `|| true` to swallow errors silently, which is why the stranded alias never got rolled back. That's a separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
2 tasks
cipher813
added a commit
that referenced
this pull request
May 3, 2026
…143) Stage 3 of the RAG cleanup arc. Lib v0.3.0 (alpha-engine-lib PR #16, merged + tagged) introduced alpha_engine_lib.rag with the consolidated db/embeddings/retrieval/schema. Stage 2 (alpha-engine- research PR #97) migrated research's side. This PR migrates data's. Changes: Pipeline imports updated (5 files): - rag/pipelines/{ingest_theses,ingest_8k_filings,ingest_sec_filings, ingest_earnings_finnhub,filing_change_detection}.py — `from rag.{embeddings,retrieval,db}` → `from alpha_engine_lib.rag.{...}` Deleted (now redundant — code lives in lib): - rag/db.py (was the canonical drift source — its register_vector fix was the basis for the lib version) - rag/embeddings.py - rag/retrieval.py - rag/schema.sql Kept: - rag/__init__.py — minimal namespace placeholder - rag/pipelines/ — ingestion pipelines stay here (data's versions inline the signals lookup; canonical location decision deferred to a later cleanup arc pending production SF state verification) - rag/preflight.py — already uses alpha_engine_lib.logging; no changes needed Lib pin bumped: - requirements.txt: alpha-engine-lib v0.2.4 → v0.3.0 - Extras: added [rag] alongside [arcticdb,flow_doctor] - Note flagged on the direct pgvector/psycopg2-binary pins above the lib line — these become redundant once the [rag] extra soaks; will drop in a follow-up PR rather than ripping them out today Verified: - All 433 data tests pass - alpha_engine_lib.rag imports resolve (retrieve, is_available, embed_texts) - rag.preflight imports cleanly Companion: - alpha-engine-lib PR #16 (merged + tagged v0.3.0) - alpha-engine-research PR #97 (Stage 2 — research migration) - Stage 4 (deferred): pipelines canonical-location decision Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tonight's pipeline rerun passed DataPhase1, RAG, and Research cleanly (validating #13 macro-breadth fix + #15 Step Function HandleFailure fix end-to-end), but DataPhase2 failed with:
```
Runtime.ImportModuleError: No module named 'ssm_secrets'
```
Root cause
`lambda/handler.py:22` does `from ssm_secrets import load_secrets` at module top-level, but the Dockerfile never copied `ssm_secrets.py` into the image. The pre-#14 image was built manually at a time when this import didn't exist, so it ran fine for ages. #14's GitHub Actions auto-deploy rebuilt the image from the current Dockerfile, published version 2, and flipped the `live` alias before the canary step failed on an unrelated IAM permission — so the alias is now stuck on the broken v2.
Verified:
```
aws lambda get-alias --function-name alpha-engine-data-collector --name live
FunctionVersion: 2
LastModified: 2026-04-11T00:51:38Z (mid-#14-deploy)
```
Fix
Add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile. Audited both `lambda/handler.py` and `collectors/alternative.py` import graphs via AST walk — the only first-party modules pulled in at handler import time are `collectors` (already copied) and `ssm_secrets` (now copied). `ssm_secrets.py` itself only imports stdlib (`logging`, `os`), so no transitive issues.
Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's `paths` filter — both were missing, which means changes to `handler.py` or the secrets loader would silently not retrigger a Lambda rebuild.
Test plan
Known follow-up (NOT in scope)
The GitHub Actions OIDC role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` + (implicitly) `lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's why `infrastructure/deploy.sh`'s canary step has failed after every auto-deploy since #14 landed, and the post-canary rollback path also silently fails — stranding the alias on whatever version just got published. The Lambda still gets updated because `UpdateFunctionCode` succeeds first; the canary is a post-update safety check. Two fixes possible:
Either way, that's a separate infrastructure PR. Tonight's manual canary can run from the local AWS creds after this PR merges.
🤖 Generated with Claude Code