Skip to content

Copy ssm_secrets.py into Phase 2 Lambda image (unblocks DataPhase2)#16

Merged
cipher813 merged 1 commit into
mainfrom
fix/dockerfile-copy-ssm-secrets
Apr 11, 2026
Merged

Copy ssm_secrets.py into Phase 2 Lambda image (unblocks DataPhase2)#16
cipher813 merged 1 commit into
mainfrom
fix/dockerfile-copy-ssm-secrets

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Tonight's pipeline rerun passed DataPhase1, RAG, and Research cleanly (validating #13 macro-breadth fix + #15 Step Function HandleFailure fix end-to-end), but DataPhase2 failed with:

```
Runtime.ImportModuleError: No module named 'ssm_secrets'
```

Root cause

`lambda/handler.py:22` does `from ssm_secrets import load_secrets` at module top-level, but the Dockerfile never copied `ssm_secrets.py` into the image. The pre-#14 image was built manually at a time when this import didn't exist, so it ran fine for ages. #14's GitHub Actions auto-deploy rebuilt the image from the current Dockerfile, published version 2, and flipped the `live` alias before the canary step failed on an unrelated IAM permission — so the alias is now stuck on the broken v2.

Verified:
```
aws lambda get-alias --function-name alpha-engine-data-collector --name live
FunctionVersion: 2
LastModified: 2026-04-11T00:51:38Z (mid-#14-deploy)
```

Fix

Add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile. Audited both `lambda/handler.py` and `collectors/alternative.py` import graphs via AST walk — the only first-party modules pulled in at handler import time are `collectors` (already copied) and `ssm_secrets` (now copied). `ssm_secrets.py` itself only imports stdlib (`logging`, `os`), so no transitive issues.

Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's `paths` filter — both were missing, which means changes to `handler.py` or the secrets loader would silently not retrigger a Lambda rebuild.

Test plan

  • `pytest tests/ -q` — 61 passed
  • Gitleaks pre-commit pass
  • AST audit of handler + alternative imports confirms `ssm_secrets` is the only missing first-party dep
  • After merge + auto-deploy, verify `aws lambda invoke alpha-engine-data-collector:live --payload '{"phase": 2, "dry_run": true}'` returns OK
  • Redrive the failed Step Function execution `manual-recovery-20260411T010521Z` from DataPhase2

Known follow-up (NOT in scope)

The GitHub Actions OIDC role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` + (implicitly) `lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's why `infrastructure/deploy.sh`'s canary step has failed after every auto-deploy since #14 landed, and the post-canary rollback path also silently fails — stranding the alias on whatever version just got published. The Lambda still gets updated because `UpdateFunctionCode` succeeds first; the canary is a post-update safety check. Two fixes possible:

  1. Add `lambda:InvokeFunction` + `lambda:UpdateAlias` on the data-collector Lambda ARN to the OIDC role
  2. Short-circuit the canary step when running in CI (`[ -n "$CI" ] && skip`)

Either way, that's a separate infrastructure PR. Tonight's manual canary can run from the local AWS creds after this PR merges.

🤖 Generated with Claude Code

Tonight's pipeline rerun passed DataPhase1 + RAG + Research cleanly
(validating the #13 and #15 fixes) but DataPhase2 failed with:

    Runtime.ImportModuleError: No module named 'ssm_secrets'

Root cause: lambda/handler.py:22 does `from ssm_secrets import
load_secrets` at module top-level, but the Dockerfile never copied
ssm_secrets.py into the image. The old pre-#14 image was apparently
built manually at a time when this import didn't exist, so it ran
fine. #14's GitHub Actions auto-deploy rebuilt the image from the
current Dockerfile, published version 2, and flipped the `live` alias
before the canary step failed on an unrelated IAM permission (see
follow-up note below) — so the alias is now stuck on the broken v2.

Fix: add `COPY ssm_secrets.py ${LAMBDA_TASK_ROOT}/` to the Dockerfile.
Audited lambda/handler.py + collectors/alternative.py import graphs
via AST walk: the only first-party modules pulled in at handler
import time are `collectors` (already copied) and `ssm_secrets` (now
copied). ssm_secrets.py itself only imports stdlib (logging, os), so
it doesn't transitively pull anything else.

Also added `ssm_secrets.py` and `lambda/**` to the deploy workflow's
paths filter — both were missing, which means changes to handler.py
or the secrets loader would silently not retrigger a Lambda rebuild.

Known follow-up (NOT in scope for this PR): the GitHub Actions OIDC
role `github-actions-lambda-deploy` lacks `lambda:InvokeFunction` +
`lambda:UpdateAlias` on `alpha-engine-data-collector:live`. That's
why infrastructure/deploy.sh's canary step fails after every
successful build, and why the post-canary rollback also fails
silently — leaving the alias stranded on whatever version just got
published. Two options: (a) add those perms to the OIDC role, or
(b) short-circuit the canary step in CI. Separate infrastructure PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 32c6285 into main Apr 11, 2026
1 check passed
@cipher813 cipher813 deleted the fix/dockerfile-copy-ssm-secrets branch April 11, 2026 01:54
cipher813 added a commit that referenced this pull request Apr 11, 2026
Fixes the canary IAM gap that has caused every auto-deploy since #12
to report red even when the Lambda update itself succeeded:

    AccessDeniedException: User: .../github-actions-lambda-deploy/...
    is not authorized to perform: lambda:InvokeFunction on resource:
    arn:aws:lambda:us-east-1:711398986525:function:
    alpha-engine-data-collector:live

The github-actions-lambda-deploy OIDC role was created ad-hoc when
the GitHub Actions auto-deploy workflow was introduced in #12. It
had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not
InvokeFunction — so infrastructure/deploy.sh's post-update canary
step (aws lambda invoke with dry_run=true) failed with
AccessDenied, the rollback then also failed silently because the
script's || true swallowed the error, and every deploy since has
been leaving the alias stranded on whatever version just got
published with no safety net. Three deploys in a row (#13, #14,
#16) all looked like failures despite the underlying Lambda being
updated.

This PR does two things:

1. Adds infrastructure/iam/ as the new home for version-controlled
   IAM policies. It's intentionally low-ceremony — flat JSON files,
   one per role, applied via a small idempotent shell script. No
   CloudFormation, no Terraform. For a 5-module infra-light project,
   a flat directory is the right amount of rigor. Migrate to CFN
   later if the blast radius grows.

2. Adds a new LambdaInvokeCanary statement to the existing
   deploy-role policy, granting lambda:InvokeFunction on all 5
   alpha-engine Lambdas and their aliases/versions. Scoped
   narrowly to the same functions the role already has
   UpdateFunctionCode on, so the blast radius is unchanged: an
   attacker with ECR push + UpdateFunctionCode can already run
   arbitrary code in these Lambdas.

Applied live via `infrastructure/iam/apply.sh
github-actions-lambda-deploy` before committing — so the next
deploy workflow run actually passes the canary step. Also cleaned
up the orphaned old `deploy-lambdas` inline policy (the new file's
name is `github-actions-lambda-deploy-policy`, matching the
convention of filename == role name).

Why this matters beyond tonight: every IAM change from here on is
diffable, reviewable, and recoverable. If a future PR drops a
permission, code review catches it at PR time instead of surfacing
as a mysterious AccessDenied in production.

Follow-up: the deploy.sh script's rollback-on-canary-failure logic
still uses `|| true` to swallow errors silently, which is why the
stranded alias never got rolled back. That's a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
…143)

Stage 3 of the RAG cleanup arc. Lib v0.3.0 (alpha-engine-lib PR #16,
merged + tagged) introduced alpha_engine_lib.rag with the
consolidated db/embeddings/retrieval/schema. Stage 2 (alpha-engine-
research PR #97) migrated research's side. This PR migrates data's.

Changes:

Pipeline imports updated (5 files):
- rag/pipelines/{ingest_theses,ingest_8k_filings,ingest_sec_filings,
  ingest_earnings_finnhub,filing_change_detection}.py — `from
  rag.{embeddings,retrieval,db}` → `from alpha_engine_lib.rag.{...}`

Deleted (now redundant — code lives in lib):
- rag/db.py (was the canonical drift source — its register_vector
  fix was the basis for the lib version)
- rag/embeddings.py
- rag/retrieval.py
- rag/schema.sql

Kept:
- rag/__init__.py — minimal namespace placeholder
- rag/pipelines/ — ingestion pipelines stay here (data's versions
  inline the signals lookup; canonical location decision deferred
  to a later cleanup arc pending production SF state verification)
- rag/preflight.py — already uses alpha_engine_lib.logging; no
  changes needed

Lib pin bumped:
- requirements.txt: alpha-engine-lib v0.2.4 → v0.3.0
- Extras: added [rag] alongside [arcticdb,flow_doctor]
- Note flagged on the direct pgvector/psycopg2-binary pins above the
  lib line — these become redundant once the [rag] extra soaks; will
  drop in a follow-up PR rather than ripping them out today

Verified:
- All 433 data tests pass
- alpha_engine_lib.rag imports resolve (retrieve, is_available,
  embed_texts)
- rag.preflight imports cleanly

Companion:
- alpha-engine-lib PR #16 (merged + tagged v0.3.0)
- alpha-engine-research PR #97 (Stage 2 — research migration)
- Stage 4 (deferred): pipelines canonical-location decision

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant