Add GitHub Actions auto-deploy on push to main#12
Merged
Conversation
Automates the Phase 2 Lambda deploy (alpha-engine-data-collector) via GitHub Actions OIDC. Every merge to main that touches the collectors code, polygon client, Dockerfile, or requirements triggers a build + ECR push + Lambda update. Note: DataPhase1 runs as EC2 SSM (not Lambda) on the micro instance so this workflow does not help with Phase 1 drift. Phase 1 still requires a manual git pull on the micro instance after any collectors/prices.py or collectors/constituents.py change. Separate follow-up: add a scheduled pull job or EC2 user-data script to automate that too. Mechanism: same as alpha-engine-research + alpha-engine-predictor workflows. Uses the pre-existing github-actions-lambda-deploy IAM role scoped to this repo + research + predictor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
The Phase 2 Lambda deploy workflow has been failing on every push to main since #12 merged yesterday: ERROR: failed to calculate checksum of ref ... "/config.yaml": not found Root cause: the Dockerfile copies config.yaml into the image, but config.yaml is gitignored per the repo's security policy (it contains bucket names / prefixes we keep out of the public repo), so there is literally nothing to copy during the GitHub Actions checkout. The Lambda handler (lambda/handler.py) already falls back to a hardcoded default when config.yaml is absent: config = { "bucket": "alpha-engine-research", "market_data": {"s3_prefix": "market_data/"}, } So the COPY was dead weight AND a build breakage. Dropped the line and left an inline comment explaining why it must not come back. Also copied store/ into the image — it was added in #13 as a shared home for the S3 parquet loader and collectors/macro imports it at module top level. The Lambda handler doesn't import macro directly (Phase 2 only calls alternative.collect), but weekly_collector.py is in the image and does `from collectors import macro` at the top, so any future handler refactor that touches weekly_collector would bomb on `ModuleNotFoundError: store` without this. Added 'store/**' to the deploy.yml paths filter so a future change to the shared loader actually retriggers the deploy workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
Fixes the canary IAM gap that has caused every auto-deploy since #12 to report red even when the Lambda update itself succeeded: AccessDeniedException: User: .../github-actions-lambda-deploy/... is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-east-1:711398986525:function: alpha-engine-data-collector:live The github-actions-lambda-deploy OIDC role was created ad-hoc when the GitHub Actions auto-deploy workflow was introduced in #12. It had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not InvokeFunction — so infrastructure/deploy.sh's post-update canary step (aws lambda invoke with dry_run=true) failed with AccessDenied, the rollback then also failed silently because the script's || true swallowed the error, and every deploy since has been leaving the alias stranded on whatever version just got published with no safety net. Three deploys in a row (#13, #14, #16) all looked like failures despite the underlying Lambda being updated. This PR does two things: 1. Adds infrastructure/iam/ as the new home for version-controlled IAM policies. It's intentionally low-ceremony — flat JSON files, one per role, applied via a small idempotent shell script. No CloudFormation, no Terraform. For a 5-module infra-light project, a flat directory is the right amount of rigor. Migrate to CFN later if the blast radius grows. 2. Adds a new LambdaInvokeCanary statement to the existing deploy-role policy, granting lambda:InvokeFunction on all 5 alpha-engine Lambdas and their aliases/versions. Scoped narrowly to the same functions the role already has UpdateFunctionCode on, so the blast radius is unchanged: an attacker with ECR push + UpdateFunctionCode can already run arbitrary code in these Lambdas. Applied live via `infrastructure/iam/apply.sh github-actions-lambda-deploy` before committing — so the next deploy workflow run actually passes the canary step. Also cleaned up the orphaned old `deploy-lambdas` inline policy (the new file's name is `github-actions-lambda-deploy-policy`, matching the convention of filename == role name). Why this matters beyond tonight: every IAM change from here on is diffable, reviewable, and recoverable. If a future PR drops a permission, code review catches it at PR time instead of surfacing as a mysterious AccessDenied in production. Follow-up: the deploy.sh script's rollback-on-canary-failure logic still uses `|| true` to swallow errors silently, which is why the stranded alias never got rolled back. That's a separate PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
5 tasks
cipher813
added a commit
that referenced
this pull request
May 10, 2026
…e seed + backfill (#206) * feat(signal_returns): write calibrator-v1 context on score_performance seed + backfill Root-cause closure for the 2026-05-09 Saturday SF evaluator P0 (weight_optimizer ERROR: "None of [Index(['quant_score','qual_score'])] are in the [columns]"; auto-rollback Sharpe -42.2% vs baseline). Producer audit revealed two parallel writers diverged silently after research migration #12 (2026-05-08): * scoring/performance_tracker.py::record_new_buy_scores writes ALL 5 canonical context columns — but has zero production callers. * collectors/signal_returns.py::_seed_score_performance is the actual production writer (runs weekly in DataPhase1) and only wrote (symbol, score_date, score, price_on_date). The 5 canonical columns (quant_score, qual_score, conviction, sector_modifier, market_regime) were never populated. Single-fact-single-writer rebuild: * _seed_score_performance now extracts the 5 context fields from the same signals.json payload that drives the BUY filter — single source-of-truth fetch per signals.json, no second round-trip. * New _backfill_score_context repairs legacy rows whose canonical columns are NULL. UPDATE-WHERE-NULL so re-runs are no-ops once every row has a source. * _ensure_score_performance_schema mirrors research migration #12 defensively in case DataPhase1 ever fires against a fresh research.db before research's cold-start migrations run. Composes with backtester #176 (PR-day consumer-side coalesce fix). With this PR the producer becomes authoritative; the next backtester PR can retire the S3 round-trip in weight_optimizer.load_with_subscores. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(signal_returns): drift gate — canonical context coverage CW gauge Locks the producer-side contract established in the previous commit: after seed + backfill complete, query score_performance for rows with score_date >= 2026-05-17 (first Sat SF after this PR merges) and emit the coverage percentage as a CloudWatch gauge: AlphaEngine/Data/score_performance_canonical_coverage_pct Coverage = fraction of post-cutover rows with ALL 5 canonical context columns populated (quant_score, qual_score, conviction, sector_modifier, market_regime). 100% is the contract; the gauge is always emitted (including 100.0) so alarm baselines stay continuous. Mirrors the chronic-gap drift detection pattern at weekly_collector.py:_check_chronic_gap_polygon_recovery — same best-effort emit, same observability-not-load-bearing posture. A follow-up alpha-engine-lib transparency_inventory entry can wire this into the substrate health alarm if desired; the metric itself is the drift signal. Tripwire test asserts _CANONICAL_CONTEXT_COLUMNS stays in lockstep with the seed INSERT — adding a 6th column to one without the other would make the drift gate blind to that field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automates the Phase 2 Lambda deploy via GitHub Actions OIDC. Every merge to main that touches
collectors/\*\*,polygon_client.py,Dockerfile, orrequirements*triggers a build + ECR push + Lambda update foralpha-engine-data-collector.Note: DataPhase1 runs as EC2 SSM on the micro instance (not Lambda), so this workflow does not help with Phase 1 drift. Phase 1 still requires a manual
git pullon the micro after changes to Phase1-only code. Follow-up task TBD.Uses the same
github-actions-lambda-deployIAM role as alpha-engine-research PR #9 and alpha-engine-predictor PR #11.Fallback: existing
bash infrastructure/deploy.shstill works manually.