Skip to content

Add GitHub Actions auto-deploy on push to main#12

Merged
cipher813 merged 1 commit into
mainfrom
feat/github-actions-deploy
Apr 10, 2026
Merged

Add GitHub Actions auto-deploy on push to main#12
cipher813 merged 1 commit into
mainfrom
feat/github-actions-deploy

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Automates the Phase 2 Lambda deploy via GitHub Actions OIDC. Every merge to main that touches collectors/\*\*, polygon_client.py, Dockerfile, or requirements* triggers a build + ECR push + Lambda update for alpha-engine-data-collector.

Note: DataPhase1 runs as EC2 SSM on the micro instance (not Lambda), so this workflow does not help with Phase 1 drift. Phase 1 still requires a manual git pull on the micro after changes to Phase1-only code. Follow-up task TBD.

Uses the same github-actions-lambda-deploy IAM role as alpha-engine-research PR #9 and alpha-engine-predictor PR #11.

Fallback: existing bash infrastructure/deploy.sh still works manually.

Automates the Phase 2 Lambda deploy (alpha-engine-data-collector)
via GitHub Actions OIDC. Every merge to main that touches the
collectors code, polygon client, Dockerfile, or requirements
triggers a build + ECR push + Lambda update.

Note: DataPhase1 runs as EC2 SSM (not Lambda) on the micro
instance so this workflow does not help with Phase 1 drift.
Phase 1 still requires a manual git pull on the micro instance
after any collectors/prices.py or collectors/constituents.py
change. Separate follow-up: add a scheduled pull job or EC2
user-data script to automate that too.

Mechanism: same as alpha-engine-research + alpha-engine-predictor
workflows. Uses the pre-existing github-actions-lambda-deploy IAM
role scoped to this repo + research + predictor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 941a2d5 into main Apr 10, 2026
1 check passed
@cipher813 cipher813 deleted the feat/github-actions-deploy branch April 10, 2026 22:01
cipher813 added a commit that referenced this pull request Apr 11, 2026
The Phase 2 Lambda deploy workflow has been failing on every push to
main since #12 merged yesterday:

    ERROR: failed to calculate checksum of ref ...
    "/config.yaml": not found

Root cause: the Dockerfile copies config.yaml into the image, but
config.yaml is gitignored per the repo's security policy (it contains
bucket names / prefixes we keep out of the public repo), so there is
literally nothing to copy during the GitHub Actions checkout.

The Lambda handler (lambda/handler.py) already falls back to a
hardcoded default when config.yaml is absent:

    config = {
        "bucket": "alpha-engine-research",
        "market_data": {"s3_prefix": "market_data/"},
    }

So the COPY was dead weight AND a build breakage. Dropped the line
and left an inline comment explaining why it must not come back.

Also copied store/ into the image — it was added in #13 as a shared
home for the S3 parquet loader and collectors/macro imports it at
module top level. The Lambda handler doesn't import macro directly
(Phase 2 only calls alternative.collect), but weekly_collector.py is
in the image and does `from collectors import macro` at the top, so
any future handler refactor that touches weekly_collector would bomb
on `ModuleNotFoundError: store` without this.

Added 'store/**' to the deploy.yml paths filter so a future change
to the shared loader actually retriggers the deploy workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 11, 2026
Fixes the canary IAM gap that has caused every auto-deploy since #12
to report red even when the Lambda update itself succeeded:

    AccessDeniedException: User: .../github-actions-lambda-deploy/...
    is not authorized to perform: lambda:InvokeFunction on resource:
    arn:aws:lambda:us-east-1:711398986525:function:
    alpha-engine-data-collector:live

The github-actions-lambda-deploy OIDC role was created ad-hoc when
the GitHub Actions auto-deploy workflow was introduced in #12. It
had ECR push + Lambda UpdateFunctionCode/UpdateAlias, but not
InvokeFunction — so infrastructure/deploy.sh's post-update canary
step (aws lambda invoke with dry_run=true) failed with
AccessDenied, the rollback then also failed silently because the
script's || true swallowed the error, and every deploy since has
been leaving the alias stranded on whatever version just got
published with no safety net. Three deploys in a row (#13, #14,
#16) all looked like failures despite the underlying Lambda being
updated.

This PR does two things:

1. Adds infrastructure/iam/ as the new home for version-controlled
   IAM policies. It's intentionally low-ceremony — flat JSON files,
   one per role, applied via a small idempotent shell script. No
   CloudFormation, no Terraform. For a 5-module infra-light project,
   a flat directory is the right amount of rigor. Migrate to CFN
   later if the blast radius grows.

2. Adds a new LambdaInvokeCanary statement to the existing
   deploy-role policy, granting lambda:InvokeFunction on all 5
   alpha-engine Lambdas and their aliases/versions. Scoped
   narrowly to the same functions the role already has
   UpdateFunctionCode on, so the blast radius is unchanged: an
   attacker with ECR push + UpdateFunctionCode can already run
   arbitrary code in these Lambdas.

Applied live via `infrastructure/iam/apply.sh
github-actions-lambda-deploy` before committing — so the next
deploy workflow run actually passes the canary step. Also cleaned
up the orphaned old `deploy-lambdas` inline policy (the new file's
name is `github-actions-lambda-deploy-policy`, matching the
convention of filename == role name).

Why this matters beyond tonight: every IAM change from here on is
diffable, reviewable, and recoverable. If a future PR drops a
permission, code review catches it at PR time instead of surfacing
as a mysterious AccessDenied in production.

Follow-up: the deploy.sh script's rollback-on-canary-failure logic
still uses `|| true` to swallow errors silently, which is why the
stranded alias never got rolled back. That's a separate PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 10, 2026
…e seed + backfill (#206)

* feat(signal_returns): write calibrator-v1 context on score_performance seed + backfill

Root-cause closure for the 2026-05-09 Saturday SF evaluator P0
(weight_optimizer ERROR: "None of [Index(['quant_score','qual_score'])]
are in the [columns]"; auto-rollback Sharpe -42.2% vs baseline).

Producer audit revealed two parallel writers diverged silently after
research migration #12 (2026-05-08):
  * scoring/performance_tracker.py::record_new_buy_scores writes ALL 5
    canonical context columns — but has zero production callers.
  * collectors/signal_returns.py::_seed_score_performance is the actual
    production writer (runs weekly in DataPhase1) and only wrote
    (symbol, score_date, score, price_on_date). The 5 canonical
    columns (quant_score, qual_score, conviction, sector_modifier,
    market_regime) were never populated.

Single-fact-single-writer rebuild:
  * _seed_score_performance now extracts the 5 context fields from the
    same signals.json payload that drives the BUY filter — single
    source-of-truth fetch per signals.json, no second round-trip.
  * New _backfill_score_context repairs legacy rows whose canonical
    columns are NULL. UPDATE-WHERE-NULL so re-runs are no-ops once
    every row has a source.
  * _ensure_score_performance_schema mirrors research migration #12
    defensively in case DataPhase1 ever fires against a fresh
    research.db before research's cold-start migrations run.

Composes with backtester #176 (PR-day consumer-side coalesce fix). With
this PR the producer becomes authoritative; the next backtester PR can
retire the S3 round-trip in weight_optimizer.load_with_subscores.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(signal_returns): drift gate — canonical context coverage CW gauge

Locks the producer-side contract established in the previous commit:
after seed + backfill complete, query score_performance for rows with
score_date >= 2026-05-17 (first Sat SF after this PR merges) and emit
the coverage percentage as a CloudWatch gauge:

  AlphaEngine/Data/score_performance_canonical_coverage_pct

Coverage = fraction of post-cutover rows with ALL 5 canonical context
columns populated (quant_score, qual_score, conviction,
sector_modifier, market_regime). 100% is the contract; the gauge is
always emitted (including 100.0) so alarm baselines stay continuous.

Mirrors the chronic-gap drift detection pattern at
weekly_collector.py:_check_chronic_gap_polygon_recovery — same
best-effort emit, same observability-not-load-bearing posture. A
follow-up alpha-engine-lib transparency_inventory entry can wire this
into the substrate health alarm if desired; the metric itself is the
drift signal.

Tripwire test asserts _CANONICAL_CONTEXT_COLUMNS stays in lockstep
with the seed INSERT — adding a 6th column to one without the other
would make the drift gate blind to that field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant