Skip to content

EOD Step Function definition#5

Merged
cipher813 merged 1 commit into
mainfrom
feat/eod-step-function
Apr 7, 2026
Merged

EOD Step Function definition#5
cipher813 merged 1 commit into
mainfrom
feat/eod-step-function

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • New Step Function: alpha-engine-eod-pipeline
  • Triggered by daemon shutdown (not cron/timer)
  • PostMarketData (micro EC2) → EODReconcile (trading EC2) → StopInstance
  • PostMarketData failure is non-blocking (EOD falls back to IB Gateway prices)
  • Instance always stops, even on failure (ForceStopInstance fallback)

Deployment

After merge, create the Step Function in AWS:
aws stepfunctions create-state-machine --name alpha-engine-eod-pipeline --definition file://infrastructure/step_function_eod.json --role-arn arn:aws:iam::711398986525:role/alpha-engine-step-function-role

Test plan

  • Step Function created in AWS console
  • Daemon triggers it on shutdown
  • Full pipeline completes: PostMarketData → EOD → StopInstance

@cipher813 cipher813 merged commit 46aa5ca into main Apr 7, 2026
1 check passed
@cipher813 cipher813 deleted the feat/eod-step-function branch April 7, 2026 22:28
cipher813 added a commit that referenced this pull request May 1, 2026
* feat(ci): wire deploy.yml + deploy-infrastructure.yml into system-wide changelog

Adds a final step to both deploy workflows that calls the
append-changelog composite action in alpha-engine-docs. Each
successful (or failed) deploy now emits one JSON to
s3://alpha-engine-research/changelog/.

Two distinct entries per merge that touches both surfaces:
- deploy.yml          → Phase 2 Lambda image rebuild + alias bump
- deploy-infrastructure.yml → SF + CF stamp re-deploy

Distinguished by the deploy_workflow field on each entry, so the
materialized CHANGELOG.md can show both as separate items under the
same SHA.

Uses if: always() + ternary on job.status so failed deploys also
register in the log — the failure signal is itself a useful
provenance record.

Companion: alpha-engine-docs PR #3 (composite action + aggregator),
alpha-engine-data PR #120 (IAM grant — already merged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(orchestration): SNS→S3 changelog incident mirror Lambda

Adds a small Lambda subscribed to the alpha-engine-alerts SNS topic
that mirrors every alert as one JSON entry under
s3://alpha-engine-research/changelog/incidents/. Closes the event-
mining loop alongside the deploy-side log: now both "what shipped"
and "what failed" feed the same time-ordered changelog.

Why
The 2026-05-01 weekday SF timeout cascade is the canonical example.
The deploy log records the 4 PRs that fixed it, but it never
captured the original SNS alert email at 06:01 PT — the failure
event itself. With this Lambda, that alert would have landed at
changelog/incidents/2026/05/01T13-01-XX_alpha-engine-alerts_*.json
with full subject + body, queryable months later for retro mining
("show me every SF failure incident this quarter").

Resources added (4)
- ChangelogIncidentMirrorRole       — minimal: PutObject scoped to
  changelog/incidents/* + AWSLambdaBasicExecutionRole for logs.
- ChangelogIncidentMirrorFunction   — python3.12, arm64, 256 MB,
  30s timeout. Inline ZipFile (~50 lines). Reads SNS Records,
  builds a JSON entry, S3 PutObject. No-ops cleanly on malformed
  timestamps (falls back to "now").
- ChangelogIncidentMirrorSubscription — SNS subscription on
  AlertsTopic with Protocol: lambda.
- ChangelogIncidentMirrorPermission — Lambda::Permission letting
  SNS invoke the function.

Schema (matches the deploy-side action's event_type discriminator)
{
  "ts_utc": ...,
  "event_type": "incident",
  "source": "alpha-engine-alerts",
  "subject": "...",
  "summary": "...",        // first 240 chars of subject or message line 1
  "details": "...",        // full message body
  "sns_message_id": "...",
  "topic_arn": "..."
}

Apply state
Already applied live via aws cloudformation execute-change-set;
smoke-tested with one SNS publish — entry landed at
s3://alpha-engine-research/changelog/incidents/2026/05/01T15-52-57_*
within 2s, schema validated, then cleaned up. This PR is the
codification of the source-of-truth template.

Companions
- alpha-engine-docs PR #5 (event_type schema + aggregator support)
- Future: flow-doctor S3 notifier, manual CLI helper.

Note on template description
The template's docstring says "Does NOT manage Lambda functions or
IAM roles." Strictly we now manage one of each — narrow exception
for the SNS-mirror because it's tightly coupled to AlertsTopic
defined here. Not updating the doc this commit; will revisit if a
second exception lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 2, 2026
PR #134's pre-MorningEnrich preflight calls ``constituents.collect()`` and
reads ``cons_result.get("tickers", [])`` to feed prune_delisted_tickers'
``constituents_override`` and the daily_closes request list. But collect()
was returning only ``{"status": "ok", "count": N}`` — no tickers — so the
preflight got [] and MorningEnrich aborted with "No tickers available".

2026-05-02 SF redrive #5 was the live failure: prune correctly dropped
all 8 stragglers (architectural fix worked\!), but then no tickers got
fed to daily_closes. The whole MorningEnrich step exited 1.

Add ``tickers`` to both happy-path returns (ok + ok_dry_run). Additive,
no breakage:
- ``_run_phase1`` (the only other caller) previously round-tripped to S3
  to re-read what it just wrote — now uses ``const_result["tickers"]``
  directly.
- The dry-run fork in _run_phase1 (which separately called the private
  ``_fetch_constituents``) is also collapsed.

Contract test in tests/test_constituents_sector_map.py locks both return
shapes — sneaks-back protection for this exact regression class.

376 tests pass.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 18, 2026
…y path) (#259)

ROADMAP "Friday shell-run — per-module dry-path activation" owed-item #1.
Under the Friday shell_run, the DataPhase1/MorningEnrich + RAGIngestion
spot states now boot the spot for real, run their EXISTING preflight,
then exit 0 with ZERO external API data fetch and ZERO
S3/ArcticDB/config/email/SNS writes — catching bootstrap-class breakage
(lib-pin drift, sys.path collision, stale ArcticDB symbol, SSM timeout,
Dockerfile/image gap) ~12h before the real Saturday run.

Reuses the existing preflight substrate; no parallel preflight written.

Where the gate sits / zero-fetch zero-write proof:

- weekly_collector.py: new `--preflight-only` argparse flag. main()
  exits HERE — `raise SystemExit(0)` immediately after the existing
  `DataPreflight(config["bucket"], mode).run()` and strictly BEFORE
  `run_weekly(config, args)`. run_weekly() is the SOLE function in the
  module that performs ANY collector fetch (polygon/FMP/FRED/yfinance)
  or ANY S3/ArcticDB/parquet/config/module-health write — gating in
  front of it makes every fetch/write code path statically unreachable.
  The preflight itself only does read-only/auth probes (S3 HEAD,
  polygon/FRED reference-data auth calls that fetch no collector data,
  ArcticDB list_libraries) plus a self-cleaning S3 PUT+DELETE sentinel
  under preflight/ (the preflight's own liveness probe, not a data
  write). Ordering pinned by an AST-source test.

- rag/pipelines/run_weekly_ingestion.sh: new `--preflight-only` flag.
  Exits 0 after Step 0 (`python -m rag.preflight`: check_env_vars +
  check_s3_bucket HEAD — read-only, zero fetch, zero write) and strictly
  BEFORE Step 1 (ingest_sec_filings). Every ingest_* pipeline, Voyage
  embedding call, and Postgres/pgvector + parquet write lives in Steps
  1-9 — all unreachable once the guard exits.

- infrastructure/spot_data_weekly.sh: new `--preflight-only` flag sets
  PREFLIGHT_ONLY=1, a MODIFIER orthogonal to RUN_MODE so it composes
  with the data path AND --rag-only. A dedicated data-path block runs
  `weekly_collector.py --morning-enrich --preflight-only` and/or
  `weekly_collector.py --phase 1 --preflight-only` (gated by the
  existing DO_MORNING_ENRICH/DO_PHASE1 split) then exit 0 before the
  real WORKLOADS heredoc — no prune (prune-audit JSON write), no RAG,
  no CloudWatch heartbeat, no S3 log upload.

--rag-only --preflight-only behavior: runs ONLY the RAG-path preflight
(boot + SSM secret fetch so rag.preflight's check_env_vars sees them +
`run_weekly_ingestion.sh --preflight-only` = step-0-only + exit 0). No
real RAG ingestion, no rag-ingestion heartbeat. `--preflight-only` alone
runs ONLY the DataPhase1/MorningEnrich preflight.

Universe-freshness tolerance note (ROADMAP owed-item #5): the Friday
shell-run uses the phase1 / morning_enrich preflight modes. Per
preflight.py::DataPreflight.run, NEITHER mode runs check_arcticdb_fresh
— they only do _check_arcticdb_libraries_present (a presence read, not a
freshness gate). morning_enrich deliberately omits freshness (it is part
of what *makes* ArcticDB fresh); phase1 *populates* ArcticDB. The only
freshness gate (check_arcticdb_fresh macro/SPY 4d) lives in the "daily"
mode, which the Saturday/Friday data path never selects. So a Friday run
predating Friday's settled polygon aggregate does NOT spuriously fail on
a Thursday-last-bar — no --preflight-only-scoped tolerance code is
required for the data path. Documented inline so a future mode-mapping
change re-audits this invariant.

Tests: new tests/test_preflight_only_dry_path.py (10 tests, static
greps + AST-source assertions, matching the existing
test_spot_data_weekly_run_modes.py / test_weekly_collector_preflight_
mode_mapping.py convention) pins: flag parsing on all 3 files, the
exit-0-after-preflight-before-fetch/write ordering invariant,
--rag-only --preflight-only step-0-only behavior, and the
no-prune/no-RAG/no-heartbeat/no-S3-upload hard invariant. Full suite:
1229 passed, 1 skipped (pre-existing). bash -n clean on both shell
scripts. No new deps, no secrets.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant