Skip to content

Embed git pull + hard-fail exit propagation in Saturday Step Function SSM#20

Merged
cipher813 merged 2 commits into
mainfrom
fix/step-function-git-pull-and-exit-propagation
Apr 11, 2026
Merged

Embed git pull + hard-fail exit propagation in Saturday Step Function SSM#20
cipher813 merged 2 commits into
mainfrom
fix/step-function-git-pull-and-exit-propagation

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Fix three stacked bugs in every SSM command the Saturday Step Function issues to the ops EC2 instance:

  1. No git pull — commands ran whatever was last manually checked out on `/home/ec2-user/alpha-engine-{data,predictor,backtester}`. PR Hard-fail on partial DataPhase1 + fix features unhashable dict TypeError #18 had been merged hours ago but the pipeline was still executing pre-Hard-fail on partial DataPhase1 + fix features unhashable dict TypeError #18 code. Now every SSM command starts with `git pull --ff-only origin main`.
  2. `| tee` without `set -o pipefail` (DataPhase1, RAGIngestion, HealthCheck) — `tee`'s exit code masked Python's, so even SystemExit(1) reported as Success to SSM. Now every command starts with `set -eo pipefail`.
  3. Cosmetic `echo "EXIT_CODE=$?"` as the final line (DataPhase1, RAGIngestion) — made the shell script's exit code `echo`'s exit code (always 0), losing whatever had happened earlier. Removed.

Live deployment

Applied directly to the live state machine via:
```
aws stepfunctions update-state-machine --state-machine-arn ... --definition file://infrastructure/step_function.json
```
Revision: `ac2011c6`. This PR is the repo-side record.

Answers to the "what can we do to prevent this issue?" question

  • Now (this PR): git pull + set -eo pipefail + drop cosmetic echo — inside the Step Function itself so every run picks up main automatically.
  • Medium-term: emit `git rev-parse HEAD` into every phase's manifest so drift between deployed code and expected code is monitorable.
  • Long-term: immutable artifacts — build a tar/container with a git-SHA tag, upload to S3 on merge, SSM extracts. Eliminates the git-on-EC2 model entirely.

Test plan

🤖 Generated with Claude Code

cipher813 and others added 2 commits April 11, 2026 14:10
… SSM

Three stacked bugs in the Step Function's SSM commands that left the
Saturday pipeline blind to DataPhase1 failures and running stale EC2
code for an unknown duration:

1. No `git pull` — every SSM command ran whatever was checked out on
   /home/ec2-user/alpha-engine-{data,predictor,backtester,research}
   last time an operator manually pulled. PR #18's hard-fail fix was
   merged to main hours ago but today's pipeline runs were still
   executing pre-#18 code.

2. `| tee` without `set -o pipefail` — DataPhase1, RAGIngestion, and
   HealthCheck piped Python output through tee without pipefail, so
   tee's exit code (always 0) masked the Python exit code. Even when
   main() raised SystemExit(1), SSM reported Success.

3. `echo "EXIT_CODE=$?"` as the final command on DataPhase1 and
   RAGIngestion — this was cosmetic decoration that made the shell
   script exit with echo's exit code (always 0), losing whatever
   had happened earlier in the script. Pure write-only code.

All six SSM commands now start with `set -eo pipefail`, pull their
repo from origin main with --ff-only (fails loudly if EC2 has
diverged), and drop the cosmetic echo. DriftDetection pulls both
alpha-engine-data and alpha-engine-predictor since it consumes
both. HealthCheck pulls alpha-engine-data.

This also answers the "what can we do to prevent this issue?"
question at the first layer: git pull in the SSM command itself.
Next layers (per session discussion):
- Medium-term: emit git SHA into each phase manifest for drift
  detection monitoring
- Long-term: immutable artifacts — tar/container with git-SHA tag
  uploaded to S3 on merge, SSM extracts instead of git pull

## Live deployment
Applied directly to the live state machine via
`aws stepfunctions update-state-machine` (revision ac2011c6) so the
rerun can pick it up immediately. This PR is the repo-side record.

## Test plan
- [x] JSON validates
- [ ] New execution picks up new definition
- [ ] DataPhase1 runs new code (PR #18 hard-fail), fails fast if any
  collector is non-ok
- [ ] No "EXIT_CODE=0" lines in CloudWatch despite Python errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 0a3a90b into main Apr 11, 2026
1 check passed
@cipher813 cipher813 deleted the fix/step-function-git-pull-and-exit-propagation branch April 11, 2026 21:18
cipher813 added a commit that referenced this pull request Apr 11, 2026
PR #20 added `git pull --ff-only origin main` to every SSM command in
the Saturday Step Function. When executed, every command failed with:

    fatal: detected dubious ownership in repository at
    '/home/ec2-user/alpha-engine-data'
    failed to run commands: exit status 128

Cause: SSM RunShellScript runs as root on Amazon Linux, but the four
repo checkouts are owned by ec2-user. Git's >=2.35.2 safe.directory
check refuses to operate on repos owned by a different user unless
explicitly allowed.

Fix: run the git pull as ec2-user via
`sudo -u ec2-user git -C /path pull --ff-only origin main`.
`git -C <path>` avoids the pwd-across-sudo subshell issue. The rest
of each command (cd, source, Python/bash) continues to run as root
as before — no behavior change for non-git steps.

All six SSM commands updated consistently:
- DataPhase1, RAGIngestion, HealthCheck (alpha-engine-data)
- PredictorTraining (alpha-engine-predictor)
- DriftDetection (alpha-engine-data + alpha-engine-predictor)
- Backtester (alpha-engine-backtester)

Verified working via a standalone SSM probe before pushing this PR —
`sudo -u ec2-user git -C /home/ec2-user/alpha-engine-data pull` ran
cleanly and advanced the EC2 checkout from 292e51e to 0a3a90b.

## Live deployment
Applied directly to the live state machine. This PR is the
repo-side record.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 3, 2026
…ndex (#144)

Follow-up to the merged PR #142 (verbose 154-line README). Per the
canonical-template revisit landing in alpha-engine-docs PR #20, the
locked README target is ~60 lines and detail pushes to OVERVIEW.md.

README changes (154 → 68 lines):
- Drop Quick start, Key files, How it runs, S3 contract, Testing →
  all push to OVERVIEW.md
- Drop env-var enumeration → push to OVERVIEW.md
- Compress Configuration to 1-2 sentences naming the disclosure
  boundary, no file index
- Switch ASCII architecture diagram to mermaid (4-flow high-level
  picture: Phase 1 / RAG / Phase 2 / EOD → ArcticDB / pgvector / S3
  staging)
- Drop ASCII detail block (constituents/prices/slim_cache/macro/
  universe_returns/feature_store enumeration) — that's an OVERVIEW
  concern, not a README concern
- Drop verbose Quality Gates + Design Rationale prose (compressed
  to a one-line note under the diagram)

Path corrections:
- Prior README claimed `feature_store/compute.py` — that path
  doesn't exist (actual: `features/feature_engineer.py`). All file
  references in OVERVIEW.md verified against the filesystem before
  committing.

OVERVIEW.md (new, 79 lines):
- 7 sections per locked template (alpha-engine-docs PR #20):
  Module purpose, Entry points (4 files), Where things live (~20
  concept→file mappings), Inputs/outputs (S3 contract), Run modes,
  Tests (one paragraph)
- All file paths verified against the actual filesystem

This is the first prototype OVERVIEW.md. Same shape will apply to
the other 5 module repos in upcoming PRs.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant