Embed git pull + hard-fail exit propagation in Saturday Step Function SSM#20
Merged
Merged
Conversation
… SSM
Three stacked bugs in the Step Function's SSM commands that left the
Saturday pipeline blind to DataPhase1 failures and running stale EC2
code for an unknown duration:
1. No `git pull` — every SSM command ran whatever was checked out on
/home/ec2-user/alpha-engine-{data,predictor,backtester,research}
last time an operator manually pulled. PR #18's hard-fail fix was
merged to main hours ago but today's pipeline runs were still
executing pre-#18 code.
2. `| tee` without `set -o pipefail` — DataPhase1, RAGIngestion, and
HealthCheck piped Python output through tee without pipefail, so
tee's exit code (always 0) masked the Python exit code. Even when
main() raised SystemExit(1), SSM reported Success.
3. `echo "EXIT_CODE=$?"` as the final command on DataPhase1 and
RAGIngestion — this was cosmetic decoration that made the shell
script exit with echo's exit code (always 0), losing whatever
had happened earlier in the script. Pure write-only code.
All six SSM commands now start with `set -eo pipefail`, pull their
repo from origin main with --ff-only (fails loudly if EC2 has
diverged), and drop the cosmetic echo. DriftDetection pulls both
alpha-engine-data and alpha-engine-predictor since it consumes
both. HealthCheck pulls alpha-engine-data.
This also answers the "what can we do to prevent this issue?"
question at the first layer: git pull in the SSM command itself.
Next layers (per session discussion):
- Medium-term: emit git SHA into each phase manifest for drift
detection monitoring
- Long-term: immutable artifacts — tar/container with git-SHA tag
uploaded to S3 on merge, SSM extracts instead of git pull
## Live deployment
Applied directly to the live state machine via
`aws stepfunctions update-state-machine` (revision ac2011c6) so the
rerun can pick it up immediately. This PR is the repo-side record.
## Test plan
- [x] JSON validates
- [ ] New execution picks up new definition
- [ ] DataPhase1 runs new code (PR #18 hard-fail), fails fast if any
collector is non-ok
- [ ] No "EXIT_CODE=0" lines in CloudWatch despite Python errors
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 11, 2026
PR #20 added `git pull --ff-only origin main` to every SSM command in the Saturday Step Function. When executed, every command failed with: fatal: detected dubious ownership in repository at '/home/ec2-user/alpha-engine-data' failed to run commands: exit status 128 Cause: SSM RunShellScript runs as root on Amazon Linux, but the four repo checkouts are owned by ec2-user. Git's >=2.35.2 safe.directory check refuses to operate on repos owned by a different user unless explicitly allowed. Fix: run the git pull as ec2-user via `sudo -u ec2-user git -C /path pull --ff-only origin main`. `git -C <path>` avoids the pwd-across-sudo subshell issue. The rest of each command (cd, source, Python/bash) continues to run as root as before — no behavior change for non-git steps. All six SSM commands updated consistently: - DataPhase1, RAGIngestion, HealthCheck (alpha-engine-data) - PredictorTraining (alpha-engine-predictor) - DriftDetection (alpha-engine-data + alpha-engine-predictor) - Backtester (alpha-engine-backtester) Verified working via a standalone SSM probe before pushing this PR — `sudo -u ec2-user git -C /home/ec2-user/alpha-engine-data pull` ran cleanly and advanced the EC2 checkout from 292e51e to 0a3a90b. ## Live deployment Applied directly to the live state machine. This PR is the repo-side record. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
5 tasks
cipher813
added a commit
that referenced
this pull request
May 3, 2026
…ndex (#144) Follow-up to the merged PR #142 (verbose 154-line README). Per the canonical-template revisit landing in alpha-engine-docs PR #20, the locked README target is ~60 lines and detail pushes to OVERVIEW.md. README changes (154 → 68 lines): - Drop Quick start, Key files, How it runs, S3 contract, Testing → all push to OVERVIEW.md - Drop env-var enumeration → push to OVERVIEW.md - Compress Configuration to 1-2 sentences naming the disclosure boundary, no file index - Switch ASCII architecture diagram to mermaid (4-flow high-level picture: Phase 1 / RAG / Phase 2 / EOD → ArcticDB / pgvector / S3 staging) - Drop ASCII detail block (constituents/prices/slim_cache/macro/ universe_returns/feature_store enumeration) — that's an OVERVIEW concern, not a README concern - Drop verbose Quality Gates + Design Rationale prose (compressed to a one-line note under the diagram) Path corrections: - Prior README claimed `feature_store/compute.py` — that path doesn't exist (actual: `features/feature_engineer.py`). All file references in OVERVIEW.md verified against the filesystem before committing. OVERVIEW.md (new, 79 lines): - 7 sections per locked template (alpha-engine-docs PR #20): Module purpose, Entry points (4 files), Where things live (~20 concept→file mappings), Inputs/outputs (S3 contract), Run modes, Tests (one paragraph) - All file paths verified against the actual filesystem This is the first prototype OVERVIEW.md. Same shape will apply to the other 5 module repos in upcoming PRs. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix three stacked bugs in every SSM command the Saturday Step Function issues to the ops EC2 instance:
git pull— commands ran whatever was last manually checked out on `/home/ec2-user/alpha-engine-{data,predictor,backtester}`. PR Hard-fail on partial DataPhase1 + fix features unhashable dict TypeError #18 had been merged hours ago but the pipeline was still executing pre-Hard-fail on partial DataPhase1 + fix features unhashable dict TypeError #18 code. Now every SSM command starts with `git pull --ff-only origin main`.Live deployment
Applied directly to the live state machine via:
```
aws stepfunctions update-state-machine --state-machine-arn ... --definition file://infrastructure/step_function.json
```
Revision: `ac2011c6`. This PR is the repo-side record.
Answers to the "what can we do to prevent this issue?" question
Test plan
🤖 Generated with Claude Code