Background
The emulator image public.ecr.aws/durable-functions/aws-durable-execution-emulator:latest is consumed automatically by aws-sam-cli: on every sam local invoke of a durable function, sam-cli pulls :latest and refreshes the local cache (see durable_functions_emulator_container.py; customers can override per-invoke with DURABLE_EXECUTIONS_EMULATOR_IMAGE_TAG but the default is :latest). This means any image we publish ships immediately to every durable-functions customer running sam-cli, with no version pin in between.
PR #216 in this repo recently demonstrated the blast radius: ~26 sam-cli durable integration tests went red across local-invoke, local-start-lambda, tier1-finch, and tier1-windows-other jobs (e.g. aws-sam-cli Integration Tests #496, run #8779 / local-start-lambda) the moment v1.2.0 went to :latest. Customer-visible symptom: a fresh samdev local invoke against any durable function 500s on first checkpoint or 404s on first local execution get|history|stop|callback. Mitigations are in flight on the sam-cli side (aws/aws-sam-cli#9038 merged, #9040 open) but they do not address the class problem: this repo's release pipeline has no signal from sam-cli before publishing :latest.
Why our existing tests didn't catch this
tests/web/e2e/routes_arn_encoding_int_test.py (added in #222) drives a real boto client against this repo's WebServer and would have caught the emulator-side routing bug. It does not — and cannot — exercise sam-cli's LocalLambdaHttpService, which is a separate Flask service that customers' boto clients actually hit when using samdev local invoke. Anything we change in the ARN, callback ID, or function-qualifier shape can break sam-cli's service without touching ours.
Proposal
Add a pre-publish CI step that builds the candidate emulator image and runs sam-cli's durable integration suite against it. Concrete shape:
- Build the emulator image from this repo (we already do this in
ecr-release.yml).
- Tag it locally with a candidate tag, e.g.
aws-durable-execution-emulator:pr-${SHA}.
- Check out
aws/aws-sam-cli at develop, install in SAM_CLI_DEV=1 mode.
- Run, with
DURABLE_EXECUTIONS_EMULATOR_IMAGE_TAG=pr-${SHA}:
pytest -vv \
tests/integration/local/invoke/test_invoke_durable.py \
tests/integration/local/start_lambda/test_start_lambda_durable.py \
tests/integration/local/execution/test_execution.py \
tests/integration/local/callback/test_callback.py
That's the durable subset — ~50 tests, runs in ~3–5 min in CI based on the local-invoke and local-start-lambda timings above.
- Publish to ECR only if step 4 is green.
Gate this on PRs that touch src/** so we get the signal pre-merge as well as pre-publish.
Acceptance criteria
- A workflow (e.g.
.github/workflows/sam-cli-compat.yml) that runs the four sam-cli durable test files against the locally-built emulator image and is required for PRs that change src/.
- The publish job (
ecr-release.yml) gated on the same workflow's success.
- A README / CONTRIBUTING note explaining that any change affecting the emulator's HTTP contract — ARN shape, callback-token shape, route layout, response codes — must keep this job green.
Out of scope
- Pinning sam-cli to a specific emulator tag. That just inverts the dependency: customers stop picking up emulator fixes until sam-cli ships a new release. Roll-forward + this CI gate is the durable answer.
- Running the full sam-cli integration suite. The four files above cover every code path that talks to the emulator.
References
Background
The emulator image
public.ecr.aws/durable-functions/aws-durable-execution-emulator:latestis consumed automatically byaws-sam-cli: on everysam local invokeof a durable function, sam-cli pulls:latestand refreshes the local cache (seedurable_functions_emulator_container.py; customers can override per-invoke withDURABLE_EXECUTIONS_EMULATOR_IMAGE_TAGbut the default is:latest). This means any image we publish ships immediately to every durable-functions customer running sam-cli, with no version pin in between.PR #216 in this repo recently demonstrated the blast radius: ~26 sam-cli durable integration tests went red across
local-invoke,local-start-lambda,tier1-finch, andtier1-windows-otherjobs (e.g. aws-sam-cli Integration Tests #496, run #8779 / local-start-lambda) the moment v1.2.0 went to:latest. Customer-visible symptom: a freshsamdev local invokeagainst any durable function 500s on first checkpoint or 404s on firstlocal execution get|history|stop|callback. Mitigations are in flight on the sam-cli side (aws/aws-sam-cli#9038 merged, #9040 open) but they do not address the class problem: this repo's release pipeline has no signal from sam-cli before publishing:latest.Why our existing tests didn't catch this
tests/web/e2e/routes_arn_encoding_int_test.py(added in #222) drives a real boto client against this repo'sWebServerand would have caught the emulator-side routing bug. It does not — and cannot — exercise sam-cli'sLocalLambdaHttpService, which is a separate Flask service that customers' boto clients actually hit when usingsamdev local invoke. Anything we change in the ARN, callback ID, or function-qualifier shape can break sam-cli's service without touching ours.Proposal
Add a pre-publish CI step that builds the candidate emulator image and runs sam-cli's durable integration suite against it. Concrete shape:
ecr-release.yml).aws-durable-execution-emulator:pr-${SHA}.aws/aws-sam-cliatdevelop, install inSAM_CLI_DEV=1mode.DURABLE_EXECUTIONS_EMULATOR_IMAGE_TAG=pr-${SHA}:Gate this on PRs that touch
src/**so we get the signal pre-merge as well as pre-publish.Acceptance criteria
.github/workflows/sam-cli-compat.yml) that runs the four sam-cli durable test files against the locally-built emulator image and is required for PRs that changesrc/.ecr-release.yml) gated on the same workflow's success.Out of scope
References
DurableExecutionArnshape:arn:<partition>:lambda:<region>:<account>:function:<fn>:<qualifier>/durable-execution/<execution-name>/<execution-id>(API_GetDurableExecution)[Bug]: Durable integration tests can't extract execution ARN that contains "/"fix(tests): accept '/' in durable execution ARN regex[Bug]: Local Lambda HTTP service rejects DurableExecutionArn / CallbackId containing "/"fix(local-lambda): accept documented Lambda DurableExecutionArn shape/), [Bug]: WebServer route layer doesn't URL-decode path segments #222 (URL-decode in own WebServer)