Minimal end-to-end ML pipeline take-home project with two clearly separated delivery paths:
- Training path: reproducible training, versioned artifacts, experiment logging, and a bridge toward continuous retraining.
- Inference path: lightweight Flask serving, model-agnostic deploy image, and runtime model selection.
The notebook walkthrough (docs/01_notebook_local.md) is retained as an exploratory/legacy path; the current maintained execution paths are training/train.py and api/app.py with CI workflows.
| Document | Purpose |
|---|---|
docs/QUICKSTART.md |
Fast local run commands |
docs/01_notebook_local.md |
Original notebook-only local walkthrough |
docs/02_training.md |
Training flow, artifacts, metrics, manifest |
docs/03_inference.md |
API behavior, model loading, Docker paths |
docs/04_monitoring.md |
Monitoring/observability design |
docs/05_tradeoffs.md |
Trade-offs and limitations |
.github/workflows/validation.yml |
Repo-wide quality gate |
.github/workflows/train.yml |
Training CI + submit workflow |
.github/workflows/inference.yml |
Inference CI + image release workflow |
.github/workflows/promote.yml |
Mock staging-to-production model promotion workflow |
flowchart TD
codePush[Code push or PR] --> validation[validation.yml quality gate]
validation --> trainFlow[train.yml]
validation --> inferFlow[inference.yml]
subgraph trainFlow [Training]
subgraph trainValidate [Validate]
trainChecks[PR/Push training checks]
trainTests[Training unit tests]
trainSmoke[Training CI smoke run test-only no ECR push]
trainChecks --> trainTests --> trainSmoke
end
subgraph trainRelease [Release]
trainDispatch[workflow_dispatch release handoff]
trainContract[training_submission.json]
trainManaged[Managed training job contract design-level]
trainDispatch --> trainContract --> trainManaged
end
end
subgraph inferFlow [Inference]
subgraph inferValidate [Validate]
inferChecks[PR/Push inference checks]
inferTests[Inference unit tests]
inferCiBuild[Inference CI image build test-only no ECR push]
inferChecks --> inferTests --> inferCiBuild
end
subgraph inferRelease [Release]
inferDispatch[workflow_dispatch release handoff]
inferBuildRelease[Build/tag inference Docker image]
inferMeta[inference_build_metadata.json]
inferDispatch --> inferBuildRelease --> inferMeta
end
end
localDocker[Dockerfile.local] --> localVerify[Locally test /predict & /health]
trainContract --> runtimeModel[Runtime model reference]
runtimeModel --> promoteFlow[Promote to prod (manual/gated)]
promoteFlow --> InferenceEndpoint
inferFlow --> InferenceEndpoint[Create/update inference endpoint]
- Entry point:
training/train.py. - Trains a regression model from synthetic data or
--data-uriinput. - Emits immutable versioned runs and updates a latest alias.
- Logs params/metrics/artifacts to MLflow (local file backend by default).
Per run under runs/artifacts/runs/vNNN/:
sample_model.joblibmetrics.jsonmodel_version.txtmanifest.json
Active alias under runs/artifacts/latest/ mirrors the same contract.
- Validate (PR/push):
train.ymlruns training checks, training unit tests, and a small CI smoke run (test-only, no ECR push). - Release handoff:
workflow_dispatchruns the local training path or emits the AWS training submission contract that a managed training release would use. - Release handoff artifact:
training_submission.json.
The repository is set up for continuous retraining evolution via:
- external data input (
--data-uri), - versioned artifact + manifest contract,
- documented release handoff contract (
training_submission.json+ managed training job design-level handoff), - explicit SageMaker job/pipeline contract items in Productization section.
- Entry point:
api/app.py. - Endpoints:
GET /healthPOST /predict
- Loads model artifacts via
api/model_loader.pyfrom either:- local
MODEL_DIR(defaultruns/artifacts/latest), or - runtime
MODEL_ARTIFACT_URI(including S3/local tarball flow).
- local
- Validate (PR/push):
inference.ymlruns inference checks, inference unit tests, and an inference CI image build (test-only, no ECR push). - Release handoff (
workflow_dispatch): builds/tagsDockerfile.inferenceand emits build metadata; registry push remains placeholder/design-level in this take-home. Dockerfile.localis intentionally local-only, not part of inference release CI.
Containerization is intentionally split according to purpose:
Dockerfile.local: local verification image with baked local artifacts for quick/predictchecks.Dockerfile.inference: model-agnostic deployment image; model reference is supplied at runtime.Dockerfile.training: training container contract for orchestrated training runs.
For future training in AWS, the intended flow is:
- build
Dockerfile.training, - push that image to ECR,
- create/start a SageMaker Training Job or SageMaker Pipeline step that references the ECR image.
Quality signals come from CI, while operational/model observability is runtime-focused:
- CI monitoring signals
validation.yml: repo-wide lint + full tests.train.yml: training path quality + submit metadata contract.inference.yml: inference path quality + deploy image build contract.
- Runtime monitoring signals
- structured JSON logs from
api/structured_logging.py(request_id,latency_ms,predict_ms, status/error events,model_version), - log shipping via runtime log drivers/agents to CloudWatch,
- design for API SLOs, model quality checks, and drift checks in
docs/04_monitoring.md.
- structured JSON logs from
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python training/train.py --output-dir runs/artifacts
python -m api.app
# Local verification image
docker build -f Dockerfile.local -t mlpipeline-api:local .
docker run --rm -p 8080:8080 mlpipeline-api:localPrediction test:
curl -X POST http://127.0.0.1:8080/predict \
-H "Content-Type: application/json" \
-d '{"age": 42, "income_k": 88.0, "tenure_years": 6}'- Reproducible training entrypoint:
training/train.py - Model artifact + params + metrics logging: MLflow (local file backend)
- Prediction API with validation + model version response:
api/app.py - Dockerized local verification path:
Dockerfile.local - Model-agnostic deploy image path:
Dockerfile.inference - CI split by responsibility:
validation.yml,train.yml,inference.yml - Mock staging-to-production model promotion workflow:
.github/workflows/promote.yml - Monitoring design:
docs/04_monitoring.md
Current choices in this repository:
- GitHub Actions is the primary orchestrator here for simplicity and transparency.
- Training orchestration can later move to dedicated managed orchestration while Actions remains validation/release trigger glue.
- Packaging training in
Dockerfile.trainingkeeps dependencies and runtime behavior consistent across local, CI, and managed compute, avoiding host-environment drift. - Inference delivery intentionally favors model-agnostic images + runtime model references for operational simplicity.
- Move training execution to managed jobs on AWS (SageMaker Training Jobs or SageMaker Pipeline training step).
- Build and publish training image to ECR with explicit training job contract (I/O URIs, ResourceConfig, StoppingCondition, RoleArn, Region, optional VPC).
- Attach managed experiment tracking/lineage (SageMaker Experiments or managed MLflow).
- Publish model artifacts to a registry with staged promotion.
- Gate promotions with quality checks + integration smoke tests.
- Deploy inference service with environment-specific runtime model references and secrets.
- Add production observability: latency/error SLOs, drift checks, post-deploy quality monitoring.
Promotion should move an immutable model artifact/version from staging to production, not a mutable latest pointer.
- Train and package a candidate artifact with
manifest.jsonand quality metrics. - Register the staging candidate with provenance (
model_version, artifact URI, git SHA, training run id, image tag). - Run stage gates (metric quality evaluation, API contract smoke checks, deployment health checks).
- Require manual approval before production promotion.
- Promote by updating production runtime model reference (
MODEL_ARTIFACT_URIand optionalMODEL_VERSION) to the approved artifact. - Run post-promotion verification (
/health, latency/error SLOs) and roll back to the prior artifact reference if checks fail.
This keeps deployment artifacts immutable and auditable, while avoiding per-model inference image rebuilds.