Skip to content

moshezvi/mlpipeline

Repository files navigation

mlpipeline

Minimal end-to-end ML pipeline take-home project with two clearly separated delivery paths:

  • Training path: reproducible training, versioned artifacts, experiment logging, and a bridge toward continuous retraining.
  • Inference path: lightweight Flask serving, model-agnostic deploy image, and runtime model selection.

The notebook walkthrough (docs/01_notebook_local.md) is retained as an exploratory/legacy path; the current maintained execution paths are training/train.py and api/app.py with CI workflows.

Docs index

Document Purpose
docs/QUICKSTART.md Fast local run commands
docs/01_notebook_local.md Original notebook-only local walkthrough
docs/02_training.md Training flow, artifacts, metrics, manifest
docs/03_inference.md API behavior, model loading, Docker paths
docs/04_monitoring.md Monitoring/observability design
docs/05_tradeoffs.md Trade-offs and limitations
.github/workflows/validation.yml Repo-wide quality gate
.github/workflows/train.yml Training CI + submit workflow
.github/workflows/inference.yml Inference CI + image release workflow
.github/workflows/promote.yml Mock staging-to-production model promotion workflow

Architecture at a glance

flowchart TD
    codePush[Code push or PR] --> validation[validation.yml quality gate]
    validation --> trainFlow[train.yml]
    validation --> inferFlow[inference.yml]

    subgraph trainFlow [Training]

      subgraph trainValidate [Validate]
        trainChecks[PR/Push training checks]
        trainTests[Training unit tests]
        trainSmoke[Training CI smoke run test-only no ECR push]
        trainChecks --> trainTests --> trainSmoke
      end

      subgraph trainRelease [Release]
        trainDispatch[workflow_dispatch release handoff]
        trainContract[training_submission.json]
        trainManaged[Managed training job contract design-level]
        trainDispatch --> trainContract --> trainManaged
      end

    end

    subgraph inferFlow [Inference]

      subgraph inferValidate [Validate]
        inferChecks[PR/Push inference checks]
        inferTests[Inference unit tests]
        inferCiBuild[Inference CI image build test-only no ECR push]
        inferChecks --> inferTests --> inferCiBuild
      end

      subgraph inferRelease [Release]
        inferDispatch[workflow_dispatch release handoff]
        inferBuildRelease[Build/tag inference Docker image]
        inferMeta[inference_build_metadata.json]
        inferDispatch --> inferBuildRelease --> inferMeta
      end
    end

    localDocker[Dockerfile.local] --> localVerify[Locally test /predict & /health]
    trainContract --> runtimeModel[Runtime model reference]
    runtimeModel --> promoteFlow[Promote to prod (manual/gated)]
    promoteFlow --> InferenceEndpoint
    inferFlow --> InferenceEndpoint[Create/update inference endpoint]
Loading

Training path

What this path does

  • Entry point: training/train.py.
  • Trains a regression model from synthetic data or --data-uri input.
  • Emits immutable versioned runs and updates a latest alias.
  • Logs params/metrics/artifacts to MLflow (local file backend by default).

Outputs

Per run under runs/artifacts/runs/vNNN/:

  • sample_model.joblib
  • metrics.json
  • model_version.txt
  • manifest.json

Active alias under runs/artifacts/latest/ mirrors the same contract.

CI and orchestration alignment

  • Validate (PR/push): train.yml runs training checks, training unit tests, and a small CI smoke run (test-only, no ECR push).
  • Release handoff: workflow_dispatch runs the local training path or emits the AWS training submission contract that a managed training release would use.
  • Release handoff artifact: training_submission.json.

Continuous retraining direction

The repository is set up for continuous retraining evolution via:

  • external data input (--data-uri),
  • versioned artifact + manifest contract,
  • documented release handoff contract (training_submission.json + managed training job design-level handoff),
  • explicit SageMaker job/pipeline contract items in Productization section.

Inference path

What this path does

  • Entry point: api/app.py.
  • Endpoints:
    • GET /health
    • POST /predict
  • Loads model artifacts via api/model_loader.py from either:
    • local MODEL_DIR (default runs/artifacts/latest), or
    • runtime MODEL_ARTIFACT_URI (including S3/local tarball flow).

CI alignment

  • Validate (PR/push): inference.yml runs inference checks, inference unit tests, and an inference CI image build (test-only, no ECR push).
  • Release handoff (workflow_dispatch): builds/tags Dockerfile.inference and emits build metadata; registry push remains placeholder/design-level in this take-home.
  • Dockerfile.local is intentionally local-only, not part of inference release CI.

Containerization model

Containerization is intentionally split according to purpose:

  • Dockerfile.local: local verification image with baked local artifacts for quick /predict checks.
  • Dockerfile.inference: model-agnostic deployment image; model reference is supplied at runtime.
  • Dockerfile.training: training container contract for orchestrated training runs.

For future training in AWS, the intended flow is:

  1. build Dockerfile.training,
  2. push that image to ECR,
  3. create/start a SageMaker Training Job or SageMaker Pipeline step that references the ECR image.

Monitoring and observability

Quality signals come from CI, while operational/model observability is runtime-focused:

  • CI monitoring signals
    • validation.yml: repo-wide lint + full tests.
    • train.yml: training path quality + submit metadata contract.
    • inference.yml: inference path quality + deploy image build contract.
  • Runtime monitoring signals
    • structured JSON logs from api/structured_logging.py (request_id, latency_ms, predict_ms, status/error events, model_version),
    • log shipping via runtime log drivers/agents to CloudWatch,
    • design for API SLOs, model quality checks, and drift checks in docs/04_monitoring.md.

Quick local commands

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

python training/train.py --output-dir runs/artifacts
python -m api.app

# Local verification image
docker build -f Dockerfile.local -t mlpipeline-api:local .
docker run --rm -p 8080:8080 mlpipeline-api:local

Prediction test:

curl -X POST http://127.0.0.1:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"age": 42, "income_k": 88.0, "tenure_years": 6}'

Deliverables status

  • Reproducible training entrypoint: training/train.py
  • Model artifact + params + metrics logging: MLflow (local file backend)
  • Prediction API with validation + model version response: api/app.py
  • Dockerized local verification path: Dockerfile.local
  • Model-agnostic deploy image path: Dockerfile.inference
  • CI split by responsibility: validation.yml, train.yml, inference.yml
  • Mock staging-to-production model promotion workflow: .github/workflows/promote.yml
  • Monitoring design: docs/04_monitoring.md

Productization considerations

Current choices in this repository:

  • GitHub Actions is the primary orchestrator here for simplicity and transparency.
  • Training orchestration can later move to dedicated managed orchestration while Actions remains validation/release trigger glue.
  • Packaging training in Dockerfile.training keeps dependencies and runtime behavior consistent across local, CI, and managed compute, avoiding host-environment drift.
  • Inference delivery intentionally favors model-agnostic images + runtime model references for operational simplicity.

Plan for productization

  1. Move training execution to managed jobs on AWS (SageMaker Training Jobs or SageMaker Pipeline training step).
  2. Build and publish training image to ECR with explicit training job contract (I/O URIs, ResourceConfig, StoppingCondition, RoleArn, Region, optional VPC).
  3. Attach managed experiment tracking/lineage (SageMaker Experiments or managed MLflow).
  4. Publish model artifacts to a registry with staged promotion.
  5. Gate promotions with quality checks + integration smoke tests.
  6. Deploy inference service with environment-specific runtime model references and secrets.
  7. Add production observability: latency/error SLOs, drift checks, post-deploy quality monitoring.

Stage-to-prod promotion model

Promotion should move an immutable model artifact/version from staging to production, not a mutable latest pointer.

  1. Train and package a candidate artifact with manifest.json and quality metrics.
  2. Register the staging candidate with provenance (model_version, artifact URI, git SHA, training run id, image tag).
  3. Run stage gates (metric quality evaluation, API contract smoke checks, deployment health checks).
  4. Require manual approval before production promotion.
  5. Promote by updating production runtime model reference (MODEL_ARTIFACT_URI and optional MODEL_VERSION) to the approved artifact.
  6. Run post-promotion verification (/health, latency/error SLOs) and roll back to the prior artifact reference if checks fail.

This keeps deployment artifacts immutable and auditable, while avoiding per-model inference image rebuilds.

About

Pipeline for training and deploying ML models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors