CLI-first ML pipeline template. Train → eval-against-baseline → serve, tracked in MLflow, driven by one machine-readable binary. The house style for a set of production ML reference repos — rl-studio, vision-pipeline, and ml-pipeline all derive from it.
The ML here is deliberately trivial (a tabular classifier on iris). The point is the operational shell: a CLI an agent or a human can drive end-to-end,
--jsoneverywhere, load-bearing exit codes, MLflow as the single source of truth, and an honest baseline reported with every metric. The domain repos swap the model; they keep the shell.
Most ML demos are a notebook that works once on the author's laptop. This is the opposite: a script-not-a-ritual pipeline that runs the same way in CI, in an agent loop, and on a fresh checkout.
- CLI-first — every capability is an
mltsubcommand. No notebook-only happy paths. --jsonon every command — output is a contract, so tools and agents compose it.- Exit codes mean something —
0ok, non-zero failure with one line on stderr. - MLflow is the source of truth — falls back to a local
file:./mlrunsstore with zero services running. - Baseline with every metric — a model that doesn't beat its baseline is a finding, not a number to hide.
- Marimo
.py, never.ipynb— notebooks that diff, grep, and edit like source.
uv sync --extra dev # install
uv run mlt doctor # environment readiness check (--json for CI)
uv run mlt train configs/iris.yaml
uv run mlt infer iris-rf --features 6.3,3.3,6.0,2.5The block above is marked
<!-- ci-test -->— CI runs these exact commands on every push, so this quickstart can never silently drift from the code.
Output of train (human mode):
trained iris-rf (random_forest)
accuracy 0.9667 (baseline 0.3, lift +0.6667)
model -> artifacts/iris-rf/model.joblib
Everything is also available via make: make demo runs the full train→infer loop.
mlt doctor [--json] # is this environment ready?
mlt train <config> [--out] [--json] # train, eval vs baseline, log to MLflow
mlt infer <name> [--features] [--json]
mlt version [--json]
Tracking UI (optional): make up starts MLflow on localhost:5050, then
export MLFLOW_TRACKING_URI=http://localhost:5050.
uv run marimo edit notebooks/01_explore.py # feature distributions, class balance
uv run marimo edit notebooks/02_metrics.py # compare MLflow runs vs baseline| Path | Status |
|---|---|
mlt train / infer on CPU |
✅ verified |
pytest smoke suite + ruff in CI |
✅ verified |
| MLflow local sqlite store | ✅ verified |
| MLflow server via docker-compose | 🟡 compose provided, runs locally |
Fork it, replace src/mlt/lib/pipeline.py with your domain (training,
evaluation, serving), keep the CLI / output / tracking shell. The three domain
repos linked above show exactly that, for RL fine-tuning, computer vision, and
classic ML.
Every command is non-interactive, emits a single JSON object with --json, and returns a load-bearing exit code — so AI coding agents (Codex, Claude Code, Cursor, Copilot, Windsurf, …) and plain scripts can drive the full train → eval → serve loop and parse results with no TTY, no UI, no screen-scraping.
mlt train configs/iris.yaml --json # -> {"ok": true, "metrics": {...}} exit 0Agent instructions live in AGENTS.md — the cross-tool standard. CLAUDE.md is a symlink to it, so every tool reads one source of truth.
Most repos' CI checks that the code parses. This one checks that the pipeline works — three things beyond lint + tests, all stdlib, no extra deps:
- It runs the pipeline and publishes the numbers. Every push trains the model and posts a live metrics table to the GitHub Actions run summary (
scripts/ci_report.py). The numbers in CI are produced on that commit, not pasted by hand. - It keeps the docs honest. The Quickstart block is marked
<!-- ci-test -->andscripts/test_readme.pyruns those exact commands in CI. Docs that drift from the code fail the build. - It proves determinism.
scripts/check_repro.pytrains twice and asserts identical metrics — a seed is a promise, and CI verifies the promise holds.
Run them locally too: make summary, make readme, make repro.
Apache-2.0.