Predict used-car prices (PLN) from tabular specs like mileage, year, engine, transmission, drivetrain, make/model, etc.
Built with scikit-learn pipelines (imputation + One-Hot), HistGradientBoostingRegressor, a log-transformed target for stability, and a ready-to-run FastAPI service with a lightweight /dashboard.
- Preprocessing pipeline: numeric & categorical, missing-value imputation, OneHot (
handle_unknown="ignore"). - Model:
HistGradientBoostingRegressorwrapped inTransformedTargetRegressor(log target). - Evaluation: MAE & RMSE (+ optional 5-fold OOF CV).
- Reports: parity, residuals, histogram, permutation importance (saved to
reports/). - API:
/predict,/health, and a Jinja2 /dashboard showing the estimate with ±MAE/±RMSE bands. - Artifacts: model saved as
models/*.joblib+*.meta.json(columns + metrics). - Docker Compose: train and serve with one command.
.
├── data/ # input CSV (e.g., cars\_5m.csv.gz)
├── models/ # trained models + meta.json
├── reports/ # evaluation plots
├── src/ # ML code (main.py, pipeline.py, \_helpers.py)
├── app/ # FastAPI app (main.py, templates/)
├── docker-compose.yml # train + api services
├── requirements.txt
└── README.md
1) Train the model (once):
docker compose run --rm train2) Run only the API + dashboard:
docker compose up -d api
# open http://localhost:8001/dashboard3) Or run everything:
docker compose up --buildIf a model file already exists, training is skipped. Use
--forceto retrain.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# train
python -m src.main --data data/cars_5m.csv.gz --out models/carworth_hgbr_v1.joblib --plots
# serve
uvicorn app.main:app --reload --port 8001Key flags for src.main:
--data <path>– input CSV--out <path>– output model.joblib--plots– save plots intoreports/--cv– 5-fold OOF (can be heavy for big data)--force– retrain even if a model exists
Request (API names are mapped to training names):
brand → manufacturermileage_km → odometer_kmpower_hp → engine_power_hpcar_ageis derived fromyearautomatically.
{
"year": 2016,
"mileage_km": 98000,
"power_hp": 120,
"brand": "Volkswagen",
"model": "Golf",
"fuel": "petrol",
"transmission": "manual",
"drivetrain": "fwd",
"body_type": "hatchback",
"condition": "used",
"city": "Warszawa",
"country": "PL"
}Response:
{ "price": 27009.77 }Simple form UI to get a quote with ±MAE/±RMSE bands (metrics read from *.meta.json).
Basic info about the loaded model (file name, required columns count, API→TRAIN mapping).
Environment variables (API):
MODEL_PATH— defaults to/models/carworth_hgbr_v1.joblibMODEL_META— defaults to the same path with.meta.jsonMETRIC_MAE,METRIC_RMSE— override metrics from meta (optional)
Test → MAE ≈ 3,733 PLN | RMSE ≈ 7,470 PLN
Exact numbers depend on the dataset; the above reflects the sample in this repo.
parity_test.pngresiduals_test.pngresiduals_hist_test.pngperm_importance_test.png
Generate a large synthetic dataset:
python scripts/generate_cars_dataset.py \
--rows 5_000_000 \
--out data/cars_5m.csv.gz \
--format csv \
--chunksize 250_000 \
--seed 1columns are missing: {...}Provide the missing fields or fix the API→TRAIN mapping; ensure imputers are in the pipeline andOneHotEncoder(handle_unknown="ignore")is used.*.meta.jsonnot found Retrain with--forceto generate meta (used by /dashboard and /health).- Slow CV/plots
Avoid
--cv/--plotson the full 5M rows or sample down.
