Production-oriented ML repository for product price prediction using text, image, and numeric signals.
This project contains:
- an end-to-end offline training pipeline,
- an inference pipeline and submission generation,
- a bundle-backed online serving API,
- a registry, promotion, and deployment state layer,
- a live Hugging Face Space deployment path,
- CI quality gates and developer onboarding assets.
The system predicts product price from multimodal inputs:
- text content (titles/descriptions),
- image representations,
- parsed numeric features (quantity/unit and derived signals).
Primary metric: SMAPE (lower is better).
main.py # CLI entrypoint (train/inference/features/ensemble/quickrun)
configs/ # YAML configs for training, inference, models, and features
src/
data/ # Data loading, parsing, text cleaning
features/ # Text/image/numeric feature builders and reducers
models/ # Model wrappers (Linear/RF/LGBM/XGB/Cat/etc.)
training/ # CV utilities, trainer, metrics
inference/ # Predict and postprocess pipeline
pipelines/ # Train/infer/feature/ensemble orchestrators
serving/ # FastAPI serving and live service validation
ci_cd/tests/ # CI test suite
docs/ # Handover, setup, and workflow docs
experiments/ # Artifacts: bundles, registry state, reports, submissions
Detailed flow (Mermaid):
flowchart LR
A[Raw Data\ntrain.csv / test.csv] --> B[src/data\nLoad + Parse + Clean]
B --> C[src/features\nText/Image/Numeric Features]
C --> D[src/pipelines/train_pipeline.py\nTraining Orchestration]
D --> E[src/training + src/models\nCV Models + Optional Stacker]
E --> F[experiments/\nModels + OOF + Reports]
C --> G[src/pipelines/inference_pipeline.py\nBatch Inference]
F --> G
G --> H[src/inference/postprocess.py\nClip / Round / Submission Build]
H --> I[data/submission + experiments/submissions\nPrediction CSV]
C --> J[src/serving/app.py\nFastAPI Online Serving]
F --> J
J --> K["/healthz /readyz /v1/predict"]
L[main.py CLI] --> D
L --> G
L --> J
M[CI: .github/workflows/ci.yml] --> N[compileall]
M --> O[pytest ci_cd/tests]
M --> P[python main.py --help]
- Data ingestion and normalization:
src/data/ - Multimodal feature construction:
src/features/ - Pipeline orchestration:
src/pipelines/ - Model training and ensembling:
src/training/,src/models/ - Offline inference and output formatting:
src/inference/ - Online API serving:
src/serving/app.py - Entry-point command interface:
main.py - Quality gates and regression checks:
ci_cd/tests/,.github/workflows/ci.yml
- Load dataset
- Parse/clean features
- Build multimodal feature matrix
- Optional dimensionality reduction
- CV training for base models
- Build OOF matrix and optional stacker
- Persist artifacts and reports
- Load inference CSV
- Rebuild features with saved/cached transforms
- Load fold models + stacker
- Predict + postprocess
- Write output CSV
FastAPI service in src/serving/app.py:
GET /healthzGET /readyzGET /service/infoGET /metrics/jsonPOST /v1/warmupPOST /v1/predict
- Python 3.10+
- pip
python -m pip install --upgrade pip
pip install -r requirements.txtpython -m compileall src main.py
pytest -q ci_cd/tests
python main.py --helppython main.py train --config configs/training/final_train.yamlTrain single model only:
python main.py train --config configs/training/final_train.yaml --model lgbmpython main.py features --config configs/features/all_features.yamlpython main.py inference --config configs/inference/inference.yamlpython main.py ensemble --config configs/model/ensemble.yamlpython main.py quickrunCreate local env file first:
cp .env.example .envOn Windows PowerShell:
Copy-Item .env.example .envImportant: if TEXT_METHOD=tfidf, TFIDF_VECTORIZER_PATH must point to a fitted vectorizer artifact from training.
Start API:
uvicorn src.serving.app:app --host 0.0.0.0 --port 8000Health and readiness:
curl http://127.0.0.1:8000/healthz
curl http://127.0.0.1:8000/readyzPrediction example:
curl -X POST "http://127.0.0.1:8000/v1/predict" \
-H "Content-Type: application/json" \
-d '{
"records": [
{"unique_identifier": 1, "Description": "Organic green tea 20 bags", "image_path": ""}
]
}'Typical generated artifacts:
experiments/runs/<run_id>/bundle/(immutable run-scoped model bundle)experiments/registry/index.json(registry state and active production pointer)experiments/registry/deployment_manifest.json(verified deployment record)experiments/registry/production_tracker.json(active production metadata and metrics)experiments/oof/(OOF matrix, model names)experiments/reports/(comparison and stacker summaries)experiments/submissions/(prediction files)
Current checked-in production state points to:
- active production run:
train_20260325T155219Z - strategy:
hf_space - status:
active - deployment URL:
https://arpitkumariitkgp-aml25.hf.space
GitHub Actions workflow in .github/workflows/ci.yml runs:
- dependency installation,
- syntax gate (
compileall), - test gate (
pytest -q ci_cd/tests), - CLI smoke gate (
python main.py --help).
Use YAML configs under configs/:
configs/training/for training runs and CV behavior,configs/inference/for inference inputs/outputs,configs/model/for model-specific hyperparameters,configs/features/for feature settings.
The CLI automatically supports nested config sections (for example, training: and inference: blocks).
- Designed as a modular ML codebase with a real production-like release path.
- Current serving stack is FastAPI with bundle-backed loading, registry-aware deployment state, and a Hugging Face Space deployment target.
- The most credible current portfolio story is: train a run-scoped bundle, promote the run, deploy it to Hugging Face Space, verify live endpoints, then persist production state.
To move from challenge-grade ML workflows to production-grade, hyper-scalable architecture, the planned target includes:
- Feature Store integration (Feast/Hopsworks) for online/offline parity and point-in-time correctness.
- Distributed training + Bayesian HPO (
Ray/Kubeflow+Optuna/Ray Tune) for stronger model search quality. - Asynchronous inference architecture (
Redis/RabbitMQ/Kafka) withtask_id-based queue + worker execution. - Model registry lifecycle controls (
MLflow/W&B) with Champion/Challenger promotion flow. - Drift detection and observability automation (
Evidently/Arize+ metrics/alerts) with retraining triggers.
Current vs target trend:
- Data logic: local CSV pipelines -> distributed ETL.
- Feature management: script-level features -> governed online/offline feature store.
- Inference: synchronous REST -> async queue workers + model serving layer.
- Experimentation: local artifacts -> tracked registry lifecycle.
- Scaling: vertical scaling -> horizontal autoscaling on Kubernetes.
See:
docs/DEVELOPER_ONBOARDING_AND_TECHNICAL_HANDOVER.mddocs/GITHUB_SECRETS_SETUP.mddocs/KAGGLE_COLAB_DVC_MLFLOW_GITHUB_RUNBOOK.md
Detailed execution roadmap: docs/DEVELOPER_ONBOARDING_AND_TECHNICAL_HANDOVER.md section "10) Target Future Development Plan (Gap Closure Roadmap)".
Training and inference runs are tracked through src/utils/mlflow_utils.py.
- Start local MLflow server:
./scripts/start_mlflow_server.ps1 -Port 5000- Set tracking env vars:
$env:MLFLOW_ENABLED='1'
$env:MLFLOW_TRACKING_URI='http://127.0.0.1:5000'- Run training experiment:
$env:PYTHONPATH='.'
python main.py train --config configs/training/final_train.yamlUse environment variables for credentials. Do not hardcode tokens in YAML.
$env:MLFLOW_ENABLED='1'
$env:DAGSHUB_MLFLOW_ENABLED='1'
$env:DAGSHUB_REPO_OWNER='<your_dagshub_username_or_org>'
$env:DAGSHUB_REPO_NAME='A_ML_25'
$env:DAGSHUB_TOKEN='<your_dagshub_access_token>'Then run:
$env:PYTHONPATH='.'
python main.py train --config configs/training/final_train.yamlExpected outputs after run:
- MLflow run metadata in the generated manifest under
outputs.mlflow. - Registry linkage in
experiments/registry/index.jsonundertracking.mlflow. - Run visible in DagsHub experiment UI when DagsHub mode is enabled.
- Create focused changes in one subsystem.
- Add/update tests under
ci_cd/testsfor behavior changes. - Run local quality checks before PR.
- Keep config and artifact paths consistent with existing conventions.
This repository is intended for ML challenge work and production-learning workflows. Ensure model/data usage follows challenge rules and organizational policy.
This repository follows a strict split:
- Git tracks: code, configs, docs, CI, and
.dvcpointer metadata. - DVC tracks: large datasets, feature caches, model artifacts, and generated experiment payloads.
- Pull code and data pointers:
git pull
dvc pull-
Run training/inference and update artifacts.
-
Track changed payloads with DVC:
dvc add data/raw/train.csv data/raw/test.csv
dvc add data/processed/dimred.joblib data/processed/features.joblib
dvc add experiments/oof/oof_matrix.joblib experiments/reports/model_comparison.csv- Commit pointers and related code/config together:
git add .
git commit -m "feat: update model and DVC pointers"- Push data cache then code:
dvc push
git pushIf you keep DagsHub credentials in .env, use the helper script so DVC commands automatically pick them up:
./scripts/dvc_with_env.ps1 push -r origin --all-commits
./scripts/dvc_with_env.ps1 pull -r origin.gitignoreblocks large payload directories while allowing.dvcfiles.- CI runs
python scripts/check_repo_hygiene.pyto fail PRs if binary payloads are committed directly to Git. - If
dvc pushfails due credentials, data pointers may be in Git but remote data will not be available to collaborators until push succeeds.
Production-oriented ML delivery pipeline with training, promotion, verified deployment, and health monitoring.
Phase 3 automates the current portfolio deployment lifecycle:
Git Push -> CI Gate -> Drift Check -> Training -> Promotion Approval -> HF Space Deploy -> Live Validation -> Production State Update
- Trigger: daily at
22:00 UTCor manualworkflow_dispatch - Flow:
- check data drift
- pull data through DVC
- prepare training data
- train and validate a canonical bundle-backed run
- register the run in
experiments/registry/ - persist workflow outputs and smoke-test artifacts
- Main output: canonical local
run_id, immutable bundle, registry entry, metrics artifacts
- Supported stages:
stagingandproduction - Flow:
- resolve canonical run ID
- restore durable bundle
- validate promotion thresholds
- require environment approval for production
- update registry state and tag approved releases
- Important distinction: promotion means approved for deployment, not yet live
- Deployment target: Hugging Face Space
- Flow:
- resolve canonical run ID
- restore immutable bundle
- run pre-deployment checks
- run bundle-backed inference smoke test
- create HF Space package
- publish to Hugging Face Space
- wait for
/readyzand/service/infoto report the expected run - run live prediction smoke test
- write
deployment_manifest.json - activate the production run in the registry
- write
production_tracker.json - persist verified deployment state back to Git
- Frequency: every 6 hours or manual trigger
- Checks:
- MLflow connectivity when configured
- production model loadability
- registry consistency
- inference health
- live production validation when service URL is available
- Uses the active production run and can derive the HF Space URL from workflow variables
- Builds monitoring artifacts and checks alert conditions on a schedule
Set these in Settings -> Secrets and variables -> Actions:
MLFLOW_TRACKING_URI
MLFLOW_TRACKING_USERNAME
MLFLOW_TRACKING_PASSWORD
DAGSHUB_USERNAME
DAGSHUB_TOKEN
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
HF_SPACE_TOKEN
PRODUCTION_SERVICE_BASE_URL # optional
HF_SPACE_REPO_ID # variable, optional if using the default repo id
See docs/GITHUB_SECRETS_SETUP.md for setup guidance.
Located at experiments/registry/:
index.json # run registry and active production pointer
promotion_history.jsonl # promotion audit log
deployment_manifest.json # verified deployment record
production_tracker.json # active production metadata + metrics
gh workflow run training.yml -f force_retrain=true -f model_config=configs/training/final_train.yamlOpen Actions -> Model Promotion -> Run workflow:
- enter the canonical
run_id - choose
stagingorproduction - review promotion validation artifacts in the workflow logs
Open Actions -> Deploy Live Service -> Run workflow:
- enter the approved
run_id - enter
space_repo_idsuch asarpitkumariitkgp/aml25 - set
create_if_missing=trueonly when bootstrapping a new Space
Health checks run automatically every 6 hours. Use health-check.yml for manual checks and daily-monitoring.yml for scheduled monitoring artifacts.
Push/PR -> ci.yml
Schedule/manual -> training.yml -> staging registry entry
Manual approval -> promote.yml -> production-approved run
Manual deploy -> deploy.yml -> HF Space publish -> live verification -> deployment manifest + production tracker update
Scheduled/manual -> health-check.yml -> production validation
Scheduled -> daily-monitoring.yml -> monitoring artifacts and alerts
Manual rollback:
python scripts/rollback_deployment.py \
--to-previous-production \
--reason "Manual: detected latency spike"Operationally, rollback uses the previous production run recorded in the deployment state and updates registry-tracked production metadata.
Key metrics to monitor:
- SMAPE or quality trend from training and promotion checks
- prediction latency (
p50,p95,p99) - error rate / exception rate
- data drift magnitude
- model freshness
- deployment success rate
Representative alert thresholds:
WARN if latency_p95 > 2.0s | CRITICAL if > 5.0s
WARN if error_rate > 0.02 | CRITICAL if > 0.05
WARN if drift > 0.10 | CRITICAL if > 0.25
docs/DEVELOPER_ONBOARDING_AND_TECHNICAL_HANDOVER.mddocs/GITHUB_SECRETS_SETUP.mddocs/KAGGLE_COLAB_DVC_MLFLOW_GITHUB_RUNBOOK.md
gh workflow run training.yml -f force_retrain=true -f model_config=configs/training/final_train.yamlcat experiments/registry/index.json | jq '.'cat experiments/registry/promotion_history.jsonl | jq '.'python scripts/health_check.py \
--check-mlflow \
--check-production-model \
--check-registry \
--check-inference \
--output /tmp/health.json
