🏦 Credit Risk MLOps Platform

Production-grade MLOps pipeline for credit card default prediction
Built on Kubernetes with Kubeflow, KServe, and comprehensive monitoring

📑 Table of Contents

Overview
Screenshots
Dashboard
Architecture
Quick Start
Infrastructure Setup
ML Pipeline
Model Serving
Monitoring
API Usage
Testing
Troubleshooting
Project Structure

📸 Screenshots

Grafana Monitoring Dashboard

Real-time data drift detection with Prometheus metrics:

Kubernetes Pods Status

All services running across namespaces (ml-credit-risk, kubeflow, kserve):

Live Model Prediction

KServe InferenceService returning real-time credit risk predictions:

Load Test Simulation

End-to-end simulation with 100% success rate at 1.76 req/s:

🖥️ Dashboard

A Streamlit-based web dashboard deployed on Kubernetes for interactive credit risk prediction.

Dashboard Screenshots

Overview Tab - System metrics and interactive architecture diagram:

Live Predictions Tab - Real-time predictions via KServe:

Infrastructure Tab - Kubernetes namespaces and pod status:

Dashboard Features

Tab	Features
📊 Overview	System metrics, interactive architecture diagram
🔮 Live Predictions	Customer input form, real KServe API calls, gauge chart
📈 Model Insights	Feature importance, ROC curve, metrics
🖥️ Infrastructure	Pod status cards, namespace overview

Running the Dashboard

# Deploy to Kubernetes
kubectl apply -f infra/dashboard.yaml

# Access locally via port-forward
kubectl port-forward svc/mlops-dashboard -n ml-credit-risk 8501:80

# Open http://localhost:8501

Dashboard Code

The dashboard is located in dashboard/:

dashboard/
├── app.py              # Streamlit application (400+ lines)
├── Dockerfile          # Container image definition
└── requirements.txt    # Python dependencies

Key files:

app.py - Main Streamlit application with 4 tabs
Dockerfile - Containerization for K8s deployment
dashboard.yaml - Kubernetes Deployment & Service

Docker Image: shrinet82/mlops-dashboard:latest

📊 Overview

Problem Statement

Credit card default prediction is critical for financial institutions. This platform provides:

✅ Automated ML training with Kubeflow Pipelines
✅ Scalable model serving with KServe (auto-scaling 1-3 replicas)
✅ Real-time monitoring with Prometheus & Grafana
✅ CI/CD automation with GitHub Actions

Dataset

Source: UCI ML Repository - Taiwan Credit Card Default
Size: 300,000 samples, 23 features
Target: Binary classification (default vs. non-default)

Model Performance

Metric	Score
AUC-ROC	0.78
Accuracy	76%
Precision	47%
Recall	63%

Why MLOps? The Gaps This Platform Addresses

This platform was built to address four critical gaps in traditional ML workflows:

⚡ The "Velocity" Gap

Metric	Without MLOps	With This Platform	Improvement
Model deployment time	2-4 hours (manual)	~5 minutes (automated)	24x faster
Feature engineering iteration	30+ min per change	2 min (rerun pipeline)	15x faster
API endpoint provisioning	1-2 hours	Instant (KServe)	60x faster
Rollback time	30+ min	1 command (~30 sec)	60x faster

Measured: From code commit to production inference in < 10 minutes with automated pipeline.

🔧 The "Toil" Gap

Manual Task	Before	After	Toil Eliminated
Model retraining	Manual script execution	Kubeflow pipeline trigger	100%
Feature scaling	Copy-paste code	Shared `features.py` module	100%
Deployment YAML creation	Write from scratch	Template-based `infra/*.yaml`	90%
Metrics collection	Manual logging	Automatic Prometheus scraping	100%
Health monitoring	SSH & check logs	Grafana dashboard	95%

Actual: 5-stage pipeline automates data loading → feature engineering → training → validation → registration.

💰 The "Cost" Gap

Resource	Traditional Approach	This Platform	Savings
Compute (inference)	Always-on VM	HPA (1-3 replicas on demand)	~40-60%
Storage	Duplicate models	Minio S3 with versioning	~30%
Developer time	8+ hrs/week on ops	1-2 hrs/week	~75%
Incident response	2+ hrs MTTR	15 min (automated alerts)	~85%

Note: Cost estimates based on K3s on DigitalOcean ($48/mo for 3-node cluster vs $100+/mo for dedicated ML VMs).

🔒 The "Security" Gap

Security Measure	Implementation	Coverage
Secrets management	Kubernetes Secrets (base64)	✅ S3 credentials, API keys
Network isolation	Namespace segmentation	✅ 4 namespaces (kubeflow, ml-credit-risk, monitoring, kserve)
RBAC	ServiceAccounts per component	✅ KServe SA with Minio access only
Container scanning	Base images from Docker Hub official	⚠️ Manual review (no automated scanning yet)
Audit logging	Prometheus metrics	✅ All predictions logged
Drift detection	Evidently integration	✅ Real-time monitoring

Honest assessment: Basic security implemented. Production would need: image scanning, OPA policies, mTLS, and external secrets manager.

🏗️ Architecture

High-Level System Architecture

flowchart TB
    subgraph External["External Layer"]
        USER[("👤 User/Client")]
        GH["GitHub Repository"]
    end

    subgraph K8s["Kubernetes Cluster - K3s"]
        subgraph Training["Training Pipeline"]
            KF["Kubeflow Pipelines"]
            MINIO[("Minio S3")]
            MLFLOW["MLflow Tracking"]
        end

        subgraph Serving["Model Serving"]
            KSERVE["KServe InferenceService"]
            PREDICTOR["XGBoost Predictor Pod"]
        end

        subgraph Monitoring["Observability Stack"]
            PROM["Prometheus"]
            GRAF["Grafana"]
            METRICS["Metrics Server"]
        end
    end

    USER -->|"API Request"| KSERVE
    GH -->|"CI/CD Trigger"| KF
    KF -->|"Store Model"| MINIO
    KF -->|"Log Metrics"| MLFLOW
    MINIO -->|"Load Model"| PREDICTOR
    KSERVE --> PREDICTOR
    PREDICTOR -->|"Prediction"| USER
    METRICS -->|"Scrape"| PROM
    PROM -->|"Visualize"| GRAF

Cluster Topology

flowchart LR
    subgraph Master["k3s-master - Control Plane"]
        API["K8s API Server"]
        ETCD[("etcd")]
        SCHED["Scheduler"]
    end

    subgraph Worker1["k3s-worker - Core Services"]
        KF_PODS["Kubeflow Pods"]
        MINIO_POD["Minio"]
        MYSQL["MySQL"]
        WORKFLOW["Workflow Controller"]
    end

    subgraph Worker2["k3s-worker-2 - ML Workloads"]
        KSERVE_POD["KServe Predictor"]
        MON_POD["Monitoring Service"]
    end

    Master <-->|"kubelet"| Worker1
    Master <-->|"kubelet"| Worker2
    Worker1 <-->|"Model Fetch"| Worker2

Data Flow Architecture

flowchart LR
    subgraph Input["Data Input"]
        RAW["Raw CSV Data"]
        API_REQ["API Request"]
    end

    subgraph Processing["Feature Engineering"]
        FE["22 Engineered Features"]
        SCALE["StandardScaler"]
    end

    subgraph Model["ML Model"]
        XGB["XGBoost Classifier"]
    end

    subgraph Output["Prediction Output"]
        PROB["Probability Score"]
        RISK["Risk Classification"]
    end

    RAW --> FE
    API_REQ --> FE
    FE --> SCALE
    SCALE --> XGB
    XGB --> PROB
    PROB -->|">0.5"| RISK

Namespace Organization

flowchart TB
    subgraph kubeflow["kubeflow namespace"]
        direction LR
        KF_API["ml-pipeline API"]
        KF_UI["Pipeline UI"]
        KF_MINIO["Minio Storage"]
        KF_MYSQL["MySQL DB"]
        KF_WF["Workflow Controller"]
    end

    subgraph mlcr["ml-credit-risk namespace"]
        direction LR
        ISVC["InferenceService"]
        PRED["Predictor Deployment"]
        MON_SVC["Monitoring Service"]
    end

    subgraph monitoring["monitoring namespace"]
        direction LR
        PROM_NS["Prometheus"]
        GRAF_NS["Grafana"]
        ALERT["Alertmanager"]
    end

    subgraph kserve_ns["kserve namespace"]
        direction LR
        CTRL["KServe Controller"]
        WEBHOOK["Webhook Server"]
    end

    kubeflow -->|"Store Models"| mlcr
    mlcr -->|"Expose Metrics"| monitoring
    kserve_ns -->|"Manage"| mlcr

Component Interaction Matrix

Component	Interacts With	Protocol	Purpose
Kubeflow	Minio, MySQL	S3, TCP	Store artifacts & metadata
KServe	Minio, Predictor	S3, HTTP	Load model, route requests
Predictor	Clients	HTTP/REST	Serve predictions
Prometheus	All services	HTTP	Scrape metrics
Grafana	Prometheus	PromQL	Query & visualize

Tech Stack

ML: XGBoost, LightGBM, scikit-learn
Tracking: MLflow
Pipeline: Kubeflow Pipelines v2.0
Serving: KServe v0.11 (RawDeployment mode)
Monitoring: Prometheus, Grafana, Evidently
Infrastructure: K3s on DigitalOcean

🚀 Quick Start

Local Development

# 1. Clone and setup
git clone <repo-url>
cd ML-OPS
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# 2. Download dataset
python src/download_dataset.py

# 3. Train model
python src/train_baseline.py

# 4. Run local API
uvicorn src.serve:app --reload --port 8000

Production Deployment

# 1. Upload model to Minio
kubectl port-forward svc/minio-service -n kubeflow 9001:9000 &
python src/upload_model.py

# 2. Deploy KServe InferenceService
kubectl apply -f infra/kserve-setup.yaml
kubectl apply -f infra/inference-service.yaml

# 3. Deploy monitoring
kubectl apply -f infra/monitoring.yaml
kubectl apply -f infra/servicemonitor.yaml

# 4. Verify deployment
kubectl get inferenceservice -n ml-credit-risk

⚙️ Infrastructure Setup

Prerequisites

K3s v1.33+ cluster (3 nodes recommended)
kubectl configured
Helm 3.x
Python 3.10+

Install Kubeflow Pipelines

export PIPELINE_VERSION=2.0.5
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-emissary?ref=$PIPELINE_VERSION"

Install KServe

# KServe core
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.2/kserve.yaml

# Required CRDs (even for RawDeployment mode)
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.11.0/serving-crds.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.19/manifests/charts/base/crds/crd-all.gen.yaml

# Serving runtimes
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.2/kserve-runtimes.yaml

Install Monitoring Stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

🔧 ML Pipeline

Pipeline DAG Overview

graph LR
    A[data_loader] --> B[feature_engineer]
    B --> C[train_model]
    C --> D[validate_model]
    D --> E[register_model]

Detailed Pipeline Flow

flowchart TB
    subgraph Stage1["Stage 1: Data Loading"]
        DL_IN["UCI Dataset URL"]
        DL_PROC["Download & Split"]
        DL_OUT1[("train.csv")]
        DL_OUT2[("test.csv")]

        DL_IN --> DL_PROC
        DL_PROC --> DL_OUT1
        DL_PROC --> DL_OUT2
    end

    subgraph Stage2["Stage 2: Feature Engineering"]
        FE_IN1[("train.csv")]
        FE_IN2[("test.csv")]
        FE_PROC["Apply 22 Transformations"]
        FE_SCALE["StandardScaler Fit"]
        FE_OUT1[("train_features.csv")]
        FE_OUT2[("test_features.csv")]
        FE_OUT3[("scaler.pkl")]

        FE_IN1 --> FE_PROC
        FE_IN2 --> FE_PROC
        FE_PROC --> FE_SCALE
        FE_SCALE --> FE_OUT1
        FE_SCALE --> FE_OUT2
        FE_SCALE --> FE_OUT3
    end

    subgraph Stage3["Stage 3: Model Training"]
        TR_IN[("train_features.csv")]
        TR_PROC["XGBoost Training"]
        TR_OUT1[("model.json")]
        TR_OUT2["metrics.json"]

        TR_IN --> TR_PROC
        TR_PROC --> TR_OUT1
        TR_PROC --> TR_OUT2
    end

    subgraph Stage4["Stage 4: Validation"]
        VAL_MODEL[("model.json")]
        VAL_TEST[("test_features.csv")]
        VAL_PROC["Compute AUC"]
        VAL_GATE{"AUC > 0.75?"}
        VAL_PASS["✅ Continue"]
        VAL_FAIL["❌ Fail Pipeline"]

        VAL_MODEL --> VAL_PROC
        VAL_TEST --> VAL_PROC
        VAL_PROC --> VAL_GATE
        VAL_GATE -->|"Yes"| VAL_PASS
        VAL_GATE -->|"No"| VAL_FAIL
    end

    subgraph Stage5["Stage 5: Registration"]
        REG_MODEL[("model.json")]
        REG_SCALER[("scaler.pkl")]
        REG_PROC["MLflow Log"]
        REG_OUT["Registered Model"]

        REG_MODEL --> REG_PROC
        REG_SCALER --> REG_PROC
        REG_PROC --> REG_OUT
    end

    Stage1 --> Stage2
    Stage2 --> Stage3
    Stage3 --> Stage4
    Stage4 --> Stage5

Feature Engineering Breakdown

flowchart LR
    subgraph Raw["Raw Features - 23 columns"]
        direction TB
        PAY["PAY_0 to PAY_6"]
        BILL["BILL_AMT1 to BILL_AMT6"]
        AMT["PAY_AMT1 to PAY_AMT6"]
        DEMO["LIMIT_BAL, AGE, SEX, EDUCATION, MARRIAGE"]
    end

    subgraph Engineered["Engineered Features - 22 new"]
        direction TB
        subgraph Payment["Payment Behavior - 6"]
            PB1["LATE_PAYMENTS"]
            PB2["MAX_DELAY"]
            PB3["AVG_DELAY"]
            PB4["SEVERE_DELAY"]
            PB5["EVER_2MONTH_LATE"]
            PB6["RECENT_DELAY_WEIGHTED"]
        end

        subgraph Ratios["Financial Ratios - 4"]
            FR1["UTILIZATION"]
            FR2["AVG_UTILIZATION"]
            FR3["PAY_RATIO"]
            FR4["RECENT_PAY_RATIO"]
        end

        subgraph Trends["Trends - 3"]
            TR1["BILL_TREND"]
            TR2["PAY_TREND"]
            TR3["INCREASING_DEBT"]
        end

        subgraph Aggregates["Aggregates - 4"]
            AG1["AVG_BILL_AMT"]
            AG2["AVG_PAY_AMT"]
            AG3["TOTAL_BILL"]
            AG4["TOTAL_PAY"]
        end

        subgraph Interactions["Interactions - 4"]
            IN1["LIMIT_AGE"]
            IN2["DELAY_UTIL"]
            IN3["HIGH_EDUCATION"]
            IN4["SINGLE"]
        end
    end

    PAY --> Payment
    BILL --> Ratios
    BILL --> Trends
    BILL --> Aggregates
    AMT --> Aggregates
    DEMO --> Interactions

Pipeline Stages Reference

Stage	File	Input	Output	Description
`data_loader`	`data_loader.py`	URL	train.csv, test.csv	Downloads UCI dataset, splits 80/20
`feature_engineer`	`feature_engineer.py`	CSVs	features.csv, scaler.pkl	Creates 22 features, scales data
`train_model`	`train_model.py`	features.csv	model.json	XGBoost with class weights
`validate_model`	`validate_model.py`	model, test	report.json	AUC threshold gate (>0.75)
`register_model`	`register_model.py`	artifacts	MLflow entry	Logs to model registry

Compile Pipeline

cd kubeflow && python pipeline.py
# Output: pipeline.yaml

🚢 Model Serving

KServe Configuration

# infra/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: credit-risk-model
  namespace: ml-credit-risk
  annotations:
    serving.kserve.io/deploymentMode: "RawDeployment"
spec:
  predictor:
    serviceAccountName: kserve-sa
    minReplicas: 1
    maxReplicas: 3
    xgboost:
      storageUri: s3://models/credit-risk/

Testing the Endpoint

# Port-forward
kubectl port-forward svc/credit-risk-model-predictor -n ml-credit-risk 8081:80 &

# Test prediction
python src/test_kserve.py

Load Testing

# 60-second simulation at 2 req/s
python src/simulation_test.py 60 2

# Intensive test: 120s at 5 req/s
python src/simulation_test.py 120 5

📊 Monitoring

Metrics Exposed

Metric	Type	Description
`credit_risk_drift_score`	Gauge	Data drift percentage (0-1)
`credit_risk_drift_detected`	Gauge	Binary drift flag
`credit_risk_predictions_total`	Counter	Total predictions logged

Access Grafana

kubectl port-forward svc/monitoring-grafana -n monitoring 3001:80 &
# Open http://localhost:3001
# Credentials: admin / admin

Dashboard Features

Data Drift Score: Time-series visualization
Drift Detection Alert: Binary status indicator
Auto-refresh: 5-second intervals

🔗 API Usage

Health Check

curl http://localhost:8000/health

Prediction Request

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "LIMIT_BAL": 50000,
    "SEX": 1,
    "EDUCATION": 2,
    "MARRIAGE": 1,
    "AGE": 35,
    "PAY_0": 0, "PAY_2": 0, "PAY_3": 0,
    "PAY_4": 0, "PAY_5": 0, "PAY_6": 0,
    "BILL_AMT1": 20000, "BILL_AMT2": 19000,
    "BILL_AMT3": 18000, "BILL_AMT4": 17000,
    "BILL_AMT5": 16000, "BILL_AMT6": 15000,
    "PAY_AMT1": 2000, "PAY_AMT2": 1800,
    "PAY_AMT3": 1600, "PAY_AMT4": 1400,
    "PAY_AMT5": 1200, "PAY_AMT6": 1000
  }'

KServe V1 Protocol

curl -X POST http://localhost:8081/v1/models/credit-risk-model:predict \
  -H "Host: credit-risk.local" \
  -d '{"instances": [[0.1, -0.2, 0.3, ...]]}'  # 45 scaled features

🧪 Testing

Run Unit Tests

pytest tests/

Run with Coverage

pytest tests/ --cov=src --cov-report=html

CI/CD Pipeline

GitHub Actions workflow (.github/workflows/ci.yaml):

Triggers on push/PR to main
Runs flake8 linting
Executes pytest suite

🔧 Troubleshooting

KServe Controller Crashes

# Install missing CRDs
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.11.0/serving-crds.yaml
kubectl rollout restart deployment kserve-controller-manager -n kserve

Model Not Found

# Ensure model.json exists in Minio
python src/upload_model.py

XGBoost Version Mismatch

kubectl patch clusterservingruntime kserve-xgbserver --type merge \
  -p '{"spec":{"containers":[{"name":"kserve-container","image":"kserve/xgbserver:v0.13.0"}]}}'

Grafana "No Data"

# Verify metrics endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never -- \
  curl http://monitoring-service.ml-credit-risk/metrics

📁 Project Structure

ML-OPS/
├── .github/workflows/ci.yaml     # CI/CD pipeline
├── data/                         # Dataset files
├── infra/                        # Kubernetes manifests
│   ├── inference-service.yaml    # KServe InferenceService
│   ├── kserve-setup.yaml         # Secrets & ServiceAccount
│   ├── monitoring.yaml           # Monitoring service
│   └── servicemonitor.yaml       # Prometheus ServiceMonitor
├── kubeflow/
│   ├── components/               # 5 pipeline stages
│   ├── pipeline.py               # Pipeline definition
│   └── pipeline.yaml             # Compiled pipeline
├── models/                       # Trained models
├── monitoring/                   # Grafana dashboards
├── src/
│   ├── download_dataset.py       # Data downloader
│   ├── train_baseline.py         # Training script
│   ├── serve.py                  # FastAPI endpoint
│   ├── features.py               # Feature engineering
│   ├── test_kserve.py            # KServe test client
│   ├── simulation_test.py        # Load testing
│   └── metrics_server.py         # Prometheus metrics
├── tests/                        # Unit tests
└── requirements.txt              # Dependencies

📄 License

MIT License

Created: January 2026 | Author: Shashwat Pratap Singh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
dashboard		dashboard
data		data
docs/screenshots		docs/screenshots
infra		infra
kubeflow		kubeflow
mlruns/1		mlruns/1
models		models
monitoring		monitoring
notebooks/figures		notebooks/figures
src		src
tests		tests
.gitignore		.gitignore
Dockerfile.monitoring		Dockerfile.monitoring
README.md		README.md
mlflow.db		mlflow.db
requirements.txt		requirements.txt

Shrinet82/ML-OPS

Folders and files

Latest commit

History

Repository files navigation

🏦 Credit Risk MLOps Platform

📑 Table of Contents

📸 Screenshots

Grafana Monitoring Dashboard

Kubernetes Pods Status

Live Model Prediction

Load Test Simulation

🖥️ Dashboard

Dashboard Screenshots

Dashboard Features

Running the Dashboard

Dashboard Code

📊 Overview

Problem Statement

Dataset

Model Performance

Why MLOps? The Gaps This Platform Addresses

⚡ The "Velocity" Gap

🔧 The "Toil" Gap

💰 The "Cost" Gap

🔒 The "Security" Gap

🏗️ Architecture

High-Level System Architecture

Cluster Topology

Data Flow Architecture

Namespace Organization

Component Interaction Matrix

Tech Stack

🚀 Quick Start

Local Development

Production Deployment

⚙️ Infrastructure Setup

Prerequisites

Install Kubeflow Pipelines

Install KServe

Install Monitoring Stack

🔧 ML Pipeline

Pipeline DAG Overview

Detailed Pipeline Flow

Feature Engineering Breakdown

Pipeline Stages Reference

Compile Pipeline

🚢 Model Serving

KServe Configuration

Testing the Endpoint

Load Testing

📊 Monitoring

Metrics Exposed

Access Grafana

Dashboard Features

🔗 API Usage

Health Check

Prediction Request

KServe V1 Protocol

🧪 Testing

Run Unit Tests

Run with Coverage

CI/CD Pipeline

🔧 Troubleshooting

KServe Controller Crashes

Model Not Found

XGBoost Version Mismatch

Grafana "No Data"

📁 Project Structure

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages