-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the MLOps Platform wiki – your central guide to understanding, deploying, and extending a production‑grade machine learning infrastructure.
This repository implements a full MLOps stack that automates the entire machine learning lifecycle:
- Feature Engineering using Feast (offline & online store)
- Experiment Tracking & Model Registry with MLflow
- Training Pipeline Orchestration via Kubeflow Pipelines
- Advanced Drift Detection (KS‑test, Jensen‑Shannon, PCA) for automatic retraining
- Model Serving with KServe and GPU‑optimised transformers
- Infrastructure as Code (Terraform) on AWS EKS
- CI/CD with GitHub Actions
It is designed to be cloud‑agnostic, scalable, and cost‑efficient, supporting teams that need to move from experimentation to production with confidence.
graph TD
subgraph "Data"
A[Data Lake / S3]
B[Streaming Events]
C[Feast Feature Store]
end
subgraph "ML Pipelines (Kubeflow)"
D[Training Pipeline]
E[Drift Detection]
end
subgraph "MLflow"
F[Experiment Tracking]
G[Model Registry]
end
subgraph "Serving (KServe)"
H[InferenceService]
I[Transformer]
J[Predictor]
end
subgraph "Infrastructure (EKS)"
K[Kubernetes]
L[GPU Nodes]
M[Prometheus Monitoring]
end
A --> C
B --> C
C --> D
D --> F
F --> G
G --> H
H --> I --> J
E --> D
K --> L
M --> K
For a detailed walk‑through see the Architecture Deep-Dive page.
-
Clone the repository
git clone https://github.com/Awrsha/mlops-platform.git cd mlops-platform -
Provision infrastructure with Terraform
cd terraform terraform init && terraform apply
-
Deploy cluster services (KServe, MLflow, Feast, Kubeflow)
kubectl apply -f kubernetes/
-
Trigger a training pipeline
Submit the compiled pipeline to your Kubeflow endpoint or let the CI/CD handle it automatically.
For a more detailed setup, read the Installation Guide.
- Installation Guide – Step‑by‑step instructions to deploy the platform.
- Architecture Deep-Dive – Component design, data flow, and scaling strategies.
- Drift Detection – How statistical tests trigger automatic retraining.
- CI/CD Pipelines – GitHub Actions workflows explained.
- Model Serving – KServe configuration, transformers, and GPU optimisation.
- Monitoring & Alerting – Prometheus rules, Grafana dashboards, and alerts.
- Troubleshooting – Common issues and solutions.
We welcome contributions! See the main README or the Contributing Guide for details.
Open a GitHub Issue or reach out to the team on Discussions.
Happy building! 🚀